Add ZFS replication metrics #243

svengerber · 2024-04-18T15:14:20Z

This PR adds replication metrics as requested in issue #112.

Replication Metrics are fetched per node
The metrics can be enabled or disabled

I reworked the original PR #166 to the new file structure.

Signed-off-by: Sven Gerber <[email protected]>

znerol · 2024-04-19T08:23:38Z

Hi, thanks for filing this PR. It makes sense to scrape the replications per node, so code location, naming and overall approach is correct.

I'm not too happy with the metric structure though. This project follows the common practice of exposing an X_info gauge with a complete set of labels. The actual metrics only have an id label. The following excerpt from the metrics example in the readme demonstrates the approach nicely:

# HELP pve_uptime_seconds Number of seconds since the last boot
# TYPE pve_uptime_seconds gauge
pve_uptime_seconds{id="qemu/100"} 315039.0
[...]
# HELP pve_guest_info VM/CT info
# TYPE pve_guest_info gauge
pve_guest_info{id="qemu/100",name="samplevm1",node="proxmox",type="qemu"} 1.0

Also, it would be great if the guest, source and target labels would follow the same pattern (i.e., {type}/{id}). This does simplify joining the new metrics to other time series. I'm not sure whether the id should be changed as well. This could be interesting in combination with a pve_up metric. I.e., if there is a way to determine whether a replication job is up (if there is a useful definition of up), then it could be interesting to emit pve_up{id="replication/1-0"} 1 with the same {type}/{id} pattern in order to avoid conflicts with other time series.

Finally, there is a best practice document by prometheus about metric and label naming. Metric names ... should have a suffix describing the unit, in plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable.

Thus, I'd expect a similar structure from the replication metrics.

    # HELP pve_replication_info Proxmox replication info
    # TYPE pve_replication_info gauge
    pve_replication_info{id="1-0",guest="qemu/1",source="node/proxmox1",target="node/proxmox2",type="local"} 1

    # HELP pve_replication_duration_seconds Proxmox vm replication duration
    # TYPE pve_replication_duration_seconds gauge
    pve_replication_duration_seconds{id="1-0"} 7.73584
    # HELP pve_replication_last_sync_timestamp_seconds Proxmox vm replication last_sync
    # TYPE pve_replication_last_sync_timestamp_seconds gauge
    pve_replication_last_sync_timestamp_seconds{id="1-0"} 1.713382503e+09
    # HELP pve_replication_last_try_timestamp_seconds Proxmox vm replication last_try
    # TYPE pve_replication_last_try_timestamp_seconds gauge
    pve_replication_last_try_timestamp_seconds{id="1-0"} 1.713382503e+09
    # HELP pve_replication_next_sync_timestamp_seconds Proxmox vm replication next_sync
    # TYPE pve_replication_next_sync_timestamp_seconds gauge
    pve_replication_next_sync_timestamp_seconds{id="1-0"} 1.7134689e+09
    # HELP pve_replication_failures_total Proxmox vm replication fail_count
    # TYPE pve_replication_failures_total gauge
    pve_replication_failures_total{id="1-0"} 0.0

Signed-off-by: Sven Gerber <[email protected]>

svengerber · 2024-04-19T20:04:35Z

Thank you for the review.

I have updated the following points:

metric naming
added info metric with labels (source,target,guest) in {type}/{id} format
remove unnecessary labels from other metrics
Added pve_up metric for replication jobs

Replication Jobs in Proxmox can be enabled or disabled.
The Proxomx API will return disabled: 1 is the replication job is disabled and nothing if the status is enabled.
Based on this I am currently setting the pve_up value.

I have left the id to be the replication id from the Proxmox API.
This value is unique in a cluster and a single VM can have multiple replication jobs, therefore we cannot use the qemu/lxc id.

znerol · 2024-04-20T11:45:59Z

Cool! Thanks a lot for this update. I have to confess, that I do not yet fully grasp the data model of the PVE API regarding replication jobs. I neither have access to a cluster where I could take a look myself. Thus, please bear with me a little bit while I'm putting together a mental model for it.

According to the commits in this PR and the API docs, there are two different API routes where information about replication jobs can be gathered:

Regrettably, the API docs do not specify the structure of the returned data. Thus I took a look at the respective admin ui sections in my cluster. It looks like the interesting metrics (i.e., the various timestamps) are only available from the node-endpoint. The cluster endpoint only lists configuration values and no state. If that observation is correct, then scraping the state from nodes is certainly the best option we have.

The current PR seems to scrape the pve_up metric from cluster resources. Hence, people running separate scrape jobs for node and cluster metrics end up scraping replication related stuff in two places. I fear that might lead to surprises. Also basing the pve_up metric on the enabled flag doesn't mix well with the other definitions of pve_up (i.e., vm running, container running, node running). People are using pve_up metric for alerting purposes. Alerting on the basis of a configurable value (enabled/disabled) doesn't seem very useful to me. Thus, let's drop the pve_up metric for replication jobs unless there is a more useful data source for its value.

znerol · 2024-04-20T11:58:11Z

The PVE docs state:

If a replication job encounters problems, it is placed in an error state. In this state, the configured replication intervals get suspended temporarily.

I think that would be a good data source for pve_up if the error state is available from the nodes/{node}/replication/{id}/status endpoint.

znerol · 2024-04-20T12:24:41Z

According to the pve source code, there is a chance that a failed job contains an error key in the state object. The value seems to be a textual representation of the error.

On the other hand, the code also indicates that fail_count is reset to zero after a job has recovered. As a result, it is easy to define an alerting rule using the following expression: pve_replication_failures_total != 0. But that indicates that the _total suffix isn't exactly correct here.

znerol · 2024-04-20T13:07:09Z

According to this blog post, the _count and _total suffixes shouldn't be used for gauges. Browsing through my metrics, I found one which seems to behave quite similar to fail_count: alertmanager_cluster_failed_peers:

This metric is a gauge that shows you the current number of failed peers.

So, I guess we could rename pve_replication_failures_total to pve_replication_failed_syncs. I feel that this name naturally hints towards an alerting condition of pve_replication_failed_syncs!=0.

…ation_failed_syncs

znerol · 2024-04-20T13:15:01Z

I pushed my proposed changes. @svengerber is that usable in your case?

svengerber · 2024-04-20T13:33:45Z

@znerol

Thank you for the updates!
I have tested them against our cluster with replication jobs and it looks good.

In our case we are be monitoring agains replication failures and successfull syncs, so these metrics will work for us.

znerol · 2024-04-20T13:40:27Z

Perfect. Did you ever experience replication failures? Do you think the assumption is correct that fail_count is reset to zero after a replication job recovers?

svengerber · 2024-04-20T13:57:46Z

I have tested this in our environment and the fail_count will go back to 0 after a successfull replication.

Example return data after a sync failure:

        "fail_count": 1,
        "schedule": "XX:XX",
        "id": "1-0",
        "type": "local",
        "last_try": X,
        "jobnum": 0,
        "rate": 500,
        "guest": 1,
        "vmtype": "qemu",
        "source": "XXX",
        "next_sync": 0,
        "error": "command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=XXX' [email protected] pvecm mtunnel -migration_network X.X.X.X/X -get_migration_ip' failed: exit code 255",
        "duration": 2.24147,
        "target": "XXX",
        "last_sync": XXXX

znerol · 2024-04-23T14:27:39Z

I'm looking at this with a fresh mind, this time regarding the coding style. I fixed the following things:

Adapt API call style (use self._pve.nodes(node).replication.get() instead of self._pve(f"nodes/{node}/replication/").get()).
Change variable name of for-loop. Use jobdata instead of vmdata.

I think with those changes, the new code blends in quite nicely with the existing stuff. @svengerber would you mind checking whether I broke something with the latest commits?

svengerber · 2024-04-26T14:33:40Z

@znerol

Thank you for the update.
I have tested your changes against my clusters with replication enabled and it looks good.

svengerber added 2 commits April 18, 2024 17:08

feat(replication): add replication option

bf274c7

Signed-off-by: Sven Gerber <[email protected]>

fix(replication): only scrape replication on current node

6501bf0

Signed-off-by: Sven Gerber <[email protected]>

svengerber added 4 commits April 19, 2024 21:11

feat(replication): split labels to separate info metric

af6226f

Signed-off-by: Sven Gerber <[email protected]>

chore(replication): rename metrics and update docs

5760fda

Signed-off-by: Sven Gerber <[email protected]>

feat(replication): set info metric based on status

214a4df

Signed-off-by: Sven Gerber <[email protected]>

feat(replication): add pve_up metric

41aab67

Signed-off-by: Sven Gerber <[email protected]>

znerol added 3 commits April 20, 2024 15:08

fix(replication): Remove the pve_up metric for replications

0823231

fix(replication): Coding style

9098038

fix(replication): Rename pve_replication_failures_total to pve_replic…

3c1a6a0

…ation_failed_syncs

znerol added 3 commits April 23, 2024 16:14

fix(replication): Call items() in for statement for better readability

32d87a5

fix(replication): Adapt API calling style to existing code

00e12e0

fix(replication): Correct variable name

96c7618

znerol merged commit f52d43e into prometheus-pve:main Apr 27, 2024
1 check passed

znerol mentioned this pull request Apr 27, 2024

add class ClusterReplicationCollector #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ZFS replication metrics #243

Add ZFS replication metrics #243

svengerber commented Apr 18, 2024

znerol commented Apr 19, 2024 •

edited

Loading

svengerber commented Apr 19, 2024

znerol commented Apr 20, 2024

znerol commented Apr 20, 2024

znerol commented Apr 20, 2024

znerol commented Apr 20, 2024

znerol commented Apr 20, 2024

svengerber commented Apr 20, 2024

znerol commented Apr 20, 2024

svengerber commented Apr 20, 2024

znerol commented Apr 23, 2024

svengerber commented Apr 26, 2024

Add ZFS replication metrics #243

Add ZFS replication metrics #243

Conversation

svengerber commented Apr 18, 2024

znerol commented Apr 19, 2024 • edited Loading

svengerber commented Apr 19, 2024

znerol commented Apr 20, 2024

znerol commented Apr 20, 2024

znerol commented Apr 20, 2024

znerol commented Apr 20, 2024

znerol commented Apr 20, 2024

svengerber commented Apr 20, 2024

znerol commented Apr 20, 2024

svengerber commented Apr 20, 2024

znerol commented Apr 23, 2024

svengerber commented Apr 26, 2024

znerol commented Apr 19, 2024 •

edited

Loading