Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZFS replication metrics #243

Merged
merged 12 commits into from
Apr 27, 2024
Merged

Add ZFS replication metrics #243

merged 12 commits into from
Apr 27, 2024

Conversation

svengerber
Copy link
Contributor

This PR adds replication metrics as requested in issue #112.

  • Replication Metrics are fetched per node
  • The metrics can be enabled or disabled

I reworked the original PR #166 to the new file structure.

@znerol
Copy link
Member

znerol commented Apr 19, 2024

Hi, thanks for filing this PR. It makes sense to scrape the replications per node, so code location, naming and overall approach is correct.

I'm not too happy with the metric structure though. This project follows the common practice of exposing an X_info gauge with a complete set of labels. The actual metrics only have an id label. The following excerpt from the metrics example in the readme demonstrates the approach nicely:

# HELP pve_uptime_seconds Number of seconds since the last boot
# TYPE pve_uptime_seconds gauge
pve_uptime_seconds{id="qemu/100"} 315039.0
[...]
# HELP pve_guest_info VM/CT info
# TYPE pve_guest_info gauge
pve_guest_info{id="qemu/100",name="samplevm1",node="proxmox",type="qemu"} 1.0

Also, it would be great if the guest, source and target labels would follow the same pattern (i.e., {type}/{id}). This does simplify joining the new metrics to other time series. I'm not sure whether the id should be changed as well. This could be interesting in combination with a pve_up metric. I.e., if there is a way to determine whether a replication job is up (if there is a useful definition of up), then it could be interesting to emit pve_up{id="replication/1-0"} 1 with the same {type}/{id} pattern in order to avoid conflicts with other time series.

Finally, there is a best practice document by prometheus about metric and label naming. Metric names ... should have a suffix describing the unit, in plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable.

Thus, I'd expect a similar structure from the replication metrics.

    # HELP pve_replication_info Proxmox replication info
    # TYPE pve_replication_info gauge
    pve_replication_info{id="1-0",guest="qemu/1",source="node/proxmox1",target="node/proxmox2",type="local"} 1

    # HELP pve_replication_duration_seconds Proxmox vm replication duration
    # TYPE pve_replication_duration_seconds gauge
    pve_replication_duration_seconds{id="1-0"} 7.73584
    # HELP pve_replication_last_sync_timestamp_seconds Proxmox vm replication last_sync
    # TYPE pve_replication_last_sync_timestamp_seconds gauge
    pve_replication_last_sync_timestamp_seconds{id="1-0"} 1.713382503e+09
    # HELP pve_replication_last_try_timestamp_seconds Proxmox vm replication last_try
    # TYPE pve_replication_last_try_timestamp_seconds gauge
    pve_replication_last_try_timestamp_seconds{id="1-0"} 1.713382503e+09
    # HELP pve_replication_next_sync_timestamp_seconds Proxmox vm replication next_sync
    # TYPE pve_replication_next_sync_timestamp_seconds gauge
    pve_replication_next_sync_timestamp_seconds{id="1-0"} 1.7134689e+09
    # HELP pve_replication_failures_total Proxmox vm replication fail_count
    # TYPE pve_replication_failures_total gauge
    pve_replication_failures_total{id="1-0"} 0.0

@svengerber
Copy link
Contributor Author

Thank you for the review.

I have updated the following points:

  • metric naming
  • added info metric with labels (source,target,guest) in {type}/{id} format
  • remove unnecessary labels from other metrics
  • Added pve_up metric for replication jobs

Replication Jobs in Proxmox can be enabled or disabled.
The Proxomx API will return disabled: 1 is the replication job is disabled and nothing if the status is enabled.
Based on this I am currently setting the pve_up value.

I have left the id to be the replication id from the Proxmox API.
This value is unique in a cluster and a single VM can have multiple replication jobs, therefore we cannot use the qemu/lxc id.

@znerol
Copy link
Member

znerol commented Apr 20, 2024

Cool! Thanks a lot for this update. I have to confess, that I do not yet fully grasp the data model of the PVE API regarding replication jobs. I neither have access to a cluster where I could take a look myself. Thus, please bear with me a little bit while I'm putting together a mental model for it.

According to the commits in this PR and the API docs, there are two different API routes where information about replication jobs can be gathered:

Regrettably, the API docs do not specify the structure of the returned data. Thus I took a look at the respective admin ui sections in my cluster. It looks like the interesting metrics (i.e., the various timestamps) are only available from the node-endpoint. The cluster endpoint only lists configuration values and no state. If that observation is correct, then scraping the state from nodes is certainly the best option we have.

The current PR seems to scrape the pve_up metric from cluster resources. Hence, people running separate scrape jobs for node and cluster metrics end up scraping replication related stuff in two places. I fear that might lead to surprises. Also basing the pve_up metric on the enabled flag doesn't mix well with the other definitions of pve_up (i.e., vm running, container running, node running). People are using pve_up metric for alerting purposes. Alerting on the basis of a configurable value (enabled/disabled) doesn't seem very useful to me. Thus, let's drop the pve_up metric for replication jobs unless there is a more useful data source for its value.

@znerol
Copy link
Member

znerol commented Apr 20, 2024

The PVE docs state:

If a replication job encounters problems, it is placed in an error state. In this state, the configured replication intervals get suspended temporarily.

I think that would be a good data source for pve_up if the error state is available from the nodes/{node}/replication/{id}/status endpoint.

@znerol
Copy link
Member

znerol commented Apr 20, 2024

According to the pve source code, there is a chance that a failed job contains an error key in the state object. The value seems to be a textual representation of the error.

On the other hand, the code also indicates that fail_count is reset to zero after a job has recovered. As a result, it is easy to define an alerting rule using the following expression: pve_replication_failures_total != 0. But that indicates that the _total suffix isn't exactly correct here.

@znerol
Copy link
Member

znerol commented Apr 20, 2024

According to this blog post, the _count and _total suffixes shouldn't be used for gauges. Browsing through my metrics, I found one which seems to behave quite similar to fail_count: alertmanager_cluster_failed_peers:

This metric is a gauge that shows you the current number of failed peers.

So, I guess we could rename pve_replication_failures_total to pve_replication_failed_syncs. I feel that this name naturally hints towards an alerting condition of pve_replication_failed_syncs!=0.

@znerol
Copy link
Member

znerol commented Apr 20, 2024

I pushed my proposed changes. @svengerber is that usable in your case?

@svengerber
Copy link
Contributor Author

@znerol

Thank you for the updates!
I have tested them against our cluster with replication jobs and it looks good.

In our case we are be monitoring agains replication failures and successfull syncs, so these metrics will work for us.

@znerol
Copy link
Member

znerol commented Apr 20, 2024

Perfect. Did you ever experience replication failures? Do you think the assumption is correct that fail_count is reset to zero after a replication job recovers?

@svengerber
Copy link
Contributor Author

I have tested this in our environment and the fail_count will go back to 0 after a successfull replication.

Example return data after a sync failure:

        "fail_count": 1,
        "schedule": "XX:XX",
        "id": "1-0",
        "type": "local",
        "last_try": X,
        "jobnum": 0,
        "rate": 500,
        "guest": 1,
        "vmtype": "qemu",
        "source": "XXX",
        "next_sync": 0,
        "error": "command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=XXX' [email protected] pvecm mtunnel -migration_network X.X.X.X/X -get_migration_ip' failed: exit code 255",
        "duration": 2.24147,
        "target": "XXX",
        "last_sync": XXXX

@znerol
Copy link
Member

znerol commented Apr 23, 2024

I'm looking at this with a fresh mind, this time regarding the coding style. I fixed the following things:

  • Adapt API call style (use self._pve.nodes(node).replication.get() instead of self._pve(f"nodes/{node}/replication/").get()).
  • Change variable name of for-loop. Use jobdata instead of vmdata.

I think with those changes, the new code blends in quite nicely with the existing stuff. @svengerber would you mind checking whether I broke something with the latest commits?

@svengerber
Copy link
Contributor Author

@znerol

Thank you for the update.
I have tested your changes against my clusters with replication enabled and it looks good.

@znerol znerol merged commit f52d43e into prometheus-pve:main Apr 27, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants