-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ZFS replication metrics #243
Conversation
Signed-off-by: Sven Gerber <[email protected]>
Signed-off-by: Sven Gerber <[email protected]>
Hi, thanks for filing this PR. It makes sense to scrape the replications per node, so code location, naming and overall approach is correct. I'm not too happy with the metric structure though. This project follows the common practice of exposing an
Also, it would be great if the Finally, there is a best practice document by prometheus about metric and label naming. Metric names ... should have a suffix describing the unit, in plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable. Thus, I'd expect a similar structure from the replication metrics.
|
Signed-off-by: Sven Gerber <[email protected]>
Signed-off-by: Sven Gerber <[email protected]>
Signed-off-by: Sven Gerber <[email protected]>
Signed-off-by: Sven Gerber <[email protected]>
Thank you for the review. I have updated the following points:
Replication Jobs in Proxmox can be enabled or disabled. I have left the |
Cool! Thanks a lot for this update. I have to confess, that I do not yet fully grasp the data model of the PVE API regarding replication jobs. I neither have access to a cluster where I could take a look myself. Thus, please bear with me a little bit while I'm putting together a mental model for it. According to the commits in this PR and the API docs, there are two different API routes where information about replication jobs can be gathered: Regrettably, the API docs do not specify the structure of the returned data. Thus I took a look at the respective admin ui sections in my cluster. It looks like the interesting metrics (i.e., the various timestamps) are only available from the node-endpoint. The cluster endpoint only lists configuration values and no state. If that observation is correct, then scraping the state from nodes is certainly the best option we have. The current PR seems to scrape the |
The PVE docs state:
I think that would be a good data source for |
According to the pve source code, there is a chance that a failed job contains an On the other hand, the code also indicates that |
According to this blog post, the
So, I guess we could rename |
I pushed my proposed changes. @svengerber is that usable in your case? |
Thank you for the updates! In our case we are be monitoring agains replication failures and successfull syncs, so these metrics will work for us. |
Perfect. Did you ever experience replication failures? Do you think the assumption is correct that |
I have tested this in our environment and the Example return data after a sync failure:
|
I'm looking at this with a fresh mind, this time regarding the coding style. I fixed the following things:
I think with those changes, the new code blends in quite nicely with the existing stuff. @svengerber would you mind checking whether I broke something with the latest commits? |
Thank you for the update. |
This PR adds replication metrics as requested in issue #112.
I reworked the original PR #166 to the new file structure.