check-vmware | check_vmware_datastore_performance
plugin
- Overview
- Output
- Requirements
- Stability of this plugin
- Performance Data
- Optional evaluation
- Installation
- Configuration options
- Contrib
- Examples
- Troubleshooting
- License
- References
Nagios plugin used to monitor datastore performance.
In addition to reporting current datastore performance details, this plugin also reports which VMs reside on the datastore along with their percentage of the total datastore space used. This is intended to help pinpoint potential causes of high latency at a glance.
The output for these plugins is designed to provide the one-line summary needed by Nagios for quick identification of a problem while providing longer, more detailed information for display within the web UI, use in email and Teams notifications (atc0005/send2teams).
This plugin requires that the Statistics Collection
setting (part of
Storage I/O Control
) for a monitored datastore be enabled. If it is not,
this plugin is unable to evaluate performance for a specified datastore. This
plugin attempts to detect and report this condition so that vSphere
administrators can assist with enabling this feature.
NOTE: Changing this setting requires elevated privileges in the vSphere environment.
The privileges needed to perform normal sysadmin duties (creating VMs, moving VMs, deleting VMs, uploading/downloading files from the datastore, etc.) are not sufficient to change this setting. If you have a dedicated team that manages your virtual environment you will need to contact them to have this setting changed for every datastore you wish to monitor with this plugin.
To help with locating datastores in need of adjustment, the following PowerCLI snippet may be used:
$credential = Get-Credential -Message "Enter your credentials (DOMAIN\ID)"
$server = Connect-VIServer -Server vc1.example.com -Credential $credential
Get-View -ViewType Datastore |
Where-Object {$_.IormConfiguration.StatsCollectionEnabled -eq $false} |
Select -Property Name, @{Label="StatsCollectionEnabled"; Expression={$_.IormConfiguration.StatsCollectionEnabled}} |
Sort-Object -Property Name
Disconnect-VIServer $server
Available settings For Storage I/O Control
:
Disabled
Statistics enabled but Storage I/O disabled
Statistics and Storage I/O enabled
NOTE: This plugin uses the QueryDatastorePerformanceSummary()
method
provided by the StorageResourceManager
Managed
Object. While available
since vSphere API 5.1, this API is marked as experimental (and subject to
change/removal):
This is an experimental interface that is not intended for use in production code.
In addition to using the experimental QueryDatastorePerformanceSummary()
API, this plugin uses the deprecated statsCollectionEnabled
property from
the StorageIORMInfo
Data
Object to determine
whether Statistics Collection
is enabled for a datastore. Using the
prescribed enabled
property for that Data
Object to determine
Statistics Collection
does not work.
If you use this plugin, please provide feedback by opening a new discussion thread.
Initial support has been added for emitting Performance Data / Metrics, but refinement suggestions are welcome.
Consult the list below for the metrics implemented thus far, the original discussion thread and the Add Performance Data / Metrics support project board for an index of the initial implementation work.
Please add to an existing Discussion thread or open a new one with any feedback that you may have. Thanks in advance!
Performance metrics are provided by vSphere in aggregated quantiles over a period of time (intervals). Aggregated metrics correspond with a specific percentile. As of this plugin's initial development, vSphere provides metrics associated with these percentiles:
90
80
70
60
50
If not otherwise specified, percentile 90
is used to evaluate datastore
performance metrics. While the vSphere API provides metrics in multiple
intervals (one active, up to seven historical), only the active interval is
used for evaluating current datastore performance.
There is a brief window between when the current interval ends and the new active interval begins that no metrics are available for the active interval. Testing shows that this is approximately 30 minutes. The current plugin design is to omit performance data latency metrics if no metrics are available. This is done in an attempt to prevent skewing historical data already collected.
This plugin accepts flags to:
- specify individual latency metric thresholds (e.g., read latency CRITICAL, read latency WARNING, write latency ...)
- specify percentile sets
- multiple sets supported, each composed of a percentile and pairs of CRITICAL and WARNING threshold values
If you specify a percentile set, the plugin will not accept individual latency threshold flags. The reverse is also true, specifying one or more latency threshold flags is incompatible with specifying one or more percentile sets.
By specifying multiple percentile sets, you are indicating that crossing the thresholds of any one set is enough to trigger a state change.
This plugin emits Nagios performance data metrics for each percentile in the
active interval that is not completely of value 0
. Any percentile with all
0
metrics are omitted from the performance data metrics collected & emitted
by the plugin.
Please provide feedback by opening a new issue if you find that this decision causes problems with gathering metrics.
See the main project README for details.
NOTE: These metrics are based on the visibility of the service account used to login to the target VMware environment. If the service account cannot see a resource, it cannot evaluate the resource.
Metric | Unit of Measurement | Description |
---|---|---|
time |
milliseconds | plugin runtime |
vms |
all (visible) virtual machines in the inventory | |
vms_powered_on |
virtual machines powered on | |
vms_powered_off |
virtual machines powered off | |
p*_read_latency |
milliseconds | aggregated datastore latency for read operations |
p*_write_latency |
milliseconds | aggregated datastore latency for write operations |
p*_vm_latency |
milliseconds | aggregated datastore latency as observed by VirtualMachines using the datastore |
p*_read_iops |
reads per second | aggregated datastore read I/O rate |
p*_read_iops |
writes per second | aggregated datastore write I/O rate |
NOTE: *
is a placeholder for 90
, 80
, 70
, 60
& 50
percentiles.
Some plugins provide optional support to limit evaluation of VMs to specific
Resource Pools (explicitly including or excluding) and power states (on or
off). Other plugins support similar filtering options (e.g., Acknowledged
state of Triggered Alarms). See the configuration
options, examples and
contrib sections for more information.
See the main project README for details.
TODO: Research & note why metric sets might contain all values of 0
.
Nagios State | Description |
---|---|
OK |
Ideal state, Datastore performance within bounds for the active interval for the chosen percentile(s). |
UNKNOWN |
Datastore performance metric sets are all value 0 or metrics collection for a datastore is disabled. |
WARNING |
Datastore performance crossed user-specified latency thresholds for this state. |
CRITICAL |
Datastore performance crossed user-specified latency thresholds for this state. |
- Use the
-h
or--help
flag to display current usage information. - Flags marked as
required
must be set via CLI flag. - Flags not marked as required are for settings where a useful default is already defined, but may be overridden if desired.
Flag | Required | Default | Repeat | Possible | Description |
---|---|---|---|---|---|
branding |
No | false |
No | branding |
Toggles emission of branding details with plugin status details. This output is disabled by default. |
h , help |
No | false |
No | h , help |
Show Help text along with the list of supported flags. |
v , version |
No | false |
No | v , version |
Whether to display application version and then immediately exit application. |
ll , log-level |
No | info |
No | disabled , panic , fatal , error , warn , info , debug , trace |
Log message priority filter. Log messages with a lower level are ignored. Log messages are sent to stderr by default. See Output for more information. |
p , port |
No | 443 |
No | positive whole number between 1-65535, inclusive | TCP port of the remote ESXi host or vCenter instance. This is usually 443 (HTTPS). |
t , timeout |
No | 10 |
No | positive whole number of seconds | Timeout value in seconds allowed before a plugin execution attempt is abandoned and an error returned. |
s , server |
Yes | No | fully-qualified domain name or IP Address | The fully-qualified domain name or IP Address of the remote ESXi host or vCenter instance. | |
u , username |
Yes | No | valid username | Username with permission to access specified ESXi host or vCenter instance. | |
pw , password |
Yes | No | valid password | Password used to login to ESXi host or vCenter instance. | |
domain |
No | No | valid user domain | (Optional) domain for user account used to login to ESXi host or vCenter instance. This is needed for user accounts residing in a non-default domain (e.g., SSO specific domain). | |
trust-cert |
No | false |
No | true , false |
Whether the certificate should be trusted as-is without validation. WARNING: TLS is susceptible to man-in-the-middle attacks if enabling this option. |
dc-name |
No | No | valid vSphere datacenter name | Specifies the name of a vSphere Datacenter. If not specified, applicable plugins will attempt to use the default datacenter found in the vSphere environment. Not applicable to standalone ESXi hosts. | |
ds-name |
Yes | No | valid datastore name | Datastore name as it is found within the vSphere inventory. | |
dsim , ds-ignore-missing-metrics |
No | false |
No | true , false |
Toggles how missing Datastore Performance metrics will be handled.This is believed to occur when a datastore is newly created and metrics have not yet been collected. |
dshhms , ds-hide-historical-metric-sets |
No | false |
No | true , false |
Toggles display of historical Datastore Performance metrics at plugin completion. By default historical metrics are listed. |
dsrlc , ds-read-latency-critical |
No | 15 |
No | positive whole number or float | Specifies the read latency of a datastore's storage (in ms) when a CRITICAL threshold is reached. The default percentile is used (90 ). |
dsrlw , ds-read-latency-warning |
No | 30 |
No | positive whole number or float | Specifies the read latency of a datastore's storage (in ms) when a WARNING threshold is reached. The default percentile is used (90 ). |
dswlc , ds-write-latency-critical |
No | 15 |
No | positive whole number or float | Specifies the write latency of a datastore's storage (in ms) when a CRITICAL threshold is reached. The default percentile is used (90 ). |
dswlw , ds-write-latency-warning |
No | 30 |
No | positive whole number or float | Specifies the write latency of a datastore's storage (in ms) when a WARNING threshold is reached. The default percentile is used (90 ). |
dsvmlc , ds-vm-latency-critical |
No | 15 |
No | positive whole number or float | Specifies the latency (in ms) as observed by VMs using the datastore when a CRITICAL threshold is reached. The default percentile is used (90 ). |
dsvmlw , ds-vm-latency-warning |
No | 30 |
No | positive whole number or float | Specifies the latency (in ms) as observed by VMs using the datastore when a WARNING threshold is reached. The default percentile is used (90 ). |
dslps , ds-latency-percentile-set |
No | 90,15,30,15,30,15,30 |
Yes | complete percentile set in P,RLW,RLC,WLW,WLC,VMLW,VMLC format |
Specifies the performance percentile set used for threshold calculations. Incompatible with individual latency threshold flags. All comma-separated field values are required for each set. |
Not currently supported. This feature may be added later if there is sufficient interest.
See the main project README for details.
/usr/lib/nagios/plugins/check_vmware_datastore_performance --server vc1.example.com --username SERVICE_ACCOUNT_NAME --password "SERVICE_ACCOUNT_PASSWORD" --ds-latency-percentile-set '90,15,30,15,30,15,30' --ds-name "HUSVM-DC1-vol6" --trust-cert --log-level info
See the configuration options section for all command-line settings supported by this plugin along with descriptions of each. See the contrib section for information regarding example command definitions and Nagios configuration files.
Of note:
- We use a datastore performance percentile set instead of individual latency
flags
90
th percentile- read latency
WARNING
threshold of15 ms
- read latency
CRITICAL
threshold of30 ms
- write latency
WARNING
threshold of15 ms
- write latency
CRITICAL
threshold of30 ms
- vm latency
WARNING
threshold of15 ms
- vm latency
CRITICAL
threshold of30 ms
- Due to plugin design, only the active interval is evaluated for threshold
violations
- historical interval metrics are reported via
LongServiceOutput
unless the flag to skip emitting those metrics is specified
- historical interval metrics are reported via
- Certificate warnings are ignored.
- not best practice, but many vCenter instances use self-signed certs per various freely available guides
- Service Check results output is sent to
stdout
- Logging output is enabled at the
info
level.- logging output is sent to
stderr
by default - logging output is intended to be seen when invoking the plugin directly
via CLI (often for troubleshooting)
- see the Output section of the main README for potential conflicts with some monitoring systems
- logging output is sent to
# /etc/nagios-plugins/config/vmware-datastores-performance.cfg
# Look at specific datastore and explicitly provide custom WARNING and
# CRITICAL latency threshold values via individual flags.
define command{
command_name check_vmware_datastore_performance_via_individual_flags
command_line $USER1$/check_vmware_datastore_performance --server '$HOSTNAME$' --domain '$ARG1$' --username '$ARG2$' --password '$ARG3$' --ds-read-latency-warning '$ARG4$' --ds-read-latency-critical '$ARG5$' --ds-write-latency-warning '$ARG6$' --ds-write-latency-critical '$ARG7$' --ds-vm-latency-warning '$ARG8$' --ds-vm-latency-critical '$ARG9$' --ds-name '$ARG10$' --trust-cert --log-level info
}
# Look at specific datastore and explicitly provide custom WARNING and
# CRITICAL latency threshold values for a single percentile via a percentile
# flag set.
define command{
command_name check_vmware_datastore_performance_via_1percentile_set
command_line $USER1$/check_vmware_datastore_performance --server '$HOSTNAME$' --domain '$ARG1$' --username '$ARG2$' --password '$ARG3$' --ds-latency-percentile-set '$ARG4$' --ds-name '$ARG5$' --trust-cert --log-level info
}
See the configuration options section for all command-line settings supported by this plugin along with descriptions of each. See the contrib section for information regarding example command definitions and Nagios configuration files.
If you see an error message like this one:
UNKNOWN: Unable to retrieve performance summary for datastore "DATASTORE_NAME_HERE": datastore storage I/O statistics collection disabled **ERRORS** * datastore storage I/O statistics collection disabled: assistance needed from vmware administrators to resolve issue
then it means that the required Statistics Collection
setting for the
specified datastore is not enabled. See the Requirements
section of this documentation for more information.
See the main project README for details.