check-vmware | `check_vmware_datastore_performance` plugin

Main project README
Documentation index

Overview

Nagios plugin used to monitor datastore performance.

In addition to reporting current datastore performance details, this plugin also reports which VMs reside on the datastore along with their percentage of the total datastore space used. This is intended to help pinpoint potential causes of high latency at a glance.

Output

The output for these plugins is designed to provide the one-line summary needed by Nagios for quick identification of a problem while providing longer, more detailed information for display within the web UI, use in email and Teams notifications (atc0005/send2teams).

Requirements

This plugin requires that the Statistics Collection setting (part of Storage I/O Control) for a monitored datastore be enabled. If it is not, this plugin is unable to evaluate performance for a specified datastore. This plugin attempts to detect and report this condition so that vSphere administrators can assist with enabling this feature.

NOTE: Changing this setting requires elevated privileges in the vSphere environment.

The privileges needed to perform normal sysadmin duties (creating VMs, moving VMs, deleting VMs, uploading/downloading files from the datastore, etc.) are not sufficient to change this setting. If you have a dedicated team that manages your virtual environment you will need to contact them to have this setting changed for every datastore you wish to monitor with this plugin.

To help with locating datastores in need of adjustment, the following PowerCLI snippet may be used:

$credential = Get-Credential -Message "Enter your credentials (DOMAIN\ID)"
$server = Connect-VIServer -Server vc1.example.com -Credential $credential

Get-View -ViewType Datastore |
    Where-Object {$_.IormConfiguration.StatsCollectionEnabled -eq $false} |
    Select -Property Name, @{Label="StatsCollectionEnabled"; Expression={$_.IormConfiguration.StatsCollectionEnabled}} |
    Sort-Object -Property Name

Disconnect-VIServer $server

Available settings For Storage I/O Control:

Disabled
Statistics enabled but Storage I/O disabled
Statistics and Storage I/O enabled

Stability of this plugin

NOTE: This plugin uses the QueryDatastorePerformanceSummary() method provided by the StorageResourceManager Managed Object. While available since vSphere API 5.1, this API is marked as experimental (and subject to change/removal):

This is an experimental interface that is not intended for use in production code.

In addition to using the experimental QueryDatastorePerformanceSummary() API, this plugin uses the deprecated statsCollectionEnabled property from the StorageIORMInfo Data Object to determine whether Statistics Collection is enabled for a datastore. Using the prescribed enabled property for that Data Object to determine Statistics Collection does not work.

If you use this plugin, please provide feedback by opening a new discussion thread.

Performance Data

Background

Initial support has been added for emitting Performance Data / Metrics, but refinement suggestions are welcome.

Consult the list below for the metrics implemented thus far, the original discussion thread and the Add Performance Data / Metrics support project board for an index of the initial implementation work.

Please add to an existing Discussion thread or open a new one with any feedback that you may have. Thanks in advance!

How datastore performance metrics are evaluated

Performance metrics are provided by vSphere in aggregated quantiles over a period of time (intervals). Aggregated metrics correspond with a specific percentile. As of this plugin's initial development, vSphere provides metrics associated with these percentiles:

90
80
70
60
50

If not otherwise specified, percentile 90 is used to evaluate datastore performance metrics. While the vSphere API provides metrics in multiple intervals (one active, up to seven historical), only the active interval is used for evaluating current datastore performance.

There is a brief window between when the current interval ends and the new active interval begins that no metrics are available for the active interval. Testing shows that this is approximately 30 minutes. The current plugin design is to omit performance data latency metrics if no metrics are available. This is done in an attempt to prevent skewing historical data already collected.

This plugin accepts flags to:

specify individual latency metric thresholds (e.g., read latency CRITICAL, read latency WARNING, write latency ...)
specify percentile sets
- multiple sets supported, each composed of a percentile and pairs of CRITICAL and WARNING threshold values

If you specify a percentile set, the plugin will not accept individual latency threshold flags. The reverse is also true, specifying one or more latency threshold flags is incompatible with specifying one or more percentile sets.

By specifying multiple percentile sets, you are indicating that crossing the thresholds of any one set is enough to trigger a state change.

Omitted metrics

This plugin emits Nagios performance data metrics for each percentile in the active interval that is not completely of value 0. Any percentile with all 0 metrics are omitted from the performance data metrics collected & emitted by the plugin.

Please provide feedback by opening a new issue if you find that this decision causes problems with gathering metrics.

See the main project README for details.

Supported metrics

NOTE: These metrics are based on the visibility of the service account used to login to the target VMware environment. If the service account cannot see a resource, it cannot evaluate the resource.

Metric	Unit of Measurement	Description
`time`	milliseconds	plugin runtime
`vms`		all (visible) virtual machines in the inventory
`vms_powered_on`		virtual machines powered on
`vms_powered_off`		virtual machines powered off
`p*_read_latency`	milliseconds	aggregated datastore latency for read operations
`p*_write_latency`	milliseconds	aggregated datastore latency for write operations
`p*_vm_latency`	milliseconds	aggregated datastore latency as observed by VirtualMachines using the datastore
`p*_read_iops`	reads per second	aggregated datastore read I/O rate
`p*_read_iops`	writes per second	aggregated datastore write I/O rate

NOTE: * is a placeholder for 90, 80, 70, 60 & 50 percentiles.

Optional evaluation

Some plugins provide optional support to limit evaluation of VMs to specific Resource Pools (explicitly including or excluding) and power states (on or off). Other plugins support similar filtering options (e.g., Acknowledged state of Triggered Alarms). See the configuration options, examples and contrib sections for more information.

Installation

See the main project README for details.

Configuration options

Threshold calculations

TODO: Research & note why metric sets might contain all values of 0.

Nagios State	Description
`OK`	Ideal state, Datastore performance within bounds for the active interval for the chosen percentile(s).
`UNKNOWN`	Datastore performance metric sets are all value `0` or metrics collection for a datastore is disabled.
`WARNING`	Datastore performance crossed user-specified latency thresholds for this state.
`CRITICAL`	Datastore performance crossed user-specified latency thresholds for this state.

Command-line arguments

Use the -h or --help flag to display current usage information.
Flags marked as required must be set via CLI flag.
Flags not marked as required are for settings where a useful default is already defined, but may be overridden if desired.

Flag	Required	Default	Repeat	Possible	Description
`branding`	No	`false`	No	`branding`	Toggles emission of branding details with plugin status details. This output is disabled by default.
`h`, `help`	No	`false`	No	`h`, `help`	Show Help text along with the list of supported flags.
`v`, `version`	No	`false`	No	`v`, `version`	Whether to display application version and then immediately exit application.
`ll`, `log-level`	No	`info`	No	`disabled`, `panic`, `fatal`, `error`, `warn`, `info`, `debug`, `trace`	Log message priority filter. Log messages with a lower level are ignored. Log messages are sent to `stderr` by default. See Output for more information.
`p`, `port`	No	`443`	No	positive whole number between 1-65535, inclusive	TCP port of the remote ESXi host or vCenter instance. This is usually 443 (HTTPS).
`t`, `timeout`	No	`10`	No	positive whole number of seconds	Timeout value in seconds allowed before a plugin execution attempt is abandoned and an error returned.
`s`, `server`	Yes		No	fully-qualified domain name or IP Address	The fully-qualified domain name or IP Address of the remote ESXi host or vCenter instance.
`u`, `username`	Yes		No	valid username	Username with permission to access specified ESXi host or vCenter instance.
`pw`, `password`	Yes		No	valid password	Password used to login to ESXi host or vCenter instance.
`domain`	No		No	valid user domain	(Optional) domain for user account used to login to ESXi host or vCenter instance. This is needed for user accounts residing in a non-default domain (e.g., SSO specific domain).
`trust-cert`	No	`false`	No	`true`, `false`	Whether the certificate should be trusted as-is without validation. WARNING: TLS is susceptible to man-in-the-middle attacks if enabling this option.
`dc-name`	No		No	valid vSphere datacenter name	Specifies the name of a vSphere Datacenter. If not specified, applicable plugins will attempt to use the default datacenter found in the vSphere environment. Not applicable to standalone ESXi hosts.
`ds-name`	Yes		No	valid datastore name	Datastore name as it is found within the vSphere inventory.
`dsim`, `ds-ignore-missing-metrics`	No	`false`	No	`true`, `false`	Toggles how missing Datastore Performance metrics will be handled.This is believed to occur when a datastore is newly created and metrics have not yet been collected.
`dshhms`, `ds-hide-historical-metric-sets`	No	`false`	No	`true`, `false`	Toggles display of historical Datastore Performance metrics at plugin completion. By default historical metrics are listed.
`dsrlc`, `ds-read-latency-critical`	No	`15`	No	positive whole number or float	Specifies the read latency of a datastore's storage (in ms) when a `CRITICAL` threshold is reached. The default percentile is used (`90`).
`dsrlw`, `ds-read-latency-warning`	No	`30`	No	positive whole number or float	Specifies the read latency of a datastore's storage (in ms) when a `WARNING` threshold is reached. The default percentile is used (`90`).
`dswlc`, `ds-write-latency-critical`	No	`15`	No	positive whole number or float	Specifies the write latency of a datastore's storage (in ms) when a `CRITICAL` threshold is reached. The default percentile is used (`90`).
`dswlw`, `ds-write-latency-warning`	No	`30`	No	positive whole number or float	Specifies the write latency of a datastore's storage (in ms) when a `WARNING` threshold is reached. The default percentile is used (`90`).
`dsvmlc`, `ds-vm-latency-critical`	No	`15`	No	positive whole number or float	Specifies the latency (in ms) as observed by VMs using the datastore when a `CRITICAL` threshold is reached. The default percentile is used (`90`).
`dsvmlw`, `ds-vm-latency-warning`	No	`30`	No	positive whole number or float	Specifies the latency (in ms) as observed by VMs using the datastore when a `WARNING` threshold is reached. The default percentile is used (`90`).
`dslps`, `ds-latency-percentile-set`	No	`90,15,30,15,30,15,30`	Yes	complete percentile set in `P,RLW,RLC,WLW,WLC,VMLW,VMLC` format	Specifies the performance percentile set used for threshold calculations. Incompatible with individual latency threshold flags. All comma-separated field values are required for each set.

Configuration file

Not currently supported. This feature may be added later if there is sufficient interest.

Contrib

See the main project README for details.

Examples

CLI invocation

/usr/lib/nagios/plugins/check_vmware_datastore_performance --server vc1.example.com --username SERVICE_ACCOUNT_NAME --password "SERVICE_ACCOUNT_PASSWORD" --ds-latency-percentile-set '90,15,30,15,30,15,30' --ds-name "HUSVM-DC1-vol6" --trust-cert  --log-level info

See the configuration options section for all command-line settings supported by this plugin along with descriptions of each. See the contrib section for information regarding example command definitions and Nagios configuration files.

Of note:

We use a datastore performance percentile set instead of individual latency flags
- 90th percentile
- read latency WARNING threshold of 15 ms
- read latency CRITICAL threshold of 30 ms
- write latency WARNING threshold of 15 ms
- write latency CRITICAL threshold of 30 ms
- vm latency WARNING threshold of 15 ms
- vm latency CRITICAL threshold of 30 ms
Due to plugin design, only the active interval is evaluated for threshold violations
- historical interval metrics are reported via LongServiceOutput unless the flag to skip emitting those metrics is specified
Certificate warnings are ignored.
- not best practice, but many vCenter instances use self-signed certs per various freely available guides
Service Check results output is sent to stdout
Logging output is enabled at the info level.
- logging output is sent to stderr by default
- logging output is intended to be seen when invoking the plugin directly via CLI (often for troubleshooting)
  - see the Output section of the main README for potential conflicts with some monitoring systems

Command definition

# /etc/nagios-plugins/config/vmware-datastores-performance.cfg

# Look at specific datastore and explicitly provide custom WARNING and
# CRITICAL latency threshold values via individual flags.
define command{
    command_name    check_vmware_datastore_performance_via_individual_flags
    command_line    $USER1$/check_vmware_datastore_performance --server '$HOSTNAME$' --domain '$ARG1$' --username '$ARG2$' --password '$ARG3$' --ds-read-latency-warning '$ARG4$' --ds-read-latency-critical '$ARG5$' --ds-write-latency-warning '$ARG6$' --ds-write-latency-critical '$ARG7$' --ds-vm-latency-warning '$ARG8$' --ds-vm-latency-critical '$ARG9$' --ds-name '$ARG10$' --trust-cert  --log-level info
    }

# Look at specific datastore and explicitly provide custom WARNING and
# CRITICAL latency threshold values for a single percentile via a percentile
# flag set.
define command{
    command_name    check_vmware_datastore_performance_via_1percentile_set
    command_line    $USER1$/check_vmware_datastore_performance --server '$HOSTNAME$' --domain '$ARG1$' --username '$ARG2$' --password '$ARG3$' --ds-latency-percentile-set '$ARG4$' --ds-name '$ARG5$' --trust-cert  --log-level info
    }

See the configuration options section for all command-line settings supported by this plugin along with descriptions of each. See the contrib section for information regarding example command definitions and Nagios configuration files.

Troubleshooting

Datastore storage I/O statistics collection disabled

If you see an error message like this one:

UNKNOWN: Unable to retrieve performance summary for datastore "DATASTORE_NAME_HERE": datastore storage I/O statistics collection disabled

**ERRORS**

* datastore storage I/O statistics collection disabled: assistance needed from vmware administrators to resolve issue

then it means that the required Statistics Collection setting for the specified datastore is not enabled. See the Requirements section of this documentation for more information.

License

See the main project README for details.

References

Main project README
Documentation index
Project repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check_vmware_datastore_performance.md

check_vmware_datastore_performance.md

check-vmware | `check_vmware_datastore_performance` plugin

Table of Contents

Overview

Output

Requirements

Stability of this plugin

Performance Data

Background

How datastore performance metrics are evaluated

Omitted metrics

Supported metrics

Optional evaluation

Installation

Configuration options

Threshold calculations

Command-line arguments

Configuration file

Contrib

Examples

CLI invocation

Command definition

Troubleshooting

Datastore storage I/O statistics collection disabled

License

References

Files

check_vmware_datastore_performance.md

Latest commit

History

check_vmware_datastore_performance.md

File metadata and controls

check-vmware | check_vmware_datastore_performance plugin

Table of Contents

Overview

Output

Requirements

Stability of this plugin

Performance Data

Background

How datastore performance metrics are evaluated

Omitted metrics

Supported metrics

Optional evaluation

Installation

Configuration options

Threshold calculations

Command-line arguments

Configuration file

Contrib

Examples

CLI invocation

Command definition

Troubleshooting

Datastore storage I/O statistics collection disabled

License

References

check-vmware | `check_vmware_datastore_performance` plugin