Look here first: RDSS Portfolio in Checkmk staging
- We should know before our users do when there is a problem with systems we manage.
- During emergency production outages, we should have information to help us resolve the issue quickly.
- During development, we should have performance telemetry telling us whether the systems we are building will scale and perform as needed.
All alerts should go to the #rdss-alerts channel in slack
- Checkmk - Not yet in production, but on its way, this should become our primary monitoring system and Honeybadger and DataDog uptime alerts should move here.
- Honeybadger - for capturing exceptions in a running application. This makes it easier for us to fix bugs, because it gives us a stack trace and lots of context about how the exception occurred. We sometimes use honeybadger uptime monitoring too, but it is very simple (just a binary Up or Down) and only works for systems that are available on the open Internet. New staff need to be added to Honeybadger manually. It is also advisable that each team member subscribe to uptime notifications via email or SMS.
- DataDog - for Application Performance Monitoring (APM). This service can be expensive, so we only turn it on as needed, and we follow practices to keep costs down. However, it can be invaluable for diagnosing performance issues.
The suite of applications that make up the Princeton Data Commons include (so far):
- PDC Discovery
- Our current setup is to check the application once a minute and report after two failures (i.e. we get notified after two minutes), see PDC Discovery Monitoring.
- PDC Describe