Skip to content

Latest commit

 

History

History
26 lines (21 loc) · 2.61 KB

monitoring.md

File metadata and controls

26 lines (21 loc) · 2.61 KB

Monitoring in RDSS

Where to start

Look here first: RDSS Portfolio in Checkmk staging

Goals of monitoring:

  1. We should know before our users do when there is a problem with systems we manage.
  2. During emergency production outages, we should have information to help us resolve the issue quickly.
  3. During development, we should have performance telemetry telling us whether the systems we are building will scale and perform as needed.

Where the Alerts Go

All alerts should go to the #rdss-alerts channel in slack

The Monitoring Systems

  1. Checkmk - Not yet in production, but on its way, this should become our primary monitoring system and Honeybadger and DataDog uptime alerts should move here.
  2. Honeybadger - for capturing exceptions in a running application. This makes it easier for us to fix bugs, because it gives us a stack trace and lots of context about how the exception occurred. We sometimes use honeybadger uptime monitoring too, but it is very simple (just a binary Up or Down) and only works for systems that are available on the open Internet. New staff need to be added to Honeybadger manually. It is also advisable that each team member subscribe to uptime notifications via email or SMS.
  3. DataDog - for Application Performance Monitoring (APM). This service can be expensive, so we only turn it on as needed, and we follow practices to keep costs down. However, it can be invaluable for diagnosing performance issues.

Monitoring by Application Group

Princeton Data Commons (PDC*)

The suite of applications that make up the Princeton Data Commons include (so far):