Skip to content

Latest commit

 

History

History
109 lines (90 loc) · 2.23 KB

monitoring.md

File metadata and controls

109 lines (90 loc) · 2.23 KB

Glitch Gremlin Monitoring

Overview

Comprehensive monitoring system for tracking chaos test execution, system health, and performance metrics.

Metrics Collected

System Health

  • CPU/Memory usage of AI engine workers
  • Queue depth and processing latency
  • API response times
  • Error rates and types

Test Execution

  • Active test count
  • Test completion rate
  • Average test duration
  • Success/failure ratios

Token Economics

  • Transaction volume
  • Fee collection rate
  • Stake amounts and duration
  • Governance participation rate

Implementation

Prometheus Integration

import { Registry, Counter, Gauge } from 'prom-client';

// Initialize metrics
const registry = new Registry();
const activeTests = new Gauge({
    name: 'glitch_active_tests',
    help: 'Number of currently running chaos tests'
});
const testCompletions = new Counter({
    name: 'glitch_test_completions_total',
    help: 'Total number of completed tests'
});

Grafana Dashboards

  • System Overview
  • Test Execution Metrics
  • Token Economics
  • Governance Activity

Alerting Rules

groups:
  - name: glitch-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(glitch_errors_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
      - alert: QueueBacklog
        expr: glitch_queue_depth > 100
        for: 5m
        labels:
          severity: critical

Setup Instructions

  1. Install Dependencies:
npm install prom-client winston @opentelemetry/api
  1. Configure Prometheus:
scrape_configs:
  - job_name: 'glitch-gremlin'
    static_configs:
      - targets: ['localhost:9090']
  1. Start Monitoring:
import { startMetricsServer } from './monitoring';
await startMetricsServer(9090);

Best Practices

  1. Regular Metric Review
  • Monitor error rates daily
  • Review performance weekly
  • Analyze token metrics monthly
  1. Alert Configuration
  • Set appropriate thresholds
  • Avoid alert fatigue
  • Document response procedures
  1. Dashboard Organization
  • Group related metrics
  • Use clear labels
  • Include helpful descriptions

Completed Monitoring Setup

  • Prometheus instance configured
  • Grafana dashboards created
  • Basic alerting implemented
  • Custom metrics added