Comprehensive monitoring system for tracking chaos test execution, system health, and performance metrics.
- CPU/Memory usage of AI engine workers
- Queue depth and processing latency
- API response times
- Error rates and types
- Active test count
- Test completion rate
- Average test duration
- Success/failure ratios
- Transaction volume
- Fee collection rate
- Stake amounts and duration
- Governance participation rate
import { Registry, Counter, Gauge } from 'prom-client';
// Initialize metrics
const registry = new Registry();
const activeTests = new Gauge({
name: 'glitch_active_tests',
help: 'Number of currently running chaos tests'
});
const testCompletions = new Counter({
name: 'glitch_test_completions_total',
help: 'Total number of completed tests'
});
- System Overview
- Test Execution Metrics
- Token Economics
- Governance Activity
groups:
- name: glitch-alerts
rules:
- alert: HighErrorRate
expr: rate(glitch_errors_total[5m]) > 0.1
for: 2m
labels:
severity: warning
- alert: QueueBacklog
expr: glitch_queue_depth > 100
for: 5m
labels:
severity: critical
- Install Dependencies:
npm install prom-client winston @opentelemetry/api
- Configure Prometheus:
scrape_configs:
- job_name: 'glitch-gremlin'
static_configs:
- targets: ['localhost:9090']
- Start Monitoring:
import { startMetricsServer } from './monitoring';
await startMetricsServer(9090);
- Regular Metric Review
- Monitor error rates daily
- Review performance weekly
- Analyze token metrics monthly
- Alert Configuration
- Set appropriate thresholds
- Avoid alert fatigue
- Document response procedures
- Dashboard Organization
- Group related metrics
- Use clear labels
- Include helpful descriptions
- Prometheus instance configured
- Grafana dashboards created
- Basic alerting implemented
- Custom metrics added