Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor Critical GitHub Actions Workflows Across Organization Repositories #4941

Closed
rishabh6788 opened this issue Aug 13, 2024 · 9 comments
Closed
Assignees
Labels
enhancement New Enhancement

Comments

@rishabh6788
Copy link
Collaborator

rishabh6788 commented Aug 13, 2024

Is your feature request related to a problem? Please describe

Background

We recently had a situation where publish snapshots to maven github actions workflow started failing across all the repositories due to an issue on sonatype central side. They had accidently deleted user tokens during maintainence and our jobs started failing with 401 errors.
The operator accidently happen to check the failed workflow on the commit they merged and saw snapshot workflow failure, upon further investigation it was found that the same workflow had been failing across all the repositories with same error for past 24-hours.
We need to implement a system to monitor critical GitHub Actions workflows across multiple repositories in our organization. This will help us quickly identify and respond to workflow failures or issues.

Describe the solution you'd like

Proposed Solutions

We have identified two broad categories of approaches: pull-based and push-based monitoring.

1. Pull-based Monitoring

Description

a) Oboard github actions workflow metrics onto existing metrics framework (Recommended)

  • Use exisintg metrics frame to onboard github actions metrics
  • Add monitor on failure metric and notify in slack channel, already implemented

b) Use GitHub REST APIs to periodically fetch the GitHub Actions status

  • Index the collected data in an OpenSearch cluster
  • Implement a pull job that runs on a cron schedule in Jenkins
  • Use GitHub REST APIs to periodically fetch the GitHub Actions status
  • Index the collected data in an OpenSearch cluster
  • Implement a pull job that runs on a cron schedule in Jenkins

Advantages

  • Centralized monitoring solution
  • Can provide historical data and trends
  • Allows for custom alerting based on various criteria

Challenges

  • May have slight delay in detecting issues due to polling interval
  • Need to manage API rate limits

2. Push-based Monitoring

Description

a) Slack Notifications Integration in Workflows

  • Add a Slack action to critical workflows
  • Configure the action to send a Slack message notification when a job fails

b) Email Notifications

  • Use GitHub's built-in email notification system or a custom email action
  • Send detailed email reports for workflow failures

d) Webhook Integration

  • Set up a custom webhook endpoint in our infrastructure
  • Configure GitHub to send workflow status updates to this endpoint
  • Process incoming webhooks to trigger appropriate actions (e.g., update a status page, send notifications)

Advantages

  • Real-time notifications
  • Simple to set up and maintain
  • No additional infrastructure required
  • Notification storm/fatigue during multiple failures across all repos.

Challenges

  • No centralized data storage for historical analysis
  • Requires updating each workflow file individually

Next Steps

  1. Discuss and decide on the preferred approach (pull-based, push-based, or a combination)
  2. Create a detailed implementation plan for the chosen approach(es)
  3. Assign team members to various tasks
  4. Set up a timeline for implementation and testing
  5. Plan for gradual rollout and monitoring of the new system

Questions to Consider

  • What defines a "critical" workflow in our organization?
  • How quickly do we need to be notified of issues?
  • Do we need historical data for analysis, or are real-time alerts sufficient?
  • Who should receive notifications, and how should they be prioritized?
  • How will we handle false positives or transient failures?

Please comment with your thoughts, preferences, or any additional considerations for this monitoring system.

Describe alternatives you've considered

No response

Additional context

No response

@rishabh6788 rishabh6788 added enhancement New Enhancement untriaged Issues that have not yet been triaged labels Aug 13, 2024
@rishabh6788
Copy link
Collaborator Author

Tagging @peterzhuamazon @gaiksaya @getsaurabh02 @prudhvigodithi @dblock for feedback and way forward.

@prudhvigodithi
Copy link
Member

Thanks @rishabh6788 this this an important enhancement. With the gathered data of GitHub Action Workflows we can even have summary of force merged pull requests, which is an important metric for the OpenSearch repo health. @getsaurabh02 @dblock

I would vote for 1st option to collect the incremental PR workflows, index the data and create a monitoring tool on top of the indexed raw data. Going with option 2, even if we created a custom GitHun action for this purpose it would be tough to update the 100's of workflows files across all the repos and ensuring that for new repos this action exists is tedious job.
If we go with solution 1 and running the workflow more aggressively to just monitor the incremental PR workflows would reduce the delay in detecting issues.

Thank you

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Aug 13, 2024

I am also in line with the pull based monitoring and carefully choose the data source we want to monitor. However, there will be still gaps where certain actions only run once per a month during release phase.

We need to figure out a consistent way to dry-run these actions in order to detect issues beforehand.

Thanks.

@gaiksaya gaiksaya removed the untriaged Issues that have not yet been triaged label Aug 15, 2024
@prudhvigodithi
Copy link
Member

prudhvigodithi commented Aug 22, 2024

Going with option 1 we can do the following:

Screenshot 2024-08-21 at 6 53 33 PM

Example https://api.github.com/repos/opensearch-project/query-insights/commits/1f4c4c635d6704e637004e9f363735461db21c2d/check-runs

  • Now the check-runs gives all the information of the CI runs for that commit (coming from a PR) and index the relevant important information like name, status, conclusion etc.

  • Build the monitoring tool around the indexed data, but running a query on the cluster and find the runs with "conclusion": "failure",, we can even target the specific runs for example "name": "build-and-publish-snapshots" which has conclusion as failures.

  • We can even use this information to get a new metrics (Force merged PR's and its trend) to find the PR's that are force merged with the failing CI checks.

@getsaurabh02 @dblock @rishabh6788 @peterzhuamazon @gaiksaya

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Sep 6, 2024

Following is the sample schema that can be indexed to the metrics cluster.

{
  id: <The id of the workflow run and can be directly used as document ID, directly given as part of check-runs API response >
  repository: <The Repo name>
  organization: <Optional: The Repo org>
  number: <PR number for which the workflow has triggered>
  pull_commit: <The head commit of the PR for which the workflow has triggered, should be inferred from pull API>
  merged: <The current state of the PR if merged true/false, should be inferred from pull API>
  commit_id: <The Commit ID of the PR for which the workflow has triggered, this commit should be inferred from pull API>
  html_url: <The html_url of the workflow run, directly given as part of check-runs API response>
  url: <The url of the workflow run, directly given as part of check-runs API response>
  name: <The name of the workflow run, directly given as part of check-runs API response>
  conclusion: <The result of the workflow run, directly given as part of check-runs API response>
  started_at: <The started timestamp of the workflow run, directly given as part of check-runs API response>
  completed_at: <The completed timestamp of the workflow run, directly given as part of check-runs API response>
}

Once we have the above information:

  • We should be able to monitor the desired workflows.
  • Create visualizations and trend graphs of repos with failing CI workflows and ability to filter per repo.
  • Monitor and create visualizations of repos where PR's are merged without the passing CI's.
  • Create issues with directly PR and workflow run information and URl's.

Thank you
@rishabh6788 @getsaurabh02

@peterzhuamazon peterzhuamazon moved this from Planned (Next Quarter) to 🏗 In progress in Engineering Effectiveness Board Sep 23, 2024
@prudhvigodithi
Copy link
Member

prudhvigodithi commented Sep 24, 2024

Did some more deep dive on the possible repo workflows.

@peterzhuamazon
Copy link
Member

Sync up with Prudhvi today and confirm that automation app is able to grab all the necessary context for the requirements.

We will see if we can combine the automation app and metrics cluster together on this.

Thanks.

@prudhvigodithi
Copy link
Member

Here is the final flow details, implemented based on all the merged pull requests linked to this issue.

graph LR
    A[GitHub Workflow Events] --> B[GitHub Automation App]
    B --> C[Failure Detection]
    C --> D[Workflow Failure Identified]
    D --> E[CloudWatch Alarms Update]
    D --> F[Failures Indexed]
    E --> I{Alarm Triggered?}
    I -- Yes --> G[Alerts Sent to Teams]
    I -- No --> J[No Action]
    F --> H[Data for Debugging and Trend Analysis]
Loading

@prudhvigodithi
Copy link
Member

Closing this issue.
@rishabh6788 @getsaurabh02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New Enhancement
Projects
Status: ✅ Done
Development

No branches or pull requests

4 participants