In December, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will provide a summary and follow-up details on how we addressed an incident mentioned in November’s report.
Upon further investigation around one of the incidents mentioned in November’s Availability Report, we discovered an edge case that triggered a large number of GitHub App token requests. This caused abnormal levels of replication lag within one of our MySQL clusters, specifically affecting the GitHub Actions service. This particular scenario resulted in amplified queries and increased the database lag, which impacted the database nodes that process GitHub App token requests.
When a GitHub Action is invoked, the Action is passed a GitHub App token to perform tasks on GitHub. In this case, the database lag resulted in the failure of some of those token requests because the database replicas did not have up to date information.
To help avoid this class of failure, we are updating the queries to prevent large quantities of token requests from overloading the database servers in the future.
Whether we’re introducing a system to manage flaky tests or improving our CI workflow, we’ve continued to invest in our engineering systems and overall reliability. To learn more about what we’re working on, visit GitHub’s engineering blog.