Closed Bug 1972745 Opened 3 months ago Closed 2 months ago

Let's Encrypt: Deployed Unreviewed Boulder Code

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jsha, Assigned: jsha)

Details

(Whiteboard: [ca-compliance] [policy-failure])

Full Incident Report

Summary

One of our core developers accidentally tagged a release of our CA software, Boulder, based on a commit that had not been reviewed and merged into the main branch. That release was deployed. After detection, the Boulder team immediately reviewed the commit and confirmed that it only removed dead code and did not cause incorrect operation of the CA.

  • CA Owner CCADB unique ID: A000320
  • Incident description: Unreviewed commit by core developer tagged in release.
  • Timeline summary:
    • Non-compliance start date: 2025-06-09 18:53 UTC
    • Non-compliance identified date: 2025-06-16 19:46 UTC
    • Non-compliance end date: 2025-06-16 23:01 UTC
  • Relevant policies: Boulder code review requirements ("All pull requests must receive at least one approval by a CODEOWNER other than the author"); ISRG Combined CP/CPS, System Development Controls ("When ISRG develops software to be used in CA operations, software development policies are put into place and methodologies are followed in order to ensure software quality and integrity. This always includes a requirement for peer review of code changes.")
  • Source of incident disclosure: Third Party Reported

Impact

  • Total number of certificates: 0
  • Total number of “remaining valid” certificates: 0
  • Affected certificate types: N/A
  • Incident heuristic: N/A
  • Was issuance stopped in response to this incident, and why or why not? No. After discovery, the developer team reviewed the relevant commit and concluded it could not cause incorrect CA operation.
  • Analysis: N/A
  • Additional considerations: Code review is a core part of how we maintain correctness and security in our systems. While the practical impact in this case was 0 certificates, it could have been larger.

Timeline

All times are UTC.

2025-06-06

2025-06-09

  • 18:53 release-2025-06-09 tagged and signed locally on the same developer’s machine, accidentally containing commit 63310a3 instead of the current HEAD of the main branch
  • 18:53 release-2025-06-09 tag pushed to the boulder repo, uploading commit 63310a3 for the first time and triggering the “Build release” workflow (Incident begins)
  • 18:55 release workflow completes successfully, adding the resulting artifacts to the tagged release
  • 20:56 IN-11435 created to track deployment of release-2025-06-09
  • 21:53 PR #8234 created, containing commit 63310a3

2025-06-10

2025-06-12

  • 19:42 release-2025-06-09 deployed to Prod

2025-06-16

  • 19:46 external developer files a bug noting that commit 63310a3 is not reachable from main nor from the recently-tagged release-2025-06-16 (Incident detected)
  • 19:56 Let’s Encrypt engineers begin incident response
  • 20:10 proximate cause of the incorrect tag identified (see below)
  • 21:40 responders select a plan for immediate remediation (see below)
  • 21:45 new release-branch-2025-06-09 branch is created, per the hotfix release procedure
  • 21:48 PR #8256 created, to merge commit 63310a3 into release-branch-2025-06-09
  • 22:50 PR #8256 approved by two reviewers, who determined that the CI test failures were due to lack of test-code changes, not deficiencies in the changes to production code
  • 23:01 commit 63310a3 pushed to release-branch-2025-06-09, closing PR #8256 (Incident ends)
  • 23:06 summarized the fix and closed the initial report

Related Incidents

Bug Date Description
Telekom Security: Finding in 2020 ETSI-Audit regarding weekly review of changes to configurations 2020-07-09 This incident concerned timely review of changes made to certificate systems, but is from a very different environment where post-hoc review was the norm.

Root Cause Analysis

We enforce our policy of code review for all Boulder commits using GitHub's automated branch protection features. Core developers can't push directly to main; instead, they make pull requests. Once a pull request has been approved, the author or another team member can merge it using the GitHub UI.

Approximately weekly, a Boulder developer tags a release based on the current content of the main branch, signs it using GPG, and uploads the tag with a name like release-2025-06-09. This process is manual. During deploy, an SRE uses a pair of automated tools to fetch the release tag, verify that it has passed tests on GitHub Actions, verify that the signature on the tag was made by a known Boulder developer, and build release artifacts for our staging and production datacenter environments. There is a separate process in GitHub Actions that builds release artifacts for consumption by our cloud infrastructure for MPIC validation. A second SRE reviews the config change ticket to deploy the new release artifact.

When tagging, the Boulder developer normally uses these commands in a local copy of the Boulder release:

git fetch
git switch -d origin/main
git tag -s -m 'Boulder release 2025-06-09' release-2025-06-09

These commands update the state of main from the git remote origin, switch to a "detached HEAD" (-d) matching main, and tag that commit. On the date of the affected release, the Boulder developer used these commands:

git fetch
git switch -c origin/main
git tag -s -m 'Boulder release 2025-06-09' release-2025-06-09

The middle command, instead of switching to a "detached HEAD", creates (-c) a new branch named origin/main based on the current HEAD of the repository. In this case, the current HEAD was an in-progress branch named deprecate-doh containing one non-main commit, deprecating a feature flag (DOH) that is now enabled in all environments, and removing the dead code that used to execute when the feature flag was not enabled. Shortly after the release tag was pushed, the commit was pushed to origin as part of pull request #8234, along with some test fixes.

The root cause is that there was a gap between what our automated systems enforced (all commits on main are reviewed) and what our release process assumed (signed release tags are based on a recent main, except in the case of hotfixes).

Contributing Cause #1: Lack of automation in release tagging process

  • Description: The process of a developer tagging a release involves manually executing a series of commands, which introduces potential for human error.
  • Timeline: Since Boulder started tagging releases, in 2015.
  • Detection: In 2021 we identified a need to better automate the process, and in 2023 developed a candidate tool to do so.
  • Interaction with other factors: The lack of automation in tagging the release created the necessary preconditions to make this category of error (a typo in a command line), combining with the lack of detection at later stages (#2) to allow the commit to reach deployment.

Contributing Cause #2: Lack of detection of non-main tagged commits

  • Description: The process of pulling, building, and deploying a tagged Boulder release does not detect when the release tag is not an ancestor of main, or otherwise reviewed.
  • Timeline: Since we started deploying Boulder releases, in 2015.
  • Detection: As part of this incident. Also, in 2022 when reorganizing our release documentation, we introduced the notion of build-time verification of release tag properties but didn't follow through with implementation of those checks.
  • Interaction with other factors: If we had implemented detection of non-main commits during the release build process, the presence of an error in the manual process related to #1 would have been detected and the build process halted.

Contributing Cause #3: Check for passing CI not thorough enough

  • Description: The script that builds a Boulder release in our datacenter checks for passing tests in GitHub Actions, using GitHub's check-suites API. Since the affected commit had failing unittests, this should have halted the build. There were two problems:
    • It filtered the responses by app id "15368" and app slug "github-actions" and processed the last item that matched. However, in this API, many checks can have the same app id and slug.
    • Our integration and unittest suites are triggered by pull requests and pushes to main. They are not triggered by tag pushes, since the expectation is that tags represent previously-tested commits. However, we do have a GitHub Action that triggers on tag push: the release action that automatically builds artifacts for deployment to our cloud infrastructure for MPIC validation. The action passed. When our build script in the datacenter ran, it saw one check_suite that matched the app id and app slug above, and treated that as a successful test run.
  • Timeline: Likely from 2021-03-19, when we switched from Travis CI to GitHub Actions for this check.
  • Detection: As part of this incident.
  • Interaction with other factors: This check is not intended to validate a commit's review status, so it would not have fully solved #2 even if it was working correctly. But it would have caught this specific occurrence of an accidentally tagged commit.

Lessons Learned

What went well?

  • Our openness and collaboration with the community allowed a third party to detect and report the problem to us.

What went poorly?

  • We had previously identified a need for better automation of release processes but had not executed on it.
  • We did not detect the issue until a third party reported it to us.

Where did we get lucky?

  • The unreviewed commit only removed unused code and did not add or modify code in a way that would have caused incorrect operation of the CA.

Action items

Remediation Kind Corresponding Root Cause(s) Evaluation Criteria Date Status
GitHub Actions release workflow: detect and error on release tags that are not ancestors of main or equal to the HEAD of a branch named release-branch-XXXX-YY-ZZ Detect and mitigate #1 Copying the relevant workflow to a test repository and pushing non-`main` release tags causes build errors. 2025-07-30 Ongoing
Boulder: implement automated release tagging Prevent #1 Core developers execute a single command for routine release tagging; That command tags current main from the remote repository regardless of local repository state. 2025-07-30 Ongoing
Production release builds: tags are verified to be ancestors of main or equal to the HEAD of a branch named release-branch-XXXX-YY-ZZ Detect and mitigate #2 Unittests of tag-verification code. 2025-07-30 Ongoing
Historical releases: verify that each release tag is an ancestor of main or a hotfix release with fully reviewed changes. Detect #2 Summary of results posted here. 2025-07-30 Ongoing
Production release builds: check for specific test jobs passing Detect #3 Copying the relevant workflow to a test repository and pushing a release tag with no corresponding pull request or merge to main results in the production build scripts rejecting that release tag. 2025-07-30 Ongoing
Assignee: nobody → jsha
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [policy-failure]

Can you explain why you didn’t stop issuance while the determination of the significance of the unreviewed code was being determined? It seems like there was a point at which LE was in the state “we know that unreviewed code went out but we don’t yet know what that code’s impact was”, and that issuance in that state is inappropriate (even if issuance was OK once the nature of the code had been analyzed, about which…)

That more minor point aside, I must as they say sigh and draw my katana to ask: why isn’t this a revocation event? Certificates were issued under a CPS saying that a practice was followed, and for a period of time that practice wasn’t followed correctly. That sounds like a classic misissuance-requiring-revocation scenario to me, even given recent discussions about possibly reducing the need for revocation in the case of certain kinds of CPS-truth error.

Flags: needinfo?(jsha)

I would also be interested in an action item to mechanically reject unreviewed changes (anything without r=whoever or a signed-off or whatever is used) from main and release-branch-…. What do you think?

Hi Mike, thank you for the questions! It's Juneteenth, so we will address them in our next update, after the holiday weekend.

(In reply to Mike Shaver from comment #1)

Can you explain why you didn’t stop issuance while the determination of the significance of the unreviewed code was being determined?

Because we determined the significance of the unreviewed code essentially instantaneously, as part of the process of evaluating the report. The issue was reported to us with a link directly to the unreviewed commit, and the first responder who received that report was the author of that commit and the PR it was part of. Additionally, the other early responders were engineers who had already reviewed that PR and were familiar with this change. Although it took us some time to formally document (via the publicly documented review process) that the commit did not affect correct operation of the CA, we had already made that determination internally. In more complicated circumstances, stopping issuance would certainly have been the appropriate first step after confirming an issue.


(In reply to Mike Shaver from comment #1)

why isn’t this a revocation event?

Because none of the circumstances for mandatory revocation laid out in Section 4.9.1.1 of the Baseline Requirements apply. In particular, all certificates issued during the incident period were issued in accordance with both the Baseline Requirements and our own CP/CPS, so paragraph (12) does not apply.

This circumstance is similar to the recent trusted role access, privilege access removal, and secure zone multiparty access controls incidents. In each of those cases, it’s clear that a security-critical process was not followed correctly. We agree with the implicit decision in those incidents to not revoke all certificates issued during the affected period. The correct action in those situations and in this one is to find all certificates that were affected by the non-compliance, and revoke the corresponding certificates.


(In reply to Mike Shaver from comment #2)

I would also be interested in an action item to mechanically reject unreviewed changes from `main` and `release-branch-…`. What do you think?

We already have such protections for main: GitHub itself enforces that no commit can be added to that branch without first going through the pull request and approval process. However, that couldn’t protect us from this failure mode: the issue here was that this commit was not merged to any branch at the time that it was tagged.

We have not historically had similar protections on our release-branch-... branches. This is because we only create those branches at times when we’re preparing a “hotfix” release, and our hotfix release process has a developer cherry-pick commits from main to the release branch; such branch protections would prevent that process from working correctly. Although this would not have prevented this incident, we have now extended our existing branch protection rule to also cover hotfix release branches. This will restrict the degree to which we can automate creation of hotfix release branches, but the tradeoff of requiring a PR for all commits to those branches is worth it.

Finally, as a result of this conversation, we have discovered the existence of GitHub’s “tag protection rules”, a comparatively new feature similar to the “branch protection rules” discussed above. We have now created a tag protection rule that enforces that all tags point to commits which have been merged to `main` or a `release-branch-` branch.


This action item is now complete and tested:

  • Production release builds: tags are verified to be ancestors of main or equal to the HEAD of a branch named release-branch-XXXX-YY-ZZ

While our new tag protection rule serves largely the same purpose as our first committed action item, it is not quite identical, and is less publicly verifiable, so we still intend to implement the github action as planned.

Update on our action items:

  • Item 1 is complete. We have added a step to our GitHub Actions release workflow which verifies that the tag being built points to a commit which is either the tip of a release branch or an ancestor of the main branch.
  • Item 2 is in progress. We have created a new tool to automate the creation of normal release tags, but have not yet extended it to also handle the (much rarer) hotfix release tags.
  • Item 3 is complete, as previously reported.
  • Item 4 is in progress. We have collated the full set of historical release tags and release branches, and are reviewing them for discrepancies.
  • Item 5 is in progress. We have an internal pull request which updates the production release build script to check for specific CI jobs to have passed, but it has not yet been fully reviewed and merged.

We will provide our next update on or before 2025-07-07.

Whiteboard: [ca-compliance] [policy-failure] → [ca-compliance] [policy-failure] Next update 2025-07-07

Update on our action items:

  • Item 2 is in progress. We've merged a tool for automated release tagging. We are reviewing an update to support hotfix releases as well.
  • Item 4 is still in progress. We will provide a summary of our findings soon.
  • Item 5 is complete. The production release build script now specifically checks for the “Boulder CI” workflow to have completed successfully.
Flags: needinfo?(jsha)
Whiteboard: [ca-compliance] [policy-failure] Next update 2025-07-07 → [ca-compliance] [policy-failure]

Update on our action items:

  • Items 1, 3, and 5 are complete, as previously reported.
  • Item 2 is now complete. We've merged support for hotfix releases in our automated release tool (https://github.com/letsencrypt/boulder/pull/8290).
  • Item 4 is now complete. Our analysis is below.

This marks completion of all our action items for this incident, and we intend to post a report closure summary shortly.

Analysis for action item 4

We reviewed past release tags on the Boulder repository for other instances of deployed, unreviewed code. We found none.

We did find some instances of hotfix release tags where a change was correctly cherry-picked from main and tagged, but the correspondingly named release-branch-XXXX-YY-ZZ branch was not pushed. We were able to verify correct correspondence by reproducing the cherry-pick process.

There was also one example of a release tag that had a similar problem to the one in this incident, but was caught before it was deployed to production. This reinforces our belief that the automated processes we've introduced as part of the incident will be a valuable protection.

We used the following command to list all release tags that are not ancestors of main:

$ git tag | while read tagname ; do git merge-base --is-ancestor $tagname origin/main || echo $tagname ; done

For some of the resulting tags, we used this script to confirm that they contained only code which was previously reviewed and merged to main, then cherry-picked and tagged:

# is-cherry-pick.sh
#
# Given a release tag and a cherry-picked commit, verify that that release tag is the
# result of a cherry-pick of that single commit onto an earlier revision of
# `main`.
set -eu
git checkout "${1}~"
git merge-base --is-ancestor HEAD origin/main || exit 1
git merge-base --is-ancestor "${2}" origin/main || exit 1
PARENT="$(git rev-parse --short HEAD)"
git cherry-pick "${2}"
git diff --exit-code "${1}"
echo "${1} is a cherry-pick of ${2} onto an earlier revision of main (${PARENT})"

Results of manual analysis combined with the two shell snippets above:

These tags are ancestors of release branches:

  • release-2018-05-18
  • release-2018-05-18a
  • release-2020-07-23
  • release-2021-05-24-a
  • release-2021-06-07-a
  • release-2021-08-31a
  • release-2021-08-31b
  • release-2021-08-31c
  • release-2021-09-21a
  • release-2021-10-05a
  • release-2021-10-05b
  • release-2022-01-18a
  • release-2022-01-18b
  • release-2022-01-18c
  • release-2022-01-18d
  • release-2023-05-22a
  • release-2024-06-17a
  • release-2025-06-09 (the tag which incited this incident; the release branch was created after the fact)

These tags represent cherry-picks of reviewed commits onto an earlier version of main.

  • release-2021-11-16b (8eb7272 onto b7989d0b5)
  • release-2025-03-10a (0a72637 onto ad651d4a3)
  • release-2021-07-12-sql (4dc2df6 onto 915510b5a)
  • hotfixes/2016-06-30 (d0eef4b and fd095b8 onto 3a6a254)

Each of these tags is a merge commit with the description "Merge staging into release". This was the product of an earlier version of our release process, where we periodically merged master into staging, and then into release.

  • release-2016-07-14
  • hotfixes/2017-02-01

This was a mistaken tag from a working branch, similar to the one in the incident. An engineer at the time noticed the discrepancy and we didn't deploy it.

  • release-2018-04-02

Report Closure Summary

  • Incident description: One of our core developers accidentally tagged a release of our CA software, Boulder, based on a commit that had not been reviewed and merged into the main branch. That release was deployed. After detection, the Boulder team immediately reviewed the commit and confirmed that it only removed dead code and did not cause incorrect operation of the CA.
  • Incident Root Cause(s): The root cause is that there was a gap between what our automated systems enforced (all commits on main are reviewed) and what our release process assumed (signed release tags are based on a recent main, except in the case of hotfixes). Specifically, the process of producing a release tag was not automated, and the processes for building a release tag (both in CI and in our datacenters) did not verify that the release tag pointed at an appropriately-reviewed commit.
  • Remediation description: We have improved our automation in several places: the process of producing both mainline and hotfix releases is now automated, and the process of building release artifacts both in GitHub Actions and in our datacenters now checks that the tagged commit has been reviewed and passed our integration tests. We have also reviewed all of our historical release tags to ensure that no other unreviewed code has been deployed to our production environment in the past.
  • Commitment summary: We commit to continue improving and automating our release process, to remove even more opportunities for human error.

All Action Items disclosed in this report have been completed as described, and we request its closure.

Flags: needinfo?(incident-reporting)

This is a final call for comments or questions on this Incident Report.

Otherwise, it will be closed on approximately 2025-07-29.

Whiteboard: [ca-compliance] [policy-failure] → [close on 2025-07-29] [ca-compliance] [policy-failure]
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Flags: needinfo?(incident-reporting)
Resolution: --- → FIXED
Whiteboard: [close on 2025-07-29] [ca-compliance] [policy-failure] → [ca-compliance] [policy-failure]
You need to log in before you can comment on or make changes to this bug.