As part of remediating Bug 1751984 we discovered a small number of entries in our database for which we had precertificate data stored but did not have corresponding certificate status (particularly, OCSP response) data stored.
These certificates never had OCSP data available. As no authoritative records for these certificates were available, all requests for their OCSP responses resulted in an “unauthorized” response, as required by RFC 5019, Section 2.2.3 and RFC 6960, Section 2.3.
We believe that this incident is most accurately described as a violation of Section 4.10.2 of the BRs, which requires that the CA maintain a service that application software can “use to automatically check the current status of all unexpired Certificates issued by the CA” (emphasis added). Although there are other sections of the BRs which pertain to OCSP responses and their update periods, all such requirements appear to pertain to extant OCSP responses, of which these affected certs had none.
We have both populated OCSP responses for all affected certificates, and fixed the error which allowed certificates without corresponding OCSP responses to be stored in our database.
How we first became aware of the problem.
While revoking certificates affected by Bug 1751984, our automated revocation process reported that two revocations had failed because they did not have an existing status to update. We immediately began investigating. The investigation confirmed that we were not serving OCSP responses for the two certificates. Further investigation found an additional 130 unexpired affected certificates.
Timeline of incident and actions taken in response.
All times are UTC.
- 17:46 Bug deployed to Production (Incident Begins)
- 01:20 Discovery that two serial numbers present in our certificate storage are not present in our OCSP response storage
- 01:22 Confirmation that our OCSP responder service returns “Unauthorized” for both certificates
- 01:36 Began searching for additional affected certificates
- 01:46 Confirmation that both serials are present in all other relevant tables
- 02:18 Root cause determined
- 03:30 Confirmation that we have never served any OCSP response for any certificate affected in this way
- 03:43 OCSP responses generated and served for two initial affected certificates
- 04:34 130 additional affected certificates identified
- 04:40 Fix merged into Boulder
- 05:18 Confirmation that revocation was never requested for any of the affected certificates
- 05:30 Updated version of Boulder deployed to Staging
- 06:22 Updated version of Boulder deployed to Production
- 06:39 OCSP responses generated and served for the 130 additional affected certificates (Incident Ends)
- 21:21 Confirmation that all unexpired certificates have available OCSP responses
Whether we have stopped the process giving rise to the problem or incident.
We are now serving (and regularly updating) OCSP responses for all affected certificates. We have merged and deployed a fix for the bug which gave rise to certificates without corresponding status information in our database, and so are no longer generating new affected certificates.
Summary of the affected certificates.
132 certificates issued between 2021-12-02 17:46 UTC and 2022-01-29 06:22 UTC were affected. The problem manifested very rarely, as we issued approximately 200M total certificates during that period, and with no correlation to the contents of the certificate.
Complete certificate data for the affected certificates.
The attached file contains crt.sh urls in the format
https://crt.sh/?sha256=<cert fingerprint> for all affected certificates.
Note that the certificate data itself was not affected, just the presence of OCSP information for that certificate. As such, now that the incident has been resolved, the certificates are now indistinguishable from any other certificate that was not initially affected.
Explanation of how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
The Let’s Encrypt issuance pipeline ensures that every certificate has a corresponding precertificate, every precertificate has a corresponding OCSP response, and every OCSP response has a corresponding serial. As part of maintaining this invariant, it stores both the precertificate and the OCSP response inside a single database transaction: if either of those operations fails, the whole transaction will be rolled back and neither object will have been persisted to the database.
Almost all error handling in Go is done via a standard best-practices system: a helper function returns both the normal return value and an error. The calling code checks to see if the error value is populated, and if it is, returns early and propagates that error to its own caller.
However, our code accidentally skipped checking the error returned by the operation which stored the OCSP response. This meant that it was possible for the OCSP-storing operation to fail, but for the transaction as a whole to still succeed, and thereby successfully store the precertificate. This bug was introduced in a change which otherwise greatly simplified the code at hand, but the bug was not caught by code review.
This error-checking pattern is so common in Go that there are a number of widely-used linters which check for it. The ineffassign linter emits errors when a variable is assigned to but then never read from, as happened in the buggy code. The SA4006 check within the larger staticcheck linter does the same. Both of these linters were enabled in our continuous-integration and presubmit tests at the time the bug was introduced, and continuously since then. Neither detected this bug.
The reason neither was able to detect this is due to an idiosyncratic interaction between Go’s variable declaration/assignment syntax and nested scopes. In Go, when you want to assign to an existing variable, you write
name = value. However, if that variable doesn’t exist yet, you must first declare it, either with a
var name type statement, or by combining the declaration and the assignment using the shorter
name := value syntax. In general, you cannot declare a new variable with the same name as an existing variable in the same scope. But nested scopes such as if-clauses, loops, and closures change this: inside a nested scope, you can both assign to variables already declared in the outer scope (using the
name = value syntax), or you can declare a new variable (which shadows the outer one) which exists only in the inner scope (using the
name := value syntax). Unlike in flat functions, both syntaxes can be valid at the same time inside a nested scope, albeit with slightly different semantics. This makes it easy to accidentally use the wrong form and not realize the mistake.
In this particular case, the database-writing operations take place inside a closure, since the database library uses a closure to bundle together all things that should happen during a transaction. The code saved the result of storing the OCSP response not to a new error variable local to the closure, but instead to an error variable which already existed in the closed-over scope. And even though a human can read the enclosing function and determine that that error variable is never read from, that analysis is difficult for an automated system, and so neither linter was able to detect it.
Finally, in order for this bug to trigger, exactly one of the operations within the transaction (writing the OCSP response to the database) had to fail, while all other operations in the same transaction were successful. This circumstance was very rare (0.00006%), and affected new certificate issuances essentially at random.
List of steps we are taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.
We have already fixed the proximate bug which allowed certificates to be stored without their corresponding OCSP responses being stored. In order to ensure similar bugs don’t occur in the future, we will be taking a few additional measures.
First, we will audit the entire Boulder codebase for places where we assign to (but don’t declare) an error variable inside a closure. We will ensure that all such places are correctly checking the error, since our linters can’t do so for us.
We had already established best-practices for error handling that make it easier for both humans and linters to catch bugs like this, but we had not yet gone back and applied those best practices to all existing code. We will do so.
Finally, we will add new metrics to our OCSP response server that allow us to track why we serve “Unauthorized” responses – for example because we don’t recognize the requested issuer, we don’t recognize the requested serial, or the requested certificate has already expired – so that we can detect bugs similar to this and correct them proactively.
We will complete these remediation items by Tuesday, March 1st, and will provide an update on this ticket on or before that date.
|Publish OCSP responses for affected certificates
|Ensure that failing to store OCSP fails issuance
|Audit codebase for error checks inside closures
|Not yet started
|Improve readability and lintability of error checks
|Add metrics for Unauthorized response causes