Closed Bug 1753123 Opened 3 years ago Closed 3 years ago

Let's Encrypt: Failure to provide OCSP Responses for some certificates

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: aaron, Assigned: aaron)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Attachments

(1 file)

missing-ocsp-affected-crtsh-uris.txt 3 years ago Aaron Gable 11.34 KB, text/plain		Details

Aaron Gable

Assignee

Description

•

3 years ago

Attached file missing-ocsp-affected-crtsh-uris.txt — Details

Summary

As part of remediating Bug 1751984 we discovered a small number of entries in our database for which we had precertificate data stored but did not have corresponding certificate status (particularly, OCSP response) data stored.

These certificates never had OCSP data available. As no authoritative records for these certificates were available, all requests for their OCSP responses resulted in an “unauthorized” response, as required by RFC 5019, Section 2.2.3 and RFC 6960, Section 2.3.

We believe that this incident is most accurately described as a violation of Section 4.10.2 of the BRs, which requires that the CA maintain a service that application software can “use to automatically check the current status of all unexpired Certificates issued by the CA” (emphasis added). Although there are other sections of the BRs which pertain to OCSP responses and their update periods, all such requirements appear to pertain to extant OCSP responses, of which these affected certs had none.

We have both populated OCSP responses for all affected certificates, and fixed the error which allowed certificates without corresponding OCSP responses to be stored in our database.

Incident Report

How we first became aware of the problem.

While revoking certificates affected by Bug 1751984, our automated revocation process reported that two revocations had failed because they did not have an existing status to update. We immediately began investigating. The investigation confirmed that we were not serving OCSP responses for the two certificates. Further investigation found an additional 130 unexpired affected certificates.

Timeline of incident and actions taken in response.

All times are UTC.

2021-11-25

18:29 Bug introduced to Boulder

2021-12-02

17:46 Bug deployed to Production (Incident Begins)

2022-01-29

01:20 Discovery that two serial numbers present in our certificate storage are not present in our OCSP response storage
01:22 Confirmation that our OCSP responder service returns “Unauthorized” for both certificates
01:36 Began searching for additional affected certificates
01:46 Confirmation that both serials are present in all other relevant tables
02:18 Root cause determined
03:30 Confirmation that we have never served any OCSP response for any certificate affected in this way
03:43 OCSP responses generated and served for two initial affected certificates
04:34 130 additional affected certificates identified
04:40 Fix merged into Boulder
05:18 Confirmation that revocation was never requested for any of the affected certificates
05:30 Updated version of Boulder deployed to Staging
06:22 Updated version of Boulder deployed to Production
06:39 OCSP responses generated and served for the 130 additional affected certificates (Incident Ends)
21:21 Confirmation that all unexpired certificates have available OCSP responses

Whether we have stopped the process giving rise to the problem or incident.

We are now serving (and regularly updating) OCSP responses for all affected certificates. We have merged and deployed a fix for the bug which gave rise to certificates without corresponding status information in our database, and so are no longer generating new affected certificates.

Summary of the affected certificates.

132 certificates issued between 2021-12-02 17:46 UTC and 2022-01-29 06:22 UTC were affected. The problem manifested very rarely, as we issued approximately 200M total certificates during that period, and with no correlation to the contents of the certificate.

Complete certificate data for the affected certificates.

The attached file contains crt.sh urls in the format https://crt.sh/?sha256=<cert fingerprint> for all affected certificates.

Note that the certificate data itself was not affected, just the presence of OCSP information for that certificate. As such, now that the incident has been resolved, the certificates are now indistinguishable from any other certificate that was not initially affected.

Explanation of how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The Let’s Encrypt issuance pipeline ensures that every certificate has a corresponding precertificate, every precertificate has a corresponding OCSP response, and every OCSP response has a corresponding serial. As part of maintaining this invariant, it stores both the precertificate and the OCSP response inside a single database transaction: if either of those operations fails, the whole transaction will be rolled back and neither object will have been persisted to the database.

Almost all error handling in Go is done via a standard best-practices system: a helper function returns both the normal return value and an error. The calling code checks to see if the error value is populated, and if it is, returns early and propagates that error to its own caller.

However, our code accidentally skipped checking the error returned by the operation which stored the OCSP response. This meant that it was possible for the OCSP-storing operation to fail, but for the transaction as a whole to still succeed, and thereby successfully store the precertificate. This bug was introduced in a change which otherwise greatly simplified the code at hand, but the bug was not caught by code review.

This error-checking pattern is so common in Go that there are a number of widely-used linters which check for it. The ineffassign linter emits errors when a variable is assigned to but then never read from, as happened in the buggy code. The SA4006 check within the larger staticcheck linter does the same. Both of these linters were enabled in our continuous-integration and presubmit tests at the time the bug was introduced, and continuously since then. Neither detected this bug.

The reason neither was able to detect this is due to an idiosyncratic interaction between Go’s variable declaration/assignment syntax and nested scopes. In Go, when you want to assign to an existing variable, you write name = value. However, if that variable doesn’t exist yet, you must first declare it, either with a var name type statement, or by combining the declaration and the assignment using the shorter name := value syntax. In general, you cannot declare a new variable with the same name as an existing variable in the same scope. But nested scopes such as if-clauses, loops, and closures change this: inside a nested scope, you can both assign to variables already declared in the outer scope (using the name = value syntax), or you can declare a new variable (which shadows the outer one) which exists only in the inner scope (using the name := value syntax). Unlike in flat functions, both syntaxes can be valid at the same time inside a nested scope, albeit with slightly different semantics. This makes it easy to accidentally use the wrong form and not realize the mistake.

In this particular case, the database-writing operations take place inside a closure, since the database library uses a closure to bundle together all things that should happen during a transaction. The code saved the result of storing the OCSP response not to a new error variable local to the closure, but instead to an error variable which already existed in the closed-over scope. And even though a human can read the enclosing function and determine that that error variable is never read from, that analysis is difficult for an automated system, and so neither linter was able to detect it.

Finally, in order for this bug to trigger, exactly one of the operations within the transaction (writing the OCSP response to the database) had to fail, while all other operations in the same transaction were successful. This circumstance was very rare (0.00006%), and affected new certificate issuances essentially at random.

List of steps we are taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

We have already fixed the proximate bug which allowed certificates to be stored without their corresponding OCSP responses being stored. In order to ensure similar bugs don’t occur in the future, we will be taking a few additional measures.

First, we will audit the entire Boulder codebase for places where we assign to (but don’t declare) an error variable inside a closure. We will ensure that all such places are correctly checking the error, since our linters can’t do so for us.

We had already established best-practices for error handling that make it easier for both humans and linters to catch bugs like this, but we had not yet gone back and applied those best practices to all existing code. We will do so.

Finally, we will add new metrics to our OCSP response server that allow us to track why we serve “Unauthorized” responses – for example because we don’t recognize the requested issuer, we don’t recognize the requested serial, or the requested certificate has already expired – so that we can detect bugs similar to this and correct them proactively.

We will complete these remediation items by Tuesday, March 1st, and will provide an update on this ticket on or before that date.

Remediation	Status	Date
Publish OCSP responses for affected certificates	Complete	2022-01-29
Ensure that failing to store OCSP fails issuance	Complete	2022-01-29
Audit codebase for error checks inside closures	Not yet started	2022-03-01
Improve readability and lintability of error checks	Started	2022-03-01
Add metrics for Unauthorized response causes	Started	2022-03-01

Ben Wilson

Updated

•

3 years ago

Assignee: bwilson → aaron

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Whiteboard: [ca-compliance]

Aaron Gable

Assignee

Comment 1

•

3 years ago

Weekly update: We have merged and deployed the changes to improve lintability of error checks and add metrics for Unauthorized response codes. We have not yet started the full audit for error checks inside closures.

Please set the Next Update field to 2022-03-01, when we have committed to completing remediations.

Remediation	Status	Date
Publish OCSP responses for affected certificates	Complete	2022-01-29
Ensure that failing to store OCSP fails issuance	Complete	2022-01-29
Improve readability and lintability of error checks	Complete	2022-02-01
Add metrics for Unauthorized response causes	Complete	2022-02-04
Audit codebase for error checks inside closures	Not yet started	2022-03-01

Aaron Gable

Assignee

Comment 2

•

3 years ago

We have completed our audit of error assignments inside closures.

This concludes our committed remediation items for this incident. If there are no further questions or comments, we ask that this issue be closed.

Ben Wilson

Comment 3

•

3 years ago

I'll close this on Wed. next week, 23-Feb-2022, unless there are issues or concerns remaining to be discussed.

Flags: needinfo?(bwilson)

Ben Wilson

Updated

•

3 years ago

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

Flags: needinfo?(bwilson)

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

3 years ago

Product: NSS → CA Program

Ben Wilson

Updated

•

3 years ago

Whiteboard: [ca-compliance] → [ca-compliance] [ocsp-failure]

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Let's Encrypt: Failure to provide OCSP Responses for some certificates

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

People

(Reporter: aaron, Assigned: aaron)

References

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Summary

Incident Report

How we first became aware of the problem.

Timeline of incident and actions taken in response.

Whether we have stopped the process giving rise to the problem or incident.

Summary of the affected certificates.

Complete certificate data for the affected certificates.

Explanation of how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

List of steps we are taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type