Closed Bug 1648840 Opened 5 years ago Closed 5 years ago

Let's Encrypt: OCSP responses with no revocationReason

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jsha, Assigned: jsha)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Attachments

(1 file)

incorrect-revocationreason-certs-unexpired-as-of-2020-06-19.tsv 5 years ago Jacob Hoffman-Andrews 3.09 MB, text/tab-separated-values		Details

Jacob Hoffman-Andrews

Assignee

Description

•

5 years ago

Attached file incorrect-revocationreason-certs-unexpired-as-of-2020-06-19.tsv — Details

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36

Steps to reproduce:

Until 2020-06-19, Let’s Encrypt certificates that were revoked with revocationReasons 1 (keyCompromise), 3 (affiliationChanged), 4 (superseded), and 5 (cessationOfOperation) had OCSP served with their correct revocationReasons for the first 3 days after revocation, but after that were served with no revocationReason encoded. The certStatus (revoked) was correctly served throughout. This was due to a bug in our code that updates OCSP responses. The initial OCSP response after revocation was signed with the correct revocationReason, while updated responses were signed using the default value for revocationReason (unspecified) instead of the value stored in our database.

How your CA first became aware of the problem.

At 2020-06-18 23:02 UTC, a Let’s Encrypt SRE noticed during routine maintenance that an OCSP response for a certificate which had been revoked with revocationReason = keyCompromise (1) was being served with revocationReason = unspecified (0). The certStatus field was still correctly set to revoked (1). We identified the cause of the issue and fixed it (https://github.com/letsencrypt/boulder/pull/4880). The fix was deployed at 2020-06-19 02:13 UTC.

A timeline of the actions your CA took in response.

2016-09-30 21:52 UTC: Change was merged that introduced the bug.
2016-10-20 18:52 UTC: Boulder release containing the bug was deployed (https://letsencrypt.status.io/pages/maintenance/55957a99e800baa4470002da/5808f538e60a2e2019001144)
2020-06-18 23:02 UTC: Let’s Encrypt SRE discovers problem.
2020-06-18 23:39 UTC: Cause of problem identified.
2020-06-19 00:37 UTC: Fix merged.
2020-06-19 02:13 UTC: Fix deployed.

Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem.

We have stopped generating OCSP responses with the problem.

A summary of the problematic certificates.

Between 2016-10-20 and 2020-06-19, all revoked certificates with a non-zero revocationReason had their OCSP response served with a zero revocationReason starting approximately three days after revocation.

The complete certificate data for the problematic certificates.

Attached, find a list of certificates, with crt.sh URLs, that were unexpired as of 2020-06-19 and had a nonzero revocationReason.

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The bug was introduced in https://github.com/letsencrypt/boulder/pull/2177. We were making preparations to add new columns to the table we use to store certificate status. As part of that change, we moved from using a SELECT * on the relevant table to a SELECT <field1>,<field2>,.... We did this due to some challenges in managing migrations that add or remove columns, combined with our Deployability requirements (https://github.com/letsencrypt/boulder/blob/master/CONTRIBUTING.md#gating-migrations). As part of that change, we accidentally omitted the reasonCode field, so it was populated with the default value of 0, corresponding to “unspecified.”

We use an ORM library to map columns in the database to fields in our structs, and vice versa. Some methods in this library (Get, Insert) require that every field in a struct must exist as a column in the corresponding table. Other methods (Select, when used with SELECT *) require that every column in a table must exist in the corresponding struct. This creates a Catch-22 when writing code that must run successfully before and after a migration.

Generally, we’ve resolved this by transforming our SELECT * queries into SELECT <field1>,<field2> queries. The set of fields is then decided based on a feature flag that we flip to once the migration has been done and the additional fields can be added to the query. This creates a risk that the list of field names becomes out of date or does not match the full set of fields in a struct. We mitigate this risk by centralizing the list of field names in one place: our “SA” component (https://github.com/letsencrypt/boulder/pull/2259). However, the particular query that caused this bug was not included as part of that mitigation. I can’t find a record of discussing that, but I presume it’s because the query was a JOIN and would not line up straightforwardly with the scheme we chose to centralize field lists.

Additional reasons why this was not detected until now: Our integration tests around revocation were verifying certificate status (good vs revoked) but were not verifying revocationReason (unspecified, keyCompromise, cessationOfOperation, etc). The same was true of our end-to-end tests against production. Also, while the integration tests simulate the passing of time for certificate expiration, they do not currently simulate the passing of time after a revocation, to trigger the re-signing code path of our ocsp-updater component.

List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We’ve fixed the immediate bug, and added a unit test and an integration test: https://github.com/letsencrypt/boulder/pull/4880. This is done.

We are centralizing the list of field names used in this particular query in the SA: https://github.com/letsencrypt/boulder/pull/4902/files. Estimate: 2020-07-30.

We are adding checks for the revocation reason to the tooling we use for end-to-end testing of production: https://github.com/letsencrypt/boulder/issues/4885. Estimate: 2020-07-30.

We are adding support in our integration tests for simulating the passage of time after a revocation, so we can exercise the re-signing code path: https://github.com/letsencrypt/boulder/issues/4907. Estimate: 2020-08-28.

We are discussing how to change our approach to migrations that add or remove columns. We don’t know yet what changes we might make, but we will provide an update with our conclusions: https://github.com/letsencrypt/boulder/issues/4909. Estimate: 2020-08-28.

Ryan Sleevi

Updated

•

5 years ago

Assignee: bwilson → jsha

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Whiteboard: [ca-compliance]

Ryan Sleevi

Comment 1

•

5 years ago

Thanks for filing this, and for the detailed reporting!

I'm not sure that this represents a compliance bug against any policies, although this is being proposed as part of an update to the Baseline Requirements via SC31.

At present, under RFC 5280 a revocationReason is allowed to change during the lifetime remaining in a certificate (e.g. to go from 'unspecified' to 'keyCompromise', for example), and there's no clear requirements that I'm aware of about going from a more or less specific reason. In part, this facilitates Delta CRL use cases, and similarly enables certificate suspension (although suspension is forbidden for TLS certificates)

I'm also not aware of any provision in Let's Encrypt's CP/CPS that has been violated, but perhaps I'm missing something?

I say all of that because this is incredibly useful to report, and definitely doesn't match the general expectations, but also seems to not be a program violation.

Ben: Have I missed anything? I'm incredibly appreciative of Let's Encrypt's proactive notification here, and think it's the model all CAs should aspire to regardless of whether we deem it an "incident".

Flags: needinfo?(bwilson)

Jacob Hoffman-Andrews

Assignee

Comment 2

•

5 years ago

I agree with your reading that this is not a compliance bug, and it doesn't violate our CPS. We felt it was worth notifying about because it was unexpected and unintended behavior. And doing the root cause analysis in public led us to explain the problem in more detail and, I think, come to clearer understanding about how to prevent similar bugs in the future.

Ben Wilson

Comment 3

•

5 years ago

Thanks. I also do not consider this to be a compliance incident, and I appreciate this information and explanation provided by Let's Encrypt. I plan to close this as Fixed, rather than as Invalid. I believe that updates can still be posted to this bug after it is marked as fixed.

Flags: needinfo?(bwilson)

Ben Wilson

Updated

•

5 years ago

Type: defect → task

Ben Wilson

Updated

•

5 years ago

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

3 years ago

Product: NSS → CA Program

David Lawrence [:dkl]

Updated

•

3 years ago

Whiteboard: [ca-compliance] → [ca-compliance] [ocsp-failure]

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Let's Encrypt: OCSP responses with no revocationReason

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

People

(Reporter: jsha, Assigned: jsha)

References

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type