Closed Bug 1666047 Opened 1 year ago Closed 1 year ago

Let's Encrypt: 302 total OCSP responses available beyond acceptable timelines

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kchris, Assigned: kchris)

Details

(Whiteboard: [ca-compliance])

Attachments

(3 files)

Summary:

From 2020-09-07 at 05:44:35 UTC to 2020-09-08 at 17:48:28 UTC, we served OCSP responses older than 3.5 days for 268 certificate serial numbers. From 2020-09-12 at 09:40:31 UTC to 2020-09-13 at 07:22:13 UTC, we served OCSP responses older than 3.5 days for an additional 34 certificate serial numbers. None of the OCSP responses were served beyond their validity period (nextUpdate). The maximum age an OCSP response ever reached was 5 days. For OCSP responses with a 7-day validity period, the Microsoft Root Program specifies that updated responses be available within 3.5 days and the CA/B Forum Baseline Requirements specify 4 days.

On 2020-09-13 at 17:22 UTC, the final manual remediation query was executed on the database and we verified that all potentially-affected Certificate Status entries had been remediated.


Incident Report:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

    • On 2020-09-08 at 15:00 UTC during an on-call shift rotation, our SREs triaged a non-paging alert which had fired over the preceding weekend for elevated error-level logs. The log contents were reviewed and prompted the investigation which began at 15:11 UTC.
  2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

    • 2020-09-03 at 17:40 UTC: Boulder release-2020-08-31a deployed which finalized migration to proto3.
    • 2020-09-07 at 10:14 UTC: Non-paging alert begins firing for elevated error-level logs.
    • 2020-09-08 at 15:11 UTC: Alert is triaged for elevated error-level logs.
    • 2020-09-08 at 17:47 UTC: First remediation query is executed, 268 affected status entries.
    • 2020-09-10 at 17:37 UTC: Boulder release deployed in Production fixes error scenario.
    • 2020-09-12: Remediation query interval is missed, 34 additional affected status entries.
    • 2020-09-13 at 17:22 UTC: Final remediation query is executed. All problematic entries have been updated.
  3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

    • A fix for the root cause was deployed on 2020-09-10 at 17:37 UTC and concluded at 17:59 UTC. All certificate status entries which would have raised errors and had the potential to be served beyond acceptable timelines have been remediated as of 2020-09-13 at 17:22:17.
  4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

    • OCSP responses for a total of 302 certificate serial numbers were served beyond acceptable timelines. The small number of affected certificate status entries relative to our total certificate volume is due to the very specific error scenario which caused the entries. The first affected precertificate with serial number 04274330a750fbf1bacbd7810e9363cdc2a7 was issued on 2020-09-03 at 17:43:37 UTC and will expire on 2020-12-02 at 16:43:37 UTC.
      The last affected precertificate with serial number 044f4ad743250e88a2f9f4bd38aa9711e6e7 was issued on 2020-09-09 at 18:56:01 UTC and will expire on 2020-12-08 at 17:56:01 UTC.
  5. In a case involving certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

    • Because of the specific error scenario, only issuances that errored after precertificate issuance and before final certificate issuance were affected. No final certificates were delivered to subscribers from affected precertificates. All of the affected certificate serial numbers are “reserved” as per BRs 4.9.10. We are serving OCSP responses for the reserved certificate serial numbers. We will attach a file containing all of the affected certificate serial numbers. We will attach a file containing crt.sh URLs for each of the affected serial numbers. We will also attach one current and decoded example of an OCSP response for a randomly selected certificate serial number from the affected list.
  6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

    • We had two transitions in our code ongoing at the time of the incident: A gradual migration of our OCSP table to include a new column “issuerID” and a switch from proto2 to proto3. Rows inserted before 2020-06-25 have a NULL “issuerID” field. Rows in the OCSP table inserted after 2020-06-25 are supposed to have a non-zero “issuerID” field. There was a bug in our logic: When a precertificate fails insertion into the database, it gets added to a retry queue. When the precertificate was eventually inserted into the database from that retry queue, it was inserted with a NULL “issuerID” field.

      In general, this was not a problem. Our code that does periodic re-signing of OCSP responses expects either a non-zero “issuerID” field, or a NULL (in which case it will find the relevant issuer by looking up the certificate serial number). However, the proto2 to proto3 migration introduced a semantics change: in proto3 there is no distinction between a field that is absent and one that is “present, but zero.” Our storage component, receiving requests to insert a certificate where there issuerID field was absent, interpreted that as a 0 and wrote a 0 to the database instead of a NULL.

      Later, our periodic re-signing component would read that entry, see that the “issuerID” field was non-nil, and try to find an issuer with ID 0. Not finding one, it would return an error, and the OCSP response would not get updated.

      We did not immediately catch this because our production OCSP monitoring is primarily end-to-end: We issue certificates frequently, and monitor those certificates’ OCSP responses over time to ensure they are successful and not too stale. This was designed with the anticipation that OCSP problems would affect our whole infrastructure equally. However, it does not adequately cover the case where OCSP problems affect a small subset of certificates. In particular, it does not cover the case of issuances that error after precertificate issuance but before returning the certificate to the end-user.

  7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

    • We will be exporting additional metrics from our ocsp-updater component, and our ocsp-responder component (https://github.com/letsencrypt/boulder/issues/5080). These will allow us to do clear-box monitoring of all OCSP responses that we sign and serve, in addition to our current end-to-end monitoring of a sample of OCSP responses.
    • We will be adding an all-hours paging alert on certificate status age from the new metrics.
    • These will be in production by 2020-12-09.
Attached file crt-dot-sh-urls.txt
Summary: Let's Encrypt: 302 total OCSP responses served beyond acceptable timelines → Let's Encrypt: 302 total OCSP responses available beyond acceptable timelines
Assignee: bwilson → kchris
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

However, the proto2 to proto3 migration introduced a semantics change: in proto3 there is no distinction between a field that is absent and one that is “present, but zero.”

This is fascinating, and something I wasn't aware of changed with Proto3, which as best I can tell, are only documented by the Protobuf team in the 3.0 release notes.

Could you talk about whether you've examined the rest of the code for latent bugs like this? I realize that's a big question, so a different way of framing it would be, if you haven't, can you describe why you haven't (e.g. because of unit tests, XX% code coverage, etc). If OCSP is truly the exception here, due to the ongoing service requirement, then what sort of remediations beyond this specific issue might be useful for OCSP?

Flags: needinfo?(kchris)

Any one of three changes would have prevented or caught this issue:
a) correctly ensuring that the CA's orphan-handling code sets the IssuerID when sending orphaned precertificates to the database;
b) correctly ensuring that the database requires that the IssuerID field be set when it receives a request to add a precertificate; or
c) correctly ensuring that the ocsp-updater knows that "0" is an invalid issuer ID, and to fall back to the same codepath it uses for handling NULL in that case.
(Another way to look at it: three things had to go wrong for this bug to occur.) Fixing (a) was our immediate fix, deployed on 2020-09-10. Fixes for (b) and (c) are in-flight right now. Only (c) is related to our proto3 migration; the first two were simply oversights when adding the codepath which stores and retrieves certificates based on their issuer and serial, rather than based on their full DER.

We did do an extensive review of our proto message validity checking code in the lead up to our proto3 migration. We were aware of the change in nil/zero semantics, and needed to ensure compatibility between services running proto2 and services running proto3 during our rolling deployments. We believe that we did not catch this particular instance for two reasons. First, the vast majority of the proto2 -> proto3 semantics changes require removing checks, not adding new ones: for example, while we used to be able to assert that a revocation request's revocationReason must be non-nil, we can not assert that it be non-zero because 0 semantically means Unspecified. Therefore, thinking about which fields might have been previously missed and should be added to our validity checks was not part of our review. Second, because the ocsp-updater does not receive RPC requests (it queries the database to retrieve rows to be processed), it does not perform validity checks of its own, so there was not an existing check there for us to update.

Now that our migration to proto3 is complete, our next step is to clean up our transitional validity checks (which hand to handle both proto2 and proto3 semantics). As part of this cleanup, we will be checking all places where we pass integers via gRPC and reconfirming whether those integers can be 0. We will be doing the same for strings and the empty string.

Are there any additional questions or comments? If not, then on or about 9-October-2020 I intend to close this bug as completed.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Flags: needinfo?(kchris)
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.