1838667 - Let's Encrypt: Duplicate Serial Numbers

On 2023-06-15, Let’s Encrypt updated our subscriber certificate profile to remove the ISRG CPS OID and URL from the Certificate Policies extension. While this change was being deployed, it was possible for a single ACME Order Finalization flow to produce a precertificate and final certificate with the same serial number but different contents in this extension.

Out of an abundance of caution we halted issuance while we investigated the issue. Once we confirmed that the issue was transient and occurred only during the deploy, we resumed issuance.

We have identified a preliminary set of 645 affected serial numbers. We are in the process of confirming the affected certificates and developing remediations to prevent similar incidents from happening in the future. We will revoke the affected certificates within 5 days. We will post a full incident report on or before 2023-06-20.

Rob Stradling

Comment 2

•

2 years ago

(In reply to Aaron Gable from comment #1)

We are in the process of...developing remediations to prevent similar incidents from happening in the future.

Hi Aaron.

In case it helps, here's the gist of the Go code we use at Sectigo to create each final TBSCertificate from the corresponding precertificate.

Rather than construct the final certificate from scratch using the original inputs and the current template (as Boulder seems to do), our code: parses (minimally) the precertificate's TBSCertificate, removes the CT Poison extension, adds the SCT List extension, then encodes the new TBSCertificate that will be signed. Have you considered adopting this sort of approach?

Aaron Gable

Comment 3

•

2 years ago

Yeah, we have considered taking that approach. The code you identified is the result of a deliberate decision not to do things that way. Of course, in light of this incident, we're considering changing that decision. A full write-up of why our code works the way it does will be included in our root cause analysis. Thanks for sharing your code!

Aaron Gable

Comment 4

•

2 years ago

Attached file precert_urls.txt — Details

Aaron Gable

Comment 5

•

2 years ago

Attached file cert_urls.txt — Details

Aaron Gable

Comment 6

•

2 years ago

Summary

Let's Encrypt deployed a certificate profile configuration change that removed the embedded ISRG CPS OID and related CPS URL from the Certificate Policies extension of newly issued end-entity certificates. For a few moments during the deploy, it was possible to issue a precertificate and final certificate with the same serial number but mismatched Certificate Policies extensions. This occurred when the respective issuance requests were routed to different backend instances with different configured certificate profiles. This is a violation of BRs v2.0.0 Section 7.1.2.9.1, which requires that the Extensions field of a precertificate be byte-for-byte identical to the Extensions field of the certificate (with the exception of ct-poison and SCTs).

This can equivalently be viewed as a violation of the requirement that no two certificates from the same issuer share the same serial number, because the precertificate implies the existence of a final certificate which is not identical to the final certificate that was actually issued.

Incident Report

How we first became aware of the problem.

At 15:53 UTC on 2023-06-15, Andrew Ayer emailed cert-prob-reports@letsencrypt.org alerting us of this incident and supplying a single affected serial.

Timeline of incident and actions taken in response.

All times are UTC

2023-05-16:

14:35 Profile configuration change begins deployment in Staging
14:35:49 Staging: First task starts using the new configuration
14:35:57 Staging: First affected (untrusted) precertificate issued
14:36:02 Staging: Last affected (untrusted) certificate issued
14:36:03 Staging: Last task starts using the new configuration

2023-06-15:

15:34:17 Profile configuration change begins deployment to Production
15:36:01 Datacenter 1: First task starts using the new configuration
15:36:15 Datacenter 1: First affected precertificate issued (Incident Begins)
15:37:07 Datacenter 1: Last affected certificate issued
15:37:08 Datacenter 1: Last task starts using the new configuration
15:43:30 Datacenter 2: First task starts using the new configuration
15:44:01 Datacenter 2: First affected precertificate issued
15:44:54 Datacenter 2: Last affected certificate issued (Incident Ends)
15:44:55 Datacenter 2: Last task starts using the new configuration
15:53 Andrew Ayer submits a certificate problem report
16:01 Issuance halted
16:09 Certificate problem report acknowledged
16:19 Confirmed that the issue had resolved itself prior to issuance being halted
16:24 Services restarted to purge any pending issuance requests
16:38 Test certificate issued to confirm issue resolved
16:43 Public issuance resumed
18:13 645 affected serials identified
20:30 Preliminary incident report posted
22:43 Began serving “renew immediately” ARI responses for all affected serials

2023-06-16:

22:31 Notification emails sent to affected subscribers

2023-06-19:

18:00 All affected certificates revoked

Whether we have stopped the process giving rise to the problem or incident.

We halted issuance to investigate the incident, and resumed issuance when we concluded that the issue was transient and would not reoccur. The process giving rise to the incident resolved itself prior to issuance being halted and resumed.

Summary of the affected certificates.

645 serials were affected. They are a random subset of the certificates which were issued during two one-minute-long periods as the profile change was deployed to each datacenter.

Note that, for some affected serials, the precertificate contained the ISRG CPS OID and URL while the certificate lacked them; for other affected serials, the precertificate lacked the OID and URL while the certificate contained them:

Serial 03ab31f8cdae7b94725b8c65e1dd3d27d361 affected:
  - precertificate at 2023-06-15T15:36:41.980255+00:00 has policy: true
  - certificate at 2023-06-15T15:36:42.152536+00:00 has policy: false
Serial 049d2f3ba953817eba31814e06a249f618e0 affected:
  - precertificate at 2023-06-15T15:44:39.485942+00:00 has policy: false
  - certificate at 2023-06-15T15:44:39.596283+00:00 has policy: true

(In addition, 7 certificates issued from our Staging hierarchy were also affected. This hierarchy is untrusted and not subject to any root program requirements, but it is useful to note that the issue affected both environments.)

Complete certificate data for the affected certificates.

Attached are two files containing urls in the format https://crt.sh?sha256=<certificate fingerprint>. One file contains urls pointing to the precertificate form of the affected serials, and the other contains urls pointing to the certificate form of the affected serials.

Explanation of how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Historically, Let’s Encrypt has included two policyIdentifiers in the Certificate Policies extension of all of our Subscriber certificates: the Domain Validated Reserved Certificate Policy Identifier (OID 2.23.140.1.2.1), and our own ISRG Domain Validated Policy Identifier (OID 1.3.6.1.4.1.44947.1.1.1). In addition, we included the URL of our CPS in a policyQualifier attached to the ISRG Domain Validated policyIdentifier.

In preparation for CA/BF Ballot SC62v2 “Certificate Profiles Update” coming into effect in 2023-09-15, Let's Encrypt staff reviewed the updated profiles and determined that the inclusion of that CPS URL was NOT RECOMMENDED, and that CAs only MAY include a policyIdentifier other that the reserved identifiers defined in the Baseline Requirements. As a result, we decided to remove both our own policyIdentifier and its associated policyQualifier from our Subscriber certificates.

When we deployed this change, a combination of three factors led to this incident.

Factor 1: Secure and Resilient Infrastructure

Let’s Encrypt divides our issuance services into logically separated virtual networks, protected by firewalls, with different levels of access to and from the public internet. The most-protected services – the “CA” instances which are connected to HSMs – have no ability to reach the public web, and therefore cannot themselves submit precertificates to CT logs and get SCTs in return.

Instead, the issuance process is managed by an “RA” (please excuse the naming, this service does not match the BR’s definition of a Registration Authority). This service requests that the CA sign a precertificate, submits that precert to CT logs to get SCTs, and then makes a subsequent request for the CA to issue a final certificate.

Because the precertificate and final certificate issuance occur in two separate cross-service requests from the RA to the CA, it is possible for these requests to be routed to different CA instances within the same datacenter. In theory, this is a good thing: it allows for better load-balancing across CA instances, allows for graceful recovery if a CA instance experiences an error during the lengthy wait for SCTs, and enables rapid rolling restarts where CA instances are restarted one at a time to pick up new binary versions or new configs without disrupting issuance.

However, this rolling restart meant that, during the profile configuration change, it was possible for the precertificate issuance request to go to a CA instance with the old profile, and the final certificate issuance request to go to a CA instance with the new profile (or vice versa).

We did not identify that this change was not safe to deploy in this way because making certificate profile changes is very rare for us. For example, we have not changed the validity period, KUs, or EKUs of the certificates we issue in the last three or more years. We incorrectly believed that this change was safe for a rolling deploy, just like our other regular CA configuration changes.

One of our core tenets is that all configuration changes should be safe to deploy in a rolling fashion. Had we identified this change was not rolling-safe, we could have done an “atomic” (stop all CAs, then restart all CAs) deploy, but that implies a process fix rather than a systematic fix. The issue is not that we didn’t do an atomic deploy for this change, but rather that this change couldn’t be safely deployed without an atomic deploy.

It is worth noting that this rapid restart mechanism is also why the number of impacted certificates is so small: the deploy window during which this issue was possible lasted only one minute in each datacenter, and only a fraction of issuances during that period (those whose CA requests went to instances with different configuration) were affected. The system automatically recovered and resumed normal issuance as soon as the deploy was complete.

Factor 2: Avoiding Manipulating DER

As Rob Stradling suggests in Comment #2, having requests for pre- and final certificate issuance routed to CA instances with different profiles configured would not be an issue if the final certificate was produced as a direct manipulation of the precertificate (effectively, by reversing the algorithm described in RFC6962 Section 3.1).

However, Let’s Encrypt is aware of multiple incidents that have arisen due to CAs trusting client input (e.g. SANs or extensions in a CSR) and/or directly manipulating DER in this way: Bug 1672423, Bug 1445857, Bug 1716123, Bug 1542793, and Bug 1695786 are just a few examples.

We designed our issuance pipeline specifically to avoid bugs such as these. Every issuance, both of precertificates and of final certificates, follows the same basic pattern: a limited set of variables are combined with a strict profile to produce a new certificate from scratch.

Keeping the flow this simple makes it easy to confirm that the code does what we think it does. The Go crypto/x509 library does not explicitly guarantee that parsing and re-serializing a certificate will result in the same extension ordering, and we don’t want to rely on undocumented behavior. Building the certificate from scratch also means we can’t accidentally share information between certificates (as in Bug 1815874).

It simply did not occur to us while designing and developing this locked-down issuance system that it opened the door to mismatches between the precertificate and the final certificate. It is fortunate that it took nearly three years for this issue to actually arise, and that it affected so few certificates.

Factor 3: Not Checking for Correspondence

Let’s Encrypt conducts extensive linting of all certificates. We run zlint on every precertificate and every final certificate, both before and after that (pre)certificate is issued. We abort issuance on any finding from zlint, including Notice and Warn results. We only skip two of zlint’s many checks: e_dnsname_not_valid_tld (because we ensure that the TLD is valid via a separate, faster-updating mechanism), and n_subject_common_name_included (because we still include CNs in our certificates). We also have developed additional checks which we add to the zlint registry to be included in each run.

But our current lints run without any context: you can’t supply an issuer certificate to verify that the leaf certificate’s Issuer field is byte-for-byte identical; you can’t supply a CT log’s public key to verify that the signatures in SCTs are correct; and, critically for this incident, you can’t supply a precertificate to compare against when linting a final certificate.

This means that our linting regime was incapable of detecting this particular kind of error. It also means that we did not catch this error when we deployed the same profile configuration change to our Staging environment a month prior to our production deploy, despite 7 (untrusted) issuances experiencing the same error there.

Root Cause

The root cause here is the confluence of all three of the above factors. If any one factor had not been present, this incident would not have occurred:

If both requests were routed to CA instances running the same configuration, then the profile used in both cases would have been the same.
If the final certificate was produced as a transformation of the precertificate, then requests being routed to different CA instances would not have mattered.
If we had explicit correspondence checks in place, then the process would have been aborted prior to issuance of the final certificate, and we would have caught the error during the Staging deploy.

Rapid Detection

One final note: our thanks go out to Andrew Ayer for quickly noticing and notifying us of this issue. That rapid detection was made possible largely because Let’s Encrypt submits all of our final certificates to CT logs. This enabled CertSpotter to identify the mismatch almost immediately, without having to wait for the certificates to be observed by relying parties and submitted to CT logs from there. We recommend that all CAs submit their final certificates to CT logs to reap the same benefits.

List of steps we are taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

We have already revoked all affected certificates. Prior to revocation, we marked all affected certificates as “impacted by an active incident” in our database, which caused our ACME Renewal Info endpoint to serve suggested renewal windows that would cause compliant ACME clients to renew immediately.

We are committing to introducing a new check, similar to our existing lints, which explicitly checks the correspondence between the precertificate and the (proposed) final certificate. This check will run as a pre-issuance check. We already perform pre-issuance linting before signing the precertificate and before signing the final certificate, by signing an equivalent "linting" certificate using an untrusted private key. This will allow us to compare the actual precertificate against the final linting certificate, and error out before the actual signing if there is a mismatch. We will have this check deployed in production within the next six weeks, giving us sufficient time to design, implement, review, and deploy the change, and taking into account the approaching July 4th holiday.

We acknowledge that we could commit to making additional changes, such as the one suggested by Rob in Comment #2. However, in light of the checks we are adding, those changes would not serve to prevent another incident: they would only allow us to successfully issue during a similar profile configuration change, rather than aborting issuance. We do intend to make improvements here, but are currently discussing many different designs, and do not intend to commit to a timeline to implement any one of them.

Remediation	Status	Date
Revoke affected certificates	Complete	2023-06-19
Add pre-issuance correspondence check	Started	2023-07-27

Chris Clements

Updated

•

2 years ago

Whiteboard: [ca-compliance] [dv-misissuance] → [ca-compliance] [dv-misissuance] Next update 2023-07-27

Aaron Gable

Comment 7

•

2 years ago

Our change to add an explicit correspondence check has been deployed to our Production environment. This concludes our remediation items for this incident, and we ask that it be closed if there are no further questions.

Ben Wilson

Comment 8

•

2 years ago

I'll close this next week, on or about 5-July-2023, unless there are issues that still need to be discussed.

Flags: needinfo?(bwilson)

Ben Wilson

Updated

•

2 years ago

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

Flags: needinfo?(bwilson)

Resolution: --- → FIXED

precert_urls.txt 2 years ago Aaron Gable 54.80 KB, text/plain		Details
cert_urls.txt 2 years ago Aaron Gable 54.80 KB, text/plain		Details