Closed Bug 1793114 Opened 2 years ago Closed 1 year ago

Let's Encrypt: Incomplete and Inconsistent CRLs

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aaron, Assigned: aaron)

Details

(Whiteboard: [ca-compliance] [crl-failure])

Attachments

(14 files)

3.40 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
2.48 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
3.40 MB, application/octet-stream
Details
1.36 MB, application/octet-stream
Details

Let’s Encrypt has detected a pair of issues affecting our new CRL infrastructure:

  • For a period of 30 hours, our CRLs included only approximately 10% of our total unexpired and revoked certificates. This is a violation of BRs Section 4.10.1, which requires that a revoked serial not be removed from a CRL until after it expires.
  • For a period of approximately 15 days, our revoked serials have not consistently appeared within the same CRL shard; they have been migrating between partitions of their appropriate full and complete CRL. Although both the Apple and Mozilla root programs have confirmed that they do not consider this to be a violation of BRs Section 4.10.1 (because the serial is still present in the full and complete CRL synthesized from its constituent shards), we are treating it as an incident because this behavior is unexpected.

We have resolved the first issue, and are actively working on resolving the second. We intend to supply a full incident report by Thursday, October 6.

Assignee: bwilson → aaron
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

What follows is our incident report for the first bullet point in the original preliminary report. We will provide additional information about the second bullet point in a follow-up post tomorrow, Friday Oct 7.

Summary

For a period of approximately 6 days, from 2022-09-23 15:54 UTC through 2022-09-29 23:34 UTC, Let’s Encrypt was generating incomplete CRLs containing only 10% of all unexpired and revoked certificates. For the majority of that time, up until 2022-09-29 17:24 UTC, these CRLs were not published as a certificate status service (i.e. were not yet disclosed in CCADB). However, for the 6 hours between that disclosure and the end of the incident, we were in violation of BRs Section 4.10.1, which requires that certificate status services must not remove revocation entries until after the expiry date of the revoked certificate.

Incident Report

How we first became aware of the problem.

On 2022-09-29 17:54 UTC a Let’s Encrypt engineer was inspecting the output of downloading and parsing all of our CRLs and noticed that the total number of revocation entries was significantly smaller than expected. Some back-of-the-envelope math confirmed that the number of revocation entries was approximately an order of magnitude too small, and an investigation was immediately opened.

Timeline of incident and actions taken in response.

2022-07-18

  • 16:00 CA configuration augmented to set CRL lifetime to 216h (9 days)

2022-07-22

  • 17:00 CRL configuration augmented to know that certificate lifetime is 2160h (90 days)

2022-09-08

  • 00:00 First CRLs signed

2022-09-23

  • 14:20 As part of a configuration refactoring, the CRL configuration "certificate lifetime" value (previously 2160h) is accidentally set to reference the "CRL lifetime" value (216h)
  • 15:00 Incorrect configuration deployed
  • 15:54 First incomplete CRLs signed (incompleteness incident begins)

2022-09-29

  • 17:24 CRL URLs disclosed in CCADB
  • 17:54 CRL inconsistency detected by an engineer (investigation begins)
  • 18:15 CRL URLs removed from CCADB
  • 18:48 Historical CRLs retrieved and compared (incompleteness incident confirmed)
  • 21:57 Potential fix, based on the idea that the CRL update process was silently ignoring errors, merged
  • 22:47 New version of crl-updater deployed
  • 22:49 New CRLs generated, incompleteness persists
  • 23:20 Confirmed that database was being queried for too-small shards, rather than running into errors
  • 23:23 Discovered updater was configured with certificate lifetime of 216h (incompleteness proximate cause found)
  • 23:25 Deployed configuration with correct value of 2160h
  • 23:34 Complete CRLs generated (incompleteness incident ends)

Whether we have stopped the process giving rise to the problem or incident.

We did not halt certificate or CRL issuance during our investigation or remediation. We stopped generating incomplete CRLs at 2022-09-29 23:34 UTC.

Summary of the affected certificates.

Approximately 97% of unexpired revoked Let’s Encrypt subscriber certificates were affected by the incomplete CRLs incident. The unaffected 3% were those revoked certificates whose expiration date remained within the 9-day window covered by the incomplete CRLs for the whole 6-day duration of the incident. In total, about 570k unexpired revoked certificates were affected.

Complete certificate data for the affected certificates.

Attached are a collection of files containing crt.sh URLs for all affected certificates. The URLs are in the format https://crt.sh/?sha256=<cert fingerprint>, and all link to the precertificate version of the certificate in question. The files are sharded with at most 100k URLs per file, and are compressed via zstd for ease of upload and download.

Explanation of how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

For many years, Let’s Encrypt has used a templating system to generate the configuration files for the various components of Boulder, our issuance software. For example, the following two snippets (extracted from different files) result in the CA knowing that the certificates it issues should have a validity interval of 90 days:

boulder:
  cert:
    lifespan: "2160h"
"ca": {
  "expiry": "{{ boulder.cert.lifespan }}",
}

This templating system allows us to reuse values in a modular fashion, so that we don’t have to hard-code constant values in multiple places. For example, the CRL Updater also needs to know how long a certificate is valid for, so it knows how many days worth of issuance need to be covered by the CRL shards:

"crlUpdater": {
  "certificateLifetime": "{{ boulder.cert.lifespan }}",
}

As we deployed our CRL infrastructure, we had to add several new configuration stanzas to our existing files. One of the necessary additions was to tell the CA the desired validity interval for the CRLs it issues. We selected a validity interval of 9 days, to come in safely below the BRs maximum of 10 days:

boulder:
  crl:
    lifespan: "216h"
"ca": {
  "lifespanCRL": "{{ boulder.crl.lifespan }}",
}

Note the similar names of boulder.cert.lifespan and boulder.crl.lifespan. So when a refactoring was done to change the way values are templated into our configuration files, a simple mistake slipped into the diff, replacing the CRL Updater’s "certificateLifetime" value with the CRL lifespan instead. This mistake was not caught by human reviewers, and does not inherently violate the deployability of the service (a 9 day certificate lifetime is a very real possibility in the future!), so the service successfully deployed with this configuration.

As a result, the CRL Updater believed that it only needed to cover 9 days of issuance across all of its shards. It obliged, successfully queried the database for all revoked certificates expiring in the next 9 days, and generated CRLs containing all of those entries. This left 81 days (90%) worth of unexpired certificates excluded from the CRLs.

This points to another contributing root cause: the CRL Updater’s behavior is fragile. It based its conception of what work it needs to do solely on its configuration, rather than on the actual set of all unexpired revoked certificates stored in the database. If the CRL Updater was querying the database for all unexpired revoked certificates, rather than just those that fall within the shard boundaries it computes from its config, it would have continued to cover all certificates regardless of this misconfiguration.

Finally, the error would have been caught much quicker, and been easier to diagnose, if we had sufficient black-box monitoring of our CRLs in place. For our OCSP infrastructure, we have systems which regularly issue and revoke certificates for domains we control, then check that the OCSP status is served and updated appropriately. We have plans and partial designs for similar systems which would issue and revoke a certificate and then ensure that it remains in the full and complete CRL for its whole lifetime, but were not able to complete and deploy them prior to the October 1st deadline.

List of steps we are taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

We have already put in place the configuration change which resolved the immediate incident. We updated the CRL Updater’s "certificateLifetime" configuration value to have the correct value of "2160h".

For long term prevention of similar incidents, we intend to make two changes. First, we will change CRL Updater’s logic to begin by querying for the furthest-future notAfter date among unexpired certificates, and have it ensure that every single unexpired revoked certificate is included in the resulting CRLs. Second, we will deploy a black-box monitoring system that issues and revokes certificates on a regular basis, and watches to ensure that those certificates remain in the same shard for every generation of CRLs we produce until they expire. We commit to having both of these items complete within the next six weeks, to give us time to deploy them carefully.

Remediation Status Date
Set the CRL Updater’s "certificateLifetime" config to the correct value Complete 2022-09-29
Deploy monitor which revokes certificates and watches for them to remain in their assigned shard until after they expire Not started 2022-11-18
Change CRL Updater to query for the farthest-future expiration in the database, and ensure its shards cover that whole period Not started 2022-11-18

What follows is our report for the second bullet point in the original preliminary report. We have not separated this report out into its own bug here on Bugzilla for two reasons: first, the remediation items are largely shared with the primary incident here; and second, as far as we are aware, no root program considers this to be a compliance incident. We are providing a full report for our own benefit and for the benefit of the community.

Summary

For a period of approximately 28 days, from 2022-09-08 06:00 UTC through 2022-10-07 13:01 UTC, Let’s Encrypt was generating inconsistent CRLs Partitions, with each shard containing a different subset of the revoked serials each time it was issued. The Apple and Mozilla root programs have both stated that they do not consider a revocation entry moving between CRL Partitions to be a compliance incident, as long as the entry remains in the Full and Complete CRL composed of those partitions. Despite this, we decided to design our system to minimize that possibility, to make a future transition to including the CRL URL in the CRL Distribution Point extension of our certificates easier. Because this behavior was undesired and unexpected, we are disclosing these details for the benefit of the community.

Report

How we first became aware of the problem.

During the investigation of the incompleteness incident, we also noticed that the number of revocation entries in individual CRL shards seemed to be fluctuating, going up and down by small amounts with every generation. Due to our notAfter-based sharding strategy, this should not happen: only the single oldest shard at any given time should be losing entries as they expire. We hypothesized that this was due to serials moving between shards, and consciously decided to delay investigation until after the incomplete CRLs incident was fully understood. We returned to investigate this once the primary incident was resolved.

Timeline and actions taken in response.

2022-09-08

  • 00:00 First CRLs signed
  • 06:00 Second CRLs signed (inconsistency begins)

2022-09-29

  • 17:24 CRL URLs disclosed in CCADB
  • 17:54 CRL inconsistency detected by an engineer (investigation begins)
  • 18:15 CRL URLs removed from CCADB

2022-09-30

  • 16:35 Confirmed that serials are moving between shards (inconsistency confirmed)
  • 16:37 Discovered that shard timestamp boundaries are shifting
  • 18:03 CRL URLs re-disclosed in CCADB
  • 18:04 Preliminary incident report filed
  • 19:02 Determined that shard boundaries are shifting because the math which computes them surpasses the maximum representable time duration in Go, 290 years (inconsistency proximate cause found)
  • 19:09 Potential fix developed
  • 19:45 Decision made to not rush fix to production before the weekend

2022-10-01

  • 00:00 Apple and Mozilla’s CRL CCADB disclosure requirements go into effect

2022-10-03

  • 18:24 Fix merged

2022-10-07

  • 14:56 Fix deployed
  • 16:36 Set one datacenter’s run frequency to every 1h, instead of every 6h, to observe the fix faster
  • 18:09 Suspicion that CRL shard boundaries are still not fully stable
  • 18:45 Suspicion confirmed: each datacenter is staying consistent with itself, but shard boundaries differ based on which datacenter is running the updater
  • 19:20 Determine that update frequency affects shard boundaries via intentional-but-obscure interaction
  • 19:32 Reset all datacenters to have a run frequency of 6h
  • 20:08 All datacenters produce consistent shards (inconsistency ends)

Whether we have stopped the process giving rise to the problem or incident.

We did not halt certificate or CRL issuance during our investigation or remediation. The last inconsistent CRL shards were generated prior to 2022-10-07 20:08 UTC.

Summary of the affected certificates.

Approximately 99% of unexpired revoked Let’s Encrypt subscriber certificates were affected, with nearly every revoked certificate changing shards at least once. The unaffected 1% were those certificates which were revoked less than one shard width prior to the end of the inconsistency. In total, about 740k unexpired revoked certificates were affected.

Complete certificate data for the affected certificates.

Attached are a collection of files containing crt.sh URLs for all affected certificates. The URLs are in the format https://crt.sh/?sha256=<cert fingerprint>, and all link to the precertificate version of the cert in question. The files are sharded with at most 100k URLs per file, and are compressed via zstd for ease of upload and download.

Explanation of how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

This inconsistency issue has an easier-to-pinpoint but harder-to-explain root cause. Our sharding strategy is time-based, with individual certificates being assigned to the shard which contains their expiration date. To determine which shards correspond to which dates, we divide the whole history of time into repeating, unchanging chunks and label them 0 through N-1, where N is the number of shards our CRL is being partitioned into. We size the chunks based on the current lifetime of our certificates (90 days), so for instance if we configured 90 shards, each shard would cover approximately 1 day's worth of certificates. As time moves forward, the only thing that changes is which chunk with a given label contains unexpired certificates, and therefore which chunk with label X will be output as shard X.

In order to perform this division of time, we need to have an anchor time that all chunks exist relative to. Because Go has a zero-value for all types, and the Go zero-value for time values is midnight, January 1st, year 0, we selected that time as anchor time. However, when subtracting timestamps from each other, Go’s return value is a time.Duration value, which has a maximum value of approximately 290 years (2^63 - 1 nanoseconds), significantly less than the approximately 2022 years between the present day and the zero time. This meant that, no matter what the current time was, the result of time.Now().Sub(time.Time{}) was always the same: time.Duration(math.MaxInt64).

In turn, this caused the math which computed chunk boundaries relative to the current time to not work as expected. Rather than having chunks fixed in time, and simply selecting which chunks to include in the CRL shards, the chunk boundaries themselves were drifting forward in time. Certificates had their notAfter timestamp fall into different chunks, and therefore their revocation entries appeared in different shards.

This behavior of Go’s time types is documented, but this behavior was not noticed by the engineers writing or reviewing the sharding code. There were unit tests to ensure that chunk boundaries were computed correctly relative to the current time, but those unit tests used “current time” values very close to the Go zero time, so did not run afoul of this bug.

Although the fix for the initial issue was correct, we fortuitously ran into a second issue while deploying the fix. We set the run frequency of one of our CRL Updaters to be 1 hour, instead of 6 hours, so that we could observe the fix in a more timely fashion. This resulted in the shard boundaries being inconsistent for a different reason.

The CRL Updater doesn’t only include currently-unexpired certificates in its CRLs. Due to the requirement that a certificate appear in at least one CRL even if it is revoked mere seconds before it expires, we make sure that we include all certificates in at least one CRL after they expire, just in case. This “lookback period” is proportional to the CRL Updater’s run frequency: the less often it runs, the further back it has to look to make sure it catches any certificates that were both revoked and expired since its last run.

But the CRL Updater includes this lookback period in the calculation of the total window of time it has to cover, and then divides that total window by the number of shards to determine the shard width. And when the shard width changes, of course the boundary timestamps change as well. From a stability standpoint, this computation is backwards; the shard width and boundaries should be fully static, with other values derived from them instead.

Similar to the incompleteness incident above, black-box monitoring to detect certificates moving between shards would have helped detect and diagnose these issues. This work was planned but not yet completed prior to the October 1st deadline.

List of steps we are taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

We have already put in place the code and configuration changes which resolved the immediate issue described above. We updated the CRL Updater’s anchor time to be 2015-06-04 11:04:38 UTC, the time at which Let’s Encrypt’s first root key material was generated, significantly less than 290 years ago. We also ensured that all datacenters have the same run frequency configured.

For long term prevention of similar issues, we re-commit to the same remediation item described for the incompleteness incident above: deploying a black-box monitoring system that issues and revokes certificates on a regular basis, and watches to ensure that those certificates remain in the same shard for every generation of CRLs we produce until they expire. We also will continue to make the CRL Updater’s configuration less fragile, and critical portions of its behavior less tied to its configuration.

Remediation Status Date
Set the CRL Updater’s shard computation anchor time to less than 290 years in the past Complete 2022-10-07
Make shard boundaries independent of run frequency Not started 2022-11-17

We have changes under review on our remediation items for stabilizing shard contents, and we have a design (currently internal only, apologies) for our monitoring system. Will will continue to provide updates until the work is complete.

We are deploying the scaffolding for our remediation items this week, and expect to deploy the last set of fixes next week. We will continue to provide updates until the work is complete.

Update: The last set of Boulder-side fixes for shard stability is being deployed at this moment. The implementation of the black-box monitor is nearly complete, and works in local test environments. We expect to deploy it next week.

No major update from last week, the black-box monitor is nearly complete and we expect to deploy it soon.

Product: NSS → CA Program

As of yesterday, 2022-11-17, the external monitor for our CRLs has been deployed. It regularly issues and revokes certificates for a test domain we control, verifies that entries for those certificates appear in the next generation of CRLs, and checks that all entries removed from a CRL are for certificates which expired prior to the issuance of the last CRL they appeared on. These checks are done on a per-shard basis, so we will be alerted not only if an entry is removed entirely, but also if an entry moves between shards.

This concludes our committed remediation items. We do not intend to supply any further updates on this issue, and ask that it be closed. We will monitor it for questions and comments.

Are there any additional questions or issues to be raised by the community? If not, I plan to close this on or about Wed. 23-Nov-2022.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Whiteboard: [ca-compliance] → [ca-compliance] [crl-failure]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: