Let’s Encrypt signs and publishes OCSP responses with a validity interval of 7 days. Automated systems are configured to produce updates for all OCSP responses whose
thisUpdate field is 3 or more days in the past.
For a period of approximately 65 hours, Let’s Encrypt served OCSP responses whose
thisUpdate field was more than 3.5 days in the past, a violation of the Microsoft Trusted Root Program Requirements, Section 3.C.2.b, which requires that “the next update must be available at ½ of the validity period” for OCSP responses with validity intervals greater than sixteen hours, like ours.
For a period of approximately 34 hours, Let’s Encrypt served OCSP responses whose
thisUpdate field was more than 4 days in the past, a violation of the Baseline Requirements, Section 4.9.10, which requires that “the CA SHALL update the information… no later than four days after the
thisUpdate” for OCSP responses with validity intervals greater than sixteen hours, like ours.
1. How we became aware of the problem
At 2021-09-05 01:51 UTC, our internal monitoring detected that the system responsible for updating OCSP responses (ocsp-updater) had fallen 2 hours behind our target 3-day update schedule. A warning alert was fired, but was not received by the oncall due to being configured as a working-hours-only alert. The alert was received and incident response began at 2021-09-07 14:00 UTC.
All times are given in UTC. Code changes which occurred in the public Boulder repository are linked; configuration changes which occurred in our internal repositories are not.
- 2021-09-02 20:51: An alert, indicating that at least one OCSP response is more than 10k seconds (2.8 hours) older than our desired threshold of 3 days, is sent to the SRE oncall. Although not a violation of any program requirements at this point, it is decided that action should be taken to prevent future potential regressions.
- 2021-09-02 22:52: A change to parallelize ocsp-updater is landed to our main branch. The change uses the config key “serialSuffixShards” to control what subset of the space of all possible final serial characters each instance should cover.
- 2021-09-02 22:53: The change is cherry-picked to the current release branch.
- 2021-09-02 23:10: The release branch is re-tagged for a hotfix release.
- 2021-09-03 19:35: The new version is deployed to production.
- 2021-09-03 20:13: Configuration is deployed to split one ocsp-updater instance into four instances. The configs deployed use the key “serialSuffixes” to control which quarter of the overall serial space each instance should cover.
- 2021-09-05 01:51: An alert, indicating that at least one OCSP response is more than 10k seconds (2.8 hours) beyond our desired threshold of 3 days, fires. The alert is not sent to the current oncall because it is configured as a “warning” (working-hours-only) level alert.
- 2021-09-06 01:00: The first OCSP response which was updated more than 3.5 days prior is served.
- 2021-09-07 05:00: The first OCSP response which was updated more than 4 days prior is served.
- 2021-09-07 14:00: The SRE oncall becomes aware of the issue as they begin their workday after the long weekend. An incident is declared and multiple staff begin responding.
- 2021-09-07 15:34: To reduce database connection churn, the maximum number of connections the ocsp-updater is allowed to maintain is increased. This has a negligible effect on the size of the backlog.
- 2021-09-07 16:00: Two of the four ocsp-updater instances are moved to an alternate data center to spread load between database replicas and HSMs. This has a negligible effect on the size of the backlog.
- 2021-09-07 16:30: A dev realizes that all four ocsp-updater instances are each trying to process the same number of OCSP responses per cycle as the single unified instance was, resulting in heavier database queries and bursty update patterns.
- 2021-09-07 16:36: A config change to reduce the batch size to one quarter of its current value is deployed. This results in smoother activity, but has a negligible effect on the size of the backlog.
- 2021-09-07 18:03: A change to reduce the database and network load of ocsp-updater is landed.
- 2021-09-07 18:32: The change is cherry-picked to the current release branch.
- 2021-09-07 18:33: The release branch is re-tagged for a hotfix release.
- 2021-09-07 19:37: The new version is deployed to production. This has no impact on the size of the backlog, as it is later discovered that a copy-paste error caused the change to have no effect.
- 2021-09-07 22:12: The four ocsp-updater instances are doubled to eight instances. This has a negligible effect on the size of the backlog.
- 2021-09-07 23:44: An SRE realizes that the config files deployed in production use the incorrect key “serialSuffixes”, rather than the correct key “serialSuffixShards”. Without shards correctly configured, each ocsp-updater instance was performing redundant work.
- 2021-09-08 00:28: Updated configs using the correct key are deployed to production. With the eight ocsp-updater instances now doing non-overlapping work, the backlog starts to burn down at a rate of about 8.2 million OCSP updates per hour.
- 2021-09-08 02:44: All OCSP responses served by our frontend were updated less than 4 days prior.
- 2021-09-08 05:58: All OCSP responses served by our frontend were updated less than 3.5 days prior.
- 2021-09-08 09:30: All OCSP responses served by our frontend were updated less than 3 days prior. Our internal alert stops firing.
- 2021-09-08 14:44: All OCSP responses cached by our CDN were updated less than 4 days prior. We are in compliance with the Baseline Requirements.
- 2021-09-08 17:58: All OCSP responses cached by our CDN were updated less than 3.5 days prior. We are in compliance with the Microsoft Trusted Root Program Requirements.
3. Continued issuance
Let’s Encrypt has stopped serving OCSP responses which are more than 3.5 days past their
thisUpdate timestamp. Therefore Let’s Encrypt has also stopped serving OCSP responses which are more than 4 days past their
At no time during the incident did Let’s Encrypt stop issuing certificates, nor did we ever stop issuing updated OCSP responses entirely.
At no time did any OCSP response become invalid due to being more than 7 days old.
4. Certificate summary
No problematic certificates were issued as part of this incident.
Approximately 35,853,345 certificates had OCSP responses which were not updated for more than 3.5 days, and approximately 12,249,311 certificates had OCSP responses which were not updated for more than 4 days. The affected certificates were distributed uniformly from among our certificates which had been issued at least 3.5 days (or 4 days, respectively) before the incident began and had not yet expired.
5. Certificate details
No problematic certificates were issued as part of this incident.
6. Root Cause Analysis
From a narrow perspective, the root cause of this incident was that the configuration change on September 3rd contained a bug: a mistake in the name of the new configuration key. This bug meant that, rather than parallelizing OCSP updates into four streams, we instead quadrupled the amount of work that we were asking the database and HSMs to perform. The systems were not able to keep up with this increase in load and progressively fell behind. However, it is rather comforting that our systems were able to nearly keep up with 4x and then 8x the normal load as well as they did.
From a wider perspective, we believe this incident has two primary antecedents.
First, the fact that the alert which fired on September 5th was configured to be working-hours-only, and therefore did not page the current oncall. This was not an accidental misconfiguration. Historically this alert has been somewhat noisy, with many instances of OCSP updates becoming slightly (e.g. one or two hours) stale, and then fixing themselves with no human intervention. The combination of spurious alerts and significant headroom (4 days before any OCSP responses actually expire) led to the alert being classified as a “warning”. After the alert and accompanying deployment on September 3rd, a task was created to update the alert to “critical” (i.e. would page during all hours), but that task was left for the following week.
Second, the fact that our services did not complain about missing an expected key in their configuration files, nor about seeing an unexpected key. Again, both of these behaviors were conscious decisions.
For the sake of smooth deployability, a new version of our services that accepts a new config key must be able to function even if that config key is not provided. Many configuration keys have reasonable default values to use when that configuration key is not provided, and in this case there was an obvious default: if no shards are specified, process everything just like before. The ocsp-updater instances were designed to continue to function even in the absence of the new “serialSuffixShards” configuration key.
All of our services are configured with JSON-formatted configuration files. Famously, JSON does not support comments. To work around this, our configuration loading system intentionally ignores unrecognized keys: they might simply be throwaway key-value pairs used to annotate the reasoning behind the meaningful config values. So the ocsp-updater instances were also designed to continue to function even in the presence of the unrecognized “serialSuffixes” key.
These two factors -- lack of strict configuration, and lack of an appropriately critical alert -- combined to allow the system to misbehave, undetected, for a long enough period to cause this incident.
7. Remediations and Action Items
As part of the incident response, we have already fixed the proximate cause: our production configuration files now use the correct “serialSuffixShards” key, and our ocsp-updater instances are not performing duplicate work.
As follow-ups, we have a number of remediation items that will address the root causes discussed above. We have already upgraded the criticality of this specific alert so that it will page at all hours; we have also added a new critical alert based on the age of the OCSP responses we serve, rather than just the age of the OCSP responses stored in our database. We will also perform an audit of all of our existing non-critical alerts to see if any other represent circumstances severe enough to merit an off-hours page. Finally, we will update our configuration loading system so that unexpected keys result in errors, allowing automation to detect bugs of the kind that caused this incident.
|Correct the serialSuffixShards configuration key
|Make stale OCSP responses alert “critical”
|Add second stale OCSP responses alert representing client perspective
|Review all existing non-critical alerts for whether they should be critical
||Not yet started
|Make config loading stricter
||Not yet started
We also have a variety of performance improvements planned for ocsp-updater. These will help increase its headroom so that it can continue to keep pace even in the face of increased workload, and are intended to allow us to increase the frequency at which we update OCSP responses, but we do not intend to commit to specific timelines for implementing them as they do not address the root causes of this incident.