Closed Bug 1729567 Opened 3 months ago Closed 2 months ago

Let's Encrypt: Delay updating OCSP responses

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aaron, Assigned: aaron)

Details

(Whiteboard: [ca-compliance] Next update 2021-10-01)

This is a preliminary incident report.

At 2021-09-05 01:51 UTC, our internal monitoring detected that the system responsible for updating OCSP responses had fallen 2 hours behind our target 3-day update schedule. A warning alert was fired, but was not received by the oncall due to being configured as a working-hours-only alert.

At approximately 2021-09-06 01:00 UTC, Let’s Encrypt began serving OCSP responses which had not been updated in the previous 3.5 days, in violation of the Microsoft Trusted Root Program Requirements, Section C. At approximately 2021-09-07 05:00 UTC, Let’s Encrypt began serving OCSP responses which had not been updated in the previous 4 days, in violation of the Baseline Requirements, Section 4.9.10.

We became aware of the issue during an oncall shift change at 2021-09-07 14:00 UTC and began incident response at that time. We have deployed a few technical mitigations, which appear to have allowed our systems to keep up with OCSP updating volume, but not enough to work through the backlog and bring the current staleness back within program requirements. The root cause is not yet clear.

We will provide updated information as we have it. Our next update will be on 2021-09-08.

Assignee: bwilson → aaron
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

As of 2021-09-08 02:44 UTC our OCSP responses were back in compliance with the Baseline Requirements, all having been updated within the last 4 days. As of 2021-09-08 05:58 UTC our OCSP responses were back in compliance with the Microsoft Trusted Root Program Requirements, all having been updated within the last 3.5 days. As of 2021-09-08 09:30 UTC our OCSP responses were back in compliance with our internal target of having been updated within the last 3 days.

The root cause has been identified and fixed, and multiple other remediation items have been identified for immediate or future work.

A full incident report will be provided by 2021-09-10.

A note for the Mozilla Root Program: A strict reading of https://wiki.mozilla.org/CA/Responding_To_An_Incident indicates that neither a summary of nor complete certificate data for all affected certificates is necessary in this case, as this incident does not involve problematic certificates. Please clarify if you would like such information provided regardless.

I don't believe that the complete certificate data for all affected certificates is necessary in this situation involving OCSP.

The specific question is addressed in the existing language:

In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

This was added in the 2020-06-29 modifications

Incident Report

Let’s Encrypt signs and publishes OCSP responses with a validity interval of 7 days. Automated systems are configured to produce updates for all OCSP responses whose thisUpdate field is 3 or more days in the past.

For a period of approximately 65 hours, Let’s Encrypt served OCSP responses whose thisUpdate field was more than 3.5 days in the past, a violation of the Microsoft Trusted Root Program Requirements, Section 3.C.2.b, which requires that “the next update must be available at ½ of the validity period” for OCSP responses with validity intervals greater than sixteen hours, like ours.

For a period of approximately 34 hours, Let’s Encrypt served OCSP responses whose thisUpdate field was more than 4 days in the past, a violation of the Baseline Requirements, Section 4.9.10, which requires that “the CA SHALL update the information… no later than four days after the thisUpdate” for OCSP responses with validity intervals greater than sixteen hours, like ours.

1. How we became aware of the problem

At 2021-09-05 01:51 UTC, our internal monitoring detected that the system responsible for updating OCSP responses (ocsp-updater) had fallen 2 hours behind our target 3-day update schedule. A warning alert was fired, but was not received by the oncall due to being configured as a working-hours-only alert. The alert was received and incident response began at 2021-09-07 14:00 UTC.

2. Timeline

All times are given in UTC. Code changes which occurred in the public Boulder repository are linked; configuration changes which occurred in our internal repositories are not.

The pledge:

  • 2021-09-02 20:51: An alert, indicating that at least one OCSP response is more than 10k seconds (2.8 hours) older than our desired threshold of 3 days, is sent to the SRE oncall. Although not a violation of any program requirements at this point, it is decided that action should be taken to prevent future potential regressions.
  • 2021-09-02 22:52: A change to parallelize ocsp-updater is landed to our main branch. The change uses the config key “serialSuffixShards” to control what subset of the space of all possible final serial characters each instance should cover.
  • 2021-09-02 22:53: The change is cherry-picked to the current release branch.
  • 2021-09-02 23:10: The release branch is re-tagged for a hotfix release.
  • 2021-09-03 19:35: The new version is deployed to production.
  • 2021-09-03 20:13: Configuration is deployed to split one ocsp-updater instance into four instances. The configs deployed use the key “serialSuffixes” to control which quarter of the overall serial space each instance should cover.

The turn:

  • 2021-09-05 01:51: An alert, indicating that at least one OCSP response is more than 10k seconds (2.8 hours) beyond our desired threshold of 3 days, fires. The alert is not sent to the current oncall because it is configured as a “warning” (working-hours-only) level alert.
  • 2021-09-06 01:00: The first OCSP response which was updated more than 3.5 days prior is served.
  • 2021-09-07 05:00: The first OCSP response which was updated more than 4 days prior is served.
  • 2021-09-07 14:00: The SRE oncall becomes aware of the issue as they begin their workday after the long weekend. An incident is declared and multiple staff begin responding.
  • 2021-09-07 15:34: To reduce database connection churn, the maximum number of connections the ocsp-updater is allowed to maintain is increased. This has a negligible effect on the size of the backlog.
  • 2021-09-07 16:00: Two of the four ocsp-updater instances are moved to an alternate data center to spread load between database replicas and HSMs. This has a negligible effect on the size of the backlog.
  • 2021-09-07 16:30: A dev realizes that all four ocsp-updater instances are each trying to process the same number of OCSP responses per cycle as the single unified instance was, resulting in heavier database queries and bursty update patterns.
  • 2021-09-07 16:36: A config change to reduce the batch size to one quarter of its current value is deployed. This results in smoother activity, but has a negligible effect on the size of the backlog.
  • 2021-09-07 18:03: A change to reduce the database and network load of ocsp-updater is landed.
  • 2021-09-07 18:32: The change is cherry-picked to the current release branch.
  • 2021-09-07 18:33: The release branch is re-tagged for a hotfix release.
  • 2021-09-07 19:37: The new version is deployed to production. This has no impact on the size of the backlog, as it is later discovered that a copy-paste error caused the change to have no effect.
  • 2021-09-07 22:12: The four ocsp-updater instances are doubled to eight instances. This has a negligible effect on the size of the backlog.

The prestige:

  • 2021-09-07 23:44: An SRE realizes that the config files deployed in production use the incorrect key “serialSuffixes”, rather than the correct key “serialSuffixShards”. Without shards correctly configured, each ocsp-updater instance was performing redundant work.
  • 2021-09-08 00:28: Updated configs using the correct key are deployed to production. With the eight ocsp-updater instances now doing non-overlapping work, the backlog starts to burn down at a rate of about 8.2 million OCSP updates per hour.
  • 2021-09-08 02:44: All OCSP responses served by our frontend were updated less than 4 days prior.
  • 2021-09-08 05:58: All OCSP responses served by our frontend were updated less than 3.5 days prior.
  • 2021-09-08 09:30: All OCSP responses served by our frontend were updated less than 3 days prior. Our internal alert stops firing.
  • 2021-09-08 14:44: All OCSP responses cached by our CDN were updated less than 4 days prior. We are in compliance with the Baseline Requirements.
  • 2021-09-08 17:58: All OCSP responses cached by our CDN were updated less than 3.5 days prior. We are in compliance with the Microsoft Trusted Root Program Requirements.

3. Continued issuance

Let’s Encrypt has stopped serving OCSP responses which are more than 3.5 days past their thisUpdate timestamp. Therefore Let’s Encrypt has also stopped serving OCSP responses which are more than 4 days past their thisUpdate timestamp.

At no time during the incident did Let’s Encrypt stop issuing certificates, nor did we ever stop issuing updated OCSP responses entirely.

At no time did any OCSP response become invalid due to being more than 7 days old.

4. Certificate summary

No problematic certificates were issued as part of this incident.

Approximately 35,853,345 certificates had OCSP responses which were not updated for more than 3.5 days, and approximately 12,249,311 certificates had OCSP responses which were not updated for more than 4 days. The affected certificates were distributed uniformly from among our certificates which had been issued at least 3.5 days (or 4 days, respectively) before the incident began and had not yet expired.

5. Certificate details

No problematic certificates were issued as part of this incident.

6. Root Cause Analysis

From a narrow perspective, the root cause of this incident was that the configuration change on September 3rd contained a bug: a mistake in the name of the new configuration key. This bug meant that, rather than parallelizing OCSP updates into four streams, we instead quadrupled the amount of work that we were asking the database and HSMs to perform. The systems were not able to keep up with this increase in load and progressively fell behind. However, it is rather comforting that our systems were able to nearly keep up with 4x and then 8x the normal load as well as they did.

From a wider perspective, we believe this incident has two primary antecedents.

First, the fact that the alert which fired on September 5th was configured to be working-hours-only, and therefore did not page the current oncall. This was not an accidental misconfiguration. Historically this alert has been somewhat noisy, with many instances of OCSP updates becoming slightly (e.g. one or two hours) stale, and then fixing themselves with no human intervention. The combination of spurious alerts and significant headroom (4 days before any OCSP responses actually expire) led to the alert being classified as a “warning”. After the alert and accompanying deployment on September 3rd, a task was created to update the alert to “critical” (i.e. would page during all hours), but that task was left for the following week.

Second, the fact that our services did not complain about missing an expected key in their configuration files, nor about seeing an unexpected key. Again, both of these behaviors were conscious decisions.

For the sake of smooth deployability, a new version of our services that accepts a new config key must be able to function even if that config key is not provided. Many configuration keys have reasonable default values to use when that configuration key is not provided, and in this case there was an obvious default: if no shards are specified, process everything just like before. The ocsp-updater instances were designed to continue to function even in the absence of the new “serialSuffixShards” configuration key.

All of our services are configured with JSON-formatted configuration files. Famously, JSON does not support comments. To work around this, our configuration loading system intentionally ignores unrecognized keys: they might simply be throwaway key-value pairs used to annotate the reasoning behind the meaningful config values. So the ocsp-updater instances were also designed to continue to function even in the presence of the unrecognized “serialSuffixes” key.

These two factors -- lack of strict configuration, and lack of an appropriately critical alert -- combined to allow the system to misbehave, undetected, for a long enough period to cause this incident.

7. Remediations and Action Items

As part of the incident response, we have already fixed the proximate cause: our production configuration files now use the correct “serialSuffixShards” key, and our ocsp-updater instances are not performing duplicate work.

As follow-ups, we have a number of remediation items that will address the root causes discussed above. We have already upgraded the criticality of this specific alert so that it will page at all hours; we have also added a new critical alert based on the age of the OCSP responses we serve, rather than just the age of the OCSP responses stored in our database. We will also perform an audit of all of our existing non-critical alerts to see if any other represent circumstances severe enough to merit an off-hours page. Finally, we will update our configuration loading system so that unexpected keys result in errors, allowing automation to detect bugs of the kind that caused this incident.

Description Kind Status Due Date
Correct the serialSuffixShards configuration key Mitigate Complete 2021-09-07
Make stale OCSP responses alert “critical” Detect Complete 2021-09-07
Add second stale OCSP responses alert representing client perspective Detect Complete 2021-09-07
Review all existing non-critical alerts for whether they should be critical Detect Not yet started 2021-10-01
Make config loading stricter Prevent Not yet started 2021-10-08

We also have a variety of performance improvements planned for ocsp-updater. These will help increase its headroom so that it can continue to keep pace even in the face of increased workload, and are intended to allow us to increase the frequency at which we update OCSP responses, but we do not intend to commit to specific timelines for implementing them as they do not address the root causes of this incident.

Providing an update after one week of inactivity:

We have begun work on making config loading stricter. We have a change ready to land, and are in the process of removing comment fields from our production configs so that the code change can land.

We are in the process of dividing up all existing warning-level alerts so that they can be distributed among SREs for review, which we expect to conduct largely next week.

Please set our next-update date to 2021-10-01, when we expect to report completion of the review task.

Whiteboard: [ca-compliance] → [ca-compliance] Next update 2021-10-01
Summary: Let’s Encrypt: Delay updating OCSP responses → Let's Encrypt: Delay updating OCSP responses

We have completed our review of existing alerts, and have updated the criticality of several. We have also merged and deployed our strict config loading change. This completes all work items listed in Comment #4. Unless there are additional questions from interested parties, we do not intend to provide further updates on this ticket.

As there are no open actions items remaining with this bug and there have been no questions, I am closing it as fixed.

Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.