Closed Bug 1799755 Opened 2 years ago Closed 1 year ago

Let’s Encrypt: End Entity CRLs Not Reissued On Time

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jcj, Assigned: jcj)

Details

(Whiteboard: [ca-compliance] [crl-failure])

Attachments

(1 file)

Let’s Encrypt was informed last night via our problem reporting email address that our CRLs issued by our R3 and E1 intermediates had last been updated at 27 Oct 2022 15:55:40 UTC.

For a period of approximately 11 days, Let’s Encrypt failed to update our published CRLs. As of 06:11 UTC today, 8 Nov 2022, we have resumed publishing up-to-date CRLs.

Our thanks to Sam Harrington and Ryan Dickson for both independently reporting this issue.

We intend to supply a full incident report by Monday, 14 November (accounting for the Friday US and Canadian holiday).

Product: NSS → CA Program

We have our incident report prepared, but are finishing internal review of it. With apologies, I will post it within 24 hours.

Summary

For a period of approximately 11 days, from 2022-10-27 15:55:40 UTC to 2022-11-08 06:11 UTC, Let’s Encrypt failed to issue new CRLs for our R3 and E1 intermediates. For the last four days of this period, Let's Encrypt was in violation of BR 4.9.7, which says “If the CA publishes a CRL, then the CA SHALL update and reissue CRLs at least once every seven days.”

Incident Report

How we first became aware of the problem.

At 2022-11-08 00:36 UTC, Sam Harrington emailed our certificate problem report address that a recent Mozilla CRLite run had flagged our CRLs as having an unusually-large age:

Updated CRL http://prod.c.lencr.org/20506757847264211/0.crl (path=/persistent/crls/jQJTbIh0grw0_1TkHSumWb-Fs0Ggogr621gT3PvPKG0=/prod.c.lencr.org-0-f9a75380e6701635.crl) (sz=267938) (age=260h9m42.582351073s)

Timeline of incident and actions taken in response.

2022-09-28

  • CRL Updater and CRL Storer are the first Boulder services only running in the new orchestration environment. They only receive certificates named for the orchestration workers, not for their individual services, and services that they depend on are only configured to accept certificates belonging to the orchestration workers, making them unique.

2022-10-26

  • 16:46 SRE configures service-specific certificates for Staging environment workers and lowers the CRL Updater’s timeouts from 10m to 15s. CRL Updater is unable to connect to the CA due to the certificate SAN not being in the allow-list; If it could connect, the job would reach the timeout too early. CRLs stop being updated in Staging.

2022-10-27

  • 15:55:40 Last CRL Update published.
  • 16:44 CRLsNotGenerated alerts are added to staging.
  • 16:50 CRLsNotGenerated alerts are added to production.
  • 17:06 SRE merges change defining service-specific certificates to all workers.
  • 17:09 SRE configures service-specific certificates for DC2 workers and lowers the CRL Updater’s timeouts from 10m to 10s. CRL Updater is unable to connect to the CA or SSA due to the certificate SAN not being in the allow-list; if it could connect, the job would reach the timeout too early. CRLs stop being updated in DC2.
  • 17:51 SRE configures all DC1 workers as above. CRLs also stop being updated in DC1.

… 11 days, 8 hours, 41 minutes later….

2022-11-08

  • 00:36 Third party report from Sam Harrington arrives at cert-prob-reports@ describing the issue.
  • 01:01 SRE reviews report, reproduces issue and engages SRE oncall for second opinion.
  • 01:13 SRE confirms reproduction and begins incident response.
  • 01:21 SRE identifies a cause (incorrect FQDN for gRPC certificate validation) & begins preparing a fix.
  • 01:26 SRE deploys the code change portion of the fix in staging.
  • 01:45 SRE executes the remainder of the fix in staging.
  • 01:58 SRE identifies an additional error (timeout of a DB query introduced in the most recent Boulder release). SRE decides to roll back the version of Boulder in staging.
  • 02:30 SRE executes the Boulder version rollback in staging.
  • 02:36 SRE identifies an additional cause (missing permitted client name for SSA).
  • 02:43 SRE deploys and executes the additional fix in staging.
  • 03:15 SRE identifies an additional cause (missing permitted client name for CA).
  • 03:20 SRE deploys and executes the additional fix in staging.
  • 05:21 SRE identifies an additional cause (gRPC timeouts regressed during configuration rewrite from 10m to 15s).
  • 05:27 SRE deploys and executes in staging the return to the current Boulder release, along with the timeout fix.
  • 05:57 SRE deploys and executes full fix in production, fixing the permitted client names, and the timeout regression.
  • 06:11 CRL Updater and CRL Storer batch jobs succeed. Incident ends.
  • 14:14 Third party report from Ryan Dickson arrives at cert-prob-reports@ reporting both the issue and confirming the overnight fix.

Whether we have stopped the process giving rise to the problem or incident.

We did not halt certificate issuance during our investigation or remediation. We resumed generating up-to-date CRLs at 2022-11-08 06:11 UTC.

Summary of the affected certificates.

Certificates revoked during the 11 day period were not reflected in a CRL until the update occurred. This affected 96,354 revoked certificates. For all certificates during this period, up-to-date certificate revocation status was available via OCSP.

Complete certificate data for the affected certificates.

Attached is bug1799755-affected_cert_urls.txt.zst, which contains the 96,354 certificates revoked during the incident period.

Explanation of how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Let’s Encrypt began issuing CRLs for our R3 and E1 intermediates in September of this year (see Bug 1793114), but our planned external monitoring of our CRLs has lagged behind the implementation, though it is coming online this week.

While the external CRL monitoring was making progress, an independent effort inside Let’s Encrypt has been migrating our Boulder software stack onto a more feature-rich orchestration framework. As the migration is incomplete, at this time both orchestration frameworks are running side-by-side.

This incident was caused by a combination of two factors: a configuration change that was incomplete; and monitoring of our CRL publication pipeline that was itself incomplete.

Factor 1: CRL publication pipeline disruption by internal PKI changes

On 2022-10-27, as part of the ongoing transition to the new orchestration framework, we changed Boulder’s mechanism for discovery of all its microservices to use dynamically-populated SRV records. The SRV record change necessitated rewriting all of Boulder’s gRPC configuration blocks, and required reissuance of a handful of its internal PKI client certificates: Prior to that date, services on the new orchestration framework had temporary client certificates with SANs of the form ${vlan}-worker.service.${datacenter}.letsencrypt.org, which was different from the SAN style used for existing services. After the change, all running services throughout Let’s Encrypt, including the CRL publication services, were again using internal client certificates with SANs of the form ${service_name}.service.${datacenter}.letsencrypt.org.

Our Boulder-CA and SSA services did not receive an update, as we understood that all services were running both in the old orchestration and the new orchestration simultaneously: Through homogenizing the SAN names, we were eliminating a potential misconfiguration point, rather than adding one.

However, having only come online in September, the CRL services had never before used the name crl-updater.service.${datacenter}.letsencrypt.org. Because of that, the Boulder-CA and SSA client allowlists were only allowing CRL services to clients with internal client certificates of the ${vlan}-worker-form. This discrepancy was the main trigger for the bug.

An additional problem arose during the mass-rewrite of the gRPC configuration blocks. All gRPC connections within Boulder have timeouts measured in seconds, in the form Xs, except for the new CRL Updater service, whose timeouts were measured in minutes, in the form Xm. During the reconfiguration, the CRL Updater’s connection timeouts were erroneously changed from minutes to seconds, and that change was also not caught in review.

Factor 2: CRL publication alerts

We had internal monitoring deployed to alert on failures logged by our CRL Updater tool:

  - alert: CRLsNotGenerated12h
    expr: sum by (issuer, result) (increase(crl_updater_ticks_count{issuer=~"(E1|R3) \\(Overall\\)",result="success"}[6h])) < 1
    for: 12h
    labels:
      severity: warning
    annotations:
      description: 'No CRLs for {{ $labels.issuer }} have been generated in 12 hours.'
  - alert: CRLsNotGenerated3d
    expr: sum by (issuer, result) (increase(crl_updater_ticks_count{issuer=~"(E1|R3) \\(Overall\\)",result="success"}[6h])) < 1
    for: 3d
    labels:
      severity: critical
    annotations:
      description: 'No CRLs for {{ $labels.issuer }} have been generated in 3 days.'

However, the failure caused by the gRPC certificate misconfiguration led to the crl_updater_ticks_count no longer emitting results with “success”, emitting instead only “failure”. The alerting expression only checked for low increases in “success”, but as only “failure” was emitted, the alert did not fire. With metrics that perform similarly we’ve taken care to write the alerts so that loss of the metric is itself an alert. In this case, that was omitted during development and not caught in review.

External monitoring would have also caught this issue before it became an incident, however that monitoring was not yet ready.

Incident response

Ultimately, the SRV record change promoted all the way to production with this gRPC certificate misconfiguration because SRE relied too heavily on alerting to indicate that all services were still functioning nominally, yet the brand-new alerts for the CRLs were inadequate. Detailed monitoring did show the failures, but the commonly-accessed roll-up dashboards did not include CRL age or generation: Age is planned to come from the not-yet-deployed external monitor; generation only happens every few hours, which is different from all other continuously-monitored metrics.

Once the gRPC certificate misconfiguration was resolved, the timeout errors began. SRE suspected these errors were due to recent changes to the CRL Updater’s database queries as a remediation of Bug 1793114 and reverted to the previous Boulder version. When that did not resolve the timeouts, the minutes-to-seconds misconfiguration was noticed in the code differences and quickly reverted.

List of steps we are taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

We’ve already resolved the configuration issues that led directly to the incident, where the Boulder-CA microservice refused connections from the CRL Updater as unauthorized, and the timeouts had been inappropriately shortened for the RPC methods the Updater has to perform.

For long term prevention of specific-to-CRL issues, we re-commit to the same remediation item described for Bug 1793114: deploying a external monitoring system that issues and revokes certificates on a regular basis, and watches to ensure that those certificates remain in the same shard for every generation of CRLs we produce until they expire. That remediation item was, as noted above, almost ready for testing when this incident occurred, and we expect to deploy it this month.

The alert errors are difficult to address as a class: When writing metrics-based alerts, it’s helpful to have enough stored metrics data to tune the alert. When the CRL Updater’s alerts were developed, there was no history of how the metrics behaved with the service in a failure state. However, we can institute universal alerts on RPC failure rates. Such alerts would have caught this incident, as well as other classes of failures. We have not had such alerts in the past, as they tend to be noisy, however making such alerts functional is a worthwhile remediation and goal.

One positive aspect of this incident is that, despite its cause being rooted in the upheaval of changing orchestration platforms, the new platform made this incident response significantly more efficient. The reversion, and un-reversion of the Boulder deployment was rapid and mostly-automated.

Remediation Status Date
Correct Allowed gRPC Client Names for CRL Updater Completed 2022-11-08
Revert gRPC Timeouts for CRL Updater Completed 2022-11-08
Deploy External CRL Monitor Completed 2022-11-17
Add Alerts for CRL Age at External CRL Monitor Completed 2022-12-16
Rewrite CRL Updater Alerts Completed 2022-12-16
Add System-wide Percentage of gRPC Error Alerts Completed 2022-12-16

The remediation "Deploy External CRL Monitor", originally from Bug 1793114, was completed yesterday on-time, and I've marked it so in Comment 3.

Whiteboard: [ca-compliance] → [ca-compliance][crl-failure]

We're continuing to work on remediation items for this incident, and are monitoring this bug for any questions or comments. We still expect to finish the last three remediation items by the target date of 2022-12-16.

Whiteboard: [ca-compliance][crl-failure] → [ca-compliance][crl-failure] Next update 2022-12-16

The last three remediation items for this incident are completed as of today. I've updated Comment 3 to reflect that.

We've no further updates planned to this bug, and consider the incident remediation completed. If there are any further questions, we're happy to answer them, and will continue to monitor the bug until it is closed, but it can be closed at Mozilla's convenience.

I plan to close this on or about 9-Dec-2022.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Whiteboard: [ca-compliance][crl-failure] Next update 2022-12-16 → [ca-compliance] [crl-failure]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: