Closed Bug 1771238 Opened 2 years ago Closed 2 years ago

Certainly: Serving Expired OCSP Responses

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wthayer, Assigned: wthayer)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

On 24-May 2022, Certainly became aware that we were serving some expired responses from our OCSP service. Specifically, the initial responses were being generated as expected when a certificate was issued, but the service that periodically renews these responses had been failing in an unexpected manner since 17-May. The preliminary timeline is:

5/17 13:45 UTC OCSP updater service begins to panic and becomes unable to sign fresh responses
5/24 19:58 UTC Certainly engineer checks the validity of a group of certificates via openssl ocsp command and finds that some responses return errors
5/24 20:35 UTC Investigation finds that some OCSP responses are expired
5/24 20:59 UTC Incident declared
5/24 23:18 UTC Determined problem was likely caused by a configuration change related to the latest Boulder release
5/24 23:24 UTC Determined that reverting the config didn’t resolve the problem
5/25 01:28 UTC Confirmed that the prior Boulder release does not exhibit the issue
5/25 01:36 UTC Rolled back Boulder to prior release in production. OCSP response generation begins to function normally
5/25 03:15 UTC Service fully restored once new OCSP responses were generated for all non-expired certificates

Further investigation is ongoing. A full incident report is being prepared and will be published next week.

Assignee: bwilson → wthayer
Status: NEW → ASSIGNED
Whiteboard: [ca-compliance]

1. How your CA first became aware of the problem.
We became aware of the issue while gathering annual audit evidence that required retrieving OCSP responses for a group of certificates. Some of the responses that were returned were expired. This finding triggered an investigation of our OCSP system during which we determined that the ocsp-updater process that periodically generates new OCSP responses for existing certificates was in a failed state.

2. A timeline of the actions your CA took in response.

  • 2022-05-10 18:11 UTC Deployed Boulder 2022-05-02 release along with a configuration change that added a read-only database connection to the ocsp-updater service to the Certainly staging environment for testing. Verified that new responses were being generated and signed.
  • 2022-05-12 20:30 UTC Deployed Boulder 2022-05-02 release and associated configuration change described above to production environment.
  • 2022-05-12 21:24 UTC OCSP updater service begins to panic and becomes unable to sign fresh responses for existing certificates when the container is restarted and loads the updated configuration.
  • 2022-05-24 19:58 UTC OCSP Certainly engineer checks the validity of a group of certificates via openssl ocsp command and finds that some responses return errors. Investigation begins.
  • 2022-05-24 20:35 UTC Determined that some OCSP responses are expired.
  • 2022-05-24 20:59 UTC Incident declared.
  • 2022-05-24 23:18 UTC Determined problem was caused by the configuration change described above.
  • 2022-05-24 23:24 UTC Determined that reverting the config didn’t resolve the problem. [In retrospect, we believe the prior configuration was not successfully loaded at this time.]
  • 2022-05-25 01:28 UTC Confirmed that reverting the Boulder update to the 2022-04-18 release along with the prior configuration resolved the issue in staging environment.
  • 2022-05-25 01:36 UTC Deployed fix to production. OCSP response generation begins.
  • 2022-05-25 03:15 UTC Service fully restored once new OCSP responses were generated for all non-expired certificates.

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.
The production environment was rolled back to a known good configuration on 2022-05-25 at 1:36 UTC and the generation of fresh OCSP responses was completed at 3:15 UTC at which point service was fully restored.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.
Approximately 10,647 certificates were affected by this issue during the period from 2022-05-12 when generation of renewed OCSP responses stopped until 2022-05-25 when the issue was discovered and fixed.

5. In a case involving certificates, the complete certificate data for the problematic certificates.
N/A - the problem was with OCSP, not the certificates themselves.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
We determined that the ocsp-updater service was panicking due to a configuration change that added a read-only database connection. This new database connection requires another configuration change to account for the metrics monitoring service. Without the second change, both the read-only and read-write database connections attempted to register with the same metrics interface and thus triggered the process panic.

We are unable to determine why the configuration change in staging on 2022-05-24 at 23:24 UTC during the incident was unsuccessful in resolving the problem. We assume that the reverted configuration was not loaded despite reinstantiating the service’s container.

Certainly currently monitors the health of all Boulder services and would normally have detected the unhealthy state of the ocsp-updater service prior to the deployment of this change to our production environment. In this case, the service failed in an unexpected manner that did not trigger a restart and thus was not detected by our existing monitors. Specifically, the wrapper script that limits the number of running instances of the service did not exit when the ocsp-updater process terminated abnormally.

We also monitor the availability of valid OCSP responses, but this check currently relies on newly-issued certificates. OCSP responses are initially generated by the RA service and thus were not affected by this issue.

Certainly had identified OCSP monitoring as a risk and was in the process of implementing new external monitors at the time of this incident.

7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.
We have deployed a new version of Boulder without the read-only database connection configuration and are actively monitoring for OCSP issues.

We have also implemented a fix to the container startup script that allowed the ocsp-updater process to crash without the intended behavior of shutting down the entire container and triggering an alert.

Certainly recognizes that changes to Boulder represent a significant risk to the stability of our CA and thus we implemented a process in which we validate every new Boulder build in staging before deploying it to production. Our existing test plan includes confirming that OCSP responses are generated successfully, but only for new certificates. We plan to improve this pre-release testing to specifically confirm the functioning of the ocsp-updater service to generate renewed OCSP responses.

As noted above, we had already identified OCSP monitoring as a risk that we needed to address, had begun working on improvements to external monitoring, and already had planned to implement proactive detection of expiring responses. We are aware of Let’s Encrypt’s 2021 OCSP incident and intend to implement functionally equivalent monitors to those that are described in that incident.

In summary, our remediation plan is as follows.

Action Item Due Date
Upgrade to more current (2022-05-16) Boulder release. Done (31-May)
Fix bug in startup script that allowed ocsp-updater container to continue running even when the process was not. Done (31-May)
Pre-release OCSP testing in staging environment - check for renewed OCSP response generation with each new Boulder release. 30-June
Implement stale OCSP response alerts in staging and production. 30-June
Implement external OCSP monitors for renewed responses. 30-June
Perform review of all alerts looking for gaps and incorrect priorities that could lead to similar incidents. 30-June

This new database connection requires another configuration change to account for the metrics monitoring service. Without the second change, both the read-only and read-write database connections attempted to register with the same metrics interface and thus triggered the process panic.

For the edification of other CAs running the Boulder software stack, can you provide more detail on this item from the root cause analysis?

I believe that what was happening here is that the new read-only database connection was configured as a duplicate of the existing read-write database connection, specifically with the same username and address set in the DbConfig (which usually looks something like this). Because the DSN had the same username and address, those labels got the same values when constructing the metric descriptions. Thus, when the duplicate descriptions were registered with prometheus, it detected the duplicate and panic'd.

Does that analysis match your own? If so, that suggests that there should be safety checks (in the ocsp-updater, the sa, and any other service which has multiple database connections) to exit gracefully if two database connections have the same details, rather than relying on the metrics subsystem to panic. (For example, InitDbMetrics could call prometheus.Register rather than MustRegister, return the resulting error if any, and let sa.InitSqlDb handle the error gracefully.) Would you consider including a bug and PR against the Boulder repo to address this as part of your remediation items?

(In reply to Aaron Gable from comment #2)

This new database connection requires another configuration change to account for the metrics monitoring service. Without the second change, both the read-only and read-write database connections attempted to register with the same metrics interface and thus triggered the process panic.

For the edification of other CAs running the Boulder software stack, can you provide more detail on this item from the root cause analysis?

I believe that what was happening here is that the new read-only database connection was configured as a duplicate of the existing read-write database connection, specifically with the same username and address set in the DbConfig (which usually looks something like this). Because the DSN had the same username and address, those labels got the same values when constructing the metric descriptions. Thus, when the duplicate descriptions were registered with prometheus, it detected the duplicate and panic'd.

Aaron, you are entirely correct and we have nothing to add to your description above.

Does that analysis match your own? If so, that suggests that there should be safety checks (in the ocsp-updater, the sa, and any other service which has multiple database connections) to exit gracefully if two database connections have the same details, rather than relying on the metrics subsystem to panic. (For example, InitDbMetrics could call prometheus.Register rather than MustRegister, return the resulting error if any, and let sa.InitSqlDb handle the error gracefully.) Would you consider including a bug and PR against the Boulder repo to address this as part of your remediation items?

We believe that our existing action items represent the best immediate steps to prevent a broad class of problems related to OCSP and we would like to focus our immediate effort on those tasks. However, we also welcome the opportunity to contribute to Boulder and will plan to submit a PR along the lines of what you’ve proposed. We’ve created a task on our end and have submitted issue #6150 to the Boulder repo to track this work. Thank you for making this suggestion.

Update:

Our work on improved pre-release testing and OCSP staleness monitoring is currently on track to be completed by the end of June.

We have finished a comprehensive review of our alerts that identified the following improvement opportunities:

  • Implement a catch-all alert for process panics.
  • To avoid missing important issues that might evade existing alerts on various Boulder functions, configure a catch-all alert for Boulder errors of any kind.
  • Implement additional alerts to detect failures in a few specific parts of our monitoring pipeline for which we currently rely on downstream alerts.

I’d like to request a next-update of 23-June to report back on the status of the remaining action items.

Here is the current status of our remediation plan for this incident:

Action Item (from original incident report) Due Date
Upgrade to more current (2022-05-16) Boulder release. Done (31-May)
Fix bug in startup script that allowed ocsp-updater container to continue running even when the process was not. Done (31-May)
Pre-release OCSP testing in staging environment - check for renewed OCSP response generation with each new Boulder release. On track (30-June)
Implement stale OCSP response alerts in staging and production. Delayed (30-July)
Implement external OCSP monitors for renewed responses. On track (30-June)
Perform review of all alerts looking for gaps and incorrect priorities that could lead to similar incidents. Done (7-June)

Our plan to implement monitors for OCSP staleness within our infrastructure have been delayed due to a major operating system upgrade that is underway and that has proven to be more difficult than expected. This upgrade is currently blocking the proper testing and deployment of additional internal OCSP monitoring functionality. We expect to complete the OS upgrade, and then be able to properly test and deploy the additional monitors by 30-July or sooner.

Here is an update on our work to remediate this incident:

Action Item (from original incident report) Due Date
Upgrade to more current (2022-05-16) Boulder release. Done (31-May)
Fix bug in startup script that allowed ocsp-updater container to continue running even when the process was not. Done (31-May)
Pre-release OCSP testing in staging environment - check for renewed OCSP response generation with each new Boulder release. Done (1-July)
Implement stale OCSP response alerts in staging and production. Delayed (30-July)
Implement external OCSP monitors for renewed responses. Done (1-July)
Perform review of all alerts looking for gaps and incorrect priorities that could lead to similar incidents. Done (7-June)

We are on track to complete the remaining task by 30-July or sooner, and will update this bug when that is done. I’d like to request that the next-update be set to 30-July.

Whiteboard: [ca-compliance] → [ca-compliance] Next update 2022-07-31

Here is the current status of our remediation plan for this incident:

Action Item (from original incident report) Due Date
Upgrade to more current (2022-05-16) Boulder release. Done (31-May)
Fix bug in startup script that allowed ocsp-updater container to continue running even when the process was not. Done (31-May)
Pre-release OCSP testing in staging environment - check for renewed OCSP response generation with each new Boulder release. Done (1-July)
Implement stale OCSP response alerts in staging and production. Done (30-July)
Implement external OCSP monitors for renewed responses. Done (1-July)
Perform review of all alerts looking for gaps and incorrect priorities that could lead to similar incidents. Done (7-June)

We have now completed all planned remediation tasks for this incident.

I'll close this on or about 5-Aug-2022.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] Next update 2022-07-31 → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.