Open Bug 1882904 Opened 2 months ago Updated 14 hours ago

Google Trust Services: Incorrect OCSP responses for new ICAs under test

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: gts-external, Assigned: gts-external, NeedInfo)

Details

(Whiteboard: [ca-compliance] [ocsp-failure] Next update 2024-04-26)

Attachments

(1 file)

Google Trust Services is investigating an issue with OCSP status information not being correctly updated on new intermediate CAs that we recently issued and are in the process of testing.

We will post a full incident report by Friday, March 8.

Assignee: nobody → gts-external
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [ocsp-failure]

Incident Report

Summary

While validating certificate issuance from recently-configured intermediate CAs, Google Trust Services (GTS)’ OCSP responders incorrectly responded to some requests with a response status of “unauthorized”.

Background

GTS’ OCSP responder architecture has gone through a long evolution beginning with its first implementation over 12 years ago. We operate two variants of the software: a newer version, which serves OCSP for the majority of our issued certificates, and a legacy version, which now exists exclusively to serve status information for GTS’ intermediate CAs and some Google-owned domains. Maintaining this dual infrastructure, which differs in several ways in its approaches to management, increases complexity and was a contributing factor in this incident. We are providing additional background on the history and changes that have been made since the legacy version has been the subject of past incidents.

The legacy version was first implemented at a time when OCSP responses for a few thousand certificates were being served at any given time and were consequently backed by the filesystem. It originally received data refreshes via a bundled fileset push propagated across our global infrastructure, which best met our reliability and availability goals with tools available at the time. It later incorporated pub/sub to propagate status information updates more quickly following issuance and revocation. As GTS expanded, this approach would have reached scalability limits, so a new version was implemented, now backed by Spanner, Google's globally-distributed database. The newer version has been in use without issues since its introduction and all other OCSP serving has been migrated to but the legacy version has proven to reliably serve traffic at a very high volume and still serves an order of magnitude more requests. In 2022 we performed migrations to the newer version, which resulted in bug 1773556 after a testing issue resulted in a configuration error. Earlier in 2022 we also improved the propagation time to the legacy responders, described in https://bugzilla.mozilla.org/show_bug.cgi?id=1758372#c16.

We decided to continue using the legacy version for the remaining use cases for two reasons. The primary reason is their reliability and scalability, which has been confirmed over a long period of time. In addition, we anticipate the eventual deprecation of our OCSP responders due to ballot SC-63. Deprecation is still pending acceptance by all major root programs. For a similar reason, the proposal we committed to bring to the CA/B Forum in bug 1758372 was abandoned after learning of the Chrome Root Program’s plan to make OCSP optional. We engaged with the community on an early draft; its incomplete output was collaborated on with some members of the ecosystem in SCXY Draft: SLOs for Certificate Status Information.

At the time this incident was detected, certificate issuance under the new ICAs, named WE2 and WR2, was being validated after they were enabled on the CA platform. Several other certificates were issued in the days prior for manual testing and automated health checks. These are included in the timeline below.

Impact

There were 3301 OCSP requests for status information that incorrectly received a response of “unauthorized”. In total, 7922 certificates were issued while the problem existed; they are listed below in the “Details of affected certificates” section of the report.

Timeline

All times are UTC.

2023-11-02

  • 16:03 Test ICAs corresponding with WE2 and WR2 are enabled in our test environment.

2023-12-11

  • 07:49 Health checks are enabled to monitor the test WE2 and WR2 ICAs. The health checks perform issuance and revocation in addition to verifying OCSP, but as described in Root Cause Analysis, the timing of push of the bundled fileset containing a status information refresh masked the issue.

2024-02-12

  • 15:12 The new ICAs are deployed in the production environment, issuance is disabled.

2024-02-21

  • 14:54 The new ICAs are enabled across the production environment.

2024-02-22

  • 21:00 A certificate signed by WR2 is issued for manual validation, including validating status information and revocation data. OCSP is verified to return a status of “good”.
  • 21:04 A certificate signed by WE2 is issued for the same manual validation. OCSP is verified to return a status of “good”.
  • 21:12 The WR2 leaf certificate is revoked. OCSP is verified to return a status of “revoked”.
  • 21:22 The WE2 leaf certificate is revoked. OCSP is verified to return a status of “revoked”.

2024-02-23

  • 14:28 Health checks are enabled for the recently-enabled ICAs following our successful manual verification. The health checks test the same things as ones running in the test environment.

2024-02-28

  • 10:46 Additional certificates are issued while validating certificate management functionality with the new ICAs.
  • 15:56 While updating documentation, a CA Engineer realizes that configuration files are missing for the pipeline to push status information refreshes for the new ICAs. New ones are submitted to source control.
  • 17:53 We verify that some OCSP responses are not working as expected. We begin a further investigation with the help of more engineers.
  • 18:35 A CA Engineer runs the job creating the bundled fileset of OCSP responses, out of schedule.
  • 18:49 We invoke our incident response procedure after confirming the issue.
  • 19:29 We begin a push of the bundled fileset, including the outputs from the newly-configured jobs. The propagation uses a staggered approach to reach global consistency over several hours.

2024-02-29

  • 00:38 The bundled file push concludes. The issue is confirmed to be resolved.

Root Cause Analysis

The newer version of the OCSP responder software has a key configuration difference related to this incident: once a new ICA is configured for issuance by the CA, status information begins persisting to a database that is accessed by the responders. In contrast, for the legacy version, we must also configure a pipeline to periodically push status information refreshes for each ICA. Propagation of status information updates over pub/sub is part of the ICA configuration, but the responder depends upon both sources to remain up-to-date.

This incident occurred because the pipeline was not configured for the two new ICAs that rely on the legacy OCSP responders, so only updates over pub/sub were being received by those responders. As a result, the responders returned the expected responses following manual tests of issuance and revocation. Similarly, the health check jobs received expected results. When a new bundled file push is received by legacy responders, they effectively invalidate any updates received over pub/sub preceding the earliest time at which the contents were generated, thus the status information was lost and the responders began returning a response of “unauthorized” for the impacted certificates.

There are two factors that contributed to this incident:

  • Introducing new ICAs is an infrequently-performed task and relies on manual actions to make configuration changes. Although status information served by the newer responder version is published once a configuration change is made to enable an ICA, the older version has more moving parts and requires additional manual work to enable its operation. Enabling a new ICA for use in issuance should enable it in all supporting services. Lacking this automatic enablement was the root cause of this incident.

  • We had not yet enabled all of the health checking we rely on, specifically synthetic monitoring. The synthetic monitoring works by regularly probing our endpoints to verify OCSP and CRL responses are working as expected. The probes take a random sample of valid certificates on each invocation and verify the endpoints return the expected results. Had this been turned on in our test environment, we would most likely have caught the issue.

Lessons Learned

What went well

N/A

What didn't go well

  • The problem was masked due to the timing of propagation of status information updates between two pipelines. Despite having manually validated that OCSP was working, we did not verify whether it continued to work correctly over time.
  • We did not set up synthetic monitoring at the same time as the ICAs were configured. This would most likely have caught the issue in test before it became an issue in our production environment.

Where we got lucky

  • We identified the issue before widespread use of the new ICAs.

Action Items

Action Item Kind Due Date
Determine a plan to make the legacy OCSP pipeline agnostic to ICA addition and removal. We require additional time for planning to ensure this change can be made safely. We will share a due date in our next update. Prevent 2023-03-15
Automatically enable synthetic monitoring when new ICAs are configured Detect 2024-04-26
Refresh the checklist used when configuring new ICAs to include more validation and emphasize the step differences when using the legacy OCSP responders Prevent 2024-03-22

Appendix

Details of affected certificates

See the attached text file.

Based on Incident Reporting Template v. 2.0

Attached file certs.txt

Google Trust Services has completed our plan to make the legacy OCSP pipeline agnostic to ICA addition and removal. We will finish the implementation by 2024-04-26 and have updated the table below accordingly.

Action Item Kind Due Date Delivery Steps Taken
Determine a plan to make the legacy OCSP pipeline agnostic to ICA addition and removal. Prevent 2024-03-15 2024-03-15 Completed
Refresh the checklist used when configuring new ICAs to include more validation and emphasize the step differences when using the legacy OCSP responders Prevent 2024-03-22 In progress Updates are being made
Automatically enable synthetic monitoring when new ICAs are configured Detect 2024-04-26 In progress A design is in progress
Implement changes to make the legacy OCSP pipeline agnostic to ICA addition and removal Prevent 2024-04-26 In progress The design is complete, work is in progress

GTS will have an update by March 22nd for our next AI.

GTS has completed the remediation of the AI due for 2024-03-22 and is continuing work on the last two remaining AIs. All AIs related to this incident, and their status, are listed below.

Action Item Kind Due Date Delivery Steps Taken
Refresh the checklist used when configuring new ICAs to include more validation and emphasize the step differences when using the legacy OCSP responders Prevent 2024-03-22 2024-03-21 The ICA deployment checklist has been extended to detail the missing OCSP setup steps, with links to configuration examples.
Automatically enable synthetic monitoring when new ICAs are configured Detect 2024-04-26 In progress The design is in progress.
Implement changes to make the legacy OCSP pipeline agnostic to ICA addition and removal Prevent 2024-04-26 In progress The design is complete and implementation is underway.

GTS will provide an update by April 12th for the last two AIS. We kindly request that the NextUpdate field be set to that date.

Flags: needinfo?(bwilson)
Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] [ocsp-failure] → [ca-compliance] [ocsp-failure] Next update 2024-04-12

GTS is providing an update for the remaining two AIs due 2024-04-26.

Action Item Kind Due Date Delivery Steps Taken
Automatically enable synthetic monitoring when new ICAs are configured Detect 2024-04-26 In progress The implementation is underway. We are on track to complete it next week.
Implement changes to make the legacy OCSP pipeline agnostic to ICA addition and removal Prevent 2024-04-26 In progress The implementation is currently being tested as a dark launch in our test environment. We are progressing towards doing a dark launch in our production environment once we have greater confidence.

GTS will provide our next update by April 26 for these last two AIs. We kindly request that the NextUpdate field be set to that date.

Flags: needinfo?(bwilson)
Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] [ocsp-failure] Next update 2024-04-12 → [ca-compliance] [ocsp-failure] Next update 2024-04-26

Google Trust Services has completed the remaining two AIs due today, 2024-04-26.

Action Item Kind Due Date Delivery Steps Taken
Automatically enable synthetic monitoring when new ICAs are configured Detect 2024-04-26 2024-04-23 This is now active in our production environment.
Implement changes to make the legacy OCSP pipeline agnostic to ICA addition and removal Prevent 2024-04-26 2024-04-26 Traffic in our production environment is now being served by the updated implementation.

All AIs were completed within our intended timeline. If there are no further questions or comments, we kindly request to close this incident.

Flags: needinfo?(bwilson)

Now that SC-063 has passed, what is GTS's plan with the future of the OCSP-1 (not legacy) and OCSP-2 (legacy) systems? Are there plans to remove the OCSP responders and rely exclusively on CRLs?

(In reply to comment #7)

Now that SC-063 has passed, what is GTS's plan with the future of the OCSP-1 (not legacy) and OCSP-2 (legacy) systems? Are there plans to remove the OCSP responders and rely exclusively on CRLs?

We currently plan to continue to support both responders until it is possible to deprecate them, for reasons we described in comment#1. Once all root programs adopt the language in SC-063, GTS will work to deprecate OCSP.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: