Closed Bug 1630079 Opened 5 years ago Closed 5 years ago

Google Trust Services: Invalid OCSP responses

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ryan.sleevi, Assigned: awarner)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

On 2018-01-21, a user on mozilla.dev.security.policy reported issues with Google's OCSP responses

This was acknowledged by a representative of GTS on 2018-01-21

The following incident report was provided on 2018-02-21. While this didn't cover the then-current Responding to an Incident page, this wasn't followed-up with then.

I could not find an existing Bugzilla entry bug for this, which was an oversight.

Incident Report

Summary

January 19th, at 08:40 UTC, a code push to improve OCSP generation for a subset of the Google operated Certificate Authorities was initiated. The change was related to the packaging of generated OCSP responses. The first time this change was invoked in production was January 19th at 16:40 UTC.

NOTE: The publication of new revocation information to all geographies can take up to 6 hours to propagate. Additionally, clients and middle-boxes commonly implement caching behavior. This results in a large window where clients may have begun to observe the outage.

NOTE: Most modern web browsers “soft-fail” in response to OCSP server availability issues, masking outages. Firefox, however, supports an advanced option that allows users to opt-in to “hard-fail” behavior for revocation checking. An unknown percentage of Firefox users enable this setting. We believe most users who were impacted by the outage were these Firefox users.

About 9 hours after the deployment of the change began (2018-01-20 01:36 UTC) a user on Twitter mentions that they were having problems with their hard-fail OCSP checking configuration in Firefox when visiting Google properties. This tweet and the few that followed during the outage period were not noticed by any Google employees until after the incident’s post-mortem investigation had begun.

About 1 day and 22 hours after the push was initiated (2018-01-21 15:07 UTC), a user posted a message to the mozilla.dev.security.policy mailing list where they mention they too are having problems with their hard-fail configuration in Firefox when visiting Google properties.

About two days after the push was initiated, a Google employee discovered the post and opened a ticket (2018-01-21 16:10 UTC). This triggered the remediation procedures, which began in under an hour.

The issue was resolved about 2 days and 6 hours from the time it was introduced (2018-01-21 22:56 UTC). Once Google became aware of the issue, it took 1 hour and 55 minutes to resolve the issue, and an additional 4 hours and 51 minutes for the fix to be completely deployed.

No customer reports regarding this issue were sent to the notification addresses listed in Google's CPSs or on the repository websites for the duration of the outage. This extended the duration of the outage.

Background

Google's OCSP Infrastructure works by generating OCSP responses in batches, with each batch being made up of the certificates issued by an individual CA.

In the case of GIAG2, this batch is produced in chunks of certificates issued in the last 370 days. For each chunk, the GIAG2 CA is asked to produce the corresponding OCSP responses, the results of which are placed into a separate .tar file.

The issuer of GIAG2 has chosen to issue new certificates to GIAG2 periodically, as a result GIAG2 has multiple certificates. Two of these certificates no longer have unexpired certificates associated with them. As a result, and as expected, the CA does not produce responses for the corresponding periods.

All .tar files produced during this process are then concatenated with the -concatenate command in GNU tar. This produces a single .tar file containing all of the OCSP responses for the given Certificate Authority, then this .tar file is distributed to our global CDN infrastructure for serving.

A change was made in how we batch these responses, specifically instead of outputting many .tar files within a batch, a concatenation was of all tar files was produced.

The change in question triggered an unexpected behaviour in GNU tar which then manifested as an empty tarball. These "empty" updates ended up being distributed to our global CDN, effectively dropping some responses, while continuing to serve responses for other CAs.

During testing of the change, this behaviour was not detected, as the tests did not cover the scenario in which some chunks did not contain unexpired certificates.

Findings

  • The outage only impacted sites with TLS certificates issued by the GIAG2 CA as it was the only CA that met the required pre-conditions of the bug.
  • The bug that introduced this failure manifested itself as an empty container of OCSP responses. The root cause of the issue was an unexpected behavior of GNU tar relating to concatenating tar files.
  • The outage was observed by revocation service monitoring as “unknown certificate” (HTTP 404) errors. HTTP 404 errors are expected in OCSP responder operations; they typically are the result of poorly configured clients. These events are monitored and a threshold does exist for an on-call escalation.
  • Due to a configuration error the designated Google team did not receive an escalation message.
  • External users did not use the contact details Google provided in the CPS.

Remediation Plan

  • A bug fix has been applied to prevent the same issue from happening again.
  • Test cases looking for a minimum number of OCSP responses in each tar were added to the test automation suites to catch similar issues in the future.
  • The monitoring system that was misconfigured was updated to use the correct address for escalations.
  • Both the Google Trust Services CPS (found on pki.goog) and the Google CPS (found on pki.google.com) have been updated to make it clear what email address is the most expedient path to reach the PKI team for non-security incidents.
  • The Google PKI repository page was updated to show contact details in the same way the Google Trust Services repository page already did in a hope to help users find a path of escalation.
  • The wizard that is returned for mails to the security email address has been updated to also include an explicit option for issues related to the “Google Certificate Authority” in the hopes of helping users who choose this path of escalation.
  • Existing procedures that are relied upon for periodic verification of effective escalation have been updated to include unknown certificate checking.

Wayne: In responding to Bug 1630040 I noticed we never filed a bug for the previous incident. Let me know if there's anything you'd like to add, but I suspect this can be Resolved/Fixed for accounting/bookkeeping purposes.

Flags: needinfo?(wthayer)

GTS has nothing additional to add. The CA infrastructure in question at the time of this issue from a 'tar' bug has all been turned down and new pipelines use independent code with checks specifically for the January 2018 issue. The CDN infrastructure which results are ultimately served from is currently the same as it was at the time of this report.

Resolving per comment #1.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(wthayer)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.