Closed Bug 1634795 Opened 5 years ago Closed 4 years ago

Google Trust Services: Incorrect revocation data temporarily served for GTS Y3 & Y4

Categories

(CA Program :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: awarner, Assigned: awarner)

Details

(Whiteboard: [ca-compliance] [crl-failure] [ocsp-failure])

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36

Steps to reproduce:

Actual results:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

During an internal review on 2020-04-17 we identified that some of the batch revocation data we rolled out on 2020-04-08 erroneously overwrote previously published revocation data for our old subordinate CAs Google Trust Services (GTS) Y3 and Y4.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2019-09-05 – A scheduled periodic and manual ceremony is conducted to produce CRL and OCSP responses for GIAG4, GIAG4ECC, GTS CA 1O1, GTS CA 1D2, GTS Y1, GTS Y2, GTS Y3 and GTS Y4: Revocation information is produced in batches to reduce the need to access offline key material. In this case “Batch 1” was valid from 2019-09-30 to 2020-04-15, and “Batch 2” was valid from 2020-04-01 to 2020-10-15.

2019-09-30 – On schedule, the previously deployed batch of revocation information is replaced with “Batch 1.”

2020-01-27 – Post deployment, an internal review of production configuration finds a signature algorithm mismatch in the chains of GTS Y3 and GTS Y4 (EC 256 vs 384).

2020-01-30 – Mozilla bug 1612389 is filed.

2020-01-31 – As part of the response to this incident (bug 16162389) GTS Y3 and GTS Y4 are re-issued with the correct signature algorithm and new revocation data for them is produced, valid from 2020-01-31 to 2020-10-15.

2020-02-03 – New GTS Y3 and GTS Y4 and their corresponding revocation data, as well as revocation data for old GTS Y3 and GTS Y4, are published.

2020-04-08 08:15 UTC – As “Batch 1” is approaching expiration (on 2020-04-15), on schedule, the revocation data Batch 2 is retrieved up from the safe.

2020-04-08 14:55 UTC – “Batch 2” is installed, overwriting revocation data produced on 2020-01-31 for old GTS Y3 and old GTS Y4, effectively un-revoking them.

2020-04-17 16:30 UTC – An internal review identifies that “Batch 2” contained the old versions of the revocation data for the old GTS Y3 and GTS Y4, and that the valid data was erroneously overwritten by outdated information.

2020-04-17 17:02 UTC – Partial rollback of “Batch 2” to restore intended revocation data begins.

2020-04-17 18:31 UTC – The rollout of the correct CRLs finishes and the correct CRLs are now being served.

2020-04-18 01:25 UTC – Rollout of the correct OCSP responses finishes and the correct OCSP responses for old GTS Y3 and GTS Y4 are now being served.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem.

The outdated revocation data was rolled back and replaced with correct CRL and OCSP responses.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Two certificates were impacted:

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

Revocation data for:

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The offline nature of root keys means that generation of CRLs and OCSP responses for issuing certificates is a mostly manual process.

Bringing root keys online to produce these objects is a sensitive security task. Each time keys are brought online there is risk of unauthorized use. As such, processes are designed to limit the need to access these keys. This is accomplished via the batch production of OCSP and CRLs for the associated certificates. This greatly reduces the need to bring these root keys online.

The manual nature of this process exposes it to the risk of human error. This is what happened in this case.

We have review procedures in place to help catch these manual errors, the ongoing Covid-19 pandemic made accessing the associated root keys and revocation data more complicated. The extraordinary access restrictions associated with this introduced additional special safety procedures to enter the facility and to leave it in a timely manner with limited human contact.

These safety procedures also limited our ability to include a wider real-time peer review as we normally would. While not an excuse, highly abnormal operating conditions were a contributing factor to the human error that led to this incident.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

Correct revocation information is now being served for the old GTS Y3 and GTS Y4.
We have added new checklist items to our procedures when publishing pre-generated revocation data and are evaluating additional options to further improve our processes. In particular we are implementing presubmit checks:

  • to prevent submitting CRL files that have the same or lower CRLNumber
  • to prevent accidental removal of revoked entries in CRLs
  • to prevent accidental change from Revoked to Good for OCSP responses
  • to prevent overwriting revocation data that has newer thisUpdate
  • improve Sub CA creation and revocation procedures to include explicit action for pre-produced revocation data.

Expected results:

Assignee: wthayer → awarner
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

Regarding the Response to Comment #7, the response is notably lacking

a timeline of when your CA expects to accomplish these things.

https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed makes it clear weekly updates are expected, in the absence of positive affirmation as shown via the status. This would have mitigated the above issue, had it been followed.

Do you have an update on when these changes will be implemented? As well as what steps are being taken to ensure incident reports include the necessary information and have timely updates provided?

Flags: needinfo?(awarner)

Sorry, I'm out on leave and I thought another team member was going to cover updates, but it looks like we may have gotten our wires crossed on who was handling follow-ups. 3 of the 4 presubmits are in place and have been since the beginning of May. The 4th appears to still be in progress and the procedure updates happened in early May as well. I'll get an ETA on the final presubmit and provide another update.

Type: enhancement → task

The final presubmit did not make the last production push, but is likely to make the next one during June.

Still on track for the next production push, but it has not happened yet.

Status still the same.

Flags: needinfo?(awarner)

The final presubmit is in place and has been exercised enough to be confident it is working as intended. All mitigations are complete at this point.

Assuming that there are no other preventative measures or remediations planned, I will schedule this bug to be closed on or after 7-August 2020 unless there are additional questions or issues to be discussed.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [crl-failure] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.