Closed Bug 1838707 Opened 2 years ago Closed 2 years ago

Google Trust Services: Revocation data publication delay for revoked unused subordinate CAs

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nullsem, Assigned: nullsem)

Details

(Whiteboard: [ca-compliance] Next update 2023-07-28)

Google Trust Services has identified an issue related to the publication timing for revocation data following the revocation of several unused subordinate CAs. We will post a full report with our findings within the next seven days.

Assignee: nobody → nullsem
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

1. How your CA first became aware of the problem (e.g., via a problem report submitted to your Problem Reporting Mechanism, a discussion in the MDSP or CCADB public mailing list, a Bugzilla bug, or internal self-audit), and the time and date.

The engineers who revoked the subordinate CA certificates realized that revocation information had not been published within 24 hours of the revocation date specified in revocation data.

2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a requirement became applicable, a document changed, a bug was introduced, or an audit was performed.

YYYY-MM-DD (UTC) Description
2022-10-12 00:00 The last non-expired leaf certificate issued by subordinate CA GIAG4x expires.
2022-12-19 00:00 The last non-expired leaf certificate issued by subordinate CA GIAG4 ECC expires.
2022-12-19 00:00 The last non-expired leaf certificate issued by subordinate CA GTS Y1 expires.
2022-12-19 00:00 The last non-expired leaf certificate issued by subordinate CA GTS Y2 expires.
2022-12-19 00:00 The last non-expired leaf certificate issued by subordinate CA GTS Y3 expires.
2022-12-19 00:00 The last non-expired leaf certificate issued by subordinate CA GTS Y4 expires.
2023-05-25 06:43 Preparation for the revocation ceremony begins.
2023-05-31 04:42 Ceremony input configuration is prepared and reviewed with the time of revocation set to 2023-06-13 00:00:00 UTC.
2023-05-31 13:44 Ceremony is tested and test revocation data produced in the test ceremony is reviewed.
2023-06-13 18:34 Ceremony starts in the WebTrust audited space in a data center; revocation data is generated.
2023-06-13 22:03 Ceremony ends and participants leave the WebTrust audited space.
2023-06-13 23:17 The engineers re-enter the WebTrust audited space to work on other tasks unrelated to the ceremony.
2023-06-14 00:00 The 24 hour window to publish the revocation information elapses.
2023-06-14 01:12 The engineers exit the WebTrust audited space after having completed their work.
2023-06-14 04:56 An engineer on-site prepares a commit to our internal versioning system with the newly generated revocation information.
2023-06-14 15:03 An engineer reviews the commit and reports a documentation mistake.
2023-06-14 15:20 The author of the commit and the reviewer escort external auditors in the WebTrust audited space for the yearly data center audit and other tasks.
2023-06-14 23:56 Work with external auditors in the WebTrust audited space ends.
2023-06-15 04:32 The engineers on-site identify the failure to meet the 24 hour window.
2023-06-15 04:35 The documentation mistake is fixed and the commit gets approved and submitted to version control.
2023-06-15 04:48 Automated rollout begins to publish the new CRLs.
2023-06-15 05:07 The engineers on-site declare an internal security event.
2023-06-15 06:30 Automated rollout begins to publish the new OCSP responses.
2023-06-15 06:40 An investigation into the issue begins.
2023-06-15 06:41 CRL rollout finishes; the new CRLs are publicly available globally.
2023-06-15 09:00 Evidence and output of the investigation is collected into a document and sent to the Policy Authority for determination.
2023-06-15 12:37 OCSP rollout finishes; the new OCSP responses are publicly available globally.

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

N/A. Certificate issuance was not affected by this incident.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g., OCSP failures, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help measure the severity of each problem.

Six unused subordinate certificates were revoked with a revocation timestamp of 2023-06-13 00:00:00 UTC and the data was published 2023-06-15 04:35. The corresponding CRLs had fully propagated globally by 2023-06-15 at 06:41 UTC and the corresponding OCSP responses had fully propagated globally by 2023-06-15 at 12:37 UTC.

5. In a case involving TLS server certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. It is also recommended that you use this form in your list “https://crt.sh/?sha256=[sha256-hash]”, unless circumstances dictate otherwise. When the incident being reported involves an SMIME certificate, if disclosure of personally identifiable information in the certificate may be contrary to applicable law, please provide at least the certificate serial number and SHA256 hash of the certificate. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

The revoked subordinate certificates in question:

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Google Trust Services is in the process of turning down our secondary CA platform, which runs EJBCA and originally served as a backup while transitioning to our primary CA platform running our ACME implementation. The secondary CA platform maintained several subordinate CAs that were revoked in a ceremony on 2023-06-13 and are listed in section 5 of this report. During the ceremony, we successfully generated updated revocation information for the revoked CAs, but it was published more than 24 hours after the revocation time.

The revocation ceremony for the subordinate CAs took place in the WebTrust audited space of one of our data centers. Each visit to these facilities typically spans a week and involves multiple activities. In addition to the revocation ceremony, our agenda for the week included an auditor walkthrough for our annual WebTrust audit, a key loading ceremony, and some general maintenance work, all of which had been planned for months.

Preparation for the trip took place over several weeks and included a dry run. We based input to the ceremony tool on parameters that were used in previous ceremonies, which all used 00:00:00 UTC as revocation time. The author of the ceremony input data and its reviewers did not take into consideration the time difference between UTC, which is used to specify the revocation time, and the location where the ceremony took place (UTC-5), nor the afternoon start time of the ceremony. The revocation time in the generated CRLs and OCSP responses was set to 2023-06-13 00:00:00 UTC. Since the ceremony ended at 22:03 UTC, this left a little under two hours to publish the revocation data within the mandated 24 hours.

Multiple automated checks are performed to ensure that we do not publish incorrect revocation data. Our ceremony document template has a section titled "Updating Records" at the end, which includes tasks such as: updating the inventories, archiving ceremony evidence, and publishing ceremony outputs, but it did not specify the deadline by which these actions must be completed.

The engineers who conducted the ceremony did not consider the implications of the date specified in the ceremony tool parameters and concluded work for the day after sending the revocation data update for review. The review did not occur until the following day after the walkthrough, which contributed to the delay. The team published the change immediately after realizing that they violated the 24 hours requirement.

Two primary factors contributed to this incident: 1. The revocation time specified in the ceremony did not account well for the time delta between UTC and local time and the start time of the ceremony, and 2. Publishing the updated revocation data in a timely manner was not a scripted part of the post-ceremony tasks.

7. List of steps your CA is taking to resolve the situation and ensure that such a situation or incident will not be repeated in the future. The steps should include the action(s) for resolving the issue, the status of each action, and the date each action will be completed.

To ensure we have sufficient time to publish updated revocation information following a ceremony, we will implement a linter for the ceremony tool input that will warn us when the revocation time does not allow sufficient time for the publishing and propagation of revocation data.

We will also make changes for future ceremonies:

  • For non-critical revocations, we will emphasize setting the revocation time in our preparation procedures to allow sufficient buffer to publish the updated information following a ceremony.
  • We will add an exhaustive post-ceremony checklist, with deadlines, to our ceremony document template.

We plan to complete these changes by July 28, 2023.

Whiteboard: [ca-compliance] → [ca-compliance] Next update 2023-07-28

We completed all actions described in section 7 of the incident report.

A linter is now in place to verify the input configuration of our ceremony tool. The linter is aware of the scheduled date, location (to determine timezone) and which half of the day (morning or afternoon) ceremonies are planned for. The linter ensures that the revocation time of newly revoked certificates is consistent with the estimated ceremony end time to allow sufficient time for publishing the generated revocation data. The linter is executed automatically on each commit modifying the input configuration of ceremonies.

We also amended our procedures to emphasize that the revocation time must be set to a timestamp that allows a sufficient time buffer to publish the generated revocation data following a ceremony. Note that we decided to treat both urgent and non-urgent revocations in the same manner to keep our procedures simple and consistent.

To ensure revocation data is published timely, we also added a checklist for all post-ceremony activities, with deadlines ranging from "immediately" to "7 days", in our ceremony document template.

If there are no further questions or comments, we ask that this bug be closed.

Flags: needinfo?(bwilson)

If there are any questions, we can keep this open. Otherwise, I plan to close this on Friday, 28-July-2023.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.