Closed Bug 1793848 Opened 4 months ago Closed 2 months ago

GoDaddy: Failure to revoke 210 subscriber certificates within 24 hours

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jcoover, Assigned: brittany)

Details

(Whiteboard: [ca-compliance] [delayed-revocation-leaf])

Attachments

(1 file)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36

Attached file List of affected certs

Incident Report

Problem Summary

We failed to process210 customer revocation requests within 24 hours, which is a violation of the Baseline Requirements (BRs) for Publicly Trusted SSL certificates, Section 4.9.1.1 which states, The CA SHALL revoke a Certificate within 24 hours if the Subscriber requests in writing that the CA revoke the Certificate.

1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

During the course of business, the RA team noticed a delayed revocation request and escalated this to the PKI Engineering team on Sep 23rd at 8:47 AM (MST). On 09/23/22 at 10:55am MST, the PKI Engineering team confirmed that 210 customer requests for revocation had not been processed as expected.

2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

MM/DD/YYYY HH:MM (Times are all MST)

  • 09/23/22 8:47 - RA team noticed a delayed revocation request and escalated
  • 09/23/22 10:55 - PKI Engineering confirmed issue and identified root cause
  • 09/23/22 13:45 - PKI Engineering revoked all 210 certificates Note: Only 123 of these certificates were still active
  • 09/23/22 13:45 - PKI Engineering added monitoring to mitigate further issues with revocation requests

3. Whether your CA has stopped or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

The bug was limited to certificate revocation requests and subscriber certificate issuance was not impacted.

4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Summary: 210 Customer requests for revocation were not processed, of which 123 were active.
Date of First: 11/13/2020
Date of Last: 09/22/2022

5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

<refer to attached file revocation_certsh.txt>

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now. See Google's guidance on root cause analysis for ideas of what to include.

We introduced a bug with the integration of a new event queue system (RabbitMQ) in April 2020. In rare cases, the event queue can become unresponsive causing revocation requests to go unprocessed. The requests are still persisted in a database but are never processed. It was this database storage of the requests that allowed GoDaddy engineers to review and process the revocation requests with existing requested reasons.

The delay in the detection of the bug was largely due to the rarity of the situation required for the revocation requests to go unprocessed. Since the introduction of the new queue system, we have processed over 2 million revocation requests successfully.

7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied by a timeline of when your CA expects to accomplish these things.

  • On 9/23/22, PKI Engineering revoked all 210 certificates with the originally requested revocation reason.
  • On 9/23/22, Implemented automated alerts to notify PKI Engineering to take action if any revocations are not processed as scheduled.
  • Additional System Updates (Pending 1/31/23): Implement an automated failsafe to process delayed revocation requests.
Assignee: bwilson → jcoover
Type: defect → task
Whiteboard: [ca-compliance] [delayed-revocation-leaf]
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

No formal updates at this time. We will continue to monitor for questions/comments.

Thank you for providing this report. Some follow-up questions:

  • How did you determine that the bug was introduced in April 2020?
  • What action(s)/situation(s) were determined to be the cause of the queue to become unresponsive?
  • Can you describe how the automated alerts configured on 9/23 have been tested to ensure the alerting is working properly? (i.e., have you designed positive and negative test cases?)

Just wanted to acknowledge that we have received these questions and expect to post answers early next week.

(In reply to Chris Clements from comment #4)

  • How did you determine that the bug was introduced in April 2020?

The bug is an oversight of our write-failure logic with the current queue framework (RabbitMQ) which was introduced in April of 2020. Prior to this > introduction, our queue system was built around MySQL DB tables and the failure logic was quite different.

  • What action(s)/situation(s) were determined to be the cause of the queue to become unresponsive?

Networking changes and failures are the cause behind the queue becoming unresponsive.

  • Can you describe how the automated alerts configured on 9/23 have been tested to ensure the alerting is working properly? (i.e., have you designed positive and negative test cases?)

We can reproduce the unprocessed revocation requests for testing purposes with the monitor. We also put the monitor into place before processing the affected revocation requests which resulted in successful alerting for the 210 unprocessed requests and subsequent turn down of alerts after processing was completed.

On 10/24/2022, the PKI development team deployed a failsafe which processes delayed revocation requests automatically (bullet three from action plan above). We have formally completed the last step in our remediation plan. We will continue to monitor for any questions/comments.

Assignee: jcoover → brittany
Product: NSS → CA Program

Are there any additional questions or issues to be raised by the community? If not, I plan to close this on or about Wed. 23-Nov-2022.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.