Closed Bug 1645832 Opened 4 years ago Closed 3 years ago

GoDaddy: Expired CRLs

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dxhood, Assigned: dxhood)

Details

(Whiteboard: [ca-compliance] [crl-failure])

Attachments

(1 file)

1- How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
We continuously communicate between our customer facing teams and the development teams. On the morning of June 4, our customer facing team received an inquiry from a customer reporting the CRL next update date was June 3. The customer facing team promptly reported this to the developers who researched and identified that the CRLs had expired the day prior.

2- A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
6/3/2020 10:59 AM the CRLs expired
6/4/2020 6:27 AM our customer facing team received an inquiry from a customer indicating the CRL next update was showing as 6/3/2020
6/4/2020 7:01 AM​​ our customer facing team contacted the engineering team to begin the investigation
6/4/2020 8:26 AM engineering identified that the CRLs were expired and manually kicked-off the process to regenerate them
6/4/2020 10:40 AM new CRLs were published​

3- Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
N/A

4- A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
N/A

5- The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
N/A

6- Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.​
GoDaddy is performing a hardware uplift of critical systems to improve the overall redundancy and performance of the environment. While decommissioning the old environment and swinging over to the new environment, the node serving the CRLs was brought down. Normal procedure for taking systems down includes placing systems in 'maintenance' mode, which suppresses alarms. As part of the investigation it was identified that the CRL node had not completed the swing to the new environment, which prevented the automatic refresh on June 3. Once identified, the engineering team promptly reinstated the CRL server and published updated CRLs.

7- List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.​
CRLs were republished immediately. The engineering team improved procedures and playbooks for verifying system status before decommissioning. Included in the hardware uplift project, GoDaddy is migrating the CRL generation process to a clustered environment. This change should virtually eliminate the risk that the CRLs would fail to generate on time in the future as an outage in one server or an entire site would not affect availability of the service. Delivery of this change is anticipated for August.

Assignee: bwilson → dxhood
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

Thanks for the report. I'm concerned about a lack of detail in this incident report that would help other CAs.

For example, the following statement:

Normal procedure for taking systems down includes placing systems in 'maintenance' mode, which suppresses alarms.

It's ambiguous whether or not this transition followed "normal procedure" or not. It's similarly ambiguous whether or not GoDaddy already had monitoring in place for the validity period, and had disabled that check as part of the transition, as well as why such disabling would be necessary. It's also ambiguous why GoDaddy did not detect that the "CRL node had not completed the swing to the new environment", as well as what that "swing to the new environment" means.

The engineering team improved procedures and playbooks for verifying system status before decommissioning.

At this point, incident reports are starting to read like the subtitles to a Monty Python movie. "We apologize for the fault in the playbook. Those responsible have been sacked. (another incident occurs). We apologize again for the fault in the playbook. Those responsible for sacking the people who have just been sacked have been sacked. (another incident) The directors of the firm hired to continue improving the playbook after the other people had been sacked, wish it to be known that they have just been sacked."

I think a more concrete description of what the playbook steps involved, how they were reasonably believed to have been sufficient, and what specific changes have been made are all extremely relevant. While I can appreciate "Things will be better in the future", the goal is to help provide systemic improvement, and there's nothing another CA could, from reading this incident, know what GoDaddy had in place before this incident, how that failed, and what steps GoDaddy believes are sufficient going forward.

I highlight all of this, because it's troubling that it took a customer/external party to notice this, as it "seems" like it should be easy to catch. If that's an uncharitable read of the situation, providing more details about why it's more complicated than it seems is exactly what GoDaddy should be doing.

Flags: needinfo?(dxhood)

Hello Ryan,

Thank you for your comments.

Our process includes placing systems in maintenance mode to silence alarms for short periods of time, such as when systems are being patched, or when systems are being permanently retired. This is standard procedure and was followed properly.

As part of migrating environments and retiring old hardware, our procedures include verifying services were properly migrated. In this case, the team missed a step to verify the CRLs were being generated from the new environment before shutting off the old one.

Since retiring environments is extremely rare (this is the first time since inception for us), we are adding procedures to our disaster recovery (DR) ceremonies to practice generating the CRLs from alternate sites. Additionally, our DR ceremonies will be expanded to include multiple teams that will each be responsible for verifying specific functionality is working as intended and a master of ceremonies to ensure all teams check in.

This, coupled with the new environment, will improve our overall resiliency to prevent further issues.​

Flags: needinfo?(dxhood)

Can you provide more concrete details about your DR procedures, such as providing the checklist of things to check and/or verify?

I think the concern here is that just like this issue found a gap, other gaps may exist. The best way to address and ameliorate that concern is build a picture of what GoDaddy's view of Best Practices should be / are for DR, to ensure that there are no policy or compliance issues.

Flags: needinfo?(dxhood)

Ryan,
Thank you for your question.
For security purposes, GoDaddy will not disclose our Disaster Recovery procedures, although we understand your concern.
We can assure you we have a well-developed playbook for our DR Ceremony, which happens twice a year, and tests all capabilities of our PKI environment. The test consists of two parts:
1- We swing Global Site Selector across all of our servers to verify that traffic is redirected appropriately and with no interruption in service.
2- We build a Database Virtual Machine to ensure that it contains usable data.
These parts are separated in three stages, and each stage has an extensive checklist that must be completed entirely:
• Prep Stages: (i) covering preparation of the testing environment, including setting the presence of an internal auditor to participate in the Ceremony, (ii) gathering the documentation of the preparation that needs to be submitted for review, and (iii) approval and preparation of the VM.
• Test Day: Perform the DR, supervised by developers, engineers, management, and internal audit.
• Post-Test Paperwork: Complete paperwork and results of the DR Ceremony submitted for final review and approval to the department Director.
Also, as we have mentioned previously, we will be implementing an end-to-end internal review that should start on the beginning of Q3.

Flags: needinfo?(dxhood)

Prior to this incident, GoDaddy assured the community that it had well-developed playbooks to ensure compliance with all applicable requirements, by virtue of Management’s Assertion. This incident, however, shows serious gaps and oversights, and so it’s natural to take no reassurance by similar statements.

I am concerned by blanket statements like “For security purposes, GoDaddy will not disclose”. The trust placed in a CA is conditioned upon the degree of transparency put forward by the CA in demonstrating they can comply with all applicable requirements. I can understand and appreciate exceptional circumstances where this might be true, but I think GoDaddy should be more than capable of disclosing a substantive amount of detail of this checklist. For comparison, other CAs have been more than able to provide meaningful engagements in offering transparency, such as Bug 1640805 or Bug 1588001 , so GoDaddy is not following industry best practices here.

I think this is particularly concerning given bugs like Bug 1524815 in the past, where GoDaddy-specific procedures ignored lessons from other CAs in a way that lead to incidents.

Obviously, I am and remain concerned about the testing procedures in play here, and the limited information shared here does not build further confidence in GoDaddy’s remediation, nor does it provide sufficient insight that is useful for other CAs to integrate into their own procedures for robustness. Incidents that don’t help reassure the community or improve the ecosystem are the most troubling of incidents, and so my hope is that GoDaddy will carefully re-evaluate its answer and consider how it might provide a more appropriate response.

Flags: needinfo?(dxhood)

Ryan,
We are interested in continuing to be good stewards of the community and contribute in every way we can for the betterment of the ecosystem.
With that being said, the DR is an internal document that contains sensitive information about our environment, therefore disclosing that document provides a security risk that we are unwilling to accept.
We are in the process of updating our DR procedures, to include moving over the CRLs to a clustered environment.
We would be happy to share a sanitized version of our updated procedures when those are ready. We believe that to be sufficient to satisfy your request.

Flags: needinfo?(dxhood)

Thanks Daniela. Indeed, a sanitized version may be sufficient: the goal is to ensure transparency so that we can learn and incorporate good practices. I don't see a timeline attached here: do you have a sense of when that will be?

Flags: needinfo?(dxhood)

Hi Ryan,
We will be posting the document within this Bug on 11/30/2020.

Flags: needinfo?(dxhood)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 2020-11-30

Hello,

As previously committed, attached is a sanitized excerpt from our operations guide showing teams involved (section 11.2.1) and checks to verify critical services are operating as expected during tests (section 11.7.1.1). This practice, combined with the move of our CRLs to a clustered environment, are expected to mitigate potential outages from future moves.

I have reviewed the Disaster Recovery Guide. I don't think it presents a "checklist of things to check and/or verify" but I don't have any further questions, and I think this matter can be closed.

Flags: needinfo?(ryan.sleevi)
Whiteboard: [ca-compliance] Next Update - 2020-11-30 → [ca-compliance] Next Update - 2021-01-01
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Flags: needinfo?(ryan.sleevi)
Product: NSS → CA Program
Whiteboard: [ca-compliance] Next Update - 2021-01-01 → [ca-compliance] [crl-failure]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: