Closed Bug 1715421 Opened 3 years ago Closed 3 years ago

Google Trust Services: Failure to revoke subscriber certificates within BR timeframe

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ryan.sleevi, Assigned: fotisl)

Details

(Whiteboard: [ca-compliance] [leaf-revocation-delay])

In Bug 1706967, GTS was informed, and confirmed, on 2021-04-22 that they were using an unauthorized/undisclosed method of domain control validation.

In Bug 1706967, Comment #4, GTS acknowledges that they did not complete revocation until 2021-05-01, a total of 9 days to revoke.

Although GTS did not make the determination themselves until 2021-04-30, they were unambiguously made aware of the non-compliance on 2021-04-22.

With respect to CP/CPS violations, Section 4.9.1.1 of the Baseline Requirements require that a CA MUST revoke within 5 days if:

  1. The CA is made aware that the Certificate was not issued in accordance with these
    Requirements or the CA’s Certificate Policy or Certification Practice Statement;

However, with respect to Domain Control Validation, the CA only has 24 hours to revoke. This is due to the following clause:

  1. The CA obtains evidence that the validation of domain authorization or control for
    any Fully‐Qualified Domain Name or IP address in the Certificate should not be
    relied upon.

In Bug 1706967, Comment #7, on 2021-05-11, this non-compliance was highlighted to Google Trust Services, but no subsequent issue was filed. This incident report is to track the factors that caused a delay in revocation, and the steps Google Trust Services is taking to prevent such future delays.

Flags: needinfo?(fotisl)

GTS acknowledges this bug and will provide a response shortly.

1. How your CA first became aware of the problem

During the investigation of Bug 1706967, we identified that a number of certificates were not issued in accordance with our CPS. Per 4.9.1.1 the certificates needed to be revoked and reissued under the corrected CPS.

2. A timeline of the actions your CA took in response.

YYYY-MM-DD (UTC) Description
2021-04-22 12:31 Bug 1706967 is filed noting that GTS includes a forbidden method for domain control validation in the CPS
2021-04-22 22:09 GTS acknowledges receipt of the bug
2021-04-26 14:00 While preparing the incident report, GTS identifies that an update to our CPS is required
2021-04-26 16:25 CPS is updated to version 3.4 including the new validation method
2021-04-30 12:12 Initial contact with our subscribers takes place to inform them that revocation of all certificates issued using the tls-alpn-01 challenge may be necessary.
2021-04-30 15:30 The decision to revoke all certificates is finalized.
2021-04-30 17:28 CPS 3.4 is published to the repository
2021-05-01 00:20 Mass reissuance of all certificates begins
2021-05-01 12:10 Revocation of the first batch of certificates
2021-05-01 15:45 Revocation of the remaining certificates begins
2021-05-01 20:02 Revocation of all but 3 certificates is complete.
2021-05-03 17:26 Revocation of all certificates is complete.

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.

As per Bug 1706967, we have stopped issuance of certificates under the older CPS that referred to the disallowed domain control validation method.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.)

248142 certificates issued using the tls-alpn-01 ACME challenge before the CPS update.

5. In a case involving certificates, the complete certificate data for the problematic certificates.

The list of crt.sh links for the associated certificates is too large to attach to the bug so we have provided the list at the following URL: https://drive.google.com/file/d/1DBVKRZPlbaNMCvPF8e0W9_7UPEoFmKv6/view?usp=sharing.

The SHA256 hash of the file is 5f4a450e5cfb2c5f933293c67ffcf0fe3c53752049186407762ecec857141d9a.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

For a full analysis of the events that led to the revocation of the affected certificates, please refer to Bug 1706967.

Within one day of taking the decision to revoke, GTS reissued all certificates, and started the mass revocation process. In 5 hours, all but 3 certificates were revoked.  The 3 remaining certificates were delayed by 2 days as they required manual handling.

We believe the root cause of this incident is the delayed identification of the need for reissuance. Had the decision to revoke been taken earlier, we would have revoked the entirety of the certificates within the timeframe imposed by the BRs. 

7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

For a detailed analysis of our mitigation plan and timelines, please refer to Bug 1706967 Comment 11.

Flags: needinfo?(fotisl)

(In reply to Fotis Loukos from comment #2)

We believe the root cause of this incident is the delayed identification of the need for reissuance. Had the decision to revoke been taken earlier, we would have revoked the entirety of the certificates within the timeframe imposed by the BRs. 

I'm struggling to see how this is a root cause. This seems indistinguishable from saying "a wrong decision was made" or, put differently, "human error".

Note from https://wiki.mozilla.org/CA/Responding_To_An_Incident

For example, it’s not sufficient to say that “human error” of “lack of training” was a root cause for the incident, nor that “training has been improved” as a solution. While a lack of training may have contributed to the issue, it’s also possible that error-prone tools or practices were required, and making those tools less reliant on training is the correct solution. When training or a process is improved, the CA is expected to provide specific details about the original and corrected material, and specifically detail the changes that were made, and how they tie to the issue. Training alone should not be seen as a sufficient mitigation, and focus should be made on removing error-prone manual steps from the system entirely.

Question 6 provides no details or explanation about why the delay to revoke, nor do I see a discussion on Bug 1706967. Similarly, I see no discussion from GTS about the failure to report this incident when it was emphasized to GTS (as an incident) on 2021-05-11, in Bug 1706967, Comment #7.

GTS' reference to Bug 1706967, Comment #11 also leaves much to be desired in the context of this incident report. The failure to analyze the factors here, on this incident, equally lead to a failure to see how those problems are mitigated by the solutions outlined in Bug 1706967, Comment #11. In particular, that comment identifies several changes being made:

  • More automation
  • Better issue tracking
  • Weekly reviews
  • Tools for large-scale revocations
  • Shuffling deckchairs New compliance role

None of these seem to be directly related to this incident and failure. Equally importantly is that the new compliance role and increased tracking/reviews were completed on 2021-05-03 (Bug 1706967, Comment #11), and yet this bug is evidence that GTS still failed to file an incident report, suggesting that perhaps those controls are insufficient.

Without reference to Bug 1706967, and without simply copy/pasting the answer here, I think it'd be useful to understand the factors that led GTS to take 8 days to identify the following language from the Baseline Requirements (v1.7.6, Section 4.9.1.1)

The CA SHOULD revoke a certificate within 24 hours and MUST revoke a Certificate within 5 days if one or more of the following occurs:
7. The CA is made aware that the Certificate was not issued in accordance with these Requirements or the CA’s Certificate Policy or Certification Practice Statement;

Because GTS incident reports continue to be significantly lacking in what we expect, here are attempts at addressing several problematic responses ahead of time:

  • If GTS incorrectly believed the timer started from 2021-04-26, it should be aware that has been repeatedly addressed within the Mozilla community as an unacceptable answer. Any examination, then, of 6 and 7, need to specifically work to identify those past discussions and incorporate them within the updated incident report. While GTS has provided much explanation as to why their compliance program was insufficient prior to 2021-05-03, its incident reports need to do better in order for us to believe things have improved.
  • Similarly, GTS would need to explain why the incident failed to get the necessary attention from the period it was reported, 2021-04-22, to when it actually lead to action, on 2021-04-26, given that the normal expectation for a directly-reported incident report is the CA making a determination within 24 hours. The delay here, in this incident, suggests that GTS does not actually have processes to ensure such a timely determination, and so the burden of evidence rests with GTS in demonstrating exactly how it is capable of that, and why it failed to exercise that ability here.
  • Regardless, GTS needs to meaningfully address why, even after the changes instituted on 2021-05-03, it failed to recognize the report on 2021-05-11 as indicative of a new incident report needing to be filed. The failure to file here suggests things have not substantially improved, and this undermines trust in all of GTS' ongoing remediation efforts.
  • Similarly, given that Comment #0 specifically highlighted this issue, GTS needs to further identify and address why Comment #2 failed to address or acknowledge this concern. If GTS believes Comment #2 was sufficient for addressing that concern, it needs to specifically provide detail about how and why, so that further suggestions can be made to ensure GTS has appropriate controls to prevent such faulty mistakes in the future.

In short: there's a failure to take identify revocation was necessary, a failure to enact that revocation, and a failure to report on the delayed revocation. The root cause is not the decision that was made, but the system and factors that enabled those series of decisions, including the most recent (failure to report).

Flags: needinfo?(fotisl)

Google Trust Services is monitoring this thread and will soon provide a response.

(In reply to Ryan Sleevi from comment #3)

Thank you for your comment and for giving us the opportunity to provide more clarity in our incident report. We expected this incident response to be read in conjunction with the rest of the open incidents, and in particular Bug 1708516 and Bug 1706967, however we recognize that we should shed some light on this incident response.

(In reply to Fotis Loukos from comment #2)

We believe the root cause of this incident is the delayed identification of the need for reissuance. Had the decision to revoke been taken earlier, we would have revoked the entirety of the certificates within the timeframe imposed by the BRs. 

I'm struggling to see how this is a root cause. This seems indistinguishable from saying "a wrong decision was made" or, put differently, "human error".

Note from https://wiki.mozilla.org/CA/Responding_To_An_Incident

For example, it’s not sufficient to say that “human error” of “lack of training” was a root cause for the incident, nor that “training has been improved” as a solution. While a lack of training may have contributed to the issue, it’s also possible that error-prone tools or practices were required, and making those tools less reliant on training is the correct solution. When training or a process is improved, the CA is expected to provide specific details about the original and corrected material, and specifically detail the changes that were made, and how they tie to the issue. Training alone should not be seen as a sufficient mitigation, and focus should be made on removing error-prone manual steps from the system entirely.

We would like to state that this was not a case of human error. It is our belief that the root cause was a failure of the process, and in particular the same process that led to the incidents described in Bug 1708516 and Bug 1706967.

This is why we have implemented tooling and monitoring to facilitate a process change and ensure that it operates effectively. For example, as a result of the changes presented in Bug 1708516 Comment 27 if this incident were to occur today the reporting we use in managing the service would have made the delay obvious and the automatic escalation mechanisms would have been invoked. 

Question 6 provides no details or explanation about why the delay to revoke, nor do I see a discussion on Bug 1706967. Similarly, I see no discussion from GTS about the failure to report this incident when it was emphasized to GTS (as an incident) on 2021-05-11, in Bug 1706967, Comment #7.

As mentioned, our prior process did not have sufficient tooling, monitoring and compensating controls to ensure the actions and responses happened in a timely manner. With that said, the creation of an incident report was flagged as an action item, however meeting the obligations of timely responses for four parallel incidents heavily impacted our capacity, so we deprioritized opening a fifth incident report believing that postponing it would be acceptable. We now recognize that opening a bug and providing the incident report later would be a better approach to handling this issue.

GTS' reference to Bug 1706967, Comment #11 also leaves much to be desired in the context of this incident report. The failure to analyze the factors here, on this incident, equally lead to a failure to see how those problems are mitigated by the solutions outlined in Bug 1706967, Comment #11. In particular, that comment identifies several changes being made:

  • More automation
  • Better issue tracking
  • Weekly reviews
  • Tools for large-scale revocations
  • Shuffling deckchairs New compliance role

None of these seem to be directly related to this incident and failure. Equally importantly is that the new compliance role and increased tracking/reviews were completed on 2021-05-03 (Bug 1706967, Comment #11), and yet this bug is evidence that GTS still failed to file an incident report, suggesting that perhaps those controls are insufficient.

As described in Bug 1708516 Comment 27, the automation we have implemented ensures that we track our response to Mozilla incidents like this one via our daily operations processes.

This automation ensures issues reported via Mozilla are more quickly funneled into our operational incident response processes.

As a result of these changes the entire team is involved in this portion of our compliance program. It accomplishes this by ensuring that existing automation, monitoring, reporting and escalation frameworks are used.

To give this more context the internal issue management system requires first responders to make a number of triaging decisions including (1) the severity of the issue, (2) by when it has to be resolved and (3) who is driving the resolution.

The queue of cases is monitored by a wider team consisting of representatives from engineering and compliance who can double check the determinations made by the first responders. This "second line of defense" will help detect incorrect triaging decisions.  

Additionally cases that have not been resolved in time are escalated automatically.

This integrates these scenarios into the same process we use to maintain our high level of availability.

Furthermore, we do not consider creating two new compliance roles to be shuffling of the deckchairs. As we have previously stated, this is part of restructuring our compliance program within the larger effort of revisiting how we approach compliance as a whole.

Without reference to Bug 1706967, and without simply copy/pasting the answer here, I think it'd be useful to understand the factors that led GTS to take 8 days to identify the following language from the Baseline Requirements (v1.7.6, Section 4.9.1.1)

The CA SHOULD revoke a certificate within 24 hours and MUST revoke a Certificate within 5 days if one or more of the following occurs:
7. The CA is made aware that the Certificate was not issued in accordance with these Requirements or the CA’s Certificate Policy or Certification Practice Statement;

We believe that the answers provided in the previous paragraphs should address your concerns. To restate, it was not a human error, but a process failure.

Because GTS incident reports continue to be significantly lacking in what we expect, here are attempts at addressing several problematic responses ahead of time:

  • If GTS incorrectly believed the timer started from 2021-04-26, it should be aware that has been repeatedly addressed within the Mozilla community as an unacceptable answer. Any examination, then, of 6 and 7, need to specifically work to identify those past discussions and incorporate them within the updated incident report. While GTS has provided much explanation as to why their compliance program was insufficient prior to 2021-05-03, its incident reports need to do better in order for us to believe things have improved.

We would like to clarify that this is not the case and that it is clear to us that the language in BRs Section 4.9.1.1 in combination with BRs Section 4.9.5 specifies that the timer started from 2021-04-22.

  • Similarly, GTS would need to explain why the incident failed to get the necessary attention from the period it was reported, 2021-04-22, to when it actually lead to action, on 2021-04-26, given that the normal expectation for a directly-reported incident report is the CA making a determination within 24 hours. The delay here, in this incident, suggests that GTS does not actually have processes to ensure such a timely determination, and so the burden of evidence rests with GTS in demonstrating exactly how it is capable of that, and why it failed to exercise that ability here.
  • Regardless, GTS needs to meaningfully address why, even after the changes instituted on 2021-05-03, it failed to recognize the report on 2021-05-11 as indicative of a new incident report needing to be filed. The failure to file here suggests things have not substantially improved, and this undermines trust in all of GTS' ongoing remediation efforts.
  • Similarly, given that Comment #0 specifically highlighted this issue, GTS needs to further identify and address why Comment #2 failed to address or acknowledge this concern. If GTS believes Comment #2 was sufficient for addressing that concern, it needs to specifically provide detail about how and why, so that further suggestions can be made to ensure GTS has appropriate controls to prevent such faulty mistakes in the future.

In short: there's a failure to take identify revocation was necessary, a failure to enact that revocation, and a failure to report on the delayed revocation. The root cause is not the decision that was made, but the system and factors that enabled those series of decisions, including the most recent (failure to report).

We hope that we have answered your questions and will be happy to provide further clarifications if needed. If you feel that additional clarifications are needed, we would be happy to provide you with a follow-up response.

Flags: needinfo?(fotisl)

(In reply to Fotis Loukos from comment #5)

(In reply to Fotis Loukos from comment #2)

We believe the root cause of this incident is the delayed identification of the need for reissuance. Had the decision to revoke been taken earlier, we would have revoked the entirety of the certificates within the timeframe imposed by the BRs. 
We would like to state that this was not a case of human error. It is our belief that the root cause was a failure of the process, and in particular the same process that led to the incidents described in Bug 1708516 and Bug 1706967.

Please explain how the statement "delayed identification of the need for reissuance" is a "failure of the process".

This is why we have implemented tooling and monitoring to facilitate a process change and ensure that it operates effectively. For example, as a result of the changes presented in Bug 1708516 Comment 27 if this incident were to occur today the reporting we use in managing the service would have made the delay obvious and the automatic escalation mechanisms would have been invoked. 

Please understand that, given the set of incidents with GTS, this is not an ideal response. Two weeks ago, in Bug 1715672, Comment #2, I highlighted the importance of "Show, don't tell". The problem here is that there's not sufficient detail to see how or why GTS believes this is the case: the link between "the issue" and "the fix" is non-obvious, which Comment #3 was touching on.

As mentioned, our prior process did not have sufficient tooling, monitoring and compensating controls to ensure the actions and responses happened in a timely manner. With that said, the creation of an incident report was flagged as an action item, however meeting the obligations of timely responses for four parallel incidents heavily impacted our capacity, so we deprioritized opening a fifth incident report believing that postponing it would be acceptable. We now recognize that opening a bug and providing the incident report later would be a better approach to handling this issue.

This is not encouraging, because GTS has consistently shown extremely poor judgement, especially in unilateral decisions. There's literally two different incident reports discussing this (Bug 1709223, Bug 1708516). As part of these incidents, GTS has been reminded repeatedly about https://wiki.mozilla.org/CA/Responding_To_An_Incident , which includes:

Each incident should result in an incident report, written as soon as the problem is fully diagnosed and (temporary or permanent) measures have been put in place to make sure it will not re-occur. If the permanent fix is going to take significant time to implement, you should not wait until this is done before issuing the report. We expect to see incident reports as soon as possible, and certainly within two weeks of the initial issue report.

Similarly, for Chrome, http://g.co/chrome/root-policy states:

When a CA becomes aware of or suspects an incident, they should notify chrome-root-authority-program@google.com with a description of the incident. If the CA has publicly disclosed this incident, this notification should include a link to the disclosure. If the CA has not yet disclosed this incident, this notification should include an initial timeline for public disclosure. Chrome uses the information on the public disclosure as the basis for evaluating incidents.

So this is where I'm trying to square away a better understanding as to how GTS decided to delay the report, effectively indefinitely (but at least a month between Bug 1706967, Comment #4 and me finally filing this issue).

Your response suggests that the newly implemented (as of 2021-06-14) process would have prevented such lapses in judgement, but it seems to focus on the specific poor judgement call made, without any discussion about the poor judgement call itself: to decide not to file an issue because you were busy with other issues, despite having just assigned someone to manage and and respond to issues and compliance.

Furthermore, we do not consider creating two new compliance roles to be shuffling of the deckchairs. As we have previously stated, this is part of restructuring our compliance program within the larger effort of revisiting how we approach compliance as a whole.
We believe that the answers provided in the previous paragraphs should address your concerns. To restate, it was not a human error, but a process failure.

This bug fairly accurately captures that this process is still failing to address the set of concerns raised that led to the creation of this role in the first place. It further demonstrates that the pattern of poor judgement, specifically: "we deprioritized opening a fifth incident report believing that postponing it would be acceptable", and fails to examine the process and systems that led to that determination.

Automation cannot address the "believing that postponing it would be acceptable": that's a failure of systemic processes and understanding.

We hope that we have answered your questions and will be happy to provide further clarifications if needed. If you feel that additional clarifications are needed, we would be happy to provide you with a follow-up response.

Unfortunately, it appears quite a bit was overlooked in terms of understanding the expectations, and this remains deeply concerning. I'm not sure how best to help GTS reset their expectations, but it fundamentally appears that GTS is struggling to understand the baseline expectations placed on all CAs (as evidenced by "believing that postponing it would be acceptable"), and similarly failing to recognize the good-faith, repeated efforts to highlight the deeper issues at play here, and that it explicitly dodged/failed to answer the questions that hopefully would have lead to such an awareness.

For example, in Comment #3, I tried to highlight the "show, don't tell" importance in the last bullet point, but I find myself having to repeat it again here in this comment. Similarly, the lack of reply to the second to the last bullet point is reflected in much of the substance in this reply, highlighting again, the failure of judgement, not just automation.

I'm hoping this can lead to a more deeper evaluation here, and an updated response.

Flags: needinfo?(fotisl)

Google Trust Services is monitoring this thread and will soon provide a response.

Google Trust Services is monitoring this thread and will provide a response soon.

It has been 18 days since Comment #6 asked a question.

Ryan, thank you for your comment. At the moment we are preparing a response for Bug 1708516 Comment 35, which will be helpful in providing more context for the larger set of changes we have implemented and which address the concerns for the current bug as well. After that response is sent, we will focus on providing a complementary response for this bug as well.

Flags: needinfo?(fotisl)

We are monitoring this thread for any additional updates or questions.

Ryan, as mentioned in Comment #10 we will be providing an answer to Bug 1708516 Comment 35 and come back with a complementary answer to your questions.

Google Trust Services is monitoring this thread for any additional updates or questions.

Please explain how the statement "delayed identification of the need for reissuance" is a "failure of the process".

As covered in Bug 1708516 (comment 44), we've made a number of changes and recapped this event there. In this case specifically, the relevant incident notification was not triaged correctly. The process failed because it did not enforce that the triaging action was reviewed and signed off on by a larger portion of the team.

Please understand that, given the set of incidents with GTS, this is not an ideal response. Two weeks ago, in Bug 1715672, Comment #2, I highlighted the importance of "Show, don't tell". The problem here is that there's not sufficient detail to see how or why GTS believes this is the case: the link between "the issue" and "the fix" is non-obvious, which Comment #3 was touching on.

Automated escalation was missing. The fix was to enforce escalation through tooling.

So this is where I'm trying to square away a better understanding as to how GTS decided to delay the report, effectively indefinitely (but at least a month between Bug 1706967, Comment #4 and me finally filing this issue).

Your response suggests that the newly implemented (as of 2021-06-14) process would have prevented such lapses in judgement, but it seems to focus on the specific poor judgement call made, without any discussion about the poor judgement call itself: to decide not to file an issue because you were busy with other issues, despite having just assigned someone to manage and and respond to issues and compliance.

[...]

Automation cannot address the "believing that postponing it would be acceptable": that's a failure of systemic processes and understanding.

The new process was proposed in response to Bug 1706967. Its definition began shortly afterwards and its implementation was completed shortly after you filed this incident bug.

While this new process was being designed and implemented by our new Compliance Engineering function, we focused on the incident reports that we were using to keep the Web PKI community informed of our ongoing technical remediation efforts.

Automation cannot fix this issue alone. Multiple changes were made to have defence in depth. We have added additional rigor to the overall triaging process. All triaging decisions are recorded in a consistent manner and reviewed by a larger portion of the team. This reduces the risk that mistakes occur. The automation we developed provides a mechanism to automatically escalate issues nearing an established SLO. This enables a wider group of stakeholders to intervene if required, and prevent missed deadlines.

Unfortunately, it appears quite a bit was overlooked in terms of understanding the expectations, and this remains deeply concerning. I'm not sure how best to help GTS reset their expectations, but it fundamentally appears that GTS is struggling to understand the baseline expectations placed on all CAs (as evidenced by "believing that postponing it would be acceptable"), and similarly failing to recognize the good-faith, repeated efforts to highlight the deeper issues at play here, and that it explicitly dodged/failed to answer the questions that hopefully would have lead to such an awareness.

For example, in Comment #3, I tried to highlight the "show, don't tell" importance in the last bullet point, but I find myself having to repeat it again here in this comment. Similarly, the lack of reply to the second to the last bullet point is reflected in much of the substance in this reply, highlighting again, the failure of judgement, not just automation.

I'm hoping this can lead to a more deeper evaluation here, and an updated response.

We appreciate your input and your endeavors to help us improve our incident response process. A deeper evaluation of our process changes can be found in our response to Bug 1708516. Given the changes we implemented, we hope that the Web PKI community will notice an improvement in the timeliness of our responses and updates.

Google Trust Services is monitoring this thread for any additional updates or questions.

We are monitoring this thread for any additional updates or questions.

We are monitoring this thread for any additional updates or questions.

Google Trust Services is monitoring this bug for any additional updates or questions.

It seems to me that this bug can be closed. I'll schedule to close this next Wed. 25-Aug-2021

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] [delayed-revocation-leaf] → [ca-compliance] [leaf-revocation-delay]
You need to log in before you can comment on or make changes to this bug.