Open Bug 1708516 Opened 3 months ago Updated 2 days ago

Google Trust Services: Failure to provide regular and timely incident updates

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: agwa-bugs, Assigned: awarner)

Details

(Whiteboard: [ca-compliance])

Google Trust Services has repeatedly failed to provide timely responses to incident reports, or failed to provide responses altogether, as exemplified by the following:

This is a violation of Mozilla's requirements (https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed), and as with other CAs (e.g. Bug 1563579, Bug 1572992), warrants its own incident report. This incident report should explain the root cause of the non-response/delays and what steps Google Trust Services is taking to prevent this in the future.

Assignee: bwilson → awarner
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

An incident report is being prepared with full details.

1. How your CA first became aware of the problem

Though we are responding to Bug 1708516 in this incident which calls out we have been slow to respond to incident reports and public communications this is unfortunately an issue we are aware of. The problem began a little less than a year ago.

2. A timeline of the actions your CA took in response.

YYYY-MM-DD (UTC) Description
2020-05-01 10 months ago [1] we detected a certificate profile issue in a root certificate and created an issue to track the resolution
2020-11-18 5 months ago [2] a vendor notified us of a defect in their software that impacted compliance
2021-04-20 13 days ago in m.d.s.p a response was requested [3] but was not responded to in a timely manner
2021-04-22 10 days ago [4] an incident response was requested and not posted in a timely manner
2021-04-29 Bug 1708516 [5] is filed raising concerns over our slow responses to recent incident reports and public communications.
2021-05-01 GTS shares this incident report

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.

N/A. Not related to certificate issuance.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.)

N/A. Not related to certificate issuance.

5. In a case involving certificates, the complete certificate data for the problematic certificates.

N/A. Not related to certificate issuance.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

We have built our processes for interacting with the forums around a small set of individuals. This is in support of corporate policies that discourage posting to public forums on behalf of the company without approval. This makes our response processes dependent on the availability of those individuals.

The nature of public interactions are such that they are infrequent and difficult to pre-allocate time for. As these individuals take personal/medical leave, go on vacation or get allocated additional work we lack the immediate resources to respond in a timely manner.

When Google Trust Services launched we had a single individual and a fall back in place for handling public interactions. As the primary individual became busier the secondary took on the primary responsibility for handling the public communications but we did not increase the size of this group to ensure there would always be a reliable backup.

Additionally our processes around tracking these interactions have largely been manual in nature which meant that delays introduced by the availability of these individuals could delay technical responses.

  1. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

To improve the timeliness of interactions we have scheduled work to create automation that watches for updates to the Baseline requirements, Mozilla tickets and m.d.s.p. This will decrease the dependency on these individuals always being available.

We have also defined a new engineering role and assigned an experienced engineer to participate in the associated communications on-call rotation. This will provide additional resources so that the rotation is not understaffed when others are unavailable.

Additionally we will have planned to offer training to more of the Google Trust Services organization so more individuals are aware of the processes and constraints associated with public participation.

We believe that this combination of changes will address the concern wholistically.

(In reply to ryan_hurst from comment #2)

Thanks for the update

1. How your CA first became aware of the problem

Though we are responding to Bug 1708516 in this incident which calls out we have been slow to respond to incident reports and public communications this is unfortunately an issue we are aware of. The problem began a little less than a year ago.

I would argue that "a little less than a year ago" is false if we're talking about the cause. For further reading, see section 6.

2. A timeline of the actions your CA took in response.

YYYY-MM-DD (UTC) Description
2020-05-01 10 months ago [1] we detected a certificate profile issue in a root certificate and created an issue to track the resolution
2020-11-18 5 months ago [2] a vendor notified us of a defect in their software that impacted compliance
2021-04-20 13 days ago in m.d.s.p a response was requested [3] but was not responded to in a timely manner
2021-04-22 10 days ago [4] an incident response was requested and not posted in a timely manner
2021-04-29 Bug 1708516 [5] is filed raising concerns over our slow responses to recent incident reports and public communications.
2021-05-01 GTS shares this incident report

This section was already out-of-date and contained incorrect data when it was posted on 2021-05-04 Anywhere on Earth. What I presume are footnotes ([1], [2], [3], [4] and [5]) are also undefined, and the dates fail to mention any actions taken by Google Trust Services LLC except for posting this report.

[skip]

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

We have built our processes for interacting with the forums around a small set of individuals. This is in support of corporate policies that discourage posting to public forums on behalf of the company without approval. This makes our response processes dependent on the availability of those individuals.

I believe that if you mention this as a cause, then this should be added as a relevant (roughly) date-stamped moment at section 2.

Also, I fail to understand how you went from "Don't post to public forums without prior approval" (policy) to "only a select few may post anything" (status).

Additionally: "Only a select few are responsible for handling issues" was also the root cause of Bug 1563579, and can also be seen as a root cause of Bug 1572992 (and also other delays in CA communcations on this forum). Could you explain why this problem wasn't considered as a problem earlier?

The nature of public interactions are such that they are infrequent and difficult to pre-allocate time for. As these individuals take personal/medical leave, go on vacation or get allocated additional work we lack the immediate resources to respond in a timely manner.

Please note that this issue has been going on for over 10 months. You've failed to provide timely updates for over 10 months. I fail to see how personal/medical leave, vacation and additional work could result in delays lasting more than 10 months.

When Google Trust Services launched we had a single individual and a fall back in place for handling public interactions. As the primary individual became busier the secondary took on the primary responsibility for handling the public communications but we did not increase the size of this group to ensure there would always be a reliable backup.

I can't parse that last sentence. If your primary individual is not available most of the time and is delegated to the secondary individual, there is no reliable backup for the secondary individual.

Additionally our processes around tracking these interactions have largely been manual in nature which meant that delays introduced by the availability of these individuals could delay technical responses.

  1. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

To improve the timeliness of interactions we have scheduled work to create automation that watches for updates to the Baseline requirements, Mozilla tickets and m.d.s.p. This will decrease the dependency on these individuals always being available.

I do not believe that the "provide regular updates" clause is fulfilled when you implement a bot user that posts a message every 7 days. As such, could you further explain what the goal of this automation is? Additionally, do you have an expected timeframe when you will start using this tool?

We have also defined a new engineering role and assigned an experienced engineer to participate in the associated communications on-call rotation. This will provide additional resources so that the rotation is not understaffed when others are unavailable.

Additionally we will have planned to offer training to more of the Google Trust Services organization so more individuals are aware of the processes and constraints associated with public participation.

We believe that this combination of changes will address the concern wholistically.

I fail to see how Google Trust Services LLC would guarantee that this issue won't happen again, when they won't provide supervision for the individuals that are supposed to handle the updates on MDSP, Bugzilla and the CA/B Forum. As seen before (see this bug, Bug 1563579 and others), individuals make mistakes, and these changes do not detail any structural changes in GTS that I believe to be sufficient to wholistically solve the issue. Allocating more individuals does not fix the issue of individuals failing to respond, it only makes it less likely to happen; throwing more dice doesn't guarantee you'll roll at least one 6.

I share Matthias' concerns in Comment 3 about the low quality of this incident report. In addition to the factual error and undefined footnotes in the timeline, it seems woefully incomplete. For instance, it fails to mention Bug 1532842, which is an earlier and additional example of non-responsiveness by GTS.

Note that in Bug 1532842 Comment 10, dated 2019-07-15, GTS stated:

Sorry we failed to provide full information previously. We had an internal communication breakdown, which caused the responses to be delayed and incomplete. We have added a recurring sync to ensure that all open items are reviewed, discussed and closed out within specified timelines.

Could GTS please describe in detail the "recurring sync" that was added in 2019, analyze why it failed to prevent subsequent issues, and explain how the newly proposed automation will address the shortcomings of the previous system?

Additionally, note that Bug 1634795 Comment 2, dated 2020-05-20, said:

Sorry, I'm out on leave and I thought another team member was going to cover updates, but it looks like we may have gotten our wires crossed on who was handling follow-ups

Thus, GTS has been aware for almost a year now that leave is a cause of delayed responses. It's concerning to see it being blamed yet again for delays. Could GTS please describe what steps they've been taking during this time frame to eliminate leave as a cause for delayed responses?

Finally, as Matthias points out, having too few people responsible for incidents was also a cause of Bug 1563579, and I echo Matthias' question about why GTS did not previously consider this problem.

Flags: needinfo?(rmh)

I apologize for the loss of the links. I did not notice they were stripped when pasting in the incident report into Bugzilla.

The links in question are [1][2][3][4][5].

It seems I was not as clear as I had hoped in my earlier explanation. Like most large organizations Google has processes for speaking on behalf of the organization. This is not a prohibition, which is why I used the word “discouraged” in the above text.

Our mistake was that we were too late to adjust the number of people on the rotation when the availability of those in the rotations ebbed and flowed.

Our solution to this is two fold:

  • By increasing the size of the pool we have more resource elasticity.
  • By automating the monitoring of forums and automatically tracking the status of interactions in our internal case management system we will ensure all interactions are tracked in the issue management processes of our wider operations team. This will give more individuals the ability to escalate Bugzilla bugs and forum discussions if they see a risk that we are running late.

As for why this was not detected earlier. The nature of personal/medical/vacation leave, as well as the volume of work each individual in the on call rotation are subject to, is difficult to predict and quantify as changes tend to be subtle and unexpected when they occur. Our failure, again as acknowledged, was not having sufficient resource elasticity in that rotation to accommodate that reality or having monitoring to ensure communication timelines are met.

Our recurring compliance reviews were initially set to occur every other week and required a quorum of at least 2 CA engineers and 2 policy authority members. Product management representatives are also invited, but not required. In cases where we had back to back sessions with less than 4 people for quorum, we would allow a session with 3 people and at least one person each from engineering and policy. Starting in March 2021, these reviews were moved to weekly.

To be clear, we have no intent to automate providing updates. This is why in the above text I used the word “watches” to describe what the automation is intended to do. It will give more individuals across our teams visibility on pending issues that require our response.

The goal is to use the automation to incorporate the tracking of interactions in the same way we track activities for other portions of our operations rather than having it be a separate activity fully dependent on individual availability.

To be honest, when it comes to manual processes, which are dependent on people, it is impossible for any organization to offer a reliable guarantee that a process will never fail. What we can, and will do, with the above mitigations, is to improve our processes by introducing compensating controls (such as automation to monitor response times) and increased staffing to reduce the likelihood of future communication delays.

(In reply to ryan_hurst from comment #5)

I apologize for the loss of the links. I did not notice they were stripped when pasting in the incident report into Bugzilla.

The links in question are [1][2][3][4][5].

Thanks!

It seems I was not as clear as I had hoped in my earlier explanation. Like most large organizations Google has processes for speaking on behalf of the organization. This is not a prohibition, which is why I used the word “discouraged” in the above text.

Our mistake was that we were too late to adjust the number of people on the rotation when the availability of those in the rotations ebbed and flowed.

Our solution to this is two fold:

  • By increasing the size of the pool we have more resource elasticity.
  • By automating the monitoring of forums and automatically tracking the status of interactions in our internal case management system we will ensure all interactions are tracked in the issue management processes of our wider operations team. This will give more individuals the ability to escalate Bugzilla bugs and forum discussions if they see a risk that we are running late.

How are you going to prevent the issue of "we were too late to adjust the number of people on the rotation" from occuring again?

As for why this was not detected earlier. The nature of personal/medical [...] leave [...] is difficult to predict and quantify as changes tend to be subtle and unexpected when they occur

Both medical and personal leave can indeed be unexpected, but I would hardly call the changes "subtle", and higly unlikely to stay unexpected whilst lasting for months. Unless you're easily suprised, that is.

As for why this was not detected earlier. The nature of [...] vacation leave [...] is difficult to predict and quantify as changes tend to be subtle and unexpected when they occur.

I am shocked that a CA finds vacation leaves 'unpredictable', and would argue that the changes here are also not "subtle". Could you explain how vacation leave is not planned in advance and/or discussed with the relevant teams? This seems to be a critical oversight in guarantee of the day-to-day operations of a CA; how can you run a publicly trusted CA without knowing when your employees are going on holiday? Can all your employees just go on holiday without prior coordination with management and/or the rest of the teams they're working with?

The nature of [...] the volume of work each individual in the on call rotation are subject to is difficult to predict and quantify as changes tend to be subtle and unexpected when they occur.

Isn't this why you would have a management that manages what volume of work each individual is subject to when on the call rotation, so that the changes are known and not unexpected?

Our failure, again as acknowledged, was not having sufficient resource elasticity in that rotation to accommodate that reality or having monitoring to ensure communication timelines are met.

Our recurring compliance reviews were initially set to occur every other week and required a quorum of at least 2 CA engineers and 2 policy authority members.

If the goal of these compliance reviews was to prevent issues like what this issue is about, then 'every other week' was bound to fail. You're expected to provide regular (weekly) updates, and every other week leaves a lot of time on the table to not detect a failure to respond within the required week.

Product management representatives are also invited, but not required. In cases where we had back to back sessions with less than 4 people for quorum, we would allow a session with 3 people and at least one person each from engineering and policy. Starting in March 2021, these reviews were moved to weekly.

And yet, despite this weekly review, here we are with several issues (some recent) that did not receive the expected regular updates.

To be clear, we have no intent to automate providing updates. This is why in the above text I used the word “watches” to describe what the automation is intended to do. It will give more individuals across our teams visibility on pending issues that require our response.

The goal is to use the automation to incorporate the tracking of interactions in the same way we track activities for other portions of our operations rather than having it be a separate activity fully dependent on individual availability.

That sounds like a good start, but I see no significant structural change being proposed. Yes, some more alerting, but other than that your "solution" is reported as a group of individuals, without supervision of any form of management to speak of, being given the option to meet with that one engineer, and each given the understanding that the community needs to be communicated with. With the response as written, I can only see this issue recurring.

To be honest, when it comes to manual processes, which are dependent on people, it is impossible for any organization to offer a reliable guarantee that a process will never fail. What we can, and will do, with the above mitigations, is to improve our processes by introducing compensating controls (such as automation to monitor response times) and increased staffing to reduce the likelihood of future communication delays.

What I'm missing most in the steps you're taking to resolve this issue, is something in line with the actions taken by Sectigo, as detailed in Bug 1563579 Comment 25, actions named A through H. Of those, I can only detect a trace of theirs action A, C and F in your report, whereas I was at the very least also expecting something in the line of their actions B, E and H being mentioned as part of your workflow.

It seems to me the upper management of Google Trust Services does not seem to want to take the responsibilities that are expected from a CA that is included in the Mozilla Root Store, but seems to push the parts of these responsibilities down to the engineers, who may or may not be working because (paraphrased) "their holidays are like thunder at a blue sky", does not seem to learn from other mistakes, and believes that adding resources will make the problem go away.

5 days have passed with no response to the questions posed in Comment 4. Per https://wiki.mozilla.org/CA/Responding_To_An_Incident GTS is required to "respond promptly to questions that are asked". Although the policy permits a week to respond, this is an upper bound, and leaving questions unanswered for 5 days does not give the impression that GTS is taking this incident seriously.

How are you going to prevent the issue of "we were too late to adjust the number of people on the rotation" from occuring again?
We believe that by automating incident tracking and monitoring our response time we will be able to more readily detect when delays are occurring and make the appropriate accommodations.
We also believe that by over provisioning the rotation we will reduce the chance of this occurring again.

***Could you explain how vacation leave is not planned in advance and/or discussed with the relevant teams?”
We did not say that vacation is not planned in advance. All we said is that there are many reasons why individuals may take both planned and unplanned leave or become busy with other obligations and the nature of that makes it difficult to predict when all of these classes of events occur at the same time. We believe that the improvements we have proposed will enable us to better accommodate those realities in the future.

You're expected to provide regular (weekly) updates, and every other week leaves a lot of time on the table to not detect a failure to respond within the required week.
I should have been clearer on the role of the associated meeting. It exists to ensure issues that occur in the ecosystem are reviewed for relevance to operations, and to ensure emerging changes to policy are accommodated in our plans. It does not replace the community on call rotation or incident response, which will be responsible for providing weekly responses.

Flags: needinfo?(rmh)

*** 5 days have passed with no response to the questions posed in Comment 4. Per https://wiki.mozilla.org/CA/Responding_To_An_Incident GTS is required to "respond promptly to questions that are asked". Although the policy permits a week to respond, this is an upper bound, and leaving questions unanswered for 5 days does not give the impression that GTS is taking this incident seriously. ***

According to Bugzilla it has been 4 days, 5 hours since Mathais posted comment 6.

According to Bugzilla it has been 4 days, 5 hours since Mathais posted comment 6.

I said Comment 4, not Comment 6. Comment 4 posed questions that have not yet been answered.

Flags: needinfo?(rmh)

Just noting that the Incident Report in Comment 2 objectively does not yet meet the minimum expectations of all CAs, as required in https://wiki.mozilla.org/CA/Responding_To_An_Incident as well. Given this, and the concerns raised in Comment 3 and the unanswered concerns in Comment 4 and Comment 6, it may be better for GTS to submit a new report (as a new comment, rather than editing Comment 2 in place), keeping in mind the concerns raised in GTS’ other open incidents about the nature and quality of its incident reports.

I was attempting to address the questions outlined in both comment 3 and comment 4 at the same time. I am sorry that you feel the questions in comment 4 were not sufficiently answered. I will review them again and try to address that.

Flags: needinfo?(rmh)

I will also take the action item to re-review the report and provided, try to incorporate the additional detail provided in our other responses as well as address better address the questions in comment 4.

I will review them again and try to address that.

Thank you.

Resetting the needinfo to track that there are still outstanding questions.

Flags: needinfo?(rmh)

We'd like to thank the community for their diligence, comments, and suggestions. As requested, GTS is preparing a new report to submit and expect that this will sufficiently address all parties concerns.

Note that 7 days have passed in Bug 1709223 without either a progress update from GTS or an answer to Bug 1709223 Comment 11.

Additionally, it has now been 11 days with no answers to the questions in Comment 4. I realize GTS is preparing a new incident report for this bug that will presumably address those questions, and it's certainly better to take the time to prepare a good answer than to rush it. However, https://wiki.mozilla.org/CA/Responding_To_An_Incident does require CAs to at least provide a date by which an answer will be delivered. Could GTS please provide the dates when the redone incident reports can be expected?

Flags: needinfo?(doughornyak)

1. How your CA first became aware of the problem

A member of the Mozilla forum filed Bug 1708516 pointing out that GTS communications on some incidents have failed to meet the requirements of the Mozilla Root Store Policy.

2. A timeline of the actions your CA took in response.

Analysis
2019-03-05 Incident 1532842 filed Issue was open for 4 months. There were 10 interactions and GTS made a total of 7 posts. The Incident report contained an action plan for completion by 2019-03-31. The Next-update flag in Bugzilla was set to 2019-04-01. Our 2019-03-06 update confirmed completion of the last action item. When a follow up question was asked on 2019-06-19, we responded after 3 weeks because the response was overlooked.
2020-05-01 Incident 1634795 filed Issue was open for 3 months. There were a total of 7 interactions and GTS provided 6 updates. Of these, one remediation status update was not provided for 20 days.
2020-07-13 Incident 1652581 filed Issue was open for 10 months and had 33 interactions with 12 posts from GTS. On 2020-09-30 a post was made and the language of the post was not interpreted as a question for GTS, however the needsinfo flag was set and assigned to an individual on the GTS team. On 2020-12-08, Mozilla commented that the bug will be closed on 2020-12-11. We failed to track that the needsinfo flag was set to a GTS representative as there did not appear to be a clear question directed at GTS so we failed to respond.
2020-11-18 Incident 1678183 filed Issue was open for 6 months, had 9 interactions and 5 posts from GTS. It was opened 2020-11-18 with no action items. The first question was asked 2021-01-20 and responded to on 2021-01-26. No questions were asked after our 2021-04-02 post. We did not however provide weekly updates indicating no new information was available after this date. On 2021-04-13 we asked if further updates were desired. The same day, Mozilla set the Needs info flag to the root store manager and scheduled the bug for closure on 2021-04-16. The bug was closed on 2021-04-19.
2021-04-22 Incident 1706967 filed This issue was open for 1 month, had 11 interactions and 8 posts from GTS. The bug was filed and acknowledged on 2020-04-22. After 8 days, 1708516 was opened indicating that GTS had not responded within 7 days. The next update was posted on 2020-05-01.

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.

N/A. Not related to certificate issuance.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.)

N/A. Not related to certificate issuance.

5. In a case involving certificates, the complete certificate data for the problematic certificates.

N/A. Not related to certificate issuance.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

We understand we need to do better in communicating with this community and you have our commitment that we will. To that end we want to capture the data we have used in our root cause analysis that has led to our plan to remediate the issue as outlined in section 7.

As noted in Bug 1706967, prior to 2020-02-19 our compliance program was largely driven from the governance perspective. In response to past incidents, we restructured it and introduced a new cadence of compliance meetings, structurally improved the associated processes and increased engineering and operations participation. During those compliance meetings we reviewed bugs, forums, ballots, requirement updates, and discussed public communications. As result of these changes we improved a number of the associated processes to better integrate the program with other elements of the operation.

Unfortunately, those improvements did not sufficiently reduce reliance on human elements of responses to public communications. Additionally, even after these changes the staffing of the compliance program did not adequately accommodate the ebb and flow of resource availability that results from both planned and unplanned leave or unexpected work assignments.

This is particularly important as the process is based on participation including individuals with expertise in various aspects of the operation of the program which makes having adequate planning for resource availability that much more important. As a result, communications could get delayed to ensure the appropriate stakeholders are present.

Our analysis of this particular incident showed that our handling of these communications was over reliant on human elements and lack of staffing depth in the response creation was one manifestation of that.

Also this analysis showed the lack of tracking of response times allowed the issue to go unnoticed. As such we believe that automation of incoming Mozilla bugs, forum posts and time since the last post would have helped ensure timely updates.

We believe that with these and the other changes outlined in Section 7 we will successfully prevent similar incidents in the future. This is also in line with our goal of as much of our compliance program as possible and will help reduce the likelihood of manual process failures.

7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

To prevent similar issues from happening again, we are implementing automation that watches for relevant Mozilla tickets and MDSP forum posts and automatically creates tickets in our internal tracking system.

We believe that by running these communications through our general issue tracking processes, we will ensure that there is a documented action item for every public communication item and that the implementation status of each item is visible to a larger audience in both the compliance and the engineering team.

This will reduce the process dependency on specific individuals and work on open action items can begin independent of the compliance meeting schedule. It will also enable us to integrate measuring our conformance with these requirements with our other business instrumentation. The implementation will be complete by 2021-06-15.

In addition, we have increased the frequency of the bi-weekly reviews to weekly and allocated additional time to them.

Finally, we have created a new engineering compliance role and appointed an experienced engineer to it. This role will strengthen our internal review processes by ensuring that the engineering aspects of public communications are well covered and give us more staffing depth in the response creation so we can better accommodate reviewer availability. The first task of this role is to own the implementation projects for the new automation tooling.

We will also be working to prepare more people to participate in this rotation from throughout the engineering and operations organizations supporting GTS.

Our timeline for implementing the aforementioned changes is the following table.

YYYY-MM-DD Description
2021-05-03 The new engineering role has been created and an experienced engineer has been added to the compliance team.
2021-05-03 The frequency of the bi-weekly review has been increased to weekly and participation from the engineering side has been increased
2021-06-15 Solution for automated issue tracking to be complete

We believe that the updated incident report combined with our prior responses has addressed the aforementioned questions. If you feel otherwise, you could restate your questions and we will do our best to answer them.

Flags: needinfo?(rmh)

(In reply to Fotis Loukos from comment #17)

On 2020-09-30 a post was made and the language of the post was not interpreted as a question for GTS, however the needsinfo flag was set and assigned to an individual on the GTS team.

Could you clarify: Is this referring to Comment #22, which stated:

Does this seem like the correct interpretation of the policy, or have I missed something?

Or Comment #23, which set needsinfo for GTS' answer to that very question?

Overall, with respect to this section, it appears that GTS has decided to significantly deviate from the expectation from Section 2. It does not appear to be a Timeline, which is clearly defined in https://wiki.mozilla.org/CA/Responding_To_An_Incident , as being both date and timestamped. This is relevant to the proceeding question, but also because it appears that in GTS' summarization, it overlooks many relevant details to this incident. It's unclear if this was intentional, but the concern is that this may suggest that GTS does not understand the concerns, that it disagrees with the concerns, or it does not understand the expectations. This may not have been how GTS was wanting to appear, but that's how this incident report presently appears, and thus, is concerning. Can you clarify the thinking behind the choice to summarize - is this accurately reflecting GTS' understanding of the concerns?

Similarly, can GTS explicitly clarify which of its past incidents it reviewed? Particularly relevant for this timeline, it appears that GTS has missed important and relevant details, shared on these and other incidents, which would be critically important to helping assure the community that these issues will not repeat.

This is not meant to be pedantry, but in fact meant to highlight to GTS' management that it appears GTS has overlooked precisely how these are repeats of issues that have been raised in the past. This incident report does not give any indication that GTS is aware of these facts, and that's concerning, because it suggests that there are still deeper root causes at play here that remain unaddressed. This is why the timeline asks for CAs to include the details they find relevant, including events in the past (predating this incident)

The response here in Section 2 is meant to demonstrate GTS's understanding and awareness. I'm wanting to extend the benefit of the doubt to GTS, despite Comment #17 being clear evidence to the contrary, in the hopes that this was merely an oversight. However, as noted in other incident bugs, if the issue is truly that GTS is simply not aware, then I'm more than happy to provide a timeline for Section 2 that highlights and illustrates the concerns and what might have been expected from GTS.

Our analysis of this particular incident showed that our handling of these communications was over reliant on human elements and lack of staffing depth in the response creation was one manifestation of that.

Also this analysis showed the lack of tracking of response times allowed the issue to go unnoticed. As such we believe that automation of incoming Mozilla bugs, forum posts and time since the last post would have helped ensure timely updates.

The problem with the answer in Question 2, independent of the issues with the fact that it's not a timeline as expected, is that it appears GTS has omitted a number of important relevant details. Because of that, it seems reasonable to conclude that, even despite Bug 1563579 being mentioned in Comment #6, GTS has not performed any evaluation of that, either while that incident was occurring (at a separate CA) or in response to this incident.

It's important to note that the omission here is quite relevant to understanding root causes. Would GTS agree that it seems reasonable for the public community here for GTS to have been aware of that incident, to have discussed and reviewed that incident, and taken their own steps to learn from and incorporate changes in response?

If it is reasonable, then the answer here in Question 6 does not give sufficient clarity to understand why that was not done.
If it is not reasonable, which appears to be the current conclusion from this incident report, then it seems important to ask GTS for more details about why it's not reasonable to expect this of CAs.

Bug 1563579 is but one bug, among many, that would and should have provided insight for GTS into the expectations. The failure here, then, is not just on GTS' own failure to communicate, but seems to strike a deeper chord: A failure to be aware of these incidents or to recognize the risk to GTS' operations.

This isn't a trick question, but an honest and earnest one. Comments such as Comment 1678183, Comment #3 suggest that GTS is indeed reviewing all of these bugs, and indeed agrees it's reasonable to expect, but if this isn't an accurate reflection of GTS' views, then we should sort this out first and foremost.

Finally, we have created a new engineering compliance role and appointed an experienced engineer to it. This role will strengthen our internal review processes by ensuring that the engineering aspects of public communications are well covered and give us more staffing depth in the response creation so we can better accommodate reviewer availability. The first task of this role is to own the implementation projects for the new automation tooling.

While this may seem like I'm harping on a point, the proposed mitigations reflect the failure of the timeline and the failure of the root cause analysis, and thus, fail to recognize what appears to be another issue here of GTS that is directly contributory to this incident.

As mentioned above, one concern is the lack of awareness and learning from other CAs' bugs, particularly as they relate to this incident.
As discussed elsewhere, another concern is that certain expectations have been repeatedly stated to GTS, over a period now of at least several years, and these incidents keep reoccurring.

The mitigations proposed overall, and in particular, the role of compliance being suggested here, do not seem to speak to either of the root causes for these issues. For example, automated tooling may watch GTS bugs and ensure prompt replies, but the replies could, such as Comment #17, still fail to meet the measure and expectation. They could, like this issue itself, be repeats of issues previously clarified to GTS, but not integrated or acted upon.

GTS' approach to a singular compliance role is, similarly, reflective of past mistakes other CAs have made in trying to address compliance. This incident report, as currently provided, does not appear that GTS has considered those other CAs' incidents or lessons, and thus is similarly poised to fail to appropriately remediate things.

To try to help GTS understand why Comment #17 is seem as being more indicative of more problems, rather than being assuring of solutions, and because GTS specifically requested assistance in Comment #18:

Lack of Monitoring other CAs' bugs

Question

Comment #3:

Additionally: "Only a select few are responsible for handling issues" was also the root cause of Bug 1563579, and can also be seen as a root cause of Bug 1572992 (and also other delays in CA communcations on this forum). Could you explain why this problem wasn't considered as a problem earlier?

Comment #4 states:

Finally, as Matthias points out, having too few people responsible for incidents was also a cause of Bug 1563579, and I echo Matthias' question about why GTS did not previously consider this problem.

Answer

Comment #5:

As for why this was not detected earlier. The nature of personal/medical/vacation leave, as well as the volume of work each individual in the on call rotation are subject to, is difficult to predict and quantify as changes tend to be subtle and unexpected when they occur. Our failure, again as acknowledged, was not having sufficient resource elasticity in that rotation to accommodate that reality or having monitoring to ensure communication timelines are met.

Explanation for why the answer is troubling

The answer here seems to have missed that the question is asking the same concerns being raised here, namely: Why, when past incidents occurred for other CAs, did GTS not recognize its design for compliance was fundamentally flawed and insufficient? The answer here appears to be answering a completely different question, which is "Why didn't GTS discover this on its own" - when the real question is "Why did GTS fail to discover its own flaws, if it has appropriate processes to monitor for other CAs' incidents?"

This is not a unique concern for this incident, but it's still concerning. For example, the same question was posed to GTS in Bug 1678183, Comment #2 related to OCSP. In particular, in GTS' response to those concerns it effectively said "We knew about it, but we dropped the ball, but we've hired more people". That comment was made on 2021-04-02, this bug was opened on 2021-04-29, and so it seems like we have clear evidence that the process is not working like it should.

Lack of awareness of patterns

Question

Comment #3 states:

Please note that this issue has been going on for over 10 months. You've failed to provide timely updates for over 10 months. I fail to see how personal/medical leave, vacation and additional work could result in delays lasting more than 10 months.

Comment #4 asks:

Could GTS please describe in detail the "recurring sync" that was added in 2019, analyze why it failed to prevent subsequent issues, and explain how the newly proposed automation will address the shortcomings of the previous system?

Answer

It does not appear, to date, that GTS has provided an explanation for how such issues can fail to be detected for so long.

Comment #5 attempts to address only a small portion of Comment #4's question, and even that incompletely. Namely, in response to the request to "describe in detail", GTS states in Comment #5

Our recurring compliance reviews were initially set to occur every other week and required a quorum of at least 2 CA engineers and 2 policy authority members. Product management representatives are also invited, but not required. In cases where we had back to back sessions with less than 4 people for quorum, we would allow a session with 3 people and at least one person each from engineering and policy. Starting in March 2021, these reviews were moved to weekly.

Explanation for why this is troubling

The concern here, which is similar to the previous, is that there is a pattern here where the issue extends beyond a single incident. For example, expectations for timelines are clarified to GTS, but then across all GTS bugs, GTS fails to take action. This has happened several years now, but has also happened where clarifications are provided to several CAs about the expectations, and then GTS makes the same mistake.

The root cause analysis fails focuses on GTS' failure to provide prompt updates (as a lack of automation), but that fails to address what is the far more concerning systemic issue: that it would appear that GTS is simply failing to learn from incidents, both their own and others. The discussion here, being raised of "over 10 months", is highlighting this.

Perhaps the phrasing here, which states it as a concern, is missing that it's implicitly asking for GTS to address and clarify the concern. Put differently, the question here might be seen as "Can you explain how personal/medical leave, vacation, and additional work results in delays lasting more than 10 months?"

Comment #4 was explicitly trying to get GTS to answer this, but the explanation fails to describe in detail the process of what those reviews consider (for example: all CAs' bugs? GTS' open bugs? Responses to GTS' past comments? Discussions in the CABF? Etc), and importantly, fails to "analyze why it failed to prevent subsequent issues, and explain how the newly proposed automation will address the shortcomings of the previous system"

Put differently: It's not simply "why did GTS not immediately detect this", but rather, "How, over a series of months, with both GTS receiving clarification on their own incidents and on other CAs' incidents, did this still occur, if GTS is supposed be reviewing these bi-weekly?" The incident is not just a one-off: it appears every one of these meetings failed to achieve the most critical goal of the meeting, which is learning from incidents, and that's why the response in Comment #5 is so troubling, and why Comment #17 still systemically fails to address.

As a concrete illustration of the "failure to spot patterns"/"keep making the same mistake":

  • Bug 1563579, Comment #6 (a Sectigo issue), the following was stated to Sectigo:

    In no situation should the time period between an acknowledgement and update be longer than a week, the absolute upper bound

  • Bug 1634795, Comment #1 (a GTS issue), the following was stated to GTS:

    https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed makes it clear weekly updates are expected, in the absence of positive affirmation as shown via the status. This would have mitigated the above issue, had it been followed.

  • Bug 1678183, Comment #4 (a GTS issue), the following was stated to GTS:

    Ensure regular communication on all outstanding issues until GTS has received confirmation that the bug is closed.

  • Bug 1678173, Comment #5, GTS states:

    We would appreciate clarification on the expectations around weekly updates for items that are considered remediated

This is an example of a pattern of GTS failing to learn from other CA's incidents, as well as failing to learn from their own incidents, and which lead to this incident. It appears that GTS' explanation may be "We thought this was different", but that's not supported with the above text, and if it is GTS' explanation, is itself a repeat of a pattern of GTS' application of their own flawed interpretations, rather than proactively seeking clarity and assuming the "worst". This is similarly seen by its recent SHA-1 failure: a failure to learn from both their own incidents and other CAs' incidents.

Lack of "compliance-in-depth"

Question

Comment #3 states:

I can't parse that last sentence. If your primary individual is not available most of the time and is delegated to the secondary individual, there is no reliable backup for the secondary individual.

Answer

It does not appear that GTS has provided clarity for how this sentence should be interpreted.

Explanation for why this is troubling

For better or worse, this was equally "a question not phrased as a question". "I can't parse this last sentence" is trying to ask "Can you please explain?"

The concern here is real, and carries through GTS' latest explanation in Comment #17: It fails to address the concern raised here (as was suggested GTS do in Comment #11 ), and instead, further emphasizes that it's still concerning. "We've created a new engineering compliance role and appointed an experienced engineer to it" appears to be saying that "We've stopped calling our secondary individual secondary, they are now primary. We have no more secondary engineers" - which appears to be failing to address the problem that GTS itself was highlighting.

It's concerning here because it would be unfortunate if future GTS issues were now blamed/scapegoated on this individual, both because that's contrary to a blameless port-mortem and contrary to good practice of defense-in-depth and redundancy. Should this engineering compliance engineer take vacation, are we to expect that this was unpredictable and/or a suitable explanation for why new incidents occurred?

Overall concerns

At present, it appears GTS is treating this as a task-tracking/handoff issue. Namely, an individual was assigned, they went on vacation/medical leave/etc, and things weren't handed off. The latest explanations, such as "more automation", appear to support that conclusion.

Throughout this issue, the concern being raised is that there is ample data to point out that this is a pattern of problems, and the concerns being raised here are GTS' failure to recognize patterns. The automation approach does nothing to suggest that GTS will recognize the patterns, and that's a critical goal of the whole process. Prompt responses that are devoid of content or new information are far less valuable than late responses, and just as problematic.

It's unclear if GTS recognizes this: whether it recognizes its own patterns from past incidents (The answer in Comment #17 to Question 2 suggests it does not), it's unclear if GTS recognizes patterns it continues to show, even now, that have been criticized and addressed with other CAs, and it's concerning that it appears to be either making the same mistakes (such as a single point of failure) or failing to learn from them.

For example, if one examines Google's SRE book on Incident Management, there are heavy parallels to be drawn to these Google bugs as being "unmanaged incidents". Indeed, the ad-hoc communication, which Comment #9 is perhaps a textbook example of (as pointed out by Comment #10), can equally be seen in these other incidents. If GTS was following these practices - handoffs, clear communication and responsibility, delegation - and treating "Incident Resolution" as being the actual closing out of the Bugzilla bug, it's difficult to imagine how these issues could have happened in the first place. Many of these practices are further expanded upon in the related workbook, which talk about ways to mitigate the issues GTS raised (such as in Comment #2 / Comment #5)

This may seem overly critical on the delay issue, but it's because we're viewing this holistically with contemporary and past GTS issues. For example, Bug 1706967 highlights the same concerns being touched on here, where instead of "expectations around weekly updates", we're talking about "patterns of incidents of other CAs". This is not new to GTS either, as reflected in bugs like Bug 1678183 or Bug 1612389, which both distinctively left the impression of "not paying attention to other bugs or expectations".

What does it take to close this incident

I'm sure at this point, GTS is feeling quite beat up, because its responses have, for several months, been failing to meet the bar. I'm glad to see Comment #17 at least recognized one of the issues with Comment #2 - namely, the failure to provide a binding timeline for changes. This does suggest GTS is at least recognizing that its past answers have been deficient, and is working to improve them.

Holistically, what I think the community wants to see is for GTS to be using this incident as an opportunity to take a deeper and systemic look for how it approaches compliance, and building an answer that they believe demonstrates to the community they understand the root causes and complexities and how they play together. This means a deeper analysis of their own incidents (which is normally what Question 2 is trying to provoke). This is something highlighted to GTS in https://bugzilla.mozilla.org/show_bug.cgi?id=1709223#c5

Realistically, this is a "cluster" issue - although it's a distinct issue, and needs to be dealt with on its merits, it's almost certain that its root causes are closely coupled with Google's other outstanding incidents. Trying to tackle this as an incident in isolation, then, is both failing to see the real root causes, and failing to provide assurance. So when thinking about how to fix this, it's best to think of in terms of what largescale organizational changes need to happen at GTS to ensure compliance is core to its operations. This likely means multiple full-time engineers dedicated to ensuring compliance, with strong processes in place and strong incident management playbooks (modeled after Google's own playbooks for SREs, for example), so that every single report of an incident, from any CA, but especially GTS, gets attention, discussion, and introspection. It means rethinking how GTS applies its own interpretations to expectations and how it seeks clarity for those, and what it does when it's still lacking clarity. It also means thinking about how it uses the clarity that's given - such as in past incident reports - and ensures that it applies that understanding to future incident reports.

Consider, for example, a world in which GTS has multiple compliance engineers, performing adversarial reviews in which they're encouraged to disagree with GTS' management to apply the "worst possible interpretation" (such as by red teaming / acting like literal genies), in which onboarding an engineer means spending months reviewing the past X years of incidents, with clear processes for documenting and distributing lessons from those to GTS staff. That's the sort of holistic picture being talked about.

Flags: needinfo?(doughornyak) → needinfo?(rmh)

Similarly, can GTS explicitly clarify which of its past incidents it reviewed? Particularly relevant for this timeline, it appears that GTS has missed important and relevant details, shared on these and other incidents, which would be critically important to helping assure the community that these issues will not repeat.

Since March 2020 our records indicate we have reviewed all newly opened Mozilla bugs and closely analyzed more than 100 applicable bugs looking for ways we can improve our operations. This included looking for opportunities to improve procedural controls, address technical gaps, as well as looking for conformity issues. Two public examples of where this process identified issues that led to improvements include Bug 1678183 on the Invalid ASN.1 encoding of singleExtensions in OCSP responses and Bug 1652581 on the digitalSignature KeyUsage bit not being set.

Bug 1563579 is but one bug, among many, that would and should have provided insight for GTS into the expectations. The failure here, then, is not just on GTS' own failure to communicate, but seems to strike a deeper chord: A failure to be aware of these incidents or to recognize the risk to GTS' operations.

We reviewed Bug 1563579 multiple times as it has been updated. Our records indicate that we noted that its root cause was human error. At the time we reviewed our own practices and we felt there was no indication that we were at risk of a similar occurrence within our own operations. We have since determined that our process for responding to Mozilla bugs was not sufficient and that by automating the monitoring and response handling, we will be able to reduce our dependence on human factors effectively.

GTS' approach to a singular compliance role is, similarly, reflective of past mistakes other CAs have made in trying to address compliance. This incident report, as currently provided, does not appear that GTS has considered those other CAs' incidents or lessons, and thus is similarly poised to fail to appropriately remediate things.

By aligning how we track Mozilla incidents within our internal bug tracking system, issue triage and workflow we remove the potential for a single point of failure. The definition of the new engineering role will also help us better utilize the engineering organization to help with compliance and we believe, will increase the overall focus placed on incidents and policy changes. We also believe additional engineer training and expansion of the program will help ensure more opinions are included in associated adversarial reviews. As the internal bug tracking system is durable, viewed by the entire team, and monitored closely by the on-call rotation we believe it will allow us to address our past tracking failures as well as response timing issues.

This is an example of a pattern of GTS failing to learn from other CA's incidents, as well as failing to learn from their own incidents, and which lead to this incident. It appears that GTS' explanation may be "We thought this was different", but that's not supported with the above text, and if it is GTS' explanation, is itself a repeat of a pattern of GTS' application of their own flawed interpretations, rather than proactively seeking clarity and assuming the "worst". This is similarly seen by its recent SHA-1 failure: a failure to learn from both their own incidents and other CAs' incidents.

To be clear, we believe it is important for all organizations and processes to continually improve and while we have certainly made mistakes, we also believe we have demonstrated we are continually evolving our practices to incorporate both our learning from our own experiences and the communities. Moving forward we also plan to increase our participation in discussions at the Mozilla mailing list and Mozilla Bugzilla. In addition, we are expanding the scope of our contributions to other forums on WebPKI, such as the CA/B Forum. Simply put we are committed to do more in the community which we believe will help with this concern.

It's concerning here because it would be unfortunate if future GTS issues were now blamed/scapegoated on this individual, both because that's contrary to a blameless port-mortem and contrary to good practice of defense-in-depth and redundancy. Should this engineering compliance engineer take vacation, are we to expect that this was unpredictable and/or a suitable explanation for why new incidents occurred?

The question, as worded, suggests our issue handling process is dependent on one individual and the addition of this new role shifts the responsibility to yet another single point of failure. As we have tried to elaborate in our earlier responses, while the compliance program is ultimately managed by an individual it is supported by team members across various roles and responsibilities. The new compliance engineering role is intended to augment this team as a dedicated resource who can help enhance the program with further engineering practices such as automation and playbooks and better incorporate other engineering resources into the compliance program.

The root cause analysis fails focuses on GTS' failure to provide prompt updates (as a lack of automation), but that fails to address what is the far more concerning systemic issue: that it would appear that GTS is simply failing to learn from incidents, both their own and others. The discussion here, being raised of "over 10 months", is highlighting this.

As previously detailed, we did not see any open questions on Bug 1652581, thus we were not producing updates in our weekly (previously bi-weekly) reviews and we failed to provide regular updates for the stated period. Our new automation to ensure all open issues are clearly flagged and assigned is intended to address this issue.

Consider, for example, a world in which GTS has multiple compliance engineers, performing adversarial reviews in which they're encouraged to disagree with GTS' management to apply the "worst possible interpretation" (such as by red teaming / acting like literal genies), in which onboarding an engineer means spending months reviewing the past X years of incidents, with clear processes for documenting and distributing lessons from those to GTS staff. That's the sort of holistic picture being talked about.

We agree on your proposal and in fact we are actively acting on it. We have already created a formal compliance engineering role to augment the existing engineering, project management and compliance teams’ members that manage the compliance program. This new role will focus on automation and monitoring of compliance. However, we intend to grow this team with more engineers being trained and involved. During the process of their onboarding, lessons learned and outcomes of bug triaging will be taught. These lessons will be further disseminated throughout the entire team. Furthermore, utilizing their enhanced engineering compliance training, the team will focus on creating additional playbooks in accordance with Google's best practices and policies.

Thank you for Comment #20.

At this point, the responses are not encouraging that GTS has adequate measures to detect and prevent future compliance issues. While it is clear that GTS feels differently, the replies to date do not demonstrate the level of understanding or responsiveness expected from CAs. Despite multiple efforts to highlight this, it does appear that GTS is still misunderstanding core expectations.

The following response best exemplifies this, but please understand that this is simply an example:

Our records indicate that we noted that its root cause was human error.

A more comprehensive reply will be forthcoming that will more fully detail the failures here, in the hopes of better highlighting the concerns to GTS so that we might end up with analyses and solutions that properly address the concerns. I wanted to acknowledge Comment #20, however, so there was not a misunderstanding that the concerns had been addressed, which they have not been.

Google Trust Services is monitoring this thread for any additional updates or questions.

Whilst awaiting Ryan's response, I've written down my thoughts as well. This might overlap with Ryan's upcoming response, but these are my thoughts on the current status of both this issue and GTS' Compliance plans.

Incomplete information on the state of GTS' Compliance team

With your provided information, I know that you have

  • a rotating on-call team (pre-existing),
  • regular meetings on a weekly basis (was bi-weekly) for that on-call team,
  • a singular full-time compliance engineer (new) whose task it is to own the implementation of internal compliance tools, and
  • (new) an integration between the various compliance inputs (Bugzilla, CA/B Forum, ...) and your internal issue system.

But even with all above comments to and from, how come I still cannot tell who are responsible to pick up any compliance issues; what resources are dedicated to compliance, how compliance issues are generally handled within GTS, and how GTS ensures that their compliance team actually works? How come I still cannot distill what concrete problems were found? ("not enough resources" is very vague, "a weekly meeting for the compliance team" doesn't tell me much).

If I can take Bug 1709223 Comment 21 at face value, it seems like before the assignment of the new compliance engineer role there was no person dedicated to compliance. Even now, it appears there is only 1 person fully dedicated to compliance, which (as seen in other issues) has been a source of compliance problems in other CAs before, and this is especially problematic considering that in Comment 5 you mention that changes in availability due to leaves for personal, medical or vacational reasons are subtle and unexpected (the root cause of which has not yet been clarified).

(Dis)similarities with Bug 1563579

In Bug 1563579 Comment 25 (which is a great example of answering both q6 and q7 of the IR template for procedural shortcomings in compliance), it is also noted that a new position was opened (and later filled) for an extra compliance engineer shortly after that bug was opened. Yet it took 6 more months and one employee that was not in their compliance team sending an urgent letter to upper management before any visible progress was made in explaining and solving that compliance issue. So, even with a team of 2 full-time employees (1 more than what I understand GTE has) there were significant delays. I won't say that you are in exactly the same situation, but there are striking similarities.

Furthermore, with Bug 1709223 Comment 25 Sectigo proactively fully disclosed what problems they detected in their compliance workflow, how they came to be, what their resources allocated to compliance were, what their processes were, and what systemic changes have been made to these so that this should not occur again. But in this issue, I have failed to find a similar optake on the issue from GTS: Responses have been limited in information, documented changes were limited in scope and value, and GTS has failed to adequately respond to several of the issues raised.

Incomplete / lacking responses

The problematic responses that were named in Comment 19 have not been adequately addressed in Comment 20, even after Comment 19 explained why these responses were problematic: The "Lack of Monitoring other CAs' bugs" and "overall concerns" sections did not receive any updated information, while "Lack of awareness of patterns" and "Lack of compliance-in-depth" only received a re-hash of earlier responses while still not responding to some of the highlighted problems mentioned in these sections ("Put differently: ... fails to address", "The concern ... was highlighting.").

Comment 20 does address some issues, but fails to look at the overarching issues. For instance: "As previously detailed, we did not see any open questions on Bug 1652581, thus we were not producing updates in our weekly (previously bi-weekly) reviews and we failed to provide regular updates for the stated period. Our new automation to ensure all open issues are clearly flagged and assigned is intended to address this issue.". This does indeed cover why it was "over 10 months" and not "over 9 months", but the problem is that in 4 distinct issues over the past 10 months you've been pointed to the same requirement of regular updates. Failing to discover such systematic issues is exactly something that I cannot trust to be fixed by your current changes. Sure, you'll fix the systematic issue that is delayed responses, but that doesn't fix the systematic issue of certificate profile non-compliance, or RFC non-compliance, or otherwise non-compliance with the BR / MRSP.

Open concerns

Overall, with Comment 20, GTS does not seem to recognise the concerns that Ryan named in his Comment 19 in the "Overall concerns" section. I share all of these concerns with Ryan.

I find it deeply concerning that GTS does not seem to be able to recognise these problematic patterns in their compliance workflow even when they're called out, does not seem to be able to discover these problematic patterns on their own, and seemingly has no plan to improve their detection of problematic patterns.

Suggestions

I would like to suggest that GTS re-examines the "what it takes to close this issue" section of Comment 19. Here, I want to explicitly call out the request for a holistic review of GTS' compliance approach, and I would like to suggest that such review contains at least the state of GTS' Compliance approach from before this issue was filed, any changes made since and the holistic compliance approach that GTS is targeting.

Google Trust Services is monitoring this thread and will provide an update.

Thanks again for your response, it seems we need to do a better job framing the responses we have already provided.

First, we should make it clear that our compliance program is much larger than Webtrust for CAs. For example, Google Trust Services utilises system infrastructure that is subject to many other audits such as ISO 27001. The breadth and complexity of the program is such that it is hard to express the entirety of the program in a bug like this. This is why we have attempted to focus on the elements we felt were the most relevant to the issue at hand, the failure to provide regular and timely incident updates.

To put the larger program in context, for WebTrust, there are 115 controls addressing elements ranging from information classification to cryptographic key handling. These controls are backed with a set of standard operating procedures which are further backed with playbooks and tooling that enable us to operate our services while meeting our compliance obligations. These controls, procedures and processes are supported by numerous technical controls and by dozens of developers, security engineers, program managers, compliance specialists, physical security staff and legal professionals.

Beyond this, we leverage internal and external teams to support security and policy reviews. For example, as we mentioned earlier we engage with two different auditing firms to help ensure we are able to get a diverse set of opinions on compliance related topics.

Though the intention of this bug is to track and discuss the issues relating to our failure to provide regular and timely incident updates I want to try to answer your specific questions about the larger compliance program. Based on your post, I believe the following are your core questions:

Who is responsible to identify and execute on compliance issues?

We have a clear definition of roles and responsibilities for all operational tasks. All of our staff handle compliance related issues in accordance with the responsibilities of their job role.

The CA Policy Authority (composed of members of the CA compliance team, program management, engineering management and company wide compliance) is responsible for ensuring that issues like Mozilla policy changes are captured in tracking bugs and assigned to an appropriate owner in either the compliance or engineering teams.

To ensure timely updates on Mozilla bugs we are implementing automation to monitor and create tracking bugs. The new compliance engineering role will oversee this implementation and train our engineering teams on how to triage these bugs and assign them to a suitable owner. The CA Policy Authority measures the process continuously to ensure that it is effective and that suitable owners are assigned.

What resources are dedicated to compliance?

Multiple people on our teams are dedicated to compliance work. While some of them are handling compliance tasks full time, others are solving specific compliance problems on a per project or part time basis. Since our recent incidents, we have expanded the full time roles and included two additional engineers who support the activities we have mentioned in this and the other bugs. Before this year, we had 3 people shouldering most of the load and we acknowledge that was not sufficient.

How compliance issues are generally handled within GTS?

It is difficult to summarize the overall compliance program in a single bug. We have tried above to answer the specific elements of how reviews of changes and updates are handled in this and the other related bugs. If you have a more specific question we will do our best to answer it.

How does GTS ensure that their compliance program actually works?

We have numerous controls in place that we use as checks and balances for the program. Our use of multiple auditors is an example. Additionally, we implement automation that supports the program which helps reduce the risk of human failure. We also have compensating controls that help mitigate failures should they occur which enable us to identify problems and fix them while still meeting our obligations.

Beyond the scope of questions here, we wanted to provide a bit of insight into other areas our compliance program also covers. Since inception, Google Trust Services has focused heavily on technical, engineering and security controls, both logical and physical, a few examples include:

We run all certificates through syntactic linters like Zlint and run 100% of the certificates we issue each day through audit checks instead of only inspecting 3% samples per 90 days as required by the Baseline Requirements. The checks also cover the correct performance of operational processes such as CAA- and domain control validation steps. The reports for all these checks are provided to auditors as part of our regular audit processes as well.

We leverage all the security practices and systems covered by https://www.google.com/about/datacenters/data-security/ and implement significant additional safeguards on top of them.

On top of this our certificate issuance model relies heavily on standards based automation. This approach reduces the chance of domain validation related compliance issues. In addition, we only use static certificate profiles and this way avoid the risk of a large number of manual validation errors which based on our analysis made up some 30% of the issues reported on Bugzilla between 2019 and 2020.

To be more agile to policy changes and generally encourage the adoption of more secure practices, we have adopted 90 day certificate lifetimes as our default. We have designed our systems with the goal of eliminating the chance of manual errors which are often the source of compliance issues.

We take both security and compliance very seriously and have made significant investments into both. This combined with our heavy focus on change control, removing manual tasks via automation, instrumentation and monitoring along with compensating controls help us ensure the compliance program works as expected.

That is not to say that we do not have room to improve, we do, we simply want to paint the larger picture to help answer your question.

What concrete problems were found (in this case)?

There were several problems identified in this incident:

  • We did not have sufficient controls in place to detect and manage response times for updates when the availability of team members on the rotations ebbed and flowed.
  • We had insufficient monitoring and tracking in place to measure our conformance to our public update obligations

Is it true that before 1709223 Comment 21 there was no person dedicated to compliance?

No. Please see the earlier explanation.

Is it true there is now only 1 person fully dedicated to compliance?

No. Please see the earlier explanation.

What systemic changes have been made to these so that this should not occur again?

We have committed to make a number of changes, some of which are already in place, these include:

  • Have created a new role within the organization focused on compliance from an engineering perspective.
  • We have staffed that role with a experienced and dedicated engineer to better support the compliance program overall
  • We have committed to the development of tooling to incorporate the tracking of incident Bugzilla and m.d.s.p responses into our internal issue tracker, this will ensure our response times fulfill program requirements
  • We have committed to use the dedicated engineering role to better incorporate the larger engineering team into the public incident response process.

Additionally, as mentioned above we have also recently added a second compliance engineer into the program to help accelerate the work we are doing in this area.

We leverage all the security practices and systems covered by https://www.google.com/about/datacenters/data-security/ and implement significant additional safeguards on top of them.

I'm sorry, but are you telling me that Google Trust Services is running their CA infrastructure on infrastructure owned by Google LLC [0]? That seems contrary to what the CPS [1] tells me.

In section 5.1 clearly specifies that Google Trust Services is located in and operated from "secure Google facilities". In the CPS, "Google" is specified as "Google Trust Services LLC (a Delaware corporation)" in Appendix A through section 1.6, and also defined as a common shorthand for "Google Trust Services LLC" in section 1.1. Combined, this tells me that Google Trust Services' CA infrastructure is located in Google Trust Services' secure facilities, not in the facilities of Google LLC.

Could you provide some clarification?


A different curiosity I just discovered is that the facilities audited in the WebTrust Standard Audit and BR Audit seem to only include facilities in New York, USA, and Zurich, Switzerland, whereas the communications by GTS about potential audit delays due to local Covid-19 mitigations (see Bug 1625498) mentions auditing 3 locations, of which only Zurich seems to be mentioned on the audit reports[2]; the other facilities that were mentioned in that bug but not included in the audit report being in Oklahoma, US, and in South Carolina, US.

Could you help me understand what happened there as well?

[0] the company detailed here https://en.wikipedia.org/wiki/Google
[1] https://pki.goog/repo/cps/3.4/GTS-CPS.pdf
[2] https://www.cpacanada.ca/generichandlers/CPACHandler.ashx?attachmentid=246228

In accordance with the timeline provided, we have implemented and deployed the tooling that will monitor GTS related Mozilla Bugzilla bugs. 

It does this by mirroring the relevant Mozilla Bugzilla incidents to our internal bug tracking system. This will ensure that the existing processes used daily to operate and maintain GTS services can also track community reported issues. This will importantly also give us reporting on SLA compliance including:

  • New bugs related to GTS,
  • Bugs that need update,
  • Bugs assigned to GTS, and
  • Bugs with the needinfo bug set for GTS.

Additionally this tooling enables cloning of other CA incidents into the same tracking system which will provide an internal forum where all GTS engineers can participate in adversarial reviews of other CA incidents. Some benefits include:

  • Improve our ability to learn from other incidents,
  • Improve our ability to take action when needed, and
  • More quickly and completely incorporate lessons learned into trainings and playbooks.

We believe this closes the committed deliverable for the automation mentioned in this bug.

Based on your post, I believe the following are your core questions:

Are you telling me that Google Trust Services is running their CA infrastructure on infrastructure owned by Google LLC [0] (see link reference in comment 26)? That seems contrary to what the CPS [1](see link reference in comment 26) tells me.

Google Trust Services runs our CA services from dedicated facilities located within Google data centers. It is common practice for dedicated CA facilities to be built out within existing data centers. There are no root program or audit requirements relating to ownership of physical facilities.

A different curiosity I just discovered is that the facilities audited in the WebTrust Standard Audit and BR Audit seem to only include facilities in New York, USA, and Zurich, Switzerland, whereas the communications by GTS about potential audit delays due to local Covid-19 mitigations (see Bug 1625498) mentions auditing 3 locations, of which only Zurich seems to be mentioned on the audit reports[2]; the other facilities that were mentioned in that bug but not included in the audit report being in Oklahoma, US, and in South Carolina, US.

There is no requirement in WebTrust or applicable AICPA standards as to what locations have to be included in WebTrust audit reports. We included the data center locations in Bug 1625498 since the WebTrust audits require assessments of the operational facilities. Future audits, in conformance with Mozilla Root Store Policy 2.7.1 which became effective May 1st, 2021, will also incorporate data center locations.

There is no requirement in WebTrust or applicable AICPA standards as to what locations have to be included in WebTrust audit reports. We included the data center locations in Bug 1625498 since the WebTrust audits require assessments of the operational facilities. Future audits, in conformance with Mozilla Root Store Policy 2.7.1 which became effective May 1st, 2021, will also incorporate data center locations.

This seems to directly contradict https://www.mozilla.org/en-US/about/governance/policies/security-group/certs/policy/#314-public-audit-information

Could you explain?

To be more precise:

  • The change to Mozilla's wiki regarding expectations for audit locations was performed 2020-03-23, and received additional updates the following day.
  • WebTrust's Illustrative Guidance, dated 2017-09-01 (and still in force) states the following regarding disclosure of "CA Processing Locations" (emphasis added)

    All reports issued should list the city, state/province (if applicable), and country of all physical locations used in CA operations. This includes data center locations (primary and alternate sites), registration authority locations (for registration authority operations performed by the CA), and all other locations where general IT and business process controls that are relevant to CA operations are performed.

Although it would appear that GTS is referencing the set of changes from 2.7 to 2.7.1, and the policy effective date, it would seem that the statement regarding "root program or audit requirements", as well as "no requirement in WebTrust", appears to be contrary to the available facts. I'm hoping GTS can better explain its rationale and conclusions here.

Google Trust Services is monitoring this thread and will soon provide a response.

The suggestion to include data center locations as a policy rather than as guidance was brought up at m.d.s.p on 2020-03-20[1] within the context of reports for delayed audits during the COVID-19 pandemic.

Mozilla's wiki was updated on 2020-03-23 to include this language under the Minimum Expectations subsection of Audit Delays. Although GTS created Bug 1625498 to track a possible delay, the audit was completed on time.

In your comment, you reference the updated version of Section 3.1.4 MRSP which was not in force at the time our audit report was issued. The addition of point 12 was not merely an editorial change as you can see from the discussions here:
[2, 3]

You also mention the WebTrust Practitioner Guidance, which is not directed towards CAs, it is published for WebTrust auditors.

This document has been prepared by the WebTrust for Certification Authorities Task Force (the “Task Force”) for use by those auditors licensed to perform WebTrust for Certification Authorities audits by CPA Canada.

It is correct that the Guidance mentions CA Processing Locations including data centers. The Guidance is however of illustrative nature and auditors decide how to apply it.

This discussion, which took place from December 2020 to February 2021 between Mozilla and WebTrust task force representatives indicates that at least for WebTrust auditors, the rules on data center location disclosures were still in debate.

Finally, we want to mention that our current audit report expires in about three months (on September 30). In accordance with current requirements, future reports will include the data center locations we disclosed in Bug 1625498 and any sites added going forward.

We hope that this additional information addresses your concern.

(In reply to Andy Warner from comment #32)

You also mention the WebTrust Practitioner Guidance, which is not directed towards CAs, it is published for WebTrust auditors.

This document has been prepared by the WebTrust for Certification Authorities Task Force (the “Task Force”) for use by those auditors licensed to perform WebTrust for Certification Authorities audits by CPA Canada.

It is correct that the Guidance mentions CA Processing Locations including data centers. The Guidance is however of illustrative nature and auditors decide how to apply it.

This discussion, which took place from December 2020 to February 2021 between Mozilla and WebTrust task force representatives indicates that at least for WebTrust auditors, the rules on data center location disclosures were still in debate.

That discussion quite clearly states that the requirements for audited location disclosure are clear: (all emphasis added)

Task Force Comments on Proposed Requirement

  1. Disclosure of each location that was included in the audit.
    At the present time, both the public WebTrust report, as well as the detailed controls report, require disclosure of each location involved in the audit. The specificity of disclosure can vary, often at the city level or higher, in order to protect the confidentiality of locations required by some CAs.

The discussion however did mention that the term 'facility' is not defined in either CA/B Forum Baseline Requirements nor in the WebTrust materials, but that doesn't matter here, as in Bug 1625498 you mention audits of facilities that are located in Oklahoma, US, and in South Carolina, US. These are to the best of my knowledge 'audited locations', and consequently should have been included in the audit report. Alternatively, if not the set of locations mentioned in Bug 1625498, but a different set of locations were audited, then the audit letter would still be incorrect, as it only mentions two of the three locations that we would expect based on the "3rd location" that was completed according to Bug 1625498 Comment 5.

Google Trust Services is monitoring this thread for any additional updates or questions.

As discussed in Bug 1708516, Comment #21:

GTS

As part of a series of outstanding issues, Google Trust Services (GTS) has shared their plans to overhaul their compliance program. This is collectively described across Bug 1706967, Comment #11, Bug 1708516, Comment #17, and Bug 1709223, Comment #21.

GTS' analysis suggests that the root causes were a lack of dedicated compliance resources, an over-reliance on private communications conducted verbally, and a lack of a holistic approach to compliance. GTS has outlined a plan to address these matters through additional resourcing and automation, and through a binding commitment to raise questions in the future in the public, rather than private. However I'm concerned this overlooks a deeper underlying cause that hasn't been addressed yet, namely a failure of incident management.

It's by observing how a CA communicates with the community, through its CP/CPS, descriptions of its systems, and incident reports that the community is able to understand how a CA takes steps to prevent situations like the issuance of a test "google.com" certificate, a failure to validate domains, or even the next DigiNotar. Audit reports and auditors have never been sufficient to fully assess these systems, and incident reports are essential to engaging with specific CAs to learn how their controls failed, and providing a model for other CAs to learn and prevent similar failures. Poor incident reports are the leading predictor of there being deeper issues at a CA, and the continued failure to appropriately manage, appropriately respond, and prevent incidents have revealed serious concerns with CAs, such as PROCERT and Camerfirma. It's for these reasons that CAs are held to the standard set by https://wiki.mozilla.org/CA/Responding_To_An_Incident for all incidents, no matter how serious the CA views them, because it's critical to user safety to ensure that every CA is capable of providing detailed, thorough, and correct analysis of issues.

Unfortunately, GTS’ incident reports show a pattern of problematic practices that run counter to that goal. This message includes a detailed look at past and current GTS issues, to highlight these patterns, and to suggest substantive changes to prevent re-occurrences. In short, these changes involve:

  • Treating each CA incident as a major, ongoing incident which should be managed according to incident management good practices, such as those covered in Google’s SRE book.
  • Setting up a process to ensure that communication is detailed, accurate, and consistent, and addressing any internal challenges that may prevent that.
  • GTS re-evaluating their own past incident reports to identify problematic patterns, to ensure that they are addressed going forward. This message includes an analysis of several such patterns in order to help highlight this, but is not meant to be exhaustive.
  • Re-examine the process used for root cause analysis to avoid identifying human error or errors in judgement as a root cause (per https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report). The current process has resulted in repeat incidents that a more thorough analysis and remediation should have addressed.

Timeliness and Responsiveness - A Case Study

Bug 1708516 summarizes a series of recent events where GTS failed to provide timely updates, going back to Bug 1634795 in 2020-05-17. GTS' initial response on this pattern being highlighted, as captured in Bug 1708516, Comment #5, has been to increase "the size of the pool [so] we have more resource elasticity" and "automating the monitoring of forums and automatically tracking the status of interactions". Later, after concerns were raised about how well this would address the issue, GTS further explained in Bug 1708516, Comment #17, that, prior to 2020-02-19, their "compliance program was largely driven from the governance perspective". In response to Bug 1708516, GTS stated that they've taken several steps to address this by introducing a new engineering compliance role, switching from bi-weekly to weekly meetings with engineering present, and automating monitoring of outstanding incidents.

However, this analysis does not seem to fully identify or address root causes, nor does it adequately provide reassurance that future incidents will be prevented. For example, in Bug 1532842, Google Trust Services failed to provide regular communication. On 2019-07-04, the CA contacts were directly contacted by a Module Peer, with an explicit statement "Please see https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed for the expectations regarding timely updates on incidents, including weekly progress reports unless otherwise specified." (Bug 1532842, Comment #5). This was again re-iterated on 2020-05-17, with Bug 1634795, Comment #1, with a statement "https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed makes it clear weekly updates are expected, in the absence of positive affirmation as shown via the status." When GTS had another incident fail to receive updates (Bug 1678183), with GTS again being reminded (Bug 1678183, Comment #4), GTS stated on 2021-04-02 that they felt it was unclear the expectations around incident management for items that they (GTS) consider addressed, with a statement "If there is an expectation that CAs provide updates / checkins until an issue is fully closed, it would be good to have that clarified in the Mozilla incident guide." (Bug 1678183, Comment #5).

There are three elements to extract from this: A failure for GTS to provide timely updates, a failure for GTS to correctly understand requirements, and communication by GTS that falls below the expectation for CAs.

Failure to provide updates

Failure to provide updates is something that GTS has struggled with since before they were ever included. In a thread started on mozilla.dev.security.policy on 2017-02-09 titled "Google Trust Services roots", a number of questions and concerns were raised about the path that GTS took to acquiring ubiquity. Ultimately, this thread would lead to policy changes that prevent other CAs from pursuing GTS' path. However, during the thread, after a period of initial engagement on 2017-02-10, GTS then ignored the thread (and the questions asked) until nearly a full month after, on 2017-03-06, only commenting after Kathleen Wilson raised the lack of responsiveness as a blocker to making further progress on GTS' root inclusion on 2017-03-03 (Bug 1325532, Comment #5).

That first response from GTS identified the delay as "due to scheduling conflicts", a trend of explanations that would continue through bugs such as Bug 1634795, Comment #2 to the most recent incident, in Bug 1708516, Comment #2.

While GTS' incident response focuses on the lack of staffing and need for automation, what it doesn't address is why GTS failed to identify this pattern in its own compliance issues, especially after they were highlighted in Bug 1708516, Comment #0. The latest set of mitigations appear to be focusing on the delays as the issue to be addressed, without identifying that the failure to recognize that GTS has long had issues with timely and complete responses. It might be argued that GTS recognized the risk in Bug 1532842, Comment #10, on 2019-07-15, with the introduction of their recurring sync. However, it's also clear from the incidents that GTS did not view such a sync as an essential compliance function to be prioritized (Bug 1634795, Comment #2; Bug 1678183, Comment #5).

This is why responding solely to the delays is insufficient: it lacks an analysis about why GTS failed to see the delays as a pattern. Compare this to an example response that has been highlighted to GTS as a model of what is expected: Bug 1563579, Sectigo's incident. Sectigo shared an incredibly detailed timeline on the internal deliberations on these compliance incidents, highlighting the degree to which some members were aware of both incidents and patterns, and how that was being dealt with going forward to ensure those voices were listened to. GTS' explanation is that nearly identical factors (too much work on too few critical individuals) lead to GTS' failure, but GTS' responses fail to come close to the level of transparency and detail provided by Sectigo.

This seems to suggest that the root cause is one of philosophy or approach to compliance. This is highlighted by the lack of detail provided in bugs such as Bug 1706967, Comment #11, which were in response to explicitly requested details in Bug 1706967, Comment #7.

Failure to properly understand requirements

There also seems to be a pattern of GTS' failure to properly understand requirements and expectations, which leads to not only incidents, but incomplete incident reports or responses when incidents occur. Like issues with timely communication, this pattern goes back to prior to GTS' own root inclusion.

The 2017-02-09 mozilla.dev.security.policy thread "Google Trust Services roots" highlighted a variety of issues with GTS' CP/CPS and approach to audit management. Although GTS privately disclosed its intent to acquire key material from GlobalSign for the GlobalSign Root R2 and R4 roots, the actual process GTS followed resulted in several incidents that arose from GTS' failure to understand and adhere to requirements. As highlighted by Peter Bowen at the time, and later confirmed by Mozilla (Bug 1325532, Comment #41), GTS failed to maintain the requisite audits. Peter Bowen also highlighted that the scope of the audit engagement omitted key criteria with respect to the issuance of subordinate CA certificates; the audit report, by omitting this, failed to represent an auditor opinion on the matter, and was thus, from a public perspective, unaudited.

This ultimately led to substantial policy language changes to prevent CAs from making this mistake in the future, it also highlighted that it's often necessary to reiterate questions or requirements to GTS, with GTS either failing to properly understand the existing requirements or the question. For example, Peter's original message highlighted the audit issue, GTS provided an incomplete answer that did not address the issue, the issue was again reiterated, subsequently ignored, and Peter had to again repeat and explain the question. It was not the only issue GTS had with their application; as noted in Bug 1325532, Comment #33, GTS was failing to meet Mozilla's long-standing requirements around disclosure of domain validation methods for DV certificates, and was still, within a month of it being forbidden, relying on an insecure method of domain validation for OV certificates.

This pattern repeats itself throughout GTS incidents, without any measurable sign of improvement. Bug 1652581 / Bug 1709223 are examples of this: GTS' failure to properly understand the role of digitalSignature and delegated OCSP responders would lead them to incorrectly use direct responders, and then incorrectly use root key material to sign with SHA-1. Bug 1667844 would result from GTS failing to properly understand the requirement "direct or transitively issued", which had been repeatedly addressed by Mozilla previously. GTS acknowledged they repeatedly overlooked requirements in Bug 1581183, and failed to understand existing requirements or correctly evaluate their own systems in Bug 1612389 (Bug 1612389, Comment #2).

These issues are compounded by GTS' tendency to fail to meet the Incident Report expectations, such as clear timelines of relevant events and clear timelines for action and remediation. Examples of this can be seen from Bug 1634795, Comment #1, Bug 1522975, Comment #1, Bug 1706967, Bug 1652581, Comment #1, Bug 1532842, Comment #5, Bug 1532842, Comment #7, and Bug 1532842, Comment #9. The repeated nature of this suggests either a failure to properly understand the expectations or a failure to communicate effectively.

GTS' assessment and solution for these problems has been to hire more people and create a dedicated role, but that alone doesn't give any assurance to how these patterns would be addressed. At issue is an apparent lack of a holistic system in place to ensure awareness and understanding. This was something flagged to GTS explicitly in Bug 1678183, Comment #8, and previously, in Bug 1612389, Comment #1, Bug 1612389, Comment #3, Bug 1612389, Comment #7, and Bug 1612389, Comment #7. Indeed, much of the current solution appears to be a repeat of Bug 1612389, and with the same criticisms and concerns captured in Bug 1612389, Comment #7, without substantive change or improvement.

Problematic communications and statements

One of the most concerning elements of GTS' incident reports is a clear pattern of communication that mirrors that of some of the most problematic CAs. The pattern is difficult to pin down in any one comment, but is readily apparent, both through GTS' own incidents, and through the surrounding contemporary context of GTS' incidents. The concern here is the longstaning pattern that is associated with CAs who do not view compliance as essential to trust, and are dismissive of concerns or errors that they view as minor or unfounded, despite how regularly these concerns and errors reveal more systemic issues.

Broadly, this takes a variety of forms: omission of relevant or important details or considerations, broad statements that are presented conclusively, but without supporting evidence, or statements which contradict previous statements or other information. This might be described as "confidently incorrect", and each instance of this undermines confidence that GTS is taking appropriate safeguards or performing the correct analyses. It also plays through with a preference to prefer informal, off-the-record verbal conversations, which are demonstrably prone to error and misinterpretation and contrary to root program expectations for CAs. It equally shows up with either a failure to show how a proposed solution fits the described problem, or through the inclusion of unrelated detail. Most importantly, it includes a framing of compliance issues as "disagreement of opinion", without acknowledging that ultimately, those opinions are key to deciding whether or not GTS is trustworthy.

While there is certain irony in Ryan Sleevi lecturing a CA about communication style, in this particular case, these patterns are at the core of a CA's main asset: trust. These failures in communication significantly undermine that trust and thus require further scrutiny. The centrality of communication to trust in a CA is called out by https://wiki.mozilla.org/CA/Responding_To_An_Incident, noting "Our confidence in a CA is in part affected by the number and severity of incidents, but it is also significantly affected by the speed and quality of incident response."

  • At the time of GTS' initial root inclusion, GTS's CPS asserted that during the period of unaudited operation, the newly-acquired GlobalSign roots were bound by the Google Inc. CPS. However, the scope of the Google Inc. CPS made it explicitly clear that it did not cover the newly-acquired roots, as highlighted by Peter Bowen. Only after this was again emphasized did GTS update their CPS, even though such updates do not address such concerns, because they happen after-the-fact.
  • In Bug 1652581, Comment #17, in explaining why GTS failed to detect an issue, GTS incorrectly asserted ZLint lacked a particular lint. Bug 1652581, Comment #18 highlighted to GTS that lint had existed for years prior to the incident. Only after this was highlighted did it reveal that GTS had been incorrectly running ZLint the entire time, as captured in Bug 1652581, Comment #19, but GTS failed to acknowledge that until being explicitly asked when they made the changes.
  • Similarly, Bug 1706967, Comment #9 and Bug 1709223, Comment #6 show that these issues are still present in the most recent GTS issues.
  • GTS regularly fails to include key required information, such as affected certificates, full incident timelines, and timelines for remediation, as shown by Bug 1522975, Bug 1630040, Bug 1634795, Bug 1652581, Bug 1667844, and Bug 1678183.
  • Bug 1652581, Comment #0 highlights both an omission of relevant technical details, such as the factors GTS considered, and only describes the decision that was made, not how it was made. In sharing their decision, GTS indicated their intent to continue the problematic operation because "the missing digitalSignature bit does not have an immediate security or compatibility impact for subscribers or relying parties", language similar to that used in Bug 1532842, Comment #0 ("we also believe that this issue does not represent a material security risk to the community"), both of which mirror the problematic language explicitly called out in https://wiki.mozilla.org/CA/Responding_To_An_Incident, specifically: "Responses similar to “we do not deem this non-compliant certificate to be a security risk” are not acceptable.".
    • The lack of analysis was Bug 1652581, Comment #1. In Bug 1652581, Comment #2, GTS acknowledged that they had not done any a priori analysis or testing, and instead relied on user error reports to determine if something was wrong. However, in Bug 1325532, Comment #26, GTS' first incident, GTS acknowledged that relying on user error reports is not acceptable and error prone.
  • Bug 1708516 is marred by technical errors and misstatements as highlighted by Bug 1708516, Comment #3, Bug 1708516, Comment #10, Bug 1708516, Comment #26, and Bug 1708516, Comment #30.
  • In Bug 1532842, Comment #0, GTS committed to revoke all certificates on 2019-03-31. GTS then promptly failed to do so, initially failing to provide the requested information despite repeated follow-ups (Bug 1532842, Comment #6), and then ultimately acknowledging after-the-fact that they decided to delay their previously committed to compliance deadline to 2019-04-08 (Bug 1532842, Comment #8). Even in doing so, despite two direct reminders over the previous weeks, they failed to provide the necessary information until further requested (Bug 1532842, Comment #10).
  • Bug 1652581 is an example of GTS improperly, insecurely using its root key material in a way that was unambiguously prohibited. GTS' explanation, however, was that they contacted root programs and weren't told it was prohibited, so assumed it was OK.
    • On 2020-08-13, GTS improperly used its root key.
    • On 2020-08-28, GTS stated that they reached out to Mozilla, Apple, and Microsoft "last week", according to Bug 1652581, Comment #14, placing such contact between 2020-08-15 and 2020-08-21.
    • On 2020-10-05, GTS stated that they reached out to "the various root programs" somewhere between 2020-07, according to Bug 1667844, Comment #5.
    • On 2021-04-19, GTS stated that "Before taking this action, GTS consulted with root program contacts for all the root programs we participate in.", in a post to mozilla dev.security.policy
    • While it's possible GTS performed contact both before and after, the explanations provided do not make sense if that is what GTS is stating. Further, the number and nature of root programs they contacted has changed. For example, the Chrome Root Program was never contacted about this.
    • In Bug 1709223, Comment #16, GTS ultimately acknowledges it has no records about when it contacted these root programs, nor contemporaneous notes or transcripts.
  • Bug 1709223, Comment #0 on 2021-05-03 discusses how GTS "discussed our plan to only make the targeted changes necessary to address the conformity issues", while Bug 1667844, Comment #10 highlighted that GTS had made other technical changes that were not only unnecessary, but as Bug 1709223, Comment #22 highlighted, had compatibility risk that was explicitly tested by linters.
  • In Bug 1708516, Comment #25, GTS highlights the use of multiple auditors, despite a recurring theme of discussion in the CA/Browser Forum and Mozilla dev-security-policy about how auditors do not check for root program expectations.
  • Similarly, in Bug 1708516, Comment #25, GTS notes ZLint as a syntactic linter, despite repeated discussions, including with GTS in Bug 1652581, Comment #18, about ZLint not checking syntax, only semantics, and the necessity of other linters such as certlint to achieve that.
  • In Bug 1709223, GTS gives the explanation for their decision to insecurely use their root CAs' private key as being a decision about compatibility. However, GTS acknowledges that not only did they fail to do compatibility testing, they failed to do compatibility research. When pushed for concrete details about the factors that led to GTS' concern, GTS listed a set of software that does not support SHA-2 as the explanation for their decision to use SHA-1 for the root, despite the fact that certificates from GTS' roots today already do not work in such software, and so there was no compatibility risk.

Failure to manage incidents

Ultimately, these patterns present themselves as a failure of incident management by Google Trust Services. GTS seems to now be aware of the problematic pattern of responsiveness, but sadly, its increased responsiveness has led to more cases where it hasn't correctly or adequately responded, or given answers that aren't internally consistent with past explanations. Similarly, the emphasis on timeliness of response may have led GTS to overlook performing an analysis of the systemic factors for incidents and from examining their incident reports have not meet the expectations called out in https://wiki.mozilla.org/CA/Responding_To_An_Incident, both in practice and in principle.

GTS' mitigations for this spate of incidents is to assign an engineer to a role of engineering compliance, and to focus on automation, but these alone do not address a need for GTS to fundamentally change how they approach incident management. Bug 1708516, Comment #19 highlighted that these patterns seem to fit the anti-pattern examples of being "unmanaged incidents", a term from Google's book on how it approaches Incident Management. Examples have been highlighted to GTS of other CA’s incidents, which do meet expectations, but GTS has yet to adopt improvements or acknowledge these gaps. Even recently, comparing the changes GTS proposes in Bug 1706967, Comment #11 with those of another CA, Let's Encrypt, in Bug 1715455, Comment #27, show a radically different approach to compliance and thoroughness.

Closing thoughts and recommendations

When GTS first applied for inclusion, there was significant discussion on the mailing list (mozilla.dev.security.policy, "Google Trust Services Root Inclusion Request"), with concerns that GTS' relationship with Google would mean that, no matter how bad an incident, Mozilla would be unable to distrust them. There was also concern that GTS would benefit from favorable interpretation by Chrome. As was noted then, it is essential to hold GTS to the same standard that all CAs are held to.

This is an attempt to highlight to GTS the seriousness of the situation, and the critical importance of breaking the pattern outlined above. The reason why I'm concerned about closing the remaining open bugs is that, in GTS' incident reports so far, GTS has failed to realize that GTS's failure to manage their incidents is itself an ongoing incident. This is a core and repeated theme in Bug 1708516, and so it makes the most sense to address remediating the issue there.

At the core, GTS needs to have a clear strategy for overhauling how it handles incidents. To be able to successfully do that, GTS needs to be able to identify its past patterns and make sure they have a strategy that can be explained in how the pattern is being prevented. If GTS finds it difficult to get approvals for public comment, as suggested in Bug 1708516, Comment #2, then this plan needs to show how they're addressing that to ensure communications requirements can still be met. This is not unique to GTS: it's the same standard that CAs such as those by Apple (Bug 1588001) and Microsoft (Bug 1598390) have been held to. This includes information that may be seen as embarrassing to Google: transparency is key to trust in the Web PKI. It's the same standard that all CAs are held to.

Any remediation plan almost certainly relies on adopting practices for incident management from the SRE book. Despite the previous suggestion, in Bug 1708516, Comment #19, not only has Bug 1708516, Comment #20 not addressed that, it suggests that GTS does not view their incident management as part of a systemic issue.

This overhaul needs to also address communications. GTS needs to ensure that when it's providing detail, those details are correct, accurate, and sourced. Similarly, every incident report should lead GTS to perform a thorough analysis of past discussions and bugs to ensure they are not overlooking relevant details. In doing so, GTS also needs to perform deeper root cause analysis than it currently is. For example, it’s critical to recognize that the failure of controls or processes is a symptom of deeper issues, and that incident reports need to tackle those deeper issues. It’s insufficient to say a mistake in judgement was made as a root cause. A root cause analysis would evaluate the process for gathering information to inform the judgement, as well as the process for making that judgement, and examine why it failed to reach the right result. The incident report would share the details about what information was gathered and considered, and how the decision was ultimately reached, to better explain how the corrective changes will address this. As mentioned, Bug 1715455, Comment #27 as a good example of this sort of analysis.

To see these issues closed, I would like to see GTS have a plan for how they're going to overhaul their incident management, from analysis to disclosure, and incident communications. Then, it'd be good to see that plan executed for the existing incidents, to make sure that no concern is unaddressed, question unanswered, or inconsistency uncorrected. Closing incidents like Bug 1709223 require analyzing past GTS incidents to understand how this was highlighted in past GTS incidents as a risk, and an understanding about why that past advice didn't lead to changes with GTS.

This is why these incidents are intertwined: GTS' present incident responses are sub-par, which is consistent with past CA incident responses, and the present situations could have been prevented if the past incidents had been acted upon. We need to believe the present incidents will result in meaningful change, so that we can be confident again that future incidents will be prevented. At present, we just see more of the "same ol", and it's not encouraging.

Google Trust Services is monitoring this thread and will soon provide a response.

Google Trust Services is monitoring this thread and will provide a response soon.

Does GTS have anything more substantive to share?

It’s been pointed out to several CAs recently that these content free updates are not in line with expectations. For example, if you’re not yet ready to provide an update, when will you actually do so, concretely, and why?

I don’t deny that Comment #35 is a big message, and requires careful analysis in the response, but GTS is still expected to provide substance and explanations if they aren’t able to provide actual responses yet.

Flags: needinfo?(fotisl)

In Bug 1715421, Comment #9, it was highlighted that there have been 16 days while GTS has left a question unresponded to. This continues to highlight the concern about substance and responsiveness.

I realize and recognize that Comment #35 - two weeks ago - is a lot, but as highlighted in Comment #38, there are clear and actionable steps GTS could be taking in the interim. As Comment #35 stated, open CA bugs should be treated as a "production outage incident", but it would seem GTS disagrees with that viewpoint, which is quite concerning. Comment #38 tried to align expectations, but it does not seem to have lead to actionable changes in the interim.

We would like to clarify that we are handling this incident as a production outage incident, and we are actively working on a response. However, we want to do our best to holistically address the concerns raised in your comment and that is taking longer than we would like. We are committed to posting it no later than Thursday, August 22 2021.

(In reply to Fotis Loukos from comment #40)

We are committed to posting it no later than Thursday, August 22 2021.

Was this a typo?

Yes, sorry, that was a typo, July 22nd.

Flags: needinfo?(rmh)

We are monitoring this thread for any additional updates or questions.

Ryan, as mentioned, we will provide a response for Comment #35 by July 22nd.

We appreciate the write up, Ryan. It is useful to see your consolidated interpretation of the issues and suggestions for a path forward.

TL;DR GTS recognizes that we have not always met community expectations as we transitioned from running a subordinate CA to a publicly trusted Root CA. We have made iterative progress throughout this time and continue to evolve our approach and improve. More importantly, the recent incidents, their effect on our resourcing, and the associated community responses have driven us to holistically revisit our compliance and incident management approach and devise a series of processes which will prepare us to improve even further. We have outlined these in the Action Plan section below. Our goal is to exceed expectations of the Mozilla Community.

In order to respond, we feel it is best to provide a brief summary of related past issues, provide our perspective on them then and now, expand on what we learned from them, highlight where incremental changes have been made, and recap how we believe this approach is the best path forward for GTS and the WebPKI ecosystem.

Retrospective

N.B. in the summaries below the high level details are recapped along with the issues identified and actions taken immediately following the incident. Additional changes made independently of an incident or due to patterns emerging are covered in the final summary.

Bug 1522975

GTS' first related incident was filed on January 25th 2019 and detailed in Bug 1522975. This incident was related to the issuance of an invalid OCSP response. What went well during the management of that incident was that a remediation plan was presented that addressed the root cause of the incident. The plan was implemented in a timely manner and prevented similar issues from happening again. However, in our analysis we found several areas we could improve. As pointed out in Bug 1522975 Comment 1, the response did not use the standard template, although it addressed all elements requested by the program. In addition, there was a delay in providing the final updates for closing the bug. This was caused by a drop in our engagement as soon as the items included in our action plan were complete. We also did not follow-up weekly after the work was completed.

Bug 1532842

Before February 28th 2019, GTS issued 100,813 (7,171 were non-expired) certificates with less than 64 bits of entropy from a CSPRNG in the serial number due to a bug in EJBCA.

We began investigating our serial numbers on February 26, 2019, after the public discussion began on m.d.s.p. When we received an external report regarding this bug one day later, we requested additional information from the software vendor PrimeKey and received an updated version. The same day, we backported the updated code to our environment and tested it. On February 28, we deployed the fix to production and identified the certificates that were affected by the bug and needed revocation.

On March 1, 2019, GTS started to revoke all affected certificates. Due to the large number of certificates, the revocation risked causing significant outages for some of our subscribers. We made the decision to work with these subscribers to help them rotate their certificates and minimize the interruption to their business. This delayed the revocation for a subset of certificates beyond the 5 day target defined in the BR. Like other CAs affected by the bug, we committed to a revocation plan (28 days) and completed the work within the committed time frame.

While we believe that we decided and executed on the revocation in a timely manner, we also acknowledge that the decision not to disrupt subscribers who needed a bit of additional time, which meant that we deviated from the 5 day revocation requirement.

During the revocation process, we did not provide interim updates and did not add a post to the bug when the revocation was complete. Although we acknowledged the reminder in Bug 1532842 Comment 5, at this time we did not introduce process changes to track more rigorously when incident bug updates are due.

Bug 1581183

On August 30th 2019, we posted a disclosure to m.d.s.p which resulted in a Bugzilla bug being opened by Wayne on September 13th 2019 capturing the initial disclosure and follow-ups. We reported this incident after discovering that certificates for one of our environments were not included in the regularly scheduled CRL if they had expired before the regular CRL was generated. Including these certificates is a requirement under RFC 5280.

The bug was identified during a recurring review of requirements and fixed within 11 days. Our incident response was provided in time and we did not receive any questions on the incident bug. The incident ticket was closed 4 months later.

After posting that the remediation is complete, we considered the incident response a success and again deprioritized providing no update messages. No specific action was taken to change our incident response procedure.

Bug 1612389

On January 30th 2020, we reported that two of our subordinate CAs were generated with a curve-hash combination that was not permitted under the Mozilla Root Store Policy. The affected CA certificates were revoked and replaced within 4 days.

In response to our incident report, it was pointed out that similar incidents had occurred at other CAs in the past (specifically in Bug 1527423, Bug 1534145 and Bug 1530971). It was suggested that because we did not link to those past incidents we were not aware of them..

Comments on the report also suggested that GTS should make process changes to “better receive clarification for interpretative differences” (Bug 1612389 Comment 3). To prevent similar incidents from occurring again it was suggested that GTS should conduct adversarial reviews (Bug 1612389 Comment 9), “seek clarity from Peers/Owners/Other UAs” or raise unclear requirements on mozilla.dev.security.policy (Bug 1612389 Comment 9).

This event prompted a change to always employ dedicated group reviews with the goal of ensuring appropriate conversations took place while improving the technical depth of the reviews through broader participation. Thus, as of February 2020, we involved more people from our engineering team in the review of policy requirements and WebPKI incidents.

This included more structured bi-weekly reviews of incidents by other CAs including participants from CA Policy Authority and engineering. As part of this we also began keeping detailed logs of each assessment on the applicability of those incidents to GTS noting both current and potential future issues. This change has helped us improve our ability to provide more historical context to our incident reports, ease root cause identification and strengthen our team by increasing overall team awareness of related incidents in the ecosystem.

Analysis

As we look back at historical incidents from GTS and others along with our currently open incidents, we identified the following patterns: .

Several past and current incidents (including Bug 1634795, Bug 1652581, Bug 1715421, Bug 1708516, Bug 1709223, and Bug 1706967) did not provide required weekly updates to Mozilla. We primarily attribute the lack of weekly updates to the inefficient process we were utilizing to keep track of our communication obligations. In particular, we did not use our incident response process to track our communications on Bugzilla bugs to ensure that we provide updates within the applicable due dates. Also, the incident response procedure was not specific enough and it was executed outside of the regular cadence of the larger team's work. This limited the involved individuals to a small subset of the GTS organization leading to tough prioritization decisions.

Additionally, in many cases our incident reports did not make it clear how complete our root cause analysis was on these issues, nor did they adequately link to all publicly available resources that were reviewed, such as related bugs and forum posts. While we have performed adversarial reviews on all incident reports as part of our approach to drafting, reviewing and approving them, the criteria for what level of relevant discussions and past incidents was not a well defined part of the remit for the reviewers. As a result these artifacts were inconsistently provided.

Another communication related issue is apparent when reviewing Bug 1612389 and Bug 1709223. We were overly reliant on verbal communication with root programs for clarifications on policy interpretations and did not reliably use m.d.s.p for clarifications as we had committed to in the earlier of the two bugs.

As we considered the nexus of these issues, we saw an opportunity to reevaluate some elements of how we approach compliance, how we handle incidents, and how we engage the WebPKI community. While we have invested heavily in compliance, much of the effort has been towards building a highly secure, compliant, reliable and fully audited service. Our heavy focus in these areas was not matched by similar investment and focus on public arenas such as incident response and community engagement, so we have made changes to ensure there is comparable focus.

Action Plan

Based on the above analysis, we have identified a set of changes we believe will ensure our compliance efforts and its incident response process addresses the above issues and generally improves. While more work continues, many of them have been implemented already. These changes are centered around incident response, organizational structure, and the training and involvement of the majority of the GTS organization in compliance related issues.

GTS has integrated its compliance incident management process with its operational incident process which strictly follows the Incident Command System model to ensure the best practices from IMAG and the SRE book incident response process are followed. The new process is more holistic and will help:

  • Align the incident response for compliance issues to follow industry best practices,
  • Integrate the compliance process with the operational incident response process, and
  • Create an effective tracking and prioritization, while improving the richness of information captured.

We have created Senior Compliance Engineer and Compliance Engineer roles. While we already have roles defined for CA Engineering (development and operations) and Policy Authority (audit and compliance), we have not had roles which focus entirely on the technical implementation of policy and our interaction with the community. These new roles are intended to address this shortcoming by providing additional expertise and focus, and building a stronger link between the CA Engineering and Policy Authority functions. The roles have already been staffed with two engineers, and we expect it to grow more over time.

The Senior Compliance Engineer is responsible for training the CA Engineering team and further involving the entire GTS organization in compliance related efforts. They have already created a team, developed and delivered training, and established a weekly rotation for monitoring, triaging, and summarizing ballot proposals, policy discussions and incident reports. These artifacts are shared with the entire GTS organization. The benefits we expect from this are:

  • A more thorough and structured approach to reviewing compliance incidents
  • GTS-wide familiarity with our process for managing public WebPKI incidents
  • Expanded practical knowledge gained from other CA incidents, and
  • More accessible and reusable knowledge collected for GTS incidents.

To ensure our responses are timely, and to ensure we fully evaluate all WebPKI incidents, discussions, and proposed policy changes, we have developed automation to better integrate the review of Bugzilla, m.d.s.p., and CA/B Forum mailing lists into our regular operating cadence. This automation creates bugs in our internal bug tracking system which are evaluated by members of the Compliance Engineering team. This will result in a more proactive approach to incident handling and ensures more of the organization engages in adversarial reviews of all Bugzilla bugs, evaluating them within the context of GTS applicability. The goals of these measures are to:

  • Rapidly identify new requirements that may apply to GTS,
  • Rapidly identify issues from other CAs that may apply to GTS,
  • Collect good and bad practices, document them and create trainings,
  • Identify common root causes and ways we can avoid similar issues.

We have also defined an enhanced process for authoring external incident reports and responses to the community, and are regularly improving upon it. The process is based on standard templates that conform to Mozilla's requirements and augments them (e.g. including clear deadlines presented in tables for mitigation action items when applicable). The process designates multiple adversarial reviewers who are responsible for, among other things, reviewing past incidents for patterns, inconsistencies, and anomalies. Also, where appropriate, we will provide a retrospective which will include items such as what went well, what went poorly, and what we could have done better.

All incident reports and responses are then evaluated against internal rubrics, which also incorporate Mozilla's requirements. Further, we have increased agility in collecting internal approvals and posting publicly and will improve on this further. We expect that this process will ensure:

  • Increased clarity and detail,
  • Responses will conform to Mozilla's requirements, and
  • Take fuller advantage of the insights and availability of the larger team.

We believe that these changes and new processes will allow us to avoid repeats of our current incidents, and provide us a way to proactively address policy changes efficiently and effectively in the future.

We do expect there to be continued opportunity to improve training, monitoring and execution of these changes. This is why we have added this process to our annual reviews as we believe this will help ensure we continuously improve upon it.

The timeline for the delivery of these items is the following:

YYYY-MM-DD HH:MM Description
2021-05-03 09:30 Pacific Dedicated Engineering role defined and resources assigned
2021-06-14 ~13:00 Pacific Solution for automated issue tracking is complete.
2021-06-16 09:30 Pacific New process for authoring external incident reports and responses to the community is established.
2021-07-05 06:00 Pacific Bugzilla, m.d.s.p. and CA/B Forum mailing list monitoring is operational.
2021-08-13 Time TBD New incident response process to be fully enacted (time is required to complete training for additional resources and conduct internal test runs).
Flags: needinfo?(fotisl)
You need to log in before you can comment on or make changes to this bug.