(In reply to Fotis Loukos from comment #17)
On 2020-09-30 a post was made and the language of the post was not interpreted as a question for GTS, however the needsinfo flag was set and assigned to an individual on the GTS team.
Could you clarify: Is this referring to Comment #22, which stated:
Does this seem like the correct interpretation of the policy, or have I missed something?
Or Comment #23, which set needsinfo for GTS' answer to that very question?
Overall, with respect to this section, it appears that GTS has decided to significantly deviate from the expectation from Section 2. It does not appear to be a Timeline, which is clearly defined in https://wiki.mozilla.org/CA/Responding_To_An_Incident , as being both date and timestamped. This is relevant to the proceeding question, but also because it appears that in GTS' summarization, it overlooks many relevant details to this incident. It's unclear if this was intentional, but the concern is that this may suggest that GTS does not understand the concerns, that it disagrees with the concerns, or it does not understand the expectations. This may not have been how GTS was wanting to appear, but that's how this incident report presently appears, and thus, is concerning. Can you clarify the thinking behind the choice to summarize - is this accurately reflecting GTS' understanding of the concerns?
Similarly, can GTS explicitly clarify which of its past incidents it reviewed? Particularly relevant for this timeline, it appears that GTS has missed important and relevant details, shared on these and other incidents, which would be critically important to helping assure the community that these issues will not repeat.
This is not meant to be pedantry, but in fact meant to highlight to GTS' management that it appears GTS has overlooked precisely how these are repeats of issues that have been raised in the past. This incident report does not give any indication that GTS is aware of these facts, and that's concerning, because it suggests that there are still deeper root causes at play here that remain unaddressed. This is why the timeline asks for CAs to include the details they find relevant, including events in the past (predating this incident)
The response here in Section 2 is meant to demonstrate GTS's understanding and awareness. I'm wanting to extend the benefit of the doubt to GTS, despite Comment #17 being clear evidence to the contrary, in the hopes that this was merely an oversight. However, as noted in other incident bugs, if the issue is truly that GTS is simply not aware, then I'm more than happy to provide a timeline for Section 2 that highlights and illustrates the concerns and what might have been expected from GTS.
Our analysis of this particular incident showed that our handling of these communications was over reliant on human elements and lack of staffing depth in the response creation was one manifestation of that.
Also this analysis showed the lack of tracking of response times allowed the issue to go unnoticed. As such we believe that automation of incoming Mozilla bugs, forum posts and time since the last post would have helped ensure timely updates.
The problem with the answer in Question 2, independent of the issues with the fact that it's not a timeline as expected, is that it appears GTS has omitted a number of important relevant details. Because of that, it seems reasonable to conclude that, even despite Bug 1563579 being mentioned in Comment #6, GTS has not performed any evaluation of that, either while that incident was occurring (at a separate CA) or in response to this incident.
It's important to note that the omission here is quite relevant to understanding root causes. Would GTS agree that it seems reasonable for the public community here for GTS to have been aware of that incident, to have discussed and reviewed that incident, and taken their own steps to learn from and incorporate changes in response?
If it is reasonable, then the answer here in Question 6 does not give sufficient clarity to understand why that was not done.
If it is not reasonable, which appears to be the current conclusion from this incident report, then it seems important to ask GTS for more details about why it's not reasonable to expect this of CAs.
Bug 1563579 is but one bug, among many, that would and should have provided insight for GTS into the expectations. The failure here, then, is not just on GTS' own failure to communicate, but seems to strike a deeper chord: A failure to be aware of these incidents or to recognize the risk to GTS' operations.
This isn't a trick question, but an honest and earnest one. Comments such as Comment 1678183, Comment #3 suggest that GTS is indeed reviewing all of these bugs, and indeed agrees it's reasonable to expect, but if this isn't an accurate reflection of GTS' views, then we should sort this out first and foremost.
Finally, we have created a new engineering compliance role and appointed an experienced engineer to it. This role will strengthen our internal review processes by ensuring that the engineering aspects of public communications are well covered and give us more staffing depth in the response creation so we can better accommodate reviewer availability. The first task of this role is to own the implementation projects for the new automation tooling.
While this may seem like I'm harping on a point, the proposed mitigations reflect the failure of the timeline and the failure of the root cause analysis, and thus, fail to recognize what appears to be another issue here of GTS that is directly contributory to this incident.
As mentioned above, one concern is the lack of awareness and learning from other CAs' bugs, particularly as they relate to this incident.
As discussed elsewhere, another concern is that certain expectations have been repeatedly stated to GTS, over a period now of at least several years, and these incidents keep reoccurring.
The mitigations proposed overall, and in particular, the role of compliance being suggested here, do not seem to speak to either of the root causes for these issues. For example, automated tooling may watch GTS bugs and ensure prompt replies, but the replies could, such as Comment #17, still fail to meet the measure and expectation. They could, like this issue itself, be repeats of issues previously clarified to GTS, but not integrated or acted upon.
GTS' approach to a singular compliance role is, similarly, reflective of past mistakes other CAs have made in trying to address compliance. This incident report, as currently provided, does not appear that GTS has considered those other CAs' incidents or lessons, and thus is similarly poised to fail to appropriately remediate things.
To try to help GTS understand why Comment #17 is seem as being more indicative of more problems, rather than being assuring of solutions, and because GTS specifically requested assistance in Comment #18:
Lack of Monitoring other CAs' bugs
Additionally: "Only a select few are responsible for handling issues" was also the root cause of Bug 1563579, and can also be seen as a root cause of Bug 1572992 (and also other delays in CA communcations on this forum). Could you explain why this problem wasn't considered as a problem earlier?
Comment #4 states:
Finally, as Matthias points out, having too few people responsible for incidents was also a cause of Bug 1563579, and I echo Matthias' question about why GTS did not previously consider this problem.
As for why this was not detected earlier. The nature of personal/medical/vacation leave, as well as the volume of work each individual in the on call rotation are subject to, is difficult to predict and quantify as changes tend to be subtle and unexpected when they occur. Our failure, again as acknowledged, was not having sufficient resource elasticity in that rotation to accommodate that reality or having monitoring to ensure communication timelines are met.
Explanation for why the answer is troubling
The answer here seems to have missed that the question is asking the same concerns being raised here, namely: Why, when past incidents occurred for other CAs, did GTS not recognize its design for compliance was fundamentally flawed and insufficient? The answer here appears to be answering a completely different question, which is "Why didn't GTS discover this on its own" - when the real question is "Why did GTS fail to discover its own flaws, if it has appropriate processes to monitor for other CAs' incidents?"
This is not a unique concern for this incident, but it's still concerning. For example, the same question was posed to GTS in Bug 1678183, Comment #2 related to OCSP. In particular, in GTS' response to those concerns it effectively said "We knew about it, but we dropped the ball, but we've hired more people". That comment was made on 2021-04-02, this bug was opened on 2021-04-29, and so it seems like we have clear evidence that the process is not working like it should.
Lack of awareness of patterns
Comment #3 states:
Please note that this issue has been going on for over 10 months. You've failed to provide timely updates for over 10 months. I fail to see how personal/medical leave, vacation and additional work could result in delays lasting more than 10 months.
Comment #4 asks:
Could GTS please describe in detail the "recurring sync" that was added in 2019, analyze why it failed to prevent subsequent issues, and explain how the newly proposed automation will address the shortcomings of the previous system?
It does not appear, to date, that GTS has provided an explanation for how such issues can fail to be detected for so long.
Comment #5 attempts to address only a small portion of Comment #4's question, and even that incompletely. Namely, in response to the request to "describe in detail", GTS states in Comment #5
Our recurring compliance reviews were initially set to occur every other week and required a quorum of at least 2 CA engineers and 2 policy authority members. Product management representatives are also invited, but not required. In cases where we had back to back sessions with less than 4 people for quorum, we would allow a session with 3 people and at least one person each from engineering and policy. Starting in March 2021, these reviews were moved to weekly.
Explanation for why this is troubling
The concern here, which is similar to the previous, is that there is a pattern here where the issue extends beyond a single incident. For example, expectations for timelines are clarified to GTS, but then across all GTS bugs, GTS fails to take action. This has happened several years now, but has also happened where clarifications are provided to several CAs about the expectations, and then GTS makes the same mistake.
The root cause analysis fails focuses on GTS' failure to provide prompt updates (as a lack of automation), but that fails to address what is the far more concerning systemic issue: that it would appear that GTS is simply failing to learn from incidents, both their own and others. The discussion here, being raised of "over 10 months", is highlighting this.
Perhaps the phrasing here, which states it as a concern, is missing that it's implicitly asking for GTS to address and clarify the concern. Put differently, the question here might be seen as "Can you explain how personal/medical leave, vacation, and additional work results in delays lasting more than 10 months?"
Comment #4 was explicitly trying to get GTS to answer this, but the explanation fails to describe in detail the process of what those reviews consider (for example: all CAs' bugs? GTS' open bugs? Responses to GTS' past comments? Discussions in the CABF? Etc), and importantly, fails to "analyze why it failed to prevent subsequent issues, and explain how the newly proposed automation will address the shortcomings of the previous system"
Put differently: It's not simply "why did GTS not immediately detect this", but rather, "How, over a series of months, with both GTS receiving clarification on their own incidents and on other CAs' incidents, did this still occur, if GTS is supposed be reviewing these bi-weekly?" The incident is not just a one-off: it appears every one of these meetings failed to achieve the most critical goal of the meeting, which is learning from incidents, and that's why the response in Comment #5 is so troubling, and why Comment #17 still systemically fails to address.
As a concrete illustration of the "failure to spot patterns"/"keep making the same mistake":
- Bug 1563579, Comment #6 (a Sectigo issue), the following was stated to Sectigo:
In no situation should the time period between an acknowledgement and update be longer than a week, the absolute upper bound
- Bug 1634795, Comment #1 (a GTS issue), the following was stated to GTS:
https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed makes it clear weekly updates are expected, in the absence of positive affirmation as shown via the status. This would have mitigated the above issue, had it been followed.
- Bug 1678183, Comment #4 (a GTS issue), the following was stated to GTS:
Ensure regular communication on all outstanding issues until GTS has received confirmation that the bug is closed.
- Bug 1678173, Comment #5, GTS states:
We would appreciate clarification on the expectations around weekly updates for items that are considered remediated
This is an example of a pattern of GTS failing to learn from other CA's incidents, as well as failing to learn from their own incidents, and which lead to this incident. It appears that GTS' explanation may be "We thought this was different", but that's not supported with the above text, and if it is GTS' explanation, is itself a repeat of a pattern of GTS' application of their own flawed interpretations, rather than proactively seeking clarity and assuming the "worst". This is similarly seen by its recent SHA-1 failure: a failure to learn from both their own incidents and other CAs' incidents.
Lack of "compliance-in-depth"
Comment #3 states:
I can't parse that last sentence. If your primary individual is not available most of the time and is delegated to the secondary individual, there is no reliable backup for the secondary individual.
It does not appear that GTS has provided clarity for how this sentence should be interpreted.
Explanation for why this is troubling
For better or worse, this was equally "a question not phrased as a question". "I can't parse this last sentence" is trying to ask "Can you please explain?"
The concern here is real, and carries through GTS' latest explanation in Comment #17: It fails to address the concern raised here (as was suggested GTS do in Comment #11 ), and instead, further emphasizes that it's still concerning. "We've created a new engineering compliance role and appointed an experienced engineer to it" appears to be saying that "We've stopped calling our secondary individual secondary, they are now primary. We have no more secondary engineers" - which appears to be failing to address the problem that GTS itself was highlighting.
It's concerning here because it would be unfortunate if future GTS issues were now blamed/scapegoated on this individual, both because that's contrary to a blameless port-mortem and contrary to good practice of defense-in-depth and redundancy. Should this engineering compliance engineer take vacation, are we to expect that this was unpredictable and/or a suitable explanation for why new incidents occurred?
At present, it appears GTS is treating this as a task-tracking/handoff issue. Namely, an individual was assigned, they went on vacation/medical leave/etc, and things weren't handed off. The latest explanations, such as "more automation", appear to support that conclusion.
Throughout this issue, the concern being raised is that there is ample data to point out that this is a pattern of problems, and the concerns being raised here are GTS' failure to recognize patterns. The automation approach does nothing to suggest that GTS will recognize the patterns, and that's a critical goal of the whole process. Prompt responses that are devoid of content or new information are far less valuable than late responses, and just as problematic.
It's unclear if GTS recognizes this: whether it recognizes its own patterns from past incidents (The answer in Comment #17 to Question 2 suggests it does not), it's unclear if GTS recognizes patterns it continues to show, even now, that have been criticized and addressed with other CAs, and it's concerning that it appears to be either making the same mistakes (such as a single point of failure) or failing to learn from them.
For example, if one examines Google's SRE book on Incident Management, there are heavy parallels to be drawn to these Google bugs as being "unmanaged incidents". Indeed, the ad-hoc communication, which Comment #9 is perhaps a textbook example of (as pointed out by Comment #10), can equally be seen in these other incidents. If GTS was following these practices - handoffs, clear communication and responsibility, delegation - and treating "Incident Resolution" as being the actual closing out of the Bugzilla bug, it's difficult to imagine how these issues could have happened in the first place. Many of these practices are further expanded upon in the related workbook, which talk about ways to mitigate the issues GTS raised (such as in Comment #2 / Comment #5)
This may seem overly critical on the delay issue, but it's because we're viewing this holistically with contemporary and past GTS issues. For example, Bug 1706967 highlights the same concerns being touched on here, where instead of "expectations around weekly updates", we're talking about "patterns of incidents of other CAs". This is not new to GTS either, as reflected in bugs like Bug 1678183 or Bug 1612389, which both distinctively left the impression of "not paying attention to other bugs or expectations".
What does it take to close this incident
I'm sure at this point, GTS is feeling quite beat up, because its responses have, for several months, been failing to meet the bar. I'm glad to see Comment #17 at least recognized one of the issues with Comment #2 - namely, the failure to provide a binding timeline for changes. This does suggest GTS is at least recognizing that its past answers have been deficient, and is working to improve them.
Holistically, what I think the community wants to see is for GTS to be using this incident as an opportunity to take a deeper and systemic look for how it approaches compliance, and building an answer that they believe demonstrates to the community they understand the root causes and complexities and how they play together. This means a deeper analysis of their own incidents (which is normally what Question 2 is trying to provoke). This is something highlighted to GTS in https://bugzilla.mozilla.org/show_bug.cgi?id=1709223#c5
Realistically, this is a "cluster" issue - although it's a distinct issue, and needs to be dealt with on its merits, it's almost certain that its root causes are closely coupled with Google's other outstanding incidents. Trying to tackle this as an incident in isolation, then, is both failing to see the real root causes, and failing to provide assurance. So when thinking about how to fix this, it's best to think of in terms of what largescale organizational changes need to happen at GTS to ensure compliance is core to its operations. This likely means multiple full-time engineers dedicated to ensuring compliance, with strong processes in place and strong incident management playbooks (modeled after Google's own playbooks for SREs, for example), so that every single report of an incident, from any CA, but especially GTS, gets attention, discussion, and introspection. It means rethinking how GTS applies its own interpretations to expectations and how it seeks clarity for those, and what it does when it's still lacking clarity. It also means thinking about how it uses the clarity that's given - such as in past incident reports - and ensures that it applies that understanding to future incident reports.
Consider, for example, a world in which GTS has multiple compliance engineers, performing adversarial reviews in which they're encouraged to disagree with GTS' management to apply the "worst possible interpretation" (such as by red teaming / acting like literal genies), in which onboarding an engineer means spending months reviewing the past X years of incidents, with clear processes for documenting and distributing lessons from those to GTS staff. That's the sort of holistic picture being talked about.