(In reply to Rob Stradling from comment #24)
Next week we will post an update to sections 6 of 7 of the incident report...
Although Robin eventually offered an incident report in comment #9 and responded to feedback during the two months that followed, it's clear that that effort fizzled out. With this update, we hope to clearly demonstrate that Sectigo fully understands the root causes of our lack of timeliness in incident responses and that we have put in place a robust set of measures to ensure that from now onwards our incident response efforts will meet expectations. I'm going to lay this out as a mostly personal account interspersed with a timeline of pertinent events. I feel it's necessary to name and focus on the actions (and inactions) of specific individuals, which I realize falls short of the goal of a blameless postmortem, but I don't see any other way to effectively communicate the root causes, explain why it has taken Sectigo so long to identify them fully, and to convince you that we have now mitigated them effectively.
Many of the senior CA operations personnel at Sectigo have been with the company for a very long time, both before and since the carve out from Comodo Group. Cast your minds back a few years and I hope you'll recall and agree that this same team was doing a reasonable job of incident reporting, both in terms of content and timeliness. Under Comodo Group's management, our approach to incident response was fairly informal and “bottom-up”, due in large part to what was a rather flat management structure: no single person had overall responsibility for incident response, so in practice it tended to rely on me, Robin, and a few other folks keeping a close eye on m.d.s.p / Bugzilla and then proactively taking appropriate action as and when necessary, in addition to doing our regular jobs.
2017-11-01: As Sectigo (then Comodo CA) was carved out of Comodo Group, we had to extricate our systems and processes from those of the original company, which also was a going-forward concern. That meant disentangling twenty years' worth of utterly interwoven software, data, and employee responsibilities. This task proved massive and elusive and presented challenges we didn't foresee when we set out to separate the two businesses. In the early days of the new company, a small set of experienced individuals simply did what they knew needed to be done to keep operations running, but we all understood that a more systematic, codified approach was necessary in the long term, for the company to scale. Our new CEO quickly set about building a far more structured Management team, with clear allocation of roles and responsibilities, plenty of ambition to grow the company in terms of headcount and revenue, and with a plan to transform each department into a more effective and streamlined operation.
2018-11-01: One year on, our rebrand as Sectigo was announced. With the goal of building and maintaining a positive image for this new brand, staff members who had previously represented the company freely on social media and industry forums were instructed to adhere to certain disciplines. One of those disciplines was that all communication to external parties now needed to be reviewed and sanctioned by the Marketing department. Although this stance was subsequently refined and somewhat softened, the message was clear that Senior Management wanted a controlled approach to “top-down” external communication, which of course included incident response.
2019-01-15: Continuing the same “top-down”, structured, and growth-minded approach, our CEO Bill Holtz wrote to all Sectigo employees (portions redacted) to announce two Management team changes that had occurred:
“Compliance is becoming more involved and complicated and costly for those who fail in the mission. As we move ahead with our planned growth which will include acquisitions, the topic grows in importance and complexity. Robin Alden will now be the Chief Compliance Officer for the company reporting to me. Robin is well versed in compliance requirements and challenges and works with the browsers and the CA/B forum. He is well suited for this role.
Nick France is promoted to SSL CTO reporting to me. SSL continues to be a large revenue contributor for us but is facing challenges. Customers are becoming more inquisitive and are looking more deeply at security companies before committing to a purchase. Speed of engagement with customers and timely resolution of technical and contractual issues are paramount to our success. Nick will serve the company well in this role.”
Chief Compliance Officer was a newly created position, responsible for both proactive and reactive Compliance matters, and so for the first time we had a designated individual with responsibility for incident response. Our CEO had, in my view, correctly identified that Compliance needed to have a seat at the executive table, and so Robin's appointment seemed to make perfect sense. Likewise, Nick's promotion (taking the CTO title that Robin had held for many years until that time) was well-deserved and seemed like a smart move.
Dear reader, I imagine you're probably thinking that all of these Management changes and decisions were fairly canny and reasonable. But with the benefit of hindsight, this is actually where things began to unravel.
Whereas (using the terminology from https://www.allthingsdistributed.com/2007/07/the_different_cto_roles.html) Robin operated primarily in a Technology Visionary and Operations Manager role, Nick has always been more of an External Facing Technologist. And so while Robin was relieved of his CTO duties in theory, this wasn't really what happened in practice. Due to his years of oversight of some aspects of our CA systems and operations, passing on those Operations Manager duties to one or more other people was neither quick nor simple. Consequently, Robin began to find it increasingly difficult to keep all of his plates spinning.
The other members of the Compliance department were very effective at assisting Robin to keep on top of the proactive compliance activities (writing policies, internal audits, liaising with external auditors, etc). However, despite the best of intentions, responsibility for incident response (which included permission to write and post incident reports on behalf of Sectigo) had now become concentrated on just one person, who gradually found himself unable to give it the attention it needed.
From the sidelines, I and other colleagues watched the decline in the timeliness of our incident responses, but, besides nagging Robin (with various other Management team members on CC) and offering to help, there didn't seem to be much that we could do. Our CEO is quick to deal with issues, but they have to be brought up at the weekly Senior Management meeting. Robin focused on trying to catch up on his duties as opposed to blowing the whistle at the senior table.
2019-07-04: Ryan opened this bug. I emailed Senior Management immediately to make sure they were aware. My email began “You're probably fed up of hearing me moan about this trend, but we simply MUST fix this”, and I requested that the matter be added to the agenda for the next face-to-face Management meeting, which happened to be scheduled for the following week.
2019-07-11: Incident response was discussed at the face-to-face Management meeting. Our CEO reiterated to Robin that it was his responsibility to ensure that our incident responses met Mozilla's expectations, and demanded that Robin make this his top priority. Robin responded by promising to do better and (in what I suppose was a partial admission that he was struggling with the workload) by requesting authorization to recruit to expand his (very small) Compliance team. Our CEO granted that request.
2019-11-18: Robin posted comment #18:
Our recruitment process continues. I'd love to expand on why that is taking so long, but I just can't in a public forum at least until the appointment is made.
With this recruitment drive, Robin was trying to achieve two things: to add manpower to our incident response activity, and to build for Sectigo a presence in Europe so that we could pursue EIDAS accreditation. With Brexit looming, our base in the UK was no longer going to be suitable for the latter. Some months earlier, Robin had identified a suitable candidate, Iñigo Barreira, who was keen to take on the role. For EIDAS purposes we needed to establish a legal presence in Spain, and it was agreed that this new entity would employ Iñigo. Unfortunately, establishing a legal presence in Spain took much, much longer than anticipated, which of course meant that Robin's main plan to address the manpower shortage for our incident response was severely delayed.
(In reply to Robin Alden from comment #9)
A leadership group drawn from other areas of the company (software development, product management, information technology)
will meet weekly with Compliance to ensure that development tasks related to compliance are making satisfactory progress,
that all external reporting requirements, such as Mozilla's guidance on incident report responses, are being met,
and that all necessary resources are available to meet all of our compliance obligations.
This was a great idea! However, AIUI this group managed to meet just once. Robin was supposed to be coordinating it, but it was just one more of the many screamingly urgent tasks demanding his attention, and so momentum was immediately lost.
2020-01-30: Iñigo was finally able to start working for Sectigo. Naturally it took time for him to get up to speed and start contributing, but at least we had finally made some progress on the manpower problem.
Meanwhile, I and other colleagues continued to watch our progress closely and remained concerned, until…
2020-07-21: I reviewed our open incident bugs once again, and I finally snapped. I spent most of the day pouring out my frustration in an email to Senior Management. I threw caution to the wind and wrote plainly that the current strategy for incident response was not working. For conciseness, I focused mainly just on the (many) ways we were failing to even meet the bare minimum responsiveness requirements of “in no circumstances should a question linger without a response for more than one week, even if the response is only to acknowledge the question and provide a later date when an answer will be delivered” and “You should also provide updates at least every week giving your progress”.
With this email, our CEO was persuaded that a fundamental shift in strategy was needed, and so he asked for proposals. I proposed that we “share the workload between an Incident Response Team that consists of more people than just Robin”, and everyone agreed. Now, this proposal may sound exactly like the “leadership group” that Robin mentioned in comment #9, but (and this is why I felt I needed to name specific individuals) there is in fact a crucial difference…
- List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.
Responsibility for Sectigo's incident response is no longer concentrated on one, overworked individual who has displayed a tendency, albeit with the best of intentions, to overpromise and underdeliver, rather than ask for help.
Instead, responsibility is now shared between all the members of our new Incident Response Team, who represent all areas of our CA operations. This means that the latent potential of myself and several other long-standing senior staff, which through management strategies had largely lain dormant, is now once again participating in our incident responses.
This sharing of responsibility for incident response across multiple Sectigo departments resembles what was done under Comodo Group's management, but it has a much more formal approach:
a. Regular meetings, the first of which occurred on 2020-07-23 and which have continued (and will continue) to occur on at least a weekly basis. These meetings: review progress on our responses for all open Sectigo incident bugs, consider how to respond to any questions that have been asked, define strategies for tackling any new bugs that have been filed, consider any matters for which we may need to self-report a new incident, ensure that no previously allocated tasks have been forgotten, and ensure that for each open bug somebody owns the task of posting an update that week.
b. Each Bugzilla bug has a parallel ticket in Sectigo's Jira system, so that the team can better track our everyday internal discussions and coordinate the preparation of our responses. We also have a shared tracking spreadsheet, for a simple one-page view of upcoming deadlines for posting updates and responding to questions on each bug plus a summary of known outstanding tasks for each bug that will need to be addressed before the bugs can be closed.
c. Each proposed incident response and substantive comment is peer-reviewed for correctness and clarity by at least one other team member.
d. Where we need to “Scan your corpus of certificates to look for others with the same issue”, the database queries and methodology are peer-reviewed, to ensure the correct scope and results.
e. The team member with the most relevant expertise usually takes primary responsibility for a bug and is marked as that bug's assignee in Bugzilla.
f. A new email distribution list - firstname.lastname@example.org - is now CC'd on each Bugzilla bug, to ensure that nobody on the team misses a message.
g. The team monitors activity on all Bugzilla bugs in the “CA Certificate Compliance” component and considers whether Sectigo may be affected by the same or similar issues.
h. Senior Management receives regular progress reports, and will most certainly hold us to account.
The initial members of the Incident Response Team are all seasoned veterans whose names you will recognize: Robin Alden, Iñigo Barreira, Tim Callan, Nick France, Rich Smith, and myself. We have a mandate to call upon the services of other Sectigo staff as and when necessary.