Open Bug 1563579 Opened 1 year ago Updated 5 days ago

Sectigo: Failure to provide timely incident reports

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: ryan.sleevi, Assigned: rob)

References

(Depends on 4 open bugs)

Details

(Whiteboard: [ca-compliance])

Robin,

Given the pattern in the following issues, I felt it best to spin this into a dedicated issue. Sectigo has begun demonstrating a worrying trend of not providing timely response to incident reports, and this seriously undermines confidence in the operations of the CA.

The following incidents exemplify this:

Mozilla's policy regarding Responding to an Incident notes that:

in no circumstances should a question linger without a response for more than one week, even if the response is only to acknowledge the question and provide a later date when an answer will be delivered. You should also provide updates at least every week giving your progress, and confirm when the remediation steps have been completed - unless Mozilla representatives agree to a different schedule by setting a “Next Update” date in the “Whiteboard” field of the bug.

In light of this pattern, please treat this as an incident report. This helps understand what the cause of these delays are and what steps Sectigo is taking to prevent future delays in responses. If Sectigo has already made changes with respect to how it handles incidents, please detail those changes for the community, including explaining how they relate to the (potentially various) root causes.

Flags: needinfo?(Robin.Alden)

Ryan,
Thanks for splitting this into a separate report. We will follow up on this next week.

We have made progress on resolving an underlying cause of this issue. I will provide the requested incident response next week.

Although we did make progress last week on addressing the underlying cause, I do not yet have the incident report prepared. I will follow up early next week.

Ryan, I can't update this issue, but I think this should also depend on https://bugzilla.mozilla.org/show_bug.cgi?id=1567060, as the last update for that issue was an aknowledgement made over 50 days ago.

Depends on: 1567060

Thanks. The update for Bug 1567060 was similarly promised "next week" as the Comment #3 on this issue.

Depends on: 1575022

Robin, another week has elapsed since Comment #5, with no acknowledgement.

I want to emphasize that Sectigo needs to be treating this as the highest priority issue, and ensuring these issues are responded to in a timely fashion. If you're proposing deferred updates, then without an explicit acknowledgement of a new date, as reflected in the whiteboard status, continue to provide updates.

In no situation should the time period between an acknowledgement and update be longer than a week, the absolute upper bound. Anything greater jeopardizes the continued and future trust of Sectigo issued certificates.

Sectigo folks: I am deeply concerned about the lack of response here. I'm hoping that there's a simple explanation, such as that Robin may be out of the office this week, or that messages are getting marked as bug spam.

Please confirm that Sectigo will be responding with an incident report by EOD of Friday, 20 September, GMT, and that Sectigo will ensure weekly updates on all associated Sectigo bugs. Considering that CAs have been distrusted over a lack of acknowledgements, this is gravely concerning.

Flags: needinfo?(rob)
Flags: needinfo?(rich)

Ryan, Thank you for your reminder.
I appreciate your considerable patience on the matter so far.
I will respond on this bug by the end of this week, i.e. by 20-sep-2019, and no less than weekly thereafter.

Flags: needinfo?(Robin.Alden)
  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

2019-07-04 Ryan opened bug 1563579

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

I have not analyzed all of the times that we have not provided a response to Mozilla within the expected one week.
Ryan provided these examples:

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

Sectigo commits to respond to all open incident reports in Bugzilla according to Mozilla's guidance.
That guidance (from https://wiki.mozilla.org/CA/Responding_To_An_Incident) is "Once the report is posted, you should respond promptly to questions that are asked, and in no circumstances should a question linger without a response for more than one week, even if the response is only to acknowledge the question and provide a later date when an answer will be delivered. You should also provide updates at least every week giving your progress, and confirm when the remediation steps have been completed - unless Mozilla representatives agree to a different schedule by setting a “Next Update” date in the “Whiteboard” field of the bug. Such updates should be posted to the m.d.s.p. thread, if there is one, and the Bugzilla bug. The bug will be closed when remediation is completed."

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

NA

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

NA

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The underlying cause of this issue (bug 1563579) is that we found ourselves
under-resourced in terms of having people with the particular skillset to
provide the detailed reporting required for incident responses for Mozilla's
and wider public consumption.

We fully understand the benefit to the community of having these incidents
acted upon and the actions reported, and the essential nature of our response
to them in terms of both acts and reports if we are to retain the trust of the
community.

I further appreciate that, chiefly by my own tardy responses in bugzilla, we
have given the impression that little of anything is happening in response to
reported incidents but this is not the case. We have expended considerable
resource from several parts of our organization to deal with reported
incidents and will continue to do so until we have brought the incident
responses to a conclusion.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

If Sectigo has already made changes with respect to how it handles incidents, please detail those changes for the community, including explaining how they relate to the (potentially various) root causes.

Immediately:
Our senior management has reiterated that our compliance with Mozilla and other root store operators policy
is an essential pre-requisite for our success and that where additional resources are required to achieve
our compliance, whether in action or in reporting, those resources will be provided.

Immediately:
A leadership group drawn from other areas of the company (software development, product management, information technology)
will meet weekly with Compliance to ensure that development tasks related to compliance are making satisfactory progress,
that all external reporting requirements, such as Mozilla's guidance on incident report responses, are being met,
and that all necessary resources are available to meet all of our compliance obligations.

Ongoing:
We are recruiting to expand our compliance team with a particular focus on identifying personnel with directly relevant experience in Web PKI compliance activities.
This will further enable us to more fully and actively engage (beyond the minimum requirements) with bugzilla and m.d.s.p.

Flags: needinfo?(rob)
Flags: needinfo?(rich)

This bug was a topic of conversation at our weekly management meeting and the commitment to comply with Mozilla's policy and other requirements was re-iterated.

The initial meeting of our leadership team is set for early next week and the meeting will recur on a weekly basis while any compliance incidents remain open.

This week we had the first of what will be regular dedicated meetings of a leadership group that will track the development tasks relating to open incident responses, track that we are meeting Mozilla's guidance on incident report responses, and that the necessary resources are available to meet all of our compliance obligations.

We have done further work towards recruiting further suitably skilled staff to assist with our compliance obligations.

Flags: needinfo?(Robin.Alden)

This week we reviewed and discussed the level of development effort that is being employed in compliance-related tasks.

The next regular meeting of the leadership group to track open incident response progress has been scheduled.

We are in HR negotiation with a highly skilled potential recruit for the compliance and CA management team.

Robin: Thanks for making sure to provide weekly updates. I realize we haven't acknowledged them explicitly, in part to see how things are progressing without those explicit acknowledgements. I appreciate that you're making sure to update weekly on progress.

That said, as I shared in the related bugs, I want to encourage Sectigo to be clearer and more detailed in its reports, to help build confidence that it's not just weekly updates, but that meaningful progress is being made. Obviously, sharing who the candidate is would definitely be on the /oversharing/ side, but as it relates to some of the technical controls and investigation on the dependent bugs, there's a lot of opportunity to actually dive in and explain.

While we all work in PKI and are subject matter experts in our respective fields, it might help to adopt a position of "assume they don't know what we mean", and try to add detail about the design and/or implementation. Alternatively, you might think of it like describing an algorithm: how could another CA implement a similar process, and get similar results, to Sectigo, which will hopefully be the "right" results.

For example, for this issue, sharing how you structure your meeting, what are the challenges you face, and how you're overcoming these challenges. Comment #9 said y'all were underresourced for the experts, but presumably, Comments #10 and Comments #11 are the meetings that involve the folks with the skills, making the explicit time to discuss and address the issues. Comment #9 provided assurances that significant resources were expended despite the lack of non-communication, but we're not really getting a picture of what that was then, or what that is now. Several bugs, two months later, don't even feel like there's a good preventative mitigation in place.

Helping understand where the challenges are helps browsers build confidence that things are going as expected. It also provides opportunities to help, to the best of their ability, to find solutions, or to allow the community to do more research into options.

Basically, what's missing here is the narrative I'd previously discussed. Comment #12 feels a bit like "Last week, we met in Rivendell. There was some debate about next steps" which... like, is both accurate and completely overlooks the critical discussions ;)

(In reply to Ryan Sleevi from comment #13)

.. as I shared in the related bugs, I want to encourage Sectigo to be clearer and more detailed in its reports, to help build confidence that it's not just weekly updates, but that meaningful progress is being made. Obviously, sharing who the candidate is would definitely be on the /oversharing/ side, but as it relates to some of the technical controls and investigation on the dependent bugs, there's a lot of opportunity to actually dive in and explain.

I appreciate the drivers for your comments there and we are making a sincere effort to usefully increase the level of detail in communication.

While we all work in PKI and are subject matter experts in our respective fields, it might help to adopt a position of "assume they don't know what we mean", and try to add detail about the design and/or implementation. Alternatively, you might think of it like describing an algorithm: how could another CA implement a similar process, and get similar results, to Sectigo, which will hopefully be the "right" results.

That's a useful pointer. The JoI policy checker, for example, we can share an outline of the psuedo-code for that. I'll ask for that to be prepared.
As it currently exists the source code itself is not going to be useful to anyone else, but the psuedo-code can be useful.
Having that alongside the ruleset would allow other CAs to implement the same checks. Or of course it could be incorporated into one of the existing lint-checkers.

For example, for this issue, sharing how you structure your meeting, what are the challenges you face, and how you're overcoming these challenges. Comment #9 said y'all were underresourced for the experts, but presumably, Comments #10 and Comments #11 are the meetings that involve the folks with the skills, making the explicit time to discuss and address the issues. Comment #9 provided assurances that significant resources were expended despite the lack of non-communication, but we're not really getting a picture of what that was then, or what that is now. Several bugs, two months later, don't even feel like there's a good preventative mitigation in place.

This meeting is structured around incident reports. All of the incident reports we're currently examining are the public Mozilla ones.
It really helps compliance and development to talk through the tasks and the responses so far and to get direction from a wider team of clueful people in the company and it also helps to avoid unnecessary rat-holing and headlight-staring.

For the three open incident bugs (so excluding this one) there is preventative mitigation in place.
The mitigation can always be improved.

We have our adversarial second review process which is proving far more effective than before. By 'second review' there I'm really referring to the 'Final Cross-Correlation and Due Diligence' step outlined in section 11.3 of the EV guidelines. We now separate that second review team in the same way that you'd separate an IT security team from developers.

We also have improved automated checks in place to identify and reject the most egregious examples of incorrect address data, such as what we have come to call 'meta-data' (although I think that term is probably misused here) by which we mean data that consists of dots, dashes, spaces, and other punctuation and especially sequences or repetitions of those characters instead of city or state names or other address fields.
Those same automated checks also aim to identify frequently used words and phrases that identify that these fields do not contain real data, such as '(none)', 'NA', 'Not Applicable', and we use that same mechanism to catch the common default subject phrases ('Some-State', 'Default City', etc).
Although those checks may seem unnecessary since we have the required initial verification checks and second review process in place, but those characters and phrases are exactly the ones that become so familiar to our human validators that they can become practically invisible to them.

The further automated checks of subject address fields that we are still working on are to further improve the accuracy and further reduce the scope for error by human validators. These have yielded some improvements but remain a work in progress, and something on which we are expending R&D effort. I can feel you willing me to go into more detail about the R&D avenues being explored but in this I'm going to fail you, for today at least. I think there are limits on how much of this R&D work is reasonably responsive to the incidents we're dealing with, although we probably aren't quite there yet and I absolutely realize we can't close out at least Bug 1575022 until the identification and remediation of prior problematic issuance is completed.

The more we automate the subject detail checks, the less we rely on our human validators and that is a win both because it serves to reduce ongoing costs and because it decreases error rates.

Helping understand where the challenges are helps browsers build confidence that things are going as expected. It also provides opportunities to help, to the best of their ability, to find solutions, or to allow the community to do more research into options.

Basically, what's missing here is the narrative I'd previously discussed. Comment #12 feels a bit like "Last week, we met in Rivendell. There was some debate about next steps" which... like, is both accurate and completely overlooks the critical discussions ;)

I've never been a Tolkien reader. I had friends at school who read him at 9 and 10, but I was more into running around with a stick. My running days are now behind me (unless chased by something large) and I left my stick somewhere.
But I take your point.

We have been looking some more at our internal auditing of EV SSL subject information, i.e. the 'Regular Self Audits' discussed at 17.5 in the EVGLs.

One of the lessons learned as a result of some of the things we've missed which we should have caught, which has been already applied to our internal auditing, and which we're working toward applying to our systems is that when we have human validators or auditors verifying information there is still a danger of missing the wood for the trees. E.g. by looking to verify the requirement of Physical Existence and viewing that set of fields as whole, we have in some cases missed the fact that the Locality was deformed (mis-spelled for instance).

When examining whether a certificate request consistently matches a set of data from one of the information sources we have to compare the data in multiple phases. One phase is whether the certificate request data is complete and entirely internally consistent.
E.g. Whether the certificate request data (as has been already checked by at least one person) is complete in every respect. Are all of the expected fields present (mostly automatable) and do they form a single consistent address that conforms with the known form of addresses for that country and includes only accurate city and state names, and that city is really in that state, and that zip code refers to that city,
Another phase of comparison is whether the certificate request data matches our Q?IS data.
E.g. Are these the same entity at the same address? Not a different branch office. Does the Q?IS data have exactly the same field values? - It may not, in which case is the Q?IS data summarized or abbreviated in such a way as the address fields no longer exactly match but we may still see that these are exactly the same entity at exactly the same location. Some Q?IS data can use outdated address forms, but the addresses can still be recognized to refer to the same place. GB is one place this happens a lot because the stateOrProvince boundaries move over time and the postal system develops over time (historic county names that no longer exist but which are still used, introduction of post codes, dropping of stateOrProvince from postal addresses) - but perhaps that is local observer bias since I happen to be in GB and other countries are equally as fluid.

Robin: thank you for these thoughtful updates. They are very helpful in making the case that Sectigo is learning and improving.

For the lessons in comment #15, can you explain how they are being applied to your internal auditing?

(In reply to Wayne Thayer [:wayne] from comment #16)
Wayne,
As Robin mentioned, traditionally we’ve approached validation by looking at wholistic requirements such as verify Legal Existence, or verify Physical Existence, and in doing that our focus has also been on the whole of the document as well, again, sometimes overlooking the individual pieces (individual specific fields encoded in the certificate). So as an example, we might have verified that Acme, Inc. is a legally registered company in the state of Delaware, and maintains a physical location in New York City. We’ve collected the correct documentation from the Delaware Secretary of State to verify Legal Existence and obtained documentation from a QIIS to verify the Physical Existence, but in looking at these things from a requirements/documents basis we may have missed that the QIIS has postal code 10034, but the customer, when formulating the request has fat-fingered the postal code and put 10334, or the person doing the final 2nd approval, cross-correlation and due diligence check may have missed that the person who performed the initial validation had a typo in the corporation registration number (subject:serialNumber) when transferring it from the Delaware Secretary of State data. In one sense, we’ve validated Acme, Inc. correctly, in that we’ve collected and verified the documentation to verify the various requirements w/in the EV Guidelines, Acme, Inc. is a legal entity and we know and have verified the facts about the organization, but we may have fallen short in making certain that only that verified data is what made it into the final certificate issuance because both our training and our systems were placing the emphasis a level higher than it should have been. So the two main things we’re doing are (1) taking a more adversarial outlook in our internal audits, and (2) placing an emphasis, both for our validation team and our internal audit team, on drilling down to the specific certificate fields and verifying them one by one against the documentation. We’ve revamped our training materials both for validation and auditing to put a much stronger emphasis on verification of the individual fields, and will also look for ways that we can visually place more emphasis on the individual fields w/in our validation platform.

Our recruitment process continues. I'd love to expand on why that is taking so long, but I just can't in a public forum at least until the appointment is made.

  • In Bug 1575022, the time between Comment #14, which promised an update by 2019-11-20 and a list by 2019-11-22, did not lead to an update until Comment #15 on 2019-12-09.
  • In Bug 1575022, a question regarding this delay was asked in Comment #16 on 2019-12-09, and the response, which did not answer this question, in Comment #17 was not until 2020-01-22.
  • In Bug 1593776, an incident report was provided in Comment #5 on 2019-12-09, omitting the list of impacted certificates (Questions #4 and Questions #5) until Comment #6, on 2020-01-22.

Sectigo has again failed to provide timely reports

Ben's recent sweep of bugs revealed three additional bugs with no update from Sectigo for 6 months, despite a commitment to provide a timely incident report:

Probably worth adding

  • Bug 1639805 (half an update after ~12 days and a poke from Ryan); and
  • Bug 1639804 (radio silence for the 12 days since reporting)

to the list.

Depends on: 1639805, 1639804
Depends on: 1645686
Depends on: 1648717

Hi Robin,
I am looking for an update on this bug.
Thanks,
Ben

Assignee: Robin.Alden → nick
Depends on: 1620561

(In reply to Ben Wilson from comment #23)

Hi Robin,
I am looking for an update on this bug.
Thanks,
Ben

Hi Ben. You've probably noticed a flurry of activity on Sectigo's open incident bugs over the past week or two. Next week we will post an update to sections 6 of 7 of the incident report that Robin posted in comment #9, which will: get right to the heart of what went wrong over the past couple of years with the timeliness of our incident reports, lay bare why previous efforts to improve were ultimately ineffective, and explain what we have now begun to do differently to ensure both swift resolutions to the current backlog of Sectigo incident bugs and timely responses to new incidents as and when they occur.

Assignee: nick → rob

(In reply to Rob Stradling from comment #24)

Next week we will post an update to sections 6 of 7 of the incident report...

Ben, Ryan,

Although Robin eventually offered an incident report in comment #9 and responded to feedback during the two months that followed, it's clear that that effort fizzled out. With this update, we hope to clearly demonstrate that Sectigo fully understands the root causes of our lack of timeliness in incident responses and that we have put in place a robust set of measures to ensure that from now onwards our incident response efforts will meet expectations. I'm going to lay this out as a mostly personal account interspersed with a timeline of pertinent events. I feel it's necessary to name and focus on the actions (and inactions) of specific individuals, which I realize falls short of the goal of a blameless postmortem, but I don't see any other way to effectively communicate the root causes, explain why it has taken Sectigo so long to identify them fully, and to convince you that we have now mitigated them effectively.

Many of the senior CA operations personnel at Sectigo have been with the company for a very long time, both before and since the carve out from Comodo Group. Cast your minds back a few years and I hope you'll recall and agree that this same team was doing a reasonable job of incident reporting, both in terms of content and timeliness. Under Comodo Group's management, our approach to incident response was fairly informal and “bottom-up”, due in large part to what was a rather flat management structure: no single person had overall responsibility for incident response, so in practice it tended to rely on me, Robin, and a few other folks keeping a close eye on m.d.s.p / Bugzilla and then proactively taking appropriate action as and when necessary, in addition to doing our regular jobs.

2017-11-01: As Sectigo (then Comodo CA) was carved out of Comodo Group, we had to extricate our systems and processes from those of the original company, which also was a going-forward concern. That meant disentangling twenty years' worth of utterly interwoven software, data, and employee responsibilities. This task proved massive and elusive and presented challenges we didn't foresee when we set out to separate the two businesses. In the early days of the new company, a small set of experienced individuals simply did what they knew needed to be done to keep operations running, but we all understood that a more systematic, codified approach was necessary in the long term, for the company to scale. Our new CEO quickly set about building a far more structured Management team, with clear allocation of roles and responsibilities, plenty of ambition to grow the company in terms of headcount and revenue, and with a plan to transform each department into a more effective and streamlined operation.

2018-11-01: One year on, our rebrand as Sectigo was announced. With the goal of building and maintaining a positive image for this new brand, staff members who had previously represented the company freely on social media and industry forums were instructed to adhere to certain disciplines. One of those disciplines was that all communication to external parties now needed to be reviewed and sanctioned by the Marketing department. Although this stance was subsequently refined and somewhat softened, the message was clear that Senior Management wanted a controlled approach to “top-down” external communication, which of course included incident response.

2019-01-15: Continuing the same “top-down”, structured, and growth-minded approach, our CEO Bill Holtz wrote to all Sectigo employees (portions redacted) to announce two Management team changes that had occurred:
“Compliance is becoming more involved and complicated and costly for those who fail in the mission. As we move ahead with our planned growth which will include acquisitions, the topic grows in importance and complexity. Robin Alden will now be the Chief Compliance Officer for the company reporting to me. Robin is well versed in compliance requirements and challenges and works with the browsers and the CA/B forum. He is well suited for this role.
Nick France is promoted to SSL CTO reporting to me. SSL continues to be a large revenue contributor for us but is facing challenges. Customers are becoming more inquisitive and are looking more deeply at security companies before committing to a purchase. Speed of engagement with customers and timely resolution of technical and contractual issues are paramount to our success. Nick will serve the company well in this role.”

Chief Compliance Officer was a newly created position, responsible for both proactive and reactive Compliance matters, and so for the first time we had a designated individual with responsibility for incident response. Our CEO had, in my view, correctly identified that Compliance needed to have a seat at the executive table, and so Robin's appointment seemed to make perfect sense. Likewise, Nick's promotion (taking the CTO title that Robin had held for many years until that time) was well-deserved and seemed like a smart move.

Dear reader, I imagine you're probably thinking that all of these Management changes and decisions were fairly canny and reasonable. But with the benefit of hindsight, this is actually where things began to unravel.

Whereas (using the terminology from https://www.allthingsdistributed.com/2007/07/the_different_cto_roles.html) Robin operated primarily in a Technology Visionary and Operations Manager role, Nick has always been more of an External Facing Technologist. And so while Robin was relieved of his CTO duties in theory, this wasn't really what happened in practice. Due to his years of oversight of some aspects of our CA systems and operations, passing on those Operations Manager duties to one or more other people was neither quick nor simple. Consequently, Robin began to find it increasingly difficult to keep all of his plates spinning.

The other members of the Compliance department were very effective at assisting Robin to keep on top of the proactive compliance activities (writing policies, internal audits, liaising with external auditors, etc). However, despite the best of intentions, responsibility for incident response (which included permission to write and post incident reports on behalf of Sectigo) had now become concentrated on just one person, who gradually found himself unable to give it the attention it needed.

From the sidelines, I and other colleagues watched the decline in the timeliness of our incident responses, but, besides nagging Robin (with various other Management team members on CC) and offering to help, there didn't seem to be much that we could do. Our CEO is quick to deal with issues, but they have to be brought up at the weekly Senior Management meeting. Robin focused on trying to catch up on his duties as opposed to blowing the whistle at the senior table.

2019-07-04: Ryan opened this bug. I emailed Senior Management immediately to make sure they were aware. My email began “You're probably fed up of hearing me moan about this trend, but we simply MUST fix this”, and I requested that the matter be added to the agenda for the next face-to-face Management meeting, which happened to be scheduled for the following week.

2019-07-11: Incident response was discussed at the face-to-face Management meeting. Our CEO reiterated to Robin that it was his responsibility to ensure that our incident responses met Mozilla's expectations, and demanded that Robin make this his top priority. Robin responded by promising to do better and (in what I suppose was a partial admission that he was struggling with the workload) by requesting authorization to recruit to expand his (very small) Compliance team. Our CEO granted that request.

2019-11-18: Robin posted comment #18:

Our recruitment process continues. I'd love to expand on why that is taking so long, but I just can't in a public forum at least until the appointment is made.

With this recruitment drive, Robin was trying to achieve two things: to add manpower to our incident response activity, and to build for Sectigo a presence in Europe so that we could pursue EIDAS accreditation. With Brexit looming, our base in the UK was no longer going to be suitable for the latter. Some months earlier, Robin had identified a suitable candidate, Iñigo Barreira, who was keen to take on the role. For EIDAS purposes we needed to establish a legal presence in Spain, and it was agreed that this new entity would employ Iñigo. Unfortunately, establishing a legal presence in Spain took much, much longer than anticipated, which of course meant that Robin's main plan to address the manpower shortage for our incident response was severely delayed.

(In reply to Robin Alden from comment #9)

Immediately:
A leadership group drawn from other areas of the company (software development, product management, information technology)
will meet weekly with Compliance to ensure that development tasks related to compliance are making satisfactory progress,
that all external reporting requirements, such as Mozilla's guidance on incident report responses, are being met,
and that all necessary resources are available to meet all of our compliance obligations.

This was a great idea! However, AIUI this group managed to meet just once. Robin was supposed to be coordinating it, but it was just one more of the many screamingly urgent tasks demanding his attention, and so momentum was immediately lost.

2020-01-30: Iñigo was finally able to start working for Sectigo. Naturally it took time for him to get up to speed and start contributing, but at least we had finally made some progress on the manpower problem.

Meanwhile, I and other colleagues continued to watch our progress closely and remained concerned, until…

2020-07-21: I reviewed our open incident bugs once again, and I finally snapped. I spent most of the day pouring out my frustration in an email to Senior Management. I threw caution to the wind and wrote plainly that the current strategy for incident response was not working. For conciseness, I focused mainly just on the (many) ways we were failing to even meet the bare minimum responsiveness requirements of “in no circumstances should a question linger without a response for more than one week, even if the response is only to acknowledge the question and provide a later date when an answer will be delivered” and “You should also provide updates at least every week giving your progress”.

With this email, our CEO was persuaded that a fundamental shift in strategy was needed, and so he asked for proposals. I proposed that we “share the workload between an Incident Response Team that consists of more people than just Robin”, and everyone agreed. Now, this proposal may sound exactly like the “leadership group” that Robin mentioned in comment #9, but (and this is why I felt I needed to name specific individuals) there is in fact a crucial difference…

  1. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

Responsibility for Sectigo's incident response is no longer concentrated on one, overworked individual who has displayed a tendency, albeit with the best of intentions, to overpromise and underdeliver, rather than ask for help.

Instead, responsibility is now shared between all the members of our new Incident Response Team, who represent all areas of our CA operations. This means that the latent potential of myself and several other long-standing senior staff, which through management strategies had largely lain dormant, is now once again participating in our incident responses.

This sharing of responsibility for incident response across multiple Sectigo departments resembles what was done under Comodo Group's management, but it has a much more formal approach:

a. Regular meetings, the first of which occurred on 2020-07-23 and which have continued (and will continue) to occur on at least a weekly basis. These meetings: review progress on our responses for all open Sectigo incident bugs, consider how to respond to any questions that have been asked, define strategies for tackling any new bugs that have been filed, consider any matters for which we may need to self-report a new incident, ensure that no previously allocated tasks have been forgotten, and ensure that for each open bug somebody owns the task of posting an update that week.
b. Each Bugzilla bug has a parallel ticket in Sectigo's Jira system, so that the team can better track our everyday internal discussions and coordinate the preparation of our responses. We also have a shared tracking spreadsheet, for a simple one-page view of upcoming deadlines for posting updates and responding to questions on each bug plus a summary of known outstanding tasks for each bug that will need to be addressed before the bugs can be closed.
c. Each proposed incident response and substantive comment is peer-reviewed for correctness and clarity by at least one other team member.
d. Where we need to “Scan your corpus of certificates to look for others with the same issue”, the database queries and methodology are peer-reviewed, to ensure the correct scope and results.
e. The team member with the most relevant expertise usually takes primary responsibility for a bug and is marked as that bug's assignee in Bugzilla.
f. A new email distribution list - incident-response@sectigo.com - is now CC'd on each Bugzilla bug, to ensure that nobody on the team misses a message.
g. The team monitors activity on all Bugzilla bugs in the “CA Certificate Compliance” component and considers whether Sectigo may be affected by the same or similar issues.
h. Senior Management receives regular progress reports, and will most certainly hold us to account.

The initial members of the Incident Response Team are all seasoned veterans whose names you will recognize: Robin Alden, Iñigo Barreira, Tim Callan, Nick France, Rich Smith, and myself. We have a mandate to call upon the services of other Sectigo staff as and when necessary.

Flags: needinfo?(Robin.Alden)
QA Contact: wthayer → bwilson

We have no further update.

Rob, Nick, et al: Bug 1648717 is an example of a bug that, following Comment #25, still shows Sectigo does not have adequate controls in place for timely updates. While I appreciate the broad spectrum of folks empowered to respond, I am greatly concerned that "ownership by all leads to ownership by none"; here's a bug where an update for a remediation is planned, and nearly two weeks elapse without even an update on the update (which itself had to be prodded ).

I think it's useful to understand the people now involved, but I'm concerned about the steps being taken to ensure timely updates. Why did your "step b" fail? This is not acceptable behaviour for a publicly trusted CA, and while it may seem minor, I think this bug tracks that there is a clear systemic failure across the board here.

Flags: needinfo?(rob)
Flags: needinfo?(nick)

See also Bug 1620561

This is a new process, and we are QAing and fine tuning it at the same time that we use it in the real world. In this instance a combination of user error and PTO led to a few gaps in the seven-day window for updates on active bugs. In response we have upped the cadence of our working meeting to twice per week and implemented a “buddy system” with a pair of co-owners for each bug.

Flags: needinfo?(rob)
Flags: needinfo?(nick)

We have no further update.

We have no further update.

We have no further update.

We have no further update.

You need to log in before you can comment on or make changes to this bug.