DigiCert: Late incident report for bug 1925106
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: tim.hollebeek, Assigned: dcbugzillaresponse)
Details
(Whiteboard: [ca-compliance] [disclosure-failure])
Incident Report
Summary
On bug https://bugzilla.mozilla.org/show_bug.cgi?id=1925106 , we failed to post the full incident report within the time frame promised and within the two weeks required Mozilla. The deadline posted in the initial report had the correct date, but an incorrect date was introduced during a tracker re-organization / consolidation. The root cause was an incorrect counting in days when updating an internal tracker.
Impact
The impact is the community was not provided with the full report within the timeframe expected by the Mozilla community.
Timeline
All times are UTC.
2024-10-16 19:56: DigiCert files bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1925106) with a preliminary incident report.
2024-10-31 19:42: Multiple incident reporting trackers consolidated; full report deadline calculated incorrectly.
2024-11-01 1:33: Ben acknowledges this is a question of first impression for CCADB and is being discussed both on the bug and with CCADB.
2024-11-06 4:40: Wayne provides a historical overview of the issue.
2024-11-08 21:36: Tim provides a full incident report.
Root Cause Analysis
The root cause was two-fold:
-
After Jeremy left, DigiCert temporarily separated his team into respective components with new managers. There was some confusion as to who should be tracking the timelines, and status started getting tracked separately in multiple locations. When this started causing confusion and trackers were consolidated, the full report deadline was calculated incorrectly. We have re-established our compliance program manager position and the individual tracking these timelines under our new Chief Trust Officer, who we recently hired.
-
One of our internal incident trackers wasn’t explicitly tracking full report deadlines. When it was updated to include them, the date was calculated erroneously and was not based on the time of the original email from Sectigo.
Lessons Learned
What went well
We filed the preliminary right away and promptly raised the issue to CCADB.
What didn't go well
We didn’t file the full incident report within the appropriate timeline.
Action Items
Action Item | Kind | Due Date |
---|---|---|
Make sure all program management is handled by a single compliance program manager | Prevent | Already completed |
Track full report dates in the integrated project tracker | Prevent | Already completed |
Appendix
Details of affected certificates
None
Updated•8 months ago
|
Reporter | ||
Comment 1•8 months ago
|
||
We will be updating bugs on Mondays over the holidays.
Comment 2•8 months ago
|
||
(In reply to Tim Hollebeek from comment #0)
One of our internal incident trackers wasn’t explicitly tracking full report deadlines. When it was updated to include them, the date was calculated erroneously and was not based on the time of the original email from Sectigo.
What the root cause analysis in this incident report does not mention or account for is that on October 16 in bug 1925106 comment 0 you stated you would have the full incident report by October 30 (which fact was mentioned in bug 1925106 comment 14). This strongly suggests that as of October 16 the author of that post was aware of the October 30 deadline.
While your analysis discusses the tracker problem and offers action items to address it, it seems that an additional procedural failure occurred here. At least one team member who is intimately involved in the Bugzilla process at DigiCert was aware of this deadline, and yet DigiCert still missed it by more than a week. One would hope in a case like this that the individual(s) with this knowledge would bring up the need to post a report within the specified timeframe, regardless of what the spreadsheet said.
I believe your root cause analysis should look into why that knowledge didn’t translate into action. I believe you should look for action items to deal with that failure as well.
Reporter | ||
Comment 3•8 months ago
|
||
As the individual in question, I thank you for your kinds words about your expectations for my memory, but like the rest of us, I'm human.
There are very good reasons why DigiCert tries to avoid relying on human memory for these sorts of things, and one of them is because it doesn't work reliably, as shown here.
We actually discussed including more details about all the things going on at the time, including CAB Forum and other events, but we left them out because they sounded like excuses to us. Yes, we're busy. That's why we have managers and trackers. Those are what failed.
Reporter | ||
Comment 4•8 months ago
|
||
Happy holidays!
Reporter | ||
Comment 5•8 months ago
|
||
Onward to 2025!
Reporter | ||
Comment 6•7 months ago
|
||
Nothing new here.
Incident Description:
Due to some confusion with multiple trackers, an incorrect deadline was being tracked internally for the final report on bug 1925106, causing it to be submitted late.
Incident Root Cause(s):
Due to a number of exceptional issues related to staffing changes, travel, and hardware failures, one of our incident reports inadvertently got filed late.
Remediation Description:
Our existing manager for our report tracking is now back in place and reporting to our new Chief Trust Officer, and the tracking spreadsheet has been updated to track dates for final reports.
Requesting closure on this bug since all actions are complete and a closing summary has been posted.
Comment 9•7 months ago
|
||
(In reply to DigiCert from comment #7)
Tim,
Your root cause analysis attributes this error to a change in personnel and an error in consolidation of trackers, which your timeline places on October 31, 2024. However, DigiCert has a pattern of late Bugzilla responses on bugs that extends before October 31.
For example, looking at bug 1910805 (Digicert: Delayed Revocation for bug 1894560), I find these instances:
Comment | Comment date | Time to respond |
---|---|---|
Bug 1910805 comment 3 | 7/31/24 | Never answered. |
Bug 1910805 comment 12 | 8/26/24 | Never answered. |
Bug 1910805 comment 14 | 8/28/24 | Eight days to answer. |
Bug 1910805 comment 16 | 9/6/24 | Never answered. |
Bug 1910805 comment 17 | 9/6/24 | Six questions never answered. (Note that one question posed to Chrome in this comment was answered after ten days, although Chrome does not own this bug and has no obligation to timeliness of response.) |
Bug 1910805 comment 33 | 10/20/24 | Nine days to answer some questions. One question answered after 36 days and repetition of the question. |
Bug 1910805 comment 44 | 12/8/24 | Never answered. |
If it’s useful for this conversation, I can go back through other DigiCert bugs from 2024 and make similar lists, but I hope this is sufficient to demonstrate the pattern.
I will point out that most of these predate the October 31 error that you have attributed as the root cause of this problem. This suggests to me that there has been at least one additional root cause of delays in expected behaviors when reporting and remediating incidents. I believe DigiCert should look back on its responsiveness to the Bugzilla community in 2024 and search for a root cause of this performance. Once you identify that root cause, I think you should post a new incident report that includes credible action items for this root cause and report on their delivery before contemplating closing this bug. Alternately, it may be appropriate to open a new bug and deal with it there instead.
Ben, I recommend keeping this bug open for the time being.
Updated•7 months ago
|
Updated•7 months ago
|
Assignee | ||
Comment 10•7 months ago
|
||
The root cause was Jeremy leaving. Perhaps we should have included that in the timeline. That happened on August 1st. We never implied that the problem started on October 31st.
All action items for this bug have been completed. If you have additional issues with bug 1910805, please take them there.
Comment 11•7 months ago
|
||
I'll close this on or about Friday, 24-Jan-2025.
Comment 12•7 months ago
|
||
(In reply to DigiCert from comment #10)
The root cause was Jeremy leaving.
The loss of a single employee, albeit an important one, should not result in the breakdown of fundamental expectations of a public CA for the next four months.
“Jeremy leaving” is not a root cause. At best, it is an identification of a triggering event, and I believe DigiCert owes the community a proper root cause analysis that includes action items to rectify the underlying problems. Rather than a six-word response, this persistent issue deserves its own, new Bugzilla incident with a fresh incident report that addresses the full scope of the problem and credibly promises to rectify this performance.
Can I count on you to open such a bug?
Comment 13•7 months ago
|
||
(In reply to DigiCert from comment #10)
The root cause was Jeremy leaving. Perhaps we should have included that in the timeline. That happened on August 1st. We never implied that the problem started on October 31st.
Perhaps indeed, if that is the root cause and not the other root causes listed in the DigiCert report and subsequent comments. Please amend the timeline and report to include the actual root cause and the subsequent actions or inactions that produce a causal path to this incident.
The report in this bug says “the root cause was an incorrect counting in days when updating an internal tracker”, when that is clearly a consequence of something and not a cause that stands alone.
Later, DigiCert says of the incident:
Incident Root Cause(s):
Due to a number of exceptional issues related to staffing changes, travel, and hardware failures, one of our incident reports inadvertently got filed late
While one would hope that anything leading to an incident was exceptional, rather than a routine practice, nothing in this report or its action items explains how the real apparent root cause—DigiCert failing to properly manage transfer of knowledge and responsibilities during a staffing change—is being remediated. What happens when the “single compliance manager” listed as the result of the first action items leaves DigiCert? If someone leaving is a “root cause”
Furthermore, none of the action items address how the travel or hardware failures were allowed to impact DigiCert meeting its commitments, or make clear why a repeat occurrence will be avoided if travel or hardware failures occur in the future (as I assume is very likely).
Ben, I don’t think that this incident report is satisfactory by the standards of the CCADB or MRSP, and the bug should be kept open until it is improved appropriately.
Assignee | ||
Comment 15•7 months ago
|
||
Mike, thanks for the reasonable perspective. We'll see if we can make some adjustments to the timeline and root cause.
Assignee | ||
Comment 16•7 months ago
|
||
Tim, we will not be opening a new bug.
Assignee | ||
Comment 17•6 months ago
|
||
Mike,
As we noted, we didn't include the details about the travel or hardware failures for exactly the reason you note: we don't feel they are relevant. There was a rather annoying cluster of rare events right at that particular time, but that's just something that happened, and none of them are particularly interesting, other than the fact that they made tracking these things really, really hard for a week or two. The increased difficulty and overhead almost certainly did contribute significantly to this happening, but like we said, we aren't going to blame those things.
Also, losing a compliance manager and losing a CISO are two entirely different things. A rather massive amount of knowledge and experience did get transferred in quite a short time. The fact that Jeremy's direct reports ended up reporting to different managers temporarily is the choice that if we had to do it over again, we would probably do differently. The good news is they are back together again under our new Chief Trust Officer.
There's already a ton of information about the root cause here, to be honest, in the post Jeremy wrote when he resigned. We were busy fixing a bunch of other things, and the trackers got a bit out of sync and this error happened. We'll get a new root cause analysis up, but what happened is exactly what we described. We wish it hadn't, but it did.
Assignee | ||
Comment 18•6 months ago
|
||
We are working on an updated incident report and will provide it next week.
Assignee | ||
Comment 19•6 months ago
|
||
Incident Report (Revised)
Summary
On bug https://bugzilla.mozilla.org/show_bug.cgi?id=1925106 , we failed to post the full incident report within the time frame promised and within the two weeks required Mozilla. The deadline posted in the initial report had the correct date, but an incorrect date was introduced during a tracker re-organization / consolidation. The root cause was an incorrect counting in days when updating an internal tracker while following our standard operating procedure.
Impact
The community was not provided with the full report within the timeframe expected by the Mozilla community and the delay exposed a weakness in our standard incident response process.
Timeline
All times are UTC.
2024-10-16 19:56: DigiCert files bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1925106) with a preliminary incident report.
2024-10-31 19:42: Multiple incident reporting trackers consolidated; full report deadline calculated incorrectly.
2024-11-01 1:33: Ben acknowledges this is a question of first impression for CCADB and is being discussed both on the bug and with CCADB.
2024-11-06 4:40: Wayne provides a historical overview of the issue.
2024-11-08 21:36: Tim provides a full incident report.
Root Cause Analysis
Our compliance incident response plan designates roles and responsibilities for managing incidents. These roles include the following: a program manager who tracks and manages the timelines, a compliance leader to explain/research the issue and document the timelines, a product manager to drive any fixes required by engineering, a support leader to drive customer communication, a marketing leader to schedule and send the communication. This document has specific individuals assigned to each role. When we had a change over, several assigned leaders were no longer operating in their assigned roles. In particular, the program manager was moved to other tasks, the compliance leader was no longer with DigiCert, and the product manager assigned to the team changed. This led to confusion when following the plan. The root cause was two-fold. First, we did not have sufficient contingencies in the incident response plan to account for the organization’s restructuring. Second, the incident response plan did not name sufficient successors to ensure the continued operation of the plan in accordance with the timelines. This included posting a response within the required timeframe.
Although we did have people take over management of the incident response roles, they failed to notice that clock for a response started when DigiCert last posted, not when the bug was last updated. The tracker automatically pulls in the last updated time, which lead to a belief in the program management role that DigiCert had more time. This error has been corrected in training. We’ve also updated our SOP to account for changes in role and better document how dates are calculated.
Lessons Learned
What went well
We filed the preliminary right away and promptly raised the issue to CCADB.
What didn't go well
We didn’t file the full incident report within the appropriate timeline.
Action Items
Action Item | Kind | Due Date |
---|---|---|
Make sure all program management is handled by a single compliance program manager | Prevent | Already completed |
Track full report dates in the integrated project tracker | Prevent | Already completed |
Update the SOP to include contingencies for changes in personnel | Prevent | Already completed |
Update the SOP to include specifically that the date must be calculated based on the last date DigiCert updated the bug | Prevent | Already completed |
Appendix
Details of affected certificates
None
Hopefully, this better identifies the root cause and where our process failed. This was an unusual circumstance as we’ve actually tested the incident response plan with scenarios where key staff members were absent. However, we never had the date of the last update vs. when DigiCert last updated the bug taken into consideration.
Assignee | ||
Comment 20•6 months ago
|
||
Incident Report Closure Summary
-
Incident Description: We failed to post the full incident report within the time frame promised and within the two weeks required.
-
Incident Root Cause(s): The root cause was an incorrect counting in days when updating an internal tracker while following our standard operating procedure.
-
Remediation Description:
-
Make sure all program management is handled by a single compliance program manager
-
Track full report dates in the integrated project tracker
-
Update the SOP to include contingencies for changes in personnel
-
Update the SOP to include specifically that the date must be calculated based on the last date DigiCert updated the bug
-
-
Commitment Summary: Our commitment is to improve our SOPs to take into account staff changes, and other unforeseen events which may effect our ability to respond to incidents within the required timeframes. To that end we’ve made several changes to our internal processes to better track due dates and to better identify risks and have contingency plans laid ahead of time. We are also committed to trying to post responses well ahead of the required due dates so that when unforeseen things happen, we’re not up against a due date and have some recovery time before being faced with non-compliance.
All Action Items disclosed in this Incident Report have been completed as described, and we request its closure.
Assignee | ||
Comment 21•6 months ago
|
||
All action items complete and closing summary posted. We are again requesting closure.
Comment 22•6 months ago
•
|
||
I will close this tomorrow, Friday, 28-Feb-2025.
Updated•6 months ago
|
Updated•6 months ago
|
Description
•