Closed Bug 1927675 Opened 10 months ago Closed 9 months ago

iTrusChina: CPR was not responded to within 24 hours

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: vTrus_contact, Assigned: vTrus_contact)

Details

(Whiteboard: [ca-compliance] [policy-failure])

iTrusChina was notified that it didn't respond to CPR of mis-issuance within 24 hours.

Incident Report

Summary

On October 26, 2024 (Saturday), the Google team emailed iTrusChina, notifying us about potentially mis-issued certificates used on our test websites. iTrusChina responded to the email (Certificate Problem CPR) on the morning of October 28, 2024 (Monday), which exceeded 24 hours. It is considered a violation of Section 4.9.5 of TLS BRs, which requires “Within 24 hours after receiving a Certificate Problem Report, the CA SHALL investigate the facts and circumstances related to a Certificate Problem Report and provide a preliminary report on its findings to both the Subscriber and the entity who filed the Certificate Problem Report.”

Impact

The mis-issued certificates mentioned in CPR were issued to iTrusChina itself, no other subscribers were affected.

Timeline (All times are UTC+8)

Time Event
2024-10-26 3:14 AM: The Google team sent the first email to iTrusChina’s CPR mailbox “compliance@itrus.com.cn”, notifying us about the potentially mis-issued certificates.
2024-10-28 9:34 AM: iTrusChina responded to the first email and began the investigation.
2024-10-28 11:26 AM: iTrusChina disclosed a Preliminary Report of above mentioned mis-issuance in Bug 1927384.
2024-10-29 9:00 AM: The Google Team replied to our first email, notifying iTrusChina may violate TLS BRs, which requires CA to respond to CPRs within 24 hours.
2024-10-29 11:10 AM: iTrusChina confirmed the delayed response to the CPR was mainly caused by our limited number of certificates issued and human error, iTrusChina’s relevant staff were inexperience in handling CPRs and didn’t check the emails at the weekend, which unfortunately caused the delayed response.

Root Cause Analysis

The main causes of this incident are human negligence and inexperience, there is no personnel check the email box at the weekend.

Lessons Learned

What went well

We started dealing with it as soon as we read the email.

What didn't go well

Failure to check the CPR email box on the weekend.

Where we got lucky

N/A

Action Items

Action Item Kind Due Date
Train the staff about the TLS BRs’ requirements for CPR handling to avoid such response delays. process 2024-10-30
Arrange multiple dedicated personnel to check the CPR mailbox, and establish a double-check mechanism to avoid human error. process 2024-10-30
Assignee: nobody → vTrus_contact
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [policy-failure]

iTrusChina has finished the above-mentioned action items, this incident and TLS BRs' requirements have been shared with our Compliance Team and there will be at least two personnel to constantly monitor the CPRs. Thanks.

I'm just a random guy on the Internet, but this "root cause" looks more like a restatement of the symptom rather than digging into the actual root of why it happened. What was the process to take the BR requirements and make sure that the necessary people knew about them? (I know that some requirements on CAs are buried in labyrinths of old RFCs that need to be pieced together, but I didn't think needing to be monitoring 24x7 and responding within 24 hours was one that was hard to find.) Is it just that, like, the only weekend staff person called out sick and some supervisor missed the implications? How was weekend staffing planned to be able to provide coverage, and why didn't that plan work? Or has there just historically been no weekend staff, with all levels of management and compliance personnel somehow not realizing how they wouldn't be able to handle CPRs on the weekend? How did auditors check for their ability to comply with this rule in the past? What is the "double-check mechanism" now being used, and why wasn't it implemented previously? How would they know if they were missing any other obvious requirements that they would be expected to comply with?

I think some kind of "Five Whys", or at least a couple more "Whys" than are given here, would help with understanding what actually went wrong more wholistically for the organization.

(In reply to Peter Cooper Jr. from comment #3)

I'm just a random guy on the Internet, but this "root cause" looks more like a restatement of the symptom rather than digging into the actual root of why it happened. What was the process to take the BR requirements and make sure that the necessary people knew about them? (I know that some requirements on CAs are buried in labyrinths of old RFCs that need to be pieced together, but I didn't think needing to be monitoring 24x7 and responding within 24 hours was one that was hard to find.) Is it just that, like, the only weekend staff person called out sick and some supervisor missed the implications? How was weekend staffing planned to be able to provide coverage, and why didn't that plan work? Or has there just historically been no weekend staff, with all levels of management and compliance personnel somehow not realizing how they wouldn't be able to handle CPRs on the weekend? How did auditors check for their ability to comply with this rule in the past? What is the "double-check mechanism" now being used, and why wasn't it implemented previously? How would they know if they were missing any other obvious requirements that they would be expected to comply with?

I think some kind of "Five Whys", or at least a couple more "Whys" than are given here, would help with understanding what actually went wrong more wholistically for the organization.

Hello Peter, thanks for your kind feedback.

To make sure iTrusChina’s staff is aware of BRs’ requirements, iTrusChina arranges teams to constantly track the changes of BRs, root store policies, Bugzilla incidents, and other relevant policies. These requirements are shared with the relevant teams, and necessary training and system updates are conducted to make sure our operation and systems comply with the relevant requirements.

Regarding your questions about the double-check, we arranged for two dedicated people from our compliance team and one senior manager to check the emails 7*24 and respond to CPRs.

Our qualified auditor follows their standards and processes to constantly audit iTrusChina’s operation and management as a Certification Authority, and we will keep improving our service and compliance level through their recommendations and the community’s supervision. Thanks again.

(In reply to iTrusChina Co.,Ltd. from comment #4)

To make sure iTrusChina’s staff is aware of BRs’ requirements, iTrusChina arranges teams to constantly track the changes of BRs, root store policies, Bugzilla incidents, and other relevant policies. These requirements are shared with the relevant teams, and necessary training and system updates are conducted to make sure our operation and systems comply with the relevant requirements.

So then, what was the plan that those relevant teams had made in order to comply with the 24-hour response requirement, and why didn't that plan work in this case?

(In reply to Peter Cooper Jr. from comment #5)

(In reply to iTrusChina Co.,Ltd. from comment #4)

To make sure iTrusChina’s staff is aware of BRs’ requirements, iTrusChina arranges teams to constantly track the changes of BRs, root store policies, Bugzilla incidents, and other relevant policies. These requirements are shared with the relevant teams, and necessary training and system updates are conducted to make sure our operation and systems comply with the relevant requirements.

So then, what was the plan that those relevant teams had made in order to comply with the 24-hour response requirement, and why didn't that plan work in this case?

Hello Peter, thanks for your comment.

Before this incident, we established a workflow that required a designated individual to check and respond to CPR emails for 7*24 hours. However, the duty officer failed to check emails as stipulated, resulting in this incident. Now we implement a double-check mechanism for the response of CPRs, which we believe can effectively prevent such incidents in the future.

Incident Report Closure Summary

  • Incident Description: On October 28, 2024, iTrusChina was notified that it did not respond to a Certificate Problem Report (CPR) within 24 hours, which is considered a violation of Section 4.9.5 of TLS BRs.
  • Incident Root Cause(s): The root causes of this incident are human negligence and inexperience, the duty officer failed to check emails as required at the weekend.
  • Remediation Description: iTrusChina trained the relevant staff and established a double-check mechanism to avoid human error, multiple dedicated personnel are arranged to check the email box 7*24 hours.
  • Commitment Summary: We will enhance our management and implementation of the CPR response mechanism, regularly train our staff, and continuously inspect our double-check mechanism, ensuring the CPR response mechanism is working properly.

All Action Items disclosed in this Incident Report have been completed as described, and we request its closure.

I will close this sometime later this week, unless there are questions or issues to discuss.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 9 months ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.