Closed Bug 1898847 Opened 4 months ago Closed 2 months ago

Entrust: Delayed reporting of Jurisdiction issue in some EV TLS & Code Signing certificates

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ngook.kong, Assigned: ngook.kong)

Details

(Whiteboard: [ca-compliance] [policy-failure])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36

Incident Report

Summary

On May 18, 2024, we created bug 1897630 to report an incident with the jurisdiction data of some EV TLS & Code Signing certificates. This incident report for bug 1897630 was submitted more than 72 hours from when Entrust should have been aware of the incident and we are therefore submitting this report regarding the causes for this delay.

Our analysis shows that we did not investigate, escalate, confirm and report this incident quickly enough to meet the timeframes due to insufficient processes and resources.

Impact

This delayed reporting bug impacts the same certificates affected by bug 1897630. All affected certificates had expired or were revoked by May 21, 2024, 9:30 AM UTC.

Timeline

All times are UTC.

2021-03-03:

  • Bug 1696227 (Mar 2021): Incorrect Jurisdiction Country Value in an EV Certificate
    • This incident is about certificates where the jurisdiction country is set to the value of “ZA” when it should have been set to “BW”. While this is related to incorrect information in the jurisdiction data of the certificate, this does not seem to be directly related to this incident.

2022-11-28:

  • Bug 1802916 (Nov 2022): EV TLS Certificate incorrect jurisdiction
    • In this incident report we identified that the jurisdiction state or province was used when the registry was from the country level. We did not identify cases where the jurisdiction state or province was missing and/or the registry was also from the country level.

2023-03-14:

  • Dropdown functionality was implemented for Private Organizations as described in bug 1802916 comment 7.

2023-11-28:

  • Bug 1867130 (Nov 2023): Jurisdiction Locality Wrong in EV Certificate
    • This incident detected a postal code in the jurisdiction locality field for a government entity (which does not come from the drop-down list) and was caused by insufficient indication of changes.

2024-03-09:

  • An ad hoc scan with pkilint was run in relation to bug 1883843, as part of the preparation work to implement pkilint as a post-issuance linter. The report shared by the engineer was seen as confirming the known incident (bug 1883843) and was not reviewed or investigated further by the compliance team because it was just a list of errors without identifying the certificates. The team requested a report that would include certificate numbers. Unbeknownst to the compliance team, among the thousands of errors in the report, there were 42 errors relating to the locality error. The engineer working on this fix left the company and this task halted, and the escalation process was not followed.

2024-04-03:

  • 13:12 A new scan with pkilint was started; initial results (while the scan was still running) highlighted an error where the jurisdiction locality was present and the state or province was missing. The issue was escalated to our verification team for further investigation.
  • 19:50 Verification data indicated that the organization profiles had been validated at the country level and that the locality was not listed for these jurisdictions.

2024-04-04:

  • 12:42 A compliance team member reviewed the issue and determined that the locality information in the certificates was likely incorrect. The issue was discussed with the verification team who started a deeper investigation into why their data was not the same as in the certificates.
  • 15:00 The issue was discussed between compliance and the verification team resulting in the need for further investigation.

2024-04-08:

  • We detected a certificate that was issued with a government entity with the same issue, however the logic for government entities is different from private organizations. Where Private Organizations leverage our pre-verified jurisdiction list located at https://www.entrust.com/legal-compliance/approved-incorporating-agencies, government organizations are manual.

2024-04-11:

  • 10:05 pkilint is added as a post issuance linter in production.

2024-04-15:

  • 11:48 The problem was included in a report from a partial scan with pkilint during the implementation of the post-issuance linter. The results of this scan were included in the communication to the product compliance team manager and should have been escalated at that time through established processes. It was not escalated because the compliance team manager incorrectly assumed the data was reporting the cPSuri problem from bug 1883843 that was already being addressed.

2024-05-12:

  • 02:19 A member of the product compliance team (unprompted) identified that this issue had not been reported and actioned. Following process, senior leadership was informed, and our incident handling procedure initiated.

2024-05-13:

  • 13:00 The product compliance team manager formally started an investigation.

2024-05-16

  • 11:55 Mis-issuance confirmed and final certificate data verified.
  • 11:55 We started the 5-day revocation clock.
  • 16:00 Notified subscribers of the impacted certificates and that they would be revoked within 5 days.

2024-05-21

  • 09:30 All remaining impacted certificates were revoked.

Root Cause Analysis

1. Why was there a problem?

According to the CCADB incident reporting requirements, a mis-issuance must be reported with at least a preliminary incident report within 72 hours of the CA Owner becoming aware of it. However, Entrust did not investigate and escalate data that would have identified and confirmed an incident in a timely manner.

2. Why did the initial scan on 2024-03-09 not trigger further investigation?

The 2024-03-09 scan that initially generated the relevant data was run by an engineer as part of an experimental quick scan to learn more about the PKIlinter; the work was not being tracked as a production project. The report generated by the scan only included the linting output without identifying the certificates that generated these errors. The compliance team only reviewed the report sufficiently to identify deficiencies in the report itself; based on this high-level review the team believed the report only confirmed the known incident (bug 1883843). The 42 errors with the locality error were buried within a list of thousands of cPSuri errors and were not noticed in the high-level review.

3. Why didn’t the results of the pkilint scans on 2024-04-03 or 2024-04-15 trigger further investigation?

The pkilint scan on 2024-04-03 produced preliminary results that highlighted an error that triggered further review and led to the finding by a compliance team member that information in certificates was likely incorrect and warranted further investigation, but the email reporting this and recommending further investigation was missed due to the volume of emails that had been received during this time.

A report from a scan on 2024-04-15 also included the problem; this report was communicated by email to the product compliance team manager for investigation/escalation. It was not escalated because the manager assumed the report related to the cPSuri problem from bug 1883843 that was already being addressed.

4. Why were the existing processes and resources not sufficient to trigger investigation, escalation, and confirmation of an incident within an acceptable timeframe?

At relevant points in the timeline for this bug, the compliance team was responding to other certificate revocation-event incidents that had impacts on a large numbers of subscribers and generated a large volume of questions from the community demanding the team’s time, energy and attention.

Existing processes rely on communication, tracking, and reporting of potential incidents via email, team discussions, and other processes that are susceptible to human errors, especially in situations such as the one described above.

In addition, the authority to launch and conduct formal investigations, confirm incidents, and initiate incident reporting processes is held by a small number of individuals within the compliance team. These same individuals were responsible for helping to respond to incidents, communicating with impacted subscribers, responding to questions from the Bugzilla community, and drafting and submitting incident reports, in addition to other day-to-day responsibilities.

5. Why was there not sufficient capacity to meet the reporting timelines?

The product compliance and verification teams were organizationally separate from broader organizational resources that would provide additional capacity and redundancy in high-volume situations.

Lessons Learned

  • We need to improve investigation and escalation processes for potential incidents and provide additional support to teams responsible for certificate compliance and verification.

What went well

  • Once senior leadership was notified of the event, the mis-issuance was confirmed and the impacted certificates were revoked within 5 days as required.

What didn't go well

  • Human errors resulting in data not being investigated, escalated or actioned in a timely way.

Where we got lucky

  • A product compliance team member eventually realized (unprompted) that the issue had not been reported or actioned and notified senior leadership.

Action Items

We have identified the following actions items and will consider this issue during the reflection on our recent incidents of which we will be publishing our report on or before June 7 to Mozilla and the community.

Action Item Kind Due Date
Review applicable policy and procedures Prevent June 28, 2024
Improve our internal problem reporting mechanism reported by internal staff Detect July 31, 2024
Reorganize product compliance and verification teams to provide additional organizational resources and oversight Prevent July 31, 2024
Implement additional input validation controls for verification Mitigate July 26, 2024
Implement pkilint as post-issuance linter Detect Done

I will not re-iterate my questions posed in #1897630 and #1898848.

What we need from this incident report is an explanation of why the incident report didn't get submitted within 72 hours, and why it will not happen again. The original incident is generally immaterial to this and the action items should reflect this issue alone.

Flags: needinfo?(ngook.kong)

(In reply to Wayne from comment #2)

I will not re-iterate my questions posed in #1897630 and #1898848.
What we need from this incident report is an explanation of why the incident report didn't get submitted within 72 hours, and why it will not happen again. The original incident is generally immaterial to this and the action items should reflect this issue alone.

Hi Wayne, as noted in the timeline, Entrust confirmed the mis-issuance in bug 1897630 on May 16th. We filed a preliminary incident report for it on May 18th, which was within 72 hours of the incident being confirmed. We revoked all the affected certificates within 5 days of the incident being confirmed. However, we filed a delayed reporting incident and a late revocation incident in anticipation that the community could potentially object that Entrust should have confirmed the incident and revoked the certs earlier. If the community believes, under the circumstances, that report and revocation should not be considered late, we would welcome that. We have outlined in this report all of the missed opportunities that if availed would’ve resulted in earlier investigation and confirmation. We have also identified the reasons these opportunities were not availed, and actions we are taking to rectify this. This exercise has helped us identify the process, technology and talent gaps we will alleviate to improve compliance in the future.

Flags: needinfo?(ngook.kong)

All that has been established is a systemic failure in following incident response at multiple layers, and a lack of change across several months to improve this matter.

This is why I say that this incident report needs to focus on the missed timeframe and why it will not happen again.

Now that Entrust have clarified that their incident response practices have not changed in the past few month via comment #28:

  • Has this been the same process from the start of the incident until now? If not, what lessons were learned and applied?

Yes, it is the same process.  With the volume of certificates that had to be revoked and re-issued, this approach has put a strain on our Support teams. We are trying to prevent delayed revocation, but if this process was needed if the future, the use of a standard form for incidents could be useful for purposes of review and management. However, we will not change our approach without discussion and agreement from the community.

Then I would expect Delayed Reporting incidents to the raised for the following to show consistency:
Entrust: Failed to provide a preliminary incident report according to TLS BR 4.9.5
Entrust: CPS typographical (text placement) error
Entrust: Not updating Problem Reporting Mechanism fields in CCADB

Flags: needinfo?(ngook.kong)
Assignee: nobody → ngook.kong
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance] [policy-failure]

(In reply to Wayne from comment #4)

All that has been established is a systemic failure in following incident response at multiple layers, and a lack of change across several months to improve this matter.

This is why I say that this incident report needs to focus on the missed timeframe and why it will not happen again.

Now that Entrust have clarified that their incident response practices have not changed in the past few month via comment #28:

  • Has this been the same process from the start of the incident until now? If not, what lessons were learned and applied?

Yes, it is the same process.  With the volume of certificates that had to be revoked and re-issued, this approach has put a strain on our Support teams. We are trying to prevent delayed revocation, but if this process was needed if the future, the use of a standard form for incidents could be useful for purposes of review and management. However, we will not change our approach without discussion and agreement from the community.

Then I would expect Delayed Reporting incidents to the raised for the following to show consistency:
Entrust: Failed to provide a preliminary incident report according to TLS BR 4.9.5
Entrust: CPS typographical (text placement) error
Entrust: Not updating Problem Reporting Mechanism fields in CCADB

Noted. We request direction from Mozilla if new incidents need to be opened for these 3.

Flags: needinfo?(ngook.kong)

(In reply to ngook.kong from comment #5)

(In reply to Wayne from comment #4)

All that has been established is a systemic failure in following incident response at multiple layers, and a lack of change across several months to improve this matter.

This is why I say that this incident report needs to focus on the missed timeframe and why it will not happen again.

Now that Entrust have clarified that their incident response practices have not changed in the past few month via comment #28:

  • Has this been the same process from the start of the incident until now? If not, what lessons were learned and applied?

Yes, it is the same process.  With the volume of certificates that had to be revoked and re-issued, this approach has put a strain on our Support teams. We are trying to prevent delayed revocation, but if this process was needed if the future, the use of a standard form for incidents could be useful for purposes of review and management. However, we will not change our approach without discussion and agreement from the community.

Then I would expect Delayed Reporting incidents to the raised for the following to show consistency:
Entrust: Failed to provide a preliminary incident report according to TLS BR 4.9.5
Entrust: CPS typographical (text placement) error
Entrust: Not updating Problem Reporting Mechanism fields in CCADB

Noted. We request direction from Mozilla if new incidents need to be opened for these 3.

Entrust are claiming that their incident response has not changed, opening incidents, and then claiming they now need direction from a Root Program on whether they should open incidents at all? We are all seeing this, correct?

Flags: needinfo?(ngook.kong)

(In reply to Wayne from comment #6)

(In reply to ngook.kong from comment #5)

(In reply to Wayne from comment #4)

All that has been established is a systemic failure in following incident response at multiple layers, and a lack of change across several months to improve this matter.

This is why I say that this incident report needs to focus on the missed timeframe and why it will not happen again.

Now that Entrust have clarified that their incident response practices have not changed in the past few month via comment #28:

  • Has this been the same process from the start of the incident until now? If not, what lessons were learned and applied?

Yes, it is the same process.  With the volume of certificates that had to be revoked and re-issued, this approach has put a strain on our Support teams. We are trying to prevent delayed revocation, but if this process was needed if the future, the use of a standard form for incidents could be useful for purposes of review and management. However, we will not change our approach without discussion and agreement from the community.

Then I would expect Delayed Reporting incidents to the raised for the following to show consistency:
Entrust: Failed to provide a preliminary incident report according to TLS BR 4.9.5
Entrust: CPS typographical (text placement) error
Entrust: Not updating Problem Reporting Mechanism fields in CCADB

Noted. We request direction from Mozilla if new incidents need to be opened for these 3.

Entrust are claiming that their incident response has not changed, opening incidents, and then claiming they now need direction from a Root Program on whether they should open incidents at all? We are all seeing this, correct?

This question appears to be addressed to the broader community, and not to Entrust. If there is a question or request for information from Entrust, please clarify.

We are monitoring the bug. Request next update to be 28 June 2024.

We are monitoring the bug. Request next update to be 31 July 2024.

Flags: needinfo?(ngook.kong)
Whiteboard: [ca-compliance] [policy-failure] → [ca-compliance] [policy-failure] Next update 2024-07-31

Action Items

Action Item Kind Due Date
Review applicable policy and procedures Prevent Done
Improve our internal problem reporting mechanism reported by internal staff Detect July 31, 2024
Reorganize product compliance and verification teams to provide additional organizational resources and oversight Prevent Done
Implement additional input validation controls for verification Mitigate July 26, 2024
Implement pkilint as post-issuance linter Detect Done

Action Items

Action Item Kind Due Date
Review applicable policy and procedures Prevent Done
Improve our internal problem reporting mechanism reported by internal staff Detect Done
Reorganize product compliance and verification teams to provide additional organizational resources and oversight Prevent Done
Implement additional input validation controls for verification Mitigate Done
Implement pkilint as post-issuance linter Detect Done

All actions have been addressed. We request this incident be closed. Thanks.

We have no further updates, and all action items have been completed.

We kindly request closing this incident.

Flags: needinfo?(bwilson)

I will close this on or about Friday, 9-August-2024, unless there are still questions or issues to address.

Whiteboard: [ca-compliance] [policy-failure] Next update 2024-07-31 → [ca-compliance] [policy-failure]

We will continue to monitor for comments/questions.

Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.