Closed Bug 1941009 Opened 10 months ago Closed 7 months ago

eMudhra emSign PKI Services : Issue with revocation as part of automated reissuance

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: naveen.ml, Assigned: naveen.ml)

Details

(Whiteboard: [ca-compliance] [uncategorized])

Attachments

(2 files)

Incident Report

Summary

On 08-01-2025, eMudhra identified an issue where 311 unique domain certificates tied to base orders were erroneously revoked due to a technical misconfiguration in the certificate lifecycle management system. NOTE: eMudhra TLS customers have an option to revoke (or not), any certificates associated with the original base order when initiating a subsequent re-issuance programmatically.
The issue arose from a subroutine within the automated processing of revocation of endpoints that unintentionally triggered the revocation of valid certificates during a reissuance process. In addition to the unintended revocation, the subroutine marked the reason code for each of the revoked certificates as Key Compromise instead of Superseded. While this incident caused the revocation of base certificates, the active replacement certificates linked to these orders remained valid and fully functional thereby having very minimal impact to our customers.
To address the issue, eMudhra formed a dedicated team to contact affected customers, verify their environments, and ensure there were minimal operational disruptions. Communication was sent to all affected certificate owners with detailed instructions and support for resolving the issue. Immediate corrective actions were also implemented to prevent recurrence.

Impact

The impact was minimal as most of the customers were using their reissued certificates and not the newly revoked base certificate. The affected customers were contacted by eMudhra’s dedicated team to verify their environments and confirm the validity of their active replacement certificates. There has been no evidence of misuse or compromise associated with the revoked base certificates.

Timeline

All times are IST.
[2025-01-08 00:30]: Issue detected during routine internal monitoring.
[2025-01-08 01:00]: Root cause analysis initiated, and internal incident (EMINCPKI0022) was created. The incident was immediately escalated to the engineering and compliance teams for investigation.
[2025-01-08 02:10]: Root cause identified as a misconfigured subroutine in the automated revocation process, which revoked base certificates tied to replacement orders and erroneously published the CRL with incorrect revocation code as Key Compromise (instead of Superseded).
[2025-01-08 02:30]: Full list of 311 unique domain name affected base certificates identified.
[2025-01-08 02:45]: Communication sent to affected certificate owners, providing guidance and support.
[2025-01-08 10:00]: A dedicated team was deployed to contact affected customers, verify their environments, and confirm the validity of active replacement certificates.
[2025-01-08 13:00]: Internal corrective actions implemented to prevent recurrence.

Root Cause Analysis

The root cause of this incident was traced to a misconfigured subroutine in the automated revocation process within the certificate lifecycle management system. This subroutine mistakenly flagged base certificates for revocation. This highlights the need for stronger review, validation and testing of programmatic logic to ensure all revocation conditions are thoroughly validated before any revocation process is initiated.

Lessons Learned

This incident revealed the need for robust validation and testing of actions exposed in certificate lifecycle processes, especially for revocation and reissuance workflows. Moving forward, end-to-end testing (manual and automated) and enhanced review as part of internal audit will address such gaps. Proactive communication and a dedicated response team minimized the impact, reinforcing the importance of clear processes and swift coordination. These improvements will strengthen both technical and operational reliability in managing certificates.

What went well

The issue was detected promptly through routine internal monitoring before any external reports were received. Replacement certificates tied to the affected base certificates remained unaffected, ensuring continuity of services for all customers. Internal teams quickly assessed the extent of impact, collaborated effectively to identify the root cause and implement immediate corrective measures including reaching out to the customers in a timely manner.

What didn't go well

Validation gaps in the outcome of programmatic routines underpinning the certificate lifecycle management system allowed the erroneous revocation of base certificates to occur. Additionally, operational disruptions required internal remediation and customer outreach to resolve the situation and prevent escalation.

Where we got lucky

The issue was limited to cases where an active replacement certificates requested by customers had a corresponding base certificate, which was revoked. There was minimal service disruption for customers as most of them had deployed the reissued certificate. Additionally, the incident was detected through internal monitoring before external parties reported the issue, enabling swift resolution. The dedicated team’s quick action prevented the situation from escalating further.

Action Items

Action Item Kind Due Date
Conduct an end-to-end audit of the certificate lifecycle management system and fixed the subroutine to prevent system from processing automated reissuance requests which had an optional parameter for revocation of associated certificates from base order Mitigate 2025-01-08
Do a thorough analysis of programmatic logic that support automated revocation of certificates to understand if similar gaps are exposed and fix any such gap. Add automated testing procedures as part of testing pipeline for reverification Prevent 2025-01-15
Implement an internal audit checklist for quarterly review to ensure such gaps are not reintroduced to the system Prevent 2025-01-15

Based on Incident Reporting Template v. 2.0

Assignee: nobody → naveen.ml
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [uncategorized]

The Impact section is for measurements and amounts not for saying how minimal it was. This is a revocation incident and yet I see almost no discussion of OSCP or CRLs. From https://www.ccadb.org/cas/incident-report#incident-reports

The Impact section should contain a short description of the size and nature of the incident. For example: how many certificates, OCSP responses, or CRLs were affected; whether the affected objects share features (such as issuance time, signature algorithm, or validation type); and whether the CA Owner had to cease issuance during the incident.

Also missing are the complete certificate details that are required for all reports.

In particular, in the case of incidents which directly impacted certificates, the Appendix must include a listing of the complete certificate details of all affected certificates. The recommended format is to ensure that all affected certificates are logged to CT, then to attach a text file where each line is of the form https://crt.sh/?sha256=[sha256 fingerprint of the certificate].

The timeline is incomplete. Most glaring is that the timeline doesn't even include when the incident happened. The incident doesn't start only when you detect it. When was the "misconfigured subroutine" introduced?

It's not clear from the report what corrective actions eMudhra took. You said that the published revocation reason of was set to "Key Compromise", which is incorrect. Did you fix that?

The Impact section is for measurements and amounts not for saying how minimal it was. This is a revocation incident and yet I see almost no discussion of OSCP or CRLs. From https://www.ccadb.org/cas/incident-report#incident-reports

The incident affected 311 unique domain certificates tied to a total of 854 base certificates. These certificates were erroneously revoked due to a misconfigured subroutine within the certificate lifecycle management system. Also, while the revocation reason should have been “Superseded” as the certificates in question were valid and not comprised, the revocation reason for the affected certificates got erroneously published as and remains "Key Compromise", as it was originally published.
Key Impact Metrics:
• Certificates affected was 311 unique domain names tied to a total of 854 base certificates.
• Active replacement certificates, remained valid and fully functional, ensuring continuity of customer services.
• The affected certificates remain listed with the revocation reason code "Key Compromise".
• The OCSP responses for the affected certificates also reflect the "Key Compromise" revocation reason.
• There was no suspension of certificate issuance during the incident.
Customers relying on active replacement certificates faced no disruption, ensuring continuity of their services and operations.

Also missing are the complete certificate details that are required for all reports.

Certificate crt.sh urls are enclosed as an attachment.

The timeline is incomplete. Most glaring is that the timeline doesn't even include when the incident happened. The incident doesn't start only when you detect it. When was the "misconfigured subroutine" introduced?

Timeline (All Times IST)
• 2024-12-29 02:00: The misconfigured subroutine was introduced during a routine update to the certificate lifecycle management system.
• 2025-01-07 21:21: The subroutine triggered the erroneous revocation of 311 unique base certificates tied to a total of 854 base certificates.
• 2025-01-07 22:07: The CRL was published, marking the revoked certificates with the reason code "Key Compromise."
• 2025-01-08 00:30: Issue detected during routine internal monitoring.
• 2025-01-08 01:00: Root cause analysis initiated, and internal incident (EMINCPKI0022) was created. The incident was immediately escalated to the engineering and compliance teams for investigation.
• 2025-01-08 02:10: Root cause identified as a misconfigured subroutine in the automated revocation process, which revoked base certificates tied to replacement orders and erroneously and published the CRL with the revocation code as 'Key Compromise,' which remains unchanged.
• 2025-01-08 05:00: The misconfigured subroutine in the automated revocation process was identified and immediately disabled to prevent further erroneous revocations.
• 2025-01-08 10:00: A dedicated team was deployed to contact affected customers, verify their environments, and confirm the validity of active replacement certificates.
• 2025-01-08 13:00: Internal corrective actions were implemented, including fixes to the subroutine and enhanced validation processes.

It's not clear from the report what corrective actions eMudhra took. You said that the published revocation reason of was set to "Key Compromise", which is incorrect. Did you fix that?

Corrective Actions
• The revocation reason for the affected certificates remains "Key Compromise" as originally published. While the certificates themselves were valid and unaffected by any compromise, the revocation reason 'Key Compromise' remains unchanged.
• Conduct an end-to-end audit of the certificate lifecycle management system and fixed the subroutine to prevent system from mis-processing automated reissuance requests which had an optional parameter for revocation of associated certificates from base order.
• Logic safeguards have been implemented to validate revocation requests before processing.
• An additional layer of validation was added to the CRL and OCSP publication processes to verify the consistency of revocation codes and prevent erroneous entries before publication.
• End-to-end testing for certificate lifecycle workflows now includes scenarios for revocation logic validation.
• Real-time monitoring tools were deployed to flag anomalies in CRL and OCSP entries, enabling quicker intervention.
• A dedicated customer outreach team was established to contact affected customers, verify their environments, and provide clear communication during incidents.

Attached file CertificateURLs.txt

As noted in Comment 1, the recommended format for crt.sh URLs is:

https://crt.sh/?sha256=[sha256 fingerprint of the certificate]

Use of crt.sh IDs, as the existing Appendix does, makes it difficult to study the corpus of affected certificates.

Can eMudhra please use the recommended format?

Flags: needinfo?(naveen.ml)
Flags: needinfo?(naveen.ml)

As noted in Comment 1, the recommended format for crt.sh URLs is:
https://crt.sh/?sha256=[sha256 fingerprint of the certificate]
Use of crt.sh IDs, as the existing Appendix does, makes it difficult to study the corpus of affected certificates.
Can eMudhra please use the recommended format?

Attached the certificate URLs in the recommended format.

Could eMudhra review:
(1) all comments to ensure that questions have been answered, and
(2) action items in this bug to provide us with a status update?
Thanks,
Ben

Flags: needinfo?(naveen.ml)

(1) all comments to ensure that questions have been answered

We have carefully reviewed all comments and ensured that all questions raised during this incident have been addressed. Below are specific responses to key points raised:

  1. The 311 certificates were revoked due to a misconfigured subroutine that incorrectly flagged base certificates for revocation when replacement certificates were issued. The revocation reason in the CRL was initially set to "Key Compromise" in error, and we have corrected our processes to ensure that similar cases are correctly categorized as "Superseded" moving forward.
  2. OCSP responses and CRLs were updated correctly once the revocation was completed. Validation enhancements have been introduced to prevent incorrect revocation reasons from being applied in the future.
  3. Customers using active replacement certificates did not experience service disruption. All affected customers were contacted directly to inform them of the issue and confirm that their operational certificates remained unaffected.
  4. Enhancements have been made as outlined in the table below to prevent a similar issue, including updates to the automated revocation workflow, stricter validation checks, and additional monitoring mechanisms. Quarterly audits have been introduced to verify proper revocation processing.

(2) action items in this bug to provide us with a status update?

Action Item Status Completion Date
Fixed subroutine logic to prevent unintended revocation of base certificates tied to replacement orders. Completed 2025-01-08
Conducted a thorough review of all automated revocation workflows to detect similar gaps and applied fixes where necessary. Completed 2025-01-17
Implemented additional validation steps before processing automated revocation requests to ensure correct classification of base vs. active replacement certificates. Completed 24-01-2025 Completed 2025-01-24
Enhanced monitoring and alerting mechanisms for revocation events to flag anomalies in real-time. Completed 2025-01-24
Improved customer communication procedures for revocation notifications to ensure impacted certificate holders receive timely guidance. Completed 2025-01-14
Updated internal processes to conduct quarterly audits of revocation handling to prevent similar misconfigurations in the future. In Progress (First audit scheduled) 2025-03-31
We have updated our internal procedures, and any open bugs will be followed up on a weekly basis. Completed 2025-03-10
Flags: needinfo?(naveen.ml)

All action items are completed except for the following which is anticipated to be completed on 2025-03-31:

  1. Updated internal processes to conduct quarterly audits of revocation handling to prevent similar misconfigurations in the future.

All action items are completed except for the following which is anticipated to be completed on 2025-03-31:

  1. Update internal processes to conduct quarterly audits of revocation handling to prevent similar misconfigurations in the future.

All action items have been completed. The internal review process has been finalized, and the first review has already been conducted to ensure compliance and prevent similar misconfigurations in the future.

If there are no action items remaining, and you believe that this case can be closed, then please submit a Closure Summary:
https://www.ccadb.org/cas/incident-report#how-are-reports-closed
https://www.ccadb.org/cas/incident-report#closure-report
https://www.ccadb.org/cas/incident-report#incident-closure-summary
Thanks,
Ben

Flags: needinfo?(naveen.ml)

Report Closure Summary

  • Incident description:
    On 08-01-2025, eMudhra identified a misconfiguration in its certificate lifecycle management system, resulting in the unintended revocation of 311 domain certificates tied to base orders. A flaw in the automated revocation process mistakenly revoked valid certificates during reissuance and incorrectly applied the Key Compromise reason instead of Superseded. Despite this, replacement certificates remained valid, minimizing customer impact.
    Further investigation revealed that the subroutine logic responsible for this issue was introduced on 29-12-2024 during a routine update to the certificate lifecycle management system.
    Since all certificate replacements were issued with new keys, and attempting to rectify CRL revocation codes in issued CRLs could introduce additional risk, it was decided in consultation with customers to retain the revocation status as “Key Compromise,” as this posed no identified additional risk to the industry.

  • Incident Root Cause(s):
    The root cause of this incident was traced to a misconfigured subroutine in the automated revocation process within the certificate lifecycle management system. This subroutine mistakenly flagged base certificates for revocation.

  • Remediation description:
    To address the identified issue, we have implemented multiple corrective actions to strengthen our revocation handling process and prevent unintended certificate revocations.

  1. Subroutine Logic Fix – The revocation subroutine has been corrected to ensure that base certificates tied to replacement orders are not unintentionally revoked.
  2. Automated Workflow Review & Fixes – A comprehensive review of all automated revocation workflows was conducted to identify and rectify similar vulnerabilities.
  3. Additional Validation Steps – Enhanced validation measures have been introduced to correctly classify base and active replacement certificates before processing revocation requests.
  4. Real-Time Monitoring & Alerts – Revocation events are now monitored with improved alerting mechanisms to detect and flag anomalies immediately.
  5. Customer Communication Enhancement – Revocation notification procedures have been improved to ensure timely and clear communication with impacted certificate holders.
  6. Quarterly Audit Implementation – Internal processes have been updated to include regular quarterly audits of revocation handling, ensuring compliance and preventing future misconfigurations.
  • Commitment summary:
    eMudhra is committed to ensuring robust certificate lifecycle management and has implemented corrective actions to prevent similar incidents. The revocation subroutine has been fixed, automated workflows reviewed, and validation steps enhanced. Real-time monitoring and alerts have been introduced, along with improved customer communication. Additionally, quarterly audits have been established to ensure ongoing compliance and prevent future misconfigurations.

All Action Items disclosed in this report have been completed as described, and we request its closure.

Last call. I'll pull this up for a review and closure later next week (Wed-Fri). Please provide any additional concerns or questions before then.

Flags: needinfo?(naveen.ml) → needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: