Open Bug 1885568 Opened 4 months ago Updated 7 days ago

VikingCloud: Delayed revocation of TLS certificates in connection to bug #1883779

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: andreaholland, Assigned: andreaholland, NeedInfo)

Details

(Whiteboard: [ca-compliance] [ov-misissuance] [leaf-revocation-delay] 2024-06-26)

Attachments

(2 files)

Steps to reproduce:

Incident Report

This is a preliminary report.

Summary

While investigating incident https://bugzilla.mozilla.org/show_bug.cgi?id=1883779, we identified 3167 OV certificates impacted with the same Subject RDN Attribute order reversal. Based on that incident, the affected certificates should be revoked within 5 days according to Section 4.9.1.1 of the TLS Baseline Requirements (revoke by 2024-03-16 14:53 UTC). However, we will not achieve the 5-day timeline because of several factors that will be provided in the full incident report along with the list of OV certificates that were delayed in revocation.

Assignee: nobody → andreaholland
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance] [ov-misissuance]

Can you confirm that the 3000-some certificates represent almost every single valid certificate of Viking Cloud?
I believe that to be the case.

If a CA mis-issues theire ENTIRE certificate base and then can't follow guidelines to revoke them in a suitable time...why are they trusted?

Our full report will give details around this delay including the areas of issues and our plans to improve. The full report will be posted no later than March 29, 2024.

Incident Report

Summary

VikingCloud had issued TLS certificates with a reverse Subject RDN attribute order as described in https://bugzilla.mozilla.org/show_bug.cgi?id=1883779. According to TLS BR Section 4.9.1.1 #12, the affected certificates should have been revoked within 5 days.

Our reissuance and revocation process was initially slowed down because our ticketing system caused a delay in our ability to send out email notifications to all relevant Subscribers – ultimately impacting response time. While our system would have allowed us to revoke all the certificates within the 5-day timeframe, Subscribers had not reissued all impacted certificates and we didn’t have confirmation of Subscriber installation within that timeframe. Additionally, continuity of some of the Subscribers' critical infrastructures depended on the affected certificates. While the Subscribers had processes in place to handle emergency situations, in this case it was determined by the Subscribers to be riskier to push the certificate reissuances at an emergency speed, skipping normal installation steps like quality assurance and security testing, which imposes higher risks in this particular situation. Therefore, revocation of the affected certificates was delayed to avoid negatively impacting Subscriber’s services and their end-users.

It is important to note that there have not been any identified impacts in terms of security or use of the misissued certificates as the information contained in the certificates was valid. Regardless, the misissuance event was an opportunity to action our large reissuance and revocation process and interface with Subscribers regarding the revocation requirements in our CPS as well as the TLS Baseline Requirements. Effective preventive actions will be put in place to reduce the likelihood of recurrence.

Impact

3082 certificates were not revoked in the 5-day period.

Even though the non-compliance incident does not pose a security risk, nor was it a misissuance of the Subject information provided in the certificates, industry rules require that we revoke certificates not issued in full compliance with TLS BR standards. We will comply with the standards and revoke the affected certificates. The expected final revocation date is April 30, 2024.

Timeline (All time are UTC)

2024-03-11 14:53 – Confirmation other OV certificates affected.
2024-03-11 17:04 – Notifications to Subscribers began.
2024-03-12 14:50 – Sales organization involved in aiding Subscriber contact and started setting up meetings of largely impacted Subscribers to address a plan of action.
2024-03-12 20:45 – Subscriber notified us of email communication issues.
2024-03-14 17:43 – A second round of email notifications sent out.
2024-03-15 16:20 – Preliminary report for delayed revocation created.
2024-03-16 14:53 – TLS BR required revocation date of the affected OV certificates. – Failed to meet.
Continuing to work with Subscribers on action plans for reissuance, installation, and revocation.

Root Cause Analysis

Root Cause # 1

Why was there a problem?

Because the 5-day revocation requirement for a misissaunce event was not met for several of the impacted certificates.

Why was the 5-day revocation requirement missed?

Because the Subscriber did not take immediate action.

Why did the Subscriber not take immediate action?

Because our ticketing system caused a delay in reaching out to inform Subscribers about this situation and the need for revocation.

Why did our communication plan and system not properly inform Subscribers?

Because the communication didn’t reach the Subscriber who had the ability to take immediate action on the request.

Why did the communication not reach the Subscriber who had the ability to take immediate action on the request?

The following are some, but not all of the specific reasons why this communication did not properly inform the Subscriber:

  • Subscriber contact information was outdated in the Subscriber Portal due to change in personnel within the Subscriber organization.
  • Subscriber mailbox rules dropped the communication into a folder which was not viewed as often by the Subscriber.
  • Subscriber mail systems had labeled the communication as possible SPAM.
  • The communication system did not send to all entered Subscriber contact email addresses on the messages.
  • The communication system had an issue that caused support-initiated emails to incorrectly respond as though they were Subscriber created issues.

Given this communication breakdown, this situation was also impacted by another root cause:

Root Cause # 2:

Why was there a problem?

Because the 5-day revocation requirement for a misissaunce event was not met for several of the impacted certificates.

Why was the 5-day revocation requirement missed?

Because Subscribers in critical industries and more heavily impacted Subscribers needed more time for installation of reissued certificates before the revocation of the misissued certificates.

Why did the Subscribers need more time?

Because manual replacement, especially on devices without automation and on critical infrastructure, requires more time and change control due to the risk of error.

Why was the risk of error a factor in manual replacement?

Because the risk of error during rapid replacement, which involves skipping normal installation steps like security testing, was deemed to be too high of a risk for Subscriber organizations in this situation.

Why did the risk of error outweigh the need to rapidly replace?

Because the risk of causing a potential outage by a rapid replacement could directly impact the Subscriber’s end-users and introduce potential security issues, especially in some of the industries and devices that these certificates were installed on.

Lessons Learned

What went well

Once the appropriate Subscriber contact had received the correct email, they reached out to support with questions and for help.

What didn’t go well

Subscriber Notification
  • The notices sent out to Subscribers took longer than anticipated:
    • Some of the initial emails were not sending the misissuance event content and instead sending a “Thank you for contacting support” notice.
    • We ran into issues with the notification system only sending to one of the contact email addresses instead of all.
  • Some of the Subscribers had not been updating contact information on their accounts, so the proper contacts were not aware of the necessity to reissue and requirement for revocation.
  • Notification back from the Subscriber of installation completion was not properly connected to the original email for reissuance and therefore it was time consuming to link those up prior to revocation.
Subscriber Infrastructure
  • Subscribers with wildcard certificates in use across multiple devices require more time to install across all the impacted devices.
  • A large number of our Subscribers do not utilize the automation we currently have in place due to security concerns, system isolation, or device requirements.
  • Subscribers who were starting to implement automation ran into issues with certificates spread across automation platforms and not in automation platforms.
  • Subscribers who integrate with multiple divisions and/or organizations needed time to reach out to get an action plan in place.

Where we got lucky

We were lucky that this misissuance event was not a security incident requiring 24-hour revocation.

Action Items

Action Item Kind Due Date
Investigate new system for misissuance notifications and responses Mitigation 10-23-2024
Setup annual request for Subscribers to check contact information provided to us Prevent 06-26-2024
Annual test of misissuance process in our risk assessment procedure Prevent 07-31-2024
Update ACME implementation to provide more Subscriber value Mitigation 12-18-2024
Improve Subscriber adoption of automation Mitigation 12-18-2024
Communicate process improvements to Subscriber in the event of misissuance Mitigation 06-26-2024

Appendix

Details of affected certificates

Attached.

Number of revoked certificates

As of 2024-03-29 20:29 UTC 1177 affected certificates have been revoked.

Update

Additional EV certificates were impacted by the Subject RDN attribute order as described in https://bugzilla.mozilla.org/show_bug.cgi?id=1883779. The affected certificates should be revoked within 5 days according to Section 4.9.1.1 of the TLS Baseline Requirements (revoke by 2024-04-06 15:15 UTC). However, we will not achieve the 5-day timeline. The list of delayed revocation for EV certificates will be provided by Monday, April 8.

Number of revoked certificates

As of 2024-04-05 21:00 UTC 1424 affected OV certificates have been revoked.

Number of revoked certificates

As of 2024-04-12 21:00 UTC 1615 affected OV certificates have been revoked.
As of 2024-04-12 21:00 UTC 8 affected EV certificates have been revoked.

Number of revoked certificates

As of 2024-04-19 21:00 UTC 1871 affected OV certificates have been revoked.
As of 2024-04-19 21:00 UTC 19 affected EV certificates have been revoked.

Whiteboard: [ca-compliance] [ov-misissuance] → [ca-compliance] [ov-misissuance] [leaf-revocation-delay]

Number of revoked certificates

As of 2024-04-26 21:00 UTC 2291 affected OV certificates have been revoked.
As of 2024-04-26 21:00 UTC 22 affected EV certificates have been revoked.

Update

A small number of customers who were more heavily impacted will be unable to meet the April 30th deadline for some of their critical certificates. A further update will be provided on Friday. We are actively revoking all misissued certificates and working diligently with these customers to ensure that any further delay will be of short duration.

Number of revoked certificates

As of 2024-05-03 21:00 UTC 2909 affected OV certificates have been revoked.
As of 2024-05-03 21:00 UTC 23 affected EV certificates have been revoked.

Number of revoked certificates

As of 2024-05-10 21:00 UTC 2995 affected OV certificates have been revoked.
As of 2024-05-10 21:00 UTC 23 affected EV certificates have been revoked.

Number of revoked certificates

As of 2024-05-17 15:00 UTC 3082 affected OV certificates have been revoked.
As of 2024-05-17 15:00 UTC 25 affected EV certificates have been revoked.

All affected certificates have been revoked. Please set the next update to 06-26-2024.

I would recommend pushing an update on your CRL. At least one certificate is revoked on OSCP but not the CRL as of 18:00 UTC (i.e. 3 hours after the claim was made):
https://crt.sh/?sha256=0e2e81e5e83e25dd344db8add0e60310d024eb909dadff91ca89ffb1356c9e20

In checking this the CRL for EV certs: http://crl.vikingcloud.com/VCEVCA_L1.crl has last had an entry added on May 3rd, but says its last update was May 17th at 09:00 UTC.
Looking at another CRL for OV certs: http://crl.securetrust.com/OVCA2_L1.crl shows the last entry was May 15th, and the last update was May 17th at 10:00 UTC.

Flags: needinfo?(andreaholland)

The timing on the Bugzilla last update about the completed revocations did not take into account the CDN CRL caching time. Thank you for bringing this to our attention. Upon seeing your post, we did clear our CDN cache which pushed the latest CRLs to the endpoints.

Flags: needinfo?(andreaholland)

Hi Andrea,
You mention critical industries / critical infrastructure in your incident report. Can you be more specific about the type of customers you're talking about? I didn't see any categorization in that regard.
Thanks,
Ben

Flags: needinfo?(andreaholland)

The following types of customers were impacted: Government entities, healthcare technology providers (hospitals, labs, virtual medical services, etc.), telecommunications, energy providers, finance (banking, payment services, etc.), transportation, and essential business chains.

Flags: needinfo?(andreaholland)
Whiteboard: [ca-compliance] [ov-misissuance] [leaf-revocation-delay] → [ca-compliance] [ov-misissuance] [leaf-revocation-delay] 2024-06-26

If there are no further comments, please set Next Update to 2024-06-26.

(In reply to Andrea Holland from comment #18)

If there are no further comments, please set Next Update to 2024-06-26.

Andrea,

A five-day revocation event took more than two months to complete. It’s difficult to blame this late revocation on a glitch in your ticketing system that according to your timeline was solved in three days. I do not see any discussion on this bug about the true root cause of this massive delay. I do not see any action items to address this root cause. This bug does not meet expectations without these elements.

I will remind you of what has been repeatedly stated on many, other, current Bugzilla incidents, which is that Subscribers’ stated unwillingness to install replacement certificates within 120 hours is not the root cause of your failure to revoke, and encouraging Subscribers to become agile in their certificate management is insufficient as an action item to address this issue.

Also, comment 17 does not meet the requirements of Mozilla’s policy, which states in part,

The decision and rationale for delaying revocation will be disclosed in the form of a preliminary incident report immediately; preferably before the BR-mandated revocation deadline. The rationale must include detailed and substantiated explanations for why the situation is exceptional. Responses similar to “we do not deem this non-compliant certificate to be a security risk” are not acceptable. When revocation is delayed at the request of specific Subscribers, the rationale must be provided on a per-Subscriber basis.

Rather than setting Next Updates that are weeks in the future, I would like to see VikingCloud focus on these gaps in this incident report. Please provide a root cause analysis that addresses the real root cause of this failure and list action items that the community can reasonably expect to rectify this problem. Please provide detailed and substantiated explanations on a per-subscriber basis.

As mentioned in the Root Cause Analysis section, the issues with Subscriber communications were due to multiple factors. In our timeline, the sending of a second round of notifications did not resolve the issues underlying the communication root cause. In many Subscriber situations, several communications (email and telephone) were required to help impacted customers understand the misissuance, the actions that needed to take place, and the people within their organization who had to perform those actions. It is important to rectify the communications issues that occurred for any future mass revocation event communications to improve Subscriber ability to timely address revocations, therefore reducing the impact on relying parties. We have action items around the improvement of communications with Subscribers.

Since this was the first time that most of our Subscribers had been required by us to revoke and replace their certificates within required timeframes, their emergency replacement processes, procedures, and tools had not been tested. This misissuance event also served as an ideal opportunity to interact with impacted Subscribers to push them in the direction of automation, private PKI for non-public TLS use cases, or discuss additional solutions and to be better prepared for the next security-style event which impacts all WebPKI. As an action item, we are reaching out to each impacted Subscriber to discuss improvements to streamline their certificate replacement process to prevent this event in the future.

Another one of our action items is technological improvements by upgrading our ACME implementation to include ARI which addresses some of the shortcomings of non-renewal functionality of our current ACME implementation. Having Subscribers update their processes and interactions with TLS impacts the WebPKI ecosystem as every Subscriber that moves in the direction of automation or private PKI means less impact and risk to relying parties overall.

Action Items Update

Action Item Kind Due Date
Investigate new system for misissuance notifications and responses Mitigation 10-23-2024
Setup annual request for Subscribers to check contact information provided to us Prevent Completed
Annual test of misissuance process in our risk assessment procedure Prevent 07-31-2024
Update ACME implementation to provide more Subscriber value Mitigation 12-18-2024
Improve Subscriber adoption of automation Mitigation 12-18-2024
Communicate process improvements to Subscriber in the event of misissuance Mitigation Continuous

(In reply to Andrea Holland from comment #20)

As mentioned in the Root Cause Analysis section, the issues with Subscriber communications were due to multiple factors. In our timeline, the sending of a second round of notifications did not resolve the issues underlying the communication root cause.

In my opinion, communication issues cannot really be considered a root cause of a delayed revocation, because no subscriber communication is required in order to revoke a certificate. Do you agree with this? There is a legally-binding commitment to acceptance and acknowledgement of the possibility of immediate revocation, which Subscribers must agree to under 9.6.3(8).

There are really only two things that can cause a delayed revocation:

  1. a technical or operational failure of a CA to complete an attempted revocation, for which the root cause will likely be a failure to ensure that those capabilities are always functioning, or;
  2. a deliberate decision by the CA to not revoke a certificate when they are technically capable of it, for which the root cause will virtually always be either an incorrect revocation policy or a truly unforeseeable situation in which intolerable harm to the web ecosystem would result from timely revocation

It has to be “can’t due to literal inability to generate the revocation entry in the CRL” or “won’t because of decision to put disruption avoidance ahead of web PKI integrity due to the exceptional scale of harm that would result”.

Subscriber communication may help your subscribers avoid outages due to limits on their operational capability around certificate management, and I am certainly supportive of you helping them that way, but that is not an element of your commitments to the web PKI under the BRs or MRSP.

This misissuance event also served as an ideal opportunity to interact with impacted Subscribers to push them in the direction of automation, private PKI for non-public TLS use cases, or discuss additional solutions and to be better prepared for the next security-style event which impacts all WebPKI.

This is an excellent step to take, in that it might eliminate an unforeseeable circumstance of exceptional harm.

Is there a reason that you are not also contacting the subscribers who were not affected in this particular incident to push them similarly? Were the subscribers affected by the underlying incident unusually likely, among your population of subscribers, to have these issues of missing automation or misapplication of public web PKI? If this is likely to affect other subscribers who didn’t receive misissued certificates in this incident, then it seems that the proactive response would be to also contact them.

Another one of our action items is technological improvements by upgrading our ACME implementation to include ARI which addresses some of the shortcomings of non-renewal functionality of our current ACME implementation.

As with Subscriber communication, this does not in my opinion serve as an action item to address a root cause for delayed revocation, as lack of automation by either CA or Subscriber is not sufficient justification for delaying revocation of an invalid certificate within the 5/120 limits. Do you agree with this?

While I have questions about this, I do want to say that I appreciate the thoughtfulness and candor evident in this incident report.

Flags: needinfo?(andreaholland)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: