Open Bug 1910322 Opened 1 month ago Updated 2 days ago

DigiCert: Random value in CNAME without underscore prefix

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: jeremy.rowley, Assigned: jeremy.rowley)

Details

(Whiteboard: [ca-compliance])

Attachments

(5 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36

Steps to reproduce:

We received a certificate problem report that indicated DigiCert might have an issue with our implementation of method 7 – DNS-based validation. We have been investigating the issue and discovered a path that could allow mis-issuance. We are still investigating the root cause and gathering a report on the certificates impacted. We will file a full report of our findings when we have the information.

For background, DigiCert supports multiple DNS-related verification processes:
a) Adding a random value or request token to a TXT record,
b) Adding a random value to a CNAME pointed to by _dcv.[domain] or another subdomain starting with a prefix as agreed upon with the subscriber,
c) Adding a random value or request token to a TXT record of a domain that is followed via CNAME,
d) Adding the random value with an underscore prefix to the CNAME as the sub domain of an authorization domain name,
e) Adding the random value to a CAA record, and
f) Adding the random value to a TXT record of a subdomain that includes a prefixed underscore.

Spot checks performed on each of the methods above showed accurate validation information, but we initiated a thorough code review on July 27th to confirm. This review found a potential issue under (d) on our current system.

The code review found one path where a certificate could issue when the random value was used as the host in a CNAME resource record without first pre-appending an underscore. Our preliminary investigation shows that the code that should have appended an underscore as a prefix to a random value when using the CNAME method was not working properly. The code worked in our original monolithic system but was not implemented properly when we moved to our micro-services systems. In some cases, underscores might not have been appended to random values used for CNAME verification where the random value was used as a sub domain.

We also found that the bug in the code was inadvertently remediated when engineering completed a user-experience enhancement project that collapsed multiple random value generation microservices into a single service. This service began including an underscore prefix before each random value, regardless of which validation method the user chose. This project allows DigiCert to simplify the random value generation process. This also reduced customer support calls related to the manual addition of underscore prefix, fixed a bug in our certificate management platform’s display of validation status, and inadvertently ensured that every CNAME-based verification included an underscore prefix to each random value.

We verified that after the UX modification certificate issuance is not possible without proper CNAME verification and reviewed recent issuance to confirm that certificates are not issuing without including an underscore as part of the random value.

The 24-hour revocation rule applies in this circumstance as the issue impacts domain validation. We are currently investigating the root cause and gathering a list of impacted certificate serial numbers. We will post this information in the full bug report along with a complete list of impacted certificates. The revocation process will begin as soon as we’ve identified the impacted serial numbers. We do not expect to have a delayed revocation in regard to this issue.

Assignee: nobody → jeremy.rowley
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Thank you for this report, Jeremy, and for the responsiveness exhibited by DigiCert. We’ve been getting inbound requests from various sources for guidance about this incident.

As you know, DigiCert, as a publicly trusted CA, has to comply with the CA/Browser Forum Baseline Requirements, which are written as a collaboration between CAs and Browsers, including DigiCert.

The Chrome Root Program does not have the authority to grant exceptions to the CA/Browser Forum TLS Baseline Requirements. However, we do recognize other programs include exceptions for delayed revocations in some exceptional circumstances.

As detailed in our policy, we evaluate all incidents on a case-by-case basis and work to identify ecosystem-wide risk or potential improvements that can help prevent future incidents. The Chrome Root Program continuously points to the factors significant to our program when evaluating incidents which include (but are not limited to):

  • a demonstration of understanding of the root causes of an incident,
  • a substantive commitment and timeline to changes that clearly and persuasively address the root cause,
  • past history by the Chrome Root Program Participant in its incident handling and its follow through on commitments, and,
  • the severity of the security impact of the incident.

When evaluating an incident response, the Chrome Root Program’s primary concern is ensuring that Browsers, other CA Owners, users, and website developers have the necessary information to identify improvements, and that the Chrome Root Program Participant is responsive to addressing identified issues. Outside egregious cases (e.g., abject security failures), we do not make trust decisions on individual incidents, and always consider the wider context.

We note the ongoing discussions in community forums that seek to minimize impact for revocation activities, and will be looking for opportunities to strike the right balance between the security of the Web PKI and ecosystem considerations.

I will note that currently there seems to be confusion on DigiCert's internal list of impacted certificates and how they are displayed to subscribers. I have heard directly from parties of email-validated certificates being impacted, so there is a high likelihood of an overbroad initial revocation.

While this is preferable to a partial revocation list, it would explain the high volume of customer outreach at present. There is also an additional complication of resellers not providing the revocation information to their subscribers. Please consider this internally as you analyze the situation and prepare a report.

We wanted to update the community on the status of this issue. A full incident report is forthcoming. We have identified 83,267 certs impacting 6,807 subscribers. We are planning to begin revoking within the 24-hour time window. Some of these customers have been able to reissue their certificates. Unfortunately, many other customers operating critical infrastructure, vital telecommunications networks, cloud services, and healthcare industries are not in a position to be revoked without critical service interruptions. While we have deployed automation with several willing customers, the reality is that many large organizations cannot reissue and deploy new certificates everywhere in time. We note that other customers have also initiated legal action against us to block revocation.

We have been working around the clock with customers and remain committed to adhering to the BRs. We are aware of and are participating in the active industry discussion happening about the applicability of revocation timelines given the widespread impact and the relative severity of incidents. Our response to this incident to date reflects our commitment to comply with the BRs. We also note that browsers have mentioned that delayed revocation might still be acceptable under “exceptional circumstances.” However, given no clear definition of what would constitute an exception circumstance, we are seeking root store feedback as soon as possible, as we are standing ready to begin revocations within the timeline.

The 24 hour timer starts when the issue is notified to the CA, or when it is identified without a third-party report. I fear that DigiCert are acting under the impression that the timer begins when the final certificate list is generated, or when subscribers are informed?

I will not reiterate the discussion held in #1896053, but will note it is pertinent to this incident as well.

For clarity, we expect customers will have all impacted certs remediated at 120 hours after we had serial numbers. For those ready sooner, we'll begin revocations.

Can we have some more clarity on certificate count, please?
"approximately 0.4% of the applicable domain validations we have in effect" - from your website and LinkedIn CEO post.

Can you confirm what 'applicable domain validations we have in effect' means, please?
0.4% of all domain validations? 0.4% of all DNS-based? 0.4% of the specific method of DNS validations?

"We note that other customers have also initiated legal action against us to block revocation."
Action with no legal basis, if the customers have agreed to your terms & conditions and subscriber agreements, correct?
How can legal action have merit when clearly stated in your legal documentation and industry guidelines?

If there are customers attempting to bypass these guidelines through legal means affecting all internet relying parties, they should be named.

"Unfortunately, many other customers operating critical infrastructure, vital telecommunications networks, cloud services, and healthcare industries are not in a position to be revoked without critical service interruptions. "
Very specific details are needed. Customer names, certificate information, and explicit details as to why they cannot replace.

The suggestion from your own colleague:
"For organizations that can’t, we would have transparent information about the existence of the issue, the nature of the problem, and the severity (feasible timeline). "
"And furthermore, it would be better if these disclosures came directly from the subscribers, because they are the only ones in a position to know the ground truth. Subscribers could be held publicly accountable for the accuracy and reasonableness of their determination and need."

...seems to imply this should be possible, no? Why not start now and lead by example?

"The 24-hour revocation rule applies in this circumstance as the issue impacts domain validation. "
How are we now already talking about 120 revocation periods?

Digicert, please note that if there is a delayed revocation happening to file a preliminary incident for that ASAP as the questions regarding this incident and why there’s a delay are going to be different. Also, we’re already seeing questions about the delayed revocation pop up.

Flags: needinfo?(jeremy.rowley)

To answer everyone’s question:

I will note that currently there seems to be confusion on DigiCert's internal list of impacted certificates and how they are displayed to subscribers. I have heard directly from parties of email-validated certificates being impacted, so there is a high likelihood of an overbroad initial revocation.

The confusion is around mixed certificates where one domain was verified using a non-impacted method and another was verified using CNAME without an underscore. The data does overcount but only because some underscores were included in random values. We cannot distinguish between those that that had underscores and those that did not within the OEM system and are therefore revoking all of them.

While this is preferable to a partial revocation list, it would explain the high volume of customer outreach at present. There is also an additional complication of resellers not providing the revocation information to their subscribers. Please consider this internally as you analyze the situation and prepare a report.

The total number of impacted certificates is 83,267. The primary source of confusion is the 24-hour notice that was done by email. We added an in-console message to alert users about the revocation but communicating this information in a short period of time outside of email proved difficult.

Can you confirm what 'applicable domain validations we have in effect' means, please?
0.4% of all domain validations? 0.4% of all DNS-based? 0.4% of the specific method of DNS validations?

Yes – DigiCert has one main validation system for high volume issuance (CDNs and Cloud providers specifically) and one for lower volume. The reason for the separation is the amount of control a user wants over their certificate process. 0.4% is the total over just OEM (the lower volume issuing system). We confirmed that the CNAME method in this higher volume issuance system worked correctly.

Action with no legal basis, if the customers have agreed to your terms & conditions and subscriber agreements, correct? How can legal action have merit when clearly stated in your legal documentation and industry guidelines?

Temporary Restraining Orders (TROs) are designed to be temporary while the facts are figured out. Courts routinely grant these to prevent material harm. TROs are legally binding. We did receive a TRO in connection with this revocation.

If there are customers attempting to bypass these guidelines through legal means affecting all internet relying parties, they should be named.

I will need to work my legal counsel to see if we can name them. I have no idea what’s allowed in these circumstances.

"Unfortunately, many other customers operating critical infrastructure, vital telecommunications networks, cloud services, and healthcare industries are not in a position to be revoked without critical service interruptions. " Very specific details are needed. Customer names, certificate information, and explicit details as to why they cannot replace.

This will be coming in the delayed revocation bug.

The suggestion from your own colleague: "For organizations that can’t, we would have transparent information about the existence of the issue, the nature of the problem, and the severity (feasible timeline). " "And furthermore, it would be better if these disclosures came directly from the subscribers, because they are the only ones in a position to know the ground truth. Subscribers could be held publicly accountable for the accuracy and reasonableness of their determination and need." ...seems to imply this should be possible, no? Why not start now and lead by example?

Yes. I would love for them to jump in on Bugzilla. I think that information would be very beneficial to the community.

How are we now already talking about 120 revocation periods?

That’s long-tail in the timeframe cited in discussions when all certificates can be revoked. We will not be granting that timeline, but we need to untangle the mass revocation process before we can revoke the certificates that are not exceptional circumstances. We will provide a burn-down on the delayed revocation bug.

Digicert, please note that if there is a delayed revocation happening to file a preliminary incident for that ASAP as the questions regarding this incident and why there’s a delay are going to be different. Also, we’re already seeing questions about the delayed revocation pop up.

We will be filing a preliminary delayed revocation bug today and are working on a draft. Our systems were prepared to execute the entire revocation before 24-hour mark.

As a note to anyone interested, US Court Records are public: https://www.courtlistener.com/docket/68995396/alegeus-technologies-llc-v-digicert/

DigiCert's public communication about this incident (which has been quoted in at least one news report) gets the security impact of the noncompliance completely wrong:

The underscore prefix ensures that the random value cannot collide with an actual domain name that uses the same random value. While the odds of that happening are practically negligible, the validation is still deemed as non-compliant if it does not include the underscore prefix.

Failing to include the underscore is considered a security risk because there is potential for a collision between an actual domain and the subdomain used for verification. Although the chance of a collision is extremely low because the random value has at least 150 bits of entropy, there is still a chance. Because there is a finite chance of collision, revocation is strictly required per CABF rules.

The actual reason for the underscore is so that services which allow users to create DNS records at subdomains (e.g. dynamic DNS services) can block users from registering subdomains starting with an underscore and be safe from unwanted certificate issuance. It serves the same purpose that /.well-known does for Agreed-Upon Change To Website, and that admin/administrator/webmaster/hostmaster/postmaster do for Constructed Email to Domain Contact. By using DNS records without underscores, DigiCert has violated a security-critical assumption that these services have made.

Therefore, this is truly a security-critical incident, as there is a real risk (not a negligible 2^-150 risk as implied by DigiCert) that this flaw could have been exploited to get unauthorized certificates. Revocation of the improperly validated certificates is security-critical.

It's troubling that DigiCert is no longer treating this with the urgency required by the Baseline Requirements.

It's also troubling that DigiCert has disseminated misinformation to the public that minimizes the security impact of the noncompliance.

I had to split this into two parts as the size is too big for one file.

I had to split this into two files

There are 632 crt.sh links in these files that are invalid. e.g. https://crt.sh/?sha256=7D445A62932D377A83D444842953AF0804E5ACF5 is too short to be a SHA-256 hash

Can we have a column with a human-readable attribute (such as the Common Name) in the CSV, so that not everyone has to open thousands of URLs by themselves on crt.sh?

Has DigiCert confirmed whether this vulnerability has been exploited against major service providers that permit the creation of custom subdomains?

For instance, No-IP (https://www.noip.com) allows the creation of arbitrary labels under their domains like ddns.net, enabling users to easily create subdomains such as <some_arbitrary_string>.ddns.net and map them as CNAMEs to any target.

In this scenario, this bug could have allowed an attacker to obtain a valid TLS certificate for ddns.net.

In the provided CSV's by DigiCert there are 83_267 unique serials and 166_397 crt.sh links (137 have #N/A in the precert column).
Please note that crt.sh is Precertificates heavy for DigiCert (see https://crt.sh/cert-populations?group=RootOwner).
I did a lookup of all serials and based on 84 batch requests to crt.sh between 2024-07-31T20:06:00Z and 2024-07-31T21:06:00Z this was the result:

Precertificate Leaf certificate Count Percentage Note
- 0 137 0.16% These all have #N/A in precert column
0 1 2_105 2.53%
1 0 71_732 86.15%
1 1 9_293 11.16%

These are the match numbers based on the serial and sha256 fingerprint combination. Only 13.69% of the Leaf certificates are found, while 97.31% of Precertificates are found. Because of these numbers, it's not strange that the 137 certificates without Precertificates cannot be found.

All 92_423 can be found in this bzip2 compressed attachment in tab-separated values format.
These are certificates for 172_047 unique domains, of which 20_702 are wildcard and 71 IP Addresses (63 IPv4 and 8 IPv6).

Incident Report

Summary

DigiCert verified domains using a random value in a CNAME without an underscore prefixed to the random value; the impact is limited to issuers utilizing DigiCert's OEM validation system. Validation paths through CertCentral and CIS, DigiCert’s high volume issuance engine for cloud providers, validated domains correctly and are unaffected.

Impact

83,267 valid certificates were issued based on this method. A list of active impacted certificates is attached as a CSV file to this bug report. We are revoking all certificates in the database listed as using CNAME-based DNS validation that are currently valid and issued before the date of the fix. This likely overcounts the number of certificates without an underscore, but the system controls in OEM did not adequately store information about whether an underscore was present.

Timeline

Executive Summary

DigiCert is continuously deploying changes to its RA system to consolidate and improve workflows. DigiCert deployed a change to its RA system to consolidate domain validation flows to reuse random values across multiple methods. This exposed the fact that one path through the system did not include the underscore when using CNAME verification. The issue was fixed with the consolidation but not found until a third party reported the issue. The root causes were siloing between engineering and compliance, a failure to take CPRs seriously if they did not include serial numbers, and a lack of engineering rigor.

Background

In 2019, DigiCert started creating a services-oriented architecture to extract all validation from the monolithic code. This plan separated the system into three components made up of different services: a front-end portal for customers (called CertCentral), two domain validation systems (CIS for high issuance and OEM for lower volume issuance), one org validation system, and a CA. CIS and OEM share some but not all services. In 2019, DNS CNAME was added to the services-oriented system. Both the monolithic system and service-oriented system ran in parallel while we measured performance and work flows to ensure 1:1 parity. The monolithic code included a function in the front-end that added an underscore if the customer requested a certificate for CNAME validation. The backend system did not check for the underscore, just the random value, and the underscore was not mandatory for any certificate issued through OEM.

Starting May 3, 2024, engineering initiated a project to consolidate and containerize domain validation to prepare for Multi-Perspective Validation (MPV). The goal was to execute on a directive from leadership made in January to open source our domain validation system for the community to use and deploy as needed to facilitate MPV from different global vantage points. This project also finalized the 2019 project by removing paths through our systems that are rarely used, cleaning up code, and consolidating similar processes. One of these processes consolidated the various method 7 steps to a single path. The last few months have been spent heavily on revising our validation system, which lead to confusion about what changes occurred and how the system worked. The design goal behind these changes (per engineering) was to support a very large customer request from 2023 that wanted to pull all DNS records at each authorization domain name and check a single random value across CNAME, TXT, or CAA resource records. The following timeline discusses all relevant changes to the random value processing code and relevant to this incident.

All times are UTC

Dec 15, 2023 – Engineering architected a plan to consolidate domain methods into a streamlined process that included reusing the same random values across different methods, including non-DNS based methods as needed. This included a plan for making random value generation work uniformly regardless of method. This plan was part of a larger project and customer-requested project for a cloud provider where we wanted to be able to deploy our validation system easily to different regions. Part of this plan is to consolidate to one random value generation service that works with every validation workflow.

Jan 30, 2024 at 2:24 – Engineering created a plan to consolidate workflows into a single validation flow when random values are used. The project is scheduled for Q1 by leadership.

April 16, 2024 at 21:59 – Engineering makes a change to fix a display issue in CertCentral if a validation is originally requested using one method but later completed with another method.

May 1, 2024 at 21:34 – Engineering discusses how to update the DCV methods to use the same random values across domains if the domains are related. Random values being tied to specific validation types is identified as a blocker.

May 17, 2024 at 16:47 – Random value consolidation work starts with fixes to response codes that are causing issues in the API and UX.

May 18, 2024 – JIRA discussion on how to consolidate the user experience for file auth and DNS methods.

May 21, 2024 at 16:00 – During the weekly staff meeting, the team presents the RA consolidation architecture and plan to consolidate the random value process across methods. Compliance is not involved in the architecture design meetings.

May 29, 2024 at 17:00 – Sprint planning meeting to discuss operational changes required for consolidation.

May 29, 2024 at 23:18 – Modification of DCV checking strategy in preparation for consolidation of random value checking to a single process.

May 30, 2024 – Sprint where random value service change is planned begins.

June 4, 2024 at 16:12 – Globalsign sends a Digicert employee a personal email with a question on CNAME-based verification.

June 6, 2024 at 23:53 – Engineering deployed a change that consolidated random values to a single system that uses an underscore across all systems whether they require one or not. There were 7 processes that used a random value (File auth dynamic, File auth static, CNAME token, CNAME RND, TXT, DNS Token, CAA) processes subject to the plan. Although there is some overlap between CIS and OEM with these methods, each system had partially unique workflows that required consolidation. Random values are tied to the process and cannot be reused across methods. The underscore path was chosen as the consolidation because engineering understood that the random value was required in some cases but not others. There is no engineering documentation on which methods were understood as needing an underscore. Compliance was not consulted about this consolidation. After the deployment, random values can be used across different validation methods but random values without underscores cease functioning in the system. Engineering does not provide notice of change as the change is considered technical debt and non-customer impacting.

June 7, 2024 at 19:16 – I received the Globalsign email as the forwarding (on June 6) was missed due to multiple travel schedules. I investigated and found that certificates without an underscore could not issue. I thought the question was about documentation and requested more information from Globalsign. I spot checked random values in CIS and CC and found them included.

June 10, 2024 at 13:00, Globalsign confirmed that they could not issue, but that validation showed as complete. I ask engineering to fix the UX bug associated with random values (based on the email from Globalsign). Engineering was not consulted on the Globalsign email nor brought into the investigation. This is where a mistake was made as I should have investigated further on whether something changed.

June 11, 2024 at 11:24 – Engineering deployed a fix based on the UX request. Engineering also modified the random value format. These modifications were based on the Globalsign email but were not considered a compliance change. Engineering does not provide notice of change as customers are not expected to rely on the format of random values.

June 13, 2024 at 6:44 – Engineering merged code to remove DCV methods (ones that relied on the deprecated random value components) that were redundant and no longer necessary. Compliance was not consulted on this change. Notices are not made about the deprecation per DigiCert standard procedure (as no impact was expected).

June 18, 2024 – Our engineering team fixes another UX bug related to random values where the random value displayed to the customer is not the same as expected in the backend. Engineering fixes but does not provide notice of the change.

July 3, 2024 at 20:43 – Our support team received a certificate problem report from a researcher through revoke@digicert.com. The researcher asked whether they needed to revoke their certificates that did not include an underscore as part of the random value. The question was not about CNAME-based verification, just random values in general.

July 4, 2024 at 11:24 - Our support team responded that engineering had made a UX change to show accurate random values in the console and requested serial numbers to investigate further. This support email was not escalated.

July 7, 2024 at 23:56 - The researcher asked for clarification on the random value change. The support team responded with information about the validation process and the underscore change that occurred on June 7.

July 9, 2024 at 23:41 – The researcher asked about our documentation that did not specify that an underscore was required in front of the random value. DigiCert support responded that the change was a trivial change to fix the UX to update the console. Support escalated to industry standards, who informed a browser representative about the questions and asked how to address the issue as we were unaware of an issue and could not get a serial number from the researcher. The standards team met to review the validation process and did not find an issue with respect to DNS records. The review included a demonstration of the CNAME process but showed the random value in the CNAME as retrieved from the base domain, not the authorization domain (e.g., - foo.example.com CNAME {_rnd}.dcv.digicert.com), and from an authorization domain with an underscore prefix (eg – [_rnd].domain.com CNAME dcv.digicert.com). No compliance issues were detected.

July 14, 2024 at 21:28 – The researcher responded to the request for serials stating that certificates on CertCentral are impacted and asked for clarification on what happened. No serial numbers are provided.

July 15, 2024 at 15:01 – Support again requests serial numbers and reiterates that the change was a user experience change that consolidated random values between various methods and internal workflows. 

July 15, 2024 at 21:49 - The researcher responded asking for more information on what Digicert investigated but did not provide serial numbers.

July 16, 2024 at 15:40 – Support decides this is an industry standards or compliance question and escalates to me to answer.

July 17, 2024 at 16:33 - I ask questions about what’s going on to clarify exactly what the researcher is finding given the previous investigation and demo provided. Engineering is not consulted on the issue.

July 17, 2024 at 23:20 – The researcher resends a list of questions.

July 18, 2024 at 14:50 – Engineering deploys code to consolidate file auth and DNS methods into single process.

July 18, 2024 at 21:21 – I again tried to clarify the scope of issues as there’s confusion around why the researcher thinks that adding an underscore in front of all random values is a compliance issue. The CNAME aspect of the question is missed as random values are investigated across the methods.

July 18, 2024 at 23:16 - The researcher replied that they were upset that notice of the change was not sent to customers and that they were not getting the answers needed. Note that the researcher is correct that notice was not sent to customers. This is not an issue as notices are not typically sent for changes that are expected to have negligible impact. DigiCert’s CI/CD pipeline deploys changes daily and the review process only sends notices for material changes. The researcher claims the change was to comply with the Baseline Requirements. Spot checks were again performed on domain validation and found that the sampled certificates complied with the Baseline Requirements. Spot checks include verification of CNAME validation, TXT validation, and other types of validation that included random values as it’s still unclear whether CNAME is the issue. The spot checks were primarily new certificates that had been issued after the consolidation change.

July 20, 2024 at 03:48 - I responded that the change should have been trivial as customers should not care about the contents of a random value and asked for more details. As DigiCert splits method 7 into 6 paths, there was ambiguity about which path might have an issue.

July 22, 2024 at 17:07– The researcher claimed a customer integration was broken (provides a link to the github pull where another third party commented on noticing the change with the addition of the underscore) because of the change made to the random value, citing a change to our onion process. This was not considered an issue because onions are not expected to have underscore characters but could have them at the start. Onions were not considered when evaluating the end-user impact of the change.

July 22, 2024 at 22:00 – I reply that I am unaware of an outage and ask for clarification of the outage so I can investigate. An outage was discovered with onion certs where the length of the random value changed.

July 24, 2024 at 22:42 – The researcher stated that they have serial numbers and this was a test to see if DigiCert would acknowledge that there was an issue with domain validation. DigiCert had not detected a mis-issuance, but we gave the other browsers notice that there was a potential issue and informed them that we were investigating. One browser rep offered to assist with understanding the ask from the researcher.

July 25, 2024 at 18:23 – I introduced the individual to the browser rep on the continuing email thread to help identify the issue.

July 26, 2024 at 09:33 - The browser rep helped establish a list of questions that narrow the scope and describe the issue.

July 27, 2024 at 07:36 – DigiCert began code reviews of DNS methods using random values and their paths through its validation system. This review was done on the code version that existed before the random value consolidation project. Engineering found a possible path on its service system that allowed certificates to issue without an underscore prefix using CNAME validation where the host is the random value. An underscore was required for all other subdomains (i.e., - _dcv.domain.com CNAME [rnd].dcv.digicert.com). This issue is the method from our preliminary report.

July 28, 2024 at 03:56 – DigiCert engineering reported back the root cause and found that the service-oriented system had several paths through the system for random number generation. One of those (OEM) did not add the underscore. Consolidating the random value generator remediated the issue by ensuring that all paths included an underscore at the start of the random value, even if an underscore was not required.

July 29. 2024 at 02:17– DigiCert filed the preliminary incident report.

July 29, 2024 at 22:36 – DigiCert identified impacted certificates and sends notice about revocation.

July 30, 2024 at 2:10-12:56 – DigiCert informed the root stores on the impact of revocation. Based on the discussion, DigiCert decides these are exceptional circumstances.

July 30, 2024 at 19:01– The court granted the TRO prohibiting revocation. https://www.courtlistener.com/docket/68995396/alegeus-technologies-llc-v-digicert/.

July 30, 2024 at 23:12 – DigiCert filed a delayed revocation bug.

Root Cause Analysis

In August 2019, we began modernizing our domain and organization validation systems towards a service-oriented architecture with a goal of improving performance and simplifying workflows. Legacy code in CertCentral (our TLS certificate management portal) automatically added an underscore prefix to random values if a customer selected CNAME-based verification. Our new architecture redirected all validation through separate services instead of the legacy monolithic code structure. The code adding an underscore prefix was removed from CertCentral and added to some paths in the updated system. The underscore prefix addition was not separated into a distinct service. One path through the updated system did not automatically add the underscore nor check to see if the random value had a pre-appended underscore.

We recently found the omission of an automatic underscore prefix was not caught during the cross-functional team reviews that occurred before deployment of the updated system. While we had regression testing in place, those tests failed to alert us to the change in functionality because the regression tests were scoped to workflows and functionality instead of the content/structure of the random value. Other paths through the system either added underscores automatically or required customers to manually add the random value before verification completed. Unfortunately, no reviews were done to compare the legacy random value implementations with the random value implementations in the new system. Had we conducted those evaluations, we would have learned earlier that the system was not automatically adding the underscore prefix to the random value.

On June 11, 2024, engineering completed a user-experience project that collapsed multiple random value generation microservices into a single service. This service began including an underscore prefix before each random value, regardless of which validation method the user intended to use. This project allows DigiCert to ignore the random value generation process when verifying a domain and only check whether a random value appears in an authorization domain name. This deployment also reduced customer support calls related to the manual addition of underscore characters, fixed a bug in CertCentral’s display of validation status, and inadvertently ensured that every CNAME-based verification included an underscore prefix to each random value. As before, we did not compare this UX change against the underscore flow in the legacy system.

Several weeks ago, a researcher contacted our problem report alias over email asking about random values used in validation. Although the reporter did not provide serial numbers for any certificates, DigiCert conducted a preliminary investigation. This initial investigation did not uncover any issues with random value generation or validation. After the reporter requested answers to their repeated questions without providing any certificate serial numbers, DigiCert sought guidance from external CABF participants, who suggested DigiCert conduct an additional review. Upon further review, DigiCert discovered an issue regarding the underscore prefix for random values. DigiCert then initiated this incident management process.

Lessons Learned

First, we need to have a compliance sign-off process in engineering. Engineering is expected to read and understand the standards. We also have a process for questions from engineering. However, there was not a sign-off process before RA or CA deployments, nor were compliance people included in the planning process.

Second, this would have been better handled if we’d moved the emails immediately from our certificate problem notification process to MDSP. The back and forth wasted valuable time and led to confusion on both the issue and process. The certificate problem reporting process is equipped to best handle reported serial numbers. Anything that is not a serial number needs to leave that queue as quickly as possible and move to MDSP.

Third, this incident and the last one made the fact our teams are badly siloed very apparent. Although we reorganized the compliance team to try and facilitate broader investigations, there is little communication between compliance and engineering. Had engineering sent the change to compliance for review, compliance would have known this was a critical change. Had compliance discussed the issue with engineering, the issue would have been caught in June. Compliance right now is verifying post change instead of pre-change, which is causing serious issues. Amit initiated a project to shift compliance left in the dev process during the last issue and we’ve made significant progress even though additional steps are needed. We will be adding additional remediation actions to this JIRA as we discuss internally how to ensure more technical and thorough compliance reviews that are a partnership with engineering.

What went well

We were ready to revoke at the 24-hour mark and could process that number of revocations in about 1.5 hours. The DigiCert team was aligned on the plan to revoke at the 24-hour mark, even if that plan was interrupted by additional information on the impact of the revocation.

We were able to organize on the weekend to investigate the issue.

We deployed code after the last bug that permitted customers to see whether they were impacted by the issue which helped with notification.

We notified browser representatives early.

What didn't go well

  • We were unable to revoke all certificates within the 24-hour timeframe and unable to quickly detangle the customer list of those that had exceptional circumstances vs. those that did not.

  • We never received an actionable certificate problem report. One problematic certificate would have simplified the scope. We rely too heavily on a serial for a certificate problem report and should have treated the allegation more seriously, despite not having any evidence of an issue.

  • Getting personal emails is not a good way to kickoff an investigation. Those are easily missed and were not investigated as they did not constitute a certificate problem report.

  • We did not understand the original email about the issue and believed the issue was about the random value generator.

  • We should have moved the discussion to the public earlier.

  • The 24-hour turn-around for certificate problem reports meant our responses were less ideal and didn’t have the research necessary to accurately answer the questions.

  • Our testing needs to focus as much on compliance checks as it does on customer workflow checks.

  • Notices of system changes should always be sent out, regardless of size or complexity but especially if the changes have any relevance to baseline requirements.

The ultimate root cause ended up being me. I have led the compliance team for the past several years. The fact this went unnoticed in our many reviews during that time shows that we need a different approach to both our internal investigations and compliance controls. I also dropped the ball on the certificate problem report by failing to escalate the issue to engineering and give it the proper attention it deserved. Although I did some investigation, I failed to treat the allegations with sufficient seriousness based on what could have been wrong. I assumed I knew the systems and what was happening in them rather than deeply investigating the report. Finally, I didn’t do enough to eliminate the silos between compliance and engineering. We’ve done a lot to rectify those silos, including a complete reorganization in the compliance function. Unfortunately, those changes were too late to rectify the problem.

I apologize to the community and our customers for the events and circumstances that led to this incident. This incident made me realize that I am no longer the person for this role. As such, after consulting with Amit (CEO), we have agreed that the path forward is for me to tender my resignation at the company. I will definitely miss the community, browsers, and public interactions as the PKI ecosystem has been my home and life for such a long time.

Going forward, Amit has asked Tim Hollebeek to lead a task force to implement thorough technical compliance controls (that go above and beyond pkilint) and provide oversight to our engineering team and processes to ensure strict compliance. Tim has 25+ years of computer security experience and was a security architect before joining DigiCert. He is chair of several IETF working groups, has been involved with the CA/B forum for 10 years and is a leader in our industry. I have no doubt that with Tim’s background and Amit’s oversight, he and our other compliance and engineering colleagues will do what’s needed to ensure the rigor the community and our customers expect from DigiCert.

Where we got lucky

We patched the system without knowing it. That patch ultimately ended up exposing the issue, which lead to a researcher reporting the problem.

The number of certificates was relatively small compared to the overall scope (2.8% of the non-ACME DNS-based validations through OEM).

Action Items

Modify the certificate problem report process to create a Bugzilla discussion for all non-compromised certificate serial numbers. The current process is to wait until an issue is confirmed. Although this will result in more “Invalid” bugs filed, the extra transparency will be valuable. – In progress

Consolidation of random value generators. - Done 

Ensure all random values are prefixed with an underscore. – Done

Ensure compliance is part of each architecture review meeting and has the technical expertise to provide solutions. Compliance now has the ability to access architecture meetings but the technical expertise is lacking. – In Progress; eta August 15.

Remove the ability of Product teams to self-identify changes that require a formal compliance review. All changes in the CA and RA systems require a compliance review. Compliance team members will be included in early stakeholder reviews with the RA sprint teams and will determine changes that require formal review. – In Progress; eta August 15.

Add all applicable tests (not just entropy and functionality) for random value content/structure in the context of all validation flows. – In Progress; eta August 15.

Complete review of DCV methods with Industry Standards Development team. – In Progress; eta of August 10th

Eliminate all infrequently used processes and funnel all users though a single flow for each method. – In Progress, eta of Nov 1

Open Source DCV system for community review – In Progress; eta of December 1.

Flags: needinfo?(jeremy.rowley)

Hello,

This is Tim.

We failed to live up to your expectations. Obviously, it is going to take a bit of time for me to get up to speed, but we're going to be doing things differently going forward. We have some amazing people here at DigiCert, and they've done some amazing things, but they don't always work together internally as well as they need to. And since literally the entire planet is relying on them, that's not acceptable and is going to change.

Right now, we're working hard to get all of these certificates replaced as soon as possible, but as you can see from the actions above, we take this very seriously, and have already initiated a variety of efforts to make sure our critical processes are as technically excellent and correct as you deserve them to be.

Thank you.

-Tim

Thank you for the incident report. This is quite detailed and really allows the reader to understand the full timeline and how this mistake was introduced. I have two passing thoughts here:

First:

The 24-hour turn-around for certificate problem reports meant our responses were less ideal and didn’t have the research necessary to accurately answer the questions.

From my understanding, it is acceptable to say "We're going to be doing a deep look into this, and we'll update you again in 24 hours." to get more time to do a deeper analysis. I do not think the 24 hour time for CPRs explicitly say that you have to have it full done and complete by 24 hours. Just that you have to have some response by 24 hours, and that response could be a good-faith representation that you're looking further into it.

Maybe the community can correct my understanding here.

Second:

If I'm understanding this correctly, the security impact of this incident could be quite significant if an attacker knew about this flaw. Will there be an action item to go check DCVs since this bug was introduced in production to check for anomalies of DCVs? For example looking at the logs of DNS queries where the domain being queried didn't start with an _ and seeing if there are anomalies in any of those issuance patterns?

I'd be interested if the community has any thoughts on how best a search like that can be executed.

(Speaking personally, not on behalf of my employer.)

Thank you for this detailed and well-written incident report. The timeline is very clear and interesting, and I think the points about siloing between various compliance-critical departments within the same organization are very valuable insights that should be taken to heart by all CAs. The remediation items listed are clear, actionable, and make sense to me.

I believe strongly in blameless postmortem culture. Incidents like this are not the result of individual actions, but of years of systemic failures that have allowed incorrect actions to be taken. I believe this applies at all levels, from newly-hired line engineers to CISOs. Even when leadership changes do need to occur, I believe they should not be precipitated by -- nor announced as part of -- individual incidents. Doing so harms our industry, and sends the wrong message especially to junior employees.

Jeremy, if this resignation is your own idea, done for your own health and well-being, then I hope you know that the rest of this community does not hold you personally responsible or blame you for this incident. We wish you the best. If this resignation is at the prompting of DigiCert leadership, then I hope they recognize that this move reflects poorly upon them.

Thanks again for the report, and I sincerely look forward to reviewing DigiCert's open-source DCV system!

(In reply to Aaron Gable from comment #20)

(Speaking personally, not on behalf of my employer.)

Thank you for this detailed and well-written incident report. The timeline is very clear and interesting, and I think the points about siloing between various compliance-critical departments within the same organization are very valuable insights that should be taken to heart by all CAs. The remediation items listed are clear, actionable, and make sense to me.

I believe strongly in blameless postmortem culture. Incidents like this are not the result of individual actions, but of years of systemic failures that have allowed incorrect actions to be taken. I believe this applies at all levels, from newly-hired line engineers to CISOs. Even when leadership changes do need to occur, I believe they should not be precipitated by -- nor announced as part of -- individual incidents. Doing so harms our industry, and sends the wrong message especially to junior employees.

Jeremy, if this resignation is your own idea, done for your own health and well-being, then I hope you know that the rest of this community does not hold you personally responsible or blame you for this incident. We wish you the best. If this resignation is at the prompting of DigiCert leadership, then I hope they recognize that this move reflects poorly upon them.

Thanks again for the report, and I sincerely look forward to reviewing DigiCert's open-source DCV system!

Thank you for saying this Aaron. I have similar feelings and didn’t really know what’s an appropriate way to put it into a comment.

I’d even go as far as saying I would like the CCADB incident response template specifically prohibiting these types of mentions in incident reports.

Personnel changes are not and have never been expected or appreciated as part of an incident response.

Why? I think they are an important part of accountability. I decided to step down because of the incident. That's part of the incident report and something that should be disclosed. Knowing what changes are happening is part of transparency and ensuring accountability within the organization.

(In reply to Jeremy Rowley from comment #22)

Why? I think they are an important part of accountability. I decided to step down because of the incident. That's part of the incident report and something that should be disclosed. Knowing what changes are happening is part of transparency and ensuring accountability within the organization.

One reason: it can set inadvertently make it an expectation for other CAs to think they need to make personnel changes when an incident happens. It has a chilling effect for incident reporting even if that’s not the intention.

(In reply to Jeremy Rowley from comment #22)

Why? I think they are an important part of accountability. I decided to step down because of the incident. That's part of the incident report and something that should be disclosed. Knowing what changes are happening is part of transparency and ensuring accountability within the organization.

Aaron laid out the why in his comment:

Incidents like this are not the result of individual actions, but of years of systemic failures that have allowed incorrect actions to be taken. I believe this applies at all levels, from newly-hired line engineers to CISOs.

Additionally, the lesson many people learn from shouldering blame, or from seeing blame shouldered, is to hide/bury mistakes instead of learning to uncover ways to make the mistakes easier to spot, earlier, and with more thorough understanding.

And I'd like to echo again Aaron's statement:

if this is your own idea, done for your own health and well-being, then I hope you know that the rest of this community does not hold you personally responsible or blame you for this incident. We wish you the best.

Hundreds of other humans helping to run the PKI ecosystem upon which the current Web is built have a lot to learn from these reports; a lot to learn about improving the systems, policies and procedures that keep it going. We will all be digging into the technical details of the report and finding items that can help us improve - but learning to shoulder blame is not constructive in that regard.

(In reply to Aaron Gable from comment #20)

(Speaking personally, not on behalf of my employer.)

Thank you for this detailed and well-written incident report. The timeline is very clear and interesting, and I think the points about siloing between various compliance-critical departments within the same organization are very valuable insights that should be taken to heart by all CAs. The remediation items listed are clear, actionable, and make sense to me.

I believe strongly in blameless postmortem culture. Incidents like this are not the result of individual actions, but of years of systemic failures that have allowed incorrect actions to be taken. I believe this applies at all levels, from newly-hired line engineers to CISOs. Even when leadership changes do need to occur, I believe they should not be precipitated by -- nor announced as part of -- individual incidents. Doing so harms our industry, and sends the wrong message especially to junior employees.

Jeremy, if this resignation is your own idea, done for your own health and well-being, then I hope you know that the rest of this community does not hold you personally responsible or blame you for this incident. We wish you the best. If this resignation is at the prompting of DigiCert leadership, then I hope they recognize that this move reflects poorly upon them.

Thanks again for the report, and I sincerely look forward to reviewing DigiCert's open-source DCV system!

Aaron you took every sentiment I was going to say privately and repeated them better than I could. I don't see a single piece of this incident that is a result of an individual failing, but a lack of support in getting them help. Jeremy I know you've made comments in the past that reflect you taking issues personally, but I see that more as you caring about doing your job properly. We need more people across the industry who care, so please reconsider. I see no one externally who thinks you were at all a problem throughout this, or in prior incidents.

Back to this incident: At least one organization has reached out regarding a certificate that is a single wildcard SAN that was email-verified but is in the revocation group, so I don't think the mixed-verification scenario explains the full story. Now unfortunately I can't name names, but I was being serious in prior comments. I realize the irony here when one unknown researcher was wasting everyone's time with an actual security incident and not providing serials - but that's where we are. I'd rather there be an overbroad search, but this will be why some subscribers are upset.

With regards to when the clock starts until we get stronger recommendations from CAB/F or Root Programs I am aiming to get CAs to be clearer on their methodology in incident reports. This makes it easier for people to review later, and reach a consensus across incidents.

I strongly appreciate there being an actual realistic time for revocation mentioned at 1.5h, rather than boldly claiming it can be done immediately. I am hesitant on the security of the CNAME underscore method itself, but if it is recognized in the industry then I don't see it relevant to this incident itself. I do think it is leaning too much on overlapping RFCs as a security property, but it is what it is.

(In reply to Jeremy Rowley from comment #22)

Why? I think they are an important part of accountability. I decided to step down because of the incident. That's part of the incident report and something that should be disclosed. Knowing what changes are happening is part of transparency and ensuring accountability within the organization.

I wrote my thoughts about your decision to resign privately, and won’t restate them all here (roughly: oh god, what a blow to the web), but I want to respond to this point.

“Accountability” is not about punishment or scorekeeping or setting an example pour encourager les autres. It is about having appropriate attention paid to the things that caused a problem to occur, by making that reflection and transparency the responsibility of the person who was placed to observe those things and best understand how they happened. That reflection and transparency is only valuable to the extent that it changes the future handling of related situations. I know that in your place I would be watching the director’s commentary of Looper over and over to see if I could pull off time travel even just once, but alas that’s not one of the choices available. What matters now is what happens to the web, to DigiCert’s operations, and to Jeremy Rowley as a person and community member.

When DigiCert has another incident (and while I have tremendous faith in Tim, it will happen), I would rather that they have Jeremy Rowley with his wisdom and scar tissue around to guide their response and subsequent improvement. When another CA has a crisis related to domain validation, I want their panicked CISO to be able to reach out to Jeremy Rowley for help, and to see that these crises can be used as powerful tools for change. And, more personally, when DigiCert has rolled out its changes and demonstrated that is exactly the company that we need in such a critical role on the web, I want Jeremy Rowley to get the highest of fives at the after party. DigiCert and its customers and the web already have to pay the cost of this incident, and it would be so much better if we could also benefit maximally from what you’ve learned (so so painfully) along the way.

Of course, I have no right whatsoever to tell you that you can’t resign, Jeremy, or that you “shouldn’t” feel as you do. While I very much share Aaron’s love of blameless postmortem, I know that I can be caustic in these forums when I’m frustrated or disappointed; I hope that hasn’t contributed to you feeling that you should be punished or exiled. If this is your exit, please know that you will be missed.

I have over 24 years of experience in the DNS (Domain Name System) space. I founded and developed DNS Made Easy in 2002 and have managed DNS Made Easy and Constellix since their inception. Initially, we handled DDNS (Dynamic DNS) entries for millions of hostnames before transitioning to the authoritative DNS space, focusing on the SMB and enterprise market. Over the past 24 years, I have been responsible for serving hundreds of millions of hostnames across millions of domains. For full transparency, DNS Made Easy and Constellix were acquired by DigiCert in May of 2022.

The current DNS-related issue centers on the ability for someone to create a CNAME (Canonical Name) record without being the domain owner. Historically, some services have allowed this due to the challenges of domain ownership at the time.

From a DNS perspective, it is crucial to understand that this issue arises from services that permit third parties to create hostnames within their domain names as a service or value to their customers. This is common among DDNS (dynamic DNS) providers, who typically allow users to map a name to a dynamic IP address, thus enabling the creation of A or AAAA records (IPv4 and IPv6 addresses, respectively). Most DDNS providers are not part of this conversation since they do not permit CNAME record creation by third parties.

However, a subset of DDNS providers do allow CNAME record creation for vanity reasons. Although the practice has diminished over the years, it still exists among a few companies. Within this group, the potential problem is further divided into those allowing CNAME records with hostnames starting with an underscore and those that do not. Providers allowing underscores inherently assume the risk of certificate creation for their root domain and are not part of this discussion. This risk can be mitigated with a properly secured domain using specific CAA (Certification Authority Authorization) records, but that is another conversation entirely.

Additionally, many providers do not allow the creation of CNAME records with hostnames of at least 32 characters, which further narrows the potentially affected group. After thoroughly investigating and testing notable DDNS services, it appears that No-IP allows CNAME record creation. While free accounts on their platform are limited to 19 characters, paid users can create longer CNAME records that would have allowed the 32 characters necessary for the CNAME validation.

Following discussions with the No-IP executive team and conducting an internal search, we were able to determine that none of these domains were affected by having a root certificate or a wildcard certificate created. Therefore, I can confidently say that none of the potentially affected domains were impacted. Based on a DNS-level understanding of this situation, no certificates were wrongfully or maliciously created.

Please note that all DDNS providers that allow the creation of A, AAAA, or CNAME records automatically permit certificates for individual hostnames since you can create certificates for them at any time. With any DDNS (or CNAME) provider, you have always been able to perform an "HTTP-01 challenge" to request a certificate for your individual hostname. The issue was whether you could create a certificate for other host names in the domain.

This overview is presented purely from a DNS perspective. I am not implying that the security policies of the appropriate agencies should not be followed. My goal is to assess the likelihood of such an occurrence and cross-reference this with known domains, finding zero collisions. From a DNS standpoint, I believe there is virtually a 0% chance that a domain certificate was wrongfully created.

First off, my peers at Sectigo and I want to echo others in applauding Jeremy for his contribution to this community and the WebPKI. Thank you for being an esteemed colleague and a friend.

Now, on to this TRO. I see that the TRO was filed on July 30, 2024. As of posting time it’s August 2. I can’t find any motion to dissolve on Pacer.

Question 1: Did you file a motion to dissolve? Can you post it here?

The BRs require

An acknowledgment and acceptance that the CA is entitled to revoke the certificate immediately if the Applicant were to violate the terms of the Subscriber Agreement or Terms of Use or if revocation is required by the CA’s CP, CPS, or these Baseline Requirements.

Question 2: Can you share the specific wording in your agreement with Alegeus that meets or was intended to meet this requirement? This may be illuminating in understanding how to protect CAs from this kind of offense in the future.

Flags: needinfo?(tim.hollebeek)

On Saturday, August 3, 2024, 20:47 UTC, DigiCert completed the revocation of the 83,267 certificates affected by this bug, without exception. This was a large, coordinated effort across many organizations to get all certificates revoked within 5 days. We’re very thankful to everyone for working with us on such a short timescale to make sure all impacted certificates could be revoked while minimizing impact to critical infrastructure and services. Faced with incidents of similar scale, many other CAs have simply let the affected certificates expire naturally. We originally planned to revoke all of these certificates within 24 hours, but a few legal and critical infrastructure related concerns, as well as the scale of the number of organizations we were dealing with, caused us to take a few more days to get everything resolved and revoked safely. Those concerns have now all been dealt with.

Obviously, the best outcome would have been for the validation method to have been implemented correctly, so improving the rigor of our approval processes and continuing to tighten up our technical controls is where we will be concentrating most of our efforts going forward. It’s unfortunate that this bug was exposed as a side-effect of us simplifying and coalescing our validation systems; we intend to provide more details going forward about our new, simpler validation architecture, and we even intend to open-source it so that we can benefit from community examination of the implementations.

We’ve had our DNS experts looking closely at whether any certificates were improperly issued due to this bug and have found no evidence that any of these certificates were issued to anyone other than the intended recipients. We’ve examined the list of affected domains and compared them to the theoretical attacks that have been suggested and found in most cases the suggested actions cannot be carried out successfully, and there’s no evidence anyone even attempted to do so. If people have additional scenarios or evidence that needs to be investigated, we can do so.

Questions have also been asked about whether any S/MIME certificates are affected by this validation bug, since the S/MIME Baseline Requirements include the TLS Baseline Requirement validation methods by reference, and Mozilla Policy has long required domain validation prior to the issuance of S/MIME certificates. We have investigated and found 1,308 S/MIME certificates which were based on flawed validations using this same method, and we will be revoking those as well. A list of affected serial numbers is attached.

Flags: needinfo?(tim.hollebeek)
Attached file SMIME serial numbers

As of Aug 9 21:48:44 2024 GMT, all of the SMIME certificates have been revoked as well.

Tim,

Comment 28 has two unambiguous questions and is thirteen days old. This information will be helpful to the WebPKI community as we explore how to prevent this exploit in the future by Subscribers dealing in bad faith. I’d appreciate it if you can provide that information.

Flags: needinfo?(tim.hollebeek)

The TRO dismissal is already public record (https://storage.courtlistener.com/recap/gov.uscourts.utd.149707/gov.uscourts.utd.149707.9.0.pdf). But there are open legal matters on which our legal counsel has advised that we cannot comment on. And while this particular wrinkle is interesting, and while our terms are easily found on our website, we think it unwise to discuss any future legal strategy to combat such actions publicly, as such discussions are far more likely to be used against us than to help move the industry forward. If Sectigo has any relevant current or proposed strategies they would like to disclose or share, though, we're listening.

Flags: needinfo?(tim.hollebeek)

Any other questions?

Hi DigiCert,

I don't know whether this is a question, a recommendation, a request for clarification, or just highlighting a detail for the attention of other participants in the ecosystem. Regardless, I'd appreciate your thoughts.

The error which occurred in this bug was two-fold, but only one of which was a compliance issue:

  • The domain validation system generated random values intended for use with Method 7 of the TLS Baseline Requirements, where the random value was expected to be used as a subdomain of the Applicant's Authorization Domain Name.
    • While the generation of such a random value without an underscore character prepended to the random value contributed to this bug, it is itself not a non-compliance.
  • The domain validation system queried DNS names and accepted them as valid Authorization Domain Names where the queried hostname was not in-scope of what could be considered a valid Authorization Domain Name.
    • This is the core non-compliance present in this incident. The domain validation system should never have been able to validate a subdomain which did not begin with an underscore character as demonstrating domain control for a parent domain.

While I'm under the impression that this is understood and has been addressed, it's not 100% clear and listed action items don't indicate this as being directly addressed.
To provide clarity to the community, can you confirm that the domain validation system is no longer capable of accepting a request to query a DNS zone which is incapable of qualifying as a valid Authorization Domain Name for the FQDN intended to be validated? Similarly, if a query were performed against a hostname of the format [rnd].domain.com, the resultant domain validation would be limited to the specific FQDN queried rather than its parent zone?

Phrased differently, the following points in the domain validation workflow seem to necessitate changes based on the descriptions in this bug; have changes been made to or functionality confirmed in these components to ensure a similar issue could not present itself again?

  1. Random Value Generation
  2. Requests internal to the Domain Validation System to perform a Domain Control Validation lookup for a given FQDN
  3. The Domain Validation System's interpretation of retrieved DNS records and association thereof with stored domain approvals

While my understanding is that DigiCert has addressed these things, part of my motivation for asking is to present these interactions differently with the intent that it could allow other CAs to perform additional confirmation of their own system functionality/logic.

Flags: needinfo?(tim.hollebeek)

Hi Clint,

Thanks for posting this, these are good questions.

Yes, we've verified that the technical bug side of this issue has been fixed. We will provide more details about what we're doing to find any similar errors and prevent recurrences in an update next week.

Forward and up,

-Tim

Comment 13 and Comment 16 identified invalid entries in the list of affected certificates, which DigiCert has yet to address. Could DigiCert please provide a complete and valid list of affected certificates?

Clint, the answers are slightly complicated since this particular incident happened as fallout from in progress work on our validation systems, so the action plan doesn't translate to something as reusable for other CAs as it would if it happened during steady-state operations.

Also, the fix involved changes in several places only because the implementation was more complicated than it needed to be, with various pieces of code in various layers relying on implementation details in other layers. The lesson for other CAs is DON'T DO THAT. Have a clean implementation of DCV in one place that does DCV and is not responsible for anything else.

As a reminder, our plan for assuring that our DCV is bulletproof and compliant going forward is to finally complete our transition away from these complicated, historic legacy systems, and use a clean, modern implementation of DCV validation instead:

Eliminate all infrequently used processes and funnel all users though a single flow for each method. – In Progress, eta of Nov 1

Open Source DCV system for community review – In Progress; eta of December 1.

So similar issues will not present themselves in the future, because the complicated legacy code that is the root cause here is itself going away.

Flags: needinfo?(tim.hollebeek)
Flags: needinfo?(tim.hollebeek)

Andrew, thanks for pointing out that there were issues with the original list, an updated list is attached.

Flags: needinfo?(tim.hollebeek)
Attached file cname-crtsh-links.csv

Updated affected certificate list.

Anything else?

There are still 15 certificates in the list without complete details. They look like:

4b099e6d7cee5956ac3b6a32c1edc95,,This certificate is expired.
Flags: needinfo?(tim.hollebeek)

Thanks Andrew, I'll have the assigned engineer look into what happened here.

Flags: needinfo?(tim.hollebeek)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: