DigiCert Validation Scope Incident

ASSIGNED
Assigned to

Status

task
ASSIGNED
3 months ago
6 days ago

People

(Reporter: jeremy.rowley, Assigned: jeremy.rowley)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [ca-compliance] Next Update - 01-October 2019)

Attachments

(2 attachments)

45.44 KB, image/png
Details
53.90 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36

Steps to reproduce:

Plus Incident Report

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On May 29, 2019 at 12:55 pm PT, a DigiCert support engineer was assisting a customer and noticed the SANs in the TLS certificate did not match the specified domain validation scope. The support engineer escalated to his manager, who initiated research with both the dev and compliance. This launched a full investigation. The DigiCert engineering team identified the issue with a seldom used feature of legacy DigiCert. Than feature, identified as a “Certificates Plus” allowed a customer to automatically add www.domain to an order that contained domain and vice versa. Unfortunately, the feature added the domains after validation completed but before issuance instead of during the initial order process. This allowed improper issuance of example.com if only www.example.com was verified.

Engineering had previously investigated this code while the Comodo issue was pending referenced here: https://groups.google.com/forum/#!msg/mozilla.dev.security.policy/PoMZvss_PRo/TK8L-lK0EwAJ. The report back was that the system ensured proper validation. However, the individuals performing the assessment were let go some time ago and are no longer available to find out how this was missed during the previous review. We recognize that this is an issue that required immediate attention.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

May 29, 2019 – Support engineer noticed mis-match in validation compared to SANs
May 29, 2019 – Escalated to compliance to initiate root cause investigation.
May 30, 2019 - Root cause identified as Plus feature. Faulty logic was disabled.
May 31, 2019 - Began running reports to identify impacted certificates.
June 1, 2019 – Began internally reviewing problematic cert report. Revised script to eliminate false positives.
June 3, 2019 – Investigation completed and identified final list of 1,069 problematic certs. Customers notified of three options: revalidate, reissue to remove problem domain, or revoke within 24 hours.
June 4, 2019 – 390 certificates were reissued/revalidated; 679 certificates were revoked.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

Yes, we have stopped issuing certificates with this problem. The DigiCert plus feature with the faulty logic was disabled across all platforms on May 30, 2019. The Plus feature will be re-coded to enable the option only when the CN is the base domain. We are also refactoring that part of the code shortly to eliminate its ability to touch orders post validation but pre-issuance. We want to ensure that nothing can be inserted between validation and issuance that can change the order. This way all validation and issuance runs through the same set of services, keeping all compliance tests and operations focused and consistent in that code.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

1,069 problematic certificates identified
The first certificate was issued on April 27, 2017
The last certificate was issued on May 28, 2019
All certs were logged to CT

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

<Attach file to bug report> Incident - Plus\serials-crtsh.csv

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

A legacy system contained a seldom used path to issuance that was adding the base domain to certificates for a www subdomain (e.g., www.example.com) after validation of the subdomain had been completed. The system added the base domain to the certificate after passing through our RA system and our compliance checks, just before the information was passed to the CA for signing. We attached a diagram of the flow and systems to show how the systems interconnect.

The issue only occurred in isolated cases when validation was done using file auth and DNS text where the customer elected to include our “plus” feature–an older setting that gives customers a free alternative domain. Unfortunately, the feature was designed to work both ways (adding a www subdomain to base domains and adding the base domain to www subdomain cert). The total impact is 1,069 certs where the www subdomain was validated but the base domain was not.

We acknowledge that issuance without the proper level validation is a BR violation as evidenced in the Mozilla discussion < https://groups.google.com/forum/#!msg/mozilla.dev.security.policy/PoMZvss_PRo/TK8L-lK0EwAJ> in 2016. We thought we’d investigated issuance paths like this back in March 2017, to ensure we did not have this issue but somehow, we missed a part of the code logic. As mentioned, the engineers who did the code review in 2017 are no longer with DigiCert so we cannot ask them what happened and how this was missed. The real failure is similar to our CAA record checking. An engineer who didn’t understand the important of the CAB Forum requirements that didn’t realize they missed a significant issue. We also don’t have formal hand-off procedures so there wasn’t a code review by the engineering resources taking over responsibility for that code. The new engineer is quite good. BJ is his name.

The certs represent .003% of total TLS issuance, which is why they were not caught earlier by an audit. None of these were selected during the 3% audits. We do have automated auditing components but they primarily check certificate profiles and content, not validation information.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We are taking a multi-pronged approach to resolving the situation and eliminating the possibility of future errors as follows:

  1. May 30, 2019 - We turned off the feature for issuing plus certificates without the proper validation at the subdomain or base domain. This prevented future mis-issuances from this particular system
  2. June 30, 2019 - We have scheduled training for the engineering staff the last week of June—this will be considered mandatory for any engineer working on CA code, RA code, or any other code that interfaces with the CA and RA systems.
    During that training, we are going to assign a developer that is ultimately responsible for each code segment and for ensuring compliance with each portion of the BRs. This compliance requirement includes writing unit tests for the various BRs under their responsibility to ensure compliance checks before code is released into production. We will emphasize the peer-reviews involve a comprehensive understanding of the requirements. We will run the training annually or more often, as necessitated by BR changes. We believe this will improve the quality and operation of our code substantially.

During the training, we will also come up with a better hand-off procedure for when employees. We want a procedure that establishes a new chain for custody for code and provides additional oversight in case things were during an employee’s last days. We’ll ensure that the hand-off procedures follow industry best practices, including a transitional code review. Any additional ideas for the training or that you’d like to see include are appreciated.
3. October 31, 2019 – We previously mentioned that we have a new validation engine that we’ve developed. This includes a new workbench for validation staff and rigorous compliance controls. Migration to this new capability is underway with a target completion date of October 31, 2019. At that point all certificates issued, regardless of platform, will go through the new domain validation tool. At that time, all orders and requests will go through the RA into the CA, without the possibility of order modifications post validation. Any new modifications will reset the process and go through the new validation engine.
4. April 30, 2020 - We are currently working on legacy system migration, including migration of legacy DigiCert and Symantec systems. The first DigiCert system has about 4 more customers that we are migrating. The others have more users but are being migrated at the same time. The target for beginning the decommissioning process for the systems is January 2020. The target date for final shut down is April 2020. This largely depends on how the shutdown process proceeds. We are constantly working towards these dates and will post updates as things change.

Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

Thanks Jeremy. The DigiCert incident reports are continuing to grow in the level of detail, and that's fantastic, because it shows that a holistic approach is being taken to understand and explain the issues.

I have a few questions:

  1. From the diagram/flow of information, it seems that all code paths (both Legacy Symantec and DigiCert API) pass through the Legacy DigiCert Issuer Code as part of the path to and from the validation workbench. Is that correct? If so, it doesn't seem that it's quite "legacy" yet. Understanding what component(s) the migrations on October 31, 2019 and April 30, 2020 are going to look like may help better understand how these systemic issues are being addressed.
  2. While you focused on training, it doesn't seem like it's been examined as to what sort of secondary systemic controls could have caught this. For example, you discussed linting in the context of certificate profiles. However, it seems like there could be pre- and/or post-issuance reconcilliation with the DigiCert DB. If I understand correctly, that could have caught the addition of the domain.example to the www.domain.example by having the CA reconcilliation system not find an equivalent authorization entry for that domain scope. Functionally, this is a second layer of controls, but is focused on treating all incoming data as 'hostile'/'unverified', and making sure it finds the associated signals in the canonical source of truth (presumably, your DigiCert DB). This design may not be appropriate or necessary, though, based on the rearchitecture being proposed for October 2019 and April 2020 - that's part of why I asked to understand the diagrams. However, it might be useful to understand what sort of checks can be done at the DigiCert CA and DigiCert Signer levels, either prior to signing or as post-issuance automated quality control.

I also want to acknowledge and thank you for ensuring that all affected certificates were promptly revoked or replaced. This shows an improvement over past incident reports, which either suggests more consistent application of policies or is the result of improved communications. If there are practices that DigiCert is following that help ensure they can be effective on this, it may be worth proposing them at the CA/Browser Forum, as ways to formalize best practices so that other CAs can equally be prepared and respond timely.

Assignee: wthayer → jeremy.rowley
Type: defect → task
Flags: needinfo?(jeremy.rowley)
Whiteboard: [ca-compliance]
  1. The orange lines incidentally pass through the legacy issuer code to get to the validation system. They don't actually use that code. Just a drawing issue where they pass under the box rather than connect through it

  2. Yes - we are adding a check to the CA that calls back to the RA to confirm what is being signed matches the validation. Of course, we are removing anything tha can come after the RA, but this check will be an added precaution. If the CA can't verify the proposed cert contents match what the RA saw, the signing will be rejected.

Flags: needinfo?(jeremy.rowley)

Can you provide an update regarding the training that was to be completed by 30 June 2019?

Is the next update still expected to be 1 October 2019?

Flags: needinfo?(jeremy.rowley)

Yes! The training went very well. We went over RFC5280, RFC6844, RFC6960, RFC5019 (since we use that one), the Mozilla policy, the Microsoft policy, and the CAB Forum requirements. Long training but good. We gave the engineering team links to the documents and talked about building better unit tests around some of the things that are not covered by zlint. We also talked about contributing more to zlint and finishing porting everything from cablint to zlint. We assigned specific people to take a look at a couple of systems where we suspect there might be additional issues and identified another section of code that needs refactoring. Not that it's mis-issuing certs currently, but it's one that we identified as risky. We're going to schedule refactoring of that one into one of the upcoming sprints.

The next update on the system migration is October 1, 2019. We are still tracking to migration of the validation system

Flags: needinfo?(jeremy.rowley)
Flags: needinfo?(wthayer)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 01-October 2019

Wayne: Sorry, the N-I for you was if you had further questions or were satisfied with October for progress.

Yes! The training went very well. We went over RFC5280, RFC6844, RFC6960, RFC5019 (since we use that one), the Mozilla policy, the Microsoft policy, and the CAB Forum requirements. Long training but good. We gave the engineering team links to the documents and talked about building better unit tests around some of the things that are not covered by zlint. We also talked about contributing more to zlint and finishing porting everything from cablint to zlint. We assigned specific people to take a look at a couple of systems where we suspect there might be additional issues and identified another section of code that needs refactoring. Not that it's mis-issuing certs currently, but it's one that we identified as risky. We're going to schedule refactoring of that one into one of the upcoming sprints.

The next update on the system migration is October 1, 2019. We are still tracking to migration of the validation system

Sorry for the double post.

We've split it up into smaller tasks. I can provide smaller updates based on the actual sprints if you want but they may not make much sense without know the system.

Flags: needinfo?(wthayer)

Providing weekly progress updates about the remediation are better than providing no updates, especially if dates encounter slippage. However, focusing on substantive or public details is desired. If there aren't details that are clearly related to this, it may not be useful to share updates - but it also means don't miss deadlines.

Okay - I'll try to provide bi-weekly updates. Weekly updates will be a little much since we run two-week sprints and every other one will be "in the middle of the sprint". I'll post an overall breakdown on the tasks next,

Met with the dev team to look at what still needs to be done. This has been an ongoing task for a long time so we thought we'd break it down into more actionable deadlines. The first step (per the previous note on mis-communication) is to define what exactly we are saying.

By October 31, 2019, DigiCert is targeting completion of the engineering work necessary to consolidate public TLS domain validation. All new domain validation (for the purpose of BR:3.2.2.4 compliance) which is completed after that milestone will be processed entirely by a single, dedicated service purpose-built for domain validation.

Excluding Quovadis, 165k (as of July 24, 2019) accounts exist outside of the consolidated domain validation workflow's immediate infrastructure. These accounts are being migrated over to the customer platform which will leverage the consolidated system. All customer accounts will be migrated by April 1, 2020 and a majority of customer accounts will be migrated by October 31, 2019. Additionally, customer accounts will be automatically migrated to the newer platform upon requesting a certificate, between October 31, 2019 and April 1, 2020.

In the last sprint, we’ve focused on the following consolidation efforts:
• Defining API contracts, building tests, and adding feature flags that will allow us, on an account by account basis, to point to the go-forward domain validation flows so that we can safely and methodically switch over customers to the consolidated flows rather than in a riskier wholesale migration.
• Initial work to migrate remaining customers from legacy portals to our go-forward customer portal so that they organically leverage the go-to domain validation flows.

You need to log in before you can comment on or make changes to this bug.