Closed Bug 1556948 Opened 1 year ago Closed 11 months ago

DigiCert Validation Scope Incident

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jeremy.rowley, Assigned: jeremy.rowley)

Details

(Whiteboard: [ca-compliance])

Attachments

(3 files)

45.44 KB, image/png
Details
53.90 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details
70.61 KB, image/png
Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36

Steps to reproduce:

Plus Incident Report

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On May 29, 2019 at 12:55 pm PT, a DigiCert support engineer was assisting a customer and noticed the SANs in the TLS certificate did not match the specified domain validation scope. The support engineer escalated to his manager, who initiated research with both the dev and compliance. This launched a full investigation. The DigiCert engineering team identified the issue with a seldom used feature of legacy DigiCert. Than feature, identified as a “Certificates Plus” allowed a customer to automatically add www.domain to an order that contained domain and vice versa. Unfortunately, the feature added the domains after validation completed but before issuance instead of during the initial order process. This allowed improper issuance of example.com if only www.example.com was verified.

Engineering had previously investigated this code while the Comodo issue was pending referenced here: https://groups.google.com/forum/#!msg/mozilla.dev.security.policy/PoMZvss_PRo/TK8L-lK0EwAJ. The report back was that the system ensured proper validation. However, the individuals performing the assessment were let go some time ago and are no longer available to find out how this was missed during the previous review. We recognize that this is an issue that required immediate attention.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

May 29, 2019 – Support engineer noticed mis-match in validation compared to SANs
May 29, 2019 – Escalated to compliance to initiate root cause investigation.
May 30, 2019 - Root cause identified as Plus feature. Faulty logic was disabled.
May 31, 2019 - Began running reports to identify impacted certificates.
June 1, 2019 – Began internally reviewing problematic cert report. Revised script to eliminate false positives.
June 3, 2019 – Investigation completed and identified final list of 1,069 problematic certs. Customers notified of three options: revalidate, reissue to remove problem domain, or revoke within 24 hours.
June 4, 2019 – 390 certificates were reissued/revalidated; 679 certificates were revoked.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

Yes, we have stopped issuing certificates with this problem. The DigiCert plus feature with the faulty logic was disabled across all platforms on May 30, 2019. The Plus feature will be re-coded to enable the option only when the CN is the base domain. We are also refactoring that part of the code shortly to eliminate its ability to touch orders post validation but pre-issuance. We want to ensure that nothing can be inserted between validation and issuance that can change the order. This way all validation and issuance runs through the same set of services, keeping all compliance tests and operations focused and consistent in that code.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

1,069 problematic certificates identified
The first certificate was issued on April 27, 2017
The last certificate was issued on May 28, 2019
All certs were logged to CT

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

<Attach file to bug report> Incident - Plus\serials-crtsh.csv

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

A legacy system contained a seldom used path to issuance that was adding the base domain to certificates for a www subdomain (e.g., www.example.com) after validation of the subdomain had been completed. The system added the base domain to the certificate after passing through our RA system and our compliance checks, just before the information was passed to the CA for signing. We attached a diagram of the flow and systems to show how the systems interconnect.

The issue only occurred in isolated cases when validation was done using file auth and DNS text where the customer elected to include our “plus” feature–an older setting that gives customers a free alternative domain. Unfortunately, the feature was designed to work both ways (adding a www subdomain to base domains and adding the base domain to www subdomain cert). The total impact is 1,069 certs where the www subdomain was validated but the base domain was not.

We acknowledge that issuance without the proper level validation is a BR violation as evidenced in the Mozilla discussion < https://groups.google.com/forum/#!msg/mozilla.dev.security.policy/PoMZvss_PRo/TK8L-lK0EwAJ> in 2016. We thought we’d investigated issuance paths like this back in March 2017, to ensure we did not have this issue but somehow, we missed a part of the code logic. As mentioned, the engineers who did the code review in 2017 are no longer with DigiCert so we cannot ask them what happened and how this was missed. The real failure is similar to our CAA record checking. An engineer who didn’t understand the important of the CAB Forum requirements that didn’t realize they missed a significant issue. We also don’t have formal hand-off procedures so there wasn’t a code review by the engineering resources taking over responsibility for that code. The new engineer is quite good. BJ is his name.

The certs represent .003% of total TLS issuance, which is why they were not caught earlier by an audit. None of these were selected during the 3% audits. We do have automated auditing components but they primarily check certificate profiles and content, not validation information.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We are taking a multi-pronged approach to resolving the situation and eliminating the possibility of future errors as follows:

  1. May 30, 2019 - We turned off the feature for issuing plus certificates without the proper validation at the subdomain or base domain. This prevented future mis-issuances from this particular system
  2. June 30, 2019 - We have scheduled training for the engineering staff the last week of June—this will be considered mandatory for any engineer working on CA code, RA code, or any other code that interfaces with the CA and RA systems.
    During that training, we are going to assign a developer that is ultimately responsible for each code segment and for ensuring compliance with each portion of the BRs. This compliance requirement includes writing unit tests for the various BRs under their responsibility to ensure compliance checks before code is released into production. We will emphasize the peer-reviews involve a comprehensive understanding of the requirements. We will run the training annually or more often, as necessitated by BR changes. We believe this will improve the quality and operation of our code substantially.

During the training, we will also come up with a better hand-off procedure for when employees. We want a procedure that establishes a new chain for custody for code and provides additional oversight in case things were during an employee’s last days. We’ll ensure that the hand-off procedures follow industry best practices, including a transitional code review. Any additional ideas for the training or that you’d like to see include are appreciated.
3. October 31, 2019 – We previously mentioned that we have a new validation engine that we’ve developed. This includes a new workbench for validation staff and rigorous compliance controls. Migration to this new capability is underway with a target completion date of October 31, 2019. At that point all certificates issued, regardless of platform, will go through the new domain validation tool. At that time, all orders and requests will go through the RA into the CA, without the possibility of order modifications post validation. Any new modifications will reset the process and go through the new validation engine.
4. April 30, 2020 - We are currently working on legacy system migration, including migration of legacy DigiCert and Symantec systems. The first DigiCert system has about 4 more customers that we are migrating. The others have more users but are being migrated at the same time. The target for beginning the decommissioning process for the systems is January 2020. The target date for final shut down is April 2020. This largely depends on how the shutdown process proceeds. We are constantly working towards these dates and will post updates as things change.

Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

Thanks Jeremy. The DigiCert incident reports are continuing to grow in the level of detail, and that's fantastic, because it shows that a holistic approach is being taken to understand and explain the issues.

I have a few questions:

  1. From the diagram/flow of information, it seems that all code paths (both Legacy Symantec and DigiCert API) pass through the Legacy DigiCert Issuer Code as part of the path to and from the validation workbench. Is that correct? If so, it doesn't seem that it's quite "legacy" yet. Understanding what component(s) the migrations on October 31, 2019 and April 30, 2020 are going to look like may help better understand how these systemic issues are being addressed.
  2. While you focused on training, it doesn't seem like it's been examined as to what sort of secondary systemic controls could have caught this. For example, you discussed linting in the context of certificate profiles. However, it seems like there could be pre- and/or post-issuance reconcilliation with the DigiCert DB. If I understand correctly, that could have caught the addition of the domain.example to the www.domain.example by having the CA reconcilliation system not find an equivalent authorization entry for that domain scope. Functionally, this is a second layer of controls, but is focused on treating all incoming data as 'hostile'/'unverified', and making sure it finds the associated signals in the canonical source of truth (presumably, your DigiCert DB). This design may not be appropriate or necessary, though, based on the rearchitecture being proposed for October 2019 and April 2020 - that's part of why I asked to understand the diagrams. However, it might be useful to understand what sort of checks can be done at the DigiCert CA and DigiCert Signer levels, either prior to signing or as post-issuance automated quality control.

I also want to acknowledge and thank you for ensuring that all affected certificates were promptly revoked or replaced. This shows an improvement over past incident reports, which either suggests more consistent application of policies or is the result of improved communications. If there are practices that DigiCert is following that help ensure they can be effective on this, it may be worth proposing them at the CA/Browser Forum, as ways to formalize best practices so that other CAs can equally be prepared and respond timely.

Assignee: wthayer → jeremy.rowley
Type: defect → task
Flags: needinfo?(jeremy.rowley)
Whiteboard: [ca-compliance]
  1. The orange lines incidentally pass through the legacy issuer code to get to the validation system. They don't actually use that code. Just a drawing issue where they pass under the box rather than connect through it

  2. Yes - we are adding a check to the CA that calls back to the RA to confirm what is being signed matches the validation. Of course, we are removing anything tha can come after the RA, but this check will be an added precaution. If the CA can't verify the proposed cert contents match what the RA saw, the signing will be rejected.

Flags: needinfo?(jeremy.rowley)

Can you provide an update regarding the training that was to be completed by 30 June 2019?

Is the next update still expected to be 1 October 2019?

Flags: needinfo?(jeremy.rowley)

Yes! The training went very well. We went over RFC5280, RFC6844, RFC6960, RFC5019 (since we use that one), the Mozilla policy, the Microsoft policy, and the CAB Forum requirements. Long training but good. We gave the engineering team links to the documents and talked about building better unit tests around some of the things that are not covered by zlint. We also talked about contributing more to zlint and finishing porting everything from cablint to zlint. We assigned specific people to take a look at a couple of systems where we suspect there might be additional issues and identified another section of code that needs refactoring. Not that it's mis-issuing certs currently, but it's one that we identified as risky. We're going to schedule refactoring of that one into one of the upcoming sprints.

The next update on the system migration is October 1, 2019. We are still tracking to migration of the validation system

Flags: needinfo?(jeremy.rowley)
Flags: needinfo?(wthayer)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 01-October 2019

Wayne: Sorry, the N-I for you was if you had further questions or were satisfied with October for progress.

Yes! The training went very well. We went over RFC5280, RFC6844, RFC6960, RFC5019 (since we use that one), the Mozilla policy, the Microsoft policy, and the CAB Forum requirements. Long training but good. We gave the engineering team links to the documents and talked about building better unit tests around some of the things that are not covered by zlint. We also talked about contributing more to zlint and finishing porting everything from cablint to zlint. We assigned specific people to take a look at a couple of systems where we suspect there might be additional issues and identified another section of code that needs refactoring. Not that it's mis-issuing certs currently, but it's one that we identified as risky. We're going to schedule refactoring of that one into one of the upcoming sprints.

The next update on the system migration is October 1, 2019. We are still tracking to migration of the validation system

Sorry for the double post.

We've split it up into smaller tasks. I can provide smaller updates based on the actual sprints if you want but they may not make much sense without know the system.

Flags: needinfo?(wthayer)

Providing weekly progress updates about the remediation are better than providing no updates, especially if dates encounter slippage. However, focusing on substantive or public details is desired. If there aren't details that are clearly related to this, it may not be useful to share updates - but it also means don't miss deadlines.

Okay - I'll try to provide bi-weekly updates. Weekly updates will be a little much since we run two-week sprints and every other one will be "in the middle of the sprint". I'll post an overall breakdown on the tasks next,

Met with the dev team to look at what still needs to be done. This has been an ongoing task for a long time so we thought we'd break it down into more actionable deadlines. The first step (per the previous note on mis-communication) is to define what exactly we are saying.

By October 31, 2019, DigiCert is targeting completion of the engineering work necessary to consolidate public TLS domain validation. All new domain validation (for the purpose of BR:3.2.2.4 compliance) which is completed after that milestone will be processed entirely by a single, dedicated service purpose-built for domain validation.

Excluding Quovadis, 165k (as of July 24, 2019) accounts exist outside of the consolidated domain validation workflow's immediate infrastructure. These accounts are being migrated over to the customer platform which will leverage the consolidated system. All customer accounts will be migrated by April 1, 2020 and a majority of customer accounts will be migrated by October 31, 2019. Additionally, customer accounts will be automatically migrated to the newer platform upon requesting a certificate, between October 31, 2019 and April 1, 2020.

In the last sprint, we’ve focused on the following consolidation efforts:
• Defining API contracts, building tests, and adding feature flags that will allow us, on an account by account basis, to point to the go-forward domain validation flows so that we can safely and methodically switch over customers to the consolidated flows rather than in a riskier wholesale migration.
• Initial work to migrate remaining customers from legacy portals to our go-forward customer portal so that they organically leverage the go-to domain validation flows.

In this sprint, we focused on transitioning core domain pre-validation endpoints (used primarily by enterprise customers) to the go-forward domain validation services. We identified places in which our scope of work may increase due to recent enhancements made on the front-end, for customer benefit. We are evaluating whether we can support such features in the go-forward domain validation service by the October date or if we'll need to defer support and remove these features from the front end.

We're looking at whether we can drop certain items in order to accelerate.

This sprint was a bit routed due to work on the JOI issue. However, we have some progress on consolidating our legacy customer portals into the go-forward portal (CertCentral) which is the cleanest route to having all domain validation go through the new domain validation system. We've been moving existing customers from their legacy portals to CertCentral, and new TLS purchases from digicert.com now result in the creation of a CertCentral account. We're about 75% complete with the stories tied to the legacy portal deprecation work. We've identified additional APIs needed so that CertCentral can leverage the new domain validation system for all flows (pre-validation and order-based validation), and we're about 50% complete with the work to build out the new APIs. Most of the domain consolidation team though worked the first half of the sprint on the JOI/State issuing tool and the second half on improving the tool and fixing bugs with it.

In this iteration, we continued the efforts to move existing customers customers using legacy DigiCert portals into CertCentral. We are also preparing to deprecate new orders from retail Symantec, Geotrust, Thawte, and RapidSSL, and enable new account signups directly in CertCentral.
Development was primarily focused on the following Domain Consolidation efforts:

  • DCV Email Flows (70% complete)
  • DCV Approval APIS (100% complete)
  • Domain Control Workflows (66%)
  • Update: API build progress (56%)

Additionally, we focused effort on unit testing tasks in order to make our testing more robust.

In this sprint, we made significant progress on building out the APIs for CertCentral and other services to consume the go-forward domain validation flows, and the major tasks within this bucket of work are complete, with some minor tasks still in progress and in the plan. In our efforts to move customers off of legacy portals, we're going to force the remaining legacy DigiCert platform customers to CertCentral (and the new validation system) on October 23rd.

For the next week and a half, we will be doing QA on the integration points and working on associated bug fixes. We've made some customer experience trade-offs to ensure we remain on track with the October 31 date. Heavy QA this week will show us if there are any minor date adjustments that we need to make, but we are still on track for October 31. Next update is full deployment! Woot!

We’ve been working on consolidating our domain validation systems into one master system that includes all legacy platforms. The project was taken on as a way to improve the performance of both our in-house validation service as well as the domain validation systems we inherited from Symantec. This project represents a huge effort to improve accuracy and decrease the chance of systemic mis-issuances; in addition, it reduces human error by collapsing different nuanced processes into one primary flow. As we’re nearing the completion date that we targeted to early on, I wanted to update you on the status of this project. We’ve completed development of the new domain validation system and have been heavily Q&Aing for the last week. Today, we are starting a roll out process to all customers, and we expect to complete roll out by Monday, November 4. We're taking a phased roll out approach because of how critical this system is and we want to carefully monitor the certs to see how things go. We're going to continue banging on the system all weekend as well. As we wrap up this project, I’ll keep you informed of the deployment and testing. I don't forsee any issues with the weekend testing and roll-out plan, but you can't be too careful with domain validation. Sorry for the late notice on this phased approach. We decided to take the more cautious approach this week after looking at the test results and talking to the engineering teams.

Jeremy: Thanks for the continued updates.

To make sure we've got a good picture of where things stand (and looking back 5 months ago, in Comment #0)

DigiCert has designed, and implemented, its new validation process, which is rolling out to 100% of DigiCert brands. Excluded from this are (at least identified so far), Symantec and QuoVadis (Comment #0, Comment #11, Comment #18); it's unclear the status of Verizon, and whether or not APIs are also considered in scope (Comment #15). It also appears there are some legacy customers excluded (Comment #0).

I think it'd be useful if you could summarize where things stand, following this coming Monday, about how DigiCert identifies it's various systems and brands, so that it's clear what's transitioned and what remains, since the latter would still be a risk for the situations described in Comment #0.

Flags: needinfo?(jeremy.rowley)

This specific bug applies to all brands except Quovadis (Quovadis is really a separate entity still). On Monday, all domain validation will be consolidated to a single flow, and all domain validation is in scope except Quovadis and the sub CAs, each of whom have their own validation process. The only two Sub CAs still issuing TLS are Microsoft and Apple.

Org validation already completed its consolidation into a single system. However, as identified in the EV JOI issue, there's some manual typing that goes on in that system. One of our next projects is to revamp that project to eliminate typing. On Monday, all validation will go through a single path, one for org validation and one for domain validation. Different processes are used depending on the BR method, but the system is the same. There is also one CA issuing all publicly trusted TLS certificates. This means that any changes or standards can be implemented quickly and efficiently at the appropriate level.

All of our end-state storefronts are built on our API to eliminate possible paths into our system. The idea is that everything will flow the same way, regardless of entry point. Basically, storefront->API->validation->CA and then back out. I'll prepare a better architecture diagram and share it so you can see the systems better. This is the one area where we need additional consolidation as the legacy Symantec storefronts are still operational. We've been working on this project since the acquisition and are getting close. The current target is Apr 2020, but I suspect this date will push based on customer feedback.

In summary, consolidations remaining are:

  1. Symantec storefronts and APIs (in progress, expected consolidation Apr 2020)
  2. Quovadis (starting in Q1 next year)
  3. Sub CA shut down (currently planned for end of Q1 2020 for all TLS except MS and Apple)

Everything else is consolidated.

By end of Q1 2020, all legacy systems from Verizon and Symantec should be pretty much gone.

Deployed! And running without incident or issue. The architecture diagrams previously provided describe what we're doing pretty well. Basically, we're zippering up everything so that eventually DigiCert flows through one system for each operation - no exception. With this change, we have one system for domain validation, consolidating about half a dozen issuance paths from legacy systems down to one. This means we have one system for org validation (which needs refactoring to remove manually typed strings), one system for domain validation, one CA for publicly trusted TLS certs, and several systems for storefronts.

The next opportunity for simplification is storefronts. We've consolidated two of these already and are working on many more, all of which are tracking for April 2020. However, the storefronts themselves do not impact compliance directly so aren't really part of this bug. Happy to continue providing updates though. The main advantage is the speed at which we can make changes and the complexity in operations, which is an advantage simply because it increases flexibility in adapting to new requirements.

We'll let this run a few days, continuing to monitor issuance for any surprises. Looks good though, and I think we're ready to close this bug and the other one.

Flags: needinfo?(jeremy.rowley)

Jeremy, glad to hear it went well.

Just wanting to make sure I'm absolutely clear, so that if there's any future incidents regarding domain validation, we're all on the same page.

eventually DigiCert flows through one system for each operation - no exception.

I'm still trying to parse that with Comment #20 and trying to understand where things are with Symantec. It seems based on Comment #21, namely:

The next opportunity for simplification is storefronts. We've consolidated two of these already and are working on many more, all of which are tracking for April 2020.

that Symantec systems may - or may not - be using the new validation flow. The diagram in Comment #0 describes how things were, and I'm not sure I've got a full picture of how things are, especially with respect to Comment #20's comment, namely:

Basically, storefront->API->validation->CA and then back out.

I know this seems like getting hung up on details, but I'm specifically trying to make sure that any exceptions to "Everything on one system" are clear and called out. As it stands, QuoVadis is still a separate system, so I get that. What's not clear is Symantec, Verizon, or any other legacy system or API. I want to make sure we've got a clear picture as to the systems in play, primarily so that if any future incidents happen (e.g. like the recently repeated failure to disclose), it's clear and unambiguous where and why it happened.

Flags: needinfo?(jeremy.rowley)
Attached image Architecture.png

Attached is an architecture diagram that better shows the remaining systems and how things are operating. Hopefully this helps. All domain validation now passes through the domain validation service. Org validation passes through the org validation service. The domain validation service is a consolidation of several different bits of code that were throughout the system.

What remains of the legacy Symantec system is the platform that sits on the migration API they built before the acquisition. This API calls into our platform that accesses our validation and issuance system. Once the migration is complete, this whole migration side of the code can be shut down.

I hope this answers the question.

Flags: needinfo?(jeremy.rowley)

Any follow-up questions? Is this bug ready to close?

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
Whiteboard: [ca-compliance] Next Update - 01-October 2019 → [ca-compliance]
You need to log in before you can comment on or make changes to this bug.