Closed Bug 1662346 Opened 4 years ago Closed 4 years ago

DigiCert: OCSP responder returning invalid responses

Categories

(CA Program :: CA Certificate Compliance, task)

3.55

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: martin.sullivan, Assigned: martin.sullivan)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36

Actual results:

  1. How your CA first became aware of the problem (e.g., via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

The issue was discovered internally while making changes to our system on 8/24/2020. We noticed a bug in the code that was introduced on July 22, 2020, that caused OCSP to return “good” for revoked certificates at ocspx.digicert.com. The correct response was returned at ocsp.digicert.com and included in the CRLs (where the certificate included a CRL URI).

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

July 22, 2020 – We checked in code related to migrating all on-demand OCSP to pre-signed responses (per this bug https://bugzilla.mozilla.org/show_bug.cgi?id=1640805). This code was only supposed to impact ICA status information, but it inadvertently included end-entity certificates. Because the impact was supposed to be limited to issuing CAs, we tested the impact of status information on the issuing CAs extensively, but didn’t have acceptance tests to see if end-entity certificates were impacted by the change, nor were there unit tests for revocation mismatch for the on-demand system.
July 23, 2020 – Code was deployed, and the CA began signing all responses as “good” for ocspx.digicert.com. The real status information was sent to ocsp.digicert.com. The AIA for each of these certificates points to ocspx.digicert.com, meaning the certificates had the wrong information.
August 24, 2020 – The CA team met to discuss the final steps in shutting down on-demand OCSP. While going through the plan, the team noticed a bug in the system that was causing on-demand OCSP to return “good” for revoked certificates. Upon investigation, we determined that all revoked certificates were returning “good” for on-demand signing instead of revoked.
August 25, 2020 – We investigated the impact and determined that two issuing CAs were impacted by the change. We found that the wrong responses were being delivered to the on-demand OCSP service (ocspx.digicert.com), while the correct responses were showing up at our pre-signing service (ocsp.digicert.com).
August 26, 2020 – We ran a query to determine whether any certificates were revoked for key compromise. We found that all certificates were revoked by automated systems on the customer side. The certificates are essentially used as session certificates and revoked when a session terminates.
August 27, 2020 – We deployed a code fix to send the correct responses to both the on-demand and pre-signed service.
August 31, 2020 – We moved the remaining CAs to pre-signing only and shut down on-demand signing.

  1. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

We have fixed the issue and finished shutting down on-demand signing.

  1. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g., OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

The CAs using on-demand signing are:
Plex Devices High Assurance CA2 - https://crt.sh/?id=7710111
DigiCert Cloud Services CA-1 - https://crt.sh/?id=12624881

  1. In a case involving certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

The issue was with all revoked certificates under the ICAs mentioned above. We looked at the certificates and our records of key compromise. None of the revoked certificates were revoked for key compromise. Instead, each certificate is issued and revoked automatically by the customer system as part of a per-session certificate deployment.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

We were moving all certificates to pre-signing per https://bugzilla.mozilla.org/show_bug.cgi?id=1640805. In connection with this shutdown, we introduced a change that ended up sending inconsistent responses for OCSP to two end points. We did not test end-entity certificates.

  1. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

We shut down all on-demand signing responses, meaning all OCSP services are using the same pre-signing system described at https://bugzilla.mozilla.org/show_bug.cgi?id=1640805.

Assignee: bwilson → martin.sullivan
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Hey Ben - what additional info do you want on this bug? We turned off our on-demand signing service for TLS. Any additional details you want?

Flags: needinfo?(bwilson)

It seems that this problem was discovered by chance an entire month after the code was implemented. To ensure that this type of situation or incident does not occur in the future, shouldn't item number 7 include more thorough pre- or post-deployment testing?

Flags: needinfo?(bwilson)
Flags: needinfo?(martin.sullivan)

To echo Ben's sentiment: I'm concerned that answers 6 and 7 are responses not to root causes or systemic issues, but more about plans regarding this specific incident.

The timeline in Comment #0, Question 2, is very much in the spirit of incident reports, and provides useful understanding to how DigiCert responded. However, we're lacking an understanding about how (systemically) things went wrong, only that things went wrong beginning July 22 after mistakes were made.

My concern is that another systemic issue seems just as likely, and future remediations are unlikely to involve "DigiCert shutting down X operation", especially if X is, say, a CA. DigiCert has introduced bugs several times now in response to rolling out 'quick' fixes due to compliance issues, and while responding quickly is good, I think there's an understandable concern regarding quality control, especially when it's appearing to become a pattern.

I don't think it was discovered by chance - we were actively working on shutting down the system and deploying code regularly in furtherance of that objective. We noticed the issue as part of our routine sprint planning actively to talk about what the next step was in shutting down the system. We run a continuous deployment model with daily sprint standups to review what we are working on, which means a system we are working on We had the retrospective yesterday to talk about what happened with this particular issue.

What we have done for other systems:

  1. We now have acceptance automated tests testing the on-demand endpoint path returns the correct status for end-entity certs (even though this has been disabled in prod)
  2. We already had acceptance automated tests that verify the pre-gen responses have the correct status for EE certs
  3. We discussed as a team that going forward, whenever the stored procedures for the pre-gen or on-demand ocsp endpoints are changed, both end-entity and CA certs need to be tested

The acceptance tests have been and will continue to be run against every CA branch, which will verify that the pre-gen ocsp system creates ocsp responses with the correct status. We don't merge any code unless all of the acceptance tests pass.

I think there's an understandable concern regarding quality control, especially when it's appearing to become a pattern.

In this point particular, we knew the code was rough. We mentioned this in the other bug. If the system had been an issuing system, we would have just turned it off, but we couldn't do that with a revocation system as then it would fail to generate revocation responses. There isn't a pattern in quality control except on legacy code, which is actively been deprecated (like this code). As for the never again plan, I think all software development requires better unit testing and controls. No matter who you work for, software perfection is the goal and we will continue to improve. Afterall, this is what Agile is all about - learning from mistakes and improving as you go

(In reply to Jeremy Rowley from comment #4)

No matter who you work for, software perfection is the goal and we will continue to improve. Afterall, this is what Agile is all about - learning from mistakes and improving as you go

I can't help but feel that this rings a little hollow. While an admirable exhortation, I think my question is still trying to understand the sort of robust controls in place. From the overall comment in Comment #4, it sounds like the extent of testing is that you have some acceptance tests in place in prod, and you require them to pass to land code. While testing is certainly a good idea, if not outright required to meet the control objectives, I'm a bit concerned when the scenario is "Developer makes a mistake in a test, doesn't notice it because the tests pass".

Tests help you check the things you know you're looking for ("known knowns", if you will), but I don't see a set of controls to detect or mitigate issues with the development of tests. For example, you found this issue as part of a turndown, so it suggests that one example control would be to periodically spot-check and review different code areas. Similarly, it suggests that perhaps an internal auditing sampling approach, where you actually take a random sample and attempt to aggressively test for 'odd things', becomes an opportunity to detect blindspots or otherwise missing test coverage.

A CA is as much about developing code as running a service, and understanding the approach to service testing, and not just code testing, seems a missing part of the reply in Comment #4.

We have an independent developer review all code prior to deployment. Both the independent developer and the developers writing the original code missed the impact of this change on end entity certificates. We do have regression testing and unit tests but they were insufficient to cover this particular system. We also do samples of code and test against it. We have a separate system for linting both before and after issuance to ensure everything is going into the CA correctly and coming out of the CA correctly.

We probably should do more random code sampling though. Or a review of changes a month after they are deployed (so there are fresh eyes on the problem). I'll bring that back to the team and set something up and then report back.

Looking at the incident, the issue was that we didn't have adequate test coverage to verify a change to the system providing ICA information which unintentionally made a modification to end entity certs. We've traced this problem to the fact that we didn't have good visibility for the workflows necessary to test the cross-functional changes on this part of the CA. We do have this visibility on both validation and other parts of the CA. To solve this, we are going to take the lessons learned any apply it uniformly across the CA. Specifically, engineering is implementing a github PR template on each of the CA services to verify that all major workflows are manually tested and/or automated on each PR. We are right now looking at implementing standard checklists throughout the services along with customized templates where necessary. We think this will help us better think through the issues that could arise while making modifications to the CA and potential cross-system interactions. This, combined with the automated regression testing we already have and any new regression testing opportunities this reveals, will better shore up the engineering process. We plan to have this review done by [Oct 15] and will provide information about the workflows and automation we add on PRs.

Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 15-Oct 2020
Attached image PR Template.JPG

Okay - we ended up implementing the new process a bit earlier than the 15th. At this point we have added PR templates to our git repos for the CA team and are using these as part of the dev process. More specifically, they are part of the key ceremony tool, the CA, the CA manager, and other CA related code. The screenshot is for one of the PR templates.

Any suggestions or additional thoughts on what we should do to mitigate dev risk? If not, then I think we are ready to close this bug.

Flags: needinfo?(bwilson)

I will schedule this for closure on or about 16-October-2020 unless there are additional questions, issues or concerns.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(martin.sullivan)
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] Next Update - 15-Oct 2020 → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: