Closed Bug 1577014 Opened 5 months ago Closed 3 months ago

DigiCert OCSP services returns 1 byte

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cspann, Assigned: jeremy.rowley)

Details

(Whiteboard: [ca-compliance])

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.1 Safari/605.1.15

Steps to reproduce:

When attempting to validate this certificate (https://crt.sh/?id=990545190) the OCSP response returned is 1 byte in length.

Example command: curl -v http://ocsp.digicert.com/MFYwVKADAgEAME0wSzBJMAkGBSsOAwIaBQAEFPKKiVLNMEsaHyQMgXzdNzmnizkKBBR435GQX%2B7erPbFdevVTFVT7yRKtgIQCZqu7ycbeBh9gglQHvcz3g%3D%3D

Results:
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Cache-Control: public, max-age=300
< Content-Type: application/ocsp-response
< Date: Tue, 27 Aug 2019 20:00:27 GMT
< Etag: "5bf875fc-1"
< Last-Modified: Fri, 23 Nov 2018 21:49:48 GMT
< Server: ECS (sjc/4E79)
< X-Cache: HIT
< Content-Length: 1
<

Actual results:

I received a 1 byte OCSP response. BR (https://cabforum.org/wp-content/uploads/CA-Browser-Forum-BR-1.6.5.pdf) violation for sections:

  1. 4.9.9. On-line Revocation/Status Checking Availability
  • this is since the response does not conform to RFC6960 and/or RFC5019 and the response is not signed.
  1. 4.9.10. On-line Revocation Checking Requirements
  • the response has not been updated in the required time

Expected results:

I should of received a validly signed OCSP response.

Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

Jeremy: Could you route this appropriately? I saw some CCADB contacts recently changed on DigiCert's side, so I'm not sure if you are still the best Bugzilla contact for compliance issues.

Assignee: wthayer → jeremy.rowley
Flags: needinfo?(jeremy.rowley)
Whiteboard: [ca-compliance]

Yeah - I've got Rick Roos looking at it. He should post something this evening.

Flags: needinfo?(jeremy.rowley)

Hi Curt,

Thank you for reporting this issue. We looked into it and we now have a valid response being returned for this certificate. I’ll be following up with more details once we are sure we have all the information about the root cause.

We figured out the root cause. As mentioned on the Mozilla dev forum, the issue is caused when OCPS is checked for a pre-cert where an actual cert was not issued for whatever reason. We have a script ready to deploy tomorrow morning to fix it. We'll draft the incident report tomorrow as well and give more details about the OCSP service and what's going on. Figured I'd give an update tonight though that we've figured out the issued, the scope of the issue (pre-certs), and let you know that are working on the remediation and incident report.

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On 2019-08-27 14:03 MDT a bug on Bugzilla was opened reporting that a valid OCSP response was not being returned for the certificate with the serial number 099AAEEF271B78187D8209501EF733DE and instead a single byte file with the value of ‘0’ was received. This certificate was valid and should have been returning a valid signed OCSP response.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2019-08-27 14:03 MDT a bug on Bugzilla was opened reporting that a valid OCSP response was not being returned

2019-08-27 14:10 MDT Investigation begins

During the investigation, two issues were discovered that caused an invalid OCSP response to be returned.

2019-08-27 15:55 MDT
Original issue was discovered, the file was corrected and CDN cache was flushed.

2019-08-27 18:01 MDT
The pre-certificate mentioned in the response had it’s OCSP enabled and a valid response pushed live.

2019-08-28 17:30 MDT
Patch to fix new issuance tested and implemented.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

CA is still issuing as a fix has being made to ensure the all pre-certificates are enabled for OCSP when they are initially saved during the CT pre-cert process to ensure any failure later in the process will not prevent the OCSP generation for that certificate. A script has also been created to go back and enable OCSP for any previous pre-certs that are in the same invalid state. This is still running and is expected to be finished on the 29th.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

This is impacting OCSP responses only not issues with certificates. However this would have impacted OCSP responses for all pre-certs which is just over 1 million.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

This is impacting OCSP responses only not issues with certificates.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The first issue was caused a single byte value of ‘0’ to be returned instead of a 5 byte ASN.1 OCSP UnAuthorized response when a valid OCSP response for the requested certificate was not found. DigiCert’s OCSP infrastructure consist of a front facing CDN backed by several origin servers spread throughout the world. When a request comes through the CDN cache and is received by an origin server, the origin server will look to see if it has a pre-generated response for the requested certificate. If it does find a response then that response is returned but if it does not then there is a default response file saved on the files system that is supposed to contain the OCSP UnAuthorized response. Following origin server upgrades, that default file on some of servers was overwritten with the incorrect response causing it to be returned a cached on the CDN. Once the error was discovered, the file was corrected and CDN cache was flushed.

As the first issue was being investigated, it was also determined that the origin servers did not actually have a pre-generated response for that certificate when they should have. When a certificate is in the process of being issued that requires CT, the pre-certificate will be created, signed, and then saved into the database in a pending final issuance state before it is sent to the CT logs. Once the CT process is complete the final certificate is created and saved in the database in a final issuance state and at the same time enabled for OCSP. The problem occurred due to OCSP not being properly enabled when the CT process failed for whatever reason (network issues, not enough working CT logs, etc). During a revocation event, OCSP does get re-enabled correctly and OCSP responses would start to be generated and pushed out to the origin servers for delivery. The pre-certificate mentioned in the response had it’s OCSP enabled and a valid response pushed live.

When a pre-cert ultimately fails to be successfully submitted to CT, a final certificate is not signed or returned to the customer. Most of our OCSP testing has been around when a certificate was fully issued or when a fully issued or pre-cert is revoked and the test case for OCSP on an unused failed pre-cert was not added.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

For the UnAuthorized file on the origin servers, we are adding a check to our external monitoring tool to check for a correct response for an unknown certificate. This check will also be performed before a new origin server is brought into the rotation.

We are adding an automated test to our integration test pipeline that will cause a certificate request to fail on CT submission and we will ensure that an OCSP response is still generated for the certificate.

A post mortem will be conducted prior to 12th Sept to see what we can learn from this and what can be done to stop similar incidents in the future.

Quick Update.

After going through the post-mortem and this workflow at length we can report some discoveries and next steps we are taking.

We found another way to trigger this issue. There is a potential issue in the event of server crash after signing the pre-cert and before completing the issuance process, which leads to an unformed response.

We are building a background process to check and rectify this. This run across the board as a secondary check that responses have been created for all pre-certificates and if not generate them in real time.

New code and test cases to examine these scenarios are in the process of being developed and executed. We will provide our next update in another week (9/18) with a status of when the code and test case execution will be completed.

Just touching base.

The QA tests to check these results for future software releases has been incorporated.

We are still finalizing the scope of the Background process to monitor for a server crash I will update when we have more information on this.

Question on this bug. Considering the debate on Mozilla dev forum and the fact that there is not clear guidance on how this should have happened, do we want to close this bug while the Mozilla policy is revised? I think we've answered all the questions with the current policy. We'll continue to change how the pre-certs work as the Mozilla policy evolves, but our OCSP service is currently responding 'Good' for pre-certs.

Also, are bugs being opened similar to this for all Primekey systems or is this particular issue being classified as a non-issue until the policy debate is decided?

Flags: needinfo?(wthayer)

Just updating on the status whilst the above is discussed.

the scoping of the Background process check has been completed and has been added to the 2 week sprint starting 10 Oct.

Jeremy: I don’t think this bug is similar, is it? There’s a meaningful difference between providing a syntactically valid, but incorrect response (e.g. Unknown) versus providing junk data (a 1 byte response)

From the incident report, which was helpful, it seems clear this issue was a server misconfiguration, not a misunderstanding of policy. Right?

Flags: needinfo?(jeremy.rowley)
Type: defect → task

I thought the bug represented a mix of the two - there was an invalid response and then what the response should be for pre-certs that didn't issue. Should we continue to talk just about the invalid response and discuss the pre-cert on the Mozilla thread?

Flags: needinfo?(jeremy.rowley)

I think we've clarified the policy here, and to Ryan's point, this is a little different than the "unknown" response. Please continue to regularly update this bug with remediation status.

Flags: needinfo?(wthayer)

Getting in on the weekly update, we managed to sneak this into the current sprint so the background hang check is being worked on currently.

This will be a worker process that if a hung system is located it will be called go though the queue and confirm if there are stalled pre-certificates, if there are it will send it to a working system for processing. Thus removing the chance of a stalled issuance not having a correct OCSP response.

The script has been deployed in a test environment and is going though QA now.

we should have an update in the next week when this has been finished and is live.

Noting here that I have closed the related bugs in which OCSP responders were returning "unknown" for precertificates [1], but I'm leaving this one open because in this case there is an RFC violation.

[1] https://groups.google.com/d/msg/mozilla.dev.security.policy/LC_y8yPDI9Q/tPrL7rNkBAAJ

Code is still going though QA currently.

I will update in 7 days or when if there is a important update.

Code has been completed and QA'ed and is waiting for deployment to the product system now.

This patch was deployed to production on the 15th.

after a week of monitoring everything is working as expected.

I think this can been now closed.
Issue has been identified, Code has been developed and released to test for and resolve the issue in real time.

and as can be seen for the extensive MDSP conversation this has both assisted the community and other CA's in their learning, monitoring, testing and now adjusting standards.

Wayne- Is there anything else you see pending on this bug before closing?

Flags: needinfo?(wthayer)

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Flags: needinfo?(wthayer)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.