User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36
Steps to reproduce:
From 2020-04-08 16:25 UTC to 2020-04-09 05:40 UTC, Google Trust Services' EJBCA based CAs (GIAG4, GIAG4ECC, GTSY1-4) served empty OCSP data which led the OCSP responders to return unauthorized.
These CAs exist for issuance of custom certificate profiles and certificates for test sites for inactive roots. Our primary CAs (GTS CA 1O1 and GTS CA 1D2) were unaffected. The problem self-corrected, but we have added safeguards to prevent recurrence.
- How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
Monitoring detected the issue on 2020-04-08 at 16:35 UTC. The root cause was identified within hours. The issue was automatically remediated in the next generation and push to CDN cycle while debugging and fixes were ongoing.
- A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
2020-04-08, 11:29 UTC - Scheduled system update begins
2020-04-08, 14:00 UTC - Incorrect OCSP archives are generated
2020-04-08, 15:03 UTC - Scheduled system update concludes
2020-04-08, 16:20 UTC - Incorrect OCSP responses pushed to CDN
2020-04-08, 16:35 UTC - First production monitoring alert fires
2020-04-08, 22:00 UTC - Correct OCSP archives are generated automatically
2020-04-09, 00:20 UTC - Correct OCSP responses pushed to CDN
2020-04-09, 05:40 UTC - Monitoring confirms all probes are passing
- Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
The affected CAs are only used for infrequent and manual custom certificate issuance. No certificate issuance aside from a manually issued post update test certificate to validate the upgrade to resolve the issue took place during this period. The issue in question also was specific to refreshing OCSP responses and not certificate issuance.
- A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
No certificate issuance aside from a manually issued post update test certificate to validate the upgrade to resolve the issue took place during this period. The test certificate was a valid and fully compliant issuance.
- The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
No certificate issuance aside from the manually issued post update test certificate to validate the the upgrade.
- Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
Our creation of OCSP responses and packaging them for serving is designed to fail if any sub-command fails using
set -e. However, if the function call is part of an AND or OR sequence (ie. using '&&' or '||' control operators), the
set -e is suppressed inside the function.
The tool we use to fetch OCSP responses from EJBCA correctly returned a non-zero exit code (due to no OCSP responses being generated because EJBCA was not running), but because it was called inside a function with its own error handling (using
&& syntax), the script continued without handling the error properly and wrongly used empty tar.gz files with no responses in them. The bug had existed for multiple years as a potential race condition and we did not encounter it previously.
Quality tests are executed before publication to the CDN, however, those tests accommodate empty responses as a valid condition because it is something that can and does happen.
This condition did not repeat on the following update of the OCSP responses. As a result the next update resolved the issue. Our monitoring caught the issue enabling expedient root cause analysis and resolution.
- List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
No certificate issuance aside from a valid manually issued post update test certificate to validate the upgrade took place during this period.
The logic error that led to incorrect OCSP responses being served has been corrected, is checked in and in production. Additionally, checks have been added to ensure that bad data cannot replace known good data.
We reviewed all existing monitoring of response generation and publishing and found no gaps.
A review of similar code has also been conducted to ensure we do not have other instances where similar logic could incorrectly suppress errors.
The only non-expired and revoked certificates under these CAs are used by our six demo sites.
Users or automation using these sites for testing may have interpreted the unauthorized responses to mean these revoked demo certificates were to be considered valid during the window in which bad data was served.
The issue was limited to OCSP handling and CRL data was correct during the same period.
No additional improvements are outstanding at this time.