Closed Bug 1486650 Opened Last year Closed Last year
Let's Encrypt: OCSP "unauthorized" responses
Josh Aas posted the following incident report to the mozilla.dev.security.policy forum on August 24: Let's Encrypt OCSP Responder Incident To see the original communication on our Community Forums, click here: https://community.letsencrypt.org/t/2018-08-23-ocsp-responder-incident/70350 At 17:47 UTC on August 23rd, 2018 we deployed a configuration change to our OCSP responder service that resulted in 90% of traffic to our origin inaccurately receiving OCSP "unauthorized" statuses for valid OCSP requests. Most OCSP responses that were cached at our CDN prior to the incident were not affected. The change was reverted on 19:33 UTC the same day to resolve the problem, though CDN caching may have resulted in affected statuses being served for a limited period of time after resolution. The root technical cause of this incident was [a change](https://github.com/letsencrypt/boulder/pull/3815) developed during a previous incident in which malformed OCSP traffic was causing excessive strain on the OCSP responder. Unfortunately [a bug in the implementation](https://github.com/letsencrypt/boulder/issues/3829) improperly rejected OCSP requests unless they matched the last configured serial prefix rather than any configured serial prefix. We have since [fixed the bug](https://github.com/letsencrypt/boulder/pull/3830). We first became aware of the problem at 17:52 UTC after our internal alerting flagged invalid OCSP responses for certificates issued by our monitoring systems, though the scale of the issue was not immediately clear. We began investigating the root cause, identified the problem at 19:26 UTC and immediately disabled the prefix validation feature in staging and production. The bug was not caught during testing because the unittest accompanying the initial PR did not cover the case of multiple acceptable prefixes. The bug was not caught in our staging environment for two reasons: (1) Our internal OCSP monitoring looks for HTTP 500s, but ignores OCSP "unauthorized" responses, because large number of such responses can be triggered externally by misconfigured clients; (2) Our end-to-end OCSP monitoring tests were working in production, but not in staging. Remediation items: 1. Review our procedures for ensuring that all monitoring tools are applied to both production and staging environments. 2. Extend OCSP monitoring to include OCSP statuses (unauthorized, revoked, ok, etc) in addition to HTTP statuses. 3. Add alerts when fraction of unauthorized or revoked OCSP responses is extremely high. Timeline: 2018-08-23 01:43 UTC - feature configured in staging 2018-08-23 17:47 UTC - feature configured in production 2018-08-23 19:31 UTC - feature disabled in staging 2018-08-23 19:33 UTC - feature disabled in production
Josh: please update this bug as remediation items are completed.
1. Review our procedures for ensuring that all monitoring tools are applied to both production and staging environments. This review has been completed and equivalent monitoring has been applied to staging. 2. Extend OCSP monitoring to include OCSP statuses (unauthorized, revoked, ok, etc) in addition to HTTP statuses. We have added capabilities to better monitor OCSP response statuses such as InternalError, Malformed, and Unauthorized: https://github.com/letsencrypt/boulder/pull/3841 This enabled better testing for the issue that caused this incident ("unauthorized" status). 3. Add alerts when fraction of unauthorized or revoked OCSP responses is extremely high. We have added alerts covering unauthorized response counts and we monitor whether revoked certificates are getting OCSP statues other than "revoked". We consider remediation for this incident to be complete.
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.