Open Bug 1905419 Opened 3 months ago Updated 1 month ago

GoDaddy: Intermittent unauthorized OCSP response when certificate is freshly issued

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: amir, Assigned: star)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Through https://sslmate.com/labs/ocsp_watch/ I've noticed that for some period of time, OCSP status is unavailable for freshly minted certificates on GoDaddy.

I had reached out to GoDaddy through a CPR process with specific certs that have had issues, and the OCSP watch link but I've not received an answer indicating that they're doing a deeper look of why OCSP watch keeps flagging GoDaddy intermittently.

E.g: https://crt.sh/?id=13553082022&opt=ocsp This certificate is returning "ocsp.ParseResponseForCert => ocsp: error from server: unauthorized" at the time of opening this bug - I suspect that the problem will go away after a couple of minutes.

As far as I understand, OCSP responses are supposed to be available as soon as the precertificate is issued.

I can confirm that OCSP Watch is flagging GoDaddy and can see similar errors on crt.sh for recent certificates.

Flags: needinfo?(jreading)

Thank you for assigning this to me. Just wanted to provide an update that we are reviewing this internally.

Flags: needinfo?(jreading)
Assignee: nobody → jreading
Status: NEW → ASSIGNED
Type: defect → task
Whiteboard: [ca-compliance] [ocsp-failure]

When I check these, they might appear in OCSP Watch, but then I immediately go to crt.sh, and they are reported as "good". What kind of latency are we talking about here? How soon after creation of a precertificate do these OCSP responses need to be available?

Of some relevance, in the CA/B Forum, I had mentioned adopting service level requirements for OCSP uptime, but that was dropped or hasn't been brought up again. See https://lists.cabforum.org/pipermail/servercert-wg/2020-May/001905.html.

An issue was filed in GitHub (https://github.com/mozilla/pkipolicy/issues/214), but that did not go into the latency encountered when new OCSP responses are distributed via a CDN, and we have not explored whether measuring that time starts from publication of the precertificate (although we do require that OCSP responses be available for precertificates). But no metric has been suggested for measuring this. Because of the complexity of setting such timeframes and the lack of feedback, I removed it from the version 2.9 changes to the MRSP.

As of 2024-07-02 19:57:06 UTC, this certificate had been deployed onto the TLS server and the OCSP responder was still responding unknown for it. So it's not just an issue with precertificates.

Thank you again for bringing this to our attention. Wanted to provide a preliminary response to some of the items raised here. What you’re seeing is latency in our current OCSP batching process. We believe this latency can be reduced and are taking actions to streamline and update this from a batch process to an on demand OCSP signing solution, targeting late Q3 2024/early Q4 2024. In the interim, we are continuing to look at other avenues to drive these numbers down (i.e. so the response being served is not “unauthorized”).

We would also like to clarify what the “unauthorized” response means for our certificates. When there is a delay with the current OCSP batch process, our system fails as “closed” to align with RFC 5019. We have our response fail in this way to align with the RFC and to err on the side of caution to prevent returning a “good” response when we should not.

Consistent with Ben’s observations in his comment above, we do not see a metric that we could use to baseline our OCSP responses that are propagated across CDNs. We would be happy reignite some of the discussion that Ben mentioned above and drive a ballot proposal within the CA/B  forum to make it clearer within the TLS BRs to better drive expectation and accountability in this space – especially if it minimizes confusion and concern from relying parties.

Assignee: jreading → star

I'm wondering whether a 15-minute latency for publication of OCSP responses for pre-certificates would be something that should be adopted either in the Mozilla Root Store Policy or by the CA/B Forum. I filed an issue in GitHub for this: https://github.com/mozilla/pkipolicy/issues/280.

(In reply to Ben Wilson from comment #6)

I'm wondering whether a 15-minute latency for publication of OCSP responses for pre-certificates would be something that should be adopted either in the Mozilla Root Store Policy or by the CA/B Forum. I filed an issue in GitHub for this: https://github.com/mozilla/pkipolicy/issues/280.

As noted above, this issue appears to exist for final certificates, not just pre-certificates, so such a grace period would not help in this particular case.


It's slightly unclear to me whether the Root Programs want to treat this issue as an incident. The BRs do not contain a general prohibition against serving "unknown" OCSP responses for actually-issued certificates. RFC 6960 says that the "unknown" state indicates that "the responder doesn't know about the certificate being requested", which appears to be an accurate description of the latency inherent in GoDaddy's OCSP infrastructure.

The Mozilla Root Program requires that "CA operators MUST maintain an online 24x7 repository mechanism whereby application software can automatically check online the current status of all unexpired certificates issued by the CA" (emphasis added). These certificates are clearly issued and unexpired, and yet their status cannot be checked at the OCSP responder. However, I assume that GoDaddy also produces CRLs in compliance with BRs 4.9.7, and likely satisfies this requirement via that mechanism instead.

Regardless of whether this is a full-blown incident or not, it seems clear to me that this chronic OCSP latency is both: a) unique to GoDaddy; and b) unexpected and undesired. It feels like a violation of the spirit of the requirements, if not their letter.

As such, as a community member, I would like to request a full report matching the CCADB incident report template, detailing how GoDaddy's system design led to this issue and what steps (including changes to the BRs!) are being taken to remediate it.

Once such a report has been provided for the benefit of the community, I think it would likely be appropriate to close this issue as INVALID.

Thank you for the request, Aaron. (I'm responded on behalf of Star Simmons who is on PTO this week. I am her Manager at GoDaddy).

We have been looking into the issue and agree we could be better. While we may be technically fulfilling the requirements, we are working to address these issues.

As we’ve investigated this closely, the best solution isn’t something that will be fixed overnight. So, our team is currently working on a short-term patch aimed at incremental improvements. We also have a better solution that we’re starting work on, but it’s longer term.

Regarding your request for a report, we plan on issuing a fully detailed report once we have finalized our project plans for the new long-term solution. We will do our best to fit this issue into the CCADB incident form, but we would like to note it may not be a perfect fit as this issue is not a formal violation.

Summary

GoDaddy’s OCSP response sync mechanism experienced a degradation in performance as the CA scaled up issuance. This led to the intermittent “unauthorized” responses from GoDaddy’s OCSP responders for newly issued certificates until the response was propagated to the public responder nodes.

Impact

It is not possible to identify the number of impacted certificates as the issue was intermittent and there is no history we can look back at to say which certificates had a delay in OCSP response propagation.

Timeline

All times are in UTC

2024-06-27 03:20:00 - CPR to GoDaddy sent by Security Researcher regarding example certificate giving “Unauthorized” OCSP response
2024-06-27 11:54:00 - GoDaddy acknowledges the CPR and starts investigation
2024-06-27 19:14:00 - GoDaddy responds to CPR with findings
2024-06-28 19:25:00 - Bug Report 1905419 filed
2024-06-28 23:04:00 - GoDaddy acknowledges bug report
2024-07-04 01:05:00 - GoDaddy makes initial response regarding latency
2024-07-18 03:19:00 - GoDaddy commits filing an incident report once short-term improvements are deployed
2024-07-19 00:10:00 - Responder syncing schedules tuned for more consistent propagation
2024-08-16 19:39:00 - Added additional script to “fast-track” propagation of newly generated responses to Responders

Root Cause Analysis

Background

GoDaddy currently uses a two-tiered OCSP Response generation system, where responses are pre-generated on a cluster, and then shipped to responder nodes that serve them up to clients when requested. As GoDaddy certificate issuance recently scaled higher, the OCSP response propagation time to OCSP responders increased. This delay caused our responders to return a 401 Unauthorized response for newly issued certificates upwards of an hour after issuance while the response was being shipped to the appropriate nodes.

Missed Requirement

GoDaddy does not believe this issue constitutes a violation of CAB BRs, RFC 6960, or RFC 5019, GoDaddy has made and is continuing to make improvements to our OCSP system to scale.

Requirement Fix

Improvements to the OCSP responder sync schedule for more consistent syncing

Fast track syncing of newly issued certificates OCSP responses

Deployment

See timeline above.

Lessons Learned

Our OCSP response generation and synching mechanism needed an upgrade. The increased scale of certificate issuance caused a tipping point for properly responding to OCSP requests for newly issued certificates.

We also learned that further monitoring of our OCSP response generation and synching was required. We proactively added monitoring to alert if OCSP response synching is falling behind.

What went well

The improvements made to our OCSP response propagation have vastly improved our ability to properly respond to OCSP requests for newly issued certificates.

What didn't go well

Scale of OCSP response propagation did not match the increased scale of certificate issuance.

Where we got lucky

N/A

Action Items

| Action Item | Kind | Due Date |

| ----------- | ---- | -------- |
| Adjust response syncing schedules for more consistent propagation to Responders| Prevent | 2024-07-19 (completed) |

| Add additional script to “fast-track” propagation of newly generated responses to Responders | Prevent | 2024-08-16 (completed) |

You need to log in before you can comment on or make changes to this bug.