Bugzilla

Comment 1

•

5 years ago

Jacob,

Thanks for the quick investigation and response. Was this something in the open-source Boulder code? Could you perhaps provide pointers to it?

In particular, it'd be interesting to understand the flow and sequencing of a pre-certificate being signed, an OCSP response being generated, the pre-certificate being submitted to logs, and the actual final certificate being issued. While normally this would be a question for the CA to report back, the open-source nature hopefully makes it easier to find answers in the source to the questions and any follow-ons?

Assignee: wthayer → jsha

Flags: needinfo?(jsha)

Whiteboard: [ca-compliance]

https://github.com/letsencrypt/boulder/blob/fe23dabd69ce5f07abb901faa63b5a0959978753/ra/ra.go#L1275-L1282

Assignee

Comment 2

•

5 years ago

Yes, this was in the open-source Boulder. These were the specific error cases that could result in a precertificate not having OCSP available:

However, it's more informative to think of this as a design flaw than to look at specific error lines. Boulder treats OCSP responses as a property of the final certificate. Instead, we will probably need to move things around to treat OCSP responses as a property of a precertificate.

Flags: needinfo?(jsha)

Comment 3

•

5 years ago

Right. I was wanting to understand the overall design, and thought it might be useful to look at the source for how the design worked, than to try to ask an increasing amount of questions that has you describe the design in psuedo-code.

I think your summary of the design flaw sounds spot on. Perhaps naively, I would have assumed that the process was:

Assign a serial number and a certificate "to be issued"
Confirm it's recorded in durable storage (i.e. you haven't forgot the cert)
Actually sign it (as a pre-certificate). This is done after database recording, so that you know what you "were going to sign" rather than only what you actually signed, in case there's any issues in recording that you signed something
As a pre-certificate, produce and distribute status information for this certificate
Once the status information is distributed, log the certificate to one or logs
Once sufficient logs have returned, record and issue the final certificate
Release the final certificate to the Subscriber

This flow is what I assumed the 'legacy' flow for CAs in a pre-CT world did, namely:

Assign a serial number and a certificate "to be issued"
Confirm it's recorded in durable storage (i.e. you haven't forgot the cert)
Actually sign it (as a certificate)
Produce and distribute status information for this certificate
Release the certificate to the Subscriber

Part of this sequence is to try to avoid issues like Bug 1524815, Bug 1462844, or Bug 1526154 - where things fall between the cracks and lead to non-compliance.

Of course, if I've misunderstood considerations, that'd be useful to know. As I said, it was a somewhat naive assumption I'm making as to how things work.

Comment 4

•

5 years ago

Hi Ryan,

I have a variation:

Assign a serial number and a certificate "to be issued"
Sign it as a pre-certificate
As a pre-certificate, do not distribute status information for this certificate, leave it as "unknown" as the final state of the transaction is not yet known (BR 4.9.10)
Log to CT logs, get SCTs and audit log everything
Issue the final certificate
Choice, if:
** the final certificate should not be distributed (not enough SCTs, failed linting, or other cause), set status to a revoked state and do not release to subscriber
** the final certificate should be distributed, set status to good state and release to subscriber

A benefit of not distributing status for the pre-certificate is that is saves on latency. It also keeps the issuer/serial unusable (anything-but-good) for practical matters as the cert can/should not yet be used (it has not been issued in real bytes yet). CAs publishing time to OCSP may span from milliseconds, to higher values depending on volume, load and technology used.
Latency is a concern, and flipping statuses when there is no need to takes unnecessary time, and also introduces additional possible error paths.

For the non-TLS use-case there are other processes as well (issuing in on-hold state), but here we limit ourselves to the TLS use case of course.
Even though I have no reference, I can envision TLS use cases where you want to make sure CDN caches are populated with OCSP responses before releasing the certificate to the subscriber making the "distribute status" step a process in it's own.
(validation/linting can be done at any of the stages).

Comment 5

•

5 years ago

The reason for the suggested ordering is exactly as you mentioned: to ensure clients that use OCSP stapling can obtain a fresh, usable status once they receive the certificate. The presumption here is that there is both latency, and more importantly for CAs (and related to this bug), complexity in getting the HTTP caching later correctly configured.

Adopting a flow as you suggest is understandably easier for the issuance software, but it moves the complexity to the CA interacting with the CDN and the Subscriber tasked with caching. This is because if the unknown status is not evicted, or has an unduly long TTL, then Subscribers will need to continually refresh and not productionize the certificate until the newly signed response is available.

Admittedly, a common path I’ve seen deployed is to simply set a default handler at the CDN to map any “unknown to the CDN” serials into a default state (Unknown or simply 404 are both valid and popular). In that scenario, there is only one publication event, and thus fewer state transitions. However, because CAs must monitor such requests, I suspect it’s easier to publish precisely to reduce the false negative.

Note that, for clarity sake, “failed linting” is hopefully an edge case, as any linting should have been done and successful well before the serial and TBSCertificate were assigned. It should only refer to linting the SCT extension OID, as linting that late is still misissuance.

Comment 6

•

5 years ago

Let me number the point in order to make my point clearer.

Assign a serial number and a certificate "to be issued"
Sign it as a pre-certificate
As a pre-certificate, do not distribute status information for this certificate, leave it as "unknown" as the final state of the transaction is not yet known (BR 4.9.10)
Log to CT logs, get SCTs and audit log everything
Issue the final certificate
Choice, if:
** the final certificate should not be distributed (not enough SCTs, failed linting, or other cause), set status to a revoked state and do not release to subscriber
** the final certificate should be distributed, set status to good state and release to subscriber

The time period between 1 and 6 is measured in milliseconds, or seconds if CT logs are slow. This is far lower than the merge delay of CT logs and will likely not help clients in obtaining fresh statuses. To be fair, this bug (Let's encrypt) did not affect any clients right, as no certificate was distributed to clients? Therefore are we not looking at different issues, where CT may not necessarily be the right tool?

Yes, ensuring the intent to issue is fulfilled is a tool for CT
No, ensuring that clients can use OCSP efficiently is not a tool for CT

The issue with OCSP in this bug looks like a symptom of the underlying issue (not issuing a certificate even though a pre-certificates was issued), and not the issue itself.
The process for ensuring fresh OCSP responses directly when they need it is definitely important. I see the need to avoid things falling through the cracks for sure (the bugs you linked), but we don't want to open up new cracks.

Comment 7

•

5 years ago

•

Edited

Recall that the original proposed mitigation is:

For each precertificate issued according to our audit logs, verify that we are serving a corresponding OCSP response (if the precertificate is currently valid).

This is very reasonable, and it seems we disagree on whether the Precertificate should return something like 404, or "unknown" - both of which may be cached by clients and intermediaries without the ability to flush - or whether it's desirable to have the 'correct' response being served.

Attempting the scenario as you've described requires that the CA, whether Let's Encrypt or other, ensure that they can reliably evict such responses from caches. One way is to guarantee that the response cannot be cached; but that, of course, increases load on the CA. Thus, any increase in the cache time increases the risk that, by the time #6 happens (using your numbering), the certificate cannot be safely used / does not have a correct response.

CT is entirely orthogonal to this. No one is proposing that CT be used to ensure clients can use OCSP efficiently. Rather, it's about understanding what commitments the CA is making when they log a precertificate. There are, for example, some CAs that delay generating an OCSP response until it's actually queried for (even for TLS cases), as a means of trying to reduce load on their signing servers; however, with CT, any client can monitor CT logs and query such certificates, thus defeating that load reduction.

I do not view not issuing a certificate, even though a pre-certificate was issued, to inherently be a bug. However, once a precertificate has been issued, all the necessary services for that response "should" be provisioned. This complexity is precisely why relying on Precertificates in OCSP is so fraught with peril; specifically, it requires the CA have a reliable mechanism to evict any caches that exist between them, the Subscriber, and the Client, as they transition from the "pre-precertificate" provisioned response to the "post-precertificate" provisioned response.

Again, #3 in Comment #6, places significant burden on CAs with respect to monitoring if #6 is failing, which is why the flow in Comment #3 was my naive understanding. I would normally think CAs would want to reduce risk.

Comment 8

•

5 years ago

(In reply to Ryan Sleevi from comment #7)

Recall that the original proposed mitigation is:

For each precertificate issued according to our audit logs, verify that we are serving a corresponding OCSP response (if the precertificate is currently valid).

This is very reasonable, and it seems we disagree on whether the Precertificate should return something like 404, or "unknown" - both of which may be cached by clients and intermediaries without the ability to flush - or whether it's desirable to have the 'correct' response being served.

I agree that this is reasonable. My assumption here was that monitoring the audit logs for pre-certificates and verifying against OCSP is something typically done after step 6 completes (1-6 completes in ~1 second). I know that CAs do monitor audit logs to find and mitigate any issues in the process, including with pre-certificates. The assumption is absolutely that the verification using audit logs results in a proper ok/revoked response of that issuerDN/serialNumber. Correct me if I'm wrong, but the verification is not done within steps 1-6?

Attempting the scenario as you've described requires that the CA, whether Let's Encrypt or other, ensure that they can reliably evict such responses from caches. One way is to guarantee that the response cannot be cached; but that, of course, increases load on the CA. Thus, any increase in the cache time increases the risk that, by the time #6 happens (using your numbering), the certificate cannot be safely used / does not have a correct response.

We're still talking about the time gap of a few seconds right?

CT is entirely orthogonal to this. No one is proposing that CT be used to ensure clients can use OCSP efficiently. Rather, it's about understanding what commitments the CA is making when they log a precertificate. There are, for example, some CAs that delay generating an OCSP response until it's actually queried for (even for TLS cases), as a means of trying to reduce load on their signing servers; however, with CT, any client can monitor CT logs and query such certificates, thus defeating that load reduction.

I do not view not issuing a certificate, even though a pre-certificate was issued, to inherently be a bug. However, once a precertificate has been issued, all the necessary services for that response "should" be provisioned. This complexity is precisely why relying on Precertificates in OCSP is so fraught with peril; specifically, it requires the CA have a reliable mechanism to evict any caches that exist between them, the Subscriber, and the Client, as they transition from the "pre-precertificate" provisioned response to the "post-precertificate" provisioned response.

Thanks for this clarification. I had, perhaps erroneously, though that the "intent to issue" mandated by CT really meant "issue" and I know others have made the same interpretation. Meaning that if not enough SCTs are received the certificate MUST be issued anyhow, and preferably revoked as it can not be used. If this is not the case that can affect implementation quite some.
The more details such as this included in the process description the better, it gives CAs better understanding what is ok and not. Not actually issuing can be considered a short cut, allowed or not, but it may actually save from some obscure corner case errors.

Again, #3 in Comment #6, places significant burden on CAs with respect to monitoring if #6 is failing, which is why the flow in Comment #3 was my naive understanding. I would normally think CAs would want to reduce risk.

I'm not working for a CA, but I'm sure they want to reduce risk.
I am aware of real issues of pre-certificates existing without certificates, but I have not heard about problems with OCSP cache poisoning due to pre-certificate logging and OCSP requests occurring before the certificate was issued. That doesn't mean it doesn't happen of course, I would like to hear about such cases in order to learn.

Comment 9

•

5 years ago

(In reply to Ryan Sleevi from comment #7)

I do not view not issuing a certificate, even though a pre-certificate was issued, to inherently be a bug. However, once a precertificate has been issued, all the necessary services for that response "should" be provisioned. This complexity is precisely why relying on Precertificates in OCSP is so fraught with peril; specifically, it requires the CA have a reliable mechanism to evict any caches that exist between them, the Subscriber, and the Client, as they transition from the "pre-precertificate" provisioned response to the "post-precertificate" provisioned response.

I didn't see any process suggestions suggesting "pre-precertificate" provisioned response. Was that theoretical, or did I missunderstand something?
Your process suggestion has "post-precertificate" provisioned response if I understand correctly, and mine I would call "post-certificate" provisioned response.

Assignee

Comment 10

•

5 years ago

Just chiming in to say we've deployed the code change as of 2019-09-12 17:44 UTC. I'll be following up next week with a full incident report.

Updated

•

5 years ago

Summary: Let's Encrypt OCSP Responder Returned "Unauthorized" for Some Precertificates → Let's Encrypt: OCSP Responder Returned "Unauthorized" for Some Precertificates

Assignee

Comment 11

•

5 years ago

We've reverted the flag controlling the code change mentioned above. We have a component in our signing service that handles timeouts from our database, locally queuing certificates that received a timeout and retrying until they are successfully written. Currently that component only handles certificates. When we made our fix for this issue, we decided to add precertificate handling for this queue as a follow-on task, rather than block the fix on it. In the meantime we would monitor our audit logs and ensure that any precertificates that failed initial insertion to the database would be manually inserted.

However, what we found in practice is that the original issue (precertificates lacking OCSP due to an error before final issuance) was significantly less common than the timed-out inserts, so we've reverted the flag. This both reduces the amount of manual work required and reduces the number of precertificates affected. We were seeing about 80 precertificates per day affected by timeouts, versus 0 affected by the original issue on most days.

Given that we've reverted the flag, we plan to continue our original interim remediation plan: Regularly check our audit logs against our available OCSP data, and fix any instances where a precertificate is missing an OCSP response.

Still working on the incident report for later this week.

BugBot [:suhaib / :marco/ :calixte]

Assignee

Comment 12

•

5 years ago

Here’s our incident report per https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report.

How your CA first became aware of the problem

(previously reported) On 2019-08-28 we (Let's Encrypt) read Apple’s bug report at https://bugzilla.mozilla.org/show_bug.cgi?id=1577014.

A timeline of the actions your CA took in response.

2019-08-28 21:44 (UTC) Incident began
2019-08-29 02:15 Generated and stored OCSP for first batch of precertificates (based on errors in logs)
2019-08-30 23:10 Generated and stored OCSP for second batch of precertificates (based on analysis of all issuance logs)
2019-08-31 01:59 Generated and stored OCSP for third batch of precertificates (based on analysis of all issuance logs since the start of the incident)
2019-09-05 and 2019-09-10: Generated and stored OCSP for additional batches of precertificates identified by our monitoring processes while we worked on deploying a code change.
2019-09-12 17:44: Deployed code change to Boulder.
2019-09-17 23:56: Reverted code change in Boulder (see comment above for details).
Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem.

Our fix is not currently live. On most days we are issuing zero certificates with this problem. On peak traffic days we may issue some certificates with this problem, but we have processes in place to ensure we manually generate OCSP for them until the fix is live again.

A summary of the problematic certificates.

217 currently-valid precertificates were missing OCSP responses until we remediated them. The problem began 2018-03-29 when we first started issuing precertificates.

The complete certificate data for the problematic certificates.

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

This was a conceptual problem in Boulder, where we considered OCSP responses to be a property of certificates rather than of precertificates, and only generated OCSP when a final certificate was issued. We did not detect the problem until now because, while we have tests for OCSP generation and tests for SCT submission failures, we don’t have tests that combine the two. Part of our pending fix is to explicitly test that combination, and expect valid OCSP responses when SCT submission fails (which is a subset of the possible “failed to issue final certificate” cases).
List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We intend to redeploy the fix to Boulder on 2019-10-03.

Comment 13

•

5 years ago

The priority flag is not set for this bug.
:kwilson, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(kwilson)

Julien Cristau [:jcristau]

Updated

•

5 years ago

Type: defect → task

Kathleen Wilson

Updated

•

5 years ago

Flags: needinfo?(kwilson)

Wayne Thayer

Updated

•

5 years ago

Whiteboard: [ca-compliance] → [ca-compliance] - Next Update - 04-October 2019