Closed Bug 1619047 Opened 8 months ago Closed 1 month ago

Let's Encrypt: CAA Rechecking bug

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jsha, Assigned: jsha)

References

Details

(Whiteboard: [ca-compliance] Next Update - 14-August 2020)

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36

Steps to reproduce:

On 2020-02-29 UTC, Let’s Encrypt found a bug in our CAA code. Our CA software, Boulder, checks for CAA records at the same time it validates a subscriber’s control of a domain name. Most subscribers issue a certificate immediately after domain control validation, but we consider a validation good for 30 days. That means in some cases we need to check CAA records a second time, just before issuance. Specifically, we have to check CAA within 8 hours prior to issuance (per BRs §3.2.2.8), so any domain name that was validated more than 8 hours ago requires rechecking.

The bug: when a certificate request contained N domain names that needed CAA rechecking, Boulder would pick one domain name and check it N times. What this means in practice is that if a subscriber validated a domain name at time X, and the CAA records for that domain at time X allowed Let’s Encrypt issuance, that subscriber would be able to issue a certificate containing that domain name until X+30 days, even if someone later installed CAA records on that domain name that prohibit issuance by Let’s Encrypt.

We confirmed the bug at 2020-02-29 03:08 UTC, and halted issuance at 03:10. We deployed a fix at 05:22 UTC and then re-enabled issuance.

Our preliminary investigation suggests the bug was introduced on 2019-07-25. We will conduct a more detailed investigation and provide a postmortem when it is complete.

Assignee: wthayer → jsha
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]
Blocks: 1619179

Here's our incident report:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On 2020-02-18, a Let’s Encrypt community member reported a bug in how our API reported errors rechecking CAA at issuance time. Their report was that we included multiple error messages for a single hostname. https://community.letsencrypt.org/t/rechecking-caa-fails-with-99-identical-subproblems/113517. Another community member suggested that there might be a more serious bug that caused some CAA records not to be checked. We checked the logs, confirmed that CAA was checked, and concluded the bug was not in the CAA checking itself but in the error message generation itself. We filed an issue at regular priority and incorporated it into our next sprint. https://github.com/letsencrypt/boulder/issues/4681.

On 2020-02-29 UTC, while working on a fix for the bug, a Let’s Encrypt engineer discovered that it was in fact a bug with CAA checking, not with error generation. Once we confirmed that bug (03:08), we immediately disabled issuance (03:10).

We’d like to thank our community members @lancedolan and @_az for originally bringing the issue to our attention.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2019-04-24 16:40 UTC: Bug committed to Boulder, behind a feature flag.
https://github.com/letsencrypt/boulder/pull/4134/files#diff-2285b0268539881fde96d9928ecef358R2250
2019-08-23 01:19 UTC: NewAuthorizationSchema feature flag first enabled in prod. This triggers the bug for the first time.
2020-02-18: Message about error handling posted to community support forum.
2020-02-29 03:08 UTC: Let’s Encrypt engineer confirms bug.
2020-02-29 03:10 UTC: Let’s Encrypt SRE team disables issuance.
2020-02-29 05:22 UTC: Fixed version of Boulder is deployed (containing https://github.com/letsencrypt/boulder/pull/4690). Issuance is re-enabled.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

Yes, we have stopped issuing certificates with this problem.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Of currently-valid certificates, 3,048,289 certificates are affected (out of approximately 116 million overall active certificates). Of the affected certificates, about 1 million are duplicates of other affected certificates, in the sense of covering the same set of domain names.

Because of the way this bug operated, the most commonly affected certificates were those that are reissued very frequently, which is why so many affected certificates are duplicates.

Our first priority has been determining the currently-valid certificates affected, for revocation purposes. We will next be analyzing logs back to the date the problem began.

This set of certificates was issued between 2019-12-04 and 2020-02-29 03:08 UTC.

Of the hostnames in those affected certificates, 443 currently have CAA records that reject issuance by Let’s Encrypt. By the nature of the bug, we know that at some point in the 30 days prior to issuance, each hostname either had no CAA records, or had CAA records that allowed issuance by Let’s Encrypt. However, we don’t know what the CAA status for those 443 hostnames was at the time their including certificates were issued.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

The list of certificate serial numbers and affected hostnames is in caa-rechecking-incident-affected-serials.txt.gz, available for download at https://letsencrypt.org/caaproblem, with the sha256 hash b9a5dd16a1b048c328df130e75a6d08f059e96d3fc42f7e925823d254a63a715. All certificates are logged in CT. We can provide certificate bodies, or links to crt.sh, if needed.

A list of hostnames for which CAA presently rejects issuance by Let’s Encrypt is attached to this issue.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The proximate cause of the bug was a common mistake in Go: taking a reference to a loop iterator variable. https://github.com/golang/go/wiki/CommonMistakes#using-reference-to-loop-iterator-variable. The relevant code was:

// authz2ModelMapToPB converts a mapping of domain name to authz2Models into a
// protobuf authorizations map
func authz2ModelMapToPB(m map[string]authz2Model) (*sapb.Authorizations, error) {
This conversation was marked as resolved by rolandshoemaker
resp := &sapb.Authorizations{}
for k, v := range m {
// Make a copy of k because it will be reassigned with each loop.
kCopy := k
authzPB, err := modelToAuthzPB(&v)
if err != nil {
return nil, err
}
resp.Authz = append(resp.Authz, &sapb.Authorizations_MapElement{Domain: &kCopy, Authz: authzPB})
}
return resp, nil
}

Interestingly, you can see by the comment about kCopy that we specifically handled the issue for one loop iterator variable (k), but we failed to handle it properly for a second loop iterator variable (v). In part that was because the problematic reference-taking was inside modelToAuthzPB. Since authzPB is freshly scoped with each iteration, it’s easy to assume it’s not affected by the loop variable problem. However, modelToAuthzPB actually takes a reference to two important fields from its input: IdentifierValue and RegistrationID.

func modelToAuthzPB(am *authzModel) (*corepb.Authorization, error) {
expires := am.Expires.UTC().UnixNano()
id := fmt.Sprintf("%d", am.ID)
status := uintToStatus[am.Status]
pb := &corepb.Authorization{
Id: &id,
Status: &status,
Identifier: &am.IdentifierValue,
RegistrationID: &am.RegistrationID,
Expires: &expires,
}

The result is that authz2ModelMapToPB creates a map whose keys consist of the correct set of FQDNs, but whose contents all have the same IdentifierValue. That gets returned, via RPC, to boulder-ra: the component that handles authorization checking prior to issuance. Boulder-ra checks that all needed FQDNs are present using the map keys, which works correctly. It also checks the Expires field, which is not affected by the loop iterator variable problem.

However, when boulder-ra determines that a given FQDN needs CAA rechecking, it uses the Identifier field from the authorization object, which was incorrect due to this bug. Because of the bug, the Identifier field would be the same for all values in a single map. If multiple authorizations needed CAA rechecking, boulder-ra would recheck one FQDN but not the others.

Stepping back a little, another cause was the presence of a conversion layer to take us from native Go types to RPC structures and back. We’ve found in the past that such conversion layers can be a source of hard-to-find bugs, and have tried to simplify or eliminate them when possible.

Another, secondary cause: The proto2 library we use for RPC relies heavily on pointers as fields of long-lived structures. This causes us to take references in many cases where it would be simpler, and more correct, to pass a field by value. The newer proto3 library improves on this by using values more frequently. We’ve had a longstanding plan to upgrade to it, and will increase priority on that.

Why wasn’t this caught by our tests? The bug was committed along with a unittest that covers the affected code: TestGetValidOrderAuthorizations2. This unittest exercises the broken code path, creating and retrieving an order with multiple authorizations. However, it does not adequately verify the contents of the returned authorizations, only that there were the correct number. We should have written a more thorough unittest here. Also, we should have written a lower-level unittest that tested modelToAuthzPB directly and verified its output. We will add both of these within the next three weeks.

We also have a longstanding integration test that checks the CAA rechecking behavior. However, this integration test failed to find the bug because the test requests a certificate with only one FQDN in it. We have already updated this integration as part of our fix, and confirmed that the updated version catches the bug on earlier versions of Boulder.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We’ve already deployed a fix for the immediate, known issue. Here are the steps we plan to take to avoid a recurrence of this bug, or similar bugs, in the future:

  1. Improve TestGetValidOrderAuthorizations2 unittest (3 weeks).
  2. Implement modelToAuthzPB unittest (3 weeks).
  3. Productionize automated logs verification for CAA behavior (8 weeks).
  4. Review code for other examples of loop iterator bugs (4 weeks).
  5. Evaluate adding stronger static analysis tools to our automated linting suite and add one or more if we find a good fit (4 weeks).
  6. Upgrade to proto3 (6 months).

Additional information:

We have begun notifying affected subscribers of revocation on the BR-mandated timeline. We have a FAQ and the email contents available at https://community.letsencrypt.org/t/revoking-certain-certificates-on-march-4/114864.

List of hostnames for which the current CAA record forbids issuance by Let's Encrypt.

Jacob: It's been 24 days since the last update on this bug; do you have updates on the TestGetValidOrderAuthorizations2 and modelToAuthzPB unit tests?

Trying to extrapolate the next dates based on Comment #1, if I understood correctly:

  • 2020-03-24 - Improve TestGetValidOrderAuthorizations2 unittests
  • 2020-03-24 - Implement modelToAuthzPB unittest
  • 2020-03-31 - Review code for other examples of loop iterator bugs
  • 2020-03-31 - Evaluate adding stronger static analysis tools to our automated linting suite and add one or more if we find a good fit
  • 2020-04-28 - Productionize automated logs verification for CAA behavior
  • 2020-09-03 - Upgrade to Proto3

I'd also note that since Comment #1, Let's Encrypt filed Bug 1619179 to note that they decided not to revoke after all.

Flags: needinfo?(jsha)

2020-03-24 - Improve TestGetValidOrderAuthorizations2 unittests
2020-03-24 - Implement modelToAuthzPB unittest

These are completed.

2020-03-31 - Review code for other examples of loop iterator bugs
2020-03-31 - Evaluate adding stronger static analysis tools to our automated linting suite and add one or more if we find a good fit

These are in progress.

2020-04-28 - Productionize automated logs verification for CAA behavior
2020-09-03 - Upgrade to Proto3

Work on these will start when the above items are done.

Flags: needinfo?(jsha)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 31-March 2020

2020-03-31 - Review code for other examples of loop iterator bugs
2020-03-31 - Evaluate adding stronger static analysis tools to our automated linting suite and add one or more if we find a good fit

These items are completed. We did not find other examples of loop iterator bugs during our review.

Our evaluation of static analysis tools is done. We're planning to add staticcheck to our CI: https://github.com/letsencrypt/boulder/pull/4731. It would not have caught this particular bug, but we've found its output useful and it may catch other bugs in the future.

Whiteboard: [ca-compliance] Next Update - 31-March 2020 → [ca-compliance] Next Update - 28-April 2020

2020-04-28 - Productionize automated logs verification for CAA behavior

I'm updating our estimate for this item to 2020-05-22. In working on deploying this, we realized we need to ship some logs around between datacenters in order to do this verification in a way that doesn't produce false positives.

2020-04-28 - Productionize automated logs verification for CAA behavior

I'm updating our estimate for this item once more, to 2020-05-23.

An update on the “Productionize automated logs verification for CAA behavior” remediation item. We began deployment and discovered that skew in the timestamps for our logs sometimes meant the CAA check was logged with a timestamp a small number of milliseconds in the future relative to the issuance event. We’re making a change to our checking tool to accommodate for this and will redeploy. Pushing out our estimate by small increments and then having to update it doesn’t serve anyone, so I’m going to be generous with our next estimate and move this out to 2020-06-30.

For the “Upgrade to Proto3” remediation item: We’ve landed our first code change migrating one of our RPC calls, and plan to follow on with more changes soon.

Another update on "Productionize automated logs verification for CAA behavior:" We've addressed the timestamp skew issue and were close to being able to turn on alerting for this check, but realized our strategy for replicating logs cross-datacenter for the check was causing problems when one datacenter was down for maintenance. We're planning a new strategy for logs replication to unblock this mitigation. I'm moving my estimate out to 2020-09-03, in line with our estimate for the "Upgrade to Proto3" mitigation.

Jacob: Could you provide more details on Comment #9? What was the old strategy, why did it cause problems, and what's the new strategy? Why the (roughly) 2 month delay?

While Let's Encrypt has historically been quite good about incident reports, and while I appreciate the update sharing the struggles, I think one of the goals here is to help all CAs learn from and implement best practices. The more details shared about what doesn't work are just as valuable as sharing what does work.

Flags: needinfo?(jsha)
QA Contact: wthayer → bwilson
Whiteboard: [ca-compliance] Next Update - 28-April 2020 → [ca-compliance] Next Update - 14-August 2020

Could you provide more details on Comment #9? What was the old strategy, why did it cause problems, and what's the new strategy? Why the (roughly) 2 month delay?

Sure! In our logs system, we forward logs to a central host in each datacenter, where we process and archive the logs. Normally none of this traffic flows between our datacenters. The log validation component we wrote needs a comprehensive view of logs from both datacenters. Our initial deployment used the same, live, log-forwarding infrastructure to copy the relevant logs cross-DC for analysis. This worked initially, but failed when we had to bring down one datacenter for maintenance. When that happened, our log forwarding infrastructure was unable to send logs to the downed datacenter. A message queue filled up, blocking all processing of log messages, and we automatically halted issuance. We were able to quickly restore service by turning off this log-forwarding configuration. Our conclusion was that live log forwarding cross-datacenter introduced too much outage risk. This is particularly the case because the new check we are adding runs on a day's worth of logs at a time.

We’re now working on a batched process that sends chunks of logs at a time and does not create backpressure on our live issuance systems. As for why the two month delay - in part, our team has been delayed by time-sensitive work, and in part, I am trying to learn from our work to date on this item. So far, we’ve run into two unexpected delays, so I want to be conservative in my updated estimate. We are of course working to try to finish things up before the estimate if possible.

By the way, I am pleased to report an improvement on this remediation item:

2020-03-31 - Evaluate adding stronger static analysis tools to our automated linting suite and add one or more if we find a good fit

We originally implemented staticcheck, which catches some categories of “reference to loop iterator variable” mistakes (https://github.com/golang/go/wiki/CommonMistakes#using-reference-to-loop-iterator-variable), but would not have caught the specific variant of this mistake that caused this incident. We’ve now added another tool, gosec, which has a more thorough lint for this type of mistake and would have caught the specific variant that caused this incident: https://github.com/letsencrypt/boulder/pull/4991.

Flags: needinfo?(jsha)

Productionize automated logs verification for CAA behavior
Upgrade to Proto3

These two items are done. I believe this completes all extant remediation items for this incident.

Are you requesting that this bug be closed?

Flags: needinfo?(jsha)

Yep, I think it can be closed. If anyone has followup questions we can reopen.

Flags: needinfo?(jsha)

I will close this on 9-9-2020 unless there are objections.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 1 month ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.