Open Bug 1619047 Opened 1 month ago Updated 3 days ago

Let's Encrypt: CAA Rechecking bug

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: jsha, Assigned: jsha)

References

(Blocks 1 open bug)

Details

(Whiteboard: [ca-compliance] Next Update - 31-March 2020)

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36

Steps to reproduce:

On 2020-02-29 UTC, Let’s Encrypt found a bug in our CAA code. Our CA software, Boulder, checks for CAA records at the same time it validates a subscriber’s control of a domain name. Most subscribers issue a certificate immediately after domain control validation, but we consider a validation good for 30 days. That means in some cases we need to check CAA records a second time, just before issuance. Specifically, we have to check CAA within 8 hours prior to issuance (per BRs §3.2.2.8), so any domain name that was validated more than 8 hours ago requires rechecking.

The bug: when a certificate request contained N domain names that needed CAA rechecking, Boulder would pick one domain name and check it N times. What this means in practice is that if a subscriber validated a domain name at time X, and the CAA records for that domain at time X allowed Let’s Encrypt issuance, that subscriber would be able to issue a certificate containing that domain name until X+30 days, even if someone later installed CAA records on that domain name that prohibit issuance by Let’s Encrypt.

We confirmed the bug at 2020-02-29 03:08 UTC, and halted issuance at 03:10. We deployed a fix at 05:22 UTC and then re-enabled issuance.

Our preliminary investigation suggests the bug was introduced on 2019-07-25. We will conduct a more detailed investigation and provide a postmortem when it is complete.

Assignee: wthayer → jsha
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]
Blocks: 1619179

Here's our incident report:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On 2020-02-18, a Let’s Encrypt community member reported a bug in how our API reported errors rechecking CAA at issuance time. Their report was that we included multiple error messages for a single hostname. https://community.letsencrypt.org/t/rechecking-caa-fails-with-99-identical-subproblems/113517. Another community member suggested that there might be a more serious bug that caused some CAA records not to be checked. We checked the logs, confirmed that CAA was checked, and concluded the bug was not in the CAA checking itself but in the error message generation itself. We filed an issue at regular priority and incorporated it into our next sprint. https://github.com/letsencrypt/boulder/issues/4681.

On 2020-02-29 UTC, while working on a fix for the bug, a Let’s Encrypt engineer discovered that it was in fact a bug with CAA checking, not with error generation. Once we confirmed that bug (03:08), we immediately disabled issuance (03:10).

We’d like to thank our community members @lancedolan and @_az for originally bringing the issue to our attention.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2019-04-24 16:40 UTC: Bug committed to Boulder, behind a feature flag.
https://github.com/letsencrypt/boulder/pull/4134/files#diff-2285b0268539881fde96d9928ecef358R2250
2019-08-23 01:19 UTC: NewAuthorizationSchema feature flag first enabled in prod. This triggers the bug for the first time.
2020-02-18: Message about error handling posted to community support forum.
2020-02-29 03:08 UTC: Let’s Encrypt engineer confirms bug.
2020-02-29 03:10 UTC: Let’s Encrypt SRE team disables issuance.
2020-02-29 05:22 UTC: Fixed version of Boulder is deployed (containing https://github.com/letsencrypt/boulder/pull/4690). Issuance is re-enabled.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

Yes, we have stopped issuing certificates with this problem.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Of currently-valid certificates, 3,048,289 certificates are affected (out of approximately 116 million overall active certificates). Of the affected certificates, about 1 million are duplicates of other affected certificates, in the sense of covering the same set of domain names.

Because of the way this bug operated, the most commonly affected certificates were those that are reissued very frequently, which is why so many affected certificates are duplicates.

Our first priority has been determining the currently-valid certificates affected, for revocation purposes. We will next be analyzing logs back to the date the problem began.

This set of certificates was issued between 2019-12-04 and 2020-02-29 03:08 UTC.

Of the hostnames in those affected certificates, 443 currently have CAA records that reject issuance by Let’s Encrypt. By the nature of the bug, we know that at some point in the 30 days prior to issuance, each hostname either had no CAA records, or had CAA records that allowed issuance by Let’s Encrypt. However, we don’t know what the CAA status for those 443 hostnames was at the time their including certificates were issued.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

The list of certificate serial numbers and affected hostnames is in caa-rechecking-incident-affected-serials.txt.gz, available for download at https://letsencrypt.org/caaproblem, with the sha256 hash b9a5dd16a1b048c328df130e75a6d08f059e96d3fc42f7e925823d254a63a715. All certificates are logged in CT. We can provide certificate bodies, or links to crt.sh, if needed.

A list of hostnames for which CAA presently rejects issuance by Let’s Encrypt is attached to this issue.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The proximate cause of the bug was a common mistake in Go: taking a reference to a loop iterator variable. https://github.com/golang/go/wiki/CommonMistakes#using-reference-to-loop-iterator-variable. The relevant code was:

// authz2ModelMapToPB converts a mapping of domain name to authz2Models into a
// protobuf authorizations map
func authz2ModelMapToPB(m map[string]authz2Model) (*sapb.Authorizations, error) {
This conversation was marked as resolved by rolandshoemaker
resp := &sapb.Authorizations{}
for k, v := range m {
// Make a copy of k because it will be reassigned with each loop.
kCopy := k
authzPB, err := modelToAuthzPB(&v)
if err != nil {
return nil, err
}
resp.Authz = append(resp.Authz, &sapb.Authorizations_MapElement{Domain: &kCopy, Authz: authzPB})
}
return resp, nil
}

Interestingly, you can see by the comment about kCopy that we specifically handled the issue for one loop iterator variable (k), but we failed to handle it properly for a second loop iterator variable (v). In part that was because the problematic reference-taking was inside modelToAuthzPB. Since authzPB is freshly scoped with each iteration, it’s easy to assume it’s not affected by the loop variable problem. However, modelToAuthzPB actually takes a reference to two important fields from its input: IdentifierValue and RegistrationID.

func modelToAuthzPB(am *authzModel) (*corepb.Authorization, error) {
expires := am.Expires.UTC().UnixNano()
id := fmt.Sprintf("%d", am.ID)
status := uintToStatus[am.Status]
pb := &corepb.Authorization{
Id: &id,
Status: &status,
Identifier: &am.IdentifierValue,
RegistrationID: &am.RegistrationID,
Expires: &expires,
}

The result is that authz2ModelMapToPB creates a map whose keys consist of the correct set of FQDNs, but whose contents all have the same IdentifierValue. That gets returned, via RPC, to boulder-ra: the component that handles authorization checking prior to issuance. Boulder-ra checks that all needed FQDNs are present using the map keys, which works correctly. It also checks the Expires field, which is not affected by the loop iterator variable problem.

However, when boulder-ra determines that a given FQDN needs CAA rechecking, it uses the Identifier field from the authorization object, which was incorrect due to this bug. Because of the bug, the Identifier field would be the same for all values in a single map. If multiple authorizations needed CAA rechecking, boulder-ra would recheck one FQDN but not the others.

Stepping back a little, another cause was the presence of a conversion layer to take us from native Go types to RPC structures and back. We’ve found in the past that such conversion layers can be a source of hard-to-find bugs, and have tried to simplify or eliminate them when possible.

Another, secondary cause: The proto2 library we use for RPC relies heavily on pointers as fields of long-lived structures. This causes us to take references in many cases where it would be simpler, and more correct, to pass a field by value. The newer proto3 library improves on this by using values more frequently. We’ve had a longstanding plan to upgrade to it, and will increase priority on that.

Why wasn’t this caught by our tests? The bug was committed along with a unittest that covers the affected code: TestGetValidOrderAuthorizations2. This unittest exercises the broken code path, creating and retrieving an order with multiple authorizations. However, it does not adequately verify the contents of the returned authorizations, only that there were the correct number. We should have written a more thorough unittest here. Also, we should have written a lower-level unittest that tested modelToAuthzPB directly and verified its output. We will add both of these within the next three weeks.

We also have a longstanding integration test that checks the CAA rechecking behavior. However, this integration test failed to find the bug because the test requests a certificate with only one FQDN in it. We have already updated this integration as part of our fix, and confirmed that the updated version catches the bug on earlier versions of Boulder.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We’ve already deployed a fix for the immediate, known issue. Here are the steps we plan to take to avoid a recurrence of this bug, or similar bugs, in the future:

  1. Improve TestGetValidOrderAuthorizations2 unittest (3 weeks).
  2. Implement modelToAuthzPB unittest (3 weeks).
  3. Productionize automated logs verification for CAA behavior (8 weeks).
  4. Review code for other examples of loop iterator bugs (4 weeks).
  5. Evaluate adding stronger static analysis tools to our automated linting suite and add one or more if we find a good fit (4 weeks).
  6. Upgrade to proto3 (6 months).

Additional information:

We have begun notifying affected subscribers of revocation on the BR-mandated timeline. We have a FAQ and the email contents available at https://community.letsencrypt.org/t/revoking-certain-certificates-on-march-4/114864.

List of hostnames for which the current CAA record forbids issuance by Let's Encrypt.

Jacob: It's been 24 days since the last update on this bug; do you have updates on the TestGetValidOrderAuthorizations2 and modelToAuthzPB unit tests?

Trying to extrapolate the next dates based on Comment #1, if I understood correctly:

  • 2020-03-24 - Improve TestGetValidOrderAuthorizations2 unittests
  • 2020-03-24 - Implement modelToAuthzPB unittest
  • 2020-03-31 - Review code for other examples of loop iterator bugs
  • 2020-03-31 - Evaluate adding stronger static analysis tools to our automated linting suite and add one or more if we find a good fit
  • 2020-04-28 - Productionize automated logs verification for CAA behavior
  • 2020-09-03 - Upgrade to Proto3

I'd also note that since Comment #1, Let's Encrypt filed Bug 1619179 to note that they decided not to revoke after all.

Flags: needinfo?(jsha)

2020-03-24 - Improve TestGetValidOrderAuthorizations2 unittests
2020-03-24 - Implement modelToAuthzPB unittest

These are completed.

2020-03-31 - Review code for other examples of loop iterator bugs
2020-03-31 - Evaluate adding stronger static analysis tools to our automated linting suite and add one or more if we find a good fit

These are in progress.

2020-04-28 - Productionize automated logs verification for CAA behavior
2020-09-03 - Upgrade to Proto3

Work on these will start when the above items are done.

Flags: needinfo?(jsha)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 31-March 2020
You need to log in before you can comment on or make changes to this bug.