Open Bug 1715455 Opened 12 days ago Updated 3 days ago

Let's Encrypt: certificate lifetimes 90 days plus one second

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: jaas, Assigned: jaas)

References

Details

(Whiteboard: [ca-compliance])

Let’s Encrypt is well-known for issuing certificates that are valid for only 90 days. Since the very first certificates issued by Let’s Encrypt’s infrastructure, those certificates have been given a 90 day validity period by our CA software by taking the issuance time and adding exactly 2,160 hours to yield the certificate’s “not after” date. However, RFC 5280 defines the validity period of a certificate as being the duration between the “not before” and the “not after” timestamps, inclusive. This inclusivity means that Let’s Encrypt’s certificates have all been actually valid for 90 days plus 1 second.

ISRG CPS v3.2 Section 7.1 states that end-entity certificates have a lifetime of 90 days. Section 6.3.2 states that lifetimes will be less than 100 days, but we understand that we are responsible for the more specific lifetime stated in Section 7.1.

Note that CPS v3.3 was released on June 8, 2021, and changed Section 7.1 to match Section 6.3.2 in stating that end-entity certificates will have a lifetime of less than 100 days, but most unexpired certificates issued by Let’s Encrypt at this time were issued under CPS v3.2. We chose to remediate this issue as if the CPS change had not already brought us into compliance for future issuance, mainly in order to prevent future issues with certificate lifetime configuration.

How your CA first became aware of the problem.

We were notified of this via an email from Jesper Kristensen to security@letsencrypt.org on June 8, 2021, at 19:45:49 +0000 (UTC).

A timeline of the actions your CA took in response.

  • 2021-06-08 19:45 UTC: We received an email from Jesper Kristensen to security@letsencrypt.org
  • 2021-06-08 21:00 UTC: ISRG Security Officers met for initial review of the report
  • 2021-06-09 02:53 UTC: Internal incident declared
  • 2021-06-09 04:11 UTC: ISRG deploys a staging boulder-ca config change setting certificate lifetime to 7775999 seconds
  • 2021-06-09 04:41 UTC: ISRG deploys a production boulder-ca config change setting certificate lifetime to 7775999 seconds
  • 2021-06-09 04:41 UTC: Incident resolved. All new certificates issued have a certificate lifetime of 7775999 seconds, which conforms with a RFC 5280 validity period of 7776000 seconds

Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.

A fix was deployed quickly. We did not stop issuance at any time in response to this issue. Such a disruption was deemed to be unnecessary given the severity of the issue and the speed at which our team was able to develop and deploy a fix.

A summary of the problematic certificates.

All unexpired certificates issued by Let’s Encrypt are affected, approximately 185 million certificates. The oldest unexpired certificate is approximately 90 days old.

In a case involving certificates, the complete certificate data for the problematic certificates.

An example certificate showing the problem: https://crt.sh/?id=4670841263

An example certificate showing the corrected validity period: https://crt.sh/?id=4671255473

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The software Let’s Encrypt uses to issue certificates, Boulder, calculates its certificate validity periods in two steps: First, it determines the “not before” date by subtracting exactly 1 hour from the issuance time. Then, 2160 hours — 90 days — are added to that “not before” date to yield the “not after” date. Reviewing that code, it appears obvious that adding 90 days to some date yields a total duration of 90 days.

However, the certificate profile from RFC 5280 has a more specific definition of lifespan that all Web PKI clients are expected to adhere to: RFC 5280, Section 4.1.2.5, defines the "validity period" to be "the period of time from notBefore through notAfter, inclusive."

This inclusivity means that a hypothetical certificate with a “not before” date of 9 June 2021 at 03:42:01 and a “not after” date of 7 Sept 2021 at 03:42:01 becomes valid at the beginning of the :01 second, and only becomes invalid at the :02 second, a period that is 90 days plus 1 second. The actual 90-day “not after” time must actually be 03:42:00.

For the lifespan of Let’s Encrypt, we’ve always reasoned about certificate lifespans in terms of hours. Unfortunately, the RFC 5280 definition either requires the CA software to explicitly subtract 1 second from calculated validity periods to account for the inclusivity, or requires the configuration to be defined in seconds rather than the easier-to-analyze hours that we had always relied upon.

This error was not caught by any form of automated certificate linting.

Our issuance pipeline uses the Zlint certificate linter project as a mandatory, must-pass step at two different stages of our issuance pipeline: after construction of the “precertificate” for certificate transparency logs, and after issuance of the final certificate but before delivery to a subscriber.

The Zlint project only attempts to check whether a certificate exceeds the maximum validity allowed by the baseline requirements, and is not configurable. So while Let’s Encrypt does run Zlint against certificates being issued, Zlint was enforcing the 398 day limit of the Baseline Requirements, and thus was not in a position to catch this violation of our CPS.

Notably, Zlint also makes the same error as Boulder in calculating the maximum validity period allowed, permitting the extra 1 second that should be prohibited. We have filed issue #608 with the Zlint project with tests demonstrating the incorrect behavior.

List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

  • A fix has been deployed so that our certificates are issued with 90 day validity periods per RFC 5280.
  • ISRG CPS Section 7.1 has been updated to state that Let’s Encrypt certificates will have a validity period of less than 100 days, matching Section 6.3.2.
  • We will further improve our codebase to ensure that everywhere we compute or handle a validity period, we do so in compliance with RFC 5280’s inclusive definition of such: https://github.com/letsencrypt/boulder/issues/5473. This work will be completed by 2021-09-09 (inclusive).
Assignee: bwilson → jaas
Status: NEW → ASSIGNED
Summary: 2021.06.08 ISRG / Let's Encrypt certificate lifetimes 90 days plus one second → Let's Encrypt: certificate lifetimes 90 days plus one second
Whiteboard: [ca-compliance]

Can you describe the processes Let’s Encrypt has in place to review changes to the Baseline Requirements, as well as to Mozilla CA incidents?

This bears a concerning similarity to Bug 1708965, reported a month ago, which seems like it could and should have prompted a similar examination of your CP/CPS.

Additionally, the language from RFC 5280 was explicitly incorporated into the BRs as part of Ballot SC31. Can you clarify how this was missed?

Finally, it does not seem there is a clear statement about revocation plans. Section 4.9.1.1 requires revocation within five days if:

The CA is made aware that the Certificate was not issued in accordance with these Requirements or the CA’s Certificate Policy or Certification Practice Statement;

Will Let’s Encrypt be revoking, or will it be filing a separate bug, as required by https://wiki.mozilla.org/CA/Responding_To_An_Incident

Flags: needinfo?(jaas)

I am a member of general public and noticed this issue. I just read through this incident and compared with baseline requirement 6.3.2. 6.3.2 states

For the purpose of calculations, a day is measured as 86,400 seconds. Any amount of time greater than this, including fractional seconds and/or leap seconds, shall represent an additional day.

This causes problem in the current fix when there is a leap second between the validity period. I guess Let's Encrypt need to somehow deal with leap second in Boulder or amending CPS to accommodate the leap second.

Ignore the previous. Just noticed a new CPS v3.3 is issued to accommodate the leap second issue

Note that CPS v3.3 was released on June 8, 2021, and changed Section 7.1 to match Section 6.3.2 in stating that end-entity certificates will have a lifetime of less than 100 days

As was previously stated in https://bugzilla.mozilla.org/show_bug.cgi?id=1708965#c3 by Jesper Kristensen

The reason I want to read it is that while the BRs are very precise about what 398 days means, should we assume the same level of precision in all CPSes? I think doing so would encourage all CAs to be as vague as possible in their CPS, if a small typo in there will be treated as misissuance.

I think that the desired outcome and solution to such issues should not be generalizing the CPS and making it as vague as possible to prevent compliance issues.

I do not understand the math behind the mentioned problem.

Is either the RFC or BR really specifying a DATE of 1s precision to be an INTERVAL rather than a POINT IN TIME?
It specifies a granularity of points are allowed to be endpoints of a validity period, but these are still POINTS AFAICT.
It means that, for example 10:00:01 is AFTER 10:00, even though 10:00:01 is not a valid endpoint of a validity period (and it doesn't have to be, as it is a measurement result, not a validity period endpoint).

Does RFC or BR requires a browser to round down the measured current time to full seconds, before comparing them against the validity period?

I believe it is important that the ISRG come to a decision on whether 0.999... = 1, as the question seems to me whether the measure of the closed interval [not before, not after] is equal to the measure of the half-open interval [not before, not after). This is the "inclusive" question. To wit: LetsEncrypt engineers thought that those two had equal measure, but some implementers and security researches believe they do not. Let us assume they do not have equal measure.

If the spec defines LetsEncrypt's certificates to be valid for 90 days 1 second, and LetsEncrypt would like to issue certificates that are valid for 90 days, they will need to issue certificates of the form:

not_before = t0
not_after = t0 + 2159 hours 59 minutes 59.99999999... seconds

However, if 0.999... = 1, then we have a contradiction with our assumption(1). Therefore, the ISRG should consider adopting 0.999... != 1 as part of the next CPS revision.

I believe this could spur real innovation in the number systems used by software developers to go beyond integers, beyond fixed point decimal, beyond floating point and even real numbers! The CPS would allow CAs and validators to express "not after = 2160 hours - ε". There is a unique opportunity here to supplant the IEEE as the premier standards organizations for defining numbers. As we all know, the IEEE is needlessly bogged down on new ways to express smaller mantissas and exponents in floating point numbers so that computers can solve linear regressions faster. (A silly pursuit which I am sure will never amount to anything of import.)

A more modest proposal would be to consider adopting a half-open interval for ease of understanding and implementation.

Some background. When reading one of the many bugs about references to BR section 3.2.2.4.6 and 3.2.2.4.10 I noticed a similarity to a comment made by Ben in Bug 1701317. Since I did not see a response to Ben's comment from Let's Encrypt at the time, I decided to notify them of the similarity. Before sending the email I fact checked it by reading the CPS myself. When reading the CPS I compared it to the end entity certificate at https://letsencrypt.org/ and noticed the similarity to Bug 1708965. I did not check the intermediate certificate.

(In reply to Ryan Sleevi from comment #1)

Will Let’s Encrypt be revoking, or will it be filing a separate bug, as required by https://wiki.mozilla.org/CA/Responding_To_An_Incident

We do not plan to revoke any certificates. We will file another issue about that soon. We felt that part of our response could wait for the following working day (rather than the middle of the night for our staff, when this was filed).

We will respond regarding the rest of your questions shortly.

Flags: needinfo?(jaas)

(In reply to Grzegorz Prusak from comment #5)

Does RFC or BR requires a browser to round down the measured current time to full seconds, before comparing them against the validity period?

Here's the short answer version (I have a longer answer, but like... woah, wall of text)

Since X.509/88, the notBefore is the first date that the certificate is valid, and the notAfter is the last date that the certificate is valid. Specifically, from Section 7.2

TA indicates the period of validity of the certificate, and consists of two dates, the first and last on which the certificate is valid. Since TA is assumed to be changed in periods not less than 24 hours, it is expected that systems would use Coordinated Universal Time as a reference time base

The question I understand you to be then asking is what happens if the notAfter is expressed in seconds, but the system measures time in fractional seconds. That is, if notAfter (for HHMMSS) is 000000, and the system determines the time is 000000.123, do you:

  1. Truncate the system time to the same precision as UTCTime/GeneralizedTime (e.g. 000000.123 -> 000000)
  2. Extend the notAfter to the same precision as the system time (e.g. 000000 -> 000000.000)
  3. (Bonus) Round the system time up (i.e. ceil()) to the same precision as UTCTime, treating fractional seconds as a whole second (000000.123 -> 000001)

The answer here is (historically) 1. The validity period of 000000 represents all values from 000000.000 to 000000.999, which is clearer when you examine the notBefore. The same logic applies to the notAfter - representing all fractional values within that range, and it is not expired until 000001.000.

Comment #6 may be cheeky, but this is then the question of "Is 0.999... equal to 1", which like many great things, as a whole Wikipedia page answering that question.

I realize that folks are no doubt interested in this, given the implications to (effectively) 100% of Let's Encrypt issued certificates, and potentially other CAs simply not-yet-detected. Comment #1 captures that this has come up before and been answered, while Comment #4 accurately captures that there are tradeoffs and implications here for the ecosystem. Ultimately, incidents like this are up to the CA to demonstrate how they balance these equities and expectations, and the processes to improve and prevent things (which Comment #8 will further expand on)

The answer here is (historically) 1.
Thank you for the answer. Still, I do not understand how one can claim that the validity period doesn't match the declared one (by merely 1s), given that the x509 standard is way more imprecise at defining it. Why would one refer to an undocumented (aka "historical") reasoning to make the standard more specific, rather than to a canonical definition of real numbers?

Are the browsers still doing this validation in the "historical" way? Are they required to do so? Is 1 day and 1 second the only available date precision units? How about, lets say, 3 seconds?

As I read the other comments and the linked bug, the recommendation is to make the declaration of validity period more vague, to avoid that kind of concerns and therefore let people have opinions on how numbers and intervals are defined.

I think that if it's not practical to guarantee a precision of 1s, then it should be OK for CAs to state that certificates will be valid for x +- y. Now CAs can just change their CPS to state that certificates will be valid for no longer that the value in the BR just to be safe, but I think that it's not the desired outcome.

Ryan - definitely agree that I was being cheeky :) but there is a point I would like to drive home.

It does seem like the plain meaning of "not after" means that option #1 of truncating defies the plain meaning of "not after". If we can truncate to seconds, then why not minutes, hours, months, years? Perhaps that's in the spec though and I'm not familiar with it.

I would like to introduce a 4th and equivalent option to #3:

  1. Treat not_after and the validity period in general as a half-open interval [not_before, not_after) and treat a system time (derived by any method) equal to the not_after as invalid.

This has several benefits:

  • Half-open intervals are a common paradigm in both math and in database systems, see, e.g.: the default intervals used by Postgres in their range operators, in random number generators they are the default, in most systems that perform some sort of iteration they are the default,
  • A finite union of half-open intervals is an algebra and any set of valid times (any set or subset of our time range) a user would desire can be uniquely expressed as a smallest equivalent disjoint union of half open intervals - but that may not mean much to most people. Instead, there are some useful properties such as:
  • It's possible to create a finite set of intervals (read: certificates) that tile N/Z (or some fixed decimal number system that's equivalent) (read: there are no gaps or overlaps)
  • We can trivially define the complement without using a different system: the complement of [not_after=00000, not_before=00001] for example is not expressible using the same interval notation. On the other hand, the complement of [00000, 00001) is expressible as the union of two half open intervals - this is more than just pedantic, it's really useful for making sure code is correct because you can use the same comparison function to determine if a point is in those intervals as in the original!

Option 1, 2, and 3 all result in easy programming errors in the system, though option 3 has the "least" error, but it does mean that it is not possible to create a set of certificates, S = {A, B, C} such that:

A: not_before t=0, not_after t=1
B not_before t=1, not_after t=2
C not_before t=2, not_after t=3

And to define a function f(t) => S ∪ ∅ - i.e.: for each time, what is the unique valid certificate during that time?

My two cents would be: use half open intervals, they are the mathematically sound and useful thing to do and recommend TLS validators to do the same. Every other option results in easy programming errors which could have downstream effects. For example, a TLS certificate valid for 1 second would have a not_before=0, not_after=0. Programmatic methods of creating sequentially valid TLS certificates will be plagued by off-by-one errors, and such methods are increasingly common. For an industry example, see Facebook Engineer's blog post on generating mutual TLS certificates: https://engineering.fb.com/2019/05/29/security/service-encryption/

I think it would be very surprising to engineers building systems like theirs which construct sequentially valid certificates A, B, C above either leads to multiple certs being valid at the same time for 1s periods of time or 1s gaps during which no certificate is valid.

Correction: when I wrote N/Z above, I meant "ℕ or ℤ", as some astute readers may know that N/Z is notation for something different.

(In reply to Ryan Sleevi from comment #9)

The question I understand you to be then asking is what happens if the notAfter is expressed in seconds, but the system measures time in fractional seconds.

That's just one way to demonstrate the ambiguity in the definition.

But really I think the question is more fundamental, that is, does the certificate stop being valid after the moment in time at exactly 000000, or just before the start of the moment at exactly 0000001?

To me, the spec definition is more reasonably interpreted as the former, that the certificate stops being valid after the moment specified in "notAfter". It is inclusive of the moment specified by "notAfter", but no moment after that one. If that's the case then the old behaviour was exactly correct, and the new behaviour is creating validity periods that are ~1 second too short.

The interpretation being applied with the new behaviour seems to be that the validity is inclusive of the entire second starting with notAfter. But why should it be inclusive of 1 second exactly? The spec doesn't say "inclusive of the whole second" and there is no reason to interpret it that way other than the fact that the resolution of the timestamp is 1 second (which is irrelevant to comparisons).

To those suggesting that the interval should be made half-open, I agree that would be a good change, but then the phrase "notAfter" ought to be changed to "before" or similar.

(In reply to Ryan Sleevi from comment #1)

Can you describe the processes Let’s Encrypt has in place to review changes to the Baseline Requirements [...]?

[T]he language from RFC 5280 was explicitly incorporated into the BRs as part of Ballot SC31. Can you clarify how this was missed?

During the discussion and IPR period of every ballot, engineers review the text of the ballot to determine if it will have any effect on our operations, or if we appear to be in compliance with all of its provisions already. If a ballot seems like it will require us to make changes to our documents or operations, bugs are filed and the Policy Management Authority may get involved to review. The Policy Management Authority reviews new additions to the BRs during each quarterly meeting.

In this particular instance, the incorporation of RFC 5280’s language into the definition of “validity period” was not examined in detail because we believed ourselves to already be in compliance with RFC 5280 in this regard. No new requirement was being put in place, an existing requirement was simply being expressed in a second location.

We acknowledge that we have continuously misinterpreted the requirement over the course of the last five years. The profusion of other systems which make the exact same error points to this being a systemic problem with the structure or phrasing of the requirement itself. As noted in the original report, zlint does not compute validity periods inclusive of the notAfter interval. A number of other PKI implementations on the Web seem to have interpreted the relevant requirements similarly.

Can you describe the processes Let’s Encrypt has in place to review [...] Mozilla CA incidents?

This bears a concerning similarity to Bug 1708965, reported a month ago, which seems like it could and should have prompted a similar examination of your CP/CPS.

We often review Mozilla CA incidents, but we recognize that we have not put a process in place to ensure that we review all CA incidents. Going forward, we will establish a new triage rotation staffed by our engineering team. This rotation will have responsibility for:

  • Reviewing all Mozilla CA incidents with updates during the rotation
  • Reviewing all CABF ballots which enter the IPR review period during the rotation
  • Filing tickets for any investigations or CP/CPS updates that need to happen as a result of such review

(In reply to Aaron Friel from comment #12)

It does seem like the plain meaning of "not after" means that option #1 of truncating defies the plain meaning of "not after". If we can truncate to seconds, then why not minutes, hours, months, years? Perhaps that's in the spec though and I'm not familiar with it.

Right, both the "plain meaning" and the truncation are addressed by the specs (X.509 and X.690, respectively). That is, the certificate is valid on its notAfter date, to the precision of seconds.

The context behind this is actually a fairly elegant trick: when you're verifying certificates, rather than converting the GeneralizedTime/UTCTime into something that might trigger the Y2038 problem (like a time_t / struct tm), you convert the system time into the appropriate representation, which allows you to perform a simple string comparison here. This is where the truncate (or floor()) comes from.

Obviously, this approach doesn't help if you want to count the difference, you've gotta break out your math there, but if you want to see if notBefore <= now <= notAfter, you can do so with a harmonized string comparison.

My two cents would be: use half open intervals, they are the mathematically sound and useful thing to do and recommend TLS validators to do the same. Every other option results in easy programming errors which could have downstream effects. For example, a TLS certificate valid for 1 second would have a not_before=0, not_after=0.

Sure, and I get that there's a desire to discuss changes to the definition. But this incident report wouldn't be the place, that'd be better on mozilla.dev.security.policy.

Incident reports, and incidents in general, we base on the discussion in the past and the facts; for example, past CA incidents or other relevant discussion. That's not to close out changing things, even if they've been this way since '88, but we don't do that in incidents.

CAs can and should raise areas of confusion beforehand; in general, that helps prevent incidents and clarify expectations. When things are raised, CAs are expected to be aware of the context and discussion. So it's not that there's inflexibility, but rather, we try to hold CAs to consistent standards during incidents, and consider the relevant facts, rather than how folks would like things to be.

Programmatic methods of creating sequentially valid TLS certificates will be plagued by off-by-one errors, and such methods are increasingly common. For an industry example, see Facebook Engineer's blog post on generating mutual TLS certificates: https://engineering.fb.com/2019/05/29/security/service-encryption/

"mTLS is bad" is my hobbyhorse :P Just search for it on Google ;)

We have filed Bug 1715672 to track our non-revocation of the affected certificates.

(In reply to Josh Aas from comment #16)

As noted in the original report, zlint does not compute validity periods inclusive of the notAfter interval.

I believe that incident report may be invalid/WorksAsIntended, as I recently commented.

A number of other PKI implementations on the Web seem to have interpreted the relevant requirements similarly.

I was hoping this comparison wouldn't be made, because it's similar in nature/rationale to discussions such as underscores-in-SANs or the many intentional, but undesired workarounds that mozilla::pkix ended up implementing.

The goal here is that CAs, as the issuers, have the most influence to ensure correctness, and there's a tension in PKI implementations about "accept the badness that exists" versus "break it". Indeed, the evolution of PKIX, and the compliance issues that have resulted, have largely been a variation of "It worked in OpenSSL" - which predated the PKIX standards and never particularly adhered to them (and now they, like others, are afraid of change that would break their users, even though it's explicitly prohibited behaviour).

We often review Mozilla CA incidents, but we recognize that we have not put a process in place to ensure that we review all CA incidents. Going forward, we will establish a new triage rotation staffed by our engineering team. This rotation will have responsibility for:

  • Reviewing all Mozilla CA incidents with updates during the rotation
  • Reviewing all CABF ballots which enter the IPR review period during the rotation
  • Filing tickets for any investigations or CP/CPS updates that need to happen as a result of such review

You can use component subscriptions to help ensure this, as the majority of CAs have done (judging by the CC lists).

I think it would benefit Let's Encrypt to re-do the root cause analysis. While I referred to Bug 1708965 as an example incident, my hope and expectation is that Let's Encrypt would have more deeply examined "are there incidents or discussions where we could/should have caught this". I understand that Let's Encrypt erred on providing a prompt incident report, and LE continues to have one of the fast remediation track records of any CAs, but I'd like to encourage taking a little time to deep dive and research here as part of the RCA to look for other opportunities for systemic improvements, or gaps in systemic controls.

I highlight this, because the Mozilla Root Store Policy explicitly requires CAs to follow discussions in dev-security-policy, and this specific class of error was raised on the list on 2020-10-28.

This disclosure by EJBCA was prompted in part by Bug 1667744 , filed 2020-09-28.

So there was a deeper opportunity here to catch this, 9 months ago, with the existing Mozilla policy requirements. It's unclear if Let's Encrypt reviewed that thread from PrimeKey, which explicitly called out non-EJBCA-based CAs, and examined their systems.

  1. Can you share more about your process for reviewing m.d.s.p. and how this thread was missed?
  2. Can you share more about the nature and structure of your triage team, and whether you've reviewed past CA incidents to examine how such teams may succeed or fail depending on how the CA structures them?
    • I recognize there is a bootstrap problem here. Ideally, every CA would already be familiar with every incident, and thus would be familiar with which incidents to examine. Let's Encrypt hasn't been following these incidents, and so could understandably benefit by having specific examples. On the other hand, because Let's Encrypt has not been following past CA incidents, doing a wholesale sweet of CA incidents over the past N years is an exercise that invariably would benefit LE and their triage team, in understanding patterns, problems, challenges, and risks, and that exercise could provide a better incident report here.
    • Suggestion: The best thing to do would be to develop a process for reviewing the extant (open and closed) CA incidents, to learn from them both in incident management as well as to identify other potential gaps of understanding for Let's Encrypt. Then, provide an update here with a timeline that commits to review those incidents, followed by a timeline to deliver a report based on that about how Let's Encrypt's processes can be improved (both for monitoring for future incidents and to see if there are any other gaps, misunderstandings, or opportunities for improvements that LE identifies)
Flags: needinfo?(jaas)

(In reply to Ryan Sleevi from comment #17)

... both the "plain meaning" and the truncation are addressed by the specs (X.509 and X.690, respectively). That is, the certificate is valid on its notAfter date, to the precision of seconds. ... you convert the system time into the appropriate representation

I don't think the question was about truncating the date as stored in the certificate (for which the behaviour is clearly specified), but just truncating the current system time when doing the comparison (for which the behaviour isn't specified as far as I can see, just conventional).

The presence or absence of this "bug" seems to entirely depend on how an implementation answers this question.

(In reply to zivontsis from comment #20)

The presence or absence of this "bug" seems to entirely depend on how an implementation answers this question.

I appreciate that this may seem subjective, but hopefully the relevant specifications I've cited help explain. In any event, Aaron's now started the thread at https://groups.google.com/a/mozilla.org/g/dev-security-policy/c/-BogZx_IJyk/m/gHm3l613AgAJ for the discussion of "What if we redefined this", so it may be more appropriate to continue over there.

If you have questions for Let's Encrypt regarding this incident, it would remain appropriate to ask them, as that's part of the Responding to an Incident process. If there are relevant technical details or discussions that are overlooked, it's definitely appropriate to bring them up, but hopefully Comment #1 and Comment #19 capture some of the relevant past discussion to be aware of. While I've personally carefully reviewed the contemporaneous IETF lists when this language was introduced, if I've overlooked relevant ITU discussion, that would also be greatly appreciated.

See Also: → 1715672

if this is real problem, Wouldn't section 7.2 CSP (CRL profile NextUpdate : ThisUpdate + 30 days) likely to have same problem?
by the why, does length of year defined anywhere? it looks like BR doesn't
Root OCSP Signing Certificate("5 years") may have same problem but I can't find the definition of a year in BR

(In reply to Ryan Sleevi from comment #21)

Thank you for the link and my apologies for crowding this thread. But I have checked the specifications and past discussions and don't believe they directly address my concern, and if it is truly a matter of the requirement being unspecified then I worry unnecessary action will be taken as a result of this bug.

For example some of those past discussions focus on the use of the word "inclusive", but I think that is a red herring. As I mentioned above that doesn't imply anything about whether it is inclusive of the entire second starting at that timestamp or just of the moment of the timestamp itself. This small difference in interpretation accounts for the entire 1 second difference being discussed here.

(In reply to tjtncks from comment #22)

if this is real problem,

It is.

by the why, does length of year defined anywhere? it looks like BR doesn't
Root OCSP Signing Certificate("5 years") may have same problem but I can't find the definition of a year in BR

While not all places have been updated, sections like Section 6.3.2 and Section 4.9.10 have tried to address this. Yes, you're correct that past CA incidents have examined "what is a year", "what is a month", and "what is a day", all to surprising (and often, disappointing) results.

I'm definitely not wanting to shut down discussion here, because I think it's fantastic that we have more people involved in this bug. I would encourage participation in the dev-sec-policy thread if folks would like to ask general questions. Historically, we've used these bugs to track specific communications with the CA in a public and transparent manner, as well as giving members of the public the opportunity to ask questions of the CA about their responses that are relevant or germane to the incident.

The meta-discussion happening here, while certainly interesting, has the downside that it makes it significantly more complex to maintain and ensure that Let's Encrypt is meeting all of the requirements of https://wiki.mozilla.org/CA/Responding_To_An_Incident , and that all relevant facts are available when reviewing such incidents in the future (e.g. in the unfortunate event of discussion trends of problematic practices or behaviours). For example, it was overlooked here that Let's Encrypt failed to appropriately answer Question 5, which expects the complete certificate details to be provided. Yes, this means enumerating them all, which is done in part because multiple CAs in the past have failed to disclose/discover relevant certificates, and then claimed after-the-fact that they were in scope of the incident, despite no such notice or discovery (which is its own issue).

To that end, it would be more useful if folks wanting general questions about the BRs or proposing changes to policies or expectations engage on dev-sec-policy, and we keep this incident focused on the incident response itself. Ultimately, the setting of expectations is done by root programs that participate here (Mozilla and Chrome primarily, Apple occasionally), and while public feedback is certainly valuable and considered, it's most useful when supported with relevant technical details.

(In reply to Ryan Sleevi from comment #24)

(In reply to tjtncks from comment #22)

if this is real problem,

It is.
sorry if this flow away with meta discussion, but I mean that as this CRL profile has fixed length (30 days), if CRL works same way (5280 covers both) CRLs would have same problem as certificates (one second too long)

(In reply to tjtncks from comment #25)

(In reply to Ryan Sleevi from comment #24)

(In reply to tjtncks from comment #22)

if this is real problem,

It is.
sorry if this flow away with meta discussion, but I mean that as this CRL profile has fixed length (30 days), if CRL works same way (5280 covers both) CRLs would have same problem as certificates (one second too long)

and this doesn't fixed in CSP v3.3

Mozilla Root Store Policy explicitly requires CAs to follow discussions in dev-security-policy, and this specific class of error was raised on the list on 2020-10-28.
Can you share more about your process for reviewing m.d.s.p. and how this thread was missed?

We did not miss the thread of 2020-10-28 ("EJBCA performs incorrect calculation of validities"). At 2020-10-28 17:57:36 UTC, one of our engineers sent to all staff:

There's a post on mdsp today about EJBCA having an off-by-one-second bug in calculating validity intervals. A good reminder that we should always aim to be well within requirements about timing, rather than right up on the edge.

In other words, we evaluated the issue, and concluded that our existing practices more than adequately addressed it. Our validity intervals are shorter than the BR-set limits by approximately 308 days, which more than covers this sort of off-by-one-second bug, and similar issues like leap seconds.

One could argue that even though we believed our configuration of Boulder (our in-house CA software) adequately mitigated this category of issue, we should have checked the implementation for this off-by-one-second bug in case someone runs an instance of Boulder with a different configuration that goes right up to the BR limits. However, we maintain Boulder primarily for our own consumption, informed by our policy decisions. Since it was inconceivable that we would configure Boulder with a validity period within one second of the BR limits, we did not start an investigation into whether Boulder had this off-by-one-second bug.

Another possible response to the EJBCA thread would have been to re-review our CPS to ensure we were not setting requirements for ourselves that reintroduced the sort of "right up against the limit" compliance issue we seek to avoid with respect to BR requirements. We did not do so at that time. Even with the benefit of hindsight, we do not believe that would have been a reasonable response to the thread.

This is consistent with the discussion on the original EJBCA thread. For instance, Mike Kushner posted:

all PrimeKey customers were alerted a week ago and recommended to review their certificate profiles and responder settings to be within thresholds.

Our settings were well within thresholds already, as the result of a series of conscious design choices since launch. Those choices had been informed by even earlier discussions on mdsp about CAs that missed deprecation targets or maximum validities by a second or a day.

Bug 1708965, reported a month ago
You can use component subscriptions to help ensure this, as the majority of CAs have done (judging by the CC lists).

By contrast with the EJBCA thread, if we had read the KIR S.A. incident report in a timely manner, a correct response would have included a review of our CPS for this sort of "right up against the limit" condition. That's because the KIR S.A. incident specifically discusses a disagreement between drafting of the CPS and implementation / configuration, which is not a problem we would expect to have mitigated by our conservative approach to setting validity intervals.

Several of our engineers are subscribed to bugzilla for the relevant components. As noted, we don’t have a formalized process for reviewing all bugs in the ca-compliance category. So, one root cause is absence of such a process. We plan to implement such a process as a remediation item.

I think it would benefit Let's Encrypt to re-do the root cause analysis.

Looking deeper into root causes, the main issue seems to be: we preemptively mitigated a large class of boundary condition bugs by setting our validity intervals well within BR requirements. And then we broke that mitigation by setting additional requirements in our CPS that put our actual implementation right up against our CPS limits. How did we introduce that error? How did we miss it in multiple CPS self-reviews?

Looking back through our CPS history at https://letsencrypt.org/repository/, our 1.x series of CPS said that end-entity certificates could have a lifetime of "up to 12 months." In 2017 we extensively revised our CP and CPS to make them more accessible (HTML instead of PDF), more maintainable (versioned history on GitHub), and clearer to read.

On 2017-02-09, during the drafting period, we introduced the "less than 100 days" language in section 6.3.2:
https://github.com/letsencrypt/cp-cps/commit/063e56225885ed216e9ba150bef40d07334d6ffa. This was based on a desire to make our CPS more specific, combined with our understanding of boundary condition problems that had already been documented on mdsp as of 2017.

On 2017-03-08, still in the drafting period, we introduced the "90 days" language in section 7.1: https://github.com/letsencrypt/cp-cps/pull/6. This was a mistake, but it went uncommented at the time.

Our PMA met on 2017-04-13 11:35 AM US Central time to review the new CPS. According to our records, the PMA did not remark on differing language between section 6.3.2 and 7.1.

On 2017-04-13 4:16 PM US Central time, we published the new CPS v2.0: https://github.com/letsencrypt/website/commit/aa5c552f63b175fceb79e0c199ce06ef3b92d9c8 / https://letsencrypt.org/documents/isrg-cp-v2.0/. At the time of publication we were out of compliance because of the strict "90 days" language we introduced in 7.1.

The PMA review (and our subsequent quarterly PMA re-reviews) are when we really should have caught the discrepancy between the two requirements. We will re-review our CP and CPS with regards to relationships between different sections, to see if there are any similar inconsistencies. We will also update our guidelines for reviewing CP and CPS to include this type of review in the future.

Can you share more about the nature and structure of your triage team, and whether you've reviewed past CA incidents to examine how such teams may succeed or fail depending on how the CA structures them?

All of our CA staff subscribes to and reads mdsp, with the exception of one team member who recently switched from a non-CA role (and who subscribes as of now). Instructions to subscribe are part of our new staff onboarding. A smaller set of staff (six people) subscribes to bugzilla's CA Certificate Compliance category, but does not necessarily read all incidents in detail.

As an organization, we have reviewed and learned from many of the bugzilla CA incidents including ones that started before Let’s Encrypt was publicly available. As one publicly-documented example, in https://bugzilla.mozilla.org/show_bug.cgi?id=1577652 we revised our OCSP behavior for precertificates and filed an incident based on reading about Apple's related incident. We have a new-engineer reading syllabus that includes several incident reports. We also regularly review Mozilla’s ‘Responding to an Incident’ document which includes exemplary reports to reference.

The missing pieces are (a) more attention to incident reports on bugzilla (we focus most of our attention on mdsp), and (b) a process to ensure that each item is read.

We have reviewed discussion triage teams from past incidents and will inform the structuring of our updated processes.

Suggestion: The best thing to do would be to develop a process for reviewing the extant (open and closed) CA incidents

Our plan is to review all bugzilla issues in the "CA Certificate Compliance" categories opened between 2019-06-01 and today (439 issues, 5514 comments). We will also establish a formal review rotation where:

  • Each week a staff member will be assigned to review new bugzilla issues and mdsp threads since the last review.
  • Each month a staff member will be assigned to review updated bugzilla issues and mdsp threads since the last review.

The output of that review will be (a) a summary of issues that are relevant to Let's Encrypt, to be reviewed at the next PMA quarterly meeting, and (b) an immediate page to oncall engineers if there are items that require immediate attention. This will be in addition to our current practice of staff reading mdsp as part of the regular workflow and responding to items that require immediate attention.

The reasoning behind separate review cycles for new vs updated issues: Items that require prompt response are most likely to manifest as new threads or new issues, while comprehensive understanding of a long thread or issue is best achieved by reading the whole thread or whole issue after all participants have had time to contribute.

We intend to have the mdsp/bugzilla review rotation in place by 2021-06-15. We intend to complete the retrospective review by 2021-11-12.

We intend to have our re-review of CP and CPS and updated review guidelines done by 2021-06-21.

Question 5, which expects the complete certificate details to be provided.

The list of affected certificate serial numbers that had unexpired as of the time we declared an incident is available at this URL: https://le-https-stats.s3.amazonaws.com/one-second-incident-affected-serials.txt.gz

Interested parties can obtain the full certificate bodies either from Certificate Transparency logs, or via the ACME endpoint via a HTTP GET request to the endpoint https://acme-v02.api.letsencrypt.org/acme/cert/<serial in hex>, e.g. https://acme-v02.api.letsencrypt.org/acme/cert/030000033ee153519e6734086f560282082f

We intend to have the mdsp/bugzilla review rotation in place by 2021-06-15.

This rotation is now in place.

Type: defect → task

(In reply to Jacob Hoffman-Andrews from comment #27)

We did not do so at that time. Even with the benefit of hindsight, we do not believe that would have been a reasonable response to the thread.

Could you explain a bit more the thinking here? It's unclear if you're speaking specifically to this thread, or more generally with respect to incident disclosures and investigations.

More generally: What are the things that would cause LE to review their CP/CPS? Given the many CP/CPS issues that have been raised for CAs recently, presumably, there are some incidents that would, and I think it's useful to better understand where/how LE defines the line.

I think it would benefit Let's Encrypt to re-do the root cause analysis.

I just want to commend LE for actually delivering on this request with the level of detail that demonstrates understanding and helps assuage the concerns.

We have a new-engineer reading syllabus that includes several incident reports.

Could you share some more details about this? This seems to be within the class of helping other CAs learn from and prevent similar issues.

Given Let's Encrypt's use of GitHub and transparency in activities, has there been any thought to maintaining this in a public way, e.g. in a public Git instance (such as GitHub), similar to the CP/CPS? I don't think that this decision is a blocker for this incident, but it seems that there's an opportunity here in the spirit (but not requirement) of "Therefore, the incident report should share lessons learned that could be helpful to all CAs to build better systems." (from https://wiki.mozilla.org/CA/Responding_To_An_Incident)

We intend to have the mdsp/bugzilla review rotation in place by 2021-06-15. We intend to complete the retrospective review by 2021-11-12.

We intend to have our re-review of CP and CPS and updated review guidelines done by 2021-06-21.

Mostly since I missed these in the first read through, just reformatting this for the next reader :)

Date Mitigation
2021-06-15 m.d.s.p. and Bugzilla review rotation established (Complete; Comment #28)
2021-06-21 Review of CP/CPS to ensure internal consistency between sections
2021-11-12 Review of historic CA compliance incidents in Bugzilla for other areas of interpretation, clarification, or expectation
Flags: needinfo?(jaas)

(In reply to Ryan Sleevi from comment #29)

Could you explain a bit more the thinking here? It's unclear if you're speaking specifically to this thread, or more generally with respect to incident disclosures and investigations.

Yes, in this case we’re referring specifically to that EJBCA thread. While it was about computing validity periods, our practices ensure that we are always well below the maximum validity periods set by root programs, so a difference of one second was not concerning to us. The contents of that thread did not concern non-compliance with one’s own CPS, so it did not prompt us to review our own CPS.

What are the things that would cause LE to review their CP/CPS?

Historically, primarily changes to root program requirements and to the BRs, as well as changes in our own practices. Now that we have our new rotation in place, we are ensuring that compliance tickets and MDSP threads that discuss particulars of other CAs’ CP/CPS will prompt us to re-examine our own as well. The similar KIR ticket (Bug 1708965) would be a good example of such.

Could you share some more details about this? This seems to be within the class of helping other CAs learn from and prevent similar issues.

Of course. Much of our new engineer syllabus is already reflected in our public GitHub docs. I have just expanded that existing list with additional content (primarily valuable and well-done incident reports) from our new-engineer syllabus.

You need to log in before you can comment on or make changes to this bug.