Closed Bug 1743943 Opened 3 years ago Closed 1 years ago

Amazon Trust Services: Delayed Revocation of Subordinate CA

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: trevolip, Assigned: trevolip)

Details

(Whiteboard: [ca-compliance] [ca-revocation-delay])

Steps to reproduce:

1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in the MDSP mailing list (https://groups.google.com/a/mozilla.org/g/dev-security-policy), a Bugzilla bug, or internal self-audit), and the time and date.

As reported in bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1743935. Amazon Trust Services received a report that one of our certificates may violate our CPS. Amazon Trust Services reviewed and determined revocation would be required. After assessing the impact of this we have determined that revoking this subordinate within the seven day time frame would have a negative impact. We have determined that revoking this would cause all of the active certificates issued from https://crt.sh/?id=10739079, which is greater than 24 million as of Dec, 1, 2021, to be implicitly revoked. We plan to transfer all issuance to another intermediate certificate https://crt.sh/?id=11334874 and investigate options for retiring https://crt.sh/?id=10739079. Parallel to that we need to resolve several open questions before we would feel comfortable revoking https://crt.sh/?id=10739079.

DigiCert operates this subordinate on our behalf. However, this issue is in scope of the Amazon Trust Services CPS and area of responsibility since Amazon Trust Services issued this certificate.

2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Previously reported timeline from: https://bugzilla.mozilla.org/show_bug.cgi?id=1719920
Oct 21, 2015 - Amazon Trust Services issues intermediates that will be used to create the test certificates for it’s repository.
Nov 27-29, 2015 - Amazon Trust Services corresponds with Google regarding a question related to path building libraries.
Nov 30, 2015 - Amazon Trust Services determines that the dates on the intermediates created on Oct 21, 2015 may create undesirable behavior in certain browsers and decides to correct the dates associated with the key pairs to eliminate the identified path building issues. At this time it is also determined that this doesn’t meet the miss-issuance criteria and that revocation is not necessary.
Dec 3, 2015 - Amazon Trust Services corrects the dates associated with the previously generated key pair. The old certificates are deleted as previously described in https://bugzilla.mozilla.org/show_bug.cgi?id=1713668#c7. (https://bugzilla.mozilla.org/show_bug.cgi?id=1713668#c7)

We only deleted the certificates in our CA system associated with the key pairs that we controlled. We provided DigiCert a new certificate at this time with a 2025 expiration. We didn’t follow up with DigiCert and ask them to destroy https://crt.sh/?id=10739079. We included that certificate in our repository. Finally, we failed to revoke the certificate.

New events:
Nov 25, 2021 - 5:46am PST - Amazon Trust Services received the report.
Nov 25, 2021 - 6:59am PST - Amazon Trust Services initiates an investigation of the report.
Nov 25, 2021 - 10:36am PST - Amazon Trust Services determines that the certificate was issued in violation of the CPS and that revocation is required per 4.9.1.2.
Nov 29, 2021 - Amazon Trust Services sets the target date for revocation of this certificate for Dec 1, 2021.
Nov 29, 2021 - Amazon Trust Services identifies several places where the certificate is used. Examples: 1) Firefox displays https://crt.sh/?id=10739079 as part of the chain, even though the server hosting the leaf certificate is sending the chain from https://crt.sh/?id=11334874. 2) An active github repo (https://github.com/spulec/moto) where https://crt.sh/?id=10739079 is hardcoded.
Nov 29, 2021 - Amazon Trust Services initiates a conversation with DigiCert regarding potential impact.
Nov 29, 2021 - 3:39pm PST - Based on discussions and investigations Amazon Trust Services decides that revoking on Dec 1, 2021 will have unintended negative consequences and delays plans for revocation in order to investigate issues more deeply.
Nov 30, 2021 8:10-8:14am PST - Amazon Trust Services notifies Microsoft, Apple, Mozilla, Chrome, and Cisco of our intent to delay this revocation and potentially seek other options depending on the outcome of our investigations.
Nov 30, 2021 - 12:30pm PST - Amazon Trust Services determines that revoking the certificate will cause all end entity certificates issued from the subordinate to be implicitly revoked.

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

The impacted certificate has not yet been revoked. We are developing a plan to safely switch issuance to the <2025> intermediate CA certificate and retire or revoke the <2040> intermediate CA certificate.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

https://crt.sh/?id=10739079

5. In a case involving TLS server certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. When the incident being reported involves an SMIME certificate, if disclosure of personally identifiable information in the certificate may be contrary to applicable law, please provide at least the certificate serial number and SHA256 hash of the certificate. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

https://crt.sh/?id=10739079

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

We identified that this subordinate needed to be revoked within seven days after receiving the report and we determined that revoking it on short notice could have severe consequences.

7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

We are doing three things in parallel. A) Switching to a different issuer, B) investigating options for retiring the certificate, and C) resolving open questions.

A) Switch issuance over to https://crt.sh/?id=11334874.

B) We’ve initiated conversations with Microsoft, Apple, Mozilla, Chrome, and Cisco on options for retiring instead of revoking.

C) We’ve identified the following list of open items that need to be resolved before we would feel comfortable revoking:

  1. Firefox appears to be pre-caching the 2040 intermediate CA certificate. What actions does Mozilla need us to take so that the 2025 certificate is presented? If the 2040 certificate is revoked, will that cause any disruptions for Firefox users accessing websites?
  2. Other browsers - our limited testing so far indicates that Safari, Edge, and Chrome are displaying the 2025 intermediate CA certificate, but we don’t know for sure if this behavior is consistent across all versions.
  3. OpenSSL - we need to understand what the OpenSSL behavior will be if we revoke the 2040 certificate.
  4. Java - we need to understand what the Java behavior will be if we revoke the 2040 certificate.
  5. Outreach to owner of Github repo https://github.com/spulec/moto where this is hardcoded. We plan to help these users remove the hardcoded 2040 certificate in favor of the 2025 certificate.
  6. Complete our impact review of Amazon services that have the 2040 certificate hardcoded or had issues during the Let’s Encrypt root expiration.
  7. Completion of action A.

We will update this ticket with a timeline.

(In reply to Trevoli (Amazon Trust Services) from comment #0)

  1. Firefox appears to be pre-caching the 2040 intermediate CA certificate. What actions does Mozilla need us to take so that the 2025 certificate is presented? If the 2040 certificate is revoked, will that cause any disruptions for Firefox users accessing websites?

Most likely, this is due to intermediate preloading, and picking up this intermediate from CCADB.

Attempting with a fresh clean install may help confirm (since there's some delay in propagating intermediates - one sync, as https://bugzil.la/1667930 captures). You should also be able to locally simulate revocation by adding an explicit distrust record for this certificate, as IIRC, this will cause the alternative paths to be explored.

Without expressing an opinion on whether it's compliant with CCADB policy (i.e. follow up with Kathleen), my understanding of the intermediate preloading logic is that this data is sourced from https://ccadb-public.secure.force.com/mozilla/MozillaIntermediateCertsCSVReport , which according to https://wiki.mozilla.org/CA/Intermediate_Certificates is the "Non-revoked, non-expired intermediate CA certificates ...". As such, marking the certificate as revoked in CCADB (without the CRL itself published) should, AIUI, on the next Kinto/CRLLite update, update the preloads. That is, of course, assuming intermediate preloads expire/are revoked (there were issues with CRLLite expiration/removals, and they may also exist for intermediate preloading if the issue wasn't resolved by then).

I can totally understand not wanting to "test in production", even with CCADB, so https://wiki.mozilla.org/Security/CryptoEngineering/Intermediate_Preloading gives a hint at some of the knobs and whistles you can manipulate to simulate that further. Beyond the security.remote_settings.intermediates.downloads_per_poll flag, you can also manipulate the RKV database itself used for the cert_storage service to remove that intermediate :) The path should be the $PROFILE_DIR/security_state/cert_storage

  1. Other browsers - our limited testing so far indicates that Safari, Edge, and Chrome are displaying the 2025 intermediate CA certificate, but we don’t know for sure if this behavior is consistent across all versions.

There's a lot of variables at play, however, what's most relevant is that Safari, Edge, and Chrome all support path building on their platforms that support revocation, so it should "mostly" be fine.

The "mostly" here is that there are some AIA caching issues on Safari (if this intermediate was served via AIA, which AIUI, was not), and if CAPI has locally cached the 2040 intermediate (such as a previous verification) and CryptNet doesn't have the ability to contact the internet (implying it won't get revocation details anyways), then the AIA fetch for the 2025 version could fail.

  1. OpenSSL - we need to understand what the OpenSSL behavior will be if we revoke the 2040 certificate.

The devil is in the details here, because OpenSSL performs revocation checking after its path verification, and doesn't support path building. https://medium.com/@sleevi_/path-building-vs-path-verifying-implementation-showdown-39a9272b2820 looks at a few other implementations that may or may not be relevant, and how they process CRLs (generally, if explicitly presented, since they don't typically fetch CRLs).

Concretely:

  • If the client doesn't fetch CRLs, obviously, no impact
  • If the client does fetch CRLs (few do), then the question about which intermediate will be used may depend on which version of OpenSSL you're using, whether the client has locally cached this intermediate into its trust store (... sadly, several products do this), and whether the server is supplying the full chain.

By default, OpenSSL will use the server-supplied chain and begin constructing from that point, although flags (like trusted first or alt-chains) can affect this somewhat. If the intermediate is missing, and it's looking into the X509_STORE[_CTX], then it'll be based on the hash of the subject name and FIFO (in the event of multiple certs).

  1. Java - we need to understand what the Java behavior will be if we revoke the 2040 certificate.

"It depends". Java has always supported path building, and performs revocation checking as part of that path building, This is the java.security.cert.CertPathBuilder (and specifically, PKIXBuilderParameters). However, if an implementation bypasses the CertPathBuilder and instead directly construct a CertPathBuilder (which can optionally add in revocation checking), you are more likely to encounter issues.

While this may sound like "mostly seems fine", the logical path to avoid any issues entirely is to transition issuance to a new intermediate (i.e. new subject name), replace the existing certs, and then revoke both the 2040 and 2025 intermediate. That is generally the 'reliable' path for issues like this, which is topically relevant because of the Sectigo discussion in Bug 1741777 which touches on similar issues.

Assignee: bwilson → trevolip
Type: defect → task
Whiteboard: [ca-compliance] [delayed-revocation-ca]

[Posted on behalf of Google Chrome]

Hi Trevoli,

Thank you for sharing this incident summary.

Both the issuance of and prolonged failure to revoke the 2040 certificate violate Amazon’s policy and the BRs. We expect this incident to be included in Amazon’s annual audit report.

We agree with Amazon’s plan of transitioning leaf certificate issuance to a different issuing CA. However, is the following a typo?:

We are doing three things in parallel. A) Switching to a different issuer, B) investigating options for retiring the certificate, and C) resolving open questions.
A) Switch issuance over to https://crt.sh/?id=11334874.

The linked certificate shares the same key pair as the 2040 certificate, so this doesn’t result in any change in terms of possible path building (for as long as the 2040 certificate is not revoked).

We will continue to monitor this bug to follow Amazon’s proposed and actual remediation efforts.

Please let me know if you have any questions.

Thanks,
Ryan

Sleevi, thank you for the feedback. This is very helpful.

Dickson, to address your question, that was not a typo. We will retain the existing keypair and switch the CA system to issue from the valid 2025 intermediate CA certificate (https://crt.sh/?id=11334874). This will make the issuing certificate consistent with the chain returned by TLS servers. Based on Sleevi’s feedback we are also evaluating other paths to ensure zero customer impact.

This incident will be included in our audit report.

For actions 1 and 2 we have discussed potential impact with Microsoft, Apple, Chrome, and Mozilla.

For action 5 have reached out to a contributor of https://github.com/spulec/moto.

For action 6 we have completed the first part. The second part is related to actions 3 and 4. Based on Sleevi’s comments we are evaluating this further internally.

For action 7 we will have a date in the next update.

[Posted on behalf of Google Chrome]

Hi Trevoli,

Thank you for the clarification.

C) We’ve identified the following list of open items that need to be resolved before we would feel comfortable revoking:

We'll continue following this bug in anticipation of Amazon sharing a timeline for the prerequisite remediation activities described in your original post and the eventual revocation of the 2040 certificate.

Thanks,
Ryan

We plan to complete action 7 in January 2022. We’ll provide a more exact timeline for action 7 and the other actions on January 7, 2022.

Whiteboard: [ca-compliance] [delayed-revocation-ca] → [ca-compliance] [delayed-revocation-ca] Next update 2022-01-07

Amazon Trust Services is monitoring this bug for feedback. We still plan to provide a timeline on January 7, 2022.

Amazon Trust Services is monitoring this bug for feedback. We still plan to provide a timeline on January 7, 2022.

Amazon Trust Services is monitoring this bug for feedback. We still plan to provide a timeline on January 7, 2022.

Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

We will complete action items 3, 4, and 6 by March 4, 2022 and action 7 by January 31, 2022.

Whiteboard: [ca-compliance] [delayed-revocation-ca] Next update 2022-01-07 → [ca-compliance] [delayed-revocation-ca] Next update 2022-01-31

Amazon Trust Services is monitoring this bug for feedback. We will complete action items 3, 4, and 6 by March 4, 2022 and action 7 by January 31, 2022.

Amazon Trust Services is monitoring this bug for feedback. We will complete action items 3, 4, and 6 by March 4, 2022 and action 7 by January 31, 2022.

Amazon Trust Services is monitoring this bug for feedback. We will complete action items 3, 4, and 6 by March 4, 2022 and action 7 by January 31, 2022.

Should I re-set the "next update" for this?

Flags: needinfo?(trevolip)

We have completed action 7. Will complete action items 3, 4, and 6 by March 4, 2022. Ben you can reset the next update to that.

Flags: needinfo?(trevolip)
Whiteboard: [ca-compliance] [delayed-revocation-ca] Next update 2022-01-31 → [ca-compliance] [delayed-revocation-ca] Next update 2022-03-04

Amazon Trust Services is monitoring this bug for feedback. We will complete action items 3, 4, and 6 by March 4, 2022.

Summary: Amazon Trust Services - Delayed Revocation of Subordinate CA → Amazon Trust Services: Delayed Revocation of Subordinate CA

Amazon Trust Services is monitoring this bug for feedback. We will complete action items 3, 4, and 6 by March 4, 2022.

Since Comment #9, Amazon has not provided any substance of what it's doing, why it requires until March 4, what progress has been (or hasn't been) made, or any sort of understanding about why it takes three months (from Comment #0)

Comment #1 tried to highlight that the plan proposed was unwise, with risk of compatibility issues. Comment #3 suggested that, based on the feedback, you were examining other strategies to mitigate the risk, such as not reusing key + DNs, which would obviate the issue.

Could you share something of substance as to what's going on? Are you still planning to do the dangerous thing? Why not just keep it simple?

Flags: needinfo?(trevolip)

Thanks Ryan, your first comment on this bug was helpful. We agreed with the risks you highlighted related to action items 3 and 4 and have been working on revised action items that consider those risks. Based on those internal discussions, we are finalizing a plan based on this new information and will provide it by March 4, 2022.

We have completed action item 6.

Flags: needinfo?(trevolip)

Based on the feedback we received from discussions with the browsers, on the Mozilla bug, and our internal review, we’ve determined that revoking the 2040 intermediate while there are still valid end entity certificates from this key pair will impact customers and replying parties. As such, we plan to rotate all customers to a new intermediate key pair and then revoke the 2040 intermediate after the 2025 intermediate certificate expires. Because the 2025 intermediate is expiring so soon, we are going to align this action with the expiration of that intermediate to cause the least amount of disruption. This will allow us to message customers well in advance of this change and safely manage the switch over.

We will begin migrating customers in 2023. Issuance from the old key pair will stop by Sept 17, 2024. We will revoke the 2040 intermediate after the the 2025 intermediate expires on Oct 28, 2025.

Trev,

I don’t want this to seem personally targeting, either individually or as if I have a bone to pick with Amazon. However, this incident report seems vastly inconsistent with the level of transparency and detail, both of that expected on incident reports and in terms of what Amazon has demonstrated it is capable of for other industries.

I’m going to point out a few examples in Comment #19, because I am hoping this is just a temporary oversight, and not representative of a pattern for Amazon Trust Services, despite the growing evidence that it unfortunately is.

You refer to your internal review as being part of the conclusion, but share zero technical details here. This review has been ongoing for three months, and it is not unreasonable to expect a report with details commiserate with the policy team working full time on this issue for months. If this wasn’t a full time affair, which is at least the baseline expectation when delaying, then I think the past updates were misleading the community about this, which is why I flagged Comment #17.

It also appears that these details were and are a contributing factor to Amazon intentionally and knowingly continuing to violate stated policy. If the risk is so great, then it benefits both CAs and users to articulate precisely what you determined, to be aware of and consider the risk. Details, not conclusions, help inform CAs and help the community validate that the conclusion is supported by accurate and actionable data. It would mostly seem like you discovered that Comment #1’s recommendation three months ago was a good recommendation, but that it took you three months to confirm that.

The timing for the revocation itself is concerning. It sounds like Amazon Trust Services is not capable of, nor prepared for, complying with the Baseline Requirements when it comes to misissued certificates. The plan, as best I can tell, is to let everything issued naturally expire, rather than automatically rotate and replace the certificates. If Amazon looks through countless CA incidents through the past decade, it would be clear that when CAs are misissued, CAs get revoked, and those can invalidate the extant certificates. This is why CAs have been encouraged to automate their issuance, combined with shorter lifetimes, and to continue to invest in ways to further respond (e.g. ARI). Based on the proposed action, and with the complete dearth of meaningful detail, it would seem that Amazon has completely failed to prepare to meet industry baselines or be aware of other CAs’ incidents and lessons. My hope is that this is not the case, but this is where and why being detailed and clear in the reports helps avoid that appearance.

While I am sympathetic to the argument that “we didn’t misissue the leaves, and we didn’t ‘intend’ for this intermediate to be the one for issuance,” the basic reality that all CAs are expected to be prepared for is change, so that whether the next Heartbleed or the next DigiNotar, replacing certs is so boring and routine that it doesn’t even merit angst. I would hope Amazon would at least hold itself internally to the standard set by CAs like GlobalSign, DigiCert, Entrust, and others, in terms of recognizing that when intermediates need to get replaced, they are done so in a timely fashion. If others can and have done it, why can’t and doesn’t ATS, and what does that say about the state of operations and implementation?

As to the actual timelines, in the absence of data, they seem to demonstrate either no awareness of the requirements or no respect for the community of users that rely on CAs. Amazon is stating that it takes it two and a half years to stop issuing, when the expectation is that CAs should be prepared closer to the order of 2.5 minutes to 2.5 days. If there is some data that justifies this, then it’s essential to share, which Amazon has not.

In the event this feels like a “bring me a rock” by listing everything wrong with this report, a more constructive question/goal:

What would it take to have this certificate revoked within the next 30 days, which is still 18x longer then the time permitted by the BRs? This would mean transitioning new issuance to a new intermediate, as well as replacing certificates to minimize path building issues.

Ryan, thanks for your feedback. What follows is additional information in further support of our timeline, including what led to the decision, actions we’ve taken, alternatives we considered, and what our next steps are.
When this was initially reported in December, our assumption was that clients wouldn’t be impacted if we revoked the 2040 intermediate. We’ve never returned it as part of the chain and anecdotally, other than Firefox, we haven’t observed other clients return the 2040 root. We directed an engineer to work on the detailed plan for how we would revoke the 2040 cert and replace all leaf certificates based on our existing runbook for manual certificate rotation. We determined it would take several weeks to issue a new CA with DigiCert, and rotate all certificates. Most certificates would be rotate-able with a smaller number of certificates taking a longer time due to requiring customer interaction to re-validate.

In parallel to that, we also started our investigation regarding the impact of revoking this intermediate. We had previously identified several types of customers that would be impacted by this type of rapid rotation:

  1. Customers that use certificate status information in their workflows and have hard-coded CRL and OSP endpoints.
  2. IoT customers that use the intermediate in place of the root.
  3. Other types of customers that pin to intermediates.
  4. Customers that make use of email validation and have an expired validation.
  5. Customers that make use of DNS and have removed their CNAME record.

During our discussions with the browsers, and initial internal discussions, we identified a new category of customers that would be impacted:

  1. Customers and relying parties that make use of older clients or less popular/custom mechanisms for looking up certificate status information.

The analysis of specialists in client-side crypto behavior was that it would be unsafe to revoke the 2040 version of our key pair while we still had active certificates for the 2025 key pair in use. The number of customers that would be impacted was determined to be small but also a group that we wouldn’t be able to identify, and hence warn, in advance.

In addition to our investigations related to how crypto clients handle a situation like this. We also did a deep dive with two AWS Services that have been moving to using leaf certs that chain to an ATS root instead of the DigiCert root. While this is a root migration in talking the issues through with them we learned that a significant portion of the issues encountered were due to reliance on the intermediate and not the root.

Based on these discussions, as well as learnings from the community discussions on bugs, and articles written by other CAs on this subject, we determined that changes needed to be made to the system used by AWS customers to obtain certificates from ATS. ATS has made several recommendations to the AWS Certificate Management (ACM) product team to change how certificates are vended so that customers are better positioned to be ready for rapid rotations. This will continue work we started last year when we worked with the ACM product team to remove reuse of email validation. Prior to this customers would go so long between validating they weren’t always aware of the actions needed. The primary area we plan to tackle first is moving from a system that supports fixed assumptions about infrastructure to a system that is more dynamic. We plan to create multiple intermediates that ACM will utilize to simultaneously issue from. As part of this change we will also create new documentation for AWS customers and training for AWS support staff that help customers understand best practices for working with PKI. This will address two of the biggest areas of impact “other types of customers that pin to intermediates“, and ”customers that use certificate status information in their workflows and have hard-coded CRL and OCSP endpoints“.

After reviewing options internally, we believe we can provide a scaled down version of what we planned to launch on the initial timeline we reported. This scale down solution will allow us to begin migrating certificates in October 2022. This will allow us to rotate customers to the new paradigm and revoke the 2040 intermediate when the active certificates expire.

I'm a little confused here.

The opening states

(In reply to Trevoli (Amazon Trust Services) from comment #21)

We determined it would take several weeks to issue a new CA with DigiCert, and rotate all certificates. Most certificates would be rotate-able with a smaller number of certificates taking a longer time due to requiring customer interaction to re-validate.

and then later,

After reviewing options internally, we believe we can provide a scaled down version of what we planned to launch on the initial timeline we reported. This scale down solution will allow us to begin migrating certificates in October 2022. This will allow us to rotate customers to the new paradigm and revoke the 2040 intermediate when the active certificates expire.

These seem wildly conflicting, but the description seems to be suggesting they're two different things.

I'm a little confused by this, though. It suggests that Amazon needs 10 months (December -> October) to transition to a new intermediate, and then from October 2022 to October 2023 to actually phase out that old intermediate. That sounds a lot like not taking any proactive steps to revoke, and indeed, based on the delays here to get to this point, seems like it might be optimistic to think this would happen 7 months from now.

Have I misunderstood anything? This seems like a well-understood problem for any publicly trusted CA. Amazon has been an actively trusted CA to be aware of many CA incidents with challenges in revoking, so it doesn't seem like any of this is particularly new information: CAs have shared their challenges, and shared their steps to resolve these challenges, in past incidents. Why has Amazon not taken action to similarly prepare, given the years of advance warning of these situations? Indeed, DigiCert, Amazon's partner here, has long recognized the risks, and hence regularly rotates intermediates to reduce some of this.

  1. Customers that use certificate status information in their workflows and have hard-coded CRL and OSP endpoints.
  2. IoT customers that use the intermediate in place of the root.
  3. Other types of customers that pin to intermediates.
  4. Customers that make use of email validation and have an expired validation.
  5. Customers that make use of DNS and have removed their CNAME record.

In the abstract, this is useful, but it still doesn't seem to meet the minimum requirements of https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation . Have I missed something? This doesn't break down the individual customers, or even a below-minimum-standard aggregate breakdown.

It's unclear what is meant by "hard-coded CRL and OCSP endpoints", and why that would be relevant? Is it saying that new intermediates cannot be introduced because, rather than using the information from the certificate, they hardcode it? If so, can you provide specific examples?

With respect to 4 and 5, these are entirely predictable, as we've seen from other CA incidents, so perhaps you can share what ATS' strategy is for monitoring and learning from CA incidents. The past several years have valuable lessons that, had Amazon properly recognized the risk of non-compliance its own systems and processes were introducing, could have learned from and taken appropriate action. That it hasn't happened until now is, to put it mildly, concerning.

During our discussions with the browsers, and initial internal discussions, we identified a new category of customers that would be impacted:

  1. Customers and relying parties that make use of older clients or less popular/custom mechanisms for looking up certificate status information.

I am entirely uncertain what this means. Can you provide more detail here, and its relevance to this discussion? It sounds like it's an argument about why you can't revoke 2040 and keep 2025, but that's unclear its relevance to why you can't immediately transition to a new intermediate and begin actively replacing existing certs.

The analysis of specialists in client-side crypto behavior was that it would be unsafe to revoke the 2040 version of our key pair while we still had active certificates for the 2025 key pair in use. The number of customers that would be impacted was determined to be small but also a group that we wouldn’t be able to identify, and hence warn, in advance.

Yes. But it's unclear the relevance to this discussion. Amazon is already required to be prepared to fully revoke the 2025 intermediate within 7 days, as per the BRs. So how has Amazon designed its systems to support that? It seems, based on this bug, that Amazon has been completely unprepared for this. If Amazon was relying on supplemental controls to make this less likely, then it's unclear why the years of discussion about why supplemental controls are insufficient, and issuance practices are key, haven't been learnt from.

The primary area we plan to tackle first is moving from a system that supports fixed assumptions about infrastructure to a system that is more dynamic. We plan to create multiple intermediates that ACM will utilize to simultaneously issue from. As part of this change we will also create new documentation for AWS customers and training for AWS support staff that help customers understand best practices for working with PKI.

This sounds like (customer) "training has been improved" as a solution. Not only does this deliverable lack a date, it lacks a way to quantify whether or not it produces the intended result. Lessons from other CAs will show it's less likely ("We can't control our customers"), and thus, it seems likely to lead to future issures.

The subtext here is it sounds like the decision is being made that, even though the BRs do not support it, and even though it's clear to Amazon that it's problematic, the choice is being made to "support" these customers by violating the BRs and the public commitments by ATS, that were the basis of trust, to adhere to them. I appreciate the "customer first" mentality here, but it seems concerning to think that Amazon would place its interests, and its customers, over the BRs. We've seen this in the past with internal server names, underscores, and skipping domain validation, and that's a dangerous path to hew.

This scale down solution will allow us to begin migrating certificates in October 2022. This will allow us to rotate customers to the new paradigm and revoke the 2040 intermediate when the active certificates expire.

Definitely seeing more detail about this plan will be interesting, but it doesn't seem to remotely get close to aligning with the BRs. Since I realize much of my reply here is critical of the stated plan, perhaps you can clarify one further detail.

Prior to this incident, what steps did ATS take to ensure it was able to meet the BRs obligation to revoke within 7 days for the 2025 intermediate that was actively issuing? If "minimize customer disruption" was seen as a requirement by ATS, how did it plan to ensure that it could achieve that, while still meeting the timelines set forth, which ATS publicly committed to, and which ATS has been audited against?

Flags: needinfo?(trevolip)

If ATS had an issue with our 2025 intermediate we could revoke it in 48 hours or less.

For this issue with the 2040 intermediate, we want to focus on improving the PKI agility of our customers. Which will also improve the PKI agility of the community. Our plan is not to rotate customers to a single intermediate. The reason that it will take until October is that we instead plan to introduce several intermediates so that there is no longer a guaranteed intermediate that ATS certificates are issued from. This is based on feedback from other CAs regarding difficulties they faced when moving customers to new intermediates. As you stated, merely improving training isn’t sufficient to drive change. Instead we plan to prioritize adding agility as part of this rotation.

We’ve made a series of changes over the past few years to improve our ability to rapidly and reliably revoke an intermediate certificate. Here is a sample of the changes we have made related to items that have impacted the time it would take to revoke an intermediate:

2018
Introduced formal ceremony post-mortems to identify improvements and assign action items to complete those improvements. Several of the changes listed below are a result of this process.

Changed how the ceremony scripts are managed and reviewed. Previously we would create and review the entire script each time. In the updated model, repeatable actions are broken out separately and reviewed when added to the template. During the script review, we can track that new actions were multi-person reviewed and approved and repeatable actions are reviewed to make sure they match the approved template. This change decreased pre-ceremony preparation time by 30 – 90 minutes.

2019
Changed our backup site testing process. Previously we would perform a minimal operation at the backup site to test the equipment. In 2019 we began running full ceremonies at the backup site in addition to our primary site. We did this to ensure that both sites were equally prepared to run any operation needed.

Established a new baseline environment for HSM operations. Worked with Security and our hardware vendor to identify a scaled down set of software and updated the process for vulnerability management for our air gapped system. This change decreased pre-ceremony preparation time by 3-5 days.

Moved the pre/post ceremony runbook into the change management tooling. This allows us to take advantage of automated checks, such as requiring three approvers for the steps. It also makes it clear which steps must be completed in order to run a ceremony. Although the motivation for this change was aimed at improving the tracking mechanism for the ceremony related steps, this also decreased pre-ceremony preparation time by about 30 – 60 minutes.

2020
Automated the linting check that is run pre and post certificate issuance during ceremonies to decrease ceremony time and make it easier to spot issues. This change decreased pre-ceremony preparation time by 60-90 minutes and ceremony time by 30-45 minutes.

Expanded backup site cross training to include unassisted Ceremony Game Day. Where the second site runs a practice ceremony without guidance from someone that has previously run a ceremony. This exercise is done to identify gaps in tools or documentation that would slow down our ability to run a ceremony. Following this activity, we implemented improvements to our runbook so that it is now managed in our change management system. Steps in the script were also improved to provide more detail and/or clarity. Both of these improvements increase the reliability of ceremony operations.

2021
Migrated CA Equipment to a new secured location. This new location provides security upgrades such as automating enforcement of multiple trusted persons. But also decreased time to access equipment from two hours to 30 minutes.

Created a log dive guide that explains which logs to look in and what tools to use to verify certificate data for revocations. Prior to this, while the locations of the logs were documented, there wasn’t a detailed guide to make it easy to tell how to map the information together. This saves about 16 hours.

Flags: needinfo?(trevolip)

Thanks Trev.

I greatly appreciate the level of detail here, and there are a number of useful lessons and considerations for other CAs that are well worth examining.

However, as impressive as this is in what it includes, it’s also notable in its omission. This largely speaks to your processes of failing over to a second facility, but doesn’t seem to speak at all to the operational aspects of simply replacing the certificate. That is, it seems to contemplate “what happens if a meteor strikes our primary site”, without contemplating “What if the information was wrong and misleading and we had to simply revoke” - as two contrived examples. Please don’t misunderstand: I do think the level of detail is great here, and these sorts of lessons and improvements outside of incidents are absolutely the kind of thing that would be great for more CAs to share, but it doesn’t speak to lessons or changes related to this incident.

I ask, because this omission is where it seems ATS challenge is, based on the current timelines. There there have been ample CA incidents in the past that demonstrate the importance of a multi-layered approach, focusing on both the CA’s operations and the customer’s practical experience. Situations such as “customers rely on manual validation” have absolutely been a theme explored and the limits noted, for years, when it comes to replacing intermediates.

It’s unclear if this was an intentional oversight, e.g. due to a belief that ATS had sufficient compensating controls in other areas, or if this was unintentional, and that ATS doesn’t treat every CA incident from other CAs as a chance to introspect and analyze not only the incident itself, but the challenges and long-term remediations. That’s why I’m continuing to press for more details here.

If ATS felt it’s controls were sufficient, such that past CA incidents were not relevant or did not include lessons to learn, then it’d be useful to provide a sample of such bugs from other CAs, how ATS evaluated them, and why it was deemed unnecessary to change. However, if ATS simply didn’t yet have a process in place that examined every CA bug both for the incident itself, and any operational challenges or long-term remediations, then it seems absolutely critical that ATS introduce such a process immediately, and then work through these past CA incidents using the process, and document the bugs and lessons learned.

The severity/concern here is that a fairly benign mistake has revealed a significant operational gap, which is critical to risk mitigation and securing users, and the goal is making sure that gap is demonstrably addressed. While it may sound extreme to require “revoke every certificate because the intermediate has a doppelgänger that needs to be revoked,” and that will be hugely disruptive to customers/users, my point in comment #22 was that they’re not supposed to be disruptive at all, and we count on the CA to ensure that. If it is disruptive, we the community need to understand how ATS is going to ensure that it’s not disruptive the next time, and how that may reasonably justify the prolonged remediation this time.

Flags: needinfo?(trevolip)

To provide further context. Most of the items listed above were the result of improvements we determined needed to be made as s result of lessons learned from our intermediate creation ceremony in 2018. From that we determined we needed to reduce the time to create and increase the clarity of the steps that are taken in the event we needed to revoke and replace an intermediate.

We also worked with the AWS Certificate Manager team to have them implement changes in the ACM service to better handle these types of events. For example, the service no longer serves a chain that is static in code. Another example is that changes have been made to the rekey operational workflow to both reduce the probability of domain validation failing and to reduce the steps needed to recover if the domain validation fails.

Amazon Trust Services does have a process specifically dedicated to reviewing Mozilla bugs and discussions. Each week we review open bugs and then MDSP discussions. The bugs/discussions are triaged to determine:

  1. Is there enough information yet for a review?
    a. Bug reports without details such as what happened or root cause are skipped for the week.
  2. Does this apply to our operations?
    a. If no, bug is skipped and not looked at again.
    b. If yes, do we have recently reviewed controls in place to mitigate it?
    i. If we haven’t recently reviewed the controls in place or find something new we identify the specific items from the bug/discussion that apply. Then follow up on those items.
  3. Does this require further engineering review?
    a. If the bug requires deeper engineering analysis the bug is passed to the engineering team for a review.

Here is a selection of bugs and the outcome at the time of review.

https://bugzilla.mozilla.org/show_bug.cgi?id=1636544 - IdenTrust: OCSP Outage
When this bug was reviewed we determined that our monitoring practices were sufficient to catch issues with our OCSP Responder. Additionally, we already had a runbook for how to handle surges.

https://bugzilla.mozilla.org/show_bug.cgi?id=1540961 - Atos: Insufficient Serial Number Entropy
https://bugzilla.mozilla.org/show_bug.cgi?id=1539307 - Buypass: Insufficient Serial Number Entropy
https://bugzilla.mozilla.org/show_bug.cgi?id=1538673 - Consorci AOC: EC-SECTORPUBLIC insufficient serial number entropy
When the bugs about insufficient serial number entropy began to be logged we verified internally that our software had sufficient entropy.

https://www.mail-archive.com/dev-security-policy@lists.mozilla.org/msg13493.html -SECURITY RELEVANT FOR CAs: The curious case of the Dangerous Delegated Responder Cert
https://bugzilla.mozilla.org/show_bug.cgi?id=1649963
https://bugzilla.mozilla.org/show_bug.cgi?id=1653680
Even though we weren't listed we confirmed that none of our OCSP certificates violated the BRs as described and that our certificate profile ensures our certificates met the requirements.

https://bugzilla.mozilla.org/show_bug.cgi?id=1730291 - Apple: Test web page certificates expired
When this bug was reviewed we verified that our certificate monitoring for test website certificates was functioning as expected. Additionally, that we had a plan in place to renew them prior to expiration.

https://bugzilla.mozilla.org/show_bug.cgi?id=1759959 - GoDaddy: OV Documentation Reuse
While we don't issue OV or EV certificates we do evaluate bugs related to these types of certificates to determine if the issues in the bug are relevant to DV. For this bug we determined that it isn't specific to OV or EV and is generically about reuse of domain validation information. For this bug we verified that we had recently analyzed and refreshed the runbook for our process by which validation information is reused.

https://bugzilla.mozilla.org/show_bug.cgi?id=1561013 - Entrust: Certificate issued with validity greater than 825-days
We reviewed this bug, however since Amazon had just had the same issue https://bugzilla.mozilla.org/show_bug.cgi?id=1525710 we determined that our controls had been recently reviewed and follow up wasn't required.

https://bugzilla.mozilla.org/show_bug.cgi?id=1559376 - Entrust: Certificate Issued with Incorrect Country Code
https://bugzilla.mozilla.org/show_bug.cgi?id=1696227 - Entrust: Incorrect Jurisdiction Country Value in an EV Certificate
We reviewed these bugs but determined there isn't a lot that applies to our operations. We don't have a process for customers to order certificates from us so we don't have an override process. If you generalize these to "certificates should have the correct information". We had also recently reviewed our process for making sure that the certificates we issue have a fixed set of information in all fields.

https://bugzilla.mozilla.org/show_bug.cgi?id=1731164 - Google Trust Services: CRL validity period set to expected value plus one second
When this bug was reviewed we verified that our script still ensured that CRLs are created with a lifetime less than the maximum to avoid this kind of error.

Flags: needinfo?(trevolip)

We are still committed to begin migrating certificates to new ICAs in or before October 2022.

Trev,

Thanks for sharing the added details in Comment #25. I think the analysis shared for some of these bugs highlights an opportunity to improve the review process for Amazon Trust Services, and that ATS may not be leveraging the opportunity in a way that helps them proactively assess and prevent problems.

For example, in the discussion, you noted ATS examined bugs such as Bug 1540961, Bug 1539307, as well https://www.mail-archive.com/dev-security-policy@lists.mozilla.org/msg13493.html

While these reports examined a particular instance of a situation, they also revealed systemic issues, such as the failure to being able to promptly revoke in a minimally disruptive way, and a reminder of the existing commitment by the CAs to do so, which seems very relevant to the situation we're in presently. In the discussion of these bugs (of just a few I'd examined from the list shared), we can see that even if a particular symptom may not affect ATS, ATS presently shares the same systemic issues. Had those risks been identified at the time - that is, risks to existing and long-standing obligations and commitments - it's easy to imagine that ATS would be able to promptly rotate the intermediate now, rather than later.

This suggests that the approach of review is failing to consider "What can we learn", and rather just looking at "Are we affected". This is somewhat counter to the purpose of these incident reports. For example, it would be deeply unfortunate if other CAs failed to consider the advice and lessons of ATS in Comment #23, although I'm certain few, if any, will find themselves in the exact situation of Comment #0.

These two steps stand in ATS' current process stand out as worrisome, because as implemented, they prevent ATS from being a good CA citizen and learning from and improving the ecosystem.

Bug reports without details such as what happened or root cause are skipped for the week.

I would encourage ATS to request details and analysis in these situations. There's no reason this needs to be a consistent few individuals, and rather, the process is one that encourages all participants to ensure better details, not only to determine if they're affected, but also to learn what controls existed, how they failed, and why the controls for a CA, such as ATS, may be similarly defective.

Does this apply to our operations? If no, bug is skipped and not looked at again.

It's difficult to imagine that there are any CA incidents that don't apply to ATS. That's not to say the same root causes will exist, but rather, every bug, until it's fully closed, is an opportunity to learn. As this bug is a great example, it's not uncommon to require multiple iterations to work through and address what the true root cause is, as well as what systemic factors may exist and what controls may have been expected or failed. Until a bug is closed, it's difficult to imagine there's nothing more to learn.

While it's important to ensure there is a good triage process for immediate risk, there is also a need for a longer-term analysis. The issues Amazon Trust Services has highlighted in Comment #21 are not really unique or new, and had ATS been examining such bugs (including revocation delay bugs, which I didn't see listed), then it should have been clear there exist systemic challenges that require ATS to take action on.

The current approach may feel like it's doing the minimum necessary, but as an observer, it cannot help but seem ATS is doing less than the minimum, and thus, is at risk of future issues of failing to recognize ongoing systemic risks and patterns in the ecosystem, in order to proactively take steps to ensure ATS adheres to its obligations and commitments. That makes it all the more difficult to see such delays as reasonable, when they were preventable.

Ryan,

Thank you for the feedback. ATS does take past incidents where we failed to meet our CA obligations seriously and as a learning opportunity to improve and prevent future issues. This is why we instituted the more extensive recurring review mechanisms, as discussed above.

To clarify we didn’t mean to imply in Comment #21 that those were all unique issues we discovered as a result of this specific incident. As with Comment #25, those were meant as examples of items we discovered via bug reviews (our own and others’ bugs), which were then added to our engineering roadmap or tracked as items the broader team needs to be aware of.

However, we can always get better, both as a CA and as a citizen of the CA community. We accept the feedback that we could do more in our ongoing bug reviews, and will expand how we analyze bug information and leverage it to improve our service. If there are bugs that we believe might be helpful to us but we have questions or concerns about, we will address them directly using bug comments.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

Whiteboard: [ca-compliance] [delayed-revocation-ca] Next update 2022-03-04 → [ca-compliance] [delayed-revocation-ca] Next update 2022-10-01

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

(In reply to Trevoli (Amazon Trust Services) from comment #21)

... The number of customers that would be impacted was determined to be small but also a group that we wouldn’t be able to identify, and hence warn, in advance.

Based on these discussions, as well as learnings from the community discussions on bugs, and articles written by other CAs on this subject, we determined that changes needed to be made to the system used by AWS customers to obtain certificates from ATS. ATS has made several recommendations to the AWS Certificate Management (ACM) product team to change how certificates are vended so that customers are better positioned to be ready for rapid rotations. This will continue work we started last year when we worked with the ACM product team to remove reuse of email validation. Prior to this customers would go so long between validating they weren’t always aware of the actions needed. The primary area we plan to tackle first is moving from a system that supports fixed assumptions about infrastructure to a system that is more dynamic. We plan to create multiple intermediates that ACM will utilize to simultaneously issue from. As part of this change we will also create new documentation for AWS customers and training for AWS support staff that help customers understand best practices for working with PKI. This will address two of the biggest areas of impact “other types of customers that pin to intermediates“, and ”customers that use certificate status information in their workflows and have hard-coded CRL and OCSP endpoints“.

These improvements are valuable, but from a certain perspective, I'm not sure they address Amazon's commitments to follow policy.

Now matter how good your documentation is, or how much you nudge customers to follow best practices, it will never be possible to immediately prove that no one would be negatively impacted by a large revocation.

If there were an incident requiring revocation in the future, why would ATS do so, instead of choosing to bring out the same justifications to delay?

(I'm wearing the hat, or hats, of a subscriber and relying party.)

Thank you for your feedback we will post a full response next week.

Thank you for your feedback Matt. Our focus with these changes is on enabling PKI agility. This will directly reduce the number of customers that would be impacted by a revocation event of any size. That said, as an overarching principle, we prioritize security over availability. So while we are working to minimize the impact of any revocation, in the event of a security issue we would revoke a subordinate within the required time frame.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

(In reply to Aaron Poulsen (Amazon Trust Services) from comment #46)

ATS continues to monitor this bug. We are on track to begin migrating certificates to new ICAs in or before October 2022, as previously committed.

Thanks for these weekly updates, but they aren't necessary until after October 1. Then, if we need to update the whiteboard to indicate a new "next update", the weekly reports won't be necessary until after the "next update".

AWS just sent a notification to subscribers that (paraphrasing liberally) they would start issuing compliant certificates from the replacement intermediates 2022-10-08, linking to a blog post [0] saying that they would start doing so 2022-10-11 09:00 PDT.

(The blog post also links back here.)

[0] https://aws.amazon.com/blogs/security/amazon-introduces-dynamic-intermediate-certificate-authorities/

Matt - AWS is aware of the date mismatch between its customer communication email and blog post. The correct date is 2022-10-11. If you have further questions for AWS, please reach out via the mechanisms described in the email.

Amazon Trust Services will revoke https://crt.sh/?id=11334874 and https://crt.sh/?id=10739079 on May 31, 2023, regardless of the number of valid, non-expired end entity certificates issued from them. Our subscriber uses certificates from ATS in two primary ways. One, certificates that are used for the majority of their lifetime then superseded with a new certificate with the same SANs. Two, certificates that are used for a shorter time and not superseded. We can identify which certificates are used in the second way so we do not need to take action on these certificates prior to revoking the ICA. Beginning on October 11, 2022 all requests to ATS for certificates that aren’t superseding a previous certificate will be served from one of two new ICAs. This is sufficient time ahead of the May 31, 2023 revocation date to effectively replace all use cases for shorter use certificates. For the certificates that are used for longer durations we’ve asked the AWS Certificate Manager team to begin replacing in use certificates with new certificates from one of the new ICAs. They will begin this activity on November 1, 2022. This replacement will be done in waves with a pause over the winter holidays to minimize disruption. We don’t plan to revoke end entity certificates after they are replaced since we are revoking the ICA. In parallel to the proactive replacement being done, on January 17, 2023 we will change it so that new certificates superseding a previously issued certificate will be issued off a new ICA.

Going forward ATS will maintain a cadence of introducing new ICAs at regular intervals and adding them to our issuing pool. All subscriber certificates will come from a randomly chosen ICA that is picked by the system at request time. Over time ICAs will leave the pool ahead of their expiration. While we will do rotations as needed for security events we want to move away from rotations as part of the natural life cycle for how subscribers interact with certificate authorities. As we stated in March our focus is on PKI agility. This change will enable customers to begin operating in a more agile way with regards to certificate changes.

Hi Trev,

Overall, the delayed revocation of the mis-issued ICA certificate highlighted two areas of concern related to agility:

  1. Relying party application(s) hardcoded certificate chains, and because of this dependency, revocation of the mis-issued ICA certificate could not take place.
  2. If revocation of the ICA took place, subscriber certificates could not easily be replaced in the BR-prescribed timelines.

The described approach of using a pool of rotating ICAs will help address the first concern (or, at least, better highlight applications that are hardcoding certificates). We also see other CA owners following a similar rotation approach, like DigiCert, and we consider this a positive practice that promotes agility.

On the second concern, it’s not clear through your write-up how the plan better prepares ATS or its customers to comply with BR-prescribed timelines (i.e., revocation within 5 days or 24 hours) if another unforeseen incident happens in the future. Can you please share more detail related to how this concern is being addressed?

General questions:

  • Will these new ICAs support ACME? Reading Amazon’s CP/CPS we know it’s possible that they might, but are curious to know the actual implementation plan.
  • What intervals will ICAs be introduced into the issuance pool?
  • What is the expected validity of the new ICAs?
  • How soon before expiration will ICAs leave the pool?

Thanks!

Whiteboard: [ca-compliance] [delayed-revocation-ca] Next update 2022-10-01 → [ca-compliance] [delayed-revocation-ca]

Thank you for the feedback Ryan. Answers as follows:

  1. “how the plan better prepares ATS or its customers to comply with BR-prescribed timelines (i.e., revocation within 5 days or 24 hours) if another unforeseen incident happens in the future.”
    In parallel to deploying new ICAs we’ve improved our bulk replacement and load testing tools.

  2. "Will these new ICAs support ACME? Reading Amazon’s CP/CPS we know it’s possible that they might, but are curious to know the actual implementation plan."
    ATS currently only provides certificates for use in Amazon managed systems. Monitoring for expiration, requesting replacements, and binding to resources is already being done by mechanisms other than ACME. We currently have no plans to support ACME.

  3. "What intervals will ICAs be introduced into the issuance pool?"
    In keeping with our agility commitment we do not plan to pre-announce ICA introduction. However, we plan to introduce at least two more prior to the end of Q2 2023.

  4. "What is the expected validity of the new ICAs?"
    Our newest set of ICAs is 8 years. Previously our ICAs were 10 years. We lowered the validity for this set to bring them closer inline with the useful life their corresponding roots will have in the Chrome and Mozilla trust stores. We appreciate feedback on ICA lifetime for future ICA sets.

  5. "How soon before expiration will ICAs leave the pool?"
    We plan to remove an ICA with sufficient time to avoid unnecessary migrations.

ATS continues to monitor this bug. We have begun migrating certificates to new ICAs. We are on track to revoke our ICAs on or before May 31, 2023.

ATS continues to monitor this bug. Ben, we want to suggest a next update date of Nov 18, 2022 where will provide an update on our progress.

Whiteboard: [ca-compliance] [delayed-revocation-ca] → [ca-compliance] [delayed-revocation-ca] Next update 2022-11-18
Product: NSS → CA Program

An update on progress from the plan outlined in comment #50. Amazon Trust Services successfully implemented a change on October 11, 2022 where new issuance of certificates occurs from one of two new RSA-based ICAs. This change impacts the second use case described in comment #50.

For the first use case described in comment #50, where certificates are used for longer durations within their validity period, we began proactive replacement the week of November 1, 2022. We will continue to manage replacement in waves, as previously described. We remain on track to revoke https://crt.sh/?id=11334874 and https://crt.sh/?id=10739079 on May 31, 2023.

In parallel, and to ensure wider adoption of the multi-ICA paradigm, we also introduced a set of ECDSA ICAs (P-256 and P-384) on November 8, 2022 as new-issuance options and also in support of promoting agility in our certificate community. Both sets of ECDSA-based ICAs operate identically to the RSA-based ICAs introduced in October where the issuing ICA is selected randomly based on the desired algorithm.

Ben we would like to propose a next update of January 25, 2023.

Whiteboard: [ca-compliance] [delayed-revocation-ca] Next update 2022-11-18 → [ca-compliance] [delayed-revocation-ca] Next update 2023-01-25

ATS is continuing its commitment to reach our May 31, 2023 target date for ICA revocation as described in comment #50.

On January 17, 2023, we implemented changes switching certificate issuance for those superseding a previously-issued certificate to one of the newly-deployed ICAs as described in comment #55.

With this recent change, approximately 85% of issuance is using one of the newly-deployed RSA or ECDSA ICAs, each randomly selected based on key type. As described earlier in this report, random ICA selection discourages the use of pinning and promotes agility in the certificate community.

Beginning in February we will do a second phase of targeted replacement to certificates that expire after our planned revocation date.

Ben, we would like to propose March 17, 2023 for our next update.

Whiteboard: [ca-compliance] [delayed-revocation-ca] Next update 2023-01-25 → [ca-compliance] [delayed-revocation-ca] Next update 2023-03-17
Whiteboard: [ca-compliance] [delayed-revocation-ca] Next update 2023-03-17 → [ca-compliance] [ca-revocation-delay] Next update 2023-03-17

ATS is continuing its commitment to reach our May 31, 2023 target date for ICA revocation. As mentioned previously (comment #50) we will revoke the ICA regardless of the number of valid end-entity certificates remaining.

Ben, we would like to propose May 26, 2023 for our next update.

Whiteboard: [ca-compliance] [ca-revocation-delay] Next update 2023-03-17 → [ca-compliance] [ca-revocation-delay] Next update 2023-05-26

On May 24, 2023 we revoked https://crt.sh/?id=10739079 and https://crt.sh/?id=1133487. CCABD has been updated. We'd like to request this issue closed as resolved.

Flags: needinfo?(bwilson)

I'll close this on or about Friday, 2-June-2023, unless there are additional questions or issues to raise or discuss.

Whiteboard: [ca-compliance] [ca-revocation-delay] Next update 2023-05-26 → [ca-compliance] [ca-revocation-delay]
Status: ASSIGNED → RESOLVED
Closed: 1 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.