1. How your CA first became aware of the problem
A member of the Mozilla forum filed Bug 1706967 pointing out that our CPS still referenced BR method 10, which had been retired when version 1.7.3 of the Baseline Requirements (BRs) became effective on 2020-09-22 after Ballot SC33.
2. A timeline of the actions your CA took in response.
||GTS introduces support for validation method TLS ALPN
||The BRs are updated to remove method 10 and add method 20 for TLS ALPN.
||BR version 1.7.1 and portions of 1.7.2 are reviewed in a compliance meeting but a review of the remaining parts of 1.7.2. is not scheduled
||Mozilla Bug 1706967 is filed.
||GTS acknowledges the Bug
||GTS checks its CA configurations to determine which validation methods are actually used and determines that they do not include BR method 10
||The GTS CPS is updated to remove method 10
||The GTS CPS is updated to add BR method 20
||GTS CPS 3.4 is published in the Repository
||Compliance team determines revocation is needed and begins reaching out to customers to plan re-issuance and revocation
||The re-issuance of associated certificates begins
||The re-issuance completes
||The revocation of associated certificates completes
||GTS shares a firstthis incident report
||GTS shares this revised incident report
3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.
We have reviewed the configurations of all our CAs and confirmed that the TLS ALPN (now method 20) is the only random number based validation method we have used since 2020-09-22. Section 188.8.131.52 of the GTS CPS has been updated accordingly.
4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.)
1,029,743 certificates were issued using TLS-ALPN when our CPS did not correctly document our use of TLS-ALPN in accordance with current Baseline Requirements. The root of the issue is not the certificates themselves rather which section number we referenced within the CPS for the use of this method.
5. In a case involving certificates, the complete certificate data for the problematic certificates.
The list of crt.sh links for the associated certificates is too large to attach to the bug so we have provided the list at the following URL: https://drive.google.com/file/d/1aQH0bVSieXfpsuz2r40wO9_XyOOln_V5/view?usp=sharing. The SHA256 hash of the file is 84ae395dfcdf38fc908b5d24587d7b263a9d492a920b0b662a234d15f077825a.
The above list consists of the HEX encoded serial numbers in the certificate which in many cases will work directly with crt.sh but in some cases crt.sh requires additional formatting.
Implementation wise our serials are generated using 64 bits of output from a CSPRNG followed by encoded information. If the most significant bit of the CSPRNG's output is 1, we prepend a 0 leading to 17 bytes in the encoding instead of 16, and it seems for crt.sh to be able to query directly on the serial number in some cases a prepended 0 is needed.
To make things easier to use this list to query crt.sh we have produced a new file containing the crt.sh link for each certificate the expected encoding.
The SHA256 hash of the file is f87d705c5a4a49f88bba3b28b36638a96feed500c858f007c41ed7f407621bb9.
6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
Google Trust Services launched in 2016 with an EJBCA based CA utilizing EJBCA's proprietary APIs and custom tooling built around those APIs. Starting in 2019, we began to migrate most GTS issuance to an in-house developed CA based on the ACME protocol with the goals of accelerating the automation of certificate lifecycle management, increasing scalability, reducing manual processes involved in CA operation, and improving security through a reduced attack surface.
This system launched in July of 2019 and now represents the large majority of all certificates issued by GTS. This release included support for TLS-ALPN as the only TLS based validation mechanism supported. At the time, support for TLS-ALPN was described by method 10.
Let’s Encrypt experienced an incident in January 2018 relating to TLS-SNI, the precursor to TLS-ALPN, which resulted in them stopping issuance using this method . TLS-ALPN was proposed in September 2018 and ratified in February 2020 as a replacement to TLS-SNI.
This change led to a September 2020 update of the Baseline Requirements disallowing method 10, which accommodated TLS-SNI as well as TLS-ALPN, and replaced it with method 20, a provision specifically for TLS-ALPN. When this update took place we missed updating our CPS accordingly.
To better explain the root cause of this mistake, it's best if we start with 1612389 where an internal review identified a compliance issue in a certificate profile. In that incident we determined the root cause was an over reliance on human review for certificate profile correctness.
At the time the program was largely driven from the governance perspective. As a result of this incident we made several changes that address this limitation. In response we restructured the program and introduced a new cadence of compliance meetings, structurally improved the associated processes and increased engineering and operations participation. During those compliance meetings we reviewed bugs, ballots, and requirement updates, and as a result we improved a number of processes.
Unfortunately, those improvements did not sufficiently reduce reliance on human elements of reviews. Additionally the staffing of this group did not adequately accommodate the ebb and flow of resource availability that results from both planned and unplanned leave or unexpected work assignments. As a result this review process could get delayed to ensure the appropriate stakeholders are present.
This is exactly what happened in this case. During the review, BR version 1.7.1 was reviewed and while 1.7.2 was discussed, its review was not completed and we missed scheduling the completion of the 1.7.2 review.
When we reviewed our root cause analysis for incident 1612389, we saw parallels and determined the true root causes for this incident and 1708516 were two other process elements.
First, we still had insufficient engineering representation in the compliance review meetings to accommodate the ebb and flow of resource availability. When the engineering representatives that regularly participated in this process were not available we could miss the opportunity to assess the changes into subtle implementation details and how these align with the stated practices in our CPS. If this additional staffing had been in place it would have led to an earlier identification of the mismatch.
Second, we had no automation to monitor policy changes and response times. We believe that automation would have helped in identifying missing reviews, and by implementing alerts based on timelines, notify our Policy Authority and Engineering Compliance Lead. In addition, automation would help flag issues that directly affect us as higher priority. This is why we have adopted the changes outlined in Section 7.
We believe that with these and the other changes outlined in Section 7 we will successfully prevent similar incidents in the future. This is also in line with our goal of as much of our compliance program as possible and will help reduce the likelihood of manual process failures.
As a result of the above we have since reviewed all changes that have occurred since 1.7.1 to 1.7.4 of the BRs and have verified that our CPS only lists approved validation methods as defined in the current BRs.
In parallel we also reviewed all issuances that have taken place since 1.7.2 and what methods were used for domain control validation in each issuance. In that analysis we have identified 1,029,722 certificates that were issued using TLS-ALPN while allowed by the Baseline Requirements but not accommodated by our CPS. Of the associated certificates 248,139 were non-expired.
Following our analysis, we determined the most conservative action was to revoke and re-new the associated certificates.
Per the BRs, the revocation was to be completed within 5 days. Due to the fact that our analysis of the implications started in 2021-04-30, we missed the original timeframe for the revocation. When we identified the revocation obligations, we revoked all but 3 of the certificates in 2 days. The final 3 revocations took us 2 additional days due to coordination with customers.
Several other questions have come up since our initial incident response that we would like to provide clarity on.
The first is related to our CPS versioning. When this incident was opened, we immediately removed method 10 from our CPS, this was versioned as 3.3 of but this version was not publicly published because further changes to reflect the reliance on method 20 were needed, once made, this version became 3.4 and was published.
The second is our compliance budget. We cannot share financial figures concerning our compliance efforts, nor is it possible to offer a holistic description of a company’s compliance program in a forum post. Instead we can mention a few items that provide an indication of the size and seriousness of our investment in this area.
We engage with two auditors, both of which are active participants in the WebPKI and the Web Trust Task Force. The first auditor has worked with us since 2017 in both pre-audit and advisory capacities. The second auditor has been engaged for our certification audits.
The engagement with the first WebTrust practitioner helped us design internal controls and obtain expert opinions on important questions. Among other things, both auditors performed a full review of our CP and CPS in 2020 prior to the BR change.
Our internal compliance program is led by a dedicated resource and is supported by a Policy Authority team made up of representatives of different disciplines within Google Trust Services. In addition the program is supported by a compliance organization made up of people helping with audit coordination, compliance process consulting, risk management, internal audits, managing physical security, etc.
On the implementation level, we engage with an independent security reviewer every year who performs an assessment of our CA infrastructure to identify vulnerabilities that might exist at any level of the technology stack. The assessment covers CA software, the operating system and its configuration, network infrastructure, CA servers and other hardware, security support systems and people.
7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.
To prevent similar issues from happening again, we are implementing automation that monitors the Baseline Requirements document repository and Mozilla tickets and automatically creates tickets in our internal tracking system.
We believe that by running these updates through our general issue tracking processes, we will ensure that there is a documented action item for every BR or policy change and that the implementation status of each item is visible to a larger audience in both the compliance and the engineering team. This will reduce the process dependency on specific individuals and work on open action items can begin independent of the compliance meeting schedule. The implementation will be complete by 2021-06-15.
In addition, we have increased the frequency of the bi-weekly reviews to weekly and allocated additional time to them.
To augment our certificate rotation capabilities, we are developing additional tooling that can execute large scale revocations without significant human involvement. This will further bring down the turn-arond time for such operations. We expect this to be done by 2021-06-30.
Finally we have created a new engineering compliance role and appointed an experienced engineer to it. This role will strengthen our internal review processes by ensuring that the engineering implications of future BR changes are well covered and give us more staffing depth in the compliance reviews so we can better accommodate reviewer availability.
The first task of this role is to own the implementation projects for the new automation tooling.
Our timeline for implementing the aforementioned changes is the following. We will provide regular progress updates as they are completed.
||The new engineering role has been created and an experienced engineer has been added to the compliance team.
||The frequency of the bi-weekly review has been increased to weekly and participation from the engineering side has been increased
||Solution for automated issue tracking to be complete
||Improved tools that enable easier mass revocation.