Closed Bug 1706967 Opened 4 years ago Closed 3 years ago

Google Trust Services: Forbidden Domain Validation Method 3.2.2.4.10

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: agwa-bugs, Assigned: awarner)

Details

(Whiteboard: [ca-compliance] [policy-failure])

Attachments

(1 file)

GTS has disclosed the following CPS for their roots and intermediates: https://pki.goog/repo/cps/3.0/GTS-CPS.pdf

This CPS is dated 2021-03-19 and https://pki.goog/repository/ states that it is used for "certificates issued on or after 2021-03-25".

Section 3.2.2.4 of this CPS specifies that GTS uses the following domain validation method:

3.2.2.4.10 TLS Using a Random Number Confirming the Applicant's control over the FQDN by confirming the presence of a Random Value within a Certificate on the Authorization Domain Name which is accessible by the CA via TLS over an Authorized Port.

This domain validation method is forbidden by the Baseline Requirements.

Assignee: bwilson → awarner
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

GTS acknowledges this report. An incident report is being prepared with full details.

Summary: GTS: Forbidden Domain Validation Method 3.2.2.4.10 → Google Trust Services: Forbidden Domain Validation Method 3.2.2.4.10

An incident report is being prepared with full details.

Attached file Incident Report

1. How your CA first became aware of the problem

A member of the Mozilla forum filed Bug 1706967 pointing out that our CPS still referenced BR method 10, which had been retired when version 1.7.3 of the Baseline Requirements (BRs) became effective on 2020-09-22 after Ballot SC33.

2. A timeline of the actions your CA took in response.

YYYY-MM-DD (UTC) Description
2019-07-25 GTS introduces support for validation method TLS ALPN
2020-09-22 The BRs are updated to remove method 10 and add method 20 for TLS ALPN.
2020-09-23 BR version 1.7.1 and portions of 1.7.2 is reviewed in a compliance meeting but a review of the remaining parts of 1.7.2. is not scheduled
2021-04-22 Mozilla Bug 1706967 is filed.
2021-04-22 GTS acknowledges the Bug
2021-04-22 GTS checks its CA configurations to determine which validation methods are actually used and determines that they do not include BR method 10
2021-04-26 The GTS CPS is updated to remove method 10 and add BR method 20
2021-04-30 GTS CPS 3.4 is published in the Repository
2021-04-30 Compliance team determines revocation is needed and begins reaching out to customers to plan re-issuance and revocation
2021-05-01 The re-issuance of associated certificates begins
2021-05-01 The re-issuance completes
2021-05-01 The revocation of associated certificates completes
2021-05-03 GTS shares this incident report

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.

We have reviewed the configurations of all our CAs and confirmed that the TLS ALPN (now method 20) is the only random number based validation method we have used since 2020-09-22. Section 3.2.2.4 of the GTS CPS has been updated accordingly.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.)

1,029,743 certificates were issued using TLS-ALPN when our CPS did not correctly document our use of TLS-ALPN in accordance with current Baseline Requirements. The root of the issue is not the certificates themselves rather which section number we referenced within the CPS for the use of this method.

5. In a case involving certificates, the complete certificate data for the problematic certificates.

The list of serial numbers for the associated certificates is too large to attach to the bug so we have provided the list at the following URL: https://drive.google.com/file/d/1aQH0bVSieXfpsuz2r40wO9_XyOOln_V5/view?usp=sharing. The SHA256 hash of the file is 84ae395dfcdf38fc908b5d24587d7b263a9d492a920b0b662a234d15f077825a.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Let’s Encrypt experienced an incident relating to TLS-SNI which resulted in them stopping issuance using this method [1] [2], and later TLS-ALPN was proposed to allow TLS session based domain control validation while being resilient to the issues they experienced.

This led to an update of the Baseline Requirements removing method 10, the general purpose allowance for TLS based validation, and replacing it with method 20, a provision specifically for TLS-ALPN[3].

Google Trust Services, initially launched with a EJBCA based CA utilizing their proprietary APIs and custom tooling built around it. We later migrated much of that workload to a in-house developed CA based on the ACME protocol. It launched in July of 2019 and included support for TLS-ALPN as the only TLS based validation mechanism supported. At the time the support for TLS-ALPN was allowed by method 10, and later method 20 of the Baseline Requirements.

To ensure continual compliance with the Baseline Requirements and other operation obligations we have a formal process involving engineering and compliance stakeholders where changes to the audit criteria and contractual requirements are reviewed for impact and integrated into our practices. This process takes place bi-weekly unless an urgent matter is identified triggering a dedicated session.

Due to staffing vacations and time off occasionally this review process may get delayed to ensure the appropriate stakeholders are present for the review. As a result, when the concerned BR change took place, the review was postponed for several weeks. During the subsequent review, when all stakeholders were available, BR version 1.7.1 was reviewed and while 1.7.2 was discussed its review was not completed and we missed scheduling the completion of the 1.7.2 review.

In short as a result of this mistake we did not track the review of 1.7.2 where method 20 was introduced. We have reviewed all changes that have occurred since 1.7.1 to 1.7.4 and have verified that our CPS properly reflects all of our practices and that they are aligned with the current version of the Baseline Requirements.

We also reviewed all issuances that have taken place since 1.7.2 and what methods were used for domain control validation in each issuance. In that analysis we have identified 1,029,722 certificates that were issued using TLS-ALPN while allowed by the Baseline Requirements but not accommodated by our CPS. Of the associated certificates 248,139 were non-expired and needed to be replaced and revoked.

To address this issue we have published version 3.4 of our CPS which clarifies our use of TLS-ALPN in our issuance practices. In accordance with Baseline Requirements Section 4.9.1.1, we revoked all certificates issued using a method not allowed in our CPS but completed the revocation within 7 days.

We were made aware of this non-conformance on April 22nd and again, due to staffing availability, did not complete review and response until May 1st.

  1. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

To ensure that this sort of issue does not occur again we have scheduled work to create automation that watches for updates to the Baseline requirements and Mozilla tickets that will automatically create tickets within our internal tracking systems. This will ensure our day to day issue tracking incorporates all changes to the Baseline Requirements without reliance on manual processes.

It will also reduce the hard dependencies on specific individuals as well as enable us to kick off the reviews of changes as they occur without waiting for manual processes to occur.

To augment and monitor this automated system we are increasing the frequency of the bi-weekly review to a weekly review and expanding its scope to ensure the automation and associated processes are operating as designed.

Further we have scheduled work to reduce the manual steps involved in bulk revocations in the future which will help reduce the toil and time necessary to respond to such incidents should they occur in the future.

Finally we have defined a new engineering role and assigned an experienced engineer to work with our compliance team. This role will strengthen our compliance review process and enable further compliance related engineering automation.

Sorry for the accidental duplication.

GTS is currently monitoring this incident for comments.

In Comment #4, it was stated:

Due to staffing vacations and time off occasionally this review process may get delayed to ensure the appropriate stakeholders are present for the review. As a result, when the concerned BR change took place, the review was postponed for several weeks. During the subsequent review, when all stakeholders were available, BR version 1.7.1 was reviewed and while 1.7.2 was discussed its review was not completed and we missed scheduling the completion of the 1.7.2 review.

In short as a result of this mistake we did not track the review of 1.7.2 where method 20 was introduced. We have reviewed all changes that have occurred since 1.7.1 to 1.7.4 and have verified that our CPS properly reflects all of our practices and that they are aligned with the current version of the Baseline Requirements.

In Bug 1612389, Comment #2, there's remarkable similarity here, specifically:

I believe the primary problem with our consensus driven approach is that we were largely discussing the matters in regular weekly meetings where other matters were also covered. This meant that there was sometimes pressure to fit the discussion into an agenda that may have not allowed sufficient research and debate. The right participants were involved, but time pressure appears to have contributed to a bad outcome.

On the surface, it appears to be a duplicate of a past incident, and this is quite concerning for several reasons, chief among them that GTS has seemingly failed to address the past problem, but equally important for this incident report, that GTS failed to discuss the similarity to their own incidents in the past, showing how they've learned from them. In particular, the responses here seem rather similar to the concerns raised in https://bugzilla.mozilla.org/show_bug.cgi?id=1612389#c7 , which at the time was pointed out as problematic, and thus this report also remains problematic.

Equally, replies such as:

Due to staffing vacations and time off occasionally this review process may get delayed to ensure the appropriate stakeholders are present for the review. As a result, when the concerned BR change took place, the review was postponed for several weeks.

Raise some concern about whether or not GTS has adequately staffed efforts in supervision compliance. I think it's concerning that this incident report doesn't address why such delays were seen as acceptable, nor how their processes failed to capture what is otherwise a reasonable and normal part of business operation, which does not exclude a CA from any requirements.

In light of the recent thread on m.d.s.p, it would be useful to provide holistic details about GTS' compliance operations, because they do not appear to be sufficient for that of a publicly trusted CA, based on the current repeat issues and failure to appropriately implement compliance requirements. In particular:

  • What is the CA's compliance budget?
  • What resources have been set aside specifically for compliance?
  • To what extent has GTS reviewed the CA incidents that have been reported in Bugzilla over the past several years, including their own incidents?

Given the duplication with Bug 1612389, it appears that GTS has failed to properly perform self-review of its own documentation, policies, and practices even after the following incidents were disclosed:

An even longer list of CA incident bugs related to "Failure to adopt to new requirements" exists, ranging from failing to revoke underscores, disclosure of validation sources, and disclosure of problem reporting mechanism, to name a few. Each of these fit a pattern of a failure to properly monitor and implement changes, and so it does not seem reasonable to accept "human error" as the sole root cause, as is currently proposed.

Consider, for example, the following statement:

We were made aware of this non-conformance on April 22nd and again, due to staffing availability, did not complete review and response until May 1st.

It needs to be pointed out that GTS was reminded on 2021-03-30 of the expectations, but failed to take adequate measures to meet expectations. While this is part of a trend being dealt with on Bug 1708516, it does speak to GTS' failure to review even their own incident reports over the past year prior to compiling this incident report, and that is quite troubling, and fitting with the pattern of problematic behaviour being called out here.

This might seem like nitpicking, but I think is part of a pattern of deep concern with GTS' incidents that are significantly undermining trust in its operations. In particular, the following text:

In accordance with Baseline Requirements Section 4.9.1.1, we revoked all certificates issued using a method not allowed in our CPS but completed the revocation within 7 days.

This demonstrates a significant misrepresentation of what the Baseline Requirements require, which is revocation within 5 days. Given that this incident is predicated on GTS not adjusting their practices to the BRs, it would appear that GTS has also just admitted to another BR violation, and failed to detect it. If this is excused as misspeaking or a typo, then it highlights a lack of adequate review on these incident reports, which I think is something equally to be raised with Bug 1709223.

Note that the incident report in Comment #4 fails to meet the requirements set forth in https://wiki.mozilla.org/CA/Responding_To_An_Incident, which certainly does not help. If it would be useful, it would certainly be possible to point out the specifics, but given that part of the concerns raised here are GTS' failure to read and follow requirements, and especially in light of the similarities to Bug 1612389 and the suggestion in https://bugzilla.mozilla.org/show_bug.cgi?id=1612389#c9 to implement adversarial reviews, it seems reviewing and redoing the incident report in Comment #4 against the requirements, for all CAs, in https://wiki.mozilla.org/CA/Responding_To_An_Incident would be useful for GTS to demonstrate some progress and awareness of the concerns.

In particular, it would benefit GTS greatly to make sure it's clear and explicit the CA incident reports they have reviewed that can be seen as related to this incident, even if they ultimately determine it to be not applicable. There are numerous and stark parallels to be made, and it's critical for GTS when performing a root cause analysis, which has arguably not yet been performed, to make sure they're aware of industry trends and risks as part of any proposed solutions.

Flags: needinfo?(doughornyak)

5. In a case involving certificates, the complete certificate data for the problematic certificates.

The list of serial numbers for the associated certificates is too large to attach to the bug so we have provided the list at the following URL: https://drive.google.com/file/d/1aQH0bVSieXfpsuz2r40wO9_XyOOln_V5/view?usp=sharing. The SHA256 hash of the file is 84ae395dfcdf38fc908b5d24587d7b263a9d492a920b0b662a234d15f077825a.

I'm having trouble finding some of these serials on crt.sh. Of the first 10 in the list, only the 3rd, 6th and 10th could be found on crt.sh [0] (I havent tried more, and couldn't get censys working).

Could you please ensure that all these serials are resolvable to the complete certificate data? That may be either by providing the full certificate data, or ensuring that all these certificate ids are logged to CT logs and subsequently providing hashes over the certificate data. I trust that crt.sh stores all relevant certificates from the CT logs it ingests, and that the CT logs used by Google are ingested by crt.sh.

Apart from that, I'd be delighted if you could share which bytes in your serial number are generated from a CSPRNG, as the 0x0900000000 and 0x0a00000000-patterns in the middle were an unexpected suprise for me.

[0]
crt.sh/?serial=e684ee34ee6db1b10900000000660961
crt.sh/?serial=9fa2800c239f7b2f0a00000000c3c71c
crt.sh/?serial=44aeee6c2965cfd50a00000000bfed05
crt.sh/?serial=8f5b4e5cd384f92309000000006349a6
crt.sh/?serial=ee7f54b98465a1b50a00000000be5830
crt.sh/?serial=654f7698d7b52a2f0900000000616501
crt.sh/?serial=ae14f4015c92dc400a00000000c80eb1
crt.sh/?serial=e61cbaccbfc62f4e0a00000000c9eb0b
crt.sh/?serial=ad69aede4afc09070a00000000bdecf8
crt.sh/?serial=018333a67ca2b9dc0a00000000c4ba4a

2021-04-26 The GTS CPS is updated to remove method 10 and add BR method 20
2021-04-30 GTS CPS 3.4 is published in the Repository

The above timeline doesn't match the changelog in GTS' CPS, which states that method 10 was removed in version 3.3 on 2021-04-22. This is a relatively minor mistake, but when coupled with the numerous other mistakes in GTS' recent incident reports (Bug 1708516 and Bug 1709223) it suggests that GTS has a serious problem with accuracy of their incident reports.

We'd like to thank the community for their diligence, comments, and suggestions. As requested, GTS is preparing a new report to submit and expect that this will sufficiently address all parties concerns.

Flags: needinfo?(doughornyak)

1. How your CA first became aware of the problem

A member of the Mozilla forum filed Bug 1706967 pointing out that our CPS still referenced BR method 10, which had been retired when version 1.7.3 of the Baseline Requirements (BRs) became effective on 2020-09-22 after Ballot SC33.

2. A timeline of the actions your CA took in response.

YYYY-MM-DD (UTC) Description
2019-07-25 GTS introduces support for validation method TLS ALPN
2020-09-22 The BRs are updated to remove method 10 and add method 20 for TLS ALPN.
2020-09-23 BR version 1.7.1 and portions of 1.7.2 are reviewed in a compliance meeting but a review of the remaining parts of 1.7.2. is not scheduled
2021-04-22 Mozilla Bug 1706967 is filed.
2021-04-22 GTS acknowledges the Bug
2021-04-22 GTS checks its CA configurations to determine which validation methods are actually used and determines that they do not include BR method 10
2021-04-23 The GTS CPS is updated to remove method 10
2021-04-26 The GTS CPS is updated to add BR method 20
2021-04-30 GTS CPS 3.4 is published in the Repository
2021-04-30 Compliance team determines revocation is needed and begins reaching out to customers to plan re-issuance and revocation
2021-05-01 The re-issuance of associated certificates begins
2021-05-01 The re-issuance completes
2021-05-01 The revocation of associated certificates completes
2021-05-03 GTS shares a firstthis incident report
2021-05-17 GTS shares this revised incident report

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.

We have reviewed the configurations of all our CAs and confirmed that the TLS ALPN (now method 20) is the only random number based validation method we have used since 2020-09-22. Section 3.2.2.4 of the GTS CPS has been updated accordingly.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.)

1,029,743 certificates were issued using TLS-ALPN when our CPS did not correctly document our use of TLS-ALPN in accordance with current Baseline Requirements. The root of the issue is not the certificates themselves rather which section number we referenced within the CPS for the use of this method.

5. In a case involving certificates, the complete certificate data for the problematic certificates.

The list of crt.sh links for the associated certificates is too large to attach to the bug so we have provided the list at the following URL: https://drive.google.com/file/d/1aQH0bVSieXfpsuz2r40wO9_XyOOln_V5/view?usp=sharing. The SHA256 hash of the file is 84ae395dfcdf38fc908b5d24587d7b263a9d492a920b0b662a234d15f077825a.

The above list consists of the HEX encoded serial numbers in the certificate which in many cases will work directly with crt.sh but in some cases crt.sh requires additional formatting.

Implementation wise our serials are generated using 64 bits of output from a CSPRNG followed by encoded information. If the most significant bit of the CSPRNG's output is 1, we prepend a 0 leading to 17 bytes in the encoding instead of 16, and it seems for crt.sh to be able to query directly on the serial number in some cases a prepended 0 is needed.

To make things easier to use this list to query crt.sh we have produced a new file containing the crt.sh link for each certificate the expected encoding.

URL https://drive.google.com/file/d/1YnQMVA5r06HcMn0cZ4Jvp0G1y2qO5hy7/view?usp=sharing
The SHA256 hash of the file is f87d705c5a4a49f88bba3b28b36638a96feed500c858f007c41ed7f407621bb9.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Google Trust Services launched in 2016 with an EJBCA based CA utilizing EJBCA's proprietary APIs and custom tooling built around those APIs. Starting in 2019, we began to migrate most GTS issuance to an in-house developed CA based on the ACME protocol with the goals of accelerating the automation of certificate lifecycle management, increasing scalability, reducing manual processes involved in CA operation, and improving security through a reduced attack surface.

This system launched in July of 2019 and now represents the large majority of all certificates issued by GTS. This release included support for TLS-ALPN[1] as the only TLS based validation mechanism supported. At the time, support for TLS-ALPN was described by method 10.

Let’s Encrypt experienced an incident in January 2018 relating to TLS-SNI, the precursor to TLS-ALPN[1], which resulted in them stopping issuance using this method [2][3]. TLS-ALPN[1] was proposed in September 2018 and ratified in February 2020 as a replacement to TLS-SNI.

This change led to a September 2020 update of the Baseline Requirements disallowing method 10, which accommodated TLS-SNI as well as TLS-ALPN[1], and replaced it with method 20, a provision specifically for TLS-ALPN[4]. When this update took place we missed updating our CPS accordingly.

To better explain the root cause of this mistake, it's best if we start with 1612389[5] where an internal review identified a compliance issue in a certificate profile. In that incident we determined the root cause was an over reliance on human review for certificate profile correctness.

At the time the program was largely driven from the governance perspective. As a result of this incident we made several changes that address this limitation. In response we restructured the program and introduced a new cadence of compliance meetings, structurally improved the associated processes and increased engineering and operations participation. During those compliance meetings we reviewed bugs, ballots, and requirement updates, and as a result we improved a number of processes.

Unfortunately, those improvements did not sufficiently reduce reliance on human elements of reviews. Additionally the staffing of this group did not adequately accommodate the ebb and flow of resource availability that results from both planned and unplanned leave or unexpected work assignments. As a result this review process could get delayed to ensure the appropriate stakeholders are present.

This is exactly what happened in this case. During the review, BR version 1.7.1[6] was reviewed and while 1.7.2[7] was discussed, its review was not completed and we missed scheduling the completion of the 1.7.2[7] review.

When we reviewed our root cause analysis for incident 1612389[5], we saw parallels and determined the true root causes for this incident and 1708516[8] were two other process elements.

First, we still had insufficient engineering representation in the compliance review meetings to accommodate the ebb and flow of resource availability. When the engineering representatives that regularly participated in this process were not available we could miss the opportunity to assess the changes into subtle implementation details and how these align with the stated practices in our CPS. If this additional staffing had been in place it would have led to an earlier identification of the mismatch.

Second, we had no automation to monitor policy changes and response times. We believe that automation would have helped in identifying missing reviews, and by implementing alerts based on timelines, notify our Policy Authority and Engineering Compliance Lead. In addition, automation would help flag issues that directly affect us as higher priority. This is why we have adopted the changes outlined in Section 7.

We believe that with these and the other changes outlined in Section 7 we will successfully prevent similar incidents in the future. This is also in line with our goal of as much of our compliance program as possible and will help reduce the likelihood of manual process failures.

As a result of the above we have since reviewed all changes that have occurred since 1.7.1[6] to 1.7.4[9] of the BRs and have verified that our CPS only lists approved validation methods as defined in the current BRs.

In parallel we also reviewed all issuances that have taken place since 1.7.2[7] and what methods were used for domain control validation in each issuance. In that analysis we have identified 1,029,722 certificates that were issued using TLS-ALPN while allowed by the Baseline Requirements but not accommodated by our CPS. Of the associated certificates 248,139 were non-expired.

Following our analysis, we determined the most conservative action was to revoke and re-new the associated certificates.

Per the BRs, the revocation was to be completed within 5 days. Due to the fact that our analysis of the implications started in 2021-04-30, we missed the original timeframe for the revocation. When we identified the revocation obligations, we revoked all but 3 of the certificates in 2 days. The final 3 revocations took us 2 additional days due to coordination with customers.

Several other questions have come up since our initial incident response that we would like to provide clarity on.

The first is related to our CPS versioning. When this incident was opened, we immediately removed method 10 from our CPS, this was versioned as 3.3 of but this version was not publicly published because further changes to reflect the reliance on method 20 were needed, once made, this version became 3.4 and was published.

The second is our compliance budget. We cannot share financial figures concerning our compliance efforts, nor is it possible to offer a holistic description of a company’s compliance program in a forum post. Instead we can mention a few items that provide an indication of the size and seriousness of our investment in this area.

We engage with two auditors, both of which are active participants in the WebPKI and the Web Trust Task Force. The first auditor has worked with us since 2017 in both pre-audit and advisory capacities. The second auditor has been engaged for our certification audits.

The engagement with the first WebTrust practitioner helped us design internal controls and obtain expert opinions on important questions. Among other things, both auditors performed a full review of our CP and CPS in 2020 prior to the BR change.

Our internal compliance program is led by a dedicated resource and is supported by a Policy Authority team made up of representatives of different disciplines within Google Trust Services. In addition the program is supported by a compliance organization made up of people helping with audit coordination, compliance process consulting, risk management, internal audits, managing physical security, etc.

On the implementation level, we engage with an independent security reviewer every year who performs an assessment of our CA infrastructure to identify vulnerabilities that might exist at any level of the technology stack. The assessment covers CA software, the operating system and its configuration, network infrastructure, CA servers and other hardware, security support systems and people.

7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

To prevent similar issues from happening again, we are implementing automation that monitors the Baseline Requirements document repository and Mozilla tickets and automatically creates tickets in our internal tracking system.

We believe that by running these updates through our general issue tracking processes, we will ensure that there is a documented action item for every BR or policy change and that the implementation status of each item is visible to a larger audience in both the compliance and the engineering team. This will reduce the process dependency on specific individuals and work on open action items can begin independent of the compliance meeting schedule. The implementation will be complete by 2021-06-15.

In addition, we have increased the frequency of the bi-weekly reviews to weekly and allocated additional time to them.

To augment our certificate rotation capabilities, we are developing additional tooling that can execute large scale revocations without significant human involvement. This will further bring down the turn-arond time for such operations. We expect this to be done by 2021-06-30.

Finally we have created a new engineering compliance role and appointed an experienced engineer to it. This role will strengthen our internal review processes by ensuring that the engineering implications of future BR changes are well covered and give us more staffing depth in the compliance reviews so we can better accommodate reviewer availability.

The first task of this role is to own the implementation projects for the new automation tooling.

Our timeline for implementing the aforementioned changes is the following. We will provide regular progress updates as they are completed.

YYYY-MM-DD Description
2021-05-03 The new engineering role has been created and an experienced engineer has been added to the compliance team.
2021-05-03 The frequency of the bi-weekly review has been increased to weekly and participation from the engineering side has been increased
2021-06-15 Solution for automated issue tracking to be complete
2021-06-30 Improved tools that enable easier mass revocation.

Google Trust Services is monitoring this tread for any additional updates or questions.

s/tread/thread/

Thank you for Comment #11.

Unfortunately, this incident report does not properly identify what are clear root causes, externally, and thus fails to provide sufficient assurance that GTS has adequately understood the concerns or taken steps to mitigate them.

As mentioned in Bug 1708516 Comment #19 and Bug 1708516 Comment #21, we will provide a more detailed report in the coming days that hopefully highlights these concerns to GTS more comprehensively, in a way that GTS is able to understand and act upon, as the current Comment #11 does not address the concerns raised. While I recognize that GTS is confident in its changes proposed, ample evidence from GTS’ past incidents suggest that there are still major compliance issues undermining public trust.

Google Trust Services is monitoring this thread for any additional updates or questions.

Google Trust Services is monitoring this thread for any additional updates or questions.

(In reply to Ryan Sleevi from comment #7)

This demonstrates a significant misrepresentation of what the Baseline Requirements require, which is revocation within 5 days. Given that this incident is predicated on GTS not adjusting their practices to the BRs, it would appear that GTS has also just admitted to another BR violation, and failed to detect it.

Filed Bug 1715421 for tracking this incident, as GTS has not yet done so.

Google Trust Services is monitoring this thread for any additional updates or questions.

In accordance with the timeline provided, we have implemented and deployed the tooling that will monitor GTS related Mozilla Bugzilla bugs. 

It does this by mirroring the relevant Mozilla Bugzilla incidents to our internal bug tracking system. This will ensure that the existing processes used daily to operate and maintain GTS services can also track community reported issues. This will importantly also give us reporting on SLA compliance including:

  • New bugs related to GTS,
  • Bugs that need update,
  • Bugs assigned to GTS, and
  • Bugs with the needinfo bug set for GTS.

Additionally this tooling enables cloning of other CA incidents into the same tracking system which will provide an internal forum where all GTS engineers can participate in adversarial reviews of other CA incidents. Some benefits include:

  • Improve our ability to learn from other incidents,
  • Improve our ability to take action when needed, and
  • More quickly and completely incorporate lessons learned into trainings and playbooks.

We believe this closes the committed deliverable for the automation mentioned in this bug.

Google Trust Services is monitoring this thread for any additional updates or questions.

Google Trust Services is monitoring this thread for any additional updates or questions.

Thanks Fotis. I'm still trying to get what I mentioned in Comment #14 out to GTS here.

Google Trust Services is monitoring this thread for any additional updates or questions.

Google Trust Services is monitoring this thread and will provide a response soon.

We are monitoring this thread for any additional updates or questions.

Ryan, just to ensure that we are addressing all of your comments, can you verify that your concerns have been presented at Bug 1708516 Comment 35 and can be addressed there?

Flags: needinfo?(ryan.sleevi)

Yes.

Note that Bug 1708516, Comment #35 suggests that there are concerns with the incident management/root cause analysis side, and that it's useful and important to resolve those issues first, to see how those improvements can be applied to these existing issues. So I think the path here would be a dependency on getting Bug 1708516 on track to resolution, and then revisiting to see how those lessons can be applied here for a better analysis and resolution here.

Flags: needinfo?(ryan.sleevi)

We are monitoring this thread for any additional updates or questions.

Google Trust Services is monitoring this thread for any additional updates or questions.

Google Trust Services is monitoring this thread for any additional updates or questions.

We are monitoring this thread for any additional updates or questions.

We are monitoring this thread for any additional updates or questions.

Google Trust Services is monitoring this bug for any additional updates or questions.

It seems to me that this bug can now be closed. I will schedule doing so on next Wed. 25-Aug-2021.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [policy-failure]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: