Google Trust Services: Missing authorization audit log entry for certificate issuance
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: gts-external, Assigned: gts-external)
Details
(Whiteboard: [ca-compliance] [policy-failure])
Preliminary Incident Report
Summary
-
Incident description: Google Trust Services (GTS) was alerted by our self-auditing tool of a corner condition in our software that resulted in one certificate [1] being issued without one of its certificate lifecycle events being recorded in our audit log. We revoked the certificate in question. GTS is developing a fix and has mitigations in place to prevent a recurrence in the meantime.
-
Relevant policies:
-
Source of incident disclosure: GTS was alerted by our self-auditing tool.
GTS will publish a full incident report by 2025-08-06.
Updated•8 months ago
|
| Assignee | ||
Comment 1•8 months ago
|
||
Google Trust Services (GTS) is continuing to actively work on the full incident report. We're on track to publish the complete report by the originally stated date of August 6, 2025.
| Assignee | ||
Comment 2•8 months ago
|
||
Full Incident Report
Summary
- CA Owner CCADB unique ID: A004159
- Incident description: Google Trust Services (GTS) was alerted by our self-auditing tool of a missing log detail for one certificate. The certificate was issued without one of its certificate lifecycle authorization events being recorded in our audit log due to a race condition. We initially thought the issue was caused by a load test in our staging environment affecting other environments and conducted an expedited restart of jobs across environments and locations to clear any lingering load test remnants. It turned out the load test timing correlation was coincidental. We determined during debugging and recovery that the actual trigger was a security token for our resolver expiring and it happened to be refreshed by the service restart. Regardless of the trigger, the expiration of the credential exposed a previously unknown race condition that could allow an ACME authorization to not be recorded.
- Timeline summary:
- Non-compliance start date: 2025-07-22
- Non-compliance identified date: 2025-07-25
- Non-compliance end date: 2025-07-25
- Relevant policies:
- Source of incident disclosure: GTS was alerted by our self-auditing tool.
Impact
- Total number of certificates: 1
- Total number of "remaining valid" certificates: 0
- Affected certificate types: This incident affected a DV certificate (OID 2.23.140.1.2.1).
- Incident heuristic: It is not possible for a 3rd party to identify this issue as it involves reconciliation of data internal to the issuance process.
- Was issuance stopped in response to this incident, and why or why not?: Issuance was not stopped because we had mitigations in place to prevent a recurrence of the issue that affected the one certificate.
- Analysis: N/A - revocation was not delayed
- Additional considerations: None
Timeline
All times in UTC
2020-03-17 07:25 - A code change introducing the race condition is submitted to our version control system
2025-07-06 21:45 - The workload authentication credential used by a new resolver service being tested in a dark launch expires. Since it is in dark launch, the lookup results of this resolver are not used in domain control validation
2025-07-22 12:48 - GTS begins a load test against its staging infrastructure
2025-07-22 13:15 - The resolver attempts to authenticate using the expired credential and fails. A bug in the implementation of the dark launch caused its failure to increase latency on production traffic
2025-07-22 14:56 - A GTS engineer notices an increase in latency on its production infrastructure and begins an investigation
2025-07-22 15:52 - The load test is terminated early as it appears to be causing impact outside of staging infrastructureThis turned out to be a red herring
2025-07-22 17:43 - GTS notices elevated error rates for ACME requests
2025-07-22 17:51 - Domain control validation is initiated for the certificate that is the subject of this incident
2025-07-22 17:54 - Domain authorization is granted following successful and correctly logged domain control validation but the authorization event is not logged due to a timeout error caused by the increased latency
2025-07-22 17:54 - The certificate that is the subject of this incident is issued
2025-07-22 18:00 - GTS identifies that the dark launched resolver service is the cause of the latency and begins procedures to reset it to alleviate elevated error rates
2025-07-22 19:25 - Latency is confirmed to decrease following the service reset. The latency issue is considered resolved
2025-07-25 16:45 - The internal self-auditing tool flags a record as missing from one of the two log sources. The tool fails to reconcile the two data sources against each other and automatically files a bug
2025-07-25 17:01 - A GTS engineer investigates the finding and confirms that the authorization log entry is missing from one of the log sets
2025-07-25 17:13 - The GTS incident response procedure is initiated
2025-07-25 17:45 - A full review of all logs and events related to the finding is completed. GTS confirms that the certificate was validated correctly at the time of issuance, however the authorization event was not properly added to the audit log
2025-07-25 18:14 - The Subscriber for the certificate is notified that we are investigating issues with their certificate
2025-07-25 19:05 - The Subscriber preemptively renews their certificate
2025-07-25 19:42 - The GTS Policy Authority declares that failing to write the entry to the audit log is a violation of GTS’s CP/CPS section 5.4.1 and declares an incident
2025-07-25 20:03 - The certificate is revoked
2025-07-25 20:39 - As a short term mitigation, the entire GTS team is notified that expedited service restarts are to be avoided until a fix can be deployed
2025-07-29 14:34 - A fix that records the successful authorization in the audit logs before updating the database is submitted
2025-07-30 15:38 - During the review of the incident timeline, GTS determines that the incident trigger was not the service restarts as initially thought. Further investigation begins
2025-07-30 18:32 - Investigation is complete. The trigger for the failure to log was a security token used by the dark launched resolver had expired, and once it attempted to re-authenticate with the expired token it silently failed
Related Incidents
| Bug | Date | Description |
|---|---|---|
| 1948368 | 2025-02-14 | While different in nature, both bugs have to do with GTS’s self-auditing tool and the way audit logs are recorded and used. |
| 1423624 | 2017-12-06 | |
| This bug covers a Comodo CAA checking bug due to a race condition. The infrastructures and code are completely independent, but there is a bit of commonality in terms of how race conditions can be caused by seemingly isolated changes. |
Root Cause Analysis
**Root Cause: ** The root cause of the incident was that an authorization database record was created to indicate it was possible to issue a certificate but the corresponding audit log entry was not written due to lack of atomicity.
Contributing Factor #1: Lack of atomicity in the code that updates the authorizations database and the logs
- Description: The lack of atomicity in the code that updates the authorizations database and the logs allowed for the incident to occur. This is because a request and its associated thread were cancelled in between the two write events, which occur across two datastore systems.
- Timeline: The code that introduced the bug was pushed into production on 2020-03-17 at 07:25 UTC.
- Detection: GTS detected the issue when the auditor tool noticed a mismatch between the two databases and flagged the issue.
- Interaction with other factors: N/A
- Root Cause Analysis methodology used: 5 whys
Contributing Factor #2: The dark launched resolver went into lame-duck mode (not supposed to receive new connections, but allowing open requests to complete), but still received new traffic.
- Description: Several instances of the service went into lame-duck mode due to a dependent service not being available. However, traffic was still being sent to those instances. Requests in the affected instances would sit there until they were cancelled. This increased the overall latency of the validation process. The increased latency resulted in a higher probability of timeouts, and in this case a timeout occurred and terminated the thread after updating the database, but before it could write to the audit log.
- Timeline: The dark launched resolver went into lame-duck mode on 2025-07-22 at 13:15 UTC when the resolver attempted to authenticate using an expired credential and failed.
- Detection: An investigation was conducted to identify why the error rate in the service was increased.
- Interaction with other factors: This contributing factor increased the likelihood of the scenario where the request was cancelled between updating the database and recording the authorization in the logs.
- Root Cause Analysis methodology used: 5 whys
Lessons Learned
- What went well:
- The authorization database record was updated successfully. Its existence along with the other audit log entries confirm the code correctly granted the authorization and only a single audit log entry was missed.
- The affected subscriber was able to rotate to a new certificate quickly.
- What didn’t go well:
- The authentication credential used by the resolver expired before being rotated.
- We lost time investigating the load test as a potential cause of the issue because the timing was coincidental.
- Where we got lucky:
- The issue only affected a single certificate
- The race condition was hard to trigger and had not triggered in the past five years, as verified by our self auditing tool’s log reconciliation
- The service restarts refreshed the authentication credential despite that not being the reason we conducted them
- Additional:
Action Items
| Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Record the successful authorization in the audit logs before updating the database to indicate the certificate is authorized to issue. The CA software will not permit issuance until the authorization record exists in the database. | Prevent | Root Cause | A test was created to send out many validation attempts at once and abruptly terminate the service to replicate the circumstances that caused the incident. Then we confirmed our logs properly matched the state of the databases. Our daily audit runs will also continue to verify whether the fix succeeded. | 2025-07-31 | Complete |
| Ensure services respect lame-duck mode | Mitigate | Contributing Factor #2 | Place a service in lame-duck mode and verify that traffic does not go to it | 2025-09-12 | In Progress |
| Ensure that resolver authentication credentials do not expire before being refreshed | Prevent | Contributing Factor #2 | Rotation logic and alerting are in place for resolver authentication credentials. Other services already have both | 2025-09-12 | In Progress |
| Identify and remediate other instances in our code that could fail to write a log before issuing a certificate. | Prevent | Contributing Factor #1 | GTS will perform a focused review of its code | 2025-10-10 | In Progress |
Appendix
Comment 3•7 months ago
|
||
Thank you for the full report in Comment 2. We especially appreciate the detailed timeline of events and the technical explanation of the race condition in the RCA.
(Comment) Adding explanation for what might not be widely-known terminology (e.g., “lame-duck mode” and “dark launch”) would presumably be beneficial to readers.
(Comment) The appendix appears blank.
(Q1) A complete and verifiable audit trail is a foundational requirement for a CA. Given this:
- a. Why was a reconciliation process to ensure every issued certificate has a corresponding authorization log event not included in the system's initial design?
- b. Following this incident, is GTS planning to review other certificate lifecycle events (e.g., validation, revocation) to ensure similar reconciliation audits are in place to guarantee the integrity of the entire audit log?
(Q2) Unrelated to the above, we understand the timeline to show the internal self-auditing tool detected the issue (by way of the missing audit record) approximately three days after issuance. Has GTS considered ways of reducing this timing such that alerting and response would be near instantaneous?
(Q3) Also unrelated to the above, we note the affected subscriber renewed their certificate within an hour of being notified.
- a. Can you describe the process for contacting the affected subscriber? We’re interested to understand how this would have scaled, for example, if 500, 50,000 or 500,000 subscribers were affected.
- b. Was this renewal triggered by ARI, or was this a result of manual subscriber intervention?
- c. Did GTS assess whether ARI could have benefitted the subscriber's renewal given the observed in-use ACME client?
- d. Has GTS performed the same ARI capability evaluation on its other subscribers in an effort to encourage broader, proactive adoption of ARI? If so, what has this looked like? If not, what barriers have prevented you from doing so?
| Assignee | ||
Comment 4•7 months ago
|
||
Thank you for the comments and questions. GTS is reviewing and will provide responses by this Friday, August 15th at the latest.
| Assignee | ||
Comment 5•7 months ago
|
||
(In reply to chrome-root-program from comment #3)
Thank you for the full report in Comment 2. We especially appreciate the detailed timeline of events and the technical explanation of the race condition in the RCA.
(Comment) Adding explanation for what might not be widely-known terminology (e.g., “lame-duck mode” and “dark launch”) would presumably be beneficial to readers.
We tried to use the terms in context that explained them, but good call on defining them explicitly. 'Lame-duck' mode is an option for a server to stop accepting new connections but complete in-flight work. If a server is put in lame-duck mode behind a load balancer, the load balancer will notice it is no longer accepting new requests and send new requests to another server to gracefully transfer load or avoid an instance that is not going to be available soon.
A 'dark launch' is a launch where we send production traffic to a new service component, but do not rely on the component for issuance or validation decisions. This allows us to compare results from both the existing component and the new component to ensure any behavior or performance differences are intended.
(Comment) The appendix appears blank.
Sorry, because only one certificate was affected and details were provided in the body of the report, we did not think it made sense to include it in the appendix. It sounds like even if there is a very small number of certificates that can be in-lined in the body of the report, they should also be included in the appendix. We will do so in the future.
(Q1) A complete and verifiable audit trail is a foundational requirement for a CA. Given this:
- a. Why was a reconciliation process to ensure every issued certificate has a corresponding authorization log event not included in the system's initial design?
We agree that a complete and verifiable audit trail is a foundational requirement for a CA. Our system was initially designed to ensure that every issued certificate has corresponding log entries that authorization took place correctly. To explain this, first let us give a few more details about the authorization process and the log entries that are written:
- When a client indicates to us that it is ready for challenge validation to be attempted, GTS systems validate the challenge from 6 different perspectives.
- Each perspective must successfully write an audit log entry of its validation attempt and outcome. A perspective will not report success unless this log write is successful. Validation logs for each perspective were present for the affected certificate.
- The log which was missing recorded aggregate results from all of the perspectives. The only information in this log which is required by the BRs is the quorum results. Our system still computed quorum correctly, only the quorum value was not logged. We know this because we can re-compute the quorum by looking at each individual perspective’s audit log.
MPIC was not part of the initial system’s design. The initial system was designed to block on an audit log entry written from the primary perspective which was the only validation required at the time. The oversight occurred when MPIC logging was added, and the quorum information was written to a log entry that didn’t block issuance. Our code has been updated to ensure this can no longer happen. The action item “Identify and remediate other instances in our code that could fail to write a log before issuing a certificate” is intended to identify any other occurrences of this.
- b. Following this incident, is GTS planning to review other certificate lifecycle events (e.g., validation, revocation) to ensure similar reconciliation audits are in place to guarantee the integrity of the entire audit log?
Our existing action item “Identify and remediate other instances in our code that could fail to write a log before issuing a certificate” is intended to cover such work.
(Q2) Unrelated to the above, we understand the timeline to show the internal self-auditing tool detected the issue (by way of the missing audit record) approximately three days after issuance. Has GTS considered ways of reducing this timing such that alerting and response would be near instantaneous?
We've updated configurations to reduce detection time to approximately 24 hours. Due to the end-to-end delivery latency SLOs of our event record logging system - i.e. the amount of time between one system writing the log and that log being able to be read by another, independent system - near instantaneous reconciliation is not desirable. Reducing the time further would significantly increase the possibility of false findings, causing alert fatigue.
(Q3) Also unrelated to the above, we note the affected subscriber renewed their certificate within an hour of being notified.
- a. Can you describe the process for contacting the affected subscriber? We’re interested to understand how this would have scaled, for example, if 500, 50,000 or 500,000 subscribers were affected.
The subscriber in question was internal to Google. In this case we were able to contact the subscriber's service team directly. However, GTS has implemented automation to email subscribers in case of a mass revocation event. The automation is capable of emailing all current subscribers with active certificates within 2 hours, in addition to scheduling the certificates for revocation within timelines mandated in BRs 4.9.1.1. We use this automation any time revocation is required for more than a small handful of users.
The load testing we mentioned in our report was related to further improving performance of our mass revocation automation, which we’ve been improving in response to mass revocation preparedness requirements in MRSP 3.0 and the recent ballot SC-089.
- b. Was this renewal triggered by ARI, or was this a result of manual subscriber intervention?
GTS believes that ARI could have benefited this use case as well as the ecosystem as a whole. In this instance, the subscriber manually renewed their certificate. The subscriber has implemented ARI but has not fully released their ARI support so they conducted a manual renewal and re-issuance.
- c. Did GTS assess whether ARI could have benefitted the subscriber's renewal given the observed in-use ACME client?
See above.
- d. Has GTS performed the same ARI capability evaluation on its other subscribers in an effort to encourage broader, proactive adoption of ARI? If so, what has this looked like? If not, what barriers have prevented you from doing so?
We require all subscribers who have quota limits above our defaults to commit to enabling ARI support as a condition to get the increased quota. We do not block quota increases on ARI being active but a commitment to implement it within a reasonable period of time must be in place.
We’ve added a new action item to reflect the work that was completed to reduce the delay in the self-auditing tool to ~24 hours. Updated action item table below.
Action Items
| Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Record the successful authorization in the audit logs before updating the database to indicate the certificate is authorized to issue. The CA software will not permit issuance until the authorization record exists in the database. | Prevent | Root Cause | A test was created to send out many validation attempts at once and abruptly terminate the service to replicate the circumstances that caused the incident. Then we confirmed our logs properly matched the state of the databases. Our daily audit runs will also continue to verify whether the fix succeeded. | 2025-07-31 | Complete |
| Reduce the delay for the self-auding tool to ~24 hours | Detect | Root Cause | The results of the self-auditing tool report back the logs recorded from the day before instead of 3 days before | 2025-09-12 | Complete |
| Ensure services respect lame-duck mode | Mitigate | Contributing Factor #2 | Place a service in lame-duck mode and verify that traffic does not go to it | 2025-09-12 | In Progress |
| Ensure that resolver authentication credentials do not expire before being refreshed | Prevent | Contributing Factor #2 | Rotation logic and alerting are in place for resolver authentication credentials. Other services already have both | 2025-09-12 | In Progress |
| Identify and remediate other instances in our code that could fail to write a log before issuing a certificate. | Prevent | Contributing Factor #1 | GTS will perform a focused review of its code | 2025-10-10 | In Progress |
Comment 6•7 months ago
|
||
In response to Comment 5:
Thank you for responding to our questions.
Further promoting adoption of ARI continues to be top of mind for us and is something we’d like to see improved across the ecosystem. With that in mind, we’d like to dig a bit deeper on the following statement: “We require all subscribers who have quota limits above our defaults to commit to enabling ARI support as a condition to get the increased quota. We do not block quota increases on ARI being active but a commitment to implement it within a reasonable period of time must be in place.”
(Q1): Can you describe the terms of this subscriber commitment re: enabling ARI, and how it’s formally documented and evaluated (e.g., what’s considered a “reasonable period”)?
For example, it’s easy to imagine a subscriber organization making a good-faith commitment to adopt ARI in order to be granted the exception, to only then later have the work delayed or deprioritized due to competing priorities, or other unexpected circumstances. It’s also possible to imagine the opposite. We’d like to understand how subscriber organizations are held accountable for accomplishing this commitment.
(Q2): Along the lines of the above, for those subscriber organizations who have committed to enabling ARI, has GTS performed an analysis to identify organizations which have not yet done so (either by the committed timeline, or ever)?
(Q3): [Multi-part, please see below]
-
(A) Does GTS have a clearly defined process for working with organizations that have committed to adopting ARI but have not done so within the described timeframe?
-
(B): Are there well-defined consequences (and corresponding timelines) for subscriber organizations that have committed to adopting ARI but have not yet done so (e.g., removing the granted exception(s)).
-
(C): Has GTS considered enabling technical controls for the organizations who have committed to adopt ARI such that requests made by ACME clients that do not support ARI are rejected?
Beyond our interest in ARI, we were also hoping to learn more about some of the existing Action Items:
(Q4): There’s an action item to review code for other potential logging failures. Could you provide more detail on the methodology for this review? For example, will this be a manual, line-by-line audit of all logging-related code, or will you also employ automated static/dynamic analysis tools to identify similar race conditions or non-atomic operations across all certificate lifecycle events (e.g., validation, revocation)?
(Q5): Has GTS considered a tiered detection approach? For example, a 'fast path' reconciliation that runs more frequently (e.g., hourly) on critical log data, designed to flag potential anomalies for immediate investigation, supplemented by the robust 24-hour full reconciliation? Could something like this be useful for closing the detection gap for certain categories of issues?
Updated•7 months ago
|
| Assignee | ||
Comment 7•7 months ago
|
||
(In reply to chrome-root-program from comment #6)
In response to Comment 5:
Thank you for responding to our questions.
Further promoting adoption of ARI continues to be top of mind for us and is something we’d like to see improved across the ecosystem. With that in mind, we’d like to dig a bit deeper on the following statement: “We require all subscribers who have quota limits above our defaults to commit to enabling ARI support as a condition to get the increased quota. We do not block quota increases on ARI being active but a commitment to implement it within a reasonable period of time must be in place.”
(Q1): Can you describe the terms of this subscriber commitment re: enabling ARI, and how it’s formally documented and evaluated (e.g., what’s considered a “reasonable period”)?
For example, it’s easy to imagine a subscriber organization making a good-faith commitment to adopt ARI in order to be granted the exception, to only then later have the work delayed or deprioritized due to competing priorities, or other unexpected circumstances. It’s also possible to imagine the opposite. We’d like to understand how subscriber organizations are held accountable for accomplishing this commitment.
When we receive a request for increased quota, we provide the requester with a list of best practices, which we will post on pki.goog in the near future, that we expect them to adhere to. To date, we have used the honor system and we have not enforced timelines or penalties. We're hoping to incentivize users to adopt ARI because they also see the value, not because we're forcing them to adopt it. GTS does not have details about requirements from other CAs, but we suspect we would be an outlier if we were to start enforcing ARI requirements.
(Q2): Along the lines of the above, for those subscriber organizations who have committed to enabling ARI, has GTS performed an analysis to identify organizations which have not yet done so (either by the committed timeline, or ever)?
We don't have firm numbers, and the way some organizations use ACME makes getting such numbers challenging. Adoption is not as strong as we would like it to be but we would be open to finding ways to encourage adoption if there are suggestions.
(Q3): [Multi-part, please see below]
- (A) Does GTS have a clearly defined process for working with organizations that have committed to adopting ARI but have not done so within the described timeframe?
No. We remind larger subscribers we talk with on a recurring basis if they still have work outstanding to adopt ARI, but we do so as the opportunity naturally arises and not on a specific timeframe or via a strict process. We believe it is better for subscribers to move to ACME without ARI instead of not migrating due to the extra requirements, and then encouraging ARI adoption.
- (B): Are there well-defined consequences (and corresponding timelines) for subscriber organizations that have committed to adopting ARI but have not yet done so (e.g., removing the granted exception(s)).
Partially. We remind large subscribers that the WebPKI is now so large that ARI is essential to handle a Heartbleed like event in an orderly fashion and that they are taking a risk by not implementing it. To date, we have not reduced quota or applied other penalties.
- (C): Has GTS considered enabling technical controls for the organizations who have committed to adopt ARI such that requests made by ACME clients that do not support ARI are rejected?
We have not. The idea is interesting but requires further consideration to ensure it would not introduce new risks or problems. Most subscribers use a single client or a small number of different clients so enforcement may not be practical. It is not hard to imagine a scenario where most of a large global enterprise are using ACME with ARI enabled, but a small division is just starting to move to ACME and they could be frustrated by the requirement.
Beyond our interest in ARI, we were also hoping to learn more about some of the existing Action Items:
(Q4): There’s an action item to review code for other potential logging failures. Could you provide more detail on the methodology for this review? For example, will this be a manual, line-by-line audit of all logging-related code, or will you also employ automated static/dynamic analysis tools to identify similar race conditions or non-atomic operations across all certificate lifecycle events (e.g., validation, revocation)?
We plan to do a manual review of seven auditable events that should happen concurrently with DB updates. These cover authorization (subject of this incident), deactivating an authorization, certificate acceptance, issuance, revocation, and CRL and OCSP updates.
While we used the term race condition, this wasn't referring to a threading issue, but a coding mistake that misread whether an asynchronous operation had completed. This mistake is not detectable by scanning tools meant for detecting data races and memory errors. We continuously run code analysis tools in our development environment to detect those issues.
We do not think that reviewing the code will be the major part of the work. We may finish this AI more quickly. Remediation might take more time if we find any additional cases we need to fix.
(Q5): Has GTS considered a tiered detection approach? For example, a 'fast path' reconciliation that runs more frequently (e.g., hourly) on critical log data, designed to flag potential anomalies for immediate investigation, supplemented by the robust 24-hour full reconciliation? Could something like this be useful for closing the detection gap for certain categories of issues?
Yes, we have considered tiered detection but made the decision to focus on one tool that runs quickly rather than two similar tools. Our focus has been less on frequency and more on scaling to meet ecosystem growth and increasingly complex requirements. We try to focus on changes that improve ecosystem security and prefer prevention to mitigation. We believe that focusing on completeness will help us identify more issues in pre-production environments before they reach production.
Comment 8•7 months ago
|
||
[In response to Comment 7.]
Thank you for the candid follow-up in response to the questions in Comment 5.
While we appreciate the goal of encouraging subscribers to see the value in ARI, it is unclear whether an honor system without firm timelines or direct accountability will result in the kind of reliable and widespread adoption necessary to make a meaningful difference in a future mass-revocation event. Ultimately, a strategy that relies on subscribers to independently and voluntarily prioritize ARI adoption may not achieve the goal and corresponding benefits (to both the subscriber organization and GTS) in a reasonable timeframe, if at all. Understandably, ARI finished standardization just a few weeks ago, and adoption is still in its beginning stages.
Driving ecosystem-wide change is a shared responsibility, but it often requires someone to take the first step. To that end, we welcome further ideas from GTS and/or the broader community on how we can collectively improve ARI adoption across the ecosystem. We’d welcome that discussion here or on community mailing lists such as public@ccadb.org.
Adding to Comment 8, in addition to ARI limitation, another issue is that many ACME clients are being tied to specific CA. Perhaps some discussion of broader interoperability of ACME clients can be added to this discussion? Some ACME client conformity checker?
| Assignee | ||
Comment 10•7 months ago
|
||
(In reply to chrome-root-program from comment #8)
[In response to Comment 7.]
Thank you for the candid follow-up in response to the questions in Comment 5.
While we appreciate the goal of encouraging subscribers to see the value in ARI, it is unclear whether an honor system without firm timelines or direct accountability will result in the kind of reliable and widespread adoption necessary to make a meaningful difference in a future mass-revocation event. Ultimately, a strategy that relies on subscribers to independently and voluntarily prioritize ARI adoption may not achieve the goal and corresponding benefits (to both the subscriber organization and GTS) in a reasonable timeframe, if at all. Understandably, ARI finished standardization just a few weeks ago, and adoption is still in its beginning stages.
Driving ecosystem-wide change is a shared responsibility, but it often requires someone to take the first step. To that end, we welcome further ideas from GTS and/or the broader community on how we can collectively improve ARI adoption across the ecosystem. We’d welcome that discussion here or on community mailing lists such as public@ccadb.org.
GTS plans to use this Bugzilla thread to enforce the importance of ARI to customers who have not adopted it yet. We're also exploring options for meaningful ways to report data on ARI adoption that we can add to https://pki.goog/ and update on a recurring basis.
(In reply to Charter77 from comment #9)
Adding to Comment 8, in addition to ARI limitation, another issue is that many ACME clients are being tied to specific CA. Perhaps some discussion of broader interoperability of ACME clients can be added to this discussion? Some ACME client conformity checker?
GTS does not find this surprising. LetsEncrypt was the original ACME CA and a lot of documentation uses LetsEncrypt as their example. We're more concerned about increasing ACME adoption than which ACME CA a user chooses. If a directory URI is going to be provided in documentation it is preferable that it work rather than being an RFC 2606 style https://dv.acme-01.example.com/directory path that may confuse users. It may make sense for some documentation to provide directory URI like https://<pick an option from: https://acmeclients.com/certificate-authorities/>/directory to avoid ossification, but that also seems potentially confusing. It also introduces a dependency on the Certify the Web team that graciously provides the acmeclients.com site, but may not want to maintain it forever. In terms of conformity, sites like acmeclients.com and https://acmeprotocol.dev/getting-started/#acme-clients provide a good summary of key capabilities / conformance.
If there are further comments or questions, perhaps it is best to start an MDSP email thread as the most appropriate place for that discussion.
We are working on the remaining Action Items and kindly request the nextUpdate field to be set to 2025-09-12 when we will give an update on the remaining AIs.
Updated•7 months ago
|
| Assignee | ||
Comment 11•6 months ago
|
||
GTS has completed the two remaining action items that are due on 2025-09-12.
GTS confirmed that 'lame-duck mode' works as intended by forcing services into lame-duck mode and verifying that services that depend on them detect the change and stop sending requests to them.
Authentication credentials are normally refreshed when services restart. To ensure that they do not expire, GTS opted for a "belt and suspenders" approach by:
- Increasing their lifetime
- Adding monitoring to notify us significantly before their expiry
- Implementing a mechanism to automatically restart services when their authentication credentials are about to expire (with a random jitter to avoid simultaneous restarts)
Action Items
| Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Record the successful authorization in the audit logs before updating the database to indicate the certificate is authorized to issue. The CA software will not permit issuance until the authorization record exists in the database. | Prevent | Root Cause | A test was created to send out many validation attempts at once and abruptly terminate the service to replicate the circumstances that caused the incident. Then we confirmed our logs properly matched the state of the databases. Our daily audit runs will also continue to verify whether the fix succeeded. | 2025-07-31 | Complete |
| Reduce the delay for the self-auditing tool to ~24 hours | Detect | Root Cause | The results of the self-auditing tool report back the logs recorded from the day before instead of 3 days before | 2025-09-12 | Complete |
| Ensure services respect lame-duck mode | Mitigate | Contributing Factor #2 | Place a service in lame-duck mode and verify that traffic does not go to it | 2025-09-12 | Complete |
| Ensure that resolver authentication credentials do not expire before being refreshed | Prevent | Contributing Factor #2 | Rotation logic and alerting are in place for resolver authentication credentials. Other services already have both | 2025-09-12 | Complete |
| Identify and remediate other instances in our code that could fail to write a log before issuing a certificate. | Prevent | Contributing Factor #1 | GTS will perform a focused review of its code | 2025-10-10 | In Progress |
We kindly request that the nextUpdate field be set to 2025-10-10 when we will give an update on the last remaining action item if not sooner. Thank you.
Updated•6 months ago
|
| Assignee | ||
Comment 12•6 months ago
|
||
GTS has completed the final action item that was due on 2025-10-10.
GTS performed a review of our code and found three other instances where the code could have failed to write a log before a database update. GTS remediated all three of the instances using the same mitigation method as the issue that prompted this incident, and verified that no certificates were affected by the potential failure modes.
Action Items
| Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
|---|---|---|---|---|---|
| Record the successful authorization in the audit logs before updating the database to indicate the certificate is authorized to issue. The CA software will not permit issuance until the authorization record exists in the database. | Prevent | Root Cause | A test was created to send out many validation attempts at once and abruptly terminate the service to replicate the circumstances that caused the incident. Then we confirmed our logs properly matched the state of the databases. Our daily audit runs will also continue to verify whether the fix succeeded. | 2025-07-31 | Complete |
| Reduce the delay for the self-auditing tool to ~24 hours | Detect | Root Cause | The results of the self-auditing tool report back the logs recorded from the day before instead of 3 days before | 2025-09-12 | Complete |
| Ensure services respect lame-duck mode | Mitigate | Contributing Factor #2 | Place a service in lame-duck mode and verify that traffic does not go to it | 2025-09-12 | Complete |
| Ensure that resolver authentication credentials do not expire before being refreshed | Prevent | Contributing Factor #2 | Rotation logic and alerting are in place for resolver authentication credentials. Other services already have both | 2025-09-12 | Complete |
| Identify and remediate other instances in our code that could fail to write a log before issuing a certificate. | Prevent | Contributing Factor #1 | GTS will perform a focused review of its code | 2025-10-10 | Complete |
GTS will continue to monitor this report for any comments or questions.
| Assignee | ||
Comment 13•6 months ago
|
||
Report Closure Summary
-
Incident description:
Google Trust Services (GTS) was alerted by our self-auditing tool of a missing log detail for one certificate. The certificate was issued without one of its certificate lifecycle authorization events being recorded in our audit log due to a race condition. We determined during debugging and recovery that the trigger was a security token for our resolver expiring and it happened to be refreshed by a service restart. Regardless of the trigger, the expiration of the credential exposed a previously unknown race condition that could allow an ACME authorization to not be audit logged. -
Incident Root Cause(s):
The root cause of the incident was that an authorization database record was created to indicate it was possible to issue a certificate but the corresponding audit log entry was not written. -
Remediation description:
- GTS reviewed and updated our code to ensure that this race condition, and other similar race conditions, were remediated.
- GTS updated our systems to respect lame-duck mode, where they didn’t already.
- GTS put measures in place to ensure that the authentication credentials are refreshed well before expiry.
-
Commitment summary:
- GTS will continue to make improvements to the self-auditing tool.
- GTS will continue to monitor our systems for issues.
- GTS will continue to update and improve our code.
All Action Items disclosed in this report have been completed as described, and we request its closure.
| Assignee | ||
Updated•5 months ago
|
| Assignee | ||
Updated•5 months ago
|
Comment 14•5 months ago
|
||
This is a final call for comments or questions on this Incident Report.
Otherwise, it will be closed on approximately 2025-10-20.
| Assignee | ||
Comment 15•5 months ago
|
||
GTS is continuing to monitor this bug, pending closure.
Updated•5 months ago
|
Description
•