Closed Bug 1837519 Opened 2 years ago Closed 2 years ago

Google Trust Services: Failure to respond to CPR within 24 hours

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cadecairns, Assigned: cadecairns)

Details

(Whiteboard: [ca-compliance] [policy-failure] Next update 2023-Oct-27)

Steps to reproduce:

We are investigating a potential issue with the automation used to triage Certificate Problem Reports. We believe a reporter may not have received a reply from GTS for 26 hours, falling outside the timeline required by BR section 4.9.5. We are currently investigating the issue. A full report in the Mozilla format will be provided by the end of day on Tuesday, June 13.

Assignee: nobody → cadecairns
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

1. How your CA first became aware of the problem (e.g., via a problem report submitted to your Problem Reporting Mechanism, a discussion in the MDSP or CCADB public mailing list, a Bugzilla bug, or internal self-audit), and the time and date.

On 2023-06-08 at 13:55, a Google Trust Services (GTS) engineer investigating an error raised by an internal application realized that inquiries sent via the contact form on our website, which is also used for Certificate Problem Reports (CPRs), was no longer passing new requests into our pipeline for review.

2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done..

YYYY-MM-DD (UTC) Description
2022-05-20 20:38 GTS files Bug 1770510 to report an issue that occurred with sending the preliminary investigation results for a CPR to both the reporter and the subscriber. Several process and automation changes are made to mitigate the risk of a recurrence.
2022-08-04 19:46 GTS files Bug 1783272 to report an issue that occurred when the preliminary investigation results for a CPR were sent to a reporter and the incorrect subscriber. New automation is introduced to further improve the process for handling CPRs.
2022-12-05 08:35 A contact form with a multi-step guided process to handle CPRs and other inquiries is published to our website, as promised in Bug 1770510.
2023-01-01 00:00 GTS changes our inquiry handling pipeline to eliminate the external team that was performing initial triage from the process. Inquiries are now handled directly by GTS engineers.
2023-06-04 03:19 An intermittent error in dependent services results in inquiries failing to enter into the processing pipeline.
2023-06-07 05:45 The certificate that is the subject of the below CPR is issued.
2023-06-07 13:47 A CPR is submitted via the contact form. The report requests revocation of a certificate that was issued on behalf of the reporter to their third-party service provider.
2023-06-07 22:28 An unrelated error in the contact form automation is raised due to a temporary data store access issue.
2023-06-08 23:23 The GTS on-call engineer begins to investigate the error.
2023-06-09 00:59 The GTS on-call engineer determines that there may be unprocessed inquiries submitted via the contact form and begins troubleshooting issues. Deeper troubleshooting required multi-party approval because of the potential for PII in the data.
2023-06-09 02:21 A failsafe scheduled job in the contact form automation executes successfully, leading the engineer to reduce the urgency of the investigation.
2023-06-09 13:41 The GTS on-call engineer is granted access to the data store to continue debugging.
2023-06-09 13:55 The GTS on-call engineer informs GTS PA there are unprocessed form submissions.
2023-06-09 14:36 The GTS on-call engineer declares an event based on the single valid CPR form submission not being processed.
2023-06-09 15:28 GTS sends a preliminary investigation report to the CPR reporter and the subscriber, 25 hours and 41 minutes after the CPR was submitted.
2023-06-09 20:50 A fix for the issue is deployed.

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

GTS has not stopped issuing certificates as this incident did not produce misissued certificates.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help measure the severity of each problem.

This incident report is about a delayed CPR response for one certificate. As noted in the timeline and explained in further detail in section 6, GTS has had two other CPR-related incidents that occurred while improving our tooling and processes. We’ve outlined a plan in section 7 for further improvements to the tooling and processes for handling CPRs.

5.In a case involving TLS server certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. It is also recommended that you use this form in your list “https://crt.sh/?sha256=[sha256-hash]”, unless circumstances dictate otherwise. When the incident being reported involves an SMIME certificate, if disclosure of personally identifiable information in the certificate may be contrary to applicable law, please provide at least the certificate serial number and SHA256 hash of the certificate. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

N/A

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Google Trust Services has made ongoing improvements to our CPR handling processes and tools as our service continues to grow. We had two incidents occur with past iterations of the process last year, which has been another driving factor in making further improvements. A bug in software that was introduced during those improvements occurred at the same time as an issue in a service we rely on, resulting in a CPR not being responded to with a preliminary investigation report within the timeline mandated by section 4.9.5 of the BR. Below we describe past incidents and changes that have been made leading up to the problem.

In Bug 1770510, the subscriber was different from the reporter and was not successfully included in our correspondence with the preliminary report. At the time, we had recently partnered with a large, third party service provider that uses our service to obtain certificates on behalf of its users, resulting in an increase in CPRs when its users were surprised to see alerts from CT monitoring tools about certificates issued by GTS. We described several improvements in the incident report to reduce the likelihood of a recurrence, including introducing a new, guided process for subscribers and other entities to submit CPRs. The guided process is online today at https://pki.goog/cpr and has resulted in significant improvement in the quality of CPRs we receive, in addition to helping some domain owners with similar circumstances self-service and not need to submit CPRs at all.

In Bug 1783272, the subscriber was again different from the reporter and the incorrect subscriber, another partner service provider, was sent the preliminary report after a mistake was made when specifying the email recipient. This occurred because tooling used to investigate CPRs was immature and required manual effort. To address the issue, we implemented a processing tool used internally to guide the GTS engineer handling the CPR through a step-by-step workflow to reduce the risk of mistakes being made.

Since the time of their introduction, these changes have worked well and we have continued to make improvements to our processes, including no longer relying on a separate team to perform initial triage of requests as described in Bug 1770510, owing to the improvement made by introducing the guided process. We made this change at the beginning of 2023. It has allowed us to handle CPRs more quickly.

The guided CPR process uses a form and automation code to handle form submissions, routing them to the appropriate next step in the pipeline for triage, which sends an email to be handled by an on-call engineer. The application also runs as a periodically scheduled job to ensure sending the email had not failed due to errors and alerts the on-call engineer, intended as a failsafe. In this incident, an issue occurred in both the trigger and the failsafe.

On 2023-06-04, an issue in a dependent service caused triggers to begin failing intermittently. This resulted in our automation not triggering for four form submissions. Two of the submissions were the same CPR. The other two were complaints about phishing. When the trigger did not fire, the form submission would not be passed to the next step in the pipeline and the state field was not updated to reflect the state of the operation. A failsafe was designed to handle errors and exceptions, but did not account for an unset value, so it did nothing. The root cause of this incident is lacking monitoring, handling, and automated testing for an unanticipated failure scenario.

Our PII data access procedures added time to the initial investigation. The problem described above was uncovered after an error occurred in the scheduled job on 2023-06-07 at 22:28. A transient error occurred in the application, resulting in an error being raised. Access restrictions on the data store to protect PII prevented the engineer from doing an in-depth review of the problem at the time, however a subsequent invocation of the scheduled job succeeded on 2023-06-08 at 02:21, and the investigation was no longer treated as urgently. When access was granted at 2023-06-08 at 13:41, the engineer realized four form submissions, one of which was a valid CPR, had not been acted upon. The engineer informed our Policy Authority of the issue and an email was sent to the reporter at 15:28 informing them that the certificate had been issued to their third-party service provider. The reporter has not responded to the email, presumably satisfied with knowing the certificate was issued to their service provider.

7. List of steps your CA is taking to resolve the situation and ensure that such a situation or incident will not be repeated in the future. The steps should include the action(s) for resolving the issue, the status of each action, and the date each action will be completed.

A fix for the issue in the failsafe was deployed on 2023-06-09 at 20:50, and the dependent service that caused the issue with the trigger has been fixed by the Google team that is responsible for the service. In the immediate term, GTS is continuing to review the application code and plans to write additional automated tests where potential corner cases are discovered. We are also adjusting access control for the underlying data store to expedite access while retaining strong privacy protections. We will complete this work by June 23, 2023.

In the medium term, GTS has identified a better toolkit to expand on the CPR investigation tooling described in Bug 1783272, resulting in a more robust implementation. This approach will allow us to automatically send more contextual automated responses to inquiries such as CPRs of the nature described above, allowing more CPR reporters to provide self-service and reducing the cycle time when a revocation must be performed. This work will begin in Q3 and we will complete it by October 27, 2023.

(In reply to Cade Cairns from comment #1)

Thank you GTS and Cade for the incident report! I have a few comments and questions below to fully understand what happened here so I would appreciate any response to that. I am including them inline below to make the discussion flow more easily.

YYYY-MM-DD (UTC) Description
2023-06-04 03:19 An intermittent error in dependent services results in inquiries failing to enter into the processing pipeline.
2023-06-09 20:50 A fix for the issue is deployed.

Do I understand correctly that this is the entire time frame Google Trust Services remained without a valid contact point for Certificate Problem Reports or other communication, e.g. from Root Programs or otherwise?

YYYY-MM-DD (UTC) Description
2023-06-07 22:28 An unrelated error in the contact form automation is raised due to a temporary data store access issue.
2023-06-08 23:23 The GTS on-call engineer begins to investigate the error.
2023-06-09 02:21 A failsafe scheduled job in the contact form automation executes successfully, leading the engineer to reduce the urgency of the investigation.

If I understand correctly, despite the error being “unrelated”, it was still “in the contact form automation”. Is there any reason why it would take over 24 hours for the on-call engineer to investigate this if you have to be able to handle incoming cases in less than 24 hours? Was that because your on-call engineer determined this to be less important than it actually ended up being, or was it a lack of understanding of the failure modes and reliability characteristics of your system?

YYYY-MM-DD (UTC) Description
2023-06-09 00:59 The GTS on-call engineer determines that there may be unprocessed inquiries submitted via the contact form and begins troubleshooting issues. Deeper troubleshooting required multi-party approval because of the potential for PII in the data.

Can you expand further on your process for PII as this is being cited as the primary cause for the delay of this particular CPR? If I understand your incident report correctly, this pipeline gives GTS CA Engineers the post that was sent so they can respond. It seems to me that the data the on-caller didn’t have access to was the same information (the post). I don’t understand why the engineers don’t have access to the information when the system fails (even if they know it failed), but have access to that when the system works.

YYYY-MM-DD (UTC) Description
2023-06-09 02:21 A failsafe scheduled job in the contact form automation executes successfully, leading the engineer to reduce the urgency of the investigation.

Access restrictions on the data store to protect PII prevented the engineer from doing an in-depth review of the problem at the time, however a subsequent invocation of the scheduled job succeeded on 2023-06-08 at 02:21, and the investigation was no longer treated as urgently.

Can you please address this inconsistency? Was this investigation deprioritized on June 9th, or June 8th? It is probably a problem in timezone conversion on your end, that slipped through the reviews, but I think it is important for this incident so we can get a better understanding.

YYYY-MM-DD (UTC) Description
2023-06-09 13:41 The GTS on-call engineer is granted access to the data store to continue debugging.

Can you provide more information on this pipeline perhaps so we can understand this better? Who has access to this information when the system fails and they have to grant it to the person on-call? Is this person or group of people available enough for you to meet your 24 hour response requirements? Are they internal or external to Google Trust Services? Are they the 3rd Party Vendor you are using? I am trying to grasp whether all of these design principles were taken into account during the design phase of your new pipeline, whether all of this was thought out, and simply something failed there, or if it wasn’t considered at all. Especially with my comment above on the necessity of this access control justification: you clearly sacrifice latency, reliability, etc. -- what do you gain in return? Do you try to minimize the residual risk with further controls?

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Google Trust Services has made ongoing improvements to our CPR handling processes and tools as our service continues to grow. We had two incidents occur with past iterations of the process last year, which has been another driving factor in making further improvements. A bug in software that was introduced during those improvements occurred at the same time as an issue in a service we rely on, resulting in a CPR not being responded to with a preliminary investigation report within the timeline mandated by section 4.9.5 of the BR. Below we describe past incidents and changes that have been made leading up to the problem.

In Bug 1770510, the subscriber was different from the reporter and was not successfully included in our correspondence with the preliminary report. At the time, we had recently partnered with a large, third party service provider that uses our service to obtain certificates on behalf of its users, resulting in an increase in CPRs when its users were surprised to see alerts from CT monitoring tools about certificates issued by GTS. We described several improvements in the incident report to reduce the likelihood of a recurrence, including introducing a new, guided process for subscribers and other entities to submit CPRs. The guided process is online today at https://pki.goog/cpr and has resulted in significant improvement in the quality of CPRs we receive, in addition to helping some domain owners with similar circumstances self-service and not need to submit CPRs at all.

In Bug 1783272, the subscriber was again different from the reporter and the incorrect subscriber, another partner service provider, was sent the preliminary report after a mistake was made when specifying the email recipient. This occurred because tooling used to investigate CPRs was immature and required manual effort. To address the issue, we implemented a processing tool used internally to guide the GTS engineer handling the CPR through a step-by-step workflow to reduce the risk of mistakes being made.

Since the time of their introduction, these changes have worked well and we have continued to make improvements to our processes, including no longer relying on a separate team to perform initial triage of requests as described in Bug 1770510, owing to the improvement made by introducing the guided process. We made this change at the beginning of 2023. It has allowed us to handle CPRs more quickly.

The guided CPR process uses a form and automation code to handle form submissions, routing them to the appropriate next step in the pipeline for triage, which sends an email to be handled by an on-call engineer. The application also runs as a periodically scheduled job to ensure sending the email had not failed due to errors and alerts the on-call engineer, intended as a failsafe. In this incident, an issue occurred in both the trigger and the failsafe.

On 2023-06-04, an issue in a dependent service caused triggers to begin failing intermittently. This resulted in our automation not triggering for four form submissions. Two of the submissions were the same CPR. The other two were complaints about phishing. When the trigger did not fire, the form submission would not be passed to the next step in the pipeline and the state field was not updated to reflect the state of the operation. A failsafe was designed to handle errors and exceptions, but did not account for an unset value, so it did nothing. The root cause of this incident is lacking monitoring, handling, and automated testing for an unanticipated failure scenario.

Our PII data access procedures added time to the initial investigation. The problem described above was uncovered after an error occurred in the scheduled job on 2023-06-07 at 22:28. A transient error occurred in the application, resulting in an error being raised. Access restrictions on the data store to protect PII prevented the engineer from doing an in-depth review of the problem at the time, however a subsequent invocation of the scheduled job succeeded on 2023-06-08 at 02:21, and the investigation was no longer treated as urgently. When access was granted at 2023-06-08 at 13:41, the engineer realized four form submissions, one of which was a valid CPR, had not been acted upon. The engineer informed our Policy Authority of the issue and an email was sent to the reporter at 15:28 informing them that the certificate had been issued to their third-party service provider. The reporter has not responded to the email, presumably satisfied with knowing the certificate was issued to their service provider.

7. List of steps your CA is taking to resolve the situation and ensure that such a situation or incident will not be repeated in the future. The steps should include the action(s) for resolving the issue, the status of each action, and the date each action will be completed.

A fix for the issue in the failsafe was deployed on 2023-06-09 at 20:50, and the dependent service that caused the issue with the trigger has been fixed by the Google team that is responsible for the service. In the immediate term, GTS is continuing to review the application code and plans to write additional automated tests where potential corner cases are discovered. We are also adjusting access control for the underlying data store to expedite access while retaining strong privacy protections. We will complete this work by June 23, 2023.

In the medium term, GTS has identified a better toolkit to expand on the CPR investigation tooling described in Bug 1783272, resulting in a more robust implementation. This approach will allow us to automatically send more contextual automated responses to inquiries such as CPRs of the nature described above, allowing more CPR reporters to provide self-service and reducing the cycle time when a revocation must be performed. This work will begin in Q3 and we will complete it by October 27, 2023.

Thank you for digging up these old bugs and providing a nice background. This is extremely useful in incident reports, and saves everyone a lot of time. I have a few points based on the above:

From this incident report, it seems you are bitten again by the same problem: you relied on a 3rd Party Software / Vendor for an important part of your infrastructure (granted, not issuance or revocation related). Obviously, all software fails, and there’s human error all the time. This is why it’s important to discuss these here transparently, so we can all learn and improve, and eliminate classes of problems across the entire ecosystem. But what I’d like to understand is your process for picking software vendors, configuring their solutions, etc. Do you have any criteria in place, such as access to monitoring, SLAs or SLOs, or anything else of importance to ensure that you won’t be bitten again by a 3rd Party? If it happens once or twice in a short amount of time it may be understandable, I am interested in what you are doing to make sure this won’t become a pattern.

On the point of software development, you mention a bug in the code of your pipeline system that relies on the 3rd Party. Can you expand on your Software Development Lifecycle for this system? Does it have the same multi-party approval, code review, deployment, version control, etc. checks and balances as your CA software, or does it follow a different process, potentially with less strict requirements?

My next point is about software, system, and solution quality. In your message you mention that immature systems, unfinished solutions, and other such problems have led to incidents. In this current incident, I also see a lack of design principles, at least from the limited information presented (PII access, reliability, monitoring, etc.). I think for example an important aspect would be monitoring, whose lack is cited as a root cause here. Are there any issues that are systemic with Google Trust Services on this front? You always commit to aggressive deadlines on your incident reports, which is commendable, and you almost always deliver on those on time, but I am left wondering what the cost of this is. Does it lead to cutting corners in software development, design, procurement, or other important areas? Is there pressure to ship “anything” out the door, to make the deadline, without proper thought and study? There have been a few times over the past few years where a subsequent incident was caused because of this. This is also slightly evident in your report above: the metrics you provide for the success of this pipeline / CRP system are business focused (lower number of CPRs, less time spent, etc.) and not engineering focused (more accurate, more robust, etc.). Also, if I follow the GTS incident reports correctly, it seems you are redesigning your CPR pipeline for a 3rd or 4th time with your proposed actions, within 1-1.5 year. This seems to add to the point I am making above. Perhaps this time a well thought out and carefully considered engineering-first approach that takes into account all the business requirements will be more successful. I don’t think it’s the right approach to just kick the can down the road and deal with this again a few months later.

I am not saying that you should be slower, I am just trying to understand whether the tradeoff is worth it. This tradeoff was made by GTS, which is a business, but I would like to explore whether it’s the right balance and whether it adequately serves this community. Perhaps it is, or maybe there is an adjustment needed...

Regarding the phrase that you are “adjusting access control for the underlying data store to expedite access while retaining strong privacy protections”, what does this mean exactly? Did GTS identify that there wasn’t a reason for the lack of access in the first place, or are you building something new here? Are you just relaxing privacy requirements (as long as it’s still good enough?)?

Your root cause according to this report is stated to be “lacking monitoring, handling, and automated testing for an unanticipated failure scenario.” -- I am not sure this makes a lot of sense given the report above. It seems like you didn’t dig deep enough given my comments above, so I’d like some further discussion on that point.

Indeed, monitoring was found lacking. But it doesn’t seem like an “unexpected scenario” to me if storage runs out or a dependency service is down or a cron job fails to run. These seem basic things to monitor or plan for. Moreover, this is the 3rd redesign and there have been multiple iterations on top of it. Was this level of monitoring (or monitoring at all) not considered in the design phase? Was it deemed less important and the product launched without it, to be added later - if at all? Was it planned to be added? I find it hard to believe there’s no answer to “Why we lacked monitoring?” and this is the end node in the graph traversal, unless it was designed like that, or accepted somehow as okay. Which would be concerning.

Finally, in this incident you cite the problem being “Failure to respond to CPR within 24 hours”. You then state that a single CPR (that was received as a duplicate) was delayed by 41 minutes. I wanted to ask you: if there wasn’t a CPR during your ~6 days of downtime, and you didn’t miss anything, would you file this incident report?

Thanks,

Flags: needinfo?(cadecairns)

Thank you GTS and Cade for the incident report! I have a few comments and questions below to fully understand what happened here so I would appreciate any response to that. I am including them inline below to make the discussion flow more easily.

Thank you, Antonis, we're happy to provide more details on the questions you've asked.

YYYY-MM-DD (UTC) Description
2023-06-04 03:19 An intermittent error in dependent services results in inquiries failing to enter into the processing pipeline.
2023-06-09 20:50 A fix for the issue is deployed.

Do I understand correctly that this is the entire time frame Google Trust Services remained without a valid contact point for Certificate Problem Reports or other communication, e.g. from Root Programs or otherwise?

The form was still accepting and storing reports throughout this period. The response interval was the issue. During the interval between problem identification and the fix being deployed and validated, our on-caller was checking for form submissions roughly every three hours instead of our normal twice daily human checks.

YYYY-MM-DD (UTC) Description
2023-06-07 22:28 An unrelated error in the contact form automation is raised due to a temporary data store access issue.
2023-06-08 23:23 The GTS on-call engineer begins to investigate the error.
2023-06-09 02:21 A failsafe scheduled job in the contact form automation executes successfully, leading the engineer to reduce the urgency of the investigation.

If I understand correctly, despite the error being “unrelated”, it was still “in the contact form automation”. Is there any reason why it would take over 24 hours for the on-call engineer to investigate this if you have to be able to handle incoming cases in less than 24 hours? Was that because your on-call engineer determined this to be less important than it actually ended up being, or was it a lack of understanding of the failure modes and reliability characteristics of your system?

The on-call engineer began investigating within 55 minutes of the error being raised. The date listed is not correct. A corrected timeline can be found later in this response. When the failsafe was executed successfully not long after, the urgency of the investigation was reduced.

YYYY-MM-DD (UTC) Description
2023-06-09 00:59 The GTS on-call engineer determines that there may be unprocessed inquiries submitted via the contact form and begins troubleshooting issues. Deeper troubleshooting required multi-party approval because of the potential for PII in the data.

Can you expand further on your process for PII as this is being cited as the primary cause for the delay of this particular CPR? If I understand your incident report correctly, this pipeline gives GTS CA Engineers the post that was sent so they can respond. It seems to me that the data the on-caller didn’t have access to was the same information (the post). I don’t understand why the engineers don’t have access to the information when the system fails (even if they know it failed), but have access to that when the system works.

We have multi-party authorization for raw access to systems that may contain PII. In this case, the primary system, the automation pipeline, which normally checks and stores the report was not processing in time, so raw access to the data store was required.

YYYY-MM-DD (UTC) Description
2023-06-09 02:21 A failsafe scheduled job in the contact form automation executes successfully, leading the engineer to reduce the urgency of the investigation.

Access restrictions on the data store to protect PII prevented the engineer from doing an in-depth review of the problem at the time, however a subsequent invocation of the scheduled job succeeded on 2023-06-08 at 02:21, and the investigation was no longer treated as urgently.

Can you please address this inconsistency? Was this investigation deprioritized on June 9th, or June 8th? It is probably a problem in timezone conversion on your end, that slipped through the reviews, but I think it is important for this incident so we can get a better understanding.

Thank you for flagging this error. In section 6 where it is stated that the priority was dropped, the date is incorrect. Events after 2023-06-07 22:28 were incorrectly advanced a day in the timeline. The corrected timeline is:

YYYY-MM-DD (UTC) Description
2023-06-07 22:28 An unrelated error in the contact form automation is raised due to a temporary data store access issue.
2023-06-07 23:23 The GTS on-call engineer begins to investigate the error.
2023-06-08 00:59 The GTS on-call engineer determines that there may be unprocessed inquiries submitted via the contact form and begins troubleshooting issues. Deeper troubleshooting required multi-party approval because of the potential for PII in the data.
2023-06-08 02:21 A failsafe scheduled job in the contact form automation executes successfully, leading the engineer to reduce the urgency of the investigation.
2023-06-08 13:41 The GTS on-call engineer is granted access to the data store to continue debugging.
2023-06-08 13:55 The GTS on-call engineer informs GTS PA there are unprocessed form submissions.
2023-06-08 14:36 The GTS on-call engineer declares an event based on the single valid CPR form submission not being processed.
2023-06-08 15:28 GTS sends a preliminary investigation report to the CPR reporter and the subscriber, 25 hours and 41 minutes after the CPR was submitted.
2023-06-08 20:50 A fix for the issue is deployed.
YYYY-MM-DD (UTC) Description
2023-06-09 13:41 The GTS on-call engineer is granted access to the data store to continue debugging.

Can you provide more information on this pipeline perhaps so we can understand this better? Who has access to this information when the system fails and they have to grant it to the person on-call? Is this person or group of people available enough for you to meet your 24 hour response requirements? Are they internal or external to Google Trust Services? Are they the 3rd Party Vendor you are using? I am trying to grasp whether all of these design principles were taken into account during the design phase of your new pipeline, whether all of this was thought out, and simply something failed there, or if it wasn’t considered at all. Especially with my comment above on the necessity of this access control justification: you clearly sacrifice latency, reliability, etc. -- what do you gain in return? Do you try to minimize the residual risk with further controls?

The pipeline uses common Google systems that are available to customers. Inline with Google policies, there is no direct access to the raw data. However, access to the data store was not related to the root cause of this incident.

Our CA engineers provide 24/7/365 coverage and have full access to team owned automation, monitoring, and error reporting. Raw access to some data is only allowed once multi-party authorization has been granted, however we have a break glass emergency access mechanism for use in exceptional cases and to ensure rapid response if necessary. We have used third-party support in the past for some report triage, but no longer do so, which was discussed in comment 1.

The system in question went through a design review process, through several iterations of enhancements, and followed engineering and coding best practices. Despite those practices, this edge case was not handled by the primary logic or exception handling and resulted in the delayed response.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Google Trust Services has made ongoing improvements to our CPR handling processes and tools as our service continues to grow. We had two incidents occur with past iterations of the process last year, which has been another driving factor in making further improvements. A bug in software that was introduced during those improvements occurred at the same time as an issue in a service we rely on, resulting in a CPR not being responded to with a preliminary investigation report within the timeline mandated by section 4.9.5 of the BR. Below we describe past incidents and changes that have been made leading up to the problem.

In Bug 1770510, the subscriber was different from the reporter and was not successfully included in our correspondence with the preliminary report. At the time, we had recently partnered with a large, third party service provider that uses our service to obtain certificates on behalf of its users, resulting in an increase in CPRs when its users were surprised to see alerts from CT monitoring tools about certificates issued by GTS. We described several improvements in the incident report to reduce the likelihood of a recurrence, including introducing a new, guided process for subscribers and other entities to submit CPRs. The guided process is online today at https://pki.goog/cpr and has resulted in significant improvement in the quality of CPRs we receive, in addition to helping some domain owners with similar circumstances self-service and not need to submit CPRs at all.

In Bug 1783272, the subscriber was again different from the reporter and the incorrect subscriber, another partner service provider, was sent the preliminary report after a mistake was made when specifying the email recipient. This occurred because tooling used to investigate CPRs was immature and required manual effort. To address the issue, we implemented a processing tool used internally to guide the GTS engineer handling the CPR through a step-by-step workflow to reduce the risk of mistakes being made.

Since the time of their introduction, these changes have worked well and we have continued to make improvements to our processes, including no longer relying on a separate team to perform initial triage of requests as described in Bug 1770510, owing to the improvement made by introducing the guided process. We made this change at the beginning of 2023. It has allowed us to handle CPRs more quickly.

The guided CPR process uses a form and automation code to handle form submissions, routing them to the appropriate next step in the pipeline for triage, which sends an email to be handled by an on-call engineer. The application also runs as a periodically scheduled job to ensure sending the email had not failed due to errors and alerts the on-call engineer, intended as a failsafe. In this incident, an issue occurred in both the trigger and the failsafe.

On 2023-06-04, an issue in a dependent service caused triggers to begin failing intermittently. This resulted in our automation not triggering for four form submissions. Two of the submissions were the same CPR. The other two were complaints about phishing. When the trigger did not fire, the form submission would not be passed to the next step in the pipeline and the state field was not updated to reflect the state of the operation. A failsafe was designed to handle errors and exceptions, but did not account for an unset value, so it did nothing. The root cause of this incident is lacking monitoring, handling, and automated testing for an unanticipated failure scenario.

Our PII data access procedures added time to the initial investigation. The problem described above was uncovered after an error occurred in the scheduled job on 2023-06-07 at 22:28. A transient error occurred in the application, resulting in an error being raised. Access restrictions on the data store to protect PII prevented the engineer from doing an in-depth review of the problem at the time, however a subsequent invocation of the scheduled job succeeded on 2023-06-08 at 02:21, and the investigation was no longer treated as urgently. When access was granted at 2023-06-08 at 13:41, the engineer realized four form submissions, one of which was a valid CPR, had not been acted upon. The engineer informed our Policy Authority of the issue and an email was sent to the reporter at 15:28 informing them that the certificate had been issued to their third-party service provider. The reporter has not responded to the email, presumably satisfied with knowing the certificate was issued to their service provider.

7. List of steps your CA is taking to resolve the situation and ensure that such a situation or incident will not be repeated in the future. The steps should include the action(s) for resolving the issue, the status of each action, and the date each action will be completed.

A fix for the issue in the failsafe was deployed on 2023-06-09 at 20:50, and the dependent service that caused the issue with the trigger has been fixed by the Google team that is responsible for the service. In the immediate term, GTS is continuing to review the application code and plans to write additional automated tests where potential corner cases are discovered. We are also adjusting access control for the underlying data store to expedite access while retaining strong privacy protections. We will complete this work by June 23, 2023.

In the medium term, GTS has identified a better toolkit to expand on the CPR investigation tooling described in Bug 1783272, resulting in a more robust implementation. This approach will allow us to automatically send more contextual automated responses to inquiries such as CPRs of the nature described above, allowing more CPR reporters to provide self-service and reducing the cycle time when a revocation must be performed. This work will begin in Q3 and we will complete it by October 27, 2023.

Thank you for digging up these old bugs and providing a nice background. This is extremely useful in incident reports, and saves everyone a lot of time. I have a few points based on the above:

From this incident report, it seems you are bitten again by the same problem: you relied on a 3rd Party Software / Vendor for an important part of your infrastructure (granted, not issuance or revocation related). Obviously, all software fails, and there’s human error all the time. This is why it’s important to discuss these here transparently, so we can all learn and improve, and eliminate classes of problems across the entire ecosystem. But what I’d like to understand is your process for picking software vendors, configuring their solutions, etc. Do you have any criteria in place, such as access to monitoring, SLAs or SLOs, or anything else of importance to ensure that you won’t be bitten again by a 3rd Party? If it happens once or twice in a short amount of time it may be understandable, I am interested in what you are doing to make sure this won’t become a pattern.

The pipeline is built on top of Google systems and frameworks, which we use because we can access the source, know the supply chain and change control protections that are in place, are familiar with the alerting and monitoring provided and they are typically no or low cost for us.

On the point of software development, you mention a bug in the code of your pipeline system that relies on the 3rd Party. Can you expand on your Software Development Lifecycle for this system? Does it have the same multi-party approval, code review, deployment, version control, etc. checks and balances as your CA software, or does it follow a different process, potentially with less strict requirements?

We follow the same process for production development including design reviews, automated testing, multi-party approval, code review before check-in, staged deployment with rollback options where possible (some operations have to be roll-forward for compliance or data-integrity reasons), version control, et cetera. Our CA software and infrastructure is deployed via more stages and we over-provision our issuing systems more than support systems, but the same principles are applied.

My next point is about software, system, and solution quality. In your message you mention that immature systems, unfinished solutions, and other such problems have led to incidents. In this current incident, I also see a lack of design principles, at least from the limited information presented (PII access, reliability, monitoring, etc.). I think for example an important aspect would be monitoring, whose lack is cited as a root cause here. Are there any issues that are systemic with Google Trust Services on this front? You always commit to aggressive deadlines on your incident reports, which is commendable, and you almost always deliver on those on time, but I am left wondering what the cost of this is. Does it lead to cutting corners in software development, design, procurement, or other important areas? Is there pressure to ship “anything” out the door, to make the deadline, without proper thought and study? There have been a few times over the past few years where a subsequent incident was caused because of this. This is also slightly evident in your report above: the metrics you provide for the success of this pipeline / CRP system are business focused (lower number of CPRs, less time spent, etc.) and not engineering focused (more accurate, more robust, etc.). Also, if I follow the GTS incident reports correctly, it seems you are redesigning your CPR pipeline for a 3rd or 4th time with your proposed actions, within 1-1.5 year. This seems to add to the point I am making above. Perhaps this time a well thought out and carefully considered engineering-first approach that takes into account all the business requirements will be more successful. I don’t think it’s the right approach to just kick the can down the road and deal with this again a few months later.

I am not saying that you should be slower, I am just trying to understand whether the tradeoff is worth it. This tradeoff was made by GTS, which is a business, but I would like to explore whether it’s the right balance and whether it adequately serves this community. Perhaps it is, or maybe there is an adjustment needed…

In the past 1.5 years, our customer base has grown and changed significantly, which has been a driving factor in the improvements for our CPR process and tooling. Triaging and responding to valid and invalid certificate problem reports is a significant source of toil, so seeking to streamline that effort while retaining high quality is worthy of ongoing development. While we strive to avoid incidents, we think there are valuable lessons for us and the community as we've evolved our report handling processes.

We take an iterative approach to making changes and system improvements for this kind of software and have gone through several rounds of improvements over the past year. We have significant monitoring in place and are always refining it. Our overall monitoring approach aligns with the practices detailed in: https://sre.google/workbook/monitoring/. In terms of code quality and reviews, we follow: https://google.github.io/eng-practices/

Regarding the phrase that you are “adjusting access control for the underlying data store to expedite access while retaining strong privacy protections”, what does this mean exactly? Did GTS identify that there wasn’t a reason for the lack of access in the first place, or are you building something new here? Are you just relaxing privacy requirements (as long as it’s still good enough?)?

The group of approvers is globally distributed and available. We expanded the list of approvers to allow faster response, but will continue to balance it with maintaining strong privacy protections.

Your root cause according to this report is stated to be “lacking monitoring, handling, and automated testing for an unanticipated failure scenario.” -- I am not sure this makes a lot of sense given the report above. It seems like you didn’t dig deep enough given my comments above, so I’d like some further discussion on that point.

Indeed, monitoring was found lacking. But it doesn’t seem like an “unexpected scenario” to me if storage runs out or a dependency service is down or a cron job fails to run. These seem basic things to monitor or plan for. Moreover, this is the 3rd redesign and there have been multiple iterations on top of it. Was this level of monitoring (or monitoring at all) not considered in the design phase? Was it deemed less important and the product launched without it, to be added later - if at all? Was it planned to be added? I find it hard to believe there’s no answer to “Why we lacked monitoring?” and this is the end node in the graph traversal, unless it was designed like that, or accepted somehow as okay. Which would be concerning.

There was monitoring and error alerting in place covering the common failure modes you cited. This monitoring is how the issue was initially discovered. The issue was coverage of an unplanned corner case and how it raised an alert did not make the severity immediately clear.

Finally, in this incident you cite the problem being “Failure to respond to CPR within 24 hours”. You then state that a single CPR (that was received as a duplicate) was delayed by 41 minutes. I wanted to ask you: if there wasn’t a CPR during your ~6 days of downtime, and you didn’t miss anything, would you file this incident report?

We wish to clarify that there was no downtime or data loss. There was a delay in a pipeline. If we had a critical system down for an extended period of time that had even a small chance of failing to comply with BRs or root program requirements we would file an incident report.

Flags: needinfo?(cadecairns)

(In reply to Cade Cairns from comment #3)

Thanks for getting back Cade, I am adding some comments and questions inline below:

YYYY-MM-DD (UTC) Description
2023-06-04 03:19 An intermittent error in dependent services results in inquiries failing to enter into the processing pipeline.
2023-06-09 20:50 A fix for the issue is deployed.

Do I understand correctly that this is the entire time frame Google Trust Services remained without a valid contact point for Certificate Problem Reports or other communication, e.g. from Root Programs or otherwise?

The form was still accepting and storing reports throughout this period. The response interval was the issue. During the interval between problem identification and the fix being deployed and validated, our on-caller was checking for form submissions roughly every three hours instead of our normal twice daily human checks.

So with the above you confirm that you were not processing / acting on any of these events, despite them being delivered successfully to you, yes?

It is not a clear answer and simply accepting and storing reports is not enough to meet the requirements that CAs should be held up to. Otherwise, you can set up an email inbox, “accept and store” everything in there, and never look at it once. Is this an attempt to downplay the severity of the problem or a misunderstanding of the requirements?

As far as the frequent check-in by the on-caller goes, was someone practically sleepless for 2 days checking this form? Further in your response you mention that you have a “globally distributed and available” team, yet there was no hand-over? This seems a bit odd, so I hope I misunderstood something.

YYYY-MM-DD (UTC) Description
2023-06-09 00:59 The GTS on-call engineer determines that there may be unprocessed inquiries submitted via the contact form and begins troubleshooting issues. Deeper troubleshooting required multi-party approval because of the potential for PII in the data.

Can you expand further on your process for PII as this is being cited as the primary cause for the delay of this particular CPR? If I understand your incident report correctly, this pipeline gives GTS CA Engineers the post that was sent so they can respond. It seems to me that the data the on-caller didn’t have access to was the same information (the post). I don’t understand why the engineers don’t have access to the information when the system fails (even if they know it failed), but have access to that when the system works.

We have multi-party authorization for raw access to systems that may contain PII. In this case, the primary system, the automation pipeline, which normally checks and stores the report was not processing in time, so raw access to the data store was required.

If that is the case, then I am a bit confused about what happened. The “automation pipeline” and the “raw access” contain the same information, right? How is PII then relevant in this case? If you require multi-party authorization for one, you must require it for the other. And if you don’t require it for one, then it shouldn’t be stated as the reason of the delay for the other. From your answer here I understand it’s the former, so both systems have multi-party authorization?

YYYY-MM-DD (UTC) Description
2023-06-09 02:21 A failsafe scheduled job in the contact form automation executes successfully, leading the engineer to reduce the urgency of the investigation.

Access restrictions on the data store to protect PII prevented the engineer from doing an in-depth review of the problem at the time, however a subsequent invocation of the scheduled job succeeded on 2023-06-08 at 02:21, and the investigation was no longer treated as urgently.

Can you please address this inconsistency? Was this investigation deprioritized on June 9th, or June 8th? It is probably a problem in timezone conversion on your end, that slipped through the reviews, but I think it is important for this incident so we can get a better understanding.

Thank you for flagging this error. In section 6 where it is stated that the priority was dropped, the date is incorrect. Events after 2023-06-07 22:28 were incorrectly advanced a day in the timeline. The corrected timeline is:

YYYY-MM-DD (UTC) Description
2023-06-07 22:28 An unrelated error in the contact form automation is raised due to a temporary data store access issue.
2023-06-07 23:23 The GTS on-call engineer begins to investigate the error.
2023-06-08 00:59 The GTS on-call engineer determines that there may be unprocessed inquiries submitted via the contact form and begins troubleshooting issues. Deeper troubleshooting required multi-party approval because of the potential for PII in the data.
2023-06-08 02:21 A failsafe scheduled job in the contact form automation executes successfully, leading the engineer to reduce the urgency of the investigation.
2023-06-08 13:41 The GTS on-call engineer is granted access to the data store to continue debugging.
2023-06-08 13:55 The GTS on-call engineer informs GTS PA there are unprocessed form submissions.
2023-06-08 14:36 The GTS on-call engineer declares an event based on the single valid CPR form submission not being processed.
2023-06-08 15:28 GTS sends a preliminary investigation report to the CPR reporter and the subscriber, 25 hours and 41 minutes after the CPR was submitted.
2023-06-08 20:50 A fix for the issue is deployed.

If that is the case and more than half of the timeline GTS provided was wrong, I have a couple of items:

Is the timeline complete? It ends on 2023-06-08 20:50 (where the fix was deployed) when at 2023-06-08 21:51 you said you were still investigating what happened. Or is there nothing noteworthy that came out of this investigation?

The CPR that you missed came in at 2023-06-07 13:47. With the revised timeline that you now provided, GTS was aware of potentially unprocessed data at 2023-06-08 00:59. This is now under the 24h mark, while it was above it before. Which makes my question on the PII policy more relevant: could that be the root cause of the delay, or a better candidate at least, as according to the Google SRE Book you link below systems should be expected to fail and you should plan for it and be prepared? Did GTS operate under the assumption that this system will never fail, or was there an issue in the “globally distributed and available” group of approvers? This entire incident happened during working days, not on a weekend, and not on public holidays. I am asking this because it is relevant for the better understanding of the rest of your report and whether the community can rely on your existing processes or further improvements can be made.

YYYY-MM-DD (UTC) Description
2023-06-09 13:41 The GTS on-call engineer is granted access to the data store to continue debugging.

Can you provide more information on this pipeline perhaps so we can understand this better? Who has access to this information when the system fails and they have to grant it to the person on-call? Is this person or group of people available enough for you to meet your 24 hour response requirements? Are they internal or external to Google Trust Services? Are they the 3rd Party Vendor you are using? I am trying to grasp whether all of these design principles were taken into account during the design phase of your new pipeline, whether all of this was thought out, and simply something failed there, or if it wasn’t considered at all. Especially with my comment above on the necessity of this access control justification: you clearly sacrifice latency, reliability, etc. -- what do you gain in return? Do you try to minimize the residual risk with further controls?

The pipeline uses common Google systems that are available to customers. Inline with Google policies, there is no direct access to the raw data. However, access to the data store was not related to the root cause of this incident.

I am not yet convinced on the correct root cause analysis performed by GTS, and I go over a few points above on why this can be relevant here. Understanding this better may provide more hints as to why this incident occurred.

Based on what you write here this is what I have: you used a 3rd Party Google tool that follows Google policy for PII, and then you wrote some code (?) or you somehow created a system that exposes this tool’s database ('s content?) to the on-callers. Is that the case? As you failed to answer my questions above on who this “multi-party” is, etc., I can’t figure out exactly what happened.

I would please ask you to respond to my questions as their answers can be highly relevant to this incident report. From your comment I feel like you don’t want to go into more depth here.

The issue of whether the decisions and the balance struck by GTS is the right one to serve this community is being resurfaced here. This exact decision or policy or process or design may be the root cause of this incident, yet you want us to focus on another area.

I think that proper root cause analysis, actual identification and remediation of the correct problems, and openness can help everyone learn and avoid the same mistakes in the future. If CAs treat this forum as a place where they post something (which may even be factually incorrect!), wait a week, and then continue on with their lives it’s not helping anyone. The same goes for treating this community adversarially. Mozilla and the other Root Programs choose to place trust in CAs, and this is indicative of what this relationship should look like. They can’t audit every CA themselves, they rely on some transparency and good communication.

Our CA engineers provide 24/7/365 coverage and have full access to team owned automation, monitoring, and error reporting. Raw access to some data is only allowed once multi-party authorization has been granted, however we have a break glass emergency access mechanism for use in exceptional cases and to ensure rapid response if necessary. We have used third-party support in the past for some report triage, but no longer do so, which was discussed in comment 1.

Are you saying here that “multi-party” means “two or more CA Engineers”?

Do you maintain “two or more CA Engineers” 24/7/365? If not, then I imagine you rely on this break-glass mechanism you describe? If yes, why was this not used? Was it because you were below the 12h mark at that point and it shouldn’t have been activated yet?

Is “automation, monitoring, and error reporting” of this 3rd Party Tool “team owned”? From what you’re saying I understand it is not, so how is this sentence relevant?

The system in question went through a design review process, through several iterations of enhancements, and followed engineering and coding best practices. Despite those practices, this edge case was not handled by the primary logic or exception handling and resulted in the delayed response.

By “The system in question” you refer to the 3rd Party Google tool (Forms?) or the code you wrote on top of it yourselves?

Obviously it is normal to expect bugs in the code, this is not a problem, I am just once again trying to understand where the problem was and its nature. If it was an unanticipated bug in a 3rd Party tool, or if this was a problem in your code, configuration, use of the tool, etc.

From this incident report, it seems you are bitten again by the same problem: you relied on a 3rd Party Software / Vendor for an important part of your infrastructure (granted, not issuance or revocation related). Obviously, all software fails, and there’s human error all the time. This is why it’s important to discuss these here transparently, so we can all learn and improve, and eliminate classes of problems across the entire ecosystem. But what I’d like to understand is your process for picking software vendors, configuring their solutions, etc. Do you have any criteria in place, such as access to monitoring, SLAs or SLOs, or anything else of importance to ensure that you won’t be bitten again by a 3rd Party? If it happens once or twice in a short amount of time it may be understandable, I am interested in what you are doing to make sure this won’t become a pattern.

The pipeline is built on top of Google systems and frameworks, which we use because we can access the source, know the supply chain and change control protections that are in place, are familiar with the alerting and monitoring provided and they are typically no or low cost for us.

First, let me address the “no or low cost”. I don’t think it looks good for anyone if we ever have to say “There was this incident because Google couldn’t pay $500 / month”. This is not to say you are required to be spending unlimited money, but next to all of these good reasons it just stands out bad.

I have one question for this answer, which is whether the tools you are using are made for the purpose you use them for. There’s no better way to put it, but is Google Forms the go-to Google solution for a customer support system? It looks like you are bolting 3rd Party things together. You have a Google Form and then some code that takes the form responses and sends you an e-mail with their content? During your design (/review) phase of the CPR pipeline, was there no other or better system that would still have all of these properties (f.e. Made by Google)?

On the point of software development, you mention a bug in the code of your pipeline system that relies on the 3rd Party. Can you expand on your Software Development Lifecycle for this system? Does it have the same multi-party approval, code review, deployment, version control, etc. checks and balances as your CA software, or does it follow a different process, potentially with less strict requirements?

We follow the same process for production development including design reviews, automated testing, multi-party approval, code review before check-in, staged deployment with rollback options where possible (some operations have to be roll-forward for compliance or data-integrity reasons), version control, et cetera. Our CA software and infrastructure is deployed via more stages and we over-provision our issuing systems more than support systems, but the same principles are applied.

Okay, this is great. Obviously the CA infrastructure seems to handle more traffic, so this makes sense. Thanks!

My next point is about software, system, and solution quality. In your message you mention that immature systems, unfinished solutions, and other such problems have led to incidents. In this current incident, I also see a lack of design principles, at least from the limited information presented (PII access, reliability, monitoring, etc.). I think for example an important aspect would be monitoring, whose lack is cited as a root cause here. Are there any issues that are systemic with Google Trust Services on this front? You always commit to aggressive deadlines on your incident reports, which is commendable, and you almost always deliver on those on time, but I am left wondering what the cost of this is. Does it lead to cutting corners in software development, design, procurement, or other important areas? Is there pressure to ship “anything” out the door, to make the deadline, without proper thought and study? There have been a few times over the past few years where a subsequent incident was caused because of this. This is also slightly evident in your report above: the metrics you provide for the success of this pipeline / CRP system are business focused (lower number of CPRs, less time spent, etc.) and not engineering focused (more accurate, more robust, etc.). Also, if I follow the GTS incident reports correctly, it seems you are redesigning your CPR pipeline for a 3rd or 4th time with your proposed actions, within 1-1.5 year. This seems to add to the point I am making above. Perhaps this time a well thought out and carefully considered engineering-first approach that takes into account all the business requirements will be more successful. I don’t think it’s the right approach to just kick the can down the road and deal with this again a few months later.

I am not saying that you should be slower, I am just trying to understand whether the tradeoff is worth it. This tradeoff was made by GTS, which is a business, but I would like to explore whether it’s the right balance and whether it adequately serves this community. Perhaps it is, or maybe there is an adjustment needed…

In the past 1.5 years, our customer base has grown and changed significantly, which has been a driving factor in the improvements for our CPR process and tooling. Triaging and responding to valid and invalid certificate problem reports is a significant source of toil, so seeking to streamline that effort while retaining high quality is worthy of ongoing development. While we strive to avoid incidents, we think there are valuable lessons for us and the community as we've evolved our report handling processes.

You’re right, there’s definitely a lot to learn here for everyone. I was just wondering if this is being treated with the severity it demands by GTS. Have you contacted other CAs to figure out what they are doing for CPRs?

Let’s Encrypt is 8-10x larger in terms of currently unexpired certificates, registration for them is anonymous (while you require a Google Cloud Billing Account), and they are the default CA for almost all ACME clients. I am guessing they should be receiving at least as many CPRs as you are and every time I talked to them they were very nice and happy to share some know-how. I’m not volunteering them, it’s just an idea ;) If this seems to be a pain point perhaps it’s also worth asking in M.D.S.P. what everyone else is doing? Mozilla is trying to cultivate a community here, so I’d like to hope that there is help and knowledge exchange among participants.

Your root cause according to this report is stated to be “lacking monitoring, handling, and automated testing for an unanticipated failure scenario.” -- I am not sure this makes a lot of sense given the report above. It seems like you didn’t dig deep enough given my comments above, so I’d like some further discussion on that point.

Indeed, monitoring was found lacking. But it doesn’t seem like an “unexpected scenario” to me if storage runs out or a dependency service is down or a cron job fails to run. These seem basic things to monitor or plan for. Moreover, this is the 3rd redesign and there have been multiple iterations on top of it. Was this level of monitoring (or monitoring at all) not considered in the design phase? Was it deemed less important and the product launched without it, to be added later - if at all? Was it planned to be added? I find it hard to believe there’s no answer to “Why we lacked monitoring?” and this is the end node in the graph traversal, unless it was designed like that, or accepted somehow as okay. Which would be concerning.

There was monitoring and error alerting in place covering the common failure modes you cited. This monitoring is how the issue was initially discovered. The issue was coverage of an unplanned corner case and how it raised an alert did not make the severity immediately clear.

From the incident report so far I understand that monitoring existed for an “unrelated error”, and not for this one. It just so happened that as the on-caller was investigating this unrelated problem, they saw the other one. I think we both agree on that so far.

In the original full report I see the following text:

The application also runs as a periodically scheduled job to ensure sending the email had not failed due to errors and alerts the on-call engineer

This just seems to be adding more complexity. Why not alert when the email failure occurs? What if the job’s email fails too?

Is the following summary of the issue correct?

You received a Google Form submission. You then have a system that picks up form submissions and does something (the pipeline steps you are mentioning above) and then sends an email with the form content to the on-caller. Due to a non-GTS issue the email wasn’t sent and your system knew it. The software error is that the GTS “Finite State Machine” did not set this to the “failed-to-send-email” state, it left it dangling. The periodically scheduled job ran every time, but the only thing it was doing was moving items in the “failed-to-send-email” state to some other state that warns you through a different medium other than email. And this is why these messages were left in a state you couldn’t recover from (the “NULL” state).

Now if your software never sends items to the “failed-to-send-email” state, you would have caught it. Since you said that you test this software and it follows good coding practices, you’d notice it. So the problem was that the 3rd Party service hosting your “FSM” was intermittently unavailable, and your software did not account for this, and exited early? And also maybe a few times the periodically scheduled job didn’t run, but then it did, downgrading the investigation because you thought things would be surfaced?

Is the “FSM” I use above the “dependent service” that was “failing intermittently”? Isn’t this monitoring of whether a dependency service is down or not?

I don’t know how monitoring was performed here but a typical way to collect FSM / queue metrics is count by state. In Prometheus that would be:

fsm_object_cnt{state="failed-to-send-email"} 5
fsm_object_cnt{state=""} 4

Was there an additional software error in the monitoring that failed to alert on (or accept) an empty state label?

Finally, in this incident you cite the problem being “Failure to respond to CPR within 24 hours”. You then state that a single CPR (that was received as a duplicate) was delayed by 41 minutes. I wanted to ask you: if there wasn’t a CPR during your ~6 days of downtime, and you didn’t miss anything, would you file this incident report?

We wish to clarify that there was no downtime or data loss. There was a delay in a pipeline. If we had a critical system down for an extended period of time that had even a small chance of failing to comply with BRs or root program requirements we would file an incident report.

This is my main problem with this incident report.

Thanks to your followup today I think I was able to understand the software problem. It seems possible to happen, nobody should be expected not to make it, and there’s so much software out there with FSM initialization or corruption issues. That would have been fine.

But throughout this interaction, I see attempts to downplay the severity, confusing phraseology, inconsistencies and factual inaccuracies, and decisions that don’t make much sense. And I don’t see why... It just looks like you are either afraid of something, or you have something to hide. I really don’t think that’s necessarily the case. I perceive this as an adversarial treatment of this community without any obvious reasons to provoke this. The way it seems to me right now from your followup is that you want to draw attention into the software issue and avoid exploration into any other potential cause for this.

I still don’t know if the software bug was the main issue here or if it’s poor choice of solutions, poor process design for access to data (requirements not taken into account), process failure, etc. I hope we’ll get more clarity with a subsequent response here.

For a few days you were unable to fulfill your requirements of handling CPRs. Although you enqueue them, you’re not processing them, which is how I would define “downtime”. If a CA wasn’t revoking certificates for a month, I wouldn’t call it a “delay in the pipeline” if they ran through that list of keyCompromises later. If Gmail stored everything in the “Outbox” folder and only sent the messages once every 3-4 days, I wouldn’t be happy either. I’d probably call that a “downtime” too, even if I could see the UI. I hope you understand it’s a bit of a more extreme example, but it should communicate what I’m trying to say here.

To summarize, I don’t think the software bug is anything special or concerning, it could happen to anyone, what we need to work on now here is to figure out whether that is the main problem or if there’s anything else that is a root-ier cause. Perhaps there was a bad design or a bad decision, that increased complexity, or a choice of tools where there was lack of operational capabilities (monitor, configure, understand, ...).

As incidents with the CPR pipeline are recurring for GTS lately I would like to make sure that we’re going after the right things, and we’re fixing the right problems, and we’re not affected by tunnel vision or the sunk cost fallacy. This is the best and most efficient way forward for everyone involved. Let’s make sure we won’t be here talking about a similar problem 2 months from now.

I'm not convinced that looking at this code more is enough to solve this problem for good. It feels bigger than that. I also don’t think self-service CPRs will also extinguish this problem, as there will still be room for manual work. The latter would help GTS, so it’s a good thing, but it seems like it will just reduce the frequency of these by making it more difficult to know when they happen.

Flags: needinfo?(cadecairns)

We have completed a review of the application code that handles form submissions to assess whether there are other potential corner cases that might prevent a Certificate Problem Report from reaching the next stage of the pipeline for human review. We did not identify any additional issues and have added more unit tests. We also adjusted access controls on the data store to enable engineers to gain access sooner if in-depth debugging is required in the future.

We have begun work on a design to reduce the risk of problems occurring throughout the entire lifecycle of Certificate Problem Reports and other external inquiries we receive. We have not settled on final implementation details yet, but automation related to the pipeline will be more closely coupled and built using a different toolkit. This will provide a more robust implementation while further improving upon process changes we have already made.

We are preparing a response to the questions we were recently asked and will respond once completed. In the meantime, Google Trust Services will continue monitoring this bug for comments or questions.

(In reply to Antonis from comment #4)

(In reply to Cade Cairns from comment #3)

Thanks for getting back Cade, I am adding some comments and questions inline below:

YYYY-MM-DD (UTC) Description
2023-06-04 03:19 An intermittent error in dependent services results in inquiries failing to enter into the processing pipeline.
2023-06-09 20:50 A fix for the issue is deployed.

Do I understand correctly that this is the entire time frame Google Trust Services remained without a valid contact point for Certificate Problem Reports or other communication, e.g. from Root Programs or otherwise?

The form was still accepting and storing reports throughout this period. The response interval was the issue. During the interval between problem identification and the fix being deployed and validated, our on-caller was checking for form submissions roughly every three hours instead of our normal twice daily human checks.

So with the above you confirm that you were not processing / acting on any of these events, despite them being delivered successfully to you, yes?

It is not a clear answer and simply accepting and storing reports is not enough to meet the requirements that CAs should be held up to. Otherwise, you can set up an email inbox, “accept and store” everything in there, and never look at it once. Is this an attempt to downplay the severity of the problem or a misunderstanding of the requirements?

As far as the frequent check-in by the on-caller goes, was someone practically sleepless for 2 days checking this form? Further in your response you mention that you have a “globally distributed and available” team, yet there was no hand-over? This seems a bit odd, so I hope I misunderstood something.

Sorry, but our wording seems to have been interpreted differently than intended. In comment #1 we described how the form submission was not being passed to the next step in the pipeline to be handled by an on-call engineer. We mentioned that the form was accepting and storing reports to reflect that no data was lost due to these issues, not to try to minimize the severity of the issue. We did not meet the response deadline, but we are confident nothing was permanently missed or lost during this period.

Once we detected the failure, the on-call engineer increased the frequency of manual check-ins to ensure we did not risk missing any further CPRs. They checked every three hours during waking hours. Outside of those hours, other engineers were also looking until we were confident the problem had been fixed.

YYYY-MM-DD (UTC) Description
2023-06-09 00:59 The GTS on-call engineer determines that there may be unprocessed inquiries submitted via the contact form and begins troubleshooting issues. Deeper troubleshooting required multi-party approval because of the potential for PII in the data.

Can you expand further on your process for PII as this is being cited as the primary cause for the delay of this particular CPR? If I understand your incident report correctly, this pipeline gives GTS CA Engineers the post that was sent so they can respond. It seems to me that the data the on-caller didn’t have access to was the same information (the post). I don’t understand why the engineers don’t have access to the information when the system fails (even if they know it failed), but have access to that when the system works.

We have multi-party authorization for raw access to systems that may contain PII. In this case, the primary system, the automation pipeline, which normally checks and stores the report was not processing in time, so raw access to the data store was required.

If that is the case, then I am a bit confused about what happened. The “automation pipeline” and the “raw access” contain the same information, right? How is PII then relevant in this case? If you require multi-party authorization for one, you must require it for the other. And if you don’t require it for one, then it shouldn’t be stated as the reason of the delay for the other. From your answer here I understand it’s the former, so both systems have multi-party authorization?

In comment #1 we described how several factors contributed to a delay in processing the failed CPR, including a problem in a dependent service that began several days earlier and caused triggers to fail intermittently, which caused the delay and this incident. During investigation there was a further delay after the successful execution of the failsafe led to the urgency of the investigation being reduced.

The automation that had the problem requires read and write access to the data store to perform its function of sending the form submission for handling by an on-call engineer. Engineers do not have the same level of access as the automation without approval and, in the case of this system, that access was required to perform deeper troubleshooting.

YYYY-MM-DD (UTC) Description
2023-06-09 02:21 A failsafe scheduled job in the contact form automation executes successfully, leading the engineer to reduce the urgency of the investigation.

Access restrictions on the data store to protect PII prevented the engineer from doing an in-depth review of the problem at the time, however a subsequent invocation of the scheduled job succeeded on 2023-06-08 at 02:21, and the investigation was no longer treated as urgently.

Can you please address this inconsistency? Was this investigation deprioritized on June 9th, or June 8th? It is probably a problem in timezone conversion on your end, that slipped through the reviews, but I think it is important for this incident so we can get a better understanding.

Thank you for flagging this error. In section 6 where it is stated that the priority was dropped, the date is incorrect. Events after 2023-06-07 22:28 were incorrectly advanced a day in the timeline. The corrected timeline is:

YYYY-MM-DD (UTC) Description
2023-06-07 22:28 An unrelated error in the contact form automation is raised due to a temporary data store access issue.
2023-06-07 23:23 The GTS on-call engineer begins to investigate the error.
2023-06-08 00:59 The GTS on-call engineer determines that there may be unprocessed inquiries submitted via the contact form and begins troubleshooting issues. Deeper troubleshooting required multi-party approval because of the potential for PII in the data.
2023-06-08 02:21 A failsafe scheduled job in the contact form automation executes successfully, leading the engineer to reduce the urgency of the investigation.
2023-06-08 13:41 The GTS on-call engineer is granted access to the data store to continue debugging.
2023-06-08 13:55 The GTS on-call engineer informs GTS PA there are unprocessed form submissions.
2023-06-08 14:36 The GTS on-call engineer declares an event based on the single valid CPR form submission not being processed.
2023-06-08 15:28 GTS sends a preliminary investigation report to the CPR reporter and the subscriber, 25 hours and 41 minutes after the CPR was submitted.
2023-06-08 20:50 A fix for the issue is deployed.

If that is the case and more than half of the timeline GTS provided was wrong, I have a couple of items:

Is the timeline complete? It ends on 2023-06-08 20:50 (where the fix was deployed) when at 2023-06-08 21:51 you said you were still investigating what happened. Or is there nothing noteworthy that came out of this investigation?

We are confident that the fix deployed on 2023-06-08 at 20:50 corrected this issue. The scope of this issue was limited to four form submissions as we stated in comment #1 and nothing further was identified during our investigation. We continued to review the application as described in comment #5 but did not identify any additional issues.

The CPR that you missed came in at 2023-06-07 13:47. With the revised timeline that you now provided, GTS was aware of potentially unprocessed data at 2023-06-08 00:59. This is now under the 24h mark, while it was above it before. Which makes my question on the PII policy more relevant: could that be the root cause of the delay, or a better candidate at least, as according to the Google SRE Book you link below systems should be expected to fail and you should plan for it and be prepared? Did GTS operate under the assumption that this system will never fail, or was there an issue in the “globally distributed and available” group of approvers? This entire incident happened during working days, not on a weekend, and not on public holidays. I am asking this because it is relevant for the better understanding of the rest of your report and whether the community can rely on your existing processes or further improvements can be made.

As we described above and in comment #1, data access is not the root cause of this incident.

YYYY-MM-DD (UTC) Description
2023-06-09 13:41 The GTS on-call engineer is granted access to the data store to continue debugging.

Can you provide more information on this pipeline perhaps so we can understand this better? Who has access to this information when the system fails and they have to grant it to the person on-call? Is this person or group of people available enough for you to meet your 24 hour response requirements? Are they internal or external to Google Trust Services? Are they the 3rd Party Vendor you are using? I am trying to grasp whether all of these design principles were taken into account during the design phase of your new pipeline, whether all of this was thought out, and simply something failed there, or if it wasn’t considered at all. Especially with my comment above on the necessity of this access control justification: you clearly sacrifice latency, reliability, etc. -- what do you gain in return? Do you try to minimize the residual risk with further controls?

The pipeline uses common Google systems that are available to customers. Inline with Google policies, there is no direct access to the raw data. However, access to the data store was not related to the root cause of this incident.

I am not yet convinced on the correct root cause analysis performed by GTS, and I go over a few points above on why this can be relevant here. Understanding this better may provide more hints as to why this incident occurred.

Based on what you write here this is what I have: you used a 3rd Party Google tool that follows Google policy for PII, and then you wrote some code (?) or you somehow created a system that exposes this tool’s database ('s content?) to the on-callers. Is that the case? As you failed to answer my questions above on who this “multi-party” is, etc., I can’t figure out exactly what happened.

I would please ask you to respond to my questions as their answers can be highly relevant to this incident report. From your comment I feel like you don’t want to go into more depth here.

The issue of whether the decisions and the balance struck by GTS is the right one to serve this community is being resurfaced here. This exact decision or policy or process or design may be the root cause of this incident, yet you want us to focus on another area.

I think that proper root cause analysis, actual identification and remediation of the correct problems, and openness can help everyone learn and avoid the same mistakes in the future. If CAs treat this forum as a place where they post something (which may even be factually incorrect!), wait a week, and then continue on with their lives it’s not helping anyone. The same goes for treating this community adversarially. Mozilla and the other Root Programs choose to place trust in CAs, and this is indicative of what this relationship should look like. They can’t audit every CA themselves, they rely on some transparency and good communication.

In comment #1 we described how our automation that failed is responsible for handling form submissions by routing them to the next step in the pipeline for triage by an on-call engineer, which it does by sending an email with details dependent on the type of inquiry. The automation does not expose the database. For most types of inquiries, it also sends an automatic response to the entity who submitted the form. Email is used because it is used by the customer support management tool used to manage the lifecycle of an inquiry. “Pipeline” in this context means a sequential process consisting of steps handled by both automation and the engineers who handle interactions related to an inquiry.

Our CA engineers provide 24/7/365 coverage and have full access to team owned automation, monitoring, and error reporting. Raw access to some data is only allowed once multi-party authorization has been granted, however we have a break glass emergency access mechanism for use in exceptional cases and to ensure rapid response if necessary. We have used third-party support in the past for some report triage, but no longer do so, which was discussed in comment 1.

Are you saying here that “multi-party” means “two or more CA Engineers”?

Do you maintain “two or more CA Engineers” 24/7/365? If not, then I imagine you rely on this break-glass mechanism you describe? If yes, why was this not used? Was it because you were below the 12h mark at that point and it shouldn’t have been activated yet?

As we described above, the successful execution of the failsafe led to the urgency of the investigation being reduced.

Is “automation, monitoring, and error reporting” of this 3rd Party Tool “team owned”? From what you’re saying I understand it is not, so how is this sentence relevant?

The automation that is responsible for handling form submissions by routing them to the next step in the pipeline for triage by an on-call engineer is maintained by our team. Automation, monitoring, and error reporting of that automation is therefore our responsibility.

The system in question went through a design review process, through several iterations of enhancements, and followed engineering and coding best practices. Despite those practices, this edge case was not handled by the primary logic or exception handling and resulted in the delayed response.

By “The system in question” you refer to the 3rd Party Google tool (Forms?) or the code you wrote on top of it yourselves?

Obviously it is normal to expect bugs in the code, this is not a problem, I am just once again trying to understand where the problem was and its nature. If it was an unanticipated bug in a 3rd Party tool, or if this was a problem in your code, configuration, use of the tool, etc.

From this incident report, it seems you are bitten again by the same problem: you relied on a 3rd Party Software / Vendor for an important part of your infrastructure (granted, not issuance or revocation related). Obviously, all software fails, and there’s human error all the time. This is why it’s important to discuss these here transparently, so we can all learn and improve, and eliminate classes of problems across the entire ecosystem. But what I’d like to understand is your process for picking software vendors, configuring their solutions, etc. Do you have any criteria in place, such as access to monitoring, SLAs or SLOs, or anything else of importance to ensure that you won’t be bitten again by a 3rd Party? If it happens once or twice in a short amount of time it may be understandable, I am interested in what you are doing to make sure this won’t become a pattern.

The pipeline is built on top of Google systems and frameworks, which we use because we can access the source, know the supply chain and change control protections that are in place, are familiar with the alerting and monitoring provided and they are typically no or low cost for us.

First, let me address the “no or low cost”. I don’t think it looks good for anyone if we ever have to say “There was this incident because Google couldn’t pay $500 / month”. This is not to say you are required to be spending unlimited money, but next to all of these good reasons it just stands out bad.

I have one question for this answer, which is whether the tools you are using are made for the purpose you use them for. There’s no better way to put it, but is Google Forms the go-to Google solution for a customer support system? It looks like you are bolting 3rd Party things together. You have a Google Form and then some code that takes the form responses and sends you an e-mail with their content? During your design (/review) phase of the CPR pipeline, was there no other or better system that would still have all of these properties (f.e. Made by Google)?

We hope the distinction between our automation and the form is clear from our responses above. To respond to your other question, Google has a number of customer support tools, but they are all for different purposes. We investigated several options when the project began and chose to build our own solution because it provided a path to meeting our needs without introducing significant overhead.

On the point of software development, you mention a bug in the code of your pipeline system that relies on the 3rd Party. Can you expand on your Software Development Lifecycle for this system? Does it have the same multi-party approval, code review, deployment, version control, etc. checks and balances as your CA software, or does it follow a different process, potentially with less strict requirements?

We follow the same process for production development including design reviews, automated testing, multi-party approval, code review before check-in, staged deployment with rollback options where possible (some operations have to be roll-forward for compliance or data-integrity reasons), version control, et cetera. Our CA software and infrastructure is deployed via more stages and we over-provision our issuing systems more than support systems, but the same principles are applied.

Okay, this is great. Obviously the CA infrastructure seems to handle more traffic, so this makes sense. Thanks!

My next point is about software, system, and solution quality. In your message you mention that immature systems, unfinished solutions, and other such problems have led to incidents. In this current incident, I also see a lack of design principles, at least from the limited information presented (PII access, reliability, monitoring, etc.). I think for example an important aspect would be monitoring, whose lack is cited as a root cause here. Are there any issues that are systemic with Google Trust Services on this front? You always commit to aggressive deadlines on your incident reports, which is commendable, and you almost always deliver on those on time, but I am left wondering what the cost of this is. Does it lead to cutting corners in software development, design, procurement, or other important areas? Is there pressure to ship “anything” out the door, to make the deadline, without proper thought and study? There have been a few times over the past few years where a subsequent incident was caused because of this. This is also slightly evident in your report above: the metrics you provide for the success of this pipeline / CRP system are business focused (lower number of CPRs, less time spent, etc.) and not engineering focused (more accurate, more robust, etc.). Also, if I follow the GTS incident reports correctly, it seems you are redesigning your CPR pipeline for a 3rd or 4th time with your proposed actions, within 1-1.5 year. This seems to add to the point I am making above. Perhaps this time a well thought out and carefully considered engineering-first approach that takes into account all the business requirements will be more successful. I don’t think it’s the right approach to just kick the can down the road and deal with this again a few months later.

I am not saying that you should be slower, I am just trying to understand whether the tradeoff is worth it. This tradeoff was made by GTS, which is a business, but I would like to explore whether it’s the right balance and whether it adequately serves this community. Perhaps it is, or maybe there is an adjustment needed…

In the past 1.5 years, our customer base has grown and changed significantly, which has been a driving factor in the improvements for our CPR process and tooling. Triaging and responding to valid and invalid certificate problem reports is a significant source of toil, so seeking to streamline that effort while retaining high quality is worthy of ongoing development. While we strive to avoid incidents, we think there are valuable lessons for us and the community as we've evolved our report handling processes.

You’re right, there’s definitely a lot to learn here for everyone. I was just wondering if this is being treated with the severity it demands by GTS. Have you contacted other CAs to figure out what they are doing for CPRs?

Let’s Encrypt is 8-10x larger in terms of currently unexpired certificates, registration for them is anonymous (while you require a Google Cloud Billing Account), and they are the default CA for almost all ACME clients. I am guessing they should be receiving at least as many CPRs as you are and every time I talked to them they were very nice and happy to share some know-how. I’m not volunteering them, it’s just an idea ;) If this seems to be a pain point perhaps it’s also worth asking in M.D.S.P. what everyone else is doing? Mozilla is trying to cultivate a community here, so I’d like to hope that there is help and knowledge exchange among participants.

When the project began, we both spoke with several CAs and investigated the inquiry intake method of additional CAs. As we stated in comment #1, our approach resulted in significant improvement in the quality of CPRs we receive. It is unfortunate that a bug in new software resulted in this incident, but the improvements from the new guided process has been beneficial to customers in terms of clarity regarding their obligations and how to follow the process, which in turn means the data provided is more actionable than previous free-form submissions were.

Your root cause according to this report is stated to be “lacking monitoring, handling, and automated testing for an unanticipated failure scenario.” -- I am not sure this makes a lot of sense given the report above. It seems like you didn’t dig deep enough given my comments above, so I’d like some further discussion on that point.

Indeed, monitoring was found lacking. But it doesn’t seem like an “unexpected scenario” to me if storage runs out or a dependency service is down or a cron job fails to run. These seem basic things to monitor or plan for. Moreover, this is the 3rd redesign and there have been multiple iterations on top of it. Was this level of monitoring (or monitoring at all) not considered in the design phase? Was it deemed less important and the product launched without it, to be added later - if at all? Was it planned to be added? I find it hard to believe there’s no answer to “Why we lacked monitoring?” and this is the end node in the graph traversal, unless it was designed like that, or accepted somehow as okay. Which would be concerning.

There was monitoring and error alerting in place covering the common failure modes you cited. This monitoring is how the issue was initially discovered. The issue was coverage of an unplanned corner case and how it raised an alert did not make the severity immediately clear.

From the incident report so far I understand that monitoring existed for an “unrelated error”, and not for this one. It just so happened that as the on-caller was investigating this unrelated problem, they saw the other one. I think we both agree on that so far.

In the original full report I see the following text:

The application also runs as a periodically scheduled job to ensure sending the email had not failed due to errors and alerts the on-call engineer

This just seems to be adding more complexity. Why not alert when the email failure occurs? What if the job’s email fails too?

Is the following summary of the issue correct?

You received a Google Form submission. You then have a system that picks up form submissions and does something (the pipeline steps you are mentioning above) and then sends an email with the form content to the on-caller. Due to a non-GTS issue the email wasn’t sent and your system knew it. The software error is that the GTS “Finite State Machine” did not set this to the “failed-to-send-email” state, it left it dangling. The periodically scheduled job ran every time, but the only thing it was doing was moving items in the “failed-to-send-email” state to some other state that warns you through a different medium other than email. And this is why these messages were left in a state you couldn’t recover from (the “NULL” state).

Please refer to what we wrote in comment #1.

Now if your software never sends items to the “failed-to-send-email” state, you would have caught it. Since you said that you test this software and it follows good coding practices, you’d notice it. So the problem was that the 3rd Party service hosting your “FSM” was intermittently unavailable, and your software did not account for this, and exited early? And also maybe a few times the periodically scheduled job didn’t run, but then it did, downgrading the investigation because you thought things would be surfaced?

Is the “FSM” I use above the “dependent service” that was “failing intermittently”? Isn’t this monitoring of whether a dependency service is down or not?

I don’t know how monitoring was performed here but a typical way to collect FSM / queue metrics is count by state. In Prometheus that would be:

fsm_object_cnt{state="failed-to-send-email"} 5
fsm_object_cnt{state=""} 4

Was there an additional software error in the monitoring that failed to alert on (or accept) an empty state label?

Finally, in this incident you cite the problem being “Failure to respond to CPR within 24 hours”. You then state that a single CPR (that was received as a duplicate) was delayed by 41 minutes. I wanted to ask you: if there wasn’t a CPR during your ~6 days of downtime, and you didn’t miss anything, would you file this incident report?

We wish to clarify that there was no downtime or data loss. There was a delay in a pipeline. If we had a critical system down for an extended period of time that had even a small chance of failing to comply with BRs or root program requirements we would file an incident report.

This is my main problem with this incident report.

Thanks to your followup today I think I was able to understand the software problem. It seems possible to happen, nobody should be expected not to make it, and there’s so much software out there with FSM initialization or corruption issues. That would have been fine.

But throughout this interaction, I see attempts to downplay the severity, confusing phraseology, inconsistencies and factual inaccuracies, and decisions that don’t make much sense. And I don’t see why... It just looks like you are either afraid of something, or you have something to hide. I really don’t think that’s necessarily the case. I perceive this as an adversarial treatment of this community without any obvious reasons to provoke this. The way it seems to me right now from your followup is that you want to draw attention into the software issue and avoid exploration into any other potential cause for this.

I still don’t know if the software bug was the main issue here or if it’s poor choice of solutions, poor process design for access to data (requirements not taken into account), process failure, etc. I hope we’ll get more clarity with a subsequent response here.

For a few days you were unable to fulfill your requirements of handling CPRs. Although you enqueue them, you’re not processing them, which is how I would define “downtime”. If a CA wasn’t revoking certificates for a month, I wouldn’t call it a “delay in the pipeline” if they ran through that list of keyCompromises later. If Gmail stored everything in the “Outbox” folder and only sent the messages once every 3-4 days, I wouldn’t be happy either. I’d probably call that a “downtime” too, even if I could see the UI. I hope you understand it’s a bit of a more extreme example, but it should communicate what I’m trying to say here.

To summarize, I don’t think the software bug is anything special or concerning, it could happen to anyone, what we need to work on now here is to figure out whether that is the main problem or if there’s anything else that is a root-ier cause. Perhaps there was a bad design or a bad decision, that increased complexity, or a choice of tools where there was lack of operational capabilities (monitor, configure, understand, ...).

As incidents with the CPR pipeline are recurring for GTS lately I would like to make sure that we’re going after the right things, and we’re fixing the right problems, and we’re not affected by tunnel vision or the sunk cost fallacy. This is the best and most efficient way forward for everyone involved. Let’s make sure we won’t be here talking about a similar problem 2 months from now.

I'm not convinced that looking at this code more is enough to solve this problem for good. It feels bigger than that. I also don’t think self-service CPRs will also extinguish this problem, as there will still be room for manual work. The latter would help GTS, so it’s a good thing, but it seems like it will just reduce the frequency of these by making it more difficult to know when they happen.

Inline with community standards, we have provided a full and factual recounting of the issue. We are sorry that you feel differently. We apologize for the timeline error, we fully accept that was a preventable mistake on our part, but we do not understand the assertions of confusing phrasing and poor decisions. We hope that our replies clarified any uncertainty you may have had about our report or the root cause of the incident. If you have further questions, we ask that you group them together as this thread is becoming difficult to follow.

GTS self-reported this incident based on our own monitoring and investigation. As we described in comment #5, we are continuing to take steps to reduce the risk of problems occurring throughout the entire lifecycle of Certificate Problem Reports and other external inquiries we receive. Bugs sometimes happen in software, automated tests don’t always take every case into consideration, and the problem occurred in a new system for us. We will continue to make improvements to our systems to mitigate the risk of problems occurring.

Flags: needinfo?(cadecairns)

Cade,

If GTS insists on the root cause of this incident, and you’re not willing to explore any alternatives, there’s nothing more I can contribute to this thread.

Mozilla has some guidelines[1] on how to respond to incidents, that have been given to GTS many times in the past (Bug 1532842, Bug 1709223, Bug 1522975, Bug 1715421, Bug 1708516, Bug 1706967, Bug 1678183, Bug 1634795, possibly others) because you failed to follow them. They cover the important aspects of handling and reporting incidents and outline what should be done.

Since I thought that some of these parts were missing from your report, I asked questions in my previous comments to capture them in this report here.

Work out how the bug or problem was introduced. For a code bug, were the code review processes sufficient?

You responded that you followed the same processes for this code as anything else, so it just slipped(?)

Does your code have automated tests, and if so, why did they not catch this case?

You responded that it does, but you expanded them now and it wasn’t catched because a dependency failing was an unpredicted corner case.

Work out why the problem was not detected earlier. [...] is the code or process you use for insufficient?

I understand your answer here is that the dependency just didn’t fail before, to your knowledge, and it did now.

If the problem is lack of compliance to an RFC, Baseline Requirement, or Mozilla Policy requirement: were you aware of this requirement? If not, why not? If so, was an attempt made to meet it? If not, why not? If so, why was that attempt flawed? Do any processes need updating for making sure your CA complies with the latest version of the various requirements placed upon it?

Here I was worried, but with your latest message I think that you understood the requirement, you admit that you failed to meet it, and that you probably took this into account when designing the CPR pipeline. It is okay.

If, as happens in a regrettably large number of cases, a problem report was sent to your CA but action in accordance with BR section 9.4.5 was not taken within 24 hours, investigate what happened to that report and whether your report handling processes are adequate.

I understand that you consider your report handling process after all the improvements you committed to make “adequate” now? So this is also done.

Scan your corpus of certificates to look for others with the same issue.

Although this is not a problem with certificates, I expected the same principle to apply: look for past times where this may have happened, and check whether you missed, or you could have missed reports. Since you had the monitoring, this shouldn’t be a problem to look. I understand that you checked, in your investigation, and you found that this problem never happened in the past?

Finally, I would like to bring up your CPS section 6.6.1[2] which I quote in its entirety here:

Google uses software that has been formally tested for suitability and fitness for purpose. Hardware is procured through a managed process leveraging industry-standard vendors.

Half of it talks about software, and what you’re using and developing to operate GTS. I thought that this Google Form + wrapping in code was not “formally tested for suitability and fitness for purpose”. At least for how I perceive this. Perhaps by your definition it is. I don’t know, but if you want to add any comments, feel free.

In any case, to avoid future misunderstandings, perhaps you’d like to clarify this further or expand on it, beyond a single sentence. Because there’s also the chance that this sentence disqualifies most Open Source Software.

For example, in Bug 1708516, Bug 1678183, Bug 1652581, Bug 1630040, and Bug 1612389, you mention the use of the Zlint Open Source tool. In its license agreement[3] it explicitly states in section 7:

[...] ​​without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE.

Would this (and possibly use of other software) violate this CPS? You perform annual reviews and audits, so you probably have an answer.

This is probably not the best place to discuss this, I’m just adding this here so you can review it, and then inform us, either by filing an incident or not I guess.

If you want some ideas on how to improve this section, I can refer you to Let’s Encrypt’s CPS[4] which I find strikes a much better balance.

Thanks,


[1] : https://wiki.mozilla.org/CA/Responding_To_An_Incident
[2] : https://pki.goog/repo/cps/4.15/GTS-CPS.html#6-6-1-system-development-controls
[3] : https://github.com/zmap/zlint/blob/master/LICENSE
[4] : https://github.com/letsencrypt/cp-cps/blob/main/CP-CPS.md

Flags: needinfo?(cadecairns)

(In reply to Antonis from comment #7)

Cade,

If GTS insists on the root cause of this incident, and you’re not willing to explore any alternatives, there’s nothing more I can contribute to this thread.

Mozilla has some guidelines[1] on how to respond to incidents, that have been given to GTS many times in the past (Bug 1532842, Bug 1709223, Bug 1522975, Bug 1715421, Bug 1708516, Bug 1706967, Bug 1678183, Bug 1634795, possibly others) because you failed to follow them. They cover the important aspects of handling and reporting incidents and outline what should be done.

Since I thought that some of these parts were missing from your report, I asked questions in my previous comments to capture them in this report here.

Work out how the bug or problem was introduced. For a code bug, were the code review processes sufficient?

You responded that you followed the same processes for this code as anything else, so it just slipped(?)

Does your code have automated tests, and if so, why did they not catch this case?

You responded that it does, but you expanded them now and it wasn’t catched because a dependency failing was an unpredicted corner case.

Work out why the problem was not detected earlier. [...] is the code or process you use for insufficient?

I understand your answer here is that the dependency just didn’t fail before, to your knowledge, and it did now.

If the problem is lack of compliance to an RFC, Baseline Requirement, or Mozilla Policy requirement: were you aware of this requirement? If not, why not? If so, was an attempt made to meet it? If not, why not? If so, why was that attempt flawed? Do any processes need updating for making sure your CA complies with the latest version of the various requirements placed upon it?

Here I was worried, but with your latest message I think that you understood the requirement, you admit that you failed to meet it, and that you probably took this into account when designing the CPR pipeline. It is okay.

If, as happens in a regrettably large number of cases, a problem report was sent to your CA but action in accordance with BR section 9.4.5 was not taken within 24 hours, investigate what happened to that report and whether your report handling processes are adequate.

I understand that you consider your report handling process after all the improvements you committed to make “adequate” now? So this is also done.

Scan your corpus of certificates to look for others with the same issue.

Although this is not a problem with certificates, I expected the same principle to apply: look for past times where this may have happened, and check whether you missed, or you could have missed reports. Since you had the monitoring, this shouldn’t be a problem to look. I understand that you checked, in your investigation, and you found that this problem never happened in the past?

Finally, I would like to bring up your CPS section 6.6.1[2] which I quote in its entirety here:

Google uses software that has been formally tested for suitability and fitness for purpose. Hardware is procured through a managed process leveraging industry-standard vendors.

Half of it talks about software, and what you’re using and developing to operate GTS. I thought that this Google Form + wrapping in code was not “formally tested for suitability and fitness for purpose”. At least for how I perceive this. Perhaps by your definition it is. I don’t know, but if you want to add any comments, feel free.

In any case, to avoid future misunderstandings, perhaps you’d like to clarify this further or expand on it, beyond a single sentence. Because there’s also the chance that this sentence disqualifies most Open Source Software.

For example, in Bug 1708516, Bug 1678183, Bug 1652581, Bug 1630040, and Bug 1612389, you mention the use of the Zlint Open Source tool. In its license agreement[3] it explicitly states in section 7:

[...] ​​without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE.

Would this (and possibly use of other software) violate this CPS? You perform annual reviews and audits, so you probably have an answer.

This is probably not the best place to discuss this, I’m just adding this here so you can review it, and then inform us, either by filing an incident or not I guess.

If you want some ideas on how to improve this section, I can refer you to Let’s Encrypt’s CPS[4] which I find strikes a much better balance.

Thanks,


[1] : https://wiki.mozilla.org/CA/Responding_To_An_Incident
[2] : https://pki.goog/repo/cps/4.15/GTS-CPS.html#6-6-1-system-development-controls
[3] : https://github.com/zmap/zlint/blob/master/LICENSE
[4] : https://github.com/letsencrypt/cp-cps/blob/main/CP-CPS.md

Google Trust Services has provided a factual analysis of the issues that contributed to this incident along with changes we are making both to address the root cause and make our CPR process more reliable in order to reduce the risk of an incident happening again. We don’t feel that we can further expand on answers to questions being raised that relate to this incident.

We thank you for your point raised about the terms of our CPS. GTS attempts to take into account each of our dependencies and make sure they are used properly and in accordance with our CPS. In the case of ZLint, we do not depend upon it to meet requirements. We only use it as a backup to our own checks and code.

We respectfully request that Mozilla consider setting the next-update to October 27, 2023, to allow us time to provide details on changes we’ve made once we have completed the next phase of promised improvements. We are continuing to monitor this bug for comments or questions.

Flags: needinfo?(cadecairns) → needinfo?(bwilson)

Just a quick comment regarding zlint, as it seems you didn't understand what I said:

The problem is the use of software "that has been formally tested for suitability and fitness for purpose". Not zlint. This would include any software, so this part is what I think is too broad. For example, the Go programming language would also be covered under that, OpenSSL, EJBCA, Linux, SQL databases, ... Perhaps a rewrite of this section to something more realistic would serve you better.

Specifically on your comment for Zlint, that you "do not depend upon it to meet requirements", this seems to contradict previous GTS statements:

Bug 1708516 :

We run all certificates through syntactic linters like Zlint and run 100% of the certificates we issue each day through audit checks instead of only inspecting 3% samples per 90 days as required by the Baseline Requirements. The checks also cover the correct performance of operational processes such as CAA- and domain control validation steps. The reports for all these checks are provided to auditors as part of our regular audit processes as well.

It seems it's being used to meet auditing requirements.

Bug 1678183 :

Our primary quality checks are zlint and internal test suites, including use of openssl asn1parse to check encodings.

It seems it's being used to ensure correct output, and correct certificates are in the requirements.

Bug 1652581 :

we continually run the latest version zlint, keep it up to date and run it prior to all ceremonies in a test environment to see the implications caused by the issuance of the particular certificate

It seems it's being used to ensure your ceremonies meet the requirements and produce CAs that meet the requirements.

Bug 1612389:

We have included Zlint in our review process for some time

...

To be clear, I don't disagree with the next-update request, I don't need more information, the above was just a comment. In case it wasn't clear.

If the remaining task only involves an update to the CPS, then I'm inclined to close this incident rather than prolong it until the end of October.

I've set the next update for Oct. 27, 2023, however, I'd like to see this bug closed well before then.

Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] → [ca-compliance] [uncategorized] Next update 2023-Oct-27

If the remaining task only involves an update to the CPS, then I'm inclined to close this incident rather than prolong it until the end of October.

To clarify, we believe that we have fixed the immediate issue that caused problems with our CPR process. However, we are also planning to make additional significant changes to our CPR process, not the CPS. We will make these changes by the end of October. We are happy to consider this issue resolved or provide an update on GTS progress in revising the CPR process by September 15th.

Google Trust Services would like to provide an update about the progress we’ve made to improve our CPR process.

Since our last update, we have made several changes.

  1. To improve the reliability of triaging CPRs, we started a secondary review rotation to check periodically in case the on-call engineer is unavailable.
  2. We automated sending notifications to the chat service used by the team when new actions are pending, which resulted in greater engagement by the entire team across time zones and reduced the time to respond to CPRs. These changes have improved the visibility of incoming reports and the state of our responses, making the process more robust.

We are continuing development work and enhancements to our CPR tooling as a priority and, having collected more data in the past few months, are confident that we have made the necessary improvements. If there are no further questions we would like to request that this bug be closed.

Flags: needinfo?(bwilson)

I propose that this be closed on Friday, 22-Sept-2023.

Ben, the final decision is up to you, I just wanted to add a comment:

In the full incident report, GTS promised the following:

In the medium term, GTS has identified a better toolkit to expand on the CPR investigation tooling described in Bug 1783272, resulting in a more robust implementation. This approach will allow us to automatically send more contextual automated responses to inquiries such as CPRs of the nature described above, allowing more CPR reporters to provide self-service and reducing the cycle time when a revocation must be performed. This work will begin in Q3 and we will complete it by October 27, 2023.

In the last message before now, on the 5th of July, GTS promised:

However, we are also planning to make additional significant changes to our CPR process, not the CPS. We will make these changes by the end of October.

From Cade's message now I understand that another rotation was added, which remains to be seen whether it will be effective, and GTS just added a chat integration / chatbot that posts a message when a new CPR is received.

I fail to see:

  • How creating a tool that posts a chat message justifies ~2 months of work (requested by the nextupdate)
  • How this is "additional significant changes"
  • How this allows GTS "to automatically send more contextual automated responses"
  • How this is "allowing more CPR reporters to provide self-service"
  • How this helps with "reducing the cycle time when a revocation must be performed"
  • Why this new system that just posts a message in a chat room somewhere cannot fail in the same ways described above

So far the bug has been patched, tests have been written, but these are the short term promises.

What about the medium term promises to ensure this won't happen again?

Without any additional information, just reading this report, I see GTS promising significant changes, and then showing up 3 months later, saying that they added a second person to a shift, added a chat integration, and signing off.

I haven't seen the "better toolkit", or how my points above are addressed, so I am just thinking that there was maybe a plan to do all of that, and then it just got deprioritized during the summer, for unknown reasons. In the message by Cade two hour ago, all of these were gone, they weren't even postponed for another 3 months, there's no mention of them whatsoever.

I just don't see why GTS is "confident that we have made the necessary improvements" and I clearly don't see that the CPR tooling is "a priority".

Were all of these promises deemed unnecessary? What happened in this period of time?

Our efforts to further develop the process tooling will continue, however this bug has been open for almost four months now. The data we have collected confirms that the implemented improvements are effective. At this point we consider this incident mitigated and the bug ready to be closed.

Cade,

this bug has been open for 107 days. Out of these 107 days, 99 of them we were waiting for a response or action from Google Trust Services. Granted, 76 of which were due to the Next Update.

This is a behavior that has been exhibited before: Bug 1708516 . In fact, reading that incident report again, many of the same reasons seem to apply here as well. I would recommend that you take a look into that as well and make sure that all the issues discussed there still no longer apply, and all the mitigations you performed are still effective. It seems that the issue is systemic and has resurfaced or was not addressed properly.

Perhaps we need to file a new incident for GTS and discuss this there, not here, but there are patterns that show a repeated behavior of not treating incidents seriously, or with high enough priority, as per the Requirements. If you treat these incidents seriously, it doesn't look like it, and if you're not, you really need to start to.

I also see the same behavior of attempting to avoid answering questions, which is also part of the Requirements.

You requested a significant time frame (3 months), to do a lot of work. You came here 3 months later, after no update, and you've done a week's worth of work, not addressing your own root cause, and you ask for this incident to be closed. Why the extra 71 days?

What are we supposed to make of this? You keep repeating the same phrase over and over, avoiding any comments when issues are being raised, and providing no responses. Is this a misunderstanding?

I am asking again: what happened in these past three months? It looks like you just deprioritized all work, and then when the deadline approached, you collected whatever little was done, and reported this as "several changes". If all you wanted to do was a chat integration and adding another person to a rotation, why did you request 3 months? Are you familiar with your obligations for incident response? You've failed to meet them many times in the past, including a few in this report, so I want to make sure you are aware of them.


In any case, Ben, I will leave this up to you: do you want to close this and file a separate incident to discuss this behavior, or do you want to keep this open until the work has been done? I am only insisting this time because GTS had many incidents in the past where they had issues with this exact software component and process, and we've always closed them with promises of improvements, yet the issues keep happening again and again and again. Similarly, GTS is once more asking for this to be closed, with future promises and solutions. I feel like closing this without action would be missing the point of the Mozilla Incident Response for CAs.

My company is considering using GTS as a CA, so I've been following this from the sidelines for a while. I second Antonis: Google's response here is concerning. It's not clear to me that GTS has made a concerted effort to fulfill their obligations, for reasons that Antonis has explained much better than I could. I strongly oppose closing this incident until Google can prove that it has actually made significant progress toward improving during the three months that have elapsed.

Hello Antonis,

We would like to share some clarifications. Since the start of this incident, multiple staff members have been working on improvements. The root cause has been fully addressed, and further improvements continue.

The mitigations described in comment #14 were added in late July. They provided the improvements described in that same comment and resulted in signals and safeguards to ensure the process is functioning as expected.

Improvements to our CPR tooling and process described in comment #1 remain under active development and continue to be a priority for us, which is what we meant when we stated "as a priority" in comment #14. It has not been deprioritized. Cleanly triaging and handling all the possible types of reports sent to a CA is a lot of work and we have taken an iterative approach as we work toward completing all of our planned improvements. It was stated in an earlier bug update that this bug should be closed before Oct 27. We have now reported changes covering progress thus far and we are on track to complete deployment and testing of our intended changes before Oct 27.

We hope that the above clarifications answer any open questions you had about the state of our work. We request Mozilla consider what is sufficient to close this bug.

Cade, would your team be willing to provide more specific information about what you're planning, the challenges you've encountered over the past three months, and any additional information that might benefit both subscribers like myself and other CAs who may encounter similar challenges? It sounds as though some of these three months have been spent working on tooling that isn't yet live, but you haven't really offered any specifics. What sort of tooling is being implemented, and when will it be live?

On its own, I don't consider this incident concerning: failure to respond to a CPR within 24 hours isn't the end of the world. The initial response wasn't bad, either: your team assessed the situation, reported their findings, and offered a plan for improvement. But now that we're approaching the deadline to implement those improvements, answers are becoming vague, and there doesn't actually seem to be much to show for the months it took to get to this point.

PKI relies heavily on transparency and adherence to procedure. It's a big red flag when a CA stops communicating effectively over what would otherwise be a minor procedural error. You're increasingly attracting the attention of people such as myself who normally just watch from the sidelines.

Paul, thank you for your comments. There have not been specific challenges. It takes time to develop and test new systems. The tooling to handle CPRs we're creating is very specific to our infrastructure, so it is unlikely to be open-sourced or have elements that are re-usable by the wider CA community.

We are working to automate workflows to remove as many manual steps as possible, integrate further with Google's internal task tracking system, and rely on dependencies that provide greater fault tolerance. Google's task tracker has robust mechanisms for escalating and is already deeply integrated into our team's operation, which allows us to better enforce response times. In comment #1 and the incidents it references, we described the evolution of our CPR handling process, which helped incrementally improve our CPR handling but resulted in some fragmentation in systems and tools. We are merging the automation we’ve created together into a single, cohesive application that manages the end-to-end workflow of handling CPRs. This also includes adding the capability to automatically respond with contextual responses based on information about certificates we’ve issued, which helps towards the goal of reducing manual steps.

Completing unification of our systems while simplifying our dependencies was important. Designing, implementing, and testing has been ongoing since this incident was opened. We have been focusing on updates for when we have something working. We are still working towards the restructuring we will complete by October 27, but since the additional mitigations we applied in July have proven effective, we felt that it was worth sharing with the community. Given the effectiveness of our mitigations, and that the duration of our more complete rearchitecting is longer than typical incident scope, we believe we have satisfied incident reporting guidelines and the incident is ready to be closed.

To keep everyone updated, the Next update remains set to 2023-Oct-27. GTS will post an update when the automation is deployed unless other comments or questions are raised first.

Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] [uncategorized] Next update 2023-Oct-27 → [ca-compliance] [policy-failure] Next update 2023-Oct-27

Google Trust Services has tested and deployed our new CPR management solution as described in Comment #1 and Comment #22.

The new CPR management system better integrates with Google’s task tracking tool, is built upon more fault-tolerant systems, and provides additional guidance to reporters for some CPR workflows to enable faster response times and provide more helpful information to reporters. The solution enforces deadlines for responses to ensure they are sent within mandated timelines. Ongoing enhancements will be made to the new system to further improve it and address any changing trends in CPR submissions.

Thus far, the system has performed well but we are continuing to operate our now legacy solution through the end of November 2023 in case we need to temporarily switch back while we further validate the new system.

We believe this addresses the root cause of this incident and the commitments made, and we respectfully ask that this incident be closed.

Flags: needinfo?(bwilson)

I intend to close this next Wed. 2023-11-01.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.