Let's Encrypt: Failure to Document Analysis of Detected Vulnerabilities
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: pporada, Assigned: pporada)
Details
(Whiteboard: [ca-compliance] [policy-failure])
Preliminary Incident Report
Summary
-
Incident description: Let's Encrypt performs weekly system vulnerability scans. Not all detected vulnerabilities are relevant to Let's Encrypt's infrastructure and operations, and most are remediated by automatic operating system package upgrades. We have identified an instance where we failed to document an existing compensating control for a detected Critical Vulnerability which was not remediated within 96 hours of detection.
-
Relevant policies: Network and Certificate System Security Requirements, Version 1.7, Section 4, which says that all Critical Vulnerabilities not remediated within 96 hours of detection must have a documented plan to do so, or a documented basis upon which the vulnerability does not require remediation.
-
Source of incident disclosure: Let's Encrypt self detected this while gathering evidence for an ongoing Webtrust Audit.
We will provide a full incident report on or before 2025-04-04.
Comment 1•5 months ago
|
||
While we continue working on the final report, could the next-update please be set to 04 APR 2025?
Updated•5 months ago
|
Comment 2•4 months ago
|
||
Full Incident Report
Summary
-
CA Owner CCADB unique ID: A000320
-
Incident description:
Let’s Encrypt performs weekly vulnerability scans on our infrastructure. We have identified multiple instances where we failed to remediate or document critical vulnerabilities (defined as those with CVSSv2 scores >= 7.0) within 96 hours of receiving scan results, as required by Version 1.7 of the NCSSRs.Note that the preliminary incident disclosure described this incident as Let’s Encrypt failing to document compensating controls for a single critical vulnerability finding. Upon further investigation, we have expanded the scope of this incident to include failure to timely remediate or document our analyses of multiple critical vulnerabilities.
-
Timeline summary:
- Non-compliance start date: 2024-07-15
- Non-compliance identified date: 2025-03-21
- Non-compliance end date: 2025-03-21
-
Relevant policies: Section 4 of the NCSSRs v1.7 states:
Certification Authorities and Delegated Third Parties SHALL:
. . .
f. Do one of the following within ninety-six (96) hours of discovery of a Critical Vulnerability not previously addressed by the CA’s vulnerability correction process:-
Remediate the Critical Vulnerability;
-
If remediation of the Critical Vulnerability within ninety-six (96) hours is not possible, create and implement a plan to mitigate the Critical Vulnerability, giving priority to
i. vulnerabilities with high CVSS scores, starting with the vulnerabilities the CA determines are the most critical (such as those with a CVSS score of 10.0) and ii. systems that lack sufficient compensating controls that, if the vulnerability were left unmitigated, would allow external system control, code execution, privilege escalation, or system compromise; or -
Document the factual basis for the CA’s determination that the vulnerability does not require remediation because
i. the CA disagrees with the NVD rating, ii. the identification is a false positive, iii. the exploit of the vulnerability is prevented by compensating controls or an absence of threats; or iv. other similar reasons.
-
-
Source of incident disclosure: Self Reported
Let's Encrypt self detected this while gathering evidence for an ongoing Webtrust Audit.
Impact
This incident did not result in the misissuance of certificates nor incorrect revocation statuses.
We identified a list of 58 vulnerabilities reported by our scanner between 2024-07-15 and 2025-03-21 for which remediation was not performed or an analysis was not documented within 96 hours. No single finding affected more than 4 logical hosts on our network, and only 8 of the vulnerabilities were still reported present at the time this incident was discovered.
Of the reported vulnerabilities, scan logs showed:
- 4 vulnerabilities that were only present during a single weekly scan,
- 1 vulnerability that was present two weeks in a row,
- 17 vulnerabilities that were present for between 3 and 24 weeks in a row, and
- 36 vulnerabilities that were present for 6 months or more.
We are confident that the failure to timely remediate or document these vulnerabilities did not lead to any intrusion or compromise of our systems. We employ multiple layers of defense-in-depth that provide strong mitigating controls against the vulnerabilities that were identified, and we have observed no evidence of intrusion over the timespan of the incident.
Timeline
All times are UTC.
2022-07-02
- 00:14 New scanning infrastructure is set up and responsibility for the vulnerability response procedure is left primarily in the hands of a single SRE
2022-07-12
- 21:14 First vulnerability ticket for scan findings is manually created and triaged
2024-07-15
- 22:08 Last vulnerability ticket for scan findings is manually created and triaged
2024-07-17
- 01:39 Scanner finishes weekly scan and reports vulnerability findings via email
2024-07-21
- 01:39 Ninety-six hours pass from the last vulnerability scan with no tickets having been created or triaged according to our vulnerability response procedure - INCIDENT BEGINS
2024-07-31
- SRE who normally triages CVEs leaves on sabbatical, moving to a different team upon return
2025-03-19
- 01:27 Scanner finishes weekly scan and reports vulnerability findings via email, including 8 findings from the previous scan
2025-03-21
- 18:30 SREs collecting data for our annual WebTrust audit notice a discrepancy between documentation, practice, and requirements in the NCSSRs
- 19:16 SREs declare incident
- 19:34 SREs begin addressing findings from that week’s scan according to our vulnerability response procedure
- 21:24 Final outstanding vulnerability is addressed - INCIDENT ENDS
- 23:26 Preliminary incident report is posted to Bugzilla
Related Incidents
None.
Root Cause Analysis
To give some background, Let’s Encrypt has a formally-documented vulnerability response procedure that the SRE team follows whenever we discover a critical security vulnerability (defined as a vulnerability with a CVSSv2 score of 7.0 or higher). We run automatic datacenter-wide vulnerability scans on a weekly basis, and we have it configured to send an email notification to a team-wide mailing list every time a critical vulnerability is observed.
As part of our vulnerability response procedure, a member of the SRE team is expected to create a ticket in our tracking system for each new vulnerability that is reported, and they must see to it that either the vulnerability is remediated or a suitable justification and plan is documented in the ticket within 96 hours. Without this audit trail, we have no tangible evidence that we have complied with requirement 4.f of the NCSSRs v1.7.
This incident is a result of failing to follow our own vulnerability response procedure, which can be attributed to two root causes:
Contributing Factor #1: Failure to share knowledge and responsibilities
-
Description: Our current vulnerability response procedure relies heavily on manual action at multiple stages of the process, making it difficult to execute correctly from start to finish.
Firstly, the email notifications that the scanner sends to our team-wide mailing list are triggered per-vulnerability per-host, causing 50-100 false positive emails on average per week, with no historical context or deduplication on subsequent scans. The repetitive emails cause alert fatigue and create unnecessary toil to interpret, whereas a more aggregated style of report would make it easier to extract a list of CVEs to investigate.
Further, the vulnerability notifications we receive are not automatically linked to our ticketing system where they would have an auditable location to be analyzed and triaged. We rely on an SRE to translate the weekly barrage of emails into an appropriate set of tickets on a weekly basis.
-
Timeline: This cause dates back to when we first configured weekly scans in July of 2022. Because our vulnerability response procedure was difficult to execute, there was a large amount of friction involved in training other SREs to execute it, and so we never did. Then during July of 2024, the SRE that had been primarily responsible for this process began transitioning off of the SRE team and we failed to hand the responsibility over to other members of the team in a way that ensured it would get done in accordance with procedure.
-
Detection: In the course of investigating this incident, we revisited and updated our vulnerability response procedure.
-
Interaction with other factors: This cause played a role in Contributing Cause #1, in that the deficiencies in our current vulnerability response procedure exacerbated the difficulty of taking over the responsibility correctly.
Contributing Factor #2: Overly toilsome response procedure
-
Description: Our current vulnerability response procedure relies heavily on manual action at multiple stages of the process, making it difficult to execute correctly from start to finish.
Firstly, the email notifications that the scanner sends to our team-wide mailing list are triggered per-vulnerability per-host, causing 50-100 false positive emails on average per week, with no historical context or deduplication on subsequent scans. The repetitive emails cause alert fatigue and create unnecessary toil to interpret, whereas a more aggregated style of report would make it easier to extract a list of CVEs to investigate.
Further, the vulnerability notifications we receive are not automatically linked to our ticketing system where they would have an auditable location to be analyzed and triaged. We rely on an SRE to translate the weekly barrage of emails into an appropriate set of tickets on a weekly basis.
-
Timeline: This cause dates back to when we first configured weekly scans in July of 2022. Because our vulnerability response procedure was difficult to execute, there was a large amount of friction involved in training other SREs to execute it, and so we never did. Then during July of 2024, the SRE that had been primarily responsible for this process began transitioning off of the SRE team and we failed to hand the responsibility over to other members of the team in a way that ensured it would get done in accordance with procedure.
-
Detection: In the course of investigating this incident, we revisited and updated our vulnerability response procedure.
-
Interaction with other factors: This cause played a role in Contributing Cause #1, in that the deficiencies in our current vulnerability response procedure exacerbated the difficulty of taking over the responsibility correctly.
Lessons Learned
What went well:
- Our annual WebTrust audit proved effective in helping us uncover this issue.
- The majority of discovered vulnerabilities were eventually remediated through automatic weekly operating system updates, and the rest were addressed within a few hours of discovering this incident.
What didn’t go well:
- Responsibility for our vulnerability response procedure was ad-hoc assigned to a single SRE and wasn’t properly handed off when they left the team.
- Lack of deduplication or aggregation of vulnerability findings created a toilsome review process that was hard to get right.
- Lack of automation meant that scan results were not being turned into an auditable set of tracking tickets each week.
Where we got lucky:
- N/A
Action Items
Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Triage and address all outstanding vulnerabilities | Mitigate | #1 | Tracking tickets were created and closed for every scanner finding from the week that this incident was discovered. | 2025-03-21 | Complete |
Put in place an interim vulnerability review team | Mitigate | #1 | Team consists of at least two SREs who are familiar with the vulnerability response procedure and can conduct reviews until we put in place a policy to explicitly delegate the responsibility | 2025-03-21 | Complete |
Automatically generate aggregated vulnerability reports | Mitigate | #2 | Vulnerability findings can be reviewed in aggregate, rather than per-host-per-vulnerability. | 2025-03-27 | Complete |
Automate creation of tracking tickets based on vulnerability scan results | Prevent | #2 | Tracking tickets are created automatically with a summary of vulnerability scan findings. | 2025-04-02 | Complete |
Revise our vulnerability response procedure to require that the responsibility is shared appropriately among the SRE team | Prevent | #1 | Policy requires that at least two SREs are familiar with the vulnerability response procedure and are able to regularly review scan results. | 2025-04-02 | Ongoing |
Train relevant personnel on the vulnerability response process | Prevent | #1, #2 | Those responsible for vulnerability response according to the new procedure are properly trained to carry it out. | 2025-04-16 | Ongoing |
(In reply to Preston Locke from comment #2)
Thank you Preston and Let’s Encrypt for this very thorough and thoughtful incident report. However, I think that the content for Contributing Factor #1 has been replaced by a copy of Contributing Factor #2.
Contributing Factor #1: Failure to share knowledge and responsibilities
Description: Our current vulnerability response procedure relies heavily on manual action at multiple stages of the process, making it difficult to execute correctly from start to finish.
Firstly, the email notifications that the scanner sends to our team-wide mailing list are triggered per-vulnerability per-host, causing 50-100 false positive emails on average per week, with no historical context or deduplication on subsequent scans. The repetitive emails cause alert fatigue and create unnecessary toil to interpret, whereas a more aggregated style of report would make it easier to extract a list of CVEs to investigate.
Further, the vulnerability notifications we receive are not automatically linked to our ticketing system where they would have an auditable location to be analyzed and triaged. We rely on an SRE to translate the weekly barrage of emails into an appropriate set of tickets on a weekly basis.
Timeline: This cause dates back to when we first configured weekly scans in July of 2022. Because our vulnerability response procedure was difficult to execute, there was a large amount of friction involved in training other SREs to execute it, and so we never did. Then during July of 2024, the SRE that had been primarily responsible for this process began transitioning off of the SRE team and we failed to hand the responsibility over to other members of the team in a way that ensured it would get done in accordance with procedure.
Detection: In the course of investigating this incident, we revisited and updated our vulnerability response procedure.
Interaction with other factors: This cause played a role in Contributing Cause #1, in that the deficiencies in our current vulnerability response procedure exacerbated the difficulty of taking over the responsibility correctly.
Contributing Factor #2: Overly toilsome response procedure
Description: Our current vulnerability response procedure relies heavily on manual action at multiple stages of the process, making it difficult to execute correctly from start to finish.
Firstly, the email notifications that the scanner sends to our team-wide mailing list are triggered per-vulnerability per-host, causing 50-100 false positive emails on average per week, with no historical context or deduplication on subsequent scans. The repetitive emails cause alert fatigue and create unnecessary toil to interpret, whereas a more aggregated style of report would make it easier to extract a list of CVEs to investigate.
Further, the vulnerability notifications we receive are not automatically linked to our ticketing system where they would have an auditable location to be analyzed and triaged. We rely on an SRE to translate the weekly barrage of emails into an appropriate set of tickets on a weekly basis.
Timeline: This cause dates back to when we first configured weekly scans in July of 2022. Because our vulnerability response procedure was difficult to execute, there was a large amount of friction involved in training other SREs to execute it, and so we never did. Then during July of 2024, the SRE that had been primarily responsible for this process began transitioning off of the SRE team and we failed to hand the responsibility over to other members of the team in a way that ensured it would get done in accordance with procedure.
Detection: In the course of investigating this incident, we revisited and updated our vulnerability response procedure.
Interaction with other factors: This cause played a role in Contributing Cause #1, in that the deficiencies in our current vulnerability response procedure exacerbated the difficulty of taking over the responsibility correctly.
Comment 4•4 months ago
|
||
Oops... Thank you Zacharias for pointing that out! Here's the corrected report:
Full Incident Report
Summary
-
CA Owner CCADB unique ID: A000320
-
Incident description:
Let’s Encrypt performs weekly vulnerability scans on our infrastructure. We have identified multiple instances where we failed to remediate or document critical vulnerabilities (defined as those with CVSSv2 scores >= 7.0) within 96 hours of receiving scan results, as required by Version 1.7 of the NCSSRs.Note that the preliminary incident disclosure described this incident as Let’s Encrypt failing to document compensating controls for a single critical vulnerability finding. Upon further investigation, we have expanded the scope of this incident to include failure to timely remediate or document our analyses of multiple critical vulnerabilities.
-
Timeline summary:
- Non-compliance start date: 2024-07-15
- Non-compliance identified date: 2025-03-21
- Non-compliance end date: 2025-03-21
-
Relevant policies:
Section 4 of the NCSSRs v1.7 states:Certification Authorities and Delegated Third Parties SHALL:
. . .
f. Do one of the following within ninety-six (96) hours of discovery of a Critical Vulnerability not previously addressed by the CA’s vulnerability correction process:-
Remediate the Critical Vulnerability;
-
If remediation of the Critical Vulnerability within ninety-six (96) hours is not possible, create and implement a plan to mitigate the Critical Vulnerability, giving priority to
i. vulnerabilities with high CVSS scores, starting with the vulnerabilities the CA determines are the most critical (such as those with a CVSS score of 10.0) and ii. systems that lack sufficient compensating controls that, if the vulnerability were left unmitigated, would allow external system control, code execution, privilege escalation, or system compromise; or -
Document the factual basis for the CA’s determination that the vulnerability does not require remediation because
i. the CA disagrees with the NVD rating, ii. the identification is a false positive, iii. the exploit of the vulnerability is prevented by compensating controls or an absence of threats; or iv. other similar reasons.
-
-
Source of incident disclosure: Self Reported
Let's Encrypt self detected this while gathering evidence for an ongoing Webtrust Audit.
Impact
This incident did not result in the misissuance of certificates nor incorrect revocation statuses.
We identified a list of 58 vulnerabilities reported by our scanner between 2024-07-15 and 2025-03-21 for which remediation was not performed or an analysis was not documented within 96 hours. No single finding affected more than 4 logical hosts on our network, and only 8 of the vulnerabilities were still reported present at the time this incident was discovered.
Of the reported vulnerabilities, scan logs showed:
- 4 vulnerabilities that were only present during a single weekly scan,
- 1 vulnerability that was present two weeks in a row,
- 17 vulnerabilities that were present for between 3 and 24 weeks in a row, and
- 36 vulnerabilities that were present for 6 months or more.
We are confident that the failure to timely remediate or document these vulnerabilities did not lead to any intrusion or compromise of our systems. We employ multiple layers of defense-in-depth that provide strong mitigating controls against the vulnerabilities that were identified, and we have observed no evidence of intrusion over the timespan of the incident.
Timeline
All times are UTC.
2022-07-02
- 00:14 New scanning infrastructure is set up and responsibility for the vulnerability response procedure is left primarily in the hands of a single SRE
2022-07-12
- 21:14 First vulnerability ticket for scan findings is manually created and triaged
2024-07-15
- 22:08 Last vulnerability ticket for scan findings is manually created and triaged
2024-07-17
- 01:39 Scanner finishes weekly scan and reports vulnerability findings via email
2024-07-21
- 01:39 Ninety-six hours pass from the last vulnerability scan with no tickets having been created or triaged according to our vulnerability response procedure - INCIDENT BEGINS
2024-07-31
- SRE who normally triages CVEs leaves on sabbatical, moving to a different team upon return
2025-03-19
- 01:27 Scanner finishes weekly scan and reports vulnerability findings via email, including 8 findings from the previous scan
2025-03-21
- 18:30 SREs collecting data for our annual WebTrust audit notice a discrepancy between documentation, practice, and requirements in the NCSSRs
- 19:16 SREs declare incident
- 19:34 SREs begin addressing findings from that week’s scan according to our vulnerability response procedure
- 21:24 Final outstanding vulnerability is addressed - INCIDENT ENDS
- 23:26 Preliminary incident report is posted to Bugzilla
Related Incidents
None.
Root Cause Analysis
To give some background, Let’s Encrypt has a formally-documented vulnerability response procedure that the SRE team follows whenever we discover a critical security vulnerability (defined as a vulnerability with a CVSSv2 score of 7.0 or higher). We run automatic datacenter-wide vulnerability scans on a weekly basis, and we have it configured to send an email notification to a team-wide mailing list every time a critical vulnerability is observed.
As part of our vulnerability response procedure, a member of the SRE team is expected to create a ticket in our tracking system for each new vulnerability that is reported, and they must see to it that either the vulnerability is remediated or a suitable justification and plan is documented in the ticket within 96 hours. Without this audit trail, we have no tangible evidence that we have complied with requirement 4.f of the NCSSRs v1.7.
This incident is a result of failing to follow our own vulnerability response procedure, which can be attributed to two root causes:
Contributing Factor #1: Failure to share knowledge and responsibilities
-
Description: While Let’s Encrypt has a formally-documented vulnerability response procedure, we have primarily relied on a single subject-matter expert to carry it out.
Ever since the current vulnerability scanning infrastructure was configured, the work of documenting and responding to findings has been dutifully performed by an individual SRE. This role was assigned ad-hoc and not clearly documented in onboarding/offboarding materials as a responsibility that must be handed over, nor was it documented anywhere as a responsibility that should be shared among the team.
When the SRE that had been responsible for carrying out our vulnerability response procedure transitioned to a new team within Let’s Encrypt, we failed to hand off that responsibility so that it would continue to be done correctly. Other members of the SRE team reviewed weekly scan results informally, but they were not aware they needed to document their analyses in an auditable way.
This is not a failure of any individual; rather it is a failure of Let’s Encrypt’s processes in ensuring continuity in the face of staffing changes. Had we shared this responsibility–and in turn the knowledge of how to perform it–more formally as a team, the transition of a single team member would not have caused us to stop following our procedures correctly.
-
Timeline: This cause dates back to when we first configured weekly scans in July of 2022, failing to establish a shared pool of knowledge and responsibility concerning our vulnerability response procedure. Then during July of 2024, the SRE that had been primarily responsible for this process began transitioning off of the SRE team and we stopped following the procedure correctly.
-
Detection: While collecting evidence for a WebTrust audit, we discovered the trail of vulnerability tickets ceased at approximately the same time the individual who had been making them was preparing to leave the SRE team.
-
Interaction with other factors: Contributing Cause #2 was a factor in why we failed to hand off work.
Contributing Factor #2: Overly toilsome response procedure
-
Description: Our current vulnerability response procedure relies heavily on manual action at multiple stages of the process, making it difficult to execute correctly from start to finish.
Firstly, the email notifications that the scanner sends to our team-wide mailing list are triggered per-vulnerability per-host, causing 50-100 false positive emails on average per week, with no historical context or deduplication on subsequent scans. The repetitive emails cause alert fatigue and create unnecessary toil to interpret, whereas a more aggregated style of report would make it easier to extract a list of CVEs to investigate.
Further, the vulnerability notifications we receive are not automatically linked to our ticketing system where they would have an auditable location to be analyzed and triaged. We rely on an SRE to translate the weekly barrage of emails into an appropriate set of tickets on a weekly basis.
-
Timeline: This cause dates back to when we first configured weekly scans in July of 2022. Because our vulnerability response procedure was difficult to execute, there was a large amount of friction involved in training other SREs to execute it, and so we never did. Then during July of 2024, the SRE that had been primarily responsible for this process began transitioning off of the SRE team and we failed to hand the responsibility over to other members of the team in a way that ensured it would get done in accordance with procedure.
-
Detection: In the course of investigating this incident, we revisited and updated our vulnerability response procedure.
-
Interaction with other factors: This cause played a role in Contributing Cause #1, in that the deficiencies in our current vulnerability response procedure exacerbated the difficulty of taking over the responsibility correctly.
Lessons Learned
What went well:
- Our annual WebTrust audit proved effective in helping us uncover this issue.
- The majority of discovered vulnerabilities were eventually remediated through automatic weekly operating system updates, and the rest were addressed within a few hours of discovering this incident.
What didn’t go well:
- Responsibility for our vulnerability response procedure was ad-hoc assigned to a single SRE and wasn’t properly handed off when they left the team.
- Lack of deduplication or aggregation of vulnerability findings created a toilsome review process that was hard to get right.
- Lack of automation meant that scan results were not being turned into an auditable set of tracking tickets each week.
Where we got lucky:
- N/A
Action Items
Action Item | Kind | Corresponding Root Cause(s) | Evaluation Criteria | Due Date | Status |
---|---|---|---|---|---|
Triage and address all outstanding vulnerabilities | Mitigate | #1 | Tracking tickets were created and closed for every scanner finding from the week that this incident was discovered. | 2025-03-21 | Complete |
Put in place an interim vulnerability review team | Mitigate | #1 | Team consists of at least two SREs who are familiar with the vulnerability response procedure and can conduct reviews until we put in place a policy to explicitly delegate the responsibility | 2025-03-21 | Complete |
Automatically generate aggregated vulnerability reports | Mitigate | #2 | Vulnerability findings can be reviewed in aggregate, rather than per-host-per-vulnerability. | 2025-03-27 | Complete |
Automate creation of tracking tickets based on vulnerability scan results | Prevent | #2 | Tracking tickets are created automatically with a summary of vulnerability scan findings. | 2025-04-02 | Complete |
Revise our vulnerability response procedure to require that the responsibility is shared appropriately among the SRE team | Prevent | #1 | Policy requires that at least two SREs are familiar with the vulnerability response procedure and are able to regularly review scan results. | 2025-04-02 | Ongoing |
Train relevant personnel on the vulnerability response process | Prevent | #1, #2 | Those responsible for vulnerability response according to the new procedure are properly trained to carry it out. | 2025-04-16 | Ongoing |
Comment 5•4 months ago
|
||
Ah, apologies once again! I've just noticed that I got the due dates on the final two remediation items wrong. Those should be 2025-05-02 and 2025-05-16, respectively.
Comment 6•4 months ago
|
||
We 're still working on updates to our vulnerability response procedure and organizing the resulting training. We request a next-update of 2025-05-02.
Updated•4 months ago
|
Comment 7•4 months ago
|
||
Let's Encrypt has internally published a new version of our Vulnerability Response Procedure requiring a weekly meeting of SRE team members to review scan results and other sources of vulnerability reports. We've already performed 4 of these weekly meetings, and we're optimistic about the resilience of this process compared to our previous one.
Now that the new procedure has been published, we'll be holding a training in the next couple weeks to ensure relevant team members are caught up on the changes and can contribute effectively to weekly meetings. Request a next-update of 2025-05-16.
Updated•3 months ago
|
Comment 8•3 months ago
|
||
Let's Encrypt SRE completed the training on our new procedure on 2025-05-14, finishing our remediation items for this incident. If there are no questions, we believe this incident can be closed.
Closure Summary
- Incident Description: Let’s Encrypt performs weekly vulnerability scans on our infrastructure. We identified multiple instances where we failed to remediate or document critical vulnerabilities (defined as those with CVSSv2 scores >= 7.0) within 96 hours of receiving scan results, as required by Version 1.7 of the NCSSRs.
- Incident Root Causes: We failed to follow our documented Vulnerability Response Procedure because it relied heavily on manual action and was owned by a single SRE who transitioned to a new team. When responsibility for carrying out the procedure was handed off to other team members, the requirement that each critical vulnerability requires a ticket in our tracking system was not properly communicated, and there was no automation in place to make it happen automatically.
- Remediation Description: Upon discovery of the incident, SREs immediately began triaging, addressing, and documenting all outstanding critical vulnerabilities reported in the most recent weekly scan. We also put in place an interim team of SREs that understood the relevant requirements and could review and document weekly scan results until we could put in place a more robust process. We improved the format in which we receive vulnerability data by creating an aggregated report that runs after each weekly scan, and we hooked that into our ticketing system to ensure the triage of critical vulnerabilities always has an audit trail. Finally, we updated our Vulnerability Response Procedure to require a weekly meeting of SRE team members to review vulnerabilities, and we carried out a training on the new procedure. We believe our new procedure effectively eliminates the single-point-of-failure issue that led to this incident, as we will now share the load of this responsibility more broadly among the team.
- Commitment Summary: N/A
Updated•3 months ago
|
Comment 9•3 months ago
|
||
Thank you for this report. This is a great example of effective and transparent incident reporting. We value the comprehensive detail in this self-reported incident, and we particularly appreciate that the descriptions provided in the root cause analysis were thorough and constructively focused on systemic process improvements rather than individual blame.
Comment 10•3 months ago
|
||
This is a final call for comments or questions on this Incident Report.
Otherwise, it will be closed on approximately 2025-06-10.
Updated•2 months ago
|
Updated•2 months ago
|
Description
•