Closed Bug 1869056 Opened 9 months ago Closed 7 months ago

Sectigo: Inadequate vulnerability scanning and patching

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: martijn.katerbarg, Assigned: martijn.katerbarg)

Details

(Whiteboard: [ca-compliance] [policy-failure])

Incident Report

Summary

During our ETSI audit, we became aware of deficiencies in our internal vulnerability scanning and patching of discovered vulnerabilities.

As a result of the events described below, between May 15, 2023 and the end of October 2023, internal agent-based scans were not running properly and over 400 previously discovered vulnerabilities were not acted upon.

All references to vulnerability scans within this document refer to our internal vulnerability scanning, unless stated otherwise.

Impact

A total of 418 vulnerabilities with a CVSS score of 7.0 or higher and marked as exploitable were revealed for which patching or analysis had not been completed. Additionally, for a total of 5 months, internal vulnerability scans were not executed.

External vulnerability scans on our CA infrastructure, which are performed on a weekly basis, were not impacted by this incident.

Timeline

All times are UTC.

2022:

  • We deploy a new vulnerability management and scanning system, Tenable. For our internal vulnerability scans, we opt to use agent-based scanning, which utilizes software agents installed on each machine for internal scanning of the individual machines.

2023-05-15:

  • 08:00 Unbeknownst to the IT Security team, agent-based scanning stopped working. The agent scanner isn’t being actively monitored and the breakdown goes undetected (until October).

2023-10-24:

  • 15:00 During a call with our ETSI auditors, we are asked to provide evidence of vulnerability scanning. Our newly appointed VP of IT / Security joins the call and presents the current standing within Tenable, showing 418 unresolved vulnerabilities with a CVSS score of 7.0 or higher.
  • 19:00 An internal call is held with stakeholders from IT Security, IT Operations and Compliance to discuss the findings and find an appropriate path forward. We start working on a mitigation and remediation plan and agree that these teams will work in parallel, with IT Ops taking the lead on patching where possible. During the call we discover that agent based scanning stopped working on May 15th.

2023-10-26:

  • 14:30 We have a follow-up internal call to discuss current progress and further steps.

2023-10-27:

  • 21:00 All results from servers no longer in active service are removed from Tenable.

  • 22:17 We send our mitigation and remediation plan to our ETSI auditors. This plan includes dates by which we intend to complete patching within each of our datacenters, which are referred to as DC1, DC2 and DC3 in this incident report. We aim to:

    • Review the actual risk for each unique vulnerability and assess and document if patching is required or if other mitigating controls are already in place by 2023-11-10.
    • Apply relevant and required patches to DC2 and DC3 infrastructure where possible without impacting production systems by 2023-11-10.
    • Apply remaining patches to DC2 and DC3 infrastructure during planned maintenance windows by 2023-11-15.
    • Apply all patches to DC1 infrastructure during maintenance windows by 2023-12-15.

2023-11-01 – 2023-11-10:

  • 00:00 Patching systems where possible without an outage is an ongoing process for all three datacenters.

2023-11-04:

  • 17:00 Agent-based vulnerability scanning of internal systems is re-enabled and confirmed working.

2023-11-08:

  • 14:09 We generate an updated report from Tenable and submit this to our auditors.

2023-11-12:

  • 00:00 During a scheduled outage window for planned maintenance, we apply patches to DC2 and DC3 infrastructure.

2023-11-14:

  • 15:01 A new Tenable report is requested and generated. This shows that all patching of DC2 has been completed, with 0 remaining vulnerabilities. For DC3, 18 vulnerabilities remain. These are deemed as mitigated through compensating controls. DC3 is only in use for development operations as well as our offsite backup infrastructure and does not directly interact with our Certificate Systems.

2023-11-19:

  • 00:00 During a scheduled outage window for planned maintenance, we patch nearly all remaining systems, with focus on DC1.

2023-11-20:

  • 14:25 A new Tenable report is run after patching over the weekend has been done. A total of 6 vulnerabilities remain, all in DC1. After investigation, these remaining 6 vulnerabilities are all deemed false positives.

2023-11-27:

  • 09:39 We mark the remaining 6 vulnerabilities as false positives within Tenable.

2023-11-28:

  • 16:28 We run a new Tenable report and a total number of 0 critical vulnerabilities are reported, two weeks ahead of our planned schedule.

Root Cause Analysis

Solutions used by Sectigo in the past, in common with other, non-agent based vulnerability scanning tools, rarely scan more than the actual web applications, which makes it (nearly) impossible to check server patch levels.

Tenable takes a different approach, in which every server is scanned from inside the network, at a much deeper level. Due to this deeper inspection, the number of discovered vulnerabilities in our network was far higher than expected, albeit with a false positive rate noticeably higher than that of other scanning tools we have used in the past. All of the discovered vulnerabilities were related to out-of-date packages, which external scanning tools generally are unable to detect. The high number of discovered vulnerabilities overwhelmed the IT / Security team, who failed to take the required action in a timely fashion.

Additionally, the IT Security team hadn’t previously experienced an issue where the agents scanning server and agents just stopped reporting to the reporting server. The IT Security Team failed to review the Agent Scan logs to confirm that the data that was being presented was accurate.

We went through a few personnel changes in July and August within the IT Security team that had failed to check the agent scanner. This had the unfortunate effect of delaying the discovery of the breakdown of our Tenable agent-based scanning. It wasn’t until our ETSI audit in October, when the new VP of IT / Security with the new Security Engineering team had taken over, that we rediscovered the issue of unresolved vulnerabilities.

Lessons Learned

The IT Security Team has learned to review the Agent Based Scanning and confirm it is working correctly. Until we add automated monitoring, the team will login to the system on a daily basis to check the agents scanner and confirm the latest scan worked correctly.

What went well

  • We found several false positives and processed these as such within the next two days.
  • Sectigo has other security measures that prevented outside attackers from being able to access these machines internally or externally.

What didn't go well

  • Tenable is a very powerful scanning tool. While we aim to use the best tools available, the amount of detail in the scanning results has been staggering. The shift in personnel created a steep learning curve for new employees.
  • Proper monitoring of the Tenable system agent scan functionality was not in place. This together with earlier mentioned personnel changes led to a breakdown of complying with the CA/Browser Forum NSRs.
  • Several servers that were still included in the vulnerability reports had already been decommissioned. However, the results were still shown as valid within Tenable. These were cleaned up once we realized this.
  • While investigating this incident we did notice that vulnerabilities discovered through external vulnerability scans, while analyzed and processed in a timely fashion, were not always marked as a false positive (if deemed so after analysis) within the external vulnerability reports. Our policies around this have since been tightened.
  • Our Internal Audit procedures did not adequately monitor scanning and patching procedures.
  • With our initial focus being squarely on presenting a remediation plan to our auditors in a timely fashion, regrettably we had an oversight in completing and posting this full incident report within two weeks.

Where we got lucky

  • While the number of 418 exploitable vulnerabilities seems high, the actual number of distinct vulnerabilities is far less. Tenable reports each vulnerability individually for each affected server, but there were actually only 26 distinct vulnerabilities, with most servers sharing the same vulnerabilities.

Action Items

Action Item Kind Due Date
Integrate automated JIRA ticket creation for discovered vulnerabilities Mitigate Completed
Continued education of IT security staff in working with Tenable Prevent N/A - Continues ongoing
Add automated monitoring of Tenable and setup notifications in case of scanning agent issues Prevent 2024-01-31
Update internal processes so any machine taken out of commission is also properly removed from Tenable Prevent 2023-12-31
Add controls to our internal audit process to monitor our internal and external vulnerability scanning compliance with the CA/Browser Forum NSRs Detect 2024-01-31
Assignee: nobody → martijn.katerbarg
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]
Whiteboard: [ca-compliance] → [ca-compliance] [policy-failure]

From what I've gathered from the documentation & packages of Tenable, these agents have root-level access to the machines that the CA uses. Does this company have a delegated third party role with the CA? If not, what steps have been taken to not allow Tenable itself to be a vector of attack on these systems?

Am I also reading this correctly that for about five months, none of these servers received updates?

Is regular updating of these servers part of the normal operations workflows? If so, how did these servers go five months without receiving an update?

(In reply to amir from comment #1)

From what I've gathered from the documentation & packages of Tenable, these agents have root-level access to the machines that the CA uses. Does this company have a delegated third party role with the CA? If not, what steps have been taken to not allow Tenable itself to be a vector of attack on these systems?

We are using a self-hosted version of Tenable, to which only the Sectigo IT / Security team has access. Tenable employees do not have any kind of access to our servers or our installation of the Tenable security center. Therefore, Tenable is not a delegated third party.

Am I also reading this correctly that for about five months, none of these servers received updates?
Is regular updating of these servers part of the normal operations workflows? If so, how did these servers go five months without receiving an update?

For operating system package updates, unfortunately that is the case. The personnel changes mentioned in comment #0 led to miscommunication and misunderstanding about roles and responsibilities, and a meaningful amount of infrastructure patching suffered until we recently discovered the oversight.

Tenable is effective at monitoring which packages are in need of patching, which is part of the reason we implemented it. We however did fail to monitor the output and continued functioning of the product. As specified in the Action Items, we are implementing automated monitoring of Tenable itself. With that in place, we believe we will mitigate this risk for the future.

To clarify where this error fits in context, we would like to give some insight into our wider infrastructure, without revealing anything too sensitive. Our current infrastructure mainly relies on a few setups:

  • Bare-metal servers
  • Virtual Machines
  • Containers / Kubernetes

We began using Kubernetes in 2020, and at this stage run several of our Certificate (Management) Systems on it. We rely heavily on the concept of infrastructure-as-code and use additional tooling, such as “renovatebot”, to keep up with updates (patches) of dependencies. All these systems receive, and have received, regular updates.

We also have services running on bare-metal servers and VMs, many of which we build and maintain in-house. As with Kubernetes, the code for these services is covered by the above-mentioned additional tooling, which easily lets us update dependencies. Generally, for services running on VMs and bare-metal servers, we plan a release a few days after the updates have gone through our QA process. Deployments for these services occur when required, and historically have occurred 2-3 times per month for each service.

All these systems, on a code and dependencies level, receive and have received regular updates. We regret that we failed to uphold this same standard across the entirety of our infrastructure.

Thank you for the information! I think that all makes sense and I don't really have other questions here. I do think with the changes you're adding, and with the final action item to put regularly check on the automation - the chance of re-occurrence here is minimal.

Completely off topic, I'm excited to hear about the use of containers & k8s in the infrastructure. Maybe your team can write up a technical blog post at some point :)?

I do not have any further questions here.

(In reply to amir from comment #3)

Completely off topic, I'm excited to hear about the use of containers & k8s in the infrastructure. Maybe your team can write up a technical blog post at some point :)?

While we cannot currently commit to this, we will take it into consideration.

We have now also updated the internal processes so any machine taken out of commission is also properly removed from Tenable. This resolves the correlating action item.

Ben, as we have 2 action items pending with a due date of January 31st, we would like to request a next-update for that date.

Flags: needinfo?(bwilson)
Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] [policy-failure] → [ca-compliance] [policy-failure] Next update 2024-01-31

Both action items have been completed on time.

Are there any further questions and/or comments on this bug?

I will close this on or about Friday, 2-Feb-2024.

Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] [policy-failure] Next update 2024-01-31 → [ca-compliance] [policy-failure]
Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.