Closed Bug 1602999 Opened 2 years ago Closed 2 years ago

Microsoft: Loss of Archived Firewall logs from Retention Store

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mohanr, Assigned: mohanr)

Details

(Whiteboard: [ca-compliance])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3974.0 Safari/537.36 Edg/80.0.345.0

Steps to reproduce:

  1. How your CA first became aware of the problem
    The problem being noted is that we had a loss of approximately nine and half months of archived firewall syslog data from the retention store of monitoring and retention platform . Though the certificates issued during timeframe (of the loss of syslog data) were test certs that have since expired, we decided to report this issue based on our discovery of the problem. This problem was noticed while reviewing enhancements to our monitoring and retention platform.

  2. A timeline of actions your CA took in response:

The problem was detected on Sept 12th 2019, and the misconfiguration was corrected within couple of hours on the same day. Further, we added continuous monitoring and alerting to the platform to ensure that any configuration drift or potential loss is immediately detected for corrective action. The alerting system has been in-place since Oct 7th 2019.
Timelines:

  • Problem Detection: Sept 12th 2019 ~10:30 AM
  • Engineering Engaged & Diagnosis: Sept 12th 2019 ~1: 15 PM
  • Mitigation Implemented (configuration corrected) Sept 12th 2019 ~2: 15 PM
  • Continuous Monitoring & Alerting Implemented Oct 7th 2019
  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem.

This problem had no impact on the issuing CAs or issued certificates.

  1. A summary of the problematic certs.

As noted above, there are no problem certificates to report.

  1. The complete certificate data for the problematic certificates.

As noted above, there are no problem certificates to report.

  1. Explanation about how the mistakes were made or bugs introduced, and how they avoided detection until now.
    The issue was caused by human error in setting incorrect parameters for archival of firewall logs in the retention part of the monitoring and retention platform. The problem was not noticed earlier due to the fact that periodic requests for evidence related to firewall logs happened to be fulfilled by the data stored in the monitoring part of the platform instead of being sourced from retention part of the platform

  2. List of steps the CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

The two steps we have taken are:
a) Immediately corrected the misconfiguration of the parameters in the retention part of the platform: Completed, Sept 12th 2019
b) Setup continuous monitoring and alerting on the retention part of the platform to alert on any potential loss of archived firewall logs or changes in the retention window configuration: Completed, Oct 7th, 2019

Assignee: wthayer → mohanr
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]
Type: defect → task
Summary: Incident Report: Microsoft Loss of Archived Firewall logs from Retention Store → Microsoft: Loss of Archived Firewall logs from Retention Store

Thank you for the incident report. Questions:

  • Which Microsoft CA certificate(s) did this affect?
  • Why was this incident not reported until two months after it occurred?
  • Was a root cause analysis performed to determine what conditions led to the human error? Has anything been done to prevent future errors of this nature?
Flags: needinfo?(mohanr)

Answers to your questions:

Which Microsoft CA certificate(s) did this affect?
The following CAs were active during the time of the loss of the firewall logs. As noted earlier, though the archived firewall logs were lost, the systems security events and firewalls logs were being continuously received and monitored 24*7 by our dedicated security team, as has always been the case.

Microsoft TLS Issuing CA 01
Microsoft TLS ECC Issuing CA 01
Microsoft TLS EV Issuing CA 01
Microsoft TLS ECC EV Issuing CA 01
Microsoft TLS Issuing CA 05
Microsoft TLS ECC Issuing CA 05
Microsoft TLS EV Issuing CA 05
Microsoft TLS ECC EV Issuing CA 05
Microsoft Server PCA 2018
Microsoft ECC Server PCA 2018
Microsoft EV Server PCA 2018
Microsoft EV ECC Server PCA 2018
Microsoft RSA Root Certificate Authority 2017
Microsoft ECC Root Certificate Authority 2017
Microsoft EV RSA Root Certificate Authority 2017
Microsoft EV ECC Root Certificate Authority 2017

Why was this incident not reported until two months after it occurred?

During the two month window, we attempted to find ways to recover the archived logs from systems managed by internal and partner security teams within our organizations, developed and implemented mitigations, performed root cause analysis. leadership reviews etc. Once it was determined that the data was not recoverable, we self-reported it to our auditors and reported it out to this forum.

Was a root cause analysis performed to determine what conditions led to the human error? Has anything been done to prevent future errors of this nature?
Yes, the root cause was determined. The gap identified was that we needed to have continuous monitoring and alerting enabled not only on the log retention data, but we also needed continuous monitoring on the configuration parameters such as retention period in the system. This would enable us to detect and recover in short order even if a human error occurs in the future. As part of remediation, we have implemented automated monitoring and alerting solution to continuously monitor the log storage configuration and alert, should there be any anomalies related to log data or changes to configuration of the retention system. In addition, we have added a manual check process to validate state of the log retention as part of our internal periodic audit process.

Flags: needinfo?(mohanr)

Not sure why the text font changed to bold and large in the middle part of the response, but it was un-intentional.

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.