Closed Bug 1663953 Opened 4 years ago Closed 3 years ago

TunTrust: OCSP unreachable

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pki, Assigned: pki)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36

Steps to reproduce:

Deleted the OCSP server from the vCenter.

Actual results:

OCSP deleted so unreachable.

Expected results:

Not delete the server and its backup.

Hello,

I represent the “Agence Nationale de Certification Electronique” (in Tunisia) and we currently have an assigned Bug (Bug 1587779) for the inclusion of our Root CA in the Mozilla Root Store.
We have faced an incident that was detected yesterday September 8th, 2020 at 3 pm (GMT) by our internal administrators. The main issue is that we had an unreachable OCSP server for about 20 hours starting from that time and the online revocation on our website was not available for 20 hours (Other methods of revocation were still available so we can firmly assert that there were no revocation delays during this period of time). We will submit a more detailed report as soon as possible, hopefully within 24 hours. We will do so accordingly to the best practices set in https://wiki.mozilla.org/CA/Responding_To_An_Incident .

Best regards,
Agence Nationale de Certification Electronique - TunTrust
https://www.tuntrust.tn

Assignee: bwilson → pki
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Note: All the times mentioned in this document are in CET time zone (GMT + 1).

Summary of the incident:
The following servers were unavailable as follows:

  • The OCSP server related to “TunTrust Root CA” and “TunTrust Services CA” was intermittently unavailable for the period between 3:18pm on the 8th of September to 1:30pm on September 10th, 2020)
  • The CRLs of “TunTrust Root CA” and “TunTrust Services CA” were not available for 20h (between 3:18pm on the 8th of September to 12pm on the 9th of September 2020).

1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

08 September 2020
03:19pm: The monitoring system alerted the system administrators that the OCSP server and the CRL repository were in status DOWN. The administrators responded to the alert by investigating the cause of the issue.
03:34pm: The root cause was identified: The system administrator made an error while executing a script related to automation of the patch management process. There was a configuration step that was missing in the set of taken actions that resulted in the inadvertent deletion of the OCSP and CRL VMs.

2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Date & Time Action
08 September 2020 3:19pm The monitoring system alerts the system administrators that the OCSP server and the CRL repository are in status DOWN.
08 September 2020 3:34pm Automated patch management console powered off for offline troubleshooting. Certificate issuance stopped.
09 September 2020 12pm CRL repository restored and external-facing CRL services are fully operational. Our revocation request monitoring shows that no revocation request was received during the downtime of the CRL and OCSP services.
09 September 2020 4:10pm Post initial problem to Bugzilla.
10 September 2020 1:30 am OCSP server restored and external-facing OCSP services are fully operational.
11 September 2020 08:00am Certificate issuance resumed.

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

The certificate issuance was stopped September 8th at 3:34pm and was resumed on the 11th of September 2020 at 8am. In addition to that, our monitoring and audit logging processes confirmed that no certificate was issued during the entire downtime of the CRL and OCSP services.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

There was no certificate mis-issuance. We also verified that no certificate revocation requests were processed during the CRL and OCSP downtime and that no revocation requests were received.

5. In a case involving certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

This case does not affect issuance of certificates but the OCSP responder and CRL related to “TunTrust Services CA” and “TunTrust Root CA”.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The system administrator made an error while executing the script related to automation of the patch management process. There was a configuration step that was missing in the set of taken actions that resulted in the inadvertent deletion of the OCSP and CRL VMs.

7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

Sept 10th 2020: Reverting to the legacy patch management process included in the scope of our prior and current ETSI and WebTrust audits. Deferring the automation of the patch management process considering our priorities and available resources.

I want to commend you for filing an incident report even as your inclusion request is in progress. This is exactly the kind of proactive communication that helps reflect positively on a CA, as even though an incident happened, it provides an opportunity to learn.

That said, I'd like to push for a more deeper evaluation of root causes. I think you've identified a proximate cause of where things went wrong, but it sounds like there's still opportunities to understand what happened and what went wrong.

That is:

The system administrator made an error while executing the script related to automation of the patch management process. There was a configuration step that was missing in the set of taken actions that resulted in the inadvertent deletion of the OCSP and CRL VMs.

is clear for describing what went wrong, but doesn't look at what factors lead to this being possible. Human error is rarely a "root cause", they're often a factor of systems and designs that allow human errors to have disproportionate impact.

There's various examples of how to do this, including examples of bad vs good. To help highlight this, I think some of the questions to answer, in looking for root causes and resolution:

  • Why was it possible to delete VMs in the first place?
  • Why did it take 21 hours to recover CRLs?
  • Why did it take nearly 36 hours to recover OCSP?
  • Why did the configuration script error go undetected?

These are a few of the questions to unpack. Understandably, the system administrator made an error is unfortunate; the question I think in trying to get to root cause is trying to understand what made a regrettable mistake into a service outage, and what can be done to address those risks. I understand the "Don't try automation yet" is an approach, but it avoids the problems, rather than tries to address and resolve them.

Flags: needinfo?(pki)

Please find below our answers:

1. Why was it possible to delete VMs in the first place?

The patch management solution that we are mentioning in this incident report is the RedHat Satellite Server. This server needs an administrator-like account with privileges to fully manage virtual machines as mentioned in the RedHat configuration guide (https://access.redhat.com/documentation/en-us/red_hat_satellite/6.7/html/provisioning_guide/provisioning_virtual_machines_in_vmware_vsphere ).
This account’s creation was approved according to the internal change management process.
Please refer to question 03 for further details.

2. Why did it take 21 hours to recover CRLs? Why did it take nearly 36 hours to recover OCSP?

We have an active-active disaster recovery failover configuration covering our primary and secondary sites. Both sites are in an active mode, and if the primary site becomes unavailable, our online services will continue to operate through the secondary site to ensure business continuity. In this configuration, all the hosts belong to the same virtualization cluster.
The patch management solution was set up to operate on the virtualization cluster encompassing virtual machines in the primary and the secondary site. The inadvertent delete operation from the patch management Web UI affected virtual machines on the primary site and the secondary site at the same time.
Therefore, we used our offline backups to recover OCSP and CRL services. The following recovery steps were undertaken:
- Creating another virtual machine,
- Retrieving offline backups from the DR Site,
- Restoring the databases,
- Configuring the new virtual machines: hostname, network configuration, OS hardening, checking the vulnerabilities and installed patches, configuring the real time logging, monitoring and alerting, etc.

3. Why did the configuration script error go undetected?

In order to configure the automated patch management, the unregistering (unsubscribing from the global RedHat Support) of our servers was required.
RedHat Satellite WebUI allows the unsubscription of hosts on 2 places:

  • 'Hosts --> All Hosts' : In this case, a proper warning will pop-up explaining the impact of this action (irreversible deletion of the hosts and linked VMs);
  • 'Hosts --> Content Hosts' : In this case, a pop-up window appears asking to confirm the action, without explaining further its impact: This window doesn't contain an explanation of what's going to happen.

The latter way is the one that our administrator used and thus he was not aware of the impact of unregistering the hosts (which caused the inadvertent destruction of OCSP and CRL virtual machines). The default behavior of the patch management solution is to delete virtual machines upon unregistration (Please refer to the bug https://bugzilla.redhat.com/show_bug.cgi?id=1804669 for further details)
We reported this incident to the RedHat support team and they opened a high severity bug to deal with this critical behavior within the RedHat Satellite (please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1882371 ).

The following action plan was produced to deal with this incident:

  • Reverting to the legacy patch management process included in the scope of our prior and current ETSI and WebTrust audits.
  • Deferring the automation of the patch management process based on available resources (This is a mid-term action because of health measures against the coronavirus impacting working hours and physical presence of employees).
  • Contracting with a third party service provider for the assistance on the configuration of the automated patch management process. (This is a mid-term action)
  • Reviewing the disaster recovery (DR) configuration by adding an Active-Passive DR configuration in addition to the Active-Active DR already in place. (This is a long term action because it depends on the budget)
  • Keeping offline backups in the main site in addition to the offline backups in the Disaster Recovery Site (This is a short term action).
  • Having additional training for the technical team (This is a mid-term action).
Flags: needinfo?(pki)

(In reply to Agence Nationale de Certification Electronique from comment #4)

2. Why did it take 21 hours to recover CRLs? Why did it take nearly 36 hours to recover OCSP?

We have an active-active disaster recovery failover configuration covering our primary and secondary sites. Both sites are in an active mode, and if the primary site becomes unavailable, our online services will continue to operate through the secondary site to ensure business continuity. In this configuration, all the hosts belong to the same virtualization cluster.
The patch management solution was set up to operate on the virtualization cluster encompassing virtual machines in the primary and the secondary site. The inadvertent delete operation from the patch management Web UI affected virtual machines on the primary site and the secondary site at the same time.

This seems like an operational risk, and I'm not sure I saw clear steps for managing this risk. That is, you've got the risk that a single operation or error can bring offline both your primary and your secondary sites (as we see here). I suspect there's some benefits from this approach (e.g. a common config for both sites), but there's also risks, and I'm hoping you can explain better how that's being managed here for situations like this.

    - Configuring the new virtual machines: hostname, network configuration, OS hardening, checking the vulnerabilities and installed patches, configuring the real time logging, monitoring and alerting, etc.

This is a really surprising aspect of the system design. The fact that you had to redo configuration and OS hardening/vulnerability scanning suggests that the configure "is what's running", and based on an assumption it's only changed by authorized administrators, as opposed to being something that can be defined declaratively/statically configured (i.e. a DevOps-like approach).

To highlight and explain the concern, what would prevent a system administrator from accessing a VM, installing or modifying some aspect of the configuration, and then deleting the audit logs for that. How would such an action be detected? This is particularly relevant if we substitute "attacker" for system administrator.

It may be that these steps are already scripted/automated and maintained, but as a former system admin, I certainly get concerned when I see non-blessed-image-built-from-script approaches to systems management.

3. Why did the configuration script error go undetected?

I just wanted to say thanks for the depth and quality of the response here, and explicitly acknowledge it as exactly the kind of thing we look for in these incident reports.

The following action plan was produced to deal with this incident:

  • Reverting to the legacy patch management process included in the scope of our prior and current ETSI and WebTrust audits.
  • Deferring the automation of the patch management process based on available resources (This is a mid-term action because of health measures against the coronavirus impacting working hours and physical presence of employees).
  • Contracting with a third party service provider for the assistance on the configuration of the automated patch management process. (This is a mid-term action)
  • Reviewing the disaster recovery (DR) configuration by adding an Active-Passive DR configuration in addition to the Active-Active DR already in place. (This is a long term action because it depends on the budget)
  • Keeping offline backups in the main site in addition to the offline backups in the Disaster Recovery Site (This is a short term action).
  • Having additional training for the technical team (This is a mid-term action).

Note that one of the principles captured in https://wiki.mozilla.org/CA/Responding_To_An_Incident is trying to get a timeline for these. Inevitably, these will be discussed in the inclusion request, but having timelines help provide a better understanding about the priorities and resourcing of the CA, which is a critical component of trust, as well as being able to have objective timelines to follow-up on. With that said, do you have a sense of timelines, now that we're three months later?

Flags: needinfo?(pki)

(In reply to Comment #5)

Q1:

2. Why did it take 21 hours to recover CRLs? Why did it take nearly 36 hours to recover OCSP?
We have an active-active disaster recovery failover configuration covering our primary and secondary sites. Both sites are in an active mode, and if the primary site becomes unavailable, our online services will continue to operate through the secondary site to ensure business continuity. In this configuration, all the hosts belong to the same virtualization cluster.
The patch management solution was set up to operate on the virtualization cluster encompassing virtual machines in the primary and the secondary site. The inadvertent delete operation from the patch management Web UI affected virtual machines on the primary site and the secondary site at the same time.

This seems like an operational risk, and I'm not sure I saw clear steps for managing this risk. That is, you've got the risk that a single operation or error can bring offline both your primary and your secondary sites (as we see here). I suspect there's some benefits from this approach (e.g. a common config for both sites), but there's also risks, and I'm hoping you can explain better how that's being managed here for situations like this.

A1:
This is indeed an operational risk and the main steps that were taken for managing this risk (before the incident) are as follows:

  • According to our backup plan, we keep full backups of databases, configuration files and logs in two servers (one in the primary site and the other one in the secondary site), all encrypted and signed.
  • Also according to our backup plan, we keep encrypted and signed backups of databases, configuration files and logs in a safe in the secondary site.
  • We have a Business Recovery Plan that details all steps that need to be taken to restore VMs and data.

After the incident, we updated the Backup Plan so that now, we keep full backups of all virtual machines (vmdk files) in a storage array. When exporting virtual machines, the vCenter automatically creates mf, vmdk and ovf files. The mf file contains sha256 fingerprints of the vmdk files. We keep copies of these mf files in an external media in a safe in a high security zone. Before restoring a VM, the mf file is retrieved in order to detect if there were any changes made to the VM to be restored. Also, any changes to the backup data in the storage array are automatically captured and sent to a central monitoring service to automatically issue alerts in real time.
In addition to that, all administrator accounts are disabled from the vCenter and the password of the only account capable of deleting VMs and adding new VMs was divided into two parts so that the access to the vCenter would require two trusted roles to be present at the same time to access the management interface.

Q2:

  • Configuring the new virtual machines: hostname, network configuration, OS hardening, checking the vulnerabilities and installed patches, configuring the real time logging, monitoring and alerting, etc.
    This is a really surprising aspect of the system design. The fact that you had to redo configuration and OS hardening/vulnerability scanning suggests that the configure "is what's running", and based on an assumption it's only changed by authorized administrators, as opposed to being something that can be defined declaratively/statically configured (i.e. a DevOps-like approach).

To highlight and explain the concern, what would prevent a system administrator from accessing a VM, installing or modifying some aspect of the configuration, and then deleting the audit logs for that. How would such an action be detected? This is particularly relevant if we substitute "attacker" for system administrator.
It may be that these steps are already scripted/automated and maintained, but as a former system admin, I certainly get concerned when I see non-blessed-image-built-from-script approaches to systems management.

A2:
Prior to the incident, we had a RedHat template VM that is pre-configured to use in this kind of situations. Almost all of the configuration steps are applied to this VM. However, there are some steps in the configuration that are specific to each host such as IP tables configuration, configuration of IDS and signature keys. As for the OS hardening and vulnerability scanning, even though we keep the template VM up to date, we did not want to take the risk of having an unpatched VM in the production environment.

After the incident, we updated the Business Recovery Plan so that we have now an exhaustive backup of our VMs (vmdk files) that are kept in an offline storage array (as described above) in order to reduce the RTO after the incident as specific configuration are already kept in those backups.
Configuration changes to systems, including the management interface and the systems that contain backups, are automatically captured and sent to a central monitoring service to determine whether any changes violated our security policies. Alerts are issued automatically and in real time when any changes are detected. The alerts detail the changes made and the account that made that change.
In addition to that, the password of the account administering the monitoring system is split into two parts so that the access to this system would require two trusted roles to be present at the same time.

Q3:

Note that one of the principles captured in https://wiki.mozilla.org/CA/Responding_To_An_Incident is trying to get a timeline for these. Inevitably, these will be discussed in the inclusion request, but having timelines help provide a better understanding about the priorities and resourcing of the CA, which is a critical component of trust, as well as being able to have objective timelines to follow-up on. With that said, do you have a sense of timelines, now that we're three months later?

A3:
Please find below the timelines related to the action plan that was produced to deal with this incident:

Action Status Observations
1 Reverting to the legacy patch management process included in the scope of our prior and current ETSI and WebTrust audits. Done -
2 Deferring the automation of the patch management process based on available resources (This is a mid-term action because of health measures against the coronavirus impacting working hours and physical presence of employees) June 2021 -
3 Contracting with a third party service provider for the assistance on the configuration of the automated patch management process. (This is a mid-term action) June 2021 -
4 Reviewing the disaster recovery (DR) configuration by adding an Active-Passive DR configuration in addition to the Active-Active DR already in place. (This is a long term action because it depends on the budget) June 2021 We updated the backup plan and thus we currently keep full backups of VMs (vmdk files), databases, configuration files and logs (as described in A1. and A2.) In addition to those actions, we are in the process of acquiring a fully automated backup and replication solution (VEEAM). The contract was signed with the service provider in November 30th , 2020 and we are currently waiting for the delivery of all the components, scheduled for April 2021.
5 Keeping offline backups in the main site in addition to the offline backups in the Disaster Recovery Site (This is a short term action). Done -
6 Having additional training for the technical team (This is a mid-term action). June 2021 The training is part of the contract for acquiring a fully automated backup and replication solution (VEEAM) that was signed in November 30th , 2020. The training is scheduled just after the implementation of this solution.
Flags: needinfo?(pki)

I am setting a "Next Update" for 1-April-2021. Meanwhile, if you have any updates to your progress on items 2, 3, 4 and 6 (set for June 2021), before then, please update your progress/status. Thanks.

Whiteboard: [ca-compliance] → [ca-compliance] Next Update 2021-04-01

This is just a reminder that a status update is due next week.

(In reply to Ben Wilson from comment #7 and comment #8)

Please find below the update regarding the action plan that was produced to deal with this incident:

Action Status Observations
1 Reverting to the legacy patch management process included in the scope of our prior and current ETSI and WebTrust audits. Done (Right after the incident) -
2 Deferring the automation of the patch management process based on available resources (This is a mid-term action because of health measures against the coronavirus impacting working hours and physical presence of employees) Ongoing (scheduled to end by June 30th, 2021) Call for tenders process is done. The contract with the service provider will be signed during the first week of April 2021. The implementation of the automated patch management is scheduled to be completed by June 30th, 2021.
3 Contracting with a third party service provider for the assistance on the configuration of the automated patch management process. (This is a mid-term action) Ongoing (scheduled to end by June 30th, 2021) (Please refer to the observations of action#2)
4 Reviewing the disaster recovery (DR) configuration by adding an Active-Passive DR configuration in addition to the Active-Active DR already in place. (This is a long term action because it depends on the budget) Done We now have an Active-Passive DR configuration in addition to the Active-Active DR configuration already in place. The contract of acquiring a fully automated backup and replication solution (VEEAM) was signed with the service provider in November 30th, 2020. The components were delivered on February 9th, 2021. We then installed the solution in our test environment and went through all necessary tests. Afterwards, we moved the solution to the production environment (March 18th, 2021). All the actions taken were approved and implemented according to our change management procedure. We also updated our internal documents, including our business continuity plan, our backup plan and backup procedure. We are now keeping the current backup process along with the new one: This is a temporary measure to provide mitigating controls and the decision was taken based on a risk assessment. We intend to withdraw the older backup process in a 02 months period. However, in our updated procedures, we are clear that restoration will be made using the new backup solution. On the 26th of March 2021, our WebTrust auditors came on site to independently assess the progress of the project and the implementation of the action plan as part of their audit procedures.
5 Keeping offline backups in the main site in addition to the offline backups in the Disaster Recovery Site (This is a short term action). Done (September 30th , 2020) -
6 Having additional training for the technical team (This is a mid-term action). Ongoing (scheduled to end in June 2021) The training for our administrators on the new backup solution is divided into two parts: A skill transfer (« train-the-trainer») from the service provider’s team to our system administrators and a vendor training session provided by the service provider for our system administrators and PKI administrators. The skill transfer started in February and covered the configuration, testing and putting in production. The training session is scheduled from April 12th to April 16th, 2021. We will provide an update once the training is delivered. A support contract is also signed with the service provider to provide us with assistance with the backup solution for three years.
Whiteboard: [ca-compliance] Next Update 2021-04-01 → [ca-compliance] Next Update 2021-06-15

This comment is to provide an update regarding action #2 of the action plan that was produced to deal with this incident:

Action Status Observations
1 Reverting to the legacy patch management process included in the scope of our prior and current ETSI and WebTrust audits. Done (Right after the incident) -
2 Deferring the automation of the patch management process based on available resources (This is a mid-term action because of health measures against the coronavirus impacting working hours and physical presence of employees) Ongoing (scheduled to end in June 2021) Call for tenders process is done. The contract with the service provider was signed on April 6th , 2021. The implementation of the automated patch management is scheduled to be completed by June 30th, 2021.
3 Contracting with a third party service provider for the assistance on the configuration of the automated patch management process. (This is a mid-term action) Ongoing (scheduled to end in June 30th, 2021) (Please refer to action#2)
4 Reviewing the disaster recovery (DR) configuration by adding an Active-Passive DR configuration in addition to the Active-Active DR already in place. (This is a long term action because it depends on the budget) Done We now have an Active-Passive DR configuration in addition to the Active-Active DR configuration already in place. The contract of acquiring a fully automated backup and replication solution (VEEAM) was signed with the service provider in November 30th, 2020. The components were delivered on February 9th, 2021. We then installed the solution in our test environment and went through all necessary tests. Afterwards, we moved the solution to the production environment (March 18th, 2021). All the actions taken were approved and implemented according to our change management procedure. We also updated our internal documents, including our business continuity plan, our backup plan and backup procedure. We are now keeping the current backup process along with the new one: This is a temporary measure to provide mitigating controls and the decision was taken based on a risk assessment. We intend to withdraw the older backup process in a 02 months period. However, in our updated procedures, we are clear that restoration will be made using the new backup solution. On the 26th of March 2021, our WebTrust auditors came on site to independently assess the progress of the project and the implementation of the action plan as part of their audit procedures.
5 Keeping offline backups in the main site in addition to the offline backups in the Disaster Recovery Site (This is a short term action). Done (September 30th , 2020) -
6 Having additional training for the technical team (This is a mid-term action). Ongoing (scheduled to end in June 2021) The training for our administrators on the new backup solution is divided into two parts: A skill transfer (« train-the-trainer») from the service provider’s team to our system administrators and a vendor training session provided by the service provider for our system administrators and PKI administrators. The skill transfer started in February and covered the configuration, testing and putting in production. The training session is scheduled from April 12th to April 16th, 2021. We will provide an update once the training is delivered. A support contract is also signed with the service provider to provide us with assistance with the backup solution for three years.

Please find below the update regarding the action plan that was produced to deal with this incident (actions #2, #3, #6 and #7). Please note that we split action #6 into two rows (hence the row #7)

Action Status Observations
1 Reverting to the legacy patch management process included in the scope of our prior and current ETSI and WebTrust audits. Done (Right after the incident) -
2 Deferring the automation of the patch management process based on available resources (This is a mid-term action because of health measures against the coronavirus impacting working hours and physical presence of employees) Ongoing (scheduled to end in June 2021) The contract with the service provider was signed on April 6th , 2021. The kick-off meeting took place on April 20th, 2021.
3 Contracting with a third party service provider for the assistance on the configuration of the automated patch management process. (This is a mid-term action) Ongoing (scheduled to end in June 30th, 2021) (Please refer to the observations in action #2)
4 Reviewing the disaster recovery (DR) configuration by adding an Active-Passive DR configuration in addition to the Active-Active DR already in place. (This is a long term action because it depends on the budget) Done We now have an Active-Passive DR configuration in addition to the Active-Active DR configuration already in place. The contract of acquiring a fully automated backup and replication solution (VEEAM) was signed with the service provider in November 30th, 2020. The components were delivered on February 9th, 2021. We then installed the solution in our test environment and went through all necessary tests. Afterwards, we moved the solution to the production environment (March 18th, 2021). All the actions taken were approved and implemented according to our change management procedure. We also updated our internal documents, including our business continuity plan, our backup plan and backup procedure. We are now keeping the current backup process along with the new one: This is a temporary measure to provide mitigating controls and the decision was taken based on a risk assessment. We intend to withdraw the older backup process in a 02 months period. However, in our updated procedures, we are clear that restoration will be made using the new backup solution. On the 26th of March 2021, our WebTrust auditors came on site to independently assess the progress of the project and the implementation of the action plan as part of their audit procedures.
5 Keeping offline backups in the main site in addition to the offline backups in the Disaster Recovery Site (This is a short term action). Done (September 30th , 2020) -
6 Having additional training for the technical team (on the new backup solution) Done The training for our administrators on the new backup solution is divided into two parts: A skill transfer / « train-the-trainer» from the service provider’s team to our system administrators and a vendor training session provided by the service provider for system administrators and PKI administrators. The skill transfer started in February and covered the configuration, testing and putting in production. The internal training session took place from April 12th to April 16th 2021. A support contract is signed with the service provider to provide us assistance with the backup solution for three years.
7 Having additional training for the technical team (on the new patch management solution) Ongoing (scheduled to end in June 2021) The training regarding the automated patch management solution (please refer to action #2 & #3) is scheduled for June 2021.
Summary: TunTrust : OCSP unreachable → TunTrust: OCSP unreachable

Please find below the update regarding the action plan that was produced to deal with this incident (actions #2, #3 and #7):

Action Status Observations
1 Reverting to the legacy patch management process included in the scope of our prior and current ETSI and WebTrust audits. Done (Right after the incident) -
2 Deferring the automation of the patch management process based on available resources (This is a mid-term action because of health measures against the coronavirus impacting working hours and physical presence of employees) Done (June 18th, 2021) The contract with the service provider was signed on April 6th , 2021. The kick-off meeting took place on the 20th of April 2021. The automation of the patch management process has started since June 18th , 2021.
3 Contracting with a third party service provider for the assistance on the configuration of the automated patch management process. (This is a mid-term action) Done (June 18th, 2021) (Please refer to the observations in action#2)
4 Reviewing the disaster recovery (DR) configuration by adding an Active-Passive DR configuration in addition to the Active-Active DR already in place. (This is a long term action because it depends on the budget) Done We now have an Active-Passive DR configuration in addition to the Active-Active DR configuration already in place. The contract of acquiring a fully automated backup and replication solution (VEEAM) was signed with the service provider in November 30th, 2020. The components were delivered on February 9th, 2021. We then installed the solution in our test environment and went through all necessary tests. Afterwards, we moved the solution to the production environment (March 18th, 2021). All the actions taken were approved and implemented according to our change management procedure. We also updated our internal documents, including our business continuity plan, our backup plan and backup procedure. We are now keeping the current backup process along with the new one: This is a temporary measure to provide mitigating controls and the decision was taken based on a risk assessment. We intend to withdraw the older backup process in a 02 months period. However, in our updated procedures, we are clear that restoration will be made using the new backup solution. On the 26th of March 2021, our WebTrust auditors came on site to independently assess the progress of the project and the implementation of the action plan as part of their audit procedures.
5 Keeping offline backups in the main site in addition to the offline backups in the Disaster Recovery Site (This is a short term action). Done (September 30th , 2020) -
6 Having additional training for the technical team (on the new backup solution) Done The training for our administrators on the new backup solution is divided into two parts: A skill transfer / « train-the-trainer» from the service provider’s team to our system administrators and a vendor training session provided by the service provider for system administrators and PKI administrators. The skill transfer started in February and covered the configuration, testing and putting in production. The internal training session took place from April 12th to April 16th 2021. A support contract is signed with the service provider to provide us assistance with the backup solution for three years.
7 Having additional training for the technical team (on the new patch management solution) Ongoing (scheduled to end in June 2021) The training regarding the automated patch management solution (please refer to action #2 & #3) has started June 21st , 2021 and is scheduled to finish on June 25th , 2021.

Please find below the update regarding the action plan that was produced to deal with this incident (action #7):

Action Status Observations
1 Reverting to the legacy patch management process included in the scope of our prior and current ETSI and WebTrust audits. Done (Right after the incident) -
2 Deferring the automation of the patch management process based on available resources (This is a mid-term action because of health measures against the coronavirus impacting working hours and physical presence of employees) Done (June 18th, 2021) The contract with the service provider was signed on April 6th , 2021. The kick-off meeting took place on the 20th of April 2021. The automation of the patch management process has started since June 18th , 2021.
3 Contracting with a third party service provider for the assistance on the configuration of the automated patch management process. (This is a mid-term action) Done (June 18th, 2021) (Please refer to the observations in action#2)
4 Reviewing the disaster recovery (DR) configuration by adding an Active-Passive DR configuration in addition to the Active-Active DR already in place. (This is a long term action because it depends on the budget) Done We now have an Active-Passive DR configuration in addition to the Active-Active DR configuration already in place. The contract of acquiring a fully automated backup and replication solution (VEEAM) was signed with the service provider in November 30th, 2020. The components were delivered on February 9th, 2021. We then installed the solution in our test environment and went through all necessary tests. Afterwards, we moved the solution to the production environment (March 18th, 2021). All the actions taken were approved and implemented according to our change management procedure. We also updated our internal documents, including our business continuity plan, our backup plan and backup procedure. We are now keeping the current backup process along with the new one: This is a temporary measure to provide mitigating controls and the decision was taken based on a risk assessment. We intend to withdraw the older backup process in a 02 months period. However, in our updated procedures, we are clear that restoration will be made using the new backup solution. On the 26th of March 2021, our WebTrust auditors came on site to independently assess the progress of the project and the implementation of the action plan as part of their audit procedures.
5 Keeping offline backups in the main site in addition to the offline backups in the Disaster Recovery Site (This is a short term action). Done (September 30th , 2020) -
6 Having additional training for the technical team (on the new backup solution) Done (April 16th , 2021) The training for our administrators on the new backup solution is divided into two parts: A skill transfer / « train-the-trainer» from the service provider’s team to our system administrators and a vendor training session provided by the service provider for system administrators and PKI administrators. The skill transfer started in February and covered the configuration, testing and putting in production. The internal training session took place from April 12th to April 16th 2021. A support contract is signed with the service provider to provide us assistance with the backup solution for three years.
7 Having additional training for the technical team (on the new patch management solution) Done (June 25th, 2021) The training regarding the automated patch management solution (please refer to action #2 & #3) started June 21st , 2021 and ended on June 25th , 2021.

We have implemented the action plan related to the OCSP incident reported in this bug. Please let us know if there is any additional information needed from our side.

Are there any other questions from the Mozilla community, or can this bug be closed on Friday, 16-July-2021?

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] Next Update 2021-06-15 → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.