Closed Bug 1622505 Opened 4 years ago Closed 4 years ago

GlobalSign: OCSP Status HTTP 530

Categories

(CA Program :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tdelmas, Assigned: arvid.vermote)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0

Steps to reproduce:

Open https://www.velib-metropole.fr/

Actual results:

SEC_ERROR_OCSP_SERVER_ERROR

Expected results:

Not an OCSP error.

https://web.archive.org/web/20200314093203/https://www.globalsign.com/en/status

https://twitter.com/GSSystemAlerts/status/1238523405089079297

https://groups.google.com/forum/#!topic/mozilla.dev.security.policy/tt-ytBbW5po

As per the details shared through our status and alter channels we are facing some instability at one of our data centers. This is being addressed and will be resolved in the upcoming hours.

Assignee: wthayer → arvid.vermote
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Hi Wayne - we had an outage in one of our data centers last week which caused blips in our OCSP responder availability. Please find the incident report as publicly disclosed attached.

We are aware about Baseline Requirements #4.10.2 and under normal conditions our response times are well under 10 seconds.

Happy to create an additional tailored incident report if required or answer any questions you have.

Note: It's useful to include the content in the bug, for searchability and ease of reference.

4.10.2 also requires:

The CA SHALL maintain an online 24x7 Repository that application software can use to automatically check the current status of all unexpired Certificates issued by the CA.

Regarding the incident and mitigations, I have several questions. Some of these questions may be related to this document presumably being used for other consumers as well, and so don't necessarily represent serious issues, so much as trying to better systemically understand.

Timeline and Background

  1. What are CertSafe and Atlas?
  2. What is meant by "non-GSS, non-Japanese soil"?
  3. What were the affected CRLs?
    1. How many unexpired certificates referred to these CRLs?
  4. What were the affected OCSP responder endpoints URLs?
    1. How many unexpired certificates referred to these endpoints?

Root Cause Analysis

  1. It's mention that the combination of "an unforeseen increase in origin traffic" and "a significant amount of malicious traffic" were responsible. From the timeline, it's unclear when the malicious traffic was dropped, and the discussion that it originated with the CDN seems to suggest that this could have been addressed at the CDN through filtering?
    1. For example, in the absolute worst case, it seems like you could have filtered out unanswerable URLs (i.e. that you know the origin won't serve), as well as halted issuance and deployed an allow-list of known serials through. That's not to suggest these are "easy" things, by any means, but understanding more about the nature of the traffic, and how much was attributable to that (and the challenges responding to that burst) seem essential in distinguishing what "normal" load is.

Mitigations

  1. Is it correct to understand that your OCSP services are or were single-homed?
    1. That is, it seems possible to move from a "pull" relationship with the CDN (where the CDN queries for unprovided for statuses) to a "push" relationship, where you publish the resources for distribution on the CDN. Different CAs have taken different approaches here, so understanding more about the architectural design of the OCSP responders can help further identify if there are good practices here worth implementing, or potentially challenging practices that other CAs have found non-viable for "normal" operation.
Flags: needinfo?(arvid.vermote)

Please find below the answers to your questions. We are also consolidating these and queries we are receiving from customers in an updated incident report.

Timeline and Background
1. What are CertSafe and Atlas?

CertSafe is a product where we host OCSP responders on behalf of customers. This is specifically aimed at organizations that operate an on-premise CA but prefer outsourcing OCSP.

Atlas is our new certificate issuance and certificate life cycle management platform. Our old platform is named GCC and has components hosted in the affected data center. We are in the process of migrating all products from GCC to Atlas and eventually decommissioning the GCC platform.

2. What is meant by "non-GSS, non-Japanese soil"?

DSS (Digital Signing Service) is a document signing service for which we serve timestamps from a different location not affected by the incident. We also have specific TSA services provided on Japanese soil mainly under the Japan Accredited Time Stamping scheme which were also not affected by the incident.

3. What were the affected CRLs? - How many unexpired certificates referred to these CRLs?
4. What were the affected OCSP responder endpoints URLs? - How many unexpired certificates referred to these endpoints?

Determining what CRL, OCSP and leafs were affected is not trivial as we use multiple CDNs and CDN caches are not distributed globally. A certain URL might return a 503 for one user but work fine for the other depending on which CDN and geographical area they go through. Also because of the way we cache OCSP queries a recently queried certificate will have a proper response if the origin is unstable at that moment (because it is cached at CDN level), whereas an uncached query might return a failure.

Generally because of caching CRLs were globally available throughout the incident, except for a few uncached lists related to private PKI hierarchies.

In terms of WebPKI, based on current data correlation, following OCSP responders are known to have been unstable and unable to serve certain uncached queries at times during the incident window..

http://ocsp2.globalsign.com/gsorganizationvalsha2g2
http://ocsp.globalsign.com/gseccovsslca2018
http://ocsp.globalsign.com/ca/gsovsha2g4r3
http://ocsp2.globalsign.com/gsorganizationvalsha2g3
http://ocsp.globalsign.com/gsrsaovsslca2018
http://ocsp2.globalsign.com/cloudsslsha2g3
http://ocsp.globalsign.com/gsrsadvsslca2018
http://ocsp2.globalsign.com/gsdomainvalsha2g3
http://ocsp2.globalsign.com/gsalphasha2g2
http://ocsp2.globalsign.com/gsextendvalsha2g3r3
http://ocsp.globalsign.com/gseccevsslca2019
http://ocsp.globalsign.com/gsrsaevqwac2019

Root Cause Analysis
It's mention that the combination of "an unforeseen increase in origin traffic" and "a significant amount of malicious traffic" were responsible. From the timeline, it's unclear when the malicious traffic was dropped, and the discussion that it originated with the CDN seems to suggest that this could have been addressed at the CDN through filtering?
For example, in the absolute worst case, it seems like you could have filtered out unanswerable URLs (i.e. that you know the origin won't serve), as well as halted issuance and deployed an allow-list of known serials through. That's not to suggest these are "easy" things, by any means, but understanding more about the nature of the traffic, and how much was attributable to that (and the challenges responding to that burst) seem essential in distinguishing what "normal" load is.

There were two types of malicious traffic:

  1. Traffic through CDNs to HTTP resources that did not exist. As these did not exist each request was forwarded to the data center. This traffic was sinkholed at CDN levels Wednesday March 11 07:30 UTC.
  2. Distributed direct traffic to the data center to resources that do not exist. Some of it had a fixed source and was blocked Wednesday March 11 07:30 UTC. At the same time we added dummy responses to invoke a HTTP 200 for the queried non-existing resources to reduce the general load and log/alert noise this traffic was generating.

Mitigations
Is it correct to understand that your OCSP services are or were single-homed?
That is, it seems possible to move from a "pull" relationship with the CDN (where the CDN queries for unprovided for statuses) to a "push" relationship, where you publish the resources for distribution on the CDN. Different CAs have taken different approaches here, so understanding more about the architectural design of the OCSP responders can help further identify if there are good practices here worth implementing, or potentially challenging practices that other CAs have found non-viable for "normal" operation.

For our GCC platform OCSP services are currently single-homed. Our new platform (Atlas) will be multi-homed with control of failing over OCSP responders to other data centers. We are currently considering pre-generating OCSP responses for certain ICA but have not arrived to a conclusion there yet.

Flags: needinfo?(arvid.vermote)

Arvid:

Can you please make sure to fill out the template at https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report

In particular, it would highlight the things not yet delivered, most important being:

a timeline of when your CA expects to accomplish these things.

Flags: needinfo?(arvid.vermote)

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On Tuesday March 10th 11:00 UTC our system monitoring tools and Network Operations Center (NOC) started identifying issues with response times and response contents of the services served from the affected data center.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Time (UTC) Action
March 10 2020 11:00 Issue begins. Customers will experience occasional timeouts and HTTP 503 or 504 errors on Timestamping services, and sporadic issuance failures
March 10 2020 12:30 Issue mitigated by moving some services to an alternative firewall and load balancer cluster. Some timeouts still remain, but issuance failures ceased.
March 10 2020 13:00 - 21:25 Further troubleshooting and tuning work performed. GlobalSign OCSP failures reduced to 0.5%. Timestamping time-outs still occurring, but customers given work-arounds
March 10 2020 21:25 Timestamping services using the VPN connection to Japan restored. Other timestamping services still taking 13-15 seconds to return as load drops off slowly
March 10 2020 23:15 Situation declared stable
March 11 2020 03:15 Firewall sessions are seen increasing, then at 3:15 UTC reach the critical point, interrupting services again.
March 11 2020 03:30 – 05:30 Troubleshooting and investigation of issue
March 11 2020 06:30 OCSP traffic moved over to DR firewalls
March 11 2020 07:30 The Security Team identify a source of malicious incoming traffic, this is blocked to reduce impact. Firewall vendor arrives on-site to help troubleshoot
March 11 2020 08:30 All inbound traffic routed through DR firewalls, improving service slightly. However, traffic between the DMZ and the App zone still travels over the affected firewalls, still impacting service
March 11 2020 09:30 - 11:30 The engineer from the vendor brings the DMZ firewalls back online and traffic is slowly ramped up through them, restoring service. Timestamping is still slow.
March 11 2020 14:30 - 20:30 The network is largely stable, but with an elevated occurrence of HTTP 503 and 530 errors, and timestamps taking longer than usual to return (2-3 seconds). We see a high rate of certificate issuance failures
March 11 2020 20:30 Rate limits are put in place on the load balancers to prevent load causing issues as traffic ramps up in the morning
March 11 2020 20:30 - 00:30 Work with CertSafe customers on troubleshooting and resolving any outstanding issues they are still seeing
March 12 2020 03:00 A burst of traffic at around 09:00 SGT (01:00 UTC) caused a build-up of sessions on the load balancers, which started to cause further issues in the data centre. The nature of the issue meant that the rate limits put in place on the Wednesday were ineffectual at resolving the problem.
March 12 2020 03:00 - 06:00 Troubleshooting continues – further rate limits placed on connections
March 12 2020 06:00 System declared stable but with low throughput. Some certificate issuance failures showing, and performance issues for some CertSafe customers
March 12 2020 06:00 - 09:00 Monitoring and working on outstanding performance issues
March 12 2020 09:00 - 12:00 Inside leg of DMZ moved to DR firewalls to further reduce the reliance on the legacy units, which are still exhibiting high CPU at times.
March 12 2020 12:00 - 20:00 Most services stable, some errors on some OCSP services. Ongoing investigation of service issues for CertSafe customers
March 12 2020 20:00 System stable, however a small number of OCSP services are intermittently unavailable, showing with a CDN error.
March 12 2020 20:00 - 03:00 Further work with CertSafe customers resolving outstanding issues
March 13 2020 15:30 - 18:30 Some load-related latency and outages experience by CertSafe customers
March 14 2020 07:00 - 12:30 Work on restoring full access to OCSP services which are still throwing errors
March 14 2020 12:30 All services now operational, and remain so over Sunday 15th March
March 16 2020 01:00-07:00 Some load-based outages on the CertSafe platform, leading to close monitoring throughout the day. CertSafe was moved to dedicated hardware at 04:00 on Tuesday to resolve this

Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

N/A given the nature of the incident.

A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Due to the nature of the bug there is no individual problematic certificates. Determining what CRL, OCSP and leafs were affected is not trivial as we use multiple CDNs and CDN caches are not distributed globally. A certain URL might return a 503 for one user but work fine for the other depending on which CDN and geographical area they go through. Also because of the way we cache OCSP queries a recently queried certificate will have a proper response if the origin is unstable at that moment (because it is cached at CDN level), whereas an uncached query might return a failure.

Generally because of caching CRLs were globally available throughout the incident, except for a few uncached lists related to private PKI hierarchies.

In terms of WebPKI, based on current data correlation, following OCSP responders are known to have been unstable and unable to serve certain uncached queries at times during the incident window..

http://ocsp2.globalsign.com/gsorganizationvalsha2g2
http://ocsp.globalsign.com/gseccovsslca2018
http://ocsp.globalsign.com/ca/gsovsha2g4r3
http://ocsp2.globalsign.com/gsorganizationvalsha2g3
http://ocsp.globalsign.com/gsrsaovsslca2018
http://ocsp2.globalsign.com/cloudsslsha2g3
http://ocsp.globalsign.com/gsrsadvsslca2018
http://ocsp2.globalsign.com/gsdomainvalsha2g3
http://ocsp2.globalsign.com/gsalphasha2g2
http://ocsp2.globalsign.com/gsextendvalsha2g3r3
http://ocsp.globalsign.com/gseccevsslca2019
http://ocsp.globalsign.com/gsrsaevqwac2019

The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

See above.

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

In-depth analysis of many tens of GB of firewall logs and packet captures has been performed to identify the root cause. The following factors have been identified as contributing to the outage, in a “perfect storm” for the data centre:

  1. A sharp, unforeseen increase in origin traffic on the CertSafe platform, causing an increase in database contention, and leading to intermittent sharp rises in load balancer sockets in the TIME_WAIT state (waiting for a response).
  2. As a result of #1 and #2, the CPU on the DMZ firewalls became saturated instantiating new sessions, which led to pauses in packet forwarding and further hanging of connections
  3. This increase in established but hung connections caused database connection pools on some services to fill up, meaning useful sessions were then unable to establish connectivity to the database.

List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

  • March 11 2020 07:30 UTC - The malware traffic has been black-holed
  • March 17 2020 04:12 UTC - The CertSafe platform has been moved to dedicated database and HSM hardware, meaning further changes to usage pattern cannot affect other services. (#2)
  • March 15 2020 15:23 UTC - The vendor was unable to find significant failings in the legacy DMZ firewalls (#3), however they have now been completely retired as a precaution, and more powerful alternatives are being used instead
  • March 12 2020 09:50 UTC - Some services (e.g. secure.globalsign.com) have been failed over to our UK data centre to share load better between the sites
  • June 15 2020: Completion of detailed review of DR plans for the affected data center and start of any resulting activities.
  • July 1 2020: Completion of internal review and improvement of Incident Communication, in order to ensure more proactive and detailed communication and customer notifications in the event of any future incidents.
  • October 1 2020: TSA services available from multiple sites simultaneously (Previously they had been available from other data centres for DR purposes only)
  • January 31 2021: Finalization of improvements to our (D)DOS detection and prevention techniques for faster diagnosis and resolution in future
  • December 31 2022: Current EOL date of our legacy platform (GCC), for which OCSP/CRL services are single-homed. Our new platform (Atlas) will be multi-homed with control of failing over OCSP responders to other data centers. The migration to the new Atlas platform happens on a per product-basis, where enterprise TLS products are planned to happen in 2020-2021, and retail SSL during the course of 2021-2022.
Flags: needinfo?(arvid.vermote)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 15-June 2020

Arvid, do you have any updates on the status of your DR plan reviews and improvements to your Incident Communication process? Thanks.

Flags: needinfo?(arvid.vermote)

Completion of detailed review of DR plans for the affected data center and start of any resulting activities

A more extensive evaluation has been carried out assessing the details of activities performed during the incident under discussion, based on which various procedures have been changed and verbosity added to ensure quicker recovery in case of future occurrence. Existing disaster recovery plans and recovery time objectives have been evaluated, and procedural and technical improvements have been made where appropriate. A full scale failover & recovery exercise is planned for November 2020.

Additional plans for further reducing the recovery time objective and the manual complexity of failover have been suggested for presentation to the GlobalSign board, along with associated implementation time and costs. The appropriate actions will be scheduled and taken following board review of this data (expected to be concluded and approved September 2020).

Additionally we are accelerating the migration of certain products and customers off our old platform (of which certificate status services are single-homed) to our new platform, which due to its design allows for better failover and recovery capabilities, and is destined to be multi-homed.

Completion of internal review and improvement of Incident Communication

The task group in charge of this review has completed their appraisal and has implemented new internal and external communication procedures and templates in order to more quickly inform customers and relying parties when there is a service outage or status change.

Separate projects to implement new channels for improving direct communication to customers and relying parties, including a real-time status portal, have been instantiated with their own timelines.

Flags: needinfo?(arvid.vermote)

Hi Ben - Can it be confirmed whether more information is required or this ticket can now be closed? Thanks

Flags: needinfo?(bwilson)

I believe so. I'll close this on 14-Aug-2020 unless anyone else has questions or concerns that require further response.

Whiteboard: [ca-compliance] Next Update - 15-June 2020 → [ca-compliance]
QA Contact: wthayer → bwilson
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: