Closed Bug 1677234 Opened 5 years ago Closed 4 years ago

Apple: OCSP availability 2020-11-12

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: certification_authority, Assigned: certification_authority)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15

Apple’s OCSP services experienced a period of diminished availability in the early afternoon (Pacific time; UTC-8) on Thursday, November 12, 2020.
We continue to investigate and gather data and will provide more details in a forthcoming update.

Assignee: bwilson → certification_authority
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Type: defect → task
Whiteboard: [ca-compliance]

Full Incident Report

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date
    1. On November 12 at 12:46 PT the Apple CA team received an internal monitoring alert notifying us of a timeout in our certificate status services.
    2. Between 2020 Nov 12 11:54 PT and 2020 Nov 12 13:43 PT, Apple experienced diminished availability of certificate status services.
  2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
    1. 2020 Nov 12 11:54 PT: Initial decline in successful OCSP queries begins
    2. 2020 Nov 12 12:46 PT: Apple CA’s monitoring alerted due to rate of timeouts in CRL retrieval surpassing threshold
    3. 2020 Nov 12 12:46 PT: Apple CA team confirmed healthy state in backend component of certificate status services
    4. 2020 Nov 12 12:50 PT: Apple CA team confirmed CDN as cause of diminished availability of certificate status services
    5. 2020 Nov 12 12:51 PT: Apple CA team engaged with CDN to track mitigations and resolution
    6. 2020 Nov 12 13:05 PT: Recovery in rate of successful OCSP queries begins due to mitigations implemented by CDN
    7. 2020 Nov 12 13:43 PT: Monitoring observed recovery of impacted services and rate of successful OCSP queries returned to expected volume
  3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.
    1. This did not affect certificate issuance.
  4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.
    1. Though this issue didn’t involve certificate issuance, it did involve degradation of certificate status service availability. The impacted services were ocsp.apple.com (http://ocsp.apple.com/) (hosting OCSP services), crl.apple.com (http://crl.apple.com/) (hosting CRLs), and certs.apple.com (http://certs.apple.com/) (hosting CA Issuer files). The Apple CA team runs a service that monitors certificate status services from an external (i.e. client device) perspective. After an alert was triggered regarding CRL retrieval timeouts, the CA team verified that the backend component of the certificate status services was operating properly. It was then noted that the CDN fronting the certificate status services was experiencing diminished availability, causing an increase in failed OCSP queries.
    2. We estimate approximately 60% of traffic was impacted over the period of this incident.
  5. In a case involving certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.
    1. This did not affect certificates.
  6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
    1. Apple certificate status services rely on an Apple-managed CDN to serve certificate status information. This intermediary layer is what client devices interact with when checking a certificate’s status. The Apple CDN experienced a period of diminished availability, causing it to become non-responsive to some incoming requests, including certificate status checks.
    2. The CDN has multiple global points of presence and we noted that Apple CA internal monitoring does not effectively account for all regions, so while parts of the CDN began to experience issues, our monitoring did not detect it right away.
  7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.
    1. We are currently working with the CDN team to identify areas where additional solutions may be applied to prevent future degradation of service.
    2. To address the delay in the CA team timeout alerts, we plan to geographically diversify Apple CA’s monitoring to enhance and expand overall coverage.
    3. Expected completion date TBD.

Due to the Thanksgiving holiday in the US, we expect to provide an update and/or reply to any questions no earlier than the week of 30 November.

We continue to work with the CDN team to identify solutions to prevent future degradation of services. Additionally, we continue to enhance and expand overall monitoring coverage. We expect to provide another update by 18 December.

In order to mitigate future degradation of services, we designed a solution to leverage 3rd party CDNs to compliment the existing Apple CDN infrastructure. This will diversify the risk of one CDN impacting Apple’s certificate status services. This will be implemented by the end of January 2021.

We also enhanced our existing monitoring solution to improve global coverage. The new solution will monitor each of our globally distributed CDN endpoints. This solution will be rolled out by the end of this month.

We plan to post our next update once the above solutions have been implemented or by the end of January 2021 at the latest.

Is this still on track to be completed by EOM?

Note that in the absence of a Mozilla representative setting a Next Update, CAs are expected to provide regular progress reports ( https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed ). Setting N-I for Ben

Flags: needinfo?(certification_authority)
Flags: needinfo?(bwilson)

Thank you, Ryan. This is still on track for completion by the end of this month.

Flags: needinfo?(certification_authority)

We are still on track with implementing a new solution to leverage 3rd party CDNs by the end of this month. By way of clarification, we completed the described monitoring improvements within the expected timeframe (December 2020).

Setting N-I for Ben to set Next-Update to 2021-01-31 as needed.

Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update 2021-01-31

The following tasks have been completed:

  • As previously stated, we enhanced our existing monitoring solution to improve global coverage
  • We implemented a solution to leverage 3rd party CDNs to compliment the existing Apple CDN Infrastructure

This completes the open tasks for this incident.

I will schedule to close this on Wed. 3-Feb-2021 unless there are additional issues to discuss.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] Next Update 2021-01-31 → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.