Closed Bug 1640805 Opened 5 years ago Closed 5 years ago

DigiCert: delayed publication of revocation information

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mpalmer, Assigned: jeremy.rowley)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Attachments

(2 files)

Digicert's OCSP responders have provided validly-signed "Good" responses to requests for six certificates revoked by Digicert more than 24 hours after the timestamp later given in "revoked" OCSP responses as the revocation time -- and around 45 hours after the initial certificate problem report was received by Digicert.

Whilst the BRs mention that revocation information must be "published" within 24 hours, they do not define exactly what "published" means. As a result, I believe it is open to some amount of interpretation as to whether what I have observed is -- or is not -- a BR violation. However, I do believe that delayed availability of revocation information via OCSP is not in the interests of relying parties, for two reasons:

  1. Delayed availability of revocation information means that relying parties cannot take appropriate action in response to a revoked certificate, because they do not know that the certificate has been revoked; and
  2. Malfeasance on the part of CA -- specifically, "backdating" revocation timestamps in order to appear compliant with BRs 4.9.5 -- cannot be detected by an outside observer, as it is indistinguishable from a simple and non-malicious delay in revocation information publication.

As such, I believe it would be valuable for Digicert to explain their understanding of the BR requirements around "publication", and how they believe the behaviour I've observed is consistent with the BRs, with a view to gathering sufficient information that a suitable improvement to the BRs or Mozilla Policy can be made, so that it is clear to all CAs that OCSP responses must be provided promptly. Of course, if what I've observed is deemed by Mozilla to be a BR violation, there'll be a need for an incident report, and so on, as well.

Gory Details

(All times UTC)

Here are query results for the relevant certificates, taken from the Revokinator:

                             digest                               |          sent_at           |      publication_time      | revocation_timestamp |    effective_delay    
------------------------------------------------------------------+----------------------------+----------------------------+----------------------+-----------------------
 08ab9b1d4f91b1309d733db349ae7e3535e6a370bb0d885d91946cef6d1f754f | 2020-05-20 06:57:22.07832  | 2020-05-22 05:25:55.334282 | 2020-05-21 03:35:47  | 1 day 22:28:33.255962
 33e331183e6e700db7d235fc015a5db065c497926fd586623179865ebab2ef63 | 2020-05-20 06:59:42.769947 | 2020-05-22 04:15:24.29826  | 2020-05-21 03:35:43  | 1 day 21:15:41.528313
 6928df582b90f9c3c451f9e39cca9464694898c2ccad04cf8ba8f1f8445622e0 | 2020-05-20 07:02:04.185273 | 2020-05-22 04:17:17.822975 | 2020-05-21 03:35:47  | 1 day 21:15:13.637702
 81783f7c272b8e13b210395e61f2ccdae6942e2a2e938f5910ca4b61a302b7ad | 2020-05-20 07:00:16.441478 | 2020-05-22 03:46:30.865371 | 2020-05-21 03:35:47  | 1 day 20:46:14.423893
 840516cf05c6a1e7cb838ab11ed139074e0175caf3c77a17ea9effc5c4b6fe2a | 2020-05-20 06:57:22.07832  | 2020-05-22 05:37:22.763886 | 2020-05-21 03:35:47  | 1 day 22:40:00.685566
 d9a01093e376cd7032d3d0275427e5316ea7c182f541d12604356e262cb08321 | 2020-05-20 07:02:04.185273 | 2020-05-22 04:13:33.329776 | 2020-05-21 03:35:47  | 1 day 21:11:29.144503

Where:

  • digest is the SHA256 digest of the certificate DER;
  • sent_at is the time at which an MX server listed in the recordset for digicert.com accepted delivery for the e-mail containing the relevant certificate problem report;
  • publication_time is the last time at which an OCSP responder listed in the A or AAAA recordset for the DNS name of the OCSP responder listed in the certificate returned a valid "Good" response for the certificate;
  • revocation_timestamp is the time given in the a valid "Revoked" OCSP response; and
  • effective_delay is publication_time - sent_at, indicating the amount of time elapsed between the receipt of the problem report and the availability of a "Revoked" OCSP response from all OCSP responders listed in DNS.

If it would assist Digicert in their investigation, I can provide Digicert with details of source IPs, exact timestamps of each HTTP request and the IP address to which they were made, and so on.

Assignee: bwilson → jeremy.rowley
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

I'll post more info later, but here was one of the threads on this topic:

https://groups.google.com/forum/#!msg/mozilla.dev.security.policy/eV89JXcsBC0/7hkz9iJDAQAJ

Also here:

https://groups.google.com/forum/#!searchin/mozilla.dev.security.policy/revocation$20publication%7Csort:date/mozilla.dev.security.policy/LC_y8yPDI9Q/Z7pPGNW8AAAJ

But I do like Bugzilla being the repository of how the various CAs operate over the Mozilla mailing list so I'll post the details on this thread.

Here's how revocation works at DigiCert:

After investigating a request for revocation and confirming the reason for revocation, we mark the cert as revoked in our database. This becomes the revocation time. Every two minutes the database is queried to see if a new certificate has been revoked. If a new revocation is found, the response is generated and pushed to our origin servers. Once we push it to the origin server, the certificate is considered revoked since we have "published" the certificate information for general consumption. The certificate is revoked at that point. Requests for certificate status information are routed through one of our CDN providers (depending on the cert). The cache is usually 4 hours. If the cache is older than 4 hours then the latest response is pulled from the origin server.

OCSP responses for new certificates work the same way - every two minutes, the system scans the database looking for certs issued. We then sign a response for the cert and push the OCSP response to the origin servers. Once again, the CDN picks up the response when the cache is cleared and distributes it upon request. This time for pickup is dependent on the CDN and any caching by the user's browser.

However, the time to distribute should be less than 4 hours even at the extreme. Something else is happening that caused extra long delays on those certs (21 hours is outside of the expected range). I'm investigating and will provide an additional update when I know what happened.

Many CDNs provide the ability to proactively invalidate CDN caches. Does DigiCert use these, or does it solely rely on the CDN edge services? Some of these operate on the order of seconds.

Flags: needinfo?(jeremy.rowley)

We can clear the entire cache but not individual OCSP responses. If we cleared the entire cache every time a revocation occured, we'd end up uploading 10s of millions of certs every few minutes. This would be worse than not having a CDN distribute the responses, likely bringing down the origin server. However, there is definitely something wrong during the time period Matt cited since it should never be over 4 hours from revocation to distribution.

Flags: needinfo?(jeremy.rowley)

I’m still not sure I’ve got sufficient detail. Many CDNs, including both Akamai and CloudFlare (both of which I believe DigiCert may use or have used) support revoking by URL, which would be fine for RFC 5019. Similarly, POST responses can use CDN-specific tags that you can later selectively invalidate.

That’s why I’m not sure I understand why it would even take four hours, and it seems like there’s an opportunity here for a systemic, and long-term, fix.

Flags: needinfo?(jeremy.rowley)

Thanks for the details, Jeremy. A couple of comments:

Once we push it to the origin server, the certificate is considered revoked since we have "published" the certificate information for general consumption.

If I'm understanding this sentence correctly, you're asserting that generating an OCSP response with a "revoked" status, even if no relying party can actually access that OCSP response for up to four hours, counts as "publication" for the purposes of BRs 4.9.5. Is that an accurate reading, or am I misunderstanding something?

The cache is usually 4 hours. If the cache is older than 4 hours then the latest response is pulled from the origin server.

If there can be an unavoidable delay of up to four hours between the point at which the OCSP response is "DigiCert published" and when it may actually end up being available to relying parties, I would expect that potential delay to be accounted for in the rest of DigiCert's procedures -- for example, ensuring that the revocation operation is completed within 20 hours of receiving the problem report. I know that reduces the time that DigiCert has to process the report and work with the subscriber, but DigiCert chose their CDN provider, and everything that goes along with it.

As an aside, since creating this bug, I've looked through the Revokinator OCSP history data, and I can't find any instances where another CA has failed to make an OCSP response available within 24 hours of receiving the problem report (except in cases where they failed to revoke within 24 hours, which is a whole other set of bugs). Yes, I was pretty surprised, too. It shows, however, that most CAs seem to have their OCSP publication pipelines under some sort of control.

However, the time to distribute should be less than 4 hours even at the extreme.

There are a lot of revocations that don't meet this bar. There are a total of 94 revoked certificates whose "time to effective revocation" (that is, the period between DigiCert receiving the problem report and the last time the Revokinator got a valid OCSP Response whose cert_status was 0) is greater than 28 hours (24 hours for BRs + a four hour CDN delay), where the "nominal revocation time" (revocation timestamp - sent_at) was less than 24 hours. That compares with only 53 revoked certificates from Digicert whose "time to effective revocation" was less than or equal to 28 hours. If we take a "strict" view, and require published revocation within 24 hours, 116 of 147 certificates nominally revoked by DigiCert within 24 hours (revocation_timestamp - sent_at < 24 hours) fail to meet the "no 'Good' OCSP response retrieved more than 24 hours after reporting" bar.

Something else is happening that caused extra long delays on those certs (21 hours is outside of the expected range).

If it helps, here are the more detailed observations of OCSP responses for one of the certificates in my initial bug report (the others demonstrate a similar pattern, but I can dump them too if it'd help):

      first_observed_time   |  produced_at_time   | revocation_timestamp 
----------------------------+---------------------+----------------------
 2020-05-20 10:37:09.779802 | 2020-05-19 03:16:07 | 
 2020-05-20 16:33:53.580367 | 2020-05-20 15:46:41 | 
 2020-05-21 04:05:52.709583 | 2020-05-21 03:16:12 | 
 2020-05-22 03:51:59.104504 | 2020-05-22 03:16:20 | 2020-05-21 03:35:47
 2020-05-23 03:57:31.503772 | 2020-05-23 03:16:19 | 2020-05-21 03:35:47
 2020-05-24 03:45:22.309719 | 2020-05-24 03:16:21 | 2020-05-21 03:35:47
 2020-05-25 03:52:39.187407 | 2020-05-25 03:14:59 | 2020-05-21 03:35:47
 2020-05-26 04:14:27.484904 | 2020-05-26 03:16:26 | 2020-05-21 03:35:47
 2020-05-27 05:12:01.590851 | 2020-05-27 03:16:29 | 2020-05-21 03:35:47

(that's 08ab9b1d4f91b1309d733db349ae7e3535e6a370bb0d885d91946cef6d1f754f, BTW; first_observed_time is the time at which the HTTP request was issued whose response was the first time we'd seen a particular OCSP response produced_at timestamp)

Apart from the first row (which is the when OCSP responses started getting requested -- that is, the time at which the initial report was made), it shows, as you describe by CDN caching, a delay of ~1-2 hours from produced_at to first observation. What I'm not seeing here is any sort of indication of "scanning every two minutes" -- that is, if this had happened:

Every two minutes the database is queried to see if a new certificate has been revoked. If a new revocation is found, the response is generated and pushed to our origin servers.

I would have expected to see an OCSP response with a produced_at a minute or two after revocation_timestamp -- even if that OCSP response was first visible to the Revokinator a few hours later.

There is another "unusually long publication delay" that's just come in, too, with a (presumably) cert/pre-cert pair:

                       certificate_sha256                         |    email_address    |          sent_at           |      publication_time      | revocation_timestamp |   publication_delay   | time_to_effective_revocation 
------------------------------------------------------------------+---------------------+----------------------------+----------------------------+----------------------+-----------------------+------------------------------
 1d7c0099607c14ebf0fc2d62a1f8930e801578e9e9e3292e1a9e147e59e9288e | revoke@digicert.com | 2020-05-26 14:31:58.083888 | 2020-05-27 17:20:21.520605 | 2020-05-26 17:11:01  | 1 day 00:09:20.520605 | 1 day 02:48:23.436717
 b9d72c75738298c40ea2259ae6b42e4dca0ce165550b8e7bd11e513cb210a01c | revoke@digicert.com | 2020-05-26 14:31:58.083888 | 2020-05-27 17:12:43.220959 | 2020-05-26 17:11:01  | 1 day 00:01:42.220959 | 1 day 02:40:45.137071

The revocation has a short "report-to-revocation-timestamp", but they show the same "not picking up the revocation status out of the database and generating a new OCSP response" behaviour, for example:

      publication_time      |  produced_at_time   | revocation_timestamp 
----------------------------+---------------------+----------------------
 2020-05-26 13:31:56.872445 | 2020-05-25 15:34:23 | 
 2020-05-26 16:09:43.239413 | 2020-05-26 15:34:26 | 
 2020-05-27 16:23:33.592958 | 2020-05-27 15:34:29 | 2020-05-26 17:11:01

(that's 1d7c0099607c14ebf0fc2d62a1f8930e801578e9e9e3292e1a9e147e59e9288e)

Even for certificates whose time-to-published-revocation was less bad, I'm not seeing this "picked up within two minutes" behaviour reflected in the produced_at_time patterns:

      publication_time      |  produced_at_time   | revocation_timestamp 
----------------------------+---------------------+----------------------
 2020-05-20 09:39:41.698922 | 2020-05-19 05:34:06 | 
 2020-05-20 20:59:34.403614 | 2020-05-20 20:28:02 | 
 2020-05-21 06:14:06.59394  | 2020-05-21 05:34:13 | 2020-05-21 03:35:47
 2020-05-22 06:05:28.938222 | 2020-05-22 05:34:15 | 2020-05-21 03:35:47
 2020-05-23 06:04:06.423028 | 2020-05-23 05:34:18 | 2020-05-21 03:35:47
 2020-05-24 06:23:31.454361 | 2020-05-24 05:34:21 | 2020-05-21 03:35:47
 2020-05-25 06:24:18.417745 | 2020-05-25 05:34:24 | 2020-05-21 03:35:47
 2020-05-26 06:34:27.765186 | 2020-05-26 05:34:27 | 2020-05-21 03:35:47
 2020-05-27 07:56:08.798174 | 2020-05-27 05:34:30 | 2020-05-21 03:35:47

(that's a627c22f4ee8b890d72337dd6c2098b931daa9c64ca415345eda30128dc561da, whose time to effective revocation is 24 hours and 43 minutes, give-or-take)

For revocations that are within the bounds of the 24 hour period, there's still no sign of the two-minute bump:

      publication_time      |  produced_at_time   | revocation_timestamp 
----------------------------+---------------------+----------------------
 2020-05-25 13:32:13.614934 | 2020-05-25 10:10:23 | 
 2020-05-26 10:40:54.875547 | 2020-05-26 10:10:27 | 2020-05-26 07:42:49
 2020-05-27 10:43:28.172852 | 2020-05-27 10:10:30 | 2020-05-26 07:42:49

(1b9cb889b4956c430750606c8c1d25f529a14a2c05273efadf5e5dc4d59a5439)

To me, it more seems like whether or not a certificate revocation gets published in a timely manner is more about the luck of the draw -- if the daily OCSP response generation for a given certificate runs just before the revocation time, it'll wait a long time before its revocation status gets published.

So overall, I'm having trouble reconciling your description of how DigiCert's revocation system works, compared to what I'm seeing as an outside observer. I'd appreciate it if you've got any pointers as to how I'm misinterpreting my observations (or even misobserving in the first place), so I can make the Revokinator a more accurate reflection of the real world.

If I'm understanding this sentence correctly, you're asserting that generating an OCSP response with a "revoked" status, even if no relying party can actually access that OCSP response for up to four hours, counts as "publication" for the purposes of BRs 4.9.5. Is that an accurate reading, or am I misunderstanding something?

Once we publish a "revoked" revocation response it counts as published. Anyone with access to the fresh OCSP response can then no longer claim it is good.

If there can be an unavoidable delay of up to four hours between the point at which the OCSP response is "DigiCert published" and when it may actually end up being available to relying parties, I would expect that potential delay to be accounted for in the rest of DigiCert's procedures -- for example, ensuring that the revocation operation is completed within 20 hours of receiving the problem report. I know that reduces the time that DigiCert has to process the report and work with the subscriber, but DigiCert chose their CDN provider, and everything that goes along with it.

Nah. That's not what the requirement says. It says:
"The CA SHALL revoke a Certificate within 24 hours if one or more of the following occurs:"

4.9.4 says:
"The CA SHALL provide a process for Subscribers to request revocation of their own Certificates. The process
MUST be described in the CA's Certificate Policy or Certification Practice Statement. The CA SHALL maintain a
continuous 24x7 ability to accept and respond to revocation requests and Certificate Problem Reports."

Both of these are met. This interpretation is based on the fact that once we publish a response, it's outside of our control how the responses are consumed. You could be using a browser that doesn't check OCSP. You could be seeing network latency issues. You could have a gov blocking OCSP.

As an aside, since creating this bug, I've looked through the Revokinator OCSP history data, and I can't find any instances where another CA has failed to make an OCSP response available within 24 hours of receiving the problem report (except in cases where they failed to revoke within 24 hours, which is a whole other set of bugs). Yes, I was pretty surprised, too. It shows, however, that most CAs seem to have their OCSP publication pipelines under some sort of control.

That's also why their OCSP is so slow. We could remove the CDN from the process, but that would make it very slow to distribute responses. I'm not interested in doing that unless it becomes a requirement.

However, the time to distribute should be less than 4 hours even at the extreme.

Why? And says who? Citation required.

To me, it more seems like whether or not a certificate revocation gets published in a timely manner is more about the luck of the draw -- if the daily OCSP response generation for a given certificate runs just before the revocation time, it'll wait a long time before its revocation status gets published.

Yeah - something weird is going on with the OCSP server. Still investigating this.

So overall, I'm having trouble reconciling your description of how DigiCert's revocation system works, compared to what I'm seeing as an outside observer. I'd appreciate it if you've got any pointers as to how I'm misinterpreting my observations (or even misobserving in the first place), so I can make the Revokinator a more accurate reflection of the real world.

Yeah - something isn't matching up. That was the design of the system described above. I think what's happening is things aren't getting picked up by the origin server, but I need more research.

I’m still not sure I’ve got sufficient detail. Many CDNs, including both Akamai and CloudFlare (both of which I believe DigiCert may use or have used) support revoking by URL, which would be fine for RFC 5019. Similarly, POST responses can use CDN-specific tags that you can later selectively invalidate.

We use edgecast. I was informed this morning that we actually changed the cache time of the POST to two hours not long ago. We don't have a way to selectivly clear individual OCPS responses through EdgeCast.

Hey Matt - a dump would be great. That will help us track the issue and see what's going on.

(In reply to Jeremy Rowley from comment #9)

We use edgecast. I was informed this morning that we actually changed the cache time of the POST to two hours not long ago. We don't have a way to selectivly clear individual OCPS responses through EdgeCast.

Thanks. It may be time to look for a different CDN that helps ensure you meet the BRs.

I call this out, because I don't think your response is correct, namely:

This interpretation is based on the fact that once we publish a response, it's outside of our control how the responses are consumed

You've published a repository link in the certificate. If I dereference that URI, I do not get the information you're claiming is there. And this matters, because the URL in the certificate is the Repository (c.f. the Definitions sections of the BRs for "Repository" and "OCSP Responder"), not whatever internal database you have.

What Matt is highlighting here is that you've not actually published your responses. You've signed them, sure, but you've not published them, since the URI itself doesn't provide them.

It's also worth calling out that supporting RFC 5019 (the GET method), you can still quickly invalidate the URI, because the OCSP response for each certificate is itself a distinct path, allowing you to invalidate individual certificates. Alternatively, it looks like Imperva, Fastly, Cloudflare (and the aforementioned Akamai) are all viable, in that they allow you to purge by distinct tags/headers.

Attached file ocsp_checks.csv.gz

(In reply to Jeremy Rowley from comment #8)

If I'm understanding this sentence correctly, you're asserting that generating an OCSP response with a "revoked" status, even if no relying party can actually access that OCSP response for up to four hours, counts as "publication" for the purposes of BRs 4.9.5. Is that an accurate reading, or am I misunderstanding something?

Once we publish a "revoked" revocation response it counts as published. Anyone with access to the fresh OCSP response can then no longer claim it is good.

Sure, but your CDN isn't giving anyone access to the fresh OCSP response, so... ?

If there can be an unavoidable delay of up to four hours between the point at which the OCSP response is "DigiCert published" and when it may actually end up being available to relying parties, I would expect that potential delay to be accounted for in the rest of DigiCert's procedures -- for example, ensuring that the revocation operation is completed within 20 hours of receiving the problem report. I know that reduces the time that DigiCert has to process the report and work with the subscriber, but DigiCert chose their CDN provider, and everything that goes along with it.

Nah. That's not what the requirement says. It says:
"The CA SHALL revoke a Certificate within 24 hours if one or more of the following occurs:"

And 4.9.5 says:

"The period from receipt of the Certificate Problem Report or revocation-related notice to published revocation MUST NOT exceed the time frame set forth in Section 4.9.1.1"

The point on which I think this whole thing rests is what "published" means, exactly. I feel like DigiCert is relying on "published" means "signed on our origin server", while I (and I think, Ryan) is of the opinion is that "published" means "to make known to people in general", which, if your CDN isn't handing out the Revoked response to "people in general", it isn't "published".

Both of these are met. This interpretation is based on the fact that once we publish a response, it's outside of our control how the responses are consumed. You could be using a browser that doesn't check OCSP. You could be seeing network latency issues. You could have a gov blocking OCSP.

None of those hypotheticals are the case here. If you believe that the system which gathered the OCSP response data upon which this report is predicated is providing incorrect or misleading results, then I'd be interested in hearing about it.

As an aside, since creating this bug, I've looked through the Revokinator OCSP history data, and I can't find any instances where another CA has failed to make an OCSP response available within 24 hours of receiving the problem report (except in cases where they failed to revoke within 24 hours, which is a whole other set of bugs). Yes, I was pretty surprised, too. It shows, however, that most CAs seem to have their OCSP publication pipelines under some sort of control.

That's also why their OCSP is so slow. We could remove the CDN from the process, but that would make it very slow to distribute responses. I'm not interested in doing that unless it becomes a requirement.

I think there are more options than "use a CDN that doesn't allow selective forced expiration" and "don't use a CDN". Also, bear in mind that I keep response time data for all OCSP responses... and here is the scorecard (the mean of all OCSP responses for certificates reported to the associated e-mail address):

           email_address           |       avg       
-----------------------------------+-----------------
 cert-prob-reports@letsencrypt.org | 00:00:00.018406
 ecs.support@entrustdatacard.com   | 00:00:00.02948
 practices@starfieldtech.com       | 00:00:00.029119
 report-abuse@globalsign.com       | 00:00:00.019552
 revoke@certum.pl                  | 00:00:00.047368
 revoke@digicert.com               | 00:00:00.049018
 sslabuse@sectigo.com              | 00:00:00.026253

I'm well aware of the limitations of the mean as a statistic, and if you'd like I'd be willing to run a more in-depth analysis. However, I wouldn't recommend assuming that the data is on your side here.

However, the time to distribute should be less than 4 hours even at the extreme.

Why? And says who? Citation required.

Uhm... you did, in comment 3, first sentence of the fourth paragraph.

Yeah - something isn't matching up. That was the design of the system described above. I think what's happening is things aren't getting picked up by the origin server, but I need more research.

That's certainly the first thing I'd suspect, based on your system description and the results the Revokinator is seeing.

Hey Matt - a dump would be great. That will help us track the issue and see what's going on.

Attached. It's every OCSP request made for 08ab9b1d4f91b1309d733db349ae7e3535e6a370bb0d885d91946cef6d1f754f from when it was first flagged for pending revocation to nowish.

Thanks Matt. We figured it out - thanks a ton for the data. This was very helpful.

We discovered a bug in our revoke by serial number code where revocations were not pushed to the origin server as soon as they were revoked. Instead, they were getting picked up the next day during the batch process when new responses were signed. We've fixed the bug so you should be seeing immediate distribution by the CDN. I'll write up a formal incident response tomorrow.

Flags: needinfo?(jeremy.rowley)
Flags: needinfo?(jeremy.rowley)
  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

2020/5/25 – Matt Palmer opened this case on Bugzilla

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2020/5/25 - Began investigating the issue to see what was going on.
2020/5/28 - Thought we found an issue with the delay and deployed a patch. Turns out it wasn't the issue as Matt reported it still happening.
2020/5/29 - Found a bug in the code where if a certificate is revoked by serial number instead of by order number the certificate would generate an OCSP response that was not pushed to the origin server. This caused the certificate to wait until the next general update for the OCSP before the revoked response was picked up and pushed to origin.
2020/5/29 - Pushed a change to fix the bug causing the delay

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

These are not misss- issued so no changes to issuing has been made. However, we have stopped serving delayed revocation responses.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

The OCSP responses for revoked certs that were revoked by serial number instead of order number were impacted.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

How should I answer this one? The certs are already revoked. Do you want me to list the certs revoked for key compromise that were revoked by serial?

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Bug was introduced in code some time ago where cetificates revoked by serial number did not show up in the incremental batch as expected. The result is that it missed the two minute queue for a push to origin and didn't get picked up until the next full batch. We missed the error as our monitoring showed revocation was happening as normal, an OCSP response was being generated by the CA, and we were seeing the revocation show up at the CDN.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We fixed the bug so revocation responses will be picked up by the origin server as soon as they are generated. I'm still working with the team to figure out what monitoring to put into place to ensure this never happens again. I'm also working with them to figure out "lessons learned" from this that we can share to help improve teh ecosystem. I'll provide an update when we figure out what to do there.

(In reply to Jeremy Rowley from comment #14)

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

How should I answer this one? The certs are already revoked. Do you want me to list the certs revoked for key compromise that were revoked by serial?

I mean, this is a chance for the CA team to show how deep they investigate into accidents and what sort of information they think would be useful to relying parties evaluating the CA and the CA's response. Which I realize sounds like "you figure it out", but I do think it's important for CAs to be able to step up and lead incident responses and show comprehensively, or even just commit to what they're going to find out based on the info they have. That said, asking what should be done is still better than 90% of incident reports, and so I do appreciate that :)

For example, based on this description, it "seems" like you could determine how many relying parties/requests were affected by this issue, by doing an analysis that examined:

  1. When the bug was introduced
  2. How many certificates were revoked in a way that could cause them to be affected by this issue
  3. How many requests were made in the time window that this issue would have manifest (e.g. if Cert X was revoked at 6 pm, but wasn't pushed until 4am the next day, how many requests came in during that 10 hour window? And from how many clients)
  4. Gather that in aggregate

I feel like responses like https://bugzilla.mozilla.org/show_bug.cgi?id=1619047#c1 to be an example of that sort of response.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Bug was introduced in code some time ago where cetificates revoked by serial number did not show up in the incremental batch as expected. The result is that it missed the two minute queue for a push to origin and didn't get picked up until the next full batch. We missed the error as our monitoring showed revocation was happening as normal, an OCSP response was being generated by the CA, and we were seeing the revocation show up at the CDN.

Is there any chance for a more deeper explanation? Say, similar to https://bugzilla.mozilla.org/show_bug.cgi?id=1619047#c1

You describe the effect of the bug, but not the cause. You describe that the problem was missed, but not why/how. These are useful, but I think the more detail you share here, the more confidence is had in your next answer, and in the "Why did it need revokinator to spot this"

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We fixed the bug so revocation responses will be picked up by the origin server as soon as they are generated. I'm still working with the team to figure out what monitoring to put into place to ensure this never happens again. I'm also working with them to figure out "lessons learned" from this that we can share to help improve teh ecosystem. I'll provide an update when we figure out what to do there.

Thanks. As you do that, it might be useful to revisit some of the details above to flesh them out. The incremental updates are appreciated, and I want to make sure at the end result we have something that's clear and detailed.

We have data related to revocation information if we had a comment on the revocation. We started adding revocation comments on June 1, 2017. Of the revocations over the last 90 days, 84% of the revocations have comments. The total revocations by serial number are 2% of the total revocations. The total revocations over the last 90 days are 1,527,214.

One thing we've discovered is a significant numer problems with revocatoin related to on-demand signing compared to pre-signed responses. We used ot support both. For TLS, we are moving entirely to pre-signed responses. Although this increased the number of responses required by 11.6%, the ability to have a uniform signing process with one path was worth the increase. We also added additional HSMs in support of the signing service to ensure the system doesn't fall behind on the required signings.

A couple of things we are doing to improve the system. We previously had monitors to ensure a certificate was marked as revoked in our database within 24 hours. We need to add monitoring as the revocation flows throughout the system to ensure it is published at the end point. We've added monitoring already to ensure the CA publishes the revocation. We have it specced out and will take about two weeks from when it is prioritized. Currently, we have it slated behind the key blacklist checker and a couple of other issues we are seeing.

Increased monitoring is a running theme throughout our incident reports. I'm going to kick off a project to see where else in the system we don't have monitoring to ensure there isn't an interruptoin. Its a bit embarrasing when someone external notices a break instead of our systems alerting us.

I think there is still an open request from Ryan to revise sections 5, 6 and 7 of the Incident Report. For section 5, instead of discussing the certificates, address the revocations and the OCSP requests/responses. Ryan suggests that in section 5 you could (1) address when the bug was introduced, and (2) provide some additional statistics with aggregates that help illustrate the size of the problem (how many were revoked by serial number, how many OCSP requests might have been affected by delayed/outdated information, etc. (see comments and references above)). Then for 6 and 7, more detail and explanation is needed.

(In reply to Jeremy Rowley from comment #17)

A couple of things we are doing to improve the system. We previously had monitors to ensure a certificate was marked as revoked in our database within 24 hours. We need to add monitoring as the revocation flows throughout the system to ensure it is published at the end point.

This is definitely a positive move. By "end point", do you mean the origin server from which the CDN retrieves the responses, or the CDN edge nodes?

Increased monitoring is a running theme throughout our incident reports. I'm going to kick off a project to see where else in the system we don't have monitoring to ensure there isn't an interruptoin.

In my experience, you can never have too much monitoring.

Hey Ben - see comment 16 where I provided this information (as much as we have).

Matt - I'm hoping we can do both. I'd like to monitor on when it becomes available to end users and when it is pushed to the origin server. the first task is origin server. We're still debating internally the best way to measuer availability from a customer persepective.

In my experience, you can never have too much monitoring.
+1 to that. Unit tests and monitoring solve so many headaches.

Flags: needinfo?(jeremy.rowley)

(Adding myself back to the bug)

The blocker to closing this bug is an updated incident report based on the new information and deployment of the monitoring on OCSP responses. We haven't priortized the dev yet for the monitoring, but I'll work with the team to see when it can be ready. in the meanwhile, I'll update the incident report so everything is captured in one post.

Flags: needinfo?(jeremy.rowley)
  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

2020/5/25 – Matt Palmer opened this case on Bugzilla

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2020/5/25 - Began investigating the issue to see what was going on.
2020/5/28 - Thought we found an issue with the delay and deployed a patch. Turns out it wasn't the issue as Matt reported it still happening.
2020/5/29 - Found a bug in the code where if a certificate is revoked by serial number instead of by order number the certificate would generate an OCSP response that was not pushed to the origin server. This caused the certificate to wait until the next general update for the OCSP before the revoked response was picked up and pushed to origin.
2020/5/29 - Pushed a change to fix the bug causing the delay

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
    These are not misss- issued so no changes to issuing has been made. However, we have stopped serving delayed revocation responses.

  2. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
    The OCSP responses for revoked certs that were revoked by serial number instead of order number were impacted.

  3. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
    We started adding revocation comments on June 1, 2017. Of the revocations over the last 90 days, 84% of the revocations have comments. The total revocations by serial number are 2% of the total revocations. The total revocations over the last 90 days are 1,527,214.

  4. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
    Bug was introduced in code some time ago where cetificates revoked by serial number did not show up in the incremental batch as expected. The result is that it missed the two minute queue for a push to origin and didn't get picked up until the next full batch. We missed the error as our monitoring showed revocation was happening as normal, an OCSP response was being generated by the CA, and we were seeing the revocation show up at the CDN.

One thing we've discovered is a significant numer problems with revocations related to on-demand signing compared to pre-signed responses. We used ot support both. For TLS, we are moving entirely to pre-signed responses. Although this increased the number of responses required by 11.6%, the ability to have a uniform signing process with one path was worth the increase. We also added additional HSMs in support of the signing service to ensure the system doesn't fall behind on the required signings.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
    We fixed the bug so revocation responses will be picked up by the origin server as soon as they are generated. We added additional monitoring along the way to ensure that we detect when the cert appears revoked. This will alert us if there is an issue or if the cert fails to be pushed to the CDN. It'll give us more prompt alerts when something is going wrong for sure.

Any additional questions?

Flags: needinfo?(jeremy.rowley)

(In reply to Jeremy Rowley from comment #22)

One thing we've discovered is a significant numer problems with revocations related to on-demand signing compared to pre-signed responses. We used ot support both. For TLS, we are moving entirely to pre-signed responses. Although this increased the number of responses required by 11.6%, the ability to have a uniform signing process with one path was worth the increase. We also added additional HSMs in support of the signing service to ensure the system doesn't fall behind on the required signings.

I mean, I know it's a terrible time to suggest this, notwithstanding the broader ecosystem challenges... but have you considered delegated responder certificates, as an alternative to HSM-based scaling? :)

That is, delegated responders can be used complementary to pre-generated responses, and it's not all either/or with a CA. For example, you could examine your OCSP access logs, and figure out which certificates are getting frequently requested, and ensure they're signed directly by their issuer (to keep their size small, since they are, after all, popular). For your long-tail of certificates, which may not be accessed frequently, you could spin up a short-lived delegated responder (e.g. valid for only 12-14 days), use a software-backed key to bulk-sign your long tail (thus generating larger responses, but which may never be accessed).

This has the effect of still ensuring you always have pre-generated responses, but allows you to separately scale out your long-tail, while still optimizing for the lower size for your popular certificates.

The plan you described also works, but I'm assuming there's a bounds to HSM scalability and whether or not it effectively scales with issuance.

My only caveat to closing this out is a timeline for the monitoring in Comment #21. If Comment #22 is meant to be that, I'm hoping you can more precisely describe your added monitoring. Consider this response by Apple, in Bug 1588001, for a good model.

Flags: needinfo?(jeremy.rowley)

We want to switch completely away from delegated OCSP responders because we like putting all of our eggs in one basket. :)

Really, it's about simplification. Although a delegated responder system would be more efficient, we wanted to use one process for revocation to ensure that is hardened. Better to focus on one system for revocation rather than two. The more narrow focus will keep the system efficient and compliant, even if we need more hardware. There are bounds in scaling.

As for the monitoring, we've implemented the following:

  1. Add an endpoint for retrieving recently revoked certificates. This endpoint uses an API key with the proper permission to check if a cert is revoked and when it was revoked. We can call this on the internal side to see if any certs fall outside of the expected revocation deadline and take action. This monitor was designed specifically to provide alerts related to CA revocation operations.
  2. We added a log for each serial number added to an incremental batc. This log allows us to track serial numbers from revocation through the OCSP system. This way we can see anywhere a specific serial number gets hung up in the revocation process.
  3. We set up an alert so that if the incremental OCSP sync fails, we can force a job that runs twice a day to sync all incremental OCSP responses.

Any additional comments or can we close this bug?

Flags: needinfo?(jeremy.rowley)

I'll close this bug on or after 28-July-2020 unless we receive additional questions or issues.

Flags: needinfo?(bwilson)

Comment #23 provided a suggestion for a model of a good CA response, ironically from one of DigiCert's customers. Comment #24 suggests only three checks are made. This suggests that there are plenty of gaps for issues to come up, as Google's own repeat failures in this have shown.

I understand this isn't a straight apples->apples comparison, as this was related to a delay in publishing revocation, but I want to make sure that in the effort to publish quickly, it's not overlooked the need to publish correctly. It sounds like the system is only monitoring "new" revocations, but doesn't ensure "old" revocations (i.e. those previously revoked) continue to serve correct responses as updated responses are published.

If there are holistic controls in place monitoring for this, Comment #24 was the opportunity to share the "bigger picture" here, as Comment #23 suggested.

I realize that's not exactly a "question", but I just want to highlight that while on the surface, it responded, I'm not sure it got to the meat of providing a good understanding about the holistic set of protections in place, especially as the system is proposed (in Comment #22) to undergo significant change.

Flags: needinfo?(jeremy.rowley)
Attached image OCSP diagram.png

I don't mind providing more information about what the OCSP system looks like if it is of interest. I've attached a diagram that describes what is going on with DigiCert OCSP. it's pretty standard I think. We have daily batches and incremental batches that are uploaded to the origin server. These responses are then distributed via CDN everywhere.

The components in our OCSP system are:

  1. OCSP generator. The OCSP generator is triggered via cron to request the CA to create a new batch file and runs every 6 minutes. The batches are made by grouping all response for certificates beginning with the same three hex-decimal characters (ie 010, 011, 0B2, etc.).

  2. Batch Delivery. The batch delivery process is a shell script run by cron every minute. This script checks if there is another instance of the script being run and closes if there is one. The script creates a list of batches and uses rsync to send each batch out to the ocsp servers in AWS.

  3. Batch Unpacking. Batches are unpacked by a c++ application which runs every 10 minutes on each of the ocsp servers. It creates a list of batches and begins unpacking the batches.

  4. OCSP Responder. The ocsp responder responded to ocsp requests on each of the ocsp servers. The responder is Nginx running a custom module which parses the ocsp request and then check for the response. If the response exists it will return this and if not it returns a 5 byte file for unauthorized.

  5. Monitoring:. There are several monitors for ocsp and each of these processes.
    a. A monitor on batch generation and delivery. This monitor is handled by a web application which checks the age of the batch file. If a batch is older than the age specified in the application config, the http endpoint returns an error which a list of the batches that are too old.

b. Batch unpack. This monitors batch generation and delivery using a web application to to check the age of the batch file. If a batch is older than the age specified in the application config the http endpoint returns an error which a list of the batches that are too old.

c. Uptime monitoring . We deployed multiple monitors around the world to ping ocsp.digicert.com. Any downtime sends an automated notice to the 24x7 NOC team of the incident

d. Origin monitoring. There is a constant monitor that checks proxy and origin servers to ensure uptime and immediate response to the incident on the servers. Any failure of the origin or proxy servers to responds sends a notice to the NOC team.

e. Performance monitoring . CDN logs ingestion to Splunk and are reviewed for issues and slow downs. Performance metrics for the proxy and origin servers are also collected by NewRelic.

f. Incremental response generation. This monitor checks the OCSP system for incremental response generation for revoked certificates. It monitors for creation of the OCSP response and delivery of the response to the origin server.

g. Daily response generation. This monitor checks to see if the daily batch was created and uploaded to the origin server. Any batch failing to create triggers an alarm. This is also a monitor on unpack to see if the OCSP unpacked properly.

  1. Remediation. If a batch in the sent folder is old, the batch either failed to generate or it has been generated but not copied to all the ocsp servers. if the batch failed to generate, there is a cron task that runs every 15 minutes checking the unpack monitor endpoint. If there is a failure, it creates an array of all the batches in the response and checks if they already exist. If they do not exist, it will request a new batch from the CA. In this scenario, the alert should close within about 20 to 25 minutes.

Hope this helps. Let me know if you want any additional specifics or if you think we are missing something. The one place I'm going to add additional monitoring (on my roadmap) is to monitor the time it takes from revocation to CDN. We monitor from revocation to push to origin and from origin to CDN. However, there isn't a time-monitor to see how long it takes from revocation to publicly accessible via CDN.

I'll also review the Apple bug again and borrow from their monitor list to add to our own.

Flags: needinfo?(jeremy.rowley)

Thanks Jeremy. This is useful detail that helps further unpack and examine. For example, I think similar to the Google incidents, there's an opportunity for those steps (2) and (3) to have issues. For example, in Google's case, certain conditions in their batch packing caused issues in the overall batch delivery (an empty tarball, AIUI). I understand you check the batch age, but do you verify everything in that batch was correctly transmitted and successfully deployed? You mentioning origin/performance monitoring, but nothing seems to test the response?

If you look at Comment #14, for example, and its explanation of the issue, I'm having trouble seeing how Comment #28 would have detected this. However, at the same time, with the detail you provided in Comment #28, it can also be clear a number of things that "can't" happen.

I think if details like Comment #28 were the norm, back during Comment #14, it becomes easier to understand the mitigations being proposed and how they address the problems.

Okay - we'll try to provide more details when we post incident reports. I find balancing faster disclosure with sufficient details difficult. We'll do better on that one and make sure our responses are fully researched. Speaking of researched...

We have internal monitors on the unpacking at the individual origin servers - it's part of the batch delivery monitoring. They are part of the same tool that monitors the delivery, meaning the tool provides alerts on failures in unpacking and delivery for the incremental and daily response generation for both new and revoked certificates. We log that information to splunk and retry on failure. We currently set the retry to three before it becomes a sev1. We use OpGenie to alert the team in case of a failure.

Some clarity on this:
"Monitoring:. There are several monitors for ocsp and each of these processes.
a. A monitor on batch generation and delivery. This monitor is handled by a web application which checks the age of the batch file. If a batch is older than the age specified in the application config, the http endpoint returns an error which a list of the batches that are too old."

This monitor checks distribution to the distribution server as well as the origin server. The origin server is constantly in sycn with the distribution server for all new batches. Logs of the sync for patches are part of that monitor for distribution and are sent to splunk. All logs work the same way - three retries and then a sev1 alert is sent out.

This means it does cover comment 14. We have monitoring throughout the process. What we don't have yet is something that flags the time required from revocation to distribution through the CDN. This is why I'm specifying a revocation time of 22 hours in our automated revocation system. It gives the system 2 hours to ensure everything is flowing through the system.

Hopefully that answers more questions.

Thanks.

Ben, i don't have further questions.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Summary: Digicert: delayed publication of revocation information → DigiCert: delayed publication of revocation information
Whiteboard: [ca-compliance] → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: