Sectigo: Intermittent OCSP unauthorized responses for certificates older than 15 minutes
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: mnordhoff, Assigned: martijn.katerbarg)
Details
(Whiteboard: [ca-compliance] [ocsp-failure] [external] Next update 2025-04-30)
Steps to reproduce:
From around 2024-02-08 07:20 - 07:50 UTC, i have noticed OCSP unauthorized responses -- perhaps inconsistently, depending on client location -- for several Sectigo/ZeroSSL certificates issued on 2024-02-06, apparently issued at 2025-02-08 05:19, and apparently issued at 2025-02-08 07:01.
BR version 2.1.2 section 4.9.9 says that valid responses are required within 15 minutes. i assume they are supposed to work reliably, and CDN nodes should not be caching out-of-date nonsense for hours or days.
i noticed one on one of my own servers in Atlanta, and the other 3 on https://crt.sh/. (2 certs are mine, and 2 were picked arbitrarily.)
Actual results:
- https://crt.sh/?id=16578370712&opt=ocsp issued 2025-02-06 11:27, checked 2025-02-08 07:18:59
- https://crt.sh/?id=16578372810&opt=ocsp issued 2025-02-06 11:27, checked 2025-02-08 07:19:14
- https://crt.sh/?id=16605538694&opt=ocsp Apparently issued 2025-02-08 05:17, checked 2025-02-08 07:48:22
- https://crt.sh/?id=16605146789&opt=ocsp Apparently issued 2025-02-08 07:01, checked 2025-02-08 07:36:32
(i believe crt.sh does not cache OCSP responses, so you may see a different result if you click those links.)
Expected results:
OCSP responses should have been authorized (or revoked, i suppose, though crt.sh says the CRLs consider them valid).
i have no reported this to Sectigo because their online ticketing system went down right after i opened it. :-)
Reporter | ||
Comment 1•1 month ago
|
||
"i have no reported this..." Typo noticed. 💀
i was able to open Sectigo case number 03609993 on my third try. i only wrote one sentence because i thought the error page might not come up if i was fast enough. (Probably not true, but it's too late now.)
Reporter | ||
Comment 2•1 month ago
|
||
To clarify, any references to 2024 were also typos. i must be having an off day. :-)
Reporter | ||
Comment 3•1 month ago
|
||
i'm seeing more inconsistent results now (sometimes good, sometimes unauthorized), but for posterity:
-
https://web.archive.org/web/20250208094917/https://crt.sh/?id=16578370712&opt=ocsp
-
https://crt.sh/?id=16607838250&opt=ocsp https://web.archive.org/web/20250208100129/https://crt.sh/?id=16607838250&opt=ocsp Another arbitrary certificate, apparently issued 2025-02-08 08:05, unauthorized when checked at 10:01:30.
Assignee | ||
Comment 4•1 month ago
|
||
We are acknowledging this report and investigating the issue. Preliminary findings show that OCSP responses are being signed as expected, and point to an intermittent CDN issue.
Updated•1 month ago
|
Assignee | ||
Comment 5•1 month ago
|
||
As an update to this issue, we believe the cause of this intermittent issue has been identified. While we need to perform further analysis as to the root cause, we have made changes to our OCSP infrastructure which has resolved the availability at this moment.
We will post a preliminary report within the next few days, and a full incident report no later than February 21st.
Assignee | ||
Comment 6•1 month ago
|
||
Initial Incident Report
On Saturday, 2025-02-08, at 11:22 UTC, we were made aware (by noticing this bug) that we were serving OCSP “unauthorized” responses for multiple certificates issued more than 15 minutes prior.
This is a violation of section 4.9.9 of the TLS Baseline Requirements, which states:
Effective 2025-01-15, an authoritative OCSP response MUST be available (i.e. the responder MUST NOT respond with the "unknown" status) starting no more than 15 minutes after the Certificate or Precertificate is first published or otherwise made available.’
Only signed “good” and “revoked” OCSP responses are compliant with this requirement, which means that unsigned “unauthorized” OCSP responses are not.
Initial testing seemed to suggest an issue with our CDN, but we were later able to reproduce the intermittent problem on our origin servers as well.
We pinpointed the root cause, which was that 6 out of 18 read-only replicas of our certificate status database were showing a replication delay. All of those 6 replicas are operated from the same site.
To quickly restore normal service, we disabled that site. We were able to do this due to the very high CDN cache hit rate we have for OCSP, plus having plenty of spare capacity across our other origin servers.
We are currently investigating the root cause of the replication delay, and will provide a full incident report no later than 2025-02-21.
Assignee | ||
Updated•1 month ago
|
Updated•1 month ago
|
Assignee | ||
Comment 7•1 month ago
|
||
Incident Report
Summary
On Saturday, 2025-02-08, at 11:22 UTC, we were made aware (by noticing this bug) that we were serving OCSP “unauthorized” responses for multiple certificates issued more than 15 minutes prior.
This is a violation of section 4.9.9 of the TLS Baseline Requirements, which states:
Effective 2025-01-15, an authoritative OCSP response MUST be available (i.e. the responder MUST NOT respond with the "unknown" status) starting no more than 15 minutes after the Certificate or Precertificate is first published or otherwise made available.’
Only signed “good” and “revoked” OCSP responses are compliant with this requirement, which means that unsigned “unauthorized” OCSP responses are not.
Impact
Although our CertStatus system pregenerated OCSP responses for all unexpired certificates within the required timeline, those responses were not propagated to 6 of our 18 database replicas in a timely fashion. These database replicas serve a total of 128 web nodes which in turn are responsible for handling traffic for our OCSP endpoints, utilizing Cloudflare as a CDN.
Timeline
All times are UTC.
2025-02-03:
- 03:30 A replication lag starts to build up in 5 out of 9 database replicas at the affected site. We have monitoring in place but do not have alerting in place for this scenario.
- 09:00 Replication breaks down on 5 out of 9 database replicas at the affected site.
2025-02-04:
- 10:42 Our DevOps team reports connectivity and high latency issues in the affected site.
2025-02-07:
- 01:00 Replication breaks down on a 6th database replica at the affected site.
2025-02-08:
- 08:04 This bug is opened.
- 08:21 We receive a regular support case (Case ID 03609993) requesting us to review this bug. Since this was not submitted as a Certificate Problem Report, no incident escalation mechanisms are invoked.
- 11:22 We notice the creation of this bug. We share this discovery with some internal stakeholders.
- 11:26 We confirm that crt.sh is still showing “unauthorized” OCSP responses for the provided crt.sh certificate IDs.
- 11:38 We notice that our statistics dashboard shows a slightly higher number of “unauthorized” OCSP responses from our certificate status infrastructure than usual.
- 11:56 We confirm we are still pregenerating OCSP responses as expected. Our OCSP response signing queue has no backlog. We divert our attention away from our OCSP response signing operations towards our frontend OCSP infrastructure (called CertStatus) and CDN.
- 12:04 We confirm the origin IP used by our CDN to access our CertStatus origin servers.
- 12:08 We perform several requests for the OCSP status of https://crt.sh/?id=16578372810 using the OpenSSL command-line client. Some of these requests target our CDN, whilst others bypass our CDN and target our CertStatus origin servers directly. All attempts through our CDN show an “unauthorized” OCSP response being returned, whereas all attempts sent directly to our origin servers show a “good” response. We therefore suspect that a CDN caching problem is the root cause for the “unauthorized” responses.
- 12:12 We perform further testing using OpenSSL and cURL to compare headers between different responses. We continue to suspect our CDN of having a caching issue.
- 12:24 We post a message to our internal DevOps communication channel requesting a purge of our CDN cache.
- 12:26 We respond to the support case (Case ID 03609993).
- 12:27 We reach out directly to a member of our DevOps team to escalate the purging of the CDN cache.
- 12:43 We acknowledge this bug.
- 13:45 An internal case is created to track the incident and investigation.
- 13:47 We get confirmation that the CDN cache has been cleared.
- 13:48 We check the OCSP status of the certificates listed in comment 0. All now show a valid “good” OCSP response via our CDN.
- 14:22 We notice a new certificate, issued at 13:20, which shows an “unauthorized” OCSP response.
- 14:23 DevOps confirms an additional certificate also showing an “unauthorized” OCSP response.
- 14:25 Tests against these newly found certificates for the first time show an “unauthorized” OCSP response even when bypassing our CDN.
- 14:41 We compare different OCSP check results on the same certificate from the perspectives of different DevOps and Compliance staff members. We note that some cases return a “good” OCSP response and other cases an “unauthorized” OCSP response. We continue to investigate.
- 15:33 The on-call DevOps team members send a message to request assistance from the lead R&D engineer for the CertStatus system.
- 16:10 The R&D engineer acknowledges the message and is available to start helping with the investigation.
- 16:20 Now up to speed with the findings so far, the R&D engineer hypothesizes that a replication lag in a subset of our CertStatus database replicas could account for the intermittent “unauthorized” OCSP responses.
- 16:25 We investigate that hypothesis and discover a replication lag in 6 out of our 18 CertStatus database replicas. The 6 replicas are all located at the same site, with an additional 3 replicas at the same site having no replication lag.
- 16:47 We disable the CertStatus web frontends at the affected site. This disables a total of 64 web frontend pods. Since each site has plenty of spare capacity in order to deal with disaster scenarios, we are confident that the remainder of our infrastructure is able to cope with both current and peak demand. Shutting down these 64 web frontends halts reliance on the lagging database replicas.
- 16:54 We once again clear our CDN cache to flush out any cached “unauthorized” OCSP responses.
- 17:20 We expand the ongoing call to include our networking team, which begins to investigate reported network issues at the affected site as a priority.
- 21:30 Further investigation into the failing database replicas shows that replication had completely stopped on the affected replicas due to missing Write-Ahead-Log segments (WALs).
- 22:00 We start restoring the affected database replicas using snapshots of unaffected replicas.
- 22:30 Restoration is completed.
- 23:40 We notice that WALs are only kept for up to 15 minutes on the primary database node. We realize that, with ongoing network performance issues, replication broke down when the nodes were unable to process updates before the WALs were removed. We decide to leave the CertStatus services at the affected site switched off until the cause of the network performance issues can be found and resolved.
2025-02-12:
- 02:09 We find evidence of frequent routing convergence, causing temporary, but recurring, network instability. We shut down the affected interface and reroute traffic through an alternative interface. This solves the network issues.
2025-02-18:
- 10:52 We replace the faulty, disabled, network interface that had caused the network issues.
Root Cause Analysis
Sectigo utilizes 128 internal web nodes, fronted by a CDN, for CRL and OCSP services. These 128 web nodes are split evenly by two different locations on two different continents. Each site of 64 web nodes is backed by 9 database replicas that provide the certificate status information. These databases are kept in sync through the use of WALs.
Under normal operating conditions, all replicas are kept in sync with almost zero lag. The replicas collect WALs frequently from the primary database node, and each WAL is removed from the primary database node after 15 minutes.
We started experiencing network performance issues at the affected site. These network performance issues were observed by our DevOps and network teams, but those teams did not become aware of the corresponding impact on database replication. At that time, this site operated as our fallback location for the majority of our services. However, despite being in that fallback/standby mode, the site was serving traffic for our OCSP and CRL certificate status services.
The network performance issues eventually led to 6 database replicas being unable to retrieve WALs in time before the WALs were removed from the primary database node, which caused those replicas to no longer receive updates at all. The lack of alerting on the replication lag monitoring data meant that we were not made aware in time to prevent a compliance incident.
Lessons Learned
What went well
- Once we became aware of the incident, we were able to investigate and resolve the incident within several hours.
- Once the issue was identified we were able to quickly shut down the CertStatus services at the affected site.
- Our capacity planning foresaw and allowed for the disaster scenario of losing the CertStatus services at one site.
- Our current primary site is able to handle all traffic to our CertStatus services.
What didn’t go well
- We did not have proper alerting in place to notify us of a lag in database replication.
- We did not take database replication issues into account when we first encountered network performance issues.
- WALs were removed after 15 minutes from our primary database node, causing a breakdown in replication to the affected database replicas once they passed that threshold.
Where we got lucky
- 3 out of 9 database replicas at the affected site were not impacted by replication lag.
- Staff members who were not on-call but who possessed essential knowledge were able to make themselves available fairly quickly and provide vital assistance.
Action Items
Action Item | Kind | Due Date |
---|---|---|
Improve documentation for staff on the configuration of our CDN in relation to our CertStatus services | Mitigate | 2025-03-07 |
Add proper alerting for replication lags to our CertStatus database replicas | Prevent & Detect | 2025-03-07 |
Increase the time before WALs are removed from the primary database node | Mitigate | 2025-03-07 |
Replace the faulty, disabled, network interface causing network issues | Prevent | Completed |
Re-enable the CertStatus web nodes at the affected site | Prevent | 2025-03-21 |
Assignee | ||
Comment 8•23 days ago
|
||
We are monitoring this bug for any questions and/or comments. Meanwhile, work to complete our action item continues as scheduled.
Assignee | ||
Comment 9•16 days ago
|
||
Three out of our four remaining action items have been completed. This updates our action items to the following status:
Action Item | Kind | Due Date |
---|---|---|
Improve documentation for staff on the configuration of our CDN in relation to our CertStatus services | Mitigate | Completed |
Add proper alerting for replication lags to our CertStatus database replicas | Prevent & Detect | Completed |
Increase the time before WALs are removed from the primary database node | Mitigate | Completed |
Replace the faulty, disabled, network interface causing network issues | Prevent | Completed |
Re-enable the CertStatus web nodes at the affected site | Prevent | 2025-03-21 |
Assignee | ||
Comment 10•9 days ago
|
||
We have noted additional synchronization delays caused by network issues, both internal and external. Our changes to the storage of WALs has meant the backlog is able to catch up.
We are monitoring the stability due to further changes performed in our network before re-enabling CertStatus web nodes at the affected site.
Reporter | ||
Comment 11•9 days ago
|
||
(In reply to Martijn Katerbarg from comment #7)
- 08:21 We receive a regular support case (Case ID 03609993) requesting us to review this bug. Since this was not submitted as a Certificate Problem Report, no incident escalation mechanisms are invoked.
- 11:22 We notice the creation of this bug. We share this discovery with some internal stakeholders.
For what it's worth, I was leery of making a Certificate Problem Report. It may have guaranteed someone would look at the report faster, but it seemed like misuse and I felt it risked causing confusion.
With all due respect to the CPR staff, reporting a list of certificates that should not be revoked seemed like it could go badly in any number of ways. For example, someone triaging reports might think I was trying to get those certificates revoked, ask me for evidence, and then set the ticket to be ignored pending response. (I don't know what your ticket procedures are, but something like that seemed plausible.)
The web PKI does not really have a "generic urgent problem" or even "Generic Compliance Report" sort of mechanism, as far as I know.
Assignee | ||
Comment 12•2 days ago
|
||
(In reply to Matt Nordhoff (aka Peng on IRC & forums) from comment #11)
Hi Matt,
For what it's worth, I was leery of making a Certificate Problem Report. It may have guaranteed someone would look at the report faster, but it seemed like misuse and I felt it risked causing confusion.
With all due respect to the CPR staff, reporting a list of certificates that should not be revoked seemed like it could go badly in any number of ways. For example, someone triaging reports might think I was trying to get those certificates revoked, ask me for evidence, and then set the ticket to be ignored pending response. (I don't know what your ticket procedures are, but something like that seemed plausible.)
The web PKI does not really have a "generic urgent problem" or even "Generic Compliance Report" sort of mechanism, as far as I know.
Thank you for raising this. We would class this incident as suitable for a CPR. The definition of Certificate Problem Report according to the TLS BRs is: Complaint of suspected Key Compromise, Certificate misuse, or other types of fraud, compromise, misuse, or inappropriate conduct related to Certificates.. We believe incorrect OCSP responses would fall under the “inappropriate conduct related to Certificates” part.
However, we believe this is a good point and we acknowledge not everyone may see this the same way. A more general “Generic Compliance Report”, or “CA Problem Report” might be appropriate for a future addition into the TLS BRs.
With regards to our final action item: We continue to monitor the stability of our network. To be cautions, we are opting to not yet re-enable the CertStatus web nodes at the affected site, and are proposing to move the deadline for this change to a future date. This updates our action items list to:
Action Item | Kind | Due Date |
---|---|---|
Improve documentation for staff on the configuration of our CDN in relation to our CertStatus services | Mitigate | Completed |
Add proper alerting for replication lags to our CertStatus database replicas | Prevent & Detect | Completed |
Increase the time before WALs are removed from the primary database node | Mitigate | Completed |
Replace the faulty, disabled, network interface causing network issues | Prevent | Completed |
Re-enable the CertStatus web nodes at the affected site | Prevent | 2025-04-30 |
Ben, we would like to request a next-update for 2025-04-30.
Updated•1 day ago
|
Description
•