Closed Bug 1793445 Opened 2 years ago Closed 1 year ago

TWCA: "unknown" OCSP response for issued certificates

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: agwa-bugs, Assigned: hcli)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

Attachments

(3 files, 2 obsolete files)

Assignee: bwilson → hcli
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

Times below are in UTC+8

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in the MDSP mailing list, a Bugzilla bug, or internal self-audit), and the time and date.

  • We became aware of this problem when this bug was assigned on 2022-10-03 23:03.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

  • 2022-07-07 08:10 Network issue between TWCA and some CT log servers occurred.
  • 2022-08-07 08:10 The first reported pre-certificate created and sent to CT log servers.
  • 2022-08-07 08:11 Issuance aborted for the first reported pre-certificate because not enough SCTs collected.
  • 2022-08-08 08:44 The last reported pre-certificate was created and sent to CT log servers.
  • 2022-08-08 08:45 Issuance aborted for the last reported pre-certificate because not enough SCTs collected.
  • 2022-08-08 08:45 Network recovered.
  • 2022-10-03 23:03 Bug assigned and email from Bugzilla received.
  • 2022-10-04 8:34 We began investigated this problem.
  • 2022-10-04 19:00 We have completed the investigation and are posting this report.

Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

  • All certificates corresponding to the reported pre-certificates are not issued. These pre-certificates will be revoked.

In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

  • There are 10 pre-certificates involved. The first pre-certificate is issued at 2022-08-07 08:10:46 and the last is at 08:44:47
  • The issuance processes of all pre-certificates involved were aborted due to the same network issue and no final certificates are issued.

In a case involving TLS server certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. It is also recommended that you use this form in your list "https://crt.sh/?sha256=[sha256-hash]", unless circumstances dictate otherwise. When the incident being reported involves an SMIME certificate, if disclosure of personally identifiable information in the certificate may be contrary to applicable law, please provide at least the certificate serial number and SHA256 hash of the certificate. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

  • For certificate issuance errors, normally the certificate will be revoked automatically. But in this case when not enough SCTs collected, our system did not correctly publish the serial number to the OCSP responder's database. Therefore the OCSP responder would respond “unknown” for these serial numbers.

List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

  • We will revoke the pre-certificates and we are still discussing the details of the system fix. The update will be posted here when these are completed.

The timeline in Comment 1 contains mistyped months. Also we are still investigating the exact duration of network issues.
This is an updated timeline:

  • Before 2022-07-07 08:10 Network issue between TWCA and some CT log servers occurred.
  • 2022-07-07 08:10 The first reported pre-certificate created and sent to CT log servers.
  • 2022-07-07 08:11 Issuance aborted for the first reported pre-certificate because not enough SCTs collected.
  • 2022-07-08 08:44 The last reported pre-certificate was created and sent to CT log servers.
  • 2022-07-08 08:45 Issuance aborted for the last reported pre-certificate because not enough SCTs collected.
  • After 08:45 in 2022-07-08 Network recovered.
  • 2022-10-03 23:03 Bug assigned and email from Bugzilla received.
  • 2022-10-04 8:34 We began investigated this problem.
  • 2022-10-04 19:00 We have completed the preliminary investigation and posted report.
  • 2022-10-05 08:00 Performing full search for any pre-certificates having the same problem.
Attached file 2021-12-01.csv (obsolete) —

pre-certificates issued on 2021-12-01

Attached file 2022-07-07.csv (obsolete) —

pre-certificates issued on 2022-07-07

Below is the updated incident report. Times are in UTC+8.

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in the MDSP mailing list, a Bugzilla bug, or internal self-audit), and the time and date.

We became aware of this problem when this bug was assigned on 2022-10-03 23:03.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

  • 2021-12-01 TWCA began issuing certificates that expired in 2023, while the CT log list was not updated to include log shards for 2023.
  • 2021-12-01 00:24 The first affected pre-certificate of the first batch created, sent to CT logs, and then issuance aborted due to insufficient SCTs.
  • 2021-12-01 10:21 The last affected pre-certificate of the first batch created, sent to CT logs, and then issuance aborted due to insufficient SCTs.
  • 2021-12-01 11:00 The CT log list is fixed for non-EV issuance system.
  • 2021-12-02 11:22 One affected EV pre-certificate created and sent to CT logs.
  • 2021-12-02 17:00 The CT log list is fixed for EV issuance system.
  • 2022-07-07 03:00 Network issue between TWCA and some CT log servers occurred.
  • 2022-07-07 03:15 The first affected pre-certificate of the second batch created, sent to CT logs, and then issuance aborted due to insufficient SCTs.
  • 2022-07-07 08:44 The last affected pre-certificate of the second batch created, sent to CT logs, and then issuance aborted due to insufficient SCTs.
  • 2022-07-07 08:45 Network issue resolved.
  • 2022-10-01 00:00 MRSP requirement that requires CA to provide OCSP responses for certificates presumed to exist became effective.
  • 2022-10-03 23:03 Bug assigned and email from Bugzilla received.
  • 2022-10-04 08:34 We began investigated this problem.
  • 2022-10-04 19:00 We have completed the preliminary investigation and posted report.
  • 2022-10-05 08:00 Performing full search for any pre-certificates having the same problem.
  • 2022-10-05 15:00 Finished checking all unexpired pre-certificates and examined the causes.
  • 2022-10-05 18:00 Published affected pre-certificates data to the OCSP responder and posting this report.

There were multiple date typos in previous timelines and are now corrected. The second batch of affected certificates are all created on 2022-07-07.

Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

  • All certificates corresponding to the affected pre-certificates are not actually issued.
  • We are now providing correct OCSP responses for these pre-certificates.

In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

  • There are two batches of pre-certificates involved, 175 non-EV under TWCA Secure SSL Certification Authority and 1 EV under TWCA Global EVSSL Certification Authority.
  • The first batch is caused by system configuration error, contains 112 non-EV pre-certificates which were created on 2021-12-01 between 00:24 and 10:21, and the EV pre-certificate was created at 2021-12-02 11:22.
  • The second batch is caused by external network issues, contains 63 non-EV pre-certificates which were created on 2022-07-07 between 03:15 and 08:44.
  • All certificates corresponding to pre-certificates are not actually issued.

In a case involving TLS server certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. It is also recommended that you use this form in your list "https://crt.sh/?sha256=[sha256-hash]", unless circumstances dictate otherwise. When the incident being reported involves an SMIME certificate, if disclosure of personally identifiable information in the certificate may be contrary to applicable law, please provide at least the certificate serial number and SHA256 hash of the certificate. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

For certificate issuance errors, normally the certificate will be revoked automatically by our system. But in this case when not enough SCTs collected, we overlooked that our system did not correctly publish serial number to the OCSP responder's database. Therefore the OCSP responder would respond “unknown” for these serial numbers.
Though we did aware the MDSP requirements effective on 10-01 but only focused on the ability to revoke pre-certificates and missed this case.

List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

After further discussion we have decided on the following:

  • Publish the affected pre-certificates data to the OCSP responder so it will respond "good" for these serial numbers. (Completed)
  • Update issuance system to ensure pre-certificates are correctly revoked when any error occurs after publishing pre-certificates to CT logs. This is scheduled to be completed in 2023 Q1.
  • For the CT log list configuration, we already have a process to update it regularly.
  • Publish pre-certificates data to the OCSP responder manually if any SCT collecting error occurs before the system update completed.
Flags: needinfo?(hcli)

Hi,

This two certificates are already included in the previous report, but when we were manually creating the OCSP import list, their serial numbers were missed due to copy-paste errors. We have added them to OCSP today.
For the certificate that was not included in the previous report, I will post another report since it was due to a separate issue.

Attached file 2021-12-01.csv
Attachment #9297236 - Attachment is obsolete: true
Attached file 2022-07-07.csv
Attachment #9297237 - Attachment is obsolete: true

Below is the preliminary report for the second issue. Times are in UTC+8.

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in the MDSP mailing list, a Bugzilla bug, or internal self-audit), and the time and date.

We became aware of this problem when investigating pre-certificates list in Comment 7, and found that one of the them has the corresponding final certificate, unlike the others for which issuances were aborted.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

  • 2022-10-14 01:29 Receive notification of Comment 7.
  • 2022-10-14 09:30 Begin investigation of the three pre-certificates in Comment 7.
  • 2022-10-14 11:00 Found the second one has final certificate issued but OCSP responding unknown. Begins investigating as a separate issue.
  • 2022-10-14 14:30 Checked all unexpired certificates and their OCSP response. Found 8 certificates, including the one reported, with unknown OCSP responses.
  • 2022-10-14 18:15 Manually added all certificates to OCSP. Post this report.

Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

We already have monitoring and human process in place to publish certificates to OCSP in case certificate is not correctly published to OCSP in issuance process.

In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

There are 8 certificates with this problem.
The first is issued on 2021-10-02
The last is issued on 2022-04-21

In a case involving TLS server certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. It is also recommended that you use this form in your list "https://crt.sh/?sha256=[sha256-hash]", unless circumstances dictate otherwise. When the incident being reported involves an SMIME certificate, if disclosure of personally identifiable information in the certificate may be contrary to applicable law, please provide at least the certificate serial number and SHA256 hash of the certificate. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

https://crt.sh/?sha256=9058adabbad8aaed3e7259f0d0354325cc0c1d38d631e6de95cb54ec984eef40
https://crt.sh/?sha256=291d77f018a267fe7679a1e3315fda54b019563264e5cd8cd9647c728e37a82f
https://crt.sh/?sha256=e56a3f4d43e9704d007f6253dbefe55ebead179197eb5c931161f7cbbaf0246e
https://crt.sh/?sha256=b285e6495c093c86b6f9eba6ad1b647f873078f88308fb3c9ab334f9d2f06364
https://crt.sh/?sha256=e38c6c45143b47f1ceee72882af3801977d3059f07b93f6026b61cb3ede42727
https://crt.sh/?sha256=2dc286bc257c39feb880cd506991d73bbb9aa5046911eb4019f78635fc3daa5f
https://crt.sh/?sha256=de056a829fc5cc0fa965ac3d105682f9bdf81773affa205edd9b1689dcd2022b
https://crt.sh/?sha256=394e239f2dec7ee24ae04f6e6c0a93d2e015b30edf2d069e67be502364810b73

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

As stated above, we already had monitoring and human process for these cases. We are investigating how they were not handled according to the process.

List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

  • Publish the affected certificates data to the OCSP responder so it will respond "good". (Completed)
  • We are still discussing how the current process could be improved.
Flags: needinfo?(hcli)

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The previous monitoring process sends email as alert to the operators and still requires manual verification and publication process to resolve the issue.
Our investigation concluded that it is most possibly in these cases the alert mail was overwhelmed by other mails and missed, so the mitigation processes were not carried out properly.

List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

  • Publish the affected certificates data to the OCSP responder so it will respond "good". (Completed)
  • We have decided to implement an independent monitoring that regularly verifies the consistency between CA and OCSP database and automatically publishes missed OCSP entries. This is also scheduled to be completed in 2023 Q1.
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ocsp-failure]

Hi,

The error handling after pre-certificate issuance proposed in Comment 6 and the independent synchronization process proposed in Comment 12 has been deployed to production system.

Are there any follow-up questions to TWCA from the community? If not, I'll close this on or about Wed. 19-Apr-2023.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: