Closed Bug 1924497 Opened 4 months ago Closed 5 days ago

certSIGN: Missing certificate from the list of bad order subject attributtes

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gabriel.petcu, Assigned: gabriel.petcu)

Details

(Whiteboard: [ca-compliance] [disclosure-failure])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36

Steps to reproduce:

Incident Report

This was first reported by ryandickson@google.com at 2024-10-11 14:34 GMT time.

Summary

On 20-Mar-2024 certSIGN opened a bug: 1886624 - certSIGN: Certificates with incorrect Subject attribute order (mozilla.org) with the list of certificates issued with an incorrect attribute order. This list missed one certificate that was found by Ryan Dickson who asked us on the reasons of the certificate not being included in the initial bug certificates list

Impact

There is no impact as the only affected certificate is already expired.

Timeline

• October 11, 2024
• 14:34 – email received from Ryan Dickson reporting the issue
• 15:30 – Internal incident created
• 16:00 – Internal investigation started
• 16:17 – response email sent to Ryan Dickson
• October 14, 2024
• 08:00 – analysis on the process started
• 12:00 – finalizing the analysis process
• 14:30 – open the current Bugzilla bug
All times are UTC.

Root Cause Analysis

The initial list of certificates with incorrect subject attributes order was correct and complete, and these certificates had been sent to be checked by linters. For only one certificate in the list, the response from the linter had a very large delay and so was not caught in the initial list of certificates. certSIGN opened the bug in Bugzilla only with the certificates with linter responses indicating the incorrect order of the subject attributes. This was the root cause of skipping one certificate.

Lessons Learned

Even when the time constraint of publishing a bug as soon as possible occurs, we will continue to monitor for delayed answers from any tool or application we had used,

What went well

  • the certificates with issues had been properly identified and revoked, as indicated in the initial bug 1886624 - certSIGN: Certificates with incorrect Subject attribute order

What didn't go well

  • due to linter delayed response we missed one certificate

Where we got lucky

  • the missing certificate had no impact on the clients and already expired

Action Items

Action Item Kind Due Date
Internal training on monitoring results Prevent 2024-12-31

Appendix

Details of affected certificates

https://crt.sh/?id=10494095254

Assignee: nobody → gabriel.petcu
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance] [disclosure-failure]

Thank you for the report. I would like to ask for some clarifications.

Did you troubleshoot the issue to identify what was so special about this one certificate what was missed?

Can you please describe the process (specific steps) you followed to identify the problematic certificates?

Did you lint against your entire corpus of valid, revoked, expired certificates or just valid certificates?

Did you wait for the linting task to finish before you posted the list of affected certificates in 1886624?

I find it very unlikely that the Root Cause can be attributed to the delay of the linter. Even if linters take time to finish, the CA should wait as necessary to complete the investigation and ensure the final incident report contains accurate information.

(In reply to Dimitris Zacharopoulos from comment #1)

Thank you for the report. I would like to ask for some clarifications.
Did you troubleshoot the issue to identify what was so special about this one certificate what was missed?

Yes, we troubleshooted the issue, we didn’t identify anything special about this certificate.

Can you please describe the process (specific steps) you followed to identify the problematic certificates?

The process started with the extraction from the certSIGN internal database of all certificates issued during the reporting period.
Next, we prepared a script to check on each certificate the order of the subject attributes.
We run the script against the list extracted on step 1 and we got the list of problematic certificates.

Did you lint against your entire corpus of valid, revoked, expired certificates or just valid certificates?

The list contains all valid certificates issued during the reporting period.

Did you wait for the linting task to finish before you posted the list of affected certificates in 1886624?

Yes, the linting task finished successfully before we have posted the list of affected certificates in 1886624.

I find it very unlikely that the Root Cause can be attributed to the delay of the linter. Even if linters take time to finish, the CA should wait as necessary to complete the investigation and ensure the final incident report contains accurate information.

In the usual, day-by-day service, linter response is mandatory before issuance, there is no way that a request is processed or a certificate is issued without linters responses. However, in this case, we have sent to the linting engines all the valid (unexpired and not revoked) certificates, and this was a one-time exceptional situation.
Our conclusion is that the specialist running the bulk processing task did not observed the difference of one certificate between the number of input entries into the process with the number of responses from linter.
That is why we considered that Internal training on monitoring results in crisis situation is needed, to do also a cross-check on the activities the team performs during similar situations.

I'm still confused as to how there was a delay in the linter processing, and how this was even discovered.

Which linting libraries are you using? What was the batch script used for the process? In what format are the results returned to the specialist, and through what mechanism?

I appreciate there was a mismatch in the number of certificates to be linted, and the number of results that were returned. Did this difference in figures cause the singularly missed certificate to be discovered and linted again? What methodology are you using to handle data analysis here, and at what scale?

A browse at crt.sh gives a rough corpus of certificates total at 12k pre-certs, of which 4316 are unexpired. As far as computational workload this should be a trivial quantity for linting and confirming results, even as a relying party I've handled a larger corpus for linting in a more timely manner. The recent introduction of pkimetal should significantly improves things as well - please consider adopting it into your future practices.

(In reply to Gabriel PETCU from comment #2)

(In reply to Dimitris Zacharopoulos from comment #1)

Thank you for the report. I would like to ask for some clarifications.
Did you troubleshoot the issue to identify what was so special about this one certificate what was missed?

Yes, we troubleshooted the issue, we didn’t identify anything special about this certificate.

Can you please describe the process (specific steps) you followed to identify the problematic certificates?

The process started with the extraction from the certSIGN internal database of all certificates issued during the reporting period.
Next, we prepared a script to check on each certificate the order of the subject attributes.
We run the script against the list extracted on step 1 and we got the list of problematic certificates.

Perhaps some more details would help the community understand and assist you with this issue.

Wayne already asked more details about the script, I was going to ask the same.

Also, please explain how are the problematic certificates are identified based on the output of the linters. E.g. do you check for certain error codes?

Did you wait for the linting task to finish before you posted the list of affected certificates in 1886624?

Yes, the linting task finished successfully before we have posted the list of affected certificates in 1886624.

Then I am not sure I understand your comment (emphasis mine)

For only one certificate in the list, the response from the linter had a very large delay and so was not caught in the initial list of certificates

Can you provide some clarification about this very large delay of the linter? I read it as "the linter took forever to complete".

I find it very unlikely that the Root Cause can be attributed to the delay of the linter. Even if linters take time to finish, the CA should wait as necessary to complete the investigation and ensure the final incident report contains accurate information.

_In the usual, day-by-day service, linter response is mandatory before issuance, there is no way that a request is processed or a certificate is issued without linters responses. However, in this case, we have sent to the linting engines all the valid (unexpired and not revoked) certificates, and this was a one-time exceptional situation.

What was exceptional about this situation? You may need to lint all your valid certificates periodically. What happens when new linters come out, or existing linters get updated to support more cases?

Our conclusion is that the specialist running the bulk processing task did not observed the difference of one certificate between the number of input entries into the process with the number of responses from linter.

So, this means there should be some technical control to ensure that the number of input certificates (for linting) matches the number of output results. For example (fictional numbers),

  • input 1000 certificates to be checked
  • output 1000 certificates checked (980 found to be ok, 20 found to be with errors)

Does that match what you described here as "difference"?

That is why we considered that Internal training on monitoring results in crisis situation is needed, to do also a cross-check on the activities the team performs during similar situations._

Training alone may not be sufficient to prevent such an incident from reoccurring. Additional technical controls, like the improvement of the "bulk linting script" you mentioned earlier, would be preferable.

(In reply to Wayne from comment #3)

I'm still confused as to how there was a delay in the linter processing, and how this was even discovered.
Which linting libraries are you using? What was the batch script used for the process? In what format are the results returned to the specialist, and through what mechanism?

  • We used zlint and a certSIGN developed custom linter at the moment of the problem. Meanwhile we added pkilint and we are planning to use pkimetal, that is now in our internal testing process.
  • The script was developed by our team to validate the certificates which are selected by the internal auditor on a quarterly basis. It is a postman script and takes as input all the certificates as entries in a csv file and provides as output a json formatted response. The script uses the pkilint implementation as a locally hosted service.
  • The list of the verified certificates with all their details was in JSON format. The responses from the linter are concatenated in this big JSON file

I appreciate there was a mismatch in the number of certificates to be linted, and the number of results that were returned. Did this difference in figures cause the singularly missed certificate to be discovered and linted again? What methodology are you using to handle data analysis here, and at what scale?

  • It was a human error not seeing the difference between the number of returns and the number of inputs. The difference was exactly the one missing certificate.

A browse at crt.sh gives a rough corpus of certificates total at 12k pre-certs, of which 4316 are unexpired. As far as computational workload this should be a trivial quantity for linting and confirming results, even as a relying party I've handled a larger corpus for linting in a more timely manner. The recent introduction of pkimetal should significantly improves things as well - please consider adopting it into your future practices.

  • As we mentioned above, we already have in plan to adopt pkimetal until end of November 2024.

(In reply to Dimitris Zacharopoulos from comment #4)

(In reply to Gabriel PETCU from comment #2)

(In reply to Dimitris Zacharopoulos from comment #1)

Thank you for the report. I would like to ask for some clarifications.
Did you troubleshoot the issue to identify what was so special about this one certificate what was missed?

Yes, we troubleshooted the issue, we didn’t identify anything special about this certificate.

Can you please describe the process (specific steps) you followed to identify the problematic certificates?

The process started with the extraction from the certSIGN internal database of all certificates issued during the reporting period.
Next, we prepared a script to check on each certificate the order of the subject attributes.
We run the script against the list extracted on step 1 and we got the list of problematic certificates.

Perhaps some more details would help the community understand and assist you with this issue.

Wayne already asked more details about the script, I was going to ask the same.

  • The script was developed by our team to validate the certificates which are selected by the internal auditor on a quarterly basis. It is a postman script and takes as input all the certificates as entries in a csv file and provides as output a json formatted response. The script uses the pkilint implementation as a locally hosted service.

Also, please explain how are the problematic certificates are identified based on the output of the linters. E.g. do you check for certain error codes?

  • As the purpose of the check was to detect the certificates with issues on the order of Subject attributes, we focused mainly on these error codes.

Did you wait for the linting task to finish before you posted the list of affected certificates in 1886624?

Yes, the linting task finished successfully before we have posted the list of affected certificates in 1886624.

Then I am not sure I understand your comment (emphasis mine)

For only one certificate in the list, the response from the linter had a very large delay and so was not caught in the initial list of certificates

Can you provide some clarification about this very large delay of the linter? I read it as "the linter took forever to complete".

  • Our conclusion was that for one certificate in the list, the linter response had a delay more than 5 seconds, value which was encoded in the script running the check.

I find it very unlikely that the Root Cause can be attributed to the delay of the linter. Even if linters take time to finish, the CA should wait as necessary to complete the investigation and ensure the final incident report contains accurate information.

_In the usual, day-by-day service, linter response is mandatory before issuance, there is no way that a request is processed or a certificate is issued without linters responses. However, in this case, we have sent to the linting engines all the valid (unexpired and not revoked) certificates, and this was a one-time exceptional situation.

What was exceptional about this situation? You may need to lint all your valid certificates periodically. What happens when new linters come out, or existing linters get updated to support more cases?

  • There was a time pressure related to revocation of the certificates and this was not the case before. We consider using a linter recheck whenever new rules or new linter versions appear. We plan to have a quarterly linter check of all the valid certificates issued during the last quarter.

Our conclusion is that the specialist running the bulk processing task did not observed the difference of one certificate between the number of input entries into the process with the number of responses from linter.

So, this means there should be some technical control to ensure that the number of input certificates (for linting) matches the number of output results. For example (fictional numbers),

  • input 1000 certificates to be checked
  • output 1000 certificates checked (980 found to be ok, 20 found to be with errors)

Does that match what you described here as "difference"?

  • Yes, it is correct, we considered this approach.

That is why we considered that Internal training on monitoring results in crisis situation is needed, to do also a cross-check on the activities the team performs during similar situations._

Training alone may not be sufficient to prevent such an incident from reoccurring. Additional technical controls, like the improvement of the "bulk linting script" you mentioned earlier, would be preferable.

  • We are working on improvement of the script in order to implement technical controls to prevent situations like this until end of November. Also, the script will be based on pkimetal linter. And we plan to have a quarterly linter check of all the valid certificates issued during the last quarter.

certSIGN finished the internal testing on the updated script that includes now the pkimetal linter, and deployed it in production on November 20, 2024.

If no other comments or observations we propose to close this ticket.

Flags: needinfo?(bwilson)

Hi Gabriel,
Even though this has not yet been officially formalized as a bug-closure requirement, could you please provide a closing summary?
Thanks,
Ben

A closing summary should briefly:

  • describe the incident, its root cause(s), and remediation;
  • summarize any ongoing commitments made in response to the incident; and
  • attest that all Action Items have been completed.

Here is a markdown template if needed:

Incident Report Closure Summary

  • Incident Description: [Two or three sentences summarizing the incident.]
  • Incident Root Cause(s): [Two or three sentences summarizing the root cause(s).]
  • Remediation Description: [Two or three sentences summarizing the incident's remediation.]
  • Commitment Summary: [A few sentences summarizing ongoing commitments made in response to this incident.]

All Action Items disclosed in this Incident Report have been completed as described, and we request its closure.

Incident Report Closure Summary

  • Incident Description:
    On 20-Mar-2024 certSIGN opened a bug: 1886624 - certSIGN: Certificates with incorrect Subject attribute order (mozilla.org) with the list of certificates issued with an incorrect attribute order. This list missed one certificate that was found by Ryan Dickson who asked us on the reasons of the certificate not being included in the initial bug certificates list.
  • Incident Root Cause(s):
    The initial list of certificates with incorrect subject attributes order was correct and complete, and these certificates had been sent to be checked by linters. Our conclusion was that for one certificate in the list, the linter response had a delay more than 5 seconds, value which was encoded in the script running the check. The specialist running the bulk processing task did not observed the difference of one certificate between the number of input entries into the process with the number of responses from linter.
  • Remediation Description:
    We used zlint and a certSIGN developed custom linter at the moment of the problem. Meanwhile we added pkilint and pkimetal, with closure end of November 2024.
    We applied an improvement to the checking script in order to implement technical controls to prevent situations like this.
    The internal training on monitoring results had been finalized in December 2024.
  • Commitment Summary:
    We have now a quarterly linter check of all the valid certificates issued during the last quarter and we added to the periodic trainings the specific verifications with the linter.

All Action Items disclosed in this Incident Report have been completed as described, and we request its closure.

I'll close this tomorrow - 31-Jan-2025.

Status: ASSIGNED → RESOLVED
Closed: 5 days ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.