D-TRUST: Precertificate OU > 64 Characters

ASSIGNED
Assigned to

Status

task
ASSIGNED
2 months ago
5 days ago

People

(Reporter: enrico.entschew, Assigned: enrico.entschew, NeedInfo)

Tracking

trunk

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [ca-compliance])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0

Steps to reproduce:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

2019-07-05, 04:29 UTC: Internal quality assurance noticed the error

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2019-07-05, 04:29 UTC: Internal quality assurance noticed the error / Start Incident
2019-07-05, 04:45 UTC: Issuing stopped
2019-07-05, 06:30 UTC: Start investigating the error
2019-07-05, 09:00 UTC: Short-term measures are identified to prevent further errors. Bringing the measures into effect and training the validation team. Correction of certificate request.
2019-07-05, 10:10 UTC: Issuing restarted
2019-07-05, 10:35 UTC: Production of the certificate

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

CA has stopped the production after detecting the error. Production was resumed after corrective action was taken.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

No certificates were affected but only precertificates for one certificate.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

https://crt.sh/?id=1638856360
https://crt.sh/?id=1638856369

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

In order to avoid incorrect certificate issuance, various automated checks were implemented. These have also successfully prevented the creation of a real SSL certificate. These checks were carried out with good intention as close as possible to the actual production time in order to detect any errors in the processing. However, this procedure can result in incorrect precertificates. In the specific case, the OrganizationalUnitName field of the subject was more than 64 characters.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We have realised that we also want to prevent the production of defective precertificates in the future. To this end, we will reassess the timing of the implemented checks from this perspective and add additional checks.
Until then, additional manual testing measures have been established and the validation team has been trained.

The text above will also be posted at mozilla.dev.security.policy

Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Assignee: wthayer → enrico.entschew
Type: defect → task
Summary: D-TRUST: incorrect precertificate → D-TRUST: Precertificate OU > 64 Characters
Whiteboard: [ca-compliance]

Enrico: This appears to describe the incident on the basis of internal QA noticing the error. It doesn't explain why these errors happened. Are you saying you were performing QA tests using your production CA?

I'm also not sure the proposed mitigations. It sounds like you'll log precertificates later, which is good - they should always be the last step of your operations - but it also sounds like you lacked basic X.509 controls for your certificates.

Given that other CAs have encountered this issue in the past, I think it's important to discuss why D-TRUST was not following such discussions and why it had not implemented similar controls. That represents itself a failure, in that CAs should constantly be aware of where things can go wrong, by monitoring and examining their systems against all incident reports, whether their CA or others.

Flags: needinfo?(enrico.entschew)

Ryan: We have never used nor do we use our production system for QA testing.

About the current issue: Our X.509-controls worked well, so that at no point in time did we produce a defective productive TLS certificate. But we produced two defective pre-certificates. To prevent the aforementioned situation from happening again, we will continue to improve our control procedures.

Flags: needinfo?(enrico.entschew)

Enrico: Thanks for confirming you were not doing QA in production.

However, this incident report does not demonstrate a clear understanding of the root cause, nor does it adequately describe the proposed next steps, especially as they relate to that root cause. Obviously, all CAs are expected to "improve [their] control procedures", but the point of the incident report is to provide detailed descriptions about how they will do so, on the basis of demonstrating a clear understanding of the issue.

As mentioned, this is an issue that has been discussed in the past, on m.d.s.p. and in the context of other CA incident reports. Have you examined such reports - for example, by searching m.d.s.p. for discussion of precertificates? Were you following m.d.s.p. when such discussions occurred? Are you monitoring Bugzilla for other CAs' incident reports, in order to proactively identify opportunities for improvements or risks?

These certificates are not marked revoked, which means that, at present, there's still an incomplete immediate response. Given the confusion so far in this issue, I anticipate that the response would likely be that only pre-certificates were issued, not actual TLS certificates, therefore your CA software doesn't permit you to revoke them. If that's the case, that's also a topic that has been discussed on m.d.s.p. and other CA incidents, with the clear expectation that CAs do mark such certificates as revoked.

To that end, I'm hoping you can review the existing and past discussions. I encourage you to proactively look through m.d.s.p. and past CA compliance issues for the discussions yourself, precisely because this is an expectation already expressed within Mozilla's requirements for CAs to be aware of, and demonstrates the ability of CAs to stay on top of evolving changes in industry. If you find yourself unable to find the relevant discussion, you can say so, although please be aware that it would also become a matter for which would need a meaningful remediation and mitigation plan for going forward.

Flags: needinfo?(enrico.entschew)

Ryan: This is not the final report but a status update on our analysis.

According to our internal procedures we did a thorough analysis of this case to effectively prevent similar cases in the future. As a result, we have determined that in our application processing system for retail certificates the X.509 control for pre-certificates has actually failed, contrary to our original assumption. The OU our customer requested from registration personnel exceeded the maximum length for the field. The field length check was not effective in this special use case. This was determined to be the root cause of the failure. The production of the final TLS certificate was successfully prevented.

Effective Today (2019-07-12), 12:00 CEST we shut down the affected application processing website until further notice. New applications are no longer accepted. The Website will remain down until a full review and correction of the control procedures has been completed and released. If and when the website affected will be reactivated, this will be previously announced here.

We are following the ongoing discussion on m.d.s.p., Bugzilla and at the CA/B Forum. However, we now realize that we have not given some of the issues discussed there the attention they deserve. Therefore, we have not valued the role of pre-certificates to the degree which is consensus here. We resolve to correct this.

A next update will follow as soon as we have achieved further results.

Flags: needinfo?(enrico.entschew)

Quick update on the validity of the defective pre-certificates.
Today (2019-07-15), 08:54 CEST, both defective pre-certificates have been revoked.
The status can be checked here:
https://crt.sh/?id=1638856360
https://crt.sh/?id=1638856369

A next update will follow as soon as we have achieved further results.

Thanks Enrico. This is good progress to see.

When do you anticipate the next update? Responding to an Incident Report highlights the importance of ensuring reports are "accompanied with a timeline of when your CA expects to accomplish these things."

I realize you're still investigating, but understanding a timeline for progress on that investigation is important. Do you anticipate it to take longer than a week?

Flags: needinfo?(enrico.entschew)

Ryan: Your assumption is correct. As mentioned before, we shut down the affected application processing website. So, certificates can no longer be requested using this channel. However, we are still investigating and checking the effort it takes to correct the control features of the affected application processing website. I expect that next week we will be able to say more about our next steps. On Wednesday (2019-07-24) I will publish the final report.

Flags: needinfo?(enrico.entschew)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 24-July 2019

This is the final incident report.

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

2019-07-05, 04:29 UTC: Internal quality assurance noticed the error

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2019-07-05, 04:29 UTC: Internal quality assurance noticed the error / Start Incident
2019-07-05, 04:45 UTC: Issuing stopped
2019-07-05, 06:30 UTC: Start investigating the error
2019-07-05, 09:00 UTC: Short-term measures are identified to prevent further errors. Bringing the measures into effect and training the validation team. Correction of certificate request.
2019-07-05, 10:10 UTC: Issuing restarted
2019-07-05, 10:30 UTC: Production of the certificate with correct OU field length
2019-07-10, 12:30 UTC: Start of thorough analysis according to internal problem management procedures
2019-07-12, 09:30 UTC: Management decision to shut down the affected application processing website
2019-07-12, 10:00 UTC: Shut down of affected application processing website
2019-07-12, 12:40 UTC: Informing Conformity Assessment Body about the issue
2019-07-15, 06:54 UTC: Revocation of defective pre-certificates
2019-07-18, 16:00 UTC: Management decision to terminally shut down the application processing system for PTC retail certificates
2019-07-19, 07:10 UTC: Shut down of affected application processing system
2019-07-23, 14:00 UTC: End of thorough analysis according to internal problem management procedures
2019-07-24: Final incident report

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

CA has stopped the production after detecting the error. Production was resumed after corrective action was taken. As a result of our thorough analysis, we shut down the affected application processing website. New applications are no longer accepted. In this context, as part of the ongoing thorough analysis, it was later decided that we shut down the application processing system for PTC retail certificates as well. Certificates are no longer produced via this system. This application processing system is considered a legacy system and remains shut down for good.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Problem: Precertificate OU > 64 characters
Number of affected certificates: 2
Issuing date of first certificate: 2019-07-04
Issuing date of last certificate: 2019-07-04

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

https://crt.sh/?id=1638856360
https://crt.sh/?id=1638856369

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

We have determined that in our application processing system for retail certificates the X.509 control for pre-certificates has actually failed. The OU our customer requested from registration personnel exceeded the maximum length for the field. The field length check was not effective in this special use case. This was determined to be the root cause of the failure. The production of a final TLS certificate with the error was successfully prevented.

This affected application processing system predates the issuing of pre-certificates to ct logs. This functionality was added at a later stage. The last quality gate which caught the error acted right before the issuing of the final certificate but did not cover the newly implemented issuing of pre-certificates.

We have not valued the role of pre-certificates to the degree which is consensus here.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We shut down the affected application processing website. New applications are no longer accepted. We shut down the application processing system as well. Certificates are no longer produced via this system. The application processing system is considered as a legacy system and remains shut down for good.
We reassigned reading duty to follow incidents on Bugzilla and m.d.s.p.

Enrico: thank you for the final incident report.

What is being done to ensure the timely detection and revocation of pre-certificates in the future?

Flags: needinfo?(enrico.entschew)
Whiteboard: [ca-compliance] Next Update - 24-July 2019 → [ca-compliance]

Wayne:
First of all, we've increased our monitoring frequency to ensure a more
timely detection of possibly problematic (pre-)certificates.

As opposed to the now defunct legacy application system, the currently used
automated application processing system for enterprise
customers utilizes full x.509 controls before the creation of the
pre-certificate. These controls catch errors of the type we encountered in
this case.

Additionally, we have established an incident handling procedure for
revoking pre-certificates in a timely fashion, in case they have not
resulted in full working tls certificates.

The currently used automated application processing system for enterprise
customers has been developed and improved in accordance with new development
and testing procedures we set up at the end of 2017. These procedures were
designed to prevent such issues more effectively.

Kim Nguyen, CEO, D-Trust

You need to log in before you can comment on or make changes to this bug.