Closed Bug 1610303 Opened 2 years ago Closed 1 year ago

D-TRUST: Issuance of non-conformant SSL certificate

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: enrico.entschew, Assigned: enrico.entschew)

Details

(Whiteboard: [ca-compliance] - Next Update - 2-Oct-2020)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0

Steps to reproduce:

This is a preliminary report.

1.) How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On 2020-01-20, 08:51 UTC one certificate with SMIME attributes was wrongly issued by the SSL intermediate CA D-TRUST SSL Class 2 CA 1 2009. The customer revoked the certificate shortly after issuance. Due to our internal alerting systems the case came to our attention on 2020-01-20, 08:59 UTC.

2.) A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2020-01-20, 08:51 UTC: Issuance of one certificate (https://crt.sh/?id=2347922247) with SMIME attributes by the SSL intermediate CA D-TRUST SSL Class 3 CA 1 2009
2020-01-20, 08:53 UTC: Revocation of the certificate by customer
2020-01-20, 08:59 UTC: Stop of production line and begin of investigation of the error
2020-01-20, 09:36 UTC: Correction of configuration and verification
2020-01-20, 10:21 UTC: Restart of production line

3.) Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

D-TRUST stopped the production and corrected the configuration.

4.) A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Number of affected certificates: 1
Issuing date of first certificate: 2020-01-20
Issuing date of last certificate: 2020-01-20

5.) The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

All affected certificates can be found here:
https://crt.sh/?id=2347922247

6.) Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Due to a misconfiguration it was for a short period possible to issue non-conformant certificates. We corrected the configuration and started a thorough analysis to avoid further errors. We will provide our findings with the final report.

7.) List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

D-TRUST corrected the configuration. Further steps will be reported with the final report. We expect the final report by Thursday, 2020-01-24 EOB.

Assignee: wthayer → enrico.entschew
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]
Flags: needinfo?(enrico.entschew)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 24-January 2020

This is the final report.

1.) How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On 2020-01-20, 08:51 UTC one certificate with S/MIME attributes was wrongly issued by the SSL intermediate CA D-TRUST SSL Class 2 CA 1 2009. The customer revoked the certificate shortly after issuance. Due to our internal alerting systems the case came to our attention on 2020-01-20, 08:59 UTC.

2.) A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2020-01-20, 08:30 UTC: Installation and activation of a set of configuration changes for several PKI products of D-TRUST on the production system at the same time
2020-01-20, 08:51 UTC: Issuance of one certificate (https://crt.sh/?id=2347922247) with SMIME attributes by the SSL intermediate CA D-TRUST SSL Class 3 CA 1 2009
2020-01-20, 08:53 UTC: Revocation of the certificate by customer
2020-01-20, 08:59 UTC: Stop of production line and begin of investigation of the error
2020-01-20, 09:36 UTC: Correction of configuration and verification
2020-01-20, 10:21 UTC: Restart of production line
2020-01-22, 10:48 UTC: Informing Conformity Assessment Body about the issue
2020-01-23, 11:30 UTC: End of thorough analysis and release of an action plan

3.) Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

D-TRUST stopped the production and corrected the configuration.

4.) A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Number of affected certificates: 1
Issuing date of first certificate: 2020-01-20
Issuing date of last certificate: 2020-01-20

5.) The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

All affected certificates can be found here:
https://crt.sh/?id=2347922247

6.) Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Due to a misconfiguration it was possible to issue non-conformant certificates for a short period. The misconfiguration took place in a larger configuration change of several PKI products (product life cycle) on the production system. The configuration of the SSL production line contained an error which allowed the acceptance of non-conformant data.

After the change, one customer applied for a certificate. After the certificate was received and checked by the customer, the customer recognized that the certificate was wrongly issued and revoked it. At the same time, our post-linting system created a notification, so the PKI surveillance team stopped the production system.

We corrected the configuration and started a thorough analysis to avoid further errors. We discovered that the error was introduced due to human error. One configuration attribute for producing S/MIME certificates was wrongly transferred to the configuration set of SSL certificates. The error was not recognized during several checks because of a huge amount of changed attributes at the same time and a similar naming.

7.) List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

D-TRUST corrected the configuration. No further certificates can be produced with a similar error.

Here is our action plan:

  1. We already carried out additional training of all members of the configuration team.
  2. As of now, if changes are necessary, we limit the number of affected attributes at the same time to an amount which can be overseen on one screen.
  3. We are working on a process redesign to improve our extended check routines of configuration changes. We expect this work to be finished within the first half of 2020.
Flags: needinfo?(enrico.entschew)

Thank you Enrico. Please provide regular updates on the status of the "process redesign to improve our extended check routines of configuration changes".

Flags: needinfo?(enrico.entschew)
Whiteboard: [ca-compliance] Next Update - 24-January 2020 → [ca-compliance]

Thanks Enrico! I want to highlight my appreciation for this as a better-than-most report, in that it meaningfully looks at the systemic issue, which is too much information for effective review, and then proposes remediation to address that. I also appreciate the rapid response in which action was taken, roughly 8 minutes judging by the timeline.

While at this point, I'm highly skeptical that additional training serves to mitigate or prevent issues, I'm quite keen to understand more for your process redesign, because that does sound meaningful and relevant.

I understand that not all ideas may work out, or work as well as expected, and so it's preferable to provide an update once things are deployed and known to be working, but I'm hoping you can share some thoughts about what your process redesign involves.

That said, it does seem like pre-issuance linting might have served to detect this quicker than post-issuance linting, preventing the original incident by eliminating the time from issuance ( 2020-01-20, 08:51 UTC ) to detection (sometime before 2020-01-20, 08:59 UTC, based on when production was halted). Is that a mitigation you considered with respect to this?

Ryan, we appreciate the chance to openly discuss our planned process redesign. Our focus is on several organizational changes, but also on some technical changes, that support the organizational changes. Next week, I will share a first draft of our considerations.

Independently of this, we have put the already planned and scheduled second generation of our prelinting system into operation today. This will enable us to prevent incidents of the kind we have seen in this report.

Flags: needinfo?(enrico.entschew)
Whiteboard: [ca-compliance] → [ca-compliance] - Next Update - 7-February 2020

As announced last week, here is the first draft of our considerations. I will start with organizational changes. These steps focus on an improvement of awareness in case of configurational changes.

  1. Any planned manual configuration changes have to be done in small steps. The number of affected attributes at the same time is limited to an amount which can be overseen on one screen.
  2. No planned configuration changes are allowed to be done shortly before end of business.
  3. No planned configuration changes are allowed on a working day followed by public holiday or weekend.
    These organizational improvements are already in place.

Currently, configuration changes are entered attribute by attribute via a frontend interface. We figured out that in case of many affected attributes there is a chance that entry errors cannot be recognized by the configuration specialist and the quality assurance manager. To address this issue, the technical improvements, we want to implement, are the following:

  1. Loading the current configuration as a configuration file from the production system.
  2. Adding changes by the development team.
  3. Checking the configuration file in the testing system.
  4. Checking the configuration file in the reference system.
  5. Installing the modified new configuration file on the production system.

Limited small changes are still possible via the frontend interface by the configuration specialists. However, large-scale systematic changes can be done more safely using a configuration file and a suitable editor, which allows the visualization of changes in a before-after comparison. We believe this will help avoid errors going unnoticed.

Enrico, Thanks again for a fantastic level of detail that really shows a solid understanding and approach, and reflects discussions seen with other CA incidents. The technical approach described really seems to reflect an emerging good practice.

Do you have a sense of timeframes involved here for these steps to be implemented?

Flags: needinfo?(enrico.entschew)
Whiteboard: [ca-compliance] - Next Update - 7-February 2020 → [ca-compliance]

Ryan, We have done our calculations. I have received feedback that the technical change is planned for mid-June.

Flags: needinfo?(enrico.entschew)
QA Contact: wthayer → bwilson
Whiteboard: [ca-compliance] → [ca-compliance] - Next Update - 15-June 2020

I have been informed by our development department that the completion of the planed fix will be delayed. Unfortunately, this is due to the pandemic-related adjustments in working activities. In the coming week I will publish more detailed information on how the current completion schedule has been adjusted. However, the current organizational changes prevent a repetition of the error until the technical solution can be introduced.

I apologize for the delay.

As announced last week, the completion of the planned fix will be delayed. I’m still waiting for news on that. As soon as I get it I will post it here. This will be by the end of business this week at the latest.

The fix will be part of the next regular update release. The following schedule has been agreed with the development department:
Installation and system commissioning in reference system: 10.09.2020
Installation and system commissioning in production system: 01.10.2020

Weekly Update: The same as last week.

Weekly Update: The same as last week.

Enrico: As this is a rather significant delay, I’m hoping you can be a bit more transparent in the challenges that have lead to this delay. What are the pandemic-related adjustments, and how do they relate? The more detailed the answer, the easier it becomes to understand the challenges here, while also helping better understand what priorities are being selected.

For example, I think we’d rightfully be concerned if “the pandemic delayed our target or getting 1,000 new customers, so we’re not starting compliance activities until our marketing team believes we’ll have new signups”. I’m hoping that’s just a made-up example, but sharing what activities are ongoing that have contributed to this delay is a useful exercise.

Flags: needinfo?(enrico.entschew)

SARS-COV-2 (COVID-19) hit Germany in March 2020. As in many countries of the world, Germany tried and still tries to limit the number of victims as much as possible.

In this context, one action taken by the management of D-TRUST was that all areas which did not necessarily require a presence on site were transferred to home-based work. Employees in these affected areas are only allowed to enter the premises of D-TRUST in exceptional cases and for a short period of time. Production and security areas are exempted. Even though infection numbers are low, Germany continues with some restrictions.

The pandemic and the associated lockdown meant that the prerequisites had to be created first, so that many departments could work completely mobile. This led to delays of several weeks in our software development. These consequences are that the planed fix is part of the next release.

Now we have established enough experience with the work from home. Therefore, we think that this kind of delay will no longer occur in the future.

Flags: needinfo?(enrico.entschew)

Weekly Update: There is nothing new to report. We continue working on the issue.

Thanks Enrico. I was hoping for a bit more detail: when you shared in February, in Comment #7, it seemed things were on track. It wasn't until June, two days before the deadline, in Comment #8, that things seemed to go off-track. I understand from Comment #14 that, in March, you began to re-prioritize development, but why wasn't there an update letting us know of the delays and/or uncertainty?

It seems reasonable to be concerned that, come September 8, we might hear about a similar delay two days before the deadline, and I'd like to avoid that scenario. It also seems like the tasks outlined in Comment #5 are readily available in common, off-the-shelf CA software. I'm hoping you can share more about why the challenges require three months of development work, and help build a better understanding about the development and release lifecycle. Even if the delay was "several weeks", as mentioned in Comment #14, the overall delay seems to be several months, and so I'm trying to square that away.

Flags: needinfo?(enrico.entschew)
Whiteboard: [ca-compliance] - Next Update - 15-June 2020 → [ca-compliance] - Next Update - 31-Aug 2020

Ryan: I would like to explain the underlying reasons in more detail. Our update cycles are quarterly. For every quarterly cycle a number of changes or new features are planned. This is our standard procedure.

The fix was planned to be part of the June update. Unfortunately, we had the delay because of the transfer from work at the office to work at home (covid-19). Development worked until the last minute to get the fix to be part of the June update. However, at the end they decided to postpone because the expected quality couldn’t be met. This is why I informed the Bugzilla community so late.

Because of the precautions we took, I’m confident that there is no danger of further delays.

Flags: needinfo?(enrico.entschew)

We are on track. So there are no further delays.

The following is still planned:
Installation and system commissioning in reference system: 10.09.2020
Installation and system commissioning in production system: 01.10.2020

Whiteboard: [ca-compliance] - Next Update - 31-Aug 2020 → [ca-compliance] - Next Update - 2-Oct-2020

All deployments in the reference and in the production system were successfully completed on schedule. We have achieved the goal set out in Comment #5. The standard procedure for configuration changes is as of today as described in Comment #5.

Flags: needinfo?(bwilson)

I believe that this bug can be closed, unless there are additional issues or questions to be raised/discussed. I plan to close this bug on or about 9-October-2020.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.