Closed Bug 1703528 Opened 3 years ago Closed 3 years ago

Telekom Security: Key Encipherment in two ECC SAN TLS certificates

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Arnold.Essing, Assigned: Arnold.Essing)

Details

(Whiteboard: [ca-compliance] [ov-misissuance])

Two ECC SAN TLS certificates were issued with Key Usage “Key Encipherment”.
The certificates were revoked within 24 hours.
An incident report will be posted once the initial investigation actions are completed.

Assignee: bwilson → Arnold.Essing
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

As part of a change to extend the CT logs (see Apple Root Program CT Policy as of April 21, 2021), the certificate templates had to be modified. In doing so, an incorrect template was referenced, resulting in the issuance of two incorrect ECC SAN TLS certificates.

1.How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
As part of the QA of the change, we checked the first certificates after implementing the change and found an error in the keyUsage of these ECC SAN TLS certificates.

Note: Since the error only occurred with the ECC SAN TLS certificates, the other certificate types are not mentioned in the following.

2.A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Note: Times are given in CET.

2021-04-06 11:26 and 13:26 Two ECC SAN TLS certificates are issued for an internal customer
2021-04-06 13:40 Start certificate test
2021-04-06 14:45 Result of the certificate test: The certificates have an incorrect keyUsage. Certificate issuance for ECC SAN TLS certificates has been stopped.
2021-04-06 15:30 Information call with management, product manager, compliance and ISMS team as well as a representative of the admin team
2021-04-06 16:20 Result of a detailed error analysis: The error is based on an incorrectly configured template
2021-04-06 18:00 The error was fixed by correcting the configuration
2021-04-06 19:00 Further call with management, product manager, compliance and ISMS teams and admin team members. All aspects of this bug were critically reviewed. Since the cause of the error is clear and has been fixed, it was decided, that the issuance of ECC SAN TLS certificates can be resumed the next morning.
2021-04-07 08:25 The issuance of ECC SAN TLS certificates is being resumed.
2021-04-07 08:43 and 08:46 Two new ECC SAN TLS certificates were issued for the internal customer
2021-04-07 08:59 The two erroneous certificates were being revoked
2021-04-07 09:15 Testing of the new certificates successfully completed

3.Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.
The issuance of ECC SAN TLS certificates has been stopped temporarily on 2021-04-06 at 14:45. It was resumed on 2021-04-07 at 08:25.

4.In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.
Two certificates of one internal customer were affected:
https://crt.sh/?id=4334714631 Issued: 2021-04-06 11:26 CET Revoked: 2021-04-07 08:58 CET
https://crt.sh/?id=4335191968 Issued: 2021-04-06 13:26 CET Revoked: 2021-04-07 08:59 CET

5.In a case involving certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases that are not involving a review of affected certificates, please provide other similar, relevant specifics, if any.
https://crt.sh/?id=4334714631
https://crt.sh/?id=4335191968

6.Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
Before the change, extensive tests were successfully completed in the test environment. During the transfer to the productive environment, an error occurred in the manual configuration of the templates that could not occur in the correctly configured test environment.
In order to be able to detect such misconfigurations, the first certificates issued in the productive environment are always additionally checked as a further QA measure after a change has been implemented. Within this QA, the error was detected.

7.List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.
To avoid errors caused by incorrect manual configuration of the templates of the productive environment, we will sharpen the description in the runbook for this purpose as an organizational measure. Furthermore, we are in discussion with the manufacturer to clarify whether additional technical measures can be taken to avoid such misconfigurations.

Update with further information
Another cause for the incorrect issuance of the two certificates was that certlint was not addressed due to the incorrectly configured template, since the incorrectly configured template bypassed the overall configured connection to certlint. Thus, the issuance of the two incorrect certificates was not stopped based on an error message from certlint.
As technical measure, the manufacturer will change this behavior so that the overall defined connection to certlint takes effect in any case, regardless of the specifications in a template.

In addition to the certlint, the zlint is also configured and was queried, but only returned a NOTICE and no ERROR, so the issuance of the two certificates was not stopped even after that.
Regarding this behavior, we planned to issue a pull request to the zlint project for the corresponding lint to return ERROR instead of NOTICE. But as it seems, such a pull request already exists https://github.com/zmap/zlint/pull/479 so we will implement the new zlint version as soon as it is available.

(In reply to Arnold Essing from comment #1)

As part of the QA of the change, we checked the first certificates after implementing the change and found an error in the keyUsage of these ECC SAN TLS certificates.
...
Note: Times are given in CET.

2021-04-06 11:26 and 13:26 Two ECC SAN TLS certificates are issued for an internal customer
...
Before the change, extensive tests were successfully completed in the test environment.

It seems there are some very important details that are omitted here, such as when, why, and how the change was made, and the testing that was performed.

Why did you omit these details, given their relevance in the rest of this incident report?

6.Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
Before the change, extensive tests were successfully completed in the test environment. During the transfer to the productive environment, an error occurred in the manual configuration of the templates that could not occur in the correctly configured test environment.

This is a statement ("that could not occur in the correctly configured test environment") but with no supporting detail, or even a meaningful explanation of what the "error" was.

Put differently, "Something happened, but trust us, it can't happen" is how this comes across with the level of detail provided, and so more detail is needed here.

To avoid errors caused by incorrect manual configuration of the templates of the productive environment, we will sharpen the description in the runbook for this purpose as an organizational measure. Furthermore, we are in discussion with the manufacturer to clarify whether additional technical measures can be taken to avoid such misconfigurations.

Can you help me understand how this is not an example of what is called out as an unacceptable response in https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed

For example, it’s not sufficient to say that “human error” of “lack of training” was a root cause for the incident, nor that “training has been improved” as a solution. While a lack of training may have contributed to the issue, it’s also possible that error-prone tools or practices were required, and making those tools less reliant on training is the correct solution. When training or a process is improved, the CA is expected to provide specific details about the original and corrected material, and specifically detail the changes that were made, and how they tie to the issue. Training alone should not be seen as a sufficient mitigation, and focus should be made on removing error-prone manual steps from the system entirely.

This reads as "human error happened, we've improved the training and process to prevent human error". If this is the final answer, then more detail needs to be provided here, discussing exactly what went wrong and how. Alternatively, you need to more carefully examine the cause.

From Comment #2

But as it seems, such a pull request already exists https://github.com/zmap/zlint/pull/479 so we will implement the new zlint version as soon as it is available.

The pull request exists, but it's been pointed out by the ZLint maintainers that it's not a correct pull request. Are there any plans to contribute to ZLint to contribute a correct pull request? Or are you waiting until someone else does this for you?

Flags: needinfo?(Arnold.Essing)

Ryan, thanks for your constructive feedback. We recognize that the level of detail may not have been sufficient and would like to provide an update with further information and clarification.

With the aforementioned change, on the one hand, the certificate profiles were expanded to include additional CT log entries, as required by Apple from April 21, 2021, and on the other hand, certlint and x509lint were to be additionally implemented consistently for all TLS certificate types (besides zlint). Since in our system the certificate profiles are configured via templates, the templates for each certificate profile had to be modified accordingly, i.e. extended by the additional entries. After the templates were modified, they were imported into the test environment and, as far as possible, successfully tested. “As far as possible” means that all configured lints were queried in the test environment when test certificates were generated. However, it was not possible to test the increase in the number of CT log entries, because only one CT log server is available in the test environment. So far so good.

The error (or rather the concatenation of two errors) occured during the import of the modified and successfully tested templates into the productive environment. During the required manual assignment of the templates to the certificate types, the administrator made an incorrect assignment and did not assign the correctly modified template for EC certificates to the affected certificate type "ECC SAN TLS", but instead assigned a template for RSA certificates with an almost identical name. As a consequence, the keyUsage for RSA keys was set instead of the keyUsage for ec keys. As mentioned above, this was not the only cause of the false issuance: The incorrectly assigned template is an older template in which only zlint was configured and as discussed above this only returns a NOTICE, so the certificate issuance was not blocked.
I.e. the error(s) resulted from an incorrect configuration of the production environment, which is what we meant by our former statement that this error “could not occur in the correctly configured test environment” and therefore we did not explain any more about the tests performed.

So, what can we change to prevent such errors from occurring in the future?
On the one hand, we see organizational measures to improve the quality of manual steps that are still necessary. For this case specifically, this means that in the future the configuration of templates may only be carried out under dual control and that changes that require a template change must be approved in advance by an extended change-board.
On the other hand, we also see several short-term, medium-term and long-term technical measures here:
As a short-term measure, we have ordered an amendment to the lint configuration from the supplier. In future, this is to be configured in CA-wide settings, so that the lint configuration in the templates will be omitted.
As a medium-term measure, we would like to push the adaptation of the zlint, as stated in comment#2. We are therefore going to support the implementation ourselves, which, in this case, will be in cooperation with D-Trust. Please note, that we actively contributed to zlint before by supporting implementations of several lints (via our supplier “mtg”).
As a long-term measure we are working with the supplier to improve template handling and other processes in general so that misconfigurations are prevented as far as possible through technical measures (i.e. error-prone processes shall be removed/improved).

In addition, we are still considering how we can test the correct implementation in the productive environment after a change. Perhaps, having a test signer issue test certificates in the production environment for the final QA might be an appropriate measure. It might also be helpful to have the production CA issue test certificates that are clearly marked as such, but we do not see any way to do this in compliance with the BR. If you see a possibility here or have other good ideas, we would be very happy about helpful suggestions.

We hope that this description will bring more clarity to the situation. Even if it does not contain very much new information, we hope that the summarized description of the facts will be helpful. If there are any further questions, we will gladly answer them. Otherwise we will provide an update next week at the latest.

Flags: needinfo?(Arnold.Essing)

If you see a possibility here or have other good ideas, we would be very happy about helpful suggestions.

There is some discussion about this in Bug 1707073. Is this the level of feedback you were hoping for? It's unclear if the proposal from, say, https://bugzilla.mozilla.org/show_bug.cgi?id=1707073#c8 and https://bugzilla.mozilla.org/show_bug.cgi?id=1707073#c9 , is what you were asking about.

We hope that this description will bring more clarity to the situation.

Indeed, this is hugely useful, and I think Comment #4 goes much further than the original Comment #1 / Comment #2, by giving a lot of concrete areas for improvement, their complexities, and their risks. I think the more future incident reports can incorporate details like Comment #4, the better as a whole. In particular, it helps us move of our understanding from "Doing the minimum required" to "Thinking and acting systemically", which is critical to understanding both individual incidents and patterns of incidents, and assuaging any concerns.

Flags: needinfo?(Arnold.Essing)

Update
A new update that, among other things, changes the behavior of our software so that the overall defined connection to the Linters takes effect in any case, regardless of the specifications in a template, has been provided by our software manufacturer as of today. We will start corresponding testing in the test environment very soon.
Thanks for pointing out the information in https://bugzilla.mozilla.org/show_bug.cgi?id=1707073#c8 and #9 about the potential options to perform tests in the production environment. We are currently examining the extent to which these can be sensibly implemented in our architecture.
Regarding the changes in zlint https://github.com/zmap/zlint/pull/479 , D-Trust as well as our software manufacturers and us are in close contact and work is in progress.
We will get back to you with another update next week at the latest.

Flags: needinfo?(Arnold.Essing)

The CA software update, which will centrally enforce Linters, is currently being tested in the test environment and is expected to be deployed to the productive environments in week 21 or 22.
About the potential options to perform tests in the production environment, we are in ongoing discussions with our software manufacturer.
Regarding the Update of zlint, the code has been overhauled and successfully tested in cooperation with D-Trust. The corresponding pull in github should follow very soon.
We will get back to you with another update next week.

Since Arnold Essing is currently not available, please allow me to answer on his behalf.

We have looked at the suggestions from bug 1707073 referenced in comment #c5 of this bug and discussed them both internally and with our software vendor. Our announced update for week 21/22 is basically in the same direction, i.e. the goal is to ensure that certlint and zlint will be queried in any case every time a certificate is issued.
However, our question went in a different direction: which tests can we perform in the production environment after an update that go beyond certlint and zlint checking? In principle, we consider the testing by certlint and zlint to be elementary, but there are properties which cannot be checked by them (such as URLs in the CRL-DP, AIA, etc.) or are not checked (such as, in this case, an incorrect keyUsage which was not recognized by zlint). One idea would be, for example, to initially issue only test certificates for certain own test domains after an update in the production environment and to clearly mark these as test certificates. The philosophy behind this is, that the very first certificates after an update represent a final QA (not to be understood as an aggressive testing (“very scary”) as mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1707073#c8). However, due to the unambiguous and clear marking as test certificates, the consequences of a mis-issuance in regard to security (and compliance?) could potentially be reduced to an acceptable level. We are aware, that there are currently no regulations in the baseline requirements allowing such methods and the thoughts in this direction are not yet finished. So we would be happy about ideas in this direction (or entirely other directions) and will also seek the conversation with other CAs.

Regarding the software update and the pull request for zlint, work is in progress and we will provide further information next week at the latest.

I asked about this time ago and the answer was that any misissuance should be reported, even if as result of a test.

The CA software update to centralize the configuration of linters has been deployed in the production environment of this PKI service as of today. The respective CA instances affected in https://bugzilla.mozilla.org/show_bug.cgi?id=1711432 will follow in week 23.
Regarding the update of zlint, a new pull request was created https://github.com/zmap/zlint/pull/599

We have further discussed our idea of “test certificates in the production environment”, internally as well as with another CA, and concluded that this approach has been subject of public discussion before (as Pedro already indicated) and is not a desirable approach. We are therefore dropping this idea and will focus on ensuring proper linting of all certificates.

Thanks Arnold, and thanks Pedro for Comment #9, which is indeed correct :)

Yesterday the CA software update to centralize the configuration of linters has also been deployed to the remaining CA instance affected by https://bugzilla.mozilla.org/show_bug.cgi?id=1711432
Please let us know if more information is needed. Otherwise, we would consider this incident as resolved.

I just want to note Comment #3 explicitly called out https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed (although related to a different reason), while Comment #12 was 12 days after Comment #10's update, and there has been no subsequent update in the 10 days since.

I realize Telekom Security views this incident as closed, but this has also been repeatedly addressed on a number of CA's recent incidents, which suggests that either Telekom Security is not following those incidents, or has not implemented the necessary controls to ensure regular updates, even after Comment #6 / Comment #7 / Comment #8 / Comment #10 indicated positive trends in meeting expectations.

While I think it would be good for Telekom Security to detail its processes and procedures for ensuring it monitors both it and other CAs' bugs and incidents, and ensures regular progress until closed or addressed via Next-Update, I think the mitigating factors as above mean that we don't need to block closure of this bug on this disclosure. Sending to Ben to see if he'd like to close it.

Flags: needinfo?(bwilson)

I will close this next Wed. 23-June-2021, unless someone notifies me here that there are other issues still to resolve.

Please note, that Comment #10 includes the completion of the software update and then only references the pending remediation of https://bugzilla.mozilla.org/show_bug.cgi?id=1711432 due to the parallelism of the remediation steps. In https://bugzilla.mozilla.org/show_bug.cgi?id=1711432#c5 itself it was stated that the next update was planned to be provided in 2 weeks. Admittedly, this information should have been included in this bug as well.
Apart from that, we understood https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed to require (as a “should”) updates within a maximum of 1 week in case there are open questions or progress on remediation steps to be made, but not necessarily afterwards. Telekom Security also monitors and evaluates all bugs as stated in https://bugzilla.mozilla.org/show_bug.cgi?id=1705791#c25 and is aware that this interpretation of the requirement “has also been repeatedly addressed on a number of CA’s recent incidents” (e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1711147#c5, https://bugzilla.mozilla.org/show_bug.cgi?id=1705480#c8, https://bugzilla.mozilla.org/show_bug.cgi?id=1706950#c4), while we only know of one post in which the specific expectation to generally provide weekly updates (even after completion of all remediation steps) is stated, i.e. https://bugzilla.mozilla.org/show_bug.cgi?id=1705791#c24. It seems we interpreted this “expectation” too softly and not as a strict requirement since it does not match our interpretation of https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed. Sorry for that.

Please consider the above as an explanation only and not as the start of an argument, as from now on we will make sure to always provide updates at least once a week, even after we consider our remediation to be completed.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ov-misissuance]
You need to log in before you can comment on or make changes to this bug.