Closed Bug 1655698 Opened 1 year ago Closed 10 months ago

Telekom Security: CRL also contained unrevoked certificates

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Arnold.Essing, Assigned: Arnold.Essing)

Details

(Whiteboard: [ca-compliance])

Attachments

(1 file)

35.69 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details

Because of an unplanned event related issuance of a CRL during a maintenance change the CRL was issued incorrect and contained mistakenly entries of not revoked certificates.
Actually no revocation lists should be created during the change. It was planned to create a revocation list after the database update.
Note: The incorrect CRL contained entries from not revoked certificates, but all really revoked certificates were included correctly in the CRL. There was no security issue concerning the possibility of using revoked certificates.

1.How your CA first became aware of the problem, and the time and date.
2020-07-23 06:22 CEST: We were informed about an incident from our internal management center.

2.A timeline of the actions your CA took in response.
2020-07-22 07:54 CEST Start of the change, beginning with preparing actions.
2020-07-22 12:50 CEST Access to the customer portal has been blocked.
2020-07-23 03:56 CEST Not planned, but event related issue of a new CRL. This CRL was incorrect.
2020-07-23 06:22 CEST We were informed about an incident from our internal management center.
2020-07-23 07:00 CEST First conference call to this incident. The current change was assumed to be in relationship to the error.
2020-07-23 07:10 CEST Start of verification actions due to confirming a possible relationship.
2020-07-23 08:10 CEST The faulty rountine was found and stopped. Begin of mitigating actions.
2020-07-23 08:33 CEST Erroneous Database records fixed.
2020-07-23 08:48 CEST Publication of a new CRL as a planned Task of the change.
2020-07-23 09:45 CEST Further external Tests completed. Status requests about all certificates (the affected ones and all others) are correct.
2020-07-23 11:08 CEST Evaluation of the affected customers, preparation of the customer letter.
2020-07-23 14:40 CEST Shipping of the customer letter.
2020-07-24 15:15 CEST 24 h safeguarding phase (awaiting customer feedback) completed. QA-Task of the change was closed.
2020-07-24 15:34 CEST Access to the customer portal has been activated again.

3.Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
No certificates could be issued during the customer portal was blocked.

4.In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates, please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.
There were no misissued certificates. 907 valid certificates (with the meaning of not revoked) from “TeleSec ServerPass Class 2 CA” were mistakenly put in the CRL.

5.The complete certificate data for the problematic certificates.
Serial numbers of the affected certificates are attached.

6.Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
Within the planned change, faulty data was transferred to a clean-up script due to a software error. This also mistakenly deleted active customer accounts that should not be deleted. As a consequence of this deletion, the certificates of the affected customer accounts were marked for inclusion in the revocation list and an erroneous event related CRL was issued unplanned. The plan was, that the revocation list had to be issued only after the completion of the database update within the context of the Change.

7.List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.
Planned modification of the software: In the course of automated scripts, customer accounts are no longer deleted directly, but are first deactivated, checked and only then released for deletion.

Assignee: bwilson → Arnold.Essing
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

I'm concerned here with the approach.

The incident wasn't the accidental revocation, but the intentional(?) unrevocation, of certificates. This can definitely cause compatibility issues with a number of systems.

I think that, upon realizing you accidentally revoked certificates, you would have replaced them. I'm concerned that the system supports unrevocation, as this seems like it poses potential ongoing security risk.

I'm hoping a more thorough analysis about the factors contributing to things going wrong here would be looked at. I understand this incident report to be "mistakes happened", and the response to questions 6/7 to be focused on remediating that mistake, but I don't really see a systemic evaluation about how those mistakes were able to happen. Put differently, the question isn't just about "What checks do you have in place now", but trying to understand "Why didn't you have those checks in the first place? What were the factors that lead to this being overlooked"

Flags: needinfo?(Arnold.Essing)

Root cause analysis
A detailed root cause analysis was carried out involving our specialists and the developers from the CA- and Frontend-software manufacturer. The root causes were found through collaboration with the manufacturer who provided a detailed report of their investigations.
Beside the software bug that was responsible for the deletion of certain customer accounts (and so for the corrupted database that led to the incorrect CRL), an incorrect configuration parameter resulted in the bug not being detected in the test environment. The value for the time parameter which starts the cleanup of the customer accounts was set too high in the test environment. Because of this, the effect of the incorrect cleanup did not occur during the testing phase. This parameter was set correctly to the proper value in the productive environment.
The manufacturer will provide a fix for this bug which will be installed after being thoroughly tested. Until then the cleanup script will be deactivated to ensure that this problem does not occur again.
We are currently improving the QA of our test plans to prevent incorrect configuration settings in the test environment that may lead to problems of this type.

Possible compatibility issues
To check if there are any further problems we have established a 24 hour safeguarding phase to help our customers in case of problems and to collect feedback from them. During this phase the problems reported from our customers were resolved. From then until now there have been no more problems reported, but we have noticed that the entries of the affected certificates (CRL Status) in crt.sh are still shown as revoked.

No support of unrevocation
Our CA management software is not capable of unrevoking certificates, therefore there is no security issue during regular operations.
For this case, we had to repair the corrupted database by restoring the relevant tables from the database backup which was created as one of the first tasks of the change.
By doing so we were able to restore the database to the correct status. This was not critical, because there were actually no certificates revoked and as described above no certificates issued during the change.
Different teams needed to be involved for the restoration of the database. On the technical side, the database team of the data center was involved along with our CA administrators. The restoration process had to be confirmed by management and has been documented in our audit-proofed incident management system.
It is not possible for an individual (e.g. CA administrator) to unnoticeably unrevoke certificates through the manipulation of the database because of the implemented audit features which alarms our monitoring system.

The software bug was fixed by the manufacturer. The provided update was tested in our test environment and was installed in the productive environment on 2020-08-04. The test plans were optimized in coordination with the manufacturer.

Thanks for the added details! This is quite helpful, and a good consideration for inclusion "by default" in future incident reports.

I think a remaining question is still that when the database was hand-restored, the certificates that were not "supposed" to be revoked were still, in fact, unrevoked. I realize this was exceptional and as part of an incident response, but I think that still raises concerns for an entry appearing on a CRL and then subsequently disappearing. You can see the issues it causes for relying parties, such as crt.sh, which accurately treat these certificates as revoked.

There are elements of this story that strike me as very similar to incidents that happened re: DigiNotar and India CCA, both CAs that found themselves distrusted after certificates were not correctly detected for revocation. While there are indeed substantive differences, based on the information available, I think it's not unreasonable to be concerned that if Telekom Security encountered database corruption in the future, they could "roll back" changes such as records of issuance or revocation.

As part of this incident: I would expect the certificates that were "accidentally" revoked to be really revoked. Once that error was made, the certificates needed to remain revoked, even if the database issues were later resolved. That's one of the inviolate expectations of a CA: revocations only get added to a CRL, and they don't get removed until the certificate has expired.

Flags: needinfo?(stefan.kirch)

To make sure that there are no compatibility or other problems in the future, we will ask our customers to request new certificates by September 30, 2020. Because most of the affected customers are quite small in connection with the fact that there is holiday period in Germany our customers need some time. Immediately afterwards we will revoke all affected certificates.

Flags: needinfo?(Arnold.Essing)

As planned, the last of the 907 affected certificates were revoked on October 1, 2020.

Flags: needinfo?(stefan.kirch)

Ben: This is an awkward situation. T-Systems revoked certificates, then unrevoked them, then took 2 months to "re"-revoke them (correct the error?). It's not clear to me if we should treat this as a revocation delay (due to two months to correct the "un"revocation), or whether we should treat this in line with the "unrevocation is Super Bad to begin with"

I'm also not sure that the concerns in Comment #4 were entirely addressed, nor that Comment #2 really provides any useful details for the CA community to avoid a similar mistake. However, even with those details, I would say this issue remains deeply troubling, and I'm not sure that there's much reassurance as a third-party that such events were even possible.

Having recently been re-reading the Black Tulip report, it does strike me remarkably similar, and while this case was asserted to be "benign" and a "bug", the fact that it was even possible is concerning.

Flags: needinfo?(bwilson)

I agree that the remaining focus should be on the un-revocation when the database was restored, even though this incident arose out of errors in vendor-provided software and then the attempt to correct the database. The software bug and its fix was explained in Comment #2, but the last event in the timeline provided so far in 2. of Comment #0 was 2020-07-24. There was some information provided in Comment #5, but then "the last of the 907 affected certificates were revoked on October 1, 2020". I'd like to know what happened when T-Systems discovered that un-revocation occurred or would occur, and then what happened during the months of August and September as it evaluated its decision. It would also be good for T-Systems to evaluate its decision-making processes "under pressure" and explain improvements that it is making to those processes as a result of this experience.

Flags: needinfo?(bwilson)

Hello Ben, hello Ryan,
Please allow me to answer on Arnold’s and Stefan’s behalf since they are currently not available. As for the content, I try to address the three questions of Ben as well as Ryan’s concerns about the possibility of un-revocations.

As you rightfully pointed out, the decision to un-revoke was rushed and made under pressure. Due to this, not all the necessary persons had been involved in time and the decision to restore the database (and thereby technically un-revoke the certificates) was made based on the prioritisation of customer interests. Had the compliance team been part of the decision-making, it would have advised against the “un-revocation” in order to be on the safe side compliance-wise. However, even the compliance team was not entirely sure whether this un-revocation should be handled as such, since the original revocation itself was not intended and not based on policy-reasons (no compromise of private key, no revocation request, no misuse etc.). Due to this uncertainty and the fact that the database had already been fixed before the compliance team was informed, it was decided not to revoke the certificates again and handle the incident as an erroneous database instead of a non-compliant un-revocation. Nonetheless, we decided to open this bug and provide an incident report because of the remaining uncertainties and to potentially receive further guidance on how to handle this special situation.

In August and September, the primary focus was on the technical aspects like fixing the bug in the software as well as evaluating the possibility of further improvements to prevent such errors from happening in the future (e.g. the cleanup-routine should automatically stop after reaching a specific threshold within one run to limit potential damage). Additionally, we evaluated and concluded that our decisions to un-revoke and to not re-revoke were wrong and, as Ryan stated, a final revocation of the affected certificates should be carried out to prevent any potential compatibility issues. We also realised that from the point of view of the community/third parties we actually did, in a technical sense, perform an un-revocation and that the missing intend for the accidental revocation is not verifiable from an outsider and does not justify this un-revocation. However, since the certificates had no reason by policy to be revoked (no compromise, misuse etc.) within a certain time, we at least wanted to give our customers enough time for a smooth transition, as we described in comment #5. Since there was no disagreement, we assumed that this approach was acceptable and would not do any harm to the community/third parties.
Regarding the potential ability to un-revoke certificates, we understand your concerns, but also thought to have addressed those concerns in comment #2, were we explained that due to the organisational and technical measures in place, different teams and the management were required to carry out this un-revocation. It is not possible in regular operation. It is not supported by the software. And it most certainly has not occurred in the past. However, it seems that you are still concerned and that this topic is still a point to be addressed, which is why we also started discussions with our software vendor on further technical measures that shall entirely prevent the possibility of an “un-revocation” via database manipulation or via other work-arounds in the future.

We also want to assure you that we do understand (as Ryan clearly and explicitly explained in the context of the “OCSP-EKU in Sub-CA-certificates”-issue) that our actions should not only consider our own interests but also always consider the interests of the webPKI community/third parties (and rate those as highly important). We hope to improve our decision-making process (under pressure or not) by increasing awareness for this attitude in all teams. Also, our compliance team (or core team of the requirements management) has been strengthened by two new members at the beginning of this year to improve our requirements management in general as well as to provide a central point of competence regarding compliance. However, due to this relatively new function of the core team currently being established, the team has not been considered in the incident response process yet. This has now been officially corrected and communicated to all teams.

Flags: needinfo?(bwilson)

Two remaining issues are brought to mind in comment #9 -

  1. it is stated that you have "also started discussions with our software vendor on further technical measures that shall entirely prevent the possibility of an “un-revocation” via database manipulation or via other work-arounds in the future." What are the results of those discussions? Have design changes taken place to prevent un-revocation?
    and
  2. it is also stated, "our compliance team (or core team of the requirements management) has been strengthened by two new members at the beginning of this year". Thus, throughout 2020, how many personnel were devoted to compliance matters? Has the number of compliance team members stayed consistent? Have more members been added since the beginning of 2020? Has there been any turnover? How have these team members been trained, and how are they staying up-to-date on industry requirements of the CA/B Forum and Mozilla Policy? Specifically related to this issue, what documentation has been prepared and distributed to personnel regarding violation of RFC 5280 for un-revocation? What training is provided to personnel that is related to system changes resulting from 1) above?
Flags: needinfo?(bwilson)

1.Further technical measures to prevent “un-revocation” via database manipulation or via other work-arounds in the future."
There were intensive discussions, both internally and with the software vendor. Several approaches have been discussed so far. However, we have not found a convincing technical way to completely prevent un-revocation at database level. Additional measures would only make it even more difficult/complex to perform an un-revocation than it already is right now. Should we be blind to the obvious or should the community have any ideas, we will gladly hear them out.
Irrespective of the additional technical possibilities, we see, based on what has been learned due to this bug and based on the resulting requirement of the management level combined with the already existing technical and organisational conditions (as stated in the comments above), that there will not be another un-revocation in the future.

2.Compliance core team
Since the beginning of 2020, the root/compliance team has consisted of a total of four people, i.e. in addition to the two already existing members of the root team, the team has been strengthened by two further colleagues from another team at the Telekom Trust Center. One of the two new colleagues has been working at the Telekom Trust Center for 25 years and has over 20 years of PKI experience, while the other colleague has been working at Telekom Security for three years.
These two colleagues were deliberately added to the team: on the one hand, they bring many years of know-how from the operation of various (regulated) PKIs, and on the other hand, they have an unbiased view of things, so that they can critically question entrenched opinions and processes. This core team gets further support from other experienced staff (product managers, internal auditor, ISMS, technicians, …).
In 2020, the two new colleagues looked at all the requirements of the CA/B Forum (BR, EV-GL, Netsec Requirements), the relevant ETSI standards, the root programs as well as the relevant RFCs and other requirements from scratch and used them to create a new Certificate Policy for the Trust Center (approved by the auditor last week).
To stay up to date, the root/compliance team has subscribed to the relevant mailing lists and coordinates its activities on a weekly basis in a regular video conference. In addition to this, all new bugs in Bugzilla are considered in the weekly conference in order to also derive possible bugs and improvements from the errors of other CAs.
In the discussion on this specific bug, which involved the root/compliance team, those responsible for the affected service, the internal auditor, and all managers from all teams, it was decided that in future the compliance team should always be involved in the event of unclear decisions and that backups and database corrections should only be permitted after consultation and clarification of possible complications. In addition to raising employee awareness, the management team also issued clear instructions in this regard to all colleagues concerned.

(In reply to Jan Völkel from comment #9)

As you rightfully pointed out, the decision to un-revoke was rushed and made under pressure. Due to this, not all the necessary persons had been involved in time and the decision to restore the database (and thereby technically un-revoke the certificates) was made based on the prioritisation of customer interests. Had the compliance team been part of the decision-making, it would have advised against the “un-revocation” in order to be on the safe side compliance-wise.

Now, you try to assure the community that you only put customer interests over the interests of the WebPKI as a whole because the decision was made under pressure and without involvement of the compliance team. But, with full involvement of your compliance team and without any pressure, you, in https://bugzilla.mozilla.org/show_bug.cgi?id=1655698#c5, again, very severely, put customer interests over the interest of the WebPKI by intentionally letting this situation, which would not have been there if you had put the interests of the WebPKI over customer interests in the first place, go on for another 6 weeks. I think there thus can be no trust at all in your statements.

(In reply to Arnold Essing from comment #11)

Thanks for the update. I believe this matter can be closed and will do so on or about next Wed. 3-Feb-2021 unless there are other reasons presented.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.