Closed Bug 1620922 Opened 5 years ago Closed 5 years ago

GlobalSign: Untimely revocation of TLS certificate after submission of private key compromise

Categories

(CA Program :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arvid.vermote, Assigned: arvid.vermote)

Details

(Whiteboard: [ca-compliance] [leaf-revocation-delay])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0

This is a preliminary submission of an incident report related to a report abuse case submitted to us on Friday March 6th reporting the compromise of a private key.

The weekend duty staff did not recognize the initial submitted evidence as sufficient proof of compromise and went into further liaison with the reporter to provide other satisfiable evidence.

The initial submitted evidence (CSR with custom information) did however provide substantial and strong evidence of private key compromise. This was detected during further review and discussion of the case and ultimately the corresponding certificate (https://crt.sh/?id=2522275549) was revoked on Sunday March 8th.

After we have concluded our internal investigation we will submit a full incident report.

Assignee: wthayer → arvid.vermote
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance] [delayed-revocation-leaf]

1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

We became aware about the problem on 08/03/2020 when one of our compliance employees was reviewing open certificate problem reports. The compliance employee noted an issue with the handling of a certificate problem report on private key compromise submitted on 06/03/2020 21:48 GMT.

The certificate problem report case was flagged as needing more info / compromise evidence by the on duty technical support agent whereas it did actually contain sufficient evidence to confirm compromise of the reported private key (a CSR with unique information referencing to the compromise / case). Since revocation did not happen within 24 hours this certificate problem report was not handled in line with section #4.9.1.1. of the Baseline Requirements.

2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

  • Submission of initial certificate problem report: 06/03/2020 21:48 GMT
  • Request to reporter for additional information: 06/03/2020 22:21 GMT & 06/03/2020 22:59 GMT
  • Start of sending additional reminders to reporter to submit acceptable evidence: 07/03/2020 14:26 GMT
  • Confirmation from reporter that sufficient evidence of key compromise was available in the initial certificate problem report: 08/03/2020 00:42 GMT
  • Identification of error by compliance team and instructions for revocation issued: 08/03/2020 18:56 GMT
  • Revocation of certificate: 08/03/2020 20:12:32 GMT

3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

N/A given the nature of the bug.

4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

One problematic certificate: https://crt.sh/?id=2522275549

5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

One problematic certificate: https://crt.sh/?id=2522275549

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
There are two elements related to the issue at hand:

  1. The method of evidencing a private key compromise through submission a CSR was not defined as an acceptable method.
  2. An incorrect decision tree was in place for technical support agents to respond to reported private key compromises. Rather than escalating to the compliance team if an unknown method of evidencing key compromise was used the technical support agents were directed to asking the reporter to submit evidence using one of the known and accepted mechanisms.

7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

After internal discussion it was decided to require immediate escalation of any future private key compromise report to the compliance team. These employees have a different level of PKI knowledge and will be better suited to handle this type of report. The process change will be completed by March 20 2020.

Arvid: Thanks for providing this.

A few follow-up questions:

The method of evidencing a private key compromise through submission a CSR was not defined as an acceptable method.

Could you detail what methods are/were accepted, and where that information could be/is found?

Rather than escalating to the compliance team if an unknown method of evidencing key compromise was used the technical support agents were directed to asking the reporter to submit evidence using one of the known and accepted mechanisms.

There have been a string of issues in the CA ecosystem in the past several months that share this similar lack of escalation path. Could you describe what other incidents (if any) GlobalSign is aware of and has been following, and what the evaluation has been of those proposed mitigations?

Flags: needinfo?(arvid.vermote)

Could you detail what methods are/were accepted, and where that information could be/is found?

Acceptable means documented methods for technical support agents based on which they can confirm a key compromise certificate problem report is a legitimate compromise or whether they should ask for more information to the reporter.

The instructions for technical support agents to analyze an initial submission detailed following methods through which a key compromise would be reported:

  • Submission of the private key
  • Submission of a signed message
  • Providing us links to evidence or documentation that the private key has been compromised

We have a specific discussion room to discuss any of the above cases when they come in and guidance is needed. We have had cases in the past where a method other than the above was used (e.g. a mobile app binary that contained a private key of a TLS certificate) and guidance was provided by the compliance team, who are also part of that discussion room.

That step of seeking guidance of the compliance team when using an unknown method was however not formally defined (and required) in the case handling procedure for technical support agents. For the report under discussion, since it happened during non-business / weekend hours, nobody from the compliance team stepped in to provide that guidance. The technical support staff then reverted to the procedure and sending a template email asking the reporter for proof of compromise using a known method.

There have been a string of issues in the CA ecosystem in the past several months that share this similar lack of escalation path. Could you describe what other incidents (if any) GlobalSign is aware of and has been following, and what the evaluation has been of those proposed mitigations?

We have been watching the incidents related to certificate problem reports and have instigated following additional measure based on these incidents and internal audits and risk assessments. Our 24/7 Security / Compliance Operations Center is now independently triggered when a certificate problem report is received and they follow-up and monitor the technical support team on meeting the timing set forth in Baseline Requirements #4.9.5 and if the report is confirmed that the certificate is revoked within the boundaries set in Baseline Requirements #4.9.1

However, since the case status was "unconfirmed" the Security / Compliance Operations Center did not escalate the certificate problem report under discussion as according to their monitoring parameters no conditions were being breached yet. This flaw will not be present in the future process given the changes we are performing, which will also cover the SOC escalating the certificate problem report to the compliance team in case it relates to key compromise and monitoring the compliance team on timely responding according to Baseline Requirements #4.9.5.

As a side note, based on previous feedback, we have now integrated our internal event system with Bugzilla. Any CA-Compliance bug creates a ticket in our Security / Compliance Operations Center. All tickets require to be handled, analyzed and responded to by the relevant compliance officer responsible for the affected compliance area and need to evaluate whether we are affected and/or should implement any additional controls to prevent the issue from happening in our environment.

Flags: needinfo?(arvid.vermote)

Thanks Arvid. I really appreciate the level of detail provided here, and understanding the other controls in place (e.g. monitoring CA compliance bugs). This is exactly the kind of quality detail we look from incident reports, whether initial or final, as they help build a better picture about how the CA handles compliance.

That said, I think it's useful to look beyond 'just' key compromise, and more holistically look at your approach to Certificate Problem Reports. The response provided rings similar to that provided by Entrust in Bug 1611241, which doesn't appear to be part of the list of responses reviewed (based on the request for specific bugs). I think the concerns captured with that response are equally applicable here, and bear careful consideration.

Wayne: Setting N-I if you have further questions.

Flags: needinfo?(wthayer)

Thank you Ryan. We have dedicated channels for receiving certificate problem reports, however internal discussion led to the conclusion it was not feasible for compliance employees being the first line in processing those reports:

  • 24/7 monitoring on problem reporting channels is required. Due to scarcity of PKI compliance talent and resource cost it is not viable to build a 24/7 function with compliance employees. We rely on 24/7 availability of our technical support team and a 24/7 Security / Compliance operations center capability (manned by security analysts, which is not the same as (PKI) compliance employees) and have compliance employees on call 24/7.
  • Roughly 90% of traffic coming in through these channels is noise/spam/malicious. We consciously decided not to apply any automatic filtering on potential problem reports as the nature of some legitimate messages renders them highly likely to be flagged / filtered. We rely on the technical support team and Security / Compliance operations center to filter out that traffic and only escalate the legitimate problem reports to the compliance team.

Arvid: Again, thank you for the level of detail in your responses, as it continues to be a good model here.

In particular, the statement that roughly "90% of traffic coming in through these channels is noise/spam/malicious" is exactly the kind of useful data that can help improve things. Understanding what could be done to reduce that noise/spam/malicious traffic, or understanding the types and sources, is incredibly useful and might be something worth thinking about. Having drank from the firehouse of both Google security bugs and Chrome security bugs, I fully appreciate just how truly awful some reports can be, and so I can appreciate not wanting to treat everything like an emergency, while also being prepared to handle emergencies. I'd like to suggest that it might be worth following up on this, in terms of analyzing the sort of traffic that GlobalSign sees, and potentially other CAs see, to figure out if there are systemic ways to separate the signal and noise better, for the set of problems affecting the CA industryst.

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(wthayer)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] [delayed-revocation-leaf] → [ca-compliance] [leaf-revocation-delay]
You need to log in before you can comment on or make changes to this bug.