Closed Bug 1665763 Opened 4 years ago Closed 4 years ago

Sectigo: Failure to revoke within 5 days

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rich, Assigned: rich)

Details

(Whiteboard: [ca-compliance] [leaf-revocation-delay])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4260.0 Safari/537.36 Edg/87.0.637.0

Steps to reproduce:

As per comment #45 in Bug 1645686, Sectigo has failed to revoke two batches of reported mis-issued certificates w/in the 5 day period required by the BR. The certificates in these batches were reported to us on September 11. Some were revoked today, 1 day late, and the rest will be revoked on or before Sunday September 20. I will provide more detail in a full incident report tomorrow, September 18.

Assignee: bwilson → rich
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance] [delayed-revocation-leaf]
  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On the evening of Friday, September 11, 2020 Sectigo received a problem report into our problem reporting queue at sslabuse@sectigo.com detailing 3 separate batches of certificates which had been issued with invalid Subject:stateOrProvinceName (ST) fields. These batches were in the form of two separate files containing separate lists of certificates, as well as a small list of certificates in the body of the email message.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Early on Sept. 12 the certificates contained in the body of the message and one of the lists contained in one of the files began to be investigated, processed, and submitted into the batch processing system in order to ensure revocation w/in the timeframe allowed by the BR. During this time our agents also went about notifying the affected subscribers so that the problem certificates could be replaced with corrected certificates prior to revocation. The 2nd file containing another list of problem certificates went unnoticed at this time due to a default display setting in the problem reporting ticketing system. More on that below.

  1. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

The problem is ongoing as the final batch of certificates have not yet been revoked to allow time to properly notify subscribers and allow them time to replace the affected certificates. This final batch will be revoked on or before Sept. 20, 2020.

  1. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.
    Batch 1, all revoked w/in the timeframe allowed by the BR
    Batch 2, revoked 1 day beyond the 5 day window allowed by the BR
    Batch 3, to be revoked on or before Sept. 20, 2020

  2. In a case involving certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

  3. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

There were two mistakes made:

  • When the first two batches were submitted for bulk processing the first batch consisting of 604 certificates in the first file were correctly input for bulk processing. The 2nd set consisting of the 5 certificates listed in the body of the problem report email message were not picked up by the person inputting the certificates for bulk processing. This seems to have been a communication error in that the person who sent the certificate for bulk input kept these two batches listed separately, but the person receiving the request thought that they had been combined.

This error was noticed and the 5 certificates were revoked on Sept. 17, one day late, during the process of generating this incident report.

  • The other mistake was the omission of the 2nd file listing 124 certificate from all initial processing and action. This was caused because, unbeknownst to anyone, the default view setting of the ticketing system hid this 2nd file from view by the agent who initially handled this report. For this reason he simply was not aware that the 2nd file existed.

This error came to light when, on the evening of Sept 15, as the 5 day window for revoking these certificates was drawing near to a close, the department manager undertook a review of this incident to ensure that everything had been correctly handled. He was not using the ticketing system default view, and therefore was able to see the 2nd file. He immediately notified his manager as well as the incident response team and began the investigation and processing of this previously undetected list of problem certificates. These certificates will be revoked on or before Sept. 20.

  1. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

We have made sure that the default view setting of our ticketing system shows ALL attachments to all incoming messages. In order to minimize communication errors between the teams we have given instructions that in cases where the problem report might contain multiple lists as in this case, the first step in processing should be to combine those separate lists to ensure that all subsequent investigation and processing handles all reported certificates.

Rich: It's not clear to me why the overlooked batch wasn't revoked on Sept 15, when Sectigo was aware of this. That is, while it's understandably unfortunate Sectigo made a mistake in overlooking this second batch, what's not clear to me is why it deliberately delayed another 5 days. I can speculate reasons; for example, Sectigo might not wanted to revoke to save itself embarrassment from having to acknowledge the lack of customer notification was their fault. Or, worse, Sectigo might have made a decision that it doesn't revoke until it's notified customers, which would appear to put situational factors above and beyond its documented policies and practices. These, of course, are merely speculation, but would be troubling.

I'm hoping you can discuss more about why, when you knew about them, you didn't take the corrective action that Sectigo publicly commits to all relying parties that it does and will do.

Flags: needinfo?(rich)

In response to Ryan Sleevi in comment 2

Ryan, we acknowledge that this delay in revocation is a violation of the BR and of our CPS. As you know the compliance team has beeen very focused on getting our audit back on track and characterizing the condition of Subject:stateOrProvinceName field data in our OV certificate base which, as you can see from my most recent post on bug 1645686, is substantial. With these powerful distractions the compliance team failed to keep the pressure on our customer service team to revoke these certificates in the specified timeframe. We don’t mean that as an excuse but simply as a frank explanation, which we believe is your desire.

As we seek to create a completely compliant operation, we have to look for incremental progress as a sign that our efforts are working. In this case a system bug prevented the agent, who otherwise acted promptly, correctly and in good faith, from taking action on one batch of certificates contained in a report that was itself a bit confusing, containing as it did 3 separate batches all in slightly different formats. In the past there is a good chance this oversight would have gone unnoticed until it was published by the reporting party here. In this case our internal review found the oversight and we took action to correct it. We will continue to learn from our mistakes and make further improvements.

Flags: needinfo?(rich)

Rich: Is my understanding that you don't have a system in place to automatically guarantee these certificates are revoked on the specified timeline? That is, I'm still missing information about when you knew about the issue to when you acted for revocation, and it seems understanding the timeline, and the steps being taken, is more useful and actionable than contrition.

If I'm understanding Comment #1 correctly,

This error came to light when, on the evening of Sept 15 ... immediately notified his manager as well as the incident response team and began the investigation and processing of this previously undetected list of problem certificates. These certificates will be revoked on or before Sept. 20.

From this statement, I don't understand Comment #3's remarks, specifically:

In this case our internal review found the oversight and we took action to correct it.

From what I understand, the department manager noticed this, escalated to incident response, and incident response team waited five days. This does not appear to be learning from mistakes, or a future improvement, and it's still not clear to me why the choice to wait five days was made.

Flags: needinfo?(rich)

In response to Ryan Sleevi, comment 4

Ryan asked:

Is my understanding that you don't have a system in place to automatically guarantee these certificates are revoked on the specified timeline?

We do not have software that guarantees that. It depends on a human being with the appropriate permissions choosing when the certificates are to be revoked.

Ryan said:

it's still not clear to me why the choice to wait five days was made.

This was my choice and in retrospect it was the wrong one. I knew that these certificates were all included in the larger set of OV suspects, with which I'm still contending, and thus that they would be processed as part of that batch. Therefore I didn’t push the timeline as I more or less thought of them as being part of that same effort. Clearly that was wrong as these certs had stopped being suspects and become known misissuance certificates once we completed our investigation of them.

Flags: needinfo?(rich)

(In reply to Rich Smith from comment #5)

We do not have software that guarantees that. It depends on a human being with the appropriate permissions choosing when the certificates are to be revoked.

This was my choice and in retrospect it was the wrong one.

So, in the very next sentence, you admit that relying on manual tracking of revocation timelines does not work?

Also, it seems like you are trying to avoid answering the questions in comment 2. In Comment 1, you establish evidence that you deliberately did not revoke, at least batch 3, in time (at that time, you knew that batch 2 was beyond after BR deadlines, and announced that batch 3 would be revoked even after that time) but did not state any reason for this. So, it can only be speculated that the reasons given in comment 2 are, in fact, true, which would reflect very badly on Secitgo.

(In reply to paul.leo.steinberg from comment #6)

Also, it seems like you are trying to avoid answering the questions in comment 2.

Specifically, the request:

Rich: It's not clear to me why the overlooked batch wasn't revoked on Sept 15, when Sectigo was aware of this. That is, while it's understandably unfortunate Sectigo made a mistake in overlooking this second batch, what's not clear to me is why it deliberately delayed another 5 days.

I believe Rich responded partially to this in Comment #5, namely:

I knew that these certificates were all included in the larger set of OV suspects, with which I'm still contending, and thus that they would be processed as part of that batch.

That said, the goal of the postmortem isn't to throw Rich under the proverbial bus here for that poor judgement. It's to understand why the controls relied on judgement in the first place, and what can be done to reduce the risk of judgement errors going forward. I think you're right, Paul, to highlight a clear takeaway, which is:

manual tracking of revocation timelines does not work

But I also think there's more opportunity for Rich to share about the process at Sectigo.

  • Are employees empowered and supported, from the support queue through to management, to state that "We will follow our CP/CPS as stated" and are able to insist certificates be revoked?
  • Are there checks in place to get second opinions and additional perspective, to help catch potential issues of confusion or judgement errors?
  • Similar to that first point, are folks empowered to be able to highlight those issues/errors? Or is there a risk of retaliation or penalization if they do?

If we view CAs like we do manufacturing (since they're the same), we want to make sure there's a culture that emphasizes the importance of being able to slap the proverbial "emergency stop" button (whether blocking issuance of a cert, halting all issuance, forcing revocation... the metaphor fits). We need to make sure employees feel safe to do so, and CAs understandably want to make sure employees don't do so unnecessarily. That's a difficult act to balance, but one that needs to err on the side of safety here.

With respect to this incident, it does seem Sectigo can make some deliberate choices. For example, starting with "revocation by default", such that it requires a quorum of compliance team to override (and delay/abort revocation). This leans into compliance-by-default, while still offering the ability to recognize exceptional situations and/or overzealous employees. Similarly, sharing details going forward about when overrides were made, and the rationale behind each override (including "support made a mistake, compliance's review caught this, and it didn't actually need to be revoked"), all go far in helping share lessons and opportunities for transparency.

Safety-conscious industries have long recognized the value of these postmortems, retrospectives, and longitudinal analyses as valuable learning tools.

(In reply to Ryan Sleevi from comment #7)

  • Are employees empowered and supported, from the support queue through to management, to state that "We will follow our CP/CPS as stated" and are able to insist certificates be revoked?
  • Are there checks in place to get second opinions and additional perspective, to help catch potential issues of confusion or judgement errors?
  • Similar to that first point, are folks empowered to be able to highlight those issues/errors? Or is there a risk of retaliation or penalization if they do?

I would say the answer is yes. If we use this example (and I agree the point is not to pile on Rich), the person who was working both these issues had a group of certs that was the same in both sets and lumped them together, and then in retrospect realized that was an erroneous decision.

The need for improvement is not so much about pressure on employees not to revoke certificates. Certainly it’s human nature for, let’s say, a sales account manager to want forced revocations not to happen for her particular client, but the company is well schooled in the reason these must occur, and so that kind of pushback isn’t very effective. The past might have been different, which I couldn't comment on, but that's the case today.

Where I see the opportunity to improve is to make this kind of daily decision making systematic and clearly spelled out. Right now we have experienced employees taking care of tasks, and that involves a great deal of them doing what they understand to be correct. To the degree that we have spelled-out procedures for employees to follow, we can achieve consistent results that more closely match expectations. This is a lot like your second bullet, above.

Of course, it all hinges on the details. How many procedures are we talking about? How closely can they be codified, pragmatically? What are the meaningful metrics, and how do you collect them? All non-trivial to resolve. That’s the kind of question we’re digging into.

No further update at this time.

No further update at this time.

No further update at this time.

No further update at this time.

Ben, the certificates referenced in this bug have all been revoked, and there have been no additional comments or questions for several weeks. Can this bug be closed?

Flags: needinfo?(bwilson)

I'll schedule this bug for closure on or about next Wednesday, 11-11-2020, unless there are additional issues or questions to be raised.

We have no further update.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] [delayed-revocation-leaf] → [ca-compliance] [leaf-revocation-delay]
You need to log in before you can comment on or make changes to this bug.