Closed Bug 1551362 Opened 5 years ago Closed 4 years ago

Sectigo: "Some-State" in stateOrProvinceName

Categories

(CA Program :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wthayer, Assigned: Robin.Alden)

References

Details

(Whiteboard: [ca-compliance] [ev-misissuance] [ov-misissuance])

A list of Sectigo certificates containing a stateOrProvinceName of "Some-State" was published at https://misissued.com/batch/53/ This is apparently the default placed in OpenSSL CSRs, indicating that this field was not validated. BR section 7.1.4.2.2(f) states: If present, the subject:stateOrProvinceName field MUST contain the Subject’s state or province information as verified under Section 3.2.2.1. The EVGLs reference the BRs.

Please provide an incident report, as described at https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report

Robin, it's been a month with no incident report. Have I missed something?

Flags: needinfo?(Robin.Alden)

Three certificates reported above that chain up to Sectigo roots have not yet been revoked. BR section 4.9.1.1 requires revocation within 5 days if the CA is made aware that any information in a certificate is inaccurate. Why were these certificates not revoked with the other Sectigo Some-State certificates?

https://crt.sh/?id=179844604&opt=ocsp
https://crt.sh/?id=42115324&opt=ocsp
https://crt.sh/?id=306930656&opt=ocsp

(In reply to Ryan Sleevi from comment #1)

Robin, it's been a month with no incident report. Have I missed something?

I apologize for the delay in writing up this incident response. We will provide a full response this week.

(In reply to Alex Cohn from comment #2)

Three certificates reported above that chain up to Sectigo roots have not yet been revoked. BR section 4.9.1.1 requires revocation within 5 days if the CA is made aware that any information in a certificate is inaccurate. Why were these certificates not revoked with the other Sectigo Some-State certificates?

https://crt.sh/?id=179844604&opt=ocsp
https://crt.sh/?id=42115324&opt=ocsp
https://crt.sh/?id=306930656&opt=ocsp

Alex, I will investigate why these three were missed. We will revoke these certificates and provide an updated list of affected certificates as part of the incident response this week.

Whiteboard: [ca-compliance] → [ca-compliance] - Next Update - 29-June 2019
  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
    We received a report to our problem reporting mailbox sslabuse@sectigo.com from Alex Cohn at 18:36 BST on 11th May 2019. The same email was sent to mozilla.dev.security.policy.
  2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
    2018-06-13 We identified that we needed to add “Some-state” to a filter mechanism so that it could not be added to certificate subjects.
    2018-11-12 We deployed code to add “Some-state” to a filter mechanism so that it could not be added to certificate subjects.
    2019-05-11 18:36 BST Initial Report received.
    2019-05-14 00:02 BST We started revoking the identified certificates.
    2019-05-14 01:10 BST This bug was created by Wayne
    2019-05-14 14:18 BST 72 out of 78 identified certificates had been revoked.
    2019-05-16 14:34 BST 75 out of 78 identified certificates had been revoked.
    2019-06-25 01:19 BST Alex Cohn posted to this bug that three certificates remained unrevoked.
    2019-06-30 00:38 BST the remaining 3 certificates were revoked.
  3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
    We have stopped issuing certificates with the problem.
  4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
    There were 78 time-valid certificates that contained Some-State.
    The earliest such certificate was issued on 2014-08-28.
    The latest such certificate was issued on 2018-11-07.
  5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
    https://docs.google.com/spreadsheets/d/1_xuhRm4thL9Hd03Jiu4TUn33xMBHCMu-drQ1i1drppM/edit?usp=sharing
  6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
    This is another case where visual comparison by human validators has failed because they see what they expect to see. Even with training and (since some of these are EV certificates) even with two people looking over the data sometimes they miss spurious values such as this.
    Although we had already implemented code to block this particular case we are conscious that there will be other similar cases related to default values and to other problems of human perception.
    We developed the code to block Some-State because we were seeing too many rejections of certificates during manual issuance because of the use of this phrase so automated its blocking. We should have looked through the data to find the examples of issued certificates showing the same phrase.
  7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
    We have two programs of work underway that we believe will substantially address these visual comparison errors.
    We have had the first under development for some time and it aims to verify the addresses in OV and EV certificates with bulk address records. Subscriber generated addresses that do not match any of the bulk data sources will require a higher level of scrutiny by validators before issuance. This is in final testing before release and we expect to have this in production in July 2019.
    The second has come out of suggestions made to (and by) CAs in the incident reports around these ‘default’ values appearing in certificate subjects, such as using the ISO-3166-2 for a whitelist of state or province values and using GeoNames data. We had not previously considered that latter source and are evaluating how best we can use it. We expect to have an initial implementation using these concepts live in production in July or August 2019.
    Until these automated systems are in place we have directed our validators special attention to this class of error and included additional training material to highlight these cases.
Flags: needinfo?(Robin.Alden)

(In reply to Robin Alden from comment #4)

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
    We received a report to our problem reporting mailbox sslabuse@sectigo.com from Alex Cohn at 18:36 BST on 11th May 2019. The same email was sent to mozilla.dev.security.policy.

Did you provide a Preliminary Incident Report within the timeframe required of the BRs? If so, why was that same report not provided to Mozilla? If not, then treating that as another actionable incident, please include a similar analysis and plan for remediation.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
    2018-06-13 We identified that we needed to add “Some-state” to a filter mechanism so that it could not be added to certificate subjects.
    2018-11-12 We deployed code to add “Some-state” to a filter mechanism so that it could not be added to certificate subjects.
    2019-05-11 18:36 BST Initial Report received.
    2019-05-14 00:02 BST We started revoking the identified certificates.
    2019-05-14 01:10 BST This bug was created by Wayne
    2019-05-14 14:18 BST 72 out of 78 identified certificates had been revoked.
    2019-05-16 14:34 BST 75 out of 78 identified certificates had been revoked.
    2019-06-25 01:19 BST Alex Cohn posted to this bug that three certificates remained unrevoked.
    2019-06-30 00:38 BST the remaining 3 certificates were revoked.

Why did you fail to revoke the remaining three?
Why did you fail to acknowledge and respond to this bug for nearly two months?
Why did you fail to examine and remediate past issuance, when this matter was detected and the fix deployed, back in 2018?

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
    This is another case where visual comparison by human validators has failed because they see what they expect to see. Even with training and (since some of these are EV certificates) even with two people looking over the data sometimes they miss spurious values such as this.
    Although we had already implemented code to block this particular case we are conscious that there will be other similar cases related to default values and to other problems of human perception.
    We developed the code to block Some-State because we were seeing too many rejections of certificates during manual issuance because of the use of this phrase so automated its blocking. We should have looked through the data to find the examples of issued certificates showing the same phrase.

What steps are being put into place in the future to ensure such systemic issues are actionably addressed, including historic issuance? For example, reducing the lifetime of certificates in order to mitigate the risks of such issues going unaddressed, combined with systemic playbooks for compliance issues that include steps for performing and acting upon human review, surfacing broadly to the community, etc.

For example, had Sectigo alerted the broader community to these sorts of issues, and the challenges, the CA ecosystem might have been able to tackle this nearly a year ago, based on the timeline. Such a missed opportunity to be an industry leader seems regrettable, and understanding what sort of changes for both compliance issues and validation challenges, such as those adopted by other CAs, seems useful to consider.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

Please also provide an analysis as to why the response was so significantly delayed, and what steps are being taken to prevent repeat behavior in the future.

Flags: needinfo?(Robin.Alden)
Whiteboard: [ca-compliance] - Next Update - 29-June 2019 → [ca-compliance] - Next Update - 31-July 2019
Blocks: 1563579

Robin?

I will respond on this bug by the end of this week, i.e. by 20-sep-2019, and no less than weekly thereafter.

(In reply to Ryan Sleevi from comment #5)

Did you provide a Preliminary Incident Report within the timeframe required of the BRs? If so, why was that same report not provided to Mozilla? If not, then treating that as another actionable incident, please include a similar analysis and plan for remediation.

We did not provide a preliminary incident report to the original poster. This was an oversight.
We have created a new playbook to ensure consistency of treatment of Certificate Problem Reports and to accurately track the phases of response to a Certificate Problem Report, including the preliminary report. This has been added to the training curriculum for the support staff who deal with Certificate Problem Reports.

Why did you fail to revoke the remaining three?

The cause was administrative error, by which I really mean spreadsheet wrangling problems. We see a recurrent pattern of failure in missing the timely revocation of small parts of the lists of certificates that are tabled for revocation as part of incident responses. We received internal requests from the team who run through these revocations for a system to help them track incident-related revocations.
We have created a (Jira) ticket to track the development of such a system. I will follow up with timelines for this system going into production.

Why did you fail to acknowledge and respond to this bug for nearly two months?

I apologize for the tardiness of our response to this bug. We are aware of and understand Mozilla’s guidance that “in no circumstances should a question linger without a response for more than one week”. We will in future adhere to this requirement.
I will follow up some more on this in bug 1563579.

Why did you fail to examine and remediate past issuance, when this matter was detected and the fix deployed, back in 2018?

We developed the code to block Some-State in new certificate requests because we were seeing too many rejections of certificates during initial validation or 2nd approval. We wrongly assumed that the 2nd approval process would have caught all instances of ‘Some-State’ before the certificates had been issued. This is evidently not the case and shows that on implementing any new control, even where we believe that alternate controls in the past should have been sufficient to prevent the issue, we must apply that new control to our body of issued certificates to identify any past misissuance.

What steps are being put into place in the future to ensure such systemic issues are actionably addressed, including historic issuance? For example, reducing the lifetime of certificates in order to mitigate the risks of such issues going unaddressed, combined with systemic playbooks for compliance issues that include steps for performing and acting upon human review, surfacing broadly to the community, etc.

For example, had Sectigo alerted the broader community to these sorts of issues, and the challenges, the CA ecosystem might have been able to tackle this nearly a year ago, based on the timeline. Such a missed opportunity to be an industry leader seems regrettable, and understanding what sort of changes for both compliance issues and validation challenges, such as those adopted by other CAs, seems useful to consider.

Historic Issuance: Although not directly responsive to the issue that we observed and fixed in 2018, we have created a fresh Incident Response playbook which sets out the steps required of us as part of any incident response. This includes our initial response, root cause investigation, prevention of future misissuance, investigation of past misissuance, and investigation steps, and also includes a recommendation to investigate whether other CAs are affected by the identified issue.

We will also add to our development process a step so that, for any change to CA systems that improves policy checking (including subject detail checking) we will assume that anything that could have gone wrong did go wrong in the past and perform a scan of our certificates for misissuance due to that cause. If we find it, that brings the Incident Response Playbook into play and other CAs would be notified.

Where we amend our policy checking and we find no previous misissuance by Sectigo we will routinely check for misissuance by other CAs, where practicable, and bring it to bugzilla if we find it.

Please also provide an analysis as to why the response was so significantly delayed, and what steps are being taken to prevent repeat behavior in the future.

I will respond to that in bug 1563579, if I may.

The progress report for https://bugzilla.mozilla.org/show_bug.cgi?id=1548713#c7 applies similarly to this bug.

We have made some progress with the development of the system to help our staff track incident-related revocations and we now understand the level of effort involved, We do not yet have an ETA for the delivery of the system but we anticipate having that ETA by next week.

Our EV JoI policy checking module will be deployed to our production systems this weekend (29-Sep). This will enable us to more easily apply those policy checks retrospectively to our issued body of certificates and as we further extend the data definitions available to the JoI policy checker we anticipate that this will enable us to more quickly identify and disclose certificates with a mismatch in the JoI,

The JoI policy checking module was released to our live systems last weekend (29th September) as scheduled. We are working to extend the JoI 'rules' table that it uses for more countries. This remains a work in progress.

I apologize that we do not yet have a list of JoI mismatches to publish. I had hoped we would have had it this week, but I now anticipate we will have an initial list to publish during next week.

We have identifed a further batch of EV certificates that may have incorrect Subject:JoI information and are refining and cleaning that list as we action the required revocations and reissuances and we will publish that list next week.

So, while my response is similar to that in https://bugzilla.mozilla.org/show_bug.cgi?id=1548713#c10 , I'm a little confused why, in Comment #11, you didn't just share that list now. Could you share a bit more about the rationale, as I'm not sure I see it?

In general, transparency and disclosure early and often help build confidence in the CA. For example, if Sectigo shared a list of 20 certificates, and next week only 19 were revoked, we could have a discussion about why that 1 certificate was not revoked. However, if Sectigo only shares the list of the 19 certificates revoked, we won't even know about the 20th certificate, whether it was simply misidentified (and thus revocation not needed) or whether it was overlooked (a more serious issue). Consider that this issue noted challenges in ensuring prompt and timely revocation, due to spreadsheet wrangling issues.

There's similarly opportunities to be more transparent if you look at Comment #8. What was the old playbook? What's the new playbook? Walk us through the changes and the rationale, to help show how Sectigo is analyzing where things did go wrong, or could go wrong, so we have a sense that Sectigo really has thought of (and mitigated) everything? What's the model that other CAs can and/or should use to prevent issues like this?

It's good to know that steps are being taken. It's better to know what those steps are, so that Sectigo can demonstrate it knows where it's going and how to get there, and so that others can follow in Sectigo's footsteps.

(In reply to Ryan Sleevi from comment #12)

So, while my response is similar to that in https://bugzilla.mozilla.org/show_bug.cgi?id=1548713#c10 , I'm a little confused why, in Comment #11, you didn't just share that list now. Could you share a bit more about the rationale, as I'm not sure I see it?

Sure. We didn't share the list because it was not an accurate list.

There are a few reasons why we don't want to publish and action an inaccurate list.

One reason is that when we set our customer service guys the unpleasant task of reaching out to a list of subscribers to inform them that their certificates need to be revoked and re-issued, we want to be really sure that the list contains only certificates that we are obliged to revoke because that minimizes the disruption to our subscribers and because if we think that telling a subscriber that we need to revoke her certificate and getting her to do an out-of-cycle replacement will leave her unimpressed, that pain is multiplied several-fold if we subsequently say 'my bad, we could have let that one alone'.

Related to that first reason we also have the issue that publishing a list that includes and defines as problematic a subscriber's certificate, especially when we then go on to subsequently determine that the certificate is OK, causes subscriber anguish.

Finally, although this is not an overriding one and not one I would try to overblow here, is that we'd like to hit all certificates with the same problem in the same phase of work. One reason I can't in good faith rely on that argument too much is that we are already in the position of having to split the revocation into phases.

I had intended to get that initial list out in this bug by today and start the revocation and replacement action by our customer service guys but on checking we find some further work is needed.
I will publish that list here (or on Bug 1575022) on Monday 21st.

In general, transparency and disclosure early and often help build confidence in the CA. For example, if Sectigo shared a list of 20 certificates, and next week only 19 were revoked, we could have a discussion about why that 1 certificate was not revoked. However, if Sectigo only shares the list of the 19 certificates revoked, we won't even know about the 20th certificate, whether it was simply misidentified (and thus revocation not needed) or whether it was overlooked (a more serious issue). Consider that this issue noted challenges in ensuring prompt and timely revocation, due to spreadsheet wrangling issues.

I appreciate the point about transparency and entirely concur that it is for the good of the community.
We are not trying to hide anything, and even if we were so inclined (which we are not) it is not possible to hide misissuance given that all of our certificates are logged to CT.
I can see that you would rather have a list that is 95% accurate to have the data earlier, and I noted that DigiCert shared a list of certificates 'Remaining to Review' in Bug 1576013 which is a partial disclosure of sorts that I suspect, despite their good intentions, was not greatly useful.
There is jeopardy in publishing an inaccurate list, not only for the reasons I touched upon above, but also because we risk being 'helped' to an early grave by a security researcher.

There's similarly opportunities to be more transparent if you look at Comment #8. What was the old playbook? What's the new playbook? Walk us through the changes and the rationale, to help show how Sectigo is analyzing where things did go wrong, or could go wrong, so we have a sense that Sectigo really has thought of (and mitigated) everything? What's the model that other CAs can and/or should use to prevent issues like this?

The new playbook points out that responding to the reporter of the problem is a a requisite part of the process.
An initial response to the reporter is to be made acknowledging receipt of the report and indicating that the reporter should expect a further response from us within 24 hours.
Initial problem analysis.
Triage. Is this:
a) a valid report of misissuance that requires an immediate escalation response;
b) a revocation request or other problem report that we must handle within 24 hours or within 5 days;
c) not a valid or correct problem report?
For (a), we include details on how to open a bugzilla bug with the brief summary details of the report and the results of the initial analysis.
For (a) or (b), highlight that in all cases we are still obliged to respond back to both the original reporter and the certificate subscriber within 24 hours.
For (b) we include the relevant clauses from the BRs so that the responder doesn't have to work from memory and doesn't have to peruse the BRs.

While writing this comment I realize that the BRs also oblige us to respond to the problem reporter and the certificate subscriber even if we determine the problem report to be invalid and spurious. We will amend the playbook to reflect that, although to avoid the abuse potential I am minded to propose an update for the BRs that allows the obligation to respond to be negated where both the problem report is invalid and the report appears spurious or malicious. We probably don't really want a malicious problem report that includes a URI to a crt.sh report identifying every Sectigo-issued certificate to force us to contact a hundred million subscribers, or to have to raise a CA Certificate Compliance bug because we could not (or would not) do so.

It's good to know that steps are being taken. It's better to know what those steps are, so that Sectigo can demonstrate it knows where it's going and how to get there, and so that others can follow in Sectigo's footsteps.

Fair points.

After preparing https://bugzilla.mozilla.org/show_bug.cgi?id=1563579#c14 I realize that I have so far left out of this bug the description of a mitigation that we put in place to prevent exactly this class of error ('Some-state') from recurring.
We have improved automated checks in place to identify and reject the most egregious examples of incorrect address data, such as what we have come to call 'meta-data' (although I think that term is probably misused here) by which we mean data that consists of dots, dashes, spaces, and other punctuation and especially sequences or repetitions of those characters instead of city or state names or other address fields.
Those same automated checks also aim to identify frequently used words and phrases that identify that these fields do not contain real data, such as '(none)', 'NA', 'Not Applicable', and we use that same mechanism to catch the common default subject phrases ('Some-State', 'Default City', etc).
Although those checks may seem unnecessary since we have the required initial verification checks and second review process in place, but those characters and phrases are exactly the ones that become so familiar to our human validators that they can become practically invisible to them.

We published an list of EV certificates with incorrect JoI information as we mentioned in comment #4 , but it was published at https://bugzilla.mozilla.org/show_bug.cgi?id=1575022#c12 as it is more directly relevant to that ticket.

We have made some further progress towards the development of the system to help our staff track incident-related revocations. We are tracking a release about the end of November.

Flags: needinfo?(Robin.Alden)

The release date for the new system to help our staff track incident-related revocations has gone out to the weekend of 7th December.

We have populated some further records for our pre-issuance JOI policy checker.

Flags: needinfo?(wthayer)

Robin: This appears to be another incident bug that is long overdue for an update? Can you confirm that all the certificates specific to this incident (versus 1575022) have been revoked?

Flags: needinfo?(wthayer) → needinfo?(Robin.Alden)
Whiteboard: [ca-compliance] - Next Update - 31-July 2019 → [ca-compliance]

(In reply to Wayne Thayer from comment #16)

Robin: This appears to be another incident bug that is long overdue for an update? Can you confirm that all the certificates specific to this incident (versus 1575022) have been revoked?

I confirm that all of the certificates reported in this incident were revoked.

From comment #4:

2019-05-14 14:18 BST 72 out of 78 identified certificates had been revoked.
2019-05-16 14:34 BST 75 out of 78 identified certificates had been revoked.
2019-06-25 01:19 BST Alex Cohn posted to this bug that three certificates remained unrevoked.
2019-06-30 00:38 BST the remaining 3 certificates were revoked.

Flags: needinfo?(Robin.Alden)

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

In comment 15, the release date for "the new system to help our staff track incident-related revocations" was pushed back, but there was no follow up. Did that ever get deployed?

Flags: needinfo?(wthayer)

Yes, The system to help our staff track incident-related revocations was deployed.

Flags: needinfo?(wthayer)
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ev-misissuance] [ov-misissuance]
You need to log in before you can comment on or make changes to this bug.