Closed Bug 1721271 Opened 5 months ago Closed 6 days ago

Sectigo: Missing registration numbers in EV certificates

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tim.callan, Assigned: tim.callan)

Details

(Whiteboard: [ca-compliance] Next update 2021-10-18)

Attachments

(1 file)

11.91 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details

1. How your CA first became aware of the problem

On Friday, June 11 another CA sent us a report of 332 certificates it stated were misissued. We investigated them all in the next 24 hours and determined that 47 of them were misissued due to missing registration numbers in cases where those numbers were available. The remainder were issued correctly or previously revoked.

We revoked the 47 misissued certificates on June 16.

2. Timeline

All times Eastern Daylight Time

June 11, 12:29 pm
We receive a report of 332 supposedly misissued certificates to our SSL abuse line. Investigation begins.

This is a time consuming investigation in which we must compare each certificate individually to the original documentation used to establish legal existence.

June 12, 11:44 pm
Investigation completed.

June 16, 11:00 am
Certificates revoked.

The gap between the revocation date and this writeup is because we have been exploring options for programmatically defending against errors of this sort. In general we want to do that sort of analysis up front for new issues so that we can present a more complete picture to the community. Point 7 will go into our planned path forward.

3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem

These certificates are individual cases of an intermittent and unpredictable error. By way of example, even in the original error report, only 16% of the reported certificates were actually misissued.

4. A summary of the problematic certificates

47 certificates, issued between April 25, 2019 and June 2, 2021.

5. Certificate data

We have attached a list of affected certificates.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now

Sectigo does not presently have an automatic check in place to compare registration numbers in orders to their qualified information sources. Official government documentation varies widely in format and content, so such a programmatic check would be difficult to create.

This variability also introduces risk of error to the validation process. The rep may have trouble finding a registration number in the available documentation, and since companies don’t always have registration numbers, the rep may erroneously come to the conclusion that no such number is available when in fact it is. There is a great deal of variability in registration number information based on locality and business type, with QGISs occurring at the country, state, county, and even city level. The high false positive rate of the report that spawned this incident is an illustration of the difficulty in drawing these conclusions in a systematically reliable way.

In each of these cases it appears the agent failed to find the registration number in the source information.

7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future.

We plan to add a feature that compares the presence of a registration number to our expectations based on the QGIS we use. We will create a lookup table with each QGIS and whether or not to expect a registration number (a simple Yes/No). The system will compare orders against this record and block issuance in the case of a mismatch.

For any QGIS with no value in this table, the system will ignore this check. This gives us a chance to benefit from the check without requiring 100% coverage. That means we can start to benefit from this check before we complete the heavy lift of looking into and making a determination on every QGIS we have. It also means we can add a new QGIS without needing to know at that exact same moment whether or not to expect a registration number.

We intend to expand this fundamental mechanism to other checks including,

  • QGIS-JOI match
  • QGIS-businessCategory match
  • QGIS-registration number formatting match (where possible)

And there may be others.

We expect this functionality to be live not later than the end of August and are working on a more specific release date. Setting that date has been holding up this writeup, but we don’t want to delay publishing it any longer. We’ll announce more specific plans as they become firm.

Assignee: bwilson → tim.callan
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

(In reply to Tim Callan from comment #0)

The gap between the revocation date and this writeup is because we have been exploring options for programmatically defending against errors of this sort. In general we want to do that sort of analysis up front for new issues so that we can present a more complete picture to the community.

I would strongly encourage you to reconsider this plan, not because this sort of analysis is not useful, but because waiting 5 weeks to disclose an incident is unacceptable.

https://wiki.mozilla.org/CA/Responding_To_An_Incident#Immediate_Actions

Each incident should result in an incident report, written as soon as the problem is fully diagnosed and (temporary or permanent) measures have been put in place to make sure it will not re-occur. If the permanent fix is going to take significant time to implement, you should not wait until this is done before issuing the report. We expect to see incident reports as soon as possible, and certainly within two weeks of the initial issue report.

https://g.co/chrome/root-policy

When a CA becomes aware of or suspects an incident, they should notify chrome-root-authority-program@google.com with a description of the incident. If the CA has publicly disclosed this incident, this notification should include a link to the disclosure. If the CA has not yet disclosed this incident, this notification should include an initial timeline for public disclosure. Chrome uses the information on the public disclosure as the basis for evaluating incidents.

In part in response to a series of CA incidents, including Sectigo's, the CA/Browser Forum adopted Ballot SC30 that, among other things, required CAs already have in place QGIS-JOI matches - as JOI is an EV-specific concept, and the term is "Registration Agency / Agency of Incorporation" for the QGIS in those cases (and generic QGIS's are not accepted).

This ballot was adopted one year ago.

This ballot also included the following provision:

Effective as of 1 October 2020, if the CA has disclosed a set of acceptable format or formats for Registration Numbers for the applicable Registration Agency or Incorporating Agency, as described in Section 11.1.3, the CA MUST ensure, prior to issuance, that the Registration Number is valid according to at least one currently disclosed format for that applicable Registration Agency or Incorporating agency.

This provision was in response to past CA incidents with respect to Registration Numbers being invalid for the QGIS/Registration Agency. Two years ago, in Bug 1576013, Comment #2 , Digicert shared their plans to better validate registration numbers. The discussion continued on the bug, and DigiCert shared their own plans in great detail, as shown by bugs like Bug 1576013, Comment #33 and Bug 1576013, Comment #55.

DigiCert has publicly disclosed their own investigations, showing a rather comprehensive approach to compliance.

I realize that, during this same period, Sectigo was greatly struggling with their compliance operations, and given the failure to respond to their own bugs, I suppose it should not be surprising a lack of awareness of DigiCert's bug. However, I'm still at a bit of a loss on how or why it would take 5 weeks to determine the set of options in Comment #0, given that these had not only been widely discussed in the CA/Browser Forum Validation Subcommittee, adopted by Ballot, and demonstrably implemented by other CAs. Setting aside the process failure in the past, it suggests that the present may still have process failures, if these conversations were unfamiliar.

Is there a more meaningful plan to improve here?

Flags: needinfo?(tim.callan)

(In reply to Ryan Sleevi from comment #1)

I would strongly encourage you to reconsider this plan, not because this sort of analysis is not useful, but because waiting 5 weeks to disclose an incident is unacceptable.

Thank you for the clear direction.

This strategy was actually borne of responsiveness to feedback we received here on Bugzilla. In recent bugs we have felt like we were being pushed to be very buttoned-up with our information by the time we discuss a matter publicly. Here are a few examples I tracked down pretty quickly:

  • Bug 1708934 comment 11 states, “it would have been useful to explicitly state that source, rather than leave it to the community … to guess at it”
  • Bug 1694233 comment 3 states, “It is what I would expect of an incident report focused on fixing the bug, but it does not seem to be a systemic root cause analysis.”
  • Bug 1715929 comment 4 and bug 1715929 comment 8 make it clear that you desired more information about our specific development plans than we felt we had to offer at that time.
  • And of course bug 1714628 comment 13 states, “demonstrably, we're seeing improvements in the incident reports in terms of details and transparency … and because we're seeing quality explanations and systemic investigations taking place.”

These taken together (and others that I could dig up with time, I’m sure) left us with the impression that the preferred approach was for us to do our research up front and present an (at least reasonably) complete picture of the situation and our response. Add in this passage from the above, “written as soon as the problem is fully diagnosed and (temporary or permanent) measures have been put in place”, and we thought we were taking the approach we had been directed toward.

We certainly can emphasize speed to reporting, and based on your clear direction in comment 1, we shall. Of course we’ll still provide as much information as we can; just be aware that some of the white spaces may not be filled in. We ask in advance for the understanding of Bugzilla readers if some details need to come in subsequent updates. It won’t be because we are leaving it to the community to guess.

This last point is directly relevant because, as I point out in bug 1714628 comment 13, we have been and will continue systematically examining our certificate base for misissuance. Bug 1712120, bug 1714193, and bug 1720744 are examples of this. As we discover these so-far unknown errors, our plan had been that for each we would initially report the misissued certificates and revocation dates, the root cause analysis, and ideally the programmatic fix we would implement with a target date.

Based on your comment it looks instead like we will initially report the mississued certificates and revocation dates. We’ll include root cause analysis and the programmatic fix if we’re able, but otherwise we will have to follow up.

(In reply to Tim Callan from comment #3)

This strategy was actually borne of responsiveness to feedback we received here on Bugzilla. In recent bugs we have felt like we were being pushed to be very buttoned-up with our information by the time we discuss a matter publicly.

To be clear, not having all the answers is fine: but you should be clear on when you don't have the answers, when you expect to have answers, and what the challenges, factors, or considerations are.

For example, Bug 1714968, Comment #1 to Bug 1714968, Comment #2 shows an example of disclosure and context followed by subsequent report. You can see from there, there's further exploration and collaboration. Similar examples exist for DigiCert.

Understandably, the primary goal here is: do we have a shared understanding of the risks and urgency, followed by ensuring we have a good path forward for solutions, with concrete commitments to more information, as well as a clear understanding of any anticipated delays as they arise.

(In reply to Ryan Sleevi from comment #2)

Ballot SC30 that, among other things, required CAs already have in place QGIS-JOI matches

We have that as a manual process. The discussion in this bug is about replacing error-prone human procedures with software-based checks. QGIS-JOI matching is a perfect candidate for that.

I didn’t go into that level of detail in the bug as this was a one-line toss-off that wasn’t the point of the message. I believed in the context of the discussion it would be clear what I was talking about.

if the CA has disclosed a set of acceptable format or formats for Registration Numbers for the applicable Registration Agency or Incorporating Agency, as described in Section 11.1.3, the CA MUST ensure, prior to issuance, that the Registration Number is valid according to at least one currently disclosed format

Yes. By building this framework, we will then have a mechanism to enforce this kind of format matching. The next step is to populate our QGIS list with format information as we are able. Our plan is to start with the regions most used for EV to get the best bang for our buck, which in our case means we will start with North America and move to Western Europe next.

Is there a more meaningful plan to improve here?

As part of our process we read and review all new Bugzilla bugs opened against any CA, to understand what is going on in the industry. This is one of the practices brought in by our new WebPKI Incident Response team, which started last summer. Bug 1576013 was opened in 2019. At that time we had no such process in place.

Flags: needinfo?(tim.callan)

(In reply to Tim Callan from comment #5)

Bug 1576013 was opened in 2019. At that time we had no such process in place.

Rob has correctly pointed out that this isn’t an accurate statement. I should rather have said that we lacked a formal process at that time. We previously depended on individual team members to follow what is happening on an ad hoc basis. Our current policy is to look down all new Bugzilla posts every day and when new bugs are created, share what they are in our bi-weekly WIR meeting.

In particular Rob has done a stellar job of watching industry goings-on. Nonetheless, our present process is probably better than what any individual contributor can do. I expect there is room to improve on this front still, but that’s where we are now.

We are monitoring this bug for additional questions and comments.

(In reply to Tim Callan from comment #6)

(In reply to Tim Callan from comment #5)

Bug 1576013 was opened in 2019. At that time we had no such process in place.

Rob has correctly pointed out that this isn’t an accurate statement. I should rather have said that we lacked a formal process at that time.

I appreciate Rob's clarification, and I think this points to the worrying trend mentioned in Bug 1715024, Comment #16 - with respect to ensuring adequate peer review and correctness. This seems to be a more substantive response, and seems like it would have gone through the same peer review, so I'm not sure what happened.

I realize this may seem like an impossible task: prompt, detailed, and technically correct is a difficult challenge to meet, but it's also a reasonable expectation, given the role that a CA plays. When things are technically incorrect, even minor, it raises concerns about understanding whether similar processes allow for "acceptable" errors, such as domain validation, or undermines confidence in the CA's ability to understand what is expected. We're not prescriptive on 'how' a CA achieves the necessary expectations, but it's a bar CAs are held to.

I realize that there are a number of CAs whose incident reports call into question their ability to spell PKI, and so in a report such as this, which does demonstrate some awareness of the expectations, it may seem overly harsh to focus on this. However, Sectigo's worrying trend of serious incidents have us looking to make sure there's a state-and-phase change from the "old Sectigo" to that of a modern CA. Other CAs have been able to execute such changes in a prompt manner (e.g. DigiCert's acquisition of Verizon, and the overhaul necessary), and so there is demonstrable evidence that it is possible.

At a minimum, it may help to think about all the factual statements that may be relevant to an incident. If X happened on Date Y, or Comment Z on Bug A, or Discussion T in Venue U, make a list of each of those relevant details, and ensure they're correct. Then, examine the things you want to know, but are not yet certain of: for example, you haven't yet searched Venue U for further details, but know you need to, or are planning to re-examine Bugs A, B, C. Those two sets of information should be extremely useful in compiling your incident report, after ensuring the necessary peer review, because they help put the story together about how Sectigo views the problem, both the specifics and the overall "big picture". The explanations for the other questions should logically flow from those factual details that are established (and, presumably, included on the timeline of events). Some factual statements may not have date/times associated, but you can at least review to make sure "Yes, we actually did do X".

Taking a step back from the meta-point, with respect to this issue, Comment #0 stated on 2021-07-19:

We expect this functionality to be live not later than the end of August and are working on a more specific release date

Yet, we're now at 2021-08-11, and we haven't received a more specific date, as far as I can tell. Comment #7 reflects a "keep-alive", which is useful if only to highlight that Sectigo is not presently tracking its promised-deliverables associated with issues. Here, again, is both an opportunity for a process improvement (a list of everything you promised to do, when you promised to do it, and the progress on that promise weekly), as well as a specific request for more details on this bug: can you provide an update on this.

Flags: needinfo?(tim.callan)

(In reply to Ryan Sleevi from comment #8)
Ryan, this post is to acknowledge your comment and let you know we’re working on a response.

(In reply to Ryan Sleevi from comment #8)

Taking a step back from the meta-point, with respect to this issue, Comment #0 stated on 2021-07-19:

We expect this functionality to be live not later than the end of August and are working on a more specific release date

Yet, we're now at 2021-08-11, and we haven't received a more specific date, as far as I can tell. Comment #7 reflects a "keep-alive", which is useful if only to highlight that Sectigo is not presently tracking its promised-deliverables associated with issues. Here, again, is both an opportunity for a process improvement (a list of everything you promised to do, when you promised to do it, and the progress on that promise weekly), as well as a specific request for more details on this bug: can you provide an update on this.

Based on our investigation and design exercise, we’ve decided the most effective approach will be to implement a single, greater QGIS matching project rather than break these pieces of functionality up and release them piecemeal. Our presently scoped project will perform programmatic checks of QGIS against seven different factors:

  • Registration number available (y/n)
  • Registration number format
  • Registration date available (y/n)
  • Business category
  • JOI country
  • JOI state or province
  • JOI locality

In short, QGIS-registration (n/y) matching has been subsumed by the greater QGIS matching project. All seven of these checks will follow the same basic architecture, which will be to check against the expectations recorded in a table we are presently populating. As stated in comment 0, this will depend on the relevant record being complete before a specific one of these lookups will occur for a given QGIS. This approach means we can start to gain benefits prior to the complete population of the big table of QGIS properties.

Our target delivery date for this greater QGIS matching project is October 17.

Flags: needinfo?(tim.callan)

(In reply to Tim Callan from comment #10)

We hope the scope and nature of our QGIS matching project is clear. If anything in the above comment doesn’t make sense, please let us know and we’ll be happy to elaborate.

We have also announced this project in our Guard Rails tracker, bug 1724476 comment 3.

(In reply to Ryan Sleevi from comment #8)

This seems to be a more substantive response, and seems like it would have gone through the same peer review, so I'm not sure what happened.

We've talked about it and feel it's worth saying a few words here. We agree that this “impossible task” is one we and other CAs should aspire to, and we do. It also may be that some small glitches or corrections are understandable and forgivable and not worth getting worked up about.

To look at this example, we employ peer review and did so in this case. Due to PTO, business travel, and the unforeseen things that happen in life, not every individual is necessarily available to review every comment. We pay attention to making sure that the right subject matter experts are looking at the right posts, but it’s still possible for a specific team member to have a valuable perspective that the others wouldn’t know about a priori.

That’s what happened in this case. The group of people looking at the original post were able to account for a certain timeframe. Rob has the benefit of more institutional memory than most other members of the WebPKI Incident Response (WIR) team, in part because we have been expanding that team in scope and skill set. That of necessity means team members sometimes may not be aware of additional facts and ideas that could be added to the conversation.

Rob, who is not chained to his desk, on this occasion was unable to review comment 5 before it was posted. When he saw the comment, he called my attention to something we both felt was worth adding. So we added it.

We feel that was the correct response to the situation and would do so again. We are committed to transparency and so will make sure we share relevant information, even if we missed it the first time and even at the risk of being criticized.

We have a target date for QGIS matching on seven factors, including the original motivator of this incident, for October 17 deployment . We suggest setting a NextUpdate for 2021-10-18 (which is the following Monday). If there are meaningful developments in the meantime, we will report them on this bug, and we’ll continue to monitor it for other comments.

Flags: needinfo?(bwilson)
Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] → [ca-compliance] Next update 2021-10-18

We will have to slip the release of our QGIS matching functionality. The development team that owns this project has been hit hard recently with COVID, and a few key members are unable to work. Several projects need to slip as a result, and this release is among them.

At present we are shooting for October 30 deployment. This schedule is built on the assumption that the COVID cases involved are typical in their duration and severity. If they prove worse than average, we may revise that date later.

We are presently on track for October 30 release of QGIS matching.

In addition to the seven QGIS-matching functions listed in comment 10, we also have added a check for United States financial institutions chartered and regulated on the federal level. State- and local-level JOI information is inappropriate for them. As these financial institutions will have local information associated with their actual business operations, the potential exists to incorrectly assign them one of these JOIs. Our programmatic check will prevent issuance of EV certificates containing jurisdictionStateOrProvinceName or jurisdictionLocalityName fields if the QGIS is flagged as a federal-only source.

Due to unforeseen circumstances unrelated to this additional check, we are not able to deploy in the expected window this weekend. That places this release in the next deployment window on November 6.

JOI-QGIS matching has passed QA and is awaiting deployment in this coming weekend’s window.

We are pleased to announce that our suite of QGIS matching Guard Rails as described in comment 10 and then amended in comment 16 went live this past weekend.

Is there anything else we can answer for the community about our QGIS matching release or any other part of this incident?

Ben, we appear to have covered the relevant information on this issue and our newest programmatic checks. Discussion of our QGIS-based check for US federal credit unions is ongoing at bug 1741026. Is it time to close this bug?

Flags: needinfo?(bwilson)

I'll schedule to close this on next Wed. 1-Dec-2021.

Status: ASSIGNED → RESOLVED
Closed: 6 days ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.