Closed Bug 1741026 Opened 4 years ago Closed 4 years ago

Sectigo: Incorrect JOI for federal credit unions

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tim.callan, Assigned: tim.callan)

Details

(Whiteboard: [ca-compliance] [ev-misissuance])

1. How your CA first became aware of the problem

In researching our corpus of certificates for possible misissuance, we discovered eleven certificates issued to federal credit unions that contained jOIStateName fields with their local state names. As federal credit unions in the United States are incorporated at the country level, JOIStateName is improper and should be omitted.

2. Timeline

October 7, 2021, 19:00 UTC
Research of our corpus of certificates discovers a certificate for a federal credit union with a state JOI defined. We schedule this certificate for revocation and write a query to search for additional certificates with this error. We create a ticket to add this functionality to our upcoming QGIS matching release with a target date at the time of October 18.

October 8, 17:30 UTC
Query returns an additional ten certificates.

October 12, 18:11:29 UTC
Original discovered certificate revoked.
Due to high COVID 19 absenteeism among the relevant development team, we slip the QGIS matching release to October 30.

October 13, 16:41:29 UTC
Additional ten discovered certificates revoked.

October 28
The QGIS matching release further slips to November 6.

November 6
QGIS matching goes into production.

3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem.

We have programmatically blocked this form of misissuance. We announced this release in bug 1724476 comment 12.

4. Summary of the problematic certificates

Eleven certificates issued between Nov 11, 2019 and Oct 5, 2021.

5. Affected certificates

11 certificates

Serial Number Certificate Precertificate
00C39D638E94B523A6392176C99CEC528A Certificate Precertificate
1A2673A8300430830826A6D638C5C153 Certificate Precertificate
00BEBA51537DD55950E8CFD9A0468B9E24 Certificate Precertificate
00C99981AB6A9D51974B91A14F953B5662 Certificate Precertificate
0CD1E3D2673D8DB6E67723F42F37FD2C Certificate Precertificate
00D02F74CC8F46C02D93C0202905C514BD Certificate Precertificate
00941A4B3277DC8E52FC309CC7C411719F Certificate Precertificate
522BCA3A4FCD8DC8BAB00720382C43A5 Certificate Precertificate
00F116CDCAEAF27F1061CFE51E0E7CA913 Certificate Precertificate
4BD255164EBD303CD41888FEC043B695 Certificate Precertificate
00BEBA51537DD55950E8CFD9A0468B9E24 Certificate Precertificate

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now

Assigning jurisdictions of incorporation until recently was a manual process for us. We have participated in much discussion on Bugzilla this year about the value of replacing human processes with predictable, consistent, programmatic processes, and this incident is an example of that. Despite our best efforts at training and diligence, on a small number of occasions our reps mistakenly included the state in which the credit union was located as a JOI.

This issue avoided detection before now because occurrences were very few and this is a subtle error that is easy to miss. We added this capability to our QGIS-matching release, which we expected to ship on October 18, shortly after which we would announce this bug. Unfortunately that development team subsequently was hit hard by COVID, forcing us to slip this release first to October 30 and then to November 6. As these individual schedule changes were small, in each individual case it seemed to make sense to hold on just a short while longer to provide a better report. Because we had multiple slips, that time delay added up. In retrospect, we should have reported an incomplete situation earlier, and if we could have known in advance that we would face multiple unexpected delays, we would have posted this incident report in October prior to releasing the fix.

7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future

Our November 6 QGIS matching release, as described in bug 1721271 comment 10, addressed the root cause of this problem. This release is one example of our strong effort to programmatically enforce correct certificate issuance in all ways we can identify and implement.

Assignee: bwilson → tim.callan
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

I was trying to remember what other CAs had mentioned COVID as the explanation for misissuance, but then I recalled: it was in Bug 1721271, another Sectigo incident.

Four months ago, Sectigo stated, in Bug 1721271, Comment #3

Based on your comment it looks instead like we will initially report the mississued certificates and revocation dates. We’ll include root cause analysis and the programmatic fix if we’re able, but otherwise we will have to follow up.

It would appear that, despite that direct awareness of management of the expectations, and indeed, a nearly identical incident happening, Sectigo failed to actually meaningfully uphold its commitments to prompt reporting. Indeed, Bug 1721271, Comment #16 appears to have acknowledged that there was a known issue here, 19 days ago, but failed to report the incident.

Why did that commitment fail to lead to meaningful change and prevent a near identical failure to adhere to expectations, with a delay of nearly 5 weeks from incident to reporting?

Similarly, why did these certificates evade detection until now? Sectigo has had a spate of incidents, including Bug 1721271, Bug 1575022, Bug 1548713, and Bug 1551362 which have highlighted systemic issues with validation.

Bug 1551362, Comment #10, two years ago, highlighted a "JoI policy checking module", which appears to be similar to what is being described now, and which appears to have been in effect when the misissuance happened. Why should the community believe the guard rails project will be effective now, given that there is a sustained, repeat pattern at play here, and a demonstrable failure to prevent these issues as expected?

Flags: needinfo?(tim.callan)

(In reply to Ryan Sleevi from comment #1)

Why did that commitment fail to lead to meaningful change and prevent a near identical failure to adhere to expectations, with a delay of nearly 5 weeks from incident to reporting?

We attempted to address this in part 6 of comment 0. The original discovery was October 7. After a quick bit of engineering research we believed the code fix would fit into our upcoming October 18 code release. As that was a short timeline we intended to report it all in a single report within two weeks of initial discovery. Then we unexpectedly had to slip that release. Here we made an error in judgement. We should have proceeded with the report. Because the release still felt close, we held off on the report. Then there was a further slip, and as it was just until the next release window, again it felt like we were doing the community a better service by holding off and releasing a full report. I will repeat that this was an error and we should have proceeded with the report.

If in early October we could have predicted the actual release date, we would have posted an initial report and followed up as developments occurred. Instead we made a series of adjustments to the schedule that each appeared to make sense on its own at the time, but the aggregated result was poor. We mismanaged it, and I’m not saying otherwise. This response is meant to explain how that came about.

As our WebPKI Incident Response (WIR) team is an active, working team, we are continuously evaluating and adjusting our work processes to fit changing circumstances and implement overall improvements. This bug and bug 1740493 are among the drivers of that examination right now. Some of our current changes to aimed at continual process improvement include:

  • Adding a team member to our CABF representation to monitor more closely upcoming developments and increase our engagement in that forum
  • Maintaining a formal Timeline document for each issue in real time, accessible to all WIR team members, to be populated with relevant events as they occur
  • Creating an internal checklist for Bugzilla incident reports to make sure each report meets our expectations
  • Updating and tightening the formal procedures we have in place for both changes in guidelines as well as responding to incidents

Similarly, why did these certificates evade detection until now?

This is a subtle error. To detect it, one must discern that a specific credit union with the jOIStateName field in an EV certificate actually is registered at the federal level, even though the vast majority of credit unions with jOIStateName are correct. With a total of eleven problematic certificates, they were needles in a haystack.

To find these we had to write a query based on likely keywords and then individually confirm the resulting suspects. Our QGIS-matching functionality now gives us a clearly identified source of federal credit unions for which we can programmatically enforce the correct behavior. This technical capability didn’t exist prior to the recent release.

Sectigo has had a spate of incidents, including Bug 1721271, Bug 1575022, Bug 1548713, and Bug 1551362 which have highlighted systemic issues with validation.

We have discussed in multiple incident threads over the past year that we are introducing a series of programmatic checks for factors that previously depended on purely human processes – precisely because human processes allow a set of potential errors that programmatic checks can eliminate. “Guard Rails” is simply a convenient name for programmatic checks of this sort. We have gotten ideas for these checks in several ways. Sometimes we put a check in place in response to a Sectigo misissuance event, as we did here. Other times we see the opportunity to implement a technical limitation to prevent a possible misissuance scenario. We may find this by monitoring other CAs’ bugs or through our own code review or simply by examining the issuance process and asking where an error can slip in. We prioritize these opportunities by their assessed degree of risk reduction.

Some of the bugs you have highlighted here are resolvable by Guard Rails initiatives. Others need different solutions. We have addressed problems with state names by implementing a defined country-state list, which we’ve discussed extensively in earlier bugs, especially bug 1710243. We are addressing the difficulty with localityName by phasing out this field entirely, which process is well underway. While the latter two are not technically “Guard Rails” as they are not programmatic pre-issuance checks, they still accomplish the same goal of eliminating opportunities for human error to affect certificate contents.

Bug 1551362, Comment #10, two years ago, highlighted a "JoI policy checking module", which appears to be similar to what is being described now, and which appears to have been in effect when the misissuance happened. Why should the community believe the guard rails project will be effective now

The “JoI policy checking module” implemented in 2019 introduced a rules table at the country and state level. It was valuable in that it prevented some potential misissuance, but it was too crude to handle the specific error case we see here. For US organizations it had a single rule, which was that the jOIStateName field MUST be populated and the jOILocalityName field MUST be empty. Federally registered credit unions are an exception to this rule that we did not consider when we implemented the module.

Our recent QGIS matching release now enables us to define rules on a source-by-source basis, which introduces a finer level of detailed control compared to the previous ability to set rules at the state level at best. We are now able to programmatically set the legal existence requirements for JOI, registration number, registration date, and business category based on each registration QGIS source. This is additional capability that greatly increases our level of software-based control over these factors.

Flags: needinfo?(tim.callan)

Resetting N-I, as I don’t believe this question was actually answered:

Why did that commitment fail to lead to meaningful change and prevent a near identical failure to adhere to expectations, with a delay of nearly 5 weeks from incident to reporting?

I appreciate the description of the flawed processes that again lead to repeat failures, but the question is not about getting Sectigo to acknowledge those processes are flawed, but why didn’t the previous commitment to fix those flaws prevent this.

For example, the last time Sectigo delayed its reporting, and was called out, why did it fail to implement meaningful changes and alerting to ensure situations like those in Comment #2 never happened again. This is the key: not about understanding that Sectigo’s judgement was flawed, but rather, why these sorts of flaws weren’t prevented meaningfully.

it felt like we were doing the community a better service by holding off and releasing a full report.

What specifically in the delayed report was felt that was better here? The only distinction discernible from this issue is that the change was completed. It’s unclear if Sectigo feels there are other details that improved this report, as it generally feels like a substandard report (given the goals of these report), but it would be useful to be explicit about these, to see if there is community agreement with Sectigo’s interpretations.

This is a subtle error.

It does not feel like the details provided actually support this. The goal of SC30 was specifically to take a register-centric approach, in which the JOI fields are set according to the data source used, and the CA enumerates all data sources (and expected field values). This should have been trivial to detect with such an approach, as confirming the EV details with a federal source would naturally prevent this, if implemented as such. This has come up in other CA incidents, and it’s clear other CAs have reached the same conclusion, so it’s unclear what approach Sectigo is using, and whether or not that systemically meets the compliance objectives.

That Sectigo only realizes this now, despite this exact implementation being described during the ballot discussions, is concerning.

Some of the bugs you have highlighted here are resolvable by Guard Rails initiatives.

These bugs highlight patterns of judgement failures, and that’s why I have little confidence in the Guard Rails initiative. If the guard rail is based on a flawed or incorrect understanding, it will produce flawed results. I am concerned that, as with delays in the reporting or the failure to ensure data source and JoI consistency, that Sectigo is taking a piecemeal approach based on preventing past incidents, rather than a systemic approach of correctly meeting the requirements and preventing future incidents.

This seems like a deeper organizational change is needed, especially given the many ways in which it appears Sectigo has misunderstood past CABF ballots or improperly/incompletely ensured their systems complied. I realize this sounds harsh, but I hope you can see that the pattern of issues, including those continuing to be reported, support these conclusions. I realize that many of these issues have roots in the 2019-2020 period, but that doesn’t excuse them, and highlights the need for even greater diligence and responsiveness in 2021.

Flags: needinfo?(tim.callan)

(In reply to Ryan Sleevi from comment #3)

For example, the last time Sectigo delayed its reporting, and was called out, why did it fail to implement meaningful changes and alerting to ensure situations like those in Comment #2 never happened again. This is the key: not about understanding that Sectigo’s judgement was flawed, but rather, why these sorts of flaws weren’t prevented meaningfully.

We’ve been talking about this point and in this case it’s clear our decision making didn’t get enough scrutiny. Our WebPKI Incident Response (WIR) team has been pushing hard on a broad variety of initiatives and in our effort to get through our to-do lists, we came to conclusions that we should have looked at more closely and challenged but did not.

When we formed WIR in August of 2020, it had six members. In the intervening time we have grown that group to our present lineup of fourteen including representation from Compliance, Development, QA, Validation, and Customer Service.

This is a hand-picked set of leaders from the company with a high degree of institutional knowledge and expertise in their areas of specialty. It would be impractical to expect our average junior employee to fully understand the nuances of PKI, public root store programs, and the CA/Browser Forum. Instead, we believe it’s important that this effort be led and conducted by experienced, senior, knowledgeable people. It may not be surprising that these people are in high demand and have a lot they are trying to get done.

Even though we can’t just throw bodies at it, our level of need nonetheless has increased dramatically. Our Bugzilla activity is way up, driven in no small part by our twin programs of certificate base investigation and code review. These activities themselves require a great deal of attention and cycles, as do other very important initiatives like our thorough CPS review and updates, our complete rework of validation documentation, and additional initiatives we have discussed extensively on Bugzilla this year. All this essential work has added pressure to our task lists and was a contributor to what you see here. As we intend to keep our foot on the gas for these initiatives, we must continue seeking out ways to maintain both quality and throughput in our WIR activities.

We see that effort falling into two main buckets. The first is to deepen our bench. We discussed this at length in bug 1712188 comment 16. One important benefit of the cross-functional WIR team is that members’ knowledge broadens and deepens. It gives someone who is very good at, let’s say, QA or development but is a little newer to the WebPKI world a chance to sit shoulder-to-shoulder with someone with twenty years of public CA experience. The more we can broaden and train this way, the better we can realize our ambitious plans. This has been a deliberate strategy since before bug 1712188 surfaced and continues as part of our effort to upgrade our organization.

The second bucket is continuing to improve the process and tools used by this team. A few of the improvements we have made recently include:

  • Real-time authoring of incident timelines beginning once we’re aware of a potential incident
  • Creation of an automated search to place new Bugzilla incidents in front of the WIR team as soon as they occur
  • Weekly CABF activity updates as a formal agenda item provided to the entire team by the CABF lead where before they were an "as-needed" item
  • Creation of a tool for automatic generation of linked certificate markdown for bug reports

These improvements may be small and incremental, and they occur frequently as we continue to work our processes and identify ways to get better.

As we very recently wrote on this same topic in bug 1740493 comment 6, it’s worthwhile to include a passage from that comment.

Let us point out that a year and a half ago the company had no formal WebPKI Incident Response process and we were barely forming our WIR group. Though we benefitted from the passionate activity of a few experienced individuals, we suffered from its ad hoc nature. In the intervening time we have been learning from experience, adding to our team and establishing procedures in response to friction points we observe internally, activity on Bugzilla and m.d.s.p, and, unfortunately, our own errors. While we regret these errors, we use them to adjust our actions for the sake of continuous improvement.

It is our intention to continue this trajectory of improvement.

What specifically in the delayed report was felt that was better here? The only distinction discernible from this issue is that the change was completed

Resolution of the problem and reporting affected certificates in accordance with community expectations seemed like worthy additions to the report.

the CA enumerates all data sources (and expected field values). This should have been trivial to detect with such an approach, as confirming the EV details with a federal source would naturally prevent this, if implemented as such

As discussed at length in bug 1721271, our QGIS-matching release is the first programmatic functionality we have put in place to enable this specific software-based check. Prior to that release, our validation team used a procedural approach, which as we’ve discussed this year in many comments to many bugs allows for the possibility of human error. Human error was the root cause of this bug, which is resolved by our November 6 QGIS-matching release.

With that capability now in place, it did in fact become trivial to detect the problematic certificates, which is how we were able to identify them here and revoke them. Prior to that it was not.

If the guard rail is based on a flawed or incorrect understanding, it will produce flawed results.

We are unaware of any flaws in any of the Guard Rails releases we have put out since the initiative began. These releases have reduced risk by eliminating possible issuance errors, some of which previously had resulted in misissued certificates and some of which merely held that potential. We see no indication that our understanding of the rules governing any of these programmatic checks was flawed. We believe it achievable to continue identifying valuable programmatic checks, specifying them correctly, and putting them in place.

I realize that many of these issues have roots in the 2019-2020 period, but that doesn’t excuse them

We consider ourselves responsible for our performance and our behavior, and we believe our actions reflect that understanding. Nonetheless, we can only build upon what exists today. For the past occasions when the company deployed a software error, took on engineering dept, or failed to document code or processes, we do not have the ability to go back in time and undo that. What we CAN do is,

  • Aggressively act when these issues come to light
  • Proactively search our corpus of certificates for misissuance and report, revoke, and fix whatever we find
  • Conduct code review and cleanup

We are doing these things and will continue into 2022 and beyond. Smoking out and closing down previously unknown flaws in a complex system is a lengthy process, so it will take time. We have advised the community in the past that we expected to discover and write up bugs against ourselves as part of this process, and that prediction has proven true. We continue to expect new discoveries like these as we proceed with the dual actions of reviewing our body of certificates and cleaning up our legacy code.

Flags: needinfo?(tim.callan)

We are monitoring this bug for any additional comments.

We believe we have addressed all discussion surrounding this issue. Ben, is it time to close this bug?

Flags: needinfo?(bwilson)

I believe this bug can be closed. I'll schedule this for closure on or about this Friday, 17-Dec-2021, unless there is a need for more discussion.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ev-misissuance]
You need to log in before you can comment on or make changes to this bug.