1714628 - Sectigo: Forbidden Domain Validation Method

Reporter

Description

•

4 years ago

•

In Bug 1694233, Comment #19, Sectigo disclosed that "we had not kept our list of supported DCV methods and their corresponding BR section numbers fully up to date"

While more details were provided in Bug 1694233, Comment #20, this is a distinct incident worth tracking as itself.

Tim Callan

Assignee

Comment 1

•

4 years ago

Acknowledging this bug. We will use it to track this update and corresponding certificate revocations.

Ryan Sleevi

Reporter

Comment 2

•

4 years ago

Thanks. As part of this, it's useful to examine why it was a month since Bug 1694233, Comment #19 without a bug being filed. While that was a useful comment, it was on a closed issue (per Bug 1694233, Comment #16), and even after a reminder 8 days ago in Bug 1694233, Comment #21, Sectigo didn't file an issue.

Flags: needinfo?(tim.callan)

Tim Callan

Assignee

Comment 3

•

4 years ago

Attached file acme-http-01_revocations.zip — Details

Flags: needinfo?(tim.callan)

Tim Callan

Assignee

Comment 4

•

4 years ago

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

As part of our end-to-end DCV investigation and update, as covered on an ongoing basis in bug # 1694233, we discovered that our CPS was missing one DCV method we had in production and warranted edits to other portions of CPS section 3.2.2.1.1, which covers DCV.

We added BR methods 3.2.2.4.19 (ACME-http-01) and 3.2.2.4.20 (ACME-tlc-alpn-01)). ACME-http-01 had been in use prior to this update. ACME-tls-alpn-01 had not. We also made editorial updates to our descriptions of our DNS-based, non-ACME-based, and CAA-email-based DCV methods (relevant BR section numbers: 3.2.2.4.6, 3.2.2.4.7, 3.2.2.4.13, and 3.2.2.4.18).

When we realized our CPS had been missing method 3.2.2.4.19, we knew we would have to investigate our certificate base. A few days afterward, a discussion of CPS updates on m.d.s.p. prompted us to inform the community that we would have our own report coming. This is that report.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

All times Eastern Daylight Time

April 15, 2021
Largely in response to bug #1694233 we undertake a wall-to-wall evaluation of our DCV processes including code review, process review, and investigation of our issued certificate base. We continue to post progress updates on that bug even after it is closed, to keep the community informed of what is happening.

May 3
Our CPS review reveals updates needed for our CPS to exactly match current and intended DCV methods.

May 5
We announce on bug #1694233 comment 19 that we will have a CPS update coming.

May 5 to May 20
We author, review, and sign off on CPS updates.

May 21, 11:14 am
Updated CPS published to site.

Week of May 24
We commence investigation of affected certificates. We direct customers and partners known to be regular ACME users to the updated CPS and encourage them to look into their own certificates and replace any they believe are affected.

We wind up building a custom script to find affected certificates. It’s a large query that must run for several days.

June 4
This bug opened by Ryan Sleevi.

June 6
Query script completes.

June 7
We collect the results of query script, schedule revocation, and send revocation notices. We create a custom script to revoke the discovered certificates.

June 10, 8:22 to 9:44
Custom script runs, revoking all affected certificates.

Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

We published our updated CPS on May 21, 2021. As of that moment the CPS covered all in-production DCV methods.

A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

369,922 certificates
The first was issued March 13, 2021.
The last was issued May 21, 2021.

The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

File attached.

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

In December 2019 Sectigo released ACME support, including DCV using ACME-http-01 and ACME-dns-01. The code changes required for this support were small. Because the new ACME support was so similar to existing DCV methods, our product management team did not identify that the new capability required modification to our CPS. At that time the compliance department took a responsive view on CPS maintenance based on identified business needs that came its way. When we changed our compliance leadership, we changed our approach to proactive management of our CPS to ensure compliance.

After releasing ACME support, we failed to identify this gap until our recent CPS review discovered it.

There were aspects of our response to this bug we consider suboptimal. Ordinarily we would have taken less time to update and publish our CPS and less time to investigate for affected certificates. Our responses were impaired on this occasion by a huge influx of other compliance matters, some of which were quite urgent in their priority. In one month we have gone from zero open bugs to eight. In particular, bug #1712188 (test certificates) has proven to be extreme in both its complexity and importance, as it touches on software, training, and corporate culture.

We had to perform tradeoffs on how to allocate our key team members’ time across multiple issues, several of which in most periods would be the top priority. It’s worth mentioning that when we formulated our WebPKI Incident Response team last summer, in order to ensure we were going in the right direction, we named a number of our most senior people to this team. Their experience, knowledge, cultural leadership, and decision making authority have been irreplaceable assets as we’ve worked to build a clean and accurate issuance practice across all circumstances and business flows.

Unfortunately, these qualities mean that relatively few team members need to carry the load when unexpected, complex issues like this one and bug #1712188 come up. We are in the midst of a sudden burst of activity, and that has left us making the tough priority decisions on a daily basis. In this case, we knew that our technical implementation of ACME HTTP Challenge DCV fully matched the description in the Baseline Requirements. What was missing was the appropriate language in the CPS. There was no risk of fraud or misuse of certificates. It was, if I may say so, a clerical error.

That meant we ultimately allowed tasks associated with other bugs to push ahead of these, especially 1712188. Part of our tradeoff was to keep every bug moving forward and not let any of them simply sit. Part was to prioritize tasks like investigating and fixing flaws over those like writing them up. Part was to ensure that the most critical flaws made progress the fastest. And part was not to completely drop other projects that in the long term offer our best path to more reliably accurate issuance.

For in the midst of all this, we also have aggressive improvement initiatives that we want to drive independently of our Bugzilla load. The biggest and most important of these are:

The “Guard Rails” project, which ultimately will seek to vet every field against acceptable potential contents before issuance can occur
Our complete DCV audit and update
Review and revamp of all validation process documentation and training
Elimination of potential failure points such as streetAddress
Full CPS review and update

We wanted to maintain momentum on these also. And finally, we’re deep into our annual WebTrust audit. Team members who ordinarily would be able to drop everything and focus on these bugs had to put significant time into keeping the audit on track. This year we are using a new audit firm, which means we’re receiving information requests in different forms than previous years and spending time explaining things we won’t need to explain again in future years. All these actions take time.

I do feel that on the whole we’ve made these tradeoffs well this past month or so. We have been responsive and have kept every train moving forward. We have understood root causes, implemented fixes, ceased issuance when necessary, and performed revocations on time. We have provided thorough, accurate writeups and offered detailed information in response to questions raised. We have kept our non-Bugzilla initiatives on track.

We believe this resourcing pinch is a short-term problem brought on by the perfect storm of projects and reports we’re facing right now. We do believe that as we work through a few major bugs, including this one, it will be easier to keep everything on the kind of schedule that has become the norm for us since revamping our incident handling as described in bug #1563579. Some of these bugs appear to be winding down, and for the major ones we have been making good progress. We will continue pushing on them until we all can consider them resolved as well.

List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

Our present CPS includes all in-production methods of DCV along with new methods we would like to support in the future.

In February 2021 we assigned an experienced team member as the designated internal owner of our CPS with responsibility for understanding on an ongoing basis our issuance practices alongside what our CPS says and ensuring the two are aligned. This process change would have prevented this error.

We also have added compliance review of CPS/process match as a release-blocking requirement for all product releases affecting certificate authentication. This change also would have prevented this error.

Tim Callan

Assignee

Comment 5

•

4 years ago

Once again we have attempted to provide a thorough write up, including an explanation of its timing. Thank you for your patience. We are here to respond to questions or comments based on what you see above.

Ryan Sleevi

Reporter

Comment 6

•

4 years ago

In February 2021 we assigned an experienced team member as the designated internal owner of our CPS with responsibility for understanding on an ongoing basis our issuance practices alongside what our CPS says and ensuring the two are aligned. This process change would have prevented this error.

Can you share more precise details about what this entails?

I'm trying to understand where along the spectrum of "We said X was responsible for this" (i.e. we can throw them under the bus next time) and "We have a playbook that has the following tasks ...", this falls, and what those tasks look like.

Flags: needinfo?(tim.callan)

Tim Callan

Assignee

Comment 7

•

4 years ago

(In reply to Ryan Sleevi from comment #6)

Can you share more precise details about what this entails?

There are four potential triggers for a CPS update:

We monitor all CABF ballots for purposes of deciding how to vote, casting our vote, and preparing for changes that may come from them. If a ballot passes, we kick off the action plan to account for the ballot. Most will not require changes to our CPS, but a few do. When that occurs, updating the CPS is part of the action plan.
Very occasionally a development outside CABF requires a CPS update. Last year’s browser deprecation of two-year SSL certificates is an example. We monitor communications from root store owners for these uncommon updates.
The compliance team is looped into our release roadmap and actively evaluates which items may require an update. We have a team member who sits on our public CA steering committee, which meets weekly and is where key stakeholders discuss our priorities for this business. That member brings any developments affecting validation and issuance to the WebPKI Incident Response (WIR) team’s working meeting, and at that meeting we determine our action plan, including any required CPS changes.
As part of our evolving strategy and lessons learned, the WIR team itself may realize that a CPS change would be of value and kick off a project based simply on its own internal knowledge.

The WIR team has twice-weekly working meetings. They were 90 minutes, but we have temporarily upped them to two hours per meeting while we have so much going on. We have referred to this meeting in previous comments on previous bugs, including bug # 1563579 comment #25. This meeting includes a cross-functional group of leaders from compliance, development, validation, customer success, and QA. In this meeting we surface needs, kick off initiatives, work through open questions, assign actions, share findings, and report on results. All items resulting from one of the above bullets will find their way to the WIR working meeting, where they are documented, given owners, and tracked.

Should we determine that a CPS change is required, that action will be assigned to the designated CPS owner, and we will document it and track it in future iterations of the working meeting. The updated CPS goes to our four-person Policy Authority team for review and signoff before publication. The CPS owner tracks this review process. Once reviews are complete, the CPS owner hands the updated documents off to our web development team for publication.

I'm trying to understand where along the spectrum of "We said X was responsible for this" (i.e. we can throw them under the bus next time) and "We have a playbook that has the following tasks ...", this falls, and what those tasks look like.

Linus’s law states, “Given enough eyeballs, all bugs are shallow.” That is one of the benefits of a working team like the WIR team, to ensure we have enough attention and brainpower focused on public CA compliance. At the same time, if everybody owns an initiative, nobody owns it. So we consider it essential to have an assigned owner for every project or task. For our CPS that owner follows the processes explained earlier in this post.

(It happens we're not big fans of throwing our employees under the bus. We're fully on board with the concept of blameless post-mortems that enable our individual staff, Sectigo as a whole, and the wider industry to learn from mistakes and improve security for relying parties.)

Flags: needinfo?(tim.callan)

Ryan Sleevi

Reporter

Comment 8

•

4 years ago

(In reply to Tim Callan from comment #7)

Linus’s law states, “Given enough eyeballs, all bugs are shallow.”

And Heartbleed shows us "Busy teams getting lots of feature requests can overlook critical bugs" ;-)

I appreciate the detail shared in here in Comment #7, but also note somewhat the omission here to retroactive examinations to see if clarifications were missed. I admit, I don't have a great solution for "We misunderstood a previous requirement, but we still misunderstood it", but I'm wondering to what extent there is the option of periodic reviews of related discussions (CABF, Bugzilla, m.d.s.p.) happens to challenge assumptions. Much in the same way quarterly self-audits exist to support existing controls, is something like that a potential solution for this problem? I don't know, but I'm wondering how we prevent the "cascade of misunderstandings" (e.g. as recently discussed on Bug 1713668 with Amazon Trust Services).

There's also something of note from the timeframe: The CP/CPS analysis was kicked off by Bug #1694233, but it seems the explanation for why this was not kicked off sooner (e.g. in response to other CA's incidents) was that Sectigo was so busy with its own incidents (some of which are, understandably, quite concerning and deserving of dedicated attention).

As best I can understand from Comment #4, the answer to the questions in Comment #2 is "We were quite busy and made difficult triage calls based on the resources available to us". I mention this, because lack of resources has been raised (both by Sectigo, in response to its incident response delays, and by other CAs), and I am explicitly commenting on it here to make sure Sectigo has a path forward to prevent this in the future. I realize that this isn't exactly "Hire an intern to deal with this" territory - there's not an easy elastic slack built-in - but whatever path Sectigo takes, we need to make sure that there's a good compliance strategy in the lean times, and not just the good times.

Flags: needinfo?(tim.callan)

Tim Callan

Assignee

Comment 9

•

4 years ago

(In reply to Ryan Sleevi from comment #8)

I'm wondering to what extent there is the option of periodic reviews of related discussions (CABF, Bugzilla, m.d.s.p.) happens to challenge assumptions.

Our focus has been to monitor these in real time. We fully cover CABF with attendance at all working groups and attention to the mailing lists. We keep an eye on Bugzilla and m.d.s.p. for threads that are relevant to our ongoing operation, including evolving requirements that we’ll need to adjust to and other good ideas that we can benefit from.

For example, last fall we read bug #1673119 (Entrust: Subscriber provides private key with CSR) with great interest and conducted a full audit of all our own entry points where CSR submission might result in private key exposure. We implemented a real-time check for private keys on all these entry points, and submission of a private key will result in immediate programmatic revocation of all active certificates with that key pair. We maintain a persistent table of disallowed keys and searched our own logs and SSL abuse queue going back to the beginning of our operation for keys to include in that table. We reviewed other CAs’ reports on this issue and added keys from their reports. We revoked active certificates using these keys. All of this was triggered by the original Bugzilla post.

That isn’t to suggest that we are perfect and couldn’t possibly miss something subtle. But we are active in following these sources and benefit from doing so.

I think your point is a little different than that, though. We do not presently conduct systematic backward-looking reviews of these sources to see if we understand them differently with the benefit of hindsight. We can see the potential value and want to get there. But first we hope to focus on the initiatives we have discussed that we believe will give us a great deal of bang for our buck. We do appreciate that in this case we might have discovered the CPS error earlier with a backward review of that particular product release. The challenge is that a great many product releases and other activities have occurred in the past and reviewing them all is a massive task. We continue to think that for the near future we have identified the highest priorities for optimizing clean issuance and a clean certificate base, and we want to focus on them until we feel that they are well in hand.

As best I can understand from Comment #4, the answer to the questions in Comment #2 is "We were quite busy and made difficult triage calls based on the resources available to us". I mention this, because lack of resources has been raised … and I am explicitly commenting on it here to make sure Sectigo has a path forward to prevent this in the future.

I didn’t mean to be obscure. Please excuse me if I was. There was just a lot to say.

Your capsule summary about triage is exactly correct.

The public CA compliance effort is spearheaded by our WebPKI Incident Response (WIR) team. While public CA compliance is every employee’s job (much more on this topic in tomorrow’s upcoming big post on bug #1712188), per my comment #7, we need a point of ownership for public CA compliance and the WIR team is it.

This team consists of designated compliance staff and a variety of hand-picked representatives of other key functions. This team has roughly doubled in size from when we formed it in the summer of 2020 as I have vetted and recruited potential additions. WIR is a bit of a hot ticket inside Sectigo and people want to be part of it because they see the value in what we’re doing. For our part we have kept the bar very high on who is invited to join, as the quality of this team’s operation is critical to our compliance efforts.

Of course, there is an entire company here to support us, but certain parts of these responses are difficult to offload in a hurry. If you don’t already have the underlying understanding of our process, our code, or our expected behavior from a CA compliance perspective, there are certain tasks you won’t really be competent to perform. So investigation, coordination of response, and some of the actual software updates of necessity had to come from the core WIR team.

This team is properly resourced for what our needs appeared to be up until about two months ago. At that time we had zero open bugs and were highly focused on putting systematic controls and process improvements in place. In other places like comment #4 I have listed the main initiatives we were focused on at that time (plus our WebTrust audit, of course). We were excited to push these initiatives forward and hopeful about the results we could get. And then suddenly this swarm of incidents came along, and a few of them were very important and quite complex, and we found ourselves resource constrained. And we made the hard triage decisions. You know the rest.

Certainly I have learned that my earlier assessment of our resource needs was wrong. I have secured an additional headcount and now need to somehow find the time to clear my head and put together a good description so our recruiters can start working on it. I’m calling this position a Project Engineer, and the job is to drive our programmatic initiatives meant to enforce high quality issuance. A lot like a technical product manager for the things the compliance department cares about. So for example this person should eventually be the owner for both the Guard Rails and DCV review projects. This will be a valuable addition as our programmatic issuance quality projects are the most important long-term compliance initiatives we have.

It will be a difficult hire. Rob Stradling likes to say to me, “PKI experts don’t grow on trees,” and he’s right. But if we can get the right key team member in place, it will be a huge asset to our going-forward quality strategy. And if upon onboarding that member we find our resources need to increase again, I’m confident in the ability to do so. One step at a time.

These last two paragraphs also highlight the need to develop our own bench strength. We can and should be able to increase our overall level of PKI savvy by imparting expertise on our existing team. I will touch on this point also in my upcoming big post for bug #1712188. But the high level idea is we have had to add staff very quickly. When we broke away from Comodo in November 2017, we came across with 76 employees. When I joined in April 2018 we had 100. Today we have 425. To make that growth happen, we of necessity had to recruit mostly for expertise in a certain job function and not in our specific industry. So we have a lot of developers who are good developers but they’ve never been developers for a public CA before. Or QA, or tech support. Etc.

We have been seeking ways to impart more of that expertise, which really depends on well understood types of employee programs. For example, we now have Sectigo University. Sectigo University is an employee education program with designated staff meant for both onboarding and ongoing human capital improvement. This was a meaningful expenditure of effort and budget, and the platform comes with all the best-of-breed capabilities like content delivery, testing, role-based programs, and compliance tracking.

Sectigo University went live in May and we have been busy populating it with content. Modules relevant to this discussion include mandatory training for all employees on the expectations and proper behavior for a public CA and detailed training for appropriate staff like development or tech support on our compliance requirements and our process to meet those requirements.

To some degree the WIR team itself develops bench strength. The cross-pollination of knowledge in that team has been great to watch. By putting together a set of people who each come in with his own area of expertise, everyone has had his viewpoint broadened and his relevant knowledge increased. We like this as a small success case and are presently wondering about how to extend this idea more broadly, but nothing official is happening yet.

We believe these are a good start and we want to do more. We’re working on what that “more” might be. I know you hate the “We’re thinking about it” answer, but in this case, we’re thinking about it.

Flags: needinfo?(tim.callan)

Tim Callan

Assignee

Comment 10

•

4 years ago

(In reply to Tim Callan from comment #9)

I have secured an additional headcount and now need to somehow find the time to clear my head and put together a good description so our recruiters can start working on it. I’m calling this position a Project Engineer, and the job is to drive our programmatic initiatives meant to enforce high quality issuance. A lot like a technical product manager for the things the compliance department cares about.

I'm pleased to say we have moved a strong internal candidate into this position. Our new project engineer has years of history with the company and a good working knowledge of the world of public CAs and our own product development processes. I am already offloading tasks. All signs point to this addition being a benefit to our efforts to release programmatic features to ensure quality of certificate issuance.

Ryan Sleevi

Reporter

Comment 11

•

4 years ago

Sending this over to Ben to see about closing.

As called out by Comment #8, I'm still concerned that Comment #2 happened, because it seems very reminiscent of Bug 1563579. 11 months ago, Sectigo gave an update, in Bug 1563579, Comment #11, that highlighted the steps they were taking to resolve those concerns, including:

b. Each Bugzilla bug has a parallel ticket in Sectigo's Jira system, so that the team can better track our everyday internal discussions and coordinate the preparation of our responses. We also have a shared tracking spreadsheet, for a simple one-page view of upcoming deadlines for posting updates and responding to questions on each bug plus a summary of known outstanding tasks for each bug that will need to be addressed before the bugs can be closed.

This bug failed to meet that process, simply because it appears Sectigo did not create the bug as requested, as Comment #2 shows. Once this bug was created, we've seen regular communications, which is encouraging, but it's not enough to overcome the concerns about prioritization. I think the only reason we might not be having conversations about distrusting Sectigo right now are because demonstrably, we're seeing improvements in the incident reports in terms of details and transparency (often erring on the side of including irrelevant details, for better or worse), and because we're seeing quality explanations and systemic investigations taking place. I don't want to leave those unacknowledged, just like I don't want to let this issue slip through as being concerning as to the "why", perhaps even moreso than the "what". Comment #9 captures a lot of the challenges, and shows enough awareness of those challenges and the steps that will hopefully lead to improvement, that I'm willing to consider this resolved (in terms of concrete steps), while highlighting the concern (for the future).

Flags: needinfo?(bwilson)

Ben Wilson

Comment 12

•

4 years ago

I'll close this matter on or about Friday 16-July-2021 unless there are additional questions or concerns to be discussed.

Tim Callan

Assignee

Comment 13

•

4 years ago

(In reply to Ryan Sleevi from comment #11)

I'm willing to consider this resolved (in terms of concrete steps), while highlighting the concern (for the future).

Ryan,

We understand your points here and in other recent bugs and take them to heart. I hope it has been visible during the past two months or so that we view issuance quality as a non-negotiable requirement and our key initiative in the latter half of 2021 and beyond. We have been on this arc since mid to late 2020. Of necessity in that time we had to focus on some hairy existing projects such as our Q4 OV revocation event and have been trying to get out in front of things.

We appreciate you working with us as we keep putting quality measures in place. As we are actively searching our own certificate base for mississuance (both to clean up the existing certificates and, more importantly IMHO, find ways to prevent those types of misissuance moving forward), we do expect to write up more bugs against ourselves as this process continues. Our goal is that for each of these issues we can offer a clear fix that we can develop and deploy.

Ben Wilson

Updated

•

4 years ago

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

Flags: needinfo?(bwilson)

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

2 years ago

Product: NSS → CA Program

David Lawrence [:dkl]

Updated

•

2 years ago

Whiteboard: [ca-compliance] → [ca-compliance] [dv-misissuance]

Bugzilla

Quick Search

Sectigo: Forbidden Domain Validation Method

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

People

(Reporter: ryan.sleevi, Assigned: tim.callan)

References

Details

(Whiteboard: [ca-compliance] [dv-misissuance])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type