Open Bug 1563579 Opened 4 months ago Updated 6 days ago

Sectigo: Failure to provide timely incident reports

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: ryan.sleevi, Assigned: Robin.Alden, NeedInfo)

References

(Depends on 3 open bugs)

Details

(Whiteboard: [ca-compliance])

Robin,

Given the pattern in the following issues, I felt it best to spin this into a dedicated issue. Sectigo has begun demonstrating a worrying trend of not providing timely response to incident reports, and this seriously undermines confidence in the operations of the CA.

The following incidents exemplify this:

Mozilla's policy regarding Responding to an Incident notes that:

in no circumstances should a question linger without a response for more than one week, even if the response is only to acknowledge the question and provide a later date when an answer will be delivered. You should also provide updates at least every week giving your progress, and confirm when the remediation steps have been completed - unless Mozilla representatives agree to a different schedule by setting a “Next Update” date in the “Whiteboard” field of the bug.

In light of this pattern, please treat this as an incident report. This helps understand what the cause of these delays are and what steps Sectigo is taking to prevent future delays in responses. If Sectigo has already made changes with respect to how it handles incidents, please detail those changes for the community, including explaining how they relate to the (potentially various) root causes.

Flags: needinfo?(Robin.Alden)

Ryan,
Thanks for splitting this into a separate report. We will follow up on this next week.

We have made progress on resolving an underlying cause of this issue. I will provide the requested incident response next week.

Although we did make progress last week on addressing the underlying cause, I do not yet have the incident report prepared. I will follow up early next week.

Ryan, I can't update this issue, but I think this should also depend on https://bugzilla.mozilla.org/show_bug.cgi?id=1567060, as the last update for that issue was an aknowledgement made over 50 days ago.

Depends on: 1567060

Thanks. The update for Bug 1567060 was similarly promised "next week" as the Comment #3 on this issue.

Depends on: 1575022

Robin, another week has elapsed since Comment #5, with no acknowledgement.

I want to emphasize that Sectigo needs to be treating this as the highest priority issue, and ensuring these issues are responded to in a timely fashion. If you're proposing deferred updates, then without an explicit acknowledgement of a new date, as reflected in the whiteboard status, continue to provide updates.

In no situation should the time period between an acknowledgement and update be longer than a week, the absolute upper bound. Anything greater jeopardizes the continued and future trust of Sectigo issued certificates.

Sectigo folks: I am deeply concerned about the lack of response here. I'm hoping that there's a simple explanation, such as that Robin may be out of the office this week, or that messages are getting marked as bug spam.

Please confirm that Sectigo will be responding with an incident report by EOD of Friday, 20 September, GMT, and that Sectigo will ensure weekly updates on all associated Sectigo bugs. Considering that CAs have been distrusted over a lack of acknowledgements, this is gravely concerning.

Flags: needinfo?(rob)
Flags: needinfo?(rich)

Ryan, Thank you for your reminder.
I appreciate your considerable patience on the matter so far.
I will respond on this bug by the end of this week, i.e. by 20-sep-2019, and no less than weekly thereafter.

Flags: needinfo?(Robin.Alden)
  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

2019-07-04 Ryan opened bug 1563579

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

I have not analyzed all of the times that we have not provided a response to Mozilla within the expected one week.
Ryan provided these examples:

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

Sectigo commits to respond to all open incident reports in Bugzilla according to Mozilla's guidance.
That guidance (from https://wiki.mozilla.org/CA/Responding_To_An_Incident) is "Once the report is posted, you should respond promptly to questions that are asked, and in no circumstances should a question linger without a response for more than one week, even if the response is only to acknowledge the question and provide a later date when an answer will be delivered. You should also provide updates at least every week giving your progress, and confirm when the remediation steps have been completed - unless Mozilla representatives agree to a different schedule by setting a “Next Update” date in the “Whiteboard” field of the bug. Such updates should be posted to the m.d.s.p. thread, if there is one, and the Bugzilla bug. The bug will be closed when remediation is completed."

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

NA

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

NA

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The underlying cause of this issue (bug 1563579) is that we found ourselves
under-resourced in terms of having people with the particular skillset to
provide the detailed reporting required for incident responses for Mozilla's
and wider public consumption.

We fully understand the benefit to the community of having these incidents
acted upon and the actions reported, and the essential nature of our response
to them in terms of both acts and reports if we are to retain the trust of the
community.

I further appreciate that, chiefly by my own tardy responses in bugzilla, we
have given the impression that little of anything is happening in response to
reported incidents but this is not the case. We have expended considerable
resource from several parts of our organization to deal with reported
incidents and will continue to do so until we have brought the incident
responses to a conclusion.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

If Sectigo has already made changes with respect to how it handles incidents, please detail those changes for the community, including explaining how they relate to the (potentially various) root causes.

Immediately:
Our senior management has reiterated that our compliance with Mozilla and other root store operators policy
is an essential pre-requisite for our success and that where additional resources are required to achieve
our compliance, whether in action or in reporting, those resources will be provided.

Immediately:
A leadership group drawn from other areas of the company (software development, product management, information technology)
will meet weekly with Compliance to ensure that development tasks related to compliance are making satisfactory progress,
that all external reporting requirements, such as Mozilla's guidance on incident report responses, are being met,
and that all necessary resources are available to meet all of our compliance obligations.

Ongoing:
We are recruiting to expand our compliance team with a particular focus on identifying personnel with directly relevant experience in Web PKI compliance activities.
This will further enable us to more fully and actively engage (beyond the minimum requirements) with bugzilla and m.d.s.p.

Flags: needinfo?(rob)
Flags: needinfo?(rich)

This bug was a topic of conversation at our weekly management meeting and the commitment to comply with Mozilla's policy and other requirements was re-iterated.

The initial meeting of our leadership team is set for early next week and the meeting will recur on a weekly basis while any compliance incidents remain open.

This week we had the first of what will be regular dedicated meetings of a leadership group that will track the development tasks relating to open incident responses, track that we are meeting Mozilla's guidance on incident report responses, and that the necessary resources are available to meet all of our compliance obligations.

We have done further work towards recruiting further suitably skilled staff to assist with our compliance obligations.

Flags: needinfo?(Robin.Alden)

This week we reviewed and discussed the level of development effort that is being employed in compliance-related tasks.

The next regular meeting of the leadership group to track open incident response progress has been scheduled.

We are in HR negotiation with a highly skilled potential recruit for the compliance and CA management team.

Robin: Thanks for making sure to provide weekly updates. I realize we haven't acknowledged them explicitly, in part to see how things are progressing without those explicit acknowledgements. I appreciate that you're making sure to update weekly on progress.

That said, as I shared in the related bugs, I want to encourage Sectigo to be clearer and more detailed in its reports, to help build confidence that it's not just weekly updates, but that meaningful progress is being made. Obviously, sharing who the candidate is would definitely be on the /oversharing/ side, but as it relates to some of the technical controls and investigation on the dependent bugs, there's a lot of opportunity to actually dive in and explain.

While we all work in PKI and are subject matter experts in our respective fields, it might help to adopt a position of "assume they don't know what we mean", and try to add detail about the design and/or implementation. Alternatively, you might think of it like describing an algorithm: how could another CA implement a similar process, and get similar results, to Sectigo, which will hopefully be the "right" results.

For example, for this issue, sharing how you structure your meeting, what are the challenges you face, and how you're overcoming these challenges. Comment #9 said y'all were underresourced for the experts, but presumably, Comments #10 and Comments #11 are the meetings that involve the folks with the skills, making the explicit time to discuss and address the issues. Comment #9 provided assurances that significant resources were expended despite the lack of non-communication, but we're not really getting a picture of what that was then, or what that is now. Several bugs, two months later, don't even feel like there's a good preventative mitigation in place.

Helping understand where the challenges are helps browsers build confidence that things are going as expected. It also provides opportunities to help, to the best of their ability, to find solutions, or to allow the community to do more research into options.

Basically, what's missing here is the narrative I'd previously discussed. Comment #12 feels a bit like "Last week, we met in Rivendell. There was some debate about next steps" which... like, is both accurate and completely overlooks the critical discussions ;)

(In reply to Ryan Sleevi from comment #13)

.. as I shared in the related bugs, I want to encourage Sectigo to be clearer and more detailed in its reports, to help build confidence that it's not just weekly updates, but that meaningful progress is being made. Obviously, sharing who the candidate is would definitely be on the /oversharing/ side, but as it relates to some of the technical controls and investigation on the dependent bugs, there's a lot of opportunity to actually dive in and explain.

I appreciate the drivers for your comments there and we are making a sincere effort to usefully increase the level of detail in communication.

While we all work in PKI and are subject matter experts in our respective fields, it might help to adopt a position of "assume they don't know what we mean", and try to add detail about the design and/or implementation. Alternatively, you might think of it like describing an algorithm: how could another CA implement a similar process, and get similar results, to Sectigo, which will hopefully be the "right" results.

That's a useful pointer. The JoI policy checker, for example, we can share an outline of the psuedo-code for that. I'll ask for that to be prepared.
As it currently exists the source code itself is not going to be useful to anyone else, but the psuedo-code can be useful.
Having that alongside the ruleset would allow other CAs to implement the same checks. Or of course it could be incorporated into one of the existing lint-checkers.

For example, for this issue, sharing how you structure your meeting, what are the challenges you face, and how you're overcoming these challenges. Comment #9 said y'all were underresourced for the experts, but presumably, Comments #10 and Comments #11 are the meetings that involve the folks with the skills, making the explicit time to discuss and address the issues. Comment #9 provided assurances that significant resources were expended despite the lack of non-communication, but we're not really getting a picture of what that was then, or what that is now. Several bugs, two months later, don't even feel like there's a good preventative mitigation in place.

This meeting is structured around incident reports. All of the incident reports we're currently examining are the public Mozilla ones.
It really helps compliance and development to talk through the tasks and the responses so far and to get direction from a wider team of clueful people in the company and it also helps to avoid unnecessary rat-holing and headlight-staring.

For the three open incident bugs (so excluding this one) there is preventative mitigation in place.
The mitigation can always be improved.

We have our adversarial second review process which is proving far more effective than before. By 'second review' there I'm really referring to the 'Final Cross-Correlation and Due Diligence' step outlined in section 11.3 of the EV guidelines. We now separate that second review team in the same way that you'd separate an IT security team from developers.

We also have improved automated checks in place to identify and reject the most egregious examples of incorrect address data, such as what we have come to call 'meta-data' (although I think that term is probably misused here) by which we mean data that consists of dots, dashes, spaces, and other punctuation and especially sequences or repetitions of those characters instead of city or state names or other address fields.
Those same automated checks also aim to identify frequently used words and phrases that identify that these fields do not contain real data, such as '(none)', 'NA', 'Not Applicable', and we use that same mechanism to catch the common default subject phrases ('Some-State', 'Default City', etc).
Although those checks may seem unnecessary since we have the required initial verification checks and second review process in place, but those characters and phrases are exactly the ones that become so familiar to our human validators that they can become practically invisible to them.

The further automated checks of subject address fields that we are still working on are to further improve the accuracy and further reduce the scope for error by human validators. These have yielded some improvements but remain a work in progress, and something on which we are expending R&D effort. I can feel you willing me to go into more detail about the R&D avenues being explored but in this I'm going to fail you, for today at least. I think there are limits on how much of this R&D work is reasonably responsive to the incidents we're dealing with, although we probably aren't quite there yet and I absolutely realize we can't close out at least Bug 1575022 until the identification and remediation of prior problematic issuance is completed.

The more we automate the subject detail checks, the less we rely on our human validators and that is a win both because it serves to reduce ongoing costs and because it decreases error rates.

Helping understand where the challenges are helps browsers build confidence that things are going as expected. It also provides opportunities to help, to the best of their ability, to find solutions, or to allow the community to do more research into options.

Basically, what's missing here is the narrative I'd previously discussed. Comment #12 feels a bit like "Last week, we met in Rivendell. There was some debate about next steps" which... like, is both accurate and completely overlooks the critical discussions ;)

I've never been a Tolkien reader. I had friends at school who read him at 9 and 10, but I was more into running around with a stick. My running days are now behind me (unless chased by something large) and I left my stick somewhere.
But I take your point.

We have been looking some more at our internal auditing of EV SSL subject information, i.e. the 'Regular Self Audits' discussed at 17.5 in the EVGLs.

One of the lessons learned as a result of some of the things we've missed which we should have caught, which has been already applied to our internal auditing, and which we're working toward applying to our systems is that when we have human validators or auditors verifying information there is still a danger of missing the wood for the trees. E.g. by looking to verify the requirement of Physical Existence and viewing that set of fields as whole, we have in some cases missed the fact that the Locality was deformed (mis-spelled for instance).

When examining whether a certificate request consistently matches a set of data from one of the information sources we have to compare the data in multiple phases. One phase is whether the certificate request data is complete and entirely internally consistent.
E.g. Whether the certificate request data (as has been already checked by at least one person) is complete in every respect. Are all of the expected fields present (mostly automatable) and do they form a single consistent address that conforms with the known form of addresses for that country and includes only accurate city and state names, and that city is really in that state, and that zip code refers to that city,
Another phase of comparison is whether the certificate request data matches our Q?IS data.
E.g. Are these the same entity at the same address? Not a different branch office. Does the Q?IS data have exactly the same field values? - It may not, in which case is the Q?IS data summarized or abbreviated in such a way as the address fields no longer exactly match but we may still see that these are exactly the same entity at exactly the same location. Some Q?IS data can use outdated address forms, but the addresses can still be recognized to refer to the same place. GB is one place this happens a lot because the stateOrProvince boundaries move over time and the postal system develops over time (historic county names that no longer exist but which are still used, introduction of post codes, dropping of stateOrProvince from postal addresses) - but perhaps that is local observer bias since I happen to be in GB and other countries are equally as fluid.

Robin: thank you for these thoughtful updates. They are very helpful in making the case that Sectigo is learning and improving.

For the lessons in comment #15, can you explain how they are being applied to your internal auditing?

(In reply to Wayne Thayer [:wayne] from comment #16)
Wayne,
As Robin mentioned, traditionally we’ve approached validation by looking at wholistic requirements such as verify Legal Existence, or verify Physical Existence, and in doing that our focus has also been on the whole of the document as well, again, sometimes overlooking the individual pieces (individual specific fields encoded in the certificate). So as an example, we might have verified that Acme, Inc. is a legally registered company in the state of Delaware, and maintains a physical location in New York City. We’ve collected the correct documentation from the Delaware Secretary of State to verify Legal Existence and obtained documentation from a QIIS to verify the Physical Existence, but in looking at these things from a requirements/documents basis we may have missed that the QIIS has postal code 10034, but the customer, when formulating the request has fat-fingered the postal code and put 10334, or the person doing the final 2nd approval, cross-correlation and due diligence check may have missed that the person who performed the initial validation had a typo in the corporation registration number (subject:serialNumber) when transferring it from the Delaware Secretary of State data. In one sense, we’ve validated Acme, Inc. correctly, in that we’ve collected and verified the documentation to verify the various requirements w/in the EV Guidelines, Acme, Inc. is a legal entity and we know and have verified the facts about the organization, but we may have fallen short in making certain that only that verified data is what made it into the final certificate issuance because both our training and our systems were placing the emphasis a level higher than it should have been. So the two main things we’re doing are (1) taking a more adversarial outlook in our internal audits, and (2) placing an emphasis, both for our validation team and our internal audit team, on drilling down to the specific certificate fields and verifying them one by one against the documentation. We’ve revamped our training materials both for validation and auditing to put a much stronger emphasis on verification of the individual fields, and will also look for ways that we can visually place more emphasis on the individual fields w/in our validation platform.

You need to log in before you can comment on or make changes to this bug.