Closed Bug 1648717 Opened 4 years ago Closed 3 years ago

Sectigo: Failure to provide a preliminary report within 24 hours.

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fozzie, Assigned: rich)

References

Details

(Whiteboard: [ca-compliance] [disclosure-failure])

I sent two problem reports to sslabuse@sectigo.com and have yet to receive a preliminary report:

Wednesday 24th June 12:59 UTC - I sent a report concerning https://crt.sh/?q=A10D502F5B6770CC633EA6BB4C472E6971B42C7AF33156D639E579B616C82BF2 (Manchester in the localityName when the stateOrProvinceName is London).
Wednesday 24th June 13:18 UTC - I sent another report concerning https://crt.sh/?q=D0A3B6E663D2AA07FA386AA70B4FA34A861473533684067AE4551FD38FE70558 (Buckinghamshire in the stateOrProvinceName yet a Welsh postalCode is set).
Wednesday 24th June 16:38 UTC - I received a response for my first report saying that the report has been received.
Wednesday 24th June 16:39 UTC - I received a response for my second report saying that the report has been received.

As of Friday 26th June 09:08 UTC I haven't had any further responses from Sectigo on these reports.

Thanks very much for this incident report. We are investigating it now and assuming it checks out at our side we will respond with an incident in Mozilla's preferred format.

Flags: needinfo?(Robin.Alden)
Assignee: bwilson → Robin.Alden
Blocks: 1563579
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]
  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

https://crt.sh/?q=A10D502F5B6770CC633EA6BB4C472E6971B42C7AF33156D639E579B616C82BF2
24-Jun-2020 12:59 UTC - Email was received to SSL abuse address detailing problem of incorrect Subject information in certificate.

https://crt.sh/?q=D0A3B6E663D2AA07FA386AA70B4FA34A861473533684067AE4551FD38FE70558
24-Jun-2020 13:18 UTC - Email was received to SSL abuse address detailing problem of incorrect Subject information in certificate.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

https://crt.sh/?q=A10D502F5B6770CC633EA6BB4C472E6971B42C7AF33156D639E579B616C82BF2:
24-Jun-2020 16:38 UTC - Reply sent by Sectigo staff to acknowledge the report.
24-Jun-2020 19:59 UTC - Message communicated to customer/partner regarding incorrect information, stating revocation will be actioned 'Sunday, June 28th, 2020 at 4:30pm EST'.
24-Jun-2020 20:06 UTC - Partner confirmed and requested more time to have the new certificate installed.
24-Jun-2020 20:09 UTC - Sectigo staff confirmed new revocation on 'Monday, June 29th at 10:00am EST'.
29-Jun-2020 14:12 UTC - Certificate was revoked.
29-Jun-2020 14:19 UTC - Email sent by Sectigo to reporter confirming revocation.

https://crt.sh/?q=D0A3B6E663D2AA07FA386AA70B4FA34A861473533684067AE4551FD38FE70558:
24-Jun-2020 16:39 UTC - Reply sent by Sectigo staff to acknowledge the report.
24-Jun-2020 19:40 UTC - Message communicated to customer regarding incorrect information, stating revocation will be actioned 'Sunday, June 28th, 2020 at 4:30pm EST'.
26-Jun-2020 20:30 UTC - Reply sent by Sectigo staff to reporter thanking for the report and confirming the certificate will be revoked on the date and time above.
29-Jun-2020 14:12 UTC - Certificate was revoked.
29-Jun-2020 14:18 UTC - Email sent by Sectigo to reporter confirming revocation.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

NA

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

https://crt.sh/?q=A10D502F5B6770CC633EA6BB4C472E6971B42C7AF33156D639E579B616C82BF2
https://crt.sh/?q=D0A3B6E663D2AA07FA386AA70B4FA34A861473533684067AE4551FD38FE70558

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

See 4.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The original misissuance was related to the issues in: 1575022

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

Issues with validation are being addressed in: 1575022
I have reached out directly to the reporter to futher investigate the responses that were not received when sent from our ticketing system (Salesforce). The last report was before 3 of the responses were sent, so they may have been received now.
(Reporter confirmed they received the responses. They have suggested a change to better help with email delivery that we are investigating with Saleforce).

Thanks for the incident report.

I received your responses of "Thank you for bringing this to our attention. We will look into this immediately and inform you of any relevant updates." which are not considered preliminary reports.

When you sent emails to your customers about revocation why did you not send this to me as required by BR 4.9.5

Within 24 hours after receiving a Certificate Problem Report, the CA SHALL investigate the facts and
circumstances related to a Certificate Problem Report and provide a preliminary report on its findings to both
the Subscriber and the entity who filed the Certificate Problem Report

Flags: needinfo?(nick)

Thanks, George.

We do aim to provide preliminary reports for these Certificate Problem Reports confirming the issues, and in these cases our staff did not provide this report confirming your findings until around 48 hours later when the revocation date was given. As above, your findings were accurate and the certificates revoked in these cases.
I note other reports you have made were confirmed (or explained) within 24 hours.

Flags: needinfo?(nick)

I believe an incident report should be produced explaining how these two reports were not sent preliminary reports in 24 hours and what Sectigo is going to do to insure they will in the future.

I agree with Comment #5, this report seems to have overlooked the underlying/systemic issue.

Flags: needinfo?(Robin.Alden) → needinfo?(nick)

I do not believe there is a systemic issue, but human error on these two reports (which were received close together and handled by the same staff member at the same time).

An initial response was made by a member of staff (not an auto-response), but this did not confirm the findings and revocation back to the reporter.
A communication was made to the subscriber within the 24 hour window, confirming revocation, but in error this was not also copied to the reporter.

I have checked each of the 24 reports sent by the original reporter throughout June and found all but the two listed here had preliminary reports confirming revocation within the required window.

I am happy to expand that into a full incident report?

Flags: needinfo?(nick)

Your initial incident report hasn't really explained how the issue occurred or what Sectigo is doing to make sure it doesn't happen again. Under Responding To An Incident:

For example, it’s not sufficient to say that “human error” of “lack of training” was a root cause for the incident, nor that “training has been improved” as a solution. While a lack of training may have contributed to the issue, it’s also possible that error-prone tools or practices were required, and making those tools less reliant on training is the correct solution. When training or a process is improved, the CA is expected to provide specific details about the original and corrected material, and specifically detail the changes that were made, and how they tie to the issue. Training alone should not be seen as a sufficient mitigation, and focus should be made on removing error-prone manual steps from the system entirely.

I think a new incident report should be posted with all of these steps explained clearly.

I agree with George, and made a similar remark in the incident report provided in Bug 1650845 to the same effect, as it involved a separate reporter with a remarkably similar root cause.

Flags: needinfo?(nick)

We are still investigating what changes we can make to our customer-service systems for our staff responding to incident reports (that cannot be handled in an automated manner). This may require changes across our third-party ticketing system as well as our internal order-management systems.

We may combine the response with bug 1650845 and do a single (new) report with remediation for both.

Flags: needinfo?(nick)

When can we expect to see more information on this revamp?

Ben: We had a meeting late last week of the Incident Response group, and discussed a requirements document from the team directly handling these reports.
It's already being worked into a development specification.

Since the Sectigo/Comodo carve-out, we have used Salesforce as the primary system for our staff to communicate with partners and customers - Validation, Support and our Abuse teams.
A greatly increased volume of reports that we deal with (which we would still like to automate) we realise that - specifically for abuse reporting - a different system is needed specifically tailored deal with handling these reports, verifying keys where needed, having humans check any Subject information or other parts that cannot be automated, completing revocation and then ensuring the various involved parties are notified correctly and on time.

I don't have an ETA for development yet, but will of course share once we do, with timely updates as needed.

It's now been nearly two weeks, and I would say "timely updates" have not been provided.

Flags: needinfo?(rob)
Flags: needinfo?(nick)

We're working on a system change which will, in cases where we agree with the reporting party, and revoke the certificate, automate the preliminary report to let the reporting party know what action we're taking. I met with the manager of our customer service team last week to begin spec'ing this out. I don't yet have an ETA, but hope to have a first draft spec completed by end if this week or early next.

Flags: needinfo?(rob)
Flags: needinfo?(rich)
Flags: needinfo?(nick)

Still working on the spec, so no additional updates yet.

Draft spec is near completion, but no additional updates at this time.

No additional updates yet.

Flags: needinfo?(rich)

Rich: How are things progressing on the draft spec? What's causing the delay from Comment #14's projection to present?

Flags: needinfo?(rich)

In response to Ryan, comment 18;
The spec is 90% complete. Unfortunately, both I and the CS manager with whom I've been collaborating on it have been pulled in other directions over the past few weeks. I hope to pull it back up in the coming week and pin down the last bits it needs before sending on to the dev team, but I also know that there are several items higher on the priority list for both of us, so it will depend on both of us being able to work through our respective task lists and also being able to get our schedules synced to bring this over the finish line. Know that this has not fallen off the radar. It's still high on my task list but, as inevitably happens, it has been bumped down a couple slots due to unforeseen things coming up.

Rich: I'm hoping you can be more candid and transparent in the answer here. This doesn't really seem to fit with https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed

"I got busy, they got busy, it's important, but there's other things more important" doesn't really inspire confidence or trust. If there are competing concerns, being transparent about what those concerns helps us recognize "OK, they've got the right priorities, and their priorities are what's best for users". Yet vagueries, especially with no concrete timelines or commitments other than hopes and aspirations, don't really do that.

I think Sectigo is at an inflection point where it really needs to be quite candid and transparent, given the systemic patterns of issues that are non-communication, a lack of transparency, and a lack of accountability. If you believe you've got the right priorities, share them. Otherwise, it feels like this is Sectigo not taking compliance seriously, and I doubt that's the impression Sectigo wants to leave.

In response to Ryan Sleevi;

Ryan, first off I’m happy to report that the initial draft of the functional specification for the above mentioned improvements to the revocation process has been completed and sent to the dev team for review and feedback. I will continue to report back here on progress until such time as the new code is complete and live.

I hope you can appreciate that this is an unusual period for us. We’ve had a host of extraordinary needs raise their heads all on top of each other. We have been trying to bring our audit to conclusion and to examine and potentially replace 21,000 or so OV certificates. We put in upgrades to systems to fix the bugs that allowed these misissuances, and we’ve been aggressively cleaning up our outstanding Bugzilla backlog. We are working on behind-the-scenes improvements like the one being discussed here. In the midst of all this we recently had a root expiration that went thermonuclear due to previously unknown bugs in OpenSSL and other popular tools. One might say these are mostly self-imposed problems, and though that may be, that doesn’t change the fact that Sectigo’s compliance team is firing at a lot of targets right now. So rightly or wrongly, I prioritized things like examining the OV certificate population over the specification for this bug.

You ask what our priorities are? We’re working to improve and clean up our public CA issuance and certificate base and to build the systems and business processes necessary to achieve total certificate agility with all public certificates. That is a complex and multivariate project, not so much a task as a large number of separate tasks of various scope and nature, some (or many) of which are not fully understood or perhaps even known about at this stage.

Our BHAGs are “No misissued certificates” and “All public certificates can be replaced within the specified timeframes with zero negative effect on subscribers or relying parties.” These are a tall order, but that’s why they’re BHAGs. We don’t have to get all the way there to make meaningful improvements that will matter to us, our subscribers, and the community as a whole. When I recently pointed out that an internal process control found a flaw that in prior times would have gotten through, that is a sign to us that the initiative is having an impact. We like to see that kind of thing.

The carve out from Comodo Group was a tough time for us. We had twenty years’ worth of completely intertwined systems that had to be disentangled ASAP, a vast hairball of legacy code to deal with, and a skeleton crew of employees that numbered well under half of what we needed to operate in any reasonable fashion. At that time we put first things first, addressing the most egregious needs and risk factors and working our way down the list. We have been grinding at it for nearly three years and will continue to grind for years to come. Examples of what we have worked on and improved have been topics of conversation in this forum (see 1645868 comment #41 for instance). We are not as automated as we would like to be. Sometimes an individual employee makes a wrong call. We still stub our toes on previously invisible software flaws. This will continue for a while. Every day we’re a little better than the day before, and that’s how these things get done.

Flags: needinfo?(rich)

That is a complex and multivariate project, not so much a task as a large number of separate tasks of various scope and nature, some (or many) of which are not fully understood or perhaps even known about at this stage.

I think the opportunity here is to share concretely what those separate tasks are, and to make sure that this bug, and the related Bug 1563579, which are fundamentally highlighting that there seems to be systemic risk in continuing to trust Sectigo, helps reassure folks, by understanding what's being prioritized and/or here's where time was spent.

Similarly, I hope you can understand how highlighting the Comodo/Sectigo split, which does seem to have resulted in better outcomes with respect to governance, also opens up questions about pending aquisitions, to ensure that the resources and support will be provided, rather than further pared back.

It sounds like Sectigo has identified a number of key areas for change, and is executing on them. I think when it comes to comments like Comment #19, we need more transparency about what's being done. Comment #21 speaks to goals, and that's laudable, but I think there's understandable concern that actions speak louder here. I understand that not everything will be ready yet, some is still being done, but if you're going to provide an update like Comment #19, it should really be more detailed and concrete about what's soaking that time.

When it comes to big hairy audacious goals, those are understandably the product of many smaller things, so sharing updates about progress being made to those BHAGs, especially if it's detracting from near-term feedback, is critical. I think that in the length of Comment #21, I only really see one concrete example of that, namely:

So rightly or wrongly, I prioritized things like examining the OV certificate population over the specification for this bug.

That's incredibly useful! That's the sort of detail that we're looking for. There's no information not appropriate to share, so even if it was down to the level of timesheets of "Rich was focused on this, Rob on this, Inigo was working to close out this issue, and Tim's focused on this; with everyone focused on those, this issue didn't get brought over the finish line. However, the focus helped us close those out, with minimal customer disruption, and that allows us to provide more attention to this issue going forward" is the sort of thing that helps highlight priority. From the perspective of module peers, we're no strangers to the fact that some bugs require a lot more time investment than others, but this is where sharing is key for CAs.

Hopefully this helps provide more concrete feedback here regarding updates like Comment #19, and frankly, like Comment #21.

Similarly, when you talk about "draft spec", that feels a bit like "training has been improved" as a solution, just with "code" in between. I think sharing more here is what we want. What was the old procedure? What's the new draft spec proposing? You can share those before review with the dev team, because that also allows updates like "The dev team pointed out X and Y were infeasiable, but offered A, B, C as approaches that would do the same or better, so the final plan is A, B, and Z". Those are the things that help us learn, by helping us also highlight what doesn't work.

Flags: needinfo?(rich)
Assignee: Robin.Alden → rich

We met with the dev team to review the specification and answer any questions they might have. Out of that meeting we have a few actions to clarify the specification, but mostly it’s finalized. We pinned down the time required to code the changes to 5 to 10 days, and are now looking to get a concrete timeframe for getting that block of time inserted into the dev schedule. I’ll update on that as soon as I’m able.

In comment 22, Ryan said:

Similarly, when you talk about "draft spec", that feels a bit like "training has been improved" as a solution, just with "code" in between.

Point taken. To be more specific, we are undertaking to automate more of the actions required for both revoking the certificate, and notifying the various parties who need to be notified, including the reporting party.

Specifically within our certificate lifecycle management system, we are making the following improvements to the certificate revocation workflow:

  1. Add a field to contain the email address of the reporting party, and auto-generate the response to inform the reporting party as to the action to be taken as a result of the report. Automatic reporting to certificate Subscriber already exists.
  2. Add the exact time at which the initial report came in so that all other actions can be calculated based upon that time.
  3. Calculate the time in which the certificate MUST be revoked based upon the time the report came in and the stated reason for the revocation, to match max revocation window with BR requirements, and prevent the agent processing the revocation from setting a scheduled revocation time outside that window. Note: Currently the ability to schedule the revocation exists, but allows the agent to specify any date/time they choose without any reference to the date and time of the report or anything else.

These changes will help ensure both that the reporting party is advised of preliminary findings, as well as to further help ensure that the certificate is revoked w/in the timeframe required.

Flags: needinfo?(rich)

Still working with the dev team to get an ETA. Hope to have that pinned down NLT early next week.

I've been advised by the dev team lead that these improvements to our revocation processing should be live by the end of November.

No further update at this time.

No further update at this time.

On the 8th November 2020 at 10:58 UTC I sent an incident report to Sectigo, on 9th November 2020 at 17:29 UTC I received an email saying that the investigation had started.

I realise this bug is now focusing on the automation aspect of this but given that the investigation has to have at least started for a preliminary report to be sent out this seems like another big flaw in Sectigo's plan.

Agreed. Adding automation and error checking to this part of the process is one of our roadmap items for item 2b in bug 1563579, comment #35.

We have no further update.

We have no update at this time.

We have no update at this time.

The first round of improvements to our revocation portal has gone live this past weekend. We've already spec'd a second round, and sent to the dev team but don't yet have a firm implementation date. Further improvements will be ongoing.

We have no update at this time.

We have no update at this time.

We have no update at this time.

We have no update at this time.

We have no update at this time.

Tim: While I appreciate the weekly updates, it's not clear to me that there's anything of substance here. For example, Comment #25 highlighted "end of November", but we had no insight if that was ontrack or delayed until Comment #33 - a week after the scheduled deploy. Similarly, in Comment #33, we've got a commitment that there will at least be a firm implementation date forthcoming, but no update in the month that has followed.

Comment #33 doesn't clarify what the first round of improvements are (were they the ones from Comment #23? A subset?), nor do they clarify what the second round of improvements are. Comment #23 is an example of a good response, because it provides an understanding of what progress is being made, the challenges, and the priority. Comment #33 through Comment #38 are perhaps less than ideal, because they give no insight into what is happening, why, and when.

That's the difference between closing this bug out or not.

Flags: needinfo?(tim.callan)

Ryan,

Your point is taken. As you’ll see from the likes of bug 1645686 and bug 1563579, comment #51, in Q4 we were focused on execution of a few big initiatives, combined with a good amount of PTO as people finally had a chance to use up some of their backlog time. We understand the value of open communication with the community and still desire to maintain proactive transparency in our open bugs.

Let’s address the matter you have brought up here.

Comment 23 gives these criteria:

  1. Add a field to contain the email address of the reporting party, and auto-generate the response to inform the reporting party as to the action to be taken as a result of the report. Automatic reporting to certificate Subscriber already exists.
  2. Add the exact time at which the initial report came in so that all other actions can be calculated based upon that time.
  3. Calculate the time in which the certificate MUST be revoked based upon the time the report came in and the stated reason for the revocation, to match max revocation window with BR requirements, and prevent the agent processing the revocation from setting a scheduled revocation time outside that window. Note: Currently the ability to schedule the revocation exists, but allows the agent to specify any date/time they choose without any reference to the date and time of the report or anything else.

We added these three capabilities to our single-certificate revocation and points 2 and 3 to our bulk revocation in the first release.

With the next release we intend to add a few things related exclusively to bulk revocation:
• Add the point 1 functionality to bulk revocation
• Add a ticket number to the bulk revocation order
• Ability to automatically replace the revoked certificates where appropriate
• A few bug fixes and interface improvements we discovered with rev 1

In using rev 1 of the portal we identified potential for additional improvements in problem report handling and certificate revocation processing. We have started discussing what we would like to see, and at this point incremental improvement to our existing processing system or a rework of processing are both on the table. The decision to move forward on rev 2 as specced or to combine it with larger changes will determine the release schedule for rev 2 and further advancements. As we settle these questions, we’ll update the community.

Flags: needinfo?(tim.callan)

We have no update at this time.

We have no update to add right now.

We've submitted a specification to our dev team to extend the additional functionality that's been added to the single certificate revocation process to the bulk revocation processing. We are awaiting their review and should be able to give an ETA sometime next week.

We have no new information at this time.

Let me add that we are still working on an expected delivery date.

Newly submitted bug 1694233 discusses a problem with our Bulk Revocation engine, which is fixed and set for deployment on March 7. See that bug for more information.

We have no update right now.

We have no further update today.

Ben,

The subject matter of this bug has changed over time, and now is focused on the ongoing development roadmap for a capability that did not exist when the bug was opened. The original matter has been resolved for some months now. Can we close this bug?

Once we close this bug we should close bug 1563579 and bug 1650845 as they depend on this bug.

The updates to our bulk revocation functionality are still awaiting roadmap prioritization. It is a seldom used capability with other ways to get the job done, so it's okay for it to be in the queue for the moment, as illustrated by bug 1694233, for which we conducted a bulk revocation.

Cheers,

-tlc

Flags: needinfo?(bwilson)

I intend to close this bug on or about Friday, 26-Mar-2021.

(In reply to Tim Callan from comment #40)

  1. Add a field to contain the email address of the reporting party, and auto-generate the response to inform the reporting party as to the action to be taken as a result of the report. Automatic reporting to certificate Subscriber already exists.
  2. Add the exact time at which the initial report came in so that all other actions can be calculated based upon that time.
  3. Calculate the time in which the certificate MUST be revoked based upon the time the report came in and the stated reason for the revocation, to match max revocation window with BR requirements, and prevent the agent processing the revocation from setting a scheduled revocation time outside that window. Note: Currently the ability to schedule the revocation exists, but allows the agent to specify any date/time they choose without any reference to the date and time of the report or anything else.

We added these three capabilities to our single-certificate revocation and points 2 and 3 to our bulk revocation in the first release.

(In reply to Tim Callan from comment #49)

The subject matter of this bug has changed over time, and now is focused on the ongoing development roadmap for a capability that did not exist when the bug was opened. The original matter has been resolved for some months now. Can we close this bug?

Did you add automated responses to the reporting party for when no revocation is triggered? You still need to send a preliminary report within 24 hours even if you don't revoke.

(In reply to Mathew Hodson from comment #51)

Did you add automated responses to the reporting party for when no revocation is triggered? You still need to send a preliminary report within 24 hours even if you don't revoke.

Rather than give you an off-the-cuff yes, I took a day to confirm my understanding. It's a good thing I did because it turns out there remains a no-revocation workflow where the automation code is yet to be delivered. All the revocation scenarios are covered. We are paying close attention to this case to prevent any failures until the automated solution is in place.

I had gotten the false impression in my early days in this role that all dev items here were complete. I'm going to seek to escalate it so that we have all this functionality buttoned up.

Flags: needinfo?(bwilson)

I think this matter can be closed and intend to do so on or about next Wednesday, 31-March-2021, unless there are items that need to be discussed.

Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [disclosure-failure]
You need to log in before you can comment on or make changes to this bug.