Closed Bug 1740493 Opened 3 years ago Closed 2 years ago

Sectigo: Failure to block disallowed LDH labels in domain names

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: martijn.katerbarg, Assigned: martijn.katerbarg)

Details

(Whiteboard: [ca-compliance] [dv-misissuance])

Attachments

(1 file)

10.07 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details
No description provided.

1. How your CA first became aware of the problem

In the course of our review of recent CABF ballots and our response to them, we came to realize that we had neglected a portion of Ballot SC48v2 - Domain Name and IP Address Encoding.

2. Timeline

July 2017

CAB/Forum ballot 202 is proposed, discussed and eventually fails.

During this time Comodo implements the prohibition on U-labels in CNs based on guidance from Chrome that it viewed this prohibition as a root program requirement regardless of the status of the ballot. With its subsequent failure, Comodo did not implement other proposed stipulations in the ballot.

July 2021

CAB/Forum ballot SC48v2 is proposed, discussed and passes voting.

The Sectigo compliance team discusses the ballot and, based on its nature as a “rerun” of ballot 202, comes to the incorrect conclusion that Comodo addressed these needs in or before 2017.

October 1

The relevant section of SC48v2 - Domain Name and IP Address Encoding goes into effect, allowing only P-Labels or Non-Reserved LDH Labels as Domain Labels.

October 5

We realize that our previous conclusion was partly incorrect, in that our system was not rejecting Reserved LDH Labels that are not P-Labels. A development ticket is created to solve this issue.

October 9

A fix is deployed making us compliant with the changes from SC48v2.

October 11

SSL Abuse receives an email reporting 9 misissued certificates.

October 12

Our initial investigation into the affected certificates discovers 11 certificates found to be misissued, including the 9 certificates reported on October 11.

We commence a revocation event scheduled for Saturday, October 16th at 13:00 UTC.

Let’s Encrypt posts bug 1735247.

October 16

We complete revocation of the 11 misissued certificates.

October 19

Continued investigation reveals an additional 5 misissued certificates. This completes our investigation of our corpus of certificates for this problem.

October 24

The additional 5 certificates are revoked.

3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem.

We have stopped issuing certificates with this problem.

4. Summary of the problematic certificates

16 certificates issued between October 1 and October 8, 2021.

5. Affected certificates

The list of affected certificates is available in attachment 9250183 [details].
There are a total of 16 unique serial numbers. The attachment has 32 entries encompassing both pre certificates and leaf certificates

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now

When ballot 202 originally was in discussion, Comodo implemented a partial update to address some but not all of the requirements laid out in that ballot. When, ballot SC48V2 came along considerably later, we mistakenly believed that these needs had been addressed in 2017.

This was an error on the part of our current Compliance team in understanding the precise nature of the changes we made in the wake of ballot 202. The members of our Compliance team have entirely turned over in that interval, in which contributed to this problem. We speculate that the 2017 team did not imagine that a future group of different people might have reason to revisit the decisions surrounding this failed ballot and so made no particular effort to record for posterity that Comodo had taken on a partial but incomplete response to that ballot.

7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future

We have deployed a patch to prevent issuance to prohibited domains.

As explained in part 6 of this comment, this error was the product of a highly unusual set of circumstances. We don’t expect to see very many old ballots reincarnated years after the fact and for which the intervening time period encompasses our change from Comodo to Sectigo with the updates in personnel, process, and philosophy that accompanied it. Nonetheless, we recognize we were unknowingly making assumptions about the nature of earlier work that didn’t prove to be correct. In the unlikely event that something analogous to this event occurs in the future, we intend to scrutinize the previous decisions and actions much more closely.

Four things stand out here:

  1. This doesn’t seem a date-and-time timestamped sequence of events. Some of these even miss days. Why the lack of (expected) precision? This is a question about how incident reports are created and reviewed against expectations.
  2. Given that system changes were made on 2021-10-09, why were the misissued certificates not discovered until externally reported, on 2021-10-11? This is a question about the generic processes involved in any system change, and whether or not a process exists to review past certificates exists, and how long that process takes to be prioritized and completed effectively.
  3. It’s unclear what process was executed to ensure compliance with SC48v2. It appears there was not an empirical confirmation until sometime around 2021-10-05. It’s unclear the relevance of the discussion of past staff, because one would normally and reasonably expect that all CAs examine their actual systems to ensure compliance. While this incident omits a discussion of the normal ballot review process, it would seem based on these facts and timelines that the process is simply asking Compliance whether they believe Sectigo complies, as opposed to a formal process to empirically assert compliance (and ensure necessary tests and lints). This is trying to understand what formal process exists for reviewing ballots, before adoption, during the IP review, and once effective.
  4. The incident was reported to Sectigo on 2021-10-11, but was not reported until 2021-11-10, nearly a month later. Mozilla’s policy on incidents states “We expect to see incident reports as soon as possible, and certainly within two weeks of the initial issue report.”. What processes exists to ensure this obligation is met? This is a question about Sectigo’s overall incident management handling.
Flags: needinfo?(martijn.katerbarg)
Assignee: bwilson → martijn.katerbarg
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliancr]
Whiteboard: [ca-compliancr] → [ca-compliance]

This doesn’t seem a date-and-time timestamped sequence of events. Some of these even miss days. Why the lack of (expected) precision? This is a question about how incident reports are created and reviewed against expectations.

Providing too precise a time stamp is misleading for events that span greater lengths of time than the level of granularity would suggest. The entries for July 2017 and July 2021 each give a capsule summary of a series of events that took weeks to play out. Listing each individual event with a date and time stamp would make the writeup considerably more difficult to read and understand while providing no valuable additional information beyond what is easily discernable from the already existing public record. We believed and still believe it would not be in service to the community to arbitrarily clutter a report with such (irrelevant details) [Bug 1714628 comment 11] when they are not contributory to the discussion at hand.

The phrase “a date-and-time-stamped sequence of all relevant events” need not stipulate that every single entry is detailed to the date-and time level. Rather, a sensible interpretation would be that the specific timing of events should be recorded to the degree of accuracy that is relevant to what is being reported. (According to Wikipedia)[https://en.wikipedia.org/wiki/Timestamp]

A timestamp is a sequence of characters or encoded information identifying when a certain event occurred, usually giving date and time of day, sometimes accurate to a small fraction of a second.

Note that it says “usually”, not “always”. It also says “when...occurred” rather than “when...started” or “when...finished”; since not all events start and finish within the same second, we understand “occurred” to mean a period of time that could be short or long.

Given that system changes were made on 2021-10-09, why were the misissued certificates not discovered until externally reported, on 2021-10-11? This is a question about the generic processes involved in any system change, and whether or not a process exists to review past certificates exists, and how long that process takes to be prioritized and completed effectively.

Since we became aware that our system could in fact issue certificates with these disallowed names in the Subject Alternative Names extension, we placed initial priority on resolving this. As part of this, we planned on the first work day after this was fixed to investigate our corpus of SSL Certificates to see if any certificates had been issued and should be acted upon.

While in some cases we may do this even before a fix is in place, the fact that a change could be made to the system within only a few days lead us to the decision to generate a report shortly after it had been done. That this internal report was generated on the same day we received the external report is pure coincidence; SC48v2 had come into force only a week or so earlier, and we often find that external researchers actively go looking for non-compliance with new rules.

When Let’s Encrypt posted Bug 1735247, we realized that while our initial fix was correct, we missed these cases in the report and thus initiated a second revocation batch. The certificates in the first and second batch were all issued within the same time period.

It’s unclear what process was executed to ensure compliance with SC48v2. It appears there was not an empirical confirmation until sometime around 2021-10-05. It’s unclear the relevance of the discussion of past staff, because one would normally and reasonably expect that all CAs examine their actual systems to ensure compliance. While this incident omits a discussion of the normal ballot review process, it would seem based on these facts and timelines that the process is simply asking Compliance whether they believe Sectigo complies, as opposed to a formal process to empirically assert compliance (and ensure necessary tests and lints). This is trying to understand what formal process exists for reviewing ballots, before adoption, during the IP review, and once effective.

As mentioned earlier, this error was the product of a highly unusual set of circumstances. While we do indeed have a formal process for reviewing ballots, we hadn’t found ourselves in that exact situation before and so made an error in dealing with apparent facts. We have learned from the experience and are applying a stricter “Trust but verify” model on our internal practices. You’ve seen us talk about applying peer-review to our public posts, which is something that has also been extended to other parts of our business, including continued review of compliance with the root store programs and the CA/B Forum BRs as well as any upcoming changes.

The incident was reported to Sectigo on 2021-10-11, but was not reported until 2021-11-10, nearly a month later. Mozilla’s policy on incidents states “We expect to see incident reports as soon as possible, and certainly within two weeks of the initial issue report.”. What processes exists to ensure this obligation is met? This is a question about Sectigo’s overall incident management handling.

Our draft bug report was ready on October 26th pending final review, scheduled for posting on the 27th including the list of affected certificates. At that time we were using the serial numbers of the affected certificates as we originally mentioned we would be in Bug 1736064 Comment #4.

Then Bug 1736064 Comment #5 appeared and a discussion started on the use of serial numbers, throwing off our plan for posting Bug 1740493. We wanted to clarify reporting expectations before proceeding. In hindsight, this was not the correct decision as it delayed posting the bug for too long.

Flags: needinfo?(martijn.katerbarg)

(In reply to Martijn Katerbarg from comment #3)

This doesn’t seem a date-and-time timestamped sequence of events. Some of these even miss days. Why the lack of (expected) precision? This is a question about how incident reports are created and reviewed against expectations.

<snip>
The phrase “a date-and-time-stamped sequence of all relevant events” need not stipulate that every single entry is detailed to the date-and time level.

Thanks for the explanation. I think examples of this degree of interpretation should naturally raise concern about Sectigo's trusted CA operations, which seems to be an ongoing trend. The interpretation that "date-and-time-stamped" means "date-and/or-time-stamped" seems a stretched view, especially in light of past discussionsregarding this (and the expectations for timestamps). I appreciate that Sectigo feels they should decide what details are important in the incident reports, but that seems to highlight growing concerns with Sectigo's operational compliance, the omission of key details, and the failure to abide by long-standing requirements.

While https://wiki.mozilla.org/CA/Responding_To_An_Incident notes

Therefore, failure to follow one or more of the recommendations here is not by itself sanctionable. However, failure to do so without good reason may affect Mozilla's general opinion of the CA.

I would suggest that, within the trend of incident reports, there is a worrying pattern about how they are approached and understood, and this omission fits within that. Among other things, a key detail the failure highlights is whether or not the certificates were revoked within the 5 day period permitted, or whether Sectigo further delayed revocation within the same calendar day, but well-exceeding the time (e.g. a report in the morning, a revocation in the evening 5 days later, exceeding the BR time). The discussion about serial numbers, in Bug 1736064, Comment #4, further exemplifies this trend of misinterpretation.

To emphasize: It would be in Sectigo's best interest, given the long-standing trend of serious compliance issues, to ensure complete, total, and prompt transparency, and opting to do so in a way that strives to remove doubt or concern, rather than doing the absolute minimum (or less than) required.

Given that system changes were made on 2021-10-09, why were the misissued certificates not discovered until externally reported, on 2021-10-11? This is a question about the generic processes involved in any system change, and whether or not a process exists to review past certificates exists, and how long that process takes to be prioritized and completed effectively.

<SNIP>
That this internal report was generated on the same day we received the external report is pure coincidence; SC48v2 had come into force only a week or so earlier, and we often find that external researchers actively go looking for non-compliance with new rules.

What this seems to suggest is that compliance changes, and past evaluation, are not done until after a ballot is effective, rather than taking the opportunity for discussion - and the extended IP review period - to take steps to ensure compliance and remediation proactively. This speaks to a concerning approach to compliance, because while this establishes the factual timeframe, it doesn't help identify the decision making in this incident, or the standard policies in play, to ensuring compliance.

For example, based on the response presented in this bug, it seems natural to conclude that nothing has changed at Sectigo that would prevent similar issues in future ballots from being reported externally, or from incomplete detection and remediation. As an example of this latter point, this report notes that Sectigo's initial scan of its corpus was incorrect, and this was not noticed until Bug 1735247, but notably fails to identify why the original scan was incorrect, what the process and procedures were to develop that report, and how those processes and procedures have been changed to reduce the risk of future misinterpretation of requirements.

Given the remark about timestamps above, and the overall delay in this incident (a pattern for Sectigo, shown by Bug 1721271 and Bug 1741026, with multiple remarks about expectations), it seems likely to believe that Sectigo compliance will continue to miss relevant requirements, and rely on external supervision and reporting to operate in the expected way.

As mentioned earlier, this error was the product of a highly unusual set of circumstances.

It's been mentioned, but nothing about this incident seems unusual or unexpected. "We thought we were compliant the last time it was discussed, so we didn't look closely this time when it was discussed" suggests that there's a process failure to look closely at ballots, which is supported by the evidence. There's no substantive discussion about what the process is for ballot review, or how or why we should be confident about the results going forward.

Then Bug 1736064 Comment #5 appeared and a discussion started on the use of serial numbers, throwing off our plan for posting Bug 1740493. We wanted to clarify reporting expectations before proceeding. In hindsight, this was not the correct decision as it delayed posting the bug for too long.

This is why the timestamp response is so concerning: a pattern of "this was not the correct decision" here. This incident is the result of not making the "correct decision" with respect to ballot review, leading to an incident. Then, this report was delayed, despite previous clarifications in Bug 1721271, Comment #1, where Bug 1721271 was the result of "not making the correct decision" on reporting. Then, in Bug 1741026, Sectigo again did not make the "correct decision" and delayed reporting.

I mention these not to try to beat up the individuals responsible, but to highlight that there's a clear pattern where the decision making process implemented by Sectigo is not resulting in correct decisions or interpretations. There is a noticeable lack of detail on that process, and these incident reports seem to treat each incident as highly exceptional. While it may truly be that they're exceptional, despite the lack of evidence to that, it still reveals that there is a flawed decision making process. The goal is not to get Sectigo to admit they made wrong decisions, but to understand what's changing to ensure future decisions are correct.

The level of detail is necessary, in the spirit of https://wiki.mozilla.org/CA/Responding_To_An_Incident 's remarks that:

Mozilla expects that the incident reports provide sufficient detail about the root cause, and the remediation, that would allow other CAs or members of the public to implement an equivalent solution.

I do not believe this incident report meets that bar, either in the details for the technical issues at play, nor the process issues that allowed those technical issues to manifest in the first place.

Flags: needinfo?(martijn.katerbarg)

(In reply to Ryan Sleevi from comment #4)
Ryan, this comment is to acknowledge your comment 4 and inform the community that we are working on a detailed response. We simultaneously have been researching and preparing detailed comments for bug 1741026, bug 1736064, and bug 1741777, which we posted today.

(In reply to Ryan Sleevi from comment #4)

The interpretation that "date-and-time-stamped" means "date-and/or-time-stamped" seems a stretched view, especially in light of past discussions regarding this (and the expectations for timestamps).

a key detail the failure highlights is whether or not the certificates were revoked within the 5 day period permitted, or whether Sectigo further delayed revocation within the same calendar day, but well-exceeding the time (e.g. a report in the morning, a revocation in the evening 5 days later, exceeding the BR time).

While re-evaluating our initial response based on your comments, we do see the error in our way regarding times for when the report and revocation steps were executed. So please allow me to set the record straight on these items in the timeline:

October 11 - 21:00 UTC
SSL Abuse receives an email reporting 9 misissued certificates.

October 12 – 00:29 UTC
Let’s Encrypt posts bug 1735247.

October 12 - 13:13 UTC
Our initial investigation into the affected certificates discovers 11 certificates found to be misissued, including the 9 certificates reported on October 11.
We commence a revocation event scheduled for Saturday, October 16th at 13:00 UTC.

October 16 - 13:11 UTC
We complete revocation of the 11 misissued certificates.

October 19 - 16:58 UTC
Continued investigation reveals an additional 5 misissued certificates. This completes our investigation of our corpus of certificates for this problem.

October 24 – 13:14 UTC
The additional 5 certificates are revoked.

As for us moving forward on any potential new bugs and the wish from the community to be more precise with our timeline, we are as of now actively keeping track of the exact timeline on any new issue or report we receive or any issue we identify that appears to have the potential to result in a Bugzilla incident. All members in the WebPKI Incident Response team (WIR) have access to this through Jira and are able to update it once their assigned items have been completed. Our process includes adding a “reporting deadline” two weeks after we initially discover misissued certs or otherwise become aware that we need to post a bug report. Moreover, each incident is owned by a specific individual who is responsible for following up on these items. We are now setting a reminder so both the WIR team leader and the incident owner receive a reminder 72 hours before this deadline arrives.

We continue to believe that exact timestamps do not always add value to every reported event, depending on the nature of the event and issue. To confirm our impression that this has been the accepted practice on Bugzilla, we reviewed all CA Compliance bugs that have been updated any time this calendar year. Of those containing timelines, 52% include at least one item that is not timestamped down to the minute. In all but two cases there was no question or objection raised about the level of specificity of these timelines. Some of them – such as bug 1716843, bug 1716902 and bug 1711432 – contain one or more entries of a month and year with no date included.

This indicates that matching the level of precision in time reporting to the level of precision important to the events reported is normative behavior on Bugzilla that the community has accepted. We would hope that reports maintain at least the level of specificity relevant to the events reported, and we fully support the expectation that timeline reporting matches the level of precision necessary for understanding the key issues surrounding the incident. We agree that our report of certificate discovery and revocation in comment 1 did not match that level of precision.

We note that in response to bug 1736064 Mozilla has started a discussion thread on m.d.s.p about practices for reporting certificates. We wonder if it makes sense to have a similar discussion about the specific expectations for time stamping as it relates to bug reporting and timelines.

To make sure our Bugzilla posts meet community expectations, we are in the process of creating an internal checklist for posts to support with our ongoing peer-review process.

To emphasize: It would be in Sectigo's best interest, given the long-standing trend of serious compliance issues, to ensure complete, total, and prompt transparency, and opting to do so in a way that strives to remove doubt or concern, rather than doing the absolute minimum (or less than) required.

We fully agree, as we noted recently in bug 1736064 comment 15.

What this seems to suggest is that compliance changes, and past evaluation, are not done until after a ballot is effective , rather than taking the opportunity for discussion -– and the extended IP review period -– to take steps to ensure compliance and remediation proactively. This speaks to a concerning approach to compliance, because while this establishes the factual timeframe, it doesn'’t help identify the decision making in this incident, or the standard policies in play, to ensuring compliance.

We have not in recent years had another example of this situation occur. Our standard process is that there is a single member of the compliance team who owns the task of following proposed ballots and calling out their salient points for the greater group. We use these summaries in our working meetings to identify changes we need to make and create tickets for these changes. In this case we misidentified the ballot as a non-issue in that original review due to the factors explained in comment 1. That was the root of the problem. Therefore we did not ticket any changes and so did not write, QA, or deploy code updates.

To make sure we are on top of upcoming CA/B Forum changes, we have added an additional member (yours truly) to the CA/B Forum last month and with that increased our engagement in the CA/B Forum. Our goal is to have our voice heard more over the coming months than we have over the last year. We made this decision to help ensure we have full visibility on the compliancy and technical implications of ongoing discussions and ballots.

This report notes that Sectigo's initial scan of its corpus was incorrect, and this was not noticed until Bug 1735247, but notably fails to identify why the original scan was incorrect, what the process and procedures were to develop that report, and how those processes and procedures have been changed to reduce the risk of future misinterpretation of requirements.

In this case the author of the original script made an error and another team member failed to spot the error during peer review. (Note that writing the query was independent of writing the actual code fix as it was done by entirely different people. The code fix was correct from the outset.) As authoring and reviewing these queries is a skill set only a few members of the WebPKI Incident Response team possess, we don’t have the benefit of many eyes, as we can have for other parts of our compliance activities. Reading bug 1735247 triggered the original query author to go back and examine his own query, which led to the discovery of our error there. We immediately updated the query and found the additional certificates.

The query process is pretty simple:

  • One member writes the query.
  • Another reviews it.
  • We run the query; sometimes we know we will want to run a query at a specific date in the future and will prepare it ahead of time (as with the certificates reported in bug 1736064 comment 16, for instance), although that was not the case here.
  • Depending on the nature of what we’re investigating, the results may be “suspects” that need further investigation to determine if they are misissued or not.
  • Depending on the skill set, level of access of the member, and the nature of what we’re investigating, the query will target crt.sh, Censys, or our internal CA database. Sometimes CT log aggregators can be expected to be aware of all possible cases of misissuance relating to what we’re investigating; sometimes this doesn’t actually matter because the query is intended to be exploratory rather than definitive; and sometimes there’s no substitute for querying our internal CA database.

Occasionally these queries are themselves a learning process, as occurred for example in our investigation of DCV behavior for bug 1718771. In general, however, they are straightforward. We prefer to keep our querying process simple so that we can rapidly identify and act upon affected certificates. The performance of the last year and a half indicates that this process is highly reliable, with many queries run and exactly one error seen to date. We will give some thought to how to update this process to protect against such errors. We are open to ideas from other CAs or the broader Bugzilla community in that regard.

The goal is not to get Sectigo to admit they made wrong decisions, but to understand what's changing to ensure future decisions are correct.

Let us point out that a year and a half ago the company had no formal WebPKI Incident Response process and we were barely forming our WIR group. Though we benefitted from the passionate activity of a few experienced individuals, we suffered from its ad hoc nature. In the intervening time we have been learning from experience, adding to our team and establishing procedures in response to friction points we observe internally, activity on Bugzilla and m.d.s.p, and, unfortunately, our own errors. While we regret these errors, we use them to adjust our actions for the sake of continuous improvement.

While it would be excellent to anticipate and account for all “toe breakers” in advance of stubbing our toes on them, it’s an unfortunate fact that too often people learn how to improve the hard way, after something goes awry and they can formulate an idea for what to do about it. If in the world of Bugzilla, two years ago Sectigo was a D student, then today we’re a B student on a trajectory to becoming an A student. That is where we seek to be and learning through hard experience is part of the way we will get there.

Flags: needinfo?(martijn.katerbarg)

We have no further updates for this bug but are monitoring it for any questions or comments from the community.

It seems there are no further questions or comments and we would like to propose closure of this bug.

Flags: needinfo?(bwilson)

I'll close this bug on Friday, 17-Dec-2021, unless there are any objections.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [dv-misissuance]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: