Open Bug 1551363 Opened 3 months ago Updated Yesterday

DigiCert: "Some-State" in stateOrProvinceName

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: wayne, Assigned: brenda.bernal, NeedInfo)

Details

(Whiteboard: [ca-compliance] Next Update - 01-September 2019)

Attachments

(1 file)

A list of DigiCert certificates containing a stateOrProvinceName of "Some-State" was published at https://misissued.com/batch/53/ This is apparently the default placed in OpenSSL CSRs, indicating that this field was not validated. BR section 7.1.4.2.2(f) states: If present, the subject:stateOrProvinceName field MUST contain the Subject’s state or province information as verified under Section 3.2.2.1. The EVGLs reference the BRs.

Please provide an incident report, as described at https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report

Summary: DigiCert: → DigiCert: "Some-State" in stateOrProvinceName

Hi Wayne, We are working on the incident report and will post shortly.

Here is the incident report requested:

Incident Report – Mozilla Policy Violation

1.How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On May 11, 2019, DigiCert was informed of an issue with certificates being issued with values containing a stateOrProvinceName of "Some-State” via an MDSP discussion (subject: Certificates with subject stateOrProvinceName "Some-State"). The issue indicates that the State/Province field was not validated, and is therefore, a baseline requirement violation.

2.A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

May 11, 2019 – DigiCert was informed via an mdsp discussion of certs we issued with the “Some-State” value. Investigation began and concluded 8 certs required revocation.

May 13, 2019 – DigiCert was informed again via this bug that certificates were identified with a “some-state” value in the stateorProvinceName field of the certificate record.

May 13, 2019 – DigiCert added “Some-State” and “some state” as flags for order processing. This will require the standard first and second checks as well as built in CA blockers and a required manager review.

May 15, 2019 – All identified problem certs were revoked.

3.Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

Our CA has stopped issuing these certificates. Please see below for our get well plan (section 7). All certificates identified with invalid values related to this incident have now been revoked and blockers have been added at a CA level to prevent additional signing of certificates with these invalid values.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
    8 certificates were identified (see below).

The first cert was issued: May 31, 2016
The last cert was issued: April 16, 2017

5.The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

https://crt.sh/?id=26343383
https://crt.sh/?id=77643000
https://crt.sh/?id=42275918
https://crt.sh/?id=132635439
https://crt.sh/?id=83944720
https://crt.sh/?id=56030857
https://crt.sh/?id=81411368
https://crt.sh/?id=122821382

6.Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The root issue was a combination of the auto-populator inserting incorrect data into the certificate request and improper identification of the invalid values during our verification process. While reviews were in place, unfortunately these values were missed.

7.List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

Following our incident in 2017, we added city and state field checking system that flags potentially incorrect city and state field values. When a potentially incorrect field is flagged it forces Validation staff to do an additional review and checking process. Additionally, on 13-May-2019, we have added a blacklist check for the terms: “default”, “city”, and “some” to ensure this triggers an additional review to clear before allowed to issue.

May 16, 2019 - We completed running scans to search for any more instances of some-state across our cert database and are reviewing the results. We will post an update once we have completed our analysis.

Additionally, by the end of May 2019, we have scheduled additional training (to be completed no later than months end), on soft-blocks, flags and triggers to prevent this issue in the future.

I am concerned that this approach to response does not adequately evaluate the underlying root cause, or look at systemic ways to improve. In particular, something this "obvious" calls into question the overall set of validation activities, and thus remains unclear to what extent stateOrProvinceName was and has been adequately verified in all existing certificates.

Concerns I have with this incident report:

  • The timeline provided for question 2 does not include the facts and information provided in question 7. From a good incident report, it would be useful to ensure that all the information relevant to understanding the issue is adequately and chronologically presented and explained.
  • The scan on May 16, 2019 clearly indicates that the scan has only been for "some-state", while a more appropriate response would be to examine for any address information that is inconsistent, to the best extent possible:
    • Invalid countries
    • Invalid stateOrProvince
    • Invalid localities
  • The approach described for how the information makes it into the system clearly shows that it was derived from a CSR, rather than entered by hand. This should call into question how other information is ingested from the CSR.
    • For example, a holistic review of the CSR ingestion pipeline would be appropriate. Especially concerning would be for possible encoding issues with string types (e.g. if a user uses T61String, how does your system handle it), RDNs (for example, if there are multiple values within a SET field, how are they extracted), and other similar checks
  • The approached described for how this information is validated still leaves significant room for human error. Changes to the validation process, such as requiring two-person review, or reviewing the UI/UX and other signifiers, would go a long way.

With respect to the proposed next steps, such as additional training, please provide more detail about what this training entails. It is not reasonable, based on the description of the issue, to assume repeating the same training will produce different results. If it is believed to be the case, then more systemic changes to the training regime seem appropriate: For example, if 'spot training' on this issue is believed to fix it, then naturally this is a suggestion that more frequent training is necessary. Changing the requirements to require retraining every 6 months, instead of every year, shows a systemic approach. Hiring outside expertise to review training materials, to make sure the materials themselves lead to correct results, are another means.

In short, I do not believe it is acceptable to suggest this is human error, to be corrected by training, and to restrict-less certain phrases. Please also review the comments on other bugs, such as Bug 1551369, Bug 1551364, or Bug 1548714, to see similar remarks.

Flags: needinfo?(brenda.bernal)

Hi Ryan, Thank you for your feedback. As of today (17-May-2019), we have completed our scans for "some-state" and found no active certs apart from what was reported. We are re-inforcing the training on this issue to ensure there is no other repeat occurrence. With that said, we are continuing our scans to find any other errors with the locality, country and stateorProvince fields, and will report back with our results, along with a response to your overall commentary.

Flags: needinfo?(brenda.bernal)

Hi Ryan, To address your bullet point 2 above, we are still working on analysis of a comprehensive data extract, and will report back as soon as we get more definitive results. We have a challenge with a large data set in review, with false positives and variations to weed through. We will post an update as soon as we complete as well as addressing the additional points raised above with validation practices and re-training. On a side note, we do have two person review but clearly some things were missed with the some-state issuances.

Brenda: Thanks for the update. I'm hoping you can provide more details about the false positives and variations, to help understand the delay. Can you provide an ETA for an update about when that review is expected to be completed, or when more meaningful details can be provided?

Trying to make sure we have regular, weekly updates, which provide new information and insight. If there are delays, understanding progress, milestones, and successful completion is equally valuable. The worry with such delays is, unfortunately, that the CA may have identified more impacted certificates and is delaying disclosing these, in order to delay revocation requirements (as observed externally). To combat that impression, the regular updates, the methodology being used, the challenges face, all help bridge and address that concern.

Flags: needinfo?(brenda.bernal)

Ryan, Our target for completion of this analysis is by no later than June 21st. I expect to provide an update as we have regular meetings between now and that date on progress. We do not plan to notify and revoke in bulk at the end of this analysis but expect to do so (if necessary) in a check-in we have next week, and the week after.

The entire population set was matched up against the ISO list for verified state/province and countries. The global team is now reviewing the items that have fallen out of that comparison. There are variations to review like special characters and abbreviations that need to be inspected to rule out false-positives. One other point is we haven't started on reviewing the locality/city-equivalent as there is not a definitive source for that information. We plan to tackle this in a wave 2 of this effort (post June 21st). If you have recommendations on good sources that we can leverage, we are open to suggestions.

We are definitely planning more automated controls and will update in this report once we get that more concrete with timelines.

Flags: needinfo?(brenda.bernal)

Other CAs have made use of the GeoNames set. An entry may be on that list that is not a valid city equivalent, but absence, given the comprehensiveness of that set, should be concerning.

In terms of other steps, it seems like developing a plan for how DigiCert will validate and maintain information seems useful. For example, developing a plan for every country that is issued for, in order to ensure the quality and accuracy of the data and for canonically verifying that it is current and accurate.

It would be beneficial to understand the false positives being encountered, and their cause, as part of the incident report, in order to allow broader feedback from the community and opportunities and suggestions for improvement.

I’m hoping future updates may share concrete numbers - how many certificates matched or did not match, how many are undergoing manual review, buckets for false positive cause, etc. This information not only addressed concerns regarding transparency, but can lead to systemic improvements for the ecosystem.

With respect to revocation, 4.9.1.1 of the BRs requires revocation for such reasons, so it wasn’t clear what the plan was there.

Ryan, Thanks for the feedback and here is our progress update. A data extract of active certificates was run to match against the ISO list as previously stated. About 60K certs fell out of that matching process, which required to be manually verified. Our teams have completed about 11% of the checking, which so far has resulted in 3,469 certificates confirmed that need to be revoked and reissued. The customers for these certs will be notified today with a revocation timeline of 5 days, as per 4.9.1.1. except for 39 code signing certificates which will be revoked/re-issued in accordance with the CS guidelines.

As for examples of types of errors we are seeing, we have seen a mismatch of state/province and country (e.g. Hong Kong, Hong Kong, and Belgium, Hong Kong, respectively in the state, country fields). These are the types that are put in the "needs correction/revocation" batch.

Some of the false positives involve valid abbreviations (e.g. NSW for New South Wales) and we have perfectly valid encoded values in the cert (e.g. Cyrillic characters) that appear incorrect when exported to CSV or something less aware of full encoding.

We will post the crt.sh links for the affected certs as soon as we get them. Our next update on this incident will be next Wed (June 19). We will also post more details about system-based controls that we plan to implement at next week's updates. In the meantime, let us know if you have further questions or comments.

Thanks Brenda. This is an excellent incident report update and super-helpful to understand how things are progressing and the issues being encountered, and is a model for all CAs to follow in terms of detail and thoroughness.

Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 19-June 2019

Status update on our progress:

  • We have conducted a final due diligence on the first batch list and are revoking those certs on Friday, 21-June-2019 by 5pm MT. Total number: 1,274 active certificates impacted.
  • We have conducted a review and completed due diligence check on the next batch of certificates and have identified 1,004 impacted active certs, which will be revoked on Monday, 24-June-2019.
  • The final batch is going through review and due diligence check; we will provide an update on our findings by Monday, 24-June-2019.
  • Note that we have identified code-signing certs in each confirmed batch so far, and those will be revoked according to CS guidelines.
Whiteboard: [ca-compliance] Next Update - 19-June 2019 → [ca-compliance] Next Update - 24-June 2019

Thanks for the update, Brenda.

In line with Comment #9, after the immediate remediation efforts are complete, it would be good to (broadly) categorize the type of errors. You can see how this is done at a much more detailed level at https://bugzilla.mozilla.org/show_bug.cgi?id=1551374#c8

The point of this exercise is to understand more specifically the types of challenges CAs face, and thus hopefully collaboratively develop best practices for the industry. There's an opportunity here by being transparent about the nature of certificates and their issues.

With respect to the 2278 certs (and the final batch) - is that the 11% mentioned in Comment #9, or is that the total set?

Flags: needinfo?(brenda.bernal)

Hi Ryan,
Here are the categories of errors that we found in our investigation of this issue.
(Note: Due to the volume of certificates affected, below are examples by problem category):

Country in State/Province field | Total: 1717
Examples: Italy, Slovakia, Switzerland

Country/state mismatch | Total: 24
Examples: Berkshire, DE; Georgia, GB

Not a valid state/province (city/town) | Total: 271
Examples: Minneapolis, US; Paris la Defense, FR; San Diego

Not valid country code | Total: 7
Examples: AN, XK

State/Province is a non-location value | Total: 113
Examples: N/A, NA, None, Other

State/Province contains a number| Total: 174
Examples: 0,1,2,3,4,5,10, 3000, 3600…

Grand Total 2,306

The total number of certs impacted is 2,306; the customers were notified to re-issue with planned revocation on the last batch by this Saturday, June 29th for serverAuth certs; Codesigning certs are handled separately as per CS guidelines.  We will attach a file with the crt.sh links for all certs to this bug shortly once that information is available.  
 
Additionally, here are the different types of controls we have implemented or in the process of implementing as part of the overall remediation plan:
1.      Validation -
Compliance is working with Validation on a revised training around location values to help ensure proper attention and training is given to the subject. The training will be focused on identifying real location values, including the use of ISO 3166-1 and ISO 3166-2. Retraining will involve all Validation staff and will require passing a test. As new policies and standards evolve on this subject, we will refresh the content and train the teams accordingly. This part of the plan will be completed over the next 30 days. We have already begun to notify the Validation team of the proper validation methodology for location values, as stated in our incident report above. The comprehensive training with required exam with passing score is the next focus.

2.      Systematic controls -
We will have an API integration with an ISO source for country codes; state / locality checks will make use of a QIIS. We will check to make sure there is a valid country code and state or province from these sources. We will also confirm that the state or province associates with the country properly. We will check if the locality is valid from our source and that it associates with the proper state or province. Anything that falls out of the matching logic mentioned will be flagged and not allowed to issue until corrected. This part of the plan is being worked on now, and will be implemented in the next 45-60 days.

Flags: needinfo?(brenda.bernal)

Did the revocation scheduled on 2019-06-29 (per Comment #13) occur?

Is it correct that:

  • By 2019-07-31, Phase 1, which is improved training, will be conducted
  • By 2019-08-31, Phase 2, comprehensive systemic controls, will be implemented?
Flags: needinfo?(brenda.bernal)

Wayne: An extra set of eyes on this one, to make sure I haven't missed any questions you feel are relevant.

Flags: needinfo?(wthayer)

We had an initial kick-off meeting with the validation staff to talk about training and train the trainers. During that meeting we talked about where things are going wrong and how to better red flag items that may be missed.

The product-side of the equation is currently under development. They should be ready sooner than August 31. We can keep that as a deadline though since I don't have an updated timeline. Basically, we are tying into a better API to confirm the address is real during ordering. It potentially breaks some of the legacy Symantec systems that expect to sign exactly what is passed, which is why we have the longer timeline for implementation (to make those better check the state/country combination). By August 31, we're just going to turn it on, making those systems even more brittle.

Ryan - Per your question on Comment 14, yes the revocation scheduled on 2019-06-29 occurred.

Flags: needinfo?(brenda.bernal)
Flags: needinfo?(wthayer)
Whiteboard: [ca-compliance] Next Update - 24-June 2019 → [ca-compliance] Next Update - 01-September 2019

An update on the completion of Validation training as noted in Comment 13: All Validation members completed the training on locality vetting, which had an exam component. Agents who took the training and the requisite exam have all passed.

We are on target with the August 31st remediation with the systematic fix. We will post an update for the close out around the end of this month.

Are there any thoughts why, in light of the training mentioned in Comment #16, a certificate would be reviewed by two different CA employees and they fail to detect https://crt.sh/?id=1733359144 , an EV certificate?

This seriously calls into question DigiCert’s ability to remediate this issue, and more generally, to validate certificate information.

Flags: needinfo?(brenda.bernal)

Well the remediation isn't complete yet, right? It's still pending (Aug 31 for the technical control). And this is a slightly different problem - JOI is not the same as location information. The training was on the location, not JOI. I know that sounds lame, but they are two separate areas of the system. In this case it the system prompted them to enter a city or state of incorporation. County was closed to state so that's where they put it. Obviously it should have been locality or state, with locality covering both city and county. We've changed the terminology, added an explanation, and are making sure to include the JOI fields in the technical controls.

JOI wasn't originally included in the scope of the technical controls... since we didn't think of it (just addresses). I'm looking to see what we can to get it controlled as well.

This will also cover the mis-spellings that Brenda should be filing soon. Considering they have the same remediation. Do you want a new incident report or file them here? We're planning on running the full scan on JOI to see if there is anything interesting going on in JOIs.

Jeremy: This has to be the most disappointing response I've read from DigiCert in the past month - and there's been some real doozies. As I mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1563573#c29 , I'm increasingly concerned about the nature, scope, and severity of issues happening at DigiCert, and its ability to fundamentally address the underlying root issues. This is all "classic Symantec" behaviour, and that's not a good thing.

It sounds incredibly lame to not recognize that the validation process or expectations for stateOrProvince might also affect jurisdictionStateOrProvince , and even more lame that even after all Validation Members were retrained (per Comment #19), the exact same underlying issue happened, just in a new field. For that matter, your response gives me zero confidence that, until this issue, you had even thought to look at technical controls for the EV fields. Either you didn't, and that was just an incomplete approach to compliance, or you did, and yet you failed to identify it for validation agents (again, per Comment #19) or the training failed to be effective. There's no winning there with the response in Comment #22.

I'm kicking this over to Wayne for his take.

I will share that I'm incredibly frustrated with the scope and nature of the DigiCert issues, as captured earlier today in https://bugzilla.mozilla.org/show_bug.cgi?id=1563573#c29 . I'm not happy with this response, because while I appreciate the honesty and clarity about why it happened, I'm not happy that it happened, and I'm not happy that it seems to shrug it off casually. I'm concerned DigiCert is not spending time to systemically understand or address the issues. What happened to the DigiCert that was able to recognize shoddy validation practices, and helped improve the whole industry with CA/B Forum Ballot 218 ? I would strongly encourage you to consider how DigiCert can demonstrate it systemically understands the issues and is working to address things - perhaps by evaluating all of the compliance issues that DigiCert has encountered over the past 18 months, looking at why they happened, sharing both that analysis and proposals to change and improve the Baseline Requirements and EV Guidelines? Or improve the existing linting tools? Or propose changes to Mozilla policy to reduce ambiguity? Or change business practices and perhaps no longer offering risky services? Things that can show DigiCert understands the seriousness of the issues and is working to address them, not just for itself, but the industry. Right now, the number and seriousness of the issues that DigiCert is facing put it very much on par with Symantec prior to the distrust, and I'd really like to see DigiCert move from reactive mode, which doesn't and can't last and ultimately leads to distrust, and switch to a proactive mode to engage, address, and prevent the issues holistically, and to make sure to drive and push the industry further.

Flags: needinfo?(wthayer)

Sorry - it wasn't intended to be a full response. I just wanted to respond while we are looking at it. Kind of give you a heads up on what's going on while I figure out what to do. I do plan on filing something more later, but I thought maybe you'd appreciate a response before then.

Hey Ryan - I've had lots of time to think about this today, and I think we should move the JOI issues to a different thread. The root cause is actually different as is the remediation. I alluded to it up there (poorly, and I apologize for that), but I'd like to address it in a different thread that captures some of the things that are going on in our system. The "Mitchel County" issue actually very different than the some-state because of the way it works.

You need to log in before you can comment on or make changes to this bug.