Open Bug 1576013 Opened 2 months ago Updated 4 days ago

DigiCert: JOI Issue

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: jeremy.rowley, Assigned: jeremy.rowley)

Details

(Whiteboard: [ca-compliance] Next Update - 5-Oct-2019)

Attachments

(3 files, 2 obsolete files)

5.35 MB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details
38.20 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details
13.36 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36

Steps to reproduce:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On Aug 19th, we were sent a batch of certificates that lacked proper Jurisdiction of Incorporation information. The problems fell into a couple of categories: misspelled state/locality, county information entered into the locality field, and a mis-match between the country/incorporation.

2.A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

We decided to revoke all these certs and kicked off a notice to the customers that they will shortly be revoked. We’re working through the revocation and will post an update once complete. We did investigate why the incident happened, especially in light of the some-state issue. The underlying cause is the way our pre-validation system works. What happens is the JOI information is stored in something we call a “snapshot” that applies to the organization for the period of time prescribed by the EV Guidelines. The information can be changed by a validation agent but applies automatically to orders if not changed.

3.Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

Sort of, but progressing towards stopping issuing. We expect this to be fixed by the Aug 31 2019 roll-out. We have fixed the system to stop issuing certificates with incorrect locality information. What happens is the validation staff picks a country from a list. This lists all available sources for verification of jurisdiction of incorporation. When a source is picked, the system automatically fills in the jurisdiction information if possible. Sometimes this is not possible as multiple jurisdictions may share a single source (for example, Germany consolidates their information into a single database even though the jurisdiction may be different). To ensure that all of the jurisdiction information is correct, we need to run the entire set of snapshots through the new tool to verify the locality/state/country and invalidate any that do not pass. For example, Mitchel County would not pass with the new tool and should have its snapshot invalidated. It escaped detection because no one ever saw it after the training as the verification was stored in the snapshot.
This highlights one of the systemic issues we’ve identified with regards to ensuring fixes apply retroactively, described further under item 6.

4.A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

The problematic certificates have incorrect information in the jurisdiction of incorporation field, generally misspelling and a mis-match between country and state. One had a locality listed in the state field. These are similar to the some-state issue.

5.The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

"crt.sh URL(s)", notBefore, notAfter, "subject CN", "issuer CN", jurisdictionStateOrProvinceName
MIS-SPELLED JOI:
"https://crt.sh/?id=196281339 (precert); https://crt.sh/?id=198028214 (final)", 2017-08-22, 2019-10-21, www.secure-coloradolegal.org, "Symantec Class 3 EV SSL CA - G3", Delware
"https://crt.sh/?id=214329903 (precert)", 2017-09-20, 2019-09-25, uat-dwapp-a.umw.edu, "DigiCert SHA2 Extended Validation Server CA", Virgina
"https://crt.sh/?id=249260060 (precert); https://crt.sh/?id=252794682 (final)", 2017-11-07, 2020-02-05, www.titan-us.com, "DigiCert SHA2 Extended Validation Server CA", Wahington
"https://crt.sh/?id=266241830 (precert); https://crt.sh/?id=267948073 (final)", 2017-11-28, 2019-12-28, online.acnb.com, "Symantec Class 3 EV SSL CA - G3", Pennysylvania
"https://crt.sh/?id=284394342 (precert); https://crt.sh/?id=300119699 (final)", 2017-12-22, 2020-02-12, hie6.min-ns.net, "DigiCert SHA2 Extended Validation Server CA", Wahington
"https://crt.sh/?id=290128901 (precert)", 2017-12-27, 2020-01-03, www.friendsofksps.org, "DigiCert SHA2 Extended Validation Server CA", Wahington
"https://crt.sh/?id=311244726 (precert); https://crt.sh/?id=364038294 (final)", 2018-01-23, 2020-04-21, diplomat.sunmountainlodge.com, "DigiCert SHA2 Extended Validation Server CA", Wahington
"https://crt.sh/?id=316674948 (precert); https://crt.sh/?id=318886585 (final)", 2018-01-30, 2020-03-30, www.wvoems.org, "DigiCert SHA2 Extended Validation Server CA", "West Virgina"
"https://crt.sh/?id=326540523 (precert); https://crt.sh/?id=331458487 (final)", 2018-02-09, 2020-02-09, www.yarnspirations.com, "DigiCert SHA2 Extended Validation Server CA", Manitoba
"https://crt.sh/?id=332711155 (precert); https://crt.sh/?id=334978473 (final)", 2018-02-16, 2020-04-22, www.focusedfitness.org, "DigiCert SHA2 Extended Validation Server CA", Wahington
"https://crt.sh/?id=349404009 (precert); https://crt.sh/?id=380820028 (final)", 2018-03-07, 2020-05-21, healthagency.slocounty.ca.gov, "DigiCert SHA2 Extended Validation Server CA", Calfornia
"https://crt.sh/?id=370827666 (precert)", 2018-03-30, 2020-05-28, shop.2020Lifestyles.com, "DigiCert SHA2 Extended Validation Server CA", Wahington
"https://crt.sh/?id=370827654 (precert)", 2018-03-30, 2020-05-21, memberstatements.proclub.com, "DigiCert SHA2 Extended Validation Server CA", Wahington
"https://crt.sh/?id=376869063 (precert); https://crt.sh/?id=392480537 (final)", 2018-04-04, 2020-03-30, www.wvoems.org, "DigiCert SHA2 Extended Validation Server CA", "West Virgina"
"https://crt.sh/?id=606444487 (precert); https://crt.sh/?id=622129318 (final)", 2018-07-19, 2020-08-17, www.andigo.org, "DigiCert SHA2 Extended Validation Server CA", Illinnois
"https://crt.sh/?id=606544587 (precert); https://crt.sh/?id=629047865 (final)", 2018-07-21, 2019-10-20, www.hyperionbank.com, "DigiCert SHA2 Extended Validation Server CA", Philadelphia
"https://crt.sh/?id=1276777971 (precert); https://crt.sh/?id=1285494012 (final)", 2019-03-11, 2020-03-26, mail.ksps.org, "DigiCert SHA2 Extended Validation Server CA", Wahington
"https://crt.sh/?id=1431512740 (precert); https://crt.sh/?id=1478183063 (final)", 2019-04-30, 2021-06-28, business.numericacu.com, "DigiCert SHA2 Extended Validation Server CA", Washinton
"https://crt.sh/?id=1498948266 (precert)", 2019-05-22, 2021-05-26, cs.cfsd16.org, "DigiCert SHA2 Extended Validation Server CA", Arizaona

JOI MIS-MATCH:
"https://crt.sh/?id=244026031 (precert); https://crt.sh/?id=245606975 (final)", 2017-10-30, 2019-10-30, www.raffijewellers.com, "GeoTrust EV SSL CA - G4", Ontario
"https://crt.sh/?id=361459025 (precert); https://crt.sh/?id=366102552 (final)", 2018-03-21, 2020-06-05, bank.unitedcu.com, "Thawte EV RSA CA 2018", Ontario
"https://crt.sh/?id=422519300 (precert); https://crt.sh/?id=445716274 (final)", 2018-04-26, 2020-04-25, join.tvaoig.gov, "GeoTrust EV RSA CA 2018", Federal
"https://crt.sh/?id=546492895 (precert); https://crt.sh/?id=585700527 (final)", 2018-06-26, 2020-06-25, EGOV.COLUMBIACOUNTYGA.GOV, "DigiCert SHA2 Extended Validation Server CA", "Columbia County"
"https://crt.sh/?id=637789324 (precert); https://crt.sh/?id=747522687 (final)", 2018-08-08, 2020-08-12, drivesportswear.com, "DigiCert SHA2 Extended Validation Server CA", Alberta
"https://crt.sh/?id=647468336 (precert); https://crt.sh/?id=743688200 (final)", 2018-08-13, 2020-08-17, maxilite.lighting, "DigiCert SHA2 Extended Validation Server CA", "British Columbia"
"https://crt.sh/?id=811685910 (precert); https://crt.sh/?id=874408642 (final)", 2018-10-03, 2020-10-02, www.yarnspirations.com, "DigiCert SHA2 Extended Validation Server CA", Manitoba
"https://crt.sh/?id=1064118345 (precert); https://crt.sh/?id=1068543680 (final)", 2018-12-28, 2020-03-28, www.gfnationalonline.com, "DigiCert SHA2 Extended Validation Server CA", "Glens Falls"
"https://crt.sh/?id=1244083250 (precert); https://crt.sh/?id=1260172612 (final)", 2019-03-01, 2020-03-18, accountopen.allamericabank.net, "DigiCert SHA2 Extended Validation Server CA", USA
"https://crt.sh/?id=1254917906 (precert); https://crt.sh/?id=1258910719 (final)", 2019-03-04, 2020-04-01, www.yorkton.ca, "DigiCert SHA2 Extended Validation Server CA", Saskatchewan
"https://crt.sh/?id=1347315119 (precert); https://crt.sh/?id=1353232811 (final)", 2019-04-04, 2020-04-03, moe.do, "GeoTrust EV RSA CA 2018", Ontario

LOCALITY INCORRECT:
"https://crt.sh/?id=1726754693 (precert); https://crt.sh/?id=1733359144 (final)", 2019-07-31, 2020-08-29, www.kolbetrealtors.com, "Thawte EV RSA CA 2018", "Mitchell County"

6.Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

They avoided detection because of the low number of certs compared to the overall total. We’re currently combing through JOI data to see if there are any additional problems. However, although it is the reason the issues were missed, that’s not the core reason as it doesn’t get to the root of the problems we’ve seen over the last couple of bugs. The two items that we see reoccurring are:

  1. The dependence on people to put the correct information in the correct spot
  2. The lack of an automated system to address changes across the system. With data stored in multiple areas (like the snapshot), simply updating the information in one area doesn’t guarantee all information is appropriately changed. Similarly, changing a validation requirement doesn’t go through and automatically invalidate all existing validations, which it should. This is something we do every time the requirements change. And that’s a risk that is not acceptable.
    Fixing these two problems requires two different solutions. First, we want to remove the people element as much as possible, or at least the typing element. The JOI portion of this means further locking this down so that the validation agent is never typing the jurisdiction – it’s always chosen from a list or populated directly from the incorporation/registration document. We’re working on locking that down more. In the meantime, we are implementing (as part of the Some-State fix) the tool described there for all addresses, meaning both physical location and jurisdiction information. The address matcher will check the address (to the extent applicable) against the Google Geo-Code API. Any override requires express permission from a manager and logs the cert for further review by compliance.
    The second problem is the snapshots. We’re architecting out tooling that will first manually then automatically scan all data and run a validation test against the data. If the data doesn’t conform to all requirements, it gets invalidated. For the initial framework, it’ll invalidate addresses using the same Geo-Code API. However, this will be expanded to cover additional uses cases that can run whenever things change in the system. The snapshot tests end up becoming unit tests on the data, and are super important because they will end up detecting changes that may have weird cross-implementation impacts. The auto-scan will ensure any change to the system is reflected in the data and data sources across the Digi-verse once the change is rolled out and eliminate relying on a person process to update the snapshots.

Next Steps:
This is just the certificates reported. We haven’t scanned our system yet for additional issues that were not reported. That should happen next week. I wanted to get the incident report filed right away while we worked on evaluating the data for additional issues. I’ll post an update once the scan is complete (end of next week).

Assignee: wthayer → jeremy.rowley
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Jeremy,

I realized you said you were planning for an update on Aug 31 for Bug 1551363, but it'd be good to have an update about what DigiCert's doing at present.

That is, it's important to understand what or how you're investigating, even if you don't yet have the results or all the changes to make. Basically, help us understand what's going on and when updates will follow.

Flags: needinfo?(jeremy.rowley)

Okay. Here's the timeline on what we are doing:

  1. On Aug 31, we are launching a system that will check the location and JOI information against map information to ensure the location exists. This will check spelling, prevent abbreviated information, eliminate mismatches (like Michigan, Italy) and and confirm real locations.

  2. That same day we are going to run a check against all stored snapshot information to invalidate any information stored in the validation system. This will clear out cached validation information and invalidate any information that was incorrect. This will look at all data, including anything that was input by validation staff since the previous incident report on some-state.

  3. We'll deliver this information (on bad validations) to the internal auditing team that day and spend a day confirming that the information is bad. After confirmation of bad data, we'll kick off the incident report for this bug and then send notice to the customers about the bad validation information, informing them that the certificate will be revoked on Sept 6th (5 days from when we confirm the information).

  4. On Sept 6th, we will revoke the mis-issued certificates.

The tool is being built as a framework that can added onto with different data sets and rules. This way we can add datasets and run the tool set over it as we want and add rules and run over it. You can hook up multiple data sets to the tool. This allows us to pull in multiple sources of data into the tool so we can hit data sets across platforms while we work on the consolidation projects. The multiple rule sets works great too so we can turn this into a zlint for validation rules. This is the first, but we're currently scanning the EV dataset for additional issues we can build tools around. The current phase requires manual execution each time we want to run the tool. Eventually, we want to make this automated so it runs automatically.

As mentioned, we are investigating the corpus of EV certificates for additional issues. We know we have an issue where someone entered a telephone number into a registration number. One of the rules we are writing is to detect telephone numbers. Although not an issue that we've found, we are proactively limiting the number of sources we permit. This is intended to help with the registration number field indirectly. We are locking it down to sources recommended by GLEIF plus additional sources approved by legal review. We stripped out a few that GLEIF allows that aren't incorporating sources (like SEC). We don't have a timeline for this, except that we will block "+" characters in registration numbers first.

Besides making it easier to lock down registration numbers, one reason to do this is we want to tie the JOI information to the source of the verification information whenever possible. This way when you select a German source, you can't have US as a jurisdiction. Likewise, as much as possible, we want to pre-select the state/locality for the state.

Finally, we want to refactor how the validation system works. We want to evolve the validation system from a checklist system to a document system where the document controls the validation. I'll provide mocks as we spec it. This will be a longer project so I don't want to include it in this bug if possible. We're estimating it to be a year long project. Is there a way to start an informational bug that I could share updates on? Might be cool to share how it works? I wouldn't mind sharing the knowledge and designs freely.

Flags: needinfo?(jeremy.rowley)

Oh - I'll post the full amount and cert listings on Aug 31. Let me know how I can provide more detail. Happy to share anything/everything.

One point of clarification that I thought of today that is important to make. This incident/tool is only for DigiCert proper at the moment. Quovadis is still operated mostly as a separate business entity with some minimal integration between the two. However, the tool we built is a service, and we can deliver it to them after launch so they can implement it in their system. I've also asked them a complete list of their EV certificates so we can scan it for issues using our tool as well. I'll have their compliance people file a separate incident report for anything we find in their system.

We do plan to run the tool on all OV certs as well but don't have a date. I wanted to see how this goes for EV before we commit to a timeframe for OV. Current expectation though is end of next week if that's okay with you all.

Just to make sure I understand the timelines here and what's being done when:

  • DigiCert is still investigating the scope of issue, but is confident it's identified the cause of the issue
  • New EV certificates are still being issued, and thus have the potential to be new violations
  • 2019-08-31: New Check deployed
    • For new validations (i.e. new Subscribers with no pre-existing relationship), regardless of certificate type (DV, OV, EV), DigiCert will validate that the geographic information is consistent.
    • This will be done by relying on non-authoritative sources, such as Google's GeoCode API, versus ISO 3166-1 / 3166-2 or any form of allowlists
    • This will not affect any existing validations or profiles
  • 2019-08-31: Legacy check run
    • DigiCert will run the same tool over all existing locality fields (including jurisdiction fields) within their EV certificates to ensure consistency
    • Any failures will be delivered to Validation for human examination
    • This review will be completed no later than 2018-09-01
  • 2019-09-01: DigiCert will inform its customers of intent to revoke
  • 2019-09-06: DigiCert will revoke these certificates
  • TBD: DigiCert will examine its historical OV issuance for mistakes
  • TBD: DigiCert will revoke its historical OV certificates that are not correct
  • TBD: DigiCert will re-evaluate how it handles serialNumbers and the sources it uses for JOI
  • 2020/2021-XX: DigiCert will deploy a new validation system approach

This only applies to "DigiCert", but not QuoVadis. It is unclear if this applies to third-party subordinate CAs, Verizon, or Symantec CAs that it has acquired.

DigiCert is proposing an update by 2019-09-06, correct? I think it would be useful to have that update on 2018-09-01 - i.e. after DigiCert staff have completed their investigation and intent to revoke. However, perhaps I misunderstood, and DigiCert plans to proactively revoke regardless, and then subsequently investigate the underlying causes.

Flags: needinfo?(jeremy.rowley)

The timeline looks correct with a couple of additions.

  1. We have pulled all our EV data to look for additional issues. Once we find an issue, we'll build a tool to pattern match so we can find all certs with similar issues. This will let us figure out what's going wrong. This is still a human review so it won't be perfect but I'm hoping it'll give us an indication of problems we may find.
  2. We've reviewing our system of checks to ensure they are up to date with the current requirements. That started earlier this week. Expected completion is early next week.
  3. We've kicked off an internal review of all data sources we use for jurisdiction of incorporation based on GLEIF. We've actually found that GLEIF has a lot of typos in their URLs. I was thinking I'd publish that list after we finish the review.

The check will apply to Symantec CAs and all subordinate CAs under DigiCert's control. The only external subordinate CA still issuing is CTJ. I've asked their rep to engage with them so they can complete a scan of all their certificates as well. We will probably need to pull the remaining on prem CAs from CT logs and scan them for issues. There are very few left (4 not counting CTJ) and all of them are in shut-down mode so that shouldn't be a difficult project.

I'm proposing an update on 2019-08-31 to confirm deployment, a follow up on 2019-09-01 for the list of impacted certs, and a post on 2019-09-06 to confirm revocation. I can post the root cause tonight.

Flags: needinfo?(jeremy.rowley)

(In reply to Jeremy Rowley from comment #6)

The check will apply to Symantec CAs and all subordinate CAs under DigiCert's control. The only external subordinate CA still issuing is CTJ. I've asked their rep to engage with them so they can complete a scan of all their certificates as well. We will probably need to pull the remaining on prem CAs from CT logs and scan them for issues. There are very few left (4 not counting CTJ) and all of them are in shut-down mode so that shouldn't be a difficult project.

Thanks Jeremy for reframing it like this. I was specifically thinking about all subordinate CAs - i.e. all certificates that directly or transitively chain to DigiCert. That is, as with all issues, if DigiCert has an issue, what are they / will they make sure their externally operated subordinates don't have the same issue. I think you've still got a few unrevoked external CAs. Are there any plans to examine those?

Flags: needinfo?(jeremy.rowley)

If I break it down by affiliated entity:

  1. .ABB - Being revoked on Aug 30

  2. CTJ - Already asked to implement a similar check and run something to verify no issues
    3/4. Siemens, T-Systems - We plan on pulling the few remaining certificates from the CT logs and scanning them ourselves. I don't have an ETA yet, but this should be pretty easy to do.

  3. Verizon - Currently working with them on the revocation timeline. If we can agree to an earlier revocation timeline then they'll be out of scope for this project as well (as everything will be revoked). If not, then we'll do the same thing as Siemens/T-Systems.
    6/7. Apple/Microsoft - they only issue for their own corporation. I wasn't planning on reaching out to them since it should all be Apple/Microsoft with the relevant location and JOI info. However, I can if deemed necessary. Since it's their own information though (and they aren't verifying other entities), I didn't think they fell into the same potential bucket of mis-issuance with only doing one or two validations.

  4. Quovadis - they should be delivering us their EV certificate data shortly to run through our tool. Although this won't re-mediate the problem going forward, we will know the entire set of mis-issued certificates and can revoke the ones that were done wrong. As mentioned above, we are going to make the service available to them/us and have them integrate it. Our eventual goal is to migrate them to the DigiCert platform. The major blocker is the "EU" requirement.

For timelines, I'm thinking we should be able to evaluate all of them in the next couple of weeks considering most don't have many certs remaining and the evaluation of Quovadis certs will be automated. I'm hesitant to put an exact timeline on it until I've seen the system run on the Digicert proper certs. However, it is the next thing I'm doing.

Flags: needinfo?(jeremy.rowley)

I'm not going to get the root cause drafted tonight. I need to think about it more and work with validation a bit more. I know what the root cause is, but I think it needs some reflection to make the contents more useful and give details on what we are going to do about it. I will get it out by the 31st though.

(In reply to Jeremy Rowley from comment #8)

they only issue for their own corporation. I wasn't planning on reaching out to them since it should all be Apple/Microsoft with the relevant location and JOI info. However, I can if deemed necessary. Since it's their own information though (and they aren't verifying other entities), I didn't think they fell into the same potential bucket of mis-issuance with only doing one or two validations.

I mean, with all CA compliance issues, I would think any time a parent CA becomes aware of a potential interpretation or gap in implementation, whether in their own systems or those they've cross-certified or acquired, that the "parent" CA would take responsibility for keeping their partners informed and ensure they are not affected by similar issues.

For example, it might mean routine meetings between Compliance teams and those of subordinate or acquired CAs, to make sure that all parties always have the latest information and understanding, especially since the "parent" CA will ultimately be responsible.

Here's my root cause analysis:

  1. Issue 1: Emphasis on quantity over quality

During the Symantec acquisition we hired a bunch of new staff that had issued a lot of certs. The emphasis was on completing cert validation, and we failed to notice that across large volumes accuracy was declining as result of the ramp up. Customer expectation for fast validation plus the complexity of the requirements for new people led to poorer than hoped for accuracy in results like misspelled information. We need to change back to the legacy DigiCert value of quality. Processing high quantity is still a goal but we want to add better metrics around quality. The attached scorecard is what we are thinking of switching to as a measurement of performance. The numbers are illustrative and don’t reflect any actual performance. Each validation team member will be reviewed monthly. Although this is not a preventative-based approach, it’ll help ensure issues are caught quickly and that we raise the standard in validation. When combined with the rest of the items, we should have a good system.

In addition to the validation staff, we want to measure the compliance staff on issues detected. Prior to the Symantec acquisition, we audited 100% of the EV certificates issued for compliance concerns. The large volume made that impractical. However, with additional tooling (mentioned in this bug on MDSP) the auditing should be nearly 100% automated to catch concerns. The goal is to over-catch and have the compliance team review for false positives. We’re working the specifics of this but the rule-based system we developed for JOI and location information is the start.

  1. Issue 2: Lack of system controls

We have system controls in place to monitor for missing fields. We have system controls in place to catch mis-configured information (zlint). We have a lot of system controls in place to prevent certificates from issuing without proper domain verification. We don’t have enough controls in place to prevent validation agents from inputting the wrong data. Although there are fields locked down – for example, you can’t use a source from Germany to verify a company in France – there are too many editable fields and too many fields rely on validation staff knowledge of how companies operate. I’ve covered a lot of this in the MDSP post and above, but some examples include that there is no check against an API to verify an address is real. The validation staff is required by policy to do this check, but the system does not enforce the check on the back end. There is no check that ensures the state-country match. This is now in place.

I’m sure there are other checks we can build for this. In fact, I’ve outlined what we want to do long-term with our validation workbench rewrite on the Mozilla dev forum. Basically we want to turn it into a document system where the validation agent identifies the information on the document, eliminating any potential typing.

We also want to build some machine learning so that the system will flag errors that it detects based on other data. For example, if a registration number in France always has 12 digits (just making this up) and something comes through with 11 digits, the system should flag it as a strong possibility of being incorrect. On top of that, we want to lock things down where we can. If the registration number in France is always 12 digits, then we’d simply block all 11 digit numbers. Lots of rules can go into place, and the rules will be a continuous process to build, reform, and improve. We never see this project ending as the requirements will continuous change, and we’ll continuously find ways to improve.

  1. Issue 3: Lack of process for improving detected issues

We lacked a clear process for resolving detected issues. When something was detected, there was an unclear path on how this was escalated to the appropriate PM/leader to fix the issue. If the issue was escalated, the path to appropriately prioritize was unknown. I think we’ve got this underway to resolution, primarily though a change in policy and attitude, although, like system controls, ensuring everyone remains aligned will be an ongoing effort.

We’ve ensured the entire company is looking at certificate issuance using a Kaizen approach, and any employee can blow the whistle to halt things if they detect a compliance concern. I’ve had all hands all week with the various teams to emphasize that they can escalate directly to me to get their problem addressed. The upsetting thing is some of these issues (such as misspelling of names) were detected previously – they were not escalated high enough. This won’t happen again as all of us are aligned to make sure that any compliance fix has the highest dev priority.

In addition, we’ve implemented a weekly call to ensure that all issues are reviewed for next action and to verify nothing is dropped. This is tracked internally at a shared confluence page that everyone has access to. The page is evidenced that each issue is being addressed and that each issue is being taken seriously.

We’re also implementing a more strict accountability program for any employee not adhering to current industry requirements. Possible consequences for employees flagged for non-compliance include :
o Remove 2nd auth privileges
o No nomination for 2nd auth or upskill for EV
o Change risk scores
o Removal of dev privileges
o Termination
o Sent to the corner
o A cone of shame (I couldn’t help but throw in a few of my own non-approved punishments)

Issues are identified by criticality. Critical issues are identified using different gradients for keying critical issues:
Bright Red – mis-issuance with 24 hour revocation required
Orange – mis-issuance with 5 day revocation required
Yellow – book-keeping errors that need to be corrected for quality but not required for 24 hr/5 day revocation.

  1. Issue 4: Wrong/poor training on second approvers

The current process is for the Validation managers’ to nominate a staff member at their discretion to have 2nd auth privilege and upskilled to process EV. Once they are nominated, they are required to take the training and pass the exam. There is no criteria requiring some “flight time” on having been in the Validation role for X period.

We are adjusting this to ensure that 2nd auth approvers have additional training before being allowed to approve certificates:

  • New Hires: 100% audit of all orders for 8-12 weeks (accounts for 250-400 orders);
    During audit phase, there is no cert issuance, only order processing.
  • 2nd Auth: 6 months of validation/order processing and passing audit requirements*
  • Requires nomination by Validation manager and approval by the highest ranked Validation officer in each center
  • Take required training and pass the exam (only 1 retry)
  • A focused audit on the validation agent is conducted once they get their privileged role for up to 4-6 weeks of order processing.
  • EV Upskills: 1 year of order processing before nomination, with no cert issuance and passing audit requirements*

Passing Audit Requirements

  • The minimum requirement to passing audits will consist of having the combination of:
    o 1% or less of Critical Errors (Red-Orange items)
    o 15% or less of Remediation Errors (Yellow items) ,
  • If the individual is within the passing range above and the minimum number of audits (# of weeks above) has been completed, an e-mail will be sent to the supervisor of the individual and to auditing requesting the individual to be released from audits.
  • Once the supervisor approves the release, then the individual has officially passed audits. An announcement will be communicated by email to the individual, supervisor of the individual, highest ranked validation officer in his/her center, trainer of the individual, and auditing team

All agents are required to take a monthly quiz on recent advisories on new ballots or change in validation procedures.

As I mentioned, all of this is subject to continual improvement, and we welcome feedback. I’m planning on sharing some of our procedure and policy docs next week so you can see how we operate. Although it makes it hard to close this bug, I think compliance is an ongoing effort and neither the dev operations nor the improvement plan can really stop, even if this incident gets resolved.

Attached file Sample Scorecard.xlsx (obsolete) (deleted) —

FYI - We rolled the change out today and ran the script against all issued certificates. The number of false positives on initial inspection is quite high. It may take us a more time than expected to go through it. We have a check-in tomorrow morning to see how the progress is going and determine any change in plan outlined above.

I agree our responsibility is to inform all of our subordinate CAs about their obligations. We are sending out comms to everyone about this issue, including Microsoft and Apple. We're going to double check compliance via CT logs.

CTJ has already reported back that their scan indicated no mis-match of information on locality or state. We have a copy of the Quovadis data but need to figure out why we are seeing so many false positives before running the scan. We expect to run them later next week with a target revocation date of bad certificates on Sep 12.

Attachment #9089593 - Attachment is obsolete: true

Update:
We're over half way through the review and found about 700 certs that contain inaccurate information. The primary reason for inaccuracy is mis-spelled state/locality information. I don't have a crt.sh list yet of all the certs. Working on that and finishing the initial pass.

Update 2: I've been informed there was a bug in the script and the snapshots aren't revoked. Of the 700 certs with inaccurate information, we are clearing the stored validation data manually today.

Update 3: We're going to be blocking EV issuance until the script is fixed to invalidate all bad snapshots.

Update 4: Figured out a better way to do this. We're invalidating all the snapshots on certificates that are still under review as potentially having problematic address data or that were issued after the last data pull. I'll post a final update tonight on how the rest of the day goes for evaluating the information.

We deployed the code invalidating the snapshots and reviewed the certificates issued since the data was pulled for incorrect state/locality information. Because of the number of false positives, we ended up needing to split the data into two waves. The first wave is on track for the timelines above. We're still reviewing the rest of the data and are going to revoke within 5 days after confirming it's bad. Most of the false positives are due to internationalization - ie translation or transliteration issues.

The content of attachment 9089593 [details] has been deleted for the following reason:

User request.

Update on progress: We're still on track to revoke wave 1 on Sept 6. I'll post the crt.sh link for those certs on Sept 6th. We split the remaining certs into two waves. the second wave will go out tomorrow and represents most of the remaining (>90%). The final wave is the one-off certs that have to be manually reviewed for translation/transliteration issues. That wave should go out next week. The expected current timeline is then:
Sept 5 - wave 2 notice
Sept 10 - wave 2 revoke
Sept 12 - Wave 3 notice
Sept 17 - Wave 3 notice

Wave 2 is about 2000 certs. Wave 3 is very small. If the current false positive rates hold true, it'll be a dozen or so.

After that we start looking at the corpus of EV certs for additional issues (250k) and the CT logs for additional issues by sub CAs. We've also scanned the Quovadis certs for issues and handed them the report for review.

Jeremy: This seems like a substantial divergence from the plan previously provided by DigiCert, and introduces new concepts - such as proposing a wave of revocations. I understand you shared in Comment #13 that things may change, and Comment #5 for the timelines. Did I miss a comment where DigiCert introduced the concept of waves, or discussed why this was their approach, what the challenges were? Comment #18 is so spartan that it functionally contains no actionable/useful information.

In particular, I'll be blunt: I'm concerned that DigiCert is trying to pull a fast one here, by moving the "customers you'd like to keep happy" into Wave 3 (or some future "Wave 4" or "Wave 5" in some yet-undiscovered corpus), allowing continual delays. That's how, given the current information, it appears. I'm hoping you can provide sufficient context and clarity to help disabuse that notion. I totally understand that CAs want to be thorough in understanding the issue, minimizing the harm, and making sure nothing is missed, and if that's the route they go, transparency is necessary here. To be clear, the bar set is Bug 1551374 for the level of detail provided post-facto, and you can see that the supplemental review process, and the time it took, was in part mitigated by having a holistic understanding of what's going on.

Flags: needinfo?(jeremy.rowley)

No fast one attempted - that's why I was posting about it over labor day and what was going on in case there were questions. Basically, we ran the script and 58k certs were caught as "wrong" by the checker. When we started reviewing them we found that most of these are false positives because of translation/transliteration issues. Although we reviewed over the weekend, we found that we couldn't get the certs all reviewed in a day like we'd hoped so (in comment 18) I suggested we do it in waves. Wasn't sure how many it would take since we weren't sure on the velocity of the review. Comment 20 is my proposal, but if it's not acceptable then we could do something different.

We could post the serial numbers of the certs that we plan on revoking here first, then it could be confirmed that they were revoked five days. Funny enough the biggest customers were the ones generally included in the first wave (Sept 6th). The customers with the one or two certs are generally in the last wave.

One alternative is we could wait until all of the data is reviewed and then revoke them all at once. We discussed this approach over the weekend, but I thought you'd prefer to see some progress on revocation rather than wait until all the certs are identified.

Flags: needinfo?(jeremy.rowley)

Definitely making progress is desirable.

One way to address the desire for transparency might be:

  • List of serial numbers your tool has flagged
  • Some of the issues you've encountered that impeded progress / false positives. Comment #18 and Comment #20 and Comment #22 all mention "translation/transliteration" issues, but only in the context of 'most'; thus leaving ambiguous what sort of other issues are being encountered.
  • Concrete information about how the waves were selected / what's in which wave.

I totally understand that it may not be viable to 100% review certificates within 24 hours or 5 days, and I especially don't want to impede systemic investigations. The main goal here is understanding the process and logic and intermediate results as DigiCert works to understand and resolve this issue. This bug is a bit of a roller coaster of great detail, limited detail, great detail, limited detail, so trying to get a bit of a steady state here, and to make sure it's something that can be and is being consistently applied for all CAs. As noted from Bug 1551374, there was a month of investigation, so I totally appreciate it takes time. I do want to make sure we've got consistent transparency, particularly for CAs that may have had a number of recent issues, to help assuage some of those concerns.

Flags: needinfo?(jeremy.rowley)

Yeah - the large detail swings are me trying to balance providing information as it becomes available with providing accurate information. Often the information is scattered among different people at the time it is happening, then I bring people together for a retrospective to get all the information. For example, the full reasons for the false positives and percentages of occurrences is something I wanted to summarize later. I couldn't do that at the start of the data review as we only knew that the false positives were related to translation/transliteration and overly strict identification on locality. When I posted initially, I said "most" because that is what we were seeing based on the initial scan. However, I didn't want to commit that this was the only error until we completed the review.

The wave breakdown was simplistic. We took all of the certificates we could review over the weekend and sent them out. We grouped certs by common information to make them easier to review. Once we ran out of time, we kicked off the revocation notice. Phase 2 was everything we could finish by today. Phase three is everything that is left. The reason the review isn’t random is we grouped by organization ID so that the entities with the MOST certificates are being revoked on Sept 6th. The entities with the least certificates are the ones that will be reviewed in the last wave and is also why the last 90% of the manual review will take the longest. The information is all different without commonalities.

Of the 250k valid EV certs, 58k were identified as potentially problematic. The reasons for false positives are:
Translation/Transliteration. For example, Dubai vs. Dubay. This was the biggest blocker since the tool was strict on what the translation should be.
Locality is identified incorrectly. The new system didn’t account well for non-US metros. This is being cleaned up in dev for the tool but the manual review is taking care of it for international locations.
State. State is not well defined, but better defined than locality. The first pass blocked state/province on a lot of non-US locations where there are states. ISO3166-2 does define other states as does the Department of State.

I’ll post the crt.sh links for the certs we know are bad and the remaining certs to review. Will that help? I only have serial numbers right now so I'll need to run those through a script to generate the crt.sh links. That means I may need a day or two to get them generated.

Flags: needinfo?(jeremy.rowley)

Yeah. I definitely don't want to penalize providing updates or discouraging things being complete, just wanting to make sure there's substance/context to them. The more substantive the update, the easier it is to go longer without further updates.

Serial numbers, even without crt.sh links, is fine.

Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 10-Sept-2019
Attached file Location and JOI.xlsx

Attached are the certs in round 1, round 2, and what we still need to review.
A bit more information about the review methodology and the impediments; a lot of this is repeat from above, but I wanted to have all the information in one place next to the data for clarity.
Take away: the most common issues we’ve found are locality in state field in JOI, misspellings, and city/state/country combinations not matching. However, we’re 94% of the way through manually reviewing all certificates identified as potentially affected.

Incident timeline:
2019-08-29: We pulled our entire corpus of valid EV certificates across all systems (250,000), ahead of the schedule set out in comment 5. We thought this prudent in case the address checker had issues so we could still hit the 2019-08-31 deployment date.
2019-08-30: We ran all EV certificates through the address checker. This identified 58,000 affected certificates. In parallel, we deployed the address checker in production, and we ran a tool to invalidate snapshots (saved data) that included addresses which failed the tool’s address check. We then kicked off a systematic review of the 58,000 affected certs.

Because the checker included an aggressive address check, not everything flagged was in violation of the EV guidelines. Some examples include the address checker flagging everything in KY (the Cayman Islands) thinking it was in the US, Kosovo certs, certs with international characters if there are insufficient characters to determine the appropriate language, and an overly strict interpretation on state and locality. The engineering team worked on fixing these issues in the tool while compliance team cleared certificates from the list of ones reported as problematic.
The systematic review was basic as well. In order to process the certs as fast as possible and meet the revocation timeline, we divvied the cert data across teams, having them manually verify as much as they could before we needed to give notice of revocation. To process such a large batch while reducing as many false positives as we could, we used batch methodology where certs are revoked in rounds as cert data is verified, starting with our first revocation event on Sept 6. For redundancy, the verification process has one person who identifies the certificates as either misissued or clean and another person who confirms. Most of the reviews were done on a spreadsheet. However, the international characters didn’t translate well, and because they don’t translate to a spreadsheet, the agents are required to review either through Censys or in-console, which requires a much longer look-up process. Due to this, the majority of international character certs will be triaged in round 3.

2019-09-01: We discovered a bug in the snapshot script. This led to a gap where some certs were issued based on old addresses. We patched the script and reran it on 2019-09-04 and added all impacted certs to round 3 of our remediation and revocation process.

2019-09-05: Round 2 revocation notice was sent to affected customers. Remaining timeline is:
Sept 6: Revocation event, round 1
Sept 10: Revocation event, round 2

The following dates are tentative, depending on the review process:
Sept 12: Round 3 notice will be sent
Sept 17: Revocation event, round 3

As mentioned above, round 3 will be a smaller percentage of certs, but will takes the longest time to complete due to international lookups. Luckily, we’re already 94% of the way through the total identified cert list.
In addition, we’ve scanned all QuoVadis certs. The Quovadis team is looking at the results and will be putting together an incident report on the results.

After we’ve closed out on these buckets of certs, we’ll scan the CT logs to sub CAs with issues and start looking at similar issues with OV certs.
We did communicate with the Sub CAs already asking them to scan their own data for issues. We’ve heard back from most, confirming no issue. The CT scan will confirm this report. In addition, we’re communicating the need to implement controls similar to our check for both OV and EV issuance.

Quick update:
We've completed revocation of the second batch. On track with the third batch. Unless there's an objection, I was planning on just posting a final list of all the revoked certs once this is complete.

We made a few updates that I thought I should share:

  1. We were uncomfortable with where the address check was sitting in the system as it had the potential to miss some certificates. The current implementation was during validation and not at issuance, meaning that the check could be bypassed if not tied to the correct product or the validation was deemed not applicable. We moved the address check from validation system to the issuance system and ran the check over all of the certificates that issued since the last scan to see if any new certificates were detected. We have a batch to review and will post any non-compliant certificates. The reason for the gap was any information verified but that had not issued a certificate yet or that was verified but had issued a certificate that expired (say a 90 day cert).

  2. We ran a sanity check on the data to see if there were certificates we revoked and certificates with identical information that we did not revoke. This could exist because of the manual nature of the review and the varying degree of strictness applied to the definition of what constitutes a "state" under the CAB Forum requirements. Because there is no formal definition of "state" in either the BRs or EV guideline there could be inconsistent rules being applied. A similar problem exists with international spellings. We will review the list of mis-matched revocations and see why some were revoked and others were not revoked and decide the proper course of action.

  3. Next week we will to run a scan over all certificates issued since the last change to the address check. This is a sanity check on the system and a final check to ensure everything is working properly. We expect the results to come back as no problems detected. We wanted to run another final check next week to ensure that we've caught all address-related problems and guarantee that the address check is blocking all bad addresses.

  4. We are revoking the wave 3 certificates tonight. Later this week, we'll generate the crt.sh links for the revoked certs and post them here.

  5. We're going to look at serial numbers next. We already scanned the corpus of CT certs and found only the one telephone number. However, that doesn't answer the underlying question of how a telephone number got into the registration number field and how we prevent it from happening again. Our plan to identify incorrect serial numbers and remediate is:
    a) Pull all of our serial numbers and sources
    b) Group the serial numbers into patterns
    c) Review the serial number with the source to verify the correct pattern
    d) Code the pattern into the linter so it can block anything that doesn’t fit that pattern

After we associate each of the patterns with the EV sources, we can share that with the community. That way anyone who wants to run queries can do so. This should also help other CAs know what to look for with respect to their own registration numbers.

Summary: DigiCert JOI Issue → DigiCert: JOI Issue

Thanks for the update, Jeremy. This is all encouraging progress, and helping build a better understanding about how best to address this within the community and potentially within policy or the Baseline Requirements.

Attached file Revoked certs.xlsx (obsolete) —
Attached file CrtSh URLS.xlsx
Attachment #9094082 - Attachment is obsolete: true

We ran a sanity check on the revoked certificates and our EV dataset to determine if any certificates were incorrectly identified as false positives. We found that 81 certificates were mis-matched between what we revoked and what we identified as false positives. These are being revoked tonight. We also ran the sanity check to ensure no additional certificates are being issued outside of the validation checks. Addresses are being blocked appropriately. We will run another check next week to again ensure all systems are operating as expected. This project should take about two weeks.

We started the review on registration numbers and are building a database of registration number patterns. We plan on sharing this publicly once completed. We also started the scan on OV certificates. This scan is running by month (starting with Sep 2019 and working backwards) because of the large volume of OV. The address checks implemented for EV certificates apply equally to OV certificates, meaning that no additional OV certificates are being issued with improper/misspelled address information. We just started this project so I don't have a good feel on the velocity or expected completion time.

The third thing we are working on is organizing the issues in a way we can post here. This way we can break down the certs by problem (mis-spelling, mis-matched state, city in state field) for better transparency. This information should be ready to post by the end of this week.

Whiteboard: [ca-compliance] Next Update - 10-Sept-2019 → [ca-compliance] Next Update - 5-Oct-2019

I added the ones we revoked from the first sanity check. Sanity check 2 is scheduled next week. The break down on the reasons for the revocations are as follows:

Revoke Reasons Percentages Number of Certs
Invalid State in JOI 23% 452
Invalid State in location 21% 448
Invalid Locality in location 19% 322
Misspelled 17% 319
Invalid Locality in JOI 14% 264
Bad City-State Combo 4% 74
Abbreviation in JOI 2% 39

We were trying to run the address check over all OV certs and it was going too slow (the time required to run the check was tracking at over a month) . Instead, we are getting a list of unique org values and running the check over that then tying it to the cert. This should be faster. Still working out an ETA for the OV side of certs. I'll have an update on the registration number review this evening.

From the team looking at the registration numbers:
In the past weeks, we have looked into a dataset of 136k EV certificates with respective Jurisdiction of Incorporation validation source that was used for issuance in order to identify registration patterns of each source. As it is easy to assume that the usage of sources are not evenly distributed, we sorted and counted the validation sources and processed it in the order of the number of certificates using the source for efficiency. Top 30 sources covers 75% of the EV certificates issued and we have identified 240+ patterns so far.

We had challenges where there is not enough samples to identify a pattern as sometimes a single source is used to issue one certificate. For countries and regions that don't have a lot of enrollments for EV certificates, it is hard to tell whether the we have the comprehensive list of the registration pattern. In some cases, this investigation could be done by going into each and every source but will be time consuming. (Note, the plan is to look up registration number for this countries and see what the pattern is).

As the results of our preliminary analysis look promising, we will now move on to checking the full dataset of approx. 250k EV certificates to perform the same exercise and further identify registration patterns.

Sorry about the weird formatting at the bottom. Not sure how that happened. It's not intended to convey any special message.

You need to log in before you can comment on or make changes to this bug.