Open Bug 1595921 Opened 24 days ago Updated 4 days ago

DigiCert: Domain validation skipped

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: jeremy.rowley, Assigned: jeremy.rowley, NeedInfo)

Details

(Whiteboard: [ca-compliance])

Attachments

(3 files)

Attached file Serials.txt

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36

Steps to reproduce:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

We noticed the issue during an escape analysis after deploying a SEV 1 store-front fix unrelated to validation. The issue was missed originally during testing but the patch applied to a store-front caused issuance to skip the new domain validation system if the cert was never-before seen and the cert was org validated. We originally though the issue related to how domain validation evidence was stored but during investigation realized that the storefront skipped domain validation. This led to mis-issuance of 123 OV certs and 36 EV certs. We have been monitoring certificate issuance for problems like this since we deployed the domain consolidation, which is why we caught it during the escape analysis.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2019-11-04 – A SEV1 outage was reported for a storefront. The fix was deployed after hours with targeted testing instead of regression testing. This is standard for our Sev1 issues. Unfortunately, the patch for the SEV1 caused an issue where the storefront sent the certificate information to issuance with the evidence of domain validation rather than to validation.
2019-11-07 – The problem was discovered during an escape analysis. From the initial investigation, it looked like the validation evidence storage was at issue. We rolled back the patch while investigating further.
2019-11-08– We realized the issue was with domain validation but were not sure of the impact. We continued to investigate the certificates impacted and conditions for missing validation.
2019-11-11 – A final list of impacted certificates was reported, and an incident report was written.
2019-11-12 – All impacted certificates were revoked within 24 hours of knowing which certificates were impacted.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

A deployment on November 7th, as detailed above, reverted the patch and removed the bad code.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

123 OV and 36 EV certs. I’m working on getting crt.sh links and will post them as an attachment.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

See above.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

We had a SEV1 issue that was escalated. The team fixed the issue after hours. We performed targeted testing but no regression testing on how the change would impact other systems. Unfortunately, the system impact was that the storefront started providing certificate requests for issuance, skipping the validation system. Despite having good unit tests on applications, we lack good cross-system automated tests, mostly because of the number of storefronts.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

The immediate fix is to add a report to our canary platform that will identify issues in this integration point on an ongoing basis. This will provide alerts in an out of band process, while further system consolidations are performed that will provide even better testing around these integration points. The addition to our canary platform will be in place by 2019-11-16.

In addition, we need to provide better automated system tests. These are more complicated because of the number of storefronts, but we plan to work on them more in parallel with the system shut down.

Assignee: wthayer → jeremy.rowley
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Jeremy,

Thanks for reporting. The serials.txt data is not complete to uniquely identify the affected certificates. The recommended approach in Responding to an Incident is to include the full certificate fingerprints, which can be easily mapped with crt.sh IDs. The serial number is tricky, as this issue spans multiple CAs.

If I'm understanding correctly, this is a follow-on Bug 1556948 - that is, it's with the new system and with the new controls. The architecture diagram provided leaves it ambiguous as to how this could happen. All of the flows appear to require passing the validated domain information, so understanding how it was possible to cause issuance without passing that information is useful.

This is particularly critical given the sensitive nature of storefronts - being the closest on the edge of a CA's network and most likely to be exposed on the Web - that it was able to cause issuance. Understanding where those controls broke down is useful; that is, it seems like it should not have required automated system tests at the storefront layer, since it should have been possible to test the full API surface of the "DigiCert CA" portion in that diagram and ensure that the validation data from the "Validation Workbench" was correct and from the workbench.

Flags: needinfo?(jeremy.rowley)

It's great that DigiCert caught this quickly, but quite concerning that a "storefront" change can result in bypassing domain validation completely. Questions:

  • Do the storefronts connect to the validation system via an API, as per https://bugzilla.mozilla.org/show_bug.cgi?id=1556948? If so, what is the reason for allowing an API to cause domain validation to be bypassed?
  • How does the CA system verify that validation is complete before signing, and how did a storefront change cause that check to be bypassed?
  • Was this related to legacy Symantec infrastructure?
  • More testing is nice, but what are the architectural causes of this issue and how will those be addressed?

Do the storefronts connect to the validation system via an API, as per https://bugzilla.mozilla.org/show_bug.cgi?id=1556948? If so, what is the reason for allowing an API to cause domain validation to be bypassed?

  • The patch was pushed to the DIgiCert API (which is the backend of the legacy DigiCert storefront). This allowed the legacy DigiCert Issuer Code (shown on the diagram box) to communicate directly with the CA instead of forcing it through the validation API. Without the patch, the system would have blocked communication between the two.
  • There isn't a good reason for the API to hook to issuance except that's how the system was built originally. Our goal is to eliminate that legacy DigiCert issuer code shown on the diagram and force the API to only talk directly to the domain validation. That should have happened for domain validation with the domain validation project. unfortunately, the patch allowed it to circumvent that change, unwittingly undoing part of the domain validation project and restoring that risky API->issuance connection.

How does the CA system verify that validation is complete before signing, and how did a storefront change cause that check to be bypassed?
Was this related to legacy Symantec infrastructure?

  • This was related to legacy DigiCert infrastructure- specifically, the Legacy DigiCert Issuer Code on the architecture diagram that passes verified certificate information to the DigiCert CA. Post the domain validation project, the API cannot communicates directly with the issuance system without going through the validation system. The validation system queues/rejects anything that is not validated and blocks it from moving to the issuance system. However, the patch applied allowed the system to bypass the validation system in the following specific circumstances: a) org validation, b) a previously validated domain where the validation expired, and c) no certificate issued for that domain. I inadvertently left out (b) from the previous disclosure. I suppose I could have called this bug "DigiCert relies on expired domain validation" but I thought that wasn't accurate enough from the way the flow worked. If there was a previous domain validation, the system could skip the validation check.

More testing is nice, but what are the architectural causes of this issue and how will those be addressed?

  • The architectural issue is the Legacy DigiCert Issuer Code. That needs to be replaced with services. We currently have the domain validation service and org validation services operating through our RA platform. However, the legacy DigiCert API still calls into the Legacy DigiCert Issuer Code for issuance. I'll provide a diagram showing the planned end-state architecture.

If I'm understanding correctly, this is a follow-on Bug 1556948 - that is, it's with the new system and with the new controls. The architecture diagram provided leaves it ambiguous as to how this could happen. All of the flows appear to require passing the validated domain information, so understanding how it was possible to cause issuance without passing that information is useful.

  • The problem was not due to the domain validation system itself. Instead, this was a patch to the DigiCert API that routed things past the updated validation system in specific circumstances. In reference to the architecture diagram, the patch allowed the DigiCert API to follow the black line through the DigiCert issuer code and to the DigiCert CA. The reason we thought it was a data issue is that the patch inserted a state into the DB that made the system checks think it had accurate validation information returned from the validation system. This wasn't the case. The other part is to remove all communication from the DigiCert API to the DigiCert DB.

This is particularly critical given the sensitive nature of storefronts - being the closest on the edge of a CA's network and most likely to be exposed on the Web - that it was able to cause issuance. Understanding where those controls broke down is useful; that is, it seems like it should not have required automated system tests at the storefront layer, since it should have been possible to test the full API surface of the "DigiCert CA" portion in that diagram and ensure that the validation data from the "Validation Workbench" was correct and from the workbench.

  • Agreed that we need that check. The temporary solution we can get in place right away is to detect the issues. The end state is to completely remove the Legacy DigiCert Issuer Code so the state information can only be obtained from the RA and not checked or set by the DigiCert API. This enforces that only the RA system can communicate with the CA.
Flags: needinfo?(jeremy.rowley)

Attached is the next state architecture that we're working towards. The RA platform is a coordinator to guarantee that the correct validation completes prior to issuance. It sends information to the domain validation process and org validation process. The validation workbench reads the information from the RA platform to show validation staff what's going on and queue items for any manual process (high risks, org vetting, etc). This eliminates the Legacy DigiCert Issuer Code and the API call into the DB. The DB becomes owned by the DigiCert API for order management only and is no longer storing any information about validation status and is not part of the issuance process.

The end-state architecture goes one step further, deprecating the Symantec systems and eliminating that whole side of the code. The red box will disappear as will the existing RA platform. Still targeting Q2 next year for this consolidation.

Am I understanding correctly that the controls in place between each box and arrows are logical controls, not physical controls?

That is, that it was even possible for the DigiCert API to communicate directly with the Legacy DigiCert Issuer Code seems like it should have had multiple opportunities to be rejected:

  • At the network layer (e.g. not routing packets with these hosts, only allowing traffic flows in one direction)
    • This is the classic routing-based security; hardly robust, but useful as a baseline
  • At the API layer (e.g. mutual authentication of each component and not letting one service identity use an API intended for another service identity)
    • Concepts like https://spiffe.io are examples of mutually-authenticated service identities, and concepts exist in many other Cloud platforms
  • At the validation layer (e.g. the DigiCert CA either (a) contacting the validation workbench to verify the data or (b) expecting that the verified certificate details are authenticated as having originated in the workbench)
    • This mostly relates to a question of "Who trusts whom", and designing components around the principles of least trust
    • The things closest to customers are "least trusted", and trust increases through each layer of controls. In general, more-trusted bits can request information from less-trusted bits, but less-trusted bits can't initiate stuff to more-trusted bits.
    • Here, we assume the CA is the Most Trusted thing. Either it contacts the moderately-trusted (validation workbench), or, if it allows the less-trusted Issuer Code to contact it, it requires that the Issuer Code provide 'proof' that it came from the Validation Workbench

Admittedly, these ideas may have fatal flaws with respect to the design of the validation system, but I am concerned, as it sounds like Wayne is as well, that simply calling a heavily-privileged API worked. It seems like there needs to be robust ACLs and security checks in place there, and the above pointers reflect how other organizations are approaching it (albeit not-CAs)

Am I understanding correctly that the controls in place between each box and arrows are logical controls, not physical controls?

I'm not sure I understand. We have both physical and logical controls in place. The Legacy DigiCert Issuer Code and the Existing RA Platform are the only two services that have access to the CA.

That is, that it was even possible for the DigiCert API to communicate directly with the Legacy DigiCert Issuer Code seems like it should have had multiple opportunities to be rejected:

There are multiple systems checking the state. In the current architecture they use the same DB information. In the new architecture, the order management will be isolated both physically and logically from the issuing code.

Here, we assume the CA is the Most Trusted thing. Either it contacts the moderately-trusted (validation workbench), or, if it allows the less-trusted Issuer Code to contact it, it requires that the Issuer Code provide 'proof' that it came from the Validation Workbench

Yes - this is how it should be. With this bug, the place the issuing system called looked like the validation was complete. So the issuing system checked but it checked the same place the API decided to mess with the state. The patch broke the isolation between the system for domain validation. However, we need to isolate the issuing code better from the API as you suggest and as described in the diagram.

Admittedly, these ideas may have fatal flaws with respect to the design of the validation system, but I am concerned, as it sounds like Wayne is as well, that simply calling a heavily-privileged API worked. It seems like there needs to be robust ACLs and security checks in place there, and the above pointers reflect how other organizations are approaching it (albeit not-CAs)

We have ACLs and security checks between issuing code and the CA platform. The problem wasn't an authorization problem between the systems. It was that the state was changed which effectively bypassed the RA system in the described scenarios, which shouldn't have happened. The RA service is supposed to be the single source of truth for state and not changeable outside of that system.

As part of root cause analysis and investigation, have there been any changes to the software development practices that allowed fixes like this to be landed? I can understand the risk involved with any major architectural shift, and the sev-1 failures, but were the existing practices and controls here sufficient? Or is the only analysis not enough testing?

Flags: needinfo?(jeremy.rowley)

That's something we are still evaluating since how we categorize things as SEV1 should probably change. I thought it was pretty standard to not do regression testing on SEV1 issues, instead doing an escape analysis after deployment. The current root causes are:

  1. We need to get to the end-state architecture, although that still could be at risk as a patch to the RA system could set the wrong validation state as well, right?
  2. We need to evaluate how we categorize things as SEV1. This particular change probably didn't need to be made without regression testing. That's speaking in hind-sight, but a more strict control over what gets implemented out of our normal process would definitely help.
  3. We need a canary system on the store-front end pre-issuance to tell us if something is going out of sync. For example, here a query to the validation system that also queried the issuing CA system and the API would have detected something out of sync with the dates. This is one that is relatively straight forward, but has a long dev time associated with it (to ensure the canary system itself isn't doing anything odd).

Additional steps in the dev and deployment processes are still being investigated to see if there are additional items/controls that could be implemented.

Flags: needinfo?(jeremy.rowley)
Attached file crt links.csv

Attached are the crt.sh links

After talking to the team today about root causes, I think the biggest issue is how we categorized SEV1. SEV1 is any outage that is technically blocking a customer from issuing certificates or that is causing a compliance issue. Going forward, SEV1 cannot include any system that can impact cert validation or issuance. Anything involving issuance will require the full regression testing. This may lead to longer down times while we test systems for impact on validation, but is a good policy. A more thorough root cause retrospective is planned for the end of Nov.

During the retrospective we identified the following process improvements:
All issues raised that can potentially affect issuance must go through appropriate level of regression testing (no time constraint due to SLA) before deploying. Unless, the particular SEV1 issue has been identified as affecting all issuance.
As a result we implemented an additional designation on SEV1 escalations indicating the scale/impact on issuance ( i.e. no impact, partial impact and full impact).

We are currently training support and engineering support staff on appropriate handing on the above designation.

Any other questions or is this one ready to close?

Could you describe a bit more about your severity assessment process and levels? I think we've sort of inferred from context, but I think it'd be useful to understand both the categories and approaches here (e.g. does a SEV1 imply a SEV2)

In terms of regression testing, I'm hoping you can provide a bit more detail here. Are these manual tests or automated tests? Who develops and how do they get reviewed? I believe we've discussed in past issues (although I'd have to track down) discussions about confusion between development and compliance, and the need to ensure that developers writing tests are working with compliance to make sure they're the right tests.

This is something that's a bit easier to understand from some other CAs, due to the use of COTS software or detailed descriptions of their "staging" and "prod" environments that allow testing with non-publicly trusted keys, and probing for issues, along with extensive automated testing.

Flags: needinfo?(jeremy.rowley)
You need to log in before you can comment on or make changes to this bug.