Closed Bug 1595921 Opened 3 years ago Closed 3 years ago

DigiCert: Domain validation skipped

Categories

(CA Program :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jeremy.rowley, Assigned: jeremy.rowley)

Details

(Whiteboard: [ca-compliance])

Attachments

(3 files)

Attached file Serials.txt

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36

Steps to reproduce:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

We noticed the issue during an escape analysis after deploying a SEV 1 store-front fix unrelated to validation. The issue was missed originally during testing but the patch applied to a store-front caused issuance to skip the new domain validation system if the cert was never-before seen and the cert was org validated. We originally though the issue related to how domain validation evidence was stored but during investigation realized that the storefront skipped domain validation. This led to mis-issuance of 123 OV certs and 36 EV certs. We have been monitoring certificate issuance for problems like this since we deployed the domain consolidation, which is why we caught it during the escape analysis.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2019-11-04 – A SEV1 outage was reported for a storefront. The fix was deployed after hours with targeted testing instead of regression testing. This is standard for our Sev1 issues. Unfortunately, the patch for the SEV1 caused an issue where the storefront sent the certificate information to issuance with the evidence of domain validation rather than to validation.
2019-11-07 – The problem was discovered during an escape analysis. From the initial investigation, it looked like the validation evidence storage was at issue. We rolled back the patch while investigating further.
2019-11-08– We realized the issue was with domain validation but were not sure of the impact. We continued to investigate the certificates impacted and conditions for missing validation.
2019-11-11 – A final list of impacted certificates was reported, and an incident report was written.
2019-11-12 – All impacted certificates were revoked within 24 hours of knowing which certificates were impacted.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

A deployment on November 7th, as detailed above, reverted the patch and removed the bad code.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

123 OV and 36 EV certs. I’m working on getting crt.sh links and will post them as an attachment.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

See above.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

We had a SEV1 issue that was escalated. The team fixed the issue after hours. We performed targeted testing but no regression testing on how the change would impact other systems. Unfortunately, the system impact was that the storefront started providing certificate requests for issuance, skipping the validation system. Despite having good unit tests on applications, we lack good cross-system automated tests, mostly because of the number of storefronts.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

The immediate fix is to add a report to our canary platform that will identify issues in this integration point on an ongoing basis. This will provide alerts in an out of band process, while further system consolidations are performed that will provide even better testing around these integration points. The addition to our canary platform will be in place by 2019-11-16.

In addition, we need to provide better automated system tests. These are more complicated because of the number of storefronts, but we plan to work on them more in parallel with the system shut down.

Assignee: wthayer → jeremy.rowley
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Jeremy,

Thanks for reporting. The serials.txt data is not complete to uniquely identify the affected certificates. The recommended approach in Responding to an Incident is to include the full certificate fingerprints, which can be easily mapped with crt.sh IDs. The serial number is tricky, as this issue spans multiple CAs.

If I'm understanding correctly, this is a follow-on Bug 1556948 - that is, it's with the new system and with the new controls. The architecture diagram provided leaves it ambiguous as to how this could happen. All of the flows appear to require passing the validated domain information, so understanding how it was possible to cause issuance without passing that information is useful.

This is particularly critical given the sensitive nature of storefronts - being the closest on the edge of a CA's network and most likely to be exposed on the Web - that it was able to cause issuance. Understanding where those controls broke down is useful; that is, it seems like it should not have required automated system tests at the storefront layer, since it should have been possible to test the full API surface of the "DigiCert CA" portion in that diagram and ensure that the validation data from the "Validation Workbench" was correct and from the workbench.

Flags: needinfo?(jeremy.rowley)

It's great that DigiCert caught this quickly, but quite concerning that a "storefront" change can result in bypassing domain validation completely. Questions:

  • Do the storefronts connect to the validation system via an API, as per https://bugzilla.mozilla.org/show_bug.cgi?id=1556948? If so, what is the reason for allowing an API to cause domain validation to be bypassed?
  • How does the CA system verify that validation is complete before signing, and how did a storefront change cause that check to be bypassed?
  • Was this related to legacy Symantec infrastructure?
  • More testing is nice, but what are the architectural causes of this issue and how will those be addressed?

Do the storefronts connect to the validation system via an API, as per https://bugzilla.mozilla.org/show_bug.cgi?id=1556948? If so, what is the reason for allowing an API to cause domain validation to be bypassed?

  • The patch was pushed to the DIgiCert API (which is the backend of the legacy DigiCert storefront). This allowed the legacy DigiCert Issuer Code (shown on the diagram box) to communicate directly with the CA instead of forcing it through the validation API. Without the patch, the system would have blocked communication between the two.
  • There isn't a good reason for the API to hook to issuance except that's how the system was built originally. Our goal is to eliminate that legacy DigiCert issuer code shown on the diagram and force the API to only talk directly to the domain validation. That should have happened for domain validation with the domain validation project. unfortunately, the patch allowed it to circumvent that change, unwittingly undoing part of the domain validation project and restoring that risky API->issuance connection.

How does the CA system verify that validation is complete before signing, and how did a storefront change cause that check to be bypassed?
Was this related to legacy Symantec infrastructure?

  • This was related to legacy DigiCert infrastructure- specifically, the Legacy DigiCert Issuer Code on the architecture diagram that passes verified certificate information to the DigiCert CA. Post the domain validation project, the API cannot communicates directly with the issuance system without going through the validation system. The validation system queues/rejects anything that is not validated and blocks it from moving to the issuance system. However, the patch applied allowed the system to bypass the validation system in the following specific circumstances: a) org validation, b) a previously validated domain where the validation expired, and c) no certificate issued for that domain. I inadvertently left out (b) from the previous disclosure. I suppose I could have called this bug "DigiCert relies on expired domain validation" but I thought that wasn't accurate enough from the way the flow worked. If there was a previous domain validation, the system could skip the validation check.

More testing is nice, but what are the architectural causes of this issue and how will those be addressed?

  • The architectural issue is the Legacy DigiCert Issuer Code. That needs to be replaced with services. We currently have the domain validation service and org validation services operating through our RA platform. However, the legacy DigiCert API still calls into the Legacy DigiCert Issuer Code for issuance. I'll provide a diagram showing the planned end-state architecture.

If I'm understanding correctly, this is a follow-on Bug 1556948 - that is, it's with the new system and with the new controls. The architecture diagram provided leaves it ambiguous as to how this could happen. All of the flows appear to require passing the validated domain information, so understanding how it was possible to cause issuance without passing that information is useful.

  • The problem was not due to the domain validation system itself. Instead, this was a patch to the DigiCert API that routed things past the updated validation system in specific circumstances. In reference to the architecture diagram, the patch allowed the DigiCert API to follow the black line through the DigiCert issuer code and to the DigiCert CA. The reason we thought it was a data issue is that the patch inserted a state into the DB that made the system checks think it had accurate validation information returned from the validation system. This wasn't the case. The other part is to remove all communication from the DigiCert API to the DigiCert DB.

This is particularly critical given the sensitive nature of storefronts - being the closest on the edge of a CA's network and most likely to be exposed on the Web - that it was able to cause issuance. Understanding where those controls broke down is useful; that is, it seems like it should not have required automated system tests at the storefront layer, since it should have been possible to test the full API surface of the "DigiCert CA" portion in that diagram and ensure that the validation data from the "Validation Workbench" was correct and from the workbench.

  • Agreed that we need that check. The temporary solution we can get in place right away is to detect the issues. The end state is to completely remove the Legacy DigiCert Issuer Code so the state information can only be obtained from the RA and not checked or set by the DigiCert API. This enforces that only the RA system can communicate with the CA.
Flags: needinfo?(jeremy.rowley)

Attached is the next state architecture that we're working towards. The RA platform is a coordinator to guarantee that the correct validation completes prior to issuance. It sends information to the domain validation process and org validation process. The validation workbench reads the information from the RA platform to show validation staff what's going on and queue items for any manual process (high risks, org vetting, etc). This eliminates the Legacy DigiCert Issuer Code and the API call into the DB. The DB becomes owned by the DigiCert API for order management only and is no longer storing any information about validation status and is not part of the issuance process.

The end-state architecture goes one step further, deprecating the Symantec systems and eliminating that whole side of the code. The red box will disappear as will the existing RA platform. Still targeting Q2 next year for this consolidation.

Am I understanding correctly that the controls in place between each box and arrows are logical controls, not physical controls?

That is, that it was even possible for the DigiCert API to communicate directly with the Legacy DigiCert Issuer Code seems like it should have had multiple opportunities to be rejected:

  • At the network layer (e.g. not routing packets with these hosts, only allowing traffic flows in one direction)
    • This is the classic routing-based security; hardly robust, but useful as a baseline
  • At the API layer (e.g. mutual authentication of each component and not letting one service identity use an API intended for another service identity)
    • Concepts like https://spiffe.io are examples of mutually-authenticated service identities, and concepts exist in many other Cloud platforms
  • At the validation layer (e.g. the DigiCert CA either (a) contacting the validation workbench to verify the data or (b) expecting that the verified certificate details are authenticated as having originated in the workbench)
    • This mostly relates to a question of "Who trusts whom", and designing components around the principles of least trust
    • The things closest to customers are "least trusted", and trust increases through each layer of controls. In general, more-trusted bits can request information from less-trusted bits, but less-trusted bits can't initiate stuff to more-trusted bits.
    • Here, we assume the CA is the Most Trusted thing. Either it contacts the moderately-trusted (validation workbench), or, if it allows the less-trusted Issuer Code to contact it, it requires that the Issuer Code provide 'proof' that it came from the Validation Workbench

Admittedly, these ideas may have fatal flaws with respect to the design of the validation system, but I am concerned, as it sounds like Wayne is as well, that simply calling a heavily-privileged API worked. It seems like there needs to be robust ACLs and security checks in place there, and the above pointers reflect how other organizations are approaching it (albeit not-CAs)

Am I understanding correctly that the controls in place between each box and arrows are logical controls, not physical controls?

I'm not sure I understand. We have both physical and logical controls in place. The Legacy DigiCert Issuer Code and the Existing RA Platform are the only two services that have access to the CA.

That is, that it was even possible for the DigiCert API to communicate directly with the Legacy DigiCert Issuer Code seems like it should have had multiple opportunities to be rejected:

There are multiple systems checking the state. In the current architecture they use the same DB information. In the new architecture, the order management will be isolated both physically and logically from the issuing code.

Here, we assume the CA is the Most Trusted thing. Either it contacts the moderately-trusted (validation workbench), or, if it allows the less-trusted Issuer Code to contact it, it requires that the Issuer Code provide 'proof' that it came from the Validation Workbench

Yes - this is how it should be. With this bug, the place the issuing system called looked like the validation was complete. So the issuing system checked but it checked the same place the API decided to mess with the state. The patch broke the isolation between the system for domain validation. However, we need to isolate the issuing code better from the API as you suggest and as described in the diagram.

Admittedly, these ideas may have fatal flaws with respect to the design of the validation system, but I am concerned, as it sounds like Wayne is as well, that simply calling a heavily-privileged API worked. It seems like there needs to be robust ACLs and security checks in place there, and the above pointers reflect how other organizations are approaching it (albeit not-CAs)

We have ACLs and security checks between issuing code and the CA platform. The problem wasn't an authorization problem between the systems. It was that the state was changed which effectively bypassed the RA system in the described scenarios, which shouldn't have happened. The RA service is supposed to be the single source of truth for state and not changeable outside of that system.

As part of root cause analysis and investigation, have there been any changes to the software development practices that allowed fixes like this to be landed? I can understand the risk involved with any major architectural shift, and the sev-1 failures, but were the existing practices and controls here sufficient? Or is the only analysis not enough testing?

Flags: needinfo?(jeremy.rowley)

That's something we are still evaluating since how we categorize things as SEV1 should probably change. I thought it was pretty standard to not do regression testing on SEV1 issues, instead doing an escape analysis after deployment. The current root causes are:

  1. We need to get to the end-state architecture, although that still could be at risk as a patch to the RA system could set the wrong validation state as well, right?
  2. We need to evaluate how we categorize things as SEV1. This particular change probably didn't need to be made without regression testing. That's speaking in hind-sight, but a more strict control over what gets implemented out of our normal process would definitely help.
  3. We need a canary system on the store-front end pre-issuance to tell us if something is going out of sync. For example, here a query to the validation system that also queried the issuing CA system and the API would have detected something out of sync with the dates. This is one that is relatively straight forward, but has a long dev time associated with it (to ensure the canary system itself isn't doing anything odd).

Additional steps in the dev and deployment processes are still being investigated to see if there are additional items/controls that could be implemented.

Flags: needinfo?(jeremy.rowley)
Attached file crt links.csv

Attached are the crt.sh links

After talking to the team today about root causes, I think the biggest issue is how we categorized SEV1. SEV1 is any outage that is technically blocking a customer from issuing certificates or that is causing a compliance issue. Going forward, SEV1 cannot include any system that can impact cert validation or issuance. Anything involving issuance will require the full regression testing. This may lead to longer down times while we test systems for impact on validation, but is a good policy. A more thorough root cause retrospective is planned for the end of Nov.

During the retrospective we identified the following process improvements:
All issues raised that can potentially affect issuance must go through appropriate level of regression testing (no time constraint due to SLA) before deploying. Unless, the particular SEV1 issue has been identified as affecting all issuance.
As a result we implemented an additional designation on SEV1 escalations indicating the scale/impact on issuance ( i.e. no impact, partial impact and full impact).

We are currently training support and engineering support staff on appropriate handing on the above designation.

Any other questions or is this one ready to close?

Could you describe a bit more about your severity assessment process and levels? I think we've sort of inferred from context, but I think it'd be useful to understand both the categories and approaches here (e.g. does a SEV1 imply a SEV2)

In terms of regression testing, I'm hoping you can provide a bit more detail here. Are these manual tests or automated tests? Who develops and how do they get reviewed? I believe we've discussed in past issues (although I'd have to track down) discussions about confusion between development and compliance, and the need to ensure that developers writing tests are working with compliance to make sure they're the right tests.

This is something that's a bit easier to understand from some other CAs, due to the use of COTS software or detailed descriptions of their "staging" and "prod" environments that allow testing with non-publicly trusted keys, and probing for issues, along with extensive automated testing.

Flags: needinfo?(jeremy.rowley)

SEV1 = Outage where people can't issue certs.
SEV2 =Issue where validation is failing or there are major obstacles in getting certs
SEV3 = Functions or features are not working but people are able to issue certs and get validation

We have a combination of both automated and manual, generally engineering develops the tests to make sure services aren't cross-impacted by the change. The tests themselves aren't reviewed by compliance currently (just eng), although we talk weekly about what's being deployed and any implications on compliance. The policy we would like to have and moving toward is that on a regular cadence we are reviewing any tests we have that are testing for industry compliance with the compliance team and making sure we are both testing for the right things and adding tasks to our backlog to create additional tests for industry compliance. New product is already reviewed by compliance, but tests don't fall into this category.

Flags: needinfo?(jeremy.rowley)

Here's the more formal definitions:

Severity 1 problems include any unplanned events that have a major adverse impact on the operations of the system and on end users' use of the services, such as the problem types described below. You cannot classify a problem as Severity 1, and DigiCert will not classify an issue as a Severity 1 problem, unless a customer administrator with immediate access to the affected system(s) and related information contacts DigiCert by telephone to request support.

Severity 2 problems include any unplanned events (other than Severity 1 problems) that have a moderate adverse impact on the operations of the system and on end users' use of the services, such as:
· Errors that disable limited functions of the services and may result in degraded operations, including without limitation, errors that cause significant transaction processing delays
Intermittent degradation of availability that moderately impairs the utility of the services (edited)
· System or application unavailability that prevents critical transactions from being processed
· Online application outages that significantly impact the online availability of the services
Consistent degradation of availability of DigiCert’s systems that significantly impairs the utility of the services (edited)

Severity 3 problems include any unplanned events (other than Severity 1 or 2 problems) that have a minor impact on the operations of the system and on end users' use of the services. Customer requested improvements and system enhancements are not considered Severity 3 events.

Thanks Jeremy.

There's a lot of complexity in DigiCert's design and operations. This is perhaps understandable, given the scope, but also seems to be a result and product of how transparent DigiCert has been, relative to other CAs, and so there's simply a lot more information to ingest.

This overall incident leaves me somewhat worried, because I think it fits with the pattern of "DigiCert systems are so complex and sprawling that mistakes are made or slip through". However, I also appreciate that, in part, this incident was related to the fact that DigiCert had also recognized this pattern, and was consolidating the domain validation to reduce that sprawl and complexity, to ensure compliance.

While I think some of the actions, such as error classification, may end up being unique to DigiCert, and thus remediation about how those classifications are applied/reviewed may be unique, I greatly appreciate the transparency. I think Comment #8 captures some of those specific remediation, and it's not clear to me the timeframe for the other elements (beyond the classification).

What I think is potentially more generic and applicable, including to DigiCert, is trying to figure out how development and compliance can be tightly integrated. Not every CA necessarily splits their organization into distinct compliance, since "compliance is everyone's job," but there's arguably benefit from having folks steeped in and well versed in the compliance expectations. It seems there remains opportunities here to figure how to reduce complexity, as well as to increase testing, so that things are comprehensively covered. It also seems an opportunity to re-evaluate the development process, to help ensure the systems, controls, and escalations have everyone aligned. Comment #8 usefully looks at concrete things, but it seems there's still missed opportunity to revisit the process for development and deployment.

Of course, what that process is (for DigiCert) and what that process might be (based on insight from other CAs) is... a bit harder to write down in a bug report. But I'm curious if any thought has been given to revisiting that process or systematizing knowledge across the industry.

Outstanding questions:

  1. What's the timeline for the remaining items in Comment #8.
  2. Are any changes or considerations being made for the development process, and not just the escalation process?
  3. Is there any chance for CAs to work together to systematize best practice for development and site reliability? Are there DevOps or SRE practices that reduce risks and/or complexity in these things?
Flags: needinfo?(jeremy.rowley)

There's a lot of complexity in DigiCert's design and operations. This is perhaps understandable, given the scope, but also seems to be a result and product of how transparent DigiCert has been, relative to other CAs, and so there's simply a lot more information to ingest.

Less complexity now than ever. All domain verification goes through one point, the CA is a single CA, the checks are all zlint. It's getting pretty simplified.

This overall incident leaves me somewhat worried, because I think it fits with the pattern of "DigiCert systems are so complex and sprawling that mistakes are made or slip through". However, I also appreciate that, in part, this incident was related to the fact that DigiCert had also recognized this pattern, and was consolidating the domain validation to reduce that sprawl and complexity, to ensure compliance.

There are two more systems that I think are risky. We plan on refactoring one of those in Q1-Q2. The other by the end. The former is the system that bypassed validation and ties the buy-flow to the validation and issuance. We'd like to separate that out a bit more so ordering and issuance aren't co-mingled. The second is refactoring org validation (which is more about EV and OV). These systems are compliant, but the are complex and sprawling. Turning each one into a service will simplify the operations and ensure there are better boundaries throughout. Note that this is part of the end-state architecture I mentioned in 8.

While I think some of the actions, such as error classification, may end up being unique to DigiCert, and thus remediation about how those classifications are applied/reviewed may be unique, I greatly appreciate the transparency. I think Comment #8 captures some of those specific remediation, and it's not clear to me the timeframe for the other elements (beyond the classification).

  1. The end-state architecture is planned by end of the year, although we're also hoping to merge the Quovadis customers at the same time. This will be far easier than the Symantec systems. However, given the issues reported, I'm thinking we want to make this a high priority. At the start of the year, I'll be looking at both and to see what we can do to have both happen during 2020.

  2. We've already categorized things. SEV1 is now outages of multiple customers, not just slow-downs and timeouts. Also dev leadership has to approve SEV1 categorization.

  3. The canary system is planned for Q3-Q4 as part of the final re-architcture. It's a long time, but building this is simplified if we do it as part of the final push to the new architecture.

What I think is potentially more generic and applicable, including to DigiCert, is trying to figure out how development and compliance can be tightly integrated. Not every CA necessarily splits their organization into distinct compliance, since "compliance is everyone's job," but there's arguably benefit from having folks steeped in and well versed in the compliance expectations. It seems there remains opportunities here to figure how to reduce complexity, as well as to increase testing, so that things are comprehensively covered. It also seems an opportunity to re-evaluate the development process, to help ensure the systems, controls, and escalations have everyone aligned. Comment #8 usefully looks at concrete things, but it seems there's still missed opportunity to revisit the process for development and deployment.

That's been something we've worked on quite a bit this last year. We've tried embedding compliance in scrum, merging compliance and PM, and doing training. We do have people who do only compliance (like Brenda) and are talking about having dedicated compliance engineers who only build compliance tools. The difficulty is always that the weakest compliance person can do something incorrect and cause an issue. This is a higher risk with new hires. That's not to say we can't crack the nut, I'm just still experimenting to see if something sticks. Compliance is our #1 priority.

Of course, what that process is (for DigiCert) and what that process might be (based on insight from other CAs) is... a bit harder to write down in a bug report. But I'm curious if any thought has been given to revisiting that process or systematizing knowledge across the industry.

Yes - we plan on sharing everything that works. I have a meeting in two weeks to do a retrospective on the compliance side of things. I can share the results here and what worked vs. didn't. May help other people.

Are any changes or considerations being made for the development process, and not just the escalation process?
I don't think so. The dev process for that team is pretty well done. The refactor will make it so only the validation engineering team can touch validation. That team will also control issuance so no one can go around it.

Is there any chance for CAs to work together to systematize best practice for development and site reliability? Are there DevOps or SRE practices that reduce risks and/or complexity in these things?

Definitely a chance to work together. Probably a really good discussion for a CAB Forum working group. A WG so we get the right members there.I'd be happy to include our SRE people.

Flags: needinfo?(jeremy.rowley)

We had a retrospective to talk about how embedding compliance in scrum is going, and the results were surprising. The engineering teams that are using the embedded compliance people seemed to like it as it gave them fast answers to compliance questions. The compliance team also liked it quite a bit as they better understood how the certificate management system works and how the certificate sausages are made. Not all of the teams are super efficient using compliance (and not all of the compliance people fit well into scrum) so there's still a lot to do. I asked the compliance people to write up notes on what they see as working and what needs improvement. We can share that with Mozilla and the community when they get it done. Should we close this bug though? I don't think there's much else to do on this particular issue.

Ryan: any further questions?

Flags: needinfo?(ryan.sleevi)

It's not entirely clear to me that Comment #16 answers the specific questions in Comment #15.

I mention this, because I'm trying to understand the timeline for systemic remediations, both at DigiCert and within the broader industry.

In terms of trying to identify mitigations and commitments on when they'll be in place, here's what I've got so far:

  • 2019-11-16 - Comment #0 - "The immediate fix is to add a report to our canary platform that will identify issues in this integration point on an ongoing basis."
  • 2019-11-13 - Comment #8 / Comment #10 - "We need to evaluate how we categorize things as SEV1."
  • 2020-Q2 - Comment #4 - "The end-state architecture goes one step further, deprecating the Symantec systems and eliminating that whole side of the code. The red box will disappear as will the existing RA platform."
  • 2020-Q1/Q2 - Comment #16 - "There are two more systems that I think are risky. We plan on refactoring one of those in Q1-Q2."
  • 2020-12-31 - Comment #8 - "We need a canary system on the store-front end pre-issuance to tell us if something is going out of sync." / Comment #16 - "The canary system is planned for Q3-Q4 as part of the final re-architcture."
  • 2020-12-31 - Comment #16 - "There are two more systems that I think are risky. We plan on refactoring one of those in Q1-Q2. The other by the end."
  • 2021-01-01 - Comment #3 - "The end state is to completely remove the Legacy DigiCert Issuer Code" / Comment #16 - "The end-state architecture is planned by end of the year"
  • XXXX-XX-XX - Comment #0 - "In addition, we need to provide better automated system tests."
  • XXXX-XX-XX - Comment #16 - "Yes - we plan on sharing everything that works." / "Probably a really good discussion for a CAB Forum working group. A WG so we get the right members there.I'd be happy to include our SRE people."

Did I get my analysis of dates and deliverables correct?

Flags: needinfo?(ryan.sleevi)
Flags: needinfo?(jeremy.rowley)

2019-11-16 - Comment #0 - "The immediate fix is to add a report to our canary platform that will identify issues in this integration point on an ongoing basis."

This one is done.

2019-11-13 - Comment #8 / Comment #10 - "We need to evaluate how we categorize things as SEV1."

This one is done. I figured this one and the previous one were the bare minimum to closing out this bug.

2020-Q2 - Comment #4 - "The end-state architecture goes one step further, deprecating the Symantec systems and eliminating that whole side of the code. The red box will disappear as will the existing RA platform."
2020-Q1/Q2 - Comment #16 - "There are two more systems that I think are risky. We plan on refactoring one of those in Q1-Q2."
2020-12-31 - Comment #8 - "We need a canary system on the store-front end pre-issuance to tell us if something is going out of sync." / Comment #16 - "The canary system is planned for Q3-Q4 as part of the final re-architcture."
2020-12-31 - Comment #16 - "There are two more systems that I think are risky. We plan on refactoring one of those in Q1-Q2. The other by the end."
2021-01-01 - Comment #3 - "The end state is to completely remove the Legacy DigiCert Issuer Code" / Comment #16 - "The end-state architecture is planned by end of the year"

These dates look about right, although the Symantec deprecation will bleed into Q3 I think. The end state locks down the CA so only the RA can talk to it. So, after thinking about it more, the canary system will really be tied to better monitoring that the RA and CA are correctly communicating. We already have this system in place but there are some minor improvements I think we could make.

XXXX-XX-XX - Comment #0 - "In addition, we need to provide better automated system tests."

This is an ongoing project. We are building better unit testing as we go.

XXXX-XX-XX - Comment #16 - "Yes - we plan on sharing everything that works." / "Probably a really good discussion for a CAB Forum working group. A WG so we get the right members there.I'd be happy to include our SRE people."

We had the meeting on this last week. I'll put together some notes and share them with the CAB Forum.

Flags: needinfo?(jeremy.rowley)

Wayne: Kicking this back to you. I'm still not sure we've got a concrete timeframe (mentioned previously in Comment #15) for all the deliverables, such as for the discussion in the CA/B Forum, but I don't have further questions. A number of systemic fixes look like they'll take the better part of half a year or longer, so I'm not sure if you'd like to keep this bug open with DigiCert providing regular status updates, or close it out.

Flags: needinfo?(wthayer)

The fixes outlined above are not actually related to this issue. They are where we've identified additional risk as part of this discussion. However, the system and process (the Sev1 process) related to this specific issue are resolved. We're already tracking the Symantec migration on another bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1401384) and the other systems aren't causing mis-issuance - I just don't like the risk they impose. The CAB Forum discussion I can start soon, and I've asked Tim to look into creating an engineering-focused WG if possible.

In this exceptional case, I agree that the remaining actions are well beyond the scope of this particular incident. Given the high level of disclosure and comprehensive remediation steps provided by DigiCert in this incident, I am comfortable resolving it at this time.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(wthayer)
Resolution: --- → FIXED
Product: NSS → CA Program
You need to log in before you can comment on or make changes to this bug.