(In reply to Ryan Sleevi from comment #2)
Thanks for sharing this report. It's very clear and very helpful, and it makes a great preliminary report. However, in the process of diverging from the Mozilla process ( https://wiki.mozilla.org/CA/Responding_To_An_Incident ), it lacks several critical details.
I am not sure why you think our response is diverging from the Mozilla process. We are using a template that is based on the Mozilla's template. This is not our final incident report, because our actions have not been completed yet. We have provided several critical details, based on our initial investigation, trying to be as transparent as possible.
I think one of the key aspects we're missing here is root cause analysis. Root cause is not simply about preventing this exact scenario next time, but about understanding what systemic issues might have lead or contributed to this, and understanding how they can be mitigated.
Distinguishing between causes, root causes and symptoms is very clear to us, else there is no way to minimize the possibility of similar issues to re-occur. Even though we have not finished our work, we try to elaborate on that:
The first cause was the misunderstanding of the Validation Specialists that assumed that for EV Certificates it would be redundant to include the subject:jurisdictionOfIncorporationLocalityName and subject:localityName in the same certificate and similar for subject:jurisdictionOfIncorporationStateOrProvinceName and subject:stateOrProvinceName. Even though this example was not explicitly mentioned during training, out training material highlights that for EV Certificates, EV Guidelines are applied in addition to Baseline Requirements.
For this reason we decided to improve our training material with this clarification and also put it in our written exam questions. We are also considering reviewing the entire training material in search of other similar places of misunderstanding, even though misunderstandings are the hardest to predict. Since we never had a similar incident that was addressed to improper training of our staff, we will make efforts to improve it by adding more examples, based on different scenarios.
The second cause, that "allowed" the first cause to manifest into a misissuance, was the configuration of the new SubCAs to use a linter that was not the best available for these types of Certificates. This was addressed to human error during the post ceremony configuration activities. If the recommended linter was configured, it would prevent the misissuance.
The mitigation for this particular issue was to improve the linting script to auto-detect the type of certificate and apply the recommended linter. We are examining further improvements to automate CA configuration in post-ceremony activities based on the types of Certificates that the Issuing CA is technically capable of issuing. HARICA is currently using several available CLI configuration options to automate the post-ceremony CA configuration but EJBCA has certain limitations that cannot be easily performed using CLI.
We now reached the third cause, which is the technical restrictions posed by the EJBCA software. For other certificate profile requirements (whether fields are required/optional, acceptable values and size per subject attribute, etc), EJBCA provides the necessary tools and HARICA is using them to enforce the Certificate Profiles per the Baseline Requirements and EV Guidelines. The only rule that EJBCA was not able to provide in the end-entity profile tools was the combination of existence of subject:LocalityName OR subject:StateOrProvinceName.
To address this lack of support in the EEP configuration, we requested a feature from PrimeKey (https://jira.primekey.se/browse/ECA-8704) to allow for these conditional restrictions in end-entity profiles, as this is a feature that most publicly-trusted CAs would like to use and implement. Until then, we set our end-entity profiles for issuance of TLS Certificates to require subject:localityName for IV/OV/EV/QWACs (in addition to the other required fields).
Analyzing further on the above causes, especially the first two which are part of HARICA’s internal organization, one could consider the “human factor” to be the source. For us, it is a clear indication that our strategy to use more automation and remove the possibility for human errors in as many places as possible, is correct.
Going deeper in our analysis, while automation is certainly valuable, it still transfers part of the risk to systems, engineers and/or developers to properly implement, configure, operate and monitor them. Simply put, misunderstandings, omissions or errors, at the human or system level, can always happen; but, this is precisely a good reminder of the value of our -costly- choice to maintain multiple overlapping controls, so that we have a fault-tolerant system. In this case, it is clear that more than two things had to go wrong for a misissuance to occur, and even when that happened, our processes were able to detect the issue and take immediate action.
I hope the above are not seen as an appraise to our processes; a misissuance did take place. We remain alert and continue to evaluate specific good practices (some mentioned above and included in our incident report).
I don't really see any introspection into the systemic controls, just about mitigating this specific issue. Can you clarify when you'll be providing a complete incident report? I think one of the key things will be understanding what controls existed, where and why they failed, and what sort of steps are being taken.
We focused on the controls and mitigations that apply for the L/ST issue that hopefully address the concerns raised. We expect to deliver the complete incident report next week, after the affected certificates are revoked and our analysis is completed. The final report will include additional information that is included in this response.
There seems to be a number of steps that could have caught or prevented this.
We agree. As mentioned above, despite the failure of 2 controls (training, pre-linting), this issue was caught by HARICA exactly because we had a 3rd control in place (internal checks/testing). This testing was conducted as part of our continuous preemptive actions to introduce improved tools (linters in this case) and more automation. We are continuously trying to learn from existing incidents and improve our existing tools and practices.
Please let us know if you have any further questions or concerns.