Ben: This report seems to be in the spirit of the negative example from https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report , specifically:
For example, it’s not sufficient to say that “human error” of “lack of training” was a root cause for the incident, nor that “training has been improved” as a solution.
That is, this response states that:
This was resulted due to a misconfiguration of one particular procedure for manual issuance for test certificates, where the certificate issuance was containing wrong key usage value.
and that the solution is:
All manual procedures are now updated in internal operating procedures to perform additional checks post certificate generation, specifically in these kind of cases. Necessary re-training of people has been made towards the same.
I can't help but feel like this provides zero useful information for any other CA, or any relying party, in understanding what went wrong, how it was corrected, and how to avoid similar mistakes.
The misconfiguration is a symptom/statement, not a root cause. Understanding how that misconfiguration was possible, what the previous steps were to ensure correctness prior to certificate generation, what the new steps are to ensure correctness prior to certificate generation. The statement "necessary re-training" provides no useful detail, because it doesn't even establish that this was a training issue in a first place. However, if we assume it was a training issue, then what were the prior training materials, why were they deficient, and how was that corrected?
I'm concerned that this incident approaches it as if this was a one-off, rather than a series of systemic failures that ultimately resulted in misissuance. For example, I see nothing that would prevent this from happening, say, next year, when "training wasn't as frequent" or "new folks were hired".
These are the sorts of questions I think we'd like to get to in the incident report. I can't help but say I'm also troubled by statements like:
No impact to external customers. Only internal test certificates impacted.
As, while that's descriptive, it's not dispositive to how a CA treats incidents. It doesn't matter whether it was external or internal customers, it was misissued in the first place.
An ideal incident report here would identify:
- How the original configuration error was introduced.
- What the previous steps were for performing manual ceremonies.
- How this escaped, say, preissuance linting (which is relevant, even for manual ceremonies).
- How manual ceremonies have been updated to prevent future mistakes.
- If this was training, what was the training focused on?
- For example, was it training about what RFC 5280-et-al require?
- Was it training about prior incidents of CAs having made this mistake?
- Was it training on existing procedures for reviewing manual ceremonies?
There's a lot we can learn here, but I'm concerned with any CA that is able to perform manual issuance and miss something. If CAs aren't treating manual ceremonies as incredibly critical, with layers of review and tests before they ever touch CA key material, that's concerning. This incident report is the opportunity to provide a better understanding of what was done and why that shouldn't be a concern.