This is our final report on this issue.
1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
The issue was discovered by our validation team while performing a routine check of an EV TLS order.
2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
2021-07-10T02:34:38-00:00 EV TLS order reviewed and approved by one of our Validation Specialists.
2021-07-12T03:38:36-00:00 Customer care requests for an update on the abovementioned order.
2021-07-12T09:15:41-00:00 The issuance of an EV TLS certificate with serial number 776752935acf6697078d9cb5547921a6 is triggered with the invocation of an API call by one of our resellers.
2021-07-13T16:05:00-00:00 In the process of performing 2p approval, one of our senior Validation Specialists determines that the certificate was issued without stored evidence of 2p approval in the system. The issue is reported by registering an internal ticket.
2021-07-13T16:37:00-00:00 The ticket is processed with high priority by a senior software engineer. A bug is discovered in our API and a hotfix is immediately deployed. Investigation continues to determine the population of the possibly affected certificates.
2021-07-14T20:22:00-00:00 The population of possibly affected certificates is sought and determined to be eleven (11) EV TLS certificates.
2021-07-15T15:00:00-00:00 The issue is picked up by our Security Auditing department and investigation begins.
2021-07-16T00:46:58-00:00 Additional technical information is shared by the software engineers.
2021-07-16T02:07:05-00:00 Audit logs of the approvals of the abovementioned population are presented. Security Auditors initiate a detailed review to determine whether and for which exact certificates there is a lack of 2p approval evidence.
2021-07-19T19:31:00-00:00 After detailed review of the abovementioned population by our Security Auditors and discussions with the validation team, two (2) additional problematic certificates are found. Subsequent review confirms that validation evidence supported issuance for all three (3) affected certificates, however immediate revocation is requested due to the lack of 2p approval at the time of issuance.
2021-07-19T21:56:00-00:00 All three (3) affected certificates revoked.
2021-07-20T11:00:00-00:00 Security Auditors start gathering all the information and compiling a preliminary incident report.
2021-07-23T21:02:00-00:00 Filed initial Bugzilla report , with full report to follow pending completion of in-depth review.
2021-07-30 to 2021-09-03 Ongoing investigation and discussions between the engineering, validation and compliance departments to analyze any underlying weaknesses and, according to the results, decide any additional measures and improvements in our systems and processes, so that such occurrences are not repeated in the future. The analysis also takes into account incident no. 1724520 which was registered on 2021-08-06. Weekly updates were made to the public bug to inform the community about the ongoing investigation/analysis and address any questions raised.
2021-09-06: Started drafting the final Bugzilla report.
2021-09-10: Filed final Bugzilla report (this document).
3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
A hotfix was immediately deployed and tested after the issue was detected. No similar issuances can be performed currently.
4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
Three (3) EV TLS certificates, issued between 2021-05-18 and 2021-07-12.
5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
Impacted certificates (and their corresponding pre-certificates):
S/N: 4f3ff4dc563aa9c46f18254998643ba6 (https://crt.sh/?id=4549278351)
S/N: 45ba1526442ba72cc562493d99bcf757 (https://crt.sh/?id=4834857365)
S/N: 776752935acf6697078d9cb5547921a6 (https://crt.sh/?id=4851729123)
6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
The bug was introduced when the 2-person approval was migrated from the CA software to the RA Portal on 2021-04-13. The purpose of this change was to allow full processing of validations in the RA Portal by the Validation Specialists.
How and why the mistakes were made or bugs introduced:
The source of the bug was a faulty IF check in the certificate down-stepping process, which only occurs in EV TLS products, and for which a more complicated logic applies. This down-stepping mechanism allows issuance of a DV TLS certificate, after a certificate applicant for an EV TLS certificate completes the DV process. Its purpose is for customers to receive a DV TLS certificate while waiting for the extended validation process to be completed.
In this case, the abovementioned flawed condition statement allowed the certificate request to issue an EV certificate instead of the DV counterpart, after the order was validated by the first Validation Specialist. The scope of the bug was limited to EV TLS certificates issued through an API endpoint which was not intended for TLS issuance.
Our investigation determined that issuance via the RA Portal was not affected by this bug. The bug also did not affect EV Code Signing or other types of certificates, because this down-stepping mechanism only applies to TLS certificates.
Upon further investigation, we determined that only one user account was misusing this API to submit TLS requests.
(A side effect impacted the efficiency of our evidence gathering. In particular, our auditing mechanisms did capture all the relevant API calls, but the use of this non-TLS API for TLS issuance didn't allow for proper organization of audit information for our standard TLS reporting.)
Apart from fixing the bug and reminding the user of their obligation to only use documented API endpoints and procedures for their intended purposes going forward, we extended our investigation and proceeded with in-depth manual review of all possibly affected certificates to confirm no other such case exists.
How they avoided detection until now:
The bug passed undetected to the production systems despite the fact that the change was reviewed both within the development department (code review) and by the compliance department (change review). Our investigation confirmed the following:
- With regards to the code, a documented code review by a separate developer took place before approving the pull request, in accordance with our standard code approval policy. The review did include the code in question, however it failed to reveal the bug in the conditional logic (see point 7 for more details on this).
- A documented compliance review by our Security Auditors took place before approving the migration of the 2-person approval from the CA software to the RA Portal, in accordance with our Change Management Policy. It included a detailed review of the staging system's behavior after the change, against both our CP/CPS and the applicable requirements. The review did not include scenarios which involved misuse of API endpoints (see point 7 for more details on this).
The issue was also not detected during our quarterly certificate reviews due to the low number of affected certificates.
7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
Immediate actions are described in steps 2-4 of this report and include the emergency deployment of the code fix to prevent further problematic issuances, the involvement of our internal auditing department, the thorough review of the possibly affected certificates and the revocation of the affected certificates after confirming the issue.
After determining why the bug was introduced in the first place and in particular why it affected only EV TLS certificates but not EV Code Signing certificates, which follow the same 2p approval principle (see point 6 above), our analysis focused on why the issue was not detected during the code or compliance review, before reaching production.
Several discussions and meetings took place between all involved departments: engineering, validation and compliance. Our objectives were dictated by our Incident Management Policy: (a) to understand where and why the process failed, (b) to decide and introduce the proper mitigating measures to address any underlying weaknesses.
With regards to the compliance review, our investigation determined that several tests and use cases with different actors were executed by our internal Security Auditors. In particular, the review included the following:
- Verify that 2P rule is enforced by the RA for all EV TLS certs;
- Review implications with regards to the process/workflow of EV TLS application, validation, approval and enrollment;
- Identify any edge cases and review implementation for these.
Based on the test cases and the review notes, we consider the depth of the review to be satisfactory. The problem was that the misuse of the API was not considered as an edge case - in other words, the process failed in widening the review to include such cases. Based on our analysis, the underlying weakness was the lack of sufficient information passed to the compliance team regarding the details of the RA system and its use in the validation process. This resulted in overlooking API misuse as an edge case.
Our plan to address the above shortcomings in the compliance review process includes the following actions:
(a) closer collaboration between the compliance department, the engineering department and any other stakeholders shall be enhanced, such that all necessary information is duly passed to the compliance reviewers to minimize the risk of overlooking some aspect of the process in question. (This action is already in progress: dedicated developer resources have been assigned to liaise with the compliance department, and our Change Management Policy is being updated to improve the flow of information and the level of inter-departmental coordination.)
(b) technical documentation shall be extended to more fully specify all the steps of the validation process in detail. This documentation shall be used by the compliance department in preparing more complete compliance reviews and to broaden the scope (where required) of compliance testing;
(c) a registry of critical system components shall be made available to the compliance department and shall be consulted in every change review, so that any such critical component is considered in compliance reviews.
The plan is to complete action (a) by the end of October 2021, and actions (b) and (c) by the end of the year.
Our investigation confirmed that the code review process has been in place, requires review by an additional developer, and is enforced in our code repository. Daily meetings, standups and collaboration tools are actively used within the development team (and thus by code reviewers) to request clarifications or raise issues when performing code reviews.
Based on our analysis, this particular bug was extremely difficult to detect via code review, since it required a combination of two conditions: a partially (1p) approved order and a misuse of an API which was not intended for TLS issuance. This highlights the fact that detection of such a bug in code is unlikely when reviewed outside of the context of the process.
Given the above, we believe that the detection of such bugs requires a more collaborative and systemic testing practice; a combination of code reviews, acceptance testing and compliance reviews, sharing the knowledge of both the system and the process.
We plan to update our software development lifecycle requirements to mandate more rigorous and collaborative testing and code review standards. This shall also extend automated testing (including unit and feature testing) of all critical areas of issuance and validation. Changes in our testing infrastructure (e.g. automated test environments) are part of this effort to increase our testing capacity. Going forward, we are also planning to adopt periodic code audits covering all abovementioned critical areas.
This update of our SDLC policies and procedures is already underway. Our plan is to finalize this update in the next month and proceed with implementation in the next 2 quarters.