ACCV: Certificates issued with Policy qualifiers other than id-qt-cps
Categories
(CA Program :: CA Certificate Compliance, task)
Tracking
(Not tracked)
People
(Reporter: jamador, Assigned: jamador)
Details
(Whiteboard: [ca-compliance] [ev-misissuance])
Attachments
(2 files)
On April 3rd our team was notified by Sectigo about one revoked certificate that was issued with Policy Qualifier '1.3.6.1.5.5.7.2.2' (UNotice).
This field should not have been included in the certificate profile according to section '7.1.2.7.9 Subscriber Certficate Policies' of Baseline Requirements for the Issuance and Management of Publicly-Trusted TLS Server Certificates (BR) version 2.0.0 and later.
After a quick initial review it has been found that certificates are now not being issued with that qualifier.
We are examining the cause and impact of the problem and we have already started to check which certificates are affected to see what actions to initiate and complete within the given time and deadlines.
In that first review and in the absence of confirmation, due to other processes it does not appear that there are active certificates with this non-compliance (if there were ACCV would initiate the process for revocation in less than five days).
This is a preliminary report. We expect to have shortly a report with the details and timeline.
Thanks to the community (in this case Sectigo) for their help with these issues.
Updated•8 months ago
|
Assignee | ||
Comment 1•8 months ago
|
||
As indicated above, we confirm that there are no active certificates affected.
Assignee | ||
Comment 2•8 months ago
|
||
Assignee | ||
Comment 3•8 months ago
|
||
Incident Report
This is a update report.
Summary
On April 3 our team is notified by Sectigo about certificates were issued with Policy Qualifier "1.3.6.1.5.5.7.2.2" (unotice). This field must not be included in the certificate profile according to section “7.1.2.7.9 Subscriber Certificate Certificate Policies” of Baseline Requirements for the Issuance and Management of Publicly-Trusted TLS Server Certificates (BR) version 2.0.0 (the application of the profile came into effect on September 15, 2023).
Impact
There are no active certificates issued with this qualifier. We have checked and 379 certificates were issued with that qualifier after 15 September 2023 and they are all revoked. No certificates are now issued with that qualifier.
Once the information regarding the issuance error was received, processed and confirmed, the technical team reviewed the latest certificates issued and detected that ese field has not appeared in the certificates since 2023-11-25, for this reason it has not been necessary to stop the issuing of certificates.
Timeline
All times are GMT+2.
2023-09-10:
- ACCV plans to roll out a new version that controls various aspects of policy change issuance, including the removal of userNotice as a policy qualifier, but this particular change is not propagated to the master production node (the last change in the profile was not updated). This change is instead propagated to the backup node. As the check is not implemented in the production lints, the error is not detected.
2023-09-15:
- CAs MUST use the updated Certificate Profiles passed in Version 2.0.0 BR
2023-11-25:
- An update of our systems causes to switch the nodes by entering as master the node that was updated with the correct profile. From this moment on, the userNotice does not appear on the certificates.
2024-01-22:
- Zlint releases version 3.6.0 which includes detection of the problem with the not allowed policy qualifier.
2024-04-03:
- 10:42 An external observer from Sectigo sent a e-mail to the account accv@accv.es indicating a possible problem with the issuance in a revoked certificate.
- 11:00 Urgent meeting of the technical committee to evaluate the incident and the scope.
- 12:20 Mis-issuance confirmed, requested a report of impacted certificates and it is guaranteed that at this date the certificates are being issued correctly.
- 13:30 The compliance committee confirms that all affected certificates are revoked.
- 14:01 ACCV responds to the external user with the available information and thanking him for the notification.
- 16:30 The compliance committee meets again to investigate the cause of the error and how it occurred. In addition to drafting the incident report.
2024-04-04:
- 09:53 ACCV creates an initial bugzilla with preliminary information. https://bugzilla.mozilla.org/show_bug.cgi?id=1889567
2024-04-05:
- 15:40 ACCV updates the bugzilla to confirm that no active certificates are affected.
2024-04-06:
- 08:30 The compliance committee meets again together with the systems and development staff to discuss in detail the problem detected and the measures put in place to avoid it in the future. It is also confirmed that no similar errors have been detected in the rest of the deployments analyzed.
2024-04-08:
- 10:45 ACCV updates the Bugzilla with all the information collected.
Root Cause Analysis
In this case, although the need for the change was detected, as with the order of the fields in the subject and other minor issues, the implementation failed when an error occurred in the deployment. Several causes can be established:
-
The deployment system error in one of the nodes. From what we have investigated there was a collision not captured when applying all the changes. Several changes were merged in several updates and one of them failed in one of the nodes. This error was not caught and went unnoticed as the system continued to function normally.
-
The error did not verify that the deployment had occurred correctly. ACCV confirmed that the issue was working and that there were no problems with the linting but it was not detected that this modification had not been applied correctly in production. The continuous integration process that allows automating the deployments did not confirm the update of the profile policies.
It can also be considered as a cause the fact that when the switching was performed and the secondary node came on line, no change in a field was detected, although in this case it luckily solved the non-compliance problem.
The bug in the deployment system is already being corrected. With the new mechanism, a summary signature check of the profiles is made to ensure that they are identical in all the elements of the deployment (the planned task and the target nodes of the update). With this ACCV ensures that the policies applied are the ones that were planned in the deployment and that they are the same in all nodes.
In addition, a non-automated manual task has been added as a reinforcement to confirm the application of these changes, both in the profiles and in any other change susceptible to this verification. Due to another incident, ACCV has improved its detection process, including more personnel and improving controls so we hope to avoid such incidents in the future.
Here we present the results of the “5 whys” root cause methodology that was followed:
Why was there a problem?
Because we were issuing certificates with the userNotice field when it was not allowed.
Why were we issuing certificates with this field?
Because at the time of applying the changes an error occurred due to a software bug that did not detect a problem when applying the modifications.
Why was the error and non-compliance not detected?
Because exceptions or errors referring to configuration files were not caught, only to code files. In addition, an exhaustive review was not carried out after deployment by means of a manual check (the linter used did not detect it).
Why were these exceptions not captured?
Due to a problem with our continuous integration system at the time of deployments. In addition, the reviews and checks were performed by only one person.
Why was the review carried out by only one person?
Up to the moment of the change to BR 2.0.0, only one reviewer had been sufficient, but the modifications due to the change have been many and fields that were established since the beginning of the issuance of Certificates are no longer allowed. This has allowed us to detect this point of failure and to start to put the means to solve it.
Lessons Learned
What went well
Notification from the external source has been received and processed in a timely manner and the initial investigations conducted by the compliance office were carried out in a timely manner.
What didn't go well
- Errors have occurred in update deployments that should have been detected. These errors have occurred due to bugs in the deployment system but also due to human error in the review process.
- The procedure for reviewing changes to certificate profiles has proven to be inadequate. It was working in the case of controlled and limited number of changes but has failed in the case of comprehensive changes such as the policy modification established by BR 2.0.0.
- The latest versions of linters that we were using have not detected this issue.
Where we got lucky
- Due to previous incidents, the error reporting mechanism had been improved.
- The certificates affected by previous processes had also been revoked, so there were no active certificates.
- The activation of the secondary node where the deployment was well applied did not continue with the incorrect emission.
Action Items
Action Item | Kind | Stete | Due Date |
---|---|---|---|
The protocol for reviewing documentation associated with changes to certificate profiles will be improved. We will establish separate reviews by various team members following a matrix of certificate profile fields to check. The outcome of these reviews will be jointly evaluated by the compliance office. | Prevent | Done | 2024-04-05 |
Periodic sampling of the certificates already issued is established to check them against the new versions of the lint | Detect | In course | 2024-05-01 |
In addition to continuing to use ZLint, include Pkilint as a complementary tool for pre-lint. This tool seems to be updated more frequently to ensure compliance. Other lints that may help in the early detection of missuassuances will also be assessed periodically. | Detect | In course | 2024-04-15 |
Improve the system for monitoring update deployments and fix the system to verify that errors are captured and changes are consistent. | Prevent | In course | 2024-05-01 |
----------- | ---- | -------- | ------------- |
Appendix
Details of affected certificates
See attached file
Based on Incident Reporting Template v. 2.0
Comment 4•8 months ago
|
||
The following 13 certificates were issued by ACCV after 2023-09-15, contain User Notice policy qualifiers, are currently unrevoked, and are not listed in attachment 9395485 [details]. Jose, can you explain why ACCV's analysis missed these certificates?
Assignee | ||
Comment 5•8 months ago
|
||
Hi, Rob
We will look into the difference you have indicated and write shortly.
Thank you very much for your help.
Regards
Assignee | ||
Comment 6•8 months ago
|
||
Assignee | ||
Comment 7•8 months ago
|
||
We have checked that when sending the affected certificates we only sent those that had a final certificate delivered to the user and not those that had given an error on delivery and that for some reason had not ended up in a certificate in use (these certificates have never been put into use by the end user).
This was an error on our part at the time of collecting the evidence, counting the serial numbers of the final certificates delivered instead of using those generated. Attached is the complete list including orphan certificates (all revoked).
Our apologies for the error.
Comment 8•8 months ago
|
||
(In reply to Jose Amador from comment #3)
Jose, the timeline states:
2023-11-25:
- An update of our systems causes to switch the nodes by entering as master the node that was updated with the correct profile. From this moment on, the userNotice does not appear on the certificates.
We want to make sure we understand what you’re saying here.
- Are you stating that this bug was fixed as an unintended consequence of a different action on ACCV’s part?
- When did ACCV become aware that userNotice had ceased to appear in new certificates?
- When did ACCV understand the events that led to userNotice ceasing to appear?
Assignee | ||
Comment 9•8 months ago
|
||
Hi, Tim
This stems from a bug causing an uncaught error in our version deployment system that has already been identified and corrected. The need to remove the field we had identified, and planned the change along with others. Unfortunately bugs happen.
Q1 Are you stating that this bug was fixed as an unintended consequence of a different action on ACCV’s part?
The cause of the mis-generation was solved when during a system maintenance we switched to the second node where the deployment had been performed correctly. That is why we indicated it in the part where we were lucky. This happened in 2023-11-25.
Q2 When did ACCV become aware that userNotice had ceased to appear in new certificates?
We thought that it had been eliminated before the new profile came into effect. As the lint did not identify the problem, there was no error in the generation. In addition, zlint included this lint with version 3.6.0 which went into production on 2024-01-22 (it had already switched to the correct node). The problem and its disappearance was not detected until the error was reported. From then on, the whole process was revised and Bugzilla was created.
Q3 When did ACCV understand the events that led to userNotice ceasing to appear?
We became aware of the error in the deployment system once we investigated why certificates had been issued with that field. The error had not occurred previously. Based on these findings, the deployment system has been enhanced to identify these errors and include manual verification processes to confirm the implementation of all changes.
Thank you very much
Assignee | ||
Comment 10•7 months ago
|
||
Update on actions.
After a two-week testing process ACCV has deployed in PRO the version of the issuing system that introduces pkilint (v0.10.0) as a pre-linting mechanism. ACCV is currently using pkilint and zlint as prelint tools. We hope that in this way we can prevent future errors.
Action Item | Kind | Stete | Due Date |
---|---|---|---|
The protocol for reviewing documentation associated with changes to certificate profiles will be improved. We will establish separate reviews by various team members following a matrix of certificate profile fields to check. The outcome of these reviews will be jointly evaluated by the compliance office. | Prevent | Done | 2024-04-05 |
Periodic sampling of the certificates already issued is established to check them against the new versions of the lint | Detect | In course | 2024-05-01 |
In addition to continuing to use ZLint, include Pkilint as a complementary tool for pre-lint. This tool seems to be updated more frequently to ensure compliance. Other lints that may help in the early detection of missuassuances will also be assessed periodically. | Detect | Done | 2024-04-15 |
Improve the system for monitoring update deployments and fix the system to verify that errors are captured and changes are consistent. | Prevent | In course | 2024-05-01 |
----------- | ---- | -------- | ------------- |
Comment 11•7 months ago
|
||
- Could you please explain how are deployments done?
- What is the tooling you're using to automate the deployment?
- How did the deployment fail in one of your nodes, but not in the other?
- How often are you deploying new changes?
- How do you ensure that these deployments don't fail when you're patching, for example, a system dependency that has a critical vulnerability?
Updated•7 months ago
|
Assignee | ||
Comment 12•7 months ago
|
||
Thank you for your questions, Amir
Q) Could you please explain how are deployments done?
A) Once the testing process is finished, we plan a deployment with Jenkins. The development team releases a final version that triggers the build and deployment of the application war.
Q) What is the tooling you're using to automate the deployment?
A) We are using Jenkins.
Q) How did the deployment fail in one of your nodes, but not in the other?
A) Due to a problem when trying to replace an open file by another application. This had not happened before (or we had not detected it). The deployment finished, but that file was not replaced. On the other node everything went fine.
Q) How often are you deploying new changes?
A) When necessary due to critical changes or when several minor changes are accumulated.
Q) How do you ensure that these deployments don't fail when you're patching, for example, a system dependency that has a critical vulnerability?
A) This continuous deployment system is only for the part that we develop. OS updates/patches and third party software releases (including security ones) are carried out by the Systems team with the assistance of the software manufacturer (or at least with the update guidelines). If errors are detected in one of our deployments (in this case unfortunately it was not the case and hence the error, the report and everything else) it is notified by console and mail to be able to initiate rollback mechanisms.
Thank you!
Comment 13•7 months ago
|
||
This had not happened before (or we had not detected it).
Have you checked the previous invocations to see if you had this failure before and you missed it?
Some further questions:
ACCV has about 3600 active precerts in CT according to crt.sh. This means that the 378 certs that ACCV initially attached were revoked for reasons before this incident was discovered.
That means that ACCV had revoked about 10% of their active certs in routine operations?
How many other certs that weren't impacted by this incident (that were delivered to subscribers that are not ACCV, e.g. not test certificates) were revoked between 2023-09-15 and 2024-01-10?
From the revoked certs (both impacted, and not impacted by this incident), why were they revoked (note: I'm not necessarily asking for reason code breakdown, but more "who made the revocation request and why"?
Assignee | ||
Comment 14•7 months ago
|
||
Hi, Amir
We have reviewed previous deployments and no similar issues have been detected.
With regard to further questions.
At the end of last year, for commercial reasons, we launched a promotion and replaced many of our certificates, offering a very advantageous economic agreement for as long as the relationship was maintained, depending on the type of user and the permanence with us. In addition, as a way of gaining market share and user loyalty, we were able to easily switch from one platform to another in terms of issuing certificates. These certificates have been replaced by others. We understand that commercial agreements with our customers are not a problem.
This has had the effect that in this case the impact on revocation due to the failures detected has been much less.
In the interval you have indicated, 16 certificates not affected by the incident have been revoked.
The rest of the certificates affected by this Bugzilla have been revoked on those dates with the process indicated above. The commercial promotion ended at the end of January so there are many that were revoked during that month. Fortunately communication with our users greatly facilitates these commercial initiatives. This ability is one of the goals we advocate for when it comes to improving the ecosystem.
Thank you very much!
Comment 15•7 months ago
|
||
Thanks for the response. I'll have to say, the numbers here are not adding up in my head.
Why this feels odd from my perspective:
In the initial incident report, there was a claim that 379 certificates were impacted, and revoked. Beyond that, in the previous comment, you mention that there were another 16 certificates that were revoked in the same time period.
So, in that time period, you had 395 revocations. That means ~95% of your revocations in that time period happened to happen to exactly these misissued certificates.
Are my numbers correct? If so, do you know what happened that caused such a high percentage of your revoked certs in that time frame be these misissued ones?
Assignee | ||
Comment 16•7 months ago
|
||
Hello, Amir
As we have commented in this case these two circumstances coincided in time (the commercial initiative that we addressed at the end of the year that provided us with a change in the platform and the error that caused the incorrect issuance of certificates), so when we had to perform the revocation due to detect the error it turned out that they were already revoked during that process.
All TLS certificates issued in that period were impacted by the issuance error, so it is normal that the revoked ones for that particular profile would match. So to say, in this case we were lucky in that respect. There is no other explanation for this.
As an additional fact, due to Bugzilla 1884532 on March 9th we had to revoke more than 800 certificates (approximately 30% of our active TLS certificates) in a few days since we detected the incident (many more certificates impacted than in the case we are talking about and including those generated during the end of year change which caused us additional problems with users). In this case we were unlucky and the certificates were active, but we were able to manage it in a reasonable time since the problem was detected by us, and as you can see we have the ability to solve these mass revocation problems in a relatively efficient way (we hope to improve with lessons learned).
In fact bugzilla 1884532 predated this one by a few weeks so we would have found all the certificates impacted by this case revoked as well, even if we had not replaced them. The situation for revocation purposes would be exactly the same.
We hope that with all the measures that are being taken, these cases will not happen again, because they are a huge problem at all levels for the community, for us and for our users. For our part, until a few months ago, we had not had any similar incidents for many years.
As always thank you very much for your questions.
Comment 17•7 months ago
|
||
Understood, thank you for the information!
Assignee | ||
Comment 18•7 months ago
|
||
Update on actions
We have improved the deployment system by introducing a subsequent checksum process to programmatically ensure that what is deployed on all nodes is the same as what was planned for the deployment. Once the deployment is completed, the checksum is performed by hashing the deployment content at all locations, comparing the output and returning the result. With this process the error is detected if there are differences in the deployments.
For periodic sampling, pkilint has been added in addition to zlint, and the staff have been trained to facilitate the use of the new tool. The staff involved in this function will perform at least the same tests as those performed in the post linting process in the issue.
Action Item | Kind | Stete | Due Date |
---|---|---|---|
The protocol for reviewing documentation associated with changes to certificate profiles will be improved. We will establish separate reviews by various team members following a matrix of certificate profile fields to check. The outcome of these reviews will be jointly evaluated by the compliance office. | Prevent | Done | 2024-04-05 |
Periodic sampling of the certificates already issued is established to check them against the new versions of the lint | Detect | Done | 2024-05-01 |
In addition to continuing to use ZLint, include Pkilint as a complementary tool for pre-lint. This tool seems to be updated more frequently to ensure compliance. Other lints that may help in the early detection of missuassuances will also be assessed periodically. | Detect | Done | 2024-04-15 |
Improve the system for monitoring update deployments and fix the system to verify that errors are captured and changes are consistent. | Prevent | Done | 2024-05-01 |
----------- | ---- | -------- | ------------- |
Assignee | ||
Comment 19•7 months ago
|
||
No further action is pending. We are monitoring this bug.
Comment 20•3 months ago
|
||
Have any new potential action items been discovered? Or can this incident be closed?
Assignee | ||
Comment 21•3 months ago
|
||
Hi, Ben
There are no outstanding issues. Can be closed.
Thank you very much.
Updated•3 months ago
|
Description
•