Open Bug 1889567 Opened 3 months ago Updated 2 months ago

ACCV: Certificates issued with Policy qualifiers other than id-qt-cps

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: jamador, Assigned: jamador)

Details

(Whiteboard: [ca-compliance] [ev-misissuance])

Attachments

(2 files)

On April 3rd our team was notified by Sectigo about one revoked certificate that was issued with Policy Qualifier '1.3.6.1.5.5.7.2.2' (UNotice).
This field should not have been included in the certificate profile according to section '7.1.2.7.9 Subscriber Certficate Policies' of Baseline Requirements for the Issuance and Management of Publicly-Trusted TLS Server Certificates (BR) version 2.0.0 and later.
After a quick initial review it has been found that certificates are now not being issued with that qualifier.
We are examining the cause and impact of the problem and we have already started to check which certificates are affected to see what actions to initiate and complete within the given time and deadlines.
In that first review and in the absence of confirmation, due to other processes it does not appear that there are active certificates with this non-compliance (if there were ACCV would initiate the process for revocation in less than five days).

This is a preliminary report. We expect to have shortly a report with the details and timeline.

Thanks to the community (in this case Sectigo) for their help with these issues.

Assignee: nobody → jamador
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

As indicated above, we confirm that there are no active certificates affected.

Incident Report

This is a update report.

Summary

On April 3 our team is notified by Sectigo about certificates were issued with Policy Qualifier "1.3.6.1.5.5.7.2.2" (unotice). This field must not be included in the certificate profile according to section “7.1.2.7.9 Subscriber Certificate Certificate Policies” of Baseline Requirements for the Issuance and Management of Publicly-Trusted TLS Server Certificates (BR) version 2.0.0 (the application of the profile came into effect on September 15, 2023).

Impact

There are no active certificates issued with this qualifier. We have checked and 379 certificates were issued with that qualifier after 15 September 2023 and they are all revoked. No certificates are now issued with that qualifier.

Once the information regarding the issuance error was received, processed and confirmed, the technical team reviewed the latest certificates issued and detected that ese field has not appeared in the certificates since 2023-11-25, for this reason it has not been necessary to stop the issuing of certificates.

Timeline

All times are GMT+2.

2023-09-10:

  • ACCV plans to roll out a new version that controls various aspects of policy change issuance, including the removal of userNotice as a policy qualifier, but this particular change is not propagated to the master production node (the last change in the profile was not updated). This change is instead propagated to the backup node. As the check is not implemented in the production lints, the error is not detected.

2023-09-15:

  • CAs MUST use the updated Certificate Profiles passed in Version 2.0.0 BR

2023-11-25:

  • An update of our systems causes to switch the nodes by entering as master the node that was updated with the correct profile. From this moment on, the userNotice does not appear on the certificates.

2024-01-22:

  • Zlint releases version 3.6.0 which includes detection of the problem with the not allowed policy qualifier.

2024-04-03:

  • 10:42 An external observer from Sectigo sent a e-mail to the account accv@accv.es indicating a possible problem with the issuance in a revoked certificate.
  • 11:00 Urgent meeting of the technical committee to evaluate the incident and the scope.
  • 12:20 Mis-issuance confirmed, requested a report of impacted certificates and it is guaranteed that at this date the certificates are being issued correctly.
  • 13:30 The compliance committee confirms that all affected certificates are revoked.
  • 14:01 ACCV responds to the external user with the available information and thanking him for the notification.
  • 16:30 The compliance committee meets again to investigate the cause of the error and how it occurred. In addition to drafting the incident report.

2024-04-04:

2024-04-05:

  • 15:40 ACCV updates the bugzilla to confirm that no active certificates are affected.

2024-04-06:

  • 08:30 The compliance committee meets again together with the systems and development staff to discuss in detail the problem detected and the measures put in place to avoid it in the future. It is also confirmed that no similar errors have been detected in the rest of the deployments analyzed.

2024-04-08:

  • 10:45 ACCV updates the Bugzilla with all the information collected.

Root Cause Analysis

In this case, although the need for the change was detected, as with the order of the fields in the subject and other minor issues, the implementation failed when an error occurred in the deployment. Several causes can be established:

  • The deployment system error in one of the nodes. From what we have investigated there was a collision not captured when applying all the changes. Several changes were merged in several updates and one of them failed in one of the nodes. This error was not caught and went unnoticed as the system continued to function normally.

  • The error did not verify that the deployment had occurred correctly. ACCV confirmed that the issue was working and that there were no problems with the linting but it was not detected that this modification had not been applied correctly in production. The continuous integration process that allows automating the deployments did not confirm the update of the profile policies.

It can also be considered as a cause the fact that when the switching was performed and the secondary node came on line, no change in a field was detected, although in this case it luckily solved the non-compliance problem.

The bug in the deployment system is already being corrected. With the new mechanism, a summary signature check of the profiles is made to ensure that they are identical in all the elements of the deployment (the planned task and the target nodes of the update). With this ACCV ensures that the policies applied are the ones that were planned in the deployment and that they are the same in all nodes.

In addition, a non-automated manual task has been added as a reinforcement to confirm the application of these changes, both in the profiles and in any other change susceptible to this verification. Due to another incident, ACCV has improved its detection process, including more personnel and improving controls so we hope to avoid such incidents in the future.

Here we present the results of the “5 whys” root cause methodology that was followed:

Why was there a problem?

Because we were issuing certificates with the userNotice field when it was not allowed.

Why were we issuing certificates with this field?

Because at the time of applying the changes an error occurred due to a software bug that did not detect a problem when applying the modifications.

Why was the error and non-compliance not detected?

Because exceptions or errors referring to configuration files were not caught, only to code files. In addition, an exhaustive review was not carried out after deployment by means of a manual check (the linter used did not detect it).

Why were these exceptions not captured?

Due to a problem with our continuous integration system at the time of deployments. In addition, the reviews and checks were performed by only one person.

Why was the review carried out by only one person?

Up to the moment of the change to BR 2.0.0, only one reviewer had been sufficient, but the modifications due to the change have been many and fields that were established since the beginning of the issuance of Certificates are no longer allowed. This has allowed us to detect this point of failure and to start to put the means to solve it.

Lessons Learned

What went well

Notification from the external source has been received and processed in a timely manner and the initial investigations conducted by the compliance office were carried out in a timely manner.

What didn't go well

  • Errors have occurred in update deployments that should have been detected. These errors have occurred due to bugs in the deployment system but also due to human error in the review process.
  • The procedure for reviewing changes to certificate profiles has proven to be inadequate. It was working in the case of controlled and limited number of changes but has failed in the case of comprehensive changes such as the policy modification established by BR 2.0.0.
  • The latest versions of linters that we were using have not detected this issue.

Where we got lucky

  • Due to previous incidents, the error reporting mechanism had been improved.
  • The certificates affected by previous processes had also been revoked, so there were no active certificates.
  • The activation of the secondary node where the deployment was well applied did not continue with the incorrect emission.

Action Items

Action Item Kind Stete Due Date
The protocol for reviewing documentation associated with changes to certificate profiles will be improved. We will establish separate reviews by various team members following a matrix of certificate profile fields to check. The outcome of these reviews will be jointly evaluated by the compliance office. Prevent Done 2024-04-05
Periodic sampling of the certificates already issued is established to check them against the new versions of the lint Detect In course 2024-05-01
In addition to continuing to use ZLint, include Pkilint as a complementary tool for pre-lint. This tool seems to be updated more frequently to ensure compliance. Other lints that may help in the early detection of missuassuances will also be assessed periodically. Detect In course 2024-04-15
Improve the system for monitoring update deployments and fix the system to verify that errors are captured and changes are consistent. Prevent In course 2024-05-01
----------- ---- -------- -------------

Appendix

Details of affected certificates

See attached file

Based on Incident Reporting Template v. 2.0

The following 13 certificates were issued by ACCV after 2023-09-15, contain User Notice policy qualifiers, are currently unrevoked, and are not listed in attachment 9395485 [details]. Jose, can you explain why ACCV's analysis missed these certificates?

crt.sh ID SHA-256(Certificate) Serial Number
10976782713 67F6F709AE69E39D3E26909A748241BB942D59A77D64CDE253503F4CB017BA35 244838BADBDDB85ED82E522808552169
10976793176 867BAFFB51BACDCAA55911A99DE9D037F5C069DFA2613C97747386D25AAA8D48 5849AFE6EB0CA512702240104773E1BA
10976898956 9813CEA1E6634FE2AA229A9368799A867E9BC18C929E99AC22B3D92BE272C588 14769D80D6EEBBBB8BEC245235894636
10976916630 65A0DA10672D1D029A84C1C102EC2A4276923E67F96EFADD97B3169FE3045B3B 7D5AA0D0E7E6688D419BB0989DF68714
10983158220 BBCACBB799D485B20BD9AFA5FAFF817FEE48BF792F8D1A14C9466F6B2B75888F 5D9C304DA065D21B988969FB50E58400
10983167431 E5C132D945E69CAA1A7E214A0D5E41D2C6A05EE822F04B06C8215D5DEAB5C97D 38377441A4BAADB70CCD4FF06049AFCA
10983176062 950411239EDBA08F998FD8DA2784E975D1D9805B799FDCAFB95EB77C538B2077 65E65A587A247B2ECCA3FE58E48BD6AC
10985130249 C350DADFE674F8B92490D8292232894E084EDFE7867F3A86138DC32330C50683 595F4B36BE7312BE10A1017366EA1390
10985808498 73F200ACCB97C64C2108F21F999DF5A77BC8CE1815D29EC6794C46D0BA6E0531 79E746D759AF0DDF74B8B6992FA4291C
11012719222 C9828C90EA895FEA5C43DAE27C596E001BE07D025CD8998ED2D084F6AD257338 79C8F0D17A709E94DE99B9C001099D3A
11047401191 DE0C579F97C9B372D62B30079E4788EB46F8727EFB8768EA9BA70F8DF0F440A2 1CB1C99864B79F27175E6E5186CCFAD7
11047416828 9F0486E820956F7FC93573D0B21482F3EDF8B5A4B980A62056CB367BA43DB9B8 2ED76BBFCF06239F0A6D50AF0F7D514E
11083593177 19EDD011A6AD2F82A699C26BE46A9051150BEE24D8BF9C6E147476D9F844C219 3B90270DC8DF817DA5B918649F2B7754
Flags: needinfo?(jamador)

Hi, Rob
We will look into the difference you have indicated and write shortly.
Thank you very much for your help.
Regards

Flags: needinfo?(jamador)

We have checked that when sending the affected certificates we only sent those that had a final certificate delivered to the user and not those that had given an error on delivery and that for some reason had not ended up in a certificate in use (these certificates have never been put into use by the end user).
This was an error on our part at the time of collecting the evidence, counting the serial numbers of the final certificates delivered instead of using those generated. Attached is the complete list including orphan certificates (all revoked).
Our apologies for the error.

(In reply to Jose Amador from comment #3)
Jose, the timeline states:

2023-11-25:

  • An update of our systems causes to switch the nodes by entering as master the node that was updated with the correct profile. From this moment on, the userNotice does not appear on the certificates.

We want to make sure we understand what you’re saying here.

  • Are you stating that this bug was fixed as an unintended consequence of a different action on ACCV’s part?
  • When did ACCV become aware that userNotice had ceased to appear in new certificates?
  • When did ACCV understand the events that led to userNotice ceasing to appear?

Hi, Tim

This stems from a bug causing an uncaught error in our version deployment system that has already been identified and corrected. The need to remove the field we had identified, and planned the change along with others. Unfortunately bugs happen.

Q1 Are you stating that this bug was fixed as an unintended consequence of a different action on ACCV’s part?
The cause of the mis-generation was solved when during a system maintenance we switched to the second node where the deployment had been performed correctly. That is why we indicated it in the part where we were lucky. This happened in 2023-11-25.

Q2 When did ACCV become aware that userNotice had ceased to appear in new certificates?
We thought that it had been eliminated before the new profile came into effect. As the lint did not identify the problem, there was no error in the generation. In addition, zlint included this lint with version 3.6.0 which went into production on 2024-01-22 (it had already switched to the correct node). The problem and its disappearance was not detected until the error was reported. From then on, the whole process was revised and Bugzilla was created.

Q3 When did ACCV understand the events that led to userNotice ceasing to appear?

We became aware of the error in the deployment system once we investigated why certificates had been issued with that field. The error had not occurred previously. Based on these findings, the deployment system has been enhanced to identify these errors and include manual verification processes to confirm the implementation of all changes.

Thank you very much

Update on actions.

After a two-week testing process ACCV has deployed in PRO the version of the issuing system that introduces pkilint (v0.10.0) as a pre-linting mechanism. ACCV is currently using pkilint and zlint as prelint tools. We hope that in this way we can prevent future errors.

Action Item Kind Stete Due Date
The protocol for reviewing documentation associated with changes to certificate profiles will be improved. We will establish separate reviews by various team members following a matrix of certificate profile fields to check. The outcome of these reviews will be jointly evaluated by the compliance office. Prevent Done 2024-04-05
Periodic sampling of the certificates already issued is established to check them against the new versions of the lint Detect In course 2024-05-01
In addition to continuing to use ZLint, include Pkilint as a complementary tool for pre-lint. This tool seems to be updated more frequently to ensure compliance. Other lints that may help in the early detection of missuassuances will also be assessed periodically. Detect Done 2024-04-15
Improve the system for monitoring update deployments and fix the system to verify that errors are captured and changes are consistent. Prevent In course 2024-05-01
----------- ---- -------- -------------
  • Could you please explain how are deployments done?
  • What is the tooling you're using to automate the deployment?
  • How did the deployment fail in one of your nodes, but not in the other?
  • How often are you deploying new changes?
  • How do you ensure that these deployments don't fail when you're patching, for example, a system dependency that has a critical vulnerability?
Whiteboard: [ca-compliance] → [ca-compliance] [ev-misissuance]

Thank you for your questions, Amir

Q) Could you please explain how are deployments done?
A) Once the testing process is finished, we plan a deployment with Jenkins. The development team releases a final version that triggers the build and deployment of the application war.

Q) What is the tooling you're using to automate the deployment?
A) We are using Jenkins.

Q) How did the deployment fail in one of your nodes, but not in the other?
A) Due to a problem when trying to replace an open file by another application. This had not happened before (or we had not detected it). The deployment finished, but that file was not replaced. On the other node everything went fine.

Q) How often are you deploying new changes?
A) When necessary due to critical changes or when several minor changes are accumulated.

Q) How do you ensure that these deployments don't fail when you're patching, for example, a system dependency that has a critical vulnerability?
A) This continuous deployment system is only for the part that we develop. OS updates/patches and third party software releases (including security ones) are carried out by the Systems team with the assistance of the software manufacturer (or at least with the update guidelines). If errors are detected in one of our deployments (in this case unfortunately it was not the case and hence the error, the report and everything else) it is notified by console and mail to be able to initiate rollback mechanisms.

Thank you!

This had not happened before (or we had not detected it).

Have you checked the previous invocations to see if you had this failure before and you missed it?


Some further questions:

ACCV has about 3600 active precerts in CT according to crt.sh. This means that the 378 certs that ACCV initially attached were revoked for reasons before this incident was discovered.

That means that ACCV had revoked about 10% of their active certs in routine operations?

How many other certs that weren't impacted by this incident (that were delivered to subscribers that are not ACCV, e.g. not test certificates) were revoked between 2023-09-15 and 2024-01-10?

From the revoked certs (both impacted, and not impacted by this incident), why were they revoked (note: I'm not necessarily asking for reason code breakdown, but more "who made the revocation request and why"?

Flags: needinfo?(jamador)

Hi, Amir

We have reviewed previous deployments and no similar issues have been detected.

With regard to further questions.

At the end of last year, for commercial reasons, we launched a promotion and replaced many of our certificates, offering a very advantageous economic agreement for as long as the relationship was maintained, depending on the type of user and the permanence with us. In addition, as a way of gaining market share and user loyalty, we were able to easily switch from one platform to another in terms of issuing certificates. These certificates have been replaced by others. We understand that commercial agreements with our customers are not a problem.

This has had the effect that in this case the impact on revocation due to the failures detected has been much less.

In the interval you have indicated, 16 certificates not affected by the incident have been revoked.

The rest of the certificates affected by this Bugzilla have been revoked on those dates with the process indicated above. The commercial promotion ended at the end of January so there are many that were revoked during that month. Fortunately communication with our users greatly facilitates these commercial initiatives. This ability is one of the goals we advocate for when it comes to improving the ecosystem.

Thank you very much!

Flags: needinfo?(jamador)

Thanks for the response. I'll have to say, the numbers here are not adding up in my head.

Why this feels odd from my perspective:

In the initial incident report, there was a claim that 379 certificates were impacted, and revoked. Beyond that, in the previous comment, you mention that there were another 16 certificates that were revoked in the same time period.

So, in that time period, you had 395 revocations. That means ~95% of your revocations in that time period happened to happen to exactly these misissued certificates.

Are my numbers correct? If so, do you know what happened that caused such a high percentage of your revoked certs in that time frame be these misissued ones?

Flags: needinfo?(jamador)

Hello, Amir

As we have commented in this case these two circumstances coincided in time (the commercial initiative that we addressed at the end of the year that provided us with a change in the platform and the error that caused the incorrect issuance of certificates), so when we had to perform the revocation due to detect the error it turned out that they were already revoked during that process.

All TLS certificates issued in that period were impacted by the issuance error, so it is normal that the revoked ones for that particular profile would match. So to say, in this case we were lucky in that respect. There is no other explanation for this.

As an additional fact, due to Bugzilla 1884532 on March 9th we had to revoke more than 800 certificates (approximately 30% of our active TLS certificates) in a few days since we detected the incident (many more certificates impacted than in the case we are talking about and including those generated during the end of year change which caused us additional problems with users). In this case we were unlucky and the certificates were active, but we were able to manage it in a reasonable time since the problem was detected by us, and as you can see we have the ability to solve these mass revocation problems in a relatively efficient way (we hope to improve with lessons learned).
In fact bugzilla 1884532 predated this one by a few weeks so we would have found all the certificates impacted by this case revoked as well, even if we had not replaced them. The situation for revocation purposes would be exactly the same.

We hope that with all the measures that are being taken, these cases will not happen again, because they are a huge problem at all levels for the community, for us and for our users. For our part, until a few months ago, we had not had any similar incidents for many years.

As always thank you very much for your questions.

Flags: needinfo?(jamador)

Understood, thank you for the information!

Update on actions

We have improved the deployment system by introducing a subsequent checksum process to programmatically ensure that what is deployed on all nodes is the same as what was planned for the deployment. Once the deployment is completed, the checksum is performed by hashing the deployment content at all locations, comparing the output and returning the result. With this process the error is detected if there are differences in the deployments.

For periodic sampling, pkilint has been added in addition to zlint, and the staff have been trained to facilitate the use of the new tool. The staff involved in this function will perform at least the same tests as those performed in the post linting process in the issue.

Action Item Kind Stete Due Date
The protocol for reviewing documentation associated with changes to certificate profiles will be improved. We will establish separate reviews by various team members following a matrix of certificate profile fields to check. The outcome of these reviews will be jointly evaluated by the compliance office. Prevent Done 2024-04-05
Periodic sampling of the certificates already issued is established to check them against the new versions of the lint Detect Done 2024-05-01
In addition to continuing to use ZLint, include Pkilint as a complementary tool for pre-lint. This tool seems to be updated more frequently to ensure compliance. Other lints that may help in the early detection of missuassuances will also be assessed periodically. Detect Done 2024-04-15
Improve the system for monitoring update deployments and fix the system to verify that errors are captured and changes are consistent. Prevent Done 2024-05-01
----------- ---- -------- -------------

No further action is pending. We are monitoring this bug.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: