Open Bug 1712664 Opened 2 months ago Updated 4 hours ago

iTrusChina: verification errors for the roots' CRLs(ARL)

Categories

(NSS :: CA Certificate Compliance, task)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: vTrus_contact, Assigned: vTrus_contact)

Details

(Whiteboard: [ca-compliance] Next update 2021-09-01)

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36 Edg/90.0.818.62

Steps to reproduce:

Our offline CA's ARL system has a design bug, it causes the new extension item to not be added to the original signature, resulting in the signature verification failed.

Actual results:

crt.sh reports verification errors for our roots' CRLs.

Type: defect → task
  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in the MDSP mailing list, a Bugzilla bug, or internal self-audit), and the time and date.

Andrew Ayer report this issue in iTrusChina root inclusion Public Discussion on 21/05/2021 13:22 UTC, and we noticed it on 24/05/2021 01:20 UTC.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

24/05/2021 01:43 UTC Start troubleshooting
24/05/2021 02:53 UTC The R&D team found that the reason for the CRL error was that the signature verification failed in the ARL, and began to
analyze the reason for the signature failure.
24/05/2021 04:11 UTC Test the subscriber certificate CRL, and the problem is not reproduced
24/05/2021 05:49 UTC Issue new ARLs in the test environment to check whether there are any problems.
Issue was reproduced, started to troubleshoot problems in the code.
24/05/2021 07:10 UTC Find the bug in offline CA’s ARL signature process ,start to fix the problem.
24/05/2021 07:30 UTC The code fixed.
24/05/2021 08:43 UTC R&D team tested the code to reissue the ARL, and solve the problem in test environment.

  1. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

iTrusChina stopped certificate issuance during the resolution of the problem.

  1. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

This issue involves the ARL of two root certificates.

  1. In a case involving certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

http://wtca-cafiles.itrus.com.cn/crl/vTrusRootCA.crl
http://wtca-cafiles.itrus.com.cn/crl/vTrusECCRootCA.crl

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

This error is caused by a design bug of our offline CA,when we added extension items in ARL, The ARL information should have been assembled before signing,but the extension item information was not added to the original signature, resulting in inconsistency between the original signature and the original when verifying the signature, so the signature verification failed.
The reason we were unable to detect this error before was that our system did not design to verify the signature of newly issued ARL and CRL automatically,our CRL was designed to be signed after assembly all the extension, so it did not encounter the same problem .

  1. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

we will add a new feature in our CA system to verify the signature of newly issued ARL and CRL automatically,Regarding to this issue, we will arrange testing and launch the new version of the system later today.

Completed
5.24 fixed the code of our offline CA system to verify the signature of newly issued ARL and CRL automatically.
TBD
5.25 Deploy new version of the CA system and arrange testing in the test environment,
Launch the new version of the system and sign two new ARLs for our Roots.
Verify the signature of the new ARLs.

This sounds like a rather serious security issue. If I am understanding correctly, you have a process where the data that is signed with a CA’s offline key is not the same data that was verified, and until now, this was not noticed. Is that correct?

Equally, a critical part of asymmetric algorithms is verification after signing. This is required in a FIPS-approved mode of operation (known as a pairwise consistency check). Why were such pairwise checks not already part of your processes?

The ARL information should have been assembled before signing,but the extension item information was not added to the original signature, resulting in inconsistency between the original signature and the original when verifying the signature, so the signature verification failed.

Can you precisely describe the steps and processes used to generate CRLs/ARLs, with sufficient technical detail that they might be independently implemented? This process of assembly naturally raises questions here, and the best way to address those concerns is understand the processes and controls that exist.

Flags: needinfo?(vTrus_contact)
Assignee: bwilson → vTrus_contact
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]

(In reply to Ryan Sleevi from comment #2)

This sounds like a rather serious security issue. If I am understanding correctly, you have a process where the data that is signed with a CA’s offline key is not the same data that was verified, and until now, this was not noticed. Is that correct?

Equally, a critical part of asymmetric algorithms is verification after signing. This is required in a FIPS-approved mode of operation (known as a pairwise consistency check). Why were such pairwise checks not already part of your processes?

The ARL information should have been assembled before signing,but the extension item information was not added to the original signature, resulting in inconsistency between the original signature and the original when verifying the signature, so the signature verification failed.

Can you precisely describe the steps and processes used to generate CRLs/ARLs, with sufficient technical detail that they might be independently implemented? This process of assembly naturally raises questions here, and the best way to address those concerns is understand the processes and controls that exist.

Hi Ryan,

Thank you for your constructive feedback, it will help us to improve our system and process.
We checked our certificates issuance system, and confirmed that it will verify the signature when issuing certificates, but not the CRLs/ARLs.
We have already released the new version of the CA system and signed two new ARLs for our Roots to resolve this issue.
The steps and processes we used to generate CRLs/ARLs:

  1. Query all the revoked certificates under the CA certificate and assemble it into a CRL certificate list.
  2. Set thisUpdate and nextUpdate items, thisUpdate is the current system time, and nextUpdate is calculated according to the period.
  3. Add extension item information.
  4. Set the CRL issuer information, which is the SubjectDN of the CA certificate.
  5. Determine the signature algorithm of the CRL according to the CA certificate, and sign the CRL.
  6. Add the signature algorithm and signature value to the CRL structure.

The previous version has the 6 steps above, and the new version of our system has already added a step of verifying the signature.

The current CRL/ARL issuance process is as follows:
The CRL is automatically issued according to the period set by the system and automatically synchronized to the CDN.

ARL is issued manually after the approval process is completed:

  1. The key administrator enters the shielded computer room together with the witness of the operation procedure and the holder of the safe key(Physical),turns on the power of the HSM where the root key is located, and uses a network cable to directly connect to the operating host.
  2. Open the safe in the shielded computer room, take out the IC card corresponding to the HSM, insert it to the HSM, and read the permission of the activation key.
  3. Use the operating host to log in to the offline CA system, connect to the HSM, select the root CA key pair which is to sign the ARL, and complete the ARL issuance.
  4. Export the ARL file, HSM log, offline CA log, turn off the power of the HSM, and put the IC card back into the safe.
  5. Obtain the above files from the operation host through internal FTP.
  6. All the above operations are completed by the key administrator and witnessed by witness. Both parties need to sign and confirm each operation on the procedure script, and the procedure video will be exported and backup to the hard disk for audit.

Regards,
vTrus team

Given that the six steps mentioned in Comment #3 were the same both before and after, the description of the failure (quoted in Comment #2) is difficult to make sense of.

It sounds as if Comment #1 was saying Step 3 was skipped, but even if it was skipped, one would expect a CRL with a valid signature to be produced, as the CRL itself would be missing the extensions. Alternatively, it sounds like the ARL/CRL was produced (without extensions), signed using the procedure you described, and then that signature transposed to a different ARL/CRL. However, that doesn’t seem to fit with the 6 steps you described; there shouldn’t be a second ARL/CRL generated, and so it’s difficult to understand here how the error happened.

(In reply to Ryan Sleevi from comment #4)

Given that the six steps mentioned in Comment #3 were the same both before and after, the description of the failure (quoted in Comment #2) is difficult to make sense of.

It sounds as if Comment #1 was saying Step 3 was skipped, but even if it was skipped, one would expect a CRL with a valid signature to be produced, as the CRL itself would be missing the extensions. Alternatively, it sounds like the ARL/CRL was produced (without extensions), signed using the procedure you described, and then that signature transposed to a different ARL/CRL. However, that doesn’t seem to fit with the 6 steps you described; there shouldn’t be a second ARL/CRL generated, and so it’s difficult to understand here how the error happened.

Hi Ryan,

Sorry for confusing you, we realize that our description is not clear enough.
Our CA system of certificate issuance is using template to load extension and we also implemented certlint check in issuance process. For CRL/ARL extension items are written in code, and any changes to the extension items require modification of the code. Due to a code bug of ARL, the code to add the extension field is written to the wrong place, so ARL leads to step 3 to the last, that is, the operation of adding extension items is after the signature, which causes the verification of the signature to fail.
Our CRL is generated in accordance with the 6 steps described previously.

Regards,
vTrus team

Flags: needinfo?(vTrus_contact)

Can you further clarify: are you running an in-house/custom CA software, or is this based on a common off-the-shelf commercial or open-source system?

I think the challenge here in understanding the flow of now your system works is that it sounds like you’re running some program on the offline CA to produce the CRL/ARL. That is, instead of producing a TBSCertList, which can be manually reviewed and linted before signing, it sounds like the process is integrated into a script that performs all of these steps automatically.

If that’s the case, I think there’s a question about understanding how that’s tested. The downside of such a scripted approach is that it can make it easy to forget to run in a testing environment, that the testing environment can differ from the production environment, or that errors that occurr in production don’t occur in testing. On the other hand, such a script can remove a lot of room for manual error, such as skipping steps, and so it’s not without its benefits.

The concern here is trying to go deeper than the specific incident at play here - such as the bug in the code that lead to wrong extensions, and trying to understand the process and design of the system, to understand if there are still systemic risks.

That said, to better help understand and confirm this explanation: Can you attach one of the invalid CRLs here, along with the TBSCertList that successfully validates with the (invalid) signature? Basically, one way to help confirm that the explanation provided is what happened is to help demonstrate the “right” TBSCertList that the signature was created for (e.g. without extensions). While this is something the community could try to work out manually, based on the description provided so far, the fact that iTrusChina has a better understanding of the bug hopefully makes it easier to provide these supporting artifacts. This is not meant to accuse you of impropriety, but rather, a useful process in incident reports (as with the previous comments) to provide detailed supporting technical artifacts.

Flags: needinfo?(vTrus_contact)

(In reply to iTrusChina Co.,Ltd. from comment #3)

I noticed a potential miscommunication, so could you clarify this:

ARL is issued manually after the approval process is completed:

  1. The key administrator enters the shielded computer room together with the witness of the operation procedure and the holder of the safe key(Physical),turns on the power of the HSM where the root key is located, and uses a network cable to directly connect to the operating host.
  2. Open the safe in the shielded computer room, take out the IC card corresponding to the HSM, insert it to the HSM, and read the permission of the activation key.
  3. Export the ARL file, HSM log, offline CA log, turn off the power of the HSM, and put the IC card back into the safe.
    [snip]
  4. All the above operations are completed by the key administrator and witnessed by witness. Both parties need to sign and confirm each operation on the procedure script, and the procedure video will be exported and backup to the hard disk for audit.

It seems to me that you only have one key administrator in this key ceremony, i.e. single person control. Mozilla policy (and BR) specify that CA equipment must under all times be operated under multiple person control; and I don't think that adding a witness is adding a control person.

I do note that your CPS section 5.2.2 does specify that you use multiple person control (3-out-of-5 for cryptographic device access, 2-person access for the safe containing the HSM activation key), but the procedure you describe in Comment 3 does not reflect the controls specified in that section (only one person for opening the safe, only one person operating the cryptographic device).

(In reply to Ryan Sleevi from comment #6)

Can you further clarify: are you running an in-house/custom CA software, or is this based on a common off-the-shelf commercial or open-source system?

I think the challenge here in understanding the flow of now your system works is that it sounds like you’re running some program on the offline CA to produce the CRL/ARL. That is, instead of producing a TBSCertList, which can be manually reviewed and linted before signing, it sounds like the process is integrated into a script that performs all of these steps automatically.

If that’s the case, I think there’s a question about understanding how that’s tested. The downside of such a scripted approach is that it can make it easy to forget to run in a testing environment, that the testing environment can differ from the production environment, or that errors that occurr in production don’t occur in testing. On the other hand, such a script can remove a lot of room for manual error, such as skipping steps, and so it’s not without its benefits.

The concern here is trying to go deeper than the specific incident at play here - such as the bug in the code that lead to wrong extensions, and trying to understand the process and design of the system, to understand if there are still systemic risks.

That said, to better help understand and confirm this explanation: Can you attach one of the invalid CRLs here, along with the TBSCertList that successfully validates with the (invalid) signature? Basically, one way to help confirm that the explanation provided is what happened is to help demonstrate the “right” TBSCertList that the signature was created for (e.g. without extensions). While this is something the community could try to work out manually, based on the description provided so far, the fact that iTrusChina has a better understanding of the bug hopefully makes it easier to provide these supporting artifacts. This is not meant to accuse you of impropriety, but rather, a useful process in incident reports (as with the previous comments) to provide detailed supporting technical artifacts.

Hi Ryan,

We use a in house CA software developed by our R&D team.

The process of generation of CRL/ARL is not through scripts, but a program/function of our offline CA which performs all steps automatically.

As for the method of testing, our regular testing procedure is:
1, R&D team conducts self-test in the development environment;
2, Deploy and tested in the testing environment. We have test CA software and test HSM(with testing key pairs) exactly the same version as the production environment to trouble shooting and reproduce issues;
3, After testing work, deploy the software to production environment.

Attached is our invalid ARL and TBSCertList.

Thanks for your questions and suggestions, it will make us better understand what details the incident report should provide to make it clearer.

Regards,
vTrus team

Flags: needinfo?(vTrus_contact)

(In reply to Matthias from comment #7)

(In reply to iTrusChina Co.,Ltd. from comment #3)

I noticed a potential miscommunication, so could you clarify this:

ARL is issued manually after the approval process is completed:

  1. The key administrator enters the shielded computer room together with the witness of the operation procedure and the holder of the safe key(Physical),turns on the power of the HSM where the root key is located, and uses a network cable to directly connect to the operating host.
  2. Open the safe in the shielded computer room, take out the IC card corresponding to the HSM, insert it to the HSM, and read the permission of the activation key.
  3. Export the ARL file, HSM log, offline CA log, turn off the power of the HSM, and put the IC card back into the safe.
    [snip]
  4. All the above operations are completed by the key administrator and witnessed by witness. Both parties need to sign and confirm each operation on the procedure script, and the procedure video will be exported and backup to the hard disk for audit.

It seems to me that you only have one key administrator in this key ceremony, i.e. single person control. Mozilla policy (and BR) specify that CA equipment must under all times be operated under multiple person control; and I don't think that adding a witness is adding a control person.

I do note that your CPS section 5.2.2 does specify that you use multiple person control (3-out-of-5 for cryptographic device access, 2-person access for the safe containing the HSM activation key), but the procedure you describe in Comment 3 does not reflect the controls specified in that section (only one person for opening the safe, only one person operating the cryptographic device).

Hi Matthis,

In the process of opening the safe and taking out the IC card, we operated by two persons, the key administrator and holder of the safe key.

Regarding the admin privilege for key generation, backup and recovery, we do split the knowledge to multiple person (5 secret shareholder as described in CPS, also as requirement by BR, Mozilla policy, WebTrust for CA Chapter 4).

Key administrator and 5 secret shareholders are different person.

The key administrator only have a Operator IC card. The operator’s IC card does not have the authority to conduct key generation, backup and recovery. In this case, the key adminitrator use the operator’s IC card to accomplish the activation, and follow operating process to generating ARL. The witness will check every single step to follow the process. We consider this as a multiple person control(dual custody).

Regards,
vTrus team

We use a in house CA software developed by our R&D team.

Can you share more about the decision making to write something in-house, and about how the development is conducted? This is a rather serious error, and I think the concern here is that there may exist similar errors in other critical functions, especially with respect to how the CA private key is used. If there are relevant audits that your Software Development Lifecycle undergoes, can you also share details here?

My concern here is Comment #1 identified it as a "design bug", but I don't see any sort of deeper root cause analysis that allowed such design bugs to be introduced, or how those are systemically being prevented. While the specific manifestation of this bug appears to have been fixed as of Comment #3, I don't really see any sort of analysis that would explain how this class of errors should have been prevented, why they failed to do so, and what systemically has changed.

Flags: needinfo?(vTrus_contact)

(In reply to Ryan Sleevi from comment #11)

We use a in house CA software developed by our R&D team.

Can you share more about the decision making to write something in-house, and about how the development is conducted? This is a rather serious error, and I think the concern here is that there may exist similar errors in other critical functions, especially with respect to how the CA private key is used. If there are relevant audits that your Software Development Lifecycle undergoes, can you also share details here?

My concern here is Comment #1 identified it as a "design bug", but I don't see any sort of deeper root cause analysis that allowed such design bugs to be introduced, or how those are systemically being prevented. While the specific manifestation of this bug appears to have been fixed as of Comment #3, I don't really see any sort of analysis that would explain how this class of errors should have been prevented, why they failed to do so, and what systemically has changed.

Hi Ryan,
Sorry for lately reply.
Regarding the question of why the in house CA software is used, the reason is that iTrusChina is a commercial CA and our domestic business in China is sell certificate services and CA software for enterprise PKI requirement.
We develop CA software based on RFC5280, BR and NCSSR issued by CABF, and WebTrust audit standards, the passed ISO27001 audits and CMMI 3 assessments both regularly review the software development life cycle management. iTrusChina R&D team and compliance team had a internal discuss and conducted root cause analysis for this bug. We believe that the key point for this bug because we relies on manual testing work, and manual testing will inevitably have omissions.

Our R&D team have started to build an automated inspection system, they will manually check the error if there is an alarm, and check the results of the automatic inspection every month if there is no alarm. The inspection includes the following dimensions:

  1. Automate the detection of critical processes. When a detection item fails, alarm and block the business that fails the detection.
  2. When the entity certificate is issued, the inspection system will inspect the subject information, extensions, issuer information, validity period, and signature value of the certificate. When there is a non-conformity, it is regarded as a failure. When the detection fails, alarm and block the business that fails the detection.
  3. When performing an ocsp query, the detection system checks the validity, response time, algorithm, signature value and other information of the signature certificate in the ocsp response. When there is a non-conformity, it is regarded as a failure. When the detection fails, alarm and block the business that fails the detection.
  4. When CRL and ARL are released, the detection system detects the update time, issuer, signature algorithm, signature value and other information in the CRL and ARL. When there is a non-conformity, it is regarded as a failure, and when the detection fails, alarm and block the business that fails the detection.
  5. Periodically check the validity of the CA certificates in the system,mainly check the validity period, algorithm, signature value and other information, and give an alarm when fails.
    We will conduct a systematic control check on RFC5280 compliance in the next three months, and will update the new inspection content and methods here at any time. The entire project is expected to go live within 3 months (Expected date: Sept 30th).
Flags: needinfo?(vTrus_contact)

This largely seems to describe post-issuance linting (items 2 and 5 in particular), and suggest an independent development of tools rather than using/embracing/contributing to industry standard pre-issuance linting.

I appreciate the description of how the software is developed, but the ISO 27001 and CMMI 3 audits don't necessarily provide assurance about the set of concerns from Comment #11 - namely, misunderstanding or misapplication of the relevant standards that are expected (and mentioned in Comment #12).

To be fair, I acknowledge that design and testing go hand-in-hand; testing is meant to help detect these design bugs sooner than later. But this is something that's been discussed for some time, so it's rather surprising to only see the beginnings of steps towards addressing.

Concretely: Are there plans for pre-issuance linting? If so, will it be in-house or through existing OSS projects? Will iTrusChina be contributing to these efforts, or merely consuming them?

It's also a little concerning that it's been 14 days since Comment #12, given both the longstanding expectations and the recent discussion reiterating those requirements (which iTrusChina is required to be aware of such discussions).

Flags: needinfo?(vTrus_contact)

(In reply to Ryan Sleevi from comment #13)

This largely seems to describe post-issuance linting (items 2 and 5 in particular), and suggest an independent development of tools rather than using/embracing/contributing to industry standard pre-issuance linting.

I appreciate the description of how the software is developed, but the ISO 27001 and CMMI 3 audits don't necessarily provide assurance about the set of concerns from Comment #11 - namely, misunderstanding or misapplication of the relevant standards that are expected (and mentioned in Comment #12).

To be fair, I acknowledge that design and testing go hand-in-hand; testing is meant to help detect these design bugs sooner than later. But this is something that's been discussed for some time, so it's rather surprising to only see the beginnings of steps towards addressing.

Concretely: Are there plans for pre-issuance linting? If so, will it be in-house or through existing OSS projects? Will iTrusChina be contributing to these efforts, or merely consuming them?

It's also a little concerning that it's been 14 days since Comment #12, given both the longstanding expectations and the recent discussion reiterating those requirements (which iTrusChina is required to be aware of such discussions).

Hi Ryan,

The pre-issue linting currently used by iTrusChina includes zlint, certlint, and x509lint, The link is as follows:
https://github.com/awslabs/certlint
https://github.com/kroeckx/x509lint
https://github.com/zmap/zlint
but they are only used to check the end-entity certificates. Our incident this time lies in the CRL/ARL, which cannot be covered by these three lint libraries.
The inspection system we are developing this time includes the CRLlint function. After the CRLlint function is added, the process of generating CRL is roughly as follows:

  1. The CA system generates the CRL. At this time, the CRL only appears in the system's memory, and will not be persisted or released, nor will it be accessed by other external systems;
  2. Use CRLlint to check CRL;
  3. According to the results, perform different operations:
    a) If the check is passed, the CRL will be persisted and released to the outside;
    b) If the check fails, the system will perform a retry, re-execute the above process, and output a warning message. If the retry still fails, the system will output an error message and abort the entire generation process.
    CRLlint will perform the following checks:
    1.Invalid signature or misuse of keys;
  4. Wrong data structure, CA system configuration error, or system bug not found during testing;
  5. The values of key variables (CRL number, issuer, this update/next update, signature algorithm, etc.) are used to find CA system configuration errors, operating environment configuration errors, or system bugs not found in the test link;
    In addition, iTrusChina plans to add the following pre-issue checking items to the system:
  6. Certification path and certification identifier checking;
  7. Encoding (for example, ASCII or UTF8) compliance checking;
    Thank you very much for your suggestions, but the inspection system we are developing this time has no plans to open source, we will pay close attention to relevant discussions, and update the progress of system development every week.

Regards,
vTrus team

Flags: needinfo?(vTrus_contact)
Flags: needinfo?(bwilson)

Regarding contributing to industry standard pre-issuance linting, our R&D team have a internal discuss and decide to open source after our inspection system has been developed and used stably after a period of improvement, estimated time will be early 2022.

vTrus team

We have completed the basic component design and system design. This week we will start coding the basic components, and will give priority to basic components and functions related to CRL detection.

The following general arrangement is as follows:

Time Item
Before August 6th Development of basic components and functions related to CRL/ARL checking is completed
August 9th to August 20th Development of basic components and functions related to certificate checking
August 23 to September 3 System test

This week we have coded the basic components and functions. About 70% of the basic components related to CRL detection have been completed.

The coding of CRL/ARL update time, issuer, signature algorithm basic components and functions has been completed, and the signature verification function is in progress.

Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] → [ca-compliance] Next update 2021-09-01
You need to log in before you can comment on or make changes to this bug.