Closed Bug 1623384 Opened 4 years ago Closed 4 years ago

Camerfirma: Invalid authorityKeyIdentifier - recurrent incident

Categories

(CA Program :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ana.lopes, Assigned: ana.lopes)

References

Details

(Whiteboard: [ca-compliance] [ov-misissuance] [ev-misissuance])

Attachments

(3 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36

Steps to reproduce:

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
    We were first aware about the problem when Michel Le Bihan included that information in the bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1586860#c15

  2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
    Steps to solve the problem in the past (you can find the information in detail in the bug 1586860 - https://bugzilla.mozilla.org/show_bug.cgi?id=1586860)
    1º Development to change the affected profiles (October 8th, 2019)
    2º Pass of the changes to production (Oct 25th, 2019)
    After detecting the problem again we have performed and planned the following actions:
    La detección de lon nuevos con errores:
    1º Michael Le Bihan's detected the new errors and include a new comment in the previous bug https://bugzilla.mozilla.org/show_bug.cgi?id=1586860#c15 (March 11th)
    2º We gave responde to the comment and the two certificates detected (both of them were pre-certificates) were revoked immediately (March 11th).
    3º We examined the versión of the zlint that we were using (March 11th)
    4º We investigated all the profiles to detect all the profiles with errors in use (March 11th)
    5º Install the new versión of zlint 2.0.0 (March 12th)
    4º Change the profiles with errors (March 13th)
    5º Test the new profiles (March 13th)
    6º Pass the new profiles to the production environment (March 16th)
    7º Incoporate as part of the procedure for the Operations department to check the profile with Zlint manually before issuing the first certificate of each kind of profile (March 16th)
    8º Revocation of all the affected certificates (by March 20th)
    9º Substitution of all the revoked certificates (by March 20th)
    10º Modification of the procedure for the correction of the profiles and establishment of an aditional automatic control to verify the AKI before the issuance of the certificates (by April 1st)
    11º Establishment of a procedure to detect all the new versions of the zlint deployed. We will receive a notification every time there is a new update of the zlint version and the department responsible of this tool will decide if the update must be pass to production.
    In parallel we are analysing the possibility of establishing an automatic procedure to detect all the new versions of the zlint deployed and update the version in the production environment every time a new version is available without manual intervention. (March 18th)

  3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
    We have stopped issuing certificates since we received the communication. The last certificate was issued on March 10th.
    We started to issue certificates again on March 16th after correcting all the affected profiles.

  4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
    We have detected 37 active certificates affected by this problem. Three of them have already been revoked.
    The first certificate affected was issued on Dec 12th and the last one on March 10th.
    Regarding the profiles, we have detected 85 profiles with errors and we have already changed all the affected profiles.

  5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
    Please, find the list of affected certificates attached.

  6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
    We took measures to solve the problems detected in the previous bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1586860)
    We had not been conscious about the repetition of the problem until the communication because the version of Zlint that we were using had not detected it.
    After reviewig the corrections and tests that we made in the past to correct the profiles detected in the previous bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1586860), we have noticed that the procedure that we conducted was not effective and we did not detect it before the comment of Michel Le Bihan.

  7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
    1º Revocation of the detected certificates (March 11th)
    2º Analyses of the cause of the problem (March 12th)
    3º Updating of the version of zlint (March 12th)
    4º Elaboration of a list with all the affected certificates and communicated it to the Operations department to revoke all of them (March 13th)
    5º Correction of the problematic profiles (March 16th)
    6º Communication of the revocation and the susbtitution of the problematic certificates to all the clients
    7º Revocation of all the problematic certificates (by March 20th)
    8º Substitution of all the problematic certificates (by March 20th)
    9º Modification of the procedure for the correction of the profiles and establishment of an aditional automatic control to verify the AKI before the issuance of the certificates (by April 1st)
    11º Establishment of a procedure to detect all the new versions of the zlint deployed. We will receive a notification every time there is a new update of the zlint version and the department responsible of this tool will decide if the update must be pass to production.
    In parallel we are analysing the possibility of establishing an automatic procedure to detect all the new versions of the zlint deployed and update the version in the production environment every time a new version is available without manual intervention. (March 18th)

I am really struggling to see this incident as anything other than negligence on the part of Camerfirma. Specifically:

  • ZLint didn't detect the original occurrence of this issue, so what does it have to do with this recurrence? I mean, it's convenient to think that ZLint could have caught it, but that doesn't excuse the lack of any tested control to prevent the problem from recurring.
  • This report does not explain how Camerfirma failed to prevent the use of profiles that were known to be bad, or how it will be prevented in the future.
Assignee: wthayer → ana.lopes
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]

Hi Wayne,

In Response for your comments, please find below information more detailed about the problem.

• ZLint didn't detect the original occurrence of this issue, so what does it have to do with this recurrence? I mean, it's convenient to think that ZLint could have caught it, but that doesn't excuse the lack of any tested control to prevent the problem from recurring.
First of all, we want to highlight that the procedure followed for that massive correction of profiles is not a normal procedure for Camerfirma due to the huge amount of profiles that we had to change.
We have a normal procedure that contains the following steps every time we need to make a change:
1º Definition of the scope.
2º The PKI Expertise department implements the changes in the profiles.
3º Development Department passes the changes to the pre-production environment.
4º Development Department sent the changes to be tested by the Quality Department in the pre-production environment. PKI Expertise defines the test strategy to be followed by Quality Department.
5º If the Quality tests run ok, the changes are passed to the Production Environment by Systems Department.
In this particular case, due to the number of profiles, PKI Expertise defined a strategy that consisted of dividing the total amount of profiles in groups to perform the tests of the Quality Department more easily.
The split of profiles was done filtering the profiles by number of certificates contained in each profile and we obtained 6 different groups of certificates to test that we sent to the Development department to be tested and passed to production.
All the tests were completed without problems and the changes were passed to the production environment.
Some of the profiles were never passed to Quality and Production environments because of a filter that left out of all the groups the profiles with less than 20 certificates (we were not conscious about the problem until the comment in the previous bug and our. subsequent analyse.
The solution to avoid the repetition of this situation in the future is avoiding the separation of profiles in different groups to perform the test even if the tests result more difficult and slower. In this way we will not leave out any profiles by mistake again.

• This report does not explain how Camerfirma failed to prevent the use of profiles that were known to be bad, or how it will be prevented in the future.
Our process did not contain a second validation before issuing each certificate. After correcting, we only tested them using zlint and the previous version did not detect any errors.
We are going to introduce a new control to avoid this situation in future occasions.
You can find the control included in the action number 9 of the point 7 os the Incidet Report.
This new control to be incorporated by Apr 1st will verify that the 'authorityKeyIdentifier’ included in the configuration file used to the issuance of each certificate is composed only by the keyed. In case of non-compliance of this requirement, it will block the issuance and will generate alert message with the information about the error to be corrected.

Additionally, we want to inform about the number of revoked certificates so far due to the fact that we are having problems with some clients because of the problematic situation that we are passing through.
We informed the clients about the situation and the necessity to revoke and substitute their certificates by March 20th and the current situation is as follows:

  • 32 certificates already revoked
  • 5 certificates will be revoked by March 30th because the client has asked us for additional days because they cannot substitute their certificates until this problematic global situation finishes and they can work normally.

Please file a new bug for tracking the non-compliance with Section 4.9.1.1 of the Baseline Requirements.

Flags: needinfo?(ana.lopes)

We want to update the information about the certificates with errors pending to revoke.
We have already revoked two of the pending certificates that had the new deadline for March 30th.
The three remaining certificates will be revoked by March 30th as we indicated before.
Additionally, we have detected five more certificates with errors during our revision process, that we had not detected from the beginning because in the first revision we could not find them in crt.sh.
They were detected during the second revision when we also reviewed our internal database to detect other possible certificates that were not in crt.sh.
Those five certificates were detected on March 25th and we initialized the revocation process immediately. Up to date, two of them have already been revoked and the other three will be revoked by March 30th.
The serial number and state of those new certificates are the following:
• 49DA3A39995E4EBA04: Revoked (March 25th)
• 228200E0CFEB03FD24: Revoked (March 25th)
• 38BA99D8248104D111: Pending to revoke
• 28AD0D7EBEBBE2A93B: Pending to revoke
• 4C96A56E3B446477AE: Pending to revoke

Flags: needinfo?(ana.lopes)

We want to inform you that all the pending certificates with errors have already been revoked.

Thanks Ana.

In terms of systemic risks, it seems that one of the risks we've identified is the large number of certificate profiles that Camerfirma has, which leads to difficulty configuring them.

(From Comment #3)

we want to highlight that the procedure followed for that massive correction of profiles is not a normal procedure for Camerfirma due to the huge amount of profiles that we had to change.

While I understand that Comment #3 proposes the following:

The solution to avoid the repetition of this situation in the future is avoiding the separation of profiles in different groups to perform the test even if the tests result more difficult and slower. In this way we will not leave out any profiles by mistake again.

Seems mostly to focus on how you'll make changes to a large number of profiles going forward. However, it seems like a good opportunity to examine a few questions:

  1. Are those large number of profiles necessary? Has Camerfirma explored ways to reduce that complexity of their system?
  • Part of this is understanding which profiles are relevant for the Web use cases? How many are TLS? How many are S/MIME? How many are specific to a particular jurisdiction or regulation?
  1. Are those profiles configured through UI, or are they part of some source control?
  • Understanding this is about understanding whether it's possible to make programatic changes to a large number of profiles, to reduce the risk of misconfiguration that might result from manual repetition and revalidation.

My concern is that as the Baseline Requirements continue to evolve, it's not unreasonable to expect profiles will change, and wanting to have good practices in place to support easy and safe changing of profiles seems like a net-positive.

In terms of current deliverables, it seems like the next step is:

10º Modification of the procedure for the correction of the profiles and establishment of an aditional automatic control to verify the AKI before the issuance of the certificates (by April 1st)

Flags: needinfo?(ana.lopes)

Hi Ryan,
Thank you for your comments. We really appreciate your suggestions. We are analyzing all those questions to get a conclusion and raise possible actions to reduce that risk in the future.
Regarding the point 10º “Modification of the procedure for the correction of the profiles and establishment of an additional automatic control to verify the AKI before the issuance of the certificates that we planned for April 1st.”, the development finished successfully and the control has already been passed to the production environment.

Flags: needinfo?(ana.lopes)

It's been six weeks since Comment #8. Where do things stand now?

Flags: needinfo?(ana.lopes)

Hi Ryan,

First of all, we want to apologize for not replying before with the details. We understood that you only had made some recommendations and you were not expecting extra information.

We have taken some measures to improve the management process, please find more details below:

  1. We reviewed the profiles and we deleted some of them that were obsolete, and this let us reduce more than the 60% of active profiles at that time.
  2. We have stablished a yearly review of the profiles so that we can deactivate the ones that are not needed, and not accumulate more profiles than necessary.
  3. In order to have a more uniform management of profiles, we have developed a project to group the information like for example the AKI for all SSL certificates and be able to manage them generally to adapt them to the changes in an easier way.

We will give you more details as we advance with the project.

Flags: needinfo?(ana.lopes)

Hi Ana,

This is a follow up on Comment 10 above, do you have more details to provide concerning your remediation efforts?

Also, I notice that section 5 of your incident report is incomplete because it only provides certificate serial numbers. Mozilla's incident reporting policy requires that you provide the complete certificate data, which may be accomplished by sha2 fingerprints or crt.sh IDs.

Finally, as a follow up to Comment 4, I did not see where Camerfirma opened a new incident report to explain the delayed revocation of these certificates. (See https://www.mozilla.org/en-US/about/governance/policies/security-group/certs/policy/#61-ssl). According to the Incident Reporting Policy, https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation, https://wiki.mozilla.org/CA/Incident_Dashboard#Revocation_Delays, the filing of a new incident report with a whiteboard entry of "[delayed-revocation-leaf]" is required. So, I have created Bug 1647099, https://bugzilla.mozilla.org/show_bug.cgi?id=1647099.

Thanks,
Ben

Flags: needinfo?(ana.lopes)

Hi Ben,

Please, find the answers to your questions below.

Do you have more details to provide concerning your remediation efforts?

We have defined a new process for the substitution in case of the certificates are affected by an error or incident that will make it more effective.

You can find the diagram with the details of our new substitution process in the attached file “substitution process”.

Also, I notice that section 5 of your incident report is incomplete because it only provides certificate serial numbers. Mozilla's incident reporting policy requires that you provide the complete certificate data, which may be accomplished by sha2 fingerprints or crt.sh IDs.

You can find the list of fingerprints that Mozilla requires in the attached file “fingerprints”.

Flags: needinfo?(ana.lopes)

Susbstitution process related to comment 12

Ana,

Thanks for the recent updates.

(In reply to Ana Lopes from comment #10)

  1. In order to have a more uniform management of profiles, we have developed a project to group the information like for example the AKI for all SSL certificates and be able to manage them generally to adapt them to the changes in an easier way.

We will give you more details as we advance with the project.

Do you have more details here, regarding a timeframe for this?

Flags: needinfo?(ana.lopes)

Ben: Beyond the matter in Comment #15, I don't have further questions. The issue I raised in Comment #4 about having a new bug was something you ultimately had to address in Comment #11 with Bug 1647099, so I appreciate that (and am disappointed that this wasn't handled by the CA). The systemic concerns I raised in Comment #7 were responded to in Comment #10, although as Comment #15 captures, not all the details are there.

Flags: needinfo?(bwilson)

Hi Ryan,
In response to your question in comment 15 about the details of the project, we continue working on it, but we have already completed some tasks.
Please, find the information about the progress below:

  • We classified the profiles according to the regulation that affects to each one to apply all the changes to different groups of profiles according to the requirements so that we can identify the profiles that could be affected by one error more easily. (June 2020)
  • Besides, we performed a review of all the certificates to deactivate the profiles which no certificates have been issued with since 2019. We will maintain the revision once a year. (June 2020)
  • We have started to develop a new procedure to manage the creation of profiles to avoid profiles created by default, that are not necessary. (In progress – We plan have it finished by the end of July)
Flags: needinfo?(ana.lopes)

I intend to close this bug on or after 7-Aug 2020 unless there are any additional issues or questions.

Flags: needinfo?(bwilson)
Flags: needinfo?(bwilson)
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ov-misissuance] [ev-misissuance]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: