Closed Bug 1705187 Opened 3 years ago Closed 3 years ago

KIR S.A.: CN domain not in SAN

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: michel, Assigned: piotr.grabowski)

Details

(Whiteboard: [ca-compliance] [ov-misissuance])

Hello,
I found this precertificate with the CN domain not in SAN: https://crt.sh/?id=4372848810&opt=zlint,cablint,x509lint,ocsp. It's revoked, but I couldn't find any related incident report.

There are more:
https://crt.sh/?id=4372848810
https://crt.sh/?id=4373791552

Also, none of the domains in the cert has any DNS record. How were they validated?

Assignee: bwilson → elzbie4
Status: NEW → ASSIGNED
Whiteboard: [ca-compliance]

Hello,
We are investigating the issue and started preparing incident response. In this case domain pocztowy.pl was validaded.

Could you please also explain why 4 certificates were issues for the same domains around the same time?
https://crt.sh/?id=4373800534&opt=ocsp,zlint,x509lint,cablint (Not Before: Apr 14 11:23:25 2021 GMT)
https://crt.sh/?id=4372848810&opt=ocsp,zlint,cablint,x509lint (the one with CN domain not in SAN, Not Before: Apr 15 06:00:00 2021 GMT, Revoked, same key as previous one)
https://crt.sh/?id=4373791552&opt=ocsp,zlint,cablint,x509lint (Not Before: Apr 15 06:00:00 2021 GMT, Revoked, same key as previous one)
https://crt.sh/?id=4372844465&opt=zlint,x509lint,cablint (Not Before: Apr 15 06:00:00 2021 GMT, different key)

(In reply to Piotr Grabowski from comment #2)

Hello,
We are investigating the issue and started preparing incident response. In this case domain pocztowy.pl was validaded.

Could you show what validation method (of those mentioned in BR section 3.2.2.4) was used for this validation? Not all validation methods are created equal, and only some methods allow for reusing the validation for subdomains of the validated domain.

Sure, in this case DNS TXT validation was used.

Assignee: elzbie4 → piotr.grabowski

I found another certificate with the same issue:
https://crt.sh/?id=3778293603&opt=zlint,ocsp (Not Revoked)

Thank you Michael, we are aware of that. We have 2 certificates with that issue.

  1.   How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date. 
    

In case of https://crt.sh/?id=437284881081 KIR became aware of this by verifying the certificate content immediately after issuance. Then we found another one https://crt.sh/?id=3778293603 affected by the same issue.

  1.   A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular 
    

requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2021-04-14 07:09:29 UTC was issued.
2021-04-14 07:30:00 UTC We investigation the root cause began.
2021-04-14 11:09:43 UTC One if the certificates (https://crt.sh/?id=437284881081) was revoked. Another one (https://crt.sh/?id=3778293603) will be replaced in the upcomming week as it used in our procution system and the replacement
need some time to take place.
2021-04-14 14:09:43 UTC We identified the registration policy that issued the problematic certificate.
2021-04-16 10:00:00 UTC Interal annlysis finished with conclusion that we need CA software vendor support to avoid similiar issues.
2021-04-16 14:00:00 UTC Ticket to CA software vendor was raised describing the issue with pririty high.
2021-04-16 15:00:00 UTC SSL certificate issuance procedure was updated and awareness campaign about this update and related issue was communicated to all operators.
2021-04-16 15:30:00 UTC Ticket was registered by CA software support: ticket number 403509.

  1.  Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation. 
    

Until the requested patch is deployed we have only fresh procedural mitigiation and awareness campaign to avoid issuing certificates with the problem.

  1.  A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued. 
    

Impacted 1 certificate, issued on 2021-04-14.

  1.  The complete certificate data for the problematic certificates. 
    
     https://crt.sh/?id=437284881081
     https://crt.sh/?id=3778293603 
    
  2.  Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now. 
    

We have updated SSL certificate issuance procedure to force operator to check if CN exists in SAN entries because so far it was well-known but not clearly stated.
Awareness campaign about this update and related issue was communicated to all operators.
The conlsusion is that we are missing technical validation of dependencies between CN and DNS fields in registration policies.

  1. List of steps CA is taking to resolve the situation and ensure it will not be repeated.

Request for change was sent to our CA software provider with pririty high becuase implementing this technical validation it is the only way avoid similar issues in hte future.

Hello,
Thanks for your report.

https://crt.sh/?id=437284881081

Is it https://crt.sh/?id=4372848810 ?

KIR became aware of this by verifying the certificate content immediately after issuance.

Was it manual verification or did you use some tool?

2021-04-14 07:09:29 UTC was issued.

Was it https://crt.sh/?id=4372848810 ?

We have updated SSL certificate issuance procedure to force operator to check if CN exists in SAN entries because so far it was well-known but not clearly stated.

Why aren't you first filling SAN entries and then choosing one of them and entering it in CN? Do you have to enter the CN first?

Also, you haven't answered my question from Comment 3.

Another one (https://crt.sh/?id=3778293603) will be replaced in the upcoming week as it used in our production system and the replacement need some time to take place.

Does that mean that you will not be able to replace it before the deadline?
https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation

yes, it was Was it https://crt.sh/?id=4372848810 of course. We are using crt.sh linters.
So far our software uses uncorrelated textfields for both CN and SANs values, but the solution you proposed will be my suggestion for handling this issue by CA software vendor.

(In reply to Michel Le Bihan from comment #10)

Another one (https://crt.sh/?id=3778293603) will be replaced in the upcoming week as it used in our production system and the replacement need some time to take place.

Does that mean that you will not be able to replace it before the deadline?
https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation

We will do our best to replace it within 5 days - till Monday.

(In reply to Piotr Grabowski from comment #8)

  1.  Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now. 
    

We have updated SSL certificate issuance procedure to force operator to check if CN exists in SAN entries because so far it was well-known but not clearly stated.
Awareness campaign about this update and related issue was communicated to all operators.
The conlsusion is that we are missing technical validation of dependencies between CN and DNS fields in registration policies.

Just like KIR S.A.'s failure to read the BRs in Bug 1705832, this response feels like a failure to read https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report

To quote:

For example, it’s not sufficient to say that “human error” of “lack of training” was a root cause for the incident, nor that “training has been improved” as a solution. While a lack of training may have contributed to the issue, it’s also possible that error-prone tools or practices were required, and making those tools less reliant on training is the correct solution. When training or a process is improved, the CA is expected to provide specific details about the original and corrected material, and specifically detail the changes that were made, and how they tie to the issue. Training alone should not be seen as a sufficient mitigation, and focus should be made on removing error-prone manual steps from the system entirely.

Similarly:

as it used in our production system and the replacement need some time to take place.

Is a complete failure of a justification to even remotely consider delaying revocation.

Ryan,
I have described that the root cause of the problem was lack of technical validation in software, we raised that issue and awaiting patch from the vendor. For the time being our registration policy form is built from uncorrelated CN and SAN fields and it worked fine for more than 10 years.
After the incident we decided to apply the change that will ensure that this issue will not happen again - validated list of SAN entries and then choosing one of them as CN. Updated procedure and awareness campaign is practicing due care while waiting for the fix.

Hello,
I agree that the technical cause of the problem was given.

For the time being our registration policy form is built from uncorrelated CN and SAN fields and it worked fine for more than 10 years.

Why haven't you predicted the possibility of such an issue happening and didn't implement verification mechanisms earlier?

(In reply to Piotr Grabowski from comment #14)

I have described that the root cause of the problem was lack of technical validation in software, we raised that issue and awaiting patch from the vendor.

Yes, and it is a failure to actually examine root causes. "Lack of technical validation" is not itself a root cause, it's a symptom that stems from deeper systemic organizational issues. The failure to examine why you lacked technical validation is important to understanding what the root cause is here, especially in light of other KIR S.A. incidents that are best described as "Failure to meet the minimum requirements".

The decision by KIR S.A. management to rely on manual validation, without any examination of processes, especially in light of several years of participation in Root Programs that require the CA to be aware of incidents and discussions, is concerning. The failure to realize that the lack of technical validation beforehand, in this or in any other part of KIR S.A.'s infrastructure, is the concern.

Lack of technical validation is a symptom, but the processes, design decisions, and priorities of management to fail to recognize that speak closer to root causes. In particular, KIR S.A. appears to be treating this as a "local" incident, focused on the exact wrong certs that were issued, without an examination of the systemic causes.

For the time being our registration policy form is built from uncorrelated CN and SAN fields and it worked fine for more than 10 years.

The assertion that it "worked fine for more than 10 years" is rather dubious in light of past misissuances like:

https://crt.sh/?sha256=9227ca4f870e8920fdf313e2fd357a60fab1ae52a6f7010ae03d722effc35316

https://crt.sh/?sha256=ADE0592645EF2751D3A038976ED629F0F53CE6CBD83B0D379926898BB31614BA

(In reply to Andrew Ayer from comment #17)

For the time being our registration policy form is built from uncorrelated CN and SAN fields and it worked fine for more than 10 years.

The assertion that it "worked fine for more than 10 years" is rather dubious in light of past misissuances like:

https://crt.sh/?sha256=9227ca4f870e8920fdf313e2fd357a60fab1ae52a6f7010ae03d722effc35316

https://crt.sh/?sha256=ADE0592645EF2751D3A038976ED629F0F53CE6CBD83B0D379926898BB31614BA

You are right Andrew. I forgot about the error from 2015.

From the management point of view we keep on constant improving our both procedures and used software. Although the overall number of errors has decreased significantly we still are very concerned about new incidents that comes from KIR CA.
Last year we initiated the project of automating the process of generating certificates. The project is complexed and it covers all steps of certificates pre-validation and certificates generation. After it will be finished it eliminates all human mistakes. The project is now in implementing faze and it is planned to be finished this year.

As a emergency measures:

  • patch request was sent to the vendor, the certificate was revoked ASAP after generating and incorrect verification result. The certificate was not delivered to the client.

Till the project, mentioned at the begining will be finished we have already started additionally a deep analysis and specific tasks have been delegated to right resources:

  • to review all current business processes and procedures related to issuing certificates.
  • with a cooperation with software vendor perform weak point analysis and show all places for improvement and potential flaws.

(In reply to Michel Le Bihan from comment #10)

Another one (https://crt.sh/?id=3778293603) will be replaced in the upcoming week as it used in our production system and the replacement need some time to take place.

Does that mean that you will not be able to replace it before the deadline?
https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation

it is already revoked

As an additional control we decided today to restrict access to the registration policies to the most trained 2 out of 100 operators. We introduced another testing environment to pre-lint all P#10 requests and communicate the new procedure to whom it may concern.
CA software vendor was requested to pass our change request to the highest level.

Update from the software vendor : the patch is being evaluated.

Update form CA software vendor: the full automatic pre-linting patch will be implemented in version 5.5.1 of the software.

Piotr, do you have a timeframe when this version 5.5.1 will be released, and when you'll have that version implemented in your production workflows?

Flags: needinfo?(piotr.grabowski)

For the the vendor confirmed that the patch will released in the current version of the software (good news) but I have no precise day of the release so far. I hope we will know the date within a week or two.

Flags: needinfo?(piotr.grabowski)

No specific update here. We have planned meeting with vendor on 1st of June. We hope to know the estimated time of arrival of the patch then.

Did you get a timeframe for the updated release from the vendor?

Flags: needinfo?(piotr.grabowski)

We had a session with the vendor today.
The feasibility of the patch was confirmed.
For today they have only rough estimation that it should
take couple of weeks to deliver the patch. We should get an official offer from them soon.

Flags: needinfo?(piotr.grabowski)

We received an official offer for the patch. At the moment, we are determining the details of its implementation.

Piotr: It's been 11 days since Comment #29, and this fits with a broader set of delays for nearly two months (Since Comment #21) of trying to understand when KIR S.A. is going to commit to taking concrete action.

This lack of transparency is quite problematic, and very much against the expectations in https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed

It gives every impression that KIR S.A. does not take this issue - or, broadly, compliance - very seriously, and is just dwadling about doing nothing. If that's not the impression you mean to give, then we need much more detail, regularly delivered, and we need concrete commitments from KIR S.A. that they're taking systemic, long-term fixes. Delaying for months is not acceptable for CAs, and while I appreciate that KIR S.A. may depend on external parties, it is KIR S.A.'s responsibilities to ensure those parties are able to meet the expectations that KIR S.A. is required to meet.

Please provide a meaningful update here, as well as a concrete commitment for a timeframe to address this.

Flags: needinfo?(piotr.grabowski)

Ryan, sorry for the slight delay in responding.
I was going to respond on Wednesday 30.06 right after the first checks of patch testing is finished.
Saying that, I would like to declare that the basic functionality of the patch is delivered. The patch has some extension funcitonality that we want to use to apply some additional own linters.
We are currently working on:

  • test scenarios, test data for basic and extended funcitonality.
  • additional own linters
  • development of the process of monitoring and updating of the third-party linterns.
  • documentation
    We estimate all these steps will be finished until August 15th. We plan to deploy this patch on production by the end of August.
Flags: needinfo?(piotr.grabowski)

(In reply to Piotr Grabowski from comment #31)

I was going to respond on Wednesday 30.06 right after the first checks of patch testing is finished.

Could you explain what led you to believe that was acceptable or appropriate?

That is, what part of https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed is confusing, so that we can make sure CAs such as KIR S.A. understand the requirements? Or is the issue that it was not confusing, but KIR S.A. did not read the existing requirements?

Saying that, I would like to declare that the basic functionality of the patch is delivered.

Am I correct in understanding Comment #23 that your CA software vendor had no support for pre-issuance linting until this month (Comment #29), and that you expect it will take you another 4-5 weeks to deploy in your infrastructure?

Can you share what CA software vendor you are using, so that other CAs using this software vendor are aware of the availability of what is considered, at this point, a minimum necessary process?

Flags: needinfo?(piotr.grabowski)

(In reply to Ryan Sleevi from comment #32)

(In reply to Piotr Grabowski from comment #31)

I was going to respond on Wednesday 30.06 right after the first checks of patch testing is finished.

Could you explain what led you to believe that was acceptable or appropriate?
That is, what part of https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed is confusing, so that we can make sure CAs such as KIR S.A. understand the requirements? Or is the issue that it was not confusing, but KIR S.A. did not read the existing requirements?

My intention was to provide an update with a realistic value and timeline. We are fully aware of the obligation to update periodically. I apologize for the slight delay and deviation from the rule.

Saying that, I would like to declare that the basic functionality of the patch is delivered.

Am I correct in understanding Comment #23 that your CA software vendor had no support for pre-issuance linting until this month (Comment #29), and that you expect it will take you another 4-5 weeks to deploy in your infrastructure?

To be precise, our software provider still does not have the functionality in place. As I wrote in the comment
https://bugzilla.mozilla.org/show_bug.cgi?id=1705187#c29 'We received an official offer for the patch. At the moment, we are determining the details of its implementation'. However, the details of implementation and integration were very unfavourable for us because they would require constant changes to the CA software in case of any updates in the linting software. In the course of discussions and negotiations of the offer we decided only to acquire the necessary know-how and a non-public API to produce this patch ourselves. And this is what we did. There are a few tasks left to perform as described in the comment https://bugzilla.mozilla.org/show_bug.cgi?id=1705187#c31.

Can you share what CA software vendor you are using, so that other CAs using this software vendor are aware of the availability of what is considered, at this point, a minimum necessary process?

Sure, we are using Verizon UniCERT

Flags: needinfo?(piotr.grabowski)

We prepared:

  • first test scenarios, test data for basic and extended funcitonality.
  • additional own linters
    The solution is deployed in test environment.

Comment #31 and Comment #34 both repeat the phrase

additional own linters

Comment #10, Comment #13, and Comment #30 have all linked KIR S.A. to https://wiki.mozilla.org/CA/Responding_To_An_Incident , highlighting the expectations. This page includes the following paragraph:

The purpose of these incident reports is to provide transparency about the steps the CA is taking to address the immediate issue and prevent future issues, both the issue that originally lead to the report, and other potential issues that might share a similar root cause. Additionally, they exist to help the CA community as a whole learn from potential incidents, and adopt and improve practices and controls, to better protect all CAs. Mozilla expects that the incident reports provide sufficient detail about the root cause, and the remediation, that would allow other CAs or members of the public to implement an equivalent solution.

Flags: needinfo?(piotr.grabowski)

(In reply to Ryan Sleevi from comment #35)

Comment #31 and Comment #34 both repeat the phrase

additional own linters

Comment #10, Comment #13, and Comment #30 have all linked KIR S.A. to https://wiki.mozilla.org/CA/Responding_To_An_Incident , highlighting the expectations. This page includes the following paragraph:

The purpose of these incident reports is to provide transparency about the steps the CA is taking to address the immediate issue and prevent future issues, both the issue that originally lead to the report, and other potential issues that might share a similar root cause. Additionally, they exist to help the CA community as a whole learn from potential incidents, and adopt and improve practices and controls, to better protect all CAs. Mozilla expects that the incident reports provide sufficient detail about the root cause, and the remediation, that would allow other CAs or members of the public to implement an equivalent solution.

Saying additional own linters we didn't have anything new in mind that we didn't want to share.
We just reused functionality of previously deployed features mentioned here:

Flags: needinfo?(piotr.grabowski)

Ben: Based on Comment #31, it looks like this can be Next-Update to 2021-08-15 for ensuring its deployed to production.

Flags: needinfo?(bwilson)
Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] → [ca-compliance] Next update 2021-08-15

We have scheduled production deployment on 30/08

Whiteboard: [ca-compliance] Next update 2021-08-15 → [ca-compliance] Next update 2021-09-01

The solution today was deployed to production.

I'm assuming that this matter can now be closed, so I'll pull it back up on Friday, 3-Sept-2021, to see if there are any other comments and if not, close it then.

Flags: needinfo?(bwilson)

Ben, can we closed this bug?

I'll plan on closing this on Friday, 10-Sept-2021.

Whiteboard: [ca-compliance] Next update 2021-09-01 → [ca-compliance]
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ov-misissuance]
You need to log in before you can comment on or make changes to this bug.