Closed Bug 1509002 Opened 10 months ago Closed 5 months ago

Camerfirma: MULTICERT certificates with a validity period greater than 825 days

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: eusebio.herrera, Assigned: eusebio.herrera)

References

(Blocks 1 open bug)

Details

(Whiteboard: [ca-compliance])

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0

Steps to reproduce:

1) How we became aware of the problem:
Our Quality Team has detected that one of our subCAs (Multicert) was issuing certificates with a validity period greater than 825 days.

2) Actions and Timeline to resolve:
CAMERFIRMA.2.1) Send a comunication asking for information about the misissued certificates - duedate: 16/11/2018;
MULTICERT.2.1) Fix the problem in DEV and CERT environment and test it - duedate 16/11/2018
MULTICERT.2.2) Fix the problem in PROD environment and test it - duedate 16/11/2018
MULTICERT.2.3) Replace the certificates affected with the correct notafter date - duedate 16/11/2018
MULTICERT.2.4) Revoke the certificates affected - duedate 12/12/2018

3) Stop issuing certificates: 
From the moment MULTICERT receive notification of this problem (16/11/2018), MULTICERT stopped the issuance of qualified web authentication certificates until MULTICERT apply the correction in PROD environment.

4) Summary of the problematic certificates:
All the certificates listed below were affected with this problem and all of them will be submited to the actions defined above.

5) Complete certificate data for the problematic certificates:
All certificates listed below were affected for this problem. All of them have been issued with a validity greater than 825 days.

6) Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now:
MULTICERT: We looked for the root cause and detected a bug in the code setting the notAfter date of the new certificate (the original length was used instead of the original notAfter date). 
Furthermore, the test cases performed did not detect the error because, to reproduce such a condition, we would to need to wait for the next day to detect the wrong behaviour (the notBefore date cannot be manipulated).

7) List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things:
Listed in steps MULTICERT.2.1 to MULTICERT.2.4.
(In reply to Eusebio Herrera from comment #0)

Thank you for reporting this incident. I have some additional questions.

> Created attachment 9026675 [details]
> MULTICERT certificates with a validity period greater than 825 days
> 
> User Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:63.0) Gecko/20100101
> Firefox/63.0
> 
> Steps to reproduce:
> 
> 1) How we became aware of the problem:
> Our Quality Team has detected that one of our subCAs (Multicert) was issuing
> certificates with a validity period greater than 825 days.
> 
> 2) Actions and Timeline to resolve:
> CAMERFIRMA.2.1) Send a comunication asking for information about the
> misissued certificates - duedate: 16/11/2018;

Was this communication sent to MULTICERT on 16/11/2018, or on an earlier date? I may be confused by the use of the term "duedate" - I am interpreting that to mean a response was due from MULTICERT by 16/11/2018.

> MULTICERT.2.1) Fix the problem in DEV and CERT environment and test it -
> duedate 16/11/2018
> MULTICERT.2.2) Fix the problem in PROD environment and test it - duedate
> 16/11/2018
> MULTICERT.2.3) Replace the certificates affected with the correct notafter
> date - duedate 16/11/2018
> MULTICERT.2.4) Revoke the certificates affected - duedate 12/12/2018
> 
Please confirm your understanding of this as a decision to violate BR section 4.9.1. Why have you decided to violate that requirement?

> 3) Stop issuing certificates: 
> From the moment MULTICERT receive notification of this problem (16/11/2018),
> MULTICERT stopped the issuance of qualified web authentication certificates
> until MULTICERT apply the correction in PROD environment.
> 
> 4) Summary of the problematic certificates:
> All the certificates listed below were affected with this problem and all of
> them will be submited to the actions defined above.
> 
Have all of the affected certificated been CT logged?
 
> 5) Complete certificate data for the problematic certificates:
> All certificates listed below were affected for this problem. All of them
> have been issued with a validity greater than 825 days.
> 
> 6) Explanation about how and why the mistakes were made or bugs introduced,
> and how they avoided detection until now:
> MULTICERT: We looked for the root cause and detected a bug in the code
> setting the notAfter date of the new certificate (the original length was
> used instead of the original notAfter date). 

Please provide a more thorough description of the root cause. When and how was the bug introduced? What circumstances caused the bug to be introduced?

> Furthermore, the test cases performed did not detect the error because, to
> reproduce such a condition, we would to need to wait for the next day to
> detect the wrong behaviour (the notBefore date cannot be manipulated).
> 
It sounds as if you are describing post-production testing. Why wasn't this detected by normal quality assurance processes prior to deploying the change?
> 7) List of steps your CA is taking to resolve the situation and ensure such
> issuance will not be repeated in the future, accompanied with a timeline of
> when your CA expects to accomplish these things:
> Listed in steps MULTICERT.2.1 to MULTICERT.2.4.

This answer only describes remediation for the current problem. Please describe all the steps that you are taking to ensure that this problem, or similar problems will be prevented in the future, and the timeline for implementing those steps.
Assignee: wthayer → eusebio.herrera
Flags: needinfo?(eusebio.herrera)
Summary: MULTICERT certificates with a validity period greater than 825 days → Camerfirma: MULTICERT certificates with a validity period greater than 825 days
Whiteboard: [ca-compliance]
Blocks: 1040072
Many thanks for the additional questions. Please find clarifications below inline.

(In reply to Wayne Thayer [:wayne] from comment #1)
> (In reply to Eusebio Herrera from comment #0)
> 
> Thank you for reporting this incident. I have some additional questions.
> 
> > Created attachment 9026675 [details]
> > MULTICERT certificates with a validity period greater than 825 days
> > 
> > User Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:63.0) Gecko/20100101
> > Firefox/63.0
> > 
> > Steps to reproduce:
> > 
> > 1) How we became aware of the problem:
> > Our Quality Team has detected that one of our subCAs (Multicert) was issuing
> > certificates with a validity period greater than 825 days.
> > 
> > 2) Actions and Timeline to resolve:
> > CAMERFIRMA.2.1) Send a comunication asking for information about the
> > misissued certificates - duedate: 16/11/2018;
> 
> Was this communication sent to MULTICERT on 16/11/2018, or on an earlier
> date? I may be confused by the use of the term "duedate" - I am interpreting
> that to mean a response was due from MULTICERT by 16/11/2018.
> 

This communication was sent to MULTICERT on 16/11/2018.

> > MULTICERT.2.1) Fix the problem in DEV and CERT environment and test it -
> > duedate 16/11/2018
> > MULTICERT.2.2) Fix the problem in PROD environment and test it - duedate
> > 16/11/2018
> > MULTICERT.2.3) Replace the certificates affected with the correct notafter
> > date - duedate 16/11/2018
> > MULTICERT.2.4) Revoke the certificates affected - duedate 12/12/2018
> > 
> Please confirm your understanding of this as a decision to violate BR
> section 4.9.1. Why have you decided to violate that requirement?

We have proposed this date for the revocation of the certificates, because we wanted to guarantee that the client affected had enough time to replace the certificates for the new ones issued in 16/11/2018. After working with the customer, he replaced all the affected certificates by 22/11/2018, and we revoked the certificates affected by this incident in that same day.

> 
> > 3) Stop issuing certificates: 
> > From the moment MULTICERT receive notification of this problem (16/11/2018),
> > MULTICERT stopped the issuance of qualified web authentication certificates
> > until MULTICERT apply the correction in PROD environment.
> > 
> > 4) Summary of the problematic certificates:
> > All the certificates listed below were affected with this problem and all of
> > them will be submited to the actions defined above.
> > 
> Have all of the affected certificated been CT logged?

Yes, all the affected certificates have been CT logged.

>  
> > 5) Complete certificate data for the problematic certificates:
> > All certificates listed below were affected for this problem. All of them
> > have been issued with a validity greater than 825 days.
> > 
> > 6) Explanation about how and why the mistakes were made or bugs introduced,
> > and how they avoided detection until now:
> > MULTICERT: We looked for the root cause and detected a bug in the code
> > setting the notAfter date of the new certificate (the original length was
> > used instead of the original notAfter date). 
> 
> Please provide a more thorough description of the root cause. When and how
> was the bug introduced? What circumstances caused the bug to be introduced?

We have developed a new functionally to reissue certificates affected by missuance problems (like in incident https://bugzilla.mozilla.org/show_bug.cgi?id=1502957). The feature was incorrectly setting the notAfter date of the new certificate with the same original length instead of the original notAfter date. Some of the certificates had originally 3 years validity and therefore, they were effectively re-issued with a notAfter date set to 14/11/2018 + 3 years.

> 
> > Furthermore, the test cases performed did not detect the error because, to
> > reproduce such a condition, we would to need to wait for the next day to
> > detect the wrong behaviour (the notBefore date cannot be manipulated).
> > 
> It sounds as if you are describing post-production testing. Why wasn't this
> detected by normal quality assurance processes prior to deploying the change?

We have a segregated non-production test environment. One of the test cases was to issue a certificate, and then reissue it. As the 2 operations occurred one immediately after the other, the bug went undetected (the notAfter date was effectively the same, time gap was just a few seconds, and notAfter is always rounded to the end of the day 23:59:00 GMT - so no differences on the notAfter date/time).

> > 7) List of steps your CA is taking to resolve the situation and ensure such
> > issuance will not be repeated in the future, accompanied with a timeline of
> > when your CA expects to accomplish these things:
> > Listed in steps MULTICERT.2.1 to MULTICERT.2.4.
> 
> This answer only describes remediation for the current problem. Please
> describe all the steps that you are taking to ensure that this problem, or
> similar problems will be prevented in the future, and the timeline for
> implementing those steps.

We are integrating a lint tool into our system. (Pre-)certificates will be checked against the lint tool. Any errors will move the certificates into a quarantine process for further investigation. In case of an effective error, it will help to evolve our systems. In case of a bug on the lint tool, it will help us to submit a bug report and (hopefully) contribute a patch.

We estimate to have the lint tool wrapped in an on-premises service and the quarantine process integrated into our workflows in our sprint release of March 2019.

Best regards,

Nuno Ponte
UPDATE: new MULTICERT's certificate with a validity period greater than 825 days issued.

1) How we became aware of the problem:
Our Quality Team has detected that one of our subCAs (Multicert) issued one certificate with a validity period greater than 825 days.

2) Actions and Timeline to resolve:
CAMERFIRMA.2.1) Send a comunication asking for information about the misissued certificates - 18/12/2018

MULTICERT.2.1) Fix the problem in DEV and CERT environment and test it - 18/11/2018
MULTICERT.2.2) Fix the problem in PROD environment and test it - 18/11/2018
MULTICERT.2.3) Replace the certificate affected with the correct notafter date - 18/11/2018
MULTICERT.2.4) Revoke the certificate affected - 18/12/2018

3) Stop issuing certificates: 
From the moment MULTICERT receive notification of this problem (18/11/2018), MULTICERT stopped the issuance of certificates until MULTICERT apply the correction in PROD environment.

4) Summary of the problematic certificates:
https://crt.sh/?id=1026264398 (pre-certificate)

5) Complete certificate data for the problematic certificates:
https://crt.sh/?id=1026264398 (pre-certificate)

6) Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now:
MULTICERT: We`ve checked the root cause, because we have already a control that does not allow the issuance of certificates with a validity greater than 27 months (822 days). After verification, we discovered that the API we used to calculate the dates of the certificates only calculate the months and does not transcribes the months to the correct number of days (i.e. if a certificate have a validity from 18 December 2018 to 18 December 2019, this API counts 12 months and not the number of days between this to dates, for instance January have  31 days, February have 28 days, and so on).

7) List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things:
Listed in steps MULTICERT.2.1 to MULTICERT.2.2.
Flags: needinfo?(eusebio.herrera)
(In reply to Eusebio Herrera from comment #3)
> UPDATE: new MULTICERT's certificate with a validity period greater than 825
> days issued.
> 
Thank you for reporting this new issue. In the future, please create a new bug when reporting a separate issue.

> 1) How we became aware of the problem:
> Our Quality Team has detected that one of our subCAs (Multicert) issued one
> certificate with a validity period greater than 825 days.
> 
> 2) Actions and Timeline to resolve:
> CAMERFIRMA.2.1) Send a comunication asking for information about the
> misissued certificates - 18/12/2018
> 
> MULTICERT.2.1) Fix the problem in DEV and CERT environment and test it -
> 18/11/2018
> MULTICERT.2.2) Fix the problem in PROD environment and test it - 18/11/2018
> MULTICERT.2.3) Replace the certificate affected with the correct notafter
> date - 18/11/2018

Should the dates in 2.1 - 2.3 be 18/12 instead of 18/11?

> MULTICERT.2.4) Revoke the certificate affected - 18/12/2018
> 
> 3) Stop issuing certificates: 
> From the moment MULTICERT receive notification of this problem (18/11/2018),
> MULTICERT stopped the issuance of certificates until MULTICERT apply the
> correction in PROD environment.
> 
> 4) Summary of the problematic certificates:
> https://crt.sh/?id=1026264398 (pre-certificate)
> 
> 5) Complete certificate data for the problematic certificates:
> https://crt.sh/?id=1026264398 (pre-certificate)
> 
Have you scanned the entire database of certificates to ensure that there are no others with this problem?

> 6) Explanation about how and why the mistakes were made or bugs introduced,
> and how they avoided detection until now:
> MULTICERT: We`ve checked the root cause, because we have already a control
> that does not allow the issuance of certificates with a validity greater
> than 27 months (822 days). After verification, we discovered that the API we
> used to calculate the dates of the certificates only calculate the months
> and does not transcribes the months to the correct number of days (i.e. if a
> certificate have a validity from 18 December 2018 to 18 December 2019, this
> API counts 12 months and not the number of days between this to dates, for
> instance January have  31 days, February have 28 days, and so on).
> 
Please provide a more thorough root cause analysis. For example, why did the API use months for the calculation? What parts of your development and quality processes need to be improved to prevent similar issues in the future?

> 7) List of steps your CA is taking to resolve the situation and ensure such
> issuance will not be repeated in the future, accompanied with a timeline of
> when your CA expects to accomplish these things:
> Listed in steps MULTICERT.2.1 to MULTICERT.2.2.

Those steps fix this issue, but do not prevent similar issues from happening in the future.
Flags: needinfo?(eusebio.herrera)
(In reply to Wayne Thayer [:wayne] from comment #4)
> Please provide a more thorough root cause analysis. For example, why did the
> API use months for the calculation? What parts of your development and
> quality processes need to be improved to prevent similar issues in the
> future?

Please provide an update on this.
(In reply to Wayne Thayer [:wayne] from comment #4)
> (In reply to Eusebio Herrera from comment #3)
> > UPDATE: new MULTICERT's certificate with a validity period greater than 825
> > days issued.
> > 
> Thank you for reporting this new issue. In the future, please create a new
> bug when reporting a separate issue.
> 
> > 1) How we became aware of the problem:
> > Our Quality Team has detected that one of our subCAs (Multicert) issued one
> > certificate with a validity period greater than 825 days.
> > 
> > 2) Actions and Timeline to resolve:
> > CAMERFIRMA.2.1) Send a comunication asking for information about the
> > misissued certificates - 18/12/2018
> > 
> > MULTICERT.2.1) Fix the problem in DEV and CERT environment and test it -
> > 18/11/2018
> > MULTICERT.2.2) Fix the problem in PROD environment and test it - 18/11/2018
> > MULTICERT.2.3) Replace the certificate affected with the correct notafter
> > date - 18/11/2018
> 
> Should the dates in 2.1 - 2.3 be 18/12 instead of 18/11?

18/12 is the correct date.

> 
> > MULTICERT.2.4) Revoke the certificate affected - 18/12/2018
> > 
> > 3) Stop issuing certificates: 
> > From the moment MULTICERT receive notification of this problem (18/11/2018),
> > MULTICERT stopped the issuance of certificates until MULTICERT apply the
> > correction in PROD environment.
> > 
> > 4) Summary of the problematic certificates:
> > https://crt.sh/?id=1026264398 (pre-certificate)
> > 
> > 5) Complete certificate data for the problematic certificates:
> > https://crt.sh/?id=1026264398 (pre-certificate)
> > 
> Have you scanned the entire database of certificates to ensure that there
> are no others with this problem?

Yes. This was the only certificate found with this problem.

> 
> > 6) Explanation about how and why the mistakes were made or bugs introduced,
> > and how they avoided detection until now:
> > MULTICERT: We`ve checked the root cause, because we have already a control
> > that does not allow the issuance of certificates with a validity greater
> > than 27 months (822 days). After verification, we discovered that the API we
> > used to calculate the dates of the certificates only calculate the months
> > and does not transcribes the months to the correct number of days (i.e. if a
> > certificate have a validity from 18 December 2018 to 18 December 2019, this
> > API counts 12 months and not the number of days between this to dates, for
> > instance January have  31 days, February have 28 days, and so on).
> > 
> Please provide a more thorough root cause analysis. For example, why did the
> API use months for the calculation? What parts of your development and
> quality processes need to be improved to prevent similar issues in the
> future?
> 

Commercially, the certificates validity is presented in months, which is more meaningful to the average customer (especially when the number of days is not a multiple of 365). We translated the commercial terms directly to the API and calculated in months rather than days. This date arithmetic is implemented using Java Calendar class.

Furthermore, we also set the notAfter date to 23:59:00, independently of the hour of the notBefore date. In this particular case, this made the certificate valid for 825,42 days (0,42 more). 

> > 7) List of steps your CA is taking to resolve the situation and ensure such
> > issuance will not be repeated in the future, accompanied with a timeline of
> > when your CA expects to accomplish these things:
> > Listed in steps MULTICERT.2.1 to MULTICERT.2.2.
> 
> Those steps fix this issue, but do not prevent similar issues from happening
> in the future.

For now, the validity has been changed to 824 days to stay on the safe side.

Also, as indicated in comment #2 step 7, we are deploying a lint tool to validate all certificates issued and quarantine any if an error is found.
(In reply to ca.forum from comment #6)
> (In reply to Wayne Thayer [:wayne] from comment #4)
> > Please provide a more thorough root cause analysis. For example, why did the
> > API use months for the calculation? What parts of your development and
> > quality processes need to be improved to prevent similar issues in the
> > future?
> > 
> 
> Commercially, the certificates validity is presented in months, which is
> more meaningful to the average customer (especially when the number of days
> is not a multiple of 365). We translated the commercial terms directly to
> the API and calculated in months rather than days. This date arithmetic is
> implemented using Java Calendar class.

I think this is an area where we still need to examine more about the root cause and remediation.

I can understand that you'd want to display validity periods in months, not days.
From a compliance perspective, the longest valid number of days for 27 months is 823 days (366 + 365 + 31 + 31 + 30; that is, a leap year, a 365 day year, and then the longest span of 3 months in the year), so this sounds reasonable.
I understand that, due to a bug in the code you wrote to do this, it resulted in some periods greater than 27 months being 'rounded down' (e.g. how 827 days can be reported as "27 months").

However, both this, and the previous issue (setting the notAfter based on issuance date, rather than notBefore), seem to suggest a lack of basic testing, as both this and the previous incident seem like issues that would and should have been caught through testing. That's not to say that tests don't exist, but rather, test cases for this weren't developed, and that suggests that there may be other gaps in testing, either edge cases or as a methodology.

As Wayne highlighted, some of the steps proposed "fix" the issue, but don't "prevent" the issue. Part of the root cause analysis is to understand "What could have been done to prevent this from happening", so that such changes are systemically integrated. It sounds like there is a potentially seriously flawed development methodology, so what's changing about how changes are developed and tested? Similarly, what steps have been taken to re-examine other requirements and test cases and make sure they are robust and resistant to edge cases?
Flags: needinfo?(eusebio.herrera) → needinfo?(ca.forum)
(In reply to Ryan Sleevi from comment #7)
> (In reply to ca.forum from comment #6)
> > (In reply to Wayne Thayer [:wayne] from comment #4)
> > > Please provide a more thorough root cause analysis. For example, why did the
> > > API use months for the calculation? What parts of your development and
> > > quality processes need to be improved to prevent similar issues in the
> > > future?
> > > 
> > 
> > Commercially, the certificates validity is presented in months, which is
> > more meaningful to the average customer (especially when the number of days
> > is not a multiple of 365). We translated the commercial terms directly to
> > the API and calculated in months rather than days. This date arithmetic is
> > implemented using Java Calendar class.
> 
> I think this is an area where we still need to examine more about the root
> cause and remediation.
> 
> I can understand that you'd want to display validity periods in months, not
> days.
> From a compliance perspective, the longest valid number of days for 27
> months is 823 days (366 + 365 + 31 + 31 + 30; that is, a leap year, a 365
> day year, and then the longest span of 3 months in the year), so this sounds
> reasonable.
> I understand that, due to a bug in the code you wrote to do this, it
> resulted in some periods greater than 27 months being 'rounded down' (e.g.
> how 827 days can be reported as "27 months").
> 
> However, both this, and the previous issue (setting the notAfter based on
> issuance date, rather than notBefore), seem to suggest a lack of basic
> testing, as both this and the previous incident seem like issues that would
> and should have been caught through testing. That's not to say that tests
> don't exist, but rather, test cases for this weren't developed, and that
> suggests that there may be other gaps in testing, either edge cases or as a
> methodology.
> 
> As Wayne highlighted, some of the steps proposed "fix" the issue, but don't
> "prevent" the issue. Part of the root cause analysis is to understand "What
> could have been done to prevent this from happening", so that such changes
> are systemically integrated. It sounds like there is a potentially seriously
> flawed development methodology, so what's changing about how changes are
> developed and tested? Similarly, what steps have been taken to re-examine
> other requirements and test cases and make sure they are robust and
> resistant to edge cases?

These bugs were present in a new batch job that we have developed for certificate replacement. Previously, this was done by manual processes, but it was obviously not scaling well and was error-prone, so we decided to develop something more automated.

The test cases were in fact not covering enough of the possible situations. I would just like to emphasize that we were trying to quickly roll out this batch job to 1) replace the certificates and minimize violation of BR deadlines; 2) avoid the Christmas freeze window, where a lot of our customers have serious trouble replacing certificates.

Our PKI is in a continuous development process, split into sprints. New test cases have been now integrated to (hopefully) catch more possible errors. And as said before, we are working on the integration of a lint tool as a last safe net, when everything else fails.
Flags: needinfo?(ca.forum)
Please ignore comment 9 - it was intended for bug 1502957
Whiteboard: [ca-compliance] - Next Update - 07-March 2019 → [ca-compliance]

(In reply to ca.forum from comment #8)

These bugs were present in a new batch job that we have developed for
certificate replacement. Previously, this was done by manual processes, but
it was obviously not scaling well and was error-prone, so we decided to
develop something more automated.

The test cases were in fact not covering enough of the possible situations.
I would just like to emphasize that we were trying to quickly roll out this
batch job to 1) replace the certificates and minimize violation of BR
deadlines; 2) avoid the Christmas freeze window, where a lot of our
customers have serious trouble replacing certificates.

Our PKI is in a continuous development process, split into sprints. New test
cases have been now integrated to (hopefully) catch more possible errors.
And as said before, we are working on the integration of a lint tool as a
last safe net, when everything else fails.

Thanks for continuing to examine the root causes here, as we try to work out best practices.

Understanding that mistakes happen, it appears we've identified another root cause - which is time pressures lead to compromised code quality. Does that sound accurate? If it does, what steps are being taken to reduce this risk in the future?

Put differently, if there was another event that required a rapid roll-out of certificates, what steps have been taken to help the community feel that it will be done w/o introducing further errors?

(In reply to Ryan Sleevi from comment #11)

(In reply to ca.forum from comment #8)

These bugs were present in a new batch job that we have developed for
certificate replacement. Previously, this was done by manual processes, but
it was obviously not scaling well and was error-prone, so we decided to
develop something more automated.

The test cases were in fact not covering enough of the possible situations.
I would just like to emphasize that we were trying to quickly roll out this
batch job to 1) replace the certificates and minimize violation of BR
deadlines; 2) avoid the Christmas freeze window, where a lot of our
customers have serious trouble replacing certificates.

Our PKI is in a continuous development process, split into sprints. New test
cases have been now integrated to (hopefully) catch more possible errors.
And as said before, we are working on the integration of a lint tool as a
last safe net, when everything else fails.

Thanks for continuing to examine the root causes here, as we try to work out best practices.

Understanding that mistakes happen, it appears we've identified another root cause - which is time pressures lead to compromised code quality. Does that sound accurate?

Yes.

If it does, what steps are being taken to reduce this risk in the future?

Put differently, if there was another event that required a rapid roll-out of certificates, what steps have been taken to help the community feel that it will be done w/o introducing further errors?

Thinking of this particular error, if it would happen again in the future, at least the lint tool (if not sooner in the test cases) will catch the misissued certificates.

Trying to summarize things below. Could you please review to ensure that this is correct and complete?

  • Incident #1 (Multicert): Issuing cert with period > 825 days (Comment #0)

    • Timeline:
      • (Sometime between 2018-10-29 and 2018-11-14) Multicert introduces new code to reissue certificates
      • 2018-11-14 - First misissuance: https://crt.sh/?id=945894448
      • 2018-11-16 - Camerfirma detects misissued certificate and communicates with Multicert
      • 2018-11-16 - Multicert implements and deploys fix
      • 2018-11-22 - Multicert completes revocation of existing certificates (Comment #2)
    • Cause: Following Bug #1502957, Multicert introduced code to rapidly reissue certificates. This code improperly issued certificates with the notBefore set to the present date, and then computed the notAfter based on the originally issued validity period (e.g. +3 years). Thus, any certificate issued the following calendar day or later would be misissued (Comment #2)
    • Existing Mitigations:
      • A manual test for reissuance in a test environment existed. However, as this reissuance is performed immediately after the first issuance, it did not exercise this bug.
    • Changes made:
      • Implemented - 2018-11-16 Underlying bug is fixed (Comment #0)
      • Not Implemented - 2019-03 Pre-issuance linting will be implemented (Comment #2)
  • Incident #2 (Multicert): Failure to revoke misissued certificates according to the BRs (Comment #0, Comment #1)

    • Timeline:
      • 2018-11-14 - Multicert misissues certificate: https://crt.sh/?id=945894448
      • 2018-11-16 - Camerfirma notifies Multicert of misissuance
      • 2018-11-22 - Multicert completes revocation of last certificate
    • Cause: Multicert management decided to ignore the Baseline Requirements for this customer. (Comment #2)
    • Changes made:
      • None
  • Incident #3 (Multicert): Issuing cert with period > 825 days (Comment #3)

    • Timeline:
      • 2018-12-13: Multicert misissues a certificate - https://crt.sh/?id=1026264398 (Comment #3)
      • 2018-12-18: Camerfirma detects misissued certificate and notifies Multicert of misissuance (Comment #3)
      • 2018-12-18: Camerfirma corrects the underlying issue
      • 2018-12-18: Camerfirma revokes the misissued certificate
    • Cause:
      • Multicert's implementation of validity period controls was based on the number of months (27 months), rather than the BR-specified number of days. Multicert then used an API which computed fractional months by rounding down/eliminating, allowing for periods greater than 825 days. (Comment #3, Comment #6)
    • Changes Made:
      • Implemented - 2018-12-18 Underlying bug is fixed (Comment #3)
      • Not Implemented - 2019-03 Pre-issuance linting will be implemented (Comment #2)
  • Incident #4 (Camerfirma): Sub-CA with repeated failures to abide by the BRs

    • Incident report/details not provided
Flags: needinfo?(wthayer)
Flags: needinfo?(eusebio.herrera)

(In reply to Ryan Sleevi from comment #13)

Many thanks for the summary. Please find a few corrections and additional information below inline.

Trying to summarize things below. Could you please review to ensure that this is correct and complete?

  • Incident #1 (Multicert): Issuing cert with period > 825 days (Comment #0)

    • Timeline:
      • (Sometime between 2018-10-29 and 2018-11-14) Multicert introduces new code to reissue certificates
      • 2018-11-14 - First misissuance: https://crt.sh/?id=945894448
      • 2018-11-16 - Camerfirma detects misissued certificate and communicates with Multicert
      • 2018-11-16 - Multicert implements and deploys fix
      • 2018-11-22 - Multicert completes revocation of existing certificates (Comment #2)
    • Cause: Following Bug #1502957, Multicert introduced code to rapidly reissue certificates. This code improperly issued certificates with the notBefore set to the present date, and then computed the notAfter based on the originally issued validity period (e.g. +3 years). Thus, any certificate issued the following calendar day or later would be misissued (Comment #2)
    • Existing Mitigations:
      • A manual test for reissuance in a test environment existed. However, as this reissuance is performed immediately after the first issuance, it did not exercise this bug.
    • Changes made:
      • Implemented - 2018-11-16 Underlying bug is fixed (Comment #0)
      • Not Implemented - 2019-03 Pre-issuance linting will be implemented (Comment #2)
  • Incident #2 (Multicert): Failure to revoke misissued certificates according to the BRs (Comment #0, Comment #1)

    • Timeline:
      • 2018-11-14 - Multicert misissues certificate: https://crt.sh/?id=945894448
      • 2018-11-16 - Camerfirma notifies Multicert of misissuance
      • 2018-11-22 - Multicert completes revocation of last certificate
    • Cause: Multicert management decided to ignore the Baseline Requirements for this customer. (Comment #2)

Short clarification on the reason why we decided to hold the revocation for additional 1 more day:

That particular certificate is extremely critical to the customer – the national payment gateway system.

  • Changes made:
    • None

As actions, first and foremost we are educating our customers for the need to be able to quickly update TLS certificates. Almost all of them have been sensible and cooperative on the past certificate replacements. However, a few of them have complex systems difficult to update (for example, some have certificate pinning hard coded on the mobile app and the app is maintained by a third party). In these cases we are advising the customer about the tight revocation timelines, the benefits of establishing a quick certificate replacement process and improved ways of handling certificate replacement (doing SPKI pinning instead of certificate).

  • Incident #3 (Multicert): Issuing cert with period > 825 days (Comment #3)
    • Timeline:
      • 2018-12-13: Multicert misissues a certificate - https://crt.sh/?id=1026264398 (Comment #3)
      • 2018-12-18: Camerfirma detects misissued certificate and notifies Multicert of misissuance (Comment #3)
      • 2018-12-18: Camerfirma corrects the underlying issue
      • 2018-12-18: Camerfirma revokes the misissued certificate

In the last 2 paragraphs it was Multicert, not Camerfirma.

Besides, the date of revocation was not effectively 2018-12-18 but actually 2019-01-02. This was due to an internal misunderstanding, which was immediately fixed when double checking if all actions required were completed.

  • Cause:

    • Multicert's implementation of validity period controls was based on the number of months (27 months), rather than the BR-specified number of days. Multicert then used an API which computed fractional months by rounding down/eliminating, allowing for periods greater than 825 days. (Comment #3, Comment #6)
  • Changes Made:

    • Implemented - 2018-12-18 Underlying bug is fixed (Comment #3)
    • Not Implemented - 2019-03 Pre-issuance linting will be implemented (Comment #2)
  • Incident #4 (Camerfirma): Sub-CA with repeated failures to abide by the BRs

    • Incident report/details not provided

Besides, the date of revocation was not effectively 2018-12-18 but actually 2019-01-02. This was due to an internal misunderstanding, which was immediately fixed when double checking if all actions required were completed.

This amounts to another BR violation. Please describe how this problem occurred. What is being done to prevent this type of "misunderstanding" from happening again?

Ryan: Is incident #4 describing Camerfirma's lack of response to incidents #1-3, or some other issue?

Flags: needinfo?(wthayer) → needinfo?(ryan.sleevi)

Wayne: I was remarking that, to date, this issue has focused on the steps Multicert is doing to deal with their BR issues. However, we haven't seen any incident details or understanding about the steps that Camerfirma is taking to mitigate or prevent these issues, given that they have ultimate responsibility.

Flags: needinfo?(ryan.sleevi)

(In reply to Wayne Thayer [:wayne] from comment #15)

Besides, the date of revocation was not effectively 2018-12-18 but actually 2019-01-02. This was due to an internal misunderstanding, which was immediately fixed when double checking if all actions required were completed.

This amounts to another BR violation. Please describe how this problem occurred. What is being done to prevent this type of "misunderstanding" from happening again?

That certificate issued on 2018-12-13 was itself a replacement of a previous certificate issued to the customer on 2018-01-31. As it was misissued, it was then replaced again by a new certificate issued on 2018-12-18 17:07 https://crt.sh/?id=1039584683.

The replacement process gives a few days for the customer to install the new certificate and usually followed by our staff with a few roundtrips of communication with the customer. Therefore revocation of previous certificate is not immediate and is manually performed by a Registration Officer.

So, at 2018-12-18 17:07 there were effectively 3 active certificates for that domain (2018-01-31, 2018-12-13, 2018-12-18), all related to the same SPKI. The instructions to the Registration Officer for revoking the previous certificates (2018-01-31, 2018-12-13) were not clear and only the certificate from 2018-01-31 was revoked (the certificate from 2018-12-13 was left active).

Upon return of Christmas period, when double checking if all actions were done on this open incident we identified the certificate 2018-12-13 was not revoked yet. On 2019-01-02 that certificate was revoked.

To make the internal revocation instructions clear and unambiguous, the certificate’s serial number, AKI and CA DN (at least) have now to be communicated.

Ryan: Is incident #4 describing Camerfirma's lack of response to incidents #1-3, or some other issue?

Setting next update for Multicert to confirm that pre-issuance linting has been implemented for all issuance.

Still awaiting a response from Camerfirma on their oversight of Multicert as described in comment #16.

Whiteboard: [ca-compliance] → [ca-compliance] - Next Update - 01-April 2019

Wayne: I was remarking that, to date, this issue has focused on the steps Multicert is doing to deal with their BR issues. However, we haven't seen any incident details or understanding about the steps that Camerfirma is taking to mitigate or prevent these issues, given that they have ultimate responsibility.

Camerfirma is working together with externally operated intermediate CAs to establish and improve control mechanisms over the certificates issued by them.

Their collaboration in notifying, resolving and establishing measures for remediation and prevention of new issues is being good.

We will incorporate, before 2019-02-14, the intermediate CA's obligation that issue TSL/SSL certificates with their infrastructure under Camerfirma's root to a pre-issuance control obligating them to check their pre-certificates with our lint tool before the issuance of each certificate.

This requirement will be incorporated into our CPS in a version prior to 2019-02-14

We hope that these clarifications are sufficient to describe the future actions that Camerfirma carries out to prevent these issues.

However, we look forward to any clarification or details you may need.

(In reply to Juan Angel Martin from comment #19)

Wayne: I was remarking that, to date, this issue has focused on the steps Multicert is doing to deal with their BR issues. However, we haven't seen any incident details or understanding about the steps that Camerfirma is taking to mitigate or prevent these issues, given that they have ultimate responsibility.

Camerfirma is working together with externally operated intermediate CAs to establish and improve control mechanisms over the certificates issued by them.

Their collaboration in notifying, resolving and establishing measures for remediation and prevention of new issues is being good.

We will incorporate, before 2019-02-14, the intermediate CA's obligation that issue TSL/SSL certificates with their infrastructure under Camerfirma's root to a pre-issuance control obligating them to check their pre-certificates with our lint tool before the issuance of each certificate.

This requirement will be incorporated into our CPS in a version prior to 2019-02-14

We hope that these clarifications are sufficient to describe the future actions that Camerfirma carries out to prevent these issues.

However, we look forward to any clarification or details you may need.

Juan: Does this mean that all of Camerfirma's 3rd party subordinate CAs will be performing pre-issuance linting by 2019-02-14?

Flags: needinfo?(eusebio.herrera) → needinfo?(martin_ja)
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

Juan: Does this mean that all of Camerfirma's 3rd party subordinate CAs will be performing pre-issuance linting by 2019-02-14?

I'm sorry because I don't think I've expressed myself right.

I mean that before 2019-02-14 Camerfirma will deploy our lint tool and that all Camerfirma's 3rd party subordinate CAs will be required to perform a pre-issuance linting.

In case there is a new Camerfirma's 3rd party subordinate CA they will have to make it before the issuance of the first certificate.

For current Camerfirma's 3rd party subordinate CAs, we are working on establishing the schedule in which they should have completed the pre-issuance linting with our tool.

We'll disclose this schedule as soon as we have it.

Flags: needinfo?(martin_ja)

Today we've finished the deployment of our lint tool.

We also have required our subCAs to use our lint tool prior to issue the certificates. As soon as we have the schedule for this integration we will post it here.

MULTICERT 2019-03-22 -> Pre-issuance linting has been implemented.

QA Contact: kwilson → wthayer

It appears that remediation has been completed.

Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Whiteboard: [ca-compliance] - Next Update - 01-April 2019 → [ca-compliance]
You need to log in before you can comment on or make changes to this bug.