Closed Bug 1536213 Opened 7 months ago Closed 3 months ago

ACCV: Insufficient serial number entropy

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jamador, Assigned: jamador)

Details

(Whiteboard: [ca-compliance])

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0

Steps to reproduce:

As of 10am on 12/03/2019 GMT ACCV started researching the 64bit certificate Serial Number issue. We have identified a significant quantity of certificates (about 1,800), that according to recent conversations on the list m.d.s.p. does not meet the 64bit serial number requirement. We are still researching so certificate quantity could vary.

  1.  How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
    

ACCV was aware of a possible problem after reading several messages from the list mozilla.dev.security.policy. We are using EJBCA software as one of our CAs and we were convinced that the generation of serial numbers was correct and according to BR. We begin to suspect and research 10am on 12/3/2019 GMT.

  1.  A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
    
  • 10am on 12/3/2019 GMT, identified a possible problem with serial numbers in list m.d.s.p.,following the posts about this issue.

  • 16am on 12/3/2019 GMT, ACCV confirmed that the generation was doing the wrong way, reviewed the source code and identified the cause.

  • 02am on 13/3/2019 GMT Time, our development team finished the patch and after testing it on the test platform, we have deployed it to production on 01am 14/3/2019 GMT.

We are still in the process of identifying all the affected certificates, but we have already proceeded to establish a substitution and revocation mechanism.

  1.  Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
    

We have deployed a fix as soon as possible. We are no longer issuing certificates with the serial number generated in this way.

  1.  A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
    

We are still in the process of researching and summarizing the data. We will provide all the information when it is available.
Provisionally:
- aprox 1,600 issued
- aprox 1,500 valid (neither expired nor revoked)

  1.  The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
    

We are still in the process of researching and summarizing the data. We will provide all the information when it is available.

  1.  Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
    

We were sure that the requirements outlined in BR and MP were being met. We were generating 64-bit serial numbers from a 64-bit CSPRNG.
It has been in recent conversations in mdsp when the mechanism has been questioned. Therefore, although the certificates have a 64-bit serial number and are generated with 64 bits of entropy, selecting only the positive reduces the entropy by 1bit.

  1.  List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
    

We have patched and now all certificates are generated with 128-bit serial number and we are carefully evaluating scenarios for the replacement and revocation of the certificates.
This operation has a very important impact for our users, because in almost all cases this replacement must be done by hand and by staff with poor technical skills.

We will provide further details in a subsequent update.

Whiteboard: [ca-compliance]

We update the information, attaching the list of affected certificates (1,194). We have eliminated the duplicates that appeared in crt.sh two or more times due to errors when saving in some CT log or because there was the pre-certificate and the leaf certificate.
Thanks

Assignee: wthayer → jamador
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Summary: ACCV: Error serial number entropy → ACCV: Insufficient serial number entropy

We have spoken with our users, informing them of the situation and the commitment to replace the affected certificates, revoking the previous ones as soon as possible. We have developed additional actions and tools, and we have already started with the revocations.

It is not possible for us to revoke all certificatesin five days but we know that deadlines are important. We are working to have them all replaced and revoked within a period of no more than one month (we hope to have it before).

We will update the information.

There has been no update for three months.

Please provide a status update, as well as an explanation for the lack of weekly updates, as expected.

Flags: needinfo?(jamador)

Hi,
My apologies for the lack of updates.
We have replaced 1158 certificates, having a residual of 30 certificates that we are trying to fix with our users. It is taking longer than we expected, especially due to technical difficulties in replacing this remnant.
We have agreed with them that the maximum time was this July due to vacation. We are actively working with them to finish it this month.
We will update all the information during the next week.
Sorry
Best regards

Flags: needinfo?(jamador)

Jose: I want to highlight that this is a very troubling reply.

The lack of an update about the progress, along with the apparent violation of the BRs, represents a rather significant event. The process outlined in https://wiki.mozilla.org/CA/Responding_To_An_Incident is intended to provide guidance for CAs on how to handle such significant events and the expectations.

While we can't travel back to the past to fix this, it was certainly clear at the time what the expectations where, in that the above page had been widely circulated after a similar, but different, event. The big concern here is that Comment #2 said "no more than one month", or the end of April, and now it's being proposed to be the end of July in Comment #4.

That said, it is encouraging to hear that 1158 certificates have been replaced. I think a detail about the timeline of when these certificates were replaced would be a useful update to provide - whether they were replaced all at once or slowly progressed, and if the latter, what the frequency was of their revocation and why delays.

With respect to the 30 remaining certificates, that does seem less (1158 + 30 = 1188) than the 1194 certificates mentioned in Comment #1. It's also important to understand what those factors were, as you can see spelled out in https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation , and understanding the steps being taken to address those root causes, as discussed in https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report . Similarly, ensuring that your incident response/mitigation includes steps to ensure https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed is met for future incidents is important.

Flags: needinfo?(jamador)

Hi, Ryan

You're right. We have taken longer than expected because we have had to replace many certificates issued to users with technical difficulties to replace them. The replacement has been gradual. I will collect the information and update the report. The deadline we have now for the rest is final because we have already said that they will be revoked ex officio (there is no option). We have not wanted to leave these users without service, taking into account that we have generated the problem.
As I said, I'm going to collect the process information and update it shortly, including the timeline that you comment with the times of the replacements.

Thanks for your contribution

Best regards

Flags: needinfo?(jamador)

Resetting the N-I for the future update promised in Comment #6.

Flags: needinfo?(jamador)
  • Update actions done:

  • 1º Month

As indicated in the initial communication, the ACCV detected a problem of form due to the loss of an effective entropy bit in the serial number, going from 64 to 63. Once detected, in addition to generating the incident report, took place the steps that are detailed below.

  • ACCV sent an initial communication on March 22, 2019 notifying subscribers of the 1,194 certificates, indicating the reason for the change and the commitment to do so within one month of the initial notice (that email).

  • ACCV provides additional tools to facilitate replacement, with specific documentation that contains instructions for the entire process.

  • To avoid delays, ACCV sends emails at least twice a week for the entire period, reminding susbcribers of pending certificates to replace, providing support by mail or phone.

  • Email during this month were sent on the days:

  • March 27th

  • April 1st

  • April 4

  • April 8

  • April 10th

  • April 15

  • April 18th

  • April 22

  • At this point (April 22), 957 certificates (80.1%) had been replaced.

  • This data was smaller than expected. As additional information during this month and due to replacement of certificates, we had 4875 incidents (web and phone) and ACCV used 250 hours of support staff.

  • 2º Month

  • The ACCV contacts the users of the 237 remaining certificates, once again insisting on the urgency of and the need to replace the certificates and calls by telephone insisting and trying to facilitate as much as possible the instructions and the process.

  • In the interval between April 22 and May 22, 163 certificates were replaced, making a total of 1,120 certificates replaced (93.80%), leaving to replace 74 certificates.

  • These certificates belong to very small organizations that in many cases do not have personnel with technical skill. In these cases, we have provided customized support including web and face-to-face assistance.

  • 3º Month

  • During this time, we have continued to communicate and insist users to replace the certificates, sent two or three weekly mail and making intensive telephone calls to contact numbers.

  • In the interval between May 22 and June 22, 36 certificates are replaced, leaving 38 certificates

  • 4º Month

  • We have verified that the small amount of remaining certificates is taking a long time, so an end date is established (until July 22) so that if on this date they have not been replaced, they will be revoked ex officio by the ACCV.

  • ACCV sends mails every day and calls by phone reminding of the need and urgency of the replacement and the deadline imposed.

  • Today (July 10) there are 25 certificates to be replaced. 1,169 (97.9%) have been replaced.

  • We have not revoked unilaterally for the sole purpose of avoiding the loss of service to users, although it is true that the time allowed to them has been excessive.

  • The list of certificates to be replaced is attached and we will be informed weekly or more until July 22 of the replacements, at which time the ACCV will proceed to revoke the remaining certificates ex officio if there is one left.

  • We regret the delay. From the ACCV is making an effort both support and logistics to be able to guarantee the service to our users and the replacement of the certificates.

Attached file remnant_10072019.txt

Thanks.

If I understand correctly, ACCV is unilaterally taking the decision to violate the BRs, "for the sole purpose of avoiding the loss of service to users, although it is true that the time allowed to them has been excessive."?

I hope you can understand that such reasoning is not acceptable. CAs making such unilateral decisions jeopardizes the security of all users, and for that reason, such decisions are absolutely not allowed and of serious and great concern. CAs have been removed for such reasons in the past.

I can understand and appreciate the desire to provide a good customer experience. In the past, CAs have sought to provide a good customer experience by issuing 1024-bit certificates, issuing SHA-1 certificates, or issuing MITM certificates. As a result, those CAs were distrusted. All CAs participating in Mozilla's program are expected to abide by the BRs, including the requirements on revocation.

I can understand and appreciate the level of support you've provided to your customers. However, as part of the incident report, it's important to understand what steps you're taking to ensure that you never again miss a revocation requirement. Is my understanding correct, that your plan is to staff your support teams, such that in the future, all certificates will be replaced within the BR-mandated timeframe of 24 hours or 5 days, as appropriate?

I do want to acknowledge some of the good things in your report; for example, providing the timeline of communications helps show the frequency of that communication. After two or three of those messages, no customer can reasonably claim to not know about the issue.

It's important to note that the BRs do not require the customer replace their certificate within that timeframe. However, they do require the CA to revoke. What steps is ACCV taking to ensure it revokes, regardless of replacement, in the future?

Flags: needinfo?(wthayer)

Hi,
Our intention has been to try to carry out the replacement of the certificates in an unfavorable situation, new and unknown to us and our users. The ACCV reported the incident as soon as it was detected and we started working immediately to resolve it. We have made mistakes (without excuses) but the situation has been exceptional and we have really learned a lot from the effort made to avoid this type of logistical and support problems in the future, adjusting to the requirements of the BRs.

For example:

  • We have started to request an additional contact specifically for technical issues (one of the main problems has been the complete lack of technical skill of users).

  • This situation has also made us understand that we should make things easier and make applications even simpler. We are working to simplify user processes, writing documentation and manuals and adding wizards so that profiles without technical capacity have it easier.

  • We are working with our support team to improve and extend their capabilities, with better tools and additional training.

Honestly, it has been a logistical and support nightmare, and we hope that a situation like this never happens again. In any case, if we have to face something similar, we will be much better prepared. We understand that it is not an excuse, but we are committed to improve and learn.

We are also adding additional clauses that clarify information about the revocation of certificates and including exceptional situations like this so that users and we can take the necessary actions faster, reducing time to the maximum.

This incident with entropy has been a very unpleasant surprise that has affected us extensively to many CAs, and I am sure that the other CAs and we have made an effort to try to give a solution, patching core systems, modifying procedures, revoking certificates by what we appreciate the understanding of this exceptional scenario on the part of all those involved.

Flags: needinfo?(jamador)

Hi,

We confirm that the remaining 25 certificates have been replaced.

Best regards and thanks for everything

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Flags: needinfo?(wthayer)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.