Closed Bug 1534295 Opened 5 years ago Closed 5 years ago

Actalis: Insufficient serial number entropy

Categories

(CA Program :: CA Certificate Compliance, task)

3.22
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: adriano.santoni, Assigned: adriano.santoni)

Details

(Whiteboard: [ca-compliance] [dv-misissuance] [ov-misissuance] [ev-misissuance])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36

Actual results:

On 3 March 2019 Actalis got aware of issues concerning the entropy in the serial numbers of certificates and started investigating the impacts.

We determined the root cause of the problem to lie in the unexpected and undocumented behavior of the "EJBCA" software.

Having found, on March 4, 2019, about 230,000 active certificates which serial numbers containing 63 bits of entropy, on March 6, 2019, at 7.30 AM (CET) we implemented a fix to remove the problem. Since then, all of our certificates are issued with longer serial numbers containing much more than 64 bits of entropy.

We expect to revoke the majority of impacted certificates within approx 30 days, barring unforeseen circumstances.

We will provide further details in a subsequent update.

Adriano,

Can you please provide the details of the "about 230,000 active certificates" that have now violated the BR revocation timeline?

Further, given that this is not (yet) a preliminary incident report in https://wiki.mozilla.org/CA/Responding_To_An_Incident , can you please provide one in the next 24 hours?

Similarly, when deferring matters to subsequent updates, can you provide clear timelines about when the next update is expected? It's unclear whether you expect to provide an update one day, one week, or one month from now, and providing clear timelines helps manage expectations about that.

Flags: needinfo?(adriano.santoni)
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Assignee: wthayer → adriano.santoni
Summary: Actalis: serial numbers with 63 bits of entropy → Actalis: Insufficient serial number entropy
Whiteboard: [ca-compliance]

In the following, we provide a full incident report, still not definitive.

  1. How your CA first became aware of the problem and the time and date.

On March 3, 2019, upon reviewing a discussion in mozilla.dev.security.policy, we got aware of issues concerning the entropy in the serial numbers of certificates.

  1. A timeline of the actions your CA took in response.

(all times below are CET)

2019-03-03 h18:30, upon reviewing a discussion in m.d.s.p. we learned of a problem concerning serial numbers generated by the EJBCA software, that we also use, and started investiga-ting the impacts of said problem on our certificates.
2019-03-04 h10:15, having confirmed the impact and its root cause, we started studying the best course of action in order to fix it at the soonest; we tested a change to the EJBCA configuration so to lengthen serial numbers and verified that such change could be implemented on "live" CAs without unwanted side effects.
2019-03-05 h15.40, having found that the fix was viable and safe, we planned its deployment to our production systems for the next day early morning.
2019-03-06 h07:30, we deployed the fix to our production environment, then we double-checked that the change was effective and that everything else was still in order by closely examining several newly issued certificates and confirming that no one had any problems subsequent to the change.
2019-03-07 h08:30, we started investigating on the best way to re-issue the impacted certificates so to minimize disruption to our users. This task is still in progress.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

Confirmed. Since March 6, h7.30 CET (see timeline above), all of our certificates are issued with longer serial numbers (16 bytes) containing much more than 64 bits of randomness in them.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

We issued approx. 350,000 certs with the problem, of which approx. 224,000 are still active to date. We are still collecting and checking data. We will provide more precise figures in a subsequent update, as soon as possible.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

We will provide a full listing in a subsequent update, as soon as possible.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The serial numbers generation logic implemented by EJBCA, combined with EJBCA’s default settings, caused serial numbers to have less entropy than we expected. We relied on EJBCA to generate 64-bits random serial numbers, and since the serial numbers had the expected length and did look random we did not realize that they contained less than 64 bits of randomness.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

The defect was fixed on March 6 (see timeline at point #2).

We expect to revoke the majority of impacted certificates within approx. 30 days, barring unforeseen circumstances. We are preparing suitable procedures in order to achieve that with the minimum disruption to our customers. We anticipate that for a fraction of the impacted certificates it will take longer due to the limited agility of certain types of customers in handling certificate re-issuances at unexpected times.

To prevent reoccurrence of the same problem in the future, we adopted the following additional measures:

  • Since industry tools such as CABLint/ZLint do not currently handle this, we modified our own compliance-checking tool so to block issuance of certificates having serials shorter than 16 bytes. This change is already in production to date.

  • We hold an awareness meeting with all internal stakeholders to describe in details what happened, what were the causes, and to underline that we must ensure at all times our compliance with the randomness requirement on serial numbers as an integral part of our compliance with the BRs.

== Update ==

  • Details on impacted certificates

Essentially all certificates we issued between Sep 30, 2016 and March 6, 2019, were impacted.
The first is https://crt.sh/?serial=2CCA8249DD99A3C5, the last is https://crt.sh/?id=1263163337.
That's a total of 411333 certificates, of which 249627 were still active as of yesterday, E&OE.

DV: total 405140, still active 245685, first https://crt.sh/?serial=2CCA8249DD99A3C5, last https://crt.sh/?serial=45f6848eb52e8f62
OV: total 5361, still active 3359, first https://crt.sh/?serial=CF917218215694A, last https://crt.sh/?serial=5F562B4578F87189
EV: total 832, still active 583, first https://crt.sh/?serial=24E2CED08BBE99D1, last https://crt.sh/?serial=40ACA397101850E

  • Progress with revocations

DV - Almost ready to start revocations. We will update soon.
EV - So far we have revoked 233 EV certs.
OV - We will update asap.

Adriano,

The speed at which Actalis is handling this is slightly concerning. It took 9 days from initial report to Comment #2, and nearly two weeks between that and Comment #3, and it's clear from Comment #3 that no clear plan is yet available for how to handle OV and DV certificates.

As part of handling this incident, please

  1. Clearly indicate a date at which you will provide a clear plan and commitment regarding DV and OV certificates.
  2. Clearly indicate what steps are being taken such that, in the future, Actalis is capable of performing timely impact analysis and a clear and committed plan for resolving matters of non-compliance in the future.

I will provide an update soon.

Ryan, we've been working hard on this issue since the problem's inception. Our impact analysis was performed fairly quickly, actually, although we have reported on it here not so quickly. We have been concentrating more on the remedial actions. We will endeavor to be quicker in providing updates. Several of our people, at all levels, are constantly busy on the remedial actions and related tasks. The issue is being dealt with maximum attention and commitment by all internal stakeholders. At the start, the various company departments involved were summoned and briefed. Then, several meetings were held to share the impact analysis and discuss on how best to proceed in order to re-issue, and then revoke, the impacted certificates of the various classes and according to the various types of customers and request channels. Since March 11 we have also been working on developing ad hoc software procedures for bulk renewal (where applicable, according to certificate class and request channel) and bulk revocation of the impacted certificates. Tests had to be made to ensure that these procedures were accurate and reliable, and that took time. Last week we completed the technical-organizational premises to start bulk re-issuance of most of the impacted DV certificates. Yesterday, March 27, we started this process and we have already reissued more than 1000 DV certificates. Today, March 28, we started revocation of the impacted DVs that have been replaced, and we have revoked more than 1000 of them, to begin with. We will gradually increase the pace of revocations, which will allow us to revoke almost all of the impacted DVs by mid-April, barring unforeseen circumstances.

As to impacted OVs and EVs: we are already revoking certificates of both classes, but as anticipated this task will take longer than with DV certs, for the already mentioned reasons. I will provide an update soon.

Thanks Adriano. Details like those in Comment #6 help provide the community an understanding of what's being undertaken, and more importantly, help understand what's being proposed before it's undertaken, to ensure those are all in line with best practices and community expectations.

In the future, incident reports that share such details are expected as the norm.

I'm going to set this to NextUpdate in mid-April. If the facts or information changes, please ensure to provide prompt updates and explanations. Providing periodic updates that things are progressing as expected is also valuable.

As mentioned in Comment #4, when switching the evaluation of this incident from 'reactive' (that is, addressing this immediate issue) to 'proactive' (examining next steps to prevent or mitigate future issues), please be ensure to include consideration for what caused a lack of updates here (e.g. the business of staff, it seems), and how prompt and periodic disclosures will be integrated as a mandatory part of incident response.

We'll look for the update in April for Actalis' update about the current issue, as well as the plans (or the timeline for plans) to mitigate future issues that this has highlighted - such as BR-compliance review of software, review of m.d.s.p. disclosures, timely incident reporting and disclosure to Mozilla, and timely revocation/replacement of certificates for Subscribers.

Flags: needinfo?(adriano.santoni)
Whiteboard: [ca-compliance] → [ca-compliance] Next Update - 15-April-2019

So far we have revoked:

  • approx 11% of impacted DV certs
  • approx 55% of impacted EV certs
  • approx 3% of impacted OV certs

I will provide another update in the next few days.

To date, we have revoked:

  • approx 54% of impacted DV certs
  • approx 58% of impacted EV certs
  • approx 3% of impacted OV certs

In the following days we expect to provide a timeframe for revocation of a significant fraction of impacted OV and EV certs based on the agreements that we are finalizing with certain large customers. In the meantime we are busy contacting numerous retail customers.

To date, we have revoked:

  • approx 90% of impacted DV certs
  • approx 68% of impacted EV certs
  • approx 4% of impacted OV certs

Overall, we have revoked some 88% of all impacted certificates, and the revocation process is continuing on a 7x24 basis.

We expect to complete revocation of the remaining impacted DVs in the next few days. As to the remaining impacted OVs and EVs, we expect to revoke an appreciable fraction of them by the end of May. We aim at revoking most of them by end of June, barring unforeseen circumstances. Especially relating to OV certs, several large customers are involved, both in the private and public sector, each with several subsidiaries, so the reissuance process is rather lengthy because those organizations have complex procedures that are not tailored for fast re-issuances at unexpected times. Furthermore, we have numerous small customers who need to be actively contacted one by one, and this takes time and labour. In view of the particular nature of the non-compliance, which does not entail significant short-term security risks, we deem that the expected timeframe for revocation is reasonable.

To date, we have revoked:

  • approx 97% of impacted DV certs
  • approx 74% of impacted EV certs
  • approx 15% of impacted OV certs
Whiteboard: [ca-compliance] Next Update - 15-April-2019 → [ca-compliance] - Next Update - 01-July 2019

To date, we have revoked:

  • approx 99,9% of impacted DV certs
  • approx 76% of impacted EV certs
  • approx 22% of impacted OV certs

Although certificate reissuance is progressing steadily and is now well advanced, we expect not to complete the OV and EV revocation process by end of June due to the slowness of certain major customers in replacing certificates on their servers, on account of their complex organization. Some weeks' delay is possible.

Adriano: I am concerned about the significantly slow progress, and concerned about the explanations for that progress. As captured in https://wiki.mozilla.org/CA/Responding_To_An_Incident, responses such as "due to the slowness of certain major customers" is not an acceptable response for explaining a lack of revocation.

In order to ensure Actalis is appropriately handling things, please provide a fuller explanation, with a per-customer breakdown about the challenges faced, as well as the steps Actalis is taking to ensure that this never happens again. The Baseline Requirements place clear expectations upon Actalis, and the present response gives the impression that Actalis does not intend to abide by them nor does it take them seriously. The way to combat such an interpretation is to be vastly more detailed in your updates, and for each challenge faced, provide clear explanation about these challenges and the steps Actalis is taking to ensure they never repeat. Customer difficulty is not, in and of itself, an acceptable explanation.

Flags: needinfo?(adriano.santoni)

Actalis has never doubted for a moment the need to revoke all impacted certificates as soon as possible and has taken such commitment very seriously. So much so that today we have almost completely remedied the situation, as we have revoked ca. 99% of all impacted certificates. This has been possible thanks to the work of several people who, since the beginning of March, have been continuously carrying out various activities aimed at that goal: sending reminders to customers, holding meetings with colleagues and with main customers to assess the state of advancement of the process, continuous monitoring of reissueances and revocations, analysis of certificates installed on customer websites, production of reports for various stakeholders, etc.

As to the remaining impacted certificates (amounting to 1.2% of the total), the situation is as follows:

  • ca. 1300 certificates belong to several major public entities with very articulated organizations (many operating units and/or subsidiaries, dozens of internal interlocutors, management of certificates often carried out by external contractors, complex operating procedures, etc.). Some of these institutions, after our initial communications, have themselves initiated incident handling procedures that have lasted for a few weeks, and only at the end of those procedures they started certificate re-issuance allowing us, in turn, to start revocation of impacted certificates. Since a few weeks, those entities are busy replacing their certificates.

  • ca. 800 certificates belong to two large private companies, also characterized by a complex internal organization and with the same problems mentioned above. Both are busy replacing their certificates.

  • ca. 700 certificates belong to "retail" customers. Also for them, certificate replacement is steadily proceeding. We are continuing to solicit them and we have clearly communicated that the revocation of the impacted certificates cannot be further delayed.

With the actions taken and based on the commitments made by the various customers, in particular by the large organizations with which we have a direct and constant contact, and also considering that all subscribers are busy requesting new certificates and installing them, at a good pace, we expect to essentially complete the revocation of the small amount of remaining impacted certificates by end of July.

This does not meet the level of detail expected, which was spelled out in comment #13. The timeline itself is unsupported by the available data, nor has there been any meaningful demonstration that it is being taken seriously.

At this point, I believe the handling of this incident represents a serious and systemic failure, and should be reflected as such. Furthermore, I do not see the requested information provided in Comment #13 to provide assurance it won’t happen again. While the original issue may have been seen as not severe, the systemic failures to provide the necessary information or appropriate levels of transparency seriously undermine trust.

I do hope steps will be taken to correct this perception.

Adriano: to reiterate what Ryan has stated, Mozilla has a specific set of expectations for CAs when they choose to violate the BR revocation requirements. These expectations are listed at https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation Please review this and provide the information, including an explanation for why this situation is exceptional, and the results of Actalis' analysis of how these delays will be prevented in the future.

With reference to https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation, it was clear to us from the very beginning that it was an exceptional circumstance, where "revoking misissued certificates within the prescribed deadline may cause significant harm, such as when the certificate is used in critical infrastructure and cannot be safely replaced prior to the revocation deadline, or when the volume of revocations in a short period of time would result in a large cumulative impact to the web." The exceptional nature of the situation depended first of all on the sheer number of impacted certificates (more than 245,000) and secondly on the fact that thousands of certificates were issued to various large organizations that we knew, for various reasons, not to be able to react quickly. On the other hand, a massive unconditional revocation within a few days was not conceivable, because it would have meant a total disruption to literally millions of users. Therefore, before revoking the impacted certificates it was necessary to replace them, as soon as possible, and this required a great effort.

We managed to revoke the great majority (88%) of impacted certificates within approx. 1 month from the preliminary incident report, in line with our initial forecast. This required the modification of our software so to be able to handle the expected renewal rate of certificates. During the first month after the incident, therefore, we renewed certificates (and revoked the previous ones) at an average rate of over 6000 certificates/day, mainly automatically. We have performed such processing at the fastest speed that we managed to achieve, much greater than usual.

In the meantime, we immediately contacted our large customers in various ways (email, phone calls, meetings) explaining the situation and asking for their rapid reaction to re-issue all the impacted certificates, but with these customers we knew that the process would have been more difficult for several reasons , mainly of an organizational nature, in some cases also of a technical nature. We had different types of feedback from them, but in all cases we had to face a wide range of challenges that made it impossible to rapidly revoke the impacted certificates issued to those customers, as it was necessary to avoid disruption of their services which are for the most part quite critical and provided to a very large number of users.

The challenges we faced were not specific to each customer, but were common to almost all of them, albeit to a variable extent:

  1. in almost all cases, the large customer has a very articulated and complex organization that has prevented the replacement of the impacted certificates from being carried out quickly: numerous subsidiaries and/or operating units distributed throughout the territory; dozens of internal interlocutors; certificate management very often carried out by external suppliers; complex and inflexible operating procedures;

  2. in some cases, our initial communications made to the customer have been ignored for several days, sometimes even for a few weeks, e.g. due to the fact that the addressee was absent at that time, or because his/her role had changed, or because he/she wrongly judged (although our communication was very clear on the revocation being mandatory and urgent) that the matter was not really urgent, without bothering to immediately forward the communication to other company departments;

  3. in some cases, the customer assured us that they would replace the impacted certificates by a certain date, but then they carried out such activity very slowly and without giving us any plausible explanations; this forced us to solicit the customer several times, verbally and in writing, which led to misunderstandings and tension;

  4. in some cases, the contacted person decided to forward our communication not to the suitable technical departments but rather to their legal office (to evaluate the question from the point of view of contractual obligations) and/or to their security office (to initiate a procedure of incident handling), which led to a considerable delay in responding to our communication despite several reminders from us;

  5. several impacted certificates pertaining to the customer were for servers that provide web services to client systems that use certificate pinning, therefore replacing the server certificate required the prior update of all clients (including several mobile apps) which could not be done in a short time, for reasons also dependent on the the involved app stores and app distribution timing;

  6. in several cases, the technical staff in charge of replacing the certificates could not be activated quickly because they were contract staff, so it was necessary for the customer to issue supplementary purchase orders and/or to consume man-days of consultancy at unexpected times; this caused much delay and tensions;

  7. in almost all cases, our supply of certificates to these customers is the subject of a call for tenders, therefore our communications have been brought to the attention of the customer person in charge of the tender who, before authorizing the certificates replacement, considered it necessary to evaluate the situation in light of the contract conditions and the tender requirements, which sometimes led to strong delays in responding to our requests and/or in carrying out the replacement of impacted certificates;

  8. several impacted certificates issued for public bodies cannot be replaced quickly because they are used for machine-to-machine communications within the "Public Connectivity System" (https://www.agid.gov.it/en/infrastructures/public-connectivity-system): a network that connects all Italian public agencies, wherein peculiar technical requirements apply; within the SPC, when a server certificate must be changed, the new certificate must first be distributed to all the counterparts, before being installed, and this takes a long time; unconditional revocation on the other hand would lead to an "interruption of public service".

  9. in some cases, the customer had problems with handling the domain control validation with methods that were new to them; more on this below, with reference to our retail customers.

All these difficulties were finally overcome, but this has required several weeks of exhausting meetings, phone calls, supplementary written communications, in some cases with exceeding formality. To date, all those large customers are busy replacing their impacted certificates and several of them have already completed the process (and the related impacted certificates have been revoked).

This is a breakdown, to date, of the residual impacted certificates pertaining to those customers:

  • entities that are part of the Tuscany Region (e.g. O=Rete Telematica Regionale Toscana, etc.): 206
  • entities that are part of the Piedmont Region (e.g. O=CSI Piemonte, etc.): 268
  • central public government (eg. O=Bank of Italy, Ministry of Transports, etc.): 171
  • major banks (e.g. O=Unicredit S.p.A., FinecoBank, etc.): 284
  • large private companies (e.g. O=SNAM, Terna, Wind, etc.): 283
  • chambers of commerce: 239

Most of them has assured us they'll complete certificates replacement within July.

Another 438 residual impacted certificates were issued to many retail customers of all kinds (also including banks, healthcare, public bodies, etc. providing critical services). All these customers were sent repeated reminders via email.

The biggest challenge we faced with retail customers is that more than half of their certificates were issued before August 1, 2018, when domain control validation was still allowed with BR method #1 (i.e. based on WHOIS record information). Given that today such method can no longer be used, many customers have found themselves forced to demonstrate control of their domains with other methods (e.g. website change, DNS change, constructed email to domain contact) that were completely new to them, and they have struggled to understand and manage them; this single difficulty caused a huge waste of time, despite all our efforts to facilitate those customers, and also affected some enterprise customers whose certificates were all issued before August 2018. With some of our retail customers, however, we also faced some of the same challenges mentioned above for enterprise customers.

On July 1st we started the unconditional revocation of the certificates of retail customers who have not responded to our requests, or not adequately, and we aim at revoking most of them by July 15th.

Considering all types of customers, as of today we have revoked more than 99% of all impacted certificates.

From this exceptional experience we learned that some of the decisions we initially made to deal with the problem has not fully achieved the forecasted results.

To prevent a similar situation - of which we emphasize the exceptionality - from occurring again we have decided to adopt the following corrective measures:

  • revision of our trust services Security Plan so to explicitly cover this type of circumstance (crisis), should a similar problem reoccur, including:
    • immediate activation of a telephone task force dedicated to soliciting, through recall to the mail messages, all impacted customers for reactions;
    • immediate strengthening, more than we did this time, of our SSL delivery team so to handle all re-issuances at a higher speed
  • internal training aimed at raising the awareness of all stakeholders on this type of crisis and on the actions to be taken should it happen again.

(In reply to ADRIANO SANTONI from comment #17)

With reference to https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation, it was clear to us from the very beginning that it was an exceptional circumstance, where "revoking misissued certificates within the prescribed deadline may cause significant harm, such as when the certificate is used in critical infrastructure and cannot be safely replaced prior to the revocation deadline, or when the volume of revocations in a short period of time would result in a large cumulative impact to the web." The exceptional nature of the situation depended first of all on the sheer number of impacted certificates (more than 245,000) and secondly on the fact that thousands of certificates were issued to various large organizations that we knew, for various reasons, not to be able to react quickly. On the other hand, a massive unconditional revocation within a few days was not conceivable, because it would have meant a total disruption to literally millions of users. Therefore, before revoking the impacted certificates it was necessary to replace them, as soon as possible, and this required a great effort.

I'm uncertain why it was suggested this was inconceivable. The industry has repeatedly shown that this is very much an expected part. For example, the results of Heartbleed were similarly large-scale, and a number of CAs responded in a timely fashion. More recently, the discussions in the CA/Browser Forum for nearly two years regarding SC6, or the discussions on mozilla.dev.security.policy regarding underscores, similarly have shown that these are not at all unforseen events.

Has Actalis been monitoring these CA/Browser Forum and mozilla.dev.security.policy discussions? If so, can you provide more details about why it was not conceivable? If not, why not and what steps are being taken to resolve?

We managed to revoke the great majority (88%) of impacted certificates within approx. 1 month from the preliminary incident report, in line with our initial forecast.

I think it's worth highlighting that a number of CAs achieved much better responsiveness, both in time and percentage, and with a larger volume of certificates. Has Actalis taken steps to examine what they can be doing to improve here?

This required the modification of our software so to be able to handle the expected renewal rate of certificates. During the first month after the incident, therefore, we renewed certificates (and revoked the previous ones) at an average rate of over 6000 certificates/day, mainly automatically. We have performed such processing at the fastest speed that we managed to achieve, much greater than usual.

This suggests that Actalis was not and has not been capable of meeting the requirements of the Baseline Requirements. Can you describe what modifications were necessary? If a CA has rate limits on the number of revocations it can process, that's very concerning.

In the meantime, we immediately contacted our large customers in various ways (email, phone calls, meetings) explaining the situation and asking for their rapid reaction to re-issue all the impacted certificates, but with these customers we knew that the process would have been more difficult for several reasons , mainly of an organizational nature, in some cases also of a technical nature. We had different types of feedback from them, but in all cases we had to face a wide range of challenges that made it impossible to rapidly revoke the impacted certificates issued to those customers, as it was necessary to avoid disruption of their services which are for the most part quite critical and provided to a very large number of users.

I'm not sure I understand this. Revocation does not require Subscriber contact. Indeed, the CA's warranties and obligations to relying parties and browsers, and the Subscriber Agreement/Terms of Use that Subscribers agree to, are required to acknowledge this.

It would be helpful to understand where the requirement for contact comes from, or if it's merely a business decision to violate the BRs.

The challenges we faced were not specific to each customer, but were common to almost all of them, albeit to a variable extent:

As discussed on mozilla.dev.security.policy rather extensively when revising this section, in the context of underscores, this does not meet the requirements. It's important to highlight this, because Actalis has a responsibility to be aware of and follow those discussions, as lengthy as they may have been, and even though Actalis did not themselves issue underscores. Wayne notified CAs of the change and context surrounding it, so that's similarly not a valid reason not to be familiar.

I appreciate the examples provided, and they are closer to the minimum response level expected of responsible CAs, but it's important to note that such generalities are specifically something to be avoided here. CAs are responsible for evaluating on a case-by-case basis these facts. CAs that feel that may be a significant, perhaps burdensome, amount of work, can of course do the thing that is required of them and which they agreed to, which is to revoke these certificates. CAs that attempt to generalize, however, are much more negatively perceived when it comes to their incident response and handling capabilities.

  1. in almost all cases, the large customer has a very articulated and complex organization that has prevented the replacement of the impacted certificates from being carried out quickly: numerous subsidiaries and/or operating units distributed throughout the territory; dozens of internal interlocutors; certificate management very often carried out by external suppliers; complex and inflexible operating procedures;

The Subscriber Agreement, however, is meant to protect against this, by establishing that the Customer agrees to and acknowledges the requirement for such revocation. Can you provide details of Actalis' Subscriber Agreement and how it complies with the Baseline Requirements?

  1. in some cases, our initial communications made to the customer have been ignored for several days, sometimes even for a few weeks, e.g. due to the fact that the addressee was absent at that time, or because his/her role had changed, or because he/she wrongly judged (although our communication was very clear on the revocation being mandatory and urgent) that the matter was not really urgent, without bothering to immediately forward the communication to other company departments;

The Baseline Requirements require that the CA have the ability to revoke certificates as needed, without requiring such communications with customers. Similarly, customers are required to agree and to acknowledge this, by the required terms of the Subscriber Agreement in the Baseline Requirements. Can you provide details about Actalis' CP/CPS which states that it will not abide by the BRs if it cannot contact the customer?

  1. in some cases, the customer assured us that they would replace the impacted certificates by a certain date, but then they carried out such activity very slowly and without giving us any plausible explanations; this forced us to solicit the customer several times, verbally and in writing, which led to misunderstandings and tension;

I'm not sure I understand. PKI works that the CA initiates revocation, and the BRs require CAs initiate revocation for a number of reasons, including this. Can you help me understand what part of the CP/CPS is incompatible with this?

  1. in some cases, the contacted person decided to forward our communication not to the suitable technical departments but rather to their legal office (to evaluate the question from the point of view of contractual obligations) and/or to their security office (to initiate a procedure of incident handling), which led to a considerable delay in responding to our communication despite several reminders from us;

See above.

  1. several impacted certificates pertaining to the customer were for servers that provide web services to client systems that use certificate pinning, therefore replacing the server certificate required the prior update of all clients (including several mobile apps) which could not be done in a short time, for reasons also dependent on the the involved app stores and app distribution timing;

This seems that the customer made a choice incompatible with the Subscriber Agreement. Can you highlight where in the CP/CPS Actalis specifies this is a viable reason for ignoring the BRs?

  1. in several cases, the technical staff in charge of replacing the certificates could not be activated quickly because they were contract staff, so it was necessary for the customer to issue supplementary purchase orders and/or to consume man-days of consultancy at unexpected times; this caused much delay and tensions;

  2. in almost all cases, our supply of certificates to these customers is the subject of a call for tenders, therefore our communications have been brought to the attention of the customer person in charge of the tender who, before authorizing the certificates replacement, considered it necessary to evaluate the situation in light of the contract conditions and the tender requirements, which sometimes led to strong delays in responding to our requests and/or in carrying out the replacement of impacted certificates;

  3. several impacted certificates issued for public bodies cannot be replaced quickly because they are used for machine-to-machine communications within the "Public Connectivity System" (https://www.agid.gov.it/en/infrastructures/public-connectivity-system): a network that connects all Italian public agencies, wherein peculiar technical requirements apply; within the SPC, when a server certificate must be changed, the new certificate must first be distributed to all the counterparts, before being installed, and this takes a long time; unconditional revocation on the other hand would lead to an "interruption of public service".

  4. in some cases, the customer had problems with handling the domain control validation with methods that were new to them; more on this below, with reference to our retail customers.

All these difficulties were finally overcome, but this has required several weeks of exhausting meetings, phone calls, supplementary written communications, in some cases with exceeding formality. To date, all those large customers are busy replacing their impacted certificates and several of them have already completed the process (and the related impacted certificates have been revoked).

This is a breakdown, to date, of the residual impacted certificates pertaining to those customers:

  • entities that are part of the Tuscany Region (e.g. O=Rete Telematica Regionale Toscana, etc.): 206
  • entities that are part of the Piedmont Region (e.g. O=CSI Piemonte, etc.): 268
  • central public government (eg. O=Bank of Italy, Ministry of Transports, etc.): 171
  • major banks (e.g. O=Unicredit S.p.A., FinecoBank, etc.): 284
  • large private companies (e.g. O=SNAM, Terna, Wind, etc.): 283
  • chambers of commerce: 239

Most of them has assured us they'll complete certificates replacement within July.

This is getting closer to what is required, but I want to highlight and emphasize: This is still below the minimum requirements expected of CAs.

The biggest challenge we faced with retail customers is that more than half of their certificates were issued before August 1, 2018, when domain control validation was still allowed with BR method #1 (i.e. based on WHOIS record information). Given that today such method can no longer be used, many customers have found themselves forced to demonstrate control of their domains with other methods (e.g. website change, DNS change, constructed email to domain contact) that were completely new to them, and they have struggled to understand and manage them; this single difficulty caused a huge waste of time, despite all our efforts to facilitate those customers, and also affected some enterprise customers whose certificates were all issued before August 2018. With some of our retail customers, however, we also faced some of the same challenges mentioned above for enterprise customers.

This also is something the industry has long discussed. You can see this both when trust in CAs are removed (e.g. Symantec), but also when validation methods are removed. It sounds like Actalis that despite using now-forbidden methods, Actalis failed to take steps to minimize the risk to their customers. For example, when these methods were removed, Actalis could have begun the process of communicating with their customers to ensure validations were on file, should certificates need to be replaced.

On July 1st we started the unconditional revocation of the certificates of retail customers who have not responded to our requests, or not adequately, and we aim at revoking most of them by July 15th.

This sounds like there are still a set of certificates for which Actalis plans to further violate the BRs with, and which are not enumerated in the responses to date. Is this a correct understanding?

To prevent a similar situation - of which we emphasize the exceptionality - from occurring again we have decided to adopt the following corrective measures:

  • revision of our trust services Security Plan so to explicitly cover this type of circumstance (crisis), should a similar problem reoccur, including:
    • immediate activation of a telephone task force dedicated to soliciting, through recall to the mail messages, all impacted customers for reactions;
    • immediate strengthening, more than we did this time, of our SSL delivery team so to handle all re-issuances at a higher speed
  • internal training aimed at raising the awareness of all stakeholders on this type of crisis and on the actions to be taken should it happen again.

Based on the issues presented, I have zero faith that this in any way meaningfully addresses the challenges highlighted. That is, it does not seem Actalis recognizes the multiple business decisions its management and compliance team made to knowingly (further) violate the BRs, nor does it seem to meaningfully address the challenges provided above.

For example, Actalis was already contractually free to revoke these certificates in the BR timeframe. If contracts prohibited such, then Actalis was violating the BRs' requirements on Subscriber Agreements. Further, based on the challenges presented to customers, it does not seem that Actalis is taking steps to address this - such as reminding customers of their existing contracts, discouraging pinning, ensuring their customers are prepared for such quick revocations (such as for tender processes), providing automation to reduce delays or risks, proactively validating domains that may expire or use invalid methods, or any of the other possible solutions to the challenges faced.

While I appreciate the transparency provided here, I want to acknowledge that it is still significantly below the required minimum. Furthermore, we see other CAs, past and present, doing significantly better in volume, percentage, and progress of revocations, doing significantly better to communicate and reduce the risks to their customers, and to enact meaningful changes to ensure that there will not be further delays to revocation under the BR-required timeframe.

The desired outcome of this is that Actalis should feel confident that, should it need to wholly replace 100% of its still-valid certificates, it could reasonably do so within 5 days, and that if there were challenges, none of them would be previously known to the industry or that Actalis would have had comprehensive mitigations in place already. Seeing a plan that can reasonably give confidence that Actalis could support such a result is something that would improve the perception of this bug.

Adriano, it's been nearly two weeks. Do you have an update?

Ryan, I will provide an update as soon as possible.

In the absence of a specific date, I will take Comment #20 to be a commitment to provide a response by the end of this week, 2019-07-19. If that is not correct, please provide clear commitments about the date for when an update will be made available.

Ryan,

as we said in previous comment, most of our enterprise customers committed to replace all or a great part of their impacted certificates by end of July. They are indeed proceeding in doing so, and consequently we are revoking many certificates every day. As of today, we are left with some 1300 certificates yet to revoke, belonging to those customers. Anyway we have clear evidences that many new certificates are in progress of being installed, so we expect to crunch down that figure quite a bit by 31 July as declared.

Some customers asked an extension for certain particular certificates due to "material impossibility" to replace them within 31 July as a consequence of their being "pinned" in critical applications, especially mobile apps - used for critical services to the public, such as e.g. in healthcare environments - that cannot be updated and redistributed to (e.g.) healthcare professionals' mobile devices quickly enough.

Regarding retail customers, we have revoked most of their impacted certificates (some tens are yet to be revoked). Also some retail customers asked an extension for particular situations, and we requested a plausible motivation, just for a few days and in any case not after the 31st of July.

To date we have revoked 99,5% of all impacted certificates and expect to reach 99,9% by end of July, barring unforeseen circumstances.

As to other aspects that you commented on:

  • we do not have rate limits on the number of revocations we can process; we can process revocations automatically and very quickly; most of the revocations we did so far and are doing, are processed automatically; the modifications we had to do on our CA software were necessary for supporting higher rates of certificate re-issuances, especially in the first phase of our incident handling activities;

  • we fully acknowledge the necessity of respecting the BRs, and nothing in our CPS nor in our Terms & Conditions may lead customers to think that we will not abide by the BRs in particular situations; at the same time, our business considered the exceptionality of the situation and the consequences of abrupt revocation of certificates used for critical services. With reference to https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation, it was clear to us from the very beginning that this was an exceptional circumstance, where "revoking misissued certificates within the prescribed deadline may cause significant harm, such as when the certificate is used in critical infrastructure and cannot be safely replaced prior to the revocation deadline, or when the volume of revocations in a short period of time would result in a large cumulative impact to the web."

  • of course it's not provided in our CPS (nor anywhere else) that we will not abide by the BRs if we cannot contact the customer; in fact, we have already revoked all certificates whose customers could not be contacted or did not actively react to our messages;

  • we confirm that we promptly communicate major changes to the BRs to our enterprise customers, e.g. regarding allowed domain validation methods;

  • we do not have further impacted certificates which are not enumerated in the responses to date; in our Comment 17, the retail certificates were the ones mentioned in the sentence "Another 438 residual impacted...";

  • about your final considerations: for sure we have to improve in some respects, and we are still discussing and analyzing how to achieve that goal, treasuring this experience. From the contract point of view, we certainly are in a strong position, but when some CA actions - like revoking certifcates before the customer has replaced them - may cause significant harm or big risks to the safety of citizens, having a contractually strong position may not be enough, also from an ethical point of view. Just for example, we have a situation in which revoking a list of specific certificates would have as a consequence that many physicians in one region would be not able to prescribe medicines to patients, and this is just one situation. Organizations that are still having this kind of situations are mainly hospitals and healthcare systems operated by certain Italian regions, ministries, national security, parliament, civil protection, national electric power network and distribution, national gas infrastructure.

We intend to raise the awareness of our customers on the risks deriving from adopting such practices as certificate "pinning", when not really necessary, or otherwise use certificates in contexts that make difficult to quickly replace certificates when so required by the CA. We will also be coaching our customers on the benefits of more automation in handling TLS certificates. We are already discussing on how to organize such activity. This is an internal open discussion that is involving our marketing as well.

We have also planned a task aimed at further improve our certificate-issuance processing speed, and will start working on it past this crisis phase.

We will provide an update before the end of next week.

Flags: needinfo?(adriano.santoni)
Whiteboard: [ca-compliance] - Next Update - 01-July 2019 → [ca-compliance] - Next Update - 27-July 2019

Thank you for the additional details.

I do not believe the information provided, nor the response, meets the minimum bar expected of, and as demonstrated by, other CAs. I appreciate that Actalis may feel it does, but a cursory examination of other CA incidents demonstrate better responsiveness, better adherence to the BRs, better communication about the challenges, and better steps taken to resolve this in the future. As stated in Comment #18, I do not have faith that Actalis will be in a place to prevent future incidents on the basis of the response provided, and, when they occur, they will not be able to meet the minimum expectation required of them. I am, and remain, deeply concerned that it takes more than four months to replace some certificates.

Ryan,

You are right on your considerations, we can and we have to perform better, this is not our standard. We started at a good rhythm resolving the issue in our production very quickly and achieving 88% (more than 215.000) revocations in the first month after analysis, but after that we made some mistakes (mainly in communication and at operational level) that are at the origin of our current difficulties. In fact we have met a lot of resistances and compliances from the enterprise customers, mainly public entities, to whom we have wrongly responded by giving more time. Four months to replace some certificates is more than enough.

As managing director I have just decided some organization and role changes with immediate effect: I have removed the SSL infrastructure and Operation managers, the second one also responsible of the customer communications and I have changed the management processes and the escalation procedure for this type of incident. From now on I take directly the lead of this incident supported by the Director of the EIDAS Certification Authorities, Mr. Andrea Sassetti . Formally we are joining the managing organizations for both the CAs: SSL and EIDAS. Actalis has been a Certification Authority for more than 20 years and we have always respected the rules and the guide lines of the market, and also our commitments with our customers.

About this specific incident I can assure that all the non-revoked certificates, exactly 1.253 as for our internal report of this morning, will be revoked before the 31st of July.

About the lesson learned and the actions for the future I’ll post a new update before the end of this week but I can already assure that a situation like this one will not happen again in the future, anymore.

Giorgio

New Update

we are respecting the plan we have setup and we can confirm that all the involved certificates will be revoked by the 31st of July.

Giorgio

All the involved certificates, as for our Comment #3, has been revoked or expired in the mean time.

Giorgio

Thanks for the update.

Comment #2 originally provided an incident report in the context of the original serial issue. Following that, the BR-required revocation timeline was exceeded, rather substantially, which is a separate incident. This incident was somewhat dismissed by Comment #14, but after additional clarifications, the end of Comment #17 provided some steps about changes being made to ensure future revocation requirements are not violated.

Comment #18 highlighted some concerns with those planned remediations, based on past disclosures by other CAs. Comment #22 indicated additional steps that may be taken. Finally, Comment #24 indicated some larger changes, but focused on the specific matter of compliance.

I think the outstanding question is this: Similar to the request in Comment #1, we want to make sure we've got a clear picture about the steps Actalis is taking with respect to the BR revocation timeline. That is, preparing an incident report, in that same form, in the context of revocation delays. The concerns highlighted in comments like Comment #18 and Comment #23 were because that information had not been provided, as expected and documented at https://wiki.mozilla.org/CA/Responding_To_An_Incident

So, can you please provide an incident report, using that template, treating the incident as "Failure to revoke certs within the BR required timeframe", and with a focus on building a comprehensive understanding about the steps taken and being taken, and when they were completed or expected to be completed (date and time stamped for the past, specific dates for the future), to ensure no repeats? I realize Actalis has shared some of these plans, but without dates, and so I suspect the Incident Report template is the clearest way to communicate that going forward.

Flags: needinfo?(Giorgio.girelli)

Dear SLeevi,

ok, following your suggestion we close this Incident and we are going to open a new one focused on the "failure to revoke within BR requirements". I’ll do by tomorrow or early on Wednesday.

Giorgio

Flags: needinfo?(Giorgio.girelli)

Given that a separate bug has been opened for the delayed revocations, I believe that all questions have been answered and remediation of this issue is complete.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Whiteboard: [ca-compliance] - Next Update - 27-July 2019 → [ca-compliance]
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [dv-misissuance] [ov-misissuance] [ev-misissuance]
You need to log in before you can comment on or make changes to this bug.