Closed Bug 1556906 Opened 8 months ago Closed 28 days ago

DigiCert: Apple: Non-compliant Common Name Length

Categories

(NSS :: CA Certificate Compliance, task)

3.2.1
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: certification_authority, Assigned: certification_authority)

Details

(Whiteboard: [ca-compliance])

Attachments

(1 file)

Incident Report

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

    On 2019-05-21 14:30 PT, the CA compliance team was notified by an internal developer that during the course of a code review, it was discovered that certificates had been issued with Common Names (CNs) longer than 64 characters. 

  2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

    • 2019-05-15 9:00 PT - Code review identified that the software that checks certificates for Baseline compliance was not enforcing a max length of 64 characters for CNs.

    • 2019-05-15 11:24 PT - The only two impacted certificates that were still valid were revoked by the developer who identified the issue.

    • 2019-05-21 14:30 PT - Compliance team was notified about the issue.

    • 2019-05-21 18:00 PT -  Risk assessment was completed. 

    • 2019-05-22: 10:00 PT - Software fix was deployed to the production environment.

    • 2019-05-23: 8:42 PT - Requested meeting with DigiCert (the Root CA) to discuss the incident.

    • 2019-05-24: 13:00 PT - Notified Ernst & Young (WebTrust assessors).

  3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

    A software update that prevents issuance of certificates with CNs longer than 64 characters was deployed in production on 2019-05-22 at 10:00 PT.

  4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

    28 certificates were impacted between 2014-11-28 and 2019-03-25.

  5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

    A file has been attached with a list of all impacted certificates.

  6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

    The software that checks certificates for Baseline compliance prior to issuance and for quarterly self audits was not enforcing a max length of 64 characters for the CN.

  7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

    1. A software fix that enforces a maximum of 64 character CNs was deployed in production on May 22nd. 
    2. The internal notification process will be enhanced by mid-June to minimize the time between identification of a suspected issue and communication to the compliance team.
    3. We plan to implement a second linter (most likely zLint) by end of June, which is based on a separate code base, to strengthen the ability to prevent and detect mis-issuance.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

Thanks. I'm a bit surprised to hear this, considering that the length restrictions with respect to commonNames has been something actively discussed in the CA/Browser Forum. In particular, the limits on length were a motivating factor to support subject-less certificates with critical subjectAltNames, so that you could issue certificates for a domain name with greater than 64 characters.

It's unclear whether a check was already existing, and failed, or whether no check was developed. Can you clarify between these interpretations?

Given that the length of the commonName field is part of the ASN.1 module for RFC 5280, has there been a systemic evaluation about compliance with the ASN.1 module? For example, ZLint does not strictly validate against the ASN.1 module, and thus may not be sufficient, but it would also only be a secondary check for whatever primary controls the CA software has. Understanding what the existing controls are is, I think, important to understanding this incident report and next steps, as there are a variety of other ASN.1 constraints expressed within the module that may have been overlooked.

I appreciate that steps are being taken to improve the timeline from detection to alerting, but are more steps being taken to reduce the possibility of human error? For example, is compliance notified of revoked certificates today, and should they be? What are systemic considerations that can be made so that even if the human processes fail, automated systems may still exist to support them? This may be systems like four-eyes controls for initiating revocation, alerting compliance regularly about revocations, or post-issuance linting - all of which can assist with these sorts of investigations.

There's a gap between the notification of compliance teams, the parent CA, and the auditors - 2019-05-21 through 2019-05-24 - and the disclosure of this incident, 2019-06-04. Can you help explain that gap, as well as what steps are being taken to ensure prompt and timely disclosure going forward?

Assignee: wthayer → certification_authority
Flags: needinfo?(certification_authority)
Whiteboard: [ca-compliance]

Thank you for your questions. Please see our responses below.

It's unclear whether a check was already existing, and failed, or whether no check was developed. Can you clarify between these
interpretations?

While no check was initially developed, a subsequent check was added as a result of this incident.

Given that the length of the commonName field is part of the ASN.1 module for RFC 5280, has there been a systemic evaluation
about compliance with the ASN.1 module?

Upon learning of this incident, we immediately implemented a fix and performed an initial review of the software to identify if any other major gaps existed. We also plan to do a more thorough gap analysis of the ASN.1 module for RFC 5280 against our software.

I appreciate that steps are being taken to improve the timeline from detection to alerting, but are more steps being taken to
reduce the possibility of human error?

To improve the response and alerting time, we will be providing more communication and education to Apple on the notification process. Pre and post-issuance checks are in place, but this incident revealed an issue with the software logic which, as mentioned, is being reviewed further. Additionally, we expect that implementing a linter based on a separate code base will strengthen the ability to detect and automatically alert on mis-issuance.

There's a gap between the notification of compliance teams, the parent CA, and the auditors - 2019-05-21 through 2019-05-24 - and
the disclosure of this incident, 2019-06-04. Can you help explain that gap, as well as what steps are being taken to ensure prompt
and timely disclosure going forward?

Our internal review process for public postings involves several levels of review and approval taking major events into consideration. Following the Incident Report guidance, we targeted no later than two weeks from the date the compliance team was informed of the issue. With our required review processes, notifications should be expected to take up to two weeks for any future non-compliance events as well.

Flags: needinfo?(certification_authority)

(In reply to certification_authority from comment #2)

Given that the length of the commonName field is part of the ASN.1 module for RFC 5280, has there been a systemic evaluation
about compliance with the ASN.1 module?

Upon learning of this incident, we immediately implemented a fix and performed an initial review of the software to identify if any other major gaps existed. We also plan to do a more thorough gap analysis of the ASN.1 module for RFC 5280 against our software.

What is the timeline for this gap analysis? When can we expect updates?

I appreciate that steps are being taken to improve the timeline from detection to alerting, but are more steps being taken to
reduce the possibility of human error?

To improve the response and alerting time, we will be providing more communication and education to Apple on the notification process. Pre and post-issuance checks are in place, but this incident revealed an issue with the software logic which, as mentioned, is being reviewed further. Additionally, we expect that implementing a linter based on a separate code base will strengthen the ability to detect and automatically alert on mis-issuance.

I'm not sure how this addresses the matter being highlighted - the six day gap between revocation and the compliance team being notified, as highlighted in the follow-up questions. If the intent is with respect to communication and education, I think more details about what the current communication and education was/is, and how the added communication/education will address this. As discussed with other CAs, "the folks have been retrained" doesn't necessary provide insight as to how or why the current issue would not simply reappear.

I appreciate the added linting, but understanding how that works through a systemic architecture - who is involved when lints fail, what happens when they do, etc - is equally valuable and important.

Flags: needinfo?(certification_authority)

(In reply to Ryan Sleevi from comment #3)

(In reply to certification_authority from comment #2)

Given that the length of the commonName field is part of the ASN.1 module for RFC 5280, has there been a systemic evaluation
about compliance with the ASN.1 module?

Upon learning of this incident, we immediately implemented a fix and performed an initial review of the software to identify if any other major gaps existed. We also plan to do a more thorough gap analysis of the ASN.1 module for RFC 5280 against our software.

What is the timeline for this gap analysis? When can we expect updates?

The gap analysis has started. We will provide an update as early as July 31.

I appreciate that steps are being taken to improve the timeline from detection to alerting, but are more steps being taken to
reduce the possibility of human error?

To improve the response and alerting time, we will be providing more communication and education to Apple on the notification process. Pre and post-issuance checks are in place, but this incident revealed an issue with the software logic which, as mentioned, is being reviewed further. Additionally, we expect that implementing a linter based on a separate code base will strengthen the ability to detect and automatically alert on mis-issuance.

I'm not sure how this addresses the matter being highlighted - the six day gap between revocation and the compliance team being notified, as highlighted in the follow-up questions. If the intent is with respect to communication and education, I think more details about what the current communication and education was/is, and how the added communication/education will address this. As discussed with other CAs, "the folks have been retrained" doesn't necessary provide insight as to how or why the current issue would not simply reappear.

We offer a self-service portal for subscribers to revoke their own certificates, which was used by the developer (who was also the subscriber) to revoke the impacted certificates based on the subscriber's initial suspicion that a problem existed. After further analysis, the subscriber notified the compliance team. Thus, the primary reason for the delay was that the developer/subscriber who reported the issue was performing additional analysis.

The focus of the education was that the compliance team should be notified immediately upon suspicion of an issue and not be delayed until further analysis.

I appreciate the added linting, but understanding how that works through a systemic architecture - who is involved when lints fail, what happens when they do, etc - is equally valuable and important.

Prior to the incident, when any issue was detected by any of our existing verification software, our compliance team was automatically alerted. As previously noted, we are adding an additional linter which will trigger the same automatic alerting of the compliance team.

Flags: needinfo?(certification_authority)

Thanks. I'm still a little confused by the timeline and response, and the roles involved, and I appreciate the continued clarifications.

The initial report stated, in Comment #0

  • 2019-05-15 9:00 PT - Code review identified that the software that checks certificates for Baseline compliance was not enforcing a max length of 64 characters for CNs.
  • 2019-05-15 11:24 PT - The only two impacted certificates that were still valid were revoked by the developer who identified the issue
  • 2019-05-21 14:30 PT - Compliance team was notified about the issue.

The latest response, in Comment #4, states

We offer a self-service portal for subscribers to revoke their own certificates, which was used by the developer (who was also the subscriber) to revoke the impacted certificates based on the subscriber's initial suspicion that a problem existed. After further analysis, the subscriber notified the compliance team.

I'm trying to understand the relationship of the events here. The later response seems to suggest it was a Subscriber Initiated revocation, and that's why compliance was not notified. However, the original response seems to very clearly indicate that there was a specific evaluation of the software that checks certificates for Baseline compliance.

That's why I'm still confused about why the compliance team was not notified during the code review process. Is it that Subscribers at Apple have access to perform code review themselves? That seems to be the only interpretation I can reach by the facts available, and I'm unclear if the Subscriber-who-requested-revocation was part of the CA operations in any capacity.

If the revocation-initiator is part of the CA team, then it suggests a process breakdown in involving the necessary teams when revoking for compliance teams, and suggests another issue here that needs remediation.
If the revocation-initiator is not part of the CA team (and is truly only a Subscriber), and may merely incidentally have access to read the CA software and perform their analysis', then the main issue is the lack of detection.

To be clear: I'm not trying to blame the revocation-initiator, but I'm trying to understand their role to understand what systemic safeguards could or should have existed. Any time any member of the CA team discovers an issue, the CA should have processes in place to ensure that's alerted to the Compliance team. If a member of the CA is performing revocation - even for their own certificates - that's the sort of thing you want to make sure you have some sort of review, auditing, or assessment on, to make sure that the aforementioned processes aren't failing - a second set of eyes looking into why it was revoked and making sure policies are followed.

Flags: needinfo?(certification_authority)

Here are some additional details that may provide more context. The revocation was initiated by the Subscriber who is also an internal developer on the CA team. The developer was discussing a testing strategy for software unrelated to the baseline checks which leveraged very long Common Names to increase the size of test CSRs. Because the developer was familiar with RFC 5280 requirements, it caught the developer’s attention when the test CSRs were used successfully to issue a test certificate (from a test hierarchy) as the Common Name was greater than 64 characters.

The internal developer then performed an initial review of the code that checks for baseline compliance and determined that it was not enforcing a max length of 64 characters for Common Names; this review was completed on 2019-05-15. To be cautious, the developer immediately revoked two certificates for which he was the Subscriber even though he was not yet certain if there was a compliance issue.

Between 2019-05-15 and 2019-05-21, the developer performed additional testing, a more thorough code review, and investigated RFC 5280 requirements. On 2019-05-21, the developer notified the compliance team. We agree that the compliance team should have been notified immediately upon suspicion of a problem, which is the focus of the communication and education referenced earlier.

Having an alert sent to the compliance team whenever a certificate is revoked by a member of the CA team is a great suggestion. We are updating our processes and adding an additional detective control to account for this.

Flags: needinfo?(certification_authority)
Whiteboard: [ca-compliance] → [ca-compliance] - Next Update - 01-August 2019

Just to make sure I understand the issue appropriately, as the original incident sounded much worse:

  • There is a test infrastructure (from a test root) and a production infrastructure (with the production root)
  • A developer on the CA software was working on the test infrastructure, working on an unrelated change ("test CSRs", Comment #6).
  • 2019-05-15 9:00 PT - On the test infrastructure, they saw the certificate was issued, which surprised them, as they understood from the BRs that the CA software stack should have prohibited this. At this time, this certificate was solely in the test infrastructure (Comment #6)
  • 2019-05-15 9:00 PT - The developer then began examining the code and relevant requirements, to determine why the certificates worked.
  • 2019-05-15 11:24 PT - The developer examines the production CA and sees two certificates issued that violates this requirement. They are the Subscriber for these certificates (i.e. suggesting some internal server/service they own), and initiates revocation, still not sure if there is a compliance issue.
  • 2019-05-21 14:30 PT - Developer notifies the compliance team. Compliance team investigates, and determines that the two (already revoked) certificates were the only unexpired certificates, but that 28 certificates in total had been issued (Comment #0)
  • 2019-05-22 10:00 PT - Fix deployed.

In my original reading of the report, it had sounded like the sequence was that the developer was doing code review and noticed something wrong, revoked them, and did not notify the compliance team, which suggested an attempt to cover up a mistake by the developer. The update from Comment #6, if understood correctly, instead suggests that Apple's developers are very familiar with the requirements, spotted a bug, proactively attempted to resolve it while investigating further, and that it was resolved within hours of the compliance team being notified.

If that's a correct understanding, the only suggested improvement related to that sequence of events would be to involve compliance teams sooner. This is where auditing of revocation, which is presumably not a routine thing, may be useful in ensuring that even when folks forget to notify compliance / attempt to investigate themselves (in order to avoid noise for compliance), there's still a feedback mechanism to catch overlooked issues.

Is my understanding, based on Comment #6, correct?

Flags: needinfo?(certification_authority)

Your understanding, based on Comment #6, is correct.

Flags: needinfo?(certification_authority)

In a previous update, we committed to doing the following:

“We plan to do a more thorough gap analysis of the ASN.1 module for RFC 5280 against our software.”

To ensure this gap analysis would be objective, a member of the team not previously involved in the original design of the linter, was assigned to combine the requirements from RFC 5280 (including Appendix A and Appendix B) with the Baseline Requirements and determine which requirements applied to our certificates.

We have broken down the gap analysis into the following milestones:

  • Milestone 1: Document and scope the RFC 5280 and Baseline Requirements.
  • Milestone 2: Perform the gap assessment against in-house linter.
  • Milestone 3: Create plan for remediation.
  • Milestone 4: Remediate any identified issues.

Milestone 1 is complete and Milestone 2 is in progress. The next update on progress will be as early as August 31. We intend to complete all milestones by the end of the calendar year.

Whiteboard: [ca-compliance] - Next Update - 01-August 2019 → [ca-compliance] - Next Update - 01-September 2019

We are making good progress on performing the gap assessment on our software. We continue to work on this assessment and expect to provide an update as early as October 5.

We are continuing to work on this assessment and expect to provide an update as early as December 15.

We have completed the gap assessment. Based on our findings, we determined that through a combination of our software and supporting manual practices, all requirements that apply to our certificates from RFC 5280 (including Appendix A and Appendix B) and the Baseline Requirements are addressed. This completes our gap analysis.

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 28 days ago
Resolution: --- → FIXED
Whiteboard: [ca-compliance] - Next Update - 01-September 2019 → [ca-compliance]
You need to log in before you can comment on or make changes to this bug.