Closed Bug 1690807 Opened 3 years ago Closed 3 years ago

GlobalSign: RSA-1024 leaf certificate issued after 2013-12-31

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rob, Assigned: eva.vansteenberge)

Details

(Whiteboard: [ca-compliance] [ov-misissuance])

Attachments

(4 files, 1 obsolete file)

32.04 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details
96.09 KB, image/jpeg
Details
86.00 KB, image/jpeg
Details
188.14 KB, image/jpeg
Details

https://crt.sh/?id=4028719337&opt=zlint,cablint,x509lint was issued today. It's a leaf certificate with an RSA-1024 key. RSA keysizes of <2048-bits were disallowed by the BRs over 7 years ago!

May I ask what tool(s) GlobalSign are using for pre-issuance linting?

Assignee: bwilson → eva.vansteenberge
Status: NEW → ASSIGNED
Whiteboard: [ca-compliance]

We have revoked the certificate and will provide an incident report by Tuesday February 9 the latest.

1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

We were informed by a Bugzilla ticket on 04/02/2021 15:17 UTC that a leaf certificate with a 1024 bit RSA key was issued by GlobalSign.

2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

Time (UTC) - DD/MM/YYYY HH:MM Action
21/05/2010 22:05 Certificate ordered with ID CE201005226919 (further "incident causing order") for www.celticpapaya.com. E-mail challenge was selected as the domain ownership validation method. Domain flagged as high-risk (keyword "pay"), manual domain phishing check by verification operator required prior to issuance.
21/05/2010 22:56 Another certificate ordered with ID CE201005239239 (further "other order for affected domain") for www.celticpapaya.com using e-mail challenge as the domain ownership validation method. Domain flagged as high-risk (keyword "pay"), manual domain phishing check by verification operator required prior to issuance.
24/05/2010 09:08 Domain phishing check completed for incident causing order and other order for affected domain, domain ownership confirmation e-mail sent to customer.
27/05/2010 16:28 Domain ownership confirmation URL invoked for other order for affected domain, certificate (https://crt.sh/?id=1787837) issued.
02/01/2012 01:07 Synchronization job between GCC and RA systems fails, marking incident causing order with status 901 - REGISTER_ERROR.
04/02/2021 07:31 E-mail challenge URL invoked for incident causing order and 1024 bit RSA key certificate issued by GlobalSign for the domain www.celticpapaya.com - https://crt.sh/?id=4028719337.
04/02/2021 15:17 Bugzilla report received on email.
04/02/2021 15:22 Escalation of incident to wider Operational Governance (security & compliance) group. CISO calls for Priority 1 incident and all relevant teams (Operational Governance, Vetting, Infrastructure, Development) are mobilized.
04/02/2021 15:38 Affected certificate is revoked.
04/02/2021 16:00 Retrieval of details & further investigation of how an order with a 1024-bit RSA key could make it through the different validation layers we have.
04/02/2021 16:45 The incident causing order was placed in 2010. Further focus of investigation on a) any other old hanging orders that could have caused / cause the same issue, b) why the order was processed on February 4 2021, c) why this order was not stopped by front-end & back-end certificate request validators / linters
04/02/2021 16:52 a) As a precaution since the cause of the incident was still unclear, cancellation and review of similar orders with code 901 - REGISTER_ERROR (5455 orders) started by vetting team. b) Request RA (Registration Authority application) logs & infrastructure logs to understand why this order could be and was issued 11 years after request. c) Summon on-call engineers to go to the data center and check linting configuration
04/02/2021 17:13 Infrastructure team confirms linting is not fully enabled for the CA "GlobalSign GCC R3 DV TLS CA 2020", which issued the incompliant certificate. Compliance team orders to enable linting and starts reviewing a configuration dump to identify any other CA that do not but should have linting enabled.
04/02/2021 17:41 Infrastructure team confirms linting is now enabled for the CA "GlobalSign GCC R3 DV TLS CA 2020", which issued the incompliant certificate.
04/02/2021 18:01 Compliance team completes the analysis to identify any other CA that do not but should have linting enabled. One CA is identified: "GlobalSign GCC R3 EV QWAC CA 2020". Compliance team orders to enable linting for this CA.
04/02/2021 18:09 Infrastructure team confirms linting has been enabled for "GlobalSign GCC R3 EV QWAC CA 2020".
04/02/2021 18:19 Initiated linting on all certificates issued by "GlobalSign GCC R3 DV TLS CA 2020" and "GlobalSign GCC R3 EV QWAC CA 2020".
04/02/2021 21:53 Linting completed of all certificates issued by "GlobalSign GCC R3 DV TLS CA 2020" and "GlobalSign GCC R3 EV QWAC CA 2020". No other compliance violations detected.
04/02/2021 22:37 Confirmation from vetting team all 901 - REGISTER_ERROR orders have been cancelled. We did this as a precautionary measure as during the first hours of investigating the incident, we weren't sure yet if the misissuance was triggered by clicking on an old un-expired domain validation link, or by some other sort of reprocessing related for orders with this status.

3. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident. A statement that you have stopped will be considered a pledge to the community; a statement that you have not stopped requires an explanation.

GlobalSign did not stop issuing certificates, but did take other actions to prevent this incident from repeating.

4. In a case involving certificates, a summary of the problematic certificates. For each problem: the number of certificates, and the date the first and last certificates with that problem were issued. In other incidents that do not involve enumerating the affected certificates (e.g. OCSP failures, audit findings, delayed responses, etc.), please provide other similar statistics, aggregates, and a summary for each type of problem identified. This will help us measure the severity of each problem.

1 certificate issued at February 4, 2021 07:31 UTC: https://crt.sh/?id=4028719337.

5. In a case involving certificates, the complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. In other cases not involving a review of affected certificates, please provide other similar, relevant specifics, if any.

https://crt.sh/?id=4028719337 (revoked)

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The incident occurred in our legacy GCC platform. We are currently aiming to retire this platform for TLS certificate issuance by January 1st 2023 - by which all TLS certificate activities should be moved to our new Atlas platform.

Old orders & domain approval

Before 1/03/2017 random values used for domain validation did not yet have to expire after a maximum of 30 days. The way Ballot 169 was implemented was that for any domain ownership confirmation e-mail sent after 01/03/2017 the system would enforce the maximum random value validity of 30 days. Our platform however did not enforce the maximum random value validity of 30 days for domain ownership e-mails sent before 01/03/2017, even if the domain ownership confirmation URL embedded in the e-mail would have been invoked after 01/03/2017. This only affected e-mail validation, for HTTP and DNS validation the maximum random value validity was enforced for both orders placed before and after 1/03/2017.

The above, together with the fact we did not have a process in place to automatically, after a certain period in time, move old pending orders in a cancelled state, led to an domain ownership confirmation e-mail sent in 2010 still being invoked in 2021 and causing this certificate to be issued.

CA linting

In the GCC platform certificate validation and linting is enforced on 2 levels:

  • Validation of the certificate application data and certificate request in GCC and RA
  • Linting in the certificate factory (EJBCA)

Since the incident causing order was placed in 2010 and 1024-bit RSA keys were allowed back then, the incident causing order passed the front layers of validation of the certificate application data and certificate request in GCC and RA.

The certificate was issued from the "GlobalSign GCC R3 DV TLS CA 2020", which was a CA set up together with 60 other CA in response to the OCSP EKU incident. The short window of time and lack of proper oversight / quality assurance / process fragmentation did cause linting not to be fully enabled for that CA. Because the linting was not fully enabled at certificate factory level the certificate got issued.

The EJBCA configuration of a validator requires 1) enabling the CA in the validator-specific configuration and 2) highlighting the validator in the CA-specific configuration. 1) was not performed in the case of the "GlobalSign GCC R3 DV TLS CA 2020".

7. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

Old orders & domain approval

All other orders in the GCC platform that are older than 1 year and have not been fully processed (940909 orders) have been moved into a cancelled state on 9/02/2021.

We also removed legacy code (the part handling random values provided before 1/03/2017) from the domain ownership e-mail confirmation URL This change will be released 10/02/2021.

A change request is currently being analyzed to automatically cancel all orders older than 1 year at a regular interval. We will share the implementation date as soon as the analysis is complete.

We will also modify the compliance process for monitoring and implementing changes to external requirements to make sure older orders are purged if their invocation at a later time could violate requirements imposed in-between order date and certificate issuance date.

CA linting

On 16/04/2020 we issued a certificate with RSASSA-PSS public key (https://bugzilla.mozilla.org/show_bug.cgi?id=1630870). One of the reasons this incident materialized was also related to the fact linting was not fully enabled in EJBCA. The remedial steps taken to resolve the RSASSA-PSS incident appear to be insufficient for addressing a wider systemic issue related to the creation and provisioning of new CA and quality assurance thereof. In the previous incident, the CA profile was selected in the linter configuration, but the CA related to this profile didn't have zlint selected as linter. In this incident, the CA profile was not selected in the linter configuration, while the CA related to this profile did have zlint selected as linter. As a follow-up of the RSASSA-PSS incident we created a monitoring script to check for CAs that didn't have zlint selected as linter, but this script did not raise any warning for when the CA profile was not selected in the linter configuration.

GlobalSign has concluded there are too many procedural and technical gaps in the current process to prevent incorrect CA configuration occurring in the future and has decided to halt the provisioning of new CA in the GCC platform until 31/03/2021, this in order to fully rebuild the CA request, creation and provisioning process. Whereas the previous process could be considered fragmented (with different tickets and information depending on the team that had to execute a specific sub-action) we will be building a new flow that consolidates all required information and actions into a single process, with additional steps in terms of compliance oversight and a hard compliance approval step prior to putting a new CA into production / issuing state. We will share the details of the new process 16/02/2021 and will complete implementation before 31/03/2021. CA provisioning on the GCC platform will only resume after the implementation of the new process has been successfully completed.

We will also add an additional linter in "warn" modus at the RA side of the GCC platform, so that even if the linting is bypassed at the CA factory side we still have a backup post-issuance linter that would notify us of any non-compliant certificate as soon as issued. We will share the implementation date as soon as the analysis is complete.

Timeline of remedial actions

Date DD/MM/YYYY Status Action
04/02/2021 Completed Review of linting configuration for all CA
04/02/2021 Completed Cancellation and review of all orders with 901 - REGISTER_ERROR code
04/02/2021 Completed Internal decision & announcement on new CA freeze for GCC platform till 31/03/2021
08/02/2021 Completed Review of all certificates issued using e-mail based domain ownership conformation to detect any other certificates ordered prior to 01/03/2017 but approved after 01/03/2017 with a random value exceeding 30 days. 113 certificates identified 112 expired and 1 unexpired but revoked (certificate that caused this incident). Please refer to attached list.
09/02/2021 Completed Cancellation of all orders older than 1 year and not fully processed
10/02/2021 In progress Disablement of support for values provided before 01/03/2017 in domain ownership e-mail confirmation functionality
16/02/2021 In progress Modify compliance process for changes to requirements to make sure older orders are purged if their invocation at a later time could violate requirements imposed in-between order date and certificate issuance date.
16/02/2021 In progress New process for CA creation & provisioning
TBD Analysis Implementation of automatic script that regularly cancels all old orders
TBD Analysis Add an additional linter in "warn" modus at the RA side of the GCC platform
31/03/2021 Planning Deployment of new CA creation & provisioning process and resumption of CA provisioning in GCC platform
Attached file 1690807.xlsx (obsolete) —

Arvid,

I'm glad to see a very detailed incident report, and particularly glad you acknowledged Bug 1630870, as during reading it, the description immediately came to mind.

However, there are some things concerning to me still:

  • It does not seem that there's a clear accounting to know whether any more of these "infinite authorization" certs were issued, and which might otherwise syntactically pass lints, but have otherwise failed to meet the obligations.
    • Can you share what steps you've taken already to rule out this possibility? The response seems to largely focus on "prevention of new/additional", but I'm not seeing a clear answer for the analysis, going back to the problem introduction, of other certificates having been issued. If I missed that part of the response, I'm hoping you can highlight it for me so I can re-read.
  • As noted by Bug 1630870, there was a known failure mode re: linting configuration, due to a past incident. I understand you've covered it as "human error" to introduce the misconfiguration (due to the speed and complexity of the effort), and "human error" that the original tooling wasn't developed to catch this. While your mitigations look at reducing complexity, addressing the former, it's not clear to me how you're addressing the latter, which seems to be in the incident response/remediation. That is, are there any changes planned for how you develop and test changes in response to incidents, to ensure you've covered all the ways in which things can go wrong, and not just the immediate/proximate thing that went wrong?
  • We've seen several other CA incidents reporting similar root causes, and I'm curious to understand more the process of how GlobalSign pays attention to these issues from other CAs, evaluates the risk to GlobalSign, and determines the actions to take.
    • Bug 1654967 was a bug from DigiCert that seems somewhat related, regarding the complexity of ICA creation and the need for automation.
    • Bug 1623384 was a bug from Camerfirma, involving issues with the number of CAs leading to inappropriate configuration and remediation (which seems similar to this)
    • Bug 1622505 was a bug from GlobalSign that got into a little of the detail re: GlobalSign's various platforms (Atlas, GCC, CertSafe), and acknowledged some of the complexities related to the heterogenous environment GlobalSign was operating.
    • Bug 1654544 was a bug from GlobalSign that was specifically about a reuse of domain validation data beyond the acceptable period (re: CloudSSL)

I highlight these issues because I'm increasingly concerned that GlobalSign's infrastructure complexity may be beyond GlobalSign's ability to effectively secure and comply with things. I realize that a number of these systems are legacy and being phased out, which is encouraging, but I'm not entirely confident that there's a good understanding of where things are moving towards. While I'm encouraged to see GlobalSign is moving towards modernizing systems, and I don't want to discourage that, I'm also trying to understand and appropriately judge risk from the existing systems, and GlobalSign's ability to keep them compliant while migrating.

I'm hoping that, in the spirit of bugs like Bug 1595921 and Bug 1550645, you can help us (the community and browsers alike) build up a better understanding of your systems and the process and flow of information and certificates through that system, ideally capturing/distinguishing distinct flows (e.g. if CloudSSL takes this path, vs GCC, vs Atlas, then clearly distinguishing these paths). Equally, I think understanding a bit with respect to the existing "trusted for TLS" hierarchy and understanding what information flows they're connected to (e.g. Bug 1556948 ), helps us build a better understanding of risk.

Flags: needinfo?(arvid.vermote)
Attached file 1690807.xlsx
Attachment #9202078 - Attachment is obsolete: true

It does not seem that there's a clear accounting to know whether any more of these "infinite authorization" certs were issued, and which might otherwise syntactically pass lints, but have otherwise failed to meet the obligations.

  • Can you share what steps you've taken already to rule out this possibility? The response seems to largely focus on "prevention of new/additional", but I'm not seeing a clear answer for the analysis, going back to the problem introduction, of other certificates having been issued. If I missed that part of the response, I'm hoping you can highlight it for me so I can re-read.

Unless I misunderstand the scope of your question this is detailed in the "Timeline of remedial actions" table at the end of the incident report. On 08/02/2021 we completed a review of all certificates issued using e-mail based domain ownership conformation to detect any other certificates ordered prior to 01/03/2017 but approved after 01/03/2017 with a random value exceeding 30 days. 113 certificates were identified 112 expired and 1 unexpired but revoked (certificate that caused this incident). Refer to attached list - please let us know if you require any additional information or insights on the analysis.

We are working on your other questions and will revert tomorrow (11/02/2021)

As noted by Bug 1630870, there was a known failure mode re: linting configuration, due to a past incident. I understand you've covered it as "human error" to introduce the misconfiguration (due to the speed and complexity of the effort), and "human error" that the original tooling wasn't developed to catch this. While your mitigations look at reducing complexity, addressing the former, it's not clear to me how you're addressing the latter, which seems to be in the incident response/remediation. That is, are there any changes planned for how you develop and test changes in response to incidents, to ensure you've covered all the ways in which things can go wrong, and not just the immediate/proximate thing that went wrong?

As a response to Bug 1630870 we created an automated job that dumps the EJBCA config (https://doc.primekey.com/ejbca/ejbca-operations/ejbca-operations-guide/configdump-tool) and sends it to the compliance team on a daily basis. The compliance team then performs daily diffs between the received configuration and "known" configuration and explicitly confirms the "Validators" value in "certification-authorities/%caname%.yaml" (which was the one relevant to Bug 1630870). The missing config that was the partial cause of this incident was to be discovered in another file "validators/zlint.yaml". If we would remediate the current incident in line with the narrow remediation of Bug 1630870 we would have proposed additionally checking "validators/zlint.yaml" to the daily compliance review activity.

Bug 1630870 was treated as a P2 (major) incident, meaning that according to our internal incident classification it could be handled within the relevant teams with senior management reporting but without senior management involvement and wider systemic cause analysis and remediation. The current incident was flagged as P1 (critical) because of the seriousness of the compliance violation and recurrence (lint config), which meant it got immediate senior management attention and senior management deciding to invoke the CA setup freeze so systematic issues could be resolved. Part of the CA setup process rebuild will be to document every parameter in the EJBCA configdump, its expected outcome and its impact to compliance - which is the type of in-depth systemic remediation we expect from any P1 incident.

If Bug 1630870 had been a P1 it would have received wider internal attention and more scrutiny in terms of remediation activities not only addressing the occurrence but any wider systemic issue. We will change our incident management process so that any compliance incident is classified as P1, receives senior management attention and mandatory detailed systemic cause analysis. This will be completed by 19/02/2021.

We've seen several other CA incidents reporting similar root causes, and I'm curious to understand more the process of how GlobalSign pays attention to these issues from other CAs, evaluates the risk to GlobalSign, and determines the actions to take.

In the first quarter of 2020 GlobalSign did integrate its internal event system with Bugzilla. Any CA-Compliance bug creates a ticket in our Security / Compliance Operations Center. All tickets require to be handled, analyzed and responded to by the relevant compliance officer responsible for the affected compliance area and need to be evaluated whether we are affected and/or should implement any additional controls to prevent the issue from happening in our environment (refer to https://bugzilla.mozilla.org/show_bug.cgi?id=1620922#c4).

  • Bug 1654967 was a bug from DigiCert that seems somewhat related, regarding the complexity of ICA creation and the need for automation.

This bug was discussed in the compliance group and we deemed that the process improvements we had already made to our CA generation process (based on risk assessment, continuous improvement and the OCSP EKU incident) were close to identical to the changes DigiCert proposed to their processes and would prevent the issues disclosed in this bug from happening in our environment.

  • Bug 1623384 was a bug from Camerfirma, involving issues with the number of CAs leading to inappropriate configuration and remediation (which seems similar to this)

This bug resulted in an internal examination of our certificate issuance systems and the way AKI fields are programmatically populated.

  • Bug 1622505 was a bug from GlobalSign that got into a little of the detail re: GlobalSign's various platforms (Atlas, GCC, CertSafe), and acknowledged some of the complexities related to the heterogenous environment GlobalSign was operating.
  • Bug 1654544 was a bug from GlobalSign that was specifically about a reuse of domain validation data beyond the acceptable period (re: CloudSSL)

I highlight these issues because I'm increasingly concerned that GlobalSign's infrastructure complexity may be beyond GlobalSign's ability to effectively secure and comply with things. I realize that a number of these systems are legacy and being phased out, which is encouraging, but I'm not entirely confident that there's a good understanding of where things are moving towards. While I'm encouraged to see GlobalSign is moving towards modernizing systems, and I don't want to discourage that, I'm also trying to understand and appropriately judge risk from the existing systems, and GlobalSign's ability to keep them compliant while migrating.

I'm hoping that, in the spirit of bugs like Bug 1595921 and Bug 1550645, you can help us (the community and browsers alike) build up a better understanding of your systems and the process and flow of information and certificates through that system, ideally capturing/distinguishing distinct flows (e.g. if CloudSSL takes this path, vs GCC, vs Atlas, then clearly distinguishing these paths). Equally, I think understanding a bit with respect to the existing "trusted for TLS" hierarchy and understanding what information flows they're connected to (e.g. Bug 1556948 ), helps us build a better understanding of risk.

GlobalSign currently operates two platforms in parallel: legacy platform GCC (GlobalSign Certificate Center) and new platform Atlas.

In 2020 we released MVP1 of the Atlas platform which focused on making sure the new platform could end-to-end facilitate the customer onboarding and consumption of our Digital Signing Service (DSS). As part of MVP1 all the building blocks were developed that would also be required for other products: identity application, identity validation, identity issuance, identity renewal, identity revocation. I am using the term identity since the platform will also support identity products that do not have a direct link to a certificate as the identity providing technology.

Since the MVP1 release we have conducted 10-weekly "Platform Increment" cycles and releases, which during 2020 generally focused on further tweaking the platform and its functionalities (based on feedback from all internal stakeholders), preparing the different platform components for supporting other identity products, increasing platform security maturity, adding the necessary compliance checkpoints and switches, and making sure all our global entities could do business through the new platform (i.e. rendering it compliant with local privacy regulations, JSOX and other financial/tax requirements).

We are now at a stage where all the building blocks of the Atlas platform are in such a shape we have started rebuilding GCC products. Platform Increment 4 "Camel" will be released in production May 03 2021 and will add end-to-end support for the API version of the "IntranetSSL" product (non-public TLS certificates) and API-based DV S/MIME certificates. The strategy is that once we release a (sub)product on Atlas, any new customer for that product should onboard on Atlas, and new customer sign-up for that product will be disabled in GCC. We have a separate team that focuses on customer migration and works with individual customers to move their existing product subscription from GCC to Atlas as soon as their product is supported in the new platform and customer situation allows for migration.

We are currently performing an analysis "MSSL 80%" (MSSL = Managed SSL) which focuses on mapping and specifying the functionalities that need to be built in the next Atlas Platform Increment cycles in order to support the 100 largest SSL customers and 80% of the MSSL certificates we currently have in the GCC platform. In line with stated above, as soon as any type of customer can be supported on Atlas their onboarding through GCC will be disabled and migration activities will be started for existing customers. We are currently aiming to fully retire GCC for TLS certificate issuance by 2023.

As we focus on the new platform and phasing out GCC there is a general stop on adding new functionality in GCC. Generally the only changes still performed to the platform relate to ensuring continued compliance with standards and requirements and changes to address issues of current customers. The same spirit is applied to CA provisioning, were we are applying heavy due diligence on new CAs requested for the GCC platform, only allowing new CAs to ensure service and product continuity. The GCC platform uses third party EJBCA software for certificate issuance, revocation and certificate status mechanisms whereas in Atlas these functionalities have been developed in-house.

One of the lessons from the GCC world and heavily embedded in Atlas is Compliance Ops. Whereas in the GCC world compliance review and approval happened mostly out-of-band, it is integrated in-band on all layers of the new Atlas system. For example, rather than reviewing a CA configuration dump for its correctness, approval of a new CA config by Compliance is a required and programmatically enforced step as part of the creation flow in the new platform. The same applies for leaf certificate profiles etc. and functionalities underway such as identity and issuance kill switches.

CloudSSL was a GCC API-based TLS product for a select group of high-volume TLS customers which we are retiring. Most customers have already been moved to Atlas (which already supports API-based TLS certificate provisioning) and we expect the last customer to be migrated by the end of Q2 2021.

Flags: needinfo?(arvid.vermote)

Arvid: I appreciate the detail you provided in response to the questions, but I have to note, there's still some bits that stand out in the not answered side.

It sounds like, from the response, that there is still significant TLS issuance on the GCC platform, and as far as I can tell, there's no defined plan for migration of that (yet). I can appreciate the slow rollout of Atlas, especially when moving to something in-house developed (with all the risks that entails), but 2023 sounds like a significantly long wait for a pattern of issues here.

I'm concerned as well that there doesn't seem to be a clear picture about "what can cause TLS issuance", and issues like this, and Bug 1630870, Bug 1654544, and Bug 1622505 paint a picture of an infrastructure that is growing in complexity, due to trying to keep parallel infrastructure, rather than reducing complexity.

As a concrete example, is it fair to assume that the "new" roots GlobalSign is requesting inclusion for suffer from this same parallel infrastructure support, and thus despite being new, will not actually benefit from improved agility and improved compliance operations? It seems the desired outcome should be to accelerate the transition off of GCC, and for relying parties/browsers to only trust the Atlas infrastructure going forward for new (TLS, S/MIME) certificates.

Perhaps diagrams, as suggested previously, will help demonstrate GlobalSign has a good handle on the complexity, but it's still deeply concerning that such an issue could go undetected until now. The ZLint issue is concerning in itself, but fundamentally, it seems the root cause of this is fundamental complexity in GlobalSign's infrastructure, and it seems essential to address that as quickly as possible, in a way that the community can be confident is meaningfully addressed.

Flags: needinfo?(arvid.vermote)
Attached image Atlas.jpg
Attached image GCC.jpg
Flags: needinfo?(arvid.vermote)

We have discussed this internally and we will not issue any new GCC issuing CA from the new/future generation of TLS & S/MIME roots. We did already issue 4 GCC CA from these roots (https://crt.sh/?id=3226381778, https://crt.sh/?id=3226381770, https://crt.sh/?id=3226381775 and https://crt.sh/?id=3396870987) but have started the process of generating (and enroll for QWAC) new instances of them chaining to our old generation of roots and cessate their operations.

Here is an initial diagram of both the GCC and Atlas environment visualizing the different certificate validation layers in place. Please let us know if you require any other visualization or additional information. Also note that the GCC and Atlas platform and infrastructure are completely segregated and do not share a single function, component or system.

GCC

Refer to attached image GCC.jpg

Activity Detail
1a - Order Customer orders a certificate
1b - Certificate Application Validation When the order is received by the GCC platform GCC/RA already parse the provided details, CSR and validate it against the requirements.
2a - Subject validation If manual validation is required, validation operator picks up validation case and works with the customer to validate the order
2b - Domain validation Registration Authority system conducts domain validation based on method selected by customer
3 - Approval Certificate order is approved by validation operator (in case manual validation is required) or by the RA system (in case of automated-only validation)
4a - Request Issuance Registration Authority system requests certificate issuance to CA Factory
4b - Pre-lint CA Factory (EJBCA) lints the certificate prior to issuance, in case of linting errors issuance is blocked
4c - Issuance In case of successful issuance the certificate is returned to RA
4d - Post-lint 1 EJBCA also publishes the certificate to our post-linting system, which sends notifications to the operational governance team in case of linting errors
4e - Post-lint 2 RA conducts additional post-lint, which sends notifications to the operational governance team in case of linting errors. This is the additional post-linter we will build as one of the remedial steps for this incident.
4f - Return leaf The leaf certificate is returned to the customer through GCC
5 - Certificate status Relying parties are offered CRL and OCSP services from GlobalSign data centers.
6 - Revocation Revocation is requested by the customer through GCC. Validation operators can manually revoke certificates through the Registration Authority system.

Atlas

Refer to attached image Atlas.jpg

Activity Detail
1 - Order identity and license Customer orders a license to order certificates from the Atlas platform. Subject identity details are passed during the license ordering process.
2 - Identity validation If manual validation is required, validation operator picks up validation case and works with the customer to validate the order
3 - Approval In case manual identity validation is required, identity details are approved by validation operator in the RA system
4a - Request Issuance Customer talks to Atlas API to request issuance of a certificate
4b - Issuance Request Validation The issuance request details are parsed and validated against the requirements.
4c - Request issuance If issuance request validation is successful, the requests is sent to the OSS CA Factory.
4d - Domain Validation OSS CA Factory conducts ownership validation of the domain requested
4e - Pre-lint After successful domain validation, the certificate is pre-linted
4f - Issue & return certificate After successful pre-linting the certificate is issued and retrievable by the customer through the APIs
4g - Post-lint Through our monitoring systems the certificate is post-linted, which sends notifications to the operational governance team in case of linting errors.
5 - Certificate status Relying parties are offered CRL and OCSP services from GlobalSign data centers.
6 - Revocation Revocation is requested by the customer through the APIs. Validation operators can manually revoke individual certificates and/or identities through the Registration Authority system.

Erratum: In Comment #7 we state CloudSSL will be decommissioned by the end of Q2 2021. We would like to clarify that this statement is aimed at CloudSSL version 1.0, for which the capability has already been rebuilt in Atlas and hence we have started migrating customers and decommissioning the GCC service. We have another service named CloudSSL 2.0, which will be migrated to Atlas at a later stage of the TLS migration from GCC to Atlas.

Attached image CA Provisioning.jpg

We have finalized the high-level design of our new process for CA creation & provisioning and will now start implementation across the relevant systems and departments. The aim of the new process is to have all tasks and checks related to CA creation & provisioning in a single consistent flow in which the necessary approvals and evidence checkpoints are enforced by the ticket management system. Please see attached a diagram visualizing the high-level process flow.

Thanks Arvid. Your replies continue to be a model for the level of detail and technical content that other CAs should model after.

With respect to your "Create testing CA certificates" - is it fair to assume that process also involves some form of post-issuance linting (since I assume by "testing" you mean some self-signed || locally-signed thing that does not chain to any hierarchy, making it perfect for testing)

We've seen various issues come up in these ceremonies in the past, so I'm hoping as much as possible the precise bytes are nailed down ahead of time, and then linted (as a "fake" tbsCertificate, as crt.sh offers APIs to do, or with a signature from a testing/not-trusted key to create a real-but-not-real certificate), such that the ceremony is simply issuing a signature.

Flags: needinfo?(arvid.vermote)

After the key manager creates the testing CA certificates they are reviewed by the compliance team once the key manager invokes PACOM3 approval. The compliance review includes verifying the CA generation script, comparing the information in the test certificate with the information provided in the CA request and, both a manual compliance review of the test certificate contents and linting the certificate. We have test hierarchies and environments that fully mirror our production CA hierarchies.

As a general update we have completed all remedial activities as detailed in Comment #2 with the exception of the additional linter in "warn" modus, where we are currently performing the further integration with our CLR and SOC, and fine-tuning the operator handling process for this event type. The full integration and fine tuning is expected to be completed by the end of April.

Flags: needinfo?(arvid.vermote)

The additional linter in "warn" modus has now been fully deployed. This concludes the remedial actions detailed in Comment #3. Please let us know if you require any additional information or this bug can be closed - thank you.

Flags: needinfo?(bwilson)

I'll close this on next Wed. 5-May-2021, unless I hear otherwise.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ov-misissuance]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: