GlobalSign: 4 Misissued certificates with invalid CN

RESOLVED FIXED

Status

task
RESOLVED FIXED
3 months ago
11 days ago

People

(Reporter: douglas.beattie, Assigned: douglas.beattie)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [ca-compliance])

Attachments

(1 attachment)

5.81 KB, application/x-zip-compressed
Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

The GlobalSign pos-issuance compliance checker alerted us to the problem in 4 SSL certificates (attached).

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

This happened today, Friday May 17th, all times EST:
13:22: Certificates issued
13:40: Post-issuance compliance checker detected misissued certificates and alerted
14:30: Investigation started
15:10: Certificates revoked
16:00 Customer located and contacted. They were using a depreciated API which had not been formally disabled and this API permitted invalid values for the CN to be supplied and used without proper validation.

Ongoing: we are collecting logs from the customer end where the API was initiated via a GlobalSign provided application to determine the mis-configuration and resolve. We are in the process of shutting down the deprecated API and will report on that as soon as we have a status.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

In coordination with the customer, we are assured that no more non-compliant certificates will be issued. We will monitor issuance and we are working to update the CA system to disable this older API in our JP data center.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

We issued four certificates with a CN of: "madmin's macboo.int.mlsel.com"

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

The certificates will be uploaded to this ticket.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Full analysis is ongoing.

7)List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

We will conduct a full review of legacy or deprecated APIs to be sure they are fully disabled.

this file contains the 4 misissued certificates in pem format

Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Assignee: wthayer → douglas.beattie
Type: defect → task
Whiteboard: [ca-compliance]
Summary: Misissued certificates with invalid CN (4) → GlobalSign: 4 Misissued certificates with invalid CN

Here is an update to this incident:

5/20: After further analysis of the issue, it was determined that the cause was not the V1 API in general, but that there was a missing check for CN/SAN validation which was being skipped in a certain scenario. Specifically, when the "AEG" product code was being used, this check was skipped. Typically the AEG product code is used for non-public SSL certificates, and we found that the conditional CN/SAN check for the publicly trust thread was not being executed.

5/21: We rolled out updated code that now properly checks the CN and SAN values for the AEG product code. We also rolled back the V1 support to permit continued use of that API. While it's not being used for certificate issuance, it was being used for some other functions that impacted customer operations for the prior few days.

We reviewed all certificates issued via this product code and found that these were the only 4 that didn't comply.

Others have asked if we had skipped any other checks, like CAA, when following this AEG product thread. Over the past few days we've reviewed the code and threads and have determined that no other required checks or validations were skipped. Organization and Domain validation is done via our Enterprise model and these certificate requests all were subject to those constraints.

We're continuing to inspect the AEG thread to double and triple check that no other required validation steps were missed and will report back if we find anything new to report, but at this point I believe that we can close this incident.

Doug: Thanks for the update. This communicates that you believe you found and remediated the specific bug, but I don't think it quite gets to understanding the root cause and what systemic improvements will be made. I appreciate that you clarified you performed an exhaustive review - that is certainly to be expected in such an incident, and so it's good to hear it proactively called out, rather than having to ask about it :)

Questions that come up:

  • How long did this issue exist?
  • How was this issue missed?
  • Have there been any changes to how testing is performed to ensure future changes don't cause this?
  • Has there been any systemic change to make it easier to analyze and review this (since you noted it was a deprecated API)?
  • Have you reviewed the other product codes (not sure what this refers to? API endpoint? specific customer?)?
  • What changes have been made to testing - both the tests you always run and how you test new features or changes?
  • Has there been any timeline to deprecating the AEG interface, as originally implied?
Flags: needinfo?(douglas.beattie)

Ryan, I'm working on responses to your Questions.

Ryan, here are our responses to your questions:

How long did this issue exist?

  • This product code has been in use for approximately 5 years and this missing check has been present from the beginning.

How was this issue missed?

  • The AEG product codes were initially designed for issuing non-public SSL certificates, but over time the need arose to support publicly trusted SSL certificates. When this happened, our testing did not uncover this missing check.
  • Normally, all API requests using this product code originate from GlobalSign on-site application and that application generates valid requests; however, as was evident as part of this incident, there are also situations when different end entity applications submit certificate requests. It's evident that those clients don't always generate requests with valid CN and SAN values, and that identified this deficiency in our CA checks.

Have there been any changes to how testing is performed to ensure future changes don't cause this?

  • We are in the process of adding some unit tests to cover this specific set of cases.

Has there been any systemic change to make it easier to analyze and review this (since you noted it was a deprecated API)?

  • No, we haven't made any systemic changes to this point.

Have you reviewed the other product codes (not sure what this refers to? API endpoint? specific customer?)?

  • Yes, we have completed an exhaustive review and have not found any other related deficiencies for the validation of certificate requests

What changes have been made to testing - both the tests you always run and how you test new features or changes?

  • As mentioned above, we're adding some unit tests to cover these and similar cases to be sure the requests result in valid certificates
  • With the addition of zlint preissuance checks (soon), this will also be a fail=safe check to block future misissuances with error like this.

Has there been any timeline to deprecating the AEG interface, as originally implied?

  • Yes, the V1 API will be restricted to only a small set of customers and those customers will be moved from this API. We're targeting the end of July for this. The AEG product code will continue to be used via the V2 API, and with the additional unit tests and the use of zlint preissuance checks, we don't expect issues like this to reoccur.

(In reply to douglas.beattie from comment #7)

How was this issue missed?

  • The AEG product codes were initially designed for issuing non-public SSL certificates, but over time the need arose to support publicly trusted SSL certificates. When this happened, our testing did not uncover this missing check.
  • Normally, all API requests using this product code originate from GlobalSign on-site application and that application generates valid requests; however, as was evident as part of this incident, there are also situations when different end entity applications submit certificate requests. It's evident that those clients don't always generate requests with valid CN and SAN values, and that identified this deficiency in our CA checks.

Thanks, Doug. This sounds a bit like saying you relied on client-side checks/validation (via the on-site application), rather than server-side checks. I also totally understand that products evolve over time, and features grow and accumulate. With respect to this incident, and preventing future incidents, what sort of steps have been taken (either in response to this, or in the past, or planned) to help ensure that this API and the AEG product codes cover all the necessary requirements?

I don't think linting is necessarily a sufficient answer here, in as much as there are plenty of procedural opportunities for issues (e.g. the Some-City/Some-State discussions going on). I'm largely unaware of how divergent this API and product code are from the existing GlobalSign issuance pipeline and practices, and so I'm trying to get a picture about what the different systems in play are, where they're similar, and where they're different, so I can better understand how this issue has been both contained and remediated.

In this regard, it strikes me as a bit similar to the guidance given to DigiCert in Bug 1550645, both in the request for information and in understanding the procedural approach being taken. Can you help build a similar detailed understanding for how GlobalSign's systems are architected, how the failure happened, and what sort of steps have been taken?

Have you reviewed the other product codes (not sure what this refers to? API endpoint? specific customer?)?

  • Yes, we have completed an exhaustive review and have not found any other related deficiencies for the validation of certificate requests

Great. Coupled with the above question, it might be useful to highlight or indicate what of those systems in place you reviewed. This is the general desire for incident reports - trying to build an understanding about how the system(s) operate (and all of the system(s) in play), and how the CA is responding or addressing them.

Has there been any timeline to deprecating the AEG interface, as originally implied?

  • Yes, the V1 API will be restricted to only a small set of customers and those customers will be moved from this API. We're targeting the end of July for this. The AEG product code will continue to be used via the V2 API, and with the additional unit tests and the use of zlint preissuance checks, we don't expect issues like this to reoccur.

I'm still a bit confused between the relationship between the V1 API and the AEG product code. I had thought AEG meant the V1 API, but this response makes me think they're different systems. Can you help build an understanding here about how GlobalSign's systems are (perhaps this is already documented in your CP/CPS?)

(In reply to Ryan Sleevi from comment #8)

(In reply to douglas.beattie from comment #7)

How was this issue missed?

  • The AEG product codes were initially designed for issuing non-public SSL certificates, but over time the need arose to support publicly trusted SSL certificates. When this happened, our testing did not uncover this missing check.
  • Normally, all API requests using this product code originate from GlobalSign on-site application and that application generates valid requests; however, as was evident as part of this incident, there are also situations when different end entity applications submit certificate requests. It's evident that those clients don't always generate requests with valid CN and SAN values, and that identified this deficiency in our CA checks.

Thanks, Doug. This sounds a bit like saying you relied on client-side checks/validation (via the on-site application), rather than server-side checks.

Actually, no, we didn't rely on client side checks. It turned out that the specific scenario which generated these non-conformant certificates was one that we hadn't specifically tested (clearly). The typical AEG flow originates from the AEG client application and uses the AEG product codes; however, this request originated from a different client and we hadn't tested that specific (unexpected) scenario.

I also totally understand that products evolve over time, and features grow and accumulate. With respect to this incident, and preventing future incidents, what sort of steps have been taken (either in response to this, or in the past, or planned) to help ensure that this API and the AEG product codes cover all the necessary requirements?

Looking back, we've completed an exhaustive review of the processing of AEG certificate requests and we've verified that all of the proper checks are in place. Going forward, we've certainly learned that we need to pay even closer attention to product changes when we add new features. We're also discussing if/how we could completely remove the AEG variant of OV from OV into it's own dedicated product line so the processing of OV doesn't include any conditional logic for private trust. It's certainly the safest solution, but also has a much larger impact on our dev team and customers who wold all need to be upgraded.

I don't think linting is necessarily a sufficient answer here, in as much as there are plenty of procedural opportunities for issues (e.g. the Some-City/Some-State discussions going on).

Yes, agree. We include necessary checks into the CA system to prevent this, and we use linting as a secondary, independent check.

I'm largely unaware of how divergent this API and product code are from the existing GlobalSign issuance pipeline and practices, and so I'm trying to get a picture about what the different systems in play are, where they're similar, and where they're different, so I can better understand how this issue has been both contained and remediated.

It's unclear exactly what procedural concerns are related to this incident. The enterprise system is based on previously verified organizational details and domains, then when requests are placed using this data, certificates are immediately issued. There weren't any procedural breakdowns as it relates to this issue. The bug here was that he system didn't include a proper check for the format of the SANs, which we've addressed.

As far as this API vs. others and how they relate, this is THE core GlobalSign MSSL API for the issuance of all SSL certificates from our managed platform, so this isn't divergent, it's our core platform. As discussed, we certainly missed an important check when AEG product code was used for publicly trusted SSL, and we've resolved that.

In this regard, it strikes me as a bit similar to the guidance given to DigiCert in Bug 1550645, both in the request for information and in understanding the procedural approach being taken. Can you help build a similar detailed understanding for how GlobalSign's systems are architected, how the failure happened, and what sort of steps have been taken?

Have you reviewed the other product codes (not sure what this refers to? API endpoint? specific customer?)?

  • Yes, we have completed an exhaustive review and have not found any other related deficiencies for the validation of certificate requests

Great. Coupled with the above question, it might be useful to highlight or indicate what of those systems in place you reviewed. This is the general desire for incident reports - trying to build an understanding about how the system(s) operate (and all of the system(s) in play), and how the CA is responding or addressing them.

We reviewed our GCC ordering system and especially our RA system which enforces all of the required checks, domain validations, etc. It's separate from the GCC code and databases and is developed by a different set of engineers who focus on compliance enforcement checks. So in summary, we reviewed both GCC and RA systems.

Has there been any timeline to deprecating the AEG interface, as originally implied?

  • Yes, the V1 API will be restricted to only a small set of customers and those customers will be moved from this API. We're targeting the end of July for this. The AEG product code will continue to be used via the V2 API, and with the additional unit tests and the use of zlint preissuance checks, we don't expect issues like this to reoccur.

I'm still a bit confused between the relationship between the V1 API and the AEG product code. I had thought AEG meant the V1 API, but this response makes me think they're different systems. Can you help build an understanding here about how GlobalSign's systems are (perhaps this is already documented in your CP/CPS?)

The versioning of APIs isn't documented in the CPS, but let me try to explain it here. Several years ago we rolled out an updated API (V2) with some additional features and fields that would have otherwise broken current V1 implementations. We provide customers a transition time to move over and update their code, then we were planning to disable the prior API. In this case, V1 remained in use longer than it should have. Ordering via this API will be disabled by the end of June, but we'll permit a limited set of customers to continue using the query features in the API (some still use it)

V2 of the API supports all of our enterprise SSL products as well as the AEG (a variant of our MSSL OV product). As discussed above, we've completed an extensive review of the code and have determined that when any required BR checks are omitted, it's for either privately trusted SSL certificates, or it's for certificates that don't have the serverAuth EKU.

Flags: needinfo?(douglas.beattie)

I'm still not sure I really understand the relationship between V1, V2, and AEG product codes, so I suspect that's why I'm still struggling with this response. I was hoping for pointers to documentation in a way to build up a model of "This is where requests can come in, these are the checks we do, and this is how the system was changed"

I referenced the DigiCert issue, because descriptions like https://bugzilla.mozilla.org/show_bug.cgi?id=1550645#c9 or https://bugzilla.mozilla.org/attachment.cgi?id=9069867 can help describe the relationship between these systems better, and help understand how they're addressed.

If I understand correctly, you have externally-facing V1 and V2 APIs, and an internally-facing RA API. When using the V1/V2 APIs, customers use different 'product codes' to indicate the type of certificate they want. Originally, AEG meant a privately trusted certificate, which had no checks, but later, support for AEG+publicly trusted certificates was added. When using V2+AEG+publicly trusted, a series of BR-compliance checks were enacted. When using V1+AEG+publicly trusted, the compliance checks on the server were not performed, which is the cause of this issue. This issue only particularly manifest because the assumption was that if using the AEG product code, a local on-prem device would have done the correct formatting of the certificate.

If this isn't correct, having some diagrams to explain the relationship might go a long way to helping understand how this issue manifest, and also understand how it was mitigated.

Flags: needinfo?(douglas.beattie)

(In reply to Ryan Sleevi from comment #10)

I'm still not sure I really understand the relationship between V1, V2, and AEG product codes, so I suspect that's why I'm still struggling with this response. I was hoping for pointers to documentation in a way to build up a model of "This is where requests can come in, these are the checks we do, and this is how the system was changed"

I referenced the DigiCert issue, because descriptions like https://bugzilla.mozilla.org/show_bug.cgi?id=1550645#c9 or https://bugzilla.mozilla.org/attachment.cgi?id=9069867 can help describe the relationship between these systems better, and help understand how they're addressed.

If I understand correctly, you have externally-facing V1 and V2 APIs, and an internally-facing RA API. When using the V1/V2 APIs, customers use different 'product codes' to indicate the type of certificate they want. Originally, AEG meant a privately trusted certificate, which had no checks, but later, support for AEG+publicly trusted certificates was added. When using V2+AEG+publicly trusted, a series of BR-compliance checks were enacted. When using V1+AEG+publicly trusted, the compliance checks on the server were not performed, which is the cause of this issue. This issue only particularly manifest because the assumption was that if using the AEG product code, a local on-prem device would have done the correct formatting of the certificate.

Yes, that is 90% correct

  • We have externally facing V1 (until July 1) and v2 APIs and an internally-facing RA API
  • Customers use different product codes to indicate the type of request (DV, OV, EV, DV wildcard, OV multi-san, etc)
  • Originally we had the OV product code, then we wanted to expand it to support private trust.
  • We added the AEG product code (really a sub-type of OV)
  • When we process the AEG product code requests, we added some conditional statements to OV certificate request processing like "If AEG and Private trust, then skip CN/SAN format check". This is to permit internal server names to be issued.
  • One of the places was updated incorrectly to read: "If AEG, then skip CN/SAN format check". The "private trust" qualifier was not included.
  • Since the discovery of this incident, we did a though code review and didn't find any other missing conditional checks

If this isn't correct, having some diagrams to explain the relationship might go a long way to helping understand how this issue manifest, and also understand how it was mitigated.

While we don't want to rely completely on zlint checks finding all possible malformed certificates, between the checks we've implemented/updated and the addition of zlint preissuance checks for all publicly trusted CAs, we're comfortable that we've sufficiently mitigated the risk of this happening again. We also have been using certLint for post issuance checking as a final independent check.

Flags: needinfo?(douglas.beattie)

Got it! Thanks, that explanation you provided makes it much clearer about where the issue was and how it manifested, and helps explain the mitigations implemented.

To make sure I understand the cause:

  • There was a conditional "If AEG" that bypassed certain checks (Comment #11)
  • This bug existed for 5 years (Comment #7)

This didn't manifest because the use of the "AEG" code traditionally was only used by GlobalSign-authored code (Comment #7), but this incident revealed non-GlobalSign-developed code also requesting using this product code.

To make sure I understand the remediation:

  • The check was updated to make sure "If AEG and private trust" (Comment #11)
  • All other checks that were conditional on AEG were examined to make sure they also had "and private trust" and no other incidents were found (Comment #4, Comment #7, Comment #9, Comment #11)
  • V1 API (which ended up being wholly unrelated) is being retired in July; while not related to this incident, it has the benefit of reducing the number of paths through the system (Comment #7)
  • zLint based pre-issuance checks are being added "soon" (Comment #7)

If that's correct, then just a few more questions:

  • When you say the bug existed for 5 years, was that the ability to use the AEG code for public trust? I'm just trying to understand if the "5 years" was how long that particular code branch existed, or whether it's how long the combination existed (of being able to get a publicly trusted cert with the AEG code). I wasn't sure if they were perhaps different timelines.
  • What's the timeline for implementing pre-issuance linting?
Flags: needinfo?(douglas.beattie)

Doug: Any updates?

Between business travel and holidays this was missed. I'll gather up the details and post soon and I apologize for the delay.

(In reply to Ryan Sleevi from comment #12)

If that's correct, then just a few more questions:

  • When you say the bug existed for 5 years, was that the ability to use the AEG code for public trust? I'm just trying to understand if the "5 years" was how long that particular code branch existed, or whether it's how long the combination existed (of being able to get a publicly trusted cert with the AEG code). I wasn't sure if they were perhaps different timelines.

Yes, the AEG code could always support private and public trust certificates (same timeline)

  • What's the timeline for implementing pre-issuance linting?

Preissuance Linting was added in early June.

Flags: needinfo?(douglas.beattie)

Wayne: Hopefully the easiest way to view the history of this is Comment #0, Comment #4, Comment #5, Comment #12, Comment #15.

That is, the mitigations involved an audit of the code (based on the description of the scenarios that caused it to manifest), changes in how they develop and support such APIs and code, and the addition and deployment of preissuance linting.

Flags: needinfo?(wthayer)

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 11 days ago
Flags: needinfo?(wthayer)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.