Closed Bug 1674561 Opened 4 years ago Closed 3 years ago

Microsoft PKI Services: DV certificate issued with OV fields

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fozzie, Assigned: Dustin.Hollenback)

Details

(Whiteboard: [ca-compliance] [dv-misissuance])

A DV certificate (policy 2.23.140.1.2.1) has been issued with a locality, organization and stateOrProvinceName field:

I originally sent this to DigiCert who said they did not operate the intermediate. I couldn't find any Microsoft contact in their Certificate CPS so I tried to email centralpki@microsoft.com listed in their Corporate CPS which bounced the email. I'll just post it here instead.

Assignee: bwilson → johnmas
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance]
Assignee: johnmas → Dustin.Hollenback
Summary: Microsoft: DV certificate issued with OV fields → Microsoft DSRE PKI: DV certificate issued with OV fields

Background
This bug filing encompasses CAs managed by DSRE PKI at Microsoft, which includes the following Intermediate CAs:

  • Microsoft IT TLS CA 1
  • Microsoft IT TLS CA 2
  • Microsoft IT TLS CA 4
  • Microsoft IT TLS CA 5
  • Microsoft RSA TLS CA 01
  • Microsoft RSA TLS CA 02

How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
This was first reported by george@fozzie.dev at 06:19 am Pacific time on Saturday, 2020-10-31 via email to our centralpki@microsoft.com reporting email address. It was then opened as this Bugzilla bug.

A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
Note: Times are listed in Pacific time zone

  • 31 October 2020 06:19 – Issue reported by george@fozzie.dev as an email
  • 31 October 2020 08:59 – Internal incident created
  • 31 October 2020 09:00 – Internal incident acknowledged and investigation started
  • 31 October 2020 11:15 – Confirmed that this is a failure in the logic of a seldom used workflow
  • 31 October 2020 11:37 – Dev team notified of the workflow failure
  • 31 October 2020 13:00 – Certificate owner notified of the requirement to revoke within 5 days

Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
Our initial investigation confirmed that this is only an issue with a seldom used workflow that includes a manual approval. Any requests that come in until this is permanently resolved will be denied, preventing further mis-issuance.

A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
We are aware of 1 certificate and 2 pre-certificates that were already provided in earlier comments. We are still investigating if any other certificates have been impacted by this issue.

The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
The known impacted certificates were already provided as crt.sh links in earlier comments. We are still investigating the complete impact and will add additional details by Monday, 2020-11-02 if we find any.

Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
This bug was introduced in September 2020 when a change was made to certificate templates. There is a seldom used workflow that was missed during the template change and was pointing to the incorrect certificate template. While we performed the correct Organization Validation steps, the incorrect template mistakenly applied the wrong validation OID. It applied the DV OID to an OV certificate.

List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
We are still researching if other certificates are impacted and expect to be complete by Monday morning, 2020-11-02. If any additional certificates are found to be mis-issued, we will add them to this bug.

We have already engaged our development team to resolve the workflow issue. This is expected to be resolved by Monday evening (Pacific time), 2020-11-02. In the meantime, any requests that would have a similar issue will be denied.

Hello,
I have a correction on an earlier comment. I previously stated: "This bug was introduced in September 2020...", however this bug was actually introduced in July 2020.

After scanning the database, we confirmed that there were 3 impacted certificates (and their pre-certificates). These were the same 3 that were previously identified.

All 3 impacted certificates were revoked by 2020-11-02 11:52 Pacific time.

Also, we confirmed that we will not issue future certificates under this same condition. We have a manual process in place to re-issue these 3 certificates correctly as well as any future requests. We will eventually want to add automation to remove the human element, but our current controls are adequate for the purpose of closing this bug.

Thank you,

Dustin

This doesn't really say what Microsoft is doing to ensure this won't happen again. Does Microsoft do any pre-issuance linting? Both x509lint and zlint catch this issue.

Hi George,

Yes, Microsoft DSRE PKI performs pre-issuance linting. We currently use certlint/cablint, which does not catch this issue. Since that tool is no longer receiving updates, we plan to eventually migrate to zlint or x509lint.

We also use a separate internally developed compliance checker. We plan to eventually add an OID/attribute check that would catch this if introduced in a future code change.

In the meantime, because of your reporting, we are now aware of edge cases where this issue can occur. We have reviewed our workflows and are confident that we will not re-issue under the same condition. Until we can get an automated check, we have added a manual check to any upcoming changes to ensure this won't be introduced in other workflows. The introduction of this manual check is how we are confident that this won’t happen again. The pre-issuance checks that will be added later will help us remove any manual human processes, which is our ultimate goal.

Thank you,

Dustin

(In reply to Dustin Hollenback from comment #5)

Yes, Microsoft DSRE PKI performs pre-issuance linting. We currently use certlint/cablint, which does not catch this issue. Since that tool is no longer receiving updates, we plan to eventually migrate to zlint or x509lint.

While Amazon's original project is now archived, it continues on as a fork maintained by many of the external contributors, at https://github.com/certlint/certlint

Note that there's benefit to running both/and, not either/or; certlint is particularly useful for calling out ASN.1 encoding issues, while zlint is particularly useful for capturing these logical content issues.

Until we can get an automated check, we have added a manual check to any upcoming changes to ensure this won't be introduced in other workflows.

Can you describe more about the manual check, with a mind to capturing "What sort of checks should a (not-Microsoft) CA do to avoid this issue?"

Flags: needinfo?(Dustin.Hollenback)

Hi Ryan,

Thank you for the extra detail about best practices for linting as well as the reference to the forked version of cablint. The Microsoft DSRE PKI team will investigate if/how we can implement multiple linting tools.

Regarding your question, I do not think there's anything new for other CAs to learn here. We're implementing an additional code review check until we can automate a solution that checks every certificate at cert issuance.

Our approach is to remove manual human steps wherever possible. But, we will implement manual steps to remediate a problem if the automated process cannot be implemented immediately. Below are more details about what has been implemented as well as our longer-term plan.

  • Reviewed all certificate issuance code to identify any potential failure points that would allow the incorrect template to be assigned
  • Identified a single workflow scenario that is currently manually approved by multiple teams. Note that this is an extremely rarely used workflow. The problem condition has only occurred for 3 certificates out of millions issued. All other workflows are not impacted.
  • Implemented a manual process to prevent issuance of the certificates that would fall into this workflow.
  • Implemented an additional temporary manual check during all change reviews to ensure that the correct template is used in the future. This is labor intensive, but remediates the problem of a workflow assigning the incorrect template since we now specifically look for this condition each time we make a code change.
  • Will automate the pre-issuance check for all requests to verify the OID matches the expected attributes. Once this is implemented, we will remove the manual check for this condition during change reviews.

Thank you,

Dustin

Flags: needinfo?(Dustin.Hollenback)
Summary: Microsoft DSRE PKI: DV certificate issued with OV fields → Microsoft: DV certificate issued with OV fields

Do you have any updates on the status of linting? Are all proposed mitigations implemented?

Flags: needinfo?(Dustin.Hollenback)

All items from above have been implemented, except for this:

  • Will automate the pre-issuance check for all requests to verify the OID matches the expected attributes. Once this is implemented, we will remove the manual check for this condition during change reviews.

This pre-issuance OID-to-attribute check is currently scheduled to be implemented in June 2021. This automation will be a better long term solution than the manual controls we already implemented to immediately mitigate the issue.

We are planning to use zlint in addition to cablint, which is already in place. The zlint implementation is currently scheduled to be implemented in January 2022. I'm pushing for an earlier date, but have multiple extra hurdles with implementing 3rd-party tools into our isolated Microsoft DSRE PKI environment and cannot commit to an earlier date at this time.

Flags: needinfo?(Dustin.Hollenback)

Dustin,
Please schedule a "Next Update" on this matter for June 1, 2021. Meanwhile, if any faster progress is made, please let us know.
Thanks,
Ben

Whiteboard: [ca-compliance] → [ca-compliance] Next update 2021-06-01

I have to say this is an incredibly disappointing and disheartening response to hear from any CA, let alone Microsoft. It certainly leaves the impression that Microsoft does not take this issue seriously.

I encourage you to discuss with your colleagues on the Root Program side, because I would say it's fairly shocking to see any CA, let alone a CA that also has a root program, suggest it takes 8 months to make a basic change (pre-issuance OID-to-attribute), and 15 months to implement pre-issuance linting (these dates based on when this was issue was opened). This is not the level of agility and responsiveness to be expected or accepted from really any CA, and so it's especially disappointing to see this from Microsoft. Worse, this is beginning to suggest a troubling pattern of compliance issues. Unlike most other subordinate CAs, Microsoft's role as a root program no doubt creates challenges for the issuing CA (i.e. DigiCert) to apply the necessary and appropriate pressure or to hold Microsoft to the standard that they, or other CAs, might otherwise hold their sub-CAs accountable to. For this reason, it's especially important that Microsoft be an industry leader who sets the standard, rather than does less than the bare minimum expected.

I realize you mentioned you're personally pushing for an earlier date, and I hope this comment is useful to highlight to your colleagues that, as a widely trusted CA, this is the sort of response timeline that is deeply troubling, rather than reassuring.

Hi Ryan,

Thank you for the candid feedback. This was extremely helpful in getting my team (DSRE PKI) to understand prioritization.

My original understanding was that there are two buckets of prioritization:

  1. Revoke mis-issued certificates and mitigate the immediate problem (potential issuance where the incorrect validation OID is used) even if using manual processes, and
  2. implement automation to remove any manual processes.

With that understanding, when the implementation team originally asked for prioritization, we focused immediately on mitigating the immediate problem. Then we prioritized the two remaining follow-up items within a larger set of planned work, which resulted in a longer term commitment.

After discussing with others, including DigiCert, I realized that I fundamentally misunderstood the scope of what needs to be mitigated immediately. My new understanding is that stopping the immediate problem is important, but it is also important to quickly implement fixes for any gaps that may also catch future unknown issues. And, we should quickly implement any automation to eliminate temporary manual processes.

Based on the new mindset, I completely agree that these two remaining timelines were ridiculous. I am working with the implementation team to move these to the front of the queue.

  • I expect the pre-issuance OID-to-attribute change to be fully released by the end of this month.
  • We currently have pre-issuance linting, but not ZLint, which you previously pointed out that it would have caught this. I am still investigating whether we can implement ZLint or if we need to fully reproduce all of its checks within our internal code. This is harder to set a commitment date for, because it involves either a significant amount of development work or to implement a third-party tool into a highly restricted environment. Within the next two weeks, I should have an updated commitment date to share.

Regards,

Dustin

The DSRE PKI team completed the deployment of a new pre-issuance OID-to-attribute check. This should resolve the root cause for this issue, which is that we now have an automated check in place to ensure that the attributes match the validation OID.

As an additional measure, we are still attempting to deploy another linter. We are working on the security reviews to deploy 3rd party software into our restricted environment. I have a goal of deploying this by the end of April, but cannot firmly commit to that date at this time.

Ben: If you wanted to set a Next-Update to 2021-04-30

Flags: needinfo?(bwilson)
Flags: needinfo?(bwilson)
Whiteboard: [ca-compliance] Next update 2021-06-01 → [ca-compliance] Next update 2021-05-01

Quick update.... the Microsoft DSRE PKI team has gone through all security reviews and been approved internally to deploy zlint for pre-issuance checking. This will be deployed no later than 2021-May-15. This is moving towards the goal of using multiple industry recognized linters and custom internal linters to perform pre-issuance and post-issuance checking against all subscriber certificates.

The Microsoft DSRE PKI team is now using ZLint checking for all new subscriber certificate requests. This adds to the existing linting and one-off checks such as our custom check against the DV/OV OID to attribute issue. Using ZLint adds a redundant check against the specific issue in this bug, but also provides additional checks against other future issues.

This issue (along with others from our partner team in Microsoft) have caused us to re-evaluate the entire verification process and we will continue to focus in this area with automation including the use of multiple industry-recognized linters.

With that said, is there anything else needed before closing this bug?

Thanks for the update.

I'm encouraged to see that Microsoft was able to bring in their original 15 month timeline (Comment #9 / Comment #11 ) into what has ended up being a 6 month timeline, so I want to acknowledge and appreciate Microsoft's reprioritization (Comment #12). While that's longer than ideal for general CA changes, it is understandable that the first step in bringing in third-party software is always the hardest to take, and that this hopefully sets the stage for easier efforts in the future.

I have follow-up questions, but I don't think they necessarily need to block for closing this, so I'm setting N-I for Ben and hoping you'll follow-up as this is closed out.

  • Am I correctly understanding that Microsoft's current pre-issuance linter set is:
  • Can you describe the process Microsoft has in place to keep ZLint and Certlint updated?
    • We've had issues with CAs failing to update linters as new lints are released, so this seems relevant to keeping good practice.
  • Can you describe the process Microsoft has in place to ensure linters are correctly configured and running?
    • We've had issues with CAs having linters fail open (Bug 1635096) or fail to run at all (Bug 1690807), so understanding the process and controls in place to make sure who lints the linters is clear.
Flags: needinfo?(bwilson)
Flags: needinfo?(Dustin.Hollenback)

I'll take a look at closing this out after we've received some more feedback from Microsoft on Comment #17.

Hi Ryan,

We use several pre-issuance linters: ZLint (https://github.com/zmap/zlint), CABLint/CertLint (https://github.com/amazon-archives/certlint/blob/master/lib/certlint/cablint.rb), and custom in-house lints. We are in the process of planning to deploy the newer forked CertLint (https://github.com/certlint/certlint) as a replacement for the CABLint/CertLint linter.

We monitor the repos to make sure we know about any new releases. For instance, our monitoring software queries the following endpoint for information about ZLint releases: https://api.github.com/repos/zmap/zlint/releases. Once a new release is detected, an alert is assigned to the team to start the process to implement. We'd welcome any advice on better methods to monitor for updates.

To ensure that the linters are running properly, our workflow logic only passes if we get a known positive response. For instance, ZLint provides a well formatted list of JSON objects. If our workflow fails to parse a valid JSON object, we reject the certificate request. For CABLint, if no response or a blank response is returned, it is considered an error and the workflow will reject the certificate request.

Thank you,

Dustin

Flags: needinfo?(Dustin.Hollenback)
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Summary: Microsoft: DV certificate issued with OV fields → Microsoft PKI Services: DV certificate issued with OV fields
Product: NSS → CA Program
Whiteboard: [ca-compliance] Next update 2021-05-01 → [ca-compliance] [dv-misissuance]
You need to log in before you can comment on or make changes to this bug.