Closed Bug 1398243 Opened 2 years ago Closed 2 years ago

certSIGN: Non-BR-Compliant OCSP Responders

Categories

(NSS :: CA Certificate Compliance, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kwilson, Assigned: cristian.garabet)

References

Details

(Whiteboard: [ca-compliance])

Problems have been found with OCSP responders for this CA, and reported in the mozilla.dev.security.policy forum here:

https://groups.google.com/d/msg/mozilla.dev.security.policy/o1MX07iWDco/RuM1NK_0AQAJ

As per section 4.9.10 of the BRs, OCSP responders MUST NOT respond with a “good” status for unissued certificates. The effective date for this requirement was 2013-08-01.

Please provide an incident report in this bug, as described here:
https://wiki.mozilla.org/CA/Responding_To_A_Misissuance#Incident_Report
We acknowledge the issues, have fixed one of it second day after it was reported and will fix the other one until 15.09. On Monday we will come back with the full report as required.
>1. How your CA first became aware of the problem (e.g. via a problem report submitted to your >Problem Reporting Mechanism, via a discussion in mozilla.dev.security.policy, or via a >Bugzilla bug), and the date.

We became aware of the problem via discussion in mozilla.dev.security.policy on 29.08.2017.

>2. A timeline of the actions your CA took in response.

Regarding the OCSP for certSIGN  Enterprise CA Class 3 G2, the problem was due to a misconfiguration and has been fixed on 30.08.2017.
Regarding the OCSP for certSIGN ROOT CA,  the problem is due to a software limitation and will be fixed until 15.09.2017.

>3. Confirmation that your CA has stopped issuing TLS/SSL certificates with the problem.

Not applicable.

>4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

Not applicable.

>5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

Not applicable.

>6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

On 23.03.2017 certSIGN OCSP service has been migrated to a new infrastructure of clustered servers. The extended revocation option for unissued certificate didn't work as expected on one node of the ocsp cluster.The error was not detected because our ocsp monitoring service did not contained tests for unissued certificates.

>7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

A. The OCSP for certSIGN  Enterprise CA Class 3 G2 was fixed on 30.08.2017.
B.Update internal OCSP monitoring service to include tests for unissued certificates - 15.09.2017.
C. Update OCSP responder for certSIGN Root CA - 15.09.2017.
Regarding the certSIGN Root CA explanation, more information is needed. For example, it was highlighted that this was due to a software limitation, implying certSIGN was aware of the non-compliance. If it was not aware of the non-compliance, a question is why not.

If the answer for "why not" is that "tests for unissued certificates" weren't added, then the continued line of root cause analysis asks "Why weren't such tests added" - which is about exploring what tests (if any) certSIGN develops when the Baseline Requirements change.

Hopefully, by continually asking "Why" (or "Why not"), we can identify the root cause(s) that contributed to this, and how systemically they're geing addressed.

It's not enough to add missing tests. A full root cause analysis will attempt to explore why tests were missing, and what systemic steps are being taken.
Flags: needinfo?(cristian.garabet)
We have updated our OCSP Responder and right now is in line with the BR also for the ROOT CA.

certSIGN was not aware that OCSP responder for certSIGN Root CA must not respond with a "good" status for unissued certificates. In section 4.9.10 of BR there is no explicit requirement regarding OCSP responder for Root. Our understanding of BR was that only Sub-CAs which are not technically contrained must not respond with a "good" status for unissued certificate. We were biased in this direction by the fact that our entire ROOT CA system is offline.

After we updated the OCSP responder, we've encountered a bug in the software by which for some valid certificates the OCSP responses were as they are not-issued (status REVOKED with the revocationTime January 1, 1970).
Our monitoring system kicked in, we've identified the issue and right now everything is in order and we are monitoring carefully.
Flags: needinfo?(cristian.garabet)
(In reply to Cristian Garabet from comment #4)
> certSIGN was not aware that OCSP responder for certSIGN Root CA must not
> respond with a "good" status for unissued certificates. In section 4.9.10 of
> BR there is no explicit requirement regarding OCSP responder for Root. Our
> understanding of BR was that only Sub-CAs which are not technically
> contrained must not respond with a "good" status for unissued certificate.
> We were biased in this direction by the fact that our entire ROOT CA system
> is offline.

I see. This is a concerning interpretation, given that it's a documented requirement "For the status of Subordinate CA Certificates". While the OCSP service is provided by the Root CA, it is provided for the status of subordinate CA certificates - https://crt.sh/?q=3003bf8853427c7b91023f7539853d987c58dc4e11bbe047d2a9305c01a6152c is such an example.

While Comment #2 and Comment #4 address the immediate resolution of the non-compliance, they do not provide a structural outline for how future non-compliance will be detected or mitigated. For example, if we apply a "why" test to both issues, we can work out something like:

Problem: The OCSP responder for the Enterprise CA did not follow the BRs
Why #1: The OCSP responder had been migrated to a new cluster, and one node was not properly configured for misissued certificates.
  - Why just one node? It's useful to continue that exploration.
Why #2: Tests for this specific case of the BRs were not part of our testing process
  - Why were tests not included? What process does certSIGN have in place for developing and reviewing testing as BR changes are made? It's useful to continue that exploration.

Problem: The OCSP responder for the Root CA did not follow the BRs
Why: certSIGN interpreted Section 4.9.10 as referring to the status _responder_ of subordinate CAs, rather than the status _for_ subordinate CAs.
Why: The lack of an explicit mention of root CAs, combined with the Root CA being offline, left it unclear.

We can then work out proposed mitigations for that problem, such as:
  - certSIGN will develop internal tests for each technical change to the BRs
  - certSIGN will develop a process to review each change to the BRs with [auditors, members of the Forum, Mozilla, etc] to ensure proper interpretation

(As examples)

You can see how this process about asking why isn't about trying to assign blame - it's not about understanding who misunderstood - but about why it was misunderstood, and what opportunities for improvements there are. For example, one can also see this as a failing of the Baseline Requirements for not providing strong technical checks that can be used by CAs (and auditors) to avoid unambiguous interpretation, and a mitigation can be proposing systemic ways to address that :)
Flags: needinfo?(cristian.garabet)
Problem 1: The OCSP responder for the Root CA did not follow the BRs

When you have doubts it’s easy to ask questions. The problem is what to do when you don’t have any and in the end it turns out that you should have had. We agree with your approach on consulting other external parties more often as a way to expose ourselves to several points of view. We will consult more frequently with our auditors on BR changes just to find out what their opinions are.

Problem 2: The OCSP responder for the Enterprise CA did not follow the BRs

The why method is what we have been always using for finding out the root causes of operational incidents. For this particular incident the method lead us to the conclusions already presented. To prevent such incidents to occur in the future we have updated our change management process.
Flags: needinfo?(cristian.garabet)
Cristian Garabet: the response to comment #5 is not really complete. Ryan is requesting that you do some further root cause analysis on this issue. Please can you return and give a fuller answer?

Gerv
Flags: needinfo?(cristian.garabet)
Problem 1: The OCSP responder for the Root CA did not follow the BR
The root cause is our misinterpretation of BR requirements and the fact that we were not aware of this until now. The "MUST NOT" enforcement is only applied to the not technically constrained CAs. Since Root CA is not susceptible of technical constraints, we believed that we are on the first sentence which is "SHOULD NOT". However, we have done even a brainstorming on what we should do in order to be sure that we interpret correctly BR requirements. Our conclusion is that we should analyze the requirements more often, from different perspectives,  together with our auditors and also with other experts or providers in this way creating opportunities to identify other opinions on different requirements and consequently minimizing the risk of other misinterpretations. Also, we will proceed the same when new requirements will appear. The bugs presented on this site, from other providers, is also a good way to check our understanding of the BR requirements, this way we formalized more the process of consulting the Bugzilla site. 
 
Problem 2: The OCSP responder for the Enterprise CA did not follow the BRs
 
Regarding the OCSP for certSIGN Enterprise CA Class 3 G2, the problem was due to a misconfiguration and has been fixed on 30.08.2017. On 23.03.2017 certSIGN OCSP service has been migrated to a new infrastructure of clustered servers. The extended revocation option for unissued certificate didn't work as expected on one node of the ocsp cluster.
Why just one node?
Database connectivity was not established successfully after service migration. 
 
The error was not detected because our ocsp monitoring service did not contained tests for unissued certificates. After we updated the OCSP responder, we've encountered a bug in the software by which for some valid certificates the OCSP responses were as they are not-issued (status REVOKED with the revocationTime January 1, 1970). 
 
Why were tests not included? What process does certSIGN have in place for developing and reviewing testing as BR changes are made? It's useful to continue that exploration.
Lack of communication between people and bad leadership of people involved in change management process. 
As BR changes are made, we update our test procedures to verify compliance with BR.  The test procedure didn’t include tests for the ocsp monitoring service regarding non issued certificates. 
We have implemented a cross-check verification in order to mitigate this issue
Flags: needinfo?(cristian.garabet)
The issue is fixed and it appears that all the questions have been answered, so I'm resolving this.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.