Closed Bug 1821508 Opened 2 years ago Closed 2 years ago

eMudhra: CRL occasionally unavailable and returns 404 error

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: agwa-bugs, Assigned: vijay)

Details

(Whiteboard: [ca-compliance] [crl-failure])

The CRL for C=IN, OU=emSign PKI, O=eMudhra Technologies Limited, CN=emSign ECC Root CA - G3, http://crl.emsign.com?RootCAG3.crl occasionally returns a 404 error.

Here is what the response looks like:

HTTP/1.1 404 Not Found
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=35B1740259EB8D9E5E5391575F4924AA; Path=/; HttpOnly
Date: Thu, 09 Mar 2023 22:03:09 GMT
Pragma: no-cache
Cache-Control: no-cache, no-store, must-revalidate
Content-Disposition: attachment; filename="RootCAG3.crl"
Content-Type: text/html;charset=utf-8
Content-Language: en
Content-Length: 329

<!DOCTYPE html><html><head><title>Error report</title></head><body><h1>HTTP Status 404 - java.io.FileNotFoundException: C:\emSignCRL\CRLFiles\RootCAG3.crl (The process cannot access the file because it is being used by another process)</h1></body></html>
Assignee: nobody → vijay
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [ca-compliance] [crl-failure]
Flags: needinfo?(vijay)

We acknowledge going through this bug. The issue is random in nature and we are in the process of zeroing the root cause. This is an interim update to the mentioned issue. Thanks to SSLmate CRL Watch publishing this observation, which is being followed by our teams to monitor CRL (and OCSP) related findings.

1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.
Our team observed this coming in the CRL Watch list of SSL Mate in some instances and disappearing automatically. This was around 06-Mar-2023.

Listed / observed CRLs include:
http://crl.emsign.com?RootCAG3.crl
http://crl.emsign.com?RootCAC1.crl
http://crl.emsign.com/?RootCAC3.crl

2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.
(All times are IST)
06th March 2023 10:00: eMudhra personal discovered a possible problem with some of our CRLs. Our internal team that monitors SSLMate CRL-Watch website notified us of the possible problem. Manual checks could not reproduce the issue.
06th March 2023 16:00: Investigation started with escalations to concerned team to specifically monitor for this issue.
07th March 2023 12:00: Additional monitoring were added to debug the issue. (It was later found that this could not produce detailed logs which could help in zeroing the issue).
08th March 2023 11:00: Subscribed to uptime-monitoring service of third party and configured these URLs for HTTP monitoring. (The report for 24 hours has shown 100% uptime with 200 response code, with checks made in 1 minute intervals)
09th March 2023 10:00: Additional program is setup to continuously replicate the CRL downloads from multiple regions. This is an internal tool to replicate the behaviour.
10th March 2023 12:00: Observed this bug incident in Bugzilla by Andrew. Pending reproduction of the issue, it was considered to classify this as an incident of impact to CRL delivery. A Pubic Incident Report preparation began to report/respond in Bugzilla.
10th March 2023 18:00: Based on log observations, Operations & Engineering team came up with an update to CRL Delivery Service. This was applied to the release with necessary tests. This is still under monitoring, as there is a suspected cause at this stage and not zeroed in as root cause.

(We will report subsequent updates as we continue to monitor).

3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.
This is impacting CRL responses only with no issues with certificates.

4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.
These CRLs have no current revoked certificates.

5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.
There are no impacted certificates since these are CRL response issues only.

6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.
The suspected cause was due to a CRL delivery Service failing to load the CRLs randomly but rare to zero in. Our Engineering team has applied improvements to this and push out the changes. We have also made improvements in our HA Cluster to ensure the effective delivery of CRL files. This has also been validated by our senior architect teams. We are continuing to manually monitor it for now.

7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.
We will effectively manage changes and detect this kind of occurrences. Going forward we will have a separate process to review the CRL delivery. As additional improvements to our monitoring, we will add monitoring to our complete list of CRL distribution endpoints disclosed in CCADB rather than a few samples. This setup would happen over next couple of weeks, before 31 March 2023.

Flags: needinfo?(vijay)

This is a follow up update to this bug. We are continuing to monitor the CRL services and it is functioning normal without issues. We did not see its occurrence in SSL Mate CRL Watch as well. We will update this thread soon, once we complete the actionable highlighted in our previous comment.

As a continued update to this post, we have added monitoring to our complete list of CRL distribution endpoints disclosed in CCADB rather than a few samples. This is operational and part of our monitoring system. The observation on the issue fix indicated in above comment has been satisfactory and the CRLs have been functioning without any failures. Unless there are any other comments, we request you to consider the closure of this bug.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Summary: eMudhra: CRL occasionally returns 404 error → eMudhra: CRL occasionally unavailable and returns 404 error
You need to log in before you can comment on or make changes to this bug.