Closed Bug 1520299 Opened 5 years ago Closed 4 years ago

Hongkong Post / Certizen: Failure to report misissuance

Categories

(CA Program :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wthayer, Assigned: manho)

Details

(Whiteboard: [ca-compliance] [policy-failure])

In August 2018, Hongkong Post revoked a number of misissued certificates:

https://crt.sh/?cablint=543&iCAID=7319&minNotBefore=2017-01-01
https://crt.sh/?cablint=536&iCAID=7319&minNotBefore=2017-01-01
https://crt.sh/?id=141535828&opt=cablint

An incident report was never filed.

Please provide an incident report [1] covering both the misissuance and the failure to report it.

[1] https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

We became aware of the issue on 2018-07-18, 17:11 HKT (09:11 UTC) after following a topic discussion in mozilla.dev.security.policy.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2018-07-16, 17:01 HKT (09:01 UTC) On a regular review of discussions in mozilla.dev.security.policy, we became aware that a number of interoperability concerns were raised for certificate with length of OU more than 64 chars https://crt.sh/?id=336874396 .
2018-07-17, 09:30 HKT (01:30 UTC) We searched in our CA database for certificates with similar problems and double checked against that in crt.sh (X.509 lint).
2018-07-18, 17:11 HKT (09:11 UTC) A total 18 problematic certificates of 8 organisations were identified to have the same problem. The OU field contains either the organisation name or the organisation branch name, that could be longer than 64 chars.
2018-07-26, 10:40 HKT (02:40 UTC) Each subscriber was notified of the problem, and suggested a shorten organisation name or organisation branch name for confirmation.
2018-08-24, 17:00 HKT (08:00 UTC) It is confirmed that all subscribers had replaced their certificates for servers, and revoked their problematic certificates.

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

We had stopped immediately issuing certificates with the problem by operational controls since we became aware of the problem. Our CA system was then fixed in Sep 2018 to disallow organization name or organization branch name being longer than 64 chars.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

A total 18 problematic certificates of 8 organisations were identified. The first certificate was issued on 13 Feb 2017, and the last certificate was issued 19 June 2018.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

For details, please refer to:-
https://crt.sh/?id=189267560
https://crt.sh/?id=286896863
https://crt.sh/?id=274699659
https://crt.sh/?id=543990722
https://crt.sh/?id=229260528
https://crt.sh/?id=1120754107
https://crt.sh/?id=130831104
https://crt.sh/?id=141535828
https://crt.sh/?id=92871392
https://crt.sh/?id=92871396
https://crt.sh/?id=313710330
https://crt.sh/?id=325949218
https://crt.sh/?id=283500692
https://crt.sh/?id=336757233
https://crt.sh/?id=353042966
https://crt.sh/?id=356465268
https://crt.sh/?id=356465271
https://crt.sh/?id=375074656

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Our CA system uses subscribers' full organisation name and branch name in the Subject OU field of SSL certificate (“e-Cert (Server)”) to identify the subscriber organisation, and the full organisation name and branch name, as appeared in their government record of Business Registration certificates, could be longer than 64 characters respectively. However, the CA system didn’t validate the length of OU field must be less than or equal to 64 chars. Only a small number of certificates had this problem and we had not received any report of problem from any internet user or the subscriber. Therefore, they are not detected until now.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

The CA system had been fixed on 12 Sep 2018 to disallow organization name or organization branch name being longer than 64 chars. Moreover, we had adopted a name abbreviation rule for long organisation name and branch name so that all names remain meaningful using commonly understood semantics to determine the identity of the subscriber.

During this incident was discovered, we had to maintain communication with the affected subscribers and support them for replacement of their certificates without interruption to their services. While control has been put in place on the system side since Sep 2018 to safeguard against future recurrence of the case, we are keep communicating with our subscribers and potential subscribers advising them of the requirements and its importance. I wish that I could have filed this incident report earlier or at least sought some advice from Mozilla community, but certainly I cannot be excused for my negligence in not timely filing this report.

I appreciate the filing of this issue, but it appears to be overlooking some valuable opportunities to improve.

There's at least two distinct incidents from this:

  • The failure to adhere to RFC 5280
  • The failure to revoke in a timely fashion (from 2018-07-16 17:01 HKT to 2018-08-24 17:00 HKT)

There is also, separately, a lack of a timely incident report, although this is not yet explicitly mandated by https://github.com/mozilla/pkipolicy/blob/master/rootstore/policy.md

The response in Comment #1 speaks to the steps taken to individually ensure the O/OU field are restricted. However, it does not look at the systemic issues that allowed this field to not have limits in the first place, what steps have been taken overall to review the entire certificate profile/profile(s) for BR adherence / 5280 adherence, nor what steps have been taken to ensure timely revocation and disclosure going forward.

Can you speak about what systemic improvements have been made or reviews that have been conducted?

Flags: needinfo?(manho)

Thanks for the questions. We use a commercial CA software to generate digital certificate based on pre-defined X.509 v3 certificate profiles. For your information, all our 3 certificate profiles are also described in details in Appendix B of our CPS, see https://www.ecert.gov.hk/product/cps/ecert/img/server_cps_en3.pdf. Since we became aware of the issue, the 3 certificate profiles were reviewed entirely, especially for fields that are variables by individual applicant and the respective certificate application such as validity period (notBefore and notAfter fields), subject name (that contains attributes CN, O/OU), subject’s public key and SAN. The other fields in certificate are pre-defined in certificate profiles or generated by the CA software. In our opinion, no other issues were found in violation of BR or RFC 5280, except the length of O/OU attributes. Subsequently we implemented additional logic in our system on 12 Sep 2018 to disallow organization name or organization branch name being longer than 64 chars.

Furthermore, we strive to upkeep control by performing self audit on the adherence to BR and 5280 standard on monthly basis, more frequent than quarterly. We use cablint, x509lint, zlint tools to validate a randomly selected sample of certificates issued during that month. All samples passed the checking.

Regarding the problematic certificates, they were not clear to us how the organization name and/or organization branch name could be shortened. Therefore, it took us some time to communicate with the subscriber. If there would be other causes to revoke certificate in future, we should be able to revoke them in timely manner and prepare incident report promptly.

Flags: needinfo?(manho)

I'm a little uncertain how to interpret the last part of the response, namely:

Therefore, it took us some time to communicate with the subscriber. If there would be other causes to revoke certificate in future, we should be able to revoke them in timely manner and prepare incident report promptly.

It sounds like the root cause of the delayed revocation is "We didn't know how to offer a replacement certificate, so we didn't require revoking the non-compliant one". While it sounds like there's some clarity for future OV certs with long (>64 character names), it doesn't necessarily seem reassuring to think that other forms of (possible) misissuance won't be as or more trickier.

I think a desired outcome would be an improved process that ensures that, regardless of the questions about replacement, revocation is performed timely. This seems complimentary to, but separate from, your remarks about regular linting.

One way to address this concern is providing more understanding what your processes are for going from a potentially problematic certificate (whether self-discovered or reported) to revocation, and if those processes have changed. For example, a daily review of all problematic certificates that have been (reported, investigated, pending replacement), to ensure appropriate triage. Similarly, because some certificates may require 24 hour revocation windows, processes to ensure timely triage and response to make sure all issues are categorized and triaged well before that timeline.

Can you provide more detail about:

  1. What happens when a problem report arrives?
  2. What happens when your sampling/linting detects issues?
  3. Why you don't perform automated linting for 100% of your certificates immediately?
  4. Why you don't perform pre-issuance linting?

Thanks!

Flags: needinfo?(manho)

Thank you for your further questions. We do seriously look into this problem and review for improvements to our certificate problem reporting process. With a view to enhancing the clarity, we have recently updated our CPS, giving more details of revocation process in Section 4.9, see https://www.ecert.gov.hk/product/cps/ecert/img/server_cps_en4.pdf.

What happens when a problem report arrives?

Subscribers, relying parties, application software suppliers, and other third parties may submit Certificate Problem Reports informing HKPost of a reasonable cause to revoke an e-Cert (Server). Certificate Problem Reports must identify the entity requesting revocation and specify the reason with supporting evidence for revocation.

Any revocation request reported to HKPost will be acknowledged and actioned promptly. Under the circumstances as stated in Section 4.9.1 of our CPS, (i) HKPost will revoke an e-Cert (Server) within 24 hours (for reasons that are in accordance with Section 4.9.1.1 of BR); or (ii) HKPost may revoke an e-Cert (Server) within 24 hours and will revoke an e-Cert (Server) within 5 days (for certain reasons warranted based on the Section 4.9.1.1 of BR).

What happens when your sampling/linting detects issues?

We do monthly sample checking manually using crt.sh, and the linting provided in there, on e-Cert (Server) issued during that month because linting cannot be automated at the moment. When our sampling/linting detects error in e-Cert (Server), we shall consider it as faulty certificate and revoke it immediately within 24 hours and notify the subscriber of such revocation.

Why you don't perform automated linting for 100% of your certificates immediately?

We use Verizon UniCERT software to issue e-Cert (Server). Currently the software does not support pre-issuance linting or post-issuance linting. We plan to implement pre-issuance linting when the vendor of certificate issuance software releases new version that supports linting. Without the software support, as we understand that only post-issuance linting check is possible, and we are doing it manually at the moment. Having reviewed the number of e-Cert (Server) actually issued every month, I think we can increase our sampling size to cover all e-Cert (Server).

Why you don't perform pre-issuance linting?

While currently our certificate issuance software does not support pre-issuance linting, we have lodged an urgent request to the vendor for this feature. We are keeping in view of the availability of the related patch from the vendor.

Flags: needinfo?(manho)

Does Hongkong Post have a date from UniCERT for pre-issuance linting support? When will it be implemented?

Flags: needinfo?(manho)

We got a reply from UniCERT that they will deliver a pre-issuance linting feature within 2 months or so; and that should be a roll up of hotfix such that the feature can be released as soon as possible. Once we have got the hotfix, I'll update again immediately when we can implement pre-issuance linting in our system.

Flags: needinfo?(manho)
Whiteboard: [ca-compliance] → [ca-compliance] - Next Update - 27-April 2019

What's the status on this?

Flags: needinfo?(manho)

We had received a patch for pre-issuance linting support and some other hotfix from Verizon UniCERT in March 2019, and arranged to test it immediately. However, we found that there was a serious problem with the hotfix in the patch. So, we could not implement the pre-issuance linting feature in our system. Until in early June 2019, we received a cumulative patch for that problem and some other hotfix. Again, we are arranging to test it seriously as soon as possible. Once if the test results are positive, I'll update again when we can implement pre-issuance linting in our system. I'd expect to update again no later than end of July 2019.

Meanwhile, we have already been performing post-issuance linting month by month since February. There is no incident/issue found so far.

Flags: needinfo?(manho)

Do you have processes in place now to ensure timely updates?

Per https://wiki.mozilla.org/CA/Responding_To_An_Incident#Keeping_Us_Informed

You should also provide updates at least every week giving your progress, and confirm when the remediation steps have been completed - unless Mozilla representatives agree to a different schedule by setting a “Next Update” date in the “Whiteboard” field of the bug. Such updates should be posted to the m.d.s.p. thread, if there is one, and the Bugzilla bug. The bug will be closed when remediation is completed.

Since it sounds like you're in active testing, let's go weekly updates, unless Wayne or Kathleen request otherwise.

Whiteboard: [ca-compliance] - Next Update - 27-April 2019 → [ca-compliance]
Flags: needinfo?(manho)

I'd like to update that we tried the cumulative patch from UniCERT in our testing platform two days ago. However, we found some other problem in the patch, so that the pre-issuance linting could not be tested yet. We have escalated the issue to UniCERT for investigation in top priority. I'll update again next week.

Can you share more specific details about the problems faced?

Waiting for an update.

The problem was that after applying the patch, the system would generate CRL repeatedly but not in accordance with our pre-set schedule. UniCERT acknowledged the problem and I was hoping that they could provide fixes within last week. However, that is too optimistic. We are urging for the fix now.

Am I correct in understanding Comment #14 that the fix for pre-issuance linting included a regression in how CRLs were generated, and that's why it has not been deployed? And that UniCERT has repeatedly had issues in their fixes that have caused other, unrelated regressions, and that's why linting has not been deployed?

Where are we this week, and when does Hongkong Post expect to have a fix?

I'm greatly concerned about the ongoing reliance on UniCERT, given the issues captured on this bug (now over 6 months old), so trying to understand holistically where we stand on things.

UniCERT provided a cumulative fix which contains the fix for pre-issuance linting and other fixes that messed up the frequency of CRL generation. That's why the cumulative fix has not been deployed, or even tested yet.

After experiencing repeated issues of UniCERT, I'm also uncomfortable about reliance on UniCERT. Notwithstanding the pre-issuance linting feature of UniCERT, we'll study linting methods and then rehearse relevant methods on the certificate request data in our system. It is not easy for us but we will stand on our own.

Man: it has been 2 months since you provided an update, despite the fact that clear expectations for weekly updates were set in comment #10. Please explain why the status of this incident has not been updated for 2 months and how UniCERT will prevent delays in the future?

What is the status of implementing pre-issuance linting?

We had a long chain of email conversation with UniCERT in the past 2 months regarding the problematic cumulative fix. I wish that I could have some meaningful update earlier but nothing from UniCERT was encouraging so far. UniCERT only confirmed on 29 Aug that they have identified the issue of the cumulative fix while applying in our system. Their development team is working on it. Hence, I'm chasing UniCERT week by week.

Notwithstanding the pre-issuance linting feature of UniCERT, we've studied linting methods of certlint which can check for 161 types of error in a certificate. Methods that are relevant to our issued certificates had resembled in our application system, e.g. for SHA256 RSA certificate with key length of 2048 bits, field length of names. We will keep review any changes of certlint, and resemble those changes in our application system.

Man: thank you for the update. I would like to remind you that Mozilla expects CAs who are members of our program to provide proper oversight of their subordinate CAs. If UniCERT is unwilling or unable to remediate this problem, Mozilla will hold Certizen accountable.

Please post weekly updates on this issue until a timeline is agreed upon for implementing the pre-issuance linting patch.

(In reply to Wayne Thayer [:wayne] from comment #19)
Yes, Hongkong Post and we are aware of our accountability to Mozilla and the user community. That's our reason for resembling the must-have linting methods in our application system.

At the moment, UniCERT has formed a task force with specialists from multiple development centres and local support. We are having a meeting with the task force today. I'll update again when we have come up with action plans and timeline.

We had a meeting with the task force on 27 Sep. There were several thoughts on the issue of the patch in our environment. I'll update our action plan and timeline when some of them can work it out.

This is to update that according to UniCERT the task force has identified the cause of the issue in the patch. They will fix it by the end of this month. Hence, I'll update again our action plan and timeline.

Sorry for keep waiting for this update. Eventually we had successfully applied the patch containing UniCert pre-issuance linting feature to our system yesterday. Thank you for your patience.

Flags: needinfo?(manho)

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [policy-failure]
You need to log in before you can comment on or make changes to this bug.