Closed Bug 1550645 Opened 5 years ago Closed 4 years ago

DigiCert: CAA Checking Issue

Categories

(CA Program :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: brenda.bernal, Assigned: brenda.bernal)

Details

(Whiteboard: [ca-compliance] [dv-misissuance] [ov-misissuance] [ev-misissuance])

Attachments

(2 files)

Incident Report – Mozilla Policy Violation

1.How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On Monday, 29 April 2019, DigiCert QA discovered the issue in the late afternoon while testing a different feature. They found that when the internal CAA record checking service was unreachable (the request times out), the CA could proceed with issuance of the certificate. Jeremy Rowley (Head of Product) was alerted at 5:12 PM MDT on the same day.

2.A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

The bug was introduced in the initial coding of the CAA checking service. The CAA checking requirement was in effect on September 7, 2017.

Monday, 29 April 2019 - The issue was discovered, and a fix was applied the same day. The fix causes certificate requests to be rejected by the CA if it does not receive a response from the CAA checking service.

Wednesday, 1 May 2019 – Report was generated after required Developer work to query the CA to find impacted certificates. The report generated that day indicated that 1053 certificates were issued with the CAA check issue.

3.Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

CA has stopped issuing certificates with this problem. Issuance will fail when internal CAA service is unreachable.

4.A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

1053 certificates issued without checking CAA records, between 9 Sep 2017 and 16 Apr 2019;

16 of those would fail CAA checks if re-issued today.

5.The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

Please find information attached to this report.

6.Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

An independent check was not performed on the code released for CAA checking. It was only identified after testing another feature recently.

7.List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

Responsible code was fixed upon discovery. We will ensure a more thorough review process is followed that involves peer reviews and compliance sign off before a release goes into production for those pertaining to ballot and standards requirements.

Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

1053 certificates issued without checking CAA records, between 9 Sep 2017 and 16 Apr 2019;

How was this number determined? Is it based on logs showing when the service was down?

16 of those would fail CAA checks if re-issued today.

Are these still valid? If so, is anything being done to ensure that they should have been issued?

An independent check was not performed on the code released for CAA checking.

Was a root cause analysis performed? Why was an independent check not performed? The response to question #6 does not give me confidence that DigiCert has a complete understanding of the various factors that contributed to this issue. In turn, that leaves me to believe that the remediation steps are insufficient to prevent similar problems in the future.

CAs normally check CAA records twice - once during the request process and again just prior to issuance. If that is the case with DigiCert, I assume that the list of affected certificates failed the BR mandated "within the TTL of the CAA record, or 8 hours, whichever is greater', but did they also fail a check during the request process?

Assignee: wthayer → brenda.bernal
Flags: needinfo?(brenda.bernal)
Whiteboard: [ca-compliance]

Hi Wayne
How was this number determined? Is it based on logs showing when the service was down?

Yes. We looked through our logs and this is when the internal service could not be reached.

16 of those would fail CAA checks if re-issued today.
Are these still valid? If so, is anything being done to ensure that they should have been issued?

They are still valid. We are wondering what to do here. We're thinking we'll reach out to the customers and confirm by email or phone that they want the certificate. However, that doesn't address the issue that the CAA record did not permit issuance. It could have back when we actually issued the cert, but there is no way to know.

Was a root cause analysis performed? Why was an independent check not performed? The response to question #6 does not give me confidence that DigiCert has a complete understanding of the various factors that contributed to this issue. In turn, that leaves me to believe that the remediation steps are insufficient to prevent similar problems in the future.

Yes. The root cause was that the CAA record checking code was written and deployed too fast, prior to when we had better software change management processes in place. The code was written in response to the CAB Forum requirement going live, where there was insufficient time for a thorough code review. That was before we acquired a more sophisticated code review and deployment process.

CAs normally check CAA records twice - once during the request process and again just prior to issuance. If that is the case with DigiCert, I assume that the list of affected certificates failed the BR mandated "within the TTL of the CAA record, or 8 hours, whichever is greater', but did they also fail a check during the request process?

We don't check twice. We only check prior to issuance. If it fails, we recheck (per the requirement) but then issue. Here we saw the failure as an external error (which would normally allow issuance) but in reality it was the CAA checking service failing. We'd like to start checking twice - been on our roadmap for a while - but haven't made it a priority.

My formatting of that was horrible. My responses are the quoted part. Sorry about that.

Jeremy shared some really useful, but terribly brief, details about the technical architecture at https://groups.google.com/d/msg/mozilla.dev.security.policy/4jNkL24Fq6c/atwKWr82AAAJ

In that, we can see that there are multiple systems involved: the CA system, the CAA checking system, and a Spunk monitoring system. Based on the limited description, it sounds like the belief is that the CA system successfully initiated an API call to the CAA checking system, but then something on the CA system-side would cause it to time out while waiting for a response. The Splunk monitoring system had some record of these requests (and possibly responses), allowing for some reconstruction.

Jeremy/Brenda: It'd be super useful if you could expand and provide more technical details about how this CA<->CAA checking system is integrated. Understanding both the system components in play, and the transition of the state machines, seems like it'd be super useful for Wayne and I (based on the above comments), while also help understand what mitigations are in play, as well as possible best-practices here.

By comparison, in thinking about, say, a validation method that made use of a phone call, it sounds like there's a difference being highlighted here between "Validation agent didn't make the phone call" and "Validation agent made the phone call, we have a recording of it in our phone system and call logs, but when we tried to update the database record for the validation, we had a bug and it didn't get logged". However, I want to be careful not to assume anything in the design, its implementation, or mitigations, and that's why I'm hoping that with a description about how the system works, it can be made clear both what failed and what mitigations already existed, so we can think about next steps.

Let me grab the engineering team on Monday and put together a document that I can attach to the bug. I'm thinking we'll provide a copy of our engineering process to show how we're doing the reviews and a diagram showing how the relevant systems connect.

This is correct:
In that, we can see that there are multiple systems involved: the CA system, the CAA checking system, and a Spunk monitoring system. Based on the limited description, it sounds like the belief is that the CA system successfully initiated an API call to the CAA checking system, but then something on the CA system-side would cause it to time out while waiting for a response. The Splunk monitoring system had some record of these requests (and possibly responses), allowing for some reconstruction.

The failure was on a communication between the CA system and CAA system when the CA system timed out the request to the CAA system.

This is not quite correct:
By comparison, in thinking about, say, a validation method that made use of a phone call, it sounds like there's a difference being highlighted here between "Validation agent didn't make the phone call" and "Validation agent made the phone call, we have a recording of it in our phone system and call logs, but when we tried to update the database record for the validation, we had a bug and it didn't get logged". However, I want to be careful not to assume anything in the design, its implementation, or mitigations, and that's why I'm hoping that with a description about how the system works, it can be made clear both what failed and what mitigations already existed, so we can think about next steps.

It's more like the validation agent made the call, but hung up before the person answered the question. The phone call recording still played so we captured the response, but we don't know what the response was because teh validation person hung up the phone. We need to go listen to the recordings to see what they said.

Is it okay if I post the engineering update next week and continue the discussion at that point?

Is it okay if I post the engineering update next week and continue the discussion at that point?

My highest priority is the current discussion on m.d.s.p. around what to do for the certificates that were misissued. To the extent that understanding helps inform the tradeoffs, please prioritize it, but I otherwise view the request as useful to understanding the root cause analysis and future mitigations, and which is less urgent than resolving the existing certificates.

Sounds good. We've already kicked off revocation for the 16. We're just going to kick off revocation of the 1000 certs. 1000 isn't that many certs and most of them were certs issued for testing.

The compliance team meet with the dev team today and reviewed what happened. There are a few inaccuracies in the original post that I think warrant clarification. First, I found out we actually had a code review over the CAA system. However, the developer conducting the code review was unfamiliar with the CAA record checking requirements. This allowed the bug to slip through unnoticed. The person checking the code interpreted the requirements as "Failure permits issuance" instead of "external failures permit issuance if DNSSEC is not enabled." In this case, the failure was internal between the CAA system and CA system, meaning neither the CAA record nor the existence of DNSSEC was appropriately checked. With respect to "how will we prevent this from happening again in the future” all bugs they go through a retrospective process that is intended to reduce the likelihood of bugs going forward. We do code reviews by second developers and have QA test functionality. With me coming over to compliance, we know embed compliance into the product dev process. Unfortunately, despite all of this scrutiny new features (like CAA) may have bugs. We do try and detect it early and fix it fast. This one in particular had such a low rate of occurrence that it went undetected for some time.

We've decided to revoke all of the impacted certs except a certain number that are used on critical infrastructure where the domain operator confirmed that they didn't have a CAA record in place at time of issuance and don't currently have a CAA record. In those cases, the verification would complete instantaneously as would issuance. I'll post the exact certs a bit later. Still working out which ones will be revoked tomorrow (but it will be nearly all of them).

This seemed like an easier approach than trying to reconcile the splunk data. We also found that the splunk data didn't go back far enough.

Anyway, sorry for any confusion. Hopefully that helps. I've also attached a high level architecture diagram that shows how these parts of our system interoperate. The number is the order the services are called.

Jeremy,

Thanks, this is helpful! I appreciate DigiCert's approach here to focus on strict compliance, especially in light of the past issues with missed revocation deadlines. I would still expect to see these listed on the report - both as a matter of non-compliance with respect to the CAA checking and as a matter of non-compliance with the revocation deadline.

This does prove a useful example of an incident analysis that is indicative of a more complex interaction/bug, rather than "We didn't even bother to check" / "we relied on humans and humans are failure prone" (not withstanding the early remarks about bugs and review). In terms of approaching this systemically, it strikes that there are opportunities to either document what you are doing or to improve. It seems like it's possible to vastly revisit questions 6 and 7 from Comment #0 and expand, based on this improved understanding.

For example, Comment #8 noted:

This one in particular had such a low rate of occurrence that it went undetected for some time.

With the added diagram in Comment #9, we can see questions like:

  1. Have you audited your code for failure cases or edge cases and examined whether or not you're providing signals that can be used to detect issues?
    1. e.g. Number of failures to connect to the CAA service, number of failures to connect to the zlint service, CT service, etc
  2. Have you examined such "failure" cases (even when failures may be permissive or overridden) to make sure you're surfacing
    signals?
    1. Do you fail closed or fail open if the zlint service fails, timeouts, or is non-responsive?
    2. Are you recording metrics on which zlint checks are failing as possible signals?

Instrumentation and analytics is understandably hard, as is making it actionable, so I'm not trying to be prescriptive here. However, in an analysis of the incident, it's useful to examine what went right, where you might have got lucky, and what went wrong. Beyond the "If we didn't have the problem, we wouldn't have had the problem", it's useful to understand whether there were any signals or mitigations that could have helped reduce the scope, impact, or duration of the issue.

The more detail that is shared, the easier it becomes for the community to understand and assess how DigiCert handles development and is able to respond to issues, as well as identify best practices for other CAs, which combined are the core goals for these incident reports.

Flags: needinfo?(brenda.bernal)
Flags: needinfo?(jeremy.rowley)

Hey Ryan

  1. We do audit our code to ensure than when a failure occurs (connection timeout, connection failure, bad response, etc..) we get a signal that can be tracked and acted on. We log these signals to Splunk, which triggers an email alert to the appropriate team. The signal can also raise an opsgenie ticket depending on the issue.

  2. We examine each error. Which team does that evaluation depends on the nature of error. That team determines the root cause of the error and takes any necessary action to re mediate the error. Anything that is flagged as a potential issue with the BRs or a Browser policy are raised immediately to my team to ensure we can track timelines and prepare the necessary incident reports.

For example, if we see a connection error then the SRE team may take the ticket and add more capacity or work with the network team to solve the timeout issues. If the error was from zlint indicating a problem with the certificate then the validation/ca/compliance teams will work together to determine why zlint returned the error and take appropriate action. This may include reporting the false positive to the community or requesting a code contribution from the engineering team to fix the false positive for everyone.

i. We hard fail certificate issuance on zlint issues (connection errors, timeouts, bad responses, or errors returned). We do retry the request on connection issues compared to actual certificate content concerns.

ii, If the zlint services indicates there is a problem with the certificate both the zlint service itself and the CA service will log the errors which would trigger an alert for the appropriate teams. Logged issues are archived per our CPS so we (or the auditors) can review them again whenever they need to.

What else can I share?

Flags: needinfo?(jeremy.rowley)

Thanks for the continued detail! This is all helping build a better understanding of recommended best practices.

I suppose I'm hanging a lot on what you mentioned earlier:

This one in particular had such a low rate of occurrence that it went undetected for some time.

When something like this happens, this would normally be an opportunity to take a deep-dive look at things. I'm more interested in understanding the steps you're taking in light of this, rather than the individual results of those steps (unless they're relevant)

Using the info you added in Comment #11, if it was me, I'd be creating follow-up tasks like:

  • Go through every possible place a failure signal can be emitted and
    • Review with our Compliance team to make sure how we handle that error is compliant
    • Review with our SRE team to make sure we had appropriate alerting in place
  • Look through the code for other conditionals that might affect the flow, and:
    • Make sure you have metrics in place to know when you're hitting edge cases
    • Work with SRE and Compliance to figure out whether such reports should be done in, say, an advisory capacity (e.g. weekly review)

In looking at things going well, it's great that you already had metrics in place for what your code had interpreted was an acceptable failure case - that is, rather than just having metrics for the errors that prevent issuance, you also had metrics in place for the places where you handled results and logic.

In looking where things went wrong, as you noted, there was a misunderstanding in the implementation about acceptable failures.

In looking to how to prevent issues like this in the future, the above sort of shakes out as:

  • Have we gone through and checked for other misunderstandings?
  • Does our Compliance team know every single path that can lead to a certificate being issued in our code?
  • If there are "allowed but exceptional" cases, do we have metrics in place for them as well?
  • Do we have alerting, even informationally, in place for when something takes one of these exceptional cases?

Using CAA as the example (although I would suggest examining the whole issuance flow), there are three permitted failures-are-permission cases. A good response would be to examine and make sure you had metrics in place for each of these to measure when these paths are taken. A great response would be to implement controls such as sampling and spot checks for certificates which take these paths, to make sure there aren't issues.

This is comparable to CAs that have an "override" button for their employees to override lint WARNINGs. A Basic CA might record when this happens. A Good CA might require a second party to review and override. A Great CA might have their compliance team examine all of these post-issuance as well.

You've share a lot of great information about the architecture and how things are working - and I think quite a bit is reflective of good/great practices, if this is uniform across all your CA platforms. But something slipped through, and understanding why that happened, and how you're approaching the problem / looking for things to improve, is hugely valuable. That doesn't always mean things have to change - for example, an analysis that shows something was, say, a 1 in a quintillion event, might have the wrong cost/risk benefit. That's up for the CA to establish, through their incident report, and show their work on how they reached that conclusion :)

All of this is what I believe Comment #1 was getting at, but certainly reflected how I felt re: the incident report. You've since shared an incredible amount of information, which goes a great deal to addressing those concerns raised. In light of all of this, I'm hoping you revisit Questions 6 and 7, and might want to revise them such that someone, reading only the incident report (and ignoring these other comments, for now) can get a sense about what went wrong and how it's being addressed.

Flags: needinfo?(jeremy.rowley)

Go through every possible place a failure signal can be emitted and
Review with our Compliance team to make sure how we handle that error is compliant
Review with our SRE team to make sure we had appropriate alerting in place

This is something I've scheduled. We're planning on going through the relevant RFCs line by line to ensure everyone has the same understanding of what is required. Of course, we've done compliance training for all engineers working on the various system, and I feel comfortable with the code being deployed these days. However, we do need to review the legacy code with the engineers to ensure the RFCs were properly interpreted. We'll do that doing the sit-down

Look through the code for other conditionals that might affect the flow, and:
Make sure you have metrics in place to know when you're hitting edge cases
Work with SRE and Compliance to figure out whether such reports should be done in, say, an advisory capacity (e.g. weekly review)
In looking at things going well, it's great that you already had metrics in place for what your code had >interpreted was an acceptable failure case - that is, rather than just having metrics for the errors that >prevent issuance, you also had metrics in place for the places where you handled results and logic.

Yes, but this issue existed since the beginning so the baseline for when the number of errors increased didn't change. The baseline metrics for the alerts had the error built in since CAA was a new system.

Have we gone through and checked for other misunderstandings?

In process.

Does our Compliance team know every single path that can lead to a certificate being issued in our code?

Yes, but there are too many right now. We are working on shutting down the legacy DigiCert and Symantec systems. We've finally finished the Verizon migration (a while ago, but I don't think we ever celebrated). A lot of legacy Symantec migration is tracking for completion next year as is legacy DigiCert. Once those are migrated to our centralized system, then it'll be one route to validate and issue certificates.

If there are "allowed but exceptional" cases, do we have metrics in place for them as well?
Do we have alerting, even informationally, in place for when something takes one of these exceptional cases?
Using CAA as the example (although I would suggest examining the whole issuance flow), there are three permitted failures-are-permission cases. A good response would be to examine and make sure you had metrics in place for each of these to measure when these paths are taken. A great response would be to implement controls such as sampling and spot checks for certificates which take these paths, to make sure there aren't issues.

Agreed. And we have spot checking and sampling. However, I see your point since sampling of errors before the service call, the service itself, and after the service failed would probably have detected this.

This is comparable to CAs that have an "override" button for their employees to override lint WARNINGs. A Basic CA might record when this happens. A Good CA might require a second party to review and override. A Great CA might have their compliance team examine all of these post-issuance as well.

You've share a lot of great information about the architecture and how things are working - and I think quite a bit is reflective of good/great practices, if this is uniform across all your CA platforms. But something slipped through, and understanding why that happened, and how you're approaching the problem / looking for things to improve, is hugely valuable. That doesn't always mean things have to change - for example, an analysis that shows something was, say, a 1 in a quintillion event, might have the wrong cost/risk benefit. That's up for the CA to establish, through their incident report, and show their work on how they reached that conclusion :)

Thanks! I think we have good practices that are improving. I feel like we started with a good baseline and are trajecting towards even better prevention.

All of this is what I believe Comment #1 was getting at, but certainly reflected how I felt re: the incident report. You've since shared an incredible amount of information, which goes a great deal to addressing those concerns raised. In light of all of this, I'm hoping you revisit Questions 6 and 7, and might want to revise them such that someone, reading only the incident report (and ignoring these other comments, for now) can get a sense about what went wrong and how it's being addressed.

Okay. I'll work on something new to those two question. We'll post an updated incident report after I talk to the team one more time. I want to get a solid plan on some of the additional alerts we may add so I can share those.

Flags: needinfo?(jeremy.rowley)

As an update, here's what we are doing for additional remediation.

  1. I've scheduled a training for right after CAB Forum in June to go line-by-line through the BRs, 5280, and the CAA RFC. Attendance by devs working on RA or CA systems (or things touching those systems) is mandatory. During that meeting we will assign each group of requirements to a dev as their responsibility. They will look confirm that part of the code is compliantly performing.
  2. Going forward we are adding alerts and better unit tests to detect anomalies. The following are currently being prioritized:
  • Add unit tests for timeout talking to CAA service from the CA
  • Add a log whenever CAA mode is required and caa_status is fail (timeout case) and we are trying to save to the DB. This will detect if there's every something wrong in the system communication. We'll know if the CAA record checking system is ever no longer appropriately working.
  • Add additional tests for just before signing any public SSL cert double check that caa check passed. This will independently confirm the CAA record checking worked.
  • We're adding another check to ensure the CAA record was appropriately pulled. Basically, a second check to ensure the correct CAA record was checked. This wasn't a problem here, but it seems like a good test.

As an update, here's what we are doing for additional remediation.

  1. I've scheduled a training for right after CAB Forum in June to go line-by-line through the BRs, 5280, and the CAA RFC. Attendance by devs working on RA or CA systems (or things touching those systems) is mandatory. During that meeting we will assign each group of requirements to a dev as their responsibility. They will look confirm that part of the code is compliantly performing.
  2. Going forward we are adding alerts and better unit tests to detect anomalies. The following are currently being prioritized:
  • Add unit tests for timeout talking to CAA service from the CA
  • Add a log whenever CAA mode is required and caa_status is fail (timeout case) and we are trying to save to the DB. This will detect if there's every something wrong in the system communication. We'll know if the CAA record checking system is ever no longer appropriately working.
  • Add additional tests for just before signing any public SSL cert double check that caa check passed. This will independently confirm the CAA record checking worked.
  • We're adding another check to ensure the CAA record was appropriately pulled. Basically, a second check to ensure the correct CAA record was checked. This wasn't a problem here, but it seems like a good test.

Here's the revised Incident report:

1.How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

On Monday, 29 April 2019, DigiCert QA discovered the issue in the late afternoon while testing a different feature. They found that when the internal CAA record checking service was unreachable (the request times out), the CA could proceed with issuance of the certificate. Jeremy Rowley (Head of Product) was alerted at 5:12 PM MDT on the same day.

2.A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

The bug was introduced in the initial coding of the CAA checking service. The CAA checking requirement was in effect on September 7, 2017.

Monday, 29 April 2019 - The issue was discovered, and a fix was applied the same day. The fix causes certificate requests to be rejected by the CA if it does not receive a response from the CAA checking service.

Wednesday, 1 May 2019 – Report was generated after required Developer work to query the CA to find impacted certificates. The report generated that day indicated that 1053 certificates were issued with the CAA check issue.

3.Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

CA has stopped issuing certificates with this problem. Issuance will fail when internal CAA service is unreachable.

4.A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

1053 certificates issued without checking CAA records, between 9 Sep 2017 and 16 Apr 2019;

16 of those would fail CAA checks if re-issued today.

5.The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

Please find information attached to this report.

6.Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

We created a bug in the system where a failure in our CAA record check allowed issuance. The root cause was unfamiliarity with some of the intricacies of the RFC. The developer conducting the code review interpreted the requirements as "Failure permits issuance" instead of "external failures permit issuance if DNSSEC is not enabled." In this case, the failure was internal between the CAA system and CA system, meaning neither the CAA record nor the existence of DNSSEC was appropriately checked.

7.List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

With respect to "how will we prevent this from happening again in the future” all bugs they go through a retrospective process that is intended to reduce the likelihood of bugs going forward. We do code reviews by second developers and have QA test functionality. With me coming over to compliance, we know embed compliance into the product dev process. Unfortunately, despite all of this scrutiny new features (like CAA) may have bugs. We do try and detect it early and fix it fast. This one in particular had such a low rate of occurrence that it went undetected for some time.

We've decided to revoke all of the impacted certs except a certain number that are used on critical infrastructure where the domain operator confirmed that they didn't have a CAA record in place at time of issuance and don't currently have a CAA record. In those cases, the verification would complete instantaneously as would issuance. I'll post the exact certs a bit later. Still working out which ones will be revoked tomorrow (but it will be nearly all of them).

I've scheduled a training for right after CAB Forum in June to go line-by-line through the BRs, 5280, and the CAA RFC. Attendance by devs working on RA or CA systems (or things touching those systems) is mandatory. During that meeting we will assign each group of requirements to a dev as their responsibility. They will look confirm that part of the code is compliantly performing.

Going forward we are adding alerts and better unit tests to detect anomalies. The following are currently being prioritized:

  • Add unit tests for timeout talking to CAA service from the CA
  • Add a log whenever CAA mode is required and caa_status is fail (timeout case) and we are trying to save to the DB. This will detect if there's every something wrong in the system communication. We'll know if the CAA record checking system is ever no longer appropriately working.
  • Add additional tests for just before signing any public SSL cert double check that caa check passed. This will independently confirm the CAA record checking worked.
  • We're adding another check to ensure the CAA record was appropriately pulled. Basically, a second check to ensure the correct CAA record was checked. This wasn't a problem here, but it seems like a good test.

Jeremy: Comment #8 mentioned delay in posting the exact certs, and Comment #16 repeated it. Comment #8 was made on 2019-05-13 - have I missed an updated disclosure?

Comment #13 mentioned an in-process analysis, and Bug 1556948 noted another systemic issue. Is that analysis the post-June matter referred to in Comment #16? If so, what's the estimated timeline for completion - I think we only have the timeline for when it should start.

Comment #16 notes an additional CAA check. In light of Bug 1556948, it would be useful to understand if the check being proposed is in line with the suggestions from that bug on reducing incidents, or whether it's a different approach being done?

Flags: needinfo?(jeremy.rowley)

Sorry - yes. We uploaded the CSV file a bit ago but I see we never mentioned int the comments that the file was provided. The CSV has a complete list of the certs.

The two processes mentioned are the same. We have the meeting scheduled for the last week of June. We moved it to that week so we can capture anything that might be upcoming from the face to face. During the June meeting, that process will be complete. I should have the information on whether there are other misunderstandings that week.

The additional CAA check is one that came from the suggestion on reducing incidents. It'll basically confirm at the CA level that things are syncing appropriately. It monitors the CAA service to ensure it's operational and within the set parameters. Not sure what those parameters should be as that's an item we have for the last week of June meeting. At a minimum, it'll be detecting whether the CAA check was successful, whether a record was stored of the CAA check, and if not, the status of any retries.

Flags: needinfo?(jeremy.rowley)

Similar to Bug 1556948, can you provide an update regarding the June meeting?

A list of actions were provided in Comment #16 - but no concrete timeline for those actions. I believe that's the missing piece right now?

Flags: needinfo?(brenda.bernal)

We did this meeting as well. We actually combined it with the meeting here (https://bugzilla.mozilla.org/show_bug.cgi?id=1556948).

I owe timelines for the following:

  1. Add unit tests for timeout talking to CAA service from the CA
  2. Add a log whenever CAA mode is required and caa_status is fail (timeout case) and we are trying to save to the DB. This will detect if there's every something wrong in the system communication. We'll know if the CAA record checking system is ever no longer appropriately working.
  3. Add additional tests for just before signing any public SSL cert double check that caa check passed. This will independently confirm the CAA record checking worked.
  4. We're adding another check to ensure the CAA record was appropriately pulled. Basically, a second check to ensure the correct CAA record was checked. This wasn't a problem here, but it seems like a good test.

Here's the timeline:

  1. Add unit tests for timeout talking to CAA service from the CA
  • Done
  1. Add a log whenever CAA mode is required and caa_status is fail (timeout case) and we are trying to save to the DB. This will detect if there's every something wrong in the system communication. We'll know if the CAA record checking system is ever no longer appropriately working.
  • August
  1. Add additional tests for just before signing any public SSL cert double check that caa check passed. This will independently confirm the CAA record checking worked.
  • August
  1. We're adding another check to ensure the CAA record was appropriately pulled. Basically, a second check to ensure the correct CAA record was checked. This wasn't a problem here, but it seems like a good test.
  • Sep/Oct

When you say August, are you saying August 1 or August 30? Similar for Sept/Oct.

Flags: needinfo?(jeremy.rowley)

#2 is mid Aug, #3 is end of Aug, #4 is end of Sep/beginning of Oct

Flags: needinfo?(jeremy.rowley)

Update: We are on track with completing the checks described above for #2 and #3.

Flags: needinfo?(brenda.bernal)

Both #2 and #3 are in the current sprint and are pending QA. Assuming nothing weird is found in QA, both parts should deploy Wed. We then have another project that will come before part 4. Part 4 is tracking on time for late Sept still. Do you want me to post every two weeks until we get there or should I wait until we start that part of the project?

Please confirm when #2 and #3 are deployed. Then I will plan to set Next Update to 1-Oct.

#2 and #3 were deployed yesterday.

Whiteboard: [ca-compliance] → [ca-compliance] - Next Update - 01-October 2019

Just an update on this one - the CA completed the code for the service portion of part #4 on this project and are waiting for it to get tied into by the domain consolidation project. This may take longer than Oct 1 (since that project completes the end of Oct). However, I will still post an update Oct 1 in connection with how the integration goes.

It is October 1. The redundant checks are in place to ensure CAA is always checked and detect any system outage. The APIs for this service are done and waiting for integration by the consolidation team for pre-checks during validation. Note that everything is working smoothly and this API integration is an additional pre-check for the customer, and not something required under the BRs. All pre-issuance checks and hard-blocks for system failures are all in-place and working as intended.

Jeremy: thanks for the update. I'd like to keep this open and request updates until #4 is completed.

Whiteboard: [ca-compliance] - Next Update - 01-October 2019 → [ca-compliance]

This is being rolled out starting today and continuing until Nov 4. At that time, this will be complete assuming nothing catastrophic happens during roll-out. I'll post on Monday to confirm completion and provide any insights we gain during the roll-out process.

I posted on the other thread that this rolled out. I forgot to update here. I think this bug can be closed.

If there are no further questions, we request that this bug be closed.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Product: NSS → CA Program
Summary: Digicert: CAA Checking Issue → DigiCert: CAA Checking Issue
Whiteboard: [ca-compliance] → [ca-compliance] [dv-misissuance] [ov-misissuance] [ev-misissuance]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: