Open Bug 1355903 Opened 7 years ago Updated 6 months ago

Re-enable Certificate Transparency telemetry collection

Categories

(Core :: Security: PSM, defect, P3)

defect

Tracking

()

People

(Reporter: jcj, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Whiteboard: [psm-blocked])

Bug 1353216 shut off certificate transparency telemetry collection due to a performance regression. This bug should turn it back on after performance is improved.
Depends on: 1355993
sorry for the drive-by partially informed comment - but given that this is just for reporting is this something that can be done asynchronously queued and processed at 'idle' time? (its been a while, but iirc firefox has an event for that) so the math doesn't impact responsiveness? or is there too much state tied up in the connection to practically copy/addref for queueing?
(In reply to Patrick McManus [:mcmanus] from comment #1)
> sorry for the drive-by partially informed comment - but given that this is
> just for reporting is this something that can be done asynchronously queued
> and processed at 'idle' time? (its been a while, but iirc firefox has an
> event for that) so the math doesn't impact responsiveness? or is there too
> much state tied up in the connection to practically copy/addref for queueing?

It's a good idea, but an eventual goal is to have CT enforcement, too -- which would need to be synchronous to be effective.
Keeler and I discussed more what is needed to re-enable our CT telemetry collection:

== Better Tests ==

Before we get to performance testing, I think we need more significant tests in-tree, as specified in bug 1361177.

== Speedups ==

It seems that most of the performance regression was due to unnecessary cert chain validations, fixed in Bug 1356611. The unnecessary validations were made much worse by the slower-than-desired verification algorithm; Keeler filed Bug 1355993 to address speeding that up. It should be possible to get some performance increase, but given the magnitude of the fix from Bug 1356611, we _might not need_ the algorithm speedup to proceed.

We'd optimally like to see a Talos test for HTTPS that could show a regression during CT (Bug 1356073), but perhaps we can proceed with a plan that checks telemetry instead:

== Speedup Test Plan ==

When we disabled CT, the SSL_SUCCESFUL_CERT_VALIDATION_TIME_MOZILLAPKIX telemetry metric dropped precipitously [1] (the drop corresponds to the disable date) [2]. In-code, this metric is firmly tied to the CT performance.

I think it's OK to move on without the Talos test and verify whether our performance is OK using Nightly telemetry. If slows down more than a little, we'd want to wait on Bug 1355993 to proceed.

== Recommendation Summary ==

1. We need the tests in bug 1361177.
2. Then we can re-enable using this bug and monitor SSL_SUCCESFUL_CERT_VALIDATION_TIME_MOZILLAPKIX.
2.a. If performance slows down by an unacceptable amount, we back-out the enable commit.
2.b. If we back-out, then we block on Bug 1355993 to re-try.

[1] https://i.have.insufficient.coffee/ct-validation-time-fix.png
[2] https://mzl.la/2qKALHN
Depends on: 1361177
After further discussion, I am amending my recommendation that we only re-enable after we also have Talos tests, and marking this bug as being blocked on Bug 1356073:

== Recommendation Summary (Revised) ==

1. We need the tests in bug 1361177.
2. We also need the Talos test in bug 1356073 to monitor for HTTPS performance regressions.
3. Then we can re-enable using this bug and monitor SSL_SUCCESFUL_CERT_VALIDATION_TIME_MOZILLAPKIX in Nightly.
4.a. If performance slows down by an unacceptable amount, we back-out the enable commit.
4.b. If we back-out, then we block on Bug 1355993 before re-trying.
Depends on: 1356073
Priority: -- → P3
Whiteboard: [psm-blocked]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.