Fenix TLS connection times appear to have increased around March 19, 2024
Categories
(Core :: Networking, defect, P2)
Tracking
()
Performance Impact | high |
Tracking | Status | |
---|---|---|
firefox-esr115 | --- | unaffected |
firefox127 | --- | wontfix |
firefox128 | --- | fix-optional |
firefox129 | --- | fix-optional |
People
(Reporter: acreskey, Unassigned)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: regression, Whiteboard: [necko-triaged][necko-priority-review])
Attachments
(2 files)
Visible under thehttp channel page tls handshake
query on this dashboard. (Easier to see if you hide the other percentiles).
Appears to have landed in Nightly just after 03/18/2024.
It seems to primarily affect the lower percentiles, specifically P01 and P05.
Corresponds to this probe: https://dictionary.telemetry.mozilla.org/apps/fenix/metrics/network_tls_handshake.
I see it in Release after 05/19/2024, I believe Fx 126.
Comment 1•5 months ago
|
||
Very likely due to bug 1356686 (OMT decompression) which landed the 19th, and was in nightly the 20th. Not doing this synchronously could mean that a completion runnable might occur at points where TLS typically didn't have anything contending with it, perhaps.
Is there any similar slowdown on desktop?
Updated•5 months ago
|
Reporter | ||
Comment 2•5 months ago
•
|
||
We ruled out bug 1886042 based on merge dates.
However the root cause may be difficult to trace by pushlog since it appears that the Fenix codebase was merged into m-c on the date in question, March 18, 2024.
And there are hundreds of geckoview changes that I'm unsure the origin of:
https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2024%2F03%2F17&enddate=2024%2F03%2F18
Reporter | ||
Comment 3•5 months ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #1)
Very likely due to bug 1356686 (OMT decompression) which landed the 19th, and was in nightly the 20th. Not doing this synchronously could mean that a completion runnable might occur at points where TLS typically didn't have anything contending with it, perhaps.
Is there any similar slowdown on desktop?
Based on this query (again, hide upper percentiles) (attaching screenshot) we first saw the spike in TLS connection times in Fenix on March 19th, so I think that rules out OMT.
I don't see this particular regression on desktop (the query is a bit less flexible, but I don't see the spike).
Reporter | ||
Comment 4•5 months ago
|
||
Reporter | ||
Comment 5•5 months ago
|
||
Dennis, we noticed that the lower percentiles, specifically P01 and P05, of tls handshake increased in Fenix.
This is just a guess, but is it possible that something changed with 0-rtt
and Fenix around March 19, 2024, in nightly?
Comment 6•5 months ago
|
||
I cant think of anything - especially not if its Fenix specific. The relevant NSS release looks fairly minor (notes) and I don't see any interesting looking commits to security/manager
around that time.
Updated•5 months ago
|
Comment 7•5 months ago
|
||
The Performance Impact Calculator has determined this bug's performance impact to be high. If you'd like to request re-triage, you can reset the Performance Impact flag to "?" or needinfo the triage sheriff.
Platforms: Android
Page load impact: Some
Websites affected: Major
Reporter | ||
Comment 8•5 months ago
|
||
A good next step would be to see if the difference can be reproduced locally.
Updated•5 months ago
|
Comment 9•5 months ago
|
||
The severity field for this bug is set to S3. However, the Performance Impact
field flags this bug as having a high impact on the performance.
:jesup, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact
flag to ?
?
For more information, please visit BugBot documentation.
Comment 10•5 months ago
|
||
Add prioritr-review
flag, because we still have no idea what caused this change.
We still need more information.
Updated•5 months ago
|
Updated•5 months ago
|
Updated•4 months ago
|
Updated•3 months ago
|
Comment 12•3 months ago
|
||
I was looking at glean to see if change is also visible on desktop, and found another regression on nightly:
Went up on 20th of feb - possibly due to kyber, or neqo 0.7.1 regression range
TLS timings went back down after NSS upgrade ?
As far as I can tell, the P01 and P05 regression is only present on Fenix, not Desktop.
Updated•3 months ago
|
Reporter | ||
Comment 13•3 months ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #12)
I was looking at glean to see if change is also visible on desktop, and found another regression on nightly:
Went up on 20th of feb - possibly due to kyber, or neqo 0.7.1 regression range
TLS timings went back down after NSS upgrade ?
Agreed - from a discussion with jschanck, the Feb 20th desktop regression in tls is kyber
And fixed for most percentiles with bug 1893029 in the NSS upgrade: https://phabricator.services.mozilla.com/D209058
Still not clear what could have caused the Fenix-only regression in P01 and P05.
Comment 14•3 months ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #11)
I'll look at neqo changes.
Unfortunately, this can't be caused by HTTP/3 connection due to bug 1678312.
In our current code, secureConnectionStart
is only set when the status is NS_NET_STATUS_TLS_HANDSHAKE_STARTING in HttpConnectionUDP::SetEvent
, which is only called in Http3Session::OnTransportStatus
here.
However, HttpConnectionUDP::SetEvent
is never called with NS_NET_STATUS_TLS_HANDSHAKE_STARTING
because this condition seems never happened.
I'll put bug 1678312 in priority queue and try to fix it soon.
Comment 15•3 months ago
|
||
We haven't been able to track down a regressing patch or cause. Given this is a small regression in 1% and 5% on android only, it may well be an issue with OS core scheduling or other background processing. Anything across the system can impact core scheduling decisions, especially on modern android CPUs, and you'd expect that to preferentially show up in the 1%/5% measurements. Given all this, and the fact that we don't think this is going to affect user-level performance or any other important metric, closing as WONTFIX. If further info makes this actionable, great.
Description
•