Closed Bug 1901604 Opened 5 months ago Closed 3 months ago

Fenix TLS connection times appear to have increased around March 19, 2024

Categories

(Core :: Networking, defect, P2)

Unspecified
Android
defect

Tracking

()

RESOLVED WONTFIX
Performance Impact high
Tracking Status
firefox-esr115 --- unaffected
firefox127 --- wontfix
firefox128 --- fix-optional
firefox129 --- fix-optional

People

(Reporter: acreskey, Unassigned)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged][necko-priority-review])

Attachments

(2 files)

Visible under thehttp channel page tls handshake query on this dashboard. (Easier to see if you hide the other percentiles).

Appears to have landed in Nightly just after 03/18/2024.
It seems to primarily affect the lower percentiles, specifically P01 and P05.

Corresponds to this probe: https://dictionary.telemetry.mozilla.org/apps/fenix/metrics/network_tls_handshake.
I see it in Release after 05/19/2024, I believe Fx 126.

Very likely due to bug 1356686 (OMT decompression) which landed the 19th, and was in nightly the 20th. Not doing this synchronously could mean that a completion runnable might occur at points where TLS typically didn't have anything contending with it, perhaps.
Is there any similar slowdown on desktop?

Flags: needinfo?(acreskey)
Keywords: regression
Regressed by: 1356686

We ruled out bug 1886042 based on merge dates.

However the root cause may be difficult to trace by pushlog since it appears that the Fenix codebase was merged into m-c on the date in question, March 18, 2024.
And there are hundreds of geckoview changes that I'm unsure the origin of:
https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2024%2F03%2F17&enddate=2024%2F03%2F18

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #1)

Very likely due to bug 1356686 (OMT decompression) which landed the 19th, and was in nightly the 20th. Not doing this synchronously could mean that a completion runnable might occur at points where TLS typically didn't have anything contending with it, perhaps.
Is there any similar slowdown on desktop?

Based on this query (again, hide upper percentiles) (attaching screenshot) we first saw the spike in TLS connection times in Fenix on March 19th, so I think that rules out OMT.

I don't see this particular regression on desktop (the query is a bit less flexible, but I don't see the spike).

Flags: needinfo?(acreskey)

Dennis, we noticed that the lower percentiles, specifically P01 and P05, of tls handshake increased in Fenix.
This is just a guess, but is it possible that something changed with 0-rtt and Fenix around March 19, 2024, in nightly?

Flags: needinfo?(djackson)

I cant think of anything - especially not if its Fenix specific. The relevant NSS release looks fairly minor (notes) and I don't see any interesting looking commits to security/manager around that time.

Flags: needinfo?(djackson)

The Performance Impact Calculator has determined this bug's performance impact to be high. If you'd like to request re-triage, you can reset the Performance Impact flag to "?" or needinfo the triage sheriff.

Platforms: Android
Page load impact: Some
Websites affected: Major

Performance Impact: --- → high
Component: Performance → Networking

A good next step would be to see if the difference can be reproduced locally.

Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-new]

The severity field for this bug is set to S3. However, the Performance Impact field flags this bug as having a high impact on the performance.
:jesup, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact flag to ??

For more information, please visit BugBot documentation.

Flags: needinfo?(rjesup)

Add prioritr-review flag, because we still have no idea what caused this change.
We still need more information.

Flags: needinfo?(rjesup)
Whiteboard: [necko-triaged][necko-priority-new] → [necko-triaged][necko-priority-review]
Flags: needinfo?(smayya)
Flags: needinfo?(valentin.gosu)

I'll look at neqo changes.

Flags: needinfo?(kershaw)

I was looking at glean to see if change is also visible on desktop, and found another regression on nightly:
Went up on 20th of feb - possibly due to kyber, or neqo 0.7.1 regression range
TLS timings went back down after NSS upgrade ?

As far as I can tell, the P01 and P05 regression is only present on Fenix, not Desktop.

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(smayya)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #12)

I was looking at glean to see if change is also visible on desktop, and found another regression on nightly:
Went up on 20th of feb - possibly due to kyber, or neqo 0.7.1 regression range
TLS timings went back down after NSS upgrade ?

Agreed - from a discussion with jschanck, the Feb 20th desktop regression in tls is kyber
And fixed for most percentiles with bug 1893029 in the NSS upgrade: https://phabricator.services.mozilla.com/D209058

Still not clear what could have caused the Fenix-only regression in P01 and P05.

(In reply to Kershaw Chang [:kershaw] from comment #11)

I'll look at neqo changes.

Unfortunately, this can't be caused by HTTP/3 connection due to bug 1678312.
In our current code, secureConnectionStart is only set when the status is NS_NET_STATUS_TLS_HANDSHAKE_STARTING in HttpConnectionUDP::SetEvent, which is only called in Http3Session::OnTransportStatus here.
However, HttpConnectionUDP::SetEvent is never called with NS_NET_STATUS_TLS_HANDSHAKE_STARTING because this condition seems never happened.
I'll put bug 1678312 in priority queue and try to fix it soon.

Flags: needinfo?(kershaw)

We haven't been able to track down a regressing patch or cause. Given this is a small regression in 1% and 5% on android only, it may well be an issue with OS core scheduling or other background processing. Anything across the system can impact core scheduling decisions, especially on modern android CPUs, and you'd expect that to preferentially show up in the 1%/5% measurements. Given all this, and the fact that we don't think this is going to affect user-level performance or any other important metric, closing as WONTFIX. If further info makes this actionable, great.

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: