Closed Bug 1874608 Opened 1 year ago Closed 1 year ago

Possible regression in transaction wait times, HTTP/2, HTTP/3

Categories

(Core :: Networking, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: acreskey, Assigned: acreskey)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [necko-triaged][necko-monitor])

Looking at the probes for transaction_wait_time_http3 and transaction_wait_time_spdy I'm seeing a big jump in delays affecting the 95% + percentiles (also the 75% percentiles for H3).

Starting 2023-12-15 and onwards.

Blocks: necko-perf
Assignee: nobody → acreskey

I can reproduce some pretty long transaction wait times locally, so I'll see if I can spot a difference before and after 2023-12-15.

In the 12/14-12/15 timeframe for m-c, the only patch I can see which could affect networking is this:

changeset: 690348:5c9485e9d232
user: Benjamin Beurdouche <beurdouche@mozilla.com>
date: Fri Dec 15 09:28:06 2023 +0000
files: security/nss/TAG-INFO security/nss/automation/abi-check/previous-nss-release security/nss/automation/taskcluster/graph/src/extend.js sec>
description: Bug 1868774 - land NSS_3_96_RTM, UPGRADE_NSS_RELEASE, r=nkulatova

Flags: needinfo?(bbeurdouche)

From local testing, the regression does appear to be from that patch:

o  changeset:   689867:5c9485e9d232
|  user:        Benjamin Beurdouche <beurdouche@mozilla.com>
|  date:        Fri Dec 15 09:28:06 2023 +0000
|  summary:     Bug 1868774 - land NSS_3_96_RTM, UPGRADE_NSS_RELEASE, r=nkulatova

I'm logging transaction_wait_time_http3 and transaction_wait_time_spdy before and after that commit.

Before Stats:
count    5481.000000
mean        5.028462
std        36.459235
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       840.000000
Name: Values, dtype: float64

After Stats:
count    5481.000000
mean       19.744390
std       108.046964
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max      1414.000000
Name: Values, dtype: float64

Median Difference: 0.0
Mean Difference: 14.71592775041051
Interquartile Range Difference: 0.0
50th Percentile - Before: 0.0, After: 0.0
80th Percentile - Before: 0.0, After: 0.0
85th Percentile - Before: 0.0, After: 0.0
90th Percentile - Before: 0.0, After: 0.0
95th Percentile - Before: 0.0, After: 109.0
99th Percentile - Before: 129.99999999999727, After: 677.0

I re-ran this test (where every item on the new tab page is quickly opened in succession, new profile).

Before Stats:
count    5562.000000
mean       18.748831
std        89.309803
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max      1288.000000
Name: Values, dtype: float64

After Stats:
count    5562.000000
mean       22.008450
std       113.492866
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max      1439.000000
Name: Values, dtype: float64

Median Difference: 0.0
Mean Difference: 3.259618842143114
Interquartile Range Difference: 0.0
50th Percentile - Before: 0.0, After: 0.0
80th Percentile - Before: 0.0, After: 0.0
85th Percentile - Before: 0.0, After: 0.0
90th Percentile - Before: 6.900000000000546, After: 7.0
95th Percentile - Before: 104.0, After: 103.0
99th Percentile - Before: 512.0200000000059, After: 753.3900000000003

The results are less conclusive, but the "After" scenario a bit slower in the mean and at the 99th percentile.

Here's a profile in which I've added a temp marker that logs transaction_wait_time_http3 or transaction_wait_time_spdy if either is longer than 30ms.
You can see many taking over one second here, e.g. nsHttpConnectionMgr::DispatchTransaction TRANSACTION_WAIT_TIME_SPDY: 1071
https://share.firefox.dev/4aTD9id

Thanks Andrew, I have a little suspicion on https://hg.mozilla.org/projects/nss/rev/1bda168c0da97e19e5f14bc4227c15c0a9f493bf which looks like the only suspect in the NSS changes you pointed to.

Hi,
@Andrew, could you apply only the patch https://hg.mozilla.org/projects/nss/rev/1bda168c0da97e19e5f14bc4227c15c0a9f493bf and re-run the tests again?

Flags: needinfo?(acreskey)

Will do, Anna. Let me just find a more deterministic way of testing this.

Thank you!

Having some problems with reproducibility of results, but I'm still on it.

Hmm, while the impact is very clear in telemetry (P75 onwards for H3, P95 onwards for H2), I can’t say that I’ve been able to reliably pinpoint this regression locally.
Any workflow seems to lead to a large number of long wait times, but not consistently so for any revision.

I can try more tomorrow, but if you have ideas on scenarios where this code might be best stressed, let me know.

Trying to reproduce via a few means:

I've added internal performance metrics, PerfStats, that report the same values as the probes, transaction_wait_time_http3, and transaction_wait_time_spdy.
With these in place, I've pushed tests to try to see if we can reliably reproduce the regression.
Taskcluster is extremely slow at the moment, I imagine because of the release cycle, so the tests are still running.
I have variants that run on the recorded sites as well as one live sites.
We will see.

Also trying to find the root cause when we see the long transaction pending times locally.

Something unexpected -- whatever changed caused this would be in Fx 122, as we can see here.

However I don't see this in the beta channel:
transaction_wait_time_spdy
transaction_wait_time_http3.

We should be able to see if there is an impact in release shortly.

Unfortunately I have yet to find a reproducible way to capture this regression in CI.

The new performance metrics that measure transaction_wait_time_spdy and transaction_wait_time_http3 haven't picked up a difference in the regression range on recorded sites.

On live sites there is a wide spread in transaction_wait_time for http/2.
Comparing the start of the regression range to the end shows a significant regression, but that is just noise because we can see similar results comparing the same hg revision against itself (e.g. here).
So there must be a live-network component to transaction wait time, because in CI I'm seeing some come in at 4+ seconds.
(profiles coming)

I have a few more options in CI, but it might help to have a better understanding of what leads to the long transaction_wait_times.
(Recorded on nsHttpConnectionMgr::DispatchTransaction).

Here's one profile, taken locally, where we wait about 365ms - (see new marker, TRANSACTION_WAIT_TIME_SPDY)
https://share.firefox.dev/42cXlb9
We can see that there is a bit of NSS work going on via the flamegraph, but most of the time is spent polling.

(In reply to Anna Weine from comment #7)

Hi,
@Andrew, could you apply only the patch https://hg.mozilla.org/projects/nss/rev/1bda168c0da97e19e5f14bc4227c15c0a9f493bf and re-run the tests again?

Dennis had the idea of testing how often this code is hit via a hard assert.
Locally I'm not seeing it hit at all, although I've pushed to try to confirm.
But it doesn't look to be the cause.

Ah, this is interesting.
I added callstacks to the new profiler markers to find out where the transactions with longer transaction wait times are coming from.
From a local profile:
https://share.firefox.dev/3SjjsIq
If you expand the Name column you will see how many milliseconds the transaction was waiting for:
e.g nsHttpConnectionMgr::DispatchTransaction TRANSACTION_WAIT_TIME_SPDY: 1409 - private.canadianshield.cira.ca

Almost always to the doh provider, often through the HTTPSRecordResolver::OnLookupComplete codepath.

(root) []
...
nsThread::ProcessNextEvent(bool, bool*) [/Users/acreskey/dev/debug/xpcom/threads/nsThread.cpp]
{virtual override thunk({offset(-32)}, mozilla::net::nsSocketTransportService::Run())} [/Users/acreskey/dev/debug/netwerk/base/nsSocketTransportService2.cpp]
mozilla::net::nsSocketTransportService::Run() [/Users/acreskey/dev/debug/netwerk/base/nsSocketTransportService2.cpp]
NS_ProcessNextEvent(nsIThread*, bool) [/Users/acreskey/dev/debug/xpcom/threads/nsThreadUtils.cpp]
nsThread::ProcessNextEvent(bool, bool*) [/Users/acreskey/dev/debug/xpcom/threads/nsThread.cpp]
mozilla::detail::RunnableFunction<mozilla::net::DNSListenerProxy::OnLookupComplete(nsICancelable*, nsIDNSRecord*, nsresult)::$_0>::Run() [/Users/acreskey/dev/debug/obj-ff/dist/include/nsThreadUtils.h]
mozilla::net::DNSListenerProxy::OnLookupComplete(nsICancelable*, nsIDNSRecord*, nsresult)::$_0::operator()() const [/Users/acreskey/dev/debug/netwerk/dns/DNSListenerProxy.cpp]
mozilla::net::HTTPSRecordResolver::OnLookupComplete(nsICancelable*, nsIDNSRecord*, nsresult) [/Users/acreskey/dev/debug/netwerk/protocol/http/HTTPSRecordResolver.cpp]
mozilla::net::SpeculativeTransaction::OnHTTPSRRAvailable(nsIDNSHTTPSSVCRecord*, nsISVCBRecord*) [/Users/acreskey/dev/debug/netwerk/protocol/http/SpeculativeTransaction.cpp]
mozilla::net::nsHttpConnectionMgr::DoSpeculativeConnection(mozilla::net::SpeculativeTransaction*, bool) [/Users/acreskey/dev/debug/netwerk/protocol/http/nsHttpConnectionMgr.cpp]
mozilla::net::nsHttpConnectionMgr::GetOrCreateConnectionEntry(mozilla::net::nsHttpConnectionInfo*, bool, bool, bool, bool*, bool*) [/Users/acreskey/dev/debug/netwerk/protocol/http/nsHttpConnectionMgr.cpp]
mozilla::net::ConnectionEntry::AvailableForDispatchNow() [/Users/acreskey/dev/debug/netwerk/protocol/http/ConnectionEntry.cpp]
mozilla::net::nsHttpConnectionMgr::GetH2orH3ActiveConn(mozilla::net::ConnectionEntry*, bool, bool) [/Users/acreskey/dev/debug/netwerk/protocol/http/nsHttpConnectionMgr.cpp]
mozilla::net::nsHttpConnectionMgr::FindCoalescableConnection(mozilla::net::ConnectionEntry*, bool, bool, bool) [/Users/acreskey/dev/debug/netwerk/protocol/http/nsHttpConnectionMgr.cpp]
mozilla::net::BuildOriginFrameHashKey(nsTSubstring<char>&, mozilla::net::nsHttpConnectionInfo*, nsTSubstring<char> const&, int) [/Users/acreskey/dev/debug/netwerk/protocol/http/nsHttpConnectionMgr.cpp]
nsTSubstring<char>::AppendLiteral<4>(char const (&) [4]) [/Users/acreskey/dev/debug/obj-ff/dist/include/nsTSubstring.h]
nsTSubstring<char>::Append(char const*, unsigned long) [/Users/acreskey/dev/debug/xpcom/string/nsTSubstring.cpp]

So I think a delay in the TRR response time could cause this, however I'm not seeing that for
private.canadianshield.cira.ca in this range.

Also seeing the spike in transaction_wait_time_https_rr on Dec 15.

(only for h2_no_https_rr and h3_no_https_rr)

No change in http/1 transactions, transaction_wait_time_http.

Ah, this might be it:
Something went wrong with nextdns, looking at TRR lookup time and fraction of all fails
Dashboard

Flags: needinfo?(acreskey)

(In reply to Andrew Creskey [:acreskey] from comment #20)

Ah, this might be it:
Something went wrong with nextdns, looking at TRR lookup time and fraction of all fails
Dashboard

This is probably a red herring; while TRR lookup times did spike for nextdns, there are only 700 requests, and only on Dec 23.

I ran comparisons on a possible regression range, looking at newly-added Perfstats that measure transaction wait time. (see bug 1877000)
windows
linux
macOs

This not pick up any significant changes.
I pulled the raw results down to compare quantiles to quantiles, but still no results.

See Also: → 1877000

Interestingly enough, the telemetry reported in nightly seems to be dropping again:

transaction_wait_time_http3

transaction_wait_time_spdy

So it's not at all clear to me what is happening here.
The longer wait times that I observed were DoH related, so that remains a possibility.

Yes, something happened in most countries that increased transaction wait times in nightly.
It appears to have stopped as of a few days ago.
https://sql.telemetry.mozilla.org/queries/97257/source#240131

Valentin noticed that the restored performance coincided with last NSS upgrade to 3.97, bug https://bugzilla.mozilla.org/show_bug.cgi?id=1870290, landed Jan 22.

It could be another coincidence as I'm still not sure that the root cause is in-tree.

But this looks to have a significant impact, up to 10x increase in delay in China.
Query modified to compare 85th percentile:
https://sql.telemetry.mozilla.org/queries/97257/source#240131

Benjamin, to rule out or confirm NSS as being related, is it feasible to rollback the latest upgrade so that we can observe any possible impact on nightly?

I still don't see any reason to think that NSS 3.96 was responsible for the regression, and backing out NSS 3.97 would negatively impact other high priority work. I think we should just continue to monitor for future regressions with the hope that we can identify a more plausible cause.

(In reply to John Schanck [:jschanck] from comment #26)

I still don't see any reason to think that NSS 3.96 was responsible for the regression, and backing out NSS 3.97 would negatively impact other high priority work. I think we should just continue to monitor for future regressions with the hope that we can identify a more plausible cause.

Understood -- I really don't have evidence that NSS 3.96 introduced the regression.
(And there are some gaps in the hypothesis that it did -- for instance we never saw a bump in beta transaction wait times.

But because we saw transaction wait times shift significantly in nightly on both NSS upgrades, and since NSS plays a key role in transaction wait times, it seems like a possibility.

If it was easy and not disruptive to rollback the last NSS upgrade, then I think it would be worth it.

But it sounds like it would be disruptive. So simply monitoring these probes looks like the best choice to me.

Flags: needinfo?(bbeurdouche)
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged][necko-monitor]

Closing this one; will re-open if we see the numbers spike.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.