Possible regression in transaction wait times, HTTP/2, HTTP/3
Categories
(Core :: Networking, defect, P2)
Tracking
()
People
(Reporter: acreskey, Assigned: acreskey)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [necko-triaged][necko-monitor])
Looking at the probes for transaction_wait_time_http3 and transaction_wait_time_spdy I'm seeing a big jump in delays affecting the 95% + percentiles (also the 75% percentiles for H3).
Starting 2023-12-15 and onwards.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 1•1 year ago
|
||
I can reproduce some pretty long transaction wait times locally, so I'll see if I can spot a difference before and after 2023-12-15.
Comment 2•1 year ago
|
||
In the 12/14-12/15 timeframe for m-c, the only patch I can see which could affect networking is this:
changeset: 690348:5c9485e9d232
user: Benjamin Beurdouche <beurdouche@mozilla.com>
date: Fri Dec 15 09:28:06 2023 +0000
files: security/nss/TAG-INFO security/nss/automation/abi-check/previous-nss-release security/nss/automation/taskcluster/graph/src/extend.js sec>
description: Bug 1868774 - land NSS_3_96_RTM, UPGRADE_NSS_RELEASE, r=nkulatova
Assignee | ||
Comment 3•1 year ago
|
||
From local testing, the regression does appear to be from that patch:
o changeset: 689867:5c9485e9d232
| user: Benjamin Beurdouche <beurdouche@mozilla.com>
| date: Fri Dec 15 09:28:06 2023 +0000
| summary: Bug 1868774 - land NSS_3_96_RTM, UPGRADE_NSS_RELEASE, r=nkulatova
I'm logging transaction_wait_time_http3
and transaction_wait_time_spdy
before and after that commit.
Before Stats:
count 5481.000000
mean 5.028462
std 36.459235
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 840.000000
Name: Values, dtype: float64
After Stats:
count 5481.000000
mean 19.744390
std 108.046964
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1414.000000
Name: Values, dtype: float64
Median Difference: 0.0
Mean Difference: 14.71592775041051
Interquartile Range Difference: 0.0
50th Percentile - Before: 0.0, After: 0.0
80th Percentile - Before: 0.0, After: 0.0
85th Percentile - Before: 0.0, After: 0.0
90th Percentile - Before: 0.0, After: 0.0
95th Percentile - Before: 0.0, After: 109.0
99th Percentile - Before: 129.99999999999727, After: 677.0
Assignee | ||
Comment 4•1 year ago
|
||
I re-ran this test (where every item on the new tab page is quickly opened in succession, new profile).
Before Stats:
count 5562.000000
mean 18.748831
std 89.309803
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1288.000000
Name: Values, dtype: float64
After Stats:
count 5562.000000
mean 22.008450
std 113.492866
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1439.000000
Name: Values, dtype: float64
Median Difference: 0.0
Mean Difference: 3.259618842143114
Interquartile Range Difference: 0.0
50th Percentile - Before: 0.0, After: 0.0
80th Percentile - Before: 0.0, After: 0.0
85th Percentile - Before: 0.0, After: 0.0
90th Percentile - Before: 6.900000000000546, After: 7.0
95th Percentile - Before: 104.0, After: 103.0
99th Percentile - Before: 512.0200000000059, After: 753.3900000000003
The results are less conclusive, but the "After" scenario a bit slower in the mean and at the 99th percentile.
Assignee | ||
Comment 5•1 year ago
|
||
Here's a profile in which I've added a temp marker that logs transaction_wait_time_http3
or transaction_wait_time_spdy
if either is longer than 30ms.
You can see many taking over one second here, e.g. nsHttpConnectionMgr::DispatchTransaction TRANSACTION_WAIT_TIME_SPDY: 1071
https://share.firefox.dev/4aTD9id
Comment 6•1 year ago
|
||
Thanks Andrew, I have a little suspicion on https://hg.mozilla.org/projects/nss/rev/1bda168c0da97e19e5f14bc4227c15c0a9f493bf which looks like the only suspect in the NSS changes you pointed to.
Comment 7•1 year ago
|
||
Hi,
@Andrew, could you apply only the patch https://hg.mozilla.org/projects/nss/rev/1bda168c0da97e19e5f14bc4227c15c0a9f493bf and re-run the tests again?
Updated•1 year ago
|
Assignee | ||
Comment 8•1 year ago
|
||
Will do, Anna. Let me just find a more deterministic way of testing this.
Comment 9•1 year ago
|
||
Thank you!
Assignee | ||
Comment 10•1 year ago
|
||
Having some problems with reproducibility of results, but I'm still on it.
Assignee | ||
Comment 11•1 year ago
|
||
Hmm, while the impact is very clear in telemetry (P75 onwards for H3, P95 onwards for H2), I can’t say that I’ve been able to reliably pinpoint this regression locally.
Any workflow seems to lead to a large number of long wait times, but not consistently so for any revision.
I can try more tomorrow, but if you have ideas on scenarios where this code might be best stressed, let me know.
Comment 12•1 year ago
|
||
Assignee | ||
Comment 13•1 year ago
|
||
Trying to reproduce via a few means:
I've added internal performance metrics, PerfStats, that report the same values as the probes, transaction_wait_time_http3
, and transaction_wait_time_spdy
.
With these in place, I've pushed tests to try to see if we can reliably reproduce the regression.
Taskcluster is extremely slow at the moment, I imagine because of the release cycle, so the tests are still running.
I have variants that run on the recorded sites as well as one live sites.
We will see.
Also trying to find the root cause when we see the long transaction pending times locally.
Assignee | ||
Comment 14•1 year ago
|
||
Something unexpected -- whatever changed caused this would be in Fx 122, as we can see here.
However I don't see this in the beta channel:
transaction_wait_time_spdy
transaction_wait_time_http3.
We should be able to see if there is an impact in release shortly.
Assignee | ||
Comment 15•1 year ago
|
||
Unfortunately I have yet to find a reproducible way to capture this regression in CI.
The new performance metrics that measure transaction_wait_time_spdy
and transaction_wait_time_http3
haven't picked up a difference in the regression range on recorded sites.
On live sites there is a wide spread in transaction_wait_time for http/2.
Comparing the start of the regression range to the end shows a significant regression, but that is just noise because we can see similar results comparing the same hg revision against itself (e.g. here).
So there must be a live-network component to transaction wait time, because in CI I'm seeing some come in at 4+ seconds.
(profiles coming)
I have a few more options in CI, but it might help to have a better understanding of what leads to the long transaction_wait_times.
(Recorded on nsHttpConnectionMgr::DispatchTransaction).
Here's one profile, taken locally, where we wait about 365ms - (see new marker, TRANSACTION_WAIT_TIME_SPDY)
https://share.firefox.dev/42cXlb9
We can see that there is a bit of NSS work going on via the flamegraph, but most of the time is spent polling.
Assignee | ||
Comment 16•1 year ago
|
||
(In reply to Anna Weine from comment #7)
Hi,
@Andrew, could you apply only the patch https://hg.mozilla.org/projects/nss/rev/1bda168c0da97e19e5f14bc4227c15c0a9f493bf and re-run the tests again?
Dennis had the idea of testing how often this code is hit via a hard assert.
Locally I'm not seeing it hit at all, although I've pushed to try to confirm.
But it doesn't look to be the cause.
Assignee | ||
Comment 17•1 year ago
|
||
Ah, this is interesting.
I added callstacks to the new profiler markers to find out where the transactions with longer transaction wait times are coming from.
From a local profile:
https://share.firefox.dev/3SjjsIq
If you expand the Name
column you will see how many milliseconds the transaction was waiting for:
e.g nsHttpConnectionMgr::DispatchTransaction TRANSACTION_WAIT_TIME_SPDY: 1409 - private.canadianshield.cira.ca
Almost always to the doh provider, often through the HTTPSRecordResolver::OnLookupComplete
codepath.
(root) []
...
nsThread::ProcessNextEvent(bool, bool*) [/Users/acreskey/dev/debug/xpcom/threads/nsThread.cpp]
{virtual override thunk({offset(-32)}, mozilla::net::nsSocketTransportService::Run())} [/Users/acreskey/dev/debug/netwerk/base/nsSocketTransportService2.cpp]
mozilla::net::nsSocketTransportService::Run() [/Users/acreskey/dev/debug/netwerk/base/nsSocketTransportService2.cpp]
NS_ProcessNextEvent(nsIThread*, bool) [/Users/acreskey/dev/debug/xpcom/threads/nsThreadUtils.cpp]
nsThread::ProcessNextEvent(bool, bool*) [/Users/acreskey/dev/debug/xpcom/threads/nsThread.cpp]
mozilla::detail::RunnableFunction<mozilla::net::DNSListenerProxy::OnLookupComplete(nsICancelable*, nsIDNSRecord*, nsresult)::$_0>::Run() [/Users/acreskey/dev/debug/obj-ff/dist/include/nsThreadUtils.h]
mozilla::net::DNSListenerProxy::OnLookupComplete(nsICancelable*, nsIDNSRecord*, nsresult)::$_0::operator()() const [/Users/acreskey/dev/debug/netwerk/dns/DNSListenerProxy.cpp]
mozilla::net::HTTPSRecordResolver::OnLookupComplete(nsICancelable*, nsIDNSRecord*, nsresult) [/Users/acreskey/dev/debug/netwerk/protocol/http/HTTPSRecordResolver.cpp]
mozilla::net::SpeculativeTransaction::OnHTTPSRRAvailable(nsIDNSHTTPSSVCRecord*, nsISVCBRecord*) [/Users/acreskey/dev/debug/netwerk/protocol/http/SpeculativeTransaction.cpp]
mozilla::net::nsHttpConnectionMgr::DoSpeculativeConnection(mozilla::net::SpeculativeTransaction*, bool) [/Users/acreskey/dev/debug/netwerk/protocol/http/nsHttpConnectionMgr.cpp]
mozilla::net::nsHttpConnectionMgr::GetOrCreateConnectionEntry(mozilla::net::nsHttpConnectionInfo*, bool, bool, bool, bool*, bool*) [/Users/acreskey/dev/debug/netwerk/protocol/http/nsHttpConnectionMgr.cpp]
mozilla::net::ConnectionEntry::AvailableForDispatchNow() [/Users/acreskey/dev/debug/netwerk/protocol/http/ConnectionEntry.cpp]
mozilla::net::nsHttpConnectionMgr::GetH2orH3ActiveConn(mozilla::net::ConnectionEntry*, bool, bool) [/Users/acreskey/dev/debug/netwerk/protocol/http/nsHttpConnectionMgr.cpp]
mozilla::net::nsHttpConnectionMgr::FindCoalescableConnection(mozilla::net::ConnectionEntry*, bool, bool, bool) [/Users/acreskey/dev/debug/netwerk/protocol/http/nsHttpConnectionMgr.cpp]
mozilla::net::BuildOriginFrameHashKey(nsTSubstring<char>&, mozilla::net::nsHttpConnectionInfo*, nsTSubstring<char> const&, int) [/Users/acreskey/dev/debug/netwerk/protocol/http/nsHttpConnectionMgr.cpp]
nsTSubstring<char>::AppendLiteral<4>(char const (&) [4]) [/Users/acreskey/dev/debug/obj-ff/dist/include/nsTSubstring.h]
nsTSubstring<char>::Append(char const*, unsigned long) [/Users/acreskey/dev/debug/xpcom/string/nsTSubstring.cpp]
Assignee | ||
Comment 18•1 year ago
|
||
So I think a delay in the TRR response time could cause this, however I'm not seeing that for
private.canadianshield.cira.ca in this range.
Assignee | ||
Comment 19•1 year ago
•
|
||
Also seeing the spike in transaction_wait_time_https_rr on Dec 15.
(only for h2_no_https_rr
and h3_no_https_rr
)
No change in http/1 transactions, transaction_wait_time_http.
Assignee | ||
Comment 20•1 year ago
|
||
Ah, this might be it:
Something went wrong with nextdns
, looking at TRR lookup time
and fraction of all fails
Dashboard
Assignee | ||
Comment 21•1 year ago
|
||
(In reply to Andrew Creskey [:acreskey] from comment #20)
Ah, this might be it:
Something went wrong withnextdns
, looking atTRR lookup time
andfraction of all fails
Dashboard
This is probably a red herring; while TRR lookup times did spike for nextdns
, there are only 700 requests, and only on Dec 23.
Assignee | ||
Comment 22•1 year ago
•
|
||
I ran comparisons on a possible regression range, looking at newly-added Perfstats that measure transaction wait time. (see bug 1877000)
windows
linux
macOs
This not pick up any significant changes.
I pulled the raw results down to compare quantiles to quantiles, but still no results.
Assignee | ||
Comment 23•1 year ago
|
||
Interestingly enough, the telemetry reported in nightly seems to be dropping again:
So it's not at all clear to me what is happening here.
The longer wait times that I observed were DoH related, so that remains a possibility.
Assignee | ||
Comment 24•1 year ago
|
||
Yes, something happened in most countries that increased transaction wait times in nightly.
It appears to have stopped as of a few days ago.
https://sql.telemetry.mozilla.org/queries/97257/source#240131
Assignee | ||
Comment 25•1 year ago
|
||
Valentin noticed that the restored performance coincided with last NSS upgrade to 3.97, bug https://bugzilla.mozilla.org/show_bug.cgi?id=1870290, landed Jan 22.
It could be another coincidence as I'm still not sure that the root cause is in-tree.
But this looks to have a significant impact, up to 10x increase in delay in China.
Query modified to compare 85th percentile:
https://sql.telemetry.mozilla.org/queries/97257/source#240131
Benjamin, to rule out or confirm NSS as being related, is it feasible to rollback the latest upgrade so that we can observe any possible impact on nightly?
Comment 26•1 year ago
|
||
I still don't see any reason to think that NSS 3.96 was responsible for the regression, and backing out NSS 3.97 would negatively impact other high priority work. I think we should just continue to monitor for future regressions with the hope that we can identify a more plausible cause.
Assignee | ||
Comment 27•1 year ago
|
||
(In reply to John Schanck [:jschanck] from comment #26)
I still don't see any reason to think that NSS 3.96 was responsible for the regression, and backing out NSS 3.97 would negatively impact other high priority work. I think we should just continue to monitor for future regressions with the hope that we can identify a more plausible cause.
Understood -- I really don't have evidence that NSS 3.96 introduced the regression.
(And there are some gaps in the hypothesis that it did -- for instance we never saw a bump in beta transaction wait times.
But because we saw transaction wait times shift significantly in nightly on both NSS upgrades, and since NSS plays a key role in transaction wait times, it seems like a possibility.
If it was easy and not disruptive to rollback the last NSS upgrade, then I think it would be worth it.
But it sounds like it would be disruptive. So simply monitoring these probes looks like the best choice to me.
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 28•1 year ago
|
||
Closing this one; will re-open if we see the numbers spike.
Description
•