mozilla::net::nsHttpChannel::Connect() blocks the main thread on nsSiteSecurityService::HostHasHSTSEntry
Categories
(Core :: Security: PSM, enhancement, P1)
Tracking
()
Performance Impact | medium |
People
(Reporter: florian, Unassigned)
References
Details
(Keywords: perf, perf:responsiveness, stale-bug, Whiteboard: [fxperf:p2])
See this startup profile captured on the quantum reference hardware: thttps://perfht.ml/2rgUiwe nsSiteSecurityService::HostHasHSTSEntry blocks the main thread for 37ms on what looks like disk I/O.
Comment 1•7 years ago
|
||
Hi David, can you find an appropriate owner for this bug? Thanks!
In general, I don't think it would be easy to prevent this. Any code wanting to know if a site is HSTS (i.e. only connect securely/upgrade to https) can't continue until that information has been read from the profile. To minimize the chance that code actually waits in practice, we could: * Initialize nsISiteSecurityService as early as possible (once we have a profile) (this does a background-thread read of the data). * Avoid connecting to http URIs early in startup, although I imagine this is driven by data in the user's profile that we don't necessary control. There absolutely should not be URIs we ship with the product that are http (as opposed to https). I suppose we could add an asynchronous API to query this information. I don't know necko well enough to know if this would be a reasonable solution. Patrick, how much work would it be to change nsHttpChannel::Connect to essentially be asynchronous?
Comment 3•7 years ago
|
||
(In reply to David Keeler [:keeler] (use needinfo?) from comment #2) > > I suppose we could add an asynchronous API to query this information. I > don't know necko well enough to know if this would be a reasonable solution. that's the usual way. > Patrick, how much work would it be to change nsHttpChannel::Connect to > essentially be asynchronous? it's a doable patch (not trivial but not huge) - a lot like bug 1362388. A lot has to do with whether it can fit into one of the existing async points.
Updated•7 years ago
|
Comment 4•7 years ago
|
||
Daniel, can you take this? It looks to me like this is very similar to our cookie I/O initialization work, and thus a similar plan: 1) Stick a call to get nsISiteSecurityService (but don't actually use it) somewhere early in startup: from comment 2 it sounds like this will kick off a background thread that does the I/O here. I'm not sure of the right place to put that call--I need to ping some people who've worked on similar bugs. 2) If we're lucky, the I/O is generally fast enough that the info is ready by the time the 1st user appears and wants the data. Add some telemetry to measure that. If we wind up janking, then we need step 3 3) If needed add an async API so we never block the main thread. If #3 is simple enough, it might actually be faster/easier to skip #2 and just proceed right to an async API. (scheduling note: Daniel, this looks way more tractable in the 57 timeframe than un-mmapping JAR channels, so do this first)
Comment 5•7 years ago
|
||
> I'm not sure of the right place to put that call
Me neither, that's magic I haven't been around in before, but it does sound like a good idea that we should do independently of it being async or not since it feels like a better chance to save more total startup time.
This is an assigned P1 bug without activity in two weeks. If you intend to continue working on this bug for the current release/iteration/sprint, remove the 'stale-bug' keyword. Otherwise we'll reset the priority of the bug back to '--' on Monday, August 28th.
Reporter | ||
Comment 7•7 years ago
|
||
Another startup profile where this blocks the UI thread for 192ms: https://perfht.ml/2fhaklI
Comment 8•7 years ago
|
||
This bug won't be address for 57; moving it to QF:P2 for post 57 work.
Comment 9•7 years ago
|
||
:jgong marked this for P2 in the whiteboard but not the prio field, fixing.
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Comment 10•6 years ago
|
||
unassigning myself as I've failed to deliver on this and let's not fool anyone it will happen soon
Updated•6 years ago
|
Updated•5 years ago
|
Comment 11•5 years ago
|
||
This actually looks like it's a result of captive portal detection. The fact that this does an http instead of https request kind of makes sense, but we should avoid checking HSTS for it.
We could also just make the captive portal detection wait for the HSTS data to have been loaded.
Going to mark [fxperf:p2] because that consumer, at least, seems like a fairly straightforward one to fix.
In terms of a longer-term fix, if we wanted to be cheeky, presumably we could check if HSTS data is available in HTTPChannel::AsyncOpen for non-https URIs, and delay the Open() if not. That'd avoid the complexity of having to make Connect() async, too. Valentin, would that be doable?
Comment 12•5 years ago
|
||
(In reply to :Gijs (he/him) from comment #11)
This actually looks like it's a result of captive portal detection. The fact that this does an http instead of https request kind of makes sense, but we should avoid checking HSTS for it.
Adding an API to make a channel avoid checking HSTS seems OK. But I'd rather not do it just for this one use case :)
We could also just make the captive portal detection wait for the HSTS data to have been loaded.
Do we have a notification, or other way of determining if HSTS data was loaded?
Going to mark [fxperf:p2] because that consumer, at least, seems like a fairly straightforward one to fix.
In terms of a longer-term fix, if we wanted to be cheeky, presumably we could check if HSTS data is available in HTTPChannel::AsyncOpen for non-https URIs, and delay the Open() if not. That'd avoid the complexity of having to make Connect() async, too. Valentin, would that be doable?
This sounds OK. Shall we file another bug for the necko work, and keep this one for making HostHasHSTSEntry async?
Comment 13•5 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #12)
(In reply to :Gijs (he/him) from comment #11)
This actually looks like it's a result of captive portal detection. The fact that this does an http instead of https request kind of makes sense, but we should avoid checking HSTS for it.
Adding an API to make a channel avoid checking HSTS seems OK. But I'd rather not do it just for this one use case :)
We could also just make the captive portal detection wait for the HSTS data to have been loaded.
Do we have a notification, or other way of determining if HSTS data was loaded?
There doesn't seem to be one right now, we'd need to add one. The actual IO is off the main thread already, but if there are any main-thread calls, they block on getting a mutex and the IO being completed, it looks like (from a very quick glance, so I could be wrong). This seems to all be encapsulated in https://searchfox.org/mozilla-central/source/security/manager/ssl/DataStorage.cpp .
Going to mark [fxperf:p2] because that consumer, at least, seems like a fairly straightforward one to fix.
In terms of a longer-term fix, if we wanted to be cheeky, presumably we could check if HSTS data is available in HTTPChannel::AsyncOpen for non-https URIs, and delay the Open() if not. That'd avoid the complexity of having to make Connect() async, too. Valentin, would that be doable?
This sounds OK. Shall we file another bug for the necko work, and keep this one for making HostHasHSTSEntry async?
Sounds good! I filed bug 1521729.
Is there anything we need to do here now that bug 1521729 has landed?
Comment 15•4 years ago
|
||
Sorry for the delay, it's hard for me to tell based on bug 1521729 whether or not this is fully fixed now, so I wanted to test whether we still hit this code on Windows, but to be 100% sure about this (the profiler won't always show code if we don't sample while we're in it) I need to do a full (non-artifact) Windows build and add a marker to the code or something like that, and I haven't had time to do that. Hopefully tomorrow.
Comment 16•4 years ago
|
||
So I added a MOZ_CRASH()
in the lock code for DataStorage
if we end up waiting for the lock on the main thread; that's still getting hit, but with this (tracerefcnt on an opt build, so add appropriate amounts of salt) stack:
#01: mozilla::net::AltSvcCache::GetAltServiceMapping (mozilla-central\netwerk\protocol\http\AlternateServices.cpp:1076)
#02: mozilla::net::nsHttpChannel::BeginConnect (mozilla-central\netwerk\protocol\http\nsHttpChannel.cpp:6774)
#03: mozilla::net::nsHttpChannel::OnProxyAvailable (mozilla-central\netwerk\protocol\http\nsHttpChannel.cpp:7233)
#04: mozilla::net::nsAsyncResolveRequest::DoCallback::<unnamed-tag>::operator() (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:364)
#05: mozilla::net::nsAsyncResolveRequest::AsyncApplyFilters::Finish (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:597)
#06: mozilla::net::nsAsyncResolveRequest::AsyncApplyFilters::ProcessNextFilter (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:507)
#07: mozilla::net::nsAsyncResolveRequest::AsyncApplyFilters::AsyncProcess (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:487)
#08: mozilla::net::nsAsyncResolveRequest::DoCallback (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:373)
#09: mozilla::net::ExecuteCallback::Run (mozilla-central\netwerk\base\nsPACMan.cpp:119)
#10: nsThread::ProcessNextEvent (mozilla-central\xpcom\threads\nsThread.cpp:1222)
#11: NS_ProcessNextEvent (mozilla-central\xpcom\threads\nsThreadUtils.cpp:481)
#12: mozilla::ipc::MessagePump::Run (mozilla-central\ipc\glue\MessagePump.cpp:87)
#13: MessageLoop::RunHandler (mozilla-central\ipc\chromium\src\base\message_loop.cc:309)
#14: MessageLoop::Run (mozilla-central\ipc\chromium\src\base\message_loop.cc:291)
#15: nsBaseAppShell::Run (mozilla-central\widget\nsBaseAppShell.cpp:139)
#16: nsAppShell::Run (mozilla-central\widget\windows\nsAppShell.cpp:406)
#17: nsAppStartup::Run (mozilla-central\toolkit\components\startup\nsAppStartup.cpp:272)
#18: XREMain::XRE_mainRun (mozilla-central\toolkit\xre\nsAppRunner.cpp:4566)
#19: XREMain::XRE_main (mozilla-central\toolkit\xre\nsAppRunner.cpp:4701)
#20: XRE_main (mozilla-central\toolkit\xre\nsAppRunner.cpp:4752)
#21: NS_internal_main (mozilla-central\browser\app\nsBrowserApp.cpp:331)
#22: wmain (mozilla-central\toolkit\xre\nsWindowsWMain.cpp:131)
which I think is different?
I do see this still being hit on the main thread for the HSTS thing when mReady
is already true, which I think is expected because we effectively trigger the HSTS db load away from the main thread and then read things on the main thread later?
So I think this bug as filed is resolved, but I also think we probably want to figure out the AltSvc bits, too. Dana, does my analysis look correct to you?
I agree with your analysis. I'll file a bug for AltSvc.
Updated•2 years ago
|
Description
•