Closed Bug 1372889 Opened 7 years ago Closed 4 years ago

mozilla::net::nsHttpChannel::Connect() blocks the main thread on nsSiteSecurityService::HostHasHSTSEntry

Categories

(Core :: Security: PSM, enhancement, P1)

enhancement

Tracking

()

RESOLVED DUPLICATE of bug 1521729
Performance Impact medium

People

(Reporter: florian, Unassigned)

References

Details

(Keywords: perf, perf:responsiveness, stale-bug, Whiteboard: [fxperf:p2])

See this startup profile captured on the quantum reference hardware: thttps://perfht.ml/2rgUiwe

nsSiteSecurityService::HostHasHSTSEntry blocks the main thread for 37ms on what looks like disk I/O.
Hi David, can you find an appropriate owner for this bug? Thanks!
Component: Networking: HTTP → Security: PSM
Flags: needinfo?(dkeeler)
In general, I don't think it would be easy to prevent this. Any code wanting to know if a site is HSTS (i.e. only connect securely/upgrade to https) can't continue until that information has been read from the profile.

To minimize the chance that code actually waits in practice, we could:
* Initialize nsISiteSecurityService as early as possible (once we have a profile) (this does a background-thread read of the data).
* Avoid connecting to http URIs early in startup, although I imagine this is driven by data in the user's profile that we don't necessary control. There absolutely should not be URIs we ship with the product that are http (as opposed to https).

I suppose we could add an asynchronous API to query this information. I don't know necko well enough to know if this would be a reasonable solution. Patrick, how much work would it be to change nsHttpChannel::Connect to essentially be asynchronous?
Flags: needinfo?(dkeeler) → needinfo?(mcmanus)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #2)
>
> I suppose we could add an asynchronous API to query this information. I
> don't know necko well enough to know if this would be a reasonable solution.

that's the usual way.

> Patrick, how much work would it be to change nsHttpChannel::Connect to
> essentially be asynchronous?

it's a doable patch (not trivial but not huge) - a lot like bug 1362388. A lot has to do with whether it can fit into one of the existing async points.
Flags: needinfo?(mcmanus)
Whiteboard: [qf] → [qf:p1]
Daniel, can you take this?  It looks to me like this is very similar to our cookie I/O initialization work, and thus a similar plan:

1) Stick a call to get nsISiteSecurityService (but don't actually use it) somewhere early in startup: from comment 2 it sounds like this will kick off a background thread that does the I/O here.  I'm not sure of the right place to put that call--I need to ping some people who've worked on similar bugs.

2) If we're lucky, the I/O is generally fast enough that the info is ready by the time the 1st user appears and wants the data.  Add some telemetry to measure that.  If we wind up janking, then we need step 3

3) If needed add an async API so we never block the main thread.

If #3 is simple enough, it might actually be faster/easier to skip #2 and just proceed right to an async API.

(scheduling note: Daniel, this looks way more tractable in the 57 timeframe than un-mmapping JAR channels, so do this first)
Flags: needinfo?(daniel)
> I'm not sure of the right place to put that call

Me neither, that's magic I haven't been around in before, but it does sound like a good idea that we should do independently of it being async or not since it feels like a better chance to save more total startup time.
Assignee: nobody → daniel
Flags: needinfo?(daniel)
Priority: -- → P1
Whiteboard: [qf:p1] → [qf:p1][psm-assigned]
This is an assigned P1 bug without activity in two weeks. 

If you intend to continue working on this bug for the current release/iteration/sprint, remove the 'stale-bug' keyword.

Otherwise we'll reset the priority of the bug back to '--' on Monday, August 28th.
Keywords: stale-bug
Another startup profile where this blocks the UI thread for 192ms: https://perfht.ml/2fhaklI
This bug won't be address for 57; moving it to QF:P2 for post 57 work.
Keywords: stale-bug
Whiteboard: [qf:p1][psm-assigned] → [qf:p2][psm-assigned]
:jgong marked this for P2 in the whiteboard but not the prio field, fixing.
Priority: P1 → P2
Keywords: perf
Whiteboard: [qf:p2][psm-assigned] → [qf:p1][psm-assigned]
Priority: P2 → P1
Whiteboard: [qf:p1][psm-assigned] → [qf:p1][psm-assigned][qf:i60]
Whiteboard: [qf:p1][psm-assigned][qf:i60] → [qf:i60][qf:p1][psm-assigned]
Whiteboard: [qf:i60][qf:p1][psm-assigned] → [qf:f60][qf:p1][psm-assigned]
Whiteboard: [qf:f60][qf:p1][psm-assigned] → [qf:f61][qf:p1][psm-assigned]
Whiteboard: [qf:f61][qf:p1][psm-assigned] → [qf:f64][qf:p1][psm-assigned]
Whiteboard: [qf:f64][qf:p1][psm-assigned] → [qf:p1:f64][psm-assigned]
unassigning myself as I've failed to deliver on this and let's not fool anyone it will happen soon
Assignee: daniel → nobody
Whiteboard: [qf:p1:f64][psm-assigned] → [qf:p1:f64]
Whiteboard: [qf:p1:f64] → [qf:p2:responsiveness][fxperf]

This actually looks like it's a result of captive portal detection. The fact that this does an http instead of https request kind of makes sense, but we should avoid checking HSTS for it.

We could also just make the captive portal detection wait for the HSTS data to have been loaded.

Going to mark [fxperf:p2] because that consumer, at least, seems like a fairly straightforward one to fix.

In terms of a longer-term fix, if we wanted to be cheeky, presumably we could check if HSTS data is available in HTTPChannel::AsyncOpen for non-https URIs, and delay the Open() if not. That'd avoid the complexity of having to make Connect() async, too. Valentin, would that be doable?

Flags: needinfo?(valentin.gosu)
Whiteboard: [qf:p2:responsiveness][fxperf] → [qf:p2:responsiveness][fxperf:p2]

(In reply to :Gijs (he/him) from comment #11)

This actually looks like it's a result of captive portal detection. The fact that this does an http instead of https request kind of makes sense, but we should avoid checking HSTS for it.

Adding an API to make a channel avoid checking HSTS seems OK. But I'd rather not do it just for this one use case :)

We could also just make the captive portal detection wait for the HSTS data to have been loaded.

Do we have a notification, or other way of determining if HSTS data was loaded?

Going to mark [fxperf:p2] because that consumer, at least, seems like a fairly straightforward one to fix.

In terms of a longer-term fix, if we wanted to be cheeky, presumably we could check if HSTS data is available in HTTPChannel::AsyncOpen for non-https URIs, and delay the Open() if not. That'd avoid the complexity of having to make Connect() async, too. Valentin, would that be doable?

This sounds OK. Shall we file another bug for the necko work, and keep this one for making HostHasHSTSEntry async?

Flags: needinfo?(valentin.gosu)
Depends on: 1521729

(In reply to Valentin Gosu [:valentin] from comment #12)

(In reply to :Gijs (he/him) from comment #11)

This actually looks like it's a result of captive portal detection. The fact that this does an http instead of https request kind of makes sense, but we should avoid checking HSTS for it.

Adding an API to make a channel avoid checking HSTS seems OK. But I'd rather not do it just for this one use case :)

We could also just make the captive portal detection wait for the HSTS data to have been loaded.

Do we have a notification, or other way of determining if HSTS data was loaded?

There doesn't seem to be one right now, we'd need to add one. The actual IO is off the main thread already, but if there are any main-thread calls, they block on getting a mutex and the IO being completed, it looks like (from a very quick glance, so I could be wrong). This seems to all be encapsulated in https://searchfox.org/mozilla-central/source/security/manager/ssl/DataStorage.cpp .

Going to mark [fxperf:p2] because that consumer, at least, seems like a fairly straightforward one to fix.

In terms of a longer-term fix, if we wanted to be cheeky, presumably we could check if HSTS data is available in HTTPChannel::AsyncOpen for non-https URIs, and delay the Open() if not. That'd avoid the complexity of having to make Connect() async, too. Valentin, would that be doable?

This sounds OK. Shall we file another bug for the necko work, and keep this one for making HostHasHSTSEntry async?

Sounds good! I filed bug 1521729.

Is there anything we need to do here now that bug 1521729 has landed?

Flags: needinfo?(gijskruitbosch+bugs)

Sorry for the delay, it's hard for me to tell based on bug 1521729 whether or not this is fully fixed now, so I wanted to test whether we still hit this code on Windows, but to be 100% sure about this (the profiler won't always show code if we don't sample while we're in it) I need to do a full (non-artifact) Windows build and add a marker to the code or something like that, and I haven't had time to do that. Hopefully tomorrow.

So I added a MOZ_CRASH() in the lock code for DataStorage if we end up waiting for the lock on the main thread; that's still getting hit, but with this (tracerefcnt on an opt build, so add appropriate amounts of salt) stack:

#01: mozilla::net::AltSvcCache::GetAltServiceMapping (mozilla-central\netwerk\protocol\http\AlternateServices.cpp:1076)
#02: mozilla::net::nsHttpChannel::BeginConnect (mozilla-central\netwerk\protocol\http\nsHttpChannel.cpp:6774)
#03: mozilla::net::nsHttpChannel::OnProxyAvailable (mozilla-central\netwerk\protocol\http\nsHttpChannel.cpp:7233)
#04: mozilla::net::nsAsyncResolveRequest::DoCallback::<unnamed-tag>::operator() (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:364)
#05: mozilla::net::nsAsyncResolveRequest::AsyncApplyFilters::Finish (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:597)
#06: mozilla::net::nsAsyncResolveRequest::AsyncApplyFilters::ProcessNextFilter (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:507)
#07: mozilla::net::nsAsyncResolveRequest::AsyncApplyFilters::AsyncProcess (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:487)
#08: mozilla::net::nsAsyncResolveRequest::DoCallback (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:373)
#09: mozilla::net::ExecuteCallback::Run (mozilla-central\netwerk\base\nsPACMan.cpp:119)
#10: nsThread::ProcessNextEvent (mozilla-central\xpcom\threads\nsThread.cpp:1222)
#11: NS_ProcessNextEvent (mozilla-central\xpcom\threads\nsThreadUtils.cpp:481)
#12: mozilla::ipc::MessagePump::Run (mozilla-central\ipc\glue\MessagePump.cpp:87)
#13: MessageLoop::RunHandler (mozilla-central\ipc\chromium\src\base\message_loop.cc:309)
#14: MessageLoop::Run (mozilla-central\ipc\chromium\src\base\message_loop.cc:291)
#15: nsBaseAppShell::Run (mozilla-central\widget\nsBaseAppShell.cpp:139)
#16: nsAppShell::Run (mozilla-central\widget\windows\nsAppShell.cpp:406)
#17: nsAppStartup::Run (mozilla-central\toolkit\components\startup\nsAppStartup.cpp:272)
#18: XREMain::XRE_mainRun (mozilla-central\toolkit\xre\nsAppRunner.cpp:4566)
#19: XREMain::XRE_main (mozilla-central\toolkit\xre\nsAppRunner.cpp:4701)
#20: XRE_main (mozilla-central\toolkit\xre\nsAppRunner.cpp:4752)
#21: NS_internal_main (mozilla-central\browser\app\nsBrowserApp.cpp:331)
#22: wmain (mozilla-central\toolkit\xre\nsWindowsWMain.cpp:131)

which I think is different?

I do see this still being hit on the main thread for the HSTS thing when mReady is already true, which I think is expected because we effectively trigger the HSTS db load away from the main thread and then read things on the main thread later?

So I think this bug as filed is resolved, but I also think we probably want to figure out the AltSvc bits, too. Dana, does my analysis look correct to you?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(dkeeler)

I agree with your analysis. I'll file a bug for AltSvc.

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(dkeeler)
Resolution: --- → DUPLICATE
Performance Impact: --- → P2
Whiteboard: [qf:p2:responsiveness][fxperf:p2] → [fxperf:p2]
You need to log in before you can comment on or make changes to this bug.