1372889 - mozilla::net::nsHttpChannel::Connect() blocks the main thread on nsSiteSecurityService::HostHasHSTSEntry

Reporter

Description

•

7 years ago

See this startup profile captured on the quantum reference hardware: thttps://perfht.ml/2rgUiwe

nsSiteSecurityService::HostHasHSTSEntry blocks the main thread for 37ms on what looks like disk I/O.

Valentin Gosu [:valentin] (he/him)

Comment 1

•

7 years ago

Hi David, can you find an appropriate owner for this bug? Thanks!

Component: Networking: HTTP → Security: PSM

Flags: needinfo?(dkeeler)

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Comment 2

•

7 years ago

In general, I don't think it would be easy to prevent this. Any code wanting to know if a site is HSTS (i.e. only connect securely/upgrade to https) can't continue until that information has been read from the profile.

To minimize the chance that code actually waits in practice, we could:
* Initialize nsISiteSecurityService as early as possible (once we have a profile) (this does a background-thread read of the data).
* Avoid connecting to http URIs early in startup, although I imagine this is driven by data in the user's profile that we don't necessary control. There absolutely should not be URIs we ship with the product that are http (as opposed to https).

I suppose we could add an asynchronous API to query this information. I don't know necko well enough to know if this would be a reasonable solution. Patrick, how much work would it be to change nsHttpChannel::Connect to essentially be asynchronous?

Flags: needinfo?(dkeeler) → needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Comment 3

•

7 years ago

(In reply to David Keeler [:keeler] (use needinfo?) from comment #2)
>
> I suppose we could add an asynchronous API to query this information. I
> don't know necko well enough to know if this would be a reasonable solution.

that's the usual way.

> Patrick, how much work would it be to change nsHttpChannel::Connect to
> essentially be asynchronous?

it's a doable patch (not trivial but not huge) - a lot like bug 1362388. A lot has to do with whether it can fit into one of the existing async points.

Flags: needinfo?(mcmanus)

Naveed Ihsanullah [:naveed]

Updated

•

7 years ago

Whiteboard: [qf] → [qf:p1]

Jason Duell

Comment 4

•

7 years ago

Daniel, can you take this?  It looks to me like this is very similar to our cookie I/O initialization work, and thus a similar plan:

1) Stick a call to get nsISiteSecurityService (but don't actually use it) somewhere early in startup: from comment 2 it sounds like this will kick off a background thread that does the I/O here.  I'm not sure of the right place to put that call--I need to ping some people who've worked on similar bugs.

2) If we're lucky, the I/O is generally fast enough that the info is ready by the time the 1st user appears and wants the data.  Add some telemetry to measure that.  If we wind up janking, then we need step 3

3) If needed add an async API so we never block the main thread.

If #3 is simple enough, it might actually be faster/easier to skip #2 and just proceed right to an async API.

(scheduling note: Daniel, this looks way more tractable in the 57 timeframe than un-mmapping JAR channels, so do this first)

Flags: needinfo?(daniel)

Daniel Stenberg [:bagder]

Comment 5

•

7 years ago

> I'm not sure of the right place to put that call

Me neither, that's magic I haven't been around in before, but it does sound like a good idea that we should do independently of it being async or not since it feels like a better chance to save more total startup time.

Assignee: nobody → daniel

Flags: needinfo?(daniel)

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Updated

•

7 years ago

Priority: -- → P1

Whiteboard: [qf:p1] → [qf:p1][psm-assigned]

Emma Humphries ☕️🎸🧞‍♀️✨ (she/they) [:emceeaich] (Pacific Time) use needinfo

Comment 6

•

7 years ago

This is an assigned P1 bug without activity in two weeks. 

If you intend to continue working on this bug for the current release/iteration/sprint, remove the 'stale-bug' keyword.

Otherwise we'll reset the priority of the bug back to '--' on Monday, August 28th.

Keywords: stale-bug

Florian Quèze [:florian]

Reporter

Comment 7

•

7 years ago

Another startup profile where this blocks the UI thread for 192ms: https://perfht.ml/2fhaklI

Jean Gong :jgong

Comment 8

•

7 years ago

This bug won't be address for 57; moving it to QF:P2 for post 57 work.

Keywords: stale-bug

Whiteboard: [qf:p1][psm-assigned] → [qf:p2][psm-assigned]

Emma Humphries ☕️🎸🧞‍♀️✨ (she/they) [:emceeaich] (Pacific Time) use needinfo

Updated

•

7 years ago

Keywords: stale-bug

J.C. Jones [:jcj] (he/him)

Comment 9

•

7 years ago

:jgong marked this for P2 in the whiteboard but not the prio field, fixing.

Priority: P1 → P2

Firefox Bug Husbandry Bot

Updated

•

7 years ago

Keywords: perf

Mike Conley (:mconley) (:⚙️)

Updated

•

7 years ago

Whiteboard: [qf:p2][psm-assigned] → [qf:p1][psm-assigned]

Naveed Ihsanullah [:naveed]

Updated

•

7 years ago

Priority: P2 → P1

Whiteboard: [qf:p1][psm-assigned] → [qf:p1][psm-assigned][qf:i60]

Jean Gong :jgong

Updated

•

7 years ago

Whiteboard: [qf:p1][psm-assigned][qf:i60] → [qf:i60][qf:p1][psm-assigned]

Jean Gong :jgong

Updated

•

7 years ago

Whiteboard: [qf:i60][qf:p1][psm-assigned] → [qf:f60][qf:p1][psm-assigned]

Jean Gong :jgong

Updated

•

6 years ago

Whiteboard: [qf:f60][qf:p1][psm-assigned] → [qf:f61][qf:p1][psm-assigned]

Jean Gong :jgong

Updated

•

6 years ago

Whiteboard: [qf:f61][qf:p1][psm-assigned] → [qf:f64][qf:p1][psm-assigned]

Jean Gong :jgong

Updated

•

6 years ago

Whiteboard: [qf:f64][qf:p1][psm-assigned] → [qf:p1:f64][psm-assigned]

Daniel Stenberg [:bagder]

Comment 10

•

6 years ago

unassigning myself as I've failed to deliver on this and let's not fool anyone it will happen soon

Assignee: daniel → nobody

Valentin Gosu [:valentin] (he/him)

Updated

•

6 years ago

Whiteboard: [qf:p1:f64][psm-assigned] → [qf:p1:f64]

Mike Conley (:mconley) (:⚙️)

Updated

•

5 years ago

Whiteboard: [qf:p1:f64] → [qf:p2:responsiveness][fxperf]

:Gijs (he/him)

Comment 11

•

5 years ago

This actually looks like it's a result of captive portal detection. The fact that this does an http instead of https request kind of makes sense, but we should avoid checking HSTS for it.

We could also just make the captive portal detection wait for the HSTS data to have been loaded.

Going to mark [fxperf:p2] because that consumer, at least, seems like a fairly straightforward one to fix.

In terms of a longer-term fix, if we wanted to be cheeky, presumably we could check if HSTS data is available in HTTPChannel::AsyncOpen for non-https URIs, and delay the Open() if not. That'd avoid the complexity of having to make Connect() async, too. Valentin, would that be doable?

Flags: needinfo?(valentin.gosu)

Whiteboard: [qf:p2:responsiveness][fxperf] → [qf:p2:responsiveness][fxperf:p2]

Valentin Gosu [:valentin] (he/him)

Comment 12

•

5 years ago

(In reply to :Gijs (he/him) from comment #11)

This actually looks like it's a result of captive portal detection. The fact that this does an http instead of https request kind of makes sense, but we should avoid checking HSTS for it.

Adding an API to make a channel avoid checking HSTS seems OK. But I'd rather not do it just for this one use case :)

We could also just make the captive portal detection wait for the HSTS data to have been loaded.

Do we have a notification, or other way of determining if HSTS data was loaded?

Going to mark [fxperf:p2] because that consumer, at least, seems like a fairly straightforward one to fix.

In terms of a longer-term fix, if we wanted to be cheeky, presumably we could check if HSTS data is available in HTTPChannel::AsyncOpen for non-https URIs, and delay the Open() if not. That'd avoid the complexity of having to make Connect() async, too. Valentin, would that be doable?

This sounds OK. Shall we file another bug for the necko work, and keep this one for making HostHasHSTSEntry async?

Flags: needinfo?(valentin.gosu)

:Gijs (he/him)

Updated

•

5 years ago

Depends on: 1521729

:Gijs (he/him)

Comment 13

•

5 years ago

(In reply to Valentin Gosu [:valentin] from comment #12)

(In reply to :Gijs (he/him) from comment #11)

This actually looks like it's a result of captive portal detection. The fact that this does an http instead of https request kind of makes sense, but we should avoid checking HSTS for it.

Adding an API to make a channel avoid checking HSTS seems OK. But I'd rather not do it just for this one use case :)

We could also just make the captive portal detection wait for the HSTS data to have been loaded.

Do we have a notification, or other way of determining if HSTS data was loaded?

There doesn't seem to be one right now, we'd need to add one. The actual IO is off the main thread already, but if there are any main-thread calls, they block on getting a mutex and the IO being completed, it looks like (from a very quick glance, so I could be wrong). This seems to all be encapsulated in https://searchfox.org/mozilla-central/source/security/manager/ssl/DataStorage.cpp .

Going to mark [fxperf:p2] because that consumer, at least, seems like a fairly straightforward one to fix.

In terms of a longer-term fix, if we wanted to be cheeky, presumably we could check if HSTS data is available in HTTPChannel::AsyncOpen for non-https URIs, and delay the Open() if not. That'd avoid the complexity of having to make Connect() async, too. Valentin, would that be doable?

This sounds OK. Shall we file another bug for the necko work, and keep this one for making HostHasHSTSEntry async?

Sounds good! I filed bug 1521729.

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Comment 14

•

4 years ago

Is there anything we need to do here now that bug 1521729 has landed?

Flags: needinfo?(gijskruitbosch+bugs)

:Gijs (he/him)

Comment 15

•

4 years ago

Sorry for the delay, it's hard for me to tell based on bug 1521729 whether or not this is fully fixed now, so I wanted to test whether we still hit this code on Windows, but to be 100% sure about this (the profiler won't always show code if we don't sample while we're in it) I need to do a full (non-artifact) Windows build and add a marker to the code or something like that, and I haven't had time to do that. Hopefully tomorrow.

:Gijs (he/him)

Comment 16

•

4 years ago

So I added a MOZ_CRASH() in the lock code for DataStorage if we end up waiting for the lock on the main thread; that's still getting hit, but with this (tracerefcnt on an opt build, so add appropriate amounts of salt) stack:

#01: mozilla::net::AltSvcCache::GetAltServiceMapping (mozilla-central\netwerk\protocol\http\AlternateServices.cpp:1076)
#02: mozilla::net::nsHttpChannel::BeginConnect (mozilla-central\netwerk\protocol\http\nsHttpChannel.cpp:6774)
#03: mozilla::net::nsHttpChannel::OnProxyAvailable (mozilla-central\netwerk\protocol\http\nsHttpChannel.cpp:7233)
#04: mozilla::net::nsAsyncResolveRequest::DoCallback::<unnamed-tag>::operator() (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:364)
#05: mozilla::net::nsAsyncResolveRequest::AsyncApplyFilters::Finish (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:597)
#06: mozilla::net::nsAsyncResolveRequest::AsyncApplyFilters::ProcessNextFilter (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:507)
#07: mozilla::net::nsAsyncResolveRequest::AsyncApplyFilters::AsyncProcess (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:487)
#08: mozilla::net::nsAsyncResolveRequest::DoCallback (mozilla-central\netwerk\base\nsProtocolProxyService.cpp:373)
#09: mozilla::net::ExecuteCallback::Run (mozilla-central\netwerk\base\nsPACMan.cpp:119)
#10: nsThread::ProcessNextEvent (mozilla-central\xpcom\threads\nsThread.cpp:1222)
#11: NS_ProcessNextEvent (mozilla-central\xpcom\threads\nsThreadUtils.cpp:481)
#12: mozilla::ipc::MessagePump::Run (mozilla-central\ipc\glue\MessagePump.cpp:87)
#13: MessageLoop::RunHandler (mozilla-central\ipc\chromium\src\base\message_loop.cc:309)
#14: MessageLoop::Run (mozilla-central\ipc\chromium\src\base\message_loop.cc:291)
#15: nsBaseAppShell::Run (mozilla-central\widget\nsBaseAppShell.cpp:139)
#16: nsAppShell::Run (mozilla-central\widget\windows\nsAppShell.cpp:406)
#17: nsAppStartup::Run (mozilla-central\toolkit\components\startup\nsAppStartup.cpp:272)
#18: XREMain::XRE_mainRun (mozilla-central\toolkit\xre\nsAppRunner.cpp:4566)
#19: XREMain::XRE_main (mozilla-central\toolkit\xre\nsAppRunner.cpp:4701)
#20: XRE_main (mozilla-central\toolkit\xre\nsAppRunner.cpp:4752)
#21: NS_internal_main (mozilla-central\browser\app\nsBrowserApp.cpp:331)
#22: wmain (mozilla-central\toolkit\xre\nsWindowsWMain.cpp:131)

which I think is different?

I do see this still being hit on the main thread for the HSTS thing when mReady is already true, which I think is expected because we effectively trigger the HSTS db load away from the main thread and then read things on the main thread later?

So I think this bug as filed is resolved, but I also think we probably want to figure out the AltSvc bits, too. Dana, does my analysis look correct to you?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(dkeeler)

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Comment 17

•

4 years ago

I agree with your analysis. I'll file a bug for AltSvc.

Status: NEW → RESOLVED

Closed: 4 years ago

Flags: needinfo?(dkeeler)

Resolution: --- → DUPLICATE

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Updated

•

4 years ago

Updated

•

2 years ago

Performance Impact: --- → P2

Keywords: perf:responsiveness

Whiteboard: [qf:p2:responsiveness][fxperf:p2] → [fxperf:p2]