Closed Bug 1510226 Opened 6 years ago Closed 3 years ago

NS_NewNamedThread blocks the main thread

Categories

(Core :: XPCOM, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
87 Branch
Performance Impact high
Tracking Status
firefox87 --- fixed

People

(Reporter: florian, Assigned: alexical)

Details

(Keywords: perf:pageload, Whiteboard: [bhr:NS_NewNamedThread])

Attachments

(1 file, 1 obsolete file)

See this profile: https://perfht.ml/2Ri632q
This is also something I've been seeing in BHR data.

There's some irony in blocking the main thread to start an off main-thread operation in order to not block the main thread :-(.
Whiteboard: [perf] → [perf][qf]
I'm open to ideas here, but I'm pretty sure we rely on this behavior and there's not really a better way to implement it.  (In general, it's not safe to just fire off the thread without some sort of synchronization when you're passing data into the newly-started thread.)  Inclined to WONTFIX this.

OTOH, the profile says that the thread pool underlying the stream transport service is starting up a new thread.  It's possible we could tune the heuristics for the stream transport service's thread pool so it didn't shut down threads so eagerly.
Priority: -- → P5
Also, IIUC, the profile is showing that we're starting new threads...every 5ms?  Am I reading that correctly?  What is writing so frequently?
Flags: needinfo?(florian)
(In reply to Nathan Froyd [:froydnj] from comment #2)
> Also, IIUC, the profile is showing that we're starting new threads...every
> 5ms?  Am I reading that correctly?  What is writing so frequently?

We are attempting to sample at a 1ms rate but that usually doesn't work on Windows, so we have a sample every 2ms and a gap between samples. My profile link shows a single NS_NewNamedThread call blocking the main thread for 86ms, which is long enough to cause user perceptible jank.

I don't think P5 is the right priority for something that happens so often and is visible in BHR data.
Flags: needinfo?(florian)
(In reply to Nathan Froyd [:froydnj] from comment #1)
> I'm open to ideas here, but I'm pretty sure we rely on this behavior and
> there's not really a better way to implement it.  (In general, it's not safe
> to just fire off the thread without some sort of synchronization when you're
> passing data into the newly-started thread.)

Could the data be stored somehow, and sent to the new thread only once it is done starting?
(In reply to Florian Quèze [:florian] from comment #4)
> (In reply to Nathan Froyd [:froydnj] from comment #1)
> > I'm open to ideas here, but I'm pretty sure we rely on this behavior and
> > there's not really a better way to implement it.  (In general, it's not safe
> > to just fire off the thread without some sort of synchronization when you're
> > passing data into the newly-started thread.)
> 
> Could the data be stored somehow, and sent to the new thread only once it is
> done starting?

That's essentially what the code is already doing: we're forcing NS_NewNamedThread to not return until we know the thread is done starting.  We need some sort of synchronization between the creating thread and the created thread to let us know when we can start running things on the created thread.

One alternative would be to set up a boolean flag that some runnable continually dispatched to the creating thread would check, and then, once that boolean was set, would run the equivalent of:

https://searchfox.org/mozilla-central/source/xpcom/threads/nsThreadPool.cpp#121-137

I'm not sure that's actually better than the current situation.
(In reply to Nathan Froyd [:froydnj] from comment #5)
> (In reply to Florian Quèze [:florian] from comment #4)
> > (In reply to Nathan Froyd [:froydnj] from comment #1)
> we're forcing
> NS_NewNamedThread to not return until we know the thread is done starting. 

That's what we should change in my opinion.


> One alternative would be to set up a boolean flag that some runnable
> continually dispatched to the creating thread would check, and then, once
> that boolean was set, would run the equivalent of:
> 
> https://searchfox.org/mozilla-central/source/xpcom/threads/nsThreadPool.
> cpp#121-137
> 
> I'm not sure that's actually better than the current situation.

Anything that doesn't require janking the main thread sounds better. BHR currently shows about 3500 hang reports involving NS_NewNamedThread http://queze.net/bhr/#filter=NS_NewNamedThread (this is just reports from the Nightly population, for a single nightly build, we only capture a stack once the hang has lasted at least 128ms).

If instead of continually dispatching a runnable to the creating thread we could come up with a way for the new thread to dispatch a runnable to the creating thread once it's ready, that would be even better.
(In reply to Florian Quèze [:florian] from comment #6)
> Anything that doesn't require janking the main thread sounds better. BHR
> currently shows about 3500 hang reports involving NS_NewNamedThread
> http://queze.net/bhr/#filter=NS_NewNamedThread (this is just reports from
> the Nightly population, for a single nightly build, we only capture a stack
> once the hang has lasted at least 128ms).

Thanks for that context.  That page is saying that we have hundreds of hangs of over a minute because of thread creation?

> If instead of continually dispatching a runnable to the creating thread we
> could come up with a way for the new thread to dispatch a runnable to the
> creating thread once it's ready, that would be even better.

That does sound better than the polling scheme.  We could do some sort of promise-ish API.
Priority: P5 → P3
Whiteboard: [perf][qf] → [perf][qf:p2:responsiveness]
(In reply to Nathan Froyd [:froydnj] from comment #7)
> (In reply to Florian Quèze [:florian] from comment #6)
> > Anything that doesn't require janking the main thread sounds better. BHR
> > currently shows about 3500 hang reports involving NS_NewNamedThread
> > http://queze.net/bhr/#filter=NS_NewNamedThread (this is just reports from
> > the Nightly population, for a single nightly build, we only capture a stack
> > once the hang has lasted at least 128ms).
> 
> Thanks for that context.  That page is saying that we have hundreds of hangs
> of over a minute because of thread creation?

That page doesn't provide information about the length of individual hangs.

Today for build 20181118, it shows 3038 hangs, with 108 different stacks. The total time spent in these hangs is 1641s. So the mean hang length when creating a thread is 540ms. Given that we only collect a hang stack when the main thread is unresponsive for 128ms or more, we can expect new thread creation to cause jank way more often.

The thing that makes this bug really bad in my opinion is that typically whenever something is janking the main thread (eg. due to I/O, a call to an operating system function, a long computation, ...) we recommand moving that thing off main-thread. And now we see that creating a thread is also an operation that causes jank :-(.
(In reply to Florian Quèze [:florian] from comment #8)
> (In reply to Nathan Froyd [:froydnj] from comment #7)
> > (In reply to Florian Quèze [:florian] from comment #6)
> > > Anything that doesn't require janking the main thread sounds better. BHR
> > > currently shows about 3500 hang reports involving NS_NewNamedThread
> > > http://queze.net/bhr/#filter=NS_NewNamedThread (this is just reports from
> > > the Nightly population, for a single nightly build, we only capture a stack
> > > once the hang has lasted at least 128ms).
> > 
> > Thanks for that context.  That page is saying that we have hundreds of hangs
> > of over a minute because of thread creation?
> 
> That page doesn't provide information about the length of individual hangs.

I'm confused why the second column says "time (s)" then.  Is that total time for the number of hangs?  If so, maybe it should be labeled as such?

> The thing that makes this bug really bad in my opinion is that typically
> whenever something is janking the main thread (eg. due to I/O, a call to an
> operating system function, a long computation, ...) we recommand moving that
> thing off main-thread. And now we see that creating a thread is also an
> operation that causes jank :-(.

I don't think we expected new thread creation to be particularly expensive.  But it apparently is, and perhaps we should start revising code accordingly.

For build 2018118, it's not even clear to me that doing a promise-based solution to solve the PR_Wait hangs would even solve very much, since the top five, and a good proportion of the top 25 or so, hangs are in PR_CreateThread itself, which must be done on the calling thread.  To fix *that*, you'd need a thread-creating thread or similar, which sounds a bit silly. :)
(In reply to Nathan Froyd [:froydnj] from comment #9)
> (In reply to Florian Quèze [:florian] from comment #8)
> > (In reply to Nathan Froyd [:froydnj] from comment #7)
> > > (In reply to Florian Quèze [:florian] from comment #6)
> > > > Anything that doesn't require janking the main thread sounds better. BHR
> > > > currently shows about 3500 hang reports involving NS_NewNamedThread
> > > > http://queze.net/bhr/#filter=NS_NewNamedThread (this is just reports from
> > > > the Nightly population, for a single nightly build, we only capture a stack
> > > > once the hang has lasted at least 128ms).
> > > 
> > > Thanks for that context.  That page is saying that we have hundreds of hangs
> > > of over a minute because of thread creation?
> > 
> > That page doesn't provide information about the length of individual hangs.
> 
> I'm confused why the second column says "time (s)" then.  Is that total time
> for the number of hangs?  If so, maybe it should be labeled as such?

I changed it to "total time (s)".

> > The thing that makes this bug really bad in my opinion is that typically
> > whenever something is janking the main thread (eg. due to I/O, a call to an
> > operating system function, a long computation, ...) we recommand moving that
> > thing off main-thread. And now we see that creating a thread is also an
> > operation that causes jank :-(.
> 
> I don't think we expected new thread creation to be particularly expensive. 
> But it apparently is, and perhaps we should start revising code accordingly.
> 
> For build 2018118, it's not even clear to me that doing a promise-based
> solution to solve the PR_Wait hangs would even solve very much, since the
> top five, and a good proportion of the top 25 or so, hangs are in
> PR_CreateThread itself, which must be done on the calling thread.

If you reload, you'll see it's actually in beginthreadex, the Windows API to create a thread, so I guess your point still holds.

(My visualisation tool for BHR data is trying to merge similar stacks that are different only in irrelevant code that's not within our control, so I'm keeping only one stack frame outside of our code. "nss3" wasn't marked as being our code, so "PR_CreateThread" was the one 'foreign code' stack frame. I've added nss3 to the list.)

This tool is still very much a draft or WIP, so feedback is welcome.
(In reply to Florian Quèze [:florian] from comment #10)
> (In reply to Nathan Froyd [:froydnj] from comment #9)
> > I'm confused why the second column says "time (s)" then.  Is that total time
> > for the number of hangs?  If so, maybe it should be labeled as such?
> 
> I changed it to "total time (s)".

Thank you!

> > For build 2018118, it's not even clear to me that doing a promise-based
> > solution to solve the PR_Wait hangs would even solve very much, since the
> > top five, and a good proportion of the top 25 or so, hangs are in
> > PR_CreateThread itself, which must be done on the calling thread.
> 
> If you reload, you'll see it's actually in beginthreadex, the Windows API to
> create a thread, so I guess your point still holds.
> 
> (My visualisation tool for BHR data is trying to merge similar stacks that
> are different only in irrelevant code that's not within our control, so I'm
> keeping only one stack frame outside of our code. "nss3" wasn't marked as
> being our code, so "PR_CreateThread" was the one 'foreign code' stack frame.
> I've added nss3 to the list.)

Thanks for the explanation.  I think it's actually better than "nss3" is marked as "foreign" code (even though Mozilla develops it), as Gecko basically treats it like a black box anyway, and seeing:

beginthreadex ucrtbase
PR_MD_CREATE_THREAD nss3
PR_NativeCreateThread nss3
PR_CreateThread nss3
PR_CreateThread nss3
nsThread::Init(nsTSubstring<char> const &) xul
... more frames

is not especially more informative than:

PR_CreateThread nss3
nsThread::Init(nsTSubstring<char> const &) xul
... more frames

The current scheme also means that we have beginthreadex/_beginthreadex confusion, as well as "<unsymbolicated> kernelbase" or "<unsymbolicated> libsystem_kernel.dylib" entries, which aren't obvious what they represent until you look at the tooltip, whereas before, the latter would have just been "PR_Wait" or whatever.  Looking at what Socorro does for crash signatures to eliminate "uninteresting" frames might be useful here.
(In reply to Nathan Froyd [:froydnj] from comment #9)

> I don't think we expected new thread creation to be particularly expensive. 
> But it apparently is, and perhaps we should start revising code accordingly.

This seems to be a Windows-only problem.

> For build 2018118, it's not even clear to me that doing a promise-based
> solution to solve the PR_Wait hangs would even solve very much, since the
> top five, and a good proportion of the top 25 or so, hangs are in
> PR_CreateThread itself, which must be done on the calling thread.  To fix
> *that*, you'd need a thread-creating thread or similar, which sounds a bit
> silly. :)

Yeah, I guess this is what we'll need. Instead of creating a thread directly, we would send a message to a 'thread-creating thread', and later get back a handle to the new thread.
Whiteboard: [perf][qf:p2:responsiveness] → [qf:p2:responsiveness]

Here is another profile where this wastes a lot of main-thread time in both the parent and content processes: https://perfht.ml/2DixnYF

In the "Privileged Content" track, it visibly happens for:
mozilla::net::nsStreamTransportService::Dispatch,
mozilla::net::nsSocketTransportService::Init,
mozilla::image::DecodePoolImpl::CreateThread,
nsHtml5Module::GetStreamParserThread and
mozilla::VideoDecoderManagerChild::InitializeThread.

I think these are things that happen during page load (most of these are before the end of the TFFI marker).

Putting this back into the [qf] triage queue, since it looks like this can also affect page load.

Whiteboard: [qf:p2:responsiveness] → [qf]
Whiteboard: [qf] → [qf:p1:pageload]

https://perfht.ml/2In0Cz1 shows that we encounter this several times when attempting to load about:home after startup.

(Taking a look at this per Florian's nudge)

(In reply to Nathan Froyd [:froydnj] from comment #1)

I'm open to ideas here, but I'm pretty sure we rely on this behavior and
there's not really a better way to implement it. (In general, it's not safe
to just fire off the thread without some sort of synchronization when you're
passing data into the newly-started thread.) Inclined to WONTFIX this.

OTOH, the profile says that the thread pool underlying the stream transport
service is starting up a new thread. It's possible we could tune the
heuristics for the stream transport service's thread pool so it didn't shut
down threads so eagerly.

Nathan, maybe you can short-circuit my digging - can you explain exactly how we rely on this behavior? It's not really clear to me from a theory perspective why we can't queue something up to be run whether the thread has finished starting up or not.

Assignee: nobody → dothayer
Status: NEW → ASSIGNED
Flags: needinfo?(nfroyd)

(In reply to Doug Thayer [:dthayer] from comment #16)

(Taking a look at this per Florian's nudge)

(In reply to Nathan Froyd [:froydnj] from comment #1)

I'm open to ideas here, but I'm pretty sure we rely on this behavior and
there's not really a better way to implement it. (In general, it's not safe
to just fire off the thread without some sort of synchronization when you're
passing data into the newly-started thread.) Inclined to WONTFIX this.

Nathan, maybe you can short-circuit my digging - can you explain exactly how we rely on this behavior? It's not really clear to me from a theory perspective why we can't queue something up to be run whether the thread has finished starting up or not.

We already queue things to run before we're sure whether the thread has started, e.g.:

https://searchfox.org/mozilla-central/source/xpcom/threads/nsThread.cpp#669-675

The safety concern comes from the initialization data we pass in: when we initialize threads, we put ThreadInitData on the stack:

https://searchfox.org/mozilla-central/source/xpcom/threads/nsThread.cpp#659

and pass a pointer to said ThreadInitData into PR_CreateThread:

https://searchfox.org/mozilla-central/source/xpcom/threads/nsThread.cpp#662

and our ThreadFunc accesses that data:

https://searchfox.org/mozilla-central/source/xpcom/threads/nsThread.cpp#411-444

So that ThreadInitData needs to live until we receive some notification that the thread we wanted to start has actually started. If we don't wait, ThreadInitData might be destroyed before the new thread has a chance to start running.

I guess we could allocate ThreadInitData on the heap instead of on the stack, which would eliminate that concern. I would be a little concerned about code that assumes that NS_NewNamedThread is, in effect, a blocking call, though (e.g. the code in nsThreadPool.

Other alternatives are to have some other thread for thread pools that starts threads, or to make the case in PutEvent where we have to start a new thread a runnable that runs in the thread pool itself to start a new thread and then services the original runnable.

Flags: needinfo?(nfroyd)

Apologies, I just noticed the conversation above regarding the BHR data primarily featuring PR_CreateThread rather than the wait. This is interesting. In the profiles Florian links, the time is spend primarily in the wait. In the BHR data, however, PR_CreateThread does indeed take more of the share, but the share of NS_NewNamedThread in general is incredibly small compared to overal BHR share.

Florian, can you ensure that your data lines up with the data as reported by arewesmoothyet? It seems like quite an insignificant portion spend creating threads in that view.

Flags: needinfo?(florian)

(In reply to Doug Thayer [:dthayer] from comment #18)

Apologies, I just noticed the conversation above regarding the BHR data primarily featuring PR_CreateThread rather than the wait.

Sorry for not replying sooner. I just poked at awsy a bit. If I focus on the nsThread::Init function, then 88.8% is PR_Wait and 7.7% is PR_CreateThread. Does this answer your question?

https://arewesmoothyet.com/?category=all&durationSpec=128_512&mode=explore&payloadID=8b3d90fab13f434983feb5f4520ab134&search=nsThread%3A%3AInit&thread=1&transforms=ff-21870

In any case, the profiles I got even on slow machines had wait times that were typically below the BHR 128ms threshold, so I expect this issue to be underreported by BHR.

Flags: needinfo?(florian)

Ah, okay - I was focusing on NS_NewNamedThread, in which >25% is in beginthreadex/_beginthreadex (https://arewesmoothyet.com/?category=all&durationSpec=128_512&invertCallstack&mode=explore&payloadID=8b3d90fab13f434983feb5f4520ab134&search=ns_newnamedth&thread=1&transforms=mf-3626~mf-3627~mf-3628).

I'm trying to assess whether we want to do something goofy like a thread-creating thread, which wouldn't reduce any of the time in beginthreadex, or just allocate the ThreadInitData on the heap, which is substantially simpler and I suspect less likely to have unintended consequences, but only covers 75% of the observed cost of NS_NewNamedThread. In either case NS_NewNamedThread is a fairly tiny portion of the overall BHR data as shown in the awsy graph, though for what could be quite a small change it's likely worth it anyway.

What I'm more curious about is why you're seeing thread creation in profiles at an apparently reliable rate given that it's such a small part of BHR data overall. I'm wondering if there's something in the 2018(?) reference hardware that makes thread creation comparatively more expensive.

(In reply to Doug Thayer [:dthayer] from comment #20)

What I'm more curious about is why you're seeing thread creation in profiles at an apparently reliable rate given that it's such a small part of BHR data overall.

But in my profiles I'm never seeing it be long enough to be reported by BHR.

I'm wondering if there's something in the 2018(?) reference hardware that makes thread creation comparatively more expensive.

Maybe McAfee?

Another recent profile showing how much this impacts startup on the 2018 reference hardware: https://perfht.ml/2RqmbQj

GetCurrentPhysicalThread and GetCurrentVirtualThread are, in practice,
identical, as the TLS override that GetCurrentVirtualThread depends on
is never actually set. This simply removes that and renames some things/
deletes some comments.

To remove the blocking inside nsThread::Init, two things needed
to happen:

  • Switch the ThreadInitData value passed as the argument for
    ThreadFunc to a heap allocation, so that it can outlive the call
    to nsThread::Init.
  • Initialize mThread and mEventTarget->mThread to the return
    value of PR_CreateThread, so that to the callers, checks which
    depend on these values being set can continue to function.

Depends on D41247

(In reply to Nathan Froyd [:froydnj] from comment #17)

I guess we could allocate ThreadInitData on the heap instead of on the stack, which would eliminate that concern. I would be a little concerned about code that assumes that NS_NewNamedThread is, in effect, a blocking call, though (e.g. the code in nsThreadPool.

Can you clarify in what way the code in nsThreadPool depends on this? I could only anywhere dependencies on mThread / mVirtualThread being set, which we can do from nsThread::Init.

Is there anything in particular you would like to see to be convinced that we can safely do this?

This bug seems to have stalled out. Did you have more feedback for dthayer given comment 25, froydnj?

Flags: needinfo?(nfroyd)

(In reply to Mike Conley (:mconley) (:⚙️) from comment #26)

This bug seems to have stalled out. Did you have more feedback for dthayer given comment 25, froydnj?

Doug and I have been chatting in the phab issue (yay separated comments...). Is there anything there that I need to respond to?

Flags: needinfo?(nfroyd) → needinfo?(dothayer)

Bah, how embarrassing. I hadn't checked Phabricator for the active conversation, and just assumed that we'd stalled out because of the lack of activity in the bug. Sorry about the noise - carry on!

Flags: needinfo?(dothayer)

Sorry, I should have been a bit more clear in the Phabricator comments. I was hoping for you to just quickly look over my analysis of the possible side effects of this work and let me know if it all sounds reasonable. To be clear: from what I am seeing the only thing that remains to be addressed is the technical UB around nsThread::InitCommons static variable. The AddRef / Release issue turns out to be covered by an extra AddRef / Release pair in Init / ThreadFunc.

Flags: needinfo?(nfroyd)

(In reply to Doug Thayer [:dthayer] from comment #29)

Sorry, I should have been a bit more clear in the Phabricator comments. I was hoping for you to just quickly look over my analysis of the possible side effects of this work and let me know if it all sounds reasonable. To be clear: from what I am seeing the only thing that remains to be addressed is the technical UB around nsThread::InitCommons static variable. The AddRef / Release issue turns out to be covered by an extra AddRef / Release pair in Init / ThreadFunc.

Commented in phab.

Flags: needinfo?(nfroyd)
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b6f670610dc3
Remove vestigial references to cooperative scheduling r=froydnj
https://hg.mozilla.org/integration/autoland/rev/cb739de6606d
Do not block main thread in nsThread::Init r=froydnj

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:dthayer, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(dothayer)

I'm paused on this for the time being. The issue that manifested the failures in comment 32 seems, as far as I can tell, to be an existing problem, which is just made significantly more likely due to timing differences with this patch applied. However, I wanted to chat with Nathan about it, and he's out until the 28th.

Flags: needinfo?(dothayer)

Comment on attachment 9084080 [details]
Bug 1510226 - Remove vestigial references to cooperative scheduling r?froydnj

Revision D41247 was moved to bug 1602646. Setting attachment 9084080 [details] to obsolete.

Attachment #9084080 - Attachment is obsolete: true
Flags: needinfo?(dothayer)
Whiteboard: [qf:p1:pageload] → [qf:p1:pageload][bhr:NS_NewNamedThread]
Attachment #9084081 - Attachment description: Bug 1510226 - Do not block main thread in nsThread::Init r?froydnj → Bug 1510226 - Do not block main thread in nsThread::Init r?kriswright

Hey Kris, did you get a chance to take a look at the patch at all? Just pinging you here in case it fell off your radar.

Flags: needinfo?(dothayer) → needinfo?(kwright)

(In reply to Doug Thayer [:dthayer] (he/him) from comment #36)

Hey Kris, did you get a chance to take a look at the patch at all? Just pinging you here in case it fell off your radar.

Sorry about that! I was looking into this then all the holidays happened and it got buried in my queue. About halfway through right now.

Flags: needinfo?(kwright)
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bfe84c7b6455
Do not block main thread in nsThread::Init r=froydnj,KrisWright
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 87 Branch
Performance Impact: --- → P1
Keywords: perf:pageload
Whiteboard: [qf:p1:pageload][bhr:NS_NewNamedThread] → [bhr:NS_NewNamedThread]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: