New tab from "Capture Profile" is just blank, with hang in paf_child()'s acquiring of gPSMutex

VERIFIED FIXED in Firefox 55

Status

()

Core
Gecko Profiler
VERIFIED FIXED
5 months ago
5 months ago

People

(Reporter: dholbert, Assigned: njn)

Tracking

(Blocks: 1 bug, {hang, regression})

Trunk
mozilla55
All
Linux
hang, regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox52 unaffected, firefox-esr52 unaffected, firefox53 unaffected, firefox54 unaffected, firefox55 fixed)

Details

Attachments

(3 attachments, 1 obsolete attachment)

Comment hidden (empty)
(Reporter)

Comment 1

5 months ago
My STR:
 1. Build mozilla-central with --enable-debug --enable-optimize
 2. ./mach run
 3. Visit http://perf-html.io/ and install the profiler addon
 4. Visit https://twitter.com/RickByers/with_replies
   (Note: I also logged in to Twitter before doing this, though that's probably not necessary.)
 5. Try to capture a profile -- do Ctrl Shift 1 to stop profiling, Ctrl Shift 1 again to start, and Ctrl Shift 2 to capture profile.

ACTUAL RESULTS:
A blank tab is opened, with the URL bar saying https://perf-html.io/from-addon/ . If I attach to the process with gdb (getting the PID by hovering the tab title), I can see that it's hanging while trying to acquire gPSMutex in paf_child().
Blocks: 1342306
Summary: New tab from "Capture Profile" is just blank, with hang in paf_child() → New tab from "Capture Profile" is just blank, with hang in paf_child()'s acquiring of gPSMutex
(Reporter)

Comment 2

5 months ago
Created attachment 8848591 [details]
backtrace of hang in blank tab's content process
It looks like the lock in paf_child needs to go away.

Here are my notes from my conversations with ehsan and jrmuizel about this:

paf_child is an atfork handler that runs after the fork in the new process.
Launching a new process on Linux works like this: You fork and then you exec from the new process. The atfork handler runs between the fork and the exec. Any locks that were held in the original process during the fork will remain held in the forked process, but in the forked process there are no other threads that might release the lock. So it's not a good idea to grab locks in the forked process unless you can be sure that they weren't held in the original pre-fork process.
After the exec, you start with a completely blank slate and can grab any locks you like.
Blocks: 1329181
Keywords: hang, regression
OS: Unspecified → Linux
Hardware: Unspecified → All
(Reporter)

Comment 4

5 months ago
Created attachment 8848595 [details]
(local workaround patch)

(Here's the local workaround patch that I'm using to get the profiler working for me again, FWIW.)
(Assignee)

Comment 5

5 months ago
I'll fix this on Monday. Thanks for identifying the problem, dholbert!
Assignee: nobody → n.nethercote
(Reporter)

Comment 6

5 months ago
I just tripped over it. :D  Credit to mstange & ehsan (& jrmuizel?) for figuring out the problem.
(Assignee)

Comment 7

5 months ago
I have been having trouble capturing profiles in Linux debug builds ever since part 3 of bug 1342306 landed -- the new tab would open and hang without even showing the tab title. Part 1 of that bug added paf_child() and part 3 added gPSMutex. Furthermore, when you do "capture profile" in Nightly that often starts a new content process for the new tab, which explains why fork+exec are happening at that time. So that makes sense.

So I absolutely believe there's a problem here. But I'm puzzled by a few things.

- It makes sense that this is Linux-only (as I've seen). But why would it affect debug builds only?

- I just tried to reproduce this and failed. That's weird, it was pretty reliable before!

- What is the exact sequence that is supposed to cause the problem? Presumably gPSMutex must be locked when fork() is called, and then paf_child() tries to lock it again and that causes the hang? But I don't know how or why gPSMutex would be held when fork() is called. paf_prepare() locks and unlocks gPSMutex just prior to the fork(), and that seems to happen without any problems.

I can easily remove paf_child() -- I only added it for consistency and it's not really necessary. I would like to understand better what's going on, though. mstange, any ideas?
Flags: needinfo?(mstange)
(Assignee)

Comment 8

5 months ago
Also, dholbert, how do you get the PID from hovering over a tab title? That doesn't seem to work for me...
Flags: needinfo?(dholbert)
(Reporter)

Comment 9

5 months ago
(In reply to Nicholas Nethercote [:njn] from comment #8)
> Also, dholbert, how do you get the PID from hovering over a tab title? That
> doesn't seem to work for me...

A tooltip appears when I hover the tab title.  It has the page title, followed by the PID inside some parens. This happens in Nightly builds, too.

(This was news to me last week when mstange told me about it -- I don't know any more about it beyond that.)
Flags: needinfo?(dholbert)
(Assignee)

Comment 10

5 months ago
Ok, with glandium's help I've come up with a plausible sequence of events.

main thread
- calls fork(), which triggers a call to paf_prepare()
  
thread A
- registers or unregisters a thread, locks gPSMutex
            
main thread
- the fork actually happens, with thread A holding gPSMutex locked
- in the child process, the main thread tries to capture the profile, which
  requires locking gPSMutex. But it's already locked, and thread A doesn't
  exist in the child, so gPSMutex is never unlocked and the main thread is
  forever stuck.
- So the browser tab hangs.

It all depends on thread A still holding gPSMutex lock when the actual fork
occurs, and that is somewhat unreliable, though seemingly more common on debug
builds, presumably due to their general slowness.
(Assignee)

Comment 11

5 months ago
Created attachment 8848947 [details] [diff] [review]
Remove paf_child()

<none>
Attachment #8848947 - Flags: review?(jseward)
Comment on attachment 8848947 [details] [diff] [review]
Remove paf_child()

Review of attachment 8848947 [details] [diff] [review]:
-----------------------------------------------------------------

This is good as far as it goes.  But I wonder about a related scenario, in
which "thread A" is actually the sampler thread and it holds gPSMutex at
the point where the main thread calls fork, for a different reason -- because
it's sampling a thread.

For the Linux/Android implementation, I think this can't happen because 
paf_prepare calls SetIsPaused and so presumably the sampler thread is suspended.

But the MacOS implementation doesn't use paf_*.  So, could that still deadlock?
Attachment #8848947 - Flags: review?(jseward) → review+
I don't know why we don't need paf_* on Mac.
Flags: needinfo?(mstange)
(Assignee)

Comment 14

5 months ago
(In reply to Markus Stange [:mstange] from comment #13)
> I don't know why we don't need paf_* on Mac.

jseward added them to Linux in bug 837390. They are necessary there because Linux uses signals to trigger sampling, and the frequent signalling was preventing fork() from running correctly. Pausing the profiler prevents the signals from firing, allowing fork() to succeed. (There's a comment about this above paf_prepare().) Mac doesn't use signals so this problem isn't necessary.


> This is good as far as it goes.  But I wonder about a related scenario, in
> which "thread A" is actually the sampler thread and it holds gPSMutex at
> the point where the main thread calls fork, for a different reason -- because
> it's sampling a thread.
> 
> For the Linux/Android implementation, I think this can't happen because 
> paf_prepare calls SetIsPaused and so presumably the sampler thread is
> suspended.
> 
> But the MacOS implementation doesn't use paf_*.  So, could that still
> deadlock?

It depends what the child process does. If it immediately calls exec() without trying to lock gPSMutex then we're ok. paf_child() was causing problems because it was running between the fork() and the exec().
(Assignee)

Comment 15

5 months ago
Created attachment 8849330 [details] [diff] [review]
Remove paf_child()

It's not necessary and causes hangs.

The patch also inlines setup_atfork() and moves the Linux-only code closer to
the Linux-only PlatformInit(), and tweaks the comments a bit.
Attachment #8849330 - Flags: review?(jseward)
(Assignee)

Updated

5 months ago
Attachment #8848947 - Attachment is obsolete: true
Attachment #8849330 - Flags: review?(jseward) → review+
(Assignee)

Comment 16

5 months ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/3026456dca8ccbb45d1527e21dcff4ed5149630c
Bug 1348374 - Remove paf_child(). r=jseward.

Comment 17

5 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/3026456dca8c
Status: NEW → RESOLVED
Last Resolved: 5 months ago
status-firefox55: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
status-firefox52: --- → unaffected
status-firefox53: --- → unaffected
status-firefox54: --- → unaffected
status-firefox-esr52: --- → unaffected
Daniel, have you been able to use the profiler successfully, now that this is fixed?
Flags: needinfo?(dholbert)
(Reporter)

Comment 19

5 months ago
I'll give it a try from the same machine & build config when I'm back in the office tomorrow morning.
(Reporter)

Comment 20

5 months ago
This does indeed seem fixed.  I'm not seeing any issues when profiling a local up-to-date build today (with the same build configuration / hardware as in comment 0).

Thanks!
Status: RESOLVED → VERIFIED
Flags: needinfo?(dholbert)
You need to log in before you can comment on or make changes to this bug.