My STR: 1. Build mozilla-central with --enable-debug --enable-optimize 2. ./mach run 3. Visit http://perf-html.io/ and install the profiler addon 4. Visit https://twitter.com/RickByers/with_replies (Note: I also logged in to Twitter before doing this, though that's probably not necessary.) 5. Try to capture a profile -- do Ctrl Shift 1 to stop profiling, Ctrl Shift 1 again to start, and Ctrl Shift 2 to capture profile. ACTUAL RESULTS: A blank tab is opened, with the URL bar saying https://perf-html.io/from-addon/ . If I attach to the process with gdb (getting the PID by hovering the tab title), I can see that it's hanging while trying to acquire gPSMutex in paf_child().
Summary: New tab from "Capture Profile" is just blank, with hang in paf_child() → New tab from "Capture Profile" is just blank, with hang in paf_child()'s acquiring of gPSMutex
It looks like the lock in paf_child needs to go away. Here are my notes from my conversations with ehsan and jrmuizel about this: paf_child is an atfork handler that runs after the fork in the new process. Launching a new process on Linux works like this: You fork and then you exec from the new process. The atfork handler runs between the fork and the exec. Any locks that were held in the original process during the fork will remain held in the forked process, but in the forked process there are no other threads that might release the lock. So it's not a good idea to grab locks in the forked process unless you can be sure that they weren't held in the original pre-fork process. After the exec, you start with a completely blank slate and can grab any locks you like.
Keywords: hang, regression
OS: Unspecified → Linux
Hardware: Unspecified → All
(Here's the local workaround patch that I'm using to get the profiler working for me again, FWIW.)
I'll fix this on Monday. Thanks for identifying the problem, dholbert!
Assignee: nobody → n.nethercote
I just tripped over it. :D Credit to mstange & ehsan (& jrmuizel?) for figuring out the problem.
I have been having trouble capturing profiles in Linux debug builds ever since part 3 of bug 1342306 landed -- the new tab would open and hang without even showing the tab title. Part 1 of that bug added paf_child() and part 3 added gPSMutex. Furthermore, when you do "capture profile" in Nightly that often starts a new content process for the new tab, which explains why fork+exec are happening at that time. So that makes sense. So I absolutely believe there's a problem here. But I'm puzzled by a few things. - It makes sense that this is Linux-only (as I've seen). But why would it affect debug builds only? - I just tried to reproduce this and failed. That's weird, it was pretty reliable before! - What is the exact sequence that is supposed to cause the problem? Presumably gPSMutex must be locked when fork() is called, and then paf_child() tries to lock it again and that causes the hang? But I don't know how or why gPSMutex would be held when fork() is called. paf_prepare() locks and unlocks gPSMutex just prior to the fork(), and that seems to happen without any problems. I can easily remove paf_child() -- I only added it for consistency and it's not really necessary. I would like to understand better what's going on, though. mstange, any ideas?
Also, dholbert, how do you get the PID from hovering over a tab title? That doesn't seem to work for me...
(In reply to Nicholas Nethercote [:njn] from comment #8) > Also, dholbert, how do you get the PID from hovering over a tab title? That > doesn't seem to work for me... A tooltip appears when I hover the tab title. It has the page title, followed by the PID inside some parens. This happens in Nightly builds, too. (This was news to me last week when mstange told me about it -- I don't know any more about it beyond that.)
Ok, with glandium's help I've come up with a plausible sequence of events. main thread - calls fork(), which triggers a call to paf_prepare() thread A - registers or unregisters a thread, locks gPSMutex main thread - the fork actually happens, with thread A holding gPSMutex locked - in the child process, the main thread tries to capture the profile, which requires locking gPSMutex. But it's already locked, and thread A doesn't exist in the child, so gPSMutex is never unlocked and the main thread is forever stuck. - So the browser tab hangs. It all depends on thread A still holding gPSMutex lock when the actual fork occurs, and that is somewhat unreliable, though seemingly more common on debug builds, presumably due to their general slowness.
Attachment #8848947 - Flags: review?(jseward)
Comment on attachment 8848947 [details] [diff] [review] Remove paf_child() Review of attachment 8848947 [details] [diff] [review]: ----------------------------------------------------------------- This is good as far as it goes. But I wonder about a related scenario, in which "thread A" is actually the sampler thread and it holds gPSMutex at the point where the main thread calls fork, for a different reason -- because it's sampling a thread. For the Linux/Android implementation, I think this can't happen because paf_prepare calls SetIsPaused and so presumably the sampler thread is suspended. But the MacOS implementation doesn't use paf_*. So, could that still deadlock?
Attachment #8848947 - Flags: review?(jseward) → review+
I don't know why we don't need paf_* on Mac.
(In reply to Markus Stange [:mstange] from comment #13) > I don't know why we don't need paf_* on Mac. jseward added them to Linux in bug 837390. They are necessary there because Linux uses signals to trigger sampling, and the frequent signalling was preventing fork() from running correctly. Pausing the profiler prevents the signals from firing, allowing fork() to succeed. (There's a comment about this above paf_prepare().) Mac doesn't use signals so this problem isn't necessary. > This is good as far as it goes. But I wonder about a related scenario, in > which "thread A" is actually the sampler thread and it holds gPSMutex at > the point where the main thread calls fork, for a different reason -- because > it's sampling a thread. > > For the Linux/Android implementation, I think this can't happen because > paf_prepare calls SetIsPaused and so presumably the sampler thread is > suspended. > > But the MacOS implementation doesn't use paf_*. So, could that still > deadlock? It depends what the child process does. If it immediately calls exec() without trying to lock gPSMutex then we're ok. paf_child() was causing problems because it was running between the fork() and the exec().
It's not necessary and causes hangs. The patch also inlines setup_atfork() and moves the Linux-only code closer to the Linux-only PlatformInit(), and tweaks the comments a bit.
Attachment #8849330 - Flags: review?(jseward)
Attachment #8848947 - Attachment is obsolete: true
Attachment #8849330 - Flags: review?(jseward) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/3026456dca8ccbb45d1527e21dcff4ed5149630c Bug 1348374 - Remove paf_child(). r=jseward.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox55: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
status-firefox52: --- → unaffected
status-firefox53: --- → unaffected
status-firefox54: --- → unaffected
status-firefox-esr52: --- → unaffected
Daniel, have you been able to use the profiler successfully, now that this is fixed?
I'll give it a try from the same machine & build config when I'm back in the office tomorrow morning.
This does indeed seem fixed. I'm not seeing any issues when profiling a local up-to-date build today (with the same build configuration / hardware as in comment 0). Thanks!
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.