Closed Bug 1348374 Opened 8 years ago Closed 8 years ago

New tab from "Capture Profile" is just blank, with hang in paf_child()'s acquiring of gPSMutex

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla55

Tracking Flags:

Tracking

Status

firefox52

---

unaffected

firefox-esr52

---

unaffected

firefox53

---

unaffected

firefox54

---

unaffected

firefox55

---

fixed

People

(Reporter: dholbert, Assigned: n.nethercote)

References

(Blocks 1 open bug)

Details

(Keywords: hang, regression)

Attachments

(3 files, 1 obsolete file)

backtrace of hang in blank tab's content process 8 years ago Daniel Holbert [:dholbert] 8.21 KB, text/plain		Details
(local workaround patch) 8 years ago Daniel Holbert [:dholbert] 936 bytes, text/plain		Details
Remove paf_child() 8 years ago Nicholas Nethercote [inactive] 1.46 KB, patch	jseward : review+	Details \| Diff \| Splinter Review
Remove paf_child() 8 years ago Nicholas Nethercote [inactive] 5.17 KB, patch	jseward : review+	Details \| Diff \| Splinter Review

Daniel Holbert [:dholbert]

Reporter

Description

•

8 years ago

No description provided.

Daniel Holbert [:dholbert]

Reporter

Comment 1

•

8 years ago

My STR: 1. Build mozilla-central with --enable-debug --enable-optimize 2. ./mach run 3. Visit http://perf-html.io/ and install the profiler addon 4. Visit https://twitter.com/RickByers/with_replies (Note: I also logged in to Twitter before doing this, though that's probably not necessary.) 5. Try to capture a profile -- do Ctrl Shift 1 to stop profiling, Ctrl Shift 1 again to start, and Ctrl Shift 2 to capture profile. ACTUAL RESULTS: A blank tab is opened, with the URL bar saying https://perf-html.io/from-addon/ . If I attach to the process with gdb (getting the PID by hovering the tab title), I can see that it's hanging while trying to acquire gPSMutex in paf_child().

Blocks: 1342306

Summary: New tab from "Capture Profile" is just blank, with hang in paf_child() → New tab from "Capture Profile" is just blank, with hang in paf_child()'s acquiring of gPSMutex

Daniel Holbert [:dholbert]

Reporter

Comment 2

•

8 years ago

Attached file backtrace of hang in blank tab's content process — Details

Markus Stange [:mstange]

Comment 3

•

8 years ago

It looks like the lock in paf_child needs to go away. Here are my notes from my conversations with ehsan and jrmuizel about this: paf_child is an atfork handler that runs after the fork in the new process. Launching a new process on Linux works like this: You fork and then you exec from the new process. The atfork handler runs between the fork and the exec. Any locks that were held in the original process during the fork will remain held in the forked process, but in the forked process there are no other threads that might release the lock. So it's not a good idea to grab locks in the forked process unless you can be sure that they weren't held in the original pre-fork process. After the exec, you start with a completely blank slate and can grab any locks you like.

Markus Stange [:mstange]

Updated

•

8 years ago

Blocks: 1329181

Keywords: hang, regression

OS: Unspecified → Linux

Hardware: Unspecified → All

Daniel Holbert [:dholbert]

Reporter

Comment 4

•

8 years ago

Attached file (local workaround patch) — Details

(Here's the local workaround patch that I'm using to get the profiler working for me again, FWIW.)

Nicholas Nethercote [inactive]

Assignee

Comment 5

•

8 years ago

I'll fix this on Monday. Thanks for identifying the problem, dholbert!

Assignee: nobody → n.nethercote

Daniel Holbert [:dholbert]

Reporter

Comment 6

•

8 years ago

I just tripped over it. :D Credit to mstange & ehsan (& jrmuizel?) for figuring out the problem.

Nicholas Nethercote [inactive]

Assignee

Comment 7

•

8 years ago

I have been having trouble capturing profiles in Linux debug builds ever since part 3 of bug 1342306 landed -- the new tab would open and hang without even showing the tab title. Part 1 of that bug added paf_child() and part 3 added gPSMutex. Furthermore, when you do "capture profile" in Nightly that often starts a new content process for the new tab, which explains why fork+exec are happening at that time. So that makes sense. So I absolutely believe there's a problem here. But I'm puzzled by a few things. - It makes sense that this is Linux-only (as I've seen). But why would it affect debug builds only? - I just tried to reproduce this and failed. That's weird, it was pretty reliable before! - What is the exact sequence that is supposed to cause the problem? Presumably gPSMutex must be locked when fork() is called, and then paf_child() tries to lock it again and that causes the hang? But I don't know how or why gPSMutex would be held when fork() is called. paf_prepare() locks and unlocks gPSMutex just prior to the fork(), and that seems to happen without any problems. I can easily remove paf_child() -- I only added it for consistency and it's not really necessary. I would like to understand better what's going on, though. mstange, any ideas?

Flags: needinfo?(mstange)

Nicholas Nethercote [inactive]

Assignee

Comment 8

•

8 years ago

Also, dholbert, how do you get the PID from hovering over a tab title? That doesn't seem to work for me...

Flags: needinfo?(dholbert)

Daniel Holbert [:dholbert]

Reporter

Comment 9

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #8) > Also, dholbert, how do you get the PID from hovering over a tab title? That > doesn't seem to work for me... A tooltip appears when I hover the tab title. It has the page title, followed by the PID inside some parens. This happens in Nightly builds, too. (This was news to me last week when mstange told me about it -- I don't know any more about it beyond that.)

Flags: needinfo?(dholbert)

Nicholas Nethercote [inactive]

Assignee

Comment 10

•

8 years ago

Ok, with glandium's help I've come up with a plausible sequence of events. main thread - calls fork(), which triggers a call to paf_prepare() thread A - registers or unregisters a thread, locks gPSMutex main thread - the fork actually happens, with thread A holding gPSMutex locked - in the child process, the main thread tries to capture the profile, which requires locking gPSMutex. But it's already locked, and thread A doesn't exist in the child, so gPSMutex is never unlocked and the main thread is forever stuck. - So the browser tab hangs. It all depends on thread A still holding gPSMutex lock when the actual fork occurs, and that is somewhat unreliable, though seemingly more common on debug builds, presumably due to their general slowness.

Nicholas Nethercote [inactive]

Assignee

Comment 11

•

8 years ago

Attached patch Remove paf_child() (obsolete) — Details — Splinter Review

<none>

Attachment #8848947 - Flags: review?(jseward)

Julian Seward [:jseward]

Comment 12

•

8 years ago

Comment on attachment 8848947 [details] [diff] [review] Remove paf_child() Review of attachment 8848947 [details] [diff] [review]: ----------------------------------------------------------------- This is good as far as it goes. But I wonder about a related scenario, in which "thread A" is actually the sampler thread and it holds gPSMutex at the point where the main thread calls fork, for a different reason -- because it's sampling a thread. For the Linux/Android implementation, I think this can't happen because paf_prepare calls SetIsPaused and so presumably the sampler thread is suspended. But the MacOS implementation doesn't use paf_*. So, could that still deadlock?

Attachment #8848947 - Flags: review?(jseward) → review+

Markus Stange [:mstange]

Comment 13

•

8 years ago

I don't know why we don't need paf_* on Mac.

Flags: needinfo?(mstange)

Nicholas Nethercote [inactive]

Assignee

Comment 14

•

8 years ago

(In reply to Markus Stange [:mstange] from comment #13) > I don't know why we don't need paf_* on Mac. jseward added them to Linux in bug 837390. They are necessary there because Linux uses signals to trigger sampling, and the frequent signalling was preventing fork() from running correctly. Pausing the profiler prevents the signals from firing, allowing fork() to succeed. (There's a comment about this above paf_prepare().) Mac doesn't use signals so this problem isn't necessary. > This is good as far as it goes. But I wonder about a related scenario, in > which "thread A" is actually the sampler thread and it holds gPSMutex at > the point where the main thread calls fork, for a different reason -- because > it's sampling a thread. > > For the Linux/Android implementation, I think this can't happen because > paf_prepare calls SetIsPaused and so presumably the sampler thread is > suspended. > > But the MacOS implementation doesn't use paf_*. So, could that still > deadlock? It depends what the child process does. If it immediately calls exec() without trying to lock gPSMutex then we're ok. paf_child() was causing problems because it was running between the fork() and the exec().

Nicholas Nethercote [inactive]

Assignee

Comment 15

•

8 years ago

Attached patch Remove paf_child() — Details — Splinter Review

It's not necessary and causes hangs. The patch also inlines setup_atfork() and moves the Linux-only code closer to the Linux-only PlatformInit(), and tweaks the comments a bit.

Attachment #8849330 - Flags: review?(jseward)

Nicholas Nethercote [inactive]

Assignee

Updated

•

8 years ago

Attachment #8848947 - Attachment is obsolete: true

Julian Seward [:jseward]

Updated

•

8 years ago

Attachment #8849330 - Flags: review?(jseward) → review+

Nicholas Nethercote [inactive]

Assignee

Comment 16

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/3026456dca8ccbb45d1527e21dcff4ed5149630c Bug 1348374 - Remove paf_child(). r=jseward.

Carsten Book [:Tomcat]

Comment 17

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/3026456dca8c

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox55: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

status-firefox52: --- → unaffected

status-firefox53: --- → unaffected

status-firefox54: --- → unaffected

status-firefox-esr52: --- → unaffected

Markus Stange [:mstange]

Comment 18

•

8 years ago

Daniel, have you been able to use the profiler successfully, now that this is fixed?

Flags: needinfo?(dholbert)

Daniel Holbert [:dholbert]

Reporter

Comment 19

•

8 years ago

I'll give it a try from the same machine & build config when I'm back in the office tomorrow morning.

Daniel Holbert [:dholbert]

Reporter

Comment 20

•

8 years ago

This does indeed seem fixed. I'm not seeing any issues when profiling a local up-to-date build today (with the same build configuration / hardware as in comment 0). Thanks!

Status: RESOLVED → VERIFIED

Flags: needinfo?(dholbert)

Emilio Cobos Álvarez (:emilio)

Updated

•

3 years ago

Updated

•

3 years ago

Blocks: 1750452

Mathew Hodson

Updated

•

3 years ago