Deadlock in the profiler because locked_profiler_stop notifies observers

RESOLVED FIXED in Firefox 55

Status

()

Core
Gecko Profiler
P1
normal
RESOLVED FIXED
2 months ago
a month ago

People

(Reporter: mstange, Assigned: njn)

Tracking

(Blocks: 1 bug, {regression})

Trunk
mozilla55
regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox52 unaffected, firefox53 unaffected, firefox54 unaffected, firefox55 fixed, firefox-esr52 unaffected)

Details

Attachments

(4 attachments, 1 obsolete attachment)

(Reporter)

Description

2 months ago
Example call stacks:

> [...]
> NS_InvokeByIndex
> nsProfiler::StopProfiler
> profiler_stop
> locked_profiler_stop <-- !!
> nsObserverService::NotifyObservers
> SharedStub
> PrepareAndDispatch
> nsXPCWrappedJSClass::CallMethod
> JS_CallFunctionValue
> js::Call
> [... js stuff]
> mozilla::dom::Element::SetAttr
> mozilla::dom::Element::SetAttrAndNotify
> nsNodeUtils::AttributeChanged
> mozilla::PresShell::AttributeChanged
> mozilla::GeckoRestyleManager::AttributeChanged
> nsImageBoxFrame::AttributeChanged
> mozilla::PresShell::FrameNeedsReflow
> nsRefreshDriver::AddLayoutFlushObserver
> profiler_get_backtrace
> PR_Lock

> [...]
> mozilla::dom::ContentChild::RecvStopProfiler
> profiler_stop
> locked_profiler_stop <-- !!
> nsObserverService::NotifyObservers
> SharedStub
> PrepareAndDispatch
> nsXPCWrappedJSClass::CallMethod
> JS_CallFunctionValue
> js::Call
> js::InternalCallOrConstruct
> js::RunScript
> [... js stuff]
> js::jit::Bailout
> js::jit::BailoutIonToBaseline
> ProfilerJSEventMarker
> PR_Lock
(Assignee)

Comment 1

2 months ago
Bleh. I'll take a look on Monday.
Assignee: nobody → n.nethercote
Comment hidden (mozreview-request)
(Reporter)

Updated

2 months ago
Blocks: 1273422
Comment hidden (mozreview-request)
(Assignee)

Comment 4

2 months ago
Oh dear. I assigned this to myself and said I would look at it on Monday, but I ended up looking at it yesterday and writing some patches. They are quite similar to yours, but do a bit more refactoring as well, and I fixed a leak I found along the way.

Also, I think yours misses the fact that profiler_init() calls profiler_start().

How about I upload them now for your consideration? They're a little under-tested, but I might as well show you them now given the circumstances.
(Assignee)

Comment 5

2 months ago
Created attachment 8846385 [details] [diff] [review]
(part 1) - Fix SamplerThread leak in profiler_start()

profiler_start() can call locked_profiler_stop(). In that case it must then
call SamplerThread::Join() in order to free the SamplerThread, but it currently
doesn't.

This patch adds such a call. It also marks locked_profiler_start() with
MOZ_MUST_USE to make this mistake less likely in the future.
Attachment #8846385 - Flags: review?(mstange)
(Assignee)

Comment 6

2 months ago
Created attachment 8846386 [details] [diff] [review]
(part 2) - Merge SamplerThread::Join() into ~SamplerThread()

SamplerThread::Join() is always called just before the SamplerThread object is
deleted. Might as well combine them.
Attachment #8846386 - Flags: review?(mstange)
(Assignee)

Comment 7

2 months ago
Created attachment 8846387 [details] [diff] [review]
(part 3) - Do most NotifyObserver() calls when gPSMutex is unlocked

Calling NotifyObserver() with gPSMutex locked is a bad idea; arbitrary code can
run in observers, which can easily include code that calls other profiler
functions that themselves lock gPSMutex, causing deadlock. This has been seen
in practise with locked_profiler_stop().

This patch moves all but one of the NotifyObserver() calls in platform.cpp to
after the sections where gPSMutex is locked. The remaining call (for the
"profiler-subprocess") is harmless, because it just calls a simple callback
implemented within platform.cpp, and hard to move.

In the future we plan to allow profiler_start() and profiler_stop() to be
called from different threads. When that happens, it will be possible for the
"profiler-start" and "profiler-stop" notifications to arrive out of order.
Unfortunately this seems difficult to avoid. (Well, recursive mutexes would
make this problem much easier, but we don't have those...)
Attachment #8846387 - Flags: review?(mstange)
(Assignee)

Comment 8

2 months ago
Created attachment 8846388 [details] [diff] [review]
(part 4) - Clean up notification in the profiler

This patch does the following.

- Introduces NotifyObservers() for the simple notification cases in
  platform.cpp.

- Removes profiler_lock() and profiler_unlock() because they do notifications
  that the profiler add-on no longer listens for.
Attachment #8846388 - Flags: review?(mstange)
(Assignee)

Comment 9

2 months ago
It sounds like you have confirmed that this deadlock is the cause of bug 1273422? Good to know.
(Reporter)

Comment 10

2 months ago
(In reply to Nicholas Nethercote [:njn] from comment #4)

Oh, oops. Well, I didn't spend very much time on it, and I trust your patches more than I trust mine. :)

(In reply to Nicholas Nethercote [:njn] from comment #9)
> It sounds like you have confirmed that this deadlock is the cause of bug
> 1273422? Good to know.

I'm not 100% certain but the log does say "###!!! ERROR: Potential deadlock detected:" so I think it's fairly likely.

Thanks for uploading your patches. On the other hand now I can't promise I'll look at them before Monday (your Tuesday)...
(Assignee)

Comment 11

2 months ago
mozreview-review
Comment on attachment 8846352 [details]
Bug 1346356 - Do not notify observers while gPSMutex is locked.

https://reviewboard.mozilla.org/r/119408/#review121304

I will r- because my changes subsume yours, and ultimately I think leave the code in a better state overall. Sorry for the overlap!
Attachment #8846352 - Flags: review?(n.nethercote) → review-
(Assignee)

Comment 12

2 months ago
> Thanks for uploading your patches. On the other hand now I can't promise
> I'll look at them before Monday (your Tuesday)...

It's a public holiday here on Monday anyway :)
(Reporter)

Comment 13

2 months ago
Comment on attachment 8846385 [details] [diff] [review]
(part 1) - Fix SamplerThread leak in profiler_start()

Review of attachment 8846385 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/core/platform.cpp
@@ +2359,4 @@
>    LOG("END   profiler_start");
>  }
>  
> +static MOZ_MUST_USE SamplerThread*

Nice, I was about to suggest adding a MOZ_MUST_USE here.
Attachment #8846385 - Flags: review?(mstange) → review+
(Reporter)

Updated

2 months ago
Attachment #8846386 - Flags: review?(mstange) → review+
(Reporter)

Comment 14

2 months ago
Comment on attachment 8846387 [details] [diff] [review]
(part 3) - Do most NotifyObserver() calls when gPSMutex is unlocked

Review of attachment 8846387 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/core/platform.cpp
@@ +2394,1 @@
>    if (samplerThread) {

Using samplerThread is much better than the separate wasActive variable that I was using.
Attachment #8846387 - Flags: review?(mstange) → review+
(Reporter)

Updated

2 months ago
Attachment #8846388 - Flags: review?(mstange) → review+
(Reporter)

Updated

2 months ago
Attachment #8846352 - Attachment is obsolete: true
(Assignee)

Comment 15

2 months ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/3ec4e9dbdcb3f2a47e0618b6c97a3f753f92f1c9
Bug 1346356 (part 1) - Fix SamplerThread leak in profiler_start(). r=mstange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/8b0bfd19d13b861440f99b973249ac70f53cd457
Bug 1346356 (part 2) - Merge SamplerThread::Join() into ~SamplerThread(). r=mstange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/cbfe870385c09f0a63c46b586eeadb113cbde08a
Bug 1346356 (part 3) - Do most NotifyObserver() calls when gPSMutex is unlocked. r=mstange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/e4725643ead636a7cb9b8f2103f68a31fe278dbe
Bug 1346356 (part 4) - Clean up notification in the profiler. r=mstange.
(Assignee)

Updated

2 months ago
No longer blocks: 1273422
(Assignee)

Comment 16

2 months ago
mstange, how did you encounter this? Just running the browser manually?

Comment 17

2 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/3ec4e9dbdcb3
https://hg.mozilla.org/mozilla-central/rev/8b0bfd19d13b
https://hg.mozilla.org/mozilla-central/rev/cbfe870385c0
https://hg.mozilla.org/mozilla-central/rev/e4725643ead6
Status: NEW → RESOLVED
Last Resolved: 2 months ago
status-firefox55: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
(Reporter)

Comment 18

2 months ago
(In reply to Nicholas Nethercote [:njn] from comment #16)
> mstange, how did you encounter this? Just running the browser manually?

Yes, and starting and stopping the profiler. I didn't do anything special.
status-firefox52: --- → unaffected
status-firefox53: --- → unaffected
status-firefox54: --- → unaffected
status-firefox-esr52: --- → unaffected

Updated

a month ago
Duplicate of this bug: 1347059
You need to log in before you can comment on or make changes to this bug.