Deadlock in profiler_get_profile_jsobject_async

RESOLVED FIXED in Firefox 55

Status

()

Core
Gecko Profiler
RESOLVED FIXED
3 months ago
2 months ago

People

(Reporter: sfink, Assigned: njn)

Tracking

unspecified
mozilla55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55 fixed)

Details

Attachments

(2 attachments)

(Reporter)

Description

3 months ago
Bug 1347044 fixed a deadlock where profiler_get_profile_jsobject grabbed the lock, then could invoke JSON parsing which could try to emit a marker, which grabs the lock again.

I am getting a similar looking deadlock, but with profiler_get_profile_jsobject_async instead.

I have profiler_get_profile_jsobject_async grabbing the lock then calling ProfileGatherer::Start -> Start2 -> Finish -> JS_ParseJSON
(Assignee)

Comment 1

3 months ago
Oh yes, ProfileGatherer::Finish() does way more stuff with gPSMutex locked than it should! I will fix.

sfink, just for my curiosity: what triggered the call to profiler_get_profile_jsobject_async()?
Assignee: nobody → n.nethercote
Flags: needinfo?(sphink)
(Assignee)

Updated

3 months ago
No longer blocks: 1347044
Depends on: 1347044
(Assignee)

Updated

3 months ago
No longer depends on: 1347044
(Assignee)

Updated

3 months ago
Blocks: 1342306
(Assignee)

Comment 2

3 months ago
Hmm, this one is trickier than bug 1347044. ProfileGatherer's operations currently all run with gPSMutex locked. They also all run on the main thread, so this locking is currently not necessary.

I can see one nice solution to this, which is to remove the profiler_get_profile_jsobject_async() and profiler_save_profile_to_file_async() functions from the profiler's core API. That functionality -- async profile gathering -- would move outwards, into nsProfiler. ProfileGatherer would be a member of nsProfile instead of ProfilerState, and async profile gathering wouldn't have to deal directly with gPSMutex at all. It would instead call profiler_get_profile(). (It would also get rid of the horrible hackiness within profiler_will_gather_OOP_profile(), which is nice.)

This assumes that there is only ever one nsProfile, and that all of its operations run on the same thread (presumably the main thread), and that the notifications relating to cross-process profile gathering ("profiler-subprocess", etc.) can run at the nsProfile level. It also assumes that all of that is true even if/when we support off-main-thread operations (bug 1330184).

mstange, does this sound reasonable to you?

(I also wonder if the other notifications -- "profiler-started", "profiler-stopped", etc. -- would work better at the nsProfiler level.)
Flags: needinfo?(mstange)
(Assignee)

Comment 3

3 months ago
Created attachment 8851885 [details] [diff] [review]
(part 1) - Remove profiler_get_profile_jsobject

mstange, here's a precursor patch that gives you a taste of what comment 2 is
getting at. It shrink's the profilr's core API by moving the to-JSObject
conversion of profiles from the ProfilerState level to the nsProfiler level,
and cuts 28 lines of code in the process. Furthermore, it makes the deadlock
that was fixed in bug 1347044 impossible to recreate.
Attachment #8851885 - Flags: review?(mstange)
(Reporter)

Comment 4

3 months ago
(In reply to Nicholas Nethercote [:njn] from comment #1)
> Oh yes, ProfileGatherer::Finish() does way more stuff with gPSMutex locked
> than it should! I will fix.

Thank you!

> sfink, just for my curiosity: what triggered the call to
> profiler_get_profile_jsobject_async()?

Sadly, that debug session mysteriously killed itself. And rerunning that rr recording fails now; I've tried twice. Which is... bizarre.

From the original debug session, I unfortunately only ever went up one more frame, to GetProfileDataAsync. But it looks like that's an IDL method, so I guess that's probably going to be a call in the JS code? But I don't see a call in the tree. Uh...

Rerunning, the C++ stack just tells me it comes from NS_InvokeByIndex, so it does look like it's from JS. The JS stack is:

#0 0x7f6d989cd2e8 i   resource://gre/modules/commonjs/toolkit/loader.js -> resource://jid0-edalmuivkozlouyij0lpdx548bc-at-jetpack/lib/profiler.js:74 (0x7f6d91a3e7b0 @ 109)
#1 0x7f6d989cd250 i   resource://gre/modules/commonjs/toolkit/loader.js -> resource://jid0-edalmuivkozlouyij0lpdx548bc-at-jetpack/index.js:180 (0x7f6d909fa4b8 @ 28)
#2 0x7f6d989cd190 i   self-hosted:922 (0x7f6d758b54b8 @ 346)
#3 0x7f6d989cd0c0 i   resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/keyboard/hotkeys.js:102 (0x7f6d7649f258 @ 527)
#4 0x7ffe82491100 b   resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/event/core.js:112 (0x7f6d9097ce38 @ 254)
#5 0x7ffe824911c0 b   resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/event/core.js:89 (0x7f6d9097cd08 @ 110)
#6 0x7f6d989cd020 i   resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/keyboard/observer.js:53 (0x7f6d7649b978 @ 57)
(Reporter)

Comment 5

3 months ago
I don't seem to have that file anywhere on my filesystem, except in the XPI for what I assume is the gecko profiler extension. Or an old version of it, rather -- my active profile is shifted by one line. Here's the body of that function:

function getProfile() {
  if (!profiler().IsActive()) {
    return Promise.reject(new Error("The profiler is stopped. " +
      "You need to start the profiler before you can capture a profile."));
  }
  if (profiler().getProfileDataAsync) {
    return profiler().getProfileDataAsync();
  }
  // No luck - synchronous it is.
  return Promise.resolve(profiler().getProfileData());
}

which makes it look like you could just drop that API entirely, at the cost of introducing a pause when the profile is... fetched?
Flags: needinfo?(sphink)
(Assignee)

Comment 6

3 months ago
There are a couple of uses of getProfileDataAsync() from Talos code.
(In reply to Nicholas Nethercote [:njn] from comment #2)
> I can see one nice solution to this, which is to remove the
> profiler_get_profile_jsobject_async() and
> profiler_save_profile_to_file_async() functions from the profiler's core
> API. That functionality -- async profile gathering -- would move outwards,
> into nsProfiler. ProfileGatherer would be a member of nsProfiler instead of
> ProfilerState, and async profile gathering wouldn't have to deal directly
> with gPSMutex at all. It would instead call profiler_get_profile(). (It
> would also get rid of the horrible hackiness within
> profiler_will_gather_OOP_profile(), which is nice.)

This sounds good.

So you're basically moving some of the profiler state into a not-shared-between-threads bag that's called nsProfiler.

> This assumes that there is only ever one nsProfiler, and that all of its
> operations run on the same thread (presumably the main thread),

This is the case. nsProfiler is exposed to JS through Components.classes, which is only available on the main thread. And even in the future, when we want to be able to capture subprocess profiles off the main thread, these profiles won't be exposed to JS until they arrive at the parent process main thread in the promise resolution of the promise returned by getProfileDataAsync.

> and that the
> notifications relating to cross-process profile gathering
> ("profiler-subprocess", etc.) can run at the nsProfiler level.
> (I also wonder if the other notifications -- "profiler-started",
> "profiler-stopped", etc. -- would work better at the nsProfiler level.)

I'm not sure what you mean by "at the nsProfiler level", but I agree that these notifications are main thread only. I dislike the way we use global observer notifications as a communication mechanism for this stuff. I'd much prefer there to be explicit listeners registered at some central profiler interface. That interface might well be nsIProfiler. (In that case, the relavent methods should be C++-only methods of nsIProfiler.)

(In reply to Steve Fink [:sfink] [:s:] from comment #5)
> which makes it look like you could just drop that API entirely, at the cost
> of introducing a pause when the profile is... fetched?

The profile returned by the synchronous getProfileData method does not include content process profiles because the parent process is not allowed to synchronously block on the content process. getProfileDataAsync needs to stay.
Flags: needinfo?(mstange)
Comment on attachment 8851885 [details] [diff] [review]
(part 1) - Remove profiler_get_profile_jsobject

Review of attachment 8851885 [details] [diff] [review]:
-----------------------------------------------------------------

I would r+ this patch twice if I could.
Attachment #8851885 - Flags: review?(mstange) → review+
Attachment #8851885 - Flags: review+
Oh, I can! :D
(Assignee)

Comment 10

3 months ago
> So you're basically moving some of the profiler state into a not-shared-between-threads bag that's called nsProfiler.

Yes! And shrinking the profiler core (and its API) in the process.
(Assignee)

Comment 11

3 months ago
Created attachment 8852320 [details] [diff] [review]
(part 2) - Remove profiler_get_profile_jsobject_async() and profiler_save_profile_to_file_async()

The state management is better done within nsProfiler::GetProfileDataAsync()
and nsProfiler::DumpProfileToFileAsync(). (The latter function is new in this
patch.)

This fixes a deadlock.

Other notes:

- The patch moves ProfileGatherer from ProfilerState to nsProfiler. This is
  nice because the former is shared between threads but the latter is main
  thread only. (This is how the deadlock is avoided.)

- ProfilerStateMutex and PSLockRef are no longer required in platform.h. Those
  types and variables are now only used in platform.cpp and platform-*.cpp.

- ProfilerGatherer now calls profiler_get_profile() instead of ToJSON(). Which
  means that ToJSON() now has a single caller, so the patch inlines it at the
  callsite and removes it.

- profiler_save_profile_to_file_async() dispatched a Runnable to the main
  thread. But this wasn't necessary, because it always ran on the main thread
  itself. So the new function nsProfiler::DumpProfileToFileAsync() doesn't do
  that.

- profiler_will_gather_OOP_profile(), profiler_gathered_OOP_profile(), and
  profiler_OOP_exit_profile() are all moved into nsProfiler as well. This
  removes the need for the horrible fake lock in
  profiler_will_gather_OOP_profile(), hooray!
Attachment #8852320 - Flags: review?(mstange)
(Reporter)

Comment 12

3 months ago
(In reply to Markus Stange [:mstange] from comment #7)
> (In reply to Nicholas Nethercote [:njn] from comment #2)
> > This assumes that there is only ever one nsProfiler, and that all of its
> > operations run on the same thread (presumably the main thread),
> 
> This is the case. nsProfiler is exposed to JS through Components.classes,
> which is only available on the main thread. And even in the future, when we
> want to be able to capture subprocess profiles off the main thread, these
> profiles won't be exposed to JS until they arrive at the parent process main
> thread in the promise resolution of the promise returned by
> getProfileDataAsync.

I'm not familiar with either the old or the new setup, but I just wanted to ask -- what does this mean for the ZoneGroup/TabGroup stuff that seems to be coming up? That's where we have multiple threads running independent JS under a single JSRuntime. (Well, mostly independent -- they share the system zone group, and cooperatively schedule amongst themselves, but do not share any data and the content threads cannot communicate directly with each other.)

Also, what does this mean for workers? Is it just that the main thread is the only one allowed to retrieve profiles, or does this prevent workers from adding markers as well? (And I don't know what I mean by "workers" -- dom workers, web workers, service workers. Whichever runs in a different thread but the same process.)
(Reporter)

Updated

3 months ago
Blocks: 1322560
(Assignee)

Comment 13

3 months ago
> I'm not familiar with either the old or the new setup, but I just wanted to
> ask -- what does this mean for the ZoneGroup/TabGroup stuff that seems to be
> coming up? That's where we have multiple threads running independent JS
> under a single JSRuntime. (Well, mostly independent -- they share the system
> zone group, and cooperatively schedule amongst themselves, but do not share
> any data and the content threads cannot communicate directly with each
> other.)

I'm not sure about that.

> Also, what does this mean for workers? Is it just that the main thread is
> the only one allowed to retrieve profiles, or does this prevent workers from
> adding markers as well? (And I don't know what I mean by "workers" -- dom
> workers, web workers, service workers. Whichever runs in a different thread
> but the same process.)

Workers currently work fine with the profiler. Profiles can currently only be collected on the main thread. We're actually planning to lift that restriction (bug 1330184 and bug 1330185).

Something that may be confusing you is that there are currently two APIs for the profiler.

- The lower-level one in tools/profiler/public/GeckoProfiler.h, which consists of profiler_*() functions and a few PROFILE* macros.

- The higher-level one in tools/profiler/gecko/nsIProfiler.idl. It's mostly a thin layer above the low-level API.

This bug is moving some functionality out of the low-level API into the high-level API. The off-main-thread profile capturing will need to be controlled via the low-level API.
(Assignee)

Updated

3 months ago
Duplicate of this bug: 1351529
Comment on attachment 8852320 [details] [diff] [review]
(part 2) - Remove profiler_get_profile_jsobject_async() and profiler_save_profile_to_file_async()

Review of attachment 8852320 [details] [diff] [review]:
-----------------------------------------------------------------

This is a big improvement.

::: tools/profiler/gecko/nsProfiler.cpp
@@ +103,5 @@
>  
>  NS_IMETHODIMP
>  nsProfiler::StopProfiler()
>  {
> +  // Do this before profiler_start().

profiler_stop
Attachment #8852320 - Flags: review?(mstange) → review+
(Assignee)

Comment 16

3 months ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/4483302f97a757145dca7dd5720a397d4f9b9246
Bug 1350967 (part 1) - Remove profiler_get_profile_jsobject. r=mstange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/a1e5043844f295aa8f9a70517b7b76e34fecc634
Bug 1350967 (part 2) - Remove profiler_get_profile_jsobject_async() and profiler_save_profile_to_file_async(). r=mstange.

Comment 17

3 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/4483302f97a7
https://hg.mozilla.org/mozilla-central/rev/a1e5043844f2
Status: NEW → RESOLVED
Last Resolved: 3 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Depends on: 1355963
Depends on: 1356694
You need to log in before you can comment on or make changes to this bug.