BHR thread hangs for over a second while holding profiler lock initializing LUL

NEW
Assigned to

Status

()

Core
XPCOM
a month ago
2 days ago

People

(Reporter: mystor, Assigned: mystor)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments, 1 obsolete attachment)

(Assignee)

Description

a month ago
When running the xpcshell-test dom/push/test/xpcshell/test_clear_origin_data.js, the BHR thread kicks in and detects a hang. On linux after bug 1357829 this will cause LUL to be initialized. 

Initializing LUL (specifically calling read_procmaps) takes over a second, and causes parts of the test to time out, presumably because the main thread gets stuck trying to grab the profiler lock.

Not sure what the best solution is here. If we figure out the issues with stacks not being landed (bug 1365029), I might turn off BHR on linux temporarily until this is fixed.
(Assignee)

Comment 1

a month ago
Markus suggests you might have an idea as to what to do here
Flags: needinfo?(jseward)
Ping me on irc when you are around.  I don't understand enough about
the context here to comment, at the moment.
Flags: needinfo?(jseward)
(Assignee)

Comment 3

a month ago
Created attachment 8869613 [details] [diff] [review]
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread

Currently LUL is a member of CorePS, meaning that it is guarded by the PSMutex.
This mutex is grabbed by the main thread at random points during the execution
of the program. This is unfortunate, as initializing LUL can take a long
time (>1s on my local machine), and we definitely don't want to be blocking the
main thread waiting for it.

In addition, in the BHR case, we used to be grabbing LUL when we got our first
hang, while both the PSMutex and the BHR monitor were being held. This meant
that the main thread could make no progress during LUL initializaion, as the BHR
monitor is grabbed by the main thread on every spin of the event loop.

This patch moves that initialization to be behind a completely separate lock,
and makes BHR initialize it on the background thread before acquiring the BHR
lock, meaning that no locks other than the one guarding LUL should be held
during its initialization.

MozReview-Commit-ID: GwNYQaEAqJ1
Attachment #8869613 - Flags: review?(n.nethercote)
(Assignee)

Comment 4

a month ago
Created attachment 8869614 [details] [diff] [review]
Part 2: Re-enable the Background Hang Reporter on Linux

MozReview-Commit-ID: EjAti9UqW7k
Attachment #8869614 - Flags: review?(nfroyd)
Attachment #8869614 - Flags: review?(nfroyd) → review+
Comment on attachment 8869613 [details] [diff] [review]
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread

Review of attachment 8869613 [details] [diff] [review]:
-----------------------------------------------------------------

I more or less see what you are trying to do, but there are a couple of things I'm unclear/unhappy about, so I'm giving f+ for now. Details below.

::: tools/profiler/core/platform-linux-android.cpp
@@ +251,5 @@
>    // mSamplerTid to a dummy value and fill it in for real in
>    // SuspendAndSampleAndResumeThread().
>    , mSamplerTid(-1)
>  {
> +  profiler_initialize_stackwalk();

This LUL initialization is still happening (usually) on the main thread. From the commit log it sounds like you were trying to change this. Have I misunderstood?

::: tools/profiler/core/platform.cpp
@@ +530,5 @@
>  #endif
>  }
>  
> +#if defined(USE_LUL_STACKWALK)
> +// When using LUL, the state is stored behind a separate

Unfinished sentence. Just remove this line?

@@ +542,5 @@
> +public:
> +  static lul::LUL* Get() {
> +    mozilla::StaticMutexAutoLock lock(sLulMutex);
> +    return sLul;
> +  }

This seems bogus. The function briefly locks the mutex, but then unlocks it before returning. The caller then has full access to sLul without the protection of the mutex.

Although... maybe the point is that it's only the creation of sLul that needs the mutex protection, because after creation it's read-only? If that's the case, it's a non-obvious use of a mutex, so please make it clearer in the comments at the top of this class.

@@ +548,5 @@
> +  static lul::LUL* GetOrCreate() {
> +    mozilla::StaticMutexAutoLock lock(sLulMutex);
> +    if (sLul) {
> +      return sLul;
> +    }

Gecko style is to put the method's opening brace on its own line. Same for all the functions in this class.

@@ +1887,5 @@
>          // should poke it to give it a chance to print those statistics.  This
>          // involves doing I/O (fprintf, __android_log_print, etc.) and so
>          // can't safely be done from the critical section inside
>          // SuspendAndSampleAndResumeThread, which is why it is done here.
> +        lul::LUL* lul = LulPS::Get();

Having to lock the mutex a second time isn't good, esp. given that MaybeShowStats() usually doesn't do anything. Please try to rearrange things so it's only locked once.

@@ +3082,2 @@
>  
> +    const CorePS::ThreadVector& liveThreads = CorePS::LiveThreads(lock);

Why introduce this new scope? I don't see any benefit. (Maybe it's a leftover from a reverted change?)
Attachment #8869613 - Flags: review?(n.nethercote) → feedback+
(Assignee)

Comment 6

a month ago
Created attachment 8870494 [details] [diff] [review]
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread

(In reply to Nicholas Nethercote [:njn] from comment #5)
> This LUL initialization is still happening (usually) on the main thread.
> From the commit log it sounds like you were trying to change this. Have I
> misunderstood?

Yes, when the Background Hang Reporter initializes I am planning to run LUL initialization off of the main thread. If you look in BackgroundHangMonitor.cpp (BackgroundHangManager::RunMonitorThread()), you can see where I run it during the startup of the background hang monitor thread.

> This seems bogus. The function briefly locks the mutex, but then unlocks it
> before returning. The caller then has full access to sLul without the
> protection of the mutex.
>
> Although... maybe the point is that it's only the creation of sLul that
> needs the mutex protection, because after creation it's read-only? If that's
> the case, it's a non-obvious use of a mutex, so please make it clearer in
> the comments at the top of this class.

sLul has approximately the same lifetime as CorePS once it is initialized (It is not destroyed until CorePS is destroyed). The goal of this mutex was to ensure that if sLul is in the process of being initialized, the Getting thread would be blocked until the initialization was complete.

> @@ +1887,5 @@
> >          // should poke it to give it a chance to print those statistics.  This
> >          // involves doing I/O (fprintf, __android_log_print, etc.) and so
> >          // can't safely be done from the critical section inside
> >          // SuspendAndSampleAndResumeThread, which is why it is done here.
> > +        lul::LUL* lul = LulPS::Get();
>
> Having to lock the mutex a second time isn't good, esp. given that
> MaybeShowStats() usually doesn't do anything. Please try to rearrange things
> so it's only locked once.

As you mention above, it is a bit silly that `Get` is guarded by the mutex. In all places where `Get` is called, previous code should have made sure to call `GetOrCreate()`, which locks the mutex, meaning that it should already have been initialized and not need synchronization.

I will change `Get` to not use synchronization, and assert that `sLul` has been initialized by the time it is called.

MozReview-Commit-ID: GwNYQaEAqJ1
Attachment #8870494 - Flags: review?(n.nethercote)
(Assignee)

Updated

a month ago
Attachment #8869613 - Attachment is obsolete: true
Comment on attachment 8870494 [details] [diff] [review]
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread

Review of attachment 8870494 [details] [diff] [review]:
-----------------------------------------------------------------

Looking better. Thanks.

::: tools/profiler/core/platform.cpp
@@ +533,5 @@
> +#if defined(USE_LUL_STACKWALK)
> +// When we are using LUL, we are going to need to perform expensive
> +// initialization before we can start getting backtraces. We don't want to be
> +// holding the profiler mutex while doing that, as the main thread sometimes
> +// wants to grant that and we never want to get in the way of that. Because of

The use of the word "grant" here confuses me a little. Not sure exactly what you mean by that. Reword?

@@ +554,5 @@
> +  //
> +  // This is safe because LulPS::sLul has approximately the same lifetime as
> +  // CorePS. It is initialized after CorePS, but will live until CorePS is
> +  // killed. Once it has been initialized, it is safe to read without
> +  // synchronization.

I'd add that it's safe after initialization because it's read-only at that point.
Attachment #8870494 - Flags: review?(n.nethercote) → review+
(Assignee)

Updated

a month ago
Assignee: nobody → michael
(Assignee)

Comment 8

16 days ago
This seems to crash on try occasionally https://treeherder.mozilla.org/logviewer.html#?job_id=105513398&repo=try&lineNumber=2361

I'm not sure why initializing LUL would cause a crash. Nick, do you have any ideas?
Flags: needinfo?(n.nethercote)
Julian wrote LUL so I will forward to him.
Flags: needinfo?(n.nethercote) → needinfo?(jseward)
(In reply to Michael Layzell [:mystor] from comment #8)
> This seems to crash on try occasionally

That is a 32-bit Linux process, which is something of a rarity these days.
Are the other crashes that you see also only with 32-bit processes?
Flags: needinfo?(jseward) → needinfo?(michael)
(Assignee)

Comment 11

15 days ago
(In reply to Julian Seward [:jseward] from comment #10)
> (In reply to Michael Layzell [:mystor] from comment #8)
> > This seems to crash on try occasionally
> 
> That is a 32-bit Linux process, which is something of a rarity these days.
> Are the other crashes that you see also only with 32-bit processes?

I don't see crashes on non-32-bit linux boxes. Should we be disabling the profiler on those platforms?
Flags: needinfo?(michael) → needinfo?(n.nethercote)
> I don't see crashes on non-32-bit linux boxes. Should we be disabling the
> profiler on those platforms?

You mean, should we disable the profiler on x86/Linux? Seems reasonable given that we don't support x86/Darwin. It would simplify the code a very small amount.

mstange, jseward: what do you think about removing x86/Linux support from the profiler and LUL?
Flags: needinfo?(n.nethercote)
Flags: needinfo?(mstange)
Flags: needinfo?(jseward)
Sorry, I lost track of this.  I would guess this is crashing when LUL tries
to mmap libxul.so to read debuginfo from it.  This has happened to us
before.  It would be good to confirm that, though.

Mmaping libxul.so on any 32 bit platform can fail since it requires finding
a contiguous chunk of address space of more than 1GB, which often isn't
available.  On 64 bit platforms, by contrast, this isn't a problem.

Fixing this properly would involve inserting an abstraction layer under
LUL's ELF and Dwarf parsers that allows reading arbitrary bits of the file
without having to map the whole thing in.  That's a significant amount of
work, though.

Whether anyone will miss x86/Linux support for the profiler, I can't say.
If we want to remove it, that's fine by me.
Flags: needinfo?(jseward)
(In reply to Nicholas Nethercote [:njn] from comment #12)
> mstange, jseward: what do you think about removing x86/Linux support from
> the profiler and LUL?

If it doesn't really work in most cases then it's not really worth having it around.
Just to clarify, this just means that the "stackwalk" feature wouldn't be available, and not that MOZ_GECKO_PROFILER wouldn't be defined, right?
Flags: needinfo?(mstange)
> Just to clarify, this just means that the "stackwalk" feature wouldn't be
> available, and not that MOZ_GECKO_PROFILER wouldn't be defined, right?

Good question. I had assumed it would be fully disabled, but we could just remove 32-bit LUL support and undfine HAVE_NATIVE_UNWIND but the rest would still work. But, how useful is the profiler without stack walking? Is anybody ever going to use the profiler in that configuration?

I also talked with glandium about this. We do still ship 32-bit Linux, at least for Beta and Nightly (e.g. https://www.mozilla.org/en-US/firefox/beta/all/). But shipped Firefox is not quite the same as locally-built Firefox. In particular, libxul.so is a lot bigger when locally built. If the problem really is that the mapping fails, it won't affect shipped builds. Also, if it is the mapping failing, surely LUL should handle that more gracefully than crashing?
All good points.
So why don't we leave everything as it is and just disable BHR stackwalking on Linux 32? Then we'll still be able to get profiles from users on linux32 at least in some cases, and may be able to identify performance problems that they hit.
I just built Fx on 32 bit Linux and it crashed twice in two runs, when using the
profiler.  So it's at least reproducible.  Am investigating.
(Assignee)

Comment 18

8 days ago
This is currently blocked on some EME timeouts. As far as I can tell these timeouts are occurring because when we start up the EME process we also initialize BHR inside of that process. When this is done, we initialize LUL in that process, which involves doing a lot of work. We then, in the tests, quickly shut down the EME process. This would normally occur quickly, however shutting down the process involves shutting down BHR, which involves joining the BHR thread. As LUL initialization is occurring on the BHR thread at the time, this just hangs everything. I think this then causes everything to time out on infra, although I am unsure.

I think the easiest solution here would be to just not run BHR in the EME process. I don't even know if we collect or look at any of the telemetry information coming from it. cpearce, how would you feel about doing that?
Flags: needinfo?(cpearce)
(In reply to Julian Seward [:jseward] from comment #13)
> Sorry, I lost track of this.  I would guess this is crashing when LUL tries
> to mmap libxul.so to read debuginfo from it.

Let me update this.  I guessed wrong.  I tested with both 32- and 64-bit Linux
builds, to find out what happens when LUL fails to mmap libxul.so.  Per my testing
it handles the failure OK -- the system stays alive.  The resulting stack traces
from LUL are of course almost useless because it fails to load unwind data for
libxul, but I didn't see any stability issues pertaining to LUL.

On 32 bit Linux only targets, I do later see a segfault that happens when the
profiler GUI (Cleopatra) tries to symbolicate the traces, due to its use of
Wasm and an incorrect assertion, which I filed as bug 1374218.  But that isn't
relevant here (I assume) because (1) we're only using LUL to get stack traces,
but not running the the profiler GUI, and (2) [speculative, but as-yet unverified]
the bug 1374218 failure only occurs for --disable-jemalloc builds.

All that said, mmaping libxul.so to read symbols from it is still a losing
proposition on any 32 bit target.  So we should still decide either to disable
LUL on 32 bit targets, or to insert an abstraction layer at the bottom of
the Dwarf/ELF parser which makes it possible to read stuff from the .so file
without having to map the whole thing in.  Valgrind now uses that strategy
and it works well.
Another thing that I just observed is that LUL does much more poorly (has
more unwind failures) on 32-bit than 64-bit Linux.  64-bit also used to have
unwind failures, but most of those were resolved at the end of 1Q2017 via a
combination of Dwarf handling fixups and enabling FP unwinding.  So it would
probably be possible to do the same for 32-bit if required.  In the meantime
however be aware that 32-bit will often fail to give good unwinds, even in
the case where it does manage to map libxul.so to read unwind info.
Linux32 seems like a low-value proposition. I don't think adding the abstraction layer or improving the unwinding is worth the effort.
(In reply to Nicholas Nethercote [:njn] from comment #21)
> Linux32 seems like a low-value proposition. I don't think adding the
> abstraction layer or improving the unwinding is worth the effort.

I agree.  I'm not advocating for it -- merely recording the tradeoffs
as I see them.
(In reply to Michael Layzell [:mystor] from comment #18)
> This is currently blocked on some EME timeouts. As far as I can tell these
> timeouts are occurring because when we start up the EME process we also
> initialize BHR inside of that process. When this is done, we initialize LUL
> in that process, which involves doing a lot of work. We then, in the tests,
> quickly shut down the EME process. This would normally occur quickly,
> however shutting down the process involves shutting down BHR, which involves
> joining the BHR thread. As LUL initialization is occurring on the BHR thread
> at the time, this just hangs everything. I think this then causes everything
> to time out on infra, although I am unsure.
> 
> I think the easiest solution here would be to just not run BHR in the EME
> process. I don't even know if we collect or look at any of the telemetry
> information coming from it. cpearce, how would you feel about doing that?


I'm totally fine with not initializing BHR in the GMP process (which is the process in which EME plugins run). I didn't think we collected telemetry in the GMP process anyway, and we don't startup XPCOM either.
Flags: needinfo?(cpearce)
(Assignee)

Comment 24

3 days ago
Created attachment 8880140 [details] [diff] [review]
Part 3: Disable ThreadStackHelper on x86 linux due to a LUL crash

MozReview-Commit-ID: 2kBBFftnJEd
Attachment #8880140 - Flags: review?(nfroyd)
(Assignee)

Comment 25

3 days ago
Created attachment 8880141 [details] [diff] [review]
Part 4: Disable the BackgroundHangMonitor in GMP processes

This involved a change to BackgroundHangMonitor, as it initialized sDisabled
incorrectly to false, instead of true, We need sDisabled initialized to true, as
we cannot assume that it is enabled until BackgroundHangMonitor::Startup() is
called.

MozReview-Commit-ID: 94slLTkNk3C
Attachment #8880141 - Flags: review?(nfroyd)
Attachment #8880140 - Flags: review?(nfroyd) → review+
Attachment #8880141 - Flags: review?(nfroyd) → review+
You need to log in before you can comment on or make changes to this bug.