BHR thread hangs for over a second while holding profiler lock initializing LUL

RESOLVED FIXED in Firefox 57

Status

()

Core
XPCOM
RESOLVED FIXED
3 months ago
7 days ago

People

(Reporter: mystor, Assigned: mystor)

Tracking

unspecified
mozilla57
Points:
---

Firefox Tracking Flags

(firefox57 fixed)

Details

Attachments

(3 attachments, 12 obsolete attachments)

2.35 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
1.43 KB, patch
njn
: review+
Details | Diff | Splinter Review
14.17 KB, patch
Details | Diff | Splinter Review
When running the xpcshell-test dom/push/test/xpcshell/test_clear_origin_data.js, the BHR thread kicks in and detects a hang. On linux after bug 1357829 this will cause LUL to be initialized. 

Initializing LUL (specifically calling read_procmaps) takes over a second, and causes parts of the test to time out, presumably because the main thread gets stuck trying to grab the profiler lock.

Not sure what the best solution is here. If we figure out the issues with stacks not being landed (bug 1365029), I might turn off BHR on linux temporarily until this is fixed.
Markus suggests you might have an idea as to what to do here
Flags: needinfo?(jseward)
Ping me on irc when you are around.  I don't understand enough about
the context here to comment, at the moment.
Flags: needinfo?(jseward)
Created attachment 8869613 [details] [diff] [review]
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread

Currently LUL is a member of CorePS, meaning that it is guarded by the PSMutex.
This mutex is grabbed by the main thread at random points during the execution
of the program. This is unfortunate, as initializing LUL can take a long
time (>1s on my local machine), and we definitely don't want to be blocking the
main thread waiting for it.

In addition, in the BHR case, we used to be grabbing LUL when we got our first
hang, while both the PSMutex and the BHR monitor were being held. This meant
that the main thread could make no progress during LUL initializaion, as the BHR
monitor is grabbed by the main thread on every spin of the event loop.

This patch moves that initialization to be behind a completely separate lock,
and makes BHR initialize it on the background thread before acquiring the BHR
lock, meaning that no locks other than the one guarding LUL should be held
during its initialization.

MozReview-Commit-ID: GwNYQaEAqJ1
Attachment #8869613 - Flags: review?(n.nethercote)
Created attachment 8869614 [details] [diff] [review]
Part 2: Re-enable the Background Hang Reporter on Linux

MozReview-Commit-ID: EjAti9UqW7k
Attachment #8869614 - Flags: review?(nfroyd)

Updated

3 months ago
Attachment #8869614 - Flags: review?(nfroyd) → review+
Comment on attachment 8869613 [details] [diff] [review]
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread

Review of attachment 8869613 [details] [diff] [review]:
-----------------------------------------------------------------

I more or less see what you are trying to do, but there are a couple of things I'm unclear/unhappy about, so I'm giving f+ for now. Details below.

::: tools/profiler/core/platform-linux-android.cpp
@@ +251,5 @@
>    // mSamplerTid to a dummy value and fill it in for real in
>    // SuspendAndSampleAndResumeThread().
>    , mSamplerTid(-1)
>  {
> +  profiler_initialize_stackwalk();

This LUL initialization is still happening (usually) on the main thread. From the commit log it sounds like you were trying to change this. Have I misunderstood?

::: tools/profiler/core/platform.cpp
@@ +530,5 @@
>  #endif
>  }
>  
> +#if defined(USE_LUL_STACKWALK)
> +// When using LUL, the state is stored behind a separate

Unfinished sentence. Just remove this line?

@@ +542,5 @@
> +public:
> +  static lul::LUL* Get() {
> +    mozilla::StaticMutexAutoLock lock(sLulMutex);
> +    return sLul;
> +  }

This seems bogus. The function briefly locks the mutex, but then unlocks it before returning. The caller then has full access to sLul without the protection of the mutex.

Although... maybe the point is that it's only the creation of sLul that needs the mutex protection, because after creation it's read-only? If that's the case, it's a non-obvious use of a mutex, so please make it clearer in the comments at the top of this class.

@@ +548,5 @@
> +  static lul::LUL* GetOrCreate() {
> +    mozilla::StaticMutexAutoLock lock(sLulMutex);
> +    if (sLul) {
> +      return sLul;
> +    }

Gecko style is to put the method's opening brace on its own line. Same for all the functions in this class.

@@ +1887,5 @@
>          // should poke it to give it a chance to print those statistics.  This
>          // involves doing I/O (fprintf, __android_log_print, etc.) and so
>          // can't safely be done from the critical section inside
>          // SuspendAndSampleAndResumeThread, which is why it is done here.
> +        lul::LUL* lul = LulPS::Get();

Having to lock the mutex a second time isn't good, esp. given that MaybeShowStats() usually doesn't do anything. Please try to rearrange things so it's only locked once.

@@ +3082,2 @@
>  
> +    const CorePS::ThreadVector& liveThreads = CorePS::LiveThreads(lock);

Why introduce this new scope? I don't see any benefit. (Maybe it's a leftover from a reverted change?)
Attachment #8869613 - Flags: review?(n.nethercote) → feedback+
Created attachment 8870494 [details] [diff] [review]
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread

(In reply to Nicholas Nethercote [:njn] from comment #5)
> This LUL initialization is still happening (usually) on the main thread.
> From the commit log it sounds like you were trying to change this. Have I
> misunderstood?

Yes, when the Background Hang Reporter initializes I am planning to run LUL initialization off of the main thread. If you look in BackgroundHangMonitor.cpp (BackgroundHangManager::RunMonitorThread()), you can see where I run it during the startup of the background hang monitor thread.

> This seems bogus. The function briefly locks the mutex, but then unlocks it
> before returning. The caller then has full access to sLul without the
> protection of the mutex.
>
> Although... maybe the point is that it's only the creation of sLul that
> needs the mutex protection, because after creation it's read-only? If that's
> the case, it's a non-obvious use of a mutex, so please make it clearer in
> the comments at the top of this class.

sLul has approximately the same lifetime as CorePS once it is initialized (It is not destroyed until CorePS is destroyed). The goal of this mutex was to ensure that if sLul is in the process of being initialized, the Getting thread would be blocked until the initialization was complete.

> @@ +1887,5 @@
> >          // should poke it to give it a chance to print those statistics.  This
> >          // involves doing I/O (fprintf, __android_log_print, etc.) and so
> >          // can't safely be done from the critical section inside
> >          // SuspendAndSampleAndResumeThread, which is why it is done here.
> > +        lul::LUL* lul = LulPS::Get();
>
> Having to lock the mutex a second time isn't good, esp. given that
> MaybeShowStats() usually doesn't do anything. Please try to rearrange things
> so it's only locked once.

As you mention above, it is a bit silly that `Get` is guarded by the mutex. In all places where `Get` is called, previous code should have made sure to call `GetOrCreate()`, which locks the mutex, meaning that it should already have been initialized and not need synchronization.

I will change `Get` to not use synchronization, and assert that `sLul` has been initialized by the time it is called.

MozReview-Commit-ID: GwNYQaEAqJ1
Attachment #8870494 - Flags: review?(n.nethercote)
Attachment #8869613 - Attachment is obsolete: true
Comment on attachment 8870494 [details] [diff] [review]
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread

Review of attachment 8870494 [details] [diff] [review]:
-----------------------------------------------------------------

Looking better. Thanks.

::: tools/profiler/core/platform.cpp
@@ +533,5 @@
> +#if defined(USE_LUL_STACKWALK)
> +// When we are using LUL, we are going to need to perform expensive
> +// initialization before we can start getting backtraces. We don't want to be
> +// holding the profiler mutex while doing that, as the main thread sometimes
> +// wants to grant that and we never want to get in the way of that. Because of

The use of the word "grant" here confuses me a little. Not sure exactly what you mean by that. Reword?

@@ +554,5 @@
> +  //
> +  // This is safe because LulPS::sLul has approximately the same lifetime as
> +  // CorePS. It is initialized after CorePS, but will live until CorePS is
> +  // killed. Once it has been initialized, it is safe to read without
> +  // synchronization.

I'd add that it's safe after initialization because it's read-only at that point.
Attachment #8870494 - Flags: review?(n.nethercote) → review+
Assignee: nobody → michael
This seems to crash on try occasionally https://treeherder.mozilla.org/logviewer.html#?job_id=105513398&repo=try&lineNumber=2361

I'm not sure why initializing LUL would cause a crash. Nick, do you have any ideas?
Flags: needinfo?(n.nethercote)
Julian wrote LUL so I will forward to him.
Flags: needinfo?(n.nethercote) → needinfo?(jseward)
(In reply to Michael Layzell [:mystor] from comment #8)
> This seems to crash on try occasionally

That is a 32-bit Linux process, which is something of a rarity these days.
Are the other crashes that you see also only with 32-bit processes?
Flags: needinfo?(jseward) → needinfo?(michael)
(In reply to Julian Seward [:jseward] from comment #10)
> (In reply to Michael Layzell [:mystor] from comment #8)
> > This seems to crash on try occasionally
> 
> That is a 32-bit Linux process, which is something of a rarity these days.
> Are the other crashes that you see also only with 32-bit processes?

I don't see crashes on non-32-bit linux boxes. Should we be disabling the profiler on those platforms?
Flags: needinfo?(michael) → needinfo?(n.nethercote)
> I don't see crashes on non-32-bit linux boxes. Should we be disabling the
> profiler on those platforms?

You mean, should we disable the profiler on x86/Linux? Seems reasonable given that we don't support x86/Darwin. It would simplify the code a very small amount.

mstange, jseward: what do you think about removing x86/Linux support from the profiler and LUL?
Flags: needinfo?(n.nethercote)
Flags: needinfo?(mstange)
Flags: needinfo?(jseward)
Sorry, I lost track of this.  I would guess this is crashing when LUL tries
to mmap libxul.so to read debuginfo from it.  This has happened to us
before.  It would be good to confirm that, though.

Mmaping libxul.so on any 32 bit platform can fail since it requires finding
a contiguous chunk of address space of more than 1GB, which often isn't
available.  On 64 bit platforms, by contrast, this isn't a problem.

Fixing this properly would involve inserting an abstraction layer under
LUL's ELF and Dwarf parsers that allows reading arbitrary bits of the file
without having to map the whole thing in.  That's a significant amount of
work, though.

Whether anyone will miss x86/Linux support for the profiler, I can't say.
If we want to remove it, that's fine by me.
Flags: needinfo?(jseward)
(In reply to Nicholas Nethercote [:njn] from comment #12)
> mstange, jseward: what do you think about removing x86/Linux support from
> the profiler and LUL?

If it doesn't really work in most cases then it's not really worth having it around.
Just to clarify, this just means that the "stackwalk" feature wouldn't be available, and not that MOZ_GECKO_PROFILER wouldn't be defined, right?
Flags: needinfo?(mstange)
> Just to clarify, this just means that the "stackwalk" feature wouldn't be
> available, and not that MOZ_GECKO_PROFILER wouldn't be defined, right?

Good question. I had assumed it would be fully disabled, but we could just remove 32-bit LUL support and undfine HAVE_NATIVE_UNWIND but the rest would still work. But, how useful is the profiler without stack walking? Is anybody ever going to use the profiler in that configuration?

I also talked with glandium about this. We do still ship 32-bit Linux, at least for Beta and Nightly (e.g. https://www.mozilla.org/en-US/firefox/beta/all/). But shipped Firefox is not quite the same as locally-built Firefox. In particular, libxul.so is a lot bigger when locally built. If the problem really is that the mapping fails, it won't affect shipped builds. Also, if it is the mapping failing, surely LUL should handle that more gracefully than crashing?
All good points.
So why don't we leave everything as it is and just disable BHR stackwalking on Linux 32? Then we'll still be able to get profiles from users on linux32 at least in some cases, and may be able to identify performance problems that they hit.
I just built Fx on 32 bit Linux and it crashed twice in two runs, when using the
profiler.  So it's at least reproducible.  Am investigating.
This is currently blocked on some EME timeouts. As far as I can tell these timeouts are occurring because when we start up the EME process we also initialize BHR inside of that process. When this is done, we initialize LUL in that process, which involves doing a lot of work. We then, in the tests, quickly shut down the EME process. This would normally occur quickly, however shutting down the process involves shutting down BHR, which involves joining the BHR thread. As LUL initialization is occurring on the BHR thread at the time, this just hangs everything. I think this then causes everything to time out on infra, although I am unsure.

I think the easiest solution here would be to just not run BHR in the EME process. I don't even know if we collect or look at any of the telemetry information coming from it. cpearce, how would you feel about doing that?
Flags: needinfo?(cpearce)
(In reply to Julian Seward [:jseward] from comment #13)
> Sorry, I lost track of this.  I would guess this is crashing when LUL tries
> to mmap libxul.so to read debuginfo from it.

Let me update this.  I guessed wrong.  I tested with both 32- and 64-bit Linux
builds, to find out what happens when LUL fails to mmap libxul.so.  Per my testing
it handles the failure OK -- the system stays alive.  The resulting stack traces
from LUL are of course almost useless because it fails to load unwind data for
libxul, but I didn't see any stability issues pertaining to LUL.

On 32 bit Linux only targets, I do later see a segfault that happens when the
profiler GUI (Cleopatra) tries to symbolicate the traces, due to its use of
Wasm and an incorrect assertion, which I filed as bug 1374218.  But that isn't
relevant here (I assume) because (1) we're only using LUL to get stack traces,
but not running the the profiler GUI, and (2) [speculative, but as-yet unverified]
the bug 1374218 failure only occurs for --disable-jemalloc builds.

All that said, mmaping libxul.so to read symbols from it is still a losing
proposition on any 32 bit target.  So we should still decide either to disable
LUL on 32 bit targets, or to insert an abstraction layer at the bottom of
the Dwarf/ELF parser which makes it possible to read stuff from the .so file
without having to map the whole thing in.  Valgrind now uses that strategy
and it works well.
Another thing that I just observed is that LUL does much more poorly (has
more unwind failures) on 32-bit than 64-bit Linux.  64-bit also used to have
unwind failures, but most of those were resolved at the end of 1Q2017 via a
combination of Dwarf handling fixups and enabling FP unwinding.  So it would
probably be possible to do the same for 32-bit if required.  In the meantime
however be aware that 32-bit will often fail to give good unwinds, even in
the case where it does manage to map libxul.so to read unwind info.
Linux32 seems like a low-value proposition. I don't think adding the abstraction layer or improving the unwinding is worth the effort.
(In reply to Nicholas Nethercote [:njn] from comment #21)
> Linux32 seems like a low-value proposition. I don't think adding the
> abstraction layer or improving the unwinding is worth the effort.

I agree.  I'm not advocating for it -- merely recording the tradeoffs
as I see them.
(In reply to Michael Layzell [:mystor] from comment #18)
> This is currently blocked on some EME timeouts. As far as I can tell these
> timeouts are occurring because when we start up the EME process we also
> initialize BHR inside of that process. When this is done, we initialize LUL
> in that process, which involves doing a lot of work. We then, in the tests,
> quickly shut down the EME process. This would normally occur quickly,
> however shutting down the process involves shutting down BHR, which involves
> joining the BHR thread. As LUL initialization is occurring on the BHR thread
> at the time, this just hangs everything. I think this then causes everything
> to time out on infra, although I am unsure.
> 
> I think the easiest solution here would be to just not run BHR in the EME
> process. I don't even know if we collect or look at any of the telemetry
> information coming from it. cpearce, how would you feel about doing that?


I'm totally fine with not initializing BHR in the GMP process (which is the process in which EME plugins run). I didn't think we collected telemetry in the GMP process anyway, and we don't startup XPCOM either.
Flags: needinfo?(cpearce)
Created attachment 8880140 [details] [diff] [review]
Part 3: Disable ThreadStackHelper on x86 linux due to a LUL crash

MozReview-Commit-ID: 2kBBFftnJEd
Attachment #8880140 - Flags: review?(nfroyd)
Created attachment 8880141 [details] [diff] [review]
Part 4: Disable the BackgroundHangMonitor in GMP processes

This involved a change to BackgroundHangMonitor, as it initialized sDisabled
incorrectly to false, instead of true, We need sDisabled initialized to true, as
we cannot assume that it is enabled until BackgroundHangMonitor::Startup() is
called.

MozReview-Commit-ID: 94slLTkNk3C
Attachment #8880141 - Flags: review?(nfroyd)

Updated

2 months ago
Attachment #8880140 - Flags: review?(nfroyd) → review+

Updated

2 months ago
Attachment #8880141 - Flags: review?(nfroyd) → review+

Comment 26

2 months ago
Pushed by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/79b4cc56c9f2
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread, r=njn
https://hg.mozilla.org/integration/mozilla-inbound/rev/51aac2123868
Part 2: Re-enable the Background Hang Reporter on Linux, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/ae5d8df3d304
Part 3: Disable ThreadStackHelper on x86 linux due to a LUL crash, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/8ccec3b0d27e
Part 4: Disable the BackgroundHangMonitor in GMP processes, r=froydnj

Comment 27

2 months ago
Backout by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/968f191e70a2
Backout for test failures on a CLOSED TREE
This was causing a bunch of fun test failures - ni?-ing myself to fix them later.
Flags: needinfo?(michael)
I did another try push (https://treeherder.mozilla.org/#/jobs?repo=try&revision=42a5082ca75e697537f200c41192e8d724a90c9a) and it looks pretty green. I think the failures I backed this out for were actually bug 1372433's fault.

I'll try and reland this tomorrow morning so I'm around when the tree inevitably goes up in flames despite my try push :P.

Comment 30

a month ago
Pushed by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/39ed96d7b263
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread, r=njn
https://hg.mozilla.org/integration/mozilla-inbound/rev/48f5e0597242
Part 2: Re-enable the Background Hang Reporter on Linux, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/56850220bd77
Part 3: Disable ThreadStackHelper on x86 linux due to a LUL crash, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/9fc4c6aabd8e
Part 4: Disable the BackgroundHangMonitor in GMP processes, r=froydnj
Flags: needinfo?(michael)
Backed out for xpcshell timeouts on at least ASAN builds:

https://hg.mozilla.org/integration/mozilla-inbound/rev/83c89460055eff4f0f441f467b7e118000461096

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=23814434a6bc690b51bddd1ddd2b86382c32fe47&noautoclassify&filter-searchStr=e71bd12b3099bdeea65d5b1a0b49b18039d00515&filter-resultStatus=success&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&tochange=83c89460055eff4f0f441f467b7e118000461096&selectedJob=113766012
Flags: needinfo?(michael)
Similar issues in a different asan xpcshell chunk: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=23814434a6bc690b51bddd1ddd2b86382c32fe47&noautoclassify&filter-searchStr=6566c310c9f3f5516235a02d04acbac4b90d0637&tochange=83c89460055eff4f0f441f467b7e118000461096&selectedJob=113765330
These build metrics alerts where observed, when this bug 1st landed:

== Change summary for alert #7878 (as of July 12 2017 17:13 UTC) ==

Regressions:

  4%  compiler warnings summary linux32 pgo      1,421.17 -> 1,472.50

Improvements:

  8%  installer size summary linux32 pgo      61,150,341.67 -> 56,538,162.58
  3%  build times summary linux32 pgo taskcluster-c4.4xlarge3,122.23 -> 3,025.40

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7878
And these arrived after the backout mentioned in comment 31:

== Change summary for alert #7892 (as of July 12 2017 23:35 UTC) ==

Regressions:

  8%  installer size summary linux32 pgo      56,539,449.92 -> 61,122,888.00

Improvements:

  3%  compiler warnings summary linux32 pgo      1,472.00 -> 1,422.42

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7892
Created attachment 8886666 [details] [diff] [review]
Part 5: Don't block the BackgroundHangMonitor thread intializing stackwalk

I think that this might help fix the asan hangs which caused this patch to be
backed out most recently. The idea here is to try to get the BHR thread
responsive as quickly as possible. When I managed to reproduce the hangs locally
with an ASAN build it seemed that loading LUL was taking a very long time, so
the BHR thread was not responsive, and I'm guessing that something somehow
needed it to be responsive.

MozReview-Commit-ID: AVA1Nok6zyQ
Attachment #8886666 - Flags: review?(erahm)
Comment on attachment 8886666 [details] [diff] [review]
Part 5: Don't block the BackgroundHangMonitor thread intializing stackwalk

Review of attachment 8886666 [details] [diff] [review]:
-----------------------------------------------------------------

I don't love this, but if it helps get BHR enabled I'm fine with it. We should follow up with jseward to discuss better solutions to lul init being super slow.

Also can you do a test run to verify *why* this fixes is things? Just add some logging, my guess is mSTS is null for those tests and we never initialize lul. It would be nice to confirm that though. Some logging in the runnable would be good too, it's possible mSTS is non-null but it just hasn't processed it yet before shutdown.

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +301,5 @@
>  {
>    // Make sure to initialize any state required to perform stack walking as soon
>    // as possible, so it does not interfere with us collecting hang stacks. We
> +  // don't want to be on the main thread, and we don't want to block anything
> +  // running, so we do it off-main-thread.

Isn't this already off-main-thread? As in, this is the BH manager thread. I think we need a better explanation here. From our discussion on IRC it sounded like lul initialization could block this thread, which would block shutdown causing tests to fail due to shutdown hangs.

@@ +306,5 @@
> +  //
> +  // sStackWalkInitialized is set to true atomically when stackwalk
> +  // initialization is complete.
> +  static Atomic<bool> sStackWalkInitialized(false);
> +  if (mSTS) {

This seems like a drastic change, if |mSTS| isn't set yet we'll never get BHR stacks? From discussing on IRC it sounds like that's okay b/c without STS nothing is going to work. Can you add some comments to that effect?
Attachment #8886666 - Flags: review?(erahm) → review+
I think I'm still seeing timeouts. I have no idea why they're happening unfortunately :-/.

I'm not sure what the best approach would be here, or why this is happening. Potentially this is due to the large amount of memory we are mapping into our address space messing things up? I'm not sure.
Flags: needinfo?(michael)
(In reply to Michael Layzell [:mystor] from comment #37)
> I think I'm still seeing timeouts. I have no idea why they're happening
> unfortunately :-/.

Can you be more specific about this, please?  You're seeing timeouts in
exactly which configurations/scenarios?
Flags: needinfo?(michael)
Created attachment 8888931 [details] [diff] [review]
Part 6: Don't initialize stackwalk except when requested

I accidentally stumbled over what I did wrong while working on bug 1380081.

Basically, any time that I wanted a stackwalk, even if I wasn't requesting a
native stack, the profiler would run `profiler_initialize_stackwalk` which would
cause the thread to block while we finished loading libxul into memory.

The code was probably hanging more in asan builds because I'm guessing mapping
files into memory is slower in that mode.

This just changes the places where we call profiler_initialize_stackwalk to be
only the places where we actually are going to need LUL initialized.

(note: I haven't tested this yet on try - but I did a bunch of local testing and
it seems to have fixed the problems! - pushing to try now.)

MozReview-Commit-ID: 9Nrraa6Hw7n
Attachment #8888931 - Flags: review?(n.nethercote)
Comment on attachment 8888931 [details] [diff] [review]
Part 6: Don't initialize stackwalk except when requested

Review of attachment 8888931 [details] [diff] [review]:
-----------------------------------------------------------------

This seems ok but is the start of DoNativeBacktrace() a better chokepoint to put this call?
Attachment #8888931 - Flags: review?(n.nethercote) → review+
(In reply to Nicholas Nethercote [:njn] from comment #40)
> Comment on attachment 8888931 [details] [diff] [review]
> Part 6: Don't initialize stackwalk except when requested
> 
> Review of attachment 8888931 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This seems ok but is the start of DoNativeBacktrace() a better chokepoint to
> put this call?

DoNativeBacktrace would be a bad place to do this call. In DoNativeBacktrace the target thread is already suspended, so it's not safe to do things like allocate. profiler_initialize_stackwalk definitely does things like allocate, so we need to have already done it before DoNativeBacktrace.
Flags: needinfo?(michael)
Created attachment 8889479 [details] [diff] [review]
Part 6: Don't initialize stackwalk except when requested

MozReview-Commit-ID: 9Nrraa6Hw7n
Created attachment 8889484 [details] [diff] [review]
Part 6: Don't initialize stackwalk except when requested

MozReview-Commit-ID: 9Nrraa6Hw7n
Attachment #8888931 - Attachment is obsolete: true
Attachment #8889479 - Attachment is obsolete: true
Created attachment 8893420 [details] [diff] [review]
Part 7: Disable BHR on ASan builds due to timeouts

I've been fighting this on and off for months, and it doesn't seem like anything
I do to try to make initializing LUL _even less_ blocking is helping. It seems
like initializing LUL just has such bad performance in ASan builds, that even
when it is run off main thread it reliably causes timeouts on ASan XPCShell
tests.

This patch just turns off BHR on ASan builds so we can turn it on on linux. BHR
is only enabled on nightly, and the timeout problems appear to just be
extreme ASan-induced performance issues, rather than a symptom of a real
problem.

MozReview-Commit-ID: 7S5WWYu2fzp
Attachment #8893420 - Flags: review?(nfroyd)
Created attachment 8893421 [details] [diff] [review]
Part 8: Disable BHR completely on linux 32-bit

Previously only stackwalking was disabled. This makes it such that we don't run
it at all, which I think is reasonable to avoid useless pings and/or unnecessary
data collection.

This will be extra useful because of bug 1380081, which will increase the amount of data collection.

MozReview-Commit-ID: H3i1c6d6yut
Attachment #8893421 - Flags: review?(nfroyd)
Attachment #8893421 - Flags: review?(nfroyd) → review+
(In reply to Michael Layzell [:mystor] from comment #44)
> I've been fighting this on and off for months, and it doesn't seem like
> anything
> I do to try to make initializing LUL _even less_ blocking is helping. It
> seems
> like initializing LUL just has such bad performance in ASan builds, that even
> when it is run off main thread it reliably causes timeouts on ASan XPCShell
> tests.

Is this specific tests, or can it occur during any tests whatsoever?  If it's the former, could we just disable those tests on ASan or requestLongerTimerout or what have you?  Or have you been trying that, and it's just whack-a-mole?

Would it be possible to runtime-disable BHR instead for those tests or for all of xpcshell?  I guess the effect is the same.

Disabling code entirely under ASan makes me nervous, but as BHR is Nightly-only, the concern is lessened here.
Flags: needinfo?(michael)
(In reply to Nathan Froyd [:froydnj] from comment #46)
> Is this specific tests, or can it occur during any tests whatsoever?  If
> it's the former, could we just disable those tests on ASan or
> requestLongerTimerout or what have you?  Or have you been trying that, and
> it's just whack-a-mole?
> 
> Would it be possible to runtime-disable BHR instead for those tests or for
> all of xpcshell?  I guess the effect is the same.
> 
> Disabling code entirely under ASan makes me nervous, but as BHR is
> Nightly-only, the concern is lessened here.

The particular failures I was fighting were specifically xpcshell failures. As far as I could tell it seemed to be xpcshell test suites which used run_test_in_child a decent amount.

I still don't know exactly what the problem was unfortunately, but it consistently manifested as the test suite timing out, and looking at time stamps it looked like progress was still being made, albeit slowly. I'm not sure.

The other option which I was considering was to, rather than disable it entirely under ASan, just disable stackwalking under ASan. I only need to load LUL (which seems to be the cause of the problem) if I want to stackwalk, and I could just not do that. Would that be preferable?
Flags: needinfo?(michael) → needinfo?(nfroyd)
Approximately what length of timeout was failing?

I'm sorry this has been such a hassle.  I can imagine LUL is slower
to load the libxul.so unwind data under ASan because

* execution is just plain slower with ASan.

* mmap-ing in libxul.so to read unwind info in means ASan has to create
  and later destroy a shadow mapping of 1/8 the size of libxul.so, 
  although I can't imagine that's a big deal.

* libxul.so is bigger when compiled with ASan and so there's more
  unwind info to read.
This is the most recent try push which I made which has this problem present.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4b0eb4120742667a8dcc479f4c33c1f13239a02&filter-searchStr=asan

There are some other burnt tree failures in this push (like the test_BHRObserver.js failure - I was testing the patches for bug 1380081 at the time), but the timeouts are the main thing which was fixed by turning things off.

It looks from looking again that we also timeout in other places.
(In reply to Michael Layzell [:mystor] from comment #47)
> The other option which I was considering was to, rather than disable it
> entirely under ASan, just disable stackwalking under ASan. I only need to
> load LUL (which seems to be the cause of the problem) if I want to
> stackwalk, and I could just not do that. Would that be preferable?

I think this would be preferable if possible.
Flags: needinfo?(nfroyd)
Comment on attachment 8893420 [details] [diff] [review]
Part 7: Disable BHR on ASan builds due to timeouts

Canceling review while we investigate other possibilities.
Attachment #8893420 - Flags: review?(nfroyd)
Here's an idea: AFAICT, we don't *have* to use LUL on Linux when MOZ_PROFILING is enabled. We can instead use MozStackWalk or FramePointerStackWalk, like we do on Windows and Mac.

I briefly tried using FramePointerStackWalk on Linux64 yesterday and it appeared to basically work. I compared LUL stacks against FramePointerStackWalk stacks. The LUL stacks were more detailed -- perhaps they have info about inlined frames? -- but the FramePointerStackWalk stacks seemed reasonable. (DMD uses MozStackWalk() on Linux.)

So we could use LUL for the profiler, but use FramePointerStackWalk for profiler_suspend_and_sample_thread(). Or something like that.

mystor, what do you think?
Flags: needinfo?(michael)
Created attachment 8893923 [details] [diff] [review]
Part 1: Use FramePointerStackWalk when stackwalking from BHR

This is a simple implementation of your suggestion nick. I'm doing a try push to see if it fixes the problem while I try to rebase bug 1380081 on top of this instead of the LUL changes.

MozReview-Commit-ID: CsMHW5gVmYP
Attachment #8893923 - Flags: review?(n.nethercote)
Attachment #8869614 - Attachment is obsolete: true
Attachment #8870494 - Attachment is obsolete: true
Attachment #8880140 - Attachment is obsolete: true
Attachment #8880141 - Attachment is obsolete: true
Attachment #8886666 - Attachment is obsolete: true
Attachment #8889484 - Attachment is obsolete: true
Attachment #8893420 - Attachment is obsolete: true
Attachment #8893421 - Attachment is obsolete: true
Created attachment 8893924 [details] [diff] [review]
Part 2: re-enable BHR on linux

Changes to BHR required if we take this approach (before bug 1380081)

MozReview-Commit-ID: BtRZecG4IYA
Attachment #8893924 - Flags: review?(nfroyd)
Comment on attachment 8893923 [details] [diff] [review]
Part 1: Use FramePointerStackWalk when stackwalking from BHR

Review of attachment 8893923 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/core/platform.cpp
@@ +113,5 @@
>  # include "lul/platform-linux-lul.h"
> +
> +// Also support using FramePointerStackWalk if MOZ_PROFILING is enabled. The
> +// profiler won't use this type, but profiler_suspend_and_sample_thread we
> +// prefers frame pointer stackwalking to LUL stackwalking when both are present.

I think this would be clearer: "On Linux we use LUL for periodic samples and synchronous samples, but we use FramePointerStackWalk for backtrace samples when MOZ_PROFILING is enabled. (See the comment at the top of the file for the defition of period/synchronous/backtrace.)

@@ +1288,5 @@
> +{
> +  // This method is used to decide on which stackwalking tool to use. We choose
> +  // them with the following priority if more than one stackwalking
> +  // implementation is avaliable:
> +  // LUL -> EHABI -> FramePointer -> MozStackWalk

I find this comment misleading. Linux is the only platform where we are ever making a choice between two implementation, but this is not the place where we are making that choice, because LUL is always available on Linux but FramePointer is only sometimes available.

I suggest changing it to something like this:

"This method determines which stackwalker is used for periodic samples and synchronous samples. (Backtrace samples are treated differently; see profiler_suspend_and_sample_thread() for details.) The only part of the ordering that matters is that LUL must precede FRAME_POINTER because on Linux they can both be present."

@@ +3410,5 @@
>  #if defined(HAVE_NATIVE_UNWIND)
>          if (aSampleNative) {
> +          // Prefer FramePointerStackWalk over LUL stackwalk, to avoid the need
> +          // to initialize LUL to use profiler_suspend_and_sample_thread
> +          // (bug 1365309).

Can you add "On Linux..." at the start of this comment? Ditto for the identical comment below.

@@ +3472,2 @@
>            DoNativeBacktrace(lock, *info, aRegs, nativeStack);
> +# endif

We currently have two versions of profiler_suspend_and_sample_thread(). Can we remove the old one yet?
Attachment #8893923 - Flags: review?(n.nethercote) → review+
mystor: please double-check that FramePointerStackWalk() gives reasonable results on Linux before you land this :)
(In reply to Nicholas Nethercote [:njn] from comment #52)
> The LUL stacks were more detailed -- perhaps they have info about inlined frames?

No, LUL doesn't have support for doing inlined frames right now.
Attachment #8893924 - Flags: review?(nfroyd) → review+
I'm running into a weird bug with this patch. Intermittently the GeckoProfiler.SuspendAndSample test is failing, as we're getting a backtrace which only has 2 entries instead of the usual ~20.

I think that i know why this is happening, but I don't know how to deal with it. The frames which are captured are the pseudostack entry XRE_main, and a single pc frame set in `do_lookup_x()`, which from some quick googling is a frame inside of glibc. 

I think what is happening is that the glibc which we're building against, at least on this computer (also it seems on try), doesn't have frame pointers, meaning that we basically don't get native stacks when stackwalking from a frame located inside of glibc. I haven't been able to reproduce this under rr as it seems very race-dependent, so I can't confirm this of course. 

What do you think I should do?
Flags: needinfo?(michael) → needinfo?(n.nethercote)
(In reply to Nicholas Nethercote [:njn] from comment #55)
> We currently have two versions of profiler_suspend_and_sample_thread(). Can
> we remove the old one yet?

no, we need to land bug 1367406 first which I have local patches for but hasn't landed yet.
(In reply to Michael Layzell [:mystor] from comment #58)
> I'm running into a weird bug with this patch. Intermittently the
> GeckoProfiler.SuspendAndSample test is failing, as we're getting a backtrace
> which only has 2 entries instead of the usual ~20.

This is bug 1384766!
 
> I think that i know why this is happening, but I don't know how to deal with
> it. The frames which are captured are the pseudostack entry XRE_main, and a
> single pc frame set in `do_lookup_x()`, which from some quick googling is a
> frame inside of glibc. 
> 
> I think what is happening is that the glibc which we're building against, at
> least on this computer (also it seems on try), doesn't have frame pointers,
> meaning that we basically don't get native stacks when stackwalking from a
> frame located inside of glibc. I haven't been able to reproduce this under
> rr as it seems very race-dependent, so I can't confirm this of course. 
> 
> What do you think I should do?

I chose the >5 check for the frame count arbitrarily. Let's just change it so it does >=2 instead, and add a comment explaining why. Are you happy to do this?

(And thank you for working this out, I was planning to look at this intermittent failure today but you saved me from having to do so.)
Flags: needinfo?(n.nethercote) → needinfo?(michael)
Created attachment 8895371 [details] [diff] [review]
Part 3: Reduce expected frame count for GeckoProfiler.SuspendAndSample

This is necessary as with FramePointerStackWalk if the current frame is in glibc
when we suspend the target thread we sometimes can't collect more than one
native frame as there are no frame pointers.

We'll still collect at least one entry, so this check relaxes the assertion.

MozReview-Commit-ID: IbatTQwDti1
Attachment #8895371 - Flags: review?(n.nethercote)
Created attachment 8895373 [details] [diff] [review]
Part 1: Use FramePointerStackWalk when stackwalking from BHR

MozReview-Commit-ID: CsMHW5gVmYP
Attachment #8893923 - Attachment is obsolete: true
I think once this gets reviewed I should be able to land the new bhr ping stuff from bug 1380081! \o/
Flags: needinfo?(michael)
Attachment #8895371 - Flags: review?(n.nethercote) → review+

Comment 64

12 days ago
Pushed by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/aac6aee303fb
Part 1: Use FramePointerStackWalk when stackwalking from BHR, r=njn
https://hg.mozilla.org/integration/mozilla-inbound/rev/579fd072422e
Part 2: re-enable BHR on linux, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/d461adfd884e
Part 3: Reduce expected frame count for GeckoProfiler.SuspendAndSample, r=njn
Backed out bug 1365309, bug 1380081 and bug 1386369 for almost perma-timeout in many mochitest chunks on Windows 7 opt and pgo:

Bug 1365309
https://hg.mozilla.org/integration/mozilla-inbound/rev/7fefe0f46c4648f92d75e02a5c5bf8b20a6f799d
https://hg.mozilla.org/integration/mozilla-inbound/rev/d5bdc2c7ca34dcb6a51365aee8aef45cd76854c7
https://hg.mozilla.org/integration/mozilla-inbound/rev/d7a7569e436a330711c0cd8b622c399b14fb3190

Bug 1380081
https://hg.mozilla.org/integration/mozilla-inbound/rev/2145dba678d8f69a35149a120908445e55f02587
https://hg.mozilla.org/integration/mozilla-inbound/rev/e6d57bb2b33e6b37585b44e0bfd3f76d45f808d2
https://hg.mozilla.org/integration/mozilla-inbound/rev/f6b47124336fba85e8ba753ae607c2347b5665e6
https://hg.mozilla.org/integration/mozilla-inbound/rev/36c4ddeb51a14c5e4046376dd22501fa22d9889f
https://hg.mozilla.org/integration/mozilla-inbound/rev/718448f617c45649e07e44c9d5d6cd95cf2d72f3
https://hg.mozilla.org/integration/mozilla-inbound/rev/ef7efc0cd92a05231760aff15c546e97cfd0925f
https://hg.mozilla.org/integration/mozilla-inbound/rev/fb0774dc70c54f12cc63cbcc4197d922b5a0b832
https://hg.mozilla.org/integration/mozilla-inbound/rev/6ac0bd6aeda33ab083cca86d36508c8b3d1231b7
https://hg.mozilla.org/integration/mozilla-inbound/rev/f00f5f041f65f3001b6b0ebf23fd7f963baa6fdd
https://hg.mozilla.org/integration/mozilla-inbound/rev/750dc31d7ee77a328e1fbde3b31d262a8ea33a93
https://hg.mozilla.org/integration/mozilla-inbound/rev/dd34bd64de6a02be6860e14c23f90c8613e72c56
https://hg.mozilla.org/integration/mozilla-inbound/rev/fcbcb7b65bf2fb6532d0f1bfc7b421f00239155a
https://hg.mozilla.org/integration/mozilla-inbound/rev/514f6411123722c30a773043ce2bef22486b4982
https://hg.mozilla.org/integration/mozilla-inbound/rev/8c07f2793aa9eb16d6d51db1b0e5a6cb00d54856

Bug 1386369
https://hg.mozilla.org/integration/mozilla-inbound/rev/5cd587ec53be835443367007eeb79bdac0ef4c54
https://hg.mozilla.org/integration/mozilla-inbound/rev/c621494dc3f5f5cfd6e9a8e62f19396dbc9b8990
https://hg.mozilla.org/integration/mozilla-inbound/rev/43cb8075d066fba493779e35922ce83c77bdd70d

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=272d381012677aedc86462ebd8e45b0b30884655&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Flags: needinfo?(michael)
I just landed part 3 under the auspices of bug 1384766 in order to fix that intermittent failure.

Comment 67

8 days ago
Pushed by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4a7bef9816da
Part 1: Use FramePointerStackWalk when stackwalking from BHR, r=njn
https://hg.mozilla.org/integration/mozilla-inbound/rev/cfc37c6651db
Part 2: re-enable BHR on linux, r=froydnj
Flags: needinfo?(michael)

Comment 68

8 days ago
Pushed by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f31c029bbd16
Part 3: Disable test_ThreadHangStats.js on ASAN, a=bustage

Comment 69

7 days ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/4a7bef9816da
https://hg.mozilla.org/mozilla-central/rev/cfc37c6651db
https://hg.mozilla.org/mozilla-central/rev/f31c029bbd16
Status: NEW → RESOLVED
Last Resolved: 7 days ago
status-firefox57: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.