Closed Bug 1365309 Opened 9 years ago Closed 8 years ago

BHR thread hangs for over a second while holding profiler lock initializing LUL

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla57

Tracking Flags:

Tracking

Status

firefox57

---

fixed

People

(Reporter: nika, Assigned: nika)

References

Details

Attachments

(3 files, 12 obsolete files)

Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread 9 years ago Nika Layzell [:nika] (ni? for response) 19.55 KB, patch	n.nethercote : feedback+	Details \| Diff \| Splinter Review
Part 2: Re-enable the Background Hang Reporter on Linux 9 years ago Nika Layzell [:nika] (ni? for response) 1.29 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review
Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread 9 years ago Nika Layzell [:nika] (ni? for response) 18.31 KB, patch	n.nethercote : review+	Details \| Diff \| Splinter Review
Part 3: Disable ThreadStackHelper on x86 linux due to a LUL crash 8 years ago Nika Layzell [:nika] (ni? for response) 2.29 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review
Part 4: Disable the BackgroundHangMonitor in GMP processes 8 years ago Nika Layzell [:nika] (ni? for response) 2.89 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review
Part 5: Don't block the BackgroundHangMonitor thread intializing stackwalk 8 years ago Nika Layzell [:nika] (ni? for response) 2.66 KB, patch	erahm : review+	Details \| Diff \| Splinter Review
Part 6: Don't initialize stackwalk except when requested 8 years ago Nika Layzell [:nika] (ni? for response) 2.64 KB, patch	n.nethercote : review+	Details \| Diff \| Splinter Review
Part 6: Don't initialize stackwalk except when requested 8 years ago Nika Layzell [:nika] (ni? for response) 4.23 KB, patch		Details \| Diff \| Splinter Review
Part 6: Don't initialize stackwalk except when requested 8 years ago Nika Layzell [:nika] (ni? for response) 4.22 KB, patch		Details \| Diff \| Splinter Review
Part 7: Disable BHR on ASan builds due to timeouts 8 years ago Nika Layzell [:nika] (ni? for response) 1.59 KB, patch		Details \| Diff \| Splinter Review
Part 8: Disable BHR completely on linux 32-bit 8 years ago Nika Layzell [:nika] (ni? for response) 3.13 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review
Part 1: Use FramePointerStackWalk when stackwalking from BHR 8 years ago Nika Layzell [:nika] (ni? for response) 12.99 KB, patch	n.nethercote : review+	Details \| Diff \| Splinter Review
Part 2: re-enable BHR on linux 8 years ago Nika Layzell [:nika] (ni? for response) 2.35 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review
Part 3: Reduce expected frame count for GeckoProfiler.SuspendAndSample 8 years ago Nika Layzell [:nika] (ni? for response) 1.43 KB, patch	n.nethercote : review+	Details \| Diff \| Splinter Review
Part 1: Use FramePointerStackWalk when stackwalking from BHR 8 years ago Nika Layzell [:nika] (ni? for response) 14.17 KB, patch		Details \| Diff \| Splinter Review

Nika Layzell [:nika] (ni? for response)

Assignee

Description

•

9 years ago

When running the xpcshell-test dom/push/test/xpcshell/test_clear_origin_data.js, the BHR thread kicks in and detects a hang. On linux after bug 1357829 this will cause LUL to be initialized. Initializing LUL (specifically calling read_procmaps) takes over a second, and causes parts of the test to time out, presumably because the main thread gets stuck trying to grab the profiler lock. Not sure what the best solution is here. If we figure out the issues with stacks not being landed (bug 1365029), I might turn off BHR on linux temporarily until this is fixed.

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 1

•

9 years ago

Markus suggests you might have an idea as to what to do here

Flags: needinfo?(jseward)

Julian Seward [:jseward]

Comment 2

•

9 years ago

Ping me on irc when you are around. I don't understand enough about the context here to comment, at the moment.

Flags: needinfo?(jseward)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 3

•

9 years ago

Attached patch Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread (obsolete) — Details — Splinter Review

Currently LUL is a member of CorePS, meaning that it is guarded by the PSMutex. This mutex is grabbed by the main thread at random points during the execution of the program. This is unfortunate, as initializing LUL can take a long time (>1s on my local machine), and we definitely don't want to be blocking the main thread waiting for it. In addition, in the BHR case, we used to be grabbing LUL when we got our first hang, while both the PSMutex and the BHR monitor were being held. This meant that the main thread could make no progress during LUL initializaion, as the BHR monitor is grabbed by the main thread on every spin of the event loop. This patch moves that initialization to be behind a completely separate lock, and makes BHR initialize it on the background thread before acquiring the BHR lock, meaning that no locks other than the one guarding LUL should be held during its initialization. MozReview-Commit-ID: GwNYQaEAqJ1

Attachment #8869613 - Flags: review?(n.nethercote)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 4

•

9 years ago

Attached patch Part 2: Re-enable the Background Hang Reporter on Linux (obsolete) — Details — Splinter Review

MozReview-Commit-ID: EjAti9UqW7k

Attachment #8869614 - Flags: review?(nfroyd)

Nathan Froyd [:froydnj]

Updated

•

9 years ago

Attachment #8869614 - Flags: review?(nfroyd) → review+

Nicholas Nethercote [inactive]

Comment 5

•

9 years ago

Comment on attachment 8869613 [details] [diff] [review] Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread Review of attachment 8869613 [details] [diff] [review]: ----------------------------------------------------------------- I more or less see what you are trying to do, but there are a couple of things I'm unclear/unhappy about, so I'm giving f+ for now. Details below. ::: tools/profiler/core/platform-linux-android.cpp @@ +251,5 @@ > // mSamplerTid to a dummy value and fill it in for real in > // SuspendAndSampleAndResumeThread(). > , mSamplerTid(-1) > { > + profiler_initialize_stackwalk(); This LUL initialization is still happening (usually) on the main thread. From the commit log it sounds like you were trying to change this. Have I misunderstood? ::: tools/profiler/core/platform.cpp @@ +530,5 @@ > #endif > } > > +#if defined(USE_LUL_STACKWALK) > +// When using LUL, the state is stored behind a separate Unfinished sentence. Just remove this line? @@ +542,5 @@ > +public: > + static lul::LUL* Get() { > + mozilla::StaticMutexAutoLock lock(sLulMutex); > + return sLul; > + } This seems bogus. The function briefly locks the mutex, but then unlocks it before returning. The caller then has full access to sLul without the protection of the mutex. Although... maybe the point is that it's only the creation of sLul that needs the mutex protection, because after creation it's read-only? If that's the case, it's a non-obvious use of a mutex, so please make it clearer in the comments at the top of this class. @@ +548,5 @@ > + static lul::LUL* GetOrCreate() { > + mozilla::StaticMutexAutoLock lock(sLulMutex); > + if (sLul) { > + return sLul; > + } Gecko style is to put the method's opening brace on its own line. Same for all the functions in this class. @@ +1887,5 @@ > // should poke it to give it a chance to print those statistics. This > // involves doing I/O (fprintf, __android_log_print, etc.) and so > // can't safely be done from the critical section inside > // SuspendAndSampleAndResumeThread, which is why it is done here. > + lul::LUL* lul = LulPS::Get(); Having to lock the mutex a second time isn't good, esp. given that MaybeShowStats() usually doesn't do anything. Please try to rearrange things so it's only locked once. @@ +3082,2 @@ > > + const CorePS::ThreadVector& liveThreads = CorePS::LiveThreads(lock); Why introduce this new scope? I don't see any benefit. (Maybe it's a leftover from a reverted change?)

Attachment #8869613 - Flags: review?(n.nethercote) → feedback+

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 6

•

9 years ago

Attached patch Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread (obsolete) — Details — Splinter Review

(In reply to Nicholas Nethercote [:njn] from comment #5) > This LUL initialization is still happening (usually) on the main thread. > From the commit log it sounds like you were trying to change this. Have I > misunderstood? Yes, when the Background Hang Reporter initializes I am planning to run LUL initialization off of the main thread. If you look in BackgroundHangMonitor.cpp (BackgroundHangManager::RunMonitorThread()), you can see where I run it during the startup of the background hang monitor thread. > This seems bogus. The function briefly locks the mutex, but then unlocks it > before returning. The caller then has full access to sLul without the > protection of the mutex. > > Although... maybe the point is that it's only the creation of sLul that > needs the mutex protection, because after creation it's read-only? If that's > the case, it's a non-obvious use of a mutex, so please make it clearer in > the comments at the top of this class. sLul has approximately the same lifetime as CorePS once it is initialized (It is not destroyed until CorePS is destroyed). The goal of this mutex was to ensure that if sLul is in the process of being initialized, the Getting thread would be blocked until the initialization was complete. > @@ +1887,5 @@ > > // should poke it to give it a chance to print those statistics. This > > // involves doing I/O (fprintf, __android_log_print, etc.) and so > > // can't safely be done from the critical section inside > > // SuspendAndSampleAndResumeThread, which is why it is done here. > > + lul::LUL* lul = LulPS::Get(); > > Having to lock the mutex a second time isn't good, esp. given that > MaybeShowStats() usually doesn't do anything. Please try to rearrange things > so it's only locked once. As you mention above, it is a bit silly that `Get` is guarded by the mutex. In all places where `Get` is called, previous code should have made sure to call `GetOrCreate()`, which locks the mutex, meaning that it should already have been initialized and not need synchronization. I will change `Get` to not use synchronization, and assert that `sLul` has been initialized by the time it is called. MozReview-Commit-ID: GwNYQaEAqJ1

Attachment #8870494 - Flags: review?(n.nethercote)

Nika Layzell [:nika] (ni? for response)

Assignee

Updated

•

9 years ago

Attachment #8869613 - Attachment is obsolete: true

Nicholas Nethercote [inactive]

Comment 7

•

9 years ago

Comment on attachment 8870494 [details] [diff] [review] Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread Review of attachment 8870494 [details] [diff] [review]: ----------------------------------------------------------------- Looking better. Thanks. ::: tools/profiler/core/platform.cpp @@ +533,5 @@ > +#if defined(USE_LUL_STACKWALK) > +// When we are using LUL, we are going to need to perform expensive > +// initialization before we can start getting backtraces. We don't want to be > +// holding the profiler mutex while doing that, as the main thread sometimes > +// wants to grant that and we never want to get in the way of that. Because of The use of the word "grant" here confuses me a little. Not sure exactly what you mean by that. Reword? @@ +554,5 @@ > + // > + // This is safe because LulPS::sLul has approximately the same lifetime as > + // CorePS. It is initialized after CorePS, but will live until CorePS is > + // killed. Once it has been initialized, it is safe to read without > + // synchronization. I'd add that it's safe after initialization because it's read-only at that point.

Attachment #8870494 - Flags: review?(n.nethercote) → review+

Nika Layzell [:nika] (ni? for response)

Assignee

Updated

•

9 years ago

Assignee: nobody → michael

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 8

•

8 years ago

This seems to crash on try occasionally https://treeherder.mozilla.org/logviewer.html#?job_id=105513398&repo=try&lineNumber=2361 I'm not sure why initializing LUL would cause a crash. Nick, do you have any ideas?

Flags: needinfo?(n.nethercote)

Nicholas Nethercote [inactive]

Comment 9

•

8 years ago

Julian wrote LUL so I will forward to him.

Flags: needinfo?(n.nethercote) → needinfo?(jseward)

Julian Seward [:jseward]

Comment 10

•

8 years ago

(In reply to Michael Layzell [:mystor] from comment #8) > This seems to crash on try occasionally That is a 32-bit Linux process, which is something of a rarity these days. Are the other crashes that you see also only with 32-bit processes?

Flags: needinfo?(jseward) → needinfo?(michael)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 11

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #10) > (In reply to Michael Layzell [:mystor] from comment #8) > > This seems to crash on try occasionally > > That is a 32-bit Linux process, which is something of a rarity these days. > Are the other crashes that you see also only with 32-bit processes? I don't see crashes on non-32-bit linux boxes. Should we be disabling the profiler on those platforms?

Flags: needinfo?(michael) → needinfo?(n.nethercote)

Nicholas Nethercote [inactive]

Comment 12

•

8 years ago

> I don't see crashes on non-32-bit linux boxes. Should we be disabling the > profiler on those platforms? You mean, should we disable the profiler on x86/Linux? Seems reasonable given that we don't support x86/Darwin. It would simplify the code a very small amount. mstange, jseward: what do you think about removing x86/Linux support from the profiler and LUL?

Flags: needinfo?(n.nethercote)

Nicholas Nethercote [inactive]

Updated

•

8 years ago

Flags: needinfo?(mstange)

Flags: needinfo?(jseward)

Julian Seward [:jseward]

Comment 13

•

8 years ago

Sorry, I lost track of this. I would guess this is crashing when LUL tries to mmap libxul.so to read debuginfo from it. This has happened to us before. It would be good to confirm that, though. Mmaping libxul.so on any 32 bit platform can fail since it requires finding a contiguous chunk of address space of more than 1GB, which often isn't available. On 64 bit platforms, by contrast, this isn't a problem. Fixing this properly would involve inserting an abstraction layer under LUL's ELF and Dwarf parsers that allows reading arbitrary bits of the file without having to map the whole thing in. That's a significant amount of work, though. Whether anyone will miss x86/Linux support for the profiler, I can't say. If we want to remove it, that's fine by me.

Flags: needinfo?(jseward)

Markus Stange [:mstange]

Comment 14

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #12) > mstange, jseward: what do you think about removing x86/Linux support from > the profiler and LUL? If it doesn't really work in most cases then it's not really worth having it around. Just to clarify, this just means that the "stackwalk" feature wouldn't be available, and not that MOZ_GECKO_PROFILER wouldn't be defined, right?

Flags: needinfo?(mstange)

Nicholas Nethercote [inactive]

Comment 15

•

8 years ago

> Just to clarify, this just means that the "stackwalk" feature wouldn't be > available, and not that MOZ_GECKO_PROFILER wouldn't be defined, right? Good question. I had assumed it would be fully disabled, but we could just remove 32-bit LUL support and undfine HAVE_NATIVE_UNWIND but the rest would still work. But, how useful is the profiler without stack walking? Is anybody ever going to use the profiler in that configuration? I also talked with glandium about this. We do still ship 32-bit Linux, at least for Beta and Nightly (e.g. https://www.mozilla.org/en-US/firefox/beta/all/). But shipped Firefox is not quite the same as locally-built Firefox. In particular, libxul.so is a lot bigger when locally built. If the problem really is that the mapping fails, it won't affect shipped builds. Also, if it is the mapping failing, surely LUL should handle that more gracefully than crashing?

Markus Stange [:mstange]

Comment 16

•

8 years ago

All good points. So why don't we leave everything as it is and just disable BHR stackwalking on Linux 32? Then we'll still be able to get profiles from users on linux32 at least in some cases, and may be able to identify performance problems that they hit.

Julian Seward [:jseward]

Comment 17

•

8 years ago

I just built Fx on 32 bit Linux and it crashed twice in two runs, when using the profiler. So it's at least reproducible. Am investigating.

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 18

•

8 years ago

This is currently blocked on some EME timeouts. As far as I can tell these timeouts are occurring because when we start up the EME process we also initialize BHR inside of that process. When this is done, we initialize LUL in that process, which involves doing a lot of work. We then, in the tests, quickly shut down the EME process. This would normally occur quickly, however shutting down the process involves shutting down BHR, which involves joining the BHR thread. As LUL initialization is occurring on the BHR thread at the time, this just hangs everything. I think this then causes everything to time out on infra, although I am unsure. I think the easiest solution here would be to just not run BHR in the EME process. I don't even know if we collect or look at any of the telemetry information coming from it. cpearce, how would you feel about doing that?

Flags: needinfo?(cpearce)

Julian Seward [:jseward]

Comment 19

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #13) > Sorry, I lost track of this. I would guess this is crashing when LUL tries > to mmap libxul.so to read debuginfo from it. Let me update this. I guessed wrong. I tested with both 32- and 64-bit Linux builds, to find out what happens when LUL fails to mmap libxul.so. Per my testing it handles the failure OK -- the system stays alive. The resulting stack traces from LUL are of course almost useless because it fails to load unwind data for libxul, but I didn't see any stability issues pertaining to LUL. On 32 bit Linux only targets, I do later see a segfault that happens when the profiler GUI (Cleopatra) tries to symbolicate the traces, due to its use of Wasm and an incorrect assertion, which I filed as bug 1374218. But that isn't relevant here (I assume) because (1) we're only using LUL to get stack traces, but not running the the profiler GUI, and (2) [speculative, but as-yet unverified] the bug 1374218 failure only occurs for --disable-jemalloc builds. All that said, mmaping libxul.so to read symbols from it is still a losing proposition on any 32 bit target. So we should still decide either to disable LUL on 32 bit targets, or to insert an abstraction layer at the bottom of the Dwarf/ELF parser which makes it possible to read stuff from the .so file without having to map the whole thing in. Valgrind now uses that strategy and it works well.

Julian Seward [:jseward]

Comment 20

•

8 years ago

Another thing that I just observed is that LUL does much more poorly (has more unwind failures) on 32-bit than 64-bit Linux. 64-bit also used to have unwind failures, but most of those were resolved at the end of 1Q2017 via a combination of Dwarf handling fixups and enabling FP unwinding. So it would probably be possible to do the same for 32-bit if required. In the meantime however be aware that 32-bit will often fail to give good unwinds, even in the case where it does manage to map libxul.so to read unwind info.

Nicholas Nethercote [inactive]

Comment 21

•

8 years ago

Linux32 seems like a low-value proposition. I don't think adding the abstraction layer or improving the unwinding is worth the effort.

Julian Seward [:jseward]

Comment 22

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #21) > Linux32 seems like a low-value proposition. I don't think adding the > abstraction layer or improving the unwinding is worth the effort. I agree. I'm not advocating for it -- merely recording the tradeoffs as I see them.

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 23

•

8 years ago

(In reply to Michael Layzell [:mystor] from comment #18) > This is currently blocked on some EME timeouts. As far as I can tell these > timeouts are occurring because when we start up the EME process we also > initialize BHR inside of that process. When this is done, we initialize LUL > in that process, which involves doing a lot of work. We then, in the tests, > quickly shut down the EME process. This would normally occur quickly, > however shutting down the process involves shutting down BHR, which involves > joining the BHR thread. As LUL initialization is occurring on the BHR thread > at the time, this just hangs everything. I think this then causes everything > to time out on infra, although I am unsure. > > I think the easiest solution here would be to just not run BHR in the EME > process. I don't even know if we collect or look at any of the telemetry > information coming from it. cpearce, how would you feel about doing that? I'm totally fine with not initializing BHR in the GMP process (which is the process in which EME plugins run). I didn't think we collected telemetry in the GMP process anyway, and we don't startup XPCOM either.

Flags: needinfo?(cpearce)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 24

•

8 years ago

Attached patch Part 3: Disable ThreadStackHelper on x86 linux due to a LUL crash (obsolete) — Details — Splinter Review

MozReview-Commit-ID: 2kBBFftnJEd

Attachment #8880140 - Flags: review?(nfroyd)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 25

•

8 years ago

Attached patch Part 4: Disable the BackgroundHangMonitor in GMP processes (obsolete) — Details — Splinter Review

This involved a change to BackgroundHangMonitor, as it initialized sDisabled incorrectly to false, instead of true, We need sDisabled initialized to true, as we cannot assume that it is enabled until BackgroundHangMonitor::Startup() is called. MozReview-Commit-ID: 94slLTkNk3C

Attachment #8880141 - Flags: review?(nfroyd)

Nathan Froyd [:froydnj]

Updated

•

8 years ago

Attachment #8880140 - Flags: review?(nfroyd) → review+

Nathan Froyd [:froydnj]

Updated

•

8 years ago

Attachment #8880141 - Flags: review?(nfroyd) → review+

Pulsebot

Comment 26

•

8 years ago

Pushed by michael@thelayzells.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/79b4cc56c9f2 Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread, r=njn https://hg.mozilla.org/integration/mozilla-inbound/rev/51aac2123868 Part 2: Re-enable the Background Hang Reporter on Linux, r=froydnj https://hg.mozilla.org/integration/mozilla-inbound/rev/ae5d8df3d304 Part 3: Disable ThreadStackHelper on x86 linux due to a LUL crash, r=froydnj https://hg.mozilla.org/integration/mozilla-inbound/rev/8ccec3b0d27e Part 4: Disable the BackgroundHangMonitor in GMP processes, r=froydnj

Pulsebot

Comment 27

•

8 years ago

Backout by michael@thelayzells.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/968f191e70a2 Backout for test failures on a CLOSED TREE

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 28

•

8 years ago

This was causing a bunch of fun test failures - ni?-ing myself to fix them later.

Flags: needinfo?(michael)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 29

•

8 years ago

I did another try push (https://treeherder.mozilla.org/#/jobs?repo=try&revision=42a5082ca75e697537f200c41192e8d724a90c9a) and it looks pretty green. I think the failures I backed this out for were actually bug 1372433's fault. I'll try and reland this tomorrow morning so I'm around when the tree inevitably goes up in flames despite my try push :P.

Pulsebot

Comment 30

•

8 years ago

Pushed by michael@thelayzells.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/39ed96d7b263 Part 1: Move LUL to a separate lock, and initialize it outside of both the profiler and BHR lock on the BHR thread, r=njn https://hg.mozilla.org/integration/mozilla-inbound/rev/48f5e0597242 Part 2: Re-enable the Background Hang Reporter on Linux, r=froydnj https://hg.mozilla.org/integration/mozilla-inbound/rev/56850220bd77 Part 3: Disable ThreadStackHelper on x86 linux due to a LUL crash, r=froydnj https://hg.mozilla.org/integration/mozilla-inbound/rev/9fc4c6aabd8e Part 4: Disable the BackgroundHangMonitor in GMP processes, r=froydnj

Nika Layzell [:nika] (ni? for response)

Assignee

Updated

•

8 years ago

Flags: needinfo?(michael)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 31

•

8 years ago

Flags: needinfo?(michael)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 32

•

8 years ago

Similar issues in a different asan xpcshell chunk: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=23814434a6bc690b51bddd1ddd2b86382c32fe47&noautoclassify&filter-searchStr=6566c310c9f3f5516235a02d04acbac4b90d0637&tochange=83c89460055eff4f0f441f467b7e118000461096&selectedJob=113765330

Ionuț Goldan [:igoldan]

Comment 33

•

8 years ago

These build metrics alerts where observed, when this bug 1st landed: == Change summary for alert #7878 (as of July 12 2017 17:13 UTC) == Regressions: 4% compiler warnings summary linux32 pgo 1,421.17 -> 1,472.50 Improvements: 8% installer size summary linux32 pgo 61,150,341.67 -> 56,538,162.58 3% build times summary linux32 pgo taskcluster-c4.4xlarge3,122.23 -> 3,025.40 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7878

Ionuț Goldan [:igoldan]

Comment 34

•

8 years ago

And these arrived after the backout mentioned in comment 31: == Change summary for alert #7892 (as of July 12 2017 23:35 UTC) == Regressions: 8% installer size summary linux32 pgo 56,539,449.92 -> 61,122,888.00 Improvements: 3% compiler warnings summary linux32 pgo 1,472.00 -> 1,422.42 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7892

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 35

•

8 years ago

Attached patch Part 5: Don't block the BackgroundHangMonitor thread intializing stackwalk (obsolete) — Details — Splinter Review

I think that this might help fix the asan hangs which caused this patch to be backed out most recently. The idea here is to try to get the BHR thread responsive as quickly as possible. When I managed to reproduce the hangs locally with an ASAN build it seemed that loading LUL was taking a very long time, so the BHR thread was not responsive, and I'm guessing that something somehow needed it to be responsive. MozReview-Commit-ID: AVA1Nok6zyQ

Attachment #8886666 - Flags: review?(erahm)

Eric Rahm [:erahm]

Comment 36

•

8 years ago

Comment on attachment 8886666 [details] [diff] [review] Part 5: Don't block the BackgroundHangMonitor thread intializing stackwalk Review of attachment 8886666 [details] [diff] [review]: ----------------------------------------------------------------- I don't love this, but if it helps get BHR enabled I'm fine with it. We should follow up with jseward to discuss better solutions to lul init being super slow. Also can you do a test run to verify *why* this fixes is things? Just add some logging, my guess is mSTS is null for those tests and we never initialize lul. It would be nice to confirm that though. Some logging in the runnable would be good too, it's possible mSTS is non-null but it just hasn't processed it yet before shutdown. ::: xpcom/threads/BackgroundHangMonitor.cpp @@ +301,5 @@ > { > // Make sure to initialize any state required to perform stack walking as soon > // as possible, so it does not interfere with us collecting hang stacks. We > + // don't want to be on the main thread, and we don't want to block anything > + // running, so we do it off-main-thread. Isn't this already off-main-thread? As in, this is the BH manager thread. I think we need a better explanation here. From our discussion on IRC it sounded like lul initialization could block this thread, which would block shutdown causing tests to fail due to shutdown hangs. @@ +306,5 @@ > + // > + // sStackWalkInitialized is set to true atomically when stackwalk > + // initialization is complete. > + static Atomic<bool> sStackWalkInitialized(false); > + if (mSTS) { This seems like a drastic change, if |mSTS| isn't set yet we'll never get BHR stacks? From discussing on IRC it sounds like that's okay b/c without STS nothing is going to work. Can you add some comments to that effect?

Attachment #8886666 - Flags: review?(erahm) → review+

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 37

•

8 years ago

I think I'm still seeing timeouts. I have no idea why they're happening unfortunately :-/. I'm not sure what the best approach would be here, or why this is happening. Potentially this is due to the large amount of memory we are mapping into our address space messing things up? I'm not sure.

Flags: needinfo?(michael)

Julian Seward [:jseward]

Comment 38

•

8 years ago

(In reply to Michael Layzell [:mystor] from comment #37) > I think I'm still seeing timeouts. I have no idea why they're happening > unfortunately :-/. Can you be more specific about this, please? You're seeing timeouts in exactly which configurations/scenarios?

Flags: needinfo?(michael)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 39

•

8 years ago

Attached patch Part 6: Don't initialize stackwalk except when requested (obsolete) — Details — Splinter Review

I accidentally stumbled over what I did wrong while working on bug 1380081. Basically, any time that I wanted a stackwalk, even if I wasn't requesting a native stack, the profiler would run `profiler_initialize_stackwalk` which would cause the thread to block while we finished loading libxul into memory. The code was probably hanging more in asan builds because I'm guessing mapping files into memory is slower in that mode. This just changes the places where we call profiler_initialize_stackwalk to be only the places where we actually are going to need LUL initialized. (note: I haven't tested this yet on try - but I did a bunch of local testing and it seems to have fixed the problems! - pushing to try now.) MozReview-Commit-ID: 9Nrraa6Hw7n

Attachment #8888931 - Flags: review?(n.nethercote)

Nicholas Nethercote [inactive]

Comment 40

•

8 years ago

Comment on attachment 8888931 [details] [diff] [review] Part 6: Don't initialize stackwalk except when requested Review of attachment 8888931 [details] [diff] [review]: ----------------------------------------------------------------- This seems ok but is the start of DoNativeBacktrace() a better chokepoint to put this call?

Attachment #8888931 - Flags: review?(n.nethercote) → review+

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 41

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #40) > Comment on attachment 8888931 [details] [diff] [review] > Part 6: Don't initialize stackwalk except when requested > > Review of attachment 8888931 [details] [diff] [review]: > ----------------------------------------------------------------- > > This seems ok but is the start of DoNativeBacktrace() a better chokepoint to > put this call? DoNativeBacktrace would be a bad place to do this call. In DoNativeBacktrace the target thread is already suspended, so it's not safe to do things like allocate. profiler_initialize_stackwalk definitely does things like allocate, so we need to have already done it before DoNativeBacktrace.

Flags: needinfo?(michael)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 42

•

8 years ago

Attached patch Part 6: Don't initialize stackwalk except when requested (obsolete) — Details — Splinter Review

MozReview-Commit-ID: 9Nrraa6Hw7n

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 43

•

8 years ago

Attached patch Part 6: Don't initialize stackwalk except when requested (obsolete) — Details — Splinter Review

MozReview-Commit-ID: 9Nrraa6Hw7n

Nika Layzell [:nika] (ni? for response)

Assignee

Updated

•

8 years ago

Attachment #8888931 - Attachment is obsolete: true

Attachment #8889479 - Attachment is obsolete: true

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 44

•

8 years ago

Attached patch Part 7: Disable BHR on ASan builds due to timeouts (obsolete) — Details — Splinter Review

I've been fighting this on and off for months, and it doesn't seem like anything I do to try to make initializing LUL _even less_ blocking is helping. It seems like initializing LUL just has such bad performance in ASan builds, that even when it is run off main thread it reliably causes timeouts on ASan XPCShell tests. This patch just turns off BHR on ASan builds so we can turn it on on linux. BHR is only enabled on nightly, and the timeout problems appear to just be extreme ASan-induced performance issues, rather than a symptom of a real problem. MozReview-Commit-ID: 7S5WWYu2fzp

Attachment #8893420 - Flags: review?(nfroyd)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 45

•

8 years ago

Attached patch Part 8: Disable BHR completely on linux 32-bit (obsolete) — Details — Splinter Review

Previously only stackwalking was disabled. This makes it such that we don't run it at all, which I think is reasonable to avoid useless pings and/or unnecessary data collection. This will be extra useful because of bug 1380081, which will increase the amount of data collection. MozReview-Commit-ID: H3i1c6d6yut

Attachment #8893421 - Flags: review?(nfroyd)

Nathan Froyd [:froydnj]

Updated

•

8 years ago

Attachment #8893421 - Flags: review?(nfroyd) → review+

Nathan Froyd [:froydnj]

Comment 46

•

8 years ago

(In reply to Michael Layzell [:mystor] from comment #44) > I've been fighting this on and off for months, and it doesn't seem like > anything > I do to try to make initializing LUL _even less_ blocking is helping. It > seems > like initializing LUL just has such bad performance in ASan builds, that even > when it is run off main thread it reliably causes timeouts on ASan XPCShell > tests. Is this specific tests, or can it occur during any tests whatsoever? If it's the former, could we just disable those tests on ASan or requestLongerTimerout or what have you? Or have you been trying that, and it's just whack-a-mole? Would it be possible to runtime-disable BHR instead for those tests or for all of xpcshell? I guess the effect is the same. Disabling code entirely under ASan makes me nervous, but as BHR is Nightly-only, the concern is lessened here.

Flags: needinfo?(michael)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 47

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #46) > Is this specific tests, or can it occur during any tests whatsoever? If > it's the former, could we just disable those tests on ASan or > requestLongerTimerout or what have you? Or have you been trying that, and > it's just whack-a-mole? > > Would it be possible to runtime-disable BHR instead for those tests or for > all of xpcshell? I guess the effect is the same. > > Disabling code entirely under ASan makes me nervous, but as BHR is > Nightly-only, the concern is lessened here. The particular failures I was fighting were specifically xpcshell failures. As far as I could tell it seemed to be xpcshell test suites which used run_test_in_child a decent amount. I still don't know exactly what the problem was unfortunately, but it consistently manifested as the test suite timing out, and looking at time stamps it looked like progress was still being made, albeit slowly. I'm not sure. The other option which I was considering was to, rather than disable it entirely under ASan, just disable stackwalking under ASan. I only need to load LUL (which seems to be the cause of the problem) if I want to stackwalk, and I could just not do that. Would that be preferable?

Flags: needinfo?(michael) → needinfo?(nfroyd)

Julian Seward [:jseward]

Comment 48

•

8 years ago

Approximately what length of timeout was failing? I'm sorry this has been such a hassle. I can imagine LUL is slower to load the libxul.so unwind data under ASan because * execution is just plain slower with ASan. * mmap-ing in libxul.so to read unwind info in means ASan has to create and later destroy a shadow mapping of 1/8 the size of libxul.so, although I can't imagine that's a big deal. * libxul.so is bigger when compiled with ASan and so there's more unwind info to read.

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 49

•

8 years ago

This is the most recent try push which I made which has this problem present. https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4b0eb4120742667a8dcc479f4c33c1f13239a02&filter-searchStr=asan There are some other burnt tree failures in this push (like the test_BHRObserver.js failure - I was testing the patches for bug 1380081 at the time), but the timeouts are the main thing which was fixed by turning things off. It looks from looking again that we also timeout in other places.

Nathan Froyd [:froydnj]

Comment 50

•

8 years ago

(In reply to Michael Layzell [:mystor] from comment #47) > The other option which I was considering was to, rather than disable it > entirely under ASan, just disable stackwalking under ASan. I only need to > load LUL (which seems to be the cause of the problem) if I want to > stackwalk, and I could just not do that. Would that be preferable? I think this would be preferable if possible.

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 51

•

8 years ago

Comment on attachment 8893420 [details] [diff] [review] Part 7: Disable BHR on ASan builds due to timeouts Canceling review while we investigate other possibilities.

Attachment #8893420 - Flags: review?(nfroyd)

Nicholas Nethercote [inactive]

Comment 52

•

8 years ago

Here's an idea: AFAICT, we don't *have* to use LUL on Linux when MOZ_PROFILING is enabled. We can instead use MozStackWalk or FramePointerStackWalk, like we do on Windows and Mac. I briefly tried using FramePointerStackWalk on Linux64 yesterday and it appeared to basically work. I compared LUL stacks against FramePointerStackWalk stacks. The LUL stacks were more detailed -- perhaps they have info about inlined frames? -- but the FramePointerStackWalk stacks seemed reasonable. (DMD uses MozStackWalk() on Linux.) So we could use LUL for the profiler, but use FramePointerStackWalk for profiler_suspend_and_sample_thread(). Or something like that. mystor, what do you think?

Flags: needinfo?(michael)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 53

•

8 years ago

Attached patch Part 1: Use FramePointerStackWalk when stackwalking from BHR (obsolete) — Details — Splinter Review

This is a simple implementation of your suggestion nick. I'm doing a try push to see if it fixes the problem while I try to rebase bug 1380081 on top of this instead of the LUL changes. MozReview-Commit-ID: CsMHW5gVmYP

Attachment #8893923 - Flags: review?(n.nethercote)

Nika Layzell [:nika] (ni? for response)

Assignee

Updated

•

8 years ago

Attachment #8869614 - Attachment is obsolete: true

Attachment #8870494 - Attachment is obsolete: true

Attachment #8880140 - Attachment is obsolete: true

Attachment #8880141 - Attachment is obsolete: true

Attachment #8886666 - Attachment is obsolete: true

Attachment #8889484 - Attachment is obsolete: true

Attachment #8893420 - Attachment is obsolete: true

Attachment #8893421 - Attachment is obsolete: true

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 54

•

8 years ago

Attached patch Part 2: re-enable BHR on linux — Details — Splinter Review

Changes to BHR required if we take this approach (before bug 1380081) MozReview-Commit-ID: BtRZecG4IYA

Attachment #8893924 - Flags: review?(nfroyd)

Nicholas Nethercote [inactive]

Comment 55

•

8 years ago

Comment on attachment 8893923 [details] [diff] [review] Part 1: Use FramePointerStackWalk when stackwalking from BHR Review of attachment 8893923 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/core/platform.cpp @@ +113,5 @@ > # include "lul/platform-linux-lul.h" > + > +// Also support using FramePointerStackWalk if MOZ_PROFILING is enabled. The > +// profiler won't use this type, but profiler_suspend_and_sample_thread we > +// prefers frame pointer stackwalking to LUL stackwalking when both are present. I think this would be clearer: "On Linux we use LUL for periodic samples and synchronous samples, but we use FramePointerStackWalk for backtrace samples when MOZ_PROFILING is enabled. (See the comment at the top of the file for the defition of period/synchronous/backtrace.) @@ +1288,5 @@ > +{ > + // This method is used to decide on which stackwalking tool to use. We choose > + // them with the following priority if more than one stackwalking > + // implementation is avaliable: > + // LUL -> EHABI -> FramePointer -> MozStackWalk I find this comment misleading. Linux is the only platform where we are ever making a choice between two implementation, but this is not the place where we are making that choice, because LUL is always available on Linux but FramePointer is only sometimes available. I suggest changing it to something like this: "This method determines which stackwalker is used for periodic samples and synchronous samples. (Backtrace samples are treated differently; see profiler_suspend_and_sample_thread() for details.) The only part of the ordering that matters is that LUL must precede FRAME_POINTER because on Linux they can both be present." @@ +3410,5 @@ > #if defined(HAVE_NATIVE_UNWIND) > if (aSampleNative) { > + // Prefer FramePointerStackWalk over LUL stackwalk, to avoid the need > + // to initialize LUL to use profiler_suspend_and_sample_thread > + // (bug 1365309). Can you add "On Linux..." at the start of this comment? Ditto for the identical comment below. @@ +3472,2 @@ > DoNativeBacktrace(lock, *info, aRegs, nativeStack); > +# endif We currently have two versions of profiler_suspend_and_sample_thread(). Can we remove the old one yet?

Attachment #8893923 - Flags: review?(n.nethercote) → review+

Nicholas Nethercote [inactive]

Comment 56

•

8 years ago

mystor: please double-check that FramePointerStackWalk() gives reasonable results on Linux before you land this :)

Julian Seward [:jseward]

Comment 57

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #52) > The LUL stacks were more detailed -- perhaps they have info about inlined frames? No, LUL doesn't have support for doing inlined frames right now.

Nathan Froyd [:froydnj]

Updated

•

8 years ago

Attachment #8893924 - Flags: review?(nfroyd) → review+

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 58

•

8 years ago

I'm running into a weird bug with this patch. Intermittently the GeckoProfiler.SuspendAndSample test is failing, as we're getting a backtrace which only has 2 entries instead of the usual ~20. I think that i know why this is happening, but I don't know how to deal with it. The frames which are captured are the pseudostack entry XRE_main, and a single pc frame set in `do_lookup_x()`, which from some quick googling is a frame inside of glibc. I think what is happening is that the glibc which we're building against, at least on this computer (also it seems on try), doesn't have frame pointers, meaning that we basically don't get native stacks when stackwalking from a frame located inside of glibc. I haven't been able to reproduce this under rr as it seems very race-dependent, so I can't confirm this of course. What do you think I should do?

Flags: needinfo?(michael) → needinfo?(n.nethercote)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 59

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #55) > We currently have two versions of profiler_suspend_and_sample_thread(). Can > we remove the old one yet? no, we need to land bug 1367406 first which I have local patches for but hasn't landed yet.

Nicholas Nethercote [inactive]

Comment 60

•

8 years ago

(In reply to Michael Layzell [:mystor] from comment #58) > I'm running into a weird bug with this patch. Intermittently the > GeckoProfiler.SuspendAndSample test is failing, as we're getting a backtrace > which only has 2 entries instead of the usual ~20. This is bug 1384766! > I think that i know why this is happening, but I don't know how to deal with > it. The frames which are captured are the pseudostack entry XRE_main, and a > single pc frame set in `do_lookup_x()`, which from some quick googling is a > frame inside of glibc. > > I think what is happening is that the glibc which we're building against, at > least on this computer (also it seems on try), doesn't have frame pointers, > meaning that we basically don't get native stacks when stackwalking from a > frame located inside of glibc. I haven't been able to reproduce this under > rr as it seems very race-dependent, so I can't confirm this of course. > > What do you think I should do? I chose the >5 check for the frame count arbitrarily. Let's just change it so it does >=2 instead, and add a comment explaining why. Are you happy to do this? (And thank you for working this out, I was planning to look at this intermittent failure today but you saved me from having to do so.)

Flags: needinfo?(n.nethercote) → needinfo?(michael)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 61

•

8 years ago

Attached patch Part 3: Reduce expected frame count for GeckoProfiler.SuspendAndSample — Details — Splinter Review

This is necessary as with FramePointerStackWalk if the current frame is in glibc when we suspend the target thread we sometimes can't collect more than one native frame as there are no frame pointers. We'll still collect at least one entry, so this check relaxes the assertion. MozReview-Commit-ID: IbatTQwDti1

Attachment #8895371 - Flags: review?(n.nethercote)

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 62

•

8 years ago

Attached patch Part 1: Use FramePointerStackWalk when stackwalking from BHR — Details — Splinter Review

MozReview-Commit-ID: CsMHW5gVmYP

Nika Layzell [:nika] (ni? for response)

Assignee

Updated

•

8 years ago

Attachment #8893923 - Attachment is obsolete: true

Nika Layzell [:nika] (ni? for response)

Assignee

Comment 63

•

8 years ago

I think once this gets reviewed I should be able to land the new bhr ping stuff from bug 1380081! \o/

Flags: needinfo?(michael)

Nicholas Nethercote [inactive]

Updated

•

8 years ago

Attachment #8895371 - Flags: review?(n.nethercote) → review+

Pulsebot

Comment 64

•

8 years ago

Pushed by michael@thelayzells.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/aac6aee303fb Part 1: Use FramePointerStackWalk when stackwalking from BHR, r=njn https://hg.mozilla.org/integration/mozilla-inbound/rev/579fd072422e Part 2: re-enable BHR on linux, r=froydnj https://hg.mozilla.org/integration/mozilla-inbound/rev/d461adfd884e Part 3: Reduce expected frame count for GeckoProfiler.SuspendAndSample, r=njn

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 65

•

8 years ago

Flags: needinfo?(michael)

Nicholas Nethercote [inactive]

Comment 66

•

8 years ago

I just landed part 3 under the auspices of bug 1384766 in order to fix that intermittent failure.

Pulsebot

Comment 67

•

8 years ago

Pushed by michael@thelayzells.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/4a7bef9816da Part 1: Use FramePointerStackWalk when stackwalking from BHR, r=njn https://hg.mozilla.org/integration/mozilla-inbound/rev/cfc37c6651db Part 2: re-enable BHR on linux, r=froydnj

Nika Layzell [:nika] (ni? for response)

Assignee

Updated

•

8 years ago

Flags: needinfo?(michael)

Pulsebot

Comment 68

•

8 years ago

Pushed by michael@thelayzells.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/f31c029bbd16 Part 3: Disable test_ThreadHangStats.js on ASAN, a=bustage

Carsten Book [:Tomcat]

Comment 69

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4a7bef9816da https://hg.mozilla.org/mozilla-central/rev/cfc37c6651db https://hg.mozilla.org/mozilla-central/rev/f31c029bbd16

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox57: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla57

You need to log in before you can comment on or make changes to this bug.