Closed Bug 1302914 Opened 3 years ago Closed 3 years ago

browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js,browser_cookies_exceptions.js,browser_dbg_cmd-blackbox.js | application crashed [@ js::ProfileEntry::script()]

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox49 --- wontfix
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jandem)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

See Also: → 1303142
See Also: → 1303144
So, things are better than they were before but we're still not out of the woods it'd appear :(
Flags: needinfo?(jdemooij)
I opened one of these minidumps in Visual Studio - I think we're crashing in ProfileEntry::script doing script->zoneFromAnyThread()->runtimeFromAnyThread(), so the script pointer is probably bogus.

The bug that was fixed was that we could get a thread stack while doing a compacting GC that moved scripts. But here, the main thread (I assume that's the one we're sampling because there are no active worker threads) is doing non-GC, non-SPS stuff when we crash, so I've no idea *why* the profiler stack contains bogus script pointers now. We trace these as roots so the GC should trace and update them just fine.

It's interesting that a lot of these crashes (like the 3 linked here) are Windows 8 x64. Maybe a difference in thread scheduling behavior is provoking the background hang monitor, I don't know.

I tried to reproduce this on Linux with a very low timeout value for the background hang monitor, maybe I can try that on Windows later...
See Also: → 1303416
Summary: Intermittent browser/extensions/pdfjs/test/browser_pdfjs_zoom.js | application crashed [@ js::ProfileEntry::script()] → browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js | application crashed [@ js::ProfileEntry::script()]
Duplicate of this bug: 1303416
Duplicate of this bug: 1303142
Summary: browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js | application crashed [@ js::ProfileEntry::script()] → browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js | application crashed [@ js::ProfileEntry::script()]
Summary: browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js | application crashed [@ js::ProfileEntry::script()] → browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js | application crashed [@ js::ProfileEntry::script()]
Summary: browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js | application crashed [@ js::ProfileEntry::script()] → browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js,browser_cookies_exceptions.js | application crashed [@ js::ProfileEntry::script()]
Blocks: 1304938
Any luck, Jan? This is starting to pile up a lot of dupes similar to the last one :(
(In reply to Ryan VanderMeulen [:RyanVM] from comment #6)
> Any luck, Jan? This is starting to pile up a lot of dupes similar to the
> last one :(

I'll look at this later today, after finishing a few other things.
Reports this week are also all on Win8 x64, AFAICS. I'll try to reproduce this on Windows 10...
Summary: browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js,browser_cookies_exceptions.js | application crashed [@ js::ProfileEntry::script()] → browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js,browser_cookies_exceptions.js,browser_dbg_cmd-blackbox.js | application crashed [@ js::ProfileEntry::script()]
Blocks: 1305185
(In reply to Jan de Mooij [:jandem] from comment #8)
> I'll try to reproduce this on Windows 10...

No luck with this. Today's attempt is to do some Try pushes with different background hang monitor timeouts, to see if that makes these more frequent.
(In reply to Jan de Mooij [:jandem] from comment #10)
> Today's attempt is to do some Try pushes with different
> background hang monitor timeouts, to see if that makes these more frequent.

OK, I do get some oranges per push with this, so that's promising. Unfortunately I have Win8 test jobs queued for 2 hours, so this is probably going to take a long time :/ I'll see what happens when I Try these BHM changes on Linux64 or OS X.
(In reply to Jan de Mooij [:jandem] from comment #11)
> OK, I do get some oranges per push with this, so that's promising.
> Unfortunately I have Win8 test jobs queued for 2 hours, so this is probably
> going to take a long time :/ I'll see what happens when I Try these BHM
> changes on Linux64 or OS X.

No oranges on other platforms. I'm now Try-servering a patch that makes SPSProfiler::size_ an Atomic<uint32_t>*, to see if that changes anything.
(In reply to Jan de Mooij [:jandem] from comment #12)
> No oranges on other platforms. I'm now Try-servering a patch that makes
> SPSProfiler::size_ an Atomic<uint32_t>*, to see if that changes anything.

This didn't help. It makes a synchronization issue while pushing/popping stack entries less likely. I'm now trying a patch to enable the BackgroundHangMonitor in debug builds, to see whether that also fails and/or hits assertion failures.
Enabling the BackgroundHangMonitor in debug builds was super useful. I got some of the same crashes, but I also got a few assertion failures. The assertion failures were interesting, because we were failing an assert like this one:

  MOZ_ASSERT(entry->isJs());

But the caller(s) *also* checked/asserted entry->isJs(), so this should be impossible. Except if some other thread is modifying |entry| while we are accessing it. How can that happen if we just suspended the thread we're sampling?

It turns out SuspendThread on Windows doesn't actually suspend the thread *immediately*, it's asynchronous: when SuspendThread returns, the thread can run for a little while longer [0]. That's what's happening here: the background hang monitor thread suspends the main thread, but this thread runs for a while longer, modifying our profiler stack while we're reading from it.

The blog post suggests calling Get­Thread­Context, to ensure the thread is actually suspended. I did that and I can no longer reproduce this on Try with that change \o/.

[0] https://blogs.msdn.microsoft.com/oldnewthing/20150205-00/?p=44743
Attached patch PatchSplinter Review
See comment 14.

I fixed the one in ThreadStackHelper.cpp, and also other SuspendThread calls in HangMonitor.cpp and profiler/core/platform-win32.cc

This also fixes an unrelated bug in BackgroundHangThread::ReportHang that was exposed by enabling this code in debug builds: we called mHangStack.back() on an empty Vector.
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Flags: needinfo?(jdemooij)
Attachment #8796095 - Flags: review?(nfroyd)
Comment on attachment 8796095 [details] [diff] [review]
Patch

Review of attachment 8796095 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the fixes below, or a justification of why CONTEXT_CONTROL is OK to use in the non-platform-win32.cc cases.  Thanks for tracking this one down.

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +400,5 @@
>    // mManager->mLock IS locked
>  
>    // Remove unwanted "js::RunScript" frame from the stack
> +  if (!mHangStack.empty()) {
> +    for (const char** f = &mHangStack.back(); f >= mHangStack.begin(); f--) {

Nice, thanks for fixing this bug.

Now that I have a second chance to look at this, can you fix this loop to use integer indices like so:

for (size_t i = 0; i < mHangStack.size(); i++) {
  const char** f = mHangStack.begin() + i;
  if (!mHangStack.IsInBuffer(*f) && !strcmp(*f, "js::RunScript")) {
    mHangStack.erase(f);
  }
}

That way, we're not depending on undefined behavior with the loop exit comparison when f moves past mHangStack.begin().

Followup bug is fine, if you prefer to do it that way.

::: xpcom/threads/HangMonitor.cpp
@@ +163,5 @@
> +    // SuspendThread is asynchronous, so the thread may still be running. Use
> +    // GetThreadContext to ensure it's really suspended.
> +    // See https://blogs.msdn.microsoft.com/oldnewthing/20150205-00/?p=44743.
> +    CONTEXT context;
> +    context.ContextFlags = CONTEXT_CONTROL;

Why does using CONTEXT_CONTROL not cause crashes here compared to platform-win32.cc?  I think we want to replicate the logic from there in this code, too, replete with the same comment.

::: xpcom/threads/ThreadStackHelper.cpp
@@ +250,5 @@
> +  // SuspendThread is asynchronous, so the thread may still be running. Use
> +  // GetThreadContext to ensure it's really suspended.
> +  // See https://blogs.msdn.microsoft.com/oldnewthing/20150205-00/?p=44743.
> +  CONTEXT context;
> +  context.ContextFlags = CONTEXT_CONTROL;

Same comment here.
Attachment #8796095 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #16)
> Why does using CONTEXT_CONTROL not cause crashes here compared to
> platform-win32.cc?  I think we want to replicate the logic from there in
> this code, too, replete with the same comment.

The JS engine uses CONTEXT_CONTROL when interrupting JS code, without any known problems, and the crashes in bug 1120126 seem more related to the stack walking code and/or the data we get out of the CONTEXT than to GetThreadContext itself. Here we don't use the CONTEXT at all.

Mind if I land this as is, and if any crashes show up we switch to CONTEXT_FULL?
Flags: needinfo?(nfroyd)
Also, I think none of this code is used in release builds (BackgroundHangMonitor et al are disabled there), and interrupts are far less frequent than what the profiler does, so we could also just go with CONTEXT_FULL...
(In reply to Jan de Mooij [:jandem] from comment #17)
> (In reply to Nathan Froyd [:froydnj] from comment #16)
> > Why does using CONTEXT_CONTROL not cause crashes here compared to
> > platform-win32.cc?  I think we want to replicate the logic from there in
> > this code, too, replete with the same comment.
> 
> The JS engine uses CONTEXT_CONTROL when interrupting JS code, without any
> known problems, and the crashes in bug 1120126 seem more related to the
> stack walking code and/or the data we get out of the CONTEXT than to
> GetThreadContext itself. Here we don't use the CONTEXT at all.

Ah, I didn't look far enough.  MozStackWalk for this particular case eventually uses CONTEXT_FULL:

http://dxr.mozilla.org/mozilla-central/source/mozglue/misc/StackWalk.cpp#306

and the ThreadStackHelper case weirdly doesn't actually walk the stack.  We should really clean all this up sometime. :(

You're right, we don't need to bother with CONTEXT_FULL in the other two places.  Land away!
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/02428eab5922
Use GetThreadContext after calling SuspendThread to ensure threads are really suspended. r=froydnj
(I moved the mHangStack fix from this bug to bug 1306317.)
Blocks: 1273970
https://hg.mozilla.org/mozilla-central/rev/02428eab5922
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Duplicate of this bug: 1265228
Duplicate of this bug: 1140916
Duplicate of this bug: 1245685
Duplicate of this bug: 1242067
Duplicate of this bug: 1264415
Duplicate of this bug: 1267483
Duplicate of this bug: 1268504
Duplicate of this bug: 1273317
Duplicate of this bug: 1274851
Duplicate of this bug: 1279083
Duplicate of this bug: 1281301
Duplicate of this bug: 1287689
Duplicate of this bug: 1305185
Duplicate of this bug: 1304938
Duplicate of this bug: 1303144
Duplicate of this bug: 1273970
Duplicate of this bug: 1290067
Duplicate of this bug: 1290818
Duplicate of this bug: 1294283
Duplicate of this bug: 1294971
Duplicate of this bug: 1264216
Duplicate of this bug: 1268193
Duplicate of this bug: 1277110
Duplicate of this bug: 1285963
Duplicate of this bug: 1290380
Duplicate of this bug: 1290867
Please request Aurora/Beta approval on this when you're reasonably confident that it's baked for long enough.
Flags: needinfo?(nfroyd)
Also, My Hero <3
Flags: needinfo?(jdemooij)
Comment on attachment 8796095 [details] [diff] [review]
Patch

Approval Request Comment
[Feature/regressing bug #]: Old bug.
[User impact if declined]: Crashes when using the browser / profiler. Intermittent oranges.
[Describe test coverage new/current, TreeHerder]: Landed a few days ago. No more of these intermittents since then.
[Risks and why]: Low risk, on m-c without problems.
[String/UUID change made/needed]: None.
Flags: needinfo?(jdemooij)
Attachment #8796095 - Flags: approval-mozilla-beta?
Attachment #8796095 - Flags: approval-mozilla-aurora?
Comment on attachment 8796095 [details] [diff] [review]
Patch

Fixes an intermittent, verified on Nightly52, Aurora51+, Beta50+
Attachment #8796095 - Flags: approval-mozilla-beta?
Attachment #8796095 - Flags: approval-mozilla-beta+
Attachment #8796095 - Flags: approval-mozilla-aurora?
Attachment #8796095 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.