simplify the interrupt/jitStackLimit situation

RESOLVED FIXED in mozilla36

Status

()

Core
JavaScript Engine
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: luke, Assigned: luke)

Tracking

unspecified
mozilla36
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Assignee)

Description

3 years ago
Created attachment 8514643 [details] [diff] [review]
simplify-interrupt

There is currently a somewhat scary comment in JSRuntime::requestInterrupt about a subtle order dependency between setting rt->interrupt and jitStackLimit (with bug 785560 full of gory details).  rt->interrupt is a Relaxed Atomic and jitStackLimit isn't even an Atomic, so I am dubious of this code.  The assignment is also wrapped in the interrupt lock which is also taken at other points when handling interrupts.  I'd like to remove the interrupt lock, though, so it'd be good to understand and fix this situation in a better way.

The danger seems to be having interrupt = false but jitStackLimit = -1 (which leads to an incorrect over-recursed error).  It seems like we could fix this by treating interrupt and jitStackLimit more symmetrically: both are Relaxed Atomics and are set in no particular order; handleInterrupt calls the interrupt callback if either is set and resets both.  With this, the worst that would happen is that we call the interrupt callback twice in a row.

This patch takes all manipulation of interrupt/jitStackLimit out of the scope of the interrupt lock and tries to clean things up a bit (in particular, making these two fields private).
Attachment #8514643 - Flags: review?(bhackett1024)
(Assignee)

Updated

3 years ago
Blocks: 1091912
Comment on attachment 8514643 [details] [diff] [review]
simplify-interrupt

Review of attachment 8514643 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jscntxt.h
@@ +792,1 @@
>  }

Can this function just be removed entirely now?  It seems like the extra check here is kind of redundant given the ones in JSRuntime::handleInterrupt.  If this is an efficiency thing it would be nice to still have a comment explaining why that is.
Attachment #8514643 - Flags: review?(bhackett1024) → review+
(Assignee)

Comment 2

3 years ago
(In reply to Brian Hackett (:bhackett) from comment #1)
> Can this function just be removed entirely now?  It seems like the extra
> check here is kind of redundant given the ones in
> JSRuntime::handleInterrupt.  If this is an efficiency thing it would be nice
> to still have a comment explaining why that is.

I didn't measure, but I was assuming this is actually hot enough to warrant the inline fastpath because bug 1061764 was a speedup (the Atomic<Relaxed> in question was cx->interrupt_).  Not exactly apples to apples, but it does indicate that testing cx->interrupt_ is really hot.  I'll add a comment, though.
(Assignee)

Comment 3

3 years ago
(Try looks green.)
https://hg.mozilla.org/integration/mozilla-inbound/rev/10616214c160
https://hg.mozilla.org/mozilla-central/rev/10616214c160
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
(In reply to Luke Wagner [:luke] from comment #3)
> (Try looks green.)

Unfortunately, Android x86 S4 says otherwise. Perma-failing from your push on:
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=3481308&repo=mozilla-inbound

Backed out.
https://hg.mozilla.org/mozilla-central/rev/b695d9575654
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla36 → ---
Confirmed green on the backout.
(Assignee)

Comment 7

3 years ago
Apparently x86 S4 isn't run by '-p all -u all -t all'.
(Assignee)

Comment 8

3 years ago
Looks like Android-x86 automated tests might be turned off (bug 960674 comment 19).
I see a few things going wrong in https://tbpl.mozilla.org/php/getParsedLog.php?id=51563406&tree=Mozilla-Inbound&full=1. We are producing .dmp files for each crash, and minidump_stackwalk is producing crash reports with some symbols, but is missing lots of useful info:

14:26:21     INFO -  TEST-START | xpcshell-child-process.ini:/builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/indexedDB/test/unit/test_add_put.js
14:26:21     INFO -  TEST-PASS | xpcshell-child-process.ini:/builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/indexedDB/test/unit/test_add_put.js | took 10776ms
14:26:21     INFO -  mozdevice getting files in '/mnt/sdcard/tests/xpcshell/minidumps'
14:26:21     INFO -  mozcrash Downloading symbols from: https://ftp-ssl.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-x86/1414774198/fennec-36.0a1.en-US.android-i386.crashreporter-symbols.zip
14:26:21     INFO -  mozcrash Saved minidump as /builds/slave/talos-slave/test/build/blobber_upload_dir/7c8e42f5-fc2f-1bfd-39045ded-126aa14c.dmp
14:26:21     INFO -  mozcrash Saved app info as /builds/slave/talos-slave/test/build/blobber_upload_dir/7c8e42f5-fc2f-1bfd-39045ded-126aa14c.extra
14:26:21  WARNING -  PROCESS-CRASH | xpcshell-child-process.ini:/builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/indexedDB/test/unit/test_add_put.js | application crashed [@ libxul.so (deleted) + 0x1d94208]
14:26:21     INFO -  Crash dump filename: /tmp/tmpAyyro_/7c8e42f5-fc2f-1bfd-39045ded-126aa14c.dmp
14:26:21     INFO -  Operating system: Android
14:26:21     INFO -                    0.0.0 Linux 2.6.29 #1 PREEMPT Thu Nov 7 22:27:50 UTC 2013 i686 Android/full/generic:4.2/JOP40C/eng.ubuntu.20131107.195638:eng/test-keys
14:26:21     INFO -  CPU: x86
14:26:21     INFO -       GenuineIntel family 6 model 3 stepping 3
14:26:21     INFO -       1 CPU
14:26:21     INFO -  
14:26:21     INFO -  Crash reason:  SIGSEGV
14:26:21     INFO -  Crash address: 0x0
14:26:21     INFO -  
14:26:21     INFO -  Thread 0 (crashed)
14:26:21     INFO -   0  libxul.so (deleted) + 0x1d94208
14:26:21     INFO -      eip = 0xafa09208   esp = 0xbfffd900   ebp = 0xbfffd9e8   ebx = 0xb0a00f84
14:26:21     INFO -      esi = 0xbfffd954   edi = 0xbfffd96c   eax = 0x00000000   ecx = 0x00000001
14:26:21     INFO -      edx = 0x00000000   efl = 0x00010246
14:26:21     INFO -      Found by: given as instruction pointer in context
14:26:21     INFO -   1  libxul.so (deleted) + 0x1e8eac2
14:26:21     INFO -      eip = 0xafb03ac3   esp = 0xbfffd9f0   ebp = 0xbfffda28
14:26:21     INFO -      Found by: previous frame's frame pointer
...
14:26:21     INFO -  30  libmozglue.so!ChildProcessInit [APKOpen.cpp:10616214c160 : 453 + 0x8]
14:26:21     INFO -      eip = 0xb7eb7bb9   esp = 0xbffff7e0   ebp = 0xbffff818
14:26:21     INFO -      Found by: previous frame's frame pointer
14:26:21     INFO -  31  libplugin-container.so!main [MozillaRuntimeMainAndroid.cpp:10616214c160 : 34 + 0x5]
14:26:21     INFO -      eip = 0x0804855b   esp = 0xbffff820   ebp = 0xbffff848   ebx = 0x08049fd8
14:26:21     INFO -      esi = 0x00000008   edi = 0xbffff8a4
14:26:21     INFO -      Found by: call frame info

Unfortunately, generating a crash report for almost every xpcshell test is causing us to exceed our maximum threshold for logging:

Output exceeded 52428800 bytes, remaining output has been truncated (output was 52458724 bytes)

and that might be related to the timeout:

command timed out: 14400 seconds elapsed running ['/tools/buildbot/bin/python', 'scripts/scripts/android_emulator_unittest.py', '--cfg', 'android/androidx86.py', '--test-suite', 'robocop-1', '--test-suite', 'robocop-2', '--test-suite', 'robocop-3', '--test-suite', 'xpcshell', '--blob-upload-branch', 'mozilla-inbound', '--download-symbols', 'ondemand'], attempting to kill
process killed by signal 9

which is why we never get around to saving a logcat or the dmp files. I'll try to put together a try run that fixes some of this...
(Assignee)

Comment 10

3 years ago
Thank you!
(In reply to Geoff Brown [:gbrown] from comment #9)
> Unfortunately, generating a crash report for almost every xpcshell test is
> causing us to exceed our maximum threshold for logging:

Actually, now I see that it's not that many xpcshell tests are crashing, but that a few tests are crashing over and over again, as though we are re-trying crashed tests many, many times:

14:26:21     INFO -  mozcrash Saved minidump as /builds/slave/talos-slave/test/build/blobber_upload_dir/0c08b883-e24c-afcd-07cb9161-744d205c.dmp
14:26:21     INFO -  mozcrash Saved app info as /builds/slave/talos-slave/test/build/blobber_upload_dir/0c08b883-e24c-afcd-07cb9161-744d205c.extra
14:26:21  WARNING -  PROCESS-CRASH | xpcshell-child-process.ini:/builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/indexedDB/test/unit/test_add_put.js | application crashed [@ libxul.so (deleted) + 0x1d94208]
...
14:26:23     INFO -  mozcrash Saved minidump as /builds/slave/talos-slave/test/build/blobber_upload_dir/59175ddc-aed5-3180-51e3ba59-58a78b9c.dmp
14:26:23     INFO -  mozcrash Saved app info as /builds/slave/talos-slave/test/build/blobber_upload_dir/59175ddc-aed5-3180-51e3ba59-58a78b9c.extra
14:26:23  WARNING -  PROCESS-CRASH | xpcshell-child-process.ini:/builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/indexedDB/test/unit/test_add_put.js | application crashed [@ libxul.so (deleted) + 0x1d94208]
...
14:26:40     INFO -  TEST-START | xpcshell-child-process.ini:/builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/indexedDB/test/unit/test_add_twice_failure.js
14:26:40     INFO -  TEST-PASS | xpcshell-child-process.ini:/builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/indexedDB/test/unit/test_add_twice_failure.js | took 11331ms
14:26:40     INFO -  mozdevice getting files in '/mnt/sdcard/tests/xpcshell/minidumps'
14:26:40     INFO -  mozcrash Downloading symbols from: https://ftp-ssl.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-x86/1414774198/fennec-36.0a1.en-US.android-i386.crashreporter-symbols.zip
14:26:40     INFO -  mozcrash Saved minidump as /builds/slave/talos-slave/test/build/blobber_upload_dir/188b6562-4d51-43ce-0c5a3e96-6122d430.dmp
14:26:40     INFO -  mozcrash Saved app info as /builds/slave/talos-slave/test/build/blobber_upload_dir/188b6562-4d51-43ce-0c5a3e96-6122d430.extra
14:26:40  WARNING -  PROCESS-CRASH | xpcshell-child-process.ini:/builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/indexedDB/test/unit/test_add_twice_failure.js | application crashed [@ libxul.so (deleted) + 0x1d94208]
...
14:26:40     INFO -  mozcrash Saved minidump as /builds/slave/talos-slave/test/build/blobber_upload_dir/10eed7a3-752b-77fe-75ead317-302e1db1.dmp
14:26:40     INFO -  mozcrash Saved app info as /builds/slave/talos-slave/test/build/blobber_upload_dir/10eed7a3-752b-77fe-75ead317-302e1db1.extra
14:26:40  WARNING -  PROCESS-CRASH | xpcshell-child-process.ini:/builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/indexedDB/test/unit/test_add_twice_failure.js | application crashed [@ libxul.so (deleted) + 0x1d94208]
14:26:40     INFO -  Crash dump filename: /tmp/tmpUtjbgm/10eed7a3-752b-77fe-75ead317-302e1db1.dmp

That should not happen!

Updated

3 years ago
Depends on: 1093328
(In reply to Geoff Brown [:gbrown] from comment #11)
> (In reply to Geoff Brown [:gbrown] from comment #9)
> > Unfortunately, generating a crash report for almost every xpcshell test is
> > causing us to exceed our maximum threshold for logging:
> 
> Actually, now I see that it's not that many xpcshell tests are crashing, but
> that a few tests are crashing over and over again, as though we are 
> re-trying crashed tests many, many times:

*Actually*, we are not re-trying the test, but many minidump files are created by one crashed test. When dom/indexedDB/test/unit/test_add_put.js fails, 1004 minidump files are created, and then the log is flooded with the crash reports for each of those files. In https://tbpl.mozilla.org/?tree=Try&rev=752f64c26d3d, I just dump the first 3 .dmp files and ignore the rest; that allows the test job to complete normally, so the .dmp files and logcat are saved. I'm hoping there's some useful information in those logs.
(Assignee)

Comment 13

3 years ago
Unfortunately, the crash stacks in the try push do not appear to be symbolicated.
(In reply to Luke Wagner [:luke] from comment #13)
> Unfortunately, the crash stacks in the try push do not appear to be symbolicated.

Well, they have some symbols...but not very useful ones. I don't know why. A "normal" crash on Android x86 is "symbolicated".

14:03:17     INFO -  29  libxul.so (deleted) + 0x16a9130
14:03:17     INFO -      eip = 0xaf311131   esp = 0xbfffee20   ebp = 0xbffff7d8
14:03:17     INFO -      Found by: previous frame's frame pointer
14:03:17     INFO -  30  libmozglue.so!ChildProcessInit [APKOpen.cpp:752f64c26d3d : 453 + 0x8]
14:03:17     INFO -      eip = 0xb7eb7bb9   esp = 0xbffff7e0   ebp = 0xbffff818
14:03:17     INFO -      Found by: previous frame's frame pointer
14:03:17     INFO -  31  libplugin-container.so!main [MozillaRuntimeMainAndroid.cpp:752f64c26d3d : 34 + 0x5]
14:03:17     INFO -      eip = 0x0804855b   esp = 0xbffff820   ebp = 0xbffff848   ebx = 0x08049fd8
14:03:17     INFO -      esi = 0x00000008   edi = 0xbffff8a4
14:03:17     INFO -      Found by: call frame info

The patch in bug 1093328 (like my try push in Comment 12) will avoid the catastrophic test failure and let us access .dmp files and logcat. My latest effort is https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=7e4fe39a7ac7.

I built Android x86 locally with your patch applied and ran xpcshell tests in a local x86 emulator, but did not reproduce the crash -- all tests passed! I'll keep trying.

On the positive side, we can easily reproduce the crash on try and examine logs. The crash is consistent in that the same set of tests always fail and always fail the same way. I'll be pushing to try today to try to understand what's happening.

Updated

3 years ago
Depends on: 1093719
Depends on: 1093893
(Assignee)

Updated

3 years ago
No longer depends on: 1093893
All of the Android x86 crashes happened in child processes during dom/indexedDB tests. I don't think we should be running multi-process tests on Android, so am trying to get those disabled in bug 1093719. Once that is resolved, this should be able to land.
(Assignee)

Comment 16

3 years ago
\o/
(Assignee)

Comment 17

3 years ago
With bug 1093719, Android x86 S4 is green on try, so relanding.
  https://hg.mozilla.org/integration/mozilla-inbound/rev/583284c299fb
Thanks again for all the help Geoff!
https://hg.mozilla.org/mozilla-central/rev/583284c299fb
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
(In reply to Carsten Book [:Tomcat] from comment #18)
> https://hg.mozilla.org/mozilla-central/rev/583284c299fb

Backed out for causing intermittent Win32 and OSX 10.6 "too much recursion" jit-test failures.
https://hg.mozilla.org/integration/mozilla-inbound/rev/8607a6787a5b

https://treeherder.mozilla.org/ui/logviewer.html#?job_id=1160667&repo=fx-team
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=1161459&repo=fx-team
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla36 → ---
Also possibly-related:
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=3773699&repo=mozilla-inbound
Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/8607a6787a5b
(Assignee)

Comment 22

3 years ago
The fix was pretty simple once I was able to catch it in gdb: RegExpShared::execute was calling js_ReportOverRecursed if only jitStackLimit had been set to UINT32_MAX but interrupt_ = false (which explains why this was relatively rare to observe).  Previously, the interrupt lock was being taken which ensured that if one was set, both were set.  The fix is to just do what Ion was doing since the situation is analogous, so I just changed RegExpShared::execute to call jit::CheckOverRecursed.

https://hg.mozilla.org/integration/mozilla-inbound/rev/a2551ec63d3c
https://hg.mozilla.org/mozilla-central/rev/a2551ec63d3c
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
You need to log in before you can comment on or make changes to this bug.