Closed Bug 1091912 Opened 10 years ago Closed 10 years ago

stop using mprotect to halt Ion/asm.js execution

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: luke, Assigned: luke)

References

Details

Attachments

(1 file, 1 obsolete file)

The use of mprotect/VirtualProtect to take away execution privileges currently causes some problems:
 - some OSes don't like taking away *just* PROT_EXEC, so we also have to take away READ and this means there are a lot more places we can fault, not just when executing Ion code
 - gdb/lldb break *before* running the user-defined SIGSEGV handler and offer no way to break *after* a user-defined SIGSEGV handler has passed on the exception, so there's no .gdbinit that ignores the harmless SIGSEGVs while catching the real SIGSEGVs.  We have this JS_NO_SIGNALS hack instead.

The whole point of taking away execution privileges is to stop the main thread so we can do something to it (patch backedges on Ion, move the pc if in asm.js code).  We achieve this effect instead using SuspendThread on Windows and pthread_kill on other platforms.  (But won't pthread_kill still break by default in gdb/lldb?  Not if we choose a benign signal that isn't already claimed by NSPR or pthreads or libc.  Is there such a signal?  I think so: SIGVTALRM.)

I think this should also allow some simplification of the current interrupt-lock scheme since we have a lot less to protect against.
Depends on: 1091916
Blocks: 1063188
Attached patch rm-exec-fault (obsolete) — Splinter Review
This patch does what comment 0 said and removes about 400 lines of locky code in the process.

Everything looks mostly good except B2G and Android 2.3.  In the case of B2G, the logs seem to show the seccomp filter isn't happy with the pthread_kill, in theory that's not hard to work around.  For the Android 2.3 bustage, given that it's not on 4.0, I assume it's some bug in the emulator's handling of pthread_kill, need to run the emulator to reproduce.
Depends on: 1093893
Attached patch rm-exec-faultSplinter Review
This patch is looking green on try.  Bug 1093893 added tkill to the seccomp filter which fixed the b2g orange.

For the Android 2.3 orange, I'm pretty darn sure it's a bug in Android 2.3 or at least the emulator.  I pushed a bunch of variations of the patch to try and:
 - the patch makes 1 out of 3 unit test jobs fail on Android 2.3 with a crash, almost always while tearing down a thread (looks like thread data structure corruption, esp the pthread TLS data)
 - try'ing a patch that makes the signal handler a noop has the same crashes
 - try'ing a patch that additionally comments out the pthread_kill call has no crashes
Wikipedia shows <20% on <=2.3 (http://en.wikipedia.org/wiki/Android_version_history), so this patch just disables signal handling on versions <=2.3.  I expect the bug is actually in the emulator; if we ever cared (and had real hardware to test on; tb is all emulators) we could additionally test for that, but it doesn't seem worth it atm.
Attachment #8516467 - Attachment is obsolete: true
Attachment #8518620 - Flags: review?(bhackett1024)
Comment on attachment 8518620 [details] [diff] [review]
rm-exec-fault

Review of attachment 8518620 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!

::: js/src/asmjs/AsmJSSignalHandlers.cpp
@@ +913,5 @@
> +    if (jit::JitRuntime *jitRuntime = rt->jitRuntime()) {
> +        // If we are in the middle of mutating the backedge list, we must be in
> +        // C++ and thus not locked into a JIT iloop. We assume that we'll check
> +        // the interrupt flag at least once before entering JIT code (if not,
> +        // the interrupt will just be triggered again next second).

typo

@@ +1029,5 @@
>  
> +// JSRuntime::requestInterrupt sets interrupt_ (which is checked frequently by
> +// C++ code at every Baseline JIT loop backedge) and jitStackLimit_ (which is
> +// checked at every Baseline and Ion JIT function prologue). That leaves Ion
> +// loop backedges and all asm.js code. These last two cases are handled by this

'That leaves ion loop backedges'?

::: js/src/asmjs/AsmJSSignalHandlers.h
@@ +29,5 @@
>  namespace js {
>  
> +// Set up any signal/exception handlers needed to execute code in the given
> +// runtime. Return whether it is safe to rely on signal handling for asm.js
> +// out-of-bounds checks.

If this returns false it's also unsafe to use signal handling for Ion interrupts, right?
Attachment #8518620 - Flags: review?(bhackett1024) → review+
(In reply to Brian Hackett (:bhackett) from comment #3)
> If this returns false it's also unsafe to use signal handling for Ion
> interrupts, right?

Right, thanks.

https://hg.mozilla.org/integration/mozilla-inbound/rev/51572056e797
Talos dromaeojs became nearly permafail on OSX 10.6 after this landed also.
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=3774284&repo=mozilla-inbound
The OSX+Dromaeo failures seem to be caused by bug 1091916 (as evidenced by continued failure after the backout of this bug until the backout of bug 1091916).  The fix and relanding of bug 1091916 seem to have fixed these.

It's weird that the 4.0 crashes didn't show up in previous try pushes, but if retrigger enough, they do now.  The crash signatures are the same as the Android 2.3 crashes above; so I guess it's just a general Android issue.

On a whim, I put some logging statements into NSPR thread creation/destrunction and found my first actual lead: on jobs that crash, _pt_thread_death_internal is being called while the thread is still running!  This frees the PRThread which is why it crashes when the thread returns and NSPR attempts to access this PThread.  More logging...
More logging shows that what's happening is that pthread_kill is, in rare cases, causing a pthread_join to return before the joined-on thread actually completes.  This causes the NSPR PRThread to be destroyed while the thread is still running and then Bad Things happen.  Spurious wakeup + signals sounds like someone isn't checking the return value of pthread_cond_wait for EINTR and the bionic source for pthread_join shows exactly this bug:
  https://android.googlesource.com/platform/bionic/+/9d23e04/libc/bionic/pthread_join.cpp
Fortunately, someone already reported and fixed this bug:
  https://android-review.googlesource.com/#/c/52333/
in Android 4.4 so it sounds like I can just disable the optimization if ANDROID && __ANDROID_API__ < 19.
So I think I can just update the previous patch to use "19" instead of "10" for the SDK version query (that way we can use the optimization if the phone is 4.4, not when we eventually update our build dependency).  Re-trying now, hope to land if green.
https://hg.mozilla.org/mozilla-central/rev/7db30249d1d8
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Blocks: 1099493
Depends on: 1099339
This patch likely caused a ~25% regression on octane-zlib on Firefox OS devices (unagi & flame).

http://arewefastyet.com/#machine=26&view=single&suite=octane&subtest=zlib&start=1415802570&end=1415924708
Flags: needinfo?(luke)
I'm guessing that's the sdk version being less than 19 (tested via ro.build.version.sdk in EnsureSignalHandlersInstalled).  IIUC, new FFOS releases will be based on KK (Android v.4.4, SDK v.19) which will pass this test.  What is the Android version being tested in these devices?
Flags: needinfo?(luke)
(In reply to Luke Wagner [:luke] from comment #13)
> I'm guessing that's the sdk version being less than 19 (tested via
> ro.build.version.sdk in EnsureSignalHandlersInstalled).  IIUC, new FFOS
> releases will be based on KK (Android v.4.4, SDK v.19) which will pass this
> test.  What is the Android version being tested in these devices?

We are having the same regression on fennec on android. This is android 4.3 and api11. (According to https://developer.android.com/about/dashboards/index.html, only 44% is on kitkat and higher. Not sure if we have data about fennec usages). So do we hope its usage declines quickly and I should actually look into upgrading the flame to android 4.4? Or is there something we can do for 4.3 and earlier?
The old strategy had a fair amount of complexity and developer frustration (at the portland work week, like 3 different people out of the blue complained about "those crappy random JS seg faults"; all were most pleased we stopped doing that), so I'd much prefer we just wait it out for 4.4.
(Oh, but I forgot to answer your direct question: yes, let's update awfy to 4.4 :)
Depends on: 1163891
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: