Closed Bug 861989 Opened 11 years ago Closed 3 years ago

Crash on Heap with asm.js and timeout

Categories

(Core :: JavaScript Engine, defect)

ARM
Linux
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: decoder, Unassigned)

Details

(Keywords: crash, testcase, Whiteboard: [jsbugmon:ignore])

Attachments

(1 file, 3 obsolete files)

The following testcase crashes on mozilla-central revision 261d6997d1d1 (run with --ion-eager):


const USE_ASM = "'use asm';";
function asmCompile() {
    return Function.apply(null, arguments);
}
function asmLink(f) {
        return f.apply(null, Array.slice(arguments, 1));
}
var g = asmLink(asmCompile(USE_ASM + "function f() {} function g() { while(1) { f() } } return g"));
timeout(1);
g();
Seems to cause an invalid jump:

Program received signal SIGSEGV, Segmentation fault.
0xf5bfd050 in ?? ()
(gdb) x /i $pc
=> 0xf5bfd050:  Cannot access memory at address 0xf5bfd050
Whiteboard: [jsbugmon:ignore]
Flags: needinfo?(nicolas.b.pierron)
One of the issues with qemu-user is that the faulting address passed to the SIGSEGV handler is incorrect.  It seems more appropriate to be checking the program counter anyway, so here is a potential fix and workaround.
Thanks dougc :) I will test the patch asap.
Group: core-security
Flags: needinfo?(nicolas.b.pierron)
The other issue is that qemu does not emulate protected memory well in the 'user' mode.  AsmJS uses mprotect on the code to force a break.  Unfortunately this trips up qemu internally while actually processing a SIGSEGV.  This workaround tries to remove the protection in these paths, and this will probably work out ok since the handler will remove the protection anyway.  This patch is relative to qemu-linaro-1.4.0-2013.03.
(In reply to Douglas Crosher [:dougc] from comment #4)
> Created attachment 737951 [details] [diff] [review]
> Hack patch for qemu to workaround it hitting a SIGSEGV in the signal
> handling paths.

Should I try to forward this patch to the qemu maintainers?
(In reply to Christian Holler (:decoder) from comment #5)
> (In reply to Douglas Crosher [:dougc] from comment #4)
> > Created attachment 737951 [details] [diff] [review]
> > Hack patch for qemu to workaround it hitting a SIGSEGV in the signal
> > handling paths.
> 
> Should I try to forward this patch to the qemu maintainers?

No, I would not both them as it's not a comprehensive fix, and
only works for us because it is known that the signal handler
will remove the protection anyway.  Further the problem has
not been analysed well and there is a concern that in some
cases it might remove the protection too aggressively and we
might miss a signal.

Combined these patches allow the example to run correctly.
They also allow more of the asm.js jit-tests to work.
There is still one asm.js jit-test (testTimeout2.js)
that does not break out of its tight loop, and it appears
that for this test that qemu-user just does not hit the
protected code memory - perhaps an optimization for a
tight loop.

I would note that qemu-user is not a very faithful
emulation and slow and would suggest reconsidering
testing using the qemu system emulator and/or real
hardware.
(In reply to Douglas Crosher [:dougc] from comment #6)
> 
> I would note that qemu-user is not a very faithful
> emulation and slow and would suggest reconsidering
> testing using the qemu system emulator and/or real
> hardware.

Real hardware is in fact highly unreliable when it comes to fuzz testing, much more unreliable than any emulation (and while emulation is slower, it can easily be scaled on our servers). If the qemu-user problems cannot be fixed, then we will have to switch back to qemu-system emulation which I was previously using. The advantage of qemu-user was primarily to be a little faster and save some memory.
(In reply to Douglas Crosher [:dougc] from comment #2)
I don't think this is a valid fix: we already guarded that module.containsPC(pc) a few statements above which means all faults will end up taking the operation callback, even the out-of-bounds loads/stores.  The reason for writing module.containsPC(faultingAddress) is that we are specifically interested in the case where we fault trying to load pc itself (because we mprotected the code in TriggerOperationCallbackForAsmJSCode.
> Real hardware is in fact highly unreliable when it comes to fuzz testing,
> much more unreliable than any emulation (and while emulation is slower, it
> can easily be scaled on our servers). If the qemu-user problems cannot be
> fixed, then we will have to switch back to qemu-system emulation which I was
> previously using. The advantage of qemu-user was primarily to be a little
> faster and save some memory.

While real ARM hardware is slower, jsfunfuzz testing is getting more reliable on a pandaboard as we fix issues - we've had to write something into the Python harness to kill processes starting to use more than ~500MB memory since there was only 1GB on the pands.

Thus, going forward, ARM hardware is going to get faster and we can tweak our harnesses to be more reliable. I'm not sure if qemu will have more of such issues in the future too.

I've also heard from the js team that a Chromebook Series 3 (ARM) with 2GB RAM may be a more powerful option currently.
(In reply to Luke Wagner [:luke] from comment #8)
> (In reply to Douglas Crosher [:dougc] from comment #2)
> I don't think this is a valid fix: we already guarded that
> module.containsPC(pc) a few statements above which means all faults will end
> up taking the operation callback, even the out-of-bounds loads/stores.  The
> reason for writing module.containsPC(faultingAddress) is that we are
> specifically interested in the case where we fault trying to load pc itself
> (because we mprotected the code in TriggerOperationCallbackForAsmJSCode.

Good point, thanks.  Yes, there is already a guard checking the
'pc', and we need to differentiate between a protected code fault
and an errant read or write by code running within the asm.js
module.

An errant read or write should not trigger the callback exit - it
might be able to do so once or twice, but we need to prevent it
looping on the fault.

Could I suggest switching back to using the qemu system emulator.

If you really need qemu-user-arm running then it might still be
possible to address this by accounting for the page protection
within a special build of the Mozilla code, or by doing so
in qemu-user.
This patch attempts to also set the siginfo si_addr to the program counter when the fault is caused by a PROT_EXEC protection failure.  This makes it unnecessary to patch the asm.js signal handler.

Qemu maintains information about the pages in use and their protection state.  This made it easy to check the protection of the page currently being executed when setting up the signal context and to copy the program counter into the fault address if necessary.  The interrupt and thread safety of the change has not been assessed.

I'll try to write a simple example to reproduce the issues and send it off to the qemu forum.
Attachment #737944 - Attachment is obsolete: true
Attachment #737951 - Attachment is obsolete: true
A good number of tests were still causing qemu to fail.
This patch is hopefully an incremental improvement.  It
reworks the management of the page protection in the signal
handler paths.  It looks up the page protection state in
the qemu internal page tables and only removes the read
protection if necessary.  Further it restores the page
protection when done so is a more faithful emulation.
Interrupt and thread safety are still a concern, and
there are still a number of tests for which the page
protection mechanism will not interrupt the code and
these tests do not terminate.
Attachment #738321 - Attachment is obsolete: true
Assignee: general → nobody
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INACTIVE
Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---

Hello! I have tried to reproduce the issue with Firefox 78.7.0esr, 85.0 and 87.0a1 (2021-02-03) but unfortunately I wasn't able to reproduce it.
Christian does this issue still occur? And if so can you please provide a new testcase or some updated steps to reproduce?

Flags: needinfo?(choller)

Likely no longer relevant, closing as WFM.

Status: REOPENED → RESOLVED
Closed: 6 years ago3 years ago
Flags: needinfo?(choller)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: