Closed Bug 600488 Opened 13 years ago Closed 11 years ago

trace-test -j failures for fennec/android on galaxy s

Categories

(Core :: JavaScript Engine, defect)

ARM
All
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: cjones, Assigned: sayrer)

References

Details

Attachments

(8 files, 9 obsolete files)

348.76 KB, text/plain
Details
126.09 KB, text/plain
Details
45.39 KB, text/plain
Details
2.87 KB, text/plain
Details
934 bytes, text/plain
Details
1.24 KB, text/plain
Details
278 bytes, text/plain
Details
4.30 KB, text/plain
Details
After working around android fail with patches to js.cpp and trace-test.py, writing a wrapper script for trace-test.py (more android fail), and setting up an android python environment, I have some preliminary test results for trace-tests with just tracejit (-j).  Some failures appear to be intermittent, which doesn't seem good.  I ran three trials; the common failures were
 all 3
----------------------------------------
 basic/testComparisons.js
   (SIGSEGV)
 closures/t034.js
   (SIGSEGV)
 sunspider/check-date-format-tofte.js
   (assertion failure)

And these tests failed in two out of three trials

 2/3
----------------------------------------
 sunspider/check-3d-cube.js
   (SIGSEGV)
 sunspider/check-crypto-aes.js
   (SIGSEGV)
 v8-v5/check-crypto.js
   (SIGSEGV)

Will post full results here, setup info somewhere, and maybe -mj results somewhere, tomorrow.
Oh nice, do you have the patch and instructions for running it?
Assignee: general → sayrer
Both coming up; last bit I just finished this morning.  Here's the segfault in basic/testComparisons.js

(gdb) bt
#0  0x40117708 in ?? ()
#1  0x000e3a14 in ExecuteTrace (cx=0x4023e040, f=0x407c0d68, inlineCallCount=<value optimized out>, innermostNestedGuardp=<value optimized out>, lrp=0x0) at /home/cjones/mozilla/mozilla-central/js/src/jstracer.cpp:6544
#2  js::ExecuteTree (cx=0x4023e040, f=0x407c0d68, inlineCallCount=<value optimized out>, innermostNestedGuardp=<value optimized out>, lrp=0x0) at /home/cjones/mozilla/mozilla-central/js/src/jstracer.cpp:6645
#3  0x000e6eea in VisitStackSlots<js::CheckEntryTypeVisitor> (cx=0xbeb188d4, globalObj=0xbeb187ec, f=<value optimized out>, count=<value optimized out>) at /home/cjones/mozilla/mozilla-central/js/src/jstracer.cpp:1928
#4  VisitSlots<js::CheckEntryTypeVisitor> (cx=0xbeb188d4, globalObj=0xbeb187ec, f=<value optimized out>, count=<value optimized out>) at /home/cjones/mozilla/mozilla-central/js/src/jstracer.cpp:1958
#5  VisitSlots<js::CheckEntryTypeVisitor> (cx=0xbeb188d4, globalObj=0xbeb187ec, f=<value optimized out>, count=<value optimized out>) at /home/cjones/mozilla/mozilla-central/js/src/jstracer.cpp:1985
#6  CheckEntryTypes (cx=0xbeb188d4, globalObj=0xbeb187ec, f=<value optimized out>, count=<value optimized out>) at /home/cjones/mozilla/mozilla-central/js/src/jstracer.cpp:6437
#7  js::FindVMCompatiblePeer (cx=0xbeb188d4, globalObj=0xbeb187ec, f=<value optimized out>, count=<value optimized out>) at /home/cjones/mozilla/mozilla-central/js/src/jstracer.cpp:6461
#8  0x000fddbc in js::MonitorLoopEdge (cx=0x4023e040, inlineCallCount=@0xbeb188d4) at /home/cjones/mozilla/mozilla-central/js/src/jstracer.cpp:7149
#9  0x0016d1a6 in js::Interpret (cx=0x4023e040, entryFrame=<value optimized out>, inlineCallCount=Cannot access memory at address 0xff2fc02c
) at /home/cjones/mozilla/mozilla-central/js/src/jsinterp.cpp:3238
#10 0x000568da in RunScript (cx=0x4023e040, chain=0x40802000, script=0x407d0020, prev=<value optimized out>, flags=0, result=0x0) at /home/cjones/mozilla/mozilla-central/js/src/jsinterp.cpp:484
#11 js::Execute (cx=0x4023e040, chain=0x40802000, script=0x407d0020, prev=<value optimized out>, flags=0, result=0x0) at /home/cjones/mozilla/mozilla-central/js/src/jsinterp.cpp:749
#12 0x0001a61c in JS_ExecuteScript (cx=0x4023e040, obj=0x407e38f4, script=0xbeb1892c, rval=<value optimized out>) at /home/cjones/mozilla/mozilla-central/js/src/jsapi.cpp:4744
#13 0x0000efe0 in Process (cx=0x4023e040, obj=<value optimized out>, filename=<value optimized out>, forceTTY=<value optimized out>) at /home/cjones/mozilla/mozilla-central/js/src/shell/js.cpp:444
#14 0x0000f7f8 in ProcessArgs (cx=0x4023e040, argc=8, argv=0xbeb18c88, envp=<value optimized out>) at /home/cjones/mozilla/mozilla-central/js/src/shell/js.cpp:797
#15 shell (cx=0x4023e040, argc=8, argv=0xbeb18c88, envp=<value optimized out>) at /home/cjones/mozilla/mozilla-central/js/src/shell/js.cpp:5258
#16 0x0000fa7a in main (argc=9, argv=0xbeb18c84, envp=0xbeb18cac) at /home/cjones/mozilla/mozilla-central/js/src/shell/js.cpp:5365
Attached patch Support for debugging on android (obsolete) — Splinter Review
Various hacks for running trace-tests.  Patch includes PrintJSStack() from bug 600304 as well.
Lovely. We are dying on trace. Seeing the assertion would help potentially.
Instructions for running trace-tests on device: https://wiki.mozilla.org/Mobile/Fennec/Android#JS_trace-tests.  Please let me know if stuff is missing or broken.
Rob/whoever, I don't know whether fixing one test per bug is preferred (probably?).  I posted the stack in comment 2 to prove completeness of my testing env.
Filed bug 600596 for -mj failures.
Filed bug 600639 for debugging support.
testjs.py isn't working for me. I'm getting an ImportError: no module named trace_test

the module search path looks right though.
it works if I rename trace-test/trace-test.py to trace-test/trace_test.py
with TM tip on my Nexus One, I get 1 failure: date-format-tofte. It is not a segfault. It's just a failing test. Anyone with a Galaxy want to try that?
(In reply to comment #11)
> it works if I rename trace-test/trace-test.py to trace-test/trace_test.py

(In reply to comment #3)
> Created attachment 479419 [details] [diff] [review]
> Support for debugging on android
> 
> Various hacks for running trace-tests.  Patch includes PrintJSStack() from bug
> 600304 as well.
diff --git a/js/src/trace-test/trace-test.py b/js/src/trace-test/trace_test.py
rename from js/src/trace-test/trace-test.py
rename to js/src/trace-test/trace_test.py
--- a/js/src/trace-test/trace-test.py
+++ b/js/src/trace-test/trace_test.py

(In reply to comment #12)
> with TM tip on my Nexus One, I get 1 failure: date-format-tofte. It is not a
> segfault. It's just a failing test. Anyone with a Galaxy want to try that?

I did.  See comment 0.
Attached file Failures from trial 1 (obsolete) —
Forgot to attach the full logs, my bad.
(In reply to comment #13)
> (In reply to comment #11)
> > it works if I rename trace-test/trace-test.py to trace-test/trace_test.py
> 
> (In reply to comment #3)
> > Created attachment 479419 [details] [diff] [review] [details]
> > Support for debugging on android
> > 
> > Various hacks for running trace-tests.  Patch includes PrintJSStack() from bug
> > 600304 as well.
> diff --git a/js/src/trace-test/trace-test.py b/js/src/trace-test/trace_test.py
> rename from js/src/trace-test/trace-test.py
> rename to js/src/trace-test/trace_test.py
> --- a/js/src/trace-test/trace-test.py
> +++ b/js/src/trace-test/trace_test.py

Ah, I just applied the patch the wrong way.

> 
> (In reply to comment #12)
> > with TM tip on my Nexus One, I get 1 failure: date-format-tofte. It is not a
> > segfault. It's just a failing test. Anyone with a Galaxy want to try that?
> 
> I did.  See comment 0.

Which TM cset did you use?
OS: Android → All
I built with

changeset:   54633:94e52cdc2157
tag:         tip
user:        Andreas Gal<gal@uci.edu>
date:        Wed Sep 29 12:28:05 2010 -0700
summary:     Bug 600580 - TM: set right compartment in _newJSDContext r=gregor
Building from m-c

changeset:   54781:23605bdd871a
tag:         qparent
user:        Sid Stamm <sstamm@mozilla.com>
date:        Thu Sep 23 15:38:47 2010 -0700
summary:     Bug 594446 - CSP report-uri should accept relative URIs [r=jst a=blocking2.0: beta7+]

I still get several failures, and unfortunately they're still not consistent.  I did three trials again; below are the tests that failed in all three and the ones that failed in two out of three trials.

  3/3
 -----
basic/testComparisons.js
Exit code: -11

closures/t008.js
Exit code: -11

closures/t034.js
Exit code: -11

sunspider/check-date-format-tofte.js
/data/local/js/trace-test/tests/sunspider/check-date-format-tofte.js:302: Error: Assertion failed
Exit code: 3

v8-v5/check-earley-boyer.js
Exit code: -11


  2/3
 -----
basic/testApplyCall.js
Exit code: -11

basic/testShiftLeft.js
basic/testShiftRightArithmetic.js
/data/local/js/trace-test/tests/basic/testShiftLeft.js:35: Error: Assertion failed
/data/local/js/trace-test/tests/basic/testShiftRightArithmetic.js:41 Error: Assertion failed
Exit code: 3

closures/name3.js
Exit code: -11

sunspider/check-3d-cube.js
Exit code: -11

v8-v5/check-crypto.js
Exit code: -11

v8-v5/check-regexp.js
Exit code: -4

(We now know check-date-format-tofte to be bogus so it's not worth bothering with.)
Attached file Failures from trial 1
Attachment #479563 - Attachment is obsolete: true
Attachment #479564 - Attachment is obsolete: true
Attachment #479566 - Attachment is obsolete: true
Attached file here is my mozconfig (obsolete) —
is yours similar?
Attached file here is mine (obsolete) —
Nearly identical on the important things (--enable-optimize --disable-debug is the default, right?).
And why not, -j passes all tests in a debug build, except for being OOM-killed during the OOM test, and failing the broken date test.
I have a Samsung Vibrant now. I can reproduce this problem using the same build that passes on the Nexus One.
testComparisons will always segfault for me when run alone:

    import trace_test
    trace_test.main([
        JS,
        'basic/testComparisons.js',
        '--show-cmd',
        '--avoid-stdio',
        '--jitflags', 'j',
        '--write-failure-log', failures_file,
        
    ])
BTW, using --show-output is a little nicer than --write-failure-log.

Passes at -O0.
Playing with compiler flags proved not very fruitful.  I've been using basic/testComparisons.js since a segfault is 100% reproducible with -j and --enable-optimize/--disable-debug.  This test also fails pretty reliably with -m, giving "Too much recursion".  Quick summary is

running -j:
  --enable-debug/--disable-optimize: couldn't force a segfault
  everything else, including intermediate -O[k] flags, segfaults intermittently

running -m:
  --enable-optimize="-O0 -fno-omit-frame-pointer": couldn't force "Too much recursion"
  everything else intermittently gives "Too much recursion"

Tried various other little combinations, including --disable/--enable-tracejit/-methodjit and --disable-jemalloc.  No interesting changes.

I think I finally found something interesting in gdb though.
Here are some interesting things at the crash

(gdb) x/i $pc
=> 0xeaf14 <js::ExecuteTree(JSContext*, js::TreeFragment*, uintN&, js::VMSideExit**, js::VMSideExit**)+540>:	ldr	r2, [sp, #12]
(gdb) x/i $lr
   0xeaf15 <js::ExecuteTree(JSContext*, js::TreeFragment*, uintN&, js::VMSideExit**, js::VMSideExit**)+540>:	ldr	r2, [sp, #12]

($lr looks weird ... maybe that's expected?)

Note that we're crashing on |ldr	r2, [sp, #12]|.

(gdb) p $sp
$17 = (void *) 0xbebae6a8
(gdb) p/x *(int*)$sp
$20 = 0x0
(gdb) p/x *(int*)($sp + 12)
$21 = 0x407aa928

By the time gdb is involved, we can load $sp no problem, and load the address we segfaulted on.  Hmmmmm.....

$ cat /proc/18525/maps
cat /proc/18525/maps
00008000-0019e000 r-xp 00000000 b3:02 7606287    /data/local/js/js
0019e000-001ba000 rwxp 00195000 b3:02 7606287    /data/local/js/js
001ba000-001c6000 rwxp 001ba000 00:00 0          [heap]
[snip]
beb9a000-bebaf000 rwxp befeb000 00:00 0          [stack]

So 0xbebaf000 is the top of the stack.  We have 21 pages mapped, 84KiB.  (I double checked and
(gdb) p sysconf(0x27)  // ed: 0x27 = _SC_PAGESIZE
$22 = 4096)

Note that

bebaf000−bebae6a8 = 958
bebaf000−(bebae6a8+12) = 946

We segfaulted trying to read from within the last 1KiB of the last page in the stack mapping.

This seems very suspicious.  If the address we segfaulted on was near $end_of_stack-4096 there would be more to go on.  But as things stand, I don't have any concrete hypotheses.  That methodjit is dying with "too much recursion" on this test also seems suspicious.

Going to keep this in gdb for a bit longer in case anyone wants something, while I do some other work.  Maybe will next try catching and returning from SIGSEGV to see what happens.
(In reply to comment #31)
> ($lr looks weird ... maybe that's expected?)

It implies that the instruction you're seeing is immediately after a function call, and the function call was (probably) a leaf function that didn't touch LR.

Do you know what the prior call was? Was it the code invocation itself?

I've seen the symptom before, but can't quite place it. (A load failing in the code, but working fine in GDB.)
(In reply to comment #32)
> Do you know what the prior call was? Was it the code invocation itself?
> 

I don't know.
If it turns out that there's a system problem causing JIT unfriendliness, we need to blacklist these devices without affecting others.  This patch does that, in a gross way.  Last resort.
Semi-random failures and weirdness w.r.t. the JITs on ARM .. could
this be another manifestation of D-I cache coherence problems proposed
as the cause of bug 596755 ?
(In reply to comment #36)
> Semi-random failures and weirdness w.r.t. the JITs on ARM .. could
> this be another manifestation of D-I cache coherence problems proposed
> as the cause of bug 596755 ?

It's high on my list of suspicions, though I'd be very surprised to see such problems in NanoJIT.

I wouldn't be surprised to see it in JM (or YARR) for example, but it's very young in comparison, and the trace JIT is pretty mature now.
Ah!  That fits what I've seen so far.
The APIs changed, and all the debugging supported has been pushed to tracemonkey.
Attachment #479419 - Attachment is obsolete: true
Attachment #479421 - Attachment is obsolete: true
Attachment #479624 - Attachment is obsolete: true
Attachment #479656 - Attachment is obsolete: true
Attachment #480049 - Attachment is obsolete: true
Attachment #480264 - Attachment mime type: text/x-python → text/plain
(In reply to comment #37)
> I wouldn't be surprised to see it in JM (or YARR) for example, but it's very
> young in comparison, and the trace JIT is pretty mature now.

Well, supposing it's YARR that's failing to do the syncing properly
(although as per https://bugzilla.mozilla.org/show_bug.cgi?id=596755#c14
LukeW reckons YARR _is_ syncing properly) .. "js -j" would still invoke
YARR, wouldn't it?  In which case we'd still see the problem.
There's an easy way to use Valgrind on ARM to check whether there's a
coherence problem.  First, run it "normally" on V, perhaps with
--tool=none to speed things up.  Check that you can get some kind of
failure or weirdness.

Then, once you have failure/weirdness, rerun but add --smc-check=all.
If the failure/weirdness disappears that's a pretty good sign there's
a coherence problem.

Why this works: on ARM-Linux, dynamically generated code must be
sync'd with the icache using the cacheflush system call.  Valgrind
watches these so it knows when it should discard its translations of
the applications's code.  If the application fails to do these when it
should, then it will misbehave on V (as well as on the real hardware)
because V will be running out-of-date translations of parts of the app.

With --smc-check=all, V checks each translation is still valid before
running it; if not it re-makes it.  This is expensive but provides
transparent I-D cache coherence without any cooperation from the app.
Hence it causes the app to behave as if it always correctly notifies
any new code (via sys_cacheflush) before running it.
(In reply to comment #41)
> There's an easy way to use Valgrind on ARM 

I have a dumb question. How do I build Valgrind for ARM/Android? I swear I googled for that this morning.
Julian, I took a crack at building valgrind for android this evening.  The result is at http://people.mozilla.com/~cjones/valgrind-android.tar.gz .  It appears that this was built with -g, so https://wiki.mozilla.org/Mobile/Fennec/Android#Debugging_with_GDB should work.  To try the build on device, run

wget http://people.mozilla.com/~cjones/valgrind-android.tar.gz
tar zxvf valgrind-android.tar.gz
adb push vg /data/local/vg
adb shell

Then from within the |adb shell|, run

/data/local/vg/bin/valgrind /data/local/vg/test

A rooted device isn't needed for this.  vg/ does have to be pushed to exactly /data/local/vg or it won't work.

Valgrind doesn't build out of the box.  The only bug I can ascribe to valgrind itself is not testing for pthread_rwlock.  The rest is just stuff the android folks neglected to copy and paste from glibc.  I'm sick of dealing with their broken libc, so I refuse to post the workaround patch publicly; I'd rather get the android folks to fix bionic (I'd be willing to write a patch once we get all the valgrind issues sorted out).  Would be happy to e-mail the workaround patch though.

On my i9000, valgrind just spins @ 100% CPU running the test program.  Brad also saw this on his N1, so it's likely a problem with android+valgrind and not the i9000.  AFAIK, the obstacles to valgrind-on-android are
 - home-grown libc (bionic), so need a new suppression file and maybe other stuff.  New suppression file should be easy.
 - home-grown loader; not sure what issues this might cause

Let me know if I can help more.
Attached file Test program
Oh right, probably want the source of /data/local/vg/test.
(In reply to comment #41)
> I have a dumb question. How do I build Valgrind for ARM/Android?

Not a dumb question.  I've only tried it on beagleboard (Ubuntu 10.x)
and N900, but not on Android.  I don't have an Android device, else I
would make sure it works.

(In reply to comment #44)

> On my i9000, valgrind just spins @ 100% CPU running the test program.  Brad
> also saw this on his N1, so it's likely a problem with
> android+valgrind

Can you send the results of running with 
"-d -d -v -v --trace-flags=10000000" ?

Overall though this seems like a large diversion from the original
problem.  I'm wondering if I can somehow check for coherence loss
by running the tests on a beagleboard instead.
Attached file valgrind trace output
(In reply to comment #45)
> (In reply to comment #41)
> Can you send the results of running with 
> "-d -d -v -v --trace-flags=10000000" ?

attached here
(In reply to comment #40)
> (In reply to comment #37)
> Well, supposing it's YARR that's failing to do the syncing properly

See bug 596755 comment 16 ,17,18 for some preliminary observations
re MJ PIC/MIC lack-of-syncing.
(In reply to comment #46)
> Created attachment 480411 [details]
> valgrind trace output
> 
> (In reply to comment #45)
> > (In reply to comment #41)
> > Can you send the results of running with 
> > "-d -d -v -v --trace-flags=10000000" ?
> 
> attached here

I can see where it's hanging, during setup of some file space prior to
starting the app.  (in between lines 1791 and 1829 of
coregrind/m_main.c).  No idea why though.
Tracer has been removed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.