Closed Bug 696380 Opened 13 years ago Closed 13 years ago

shell does not run on nvidia tegra2 devices

Categories

(Tamarin Graveyard :: Virtual Machine, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dschaffe, Assigned: pnkfelix)

Details

(Whiteboard: has-patch)

Attachments

(1 file)

If you download a build of the shell from asteam (e.g.):http://asteam.corp.adobe.com/builds/tamarin-redux/6673-2c919a862782/android/avmshell  or build the android shell locally the shell does not run on a galaxy tab or any honeycomb (3.1) builds I am assuming.  

tether to galaxy tab:
$ adb shell mkdir /data/local/tamarin
$ adb push avmshell /data/local/tamarin
$ adb shell
(now you're on the galaxy tab)
$ cd /data/local/tamarin
$ chmod 777 avmshell
$ ./avmshell 
Invalid Instruction

I tried locally building with the latest android ndk (4.4.3) and got the same result.
Flags: flashplayer-qrb?
debug build don't yield any further information, just "Invalid instruction"
Assuming this is a Tab 10.1 and not one of the several other Galaxy Tab models:

What about -Dinterp?

At issue is whether the C++ code is compiled to use illegal (eg NEON) instructions or whether it's the JIT that generates illegal instructions.  (Or somesuch.)  The thing has a Tegra 2, famously NEON-less:

http://www.gsmarena.com/samsung_p7100_galaxy_tab_10_1v-3831.php.

I'm guessing it's not the JIT since you're not even getting the proper "usage" message from the shell.

The Xoom has the same CPU, FWIW.
It happens with -Dinterp as well.

(I'm going to look at this for a little while because I think I'll be better off benchmarking on the tab rather than on my nexus one.)

I've gotten as far as relearning how to install, run, and attach to gdbserver.  Will report more later.
Assignee: nobody → fklockii
Some notes on how to replicate.

1. Set up adb to forward port 5039 so that you can connect with remote gdb (chances are you've done this before):

  % adb forward tcp:5039 tcp:5039

2. Do an android debug build of avmshell.  Copy it onto device:

  % adb push shell/avmshell /data/local/tamarin

3. Copy:
  android-ndk/toolchains/arm-linux-androideabi-4.4.3/prebuilt/gdbserver
from android-ndk to /data/local/ on the device:

  % adb push prebuilt/gdbserver /data/local

4. In one terminal, connect to device and start avmshell under gdbserver:

  % adb shell
  shell@android:/ $ cd /data/local/
  shell@android:/data/local $ ./gdbserver remote:5039 ./tamarin/avmshell
  Process ./tamarin/avmshell created; pid = 22034
  Listening on port 5039


5. In another terminal, fire up the android gdb (from the ndk), load the symbols from the debug build, and connect to the listening shell process:

  % arm-linux-androideabi-gdb
  (gdb) file shell/avmshell
  Reading symbols from /Users/fklockii/DevTR/tamarin-redux/objdir-dbgdroid/shell/avmshell...done.
  (gdb) target remote:5039


6. Continue the paused avmshell.

  (gdb) c

7. Observe the failure, check out the backtrace:

  Program received signal SIGILL, Illegal instruction.
  0x00354390 in _nan () at ../core/MathUtils.cpp:112
  112	        return f;
  (gdb) bt
  #0  0x00354390 in _nan () at ../core/MathUtils.cpp:112
  #1  0x00358704 in __static_initialization_and_destruction_0 (__initialize_p=1, 
      __priority=65535) at ../core/MathUtils.cpp:128
  #2  0x00358798 in global constructors keyed to MathUtils.cpp ()
      at ../core/MathUtils.cpp:1734
  #3  0xb0003802 in ?? ()
  Cannot access memory at address 0x3a200030


8. Disassemble _nan to learn what the instruction in question is:

  (gdb) disassemble _nan
  Dump of assembler code for function _nan:
  0x00354378 <_nan+0>:	str	r11, [sp, #-4]!
  0x0035437c <_nan+4>:	add	r11, sp, #0	; 0x0
  0x00354380 <_nan+8>:	sub	sp, sp, #12	; 0xc
  0x00354384 <_nan+12>:	mvn	r3, #-2147483648	; 0x80000000
  0x00354388 <_nan+16>:	str	r3, [r11, #-8]
  0x0035438c <_nan+20>:	flds	s15, [r11, #-8]
  0x00354390 <_nan+24>:	fcvtds	d16, s15
  0x00354394 <_nan+28>:	vmov	r2, r3, d16
  0x00354398 <_nan+32>:	mov	r0, r2
  0x0035439c <_nan+36>:	mov	r1, r3
  0x003543a0 <_nan+40>:	add	sp, r11, #0	; 0x0
  0x003543a4 <_nan+44>:	ldmia	sp!, {r11}
  0x003543a8 <_nan+48>:	bx	lr
  End of assembler dump.


Seems like the problem is this:
  0x00354390 <_nan+24>:	fcvtds	d16, s15
Wrong representation for the NaN, ie, signaling NaN used where non-signaling needed?
(In reply to Lars T Hansen from comment #5)
> Wrong representation for the NaN, ie, signaling NaN used where non-signaling
> needed?

I see, according to wikipedia the interpretation of the quiet/signaling bit is architecture dependent, due to oversight in the initial IEEE 754-1985.

So I'm guessing that either Tegra 2 flipped the meaning of the bit, or more likely, we have been getting lucky in that the default NaNS exception handler on our other targets is silently replacing the signaling NaN with a quient NaN?  (Or both, I suppose.)

I'll investigate further.

(Still, Illegal Instruction seems like a strange resulting error here...)
(In reply to Lars T Hansen from comment #2)
> Assuming this is a Tab 10.1 and not one of the several other Galaxy Tab
> models:
> 
> What about -Dinterp?
> 
> At issue is whether the C++ code is compiled to use illegal (eg NEON)
> instructions or whether it's the JIT that generates illegal instructions. 
> (Or somesuch.)  The thing has a Tegra 2, famously NEON-less:
> 
> http://www.gsmarena.com/samsung_p7100_galaxy_tab_10_1v-3831.php.
> 

This may or may not matter, but another data point.  I've been working on beagleboard (ARMv7-A,NEON), and my first build of avmshell with -mfpu=neon printed "Illegal Instruction" like in this bug.  -Dinterp worked fine.  Oddly, subsuequent executions of the same binary also ran fine.  YMMV, but it would be worth your while to find out for sure what the compiler settings are.
I am indeed compiling with -mfpu=neon according to the makefile.

From skimming the configure.py script, it looks like the logic is set up so that armv7-a implies neon, and I've been using --arm-arch=armv7-a

dschaffe, two questions:

  1. What configure.py flags is the autobuild currently using?

  2. What configure.py flags *should* the autobuild be using?

:)
For the record, my settings were:
- gcc 4.5.2 (Sourcery G++ Lite 2011.03-41) -march=armv7-a -mthumb -mfloat-abi=softfp [-mfpu=neon]
- beagleboard C3 ubuntu jaunty (2009.something)
(In reply to Felix S Klock II from comment #8)
> dschaffe, two questions:
>   1. What configure.py flags is the autobuild currently using?
>   2. What configure.py flags *should* the autobuild be using?

$ ../configure.py --arm-arch=armv7-a --target=arm-android

Let me know if you think we should be using other flags.  From taking a look at configure.py I see when arm-arch=armv7-a :

BASE_CXX_FLAGS +=  -march=armv7-a -mtune=cortex-a8 -mfloat-abi=softfp -mfpu=neon -mno-thumb -fno-section-anchors -D__ARM_ARCH__=7
(In reply to Dan Schaffer from comment #10)
> (In reply to Felix S Klock II from comment #8)
> > dschaffe, two questions:
> >   1. What configure.py flags is the autobuild currently using?
> >   2. What configure.py flags *should* the autobuild be using?
> 
> $ ../configure.py --arm-arch=armv7-a --target=arm-android
> 
> Let me know if you think we should be using other flags.

Nope, I just wanted to double-check whether we were making use of any of the arm-fpu or arm-neon variations that are implied by the lines

  arm_fpu = o.getBoolArg("arm-fpu",False)
  arm_neon = o.getBoolArg("arm-neon",False)


> From taking a look at configure.py I see when arm-arch=armv7-a :
> 
> BASE_CXX_FLAGS +=  -march=armv7-a -mtune=cortex-a8 -mfloat-abi=softfp
> -mfpu=neon -mno-thumb -fno-section-anchors -D__ARM_ARCH__=7

Right; I think this is probably wrong.  In particular, the -mfpu=neon line is probably incorrect and should be removed.

I made a version of configure.py that took those out, then reconfigured and rebuilt, and my avmshell now appears to run successfully on my galaxy tab.

I'll post the patch here in a sec.
I assume this will cause a performance drop for people who aren't passing --enable-arm-neon (or whatever it is) on the configure.py command line.

(That's the main reason I wanted Ed's stamp on it.)

It does get us to a working shell on the tab.  (A number of acceptance tests related to Date have failed, but its better than crashing on startup.)
Attachment #569691 - Flags: superreview?(edwsmith)
Attachment #569691 - Flags: review?(dschaffe)
Whiteboard: has-patch
(In reply to Edwin Smith from comment #7)
> This may or may not matter, but another data point.  I've been working on
> beagleboard (ARMv7-A,NEON), and my first build of avmshell with -mfpu=neon
> printed "Illegal Instruction" like in this bug.  -Dinterp worked fine. 
> Oddly, subsuequent executions of the same binary also ran fine.  YMMV, but
> it would be worth your while to find out for sure what the compiler settings
> are.

As a followup to this note in particular (and to comment 4, comment 5, and comment 6): disabling neon did not seem to really affect what _nan disassembles to.

So its funny/odd/confusing that the error manifested itself in this way.

(I also realize that I never actually understood Ed's point, since I now notice that he said that the beagleboard *has* NEON support.  So why did -mfpu=neon yeidl a problem?  Is it just a mystery?)
(In reply to Felix S Klock II from comment #13)
> (In reply to Edwin Smith from comment #7)
> > This may or may not matter, but another data point.  I've been working on
> > beagleboard (ARMv7-A,NEON), and my first build of avmshell with -mfpu=neon
> > printed "Illegal Instruction" like in this bug.  -Dinterp worked fine. 
> > Oddly, subsuequent executions of the same binary also ran fine.  YMMV, but
> > it would be worth your while to find out for sure what the compiler settings
> > are.
> 
> As a followup to this note in particular (and to comment 4, comment 5, and
> comment 6): disabling neon did not seem to really affect what _nan
> disassembles to.
> 
> So its funny/odd/confusing that the error manifested itself in this way.

Full disclosure, here's the disassembly from the no-neon build:

(gdb) disassemble _nan
Dump of assembler code for function _nan:
0x003543e0 <_nan+0>:	str	r11, [sp, #-4]!
0x003543e4 <_nan+4>:	add	r11, sp, #0	; 0x0
0x003543e8 <_nan+8>:	sub	sp, sp, #12	; 0xc
0x003543ec <_nan+12>:	mvn	r3, #-2147483648	; 0x80000000
0x003543f0 <_nan+16>:	str	r3, [r11, #-8]
0x003543f4 <_nan+20>:	flds	s15, [r11, #-8]
0x003543f8 <_nan+24>:	fcvtds	d7, s15
0x003543fc <_nan+28>:	vmov	r2, r3, d7
0x00354400 <_nan+32>:	mov	r0, r2
0x00354404 <_nan+36>:	mov	r1, r3
0x00354408 <_nan+40>:	add	sp, r11, #0	; 0x0
0x0035440c <_nan+44>:	ldmia	sp!, {r11}
0x00354410 <_nan+48>:	bx	lr
End of assembler dump.

For all I know at the moment, "fcvtds d7, s15" is valid for all ARM but "fcvtds	d16, s15" is a neon-specific extension?  Did the register set get extended in neon?
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
(oops I don't know how it got marked as resolved, fat fingers.)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
> For all I know at the moment, "fcvtds d7, s15" is valid for all ARM but
> "fcvtds	d16, s15" is a neon-specific extension?  Did the register set get
> extended in neon?

Yes, neon and VFPv3 has d16-31, older variants had d0-15 only; d16 suspiciously straddles a boundary.  Neon also shadows D0-31 with Q0-15.  I don't know whether fcvtds (double->short conversion?) is VFPv3 or NEON specific, but I'd guess VFPv3.

(there's also a VFPv3-16 variant with other v3 extensions, but the original limited register set).

NEON is not implied by CoreTek-A9, and Tegra T20 doesn't have NEON, and I dont know whether it has VFPv3.  (there's a lot of detail to snake out here, all publicly available).  But the signs are pointing to it not having the high-numbered registers, so either VFPv2 or VFPv3-16.
Here's the fact chain I could uncover (-> means implies)

  galaxy tab 10.1 -> tegra 2 series -> Coretex-A9 (no-neon) -> ARMv7-a (no-neon) -> VFPv3-D16.

and also:

  generally: ARMv7-A -> VFPv3
  generally: NEON -> VFPv3-D32
locally I ran tests successfully on a nexxus one (froyo) and on the Galaxy Tab (honeycomb).  all tests passed.  I wanted to check performance to see if we get any degradation with performance.  I did not see any Date failures set to EST.
running v8.5 on a nexxus one was at least 10x slower with neon removed.  this seems like a problem.

v8.5/optimized:
test   ,  operations/sec without neon   ,   operations /sec with neon (current)
  crypto,    1076.0  ,  11399.0
  deltablue,    307.0,  4715.0
  earley-boyer,  97.8,  1483.0
  raytrace,  921.0,  11693.0
  regexp,     6.9, 116.0
  richards,   434.0,  6533.0
  splay,  638.0,  8276.0
(In reply to Dan Schaffer from comment #18)
> all tests passed.  I wanted to check performance to see if
> we get any degradation with performance.  I did not see any Date failures
> set to EST.

Oh of course!  I switched the TZ to Paris over here.  Okay I clearly need to change that back.  Great, thanks.
(In reply to Dan Schaffer from comment #19)
> running v8.5 on a nexxus one was at least 10x slower with neon removed. 
> this seems like a problem.
> 
> v8.5/optimized:
> test   ,  operations/sec without neon   ,   operations /sec with neon
> (current)
>   crypto,    1076.0  ,  11399.0
>   deltablue,    307.0,  4715.0
>   earley-boyer,  97.8,  1483.0
>   raytrace,  921.0,  11693.0
>   regexp,     6.9, 116.0
>   richards,   434.0,  6533.0
>   splay,  638.0,  8276.0

Am I right in assuming that these are results where you did not pass --enable-arm-neon to configure.py ?

(In other words, I think we should be able to work around any performance issues by appropriate changes to either or both of configure.py and the surrounding buildbot architecture.)
I think the bug subject should be "doesn't run on tegra2".  Honeycomb isn't the problem.
Side note, recent tr shells run just fine on my hp touchpad w/ gingerbread, which is a dual core Snapdragon APQ8060.
(In reply to Dan Schaffer from comment #19)
> running v8.5 on a nexxus one was at least 10x slower with neon removed. 
> this seems like a problem.
> 
> v8.5/optimized:
> test   ,  operations/sec without neon   ,   operations /sec with neon
> (current)
>   crypto,    1076.0  ,  11399.0
>   deltablue,    307.0,  4715.0
>   earley-boyer,  97.8,  1483.0
>   raytrace,  921.0,  11693.0
>   regexp,     6.9, 116.0
>   richards,   434.0,  6533.0
>   splay,  638.0,  8276.0

Oops I got that wrong.  I had an $AVM mistake so I was accidentally comparing the native mac shell to the android.  I will rerun the tests.
Summary: shell does not run on android honeycomb (Samsung Galaxy Tab) → shell does not run on nvidia tegra2 devices
With the correct builds on nexxus one I saw < 5% difference.  
v8.5/optimized, ops/sec with neon, ops/sec without neon
  crypto , 1076.0, 1064.0
  deltablue ,  310.0, 308.0
  earley-boyer  ,  97.8, 98.4 
  raytrace  ,  885.0, 919.0
  regexp  ,  6.9, 6.8 
  richards  , 426.3, 424.0
  splay  ,  660.0, 651.0
Attachment #569691 - Flags: review?(dschaffe) → review+
Felix, I think we should push patch since the performance issues turned out not to be significant.
changeset:   6701:f433bab213d8
user:        Felix S Klock II <fklockii@adobe.com>
date:        Wed Oct 26 08:52:00 2011 +0200
summary:     Bug 696380: arm7a does not imply neon, so stop lying to compiler (r=dschaffe, sr pending=edwsmith).

http://hg.mozilla.org/tamarin-redux/rev/f433bab213d8
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Attachment #569691 - Flags: superreview?(edwsmith) → superreview+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: