Last Comment Bug 678928 - WebGL JS broken after pushing 649202
: WebGL JS broken after pushing 649202
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: ARM Linux
: -- normal (vote)
: ---
Assigned To: Marty Rosenberg [:mjrosenb]
:
Mentors:
http://learningwebgl.com/lessons/less...
Depends on:
Blocks: 649202
  Show dependency treegraph
 
Reported: 2011-08-14 19:35 PDT by Oleg Romashin (:romaxa)
Modified: 2011-10-10 14:11 PDT (History)
10 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
unaffected
+
fixed
+
fixed


Attachments
comments saying "fmem_imm_offset is expecting the offset in words" should be heeded (1.01 KB, patch)
2011-08-18 05:02 PDT, Marty Rosenberg [:mjrosenb]
romaxa: feedback+
Details | Diff | Review
should fix the issue *and* not have 10-bit overflow issues (2.08 KB, patch)
2011-08-18 16:06 PDT, Marty Rosenberg [:mjrosenb]
adrake: review+
bugzilla: approval‑mozilla‑aurora+
Details | Diff | Review

Description Oleg Romashin (:romaxa) 2011-08-14 19:35:44 PDT
Latest trunk show broken WebGL on ARM
Also on 
http://learningwebgl.com/lessons/lesson04/index.html - it crashes
Comment 1 Oleg Romashin (:romaxa) 2011-08-14 19:39:20 PDT
#0  0x3aaf57fc in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:45
#1  0x3bdf2280 in CheckScript (script=<value optimized out>, prev=<value optimized out>)
    at js/src/jsscript.cpp:306
#2  0x3bdf2a98 in js_TraceScript (trc=0xae91571c, script=0x3b712448, owner=0x6) at js/src/jsscript.cpp:1442
#3  0x3bd7bd68 in ScanObject (this=0xae91571c) at js/src/jsgcmark.cpp:636
#4  js::GCMarker::drainMarkStack (this=0xae91571c) at js/src/jsgcmark.cpp:802
#5  0x3b67dd84 in setMarkColor (trc=0xae91571c, data=0x11c7) at js/src/jsgc.h:1314
#6  XPCJSRuntime::TraceJS (trc=0xae91571c, data=0x11c7) at js/src/xpconnect/src/xpcjsruntime.cpp:383
#7  0x3bd71288 in js::MarkRuntime (trc=0xae91571c) at js/src/jsgc.cpp:1846
#8  0x3bd73228 in MarkAndSweep (cx=0x405885c0, comp=<value optimized out>, gckind=GC_SHRINK, gcTimer=<value optimized out>)
    at js/src/jsgc.cpp:2272
#9  GCCycle (cx=0x405885c0, comp=<value optimized out>, gckind=GC_SHRINK, gcTimer=<value optimized out>)
    at js/src/jsgc.cpp:2642
#10 0x3bd73cd4 in js_GC (cx=0x405885c0, comp=<value optimized out>, gckind=GC_SHRINK) at js/src/jsgc.cpp:2728
#11 0x3b328c3c in nsJSContext::ScriptEvaluated (this=0x41031d80, aTerminated=4551)
    at dom/base/nsJSEnvironment.cpp:3167
#12 0x3b3245bc in nsJSContext::ScriptExecuted (this=0x41031d80) at dom/base/nsJSEnvironment.cpp:3239
#13 0x3b68a818 in AutoScriptEvaluate::~AutoScriptEvaluate (this=0xae915ae8, __in_chrg=<value optimized out>)
---Type <return> to continue, or q <return> to quit---
    at js/src/xpconnect/src/xpcwrappedjsclass.cpp:122
#14 0x3b68c9d0 in nsXPCWrappedJSClass::CallMethod (this=0x42150b50, wrapper=<value optimized out>, methodIndex=<value optimized out>, info=0x40468848, 
    nativeParams=0xae915c10) at js/src/xpconnect/src/xpcwrappedjsclass.cpp:1896
#15 0x3b6870e0 in nsXPCWrappedJS::CallMethod (this=0x42198640, methodIndex=3, info=0x40468848, params=0xae915c10)
    at js/src/xpconnect/src/xpcwrappedjs.cpp:585
#16 0x3bacb780 in PrepareAndDispatch (self=<value optimized out>, methodIndex=<value optimized out>, args=0xae915cd4)
    at xpcom/reflect/xptcall/src/md/unix/xptcstubs_arm.cpp:131
#17 0x3bacae00 in SharedStub () from /nfsobj-build/dist/bin/libxul.so
#18 0x3af25740 in nsRefreshDriver::Notify (this=0x3fd8c540, aTimer=<value optimized out>)
    at layout/base/nsRefreshDriver.cpp:361
#19 0x3baba5a4 in nsTimerImpl::Fire (this=0x42198680) at xpcom/threads/nsTimerImpl.cpp:427
#20 0x3baba720 in nsTimerEvent::Run (this=<value optimized out>) at xpcom/threads/nsTimerImpl.cpp:520
#21 0x3bab6a00 in nsThread::ProcessNextEvent (this=0x3fd0cc50, mayWait=<value optimized out>, result=<value optimized out>)
    at xpcom/threads/nsThread.cpp:631
Comment 2 Oleg Romashin (:romaxa) 2011-08-14 19:43:24 PDT
Have this problem on Maemo6 and WebOS.
Comment 3 Andreas Gal :gal 2011-08-14 19:45:31 PDT
GC-related crash. Its unclear to me how the ARM patch caused this. Just in case we unearthed an existing GC problem, I am going to hide the bug until someone gets around to take a look. This doesn't crash x86, at least not on mac or linux, but maybe thats just luck/different memory pressure.
Comment 4 Oleg Romashin (:romaxa) 2011-08-14 19:54:02 PDT
I've bisected it... and before that commit everything works fine... after WebGL broken and some WEbGL examples crashes.
Comment 5 Oleg Romashin (:romaxa) 2011-08-14 19:55:07 PDT
ARM patch might cause some VFP incorrect behavior and memory corruption.
Comment 6 Marty Rosenberg [:mjrosenb] 2011-08-15 03:31:35 PDT
It is entirely possible that I generated an instruction that does not exist on your particular chip.  Do you know what type of chip you have or what instruction sets are supported?
While running JIT'ed code, that is no good reason for the stack to look like anything sane, but that does appear as if you have a rather sane stack that is far from any JIT'ed code.
Comment 7 Oleg Romashin (:romaxa) 2011-08-15 07:16:53 PDT
Here is the CPUINFO on chips where I see problems with WebGL
HP Touchpad:
Processor	: ARMv7 Processor rev 2 (v7l)
processor	: 0
BogoMIPS	: 13.52
Features	: swp half thumb fastmult vfp edsp neon vfpv3 
CPU implementer	: 0x51
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0x02d
CPU revision	: 2
Hardware	: TENDERLOIN
Revision	: 0000
Serial		: 0000000000000000

N9:
Processor	: ARMv7 Processor rev 2 (v7l)
BogoMIPS	: 996.74
Features	: swp half thumb fastmult vfp edsp neon vfpv3 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x3
CPU part	: 0xc08
CPU revision	: 2
Hardware	: Nokia RM-696 board
Revision	: 1504
Serial		: 0000000000000000
Soc Info        : OMAP 3630 ES1.2-hs
        IDCODE  : 2b89102f
        Pr. ID  : 00000000 00000000 000000cc cafeb891
Comment 8 Oleg Romashin (:romaxa) 2011-08-15 07:48:36 PDT
Also on N9 we are using hardfp toolchain if that make any difference
Comment 9 Marty Rosenberg [:mjrosenb] 2011-08-15 08:26:31 PDT
That looks like it supports all of the instructions that I would have used.  I only have an android device for testing.  I'll see if it reproduces there, and otherwise, I'll borrow someone's device that does exhibit the issue.
Comment 10 Oleg Romashin (:romaxa) 2011-08-15 08:45:22 PDT
if you are in MV office, the you probably can take N9 from dougt... For webOS you need additional patches to compile FF... probably JS part is easy to compile there
Comment 11 Oleg Romashin (:romaxa) 2011-08-15 17:11:05 PDT
Tested this also on Galaxy Tab 9.0xx , and it works wrong also there, and crashes too.
http://learningwebgl.com/lessons/lesson05/index.html
http://learningwebgl.com/lessons/lesson04/index.html
Comment 12 Oleg Romashin (:romaxa) 2011-08-17 10:00:28 PDT
any progress here? this seems like on the way to 8.0 release, and presumably breaking WebGL on android (at least galaxy tab similar HW)
Comment 13 Andreas Gal :gal 2011-08-17 10:43:19 PDT
We cannot keep carrying this regression. If we don't have a fix today, please back out the offending patch.
Comment 14 Marty Rosenberg [:mjrosenb] 2011-08-17 11:06:53 PDT
I haven't been able to get a maemo development environment set up yet, and the problem doesn't reproduce on my nexus-s. I'll see if I can borrow someone's galaxy S today in order to get to the point where I can actually start debugging this.
Comment 15 Oleg Romashin (:romaxa) 2011-08-17 11:08:44 PDT
Not sure about galaxy S, but I have Webg broken on galaxy Tab... hope HW is the same for them
Comment 16 Marty Rosenberg [:mjrosenb] 2011-08-17 11:40:37 PDT
Sorry, I meant galaxy tab, not galaxy S (must be too used to typing nexus S).
Comment 17 Marty Rosenberg [:mjrosenb] 2011-08-17 16:23:17 PDT
I still haven't been able to get an environment to reproduce this... 
A simple fix may be to just disable fast typed arrays on arm rather than to back out the whole patch.  It should be as simple as removing the line: ENABLE_POLYIC_TYPED_ARRAY=1
from the arm section of js/src/configure.in
Comment 18 Andreas Gal :gal 2011-08-17 16:39:57 PDT
Is that disabling any optimization we didn't have before the patch? And what is the path to fix that?
Comment 19 Daniel Veditz [:dveditz] 2011-08-17 16:43:28 PDT
Why would the specific ARM chip matter if it's a GC bug?
Comment 20 Andreas Gal :gal 2011-08-17 16:46:01 PDT
Also, what does it mean that you can't get an environment to reproduce this? I have it reproduced on my galaxy s2. If you don't have one you can use mine. I am downstairs on the 2nd floor.
Comment 21 Oleg Romashin (:romaxa) 2011-08-17 16:53:11 PDT
(In reply to Daniel Veditz from comment #19)
> Why would the specific ARM chip matter if it's a GC bug?

I think we have a deal with memory corruption, which endup in GC...
Comment 22 Andreas Gal :gal 2011-08-17 16:54:58 PDT
If we suspect memory corruption, the patch has to come out. In particular if we have no explanation whats going on here exactly. Lets investigate this on a project branch instead of carrying risk on trunk. Thats what project branches are there for.
Comment 23 Marty Rosenberg [:mjrosenb] 2011-08-18 05:02:33 PDT
Created attachment 554036 [details] [diff] [review]
comments saying "fmem_imm_offset is expecting the offset in words" should be heeded

Looks like I got this right in 4/5 of the callsites, and the last callsite is never hit by any of our tests.
In other news, fmem_imm_off is something that shouldn't be called from outside ARMAssembler.*, so I'm about to clean that up.
Comment 24 Marty Rosenberg [:mjrosenb] 2011-08-18 05:25:15 PDT
additional thought: it would appear as if storeFloat does not just store floats.  It takes a double, converts it to a float, then stores that.  This is blatantly obvious because storefloat takes a "floatReg", which is used for both single precision and double precision values.

If storeFloat is actually supposed to convert a double to a float before writing it to memory, then most of the variants of storeFloat are wrong, which means that they are not tested anywhere.  I'm happy to change it, but we should get some tests that make sure this is doing something sane.
Comment 25 Oleg Romashin (:romaxa) 2011-08-18 09:03:05 PDT
Comment on attachment 554036 [details] [diff] [review]
comments saying "fmem_imm_offset is expecting the offset in words" should be heeded

ok, with this patch I see webgl working fine again, and no crash.
Comment 26 Marty Rosenberg [:mjrosenb] 2011-08-18 16:06:16 PDT
Created attachment 554243 [details] [diff] [review]
should fix the issue *and* not have 10-bit overflow issues

as (sort of) stated earlier, fmem_imm_off cannot be called with any offset.  In order to have a sane macro assembler, this has been wraped in floatTransfer, which will ensure that the offset is encoded correctly.
for whatever reason, I'd called fmem_imm_off, rather than floatTransfer.  this has been mitigated.
Comment 27 Marty Rosenberg [:mjrosenb] 2011-08-19 14:54:33 PDT
Comment on attachment 554243 [details] [diff] [review]
should fix the issue *and* not have 10-bit overflow issues

This is a pretty big security risk.  Using a typed array, and crafting your indexes into it, you can write single floating point values up to 3 times the length of the array past the end of the array.  Most likely this would just corrupt the heap but I cannot say for sure.

The fix is plenty simple.  It calling an already existent wrapper function that
will encode the offset, but it will apply the correct shift, and it will also guard against cases where the offset is out of range for the given instruction.
Comment 28 Daniel Holbert [:dholbert] (largely AFK until June 28) 2011-08-19 15:10:26 PDT
Landed on m-c:
 http://hg.mozilla.org/mozilla-central/rev/af2098f08fc7
Comment 29 Johnathan Nightingale [:johnath] 2011-08-23 14:43:36 PDT
Comment on attachment 554243 [details] [diff] [review]
should fix the issue *and* not have 10-bit overflow issues

Discussed and approved in triage - we're assuming this isn't needed on beta? (Please land it soon!)
Comment 30 Oleg Romashin (:romaxa) 2011-08-23 15:32:02 PDT
beta does not have fix for 649202 IIRC
Comment 31 Daniel Holbert [:dholbert] (largely AFK until June 28) 2011-08-23 18:21:44 PDT
Landed on Aurora, for Firefox 8:
http://hg.mozilla.org/releases/mozilla-aurora/rev/9a243f427279

Note You need to log in before you can comment on or make changes to this bug.