755568 - Random degradation of system causes Gaia to not to update the screen as expected on SGS2 ICS

Reporter

Description

•

13 years ago

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168 Safari/535.19 Steps to reproduce: Navigating through the Gaia UI, just doing common actions like launching some apps (Browser, Settings, Calc, etc) changing between screens by swipping, hold down home button to close backgrounded apps... in short, navigating as a normal user. I couldn't get a fixed way to reproduce the error. It just happens randomly but I get it always before 5 minutes doing what I described above. This is happening for a while now, since I changed from Gingerbread to ICS. I checked it with lastest version: Gecko (1ed4f36b1512e0388151d0aafd79f1a5737db0f9), Gaia (847ea7421bc61b2c54a8323daca93f205932f87c). I'm using Samsung Galaxy S2 - GT-I9100. I found two more people with the same problem. Actual results: When navigating through Gaia interface (mostly when pushing "back" button while in an app, but not always), then suddenly the actions that normally are being fired by input events like swipes, clicks or taps: losts accuracy (e.g. swipes between screens stops in the midle of the transition), are dispatched later (e.g. tapping on app icons don't launch the app immediately, they seemed to be launched when some other input action happens). The only way to resotre normal behavior, is by restarting b2g process. Expected results: Normal behavior. Smooth input event actions.

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Updated

•

13 years ago

Severity: normal → major

OS: All → Gonk

Hardware: All → Other

Philipp von Weitershausen [:philikon]

Updated

•

13 years ago

Assignee: nobody → atilag

Shian-Yow Wu [:swu]

Comment 1

•

13 years ago

I can use the steps below to always reproduce this issue: (described at https://github.com/andreasgal/B2G/issues/301) 1. Wait 1 minute after device boot (only reproducible after 1 minute from device boot) 2. Go to Clock app 3. Click "Start"->"Stop"->"Start"... repeatedly until issue reproduced

[:fabrice] Fabrice Desré

Updated

•

13 years ago

Summary: [B2G] Random degradation of input events dispatching causes Gaia to apparently run slowly → Random degradation of input events dispatching causes Gaia to apparently run slowly

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 2

•

13 years ago

Yes, I can always make this happens by using your procedure. I already checked that Gecko is capturing/dispatching all events correctly. Now I'm checking that JS engine is reciving them too...

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Updated

•

13 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 3

•

13 years ago

Ok, there's no any problem with the input event system. JavaScript engine is handling all events as expected. Now I'm debugging the layout system, looking for something unusual...

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Updated

•

13 years ago

Summary: Random degradation of input events dispatching causes Gaia to apparently run slowly → Random degradation of system causes Gaia to not to update the screen as expected

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 4

•

13 years ago

Can anyone reproduce this problem with a phone that was first flashed to android-ICS, *then* flashed to b2g-ICS? I think I saw it yesterday on a phone that I directly flashed to b2g-ICS without first flashing to android-ICS (i.e. retained GB modem firmware etc.). Since that's not a supported configuration it's not something worth investigating.

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 5

•

13 years ago

Chris, my phone was first flashed to Android-ICS, using my teleco company firmware (Movistar Spain) via Samsung Kies, and then flashed it to b2g-ICS. All the people I asked with a GT-I9100/B2G-ICS phone, have this problem and unless you tell me otherwise, it seems that happens in all firmwares. As I'm not a Gecko hacker, I'm posting my findings in case someone interested wants to start investigating...

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 6

•

13 years ago

I haven't been able to reproduce this on my ICS-based b2g sgs2. Just seeing if maybe there was a connection.

Shian-Yow Wu [:swu]

Comment 7

•

13 years ago

This issue is reproducible on about 50% sgs2 phone in Taipei office.

Shian-Yow Wu [:swu]

Updated

•

13 years ago

Summary: Random degradation of system causes Gaia to not to update the screen as expected → Random degradation of system causes Gaia to not to update the screen as expected on SGS2 ICS

Shian-Yow Wu [:swu]

Comment 8

•

13 years ago

The issue is only reproducible when you see the last message in "dmesg" shows cpu1 off. [ 61.585130] cpu1 turnning off! [ 61.590619] IRQ112 no longer affine to CPU1 [ 61.590924] CPU1: shutdown [ 61.591578] cpu1 off! On non-reproducible phones, even when cpu1 off, the issue doesn't happen. If system is blocked at specific B2G thread, we should find a way to know which thread was been blocked.

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 9

•

13 years ago

Finally, I know what's happening in Gecko (but no why it's happening). Roughly speaking, TimerThread is used on Gecko to process timing like events (nsTimerImpl instances like nsRefreshDriver, for example). This timing events are "Dispatched" when they're timeouts are reached. The mechanism of watting for timeout is inside TimerThread::Run() method. It uses monitors (MonitorAutoUnlock) to wait for the timeout to be reached: <code> ... mWaitting = true; mMonitor.Wait(waitFor); mWaitting = false; ... </code> The waitFor variable is calculated using the timeout of the timer event that is being processed and the time at this very moment (TimeStamp::Now()),... The problem is that when things go wrong, the values of theese timings operations get out of control with very high values [1] that makes: mMonitor.Wait() to wait "forever", denying other timer events from being processed. This explains the strange behavior on updating Gaia, cause nsRefreshDriver (wich is responsible of invalidating the screen and makes repainting) is never processed again, unless some other event unblocks the thread monitor. As I said, now I'm investigating why these values are getting corrupted. As I read through the code, timing stuff is platform dependent, this could explain why this is happening only on some SGS2. Sadly I don't have much time right now, but I hope to get this issue fixed/located soon. [1] Logcat showing this behavior: http://pastebin.mozilla.org/1668995

Dave Hylands [:dhylands]

Comment 10

•

13 years ago

So I took a look at how mMonitor.Wait is implemented. I'm not sure if this is related to the problem, but it certainly could be a contributing factor. If my tracing is correct, it all boils down to pt_TimedWait (https://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptsynch.c#240), which unfortunately uses calls gettimeofday and waits for an absolute realtime. gettimeofday is based on the kernel's clock_gettime using the CLOCK_REALTIME clock. This is fundamentally flawed, as the absolute time can change (anytime the system time changes). It should be based on CLOCK_MONOTONIC. See: http://stackoverflow.com/questions/3006259/what-time-function-do-i-need-to-use-with-pthread-cond-timedwait for a more detailed explanation. There is a function called pthread_condattr_setclock which can tell pthreads to use CLOCK_MONOTONIC.

Dave Hylands [:dhylands]

Comment 11

•

13 years ago

Sadly, it appears that bionic may not have pthread_condattr_setclock

Shian-Yow Wu [:swu]

Comment 12

•

13 years ago

(In reply to Juan Gomez [:_AtilA_] from comment #9) > Finally, I know what's happening in Gecko (but no why it's happening). > Roughly speaking, TimerThread is used on Gecko to process timing like events > (nsTimerImpl instances like nsRefreshDriver, for example). This timing > events are "Dispatched" when they're timeouts are reached. The mechanism of > watting for timeout is inside TimerThread::Run() method. It uses monitors > (MonitorAutoUnlock) to wait for the timeout to be reached: > <code> > ... > mWaitting = true; > mMonitor.Wait(waitFor); > mWaitting = false; > ... > </code> > The waitFor variable is calculated using the timeout of the timer event that > is being processed and the time at this very moment (TimeStamp::Now()),... > The problem is that when things go wrong, the values of theese timings > operations get out of control with very high values [1] that makes: > mMonitor.Wait() to wait "forever", denying other timer events from being > processed. > This explains the strange behavior on updating Gaia, cause nsRefreshDriver > (wich is responsible of invalidating the screen and makes repainting) is > never processed again, unless some other event unblocks the thread monitor. > > As I said, now I'm investigating why these values are getting corrupted. As > I read through the code, timing stuff is platform dependent, this could > explain why this is happening only on some SGS2. > > Sadly I don't have much time right now, but I hope to get this issue > fixed/located soon. > > [1] Logcat showing this behavior: http://pastebin.mozilla.org/1668995 Vincent Chang has ever encountered a similar forever loop issue in timer thread, it happened only on Samsung SGS2 I-9100G. And the issue was disappeared after removing optimization (-Os) on TimerThread.cpp. When removing optimization, the compiler generates floating point related operation in different way for https://mxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#328 Not sure if this information helps, but you may try it.

Cervantes Yu [:cyu] [:cervantes]

Comment 13

•

13 years ago

bionic doesn't have pthread_condattr_setclock(), but I has pthread_cond_timedwait_monotonic_np(), which uses CLOCK_MONOTONIC. Though pt_TimedWait() is using gettimeofday() and may contribute to the problem, the bigger problem is still why the calculation in TimerThread::Run() produced such strange results. Given how halfMicrosecondsIntervalResolution is calculated, its value should be constant on android. Why getting such strange big value is still a mystery.

Cervantes Yu [:cyu] [:cervantes]

Comment 14

•

13 years ago

After further debugging today, the problem turns out to be in floating point operation. I added some debugging messages and got similar results as in logcat in comment #9. TimeDuration::ToMilliseconds() produces an astronomical number, multiplying it with 1000.0 gets inf (infinity), and PR_MicrosecondsToInterval() gets a large (and always the same) number, 4294967. Alternatively, if I call TimeDuration::ToSeconds(), which only uses floating point division, the result in seconds is correct. But multiplying with 1000.0 generates an astronomical number again. I can work this around by replacing *1000.0 with /0.001 to get sane results in TimerThread.cpp, but there are still other quirks. Replacing vfp with neon doesn't resolve the issue. Really strange.

Dave Hylands [:dhylands]

Comment 15

•

13 years ago

(In reply to Cervantes Yu from comment #14) > After further debugging today, the problem turns out to be in floating point > operation. I added some debugging messages and got similar results as in > logcat in comment #9. TimeDuration::ToMilliseconds() produces an > astronomical number, multiplying it with 1000.0 gets inf (infinity), and > PR_MicrosecondsToInterval() gets a large (and always the same) number, > 4294967. 0xFFFFFFFF = 4294967295, and 4294967295 / 1000 = 4294967

Cervantes Yu [:cyu] [:cervantes]

Comment 16

•

13 years ago

OK, I was flawed. The quirk is not the result of using neon. It may be the result of using the latest gecko. If I use an older version with the matching gaia, it works smoothly. So the problem really is we are using vfp on exynos 4210, which only supports neon. vfp supports 64bit floating point operations, but neon only supports 32bit ones. It's really interesting we didn't crash on galaxy s2. I now add "ac_add_options --with-fpu=neon" to gonk-misc/default-gecko-config and it works so far so good. To prevent similar problems in the future, we need the build system to pass on different gcc flags to gecko depending on the device config.

Shian-Yow Wu [:swu]

Comment 17

•

13 years ago

(In reply to Cervantes Yu from comment #16) > OK, I was flawed. The quirk is not the result of using neon. It may be the > result of using the latest gecko. If I use an older version with the > matching gaia, it works smoothly. > > So the problem really is we are using vfp on exynos 4210, which only > supports neon. vfp supports 64bit floating point operations, but neon only > supports 32bit ones. It's really interesting we didn't crash on galaxy s2. > > I now add "ac_add_options --with-fpu=neon" to gonk-misc/default-gecko-config > and it works so far so good. To prevent similar problems in the future, we > need the build system to pass on different gcc flags to gecko depending on > the device config. We should get this information from gonk device configuration, maybe from BoardConfig.mk. The SGS2 has related definition, for example https://github.com/mozilla-b2g/android-device-galaxys2/blob/master/BoardConfig.mk#L35 Michael, do you have comments?

Michael Wu [:mwu]

Comment 18

•

13 years ago

You can check ARCH_ARM_HAVE_NEON. See https://github.com/mozilla-b2g/platform_build/blob/master/core/combo/arch/arm/armv7-a-neon.mk#L13

Shian-Yow Wu [:swu]

Comment 19

•

13 years ago

Attached patch Add debug message to check lag issue — Details — Splinter Review

It was reported that issue still occurred after adding "--with-fpu=neon" in gonk-misc/default-gecko-config. (see https://github.com/mozilla-b2g/gaia/issues/1550) Attached is a patch from Cervantes to show debug message to check lag issue. Unfortunately I'm not able to reproduce the issue with this patch.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 20

•

13 years ago

Adding myself to the list of reproducers. I'm using my custom build (built using Ubuntu 10.04 x64) and they degrade extremely fast. I'm usually able to launch one or two apps before the phone becomes unusable until rebooted. Two observations: - it is a regression. I can't put a timestamp on that but my first builds from the time when b2gbuilds.org were new (I remember I compiled B2G and few days later b2gbuilds have been launched) worked very well - the lock screen works well even when nothing from the main screen does I can record a video of how does it look on my phone if you want me, or compile with any patch since I have full env setup.

Cervantes Yu [:cyu] [:cervantes]

Comment 21

•

13 years ago

Zbigniew Braniecki, I found a device at hand on which this symptom is very reproducible. If the UI doesn't get stuck, then repeatedly enable/disable wifi in settings will do. It was reported that adding "ac_add_options --with-fpu=neon" doesn't fix the issue (gaia issue 1550). After disassembling the object code, asking gcc to use neon or vfp the code in question produces the same machine code. I made some tests and the result seem to show that the CPU supports VFP. The double-precision floating point operations runs slower (about 9 times) when using ac_add_options --with-soft-float=yes. The code in question is : double microseconds = (timeout - now).ToMilliseconds()*1000; in TimerThread::Run(). The compiler inlines TimeDuration::ToMilliseconds() and on SGS2 we can get incorrect results. But if I call TimeDuration::ToSeconds() the result is always correct. If I change it to: - double microseconds = (timeout - now).ToMilliseconds()*1000; + volatile double seconds = (timeout - now).ToSeconds(); + volatile double microseconds = seconds * 1000 * 1000; the symptom disappears. Maybe there are some quirks in this CPU or in the way compiler optimizes and generates code, but I don't know how to prove it.

Guillermo López :willyaranda (probably SLOW response)

Comment 22

•

13 years ago

Maybe it could be useful to land that patch (after sending to try) and see if it fixes the problem for everyone.

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 23

•

13 years ago

Ok, I splitted the line: double microseconds = (timeout - now).ToMilliseconds()*1000; into it's basics operations: 1. TimeDuration td = timeout - now; 2. double dSecs = td.ToSeconds(); 3. double dMult1 = dSecs * 1000.00f; 4. double microseconds = rest2 * 1000.0; And the problem is in the third (3) line. That's where we get incorrect values like: -343037381057174407826757625345329232073105542126671635252048583010639980129525335439392917359819772278033357329198808479082817418668860875001002880002976207935118380540769211158250426172908544466139226876867146242319608664922430685367618032391531107948156121016623431680.000000 I tried to build a simple console like program to reproduces this error, but I couldn't. I don't know how to find out the flags/options passed to the toolchain when TimerThread.cpp is compiling/linking. If someone (mwu? ;)) could tell us what they are, maybe I could reproduce this behavior in my test program and make it simple to look for a solution.

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 24

•

13 years ago

(just a little bug in those 4 lines: #define rest2 dMult1 :P) (In reply to Juan Gomez [:_AtilA_] from comment #23) > Ok, I splitted the line: > > double microseconds = (timeout - now).ToMilliseconds()*1000; > > into it's basics operations: > > 1. TimeDuration td = timeout - now; > 2. double dSecs = td.ToSeconds(); > 3. double dMult1 = dSecs * 1000.00f; > 4. double microseconds = rest2 * 1000.0; > > And the problem is in the third (3) line. > That's where we get incorrect values like: > - > 34303738105717440782675762534532923207310554212667163525204858301063998012952 > 53354393929173598197722780333573291988084790828174186688608750010028800029762 > 07935118380540769211158250426172908544466139226876867146242319608664922430685 > 367618032391531107948156121016623431680.000000 > I tried to build a simple console like program to reproduces this error, but > I couldn't. I don't know how to find out the flags/options passed to the > toolchain when TimerThread.cpp is compiling/linking. > If someone (mwu? ;)) could tell us what they are, maybe I could reproduce > this behavior in my test program and make it simple to look for a solution.

Shian-Yow Wu [:swu]

Comment 25

•

13 years ago

(In reply to Juan Gomez [:_AtilA_] from comment #23) > I tried to build a simple console like program to reproduces this error, but > I couldn't. I don't know how to find out the flags/options passed to the > toolchain when TimerThread.cpp is compiling/linking. You can add some syntax errors in TimerThread.cpp to force a compiling error. Then you can see the flags/options used by the toolchain when the building process stopped.

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 26

•

13 years ago

While trying to reproduce the error in an external test programa, I can confirm that setting variables as volatile makes Gecko to work fine. So as Cervantes Yu said, I think that we have some kind of compiler optimization problem too. These are compiler flags/options for TimerThread.cpp (I removed -I and -i flags for brevity): -o TimerThread.o -c -fvisibility=hidden -DMOZILLA_INTERNAL_API -D_IMPL_NS_COM -DEXPORT_XPT_API -DEXPORT_XPTC_API -D_IMPL_NS_GFX -D_IMPL_NS_WIDGET -DIMPL_XREAPI -DIMPL_NS_NET -DIMPL_THEBES -DSTATIC_EXPORTABLE_JS_API -D_IMPL_NS_COM -fPIC -DANDROID -pedantic -Wall -Wpointer-arith -Woverloaded-virtual -Werror=return-type -Wtype-limits -Wempty-body -Wno-ctor-dtor-privacy -Wno-overlength-strings -Wno-invalid-offsetof -Wno-variadic-macros -Wno-long-long -mandroid -fno-short-enums -fno-exceptions -DMOZ_ENABLE_JS_DUMP -march=armv7-a -mthumb -mfpu=neon -mfloat-abi=softfp -fno-exceptions -fno-strict-aliasing -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -std=gnu++0x -pipe -DNDEBUG -DTRIMMED -g -Os -freorder-blocks -fno-reorder-functions -fomit-frame-pointer -DANDROID -DMOZILLA_CLIENT -MD -MF .deps/TimerThread.o.pp /home/jgomez/b2g/build/gecko/xpcom/threads/TimerThread.cpp

Cervantes Yu [:cyu] [:cervantes]

Comment 27

•

13 years ago

Here are the disassembled code for SGS2 and otoro. The 2 looks similar except one additional fsitod instruction to convert signed integer to double. They both invoke TimeDuration::ToSeconds() and then perform 2 double multiplications. On SGS2: 13c: f7ff fffe bl 0 <_ZNK7mozilla12TimeDuration9ToSecondsEv> 140: ec41 0b30 vmov d16, r0, r1 144: ee60 0b88 fmuld d16, d16, d8 148: ee20 7b88 fmuld d7, d16, d8 On otoro: 13c: f7ff fffe bl 0 <_ZNK7mozilla12TimeDuration9ToSecondsEv> 140: eeb8 6bc9 fsitod d6, s18 144: ec41 0b17 vmov d7, r0, r1 148: ee27 7b08 fmuld d7, d7, d8 14c: ee27 7b08 fmuld d7, d7, d8

Cervantes Yu [:cyu] [:cervantes]

Comment 28

•

13 years ago

More tests: I use TimerThread.o from otoro build, build and then flash to SGS2. Here is the assembly code excerpts that causes the problem: 14e: f7ff fffe bl 0 <_ZNK7mozilla12TimeDuration9ToSecondsEv> 152: ec41 0b17 vmov d7, r0, r1 156: ee27 7b09 fmuld d7, d7, d9 15a: ee27 8b09 fmuld d8, d7, d9 SGS2 still exhibits this problem, while otoro runs well (or the problem just hasn't shown up?) Since this code also runs in fennec, we need to confirm whether this bug affects fennec.

Cervantes Yu [:cyu] [:cervantes]

Comment 29

•

13 years ago

Unfortunately, this problem also can be reproduced on fennec on SGS2. I built fennec with the log messages in TimerThread::Run() to observe the problem. I still see incorrect double arithmetic results, like -0.00233 * 1000000 => -0.0. The timers are fired to soon and fennec consumes some CPU time even it is idle. Even I work around the problem in TimerThread::Run() using the volatile trick in comment #21, I still have CPU spinning related to timers in nsIdleService. The timer is repeatedly fired and then registered and then eats up almost all CPU time. Maybe the this problem was hidden behind the one in TimerThread and is unearthed with the workaround above. Fennec only works correctly with "ac_add_options --with-soft-float=yes" added to mozconfig, which forces compiler to generate software-based float/double operations. So I think for B2G on SGS2 I think the best solution is using soft-float. I have no idea about the solution for fennec.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 30

•

13 years ago

should comment 29 elevate the priority of this bug as it now affects our stable product on a popular mobile device?

Cervantes Yu [:cyu] [:cervantes]

Comment 31

•

13 years ago

I think it might be better to open a separate bug for fennec. We might need separate solutions for fennec and for b2g. For b2g, we could use software-based float operations for SGS2 to work the problem around. It will make all float operations slower, including single-precision and double precision ones. The performance impact is yet to be investigated. But at least the impact is limited on SGS2. For fennec, there is no separate APKs for different devices. We cannot sacrifice performance on all devices because the bug (possibly in hardware) is on one specific device only.

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 32

•

13 years ago

There's nothing wrong with code in comment #27, after some debugging I realize that the problem is in the value of the third operand register. This register is loaded with a constant value when things are working as expected: vldr d11, [pc, #388] ; 1e8 <_ZN11TimerThread3RunEv+0x1e8> Disassembly of section .text._ZN11TimerThread3RunEv: ... 1e8: 00000000 .word 0x00000000 1ec: 408f4000 .word 0x408f4000 The point is that the value of the register is changed at some point while runninig and never gets the right value again. I'm still fighting with gdb to see why it never gets loaded with the right value again, but as long as I cannot make conditional watchpoints for registers properly in gdb, it's taking me a lot of time to debug. On the other hand, by using volatile variables we are telling the compiler to not to "optimize" these variables, and examining the code generated, we can see that the values used to fill double-precision registers are loaded from memory positions (int the stack) everytime a fmuld operation is going to be executed (as expected cause there's no such optimization). 158: e9d3 2300 ldrd r2, r3, [r3] 15c: e9cd 0110 strd r0, r1, [sp, #64] 160: e9cd 230e strd r2, r3, [sp, #56] 164: ed9d 6b10 vldr d6, [sp, #64] 168: ed9d 7b0e vldr d7, [sp, #56] 16c: ee26 7b07 fmuld d7, d6, d7 This should works fine cause we are loading the operand registers everytime, so I think that what Cervantes said in comment #29 about eating CPU must be another issue even not related with this one. I'm trying to compile Gecko with another toolchain and see what happens but in the meantime it would be interesting to see the dump of a TimerThread.o compiled with the old toolchain in the old build system (the one used in GINGERBREAD), cause in GB this problem didn't exists. And of course, I totally disagree with compiling with software based floating point as a solution :P , we will find another one!

Dave Hylands [:dhylands]

Comment 33

•

13 years ago

So we need to check the ABI calling conventions and see exactly which/when particular registers are saved/restored. It's also conceivable that something like an IRQ is touching the register (although the kernel doesn't support floating point operations so this is probably far fetched). And if you consider stack corruption, anybody could be doing it. If the code did a save of the register and then corrupted the stack and then restored the corrupted value.

Cervantes Yu [:cyu] [:cervantes]

Comment 34

•

13 years ago

(In reply to Juan Gomez [:_AtilA_] from comment #32) > This should works fine cause we are loading the operand registers everytime, > so I think that what Cervantes said in comment #29 about eating CPU must be > another issue even not related with this one. I think it's of the same root cause of incorrect double arithmetics because after using software floating point, fennec works perfectly on SGS2 ICS firmware. > > I'm trying to compile Gecko with another toolchain and see what happens but > in the meantime it would be interesting to see the dump of a TimerThread.o > compiled with the old toolchain in the old build system (the one used in > GINGERBREAD), cause in GB this problem didn't exists. > Interestingly, I put the same fennec on SGS2 gingerbread firmware, and the problem didn't show up. The same binary built with the same toolchain, and on stock ICS fw we have incorrect double arithmetics, but not on stock gingerbread fw! It doesn't look like the problem is in the toolchain.

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 35

•

13 years ago

I'm trying to compile the whole b2g project with an updated toolchain from NDKv8b: gcc-4.6. Looking at the changes in this new release I saw something very interesting: * Several enhancements were committed to improve SIMD code generation for NEON by adding support for widening instructions, _misaligned loads and stores_, vector conditionals and support for 64 bit arithmetic. * GCC for AAPCS configurations now more closely adheres to the AAPCS specification by enabling -fstrict-volatile-bitfields by default. I just checked that with this new toolchain the code generated for TimeThread.o is slightly different, so I'm very optimistic about closing this bug by migrating the toolchain :) I'm facing some problems with the gecko binary generated by the toolchain, cause it throws a SIGILL every time is launched ( http://pastebin.mozilla.org/1732519 ). O hope to fix it quickly and post my findings.

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 36

•

13 years ago

I finally managed to compile the whole project with gcc-4.6 toolchain (from NDKr8b) but unfortunetly, the bug still remains. I even tried to compile with VFPV3-D32, VFVP3-D16 but nothing worked. Keep on investigating

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 37

•

13 years ago

yup, still can reproduce on todays nightly on sgs2

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Updated

•

13 years ago

Depends on: 787564

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Updated

•

13 years ago

No longer depends on: 787564

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 38

•

13 years ago

Just for the record, I didn't say anything before but if someone wants to just overcome this problem temporarily while we are trying to figure out how to solve this problem, we can tackle it by making this simple change on xpcomm/threads/TimerThread.cpp: (line 331) ... volatile int iOneThousand = 1000; double microseconds = (timeout - now).ToMilliseconds() * iOneThousand; ... I don't think that a patch for this is needed (right?) :) Bad news, I found similar problems on: widget/xpwidgets/nsIdleService.cpp, but their consecuences only affect the time to wait for the screen to go black (turn off), not a big deal.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 39

•

13 years ago

would it make sense to push this temporary fix and mark it as a temporary? That would enable anyone with SGS2 to work with FxOS

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Comment 40

•

13 years ago

Ok, Good news. I have noticed that lastest Gecko versions DON NOT have the problem exposed here. So I started investigating why, and after looking at the repo history I think that I found the responsible of this magical bug fix. About a month ago, there was a commit named: "Bug 579517 - Part 1: Automated conversion of NSPR numeric types to stdint types in Gecko", that affaects TimeStamp.h. I checked out one commit behind this one, compiled and executed, and the problem manifested, so I checked out again to this commit, compiled and executed, and once again, the problem didn't show up. As I messed up a little my local Gecko repository, it would be great if someone else do the same checks I did and verfiy that I'm right. The commits are: Problem fixed here: 4b4c9daf21a48790aa09b647b6600f1c2aaf0f14 -> Bug 579517 - Part 1: Automated conversion of NSPR numeric types to stdint types in Gecko. Problem exists here (one commit behind): 11bbccd1d2bcdde05eb0cab1fcc791e19edb0203 -> Bug 781289 - Remove unecessary check that let variable name matches are atoms. r=luke

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 41

•

13 years ago

Hey, I can reproduce this problem on the SGS2 with a build from yesterday. I have a full dev env ready and screaming for yet another build so I can test anything you want ;) Should I build B2G against custom version of Gecko (which one and how can do do it)? Or just try to apply that fix to our current Gecko version? :_AtilA_ - what have you done to test it? :gandalf - could you test it too so we would get more coverage on that? I agree, that even if we don't currently support SGS2 as a Tier 1 device (it's a Tier3, isn't it) we should definitely have it fixed. SGS2 is a popular device, especially in the community, to have B2G test builds on.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 42

•

13 years ago

I downloaded the build from 25th and am using it right now. Definitely something has changed. I can't reproduce the pane scrolling bug where after a minute of using the phone scrolling the home screens would become broken. But I still see two issues: - launching app animation and closing app animation sometimes stop mid way. The app is half zoomed in and I have to tap the screen a few times before it finishes - there are visual glitches on the screen - keyboard does not open in the browser I guess those are bugs that are coming from the fact that it's tier 3, so no big deal, except of the first one which reminds me this bug a lot. It seems like the animation is being broken half-way and that was exactly the problem we saw here. So maybe the two bugs (home screen scrolling and app open/close animation) are in fact two separate issues and what has been fixed fixes only the former one?

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 43

•

13 years ago

I've tried the build from 24th and have made changes like in the comment 23 (actualy I've put everything in a separate variabls, made them volatile and used ToSeconds function). Without it I had exactly the same issues as :gandalf. After patching it animations are smooth, nothing stops in the half way, events are delivered and screen is refreshed. What's left: 1. random visual glitches on the screen (some persistent, like in the phonebook, some of them disapear after touching the screen or just waiting some time) 2. https does not work at all (protocol is not recognized) 3. camera does not work (not the issue for me, just a note) 4. marketplace application does not even start (...try reloading button) Obviously issues 2-4 aren't in the scope of this bug and I can open a separate one if you want. Still, without the hack with making variables volatile we have some problems with events not beeing delivered, and that looks like a problem with timers, indeed. I guess (only guess) that the visual glitches are a separate issue. I can make a quick video if you want to see what we're talking about.

Dave Hylands [:dhylands]

Comment 44

•

13 years ago

The real issue here seems to be the register being changed (comment 32). If the .text constants aren't being overwritten, then this sounds like an interrupt handler is modifying a register which it isn't supposed to. Having the problem "go away" when you do X Y or Z so far just looks to me like sidestepping the real issue.

Juan Gomez [:_AtilA_] (CET/CEST)

Reporter

Updated

•

11 years ago

Assignee: atilag → nobody

BMO Automation

Comment 45

•

7 years ago

Firefox OS is not being worked on

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX