Closed Bug 597093 Opened 14 years ago Closed 14 years ago

Frontend ilooping on Samsung i9000s, apparently caused by methodjit (bad jump?)

Categories

(Core :: JavaScript Engine, defect)

ARM
Android
defect
Not set
critical

Tracking

()

RESOLVED WONTFIX
Tracking Status
fennec 2.0+ ---

People

(Reporter: cjones, Unassigned)

References

()

Details

Attachments

(3 files)

Once in a while when I start fennec, the fronted will partially load the first-run page or home screen and hang.  I've also seen the frontend lock up during normal browsing.  Vlad saw the frontend un-wedge and move on one time.

I only started seeing this on builds post-JM merge, but many other changes happened around that time too.

A backtrace from a hung frontend would be useful, but I can't get one right now on my device.
tracking-fennec: --- → ?
tracking-fennec: ? → 2.0b1+
Just saw this hang during the "fennec is loading" spinner during startup.
Seems to be happening more frequently on my device with the latest nightly.  No idea why.
Severity: normal → critical
I'm seeing this pretty consistently on my Epic and the last two nightly builds
I noticed that if I kill all other running programs on my device, the hangs seem to go away.  I wonder if there's some resource we're trying to allocate that's scarce when a lot of programs are running, and blocks indefinitely.
This is 100% reproducible for me right now.  Will look into getting a stack trace.
Appears to be thumb related. Crash on startup on sdwilsh's galaxy s stop when running a non-thumb build.
Also with a build from http://hg.mozilla.org/mozilla-central/rev/901fd772c4da, I can't reproduce the hang during first-run or start page.  I did see intermittent crashes on startup and a hang during the "fennec is loading" spinner (before the frontend is up) which was apparently caused by there being another fennec instance still alive.  Will file the latter.
Should we start considering shipping with thumb off for b1?
Attached file All thread backtrace 1
This is a black-screen-on-startup hang, not hang-on-start-page.  The backtraces look pretty sane to me, except for thread 2

Thread 2 (Thread 3966):
#0  0xfffefb4c in icudt38_dat () from /home/cjones/android/gdb/lib/libm.so
#1  0xffff0006 in icudt38_dat () from /home/cjones/android/gdb/lib/libm.so
#2  0xffff0006 in icudt38_dat () from /home/cjones/android/gdb/lib/libm.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I don't see a thread that's obviously the XPCOM-main thread.  Not sure how that's set up on android.
Pretty similar to the first, with a suspicious thread 2 again

Thread 2 (Thread 4051):
#0  0xc03749b0 in ?? ()
#1  0xc03749b0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
If thread 2 is supposed to be XPCOM-main, then I wonder if this some bad thumb code jumping out into the weeds somewhere.  Will try --disable-thumb.
With thumb2 disabled, I get the same backtrace as comment 9, essentially.

Thread 2 (Thread 4140):
#0  0xfffefb4c in icudt38_dat () from /home/cjones/android/gdb/lib/libm.so
#1  0xffff0006 in icudt38_dat () from /home/cjones/android/gdb/lib/libm.so
#2  0xffff0006 in icudt38_dat () from /home/cjones/android/gdb/lib/libm.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Worth pointing out that GeckoApp still responds to orientation changes, just gecko itself does not.  Also if I continue for a while then break again, I get the about the same backtrace, with Thread 2's stack identical.

Going to try disabling JITs, why not.
With JM+TM off, no hangs.  With TM on, no hangs.  Couldn't test JM on its own because it doesn't compile with TM disabled.  Looks like JM is the culprit here.  Will update to after the most recent SM merge, and if the hang persists, we probably need to turn off JM again.

The good news is that the hang appears to be 100% reproducible in gdb, though not outside of gdb.
Sigh.  Still seeing the same iloop in gdb with m-c 8528ce3f97ce, goes away with JM disabled.
Assignee: nobody → jones.chris.g
Attachment #477089 - Flags: review?
Assignee: jones.chris.g → general
Component: General → JavaScript Engine
Product: Fennec → Core
QA Contact: general → general
Summary: Frontend hangs seemingly randomly (on Galaxy S devices?) → Frontend ilooping on Samsung i9000s, apparently caused by methodjit (bad jump?)
So this happens even with chrome methodjit off?
(In reply to comment #14)
> So this happens even with chrome methodjit off?
Is there a way to turn that off at runtime?
(In reply to comment #15)
> (In reply to comment #14)
> > So this happens even with chrome methodjit off?
> Is there a way to turn that off at runtime?

pref("javascript.options.methodjit.chrome", false);

If you can't get to about:config, you can edit prefs.js directly on a rooted phone (in your profile in /data/data/org.mozilla.fennec/mozilla), or edit mobile/app/mobile.js in your srcdir if you are building yourself.
The "javascript.options.methodjit.chrome" preference.  Note that the default value is false (as in off), because it's undertested and had some known issues last someone checked (like not passing tests).  But fennec explicitly sets it to true (see bug 596076).  So I was wondering whether flipping the pref to its default value (as opposed to compiling the code out entirely) also fixes this bug.

Amusingly enough, bug 596076 claims to have been backed out as of Sept 14, but it was repushed on Sept 15 with no corresponding comments in the bug...
Depends on: 596076
Comment on attachment 477089 [details] [diff] [review]
Temporarily work around ilooping apparently caused by JM by disabling it

we can't turn JM off, need to get a fix.
Attachment #477089 - Flags: review? → review-
If you can repro 100%, can you bisect to find a range?  Also, does it happen with just the chrome JM turned off?  (Like bz, I'm not clear what the situation is with bug 596076 or what testing was done before turning it on for Fennec.)
(In reply to comment #19)
> If you can repro 100%, can you bisect to find a range?  Also, does it happen
> with just the chrome JM turned off?  (Like bz, I'm not clear what the situation
> is with bug 596076 or what testing was done before turning it on for Fennec.)
I've gone and flipped the pref on my install.  It's not 100% reproducible, but it happens pretty quickly usually (STR are not clear on my phone at least).

Finding a regression range would likely be painful, but it's probably doable.  So far so good though.  No crashes.
(In reply to comment #14)
> So this happens even with chrome methodjit off?

This makes the iloop go away in my build in gdb.
Reverted the javascript.options.methodjit.chrome change on mobile-browser:
http://hg.mozilla.org/mobile-browser/rev/cba192dabb64

We will not enable this again in Fennec until the JS team is ready to turn it on by default.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
It'd sure help the JS team do that if you could find out more about this bug (like on which script it's ilooping, and when it started to happen)!
I'd like to help as I can, but
 - an hg bisect is likely to be very difficult given separate m-c/m-b/t-m repos and breaking changes between them and unknown intermediate states of methodjit for ARM on t-m
 - if this problem showed up pre-fennec+layers (if), a bisect over nightlies would be useful, but with 99.999% probability we're going to hit the jaeger-merge nightly
 - I'm remote, have no experience debugging methodjit, and have other blockers I need to work on.  I would need my hand held for maximum efficiency

A possibly fruitful approach that I attempted last week was to get jsshell working on device.  jsshell works, but doesn't have stdout/stderr when run in the android shell.  That's fixable.  There's also no python, so no test harnesses, but that's possibly fixable too.

(In reply to comment #23)
> It'd sure help the JS team do that if you could find out more about this bug
> (like on which script it's ilooping, and when it started to happen)!
Would DumpJSStack() work for that?
Does enabling JM for chrome block b1?  I would lean towards "no".
reopening, since we need a real fix for this, and we don't really track FIXED bugs. :-/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Ping re: b1-blocker re-assessment for re-opening.
ilooping could have been bug 597871, if we're lucky.
This was set to block Fennec when the Chrome MethodJIT had been enabled.

Why is blocking now? There are no steps to reproduce here, no debugger stacks attached, and no automated test failures (I can dream). Since we're not going to block on Random Phone X whenever someone happens to buy one, what is the configuration that needs to be tested? Is it tested anywhere?
Not commenting on blocking vs. not-blocking decision, but

 - there are reliable STR for fennec chrome here: open it in gdb with methodjit enabled on a Samsung i9000 device
 - there are less reliable STR for web content in bug 600304, for probably several different bugs
 - comment 9, comment 10, comment 11, bug 600304 comment 0, bug 600304 comment 2, bug 600304 comment 3, and bug 600304 comment 4 have C++ stacks.  The latter two have DumpJSStack() stacks.
 - "Random Phone X ..." that shit really isn't helpful, please don't.  Turns out a few Samsung i9000s are selling to even people other than me, and we'd like to support it.  For unknown reasons, we're seeing crashes on it that we aren't on other android/ARM devices.
 - There are instructions for running gdb on these devices.  See https://wiki.mozilla.org/Mobile/Fennec/Android#Debugging_on_Galaxy_S_devices.
 - We're not running automated js/src tests on this class of devices, or AFAIK any android/ARM devices.  That's a big problem.  This evening I've been working on getting js/src tests running on device.

I understand everyone's busy around freeze time.  But if what's blocking JS folks from working on these bugs is that there aren't enough Galaxy S's around the office, let's please fix that ASAP.
>  - "Random Phone X ..." that shit really isn't helpful, please don't.

It is actually super helpful. I have no idea what combination of ARM and Thumb instruction sets this phone supports. I have no idea if it's tested anywhere.

And since Fennec doesn't actually pass browser regression tests or deliver crashes, no one really has any idea whether the damn thing works
Wikipedia tells me it runs a "ARM Cortex A8 based CPU", which implies thumb2.

"Random Phone X ... whenever someone happens to buy one" isn't helpful because the galaxy s is now the flagship for t-mobile, sprint, and verizon, and second to the iphone for at&t.  We want to support it.  It makes life harder that it's exhibiting bugs other android/ARM devices aren't.

The testing situation sucks, we're on the same page there.  Taking time away from my other work to help get it in better shape for js/src tests.

If a prerequisite for someone to work on these bugs is automated tests of galaxy s devices, that's fine but beyond my power.  I thought you were asking about STR and stacks, which we have, though the stacks aren't generally very helpful.  I hope soon to have js/src test failures.
Well, look. There are some things to like about Fennec. The process separation seems to work very well.

But beyond that, there isn't much to like. The thing doesn't redraw bitmaps correctly, and it looks like an n900 refugee. Besides, I can't imagine someone claiming we should seriously support a Firefox release on a new platform without bringing up testing first.

Anyway, I guess the politically correct thing to do here is give lots of updates until this issue appears to be fixed. We won't actually have any confidence that is fixed, since there isn't any reporting or debugging information, but we can all be in it together.
Not sure who you're arguing with here about testing, certainly not me.

What's the bitmap-redrawing problem?
(In reply to comment #34)
> Not sure who you're arguing with here about testing, certainly not me.

OK, cool. I'll make sure this bug is updated with progress!

> What's the bitmap-redrawing problem?

When I resize or rotate, it never really works very well. The newest thing I see seems to be a fade from the vertical view drawn into the horizontal view. Last time, it was missing renderings. But off topic for this bug, I guess. Sorry for opening that up.
Yeah, theme and UX issues seem OT for a methodjit bug.  Re: resize/rotate, this is going to sound asshatty in context (sorry!), but have you filed a bug with STR, screenshots, test case ...?  Would be appreciated.
OK, there's a bunch of fixes now landed on mozilla-central. Is this bug still occuring?
(In reply to comment #37)
> OK, there's a bunch of fixes now landed on mozilla-central. Is this bug still
> occuring?

Ugh, occurring.
(In reply to comment #37)
> OK, there's a bunch of fixes now landed on mozilla-central. Is this bug still
> occuring?
When did it land (trying to establish if it made it in the nightlies or not)?
Looks like it made the nightlies.
(In reply to comment #40)
> Looks like it made the nightlies.
That's unfortunate since I appear to be hanging on startup now.  Hope this works better for cjones...
Some pointers to other bugs where we can move some of this discussion; feel free to file more bugs if your concerns aren't covered.

(In reply to comment #33)
> But beyond that, there isn't much to like. The thing doesn't redraw bitmaps
> correctly, and it looks like an n900 refugee.

Yes, we are currently using the Maemo theme because it already exists.  The design for the Android theme was just finalized, and there is now a WIP implementation targeted at Fennec 4.0b2 (bug 575403).

About redrawing on resize, some bugs in this area were recently fixed (bug 597230) and some we are still working on (bug 597580).

[Minefield has had some pretty bad UX and theme bustage on Linux in recent months, but that hasn't stopped most of us from working on Firefox bugs.]

> Besides, I can't imagine someone
> claiming we should seriously support a Firefox release on a new platform
> without bringing up testing first.

Fully agreed.  There are bugs filed and people working on the testing situation (e.g. bug 579563).  If anyone knows how to speed this up, please do.
(In reply to comment #41)
> (In reply to comment #40)
> > Looks like it made the nightlies.
> That's unfortunate since I appear to be hanging on startup now.  Hope this
> works better for cjones...

Hanging for me too, using a Samsung Epic on Sprint. 

I realize that's kind of not helpful, seeing as "hanging for me too" is missing any actual debugging info. 

Would it be more useful if I spent some time tonight getting gdb working on my device, or is that covered? Don't want to just be a "me too", but also don't want to get in the way
(In reply to comment #43)
> Would it be more useful if I spent some time tonight getting gdb working on my
> device, or is that covered? Don't want to just be a "me too", but also don't
> want to get in the way
I know cjones has gdb already setup, but it probably doesn't hurt to have more people with it.  I'm also not sure we can get symbols from nightly builds (I think we have to roll our own) which makes this more difficult.  Someone on the mobile team would know more.
Debugging instruction for Fennec below. You will need your own build (and either a 2.2 Android phone or a rooted phone).

https://wiki.mozilla.org/Mobile/Fennec/Android#Debugging_with_GDB

None of us have really found good C++ stacktraces for hangs AFAIK, but as you can see above cjones did have some luck with JS stacks (see comment 30). Let us know if you get stuck (we hang out in #mobile on irc.mozilla.org).
Yes, I get a hang on startup now as well, on my N1.

Today, we'll be focusing on bug 600488. That will resolve numerous problems, I hope. These things are just going to keep breaking until we have tests.
(In reply to comment #46)
> Yes, I get a hang on startup now as well, on my N1.

This is now fixed in a nightly that was completed at 4:30pm.
(In reply to comment #47)
> (In reply to comment #46)
> > Yes, I get a hang on startup now as well, on my N1.
> 
> This is now fixed in a nightly that was completed at 4:30pm.

I mean the hang is fixed. that turned out to be a separate issue. this bug remains open.
tracking-fennec: 2.0b1+ → 2.0b2+
tracking-fennec: 2.0b2+ → 2.0+
This has been fixed by blacklisting these devices.  I am tempted to make as wontfix.  sayre, objections?
Can't save Samsung from they're crappy kernel and blacklisting fixed our issue ages ago.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: