Closed Bug 662936 Opened 13 years ago Closed 13 years ago

fennec silently fails while running talos

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: jchen)

References

Details

(Keywords: intermittent-failure, Whiteboard: [mobile_unittests][mobile_dev_needed][android_tier_1])

Crash Data

Attachments

(1 file)

while running talos, we have a series of easily reproducible failures which occur.  They occur across a variety of tests such as tsvg, tdhtml, tpan, tzoom, and tp4m.  Sometimes I can get 4 or 5 runs in a row (locally from my desktop to my tegra) with no failures, other times it will fail part way through the test.  

This is not a scenario where Fennec is hung and we timeout, this is a scenario where the fennec process terminates and we complete the test early inconclusively due to lack of complete data.

I have ran hundreds of test run and looking in logcat and instrumenting test files I am not able to see any signs of data out of the ordinary.
we need a mobile developer to take a look at this.
I have updated my set of android dumps:
http://people.mozilla.org/~jmaher/android_dumps/

These contain information which should be helpful in pinpointing this.

After further debugging today, I see that we add a 'pageshow' event listener in messagemanager and we never receive that event.   In watching this on a screen, I see fennec loading the test url and crashing during the pageload.  I see checkerboard on half of the screen and the webpage on the other half of the webpage.  

This error is by far the most frequent orange that we get for android talos.
this could be the same crash as bug 613954
Depends on: 613954
here are the steps to reproduce this:
1) checkout talos:
  hg clone http://hg.mozilla.org/build/talos talos
2) cd talos/mobile_profile/extensions
3) check out fennecmark (inside mobile_profile):
  hg clone http://hg.mozilla.org/users/tglek_mozilla.com/fennecmark bench@tara.glek
4) setup a webserver (I use apache) to point at the root of your talos checkout
5) get a tegra (or compatible android device) with a SUTAgent installed
6) from the talos directory run:
python remotePerfConfigurator.py -v -e org.mozilla.fennec --activeTests tpan --resultsServer ' ' --resultsLink ' ' --output tpan.yml --remoteDevice <ip of your sutagent> --webServer <ip of your host>

7) from the talos directory run:
python run_tests.py -d -n tpan.yml

8) repeat up to 5 times, I usually get it to fail in 3 or 4 times.
What kind of failure occurs in step 8 - crash, timeout, OOM, or incorrect result?
when it fails, I see fennec crash due to OOM, and talos reports back 'unexpected output' as it never found the log file with the timing data that it expects.
Ok, thanks.

I don't have a tegra, but I can try to reproduce on linux desktop. Unless there is an obvious reason why that is a waste of time?
I can never get tpan or tzoom working on a linux desktop build of fennec (I could be building it wrong).  Any android device would be ideal.  We don't see this on Maemo, so I suspect the problem is in Android only.  That is just me speculating though.
I can try on an Android phone then.

Searching for info on SUT agent, I found this page

https://wiki.mozilla.org/Mobile/Fennec_Unittests/Remote_Testing

which says it is possible to use adb devicemanager instead of the SUT agent. Would that be simpler if I am using a phone connected by USB? And what would the STR be to run talos that way?
From irc discussion it looks like I need the SUT agent.

How would I do the webserver part on a developer machine? Our firewall will prevent me just running a webserver there that the phone can access. Is there some way in Talos to get around that?
here are some steps to run without a server (i.e. skipping step 4 above).

6) from the talos directory run:
python remotePerfConfigurator.py -v -e org.mozilla.fennec --activeTests tpan --resultsServer ' ' --resultsLink ' ' --output tpan.yml --remoteDevice <ip of your sutagent> --webServer people.mozilla.org/~jmaher/talos

7) from the talos directory run:
python run_tests.py -d -n tpan.yml
ok, I have some new instructions:
1) download: http://people.mozilla.org/~jmaher/talos/talos_ADB.zip
2) unzip in a local talos directory on your machine
3) inside that talos directory, run:
python remotePerfConfigurator.py -v -e org.mozilla.fennec --activeTests tpan --resultsServer ' ' --resultsLink ' ' --output tpan.yml --remoteDevice ADB --webServer people.mozilla.org/~jmaher/talos

4)now run:
python run_tests.py -d -n tpan.yml

This requires that you have ADB access to your device.  This has been poorly tested, but it is a much simpler way for reproducing this bug.
Definitely getting closer, but the blocking factor at this point for me is that a rooted device is apparently needed for this approach, and I don't have one. Without root copying application.ini is not possible. Trying to do run-as always results in

run-as: Package 'org.mozilla.fennec' has corrupt installation

Perhaps all of this is not needed though. If it's an OOM, it should be debuggable on desktop as well (it won't OOM, but we can see memory rise and understand why). However I can't get Talos to run on linux desktop either. I got to the point where it loads a Fennec window, but aside from the window title there nothing happens, it just waits forever.
Got a rooted device. Still no luck though, still getting 'restarting as root failed' when running remotePerfConfigurator.py.

Note btw that all the talos infrastructure stuff isn't necessarily needed on device. If I just had a URL of a website to visit, that would then scroll/load other pages/etc. the browser through the same data as tpan does, then that would be enough. All I need is to be able to run the device through that on a profiling build of Fennec (I don't need any special control/logging stuff the talos system has).
here is some logcat output.

tpan successful iteration (opt build):
http://people.mozilla.org/~jmaher/android_dumps/successful_opt.txt

tpan failed iteration (opt build):
http://people.mozilla.org/~jmaher/android_dumps/failure_opt.txt

tpan failed iteration (debug build):
http://people.mozilla.org/~jmaher/android_dumps/failure_debug.txt
jmaher, following our IRC discussion, here is a special profiling build:

http://syntensity.com/static/fennec-profiling.apk

The logcat output of that build, when failing silently in tpan, would be very useful.

If you can, please include logs from several runs (useful data may not appear in every run).
Improved build based on initial results:

http://syntensity.com/static/fennec-profiling2.apk
http://people.mozilla.org/~jmaher/android_dumps/

a second crash with the same build:
profiled2-2.txt 

a successful run with the same build (immediately before the profiled2-2.txt crash):
profiled-2-pass.log
The data so far appears to show this is indeed an OOM. Need more data though. Here is a yet more improved build:

http://syntensity.com/static/fennec-profiling3.apk
Further profiling with jmaher indicates this is NOT an oom. We sometimes get '4GB' as the rss, but the device has 1GB. Also it doesn't always crash when saying '4GB' and rss returns to normal values. And in jmaher's last log,

http://people.mozilla.org/~jmaher/android_dumps/profiled4-2.log

there is a crash but no indication of excessive memory use.

What is worrying is that the logs show different Runnables being run right before the crash. Which argues against a single problem here. Unless it isn't an event, timer or ipc message.

At this point, we need to see some stack traces. The profiling builds should have crashreporter enabled, since I didn't disable it in the mozconfig. Hopefully we can get a stack trace that way. Otherwise, we will need to figure out another way.
Depends on: 664510
The stack traces we have from before all show crashes in libc, with just hex offsets, which is not very helpful.

blassey, is there some way to get clearer stack traces on Android?
(In reply to comment #24)
> blassey, is there some way to get clearer stack traces on Android?

Debug symbols for libc would do the trick. If you build your own rom you get those.
couldn't we just look at this from the pageload angle?  It always crashes during a pageload, usually when about 2/3 of the page is loaded and the rest is checkerboard.  

I am game for installing a custom ROM on my tegra board and giving it a go.
Can I suggest that you guys Alon, Joel, and Bear, pick a board in staging, setup adb over tcp (for debugging purposes) and run the repro steps together? I can provide the couple of commands that would be necessary to start this up and then you can use everything else just like it happens in try, build, etc...
bmoss, thanks for the suggestion. Right now we have some other stuff to try, but if they don't work out we should do that.

Would we be able to modify the board from staging - install a different libc, install gdb, etc.?
jmaher, can you please reproduce the crash on your device with this debug build,

http://syntensity.com/static/fennec-debug.apk

After looking some more at the previous crashes you reported, it just looks like the stack traces are broken. Hopefully that won't happen with this debug build and we can get proper stack traces in the crash reports.
Alon, You should be able to do anything you would do from adb. The only caution I will make is if you reboot you will need to re-establish adb over tcp before you can continue, so as long as the agent is installed you should be good to go. It always starts up on boot. If you need to completely reimage the board that will most likely prove to be very difficult because it would remove the agent. Let me know if I can be of assistance.
it seems that having physical access to a device would be helpful for the main purpose of allowing crashreporter to run and collecting data that way.

Is there anyway we could get Alon a tegra+screen+input to do this?  Even if it is just for a week or two (however long) until this is behind us? 

It would be simple to pull a tegra from the automation pool and access it over the network.  Running the tests would be as simple as downloading the talos.zip and running the commands given above, just replacing the IP address with that of the tegra.  Why I hestitate to jump on this approach is that we seem to find value in the crashreporter end of things.  We can setup adb over tcp/ip for a tegra from the pool and install the debugging tools on there if we wish.
It does sound like getting a board could help. But I do have some hope that just using the build in comment #29 will get us all the data we need here. So I would suggest trying that out first since it sounds much faster.
Alon, do you have a tegra yet?  if not, we need to bug bmoss or aki to get one.
As stated in IRC, there are tegras on the right hand side of my desk that won't boot without a usb keyboard or mouse attached, which makes them useless for production.  Alon can use one.
I have a tegra board as of yesterday, have just been trying to get gdb running on it with dougt. However we have not been able to do so. Perhaps there is something wrong with that board specifically and I should get another. Aki, I'll be in touch soon unless I figure this out with the current board.

Have also been trying to run gdb on some phones, but without success due to various phone-related issues.
With bmoss' help, I have a tegra board running. However pushing gdb to the device and then just running

adb shell /data/local/gdbserver

segfaults (this worked on the phones).

Currently blocked on this and have no idea how to proceed.
After a lot of work with dougt, the situation is this:

1. We are getting corrupt stack traces apparently all the time, bug 664510, both using crashreporter and manual gdb. We can't really do much until that is fixed. Not sure who needs to look at that.

2. Debug builds now crash on startup it appears. Maybe unrelated, but also preventing work on this bug (since it might have helped us get stack traces). And, since we can't get stack traces, we can't see where this startup crash is.
For some reason, I don't think we are going to fix both of these issues any time soon.  Is there any other method we can use to help narrow this down?
It will be very sad if we do not fix the first issue very soon. We are basically flying blind right now, much of our incoming crash reporter data is meaningless!

Issue 2 might not be crucial for us here.

Aside from this, I have just one more idea at the moment, to automatically rewrite the entire tree to print out all functions as we enter them. Then we could narrow down where we are crashing here. It will be very hard to get useful data from this, but we can try it.
any update on this bug?
I am still working on the automatic rewriting patch. Nothing to update about yet.
Ok, this is probably *not* going to work, but worth a shot. Here is a special build of fennec that will print out each function as we enter it,

www.syntensity.com/static/fennec-functionprint.apk

This is slower than a debug build, almost unusably so. It also generates a ridiculous amount of output, so saving it to disk would probably not be practical. However, if we can run this on talos, reproduce the silent failure, and view the last say 10,000 lines of output, we would have a chance at processing that and figuring out what function is responsible.

jmaher, can you please try to run this build on talos to see if the crash is reproducible with it? If so, the next step would be for us to figure out how to save just the last 10,000 lines of output. (If not, back to the drawing board...)
Another direction we could pursue here, is to bisect on fennec builds on talos to see when the stack reporting breaks. That is, to find the latest version of fennec where we *can* get stack traces for this bug. There is a good chance having that would let us fix this.
Looks like we might have at least a partial fix for the stack traces issue, in bug 644707. I made an android build with it,

www.syntensity.com/static/fennec-aborthandler.apk

jmaher, can you please try to reproduce the talos problem in this bug using that build?
azakai, I have 1 device and all my other android work stops when I do this.  I was under the impression that ctalbert had a tegra you could use.  Please let me know if you have problems running the steps in comment 13 (https://bugzilla.mozilla.org/show_bug.cgi?id=662936#c13) to reproduce this.
Ok. I am currently sick at home, so can't use that tegra, but hopefully in a few days I'll be able to work on this.
I ran this and reproduced it a few times.  I have a log which has two of the crashes in it:
http://people.mozilla.org/~jmaher/android_dumps/aborted.log

I should have some cycles on my tegra today and this weekend, so I can run another build.
There won't be any useful information in the log, but hopefully crashreporter stack traces will now work.

The relevant patch landed on m-c this morning, so the same should be true for tomorrow's nightly.
A promising development is that the just-released Android NDK6 has an ndk-stack tool, which translates the raw dumps from logcat into proper stack traces. This could be very useful in this bug and elsewhere, potentially.

(I also suspect this tool would work even with binaries created with earlier NDK versions, but would need to check that.)
(In reply to comment #49)
> A promising development is that the just-released Android NDK6 has an
> ndk-stack tool, which translates the raw dumps from logcat into proper stack
> traces. This could be very useful in this bug and elsewhere, potentially.
> 
> (I also suspect this tool would work even with binaries created with earlier
> NDK versions, but would need to check that.)

So when can we expect to see some movement on this issue?  Do you still need a tegra to debug it?  Can we sit down next week and crank through this issue?
At this point, the next step is to run the latest nightly on talos (and then to inspect the crashreporter stacks). jmaher said he might have time to do that in comment 47. If not, let's do this together next week.
nightly build from yesterday and uploaded crash report:
https://crash-stats.mozilla.com/report/index/0bb6b0d2-a42a-4d9e-8f20-ab0d12110719
Sadly the work we did on better crash reports did not help, that stack trace is still just libc noise.

I will look into getting better stack traces with a custom-compiled libc.
Earlier I took a look at the crash log at http://people.mozilla.org/~jmaher/android_dumps/aborted.log. Combined with the libc.so in that directory and the source code, I translated the stack trace to the following libc functions:

0. abort() [basically anytime there's a SIGSEG with
            address 0xdeadbaad it's abort() causing it]
1. dlfree()
2. free()
3. fclose()
4. _files_getaddrinfo()
5. nsdispatch()
6. explore_fqdn()
7. getaddrinfo()

The source for getaddrinfo() and friends is here http://android.git.kernel.org/?p=platform/bionic.git;a=blob;f=libc/netbsd/net/getaddrinfo.c;hb=HEAD

So far, the stack trace says some function called getaddrinfo(), which eventually calls _files_getaddrinfo() which is responsible for resolving the host name through /etc/hosts. So _files_getaddrinfo() fopens /system/etc/hosts, reads through it, and somehow by the time it fcloses hosts, the FILE structure got corrupted to the point that free() detects it and decides to abort(). I don't know why it's getting corrupted though.

Too bad the log only shows 8 frames, but I'd be glad to help anyone getting more stack frames.
(In reply to Jim Chen [:jchen] (mobile intern :) from comment #54)
> the FILE structure got corrupted 

One correction: it's FILE::_bf getting corrupted. _bf is an internal buffer used by fgetc et al. _files_getaddrinfo only uses fgets so should be fgets that's corrupting the buffer. Still strange though.
Whiteboard: [orange][mobile_unittests][mobile_dev_needed] → [orange][mobile_unittests][mobile_dev_needed][android_tier_1]
(In reply to Jim Chen [:jchen] (mobile intern :) from comment #54)
> Earlier I took a look at the crash log at
> http://people.mozilla.org/~jmaher/android_dumps/aborted.log. Combined with
> the libc.so in that directory and the source code, I translated the stack
> trace 

Can you please elaborate on the steps you took to do that?
Attached file stack
This is a stack from the crash in comment 52. It goes off the rails after frame 8, but it looks sensible up to that point.
(In reply to Alon Zakai (:azakai) from comment #56)
> (In reply to Jim Chen [:jchen] (mobile intern :) from comment #54)
> > Earlier I took a look at the crash log at
> > http://people.mozilla.org/~jmaher/android_dumps/aborted.log. Combined with
> > the libc.so in that directory and the source code, I translated the stack
> > trace 
> 
> Can you please elaborate on the steps you took to do that?

I basically took the libc.so and did 'objdump -d', and found the instructions that matched the offset in the logcat stack trace:

> I/DEBUG   (  937):          #00  pc 00011f0c  /system/lib/libc.so
> I/DEBUG   (  937):          #01  pc 0000be08  /system/lib/libc.so
> I/DEBUG   (  937):          #02  pc 0000cd6c  /system/lib/libc.so
> I/DEBUG   (  937):          #03  pc 000147d4  /system/lib/libc.so
> I/DEBUG   (  937):          #04  pc 0002e58e  /system/lib/libc.so
> I/DEBUG   (  937):          #05  pc 0002cd1e  /system/lib/libc.so
> I/DEBUG   (  937):          #06  pc 0002d4ec  /system/lib/libc.so
> I/DEBUG   (  937):          #07  pc 0002d7f8  /system/lib/libc.so

From that I could verify the stack trace is good. objdump also gives you offsets in relation to exported symbols, so it was easy to map the frames to exported functions such as abort(), free(), nsdispatch(), getaddrinfo().

The stack trace now looks like this:
0. abort()
1. dlfree()
2. free()
3. fclose()
4. ???
5. nsdispatch()
6. ???
7. getaddrinfo()

Some functions are not exported, like _files_getaddrinfo() and explore_fqdn(). To get those I looked in the source code I linked and found that,

1) fclose() is called from android_getaddrinfo_proxy() and _endhtent(), _endhtent() is called by _files_getaddrinfo(). The instruction around the fclose() call fit _files_getaddrinfo(). So you can infer that the missing frame between fclose() and nsdispatch() is _files_getaddrinfo() and _endhtent() was inlined into _files_getaddrinfo(). That's why it appears _files_getaddrinfo() is calling fclose() directly.

2) getaddrinfo() calls explore_fqdn() and explore_fqdn() calls nsdispatch(). So the other missing frame is explore_fqdn()

Also, if you disassemble fclose() and look at where free() is called, it corresponds to the line 'free(fp->_bf._base)' in the source code. Combined with the bionic free() behavior that it calls abort() whenever it detects corruption, you can infer that the FILE::_bf buffer was corrupted.
I could not reproduce this on my HTC Vision after running the test about 10 times. Here is my setup:

1) HTC Vision running CyanogenMod 7 with debug libc.so
2) latest nightly and latest SUTAgent from http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-android-r7
3) talos from http://people.mozilla.org/~jmaher/talos/talos_ADB.zip
4) xampp as host web server
5) phone networked to host through USB tethering
6) same python commands as in comment #5

The ADB way didn't work for me but SUTAgent did. There were talos warnings about failing to validate output, but no unexpected output errors.

Maybe there is something wrong with my setup? Or has this been reproduced on a non-tegra device before?
I have not tried to reproduce this on a non tegra, I can try on a Nexus S to see if I can make it reproduce.
I have a few roms that I can flash on the tegra 200 to see if it makes any difference.
I was not able to reproduce on my Nexus S with adb or sut, but could easily reproduce using the same build and tools on my tegra.
After more testing we found that the crash _is_ caused by getaddrinfo() in bionic and more specifically, by handling of the etc/hosts file. Looks to be a thread-unsafe operation in either stdio (likely) or malloc (unlikely).

One workaround is to put locks around getaddrinfo() in NSPR, i.e. only one thread can call it at a time. This works but is ugly.

A simpler workaround confirmed by jmaher is to delete /system/etc/hosts on the tegra. (since /system is initially read-only, it should first be remounted with 'adb remount'). This works by skipping etc/hosts handling entirely, which is fine for our tegra environments. Obviously, if it's a bug in Fennec we should fix it in Fennec, but at this point, the crash looks to be a bug in bionic and is only affecting the tegras. So this workaround is the better option.
here is some sample code we can integrate into our setup/config scripts before running tests:

import devicemanager

dm = devicemanager.DeviceManager('192.168.1.101', 20701)

if dm.fileExists('/system/etc/hosts'):
  dm.sendCMD(['exec mount -o remount,rw -t yaffs2 /dev/block/mtdblock3 /system'])
  dm.sendCMD(['exec rm /system/etc/hosts'])
  if dm.fileExists('/system/etc/hosts'):
    print "failed to remove /system/etc/hosts"
  else:
    print "successfully removed hosts file, we can test!!!"
  dm.reboot()



I have tested this many times on a fresh system with just the gainroot.sh and SUT agent installed.
Depends on: 678992
most of the depends on bugs here are incorrect.  I think the only bug that this bug now depends on is 678992.  When that is fixed, we can mark this bug FIXED
I ran this on the staging tegras and have restarted them.  The staging test master triggers a set of test runs every 15 minutes, so the overnight should produce some good data.
Assignee: nobody → nchen
(In reply to Doug Turner (:dougt) from comment #65)
> most of the depends on bugs here are incorrect.  I think the only bug that
> this bug now depends on is 678992.  When that is fixed, we can mark this bug
> FIXED

bug#678992 was fixed on 23aug. Anything left to do here?
I vote for resolving this bug.
(In reply to Joel Maher (:jmaher) from comment #68)
> I vote for resolving this bug.

Since I'm assigned to it, I vote for resolving as well :)
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Yeah, I,m seeing this on recent nightlies. I just had two such crashes in a row, via a session with 1 wikipedia tab. New bug?
Yeah. File it?
(In reply to Jim Chen [:jchen] from comment #69)
> Since I'm assigned to it, I vote for resolving as well :)
What is the target milestone because it is still #2 top crasher in 7.0 Beta (#1 in 6.0)?
(In reply to Scoobidiver from comment #73)
> (In reply to Jim Chen [:jchen] from comment #69)
> > Since I'm assigned to it, I vote for resolving as well :)
> What is the target milestone because it is still #2 top crasher in 7.0 Beta
> (#1 in 6.0)?

I'm not sure. We still need to decide on a fix. Filed bug 687367 for that.
Whiteboard: [orange][mobile_unittests][mobile_dev_needed][android_tier_1] → [mobile_unittests][mobile_dev_needed][android_tier_1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: