Closed Bug 880285 Opened 11 years ago Closed 10 years ago

Intermittent b2g18* crashtest,reftest timeout followed by crash (Fatal signal 11 (SIGSEGV) at 0x42e00000)

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: crash, intermittent-failure)

Attachments

(1 file)

Looks similar to bug 818103, but this still occurs on the b2g18 branches even after updating the emulator with the fix for bug 867996. We hit this very frequently (at least once per push), primarily on the crashtests. Under our normal tree rules, the crash rate is high enough to warrant hiding the tests.

https://tbpl.mozilla.org/php/getParsedLog.php?id=23856504&tree=Mozilla-B2g18
Any idea why we're not getting stack dumps on this crash?
Flags: needinfo?(ahalberstadt)
Also, this has been happening for a long time, but we'd been starring the failures as bug 818103. To avoid cluttering this bug up like that one, I will not be copying/pasting every log link into here when I star. Unless you hear otherwise, it's safe to assume this is still happening with high frequency :)
(In reply to Mike Habicher [:mikeh] from comment #1)
> Any idea why we're not getting stack dumps on this crash?

I see:
05:57:02     INFO -  checking for crashes in '/data/local/tests/reftest/profile/minidumps'

Followed by no other output. This usually indicates that there are no minidumps being generated (otherwise there would be a minidump found message or something similar). The mechanism that generates the minidumps is kind of a black box to me. Ted, do you know what might be going on?
Flags: needinfo?(ahalberstadt) → needinfo?(ted)
I don't really know how to read the logcat tea leaves, but it strikes me that maybe this crash isn't actually being caught by Breakpad. Compare the end of the logcat from comment 0:
https://tbpl.mozilla.org/php/getParsedLog.php?id=23856504&tree=Mozilla-B2g18
---------------------------------
05:57:03     INFO -  I/Gecko   (  948): REFTEST TEST-START | http://10.0.2.2:8888/tests/layout/generic/crashtests/683702-1.xhtml | 36 / 633 (5%)
05:57:03    ERROR -  F/libc    (  948): Fatal signal 11 (SIGSEGV) at 0x42e00000 (code=2)
05:57:03    ERROR -  This usually indicates the B2G process has crashed
---------------------------------

to the end of the logcat from a crash in bug 818103:
https://tbpl.mozilla.org/php/getParsedLog.php?id=23052012&full=1&branch=birch#error2
---------------------------------
15:10:32  WARNING -  E/GeckoConsole(  766): [JavaScript Error: "The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol." {file: "http://10.0.2.2:8888/tests/layout/reftests/bugs/269908-4-ref.html" line: 0}]
15:10:32     INFO - Return code: 0
---------------------------------
Flags: needinfo?(ted)
I also note that there's a string that shows up in the logcat in both logs:
05:50:53     INFO -  exception: [Exception... "Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsICrashReporter.annotateCrashReport]"  nsresult: "0xc1f30001 (NS_ERROR_NOT_INITIALIZED)"  location: "JS frame :: chrome://browser/content/shell.js :: <TOP_LEVEL> :: line 224"  data: no]creating 1!

However, this only shows up twice in the log where we get a minidump, but 4 times in the one where we didn't. Perhaps we're not actually enabling Breakpad properly sometimes?
Out of sight, out of mind? Is there anyone who can look at this please?
How frequently is this crash happening? We are only running reftest-sanity on b2g18, I wonder if it wouldn't just be more worthwhile to turn them off (though I guess crashtests are a different matter)..

I guess it depends on:
a) how much longer b2g18 is going to be around
b) how many commits are going to be pushed there now that the first couple releases are wrapping up

Another option would be to enable the full stack emulator builds like we have on m-c, but this may require a fair amount of work for releng (and wouldn't be a guaranteed fix).
This hits crashtests too, and I think turning off what few reftests we run on b2g18 to fix an intermittent crash will not go over well. Also, b2g18 will be around until at least the end of the year into early next year. It's not going away any time soon.

Alex will have to answer about how much activity we expect on it going forward.
Flags: needinfo?(akeybl)
Oh, and it happens roughly once every 1-2 pushes. It hits in spurts, so we might see multiple crashes on one push, then 2 pushes without, etc. It's intermittent, but quite frequent.
Ok I didn't realize it was going to be around for that long. I don't know how much value reftest sanity is providing, but we are at least running a fair amount of crashtests, so I agree we don't want to turn them off.

Maybe it would be worth getting full stack emulator builds going on b2g18 then.
(In reply to Andrew Halberstadt [:ahal] from comment #8)
> How frequently is this crash happening? We are only running reftest-sanity
> on b2g18, I wonder if it wouldn't just be more worthwhile to turn them off
> (though I guess crashtests are a different matter)..
> 
> I guess it depends on:
> a) how much longer b2g18 is going to be around
> b) how many commits are going to be pushed there now that the first couple
> releases are wrapping up
> 
> Another option would be to enable the full stack emulator builds like we
> have on m-c, but this may require a fair amount of work for releng (and
> wouldn't be a guaranteed fix).

It'll be around till March 2014, and we can expect at least 30-40 landings a cycle until then. Agreed we shouldn't turn these off.
Flags: needinfo?(akeybl)
(In reply to Andrew Halberstadt [:ahal] from comment #11)
> Maybe it would be worth getting full stack emulator builds going on b2g18
> then.

Who needs to make this call?
Flags: needinfo?(ahalberstadt)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #16)
> (In reply to Andrew Halberstadt [:ahal] from comment #11)
> > Maybe it would be worth getting full stack emulator builds going on b2g18
> > then.
> 
> Who needs to make this call?
The people who make that call are me, Joduinn, and Jgriffin. It represents the same serious amount of work that adding a new platform always creates (you might as well treat it like a new platform. That means lots of strange intermittents, then a period of fixing them, and then they will be running green.  Not a silver bullet.

And if we still crash and we aren't actually setting up breakpad properly like Ted hypothesizes above, then we're back at square 1.

Ahal, would there be a way to run the same set of crashtests on a full stack emulator build using a b2g18 build of gecko/gaia/gonk?  That way we can see how much it buys us.

Ted, is there a way to determine if breakpad isn't being enabled? Perhaps something we can dump out to the log from the build in order to let us know whether that is a red herring or not?
Flags: needinfo?(ted)
(In reply to Clint Talbert ( :ctalbert ) from comment #17)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #16)
> > (In reply to Andrew Halberstadt [:ahal] from comment #11)
> > > Maybe it would be worth getting full stack emulator builds going on b2g18
> > > then.
> > 
> > Who needs to make this call?
> The people who make that call are me, Joduinn, and Jgriffin. It represents
> the same serious amount of work that adding a new platform always creates
> (you might as well treat it like a new platform. That means lots of strange
> intermittents, then a period of fixing them, and then they will be running
> green.  Not a silver bullet.

FWIW, we are already running the full builds on b2g18, it would just be a matter of changing some config files to enable the tests. Worst case scenario there will be a bunch of intermittents and we'll have to hide, fix, or disable them again. Though, yes, this worst case could be likely.

> Ahal, would there be a way to run the same set of crashtests on a full stack
> emulator build using a b2g18 build of gecko/gaia/gonk?  That way we can see
> how much it buys us.

I'll look into when I have some time (might not be for a few days). Though I wouldn't expect to be able to reproduce the problem in question locally. Our success reproducing this bug in the past has been zero.
Flags: needinfo?(ahalberstadt)
We could stick a line in reftest.js to dump the crashreporter status, it's pretty simple. Something like:
var cr = Components.classes["@mozilla.org/toolkit/crash-reporter;1"]
                      .getService(Components.interfaces.nsICrashReporter);
dump("crashreporter enabled: " + cr.enabled + "\n");
Flags: needinfo?(ted)
Clint, ahal says he probably won't have time to look into this any time soon. Is there someone else who might? We still hit it pretty consistently on b2g18*, even with the full-stack builds.
Flags: needinfo?(ctalbert)
Ryan,

We are chasing this issue down at the b2g work week. Right now, Jonas is trying to find out who would be the right person to dig into this for us. So, I'm going to transition my needinfo request to him.
Flags: needinfo?(jonas)
With 1.1 coming out, b2g 18 (1.0.1) is about to be deprecated. If this doesn't show up on other trees, then I think we just keep moving forward and focus our resources on 1.1, 1.2 and 1.3.
Flags: needinfo?(ctalbert)
b2g18 is v1.1
Flags: needinfo?(ctalbert)
Andrew is your guy
Flags: needinfo?(jonas)
So, we need full stack builds on b2g18?
Flags: needinfo?(ryanvm)
That was done in bug 897141.
Flags: needinfo?(ryanvm)
So now this is "just" finding someone to investigate?
Flags: needinfo?(ryanvm)
Correct. Switching to full-stack builds & tests did not make the problem to go away, so someone needs to investigate.
Flags: needinfo?(ryanvm)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #19)
> We could stick a line in reftest.js to dump the crashreporter status, it's
> pretty simple. Something like:
> var cr = Components.classes["@mozilla.org/toolkit/crash-reporter;1"]
>                       .getService(Components.interfaces.nsICrashReporter);
> dump("crashreporter enabled: " + cr.enabled + "\n");

How's the attached (untested)?
Attachment #826007 - Flags: review?(ted)
Comment on attachment 826007 [details] [diff] [review]
testBug880285.patch

Review of attachment 826007 [details] [diff] [review]:
-----------------------------------------------------------------

Plausible.
Attachment #826007 - Flags: review?(ted) → review+
ahal: do you think this is another symptom of the stuff we were discussing in bug 866937?
Flags: needinfo?(ahalberstadt)
Yeah could be,

> 05:57:03    ERROR -  F/libc    (  948): Fatal signal 11 (SIGSEGV) at 0x42e00000 (code=2)

looks like it originates from libc and :jrmuizel says that we are missing libc symbols in bug 866937. If we did get the symbols, I'd probably still need to hunt down and backport some of the check_for_crashes patches, but that should be fairly straight forward. It's the getting the symbols part that is a mystery to me.
Flags: needinfo?(ahalberstadt)
The symbols are irrelevant here, it's whether we're catching the crash and producing a minidump or not. Your analysis in the other bug showed that we were crashing before we restarted with Breakpad enabled, I was wondering if this was something similar.
It could be, I can't see the logfile anymore and there don't seem to be any recent failures like this on b2g18, so I'm not exactly sure where the crash is happening.

Just for the record, in the month of October we ran B2G Emulator ICS builds 5 times on b2g18 (and this crash didn't happen in any of them). Ryan, are you still noticing this crash? If so is this still something that should be a high priority to fix? It seems like b2g18 is winding down and the return on fixing this is getting smaller and smaller.
Flags: needinfo?(ryanvm)
The frequency certainly seems to be diminished. Recent retriggers show it ~10% of the time. This branch has approximately 4 months of support left and a diminishing number of checkins to it. I guess I'd be OK with leaving it as-is at this point, but I do get worried about setting an "ignore it long enough and we can eventually WONTFIX it" precedent. This bug has been on file for 5 months now.
Flags: needinfo?(ryanvm)
I'd explicitly like to not consider this as setting any precedent, but rather allocating resources to where they are most useful; we're currently working with developers to identify and fix a number of asserts and crashes on m-c, and that seems likely to have a significant future payback compared to diverting resources to work on this bug.

But, I agree it's definitely sub-optimal.  Ahal, can you bring this up on Friday's B2G engineering meeting?  Let's get jonas et al to make the final call.
Flags: needinfo?(ctalbert)
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #37)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=30152672&tree=Mozilla-B2g18
> 
> 12/76 runs = 16%

And thanks for the this statistic, btw!
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #34)
> The symbols are irrelevant here, it's whether we're catching the crash and
> producing a minidump or not. Your analysis in the other bug showed that we
> were crashing before we restarted with Breakpad enabled, I was wondering if
> this was something similar.

So to answer the question, no it isn't because it is crashing before we restart the b2g process with crashreporting enabled. It could be because we aren't checking for crashes when we should be.
Thanks. Just trying to make sure we're asking the right questions so we know what's actually wrong.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: