Closed Bug 965705 Opened 10 years ago Closed 6 years ago

Intermittent B2G mochitest TEST-UNEXPECTED-FAIL | mozrunner-startup | application timed out after 450.0 (or 330.0) seconds with no output

Categories

(Firefox OS Graveyard :: Emulator, defect)

x86
Linux
defect
Not set
normal

Tracking

(tracking-b2g:backlog)

RESOLVED WONTFIX
tracking-b2g backlog

People

(Reporter: cbook, Unassigned)

References

()

Details

(Keywords: intermittent-failure)

b2g_emulator_vm mozilla-inbound debug test mochitest-debug-13 on 2014-01-29 21:57:51 PST for push ad741e8ab1b7

slave: tst-linux64-ec2-313

https://tbpl.mozilla.org/php/getParsedLog.php?id=33788989&tree=Mozilla-Inbound



[Parent 687] WARNING: B2GRunner TEST-UNEXPECTED-FAIL | automation | application timed out after 450.0 seconds with no output
This also has the "marionette.errors.InvalidResponseException: InvalidResponseException()" exception.
Depends on: 965782
This last failure is not related to Gaia.
All these are not related to Gaia but to Mochitests, it's probably better to file a separate bug.
Comment 44 onwards are mis-stars, as a result of bug 886570 (now backed out) breaking the mochitest harness error output for test timeouts.
Also comment 41 is from the try run for that very bug.
The 3 last ones are in mochitests and not in Gaia Unit tests.

Ryan, is there a way to make this more obvious when sheriffing? Like channging this bug's title?
Flags: needinfo?(ryanvm)
We could do that. We could also have less crappy timeout handling in the harness, but I suppose that'd be asking a lot.
Flags: needinfo?(ryanvm)
Bug 1023935 will help with the summaries of things like this.
Now that I read better the failures, none of the failures are actually in Gaia and all are mochitest failures. Which component do you use for this ?
Surely they are startup crashes and thus still a Firefox OS::* component issue?
Ok, I see what you mean, all the failures happen in the emulator.

I guess we'd need symbols for the crash...
Component: Gaia::TestAgent → Emulator
Summary: TEST-UNEXPECTED-FAIL | automation | application timed out after 450.0 seconds with no output → Intermittent TEST-UNEXPECTED-FAIL | mozrunner-startup | application timed out after 450.0 seconds with no output
Summary: Intermittent TEST-UNEXPECTED-FAIL | mozrunner-startup | application timed out after 450.0 seconds with no output → Intermittent B2G mochitest TEST-UNEXPECTED-FAIL | mozrunner-startup | application timed out after 450.0 (or 330.0) seconds with no output
Hey Fabrice,

do you know what the forward path is here?
Flags: needinfo?(fabrice)
These all look like crashes in the gfx ipc. Nical, any idea?
Flags: needinfo?(fabrice)
Flags: needinfo?(nical.bugzilla)
Unfortunately, nothing comes to mind.
If the harness could give more info (like, stack traces of every thread) when timing out) it'd be very helpful.
Flags: needinfo?(nical.bugzilla)
Hey Jonathan, do you know how to fulfill Nicolas' request in comment 345 (!) ?
Flags: needinfo?(jgriffin)
Ted, I think we've discussed this before, but any ideas why we're not getting a stack trace in these cases?
Flags: needinfo?(jgriffin) → needinfo?(ted)
There is a stack trace, if you click through to the raw log you can see every thread of the b2g process. Unfortunately it's not very *interesting*. It's just sitting in the event loop. The log summary shows an IPC Abort and corresponding MOZ_CRASH in a content process, but we don't seem to be doing anything with that crash report, and then we kill the b2g process after we time out.

ahal said (on IRC) that he thought this was supposed to do the trick:
http://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozrunner/mozrunner/base/device.py#26

and it does look like we have code in shell.js to handle that:
http://dxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#132

...so I'm not sure what's happening here. Maybe someone with a local B2G emulator build could try running tests and kill -ABRT the content process to see what happens?
Flags: needinfo?(ted)
ahal: any thoughts here (see comment 351).
Flags: needinfo?(ahalberstadt)
Depends on: 1093296
I was able to reproduce the timeout locally, but now mochitest fails to run a second time, so I'm having trouble making much progress. It looks like something isn't getting cleaned up and the marionette port is still in use.

I think it's been awhile since anyone has looked at emulator mochitests and they are in pretty bad shape. I don't think they even have a definite owner anymore. At any rate, I filed bug 1093296 for it.
Flags: needinfo?(ahalberstadt)
Hey Hsin-Yi, is this something you could help with?
Flags: needinfo?(htsai)
Hey Julien,
My team is quite occupied by other emulator and stability issues at the moment. As the obvious failure frequency, let me put this into backlog to get prioritized.
blocking-b2g: --- → backlog
Flags: needinfo?(htsai)
Looks something wrong in gecko/ipc/glue/MessageChannel.cpp::OnChannelErrorFromLink

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=7235805
See Also: → 1051567
Since last week, I observed crashes on MessageChannel several times (at least once a day). And now this bug contains more than one symptom; one seems mochitest framework issue and one is MessageChannel crash problem.

Is it possible that we forward the test result coming from MessageChannel crash to the bug 1051567 to get more attention there? 

=== Crash symptom ===
03-05 05:05:29.314 I/Gecko   (  773): [Child 773] ###!!! ABORT: Aborting on
channel error.: file ../../../gecko/ipc/glue/MessageChannel.cpp, line 1584
03-05 05:05:29.394 E/Gecko   (  773): mozalloc_abort: [Child 773] ###!!! ABORT:
Aborting on channel error.: file ../../../gecko/ipc/glue/MessageChannel.cpp,
line 1584
03-05 05:05:29.414 F/libc    (  773): Fatal signal 11 (SIGSEGV) at 0x00000000
(code=1)
This usually indicates the B2G process has crashed
Return code: 1
==========
Hello Ryan, are you the one who can help my comment 496?

(In reply to Hsin-Yi Tsai [:hsinyi] from comment #496)
> Since last week, I observed crashes on MessageChannel several times (at
> least once a day). And now this bug contains more than one symptom; one
> seems mochitest framework issue and one is MessageChannel crash problem.
> 
> Is it possible that we forward the test result coming from MessageChannel
> crash to the bug 1051567 to get more attention there? 
> 
> === Crash symptom ===
> 03-05 05:05:29.314 I/Gecko   (  773): [Child 773] ###!!! ABORT: Aborting on
> channel error.: file ../../../gecko/ipc/glue/MessageChannel.cpp, line 1584
> 03-05 05:05:29.394 E/Gecko   (  773): mozalloc_abort: [Child 773] ###!!!
> ABORT:
> Aborting on channel error.: file ../../../gecko/ipc/glue/MessageChannel.cpp,
> line 1584
> 03-05 05:05:29.414 F/libc    (  773): Fatal signal 11 (SIGSEGV) at 0x00000000
> (code=1)
> This usually indicates the B2G process has crashed
> Return code: 1
> ==========
Flags: needinfo?(ryanvm)
blocking-b2g: backlog → ---
Unfortunately, those ABORTs also show up when there's a force-termination after hanging, so they're not overly useful to us from a starring standpoint. If there's some way we can get the OnChannelErrorFromLink somewhere useful, then we can split it out to another bug.
Flags: needinfo?(ryanvm)
Closing all intermittent test failures for Firefox OS (since we're not focusing on it anymore).

Please reopen if my search included your bug by mistake.
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.