893154 - Find and fix the underlying cause of B2G mochitest timeouts

Reporter

Description

•

11 years ago

I was recently discussing with Clint and Jonathan the frequency with which we hit timeouts during the B2G mochitest suite and how it has become frequent enough that we often don't even bother filing new bugs for each individual failure anymore. It has gotten to the point where failures are starred as "b;r" and retriggered, similar with what was done in prior times on Android.

However, there are some bugs on file tracking some of the more frequent occurrences of these failures (roughly in order of frequency based on what OrangeFactor thinks):
846105, 824056, 864714, 878916, 877938, 879252, 874606, 891346, 881448, 845621, 863945, 856868, 857325, 891840

I also know that we see them pretty often in the editor imptests, but I don't see a bug readily on file for that.

Also, these are easy to find on TBPL by looking at any of m-c, inbound, birch, or fx-team. Just look for starred B2G mochitests and it'll probably be a 330 second timeout.

Note that this bug only covers the mochitest situation. We should probably have another on file for the number of socket.timeouts we hit in the Marionette test suite.

Jonathan Griffin (:jgriffin)

Comment 1

•

11 years ago

Looking at several of these logs, they all seem to share a common failure mode.  They all affect tests that are normally long running (100s or more) that on some occasions temporarily freeze and cause the 330s timeout to be triggered.  They later unfreeze, however, but at that point the test run has aborted.

I'm guessing the freeze is casued by GC.  The fix is probably to give a much longer timeout for tests on B2G.

Jonathan Griffin (:jgriffin)

Comment 2

•

11 years ago

doubled the timeout in a push to try: https://tbpl.mozilla.org/?tree=Try&rev=2641379f04a0

Jonathan Griffin (:jgriffin)

Comment 3

•

11 years ago

This try job got eaten, so I'll have to submit another.

Jonathan Griffin (:jgriffin)

Comment 4

•

11 years ago

https://tbpl.mozilla.org/?tree=Try&rev=f313e548e8ea

Jonathan Griffin (:jgriffin)

Comment 5

•

11 years ago

No-op try run for side-by-side comparison, per request from RyanVM:  https://tbpl.mozilla.org/?tree=Try&rev=b481fbc9754e

cmtalbert

Comment 6

•

11 years ago

Those both look pretty green. What am I missing?

Jonathan Griffin (:jgriffin)

Comment 7

•

11 years ago

Some recent GC changes in gecko seem to have made this problem (mostly?) go away.

Jonathan Griffin (:jgriffin)

Comment 8

•

11 years ago

(I'm thinking this patch is worth landing anyway...)

cmtalbert

Comment 9

•

11 years ago

Sorry for the spam, I see now.  Looks like we repo'd the timeout on the no-op run and we didn't on the other. While that's promising, I was hoping for a more clear difference. I also saw some work nbp recently did in bug 876029, which may have helped ensure our threads are responding to memory pressure more quickly.  The instances of the "1200s timeout" bug have reduced of late since that checkin on orangefactor as well 876029 landed July 10): http://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=820739&startday=2013-05-11&endday=2013-07-18&tree=trunk. 

It's hard to read these tea leaves and see if a longer timeout is wallpapering over a real issue, or if it's what we need to do now to get us to a more reliable green state and we track memory usage/pressure over time on the endurance tests or areweslimyet.

I'm leaning toward the patch and monitor outside of the tests approach.

Jonathan Griffin (:jgriffin)

Comment 10

•

11 years ago

(In reply to Clint Talbert ( :ctalbert ) from comment #9)
> Sorry for the spam, I see now.  Looks like we repo'd the timeout on the
> no-op run and we didn't on the other. 

It's actually the other away around; we repo'd a timeout problem with the higher timeout, but didn't with the no-op run.  

There are actually two patterns of failures in the older logs:  one in which a test times out, and then continues (after it's timed out), and one in which a test times out, and is killed by the harness, after which framework continues with the next test.

We don't see any instances of the former in either of these try runs (which may be due to some recently-landed patches); we see one instance of the latter in the run with the higher timeout.  This patch was specifically targeted at the former pattern, so our inability to reproduce it may indicate that this patch isn't needed.

Jonathan Griffin (:jgriffin)

Comment 11

•

11 years ago

Hmm, on inbound I continue to see both patterns of failure.  I'll post something to dev.b2g.

Jonathan Griffin (:jgriffin)

Comment 12

•

11 years ago

https://groups.google.com/forum/#!topic/mozilla.dev.b2g/YcyxeW0iuNg

Andrew Overholt [:overholt]

Comment 13

•

11 years ago

Does this need to block bug 884399?

Flags: needinfo?(jsmith)

Jason Smith [:jsmith]

Comment 14

•

11 years ago

Nope.

No longer blocks: b2g-central-dogfood

Flags: needinfo?(jsmith)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 15

•

9 years ago

I have a hard time believing this bug serves a useful purpose at this point.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → INCOMPLETE

Bugzilla

Quick Search

Find and fix the underlying cause of B2G mochitest timeouts

Categories

(Firefox OS Graveyard :: General, defect)

Tracking

(Not tracked)

People

(Reporter: RyanVM, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15