Closed Bug 893154 Opened 6 years ago Closed 4 years ago

Find and fix the underlying cause of B2G mochitest timeouts


(Firefox OS Graveyard :: General, defect)

Gonk (Firefox OS)
Not set


(Not tracked)



(Reporter: RyanVM, Unassigned)



I was recently discussing with Clint and Jonathan the frequency with which we hit timeouts during the B2G mochitest suite and how it has become frequent enough that we often don't even bother filing new bugs for each individual failure anymore. It has gotten to the point where failures are starred as "b;r" and retriggered, similar with what was done in prior times on Android.

However, there are some bugs on file tracking some of the more frequent occurrences of these failures (roughly in order of frequency based on what OrangeFactor thinks):
846105, 824056, 864714, 878916, 877938, 879252, 874606, 891346, 881448, 845621, 863945, 856868, 857325, 891840

I also know that we see them pretty often in the editor imptests, but I don't see a bug readily on file for that.

Also, these are easy to find on TBPL by looking at any of m-c, inbound, birch, or fx-team. Just look for starred B2G mochitests and it'll probably be a 330 second timeout.

Note that this bug only covers the mochitest situation. We should probably have another on file for the number of socket.timeouts we hit in the Marionette test suite.
Looking at several of these logs, they all seem to share a common failure mode.  They all affect tests that are normally long running (100s or more) that on some occasions temporarily freeze and cause the 330s timeout to be triggered.  They later unfreeze, however, but at that point the test run has aborted.

I'm guessing the freeze is casued by GC.  The fix is probably to give a much longer timeout for tests on B2G.
This try job got eaten, so I'll have to submit another.
No-op try run for side-by-side comparison, per request from RyanVM:
Those both look pretty green. What am I missing?
Some recent GC changes in gecko seem to have made this problem (mostly?) go away.
(I'm thinking this patch is worth landing anyway...)
Sorry for the spam, I see now.  Looks like we repo'd the timeout on the no-op run and we didn't on the other. While that's promising, I was hoping for a more clear difference. I also saw some work nbp recently did in bug 876029, which may have helped ensure our threads are responding to memory pressure more quickly.  The instances of the "1200s timeout" bug have reduced of late since that checkin on orangefactor as well 876029 landed July 10): 

It's hard to read these tea leaves and see if a longer timeout is wallpapering over a real issue, or if it's what we need to do now to get us to a more reliable green state and we track memory usage/pressure over time on the endurance tests or areweslimyet.

I'm leaning toward the patch and monitor outside of the tests approach.
(In reply to Clint Talbert ( :ctalbert ) from comment #9)
> Sorry for the spam, I see now.  Looks like we repo'd the timeout on the
> no-op run and we didn't on the other. 

It's actually the other away around; we repo'd a timeout problem with the higher timeout, but didn't with the no-op run.  

There are actually two patterns of failures in the older logs:  one in which a test times out, and then continues (after it's timed out), and one in which a test times out, and is killed by the harness, after which framework continues with the next test.

We don't see any instances of the former in either of these try runs (which may be due to some recently-landed patches); we see one instance of the latter in the run with the higher timeout.  This patch was specifically targeted at the former pattern, so our inability to reproduce it may indicate that this patch isn't needed.
Hmm, on inbound I continue to see both patterns of failure.  I'll post something to dev.b2g.
Does this need to block bug 884399?
Flags: needinfo?(jsmith)
No longer blocks: b2g-central-dogfood
Flags: needinfo?(jsmith)
I have a hard time believing this bug serves a useful purpose at this point.
Closed: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.