Closed Bug 967816 Opened 10 years ago Closed 10 years ago

Gaia-ui-tests on Linux nearly perma-fail since Feb 2

Categories

(Firefox OS Graveyard :: Gaia::UI Tests, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgriffin, Unassigned)

References

Details

Since Feb 2, the gaia-ui-tests on b2g desktop on linux have become nearly perma-fail; they time out at entirely random places.  The same tests running on osx are not affected.

Looking at b2g-inbound, it looks like this problem began around https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=gaia-ui&showall=1&rev=83a3ef9b2144, but I'm doing some retriggers before and after to attempt to confirm.
(In reply to Jonathan Griffin (:jgriffin) from comment #0)
 
> Looking at b2g-inbound, it looks like this problem began around
> https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=gaia-
> ui&showall=1&rev=83a3ef9b2144, but I'm doing some retriggers before and
> after to attempt to confirm.

Er, that's on inbound.  On b2g-inbound, there's enough noise to make the situation a little unclear, but I'm requesting some retriggers there too.
If this is a crash, then bug 949028 will help when/if I can get it working.
Depends on: 949028
I'm not convinced this is not an infrastructure issue.  Retriggers on inbound from Sunday are showing as much redder than the original runs, which would indicate an infrastructure change that was made after those initial runs.  See e.g. https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=gaia-ui&rev=fac849dd7be9 and earlier pushes, for which I've done a bunch of retriggers.

I know we moved to a different AWS node type, but I haven't had a firm answer as to when exactly that happened.  Catlee, rail, can you tell us?
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
s/which would indicate an infrastructure change/which _could_ indicate an infrastructure change/
Migration from m1.medium to m3.medium happened in 2 steps:

1) on demand slaves (tst-linux*-ec2-xxx) have been switched to m3.medium around Jan 28-29
2) spot slaves (tst-linux*-spot-xxx) have been switched to m3.medium after Feb 2 (http://hg.mozilla.org/build/cloud-tools/rev/6487dca66616)
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
That timeline looks pretty consistent with the pattern of increased (nearly permared) failures we're seeing, although from the specs it's hard to see how the new instance type would be causing these problems.

One way to tell would be to switch spot instances back to m1.medium for a few days to see if our failure rate comes back down.

In tandem, Andreas Tolfsen on our team is going to investigate this on one of the on-demand slaves to see if we can get more information about the failures.
Depends on: 968300
Hidden on trunk.
Depends on: 969590
Blocks: 966070
No longer blocks: 966070
Blocks: 945981
These are green again (except for unrelated bug 970166, which I'm landing a fix for today); can we unhide them?
The https://tbpl.mozilla.org/php/getParsedLog.php?id=34567438&tree=Mozilla-Inbound thing now that they're back on m1.medium looks like more than 10%, from a quick glance.
Fair enough.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.