Gaia-ui-tests on Linux nearly perma-fail since Feb 2

RESOLVED FIXED

Status

RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: jgriffin, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

5 years ago
Since Feb 2, the gaia-ui-tests on b2g desktop on linux have become nearly perma-fail; they time out at entirely random places.  The same tests running on osx are not affected.

Looking at b2g-inbound, it looks like this problem began around https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=gaia-ui&showall=1&rev=83a3ef9b2144, but I'm doing some retriggers before and after to attempt to confirm.
(Reporter)

Comment 1

5 years ago
(In reply to Jonathan Griffin (:jgriffin) from comment #0)
 
> Looking at b2g-inbound, it looks like this problem began around
> https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=gaia-
> ui&showall=1&rev=83a3ef9b2144, but I'm doing some retriggers before and
> after to attempt to confirm.

Er, that's on inbound.  On b2g-inbound, there's enough noise to make the situation a little unclear, but I'm requesting some retriggers there too.
If this is a crash, then bug 949028 will help when/if I can get it working.
Depends on: 949028
(Reporter)

Comment 3

5 years ago
I'm not convinced this is not an infrastructure issue.  Retriggers on inbound from Sunday are showing as much redder than the original runs, which would indicate an infrastructure change that was made after those initial runs.  See e.g. https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=gaia-ui&rev=fac849dd7be9 and earlier pushes, for which I've done a bunch of retriggers.

I know we moved to a different AWS node type, but I haven't had a firm answer as to when exactly that happened.  Catlee, rail, can you tell us?
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
(Reporter)

Comment 4

5 years ago
s/which would indicate an infrastructure change/which _could_ indicate an infrastructure change/
Migration from m1.medium to m3.medium happened in 2 steps:

1) on demand slaves (tst-linux*-ec2-xxx) have been switched to m3.medium around Jan 28-29
2) spot slaves (tst-linux*-spot-xxx) have been switched to m3.medium after Feb 2 (http://hg.mozilla.org/build/cloud-tools/rev/6487dca66616)
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
(Reporter)

Comment 6

5 years ago
That timeline looks pretty consistent with the pattern of increased (nearly permared) failures we're seeing, although from the specs it's hard to see how the new instance type would be causing these problems.

One way to tell would be to switch spot instances back to m1.medium for a few days to see if our failure rate comes back down.

In tandem, Andreas Tolfsen on our team is going to investigate this on one of the on-demand slaves to see if we can get more information about the failures.
(Reporter)

Updated

5 years ago
Depends on: 968300
Hidden on trunk.
(Reporter)

Updated

5 years ago
Depends on: 969590
Blocks: 966070
No longer blocks: 966070
Blocks: 945981
(Reporter)

Comment 8

5 years ago
These are green again (except for unrelated bug 970166, which I'm landing a fix for today); can we unhide them?
The https://tbpl.mozilla.org/php/getParsedLog.php?id=34567438&tree=Mozilla-Inbound thing now that they're back on m1.medium looks like more than 10%, from a quick glance.
Fair enough.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.