Since Feb 2, the gaia-ui-tests on b2g desktop on linux have become nearly perma-fail; they time out at entirely random places. The same tests running on osx are not affected. Looking at b2g-inbound, it looks like this problem began around https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=gaia-ui&showall=1&rev=83a3ef9b2144, but I'm doing some retriggers before and after to attempt to confirm.
(In reply to Jonathan Griffin (:jgriffin) from comment #0) > Looking at b2g-inbound, it looks like this problem began around > https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=gaia- > ui&showall=1&rev=83a3ef9b2144, but I'm doing some retriggers before and > after to attempt to confirm. Er, that's on inbound. On b2g-inbound, there's enough noise to make the situation a little unclear, but I'm requesting some retriggers there too.
I'm not convinced this is not an infrastructure issue. Retriggers on inbound from Sunday are showing as much redder than the original runs, which would indicate an infrastructure change that was made after those initial runs. See e.g. https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=gaia-ui&rev=fac849dd7be9 and earlier pushes, for which I've done a bunch of retriggers. I know we moved to a different AWS node type, but I haven't had a firm answer as to when exactly that happened. Catlee, rail, can you tell us?
s/which would indicate an infrastructure change/which _could_ indicate an infrastructure change/
Migration from m1.medium to m3.medium happened in 2 steps: 1) on demand slaves (tst-linux*-ec2-xxx) have been switched to m3.medium around Jan 28-29 2) spot slaves (tst-linux*-spot-xxx) have been switched to m3.medium after Feb 2 (http://hg.mozilla.org/build/cloud-tools/rev/6487dca66616)
That timeline looks pretty consistent with the pattern of increased (nearly permared) failures we're seeing, although from the specs it's hard to see how the new instance type would be causing these problems. One way to tell would be to switch spot instances back to m1.medium for a few days to see if our failure rate comes back down. In tandem, Andreas Tolfsen on our team is going to investigate this on one of the on-demand slaves to see if we can get more information about the failures.
Hidden on trunk.
These are green again (except for unrelated bug 970166, which I'm landing a fix for today); can we unhide them?
The https://tbpl.mozilla.org/php/getParsedLog.php?id=34567438&tree=Mozilla-Inbound thing now that they're back on m1.medium looks like more than 10%, from a quick glance.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.