Closed Bug 969590 Opened 11 years ago Closed 11 years ago

Temporarily revert the change to m3.medium AWS instances to see if they are behind the recent increase in test timeouts

Categories

(Release Engineering :: General, defect)

x86
Linux
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: rail)

References

Details

Within the last week, we've seen a dramatic increase in test timeouts on Linux64. The timeframe appears to line up with when we switched from m1 instances to m3 instances. To test that theory, we would like to temporarily revert to m1 instances for a period of time to see if the rate of timeouts decreases with it.
Can you look at switching back from m3.medium for a day?
Flags: needinfo?(rail)
Blocks: 777574
Blocks: 926264
The two easier-to-spot failures should be bug 777574, Linux64 ASan webgl timeouts which were a couple of times a week and are now 30 times a day since the afternoon of January 28th, and bug 926264, Linux64 Jetpack shutdown hangs that were all-platform and once a week until the afternoon of January 28th when they became 40 or 50 a day.
Another very likely candidate is bug 967816, in which gaia-ui-tests on linux64 became nearly permafail on Feb 3, which is the date that spot instances were converted to m3.medium. The same tests running on osx didn't experience a change in failure rates.
Blocks: 967816
I have a couple of questions here and some input. Converting the instances back to m1.medium is not a bug deal it will just take some time. We will some time with mixed pool of m1 and m3 instances. The conversion will be happening whenever we start instances (not on reboot). 1) Do we want to switch all tst-linux64 instances to m1.medium (spot+on-demand)? 2) When we want to start switching? Are we OK starting during this weekend?
Flags: needinfo?(rail)
a bug deal == a big deal :)
I think it would make a clearer signal if we made the change on Monday, when there is likely to be heavier commit traffic. We'd also see the clearest signal if we changed both spot and on-demand instances, but if it's easier just to handle spot instances, I imagine we could see enough of a signal to tell whether additional investigation in this direction is warranted. If we go with the spot-only approach, how long would it take before all of the spot instances were running on the old node type? We'd want to wait about a day after we got to that state before attempting to make a call as to whether the node type change is the culprit.
(In reply to Jonathan Griffin (:jgriffin) from comment #6) > I think it would make a clearer signal if we made the change on Monday, when > there is likely to be heavier commit traffic. WFM > > If we go with the spot-only approach, how long would it take before all of > the spot instances were running on the old node type? We'd want to wait > about a day after we got to that state before attempting to make a call as > to whether the node type change is the culprit. I have no precise figures here,unfortunately... I can start the conversion on Sunday evening when things are quiet, so we get faster turnaround and probably get everything ready by Monday morning.
Assignee: nobody → rail
Thanks Rail!
To make the plan clear: * we are going to switch to m1.medium for spot instances only * the process will be started this Sunday * we evaluate the results on Tuesday Anything missing?
That sounds like a great plan to me.
Blocks: 966772
Blocks: 966796
Blocks: 966806
While adding those dependencies I ran through the intermittent-failure bugs I've filed since January 28th, and especially on February 2nd, and there are lots more one-off failures on Linux64 besides those (I've since stopped filing them, just blowing off 10 or 30 per day), so that's another thing to watch for disappearing from spot but not from on-demand, random timeouts in tests that have never timed out before.
Blocks: 966070
Favoritest one: bug 965534 failed only on on-demand slaves between January 29 and February 3, when it began also failing on spot slaves.
Blocks: 965534
As of now: m3.medium: 8 / m1.medium: 385 I'm going to monitor remaining 8 vms and terminate them on reboot.
m3.medium: 0 / m1.medium: 382
(In reply to Jonathan Griffin (:jgriffin) from comment #3) > Another very likely candidate is bug 967816, in which gaia-ui-tests on > linux64 became nearly permafail on Feb 3, which is the date that spot > instances were converted to m3.medium. The same tests running on osx didn't > experience a change in failure rates. This looks very promising so far. There have been 0 of these Gu timeouts on b2g-inbound and mozilla-inbound today on spot instances; the only occurrences have been on on-demand instances.
Yeah, the only thing that keeps it from being a no-questions absolutely perfect success is the nasty surprise of bug 970239 being m1.medium only, but that's still a deal I'd take in a heartbeat: throw away two tests in order to get back the entire suites and platforms that we hid over this.
Blocks: 970239
So, I declare this experiment a success (comment #17 notwithstanding). Rail, can we move the on-demand instances back to m1.medium too? Then, we can unhide several test suites that have been hidden since around Feb 3. Since the new node type is more efficient, we should try to acquire some engineering resources to help identify the source of the hangs.
I pushed http://hg.mozilla.org/build/cloud-tools/rev/1e8ba299c4ab#l1.42 to let automation change the instances type of the new started instances. This may take some time. I'll take a look at the instance type breakdown tomorrow.
m3.medium: 88 / m1.medium: 167
No longer blocks: 966070
This is done now. Zarro m3.medium instances.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Depends on: 972073
So, after failing to reproduce bug 926264 on a freshly installed m3.medium with ubuntu 12.04, xvfb and unity, I tried on a loaner, in case my fresh install didn't quite match our actual test environment. Guess what? I've been equally unable to reproduce. Whatever is happening on m3.mediums that made us backout this change is not happening by repeatedly running the same test. I'm afraid the only way to find what's wrong here would be to catch those timeouts while they are happening... in production. Could it perhaps be possible to setup duplicate test jobs on a small pool of m3.mediums?
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.