Closed Bug 1217493 Opened 9 years ago Closed 9 years ago

reimage 30 linux64 hosts as w7 hosts

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: arich)

References

Details

Attachments

(1 file)

we disabled all the talos hosts for linux32, now we have 99 machines for linux64- talos does not need 99 machines.

lets start with 30 machines, allocate them as fit between the windows pools.  This should help us reduce the load even more (and help me get windows talos results faster!)
Blocks: 1217494
Why doesn't it need 99 machines, and how will we tell whether or not 66 is the right number? Looks like yesterday we started 85% of the jobs in less than 15 minutes, what percentage of the jobs do you want to have take up to half an hour before they even start running?
this is more of a juggling act- we have really long backlogs on windows, and almost no wait on linux64.  It appears we run graphics as well as talos on there.  Lets look at data and see what the historic wait times are for this hardware pool.  Coop was going to look at historical wait times.
Flags: needinfo?(coop)
Amy asked me to add a bit more information wrt to which jobs run on these L64 machines:
* talos [1]
* Android x86 S4 test jobs [2]

[1] http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla-tests/config.py#l141
[2] http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla-tests/mobile_config.py#l123
And also let's split it like this:
* Win7 13 machines
* Win8 17 machines (a bit more backlogged atm)

That would bring us to:
* Win7 -> 202 -> 215 (1 disabled atm)
* Win8 -> 194 -> 211 (7 disabled atm)
talos-linux64-ix-070-099  are available for reimaging
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #4)

I think the allocation distribution is still under discussion since we might target one OS (likely w7) to beef up so we can start turning on some e10s tests. jgriffin was going to talk to the e10s team about options.
yeah I'd like to put them all in the win7 pool; post coming shortly to mozilla.tools!
Okay, I'll get started on turning these into w7 machines. I'll need some help from dcops/netops for the VLAN work before I can reinstall them.
Depends on: 1218874
These will be:

t-w732-ix-204.wintest.releng.scl3.mozilla.com
t-w732-ix-205.wintest.releng.scl3.mozilla.com
t-w732-ix-206.wintest.releng.scl3.mozilla.com
t-w732-ix-207.wintest.releng.scl3.mozilla.com
t-w732-ix-208.wintest.releng.scl3.mozilla.com
t-w732-ix-209.wintest.releng.scl3.mozilla.com
t-w732-ix-210.wintest.releng.scl3.mozilla.com
t-w732-ix-211.wintest.releng.scl3.mozilla.com
t-w732-ix-212.wintest.releng.scl3.mozilla.com
t-w732-ix-213.wintest.releng.scl3.mozilla.com
t-w732-ix-214.wintest.releng.scl3.mozilla.com
t-w732-ix-215.wintest.releng.scl3.mozilla.com
t-w732-ix-216.wintest.releng.scl3.mozilla.com
t-w732-ix-217.wintest.releng.scl3.mozilla.com
t-w732-ix-218.wintest.releng.scl3.mozilla.com
t-w732-ix-219.wintest.releng.scl3.mozilla.com
t-w732-ix-220.wintest.releng.scl3.mozilla.com
t-w732-ix-221.wintest.releng.scl3.mozilla.com
t-w732-ix-222.wintest.releng.scl3.mozilla.com
t-w732-ix-223.wintest.releng.scl3.mozilla.com
t-w732-ix-224.wintest.releng.scl3.mozilla.com
t-w732-ix-225.wintest.releng.scl3.mozilla.com
t-w732-ix-226.wintest.releng.scl3.mozilla.com
t-w732-ix-227.wintest.releng.scl3.mozilla.com
t-w732-ix-228.wintest.releng.scl3.mozilla.com
t-w732-ix-229.wintest.releng.scl3.mozilla.com
t-w732-ix-230.wintest.releng.scl3.mozilla.com
t-w732-ix-231.wintest.releng.scl3.mozilla.com
t-w732-ix-232.wintest.releng.scl3.mozilla.com
t-w732-ix-233.wintest.releng.scl3.mozilla.com
Assignee: relops → arich
Summary: reimage 30 linux64 hosts as w7 and w8 hosts → reimage 30 linux64 hosts as w7 hosts
updated in nagios as well.
I got some of these to install, but others don't seem to be making it through the process, and I'm not sure how to debug. Q, could you take a look at the following to make sure that they have everything installed properly:

204
205
208
210
212
213
214
218
219
220
226
227
228
229
231
232
233


And take a look at the following to see why they never complete:

206
207
209
211
215
216
217
221
222
223
224
225
230
Flags: needinfo?(q)
Finally found something useful.

 These machines are BSODing with a stop error of STOP: 0x00000019 when trying to reboot. The description of this error is "The issue occurs because the Hdaudio.sys driver tries to process the audio information when the uninitialized internal firmware memory within the video card is set in a certain manner."

This error combined with a log vague log  entry of "invalid gpu version" seems to be coming from something in the firmware on the video cards.
Flags: needinfo?(q)
Finally found something useful.

 These machines are BSODing with a stop error of STOP: 0x00000019 when trying to reboot. The description of this error is "The issue occurs because the Hdaudio.sys driver tries to process the audio information when the uninitialized internal firmware memory within the video card is set in a certain manner."

This error combined with a log vague log  entry of "invalid gpu version" seems to be coming from something in the firmware on the video cards.
Depends on: 1219361
The following should be ready to go into the pool, according to Q:

t-w732-ix-204.wintest.releng.scl3.mozilla.com
t-w732-ix-205.wintest.releng.scl3.mozilla.com
t-w732-ix-208.wintest.releng.scl3.mozilla.com
t-w732-ix-210.wintest.releng.scl3.mozilla.com
t-w732-ix-212.wintest.releng.scl3.mozilla.com
t-w732-ix-213.wintest.releng.scl3.mozilla.com
t-w732-ix-214.wintest.releng.scl3.mozilla.com
t-w732-ix-218.wintest.releng.scl3.mozilla.com
t-w732-ix-219.wintest.releng.scl3.mozilla.com
t-w732-ix-220.wintest.releng.scl3.mozilla.com
t-w732-ix-226.wintest.releng.scl3.mozilla.com
t-w732-ix-227.wintest.releng.scl3.mozilla.com
t-w732-ix-228.wintest.releng.scl3.mozilla.com
t-w732-ix-229.wintest.releng.scl3.mozilla.com
t-w732-ix-231.wintest.releng.scl3.mozilla.com
t-w732-ix-232.wintest.releng.scl3.mozilla.com
t-w732-ix-233.wintest.releng.scl3.mozilla.com
(In reply to Joel Maher (:jmaher) from comment #2)
> this is more of a juggling act- we have really long backlogs on windows, and
> almost no wait on linux64.  It appears we run graphics as well as talos on
> there.  Lets look at data and see what the historic wait times are for this
> hardware pool.  Coop was going to look at historical wait times.

I've done this now. I still have reservations about this approach, but we are really stuck for options here.
My biggest worry is that we're cannibalizing one of the few platforms where we currently *don't* have massive wait times for gfx performance testing. 

Going through the historical data, we've gone over 2000 test jobs for ubuntu64_hw 23 times. Sometimes that causes us to miss our 95% commitment, but mostly it doesn't. It really depends on whether the requests all come in at once or not, i.e. is the load "bursty" or not.

Does this get worse when we take away a third of the pool? Likely, but we have few alternatives.
Flags: needinfo?(coop)
Q: can you verify that the following installed correctly? All but 224 and 230 had their graphics cards swapped out (we were using non-pingable as an indicator of BSOD issues with the graphics card, possibly firmware incompatibilities):

t-w732-ix-206.wintest.releng.scl3.mozilla.com
t-w732-ix-207.wintest.releng.scl3.mozilla.com
t-w732-ix-209.wintest.releng.scl3.mozilla.com
t-w732-ix-211.wintest.releng.scl3.mozilla.com
t-w732-ix-215.wintest.releng.scl3.mozilla.com
t-w732-ix-216.wintest.releng.scl3.mozilla.com
t-w732-ix-217.wintest.releng.scl3.mozilla.com
t-w732-ix-221.wintest.releng.scl3.mozilla.com
t-w732-ix-222.wintest.releng.scl3.mozilla.com
t-w732-ix-224.wintest.releng.scl3.mozilla.com
t-w732-ix-230.wintest.releng.scl3.mozilla.com

sal replaced the graphics cards in the following and kicked off a reimage, but they hadn't finished as of yet. There's a good chance they'll also be done by the time you get to this, so please give them a look, too.

t-w732-ix-223.wintest.releng.scl3.mozilla.com
t-w732-ix-225.wintest.releng.scl3.mozilla.com
Flags: needinfo?(q)
Checked:

t-w732-ix-206.wintest.releng.scl3.mozilla.com
t-w732-ix-207.wintest.releng.scl3.mozilla.com
t-w732-ix-209.wintest.releng.scl3.mozilla.com
t-w732-ix-211.wintest.releng.scl3.mozilla.com
t-w732-ix-215.wintest.releng.scl3.mozilla.com
t-w732-ix-216.wintest.releng.scl3.mozilla.com
t-w732-ix-217.wintest.releng.scl3.mozilla.com


All look good. They need to be enabled and rebooted to be added to the pool.
Flags: needinfo?(q)
221,222, and 225 are all good now
Q: so there are issues with 223 and 230?
Flags: needinfo?(q)
I've rebooted and added all but 223 and 230 into the pool.
Attached image 223-monitor-moz.jpg
223 says there is a monitor attached. Has stuck past one reboot. Doing it again just in case.
230 was offline. Brought back up and checking now. 

In other news tests are looking green on the boxes that already went into the pool.
Flags: needinfo?(q)
223 is back up and looks good without a reported monitor. I am adding into the slave pool and rebooting.
All machines are up and in slave pool
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Thanks Q an Amy!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: