Closed Bug 624124 Opened 14 years ago Closed 13 years ago

Please re-image talos-r3-w7-048

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: zandr)

References

Details

(Whiteboard: [reimage])

bjacob just finished with talos-r3-w7-048 in bug 623278, so we should re-image this slave to get it back to a known state.
Blocks: 623274
Assignee: server-ops → zandr
zandr mind if I hold on to this slave for a little bit?

I promise I will give it back! :)
armen: WFM, assign the bug back to me when you're done.
Assignee: zandr → armenzg
Armen, can we close this?  you can reopen or file a new one when you are done.
Please go ahead and reimage.
I am done with this machine.

Thanks!

PS = Wait times for win7 machines is quite bad because of two things:
* several machines are down to re-image (since we loaned several of them)
* win7 jobs take longer

I will have to check how many are out of action but we will have to see how to improve our re-imaging turn around.
Assignee: armenzg → zandr
Component: Server Operations → Server Operations: RelEng
QA Contact: mrz → zandr
(In reply to comment #4)
 
> I will have to check how many are out of action but we will have to see how to
> improve our re-imaging turn around.

There are exactly two ways to do this:

1) Stop using Minis for OS's other than Mac OS.
2) Hire more minions.

The former scales much better than the latter.
Please ignore the previous.

There are not that many machines waiting for re-imaging (sometimes releng we take long to file the bug after the loan is over). Your turn around is good. My apologies zandr/IT.

There were more than 15 w7 slaves out of action.
* 1 caught correctly by nagios
* few slaves with buildbot not running hence PINGable. I have a solution to tackle this
* few of them were running buildbot but "hung". Running a job for days. I have another solution to tackle this as well

My apologies again it was a mistake to say that. I spoke incorrectly.

See bug 627070 if you are curious on what was going on.
Though, I did miss this bug last night while I was at the colo. :D

I was just working from bug 620948, and missed this and a couple of others. Will make a short stop there again before Monday.
No worries.

Shall we have a reimages bug and add dependencies like this one to it?

I wonder if having a single point will also helps us see overtime how many machine we reimage? Not sure if it has too much value.

It seems that we get more w7 reimages since devs book them more often.

Anyways just thinking out loud.

Have a good weekend,
Armen
Whiteboard: [reimage]
https://spreadsheets.google.com/ccc?key=0AqefQEn4Wp2ydFVjSkMwM1ZlS28xdVRaVDNHUEpLaEE&hl=en is the current best tracker, but it does not have historical information.
I have some partially formed thoughts about leveraging nagios (which already has the alert and ack history) to do all of this tracking.

Depends on some additional work in nagios to make it sane, but I'd be happy to chat about this with anyone about to embark on creating a different system. :D

Otherwise, stay tuned.
nagios scans logfiles for its history, right?  Isn't that why the historical queries are so slow?
Reimaged, needs setup.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.