Closed Bug 991236 Opened 11 years ago Closed 10 years ago

Fix StartTalos.bat and StartBuildbot.bat Scripts and update repos

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: q, Assigned: q)

References

Details

Attachments

(1 file, 1 obsolete file)

1.25 KB, text/plain
armenzg
: review+
Details
No description provided.
Assignee: relops → q
Blocks: 977341
Depends on: 987152
What is the actual fix? Remove /t 0 ?
Getting these batch scripts fixed will take care of a few outstanding problems: 1) Runslave exiting out and no-one knows why 2) Cleaning of temp dirs and possibly profiles ( full directories cause tests to time out and polluted profiles may skew results) 3) XP tester slaves failing to exit the batches correctly after a test run. I will attached a new cross platform (win7/8/xp) startTalos.bat script in this bug for review and then move on to startbuildbot.bat.
Blocks: 918507
Attached file start_talos_new.bat (obsolete) —
New Talos bat for clean up and logging with input from releng to replace platform specific and outdated bat script
Attachment #8405571 - Flags: review?(armenzg)
Comment on attachment 8405571 [details] start_talos_new.bat my only off-the-cuff comment would be, I fear runslave.log getting too long. can we do *something* like: mv runslave.log runslave.log.old tail -n500 runslave.log.old > runslave.log at the start, to essentially trim the existing log?
Comment on attachment 8405571 [details] start_talos_new.bat Q: this makes sense: mv runslave.log runslave.log.old tail -n500 runslave.log.old > runslave.log On another note, could we deploy this change few machines at a time? I fear we don't know how long the rmdir will take. We should also coordinate with buildduty so they know which machines to ignore if they are not talking jobs for a while. Thanks Q!
Attachment #8405571 - Flags: review?(armenzg) → review+
how about mv runslave.log.old ruslave.log.old.1 mv runslave.log runslave.log.old then we keep two runs and we aren't dependent on the gnu port of tail ? We can assign to a few machines at a time. I can also run a background find to clean up files in those directories with atime older than X days and delete them. That should be fairly low over head and safe. Q
We have tail on the machines under C:\mozilla-build\msys\bin IIUC Whatever you prefer. That second approach would only keep track of the last two runs. I'm worried with background removals as I don't know how it could affect running jobs or affect perf jobs (I assume nice -19 would be OK if we had it). If we do batches of machines (5-10 at a time) I don't think we would need to worry much about using background removals. It would be like taking machines down for a bit for maintenance. Does this work for you?
Attached file startTalos.bat
Added in the tail roll ( after much debate I don't mind being bound to msys tools). Also added a variable block and added comments.
Attachment #8405571 - Attachment is obsolete: true
Attachment #8408409 - Flags: review?(armenzg)
Comment on attachment 8408409 [details] startTalos.bat It looks. Could we deploy this change to 30 machines a day? I know it sucks but it will ensure that we don't cause delays depending on how long this runs. We might even get a rough idea with the first batch on if it is that much of an impact.
Attachment #8408409 - Flags: review?(armenzg) → review+
I'm presuming rolling this out will mean doing manual-foo on each box? If so, would it make sense to combine it with the rollout of bug 961075?
No manual intervention. We can select a subset of machines through Windows' GPO.
I will roll this out to the first 10 machines ( 001 - 010 ) in each pool xp, 7, and 8 start with the next reboot. Does that work for everyone ?
WFM. BTW, I was meaning 30 from each test pool. Let's see how these 10 do and gear up for large batches on the following sets? Could you please comment in here with the time when this gets deployed to the machines? I would like to look into how long it takes them to come back from their last job. Roughly. Thanks Q!
t-w864-ix-003 is loaned (though probably a no-longer-used loan), t-xp32-ix-008 is disabled, and you have t-w732-ix-003 and t-w732-ix-004, so that'll be fewer than 10 and perhaps a bit of a surprise for the loaner.
How about we start with 10 - 20 ? Q
Much better looking span, only missing the busted and disabled t-w864-ix-020.
Great those machines should pick up the changes next reboot
OK I will review them in the next couple of hours.
To clarify, it was easier to patern match *-IX*-01* so machines 010 - 019 in each OS pool will get the update
I won't be able to evaluate to determine how many we can do on every batch as they have not picked up a job since the change got deployed. I hope the machines will pick up a job sometime later in the day once the cleaning up finishes up. If anyone wants to figure out in my absence this is what I was going to do: * Load these 3 pages https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-xp32-ix https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-w732-ix https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-w864-ix * Sort the slave by name * Open each slaves on the range indicated on comment 19 ** e.g. https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-xp32-ix&name=t-xp32-ix-010 * Look at the end time of the last job that finished around 3pm PT * Figure out how much of a gap there is to the next set of jobs I hope the gap is not hours. Thanks Q.
Based on IRC conversations I think we are ready to roll this out on all testers. Any objections?
[09:59] <RyanVM> jlund|buildduty: we didn't see any test bustage this time around either [09:59] <RyanVM> so I'd be OK with a wider rollout [10:00] <RyanVM> Q: once the Windows slaves are good to with this cleanup work, I'd love the OSX slaves to get it next [10:00] <jlund|buildduty> RyanVM: thanks, verifying with you guys should have been my 1st step. :)
Rolling out pool wide. Testers will get changes on next reboot
Severity: critical → normal
Has this been fixed now? I'm asking because I want to re-enable a test in bug 918507.
This has indeed been fixed.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: