Closed
Bug 991236
Opened 11 years ago
Closed 10 years ago
Fix StartTalos.bat and StartBuildbot.bat Scripts and update repos
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: q, Assigned: q)
References
Details
Attachments
(1 file, 1 obsolete file)
No description provided.
Comment 1•11 years ago
|
||
What is the actual fix?
Remove /t 0 ?
Getting these batch scripts fixed will take care of a few outstanding problems:
1) Runslave exiting out and no-one knows why
2) Cleaning of temp dirs and possibly profiles ( full directories cause tests to time out and polluted profiles may skew results)
3) XP tester slaves failing to exit the batches correctly after a test run.
I will attached a new cross platform (win7/8/xp) startTalos.bat script in this bug for review and then move on to startbuildbot.bat.
New Talos bat for clean up and logging with input from releng to replace platform specific and outdated bat script
Attachment #8405571 -
Flags: review?(armenzg)
Comment 4•11 years ago
|
||
Comment on attachment 8405571 [details]
start_talos_new.bat
my only off-the-cuff comment would be, I fear runslave.log getting too long.
can we do *something* like:
mv runslave.log runslave.log.old
tail -n500 runslave.log.old > runslave.log
at the start, to essentially trim the existing log?
Comment 5•11 years ago
|
||
Comment on attachment 8405571 [details]
start_talos_new.bat
Q: this makes sense:
mv runslave.log runslave.log.old
tail -n500 runslave.log.old > runslave.log
On another note, could we deploy this change few machines at a time?
I fear we don't know how long the rmdir will take.
We should also coordinate with buildduty so they know which machines to ignore if they are not talking jobs for a while.
Thanks Q!
Attachment #8405571 -
Flags: review?(armenzg) → review+
how about
mv runslave.log.old ruslave.log.old.1
mv runslave.log runslave.log.old
then we keep two runs and we aren't dependent on the gnu port of tail ?
We can assign to a few machines at a time. I can also run a background find to clean up files in those directories with atime older than X days and delete them. That should be fairly low over head and safe.
Q
Comment 7•11 years ago
|
||
We have tail on the machines under C:\mozilla-build\msys\bin IIUC
Whatever you prefer. That second approach would only keep track of the last two runs.
I'm worried with background removals as I don't know how it could affect running jobs or affect perf jobs (I assume nice -19 would be OK if we had it).
If we do batches of machines (5-10 at a time) I don't think we would need to worry much about using background removals. It would be like taking machines down for a bit for maintenance.
Does this work for you?
Added in the tail roll ( after much debate I don't mind being bound to msys tools).
Also added a variable block and added comments.
Attachment #8405571 -
Attachment is obsolete: true
Attachment #8408409 -
Flags: review?(armenzg)
Comment 9•11 years ago
|
||
Comment on attachment 8408409 [details]
startTalos.bat
It looks.
Could we deploy this change to 30 machines a day?
I know it sucks but it will ensure that we don't cause delays depending on how long this runs.
We might even get a rough idea with the first batch on if it is that much of an impact.
Attachment #8408409 -
Flags: review?(armenzg) → review+
Comment 10•11 years ago
|
||
I'm presuming rolling this out will mean doing manual-foo on each box? If so, would it make sense to combine it with the rollout of bug 961075?
Comment 11•11 years ago
|
||
No manual intervention.
We can select a subset of machines through Windows' GPO.
Assignee | ||
Comment 12•11 years ago
|
||
I will roll this out to the first 10 machines ( 001 - 010 ) in each pool xp, 7, and 8 start with the next reboot. Does that work for everyone ?
Comment 13•11 years ago
|
||
WFM. BTW, I was meaning 30 from each test pool. Let's see how these 10 do and gear up for large batches on the following sets?
Could you please comment in here with the time when this gets deployed to the machines?
I would like to look into how long it takes them to come back from their last job. Roughly.
Thanks Q!
Comment 14•11 years ago
|
||
t-w864-ix-003 is loaned (though probably a no-longer-used loan), t-xp32-ix-008 is disabled, and you have t-w732-ix-003 and t-w732-ix-004, so that'll be fewer than 10 and perhaps a bit of a surprise for the loaner.
Assignee | ||
Comment 15•11 years ago
|
||
How about we start with 10 - 20 ?
Q
Comment 16•11 years ago
|
||
Much better looking span, only missing the busted and disabled t-w864-ix-020.
Assignee | ||
Comment 17•11 years ago
|
||
Great those machines should pick up the changes next reboot
Comment 18•11 years ago
|
||
OK I will review them in the next couple of hours.
Assignee | ||
Comment 19•11 years ago
|
||
To clarify, it was easier to patern match *-IX*-01* so machines 010 - 019 in each OS pool will get the update
Comment 20•11 years ago
|
||
I won't be able to evaluate to determine how many we can do on every batch as they have not picked up a job since the change got deployed.
I hope the machines will pick up a job sometime later in the day once the cleaning up finishes up.
If anyone wants to figure out in my absence this is what I was going to do:
* Load these 3 pages
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-xp32-ix
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-w732-ix
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-w864-ix
* Sort the slave by name
* Open each slaves on the range indicated on comment 19
** e.g. https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-xp32-ix&name=t-xp32-ix-010
* Look at the end time of the last job that finished around 3pm PT
* Figure out how much of a gap there is to the next set of jobs
I hope the gap is not hours.
Thanks Q.
Assignee | ||
Comment 21•11 years ago
|
||
Based on IRC conversations I think we are ready to roll this out on all testers. Any objections?
Assignee | ||
Comment 22•11 years ago
|
||
[09:59] <RyanVM> jlund|buildduty: we didn't see any test bustage this time around either
[09:59] <RyanVM> so I'd be OK with a wider rollout
[10:00] <RyanVM> Q: once the Windows slaves are good to with this cleanup work, I'd love the OSX slaves to get it next
[10:00] <jlund|buildduty> RyanVM: thanks, verifying with you guys should have been my 1st step. :)
Assignee | ||
Comment 23•11 years ago
|
||
Rolling out pool wide. Testers will get changes on next reboot
Updated•11 years ago
|
Severity: critical → normal
Comment 24•11 years ago
|
||
Has this been fixed now? I'm asking because I want to re-enable a test in bug 918507.
Assignee | ||
Comment 25•11 years ago
|
||
This has indeed been fixed.
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•