Closed Bug 1192994 Opened 9 years ago Closed 8 years ago

investigate seta scheduling for talos

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Assigned: catlee)

Details

Attachments

(3 files, 2 obsolete files)

requested in irc by RyanVM and jmaher

jmaher	kmoir: are there other things we can do to force coalescing on talos jobs on certain branches, maybe not as dynamic as seta
	jmaher	kmoir: yes, regular intervals
	RyanVM|sheriffduty	oh, no wonder why
	RyanVM|sheriffduty	we have 2x more linux64 iX slaves than linx32
	jmaher	ah, interesting
	RyanVM|sheriffduty	currently 185 pending linux32-ix test jobs
	-->|	sydpolk1 (Adium@moz-kqd.cge.136.40.IP) has joined #ateam
	RyanVM|sheriffduty	jmaher: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/
	=-=	hwine is now known as hwine-food
	jmaher	I would rather treat things as similar as possible- so all talos e10s forced coalescing and it will help a lot
	RyanVM|sheriffduty	45 total slaves, 40 currently in service
	|<--	sydpolk has left moznet (Ping timeout: 121 seconds)
	RyanVM|sheriffduty	vs. 99/86 for linux64
	RyanVM|sheriffduty	ouch
	RyanVM|sheriffduty	of course, we run some other jobs linux64-ix slaves (i.e. Android x86 S4)
	RyanVM|sheriffduty	either way, I wonder if we could maybe rebalance that a bit
	kmoir	jmaher: not without changing the scheduler code.  Also, scheduler coalescing was not implemented for talos iarc
	kmoir	iirc
	|<--	stanley has left moznet (Ping timeout: 121 seconds)
	jmaher	kmoir: ok- is that a hard thing to do?  I have been patient for 6+ months to get talos e10s scheduled and we need it now- so it is enabled everywhere; I am not sure if this is something we can get in the releng work queue?
	kmoir	jmaher: I'll open a bug and see how much work it will be. In theory it shouldn't be that difficult
	RyanVM|sheriffduty	kmoir: our logic for periodic jobs only applies to builds?
	RyanVM|sheriffduty	i.e. we can't make talos-e10s a periodic job?
	jmaher	RyanVM|sheriffduty: I don't know, that work work as long as we can make them tier-1
	kmoir	hmm, let me look.  I was just thinking of it from the test side
	RyanVM|sheriffduty	jmaher: I really do wonder what a little rebalancing would do for us - maybe even reimaging 10 or so
	RyanVM|sheriffduty	also, 4 of the 5 disabled slaves have been offline for more than a month now
	jmaher	that is a lot offline
	RyanVM|sheriffduty	i see armen has one for bug 1141416
	bugbot	Bug https://bugzilla.mozilla.org/show_bug.cgi?id=1141416 normal, --, ---, nobody, NEW , Fix the slaves broken by talos's inability to deploy an update
	RyanVM|sheriffduty	which *may* be the issue with the others as well
	jmaher	I think it is
	RyanVM|sheriffduty	looks like he's actively working on that now
	RyanVM|sheriffduty	so even 5 more would help
	RyanVM|sheriffduty	jmaher, kmoir: seems that rebalancing seems like the most painless option if it can be done
	RyanVM|sheriffduty	maybe even 5 to start to see how linux64-ix wait times get impacted
Assignee: nobody → kmoir
jmaher, RyanVM: is this still important?
I think this could help us reduce our load on win* machines if it was implemented right now.  Ideally we would ensure all tests run every X pushes where X=1,2,3,etc.  Maybe for now X=2 and we could save a lot of resources.
Do we need any change on the SETA side? or just the Buildbot side?
right now SETA has talos data integrated in, it is hardcoded- so we would need to edit the buildbot talos scheduler (which is different than the unittest scheduler)
Yes, just change the talos scheduler. To be honest I have not looked at this because my thinking was that we would be moving to taskcluster and thus not worth the effort changing the buildbot side of things.  But this depends on timelines for migration.
*Very* optimistically, we won't be able to migrate before Q4.

In any case, TaskCluster needs SETA support which we don't yet have.
Hopefully bug 1243123 makes it as a GsoC project and in the blocked bugs we will add in-tree Buildbot scheduling via TC/BBB.
at the current rate of migration we are looking at a full calendar year at least (end of 2016).  We would bump this up if we used taskcluster to schedule and use buildbot bridge to schedule everything.
I took another look at this yesterday as a way to help with some of the HW capacity issues we've been having.

The changes I made result in these tests running all the time on mozilla-inbound (for 32-bit windows):
    'Windows 7 32-bit mozilla-inbound talos g2',
    'Windows 7 32-bit mozilla-inbound talos g1',
    'Windows 7 32-bit mozilla-inbound talos svgr',
    'Windows 7 32-bit mozilla-inbound talos dromaeojs',
    'Windows 7 32-bit mozilla-inbound talos other',
    'Windows 7 32-bit mozilla-inbound talos chromez',
    'Windows 7 32-bit mozilla-inbound talos tp5o',
    'Windows 7 32-bit mozilla-inbound talos xperf',
    'Windows XP 32-bit mozilla-inbound talos g2',
    'Windows XP 32-bit mozilla-inbound talos g1',
    'Windows XP 32-bit mozilla-inbound talos svgr',
    'Windows XP 32-bit mozilla-inbound talos dromaeojs',
    'Windows XP 32-bit mozilla-inbound talos other',
    'Windows XP 32-bit mozilla-inbound talos chromez',
    'Windows XP 32-bit mozilla-inbound talos tp5o',

And these tests running every 14 pushes, or 2 hours:
    'Windows XP 32-bit mozilla-inbound talos chromez-e10s',
    'Windows XP 32-bit mozilla-inbound talos g2-e10s',
    'Windows XP 32-bit mozilla-inbound talos svgr-e10s',
    'Windows XP 32-bit mozilla-inbound talos dromaeojs-e10s',
    'Windows XP 32-bit mozilla-inbound talos other-e10s',
    'Windows XP 32-bit mozilla-inbound talos tp5o-e10s',
    'Windows XP 32-bit mozilla-inbound talos g1-e10s'

And these running every 7 pushes, or 1 hour:
    'Windows 7 32-bit mozilla-inbound talos chromez-e10s',
    'Windows 7 32-bit mozilla-inbound talos g2-e10s',
    'Windows 7 32-bit mozilla-inbound talos svgr-e10s',
    'Windows 7 32-bit mozilla-inbound talos dromaeojs-e10s',
    'Windows 7 32-bit mozilla-inbound talos other-e10s',
    'Windows 7 32-bit mozilla-inbound talos tp5o-e10s',
    'Windows 7 32-bit mozilla-inbound talos xperf-e10s',
    'Windows 7 32-bit mozilla-inbound talos g1-e10s'

Win8 64-bit, and OSX 10.10 are also impacted by this change.

Does this look about right?
Flags: needinfo?(kmoir)
Flags: needinfo?(jmaher)
Attached patch adding talos support to seta (obsolete) — Splinter Review
Attached patch adding talos support to seta (obsolete) — Splinter Review
The first stanza looks right.

For 
And these tests running every 14 pushes, or 2 hours:
and 
And these running every 7 pushes, or 1 hour:

I don't understand why these are run less often since they are not listed as tests that should run less frequently here
http://alertmanager.allizom.org/data/setadetails/?date=2015-03-03&buildbot=1&branch=mozilla-inbound&inactive=1
Flags: needinfo?(kmoir)
actually ignore my last comment, the seta data link I had was for 2015, not 2016 so the tests you indicate link look fine with the correct link :-) moar caffeine
:catlee, thanks for looking into this!  I am excited to see this change, could we add win8 in there as well to be scheduled as we are for win7?

One thought is we could go every other push by default, then do e10s on 7th/14th based on OS.  That would reduce the load and probably end up not requiring any more work to narrow down regressions.  

The biggest concern is that regression won't show up until much later, especially in the 14rev/2hour window.  That is a full 24 hours.

Will this affect pgo scheduling?  I would like to keep pgo the same as it is now.
Flags: needinfo?(jmaher)
Win8 has similar changes, I just didn't call them out explicitly.

I can look at making it every other push by default. Would that be only on fx-team and inbound, or all branches?

PGO scheduling isn't impacted - they happen as usual.
I think central should stay the same, but for inbound/fx-team we should have this intentional coalescing.

If you want me to define anything via the SETA API, that would be very doable- there is data in there already, but I could add priority fields, etc.
Comment on attachment 8726356 [details] [diff] [review]
Handle Linux64 data from SETA r=kmoir

Joel just added some Linux64 talos data to SETA, and it breaks our current configuration. I think because we're filtering out talos jobs, and so we end up with no tests inside define_configs(), but not skipping that platform in that case.
Attachment #8726356 - Flags: review?(kmoir)
Attachment #8726356 - Flags: review?(kmoir) → review+
Attachment #8726174 - Attachment is obsolete: true
Comment on attachment 8726789 [details] [diff] [review]
adding talos support to seta (buildbotcustom)

This patch changes the generateBranchObjects function to look for talos suites in the platform's skipconfig data. We then collect the talos builder names, grouped by their skipconfig. Finally, we create schedulers for each group of talos builders with the same skipconfig.
Attachment #8726789 - Attachment description: adding talos support to seta → adding talos support to seta (buildbotcustom)
Attachment #8726789 - Flags: review?(kmoir)
Attachment #8726194 - Attachment is obsolete: true
Comment on attachment 8726789 [details] [diff] [review]
adding talos support to seta (buildbotcustom)

Looks good

Will have to revise 
test_exclusions = re.compile('\[funsize\]|\[TC\]|talos')
in config_seta.py so talos is removed so talos skipconfig defintions are added
Attachment #8726789 - Flags: review?(kmoir) → review+
Comment on attachment 8726844 [details] [diff] [review]
adding talos support to seta (buildbot-configs)

Minor changes required for buildbot-configs. I needed a dummy entry for ubuntu64_hw, since talos is the only thing that runs there.

Most of the changes to config_seta.py are cleanup, with the exception of the regex change where I stop skipping talos jobs.
Attachment #8726844 - Attachment description: adding talos support to seta → adding talos support to seta (buildbot-configs)
Attachment #8726844 - Flags: review?(kmoir)
Attachment #8726844 - Flags: review?(kmoir) → review+
Assignee: kmoir → catlee
Attachment #8726789 - Flags: checked-in+
Attachment #8726844 - Flags: checked-in+
I think this is done.

bug 1255088 tracks cleaning up some of the talos suite configuration on various branches, and will allow us to have more control over talos with seta.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: