Closed
Bug 984923
Opened 11 years ago
Closed 11 years ago
Improve prioritizeBuilders
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: catlee)
References
Details
Attachments
(1 file)
6.85 KB,
patch
|
bhearsum
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
Our current prioritizeBuilders functions has been causing us grief in a few ways:
- It tries to assign work to only the most important builders, usually only one or two at a time. we've had many cases where these builders can't assign work because their nextSlave functions preclude any of the connected slaves from being used. For example, nextAWSSlave functions won't run a build on spot machines if there have been too many retries. In the case we have only spot nodes connected, and the most important builder has too many retries, we end up in a case where no work is being assigned. prioritizeBuilders is returning only the most important builder, and its nextSlave returns None, and nothing gets done. This can cause massive delays for *other* builders that are also pending.
- Because pending jobs start piling up behind these stuck important builders that can't assign work, aws_watch_pending tries to start more and more machines to service those requests. Eventually we end up starting ALL the machines, including on-demand nodes. In the case described above, this unplugs the stuck builder, and then all the other pending requests can be serviced as well. I'd guess this makes this function one of our most expensive functions in the code base :)
We should change this function to take into account the number of connected slaves, and give other builders and chance to run as well.
Some irc logs for more context:
06:39 <@catlee> bhearsum, rail: any thoughts on https://github.com/catlee/buildbot-configs/compare/master...branch_priorities ?
06:39 <@catlee> the idea is to be able to process as many builders as we have slaves
06:39 <@catlee> while still preserving branch priorities
06:40 < bhearsum> catlee: as opposed to all of the builders, which is probably higher than the number of slaves?
06:42 <@catlee> bhearsum: before it would pick just the "most important" builders
06:42 <@catlee> bhearsum: which could be just 1 or 2 of the set of pending
06:43 <@catlee> bhearsum: the problem we've been seeing is that those 1 or 2 most important builders can't actually run on any of the available slaves
06:43 < bhearsum> catlee: ah
06:53 < bhearsum> sorry, getting distracted
06:54 <@catlee> the problem the old function was trying to avoid was having slaves connect while the master is iterating through the sorted list of builders
06:56 <@catlee> so before we'd return the list of builders in sorted order. If builder 0 had no slaves, we'd move onto builder 1. If a slave connected between, then builder 1 would get run, even though builder 0 is more important
06:56 <@catlee> so then we changed it to return just the "most important" builders
06:57 <@catlee> but returning just the "most important" builders fails when those builders can't assign work due to nextSlave restrictions
06:58 <@catlee> so I'm thinking we should be able to safely consider as many builders as we have connected slaves
07:00 < bhearsum> seems fine, but i feel like it still hasn't clicked in my head yet
07:01 < bhearsum> i'll come back to it after this release stuff calms down
07:01 <@catlee> sure
07:23 < bhearsum> catlee: i guess my only question is: why not return all of the builders instead of just len(connected slaves)?
07:25 <@catlee> bhearsum: in the case we have no connected slaves, buildbot will process builders in order. that processing happens in a bit of an async loop, so slaves can connect in between builders being processed. so the most important builder can be skipped because there are no slaves available at the time, and then a slave connects, and a less important builder gets it
07:27 < bhearsum> catlee: how does that work when we're just returning a subset of all of the builders then? i'm not sure if it was clear, but i was suggesting returning all of the builders, sorted
07:28 <@catlee> bhearsum: right, I'm trying to explain why returning all the builders, sorted, doesn't work
07:29 < bhearsum> catlee: but returning a subset of those builders, sorted, works?
07:29 * bhearsum rereads everything
07:29 <@catlee> t = 0: sort builders into b0, b1, b2, return [b0, b1, b2]. no slaves connected
07:29 <@catlee> t = 1: look at b0. no slaves, move on
07:29 <@catlee> t = 2: slave s1 connects
07:29 <@catlee> t = 3: look at b1, give job to s1
07:30 < bhearsum> right, i understand that bug
07:33 <@catlee> ok
07:33 <@catlee> so that's what happens if you return all the builders, sorted
07:34 < bhearsum> doesn't that happen if you return [b0, b1] in your scenario, too?
07:35 <@catlee> hopefully not
07:35 <@catlee> because I know we have 2 slaves
07:35 <@catlee> so I'll return [b0, b1]
07:35 <@catlee> if we only have 1 slave, I'll return [b0]
07:36 <@catlee> anyway, that's my idea for improving things
07:36 <@catlee> right now we return just [b0] no matter how many slaves we have
07:36 < bhearsum> yeah
07:36 < bhearsum> that's clearly silly
07:36 <@catlee> and sometimes b0 can't run
07:36 < bhearsum> so, what happens if we return [b0, b1], but both slaves that are free are jacuzzi'ed to b2, which also has pending?
07:37 <@catlee> we are taking into account jacuzzis
07:37 <@catlee> I think...
07:37 < bhearsum> oh, right
07:37 < bhearsum> just not in the context of this diff
07:37 <@catlee> right
07:38 <@catlee> we've already filtered our slaves down
07:38 < bhearsum> right, ok
07:38 <@catlee> so its' cases like nextAWSSlave that forces things to run on spot/not-spot
07:40 < bhearsum> one more question: you mentioned above that sometimes "builders can't actually run on any of the available slaves". does that mean those builders don't have those slaves in their list, or something else?
07:40 <@catlee> something else
07:40 <@catlee> nextSlave prevents it from running on a connected slave
07:40 < bhearsum> ah
07:40 <@catlee> e.g. we have only spot connected, but the builder has retries, so we want to run on ondemand
07:40 <@catlee> so nextSlave returns none
07:41 <@catlee> this is what's causing some of our spikes I think
07:41 <@catlee> b0 is pending, so we're spinning up machines for it
07:42 <@catlee> but it can't run on any of the machines
07:42 <@catlee> meanwhile b1, b2, ... are piling up
07:42 < bhearsum> so could we still hit this in a case where we return [b0,b1], but the slaves get rejected by nextSlave? (but b2 never gets tried, because prioritizeBuilders won't return it until b0 or b1 starts)
07:42 <@catlee> but they're not getting a chance to take jobs
07:42 < bhearsum> yeah, that makes sense
07:42 < bhearsum> i think i have a full grasp of the problem now
07:42 <@catlee> yeah, we could still hit it
07:43 < bhearsum> but presumably we'll cycle fast enough that it'll be much rarer
07:43 <@catlee> hopefully...
07:43 <@catlee> well, part of this is a failure to spin up the right instances for jobs
07:44 <@catlee> but this change should give other builders a chance
07:44 <@catlee> more of a chance than they have now anyway
07:44 < bhearsum> i suppose there's no way to invalidate and rerun prioritizeBuilders when a slave connects mid-iteration through builders?
07:44 < bhearsum> yeah
07:44 < bhearsum> it definitely sounds like a big improvement
07:44 <@catlee> not really, no
07:45 <@catlee> I was thinking maybe some caching could help
07:45 <@catlee> like if we're looking at the same builders, with same set of slaves, then do something different
07:45 <@catlee> or perhaps claling nextSlave from here too
07:45 <@catlee> but that sounds expensive
07:45 < bhearsum> yeah...
07:45 <@catlee> it's a mismatch between nextSlave, prioritizeBuilders and watch_pending
07:46 < bhearsum> well, i'm fine with landing this. a comment about the situation where you don't start b2 because of nextSlave rejects would be useful though, i think
07:48 < bhearsum> catlee: thanks for all the hand holding here
07:48 <@catlee> bhearsum: thanks for digging into it
07:49 < bhearsum> in the glorious future, it seems like it would be nice to have prioritizeBuilders and nextSlave closer together, or maybe the same code
07:49 <@catlee> it's good to have more people to blame^W^W understand this
07:49 < bhearsum> hey
07:49 < bhearsum> don't you be putting me as the author here!
07:49 <@catlee> qrefresh -u ftw
07:49 < bhearsum> qrefresh -fu
07:49 <@catlee> :P
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → catlee
Assignee | ||
Comment 1•11 years ago
|
||
I think this does what we want...
I added a delay to the call to loop.trigger towards the bottom to prevent entering into a kind of infinite loop where the builder loop runs over and over without a delay, chewing up master CPU in the meanwhile.
Attachment #8393125 -
Flags: review?(bhearsum)
Comment 2•11 years ago
|
||
Comment on attachment 8393125 [details] [diff] [review]
updated prioritizeBuilders
Review of attachment 8393125 [details] [diff] [review]:
-----------------------------------------------------------------
Much better, thanks!
Attachment #8393125 -
Flags: review?(bhearsum) → review+
Comment 3•11 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #2)
> Comment on attachment 8393125 [details] [diff] [review]
> updated prioritizeBuilders
>
> Review of attachment 8393125 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Much better, thanks!
Whoops, this comment was meant for a different bug. Regardless, this patch looks fine.
Assignee | ||
Updated•11 years ago
|
Attachment #8393125 -
Flags: checked-in+
Comment 4•11 years ago
|
||
patch(es) is in production :)
Assignee | ||
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•