Closed Bug 984923 Opened 11 years ago Closed 11 years ago

Improve prioritizeBuilders

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

Attachments

(1 file)

Our current prioritizeBuilders functions has been causing us grief in a few ways: - It tries to assign work to only the most important builders, usually only one or two at a time. we've had many cases where these builders can't assign work because their nextSlave functions preclude any of the connected slaves from being used. For example, nextAWSSlave functions won't run a build on spot machines if there have been too many retries. In the case we have only spot nodes connected, and the most important builder has too many retries, we end up in a case where no work is being assigned. prioritizeBuilders is returning only the most important builder, and its nextSlave returns None, and nothing gets done. This can cause massive delays for *other* builders that are also pending. - Because pending jobs start piling up behind these stuck important builders that can't assign work, aws_watch_pending tries to start more and more machines to service those requests. Eventually we end up starting ALL the machines, including on-demand nodes. In the case described above, this unplugs the stuck builder, and then all the other pending requests can be serviced as well. I'd guess this makes this function one of our most expensive functions in the code base :) We should change this function to take into account the number of connected slaves, and give other builders and chance to run as well. Some irc logs for more context: 06:39 <@catlee> bhearsum, rail: any thoughts on https://github.com/catlee/buildbot-configs/compare/master...branch_priorities ? 06:39 <@catlee> the idea is to be able to process as many builders as we have slaves 06:39 <@catlee> while still preserving branch priorities 06:40 < bhearsum> catlee: as opposed to all of the builders, which is probably higher than the number of slaves? 06:42 <@catlee> bhearsum: before it would pick just the "most important" builders 06:42 <@catlee> bhearsum: which could be just 1 or 2 of the set of pending 06:43 <@catlee> bhearsum: the problem we've been seeing is that those 1 or 2 most important builders can't actually run on any of the available slaves 06:43 < bhearsum> catlee: ah 06:53 < bhearsum> sorry, getting distracted 06:54 <@catlee> the problem the old function was trying to avoid was having slaves connect while the master is iterating through the sorted list of builders 06:56 <@catlee> so before we'd return the list of builders in sorted order. If builder 0 had no slaves, we'd move onto builder 1. If a slave connected between, then builder 1 would get run, even though builder 0 is more important 06:56 <@catlee> so then we changed it to return just the "most important" builders 06:57 <@catlee> but returning just the "most important" builders fails when those builders can't assign work due to nextSlave restrictions 06:58 <@catlee> so I'm thinking we should be able to safely consider as many builders as we have connected slaves 07:00 < bhearsum> seems fine, but i feel like it still hasn't clicked in my head yet 07:01 < bhearsum> i'll come back to it after this release stuff calms down 07:01 <@catlee> sure 07:23 < bhearsum> catlee: i guess my only question is: why not return all of the builders instead of just len(connected slaves)? 07:25 <@catlee> bhearsum: in the case we have no connected slaves, buildbot will process builders in order. that processing happens in a bit of an async loop, so slaves can connect in between builders being processed. so the most important builder can be skipped because there are no slaves available at the time, and then a slave connects, and a less important builder gets it 07:27 < bhearsum> catlee: how does that work when we're just returning a subset of all of the builders then? i'm not sure if it was clear, but i was suggesting returning all of the builders, sorted 07:28 <@catlee> bhearsum: right, I'm trying to explain why returning all the builders, sorted, doesn't work 07:29 < bhearsum> catlee: but returning a subset of those builders, sorted, works? 07:29 * bhearsum rereads everything 07:29 <@catlee> t = 0: sort builders into b0, b1, b2, return [b0, b1, b2]. no slaves connected 07:29 <@catlee> t = 1: look at b0. no slaves, move on 07:29 <@catlee> t = 2: slave s1 connects 07:29 <@catlee> t = 3: look at b1, give job to s1 07:30 < bhearsum> right, i understand that bug 07:33 <@catlee> ok 07:33 <@catlee> so that's what happens if you return all the builders, sorted 07:34 < bhearsum> doesn't that happen if you return [b0, b1] in your scenario, too? 07:35 <@catlee> hopefully not 07:35 <@catlee> because I know we have 2 slaves 07:35 <@catlee> so I'll return [b0, b1] 07:35 <@catlee> if we only have 1 slave, I'll return [b0] 07:36 <@catlee> anyway, that's my idea for improving things 07:36 <@catlee> right now we return just [b0] no matter how many slaves we have 07:36 < bhearsum> yeah 07:36 < bhearsum> that's clearly silly 07:36 <@catlee> and sometimes b0 can't run 07:36 < bhearsum> so, what happens if we return [b0, b1], but both slaves that are free are jacuzzi'ed to b2, which also has pending? 07:37 <@catlee> we are taking into account jacuzzis 07:37 <@catlee> I think... 07:37 < bhearsum> oh, right 07:37 < bhearsum> just not in the context of this diff 07:37 <@catlee> right 07:38 <@catlee> we've already filtered our slaves down 07:38 < bhearsum> right, ok 07:38 <@catlee> so its' cases like nextAWSSlave that forces things to run on spot/not-spot 07:40 < bhearsum> one more question: you mentioned above that sometimes "builders can't actually run on any of the available slaves". does that mean those builders don't have those slaves in their list, or something else? 07:40 <@catlee> something else 07:40 <@catlee> nextSlave prevents it from running on a connected slave 07:40 < bhearsum> ah 07:40 <@catlee> e.g. we have only spot connected, but the builder has retries, so we want to run on ondemand 07:40 <@catlee> so nextSlave returns none 07:41 <@catlee> this is what's causing some of our spikes I think 07:41 <@catlee> b0 is pending, so we're spinning up machines for it 07:42 <@catlee> but it can't run on any of the machines 07:42 <@catlee> meanwhile b1, b2, ... are piling up 07:42 < bhearsum> so could we still hit this in a case where we return [b0,b1], but the slaves get rejected by nextSlave? (but b2 never gets tried, because prioritizeBuilders won't return it until b0 or b1 starts) 07:42 <@catlee> but they're not getting a chance to take jobs 07:42 < bhearsum> yeah, that makes sense 07:42 < bhearsum> i think i have a full grasp of the problem now 07:42 <@catlee> yeah, we could still hit it 07:43 < bhearsum> but presumably we'll cycle fast enough that it'll be much rarer 07:43 <@catlee> hopefully... 07:43 <@catlee> well, part of this is a failure to spin up the right instances for jobs 07:44 <@catlee> but this change should give other builders a chance 07:44 <@catlee> more of a chance than they have now anyway 07:44 < bhearsum> i suppose there's no way to invalidate and rerun prioritizeBuilders when a slave connects mid-iteration through builders? 07:44 < bhearsum> yeah 07:44 < bhearsum> it definitely sounds like a big improvement 07:44 <@catlee> not really, no 07:45 <@catlee> I was thinking maybe some caching could help 07:45 <@catlee> like if we're looking at the same builders, with same set of slaves, then do something different 07:45 <@catlee> or perhaps claling nextSlave from here too 07:45 <@catlee> but that sounds expensive 07:45 < bhearsum> yeah... 07:45 <@catlee> it's a mismatch between nextSlave, prioritizeBuilders and watch_pending 07:46 < bhearsum> well, i'm fine with landing this. a comment about the situation where you don't start b2 because of nextSlave rejects would be useful though, i think 07:48 < bhearsum> catlee: thanks for all the hand holding here 07:48 <@catlee> bhearsum: thanks for digging into it 07:49 < bhearsum> in the glorious future, it seems like it would be nice to have prioritizeBuilders and nextSlave closer together, or maybe the same code 07:49 <@catlee> it's good to have more people to blame^W^W understand this 07:49 < bhearsum> hey 07:49 < bhearsum> don't you be putting me as the author here! 07:49 <@catlee> qrefresh -u ftw 07:49 < bhearsum> qrefresh -fu 07:49 <@catlee> :P
Assignee: nobody → catlee
I think this does what we want... I added a delay to the call to loop.trigger towards the bottom to prevent entering into a kind of infinite loop where the builder loop runs over and over without a delay, chewing up master CPU in the meanwhile.
Attachment #8393125 - Flags: review?(bhearsum)
Comment on attachment 8393125 [details] [diff] [review] updated prioritizeBuilders Review of attachment 8393125 [details] [diff] [review]: ----------------------------------------------------------------- Much better, thanks!
Attachment #8393125 - Flags: review?(bhearsum) → review+
(In reply to Ben Hearsum [:bhearsum] from comment #2) > Comment on attachment 8393125 [details] [diff] [review] > updated prioritizeBuilders > > Review of attachment 8393125 [details] [diff] [review]: > ----------------------------------------------------------------- > > Much better, thanks! Whoops, this comment was meant for a different bug. Regardless, this patch looks fine.
Attachment #8393125 - Flags: checked-in+
patch(es) is in production :)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: