Closed Bug 659222 Opened 12 years ago Closed 10 years ago

Branch prioritisation is broken for test jobs

Categories

(Release Engineering :: General, defect, P3)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

References

Details

(Keywords: sheriffing-untriaged, Whiteboard: [buildmasters][scheduler][buildfaster:p2])

Attachments

(4 files, 6 obsolete files)

philor asked in #build if there was some way to get around the job prioritization function that runs try jobs only when there is nothing else pending. The code is at
 http://mxr.mozilla.org/build/source/buildbot-configs/mozilla/master_common.py#21
From line 72 I don't think you can use self-serve to do that, you can just make your try job run first amongst all the try jobs. 

However, in this graph there are both try and non-try tests running all day, despite a consistent backlog of non-try test jobs. Something is busted.
Also odd, looking at the number of pending jobs these last few tree-explosions, it sure looks like mozilla-beta is effectively priority 2.5 or so, clearing its backlog slower than anything else that it is theory a 2.
Whiteboard: [buildduty]
I don't know the code to help debug this but would it make sense to take make mozilla-aurora and mozilla-beta to priority 1 like mozilla-central?
Assignee: nobody → catlee
Severity: normal → major
Whiteboard: [buildduty] → [buildmasters][scheduler]
Priority: -- → P3
Attached patch clean up prioritization function (obsolete) — Splinter Review
I don't think the prioritization is busted (maybe fixed by 658692?), but this cleans up the logic a bit.
Attachment #539388 - Flags: review?(nrthomas)
Attached file sql data
I think there's still an issue. Yesterday we had bug 663993 blocking buildbot startup on Leopard slaves for a few hours, which was resolved about 2100 PDT (or 0400 UTC on the 14th). See the attachment for an analysis of the jobs that started.
Ok, so I think what is happening here is that slaves are connecting in the middle of iterating through the list of builders. Here's a timeline of important events:

2011-06-13 21:03:42-0700 [-] Sorted 214 builders in 0.03s
2011-06-13 21:03:42-0700 [Broker,69734,10.12.50.25] Got slaveinfo from 'talos-r3-leopard-025'
2011-06-13 21:03:42-0700 [Broker,69732,10.12.50.26] Got slaveinfo from 'talos-r3-leopard-026'
2011-06-13 21:03:42-0700 [Broker,69733,10.12.50.20] Got slaveinfo from 'talos-r3-leopard-020'
2011-06-13 21:03:42-0700 [Broker,69731,10.12.50.48] Got slaveinfo from 'talos-r3-leopard-048'
2011-06-13 21:04:04-0700 [-] starting build <Build Rev3 MacOSX Leopard 10.5.8 mozilla-aurora opt test mochitests-5/5> using slave <SlaveBuilder builder='Rev3 MacOSX Leopard 10.5.8 mozilla-aurora opt test mochitests-5/5' slave='talos-r3-leopard-026'>

So we've prioritized the builders with release, beta, etc. at the beginning of the list, and then start walking through this list trying to assign jobs to slaves. In the beginning we don't have any free slaves, but in the middle of this process new slaves show up, and so builders towards the end of the list can get jobs started.
Attached file sql data, part 2
catlee also points out the claimed_at data isn't the time a job is first claimed, but the last time the master doing the job set a heartbeat timestamp there. Using builds.start_time is a better plan. 

The ordering problem, and nice analysis of the cause, still stands though.
Attachment #539388 - Flags: review?(nrthomas)
Whiteboard: [buildmasters][scheduler] → [buildmasters][scheduler][buildfaster]
Attachment #544247 - Flags: review?(bhearsum)
Attached patch Better prioritization function (obsolete) — Splinter Review
Attachment #544249 - Flags: review?(nrthomas)
Attachment #544249 - Flags: review?(bhearsum)
Comment on attachment 544247 [details] [diff] [review]
move prioritization into buildbotcustom

Review of attachment 544247 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #544247 - Flags: review?(bhearsum) → review+
Comment on attachment 544249 [details] [diff] [review]
Better prioritization function

Review of attachment 544249 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #544249 - Flags: review?(bhearsum) → review+
Whiteboard: [buildmasters][scheduler][buildfaster] → [buildmasters][scheduler][buildfaster:p2]
Comment on attachment 544249 [details] [diff] [review]
Better prioritization function

Sorry, took a long time to wrap my head around this. I'm still a bit concerned that (for a set slaves/platform) we're only going to start jobs for the highest branch-priority each time around the sorting loop. Probably that will lead to delays in starting jobs for lower priority branches when we have a diverse range of branches with requests, and slaves available to serve them. 

To try to get a handle on if that's happening it might be useful to log how many builders are ignored because of priority. eg by switching the break at the end of prioritizeBuilders() to a continue, incrementing a counter beforehand, and using the counter in the log.msg.
Attachment #544249 - Flags: review?(nrthomas) → review+
Chatted with Nick on the phone, and he's right. Here's the situation:

1 mozilla-central build pending (high priority)
4 project branch builds pending (low priority)
5 free slaves

This code will remove the pending project branch builds from the list, and assign only the mozilla-central build, leaving 4 free slaves. We'll have to wait until the next time through the loop to assign those lower priority jobs. The situation gets worse as you add more levels of prioritization.

I think the fix for this is to try and detect this situation, and re-trigger the builder loop in that case.
Assignee: catlee → nobody
Blocks: 840997
Is the reason that I can't understand why people act like prioritization has any effect at all that I'm missing the second step in:

1. We have a huge unordered list of every job that needs to run
2. (Something chaotic happens, a lottery, a cake walk, a rousing game of musical chairs)
3. We have tens of masters, each of which has random items from the original huge unordered list, and each one separately runs prioritizeBuilders on its own part of the list, and takes things in branch then +1 then b2g-first then scheduled time then random order, but only for its part of the list

so that the only way it would be possible to see whether it was working would be to know what jobs ran on a particular master, because it's totally expected that across all jobs, a twig job will run before a release job, and a +1 job will only run before a +0 job if they both happen to be on the same master?
Each master has the full set of unordered list of pending jobs. Each then applies the same (hopefully!) logic to sort them. Then, it processes the jobs in the new sorted order, trying to assign each one to a slave.

We determined that it's possible for a slave to become available while processing the list of sorted pending jobs, which does allow for lower priority jobs to start before higher priority jobs. Is that's all that's going on here? Or do we suspect that this is happening for more cases than just slaves becoming free in the middle of the pending job processing loop?
We could add some logging to http://hg.mozilla.org/build/buildbot/file/120afd6509ca/master/buildbot/util/loop.py#l90 to let us know when the processing loop starts/stops, and then we'll know if slaves are connecting in the middle of that.
Yeah, it certainly could be that: if you imagine zero time between the end of a loop and the start of the next, then every slave will connect at some random point in it, and prioritization will be undetectably different from random, like it is (right now, aurora and central have pending b2g/linux32 tests from 4 hours ago, while Cedar has running ones of both from 1 minute, 15 minutes, and 30 minutes after they were scheduled, including a running linux32 job that started when it had 20 pending b2g jobs).

In fact, if you imagine it as a constantly spinning loop which always has just a few high priority jobs and a lot of low priority jobs (pretty much what we have), then you would actually be much less likely to start a high priority job, since a slave would have to connect during the very small window while the loop is just about to get to the spot where all the high priority jobs are.
So this gets worse the more low priority jobs we have pending. And if that's true, then we should see that the # of jobs that run should be proportional to # of pending jobs when we're at 100% capacity?
https://github.com/catlee/buildbot-configs/compare/master...branch_priorities is an attempt to port the patch above to the current implementation.

I'd love to be able to reproduce the suspected current behaviour in a test environment and then see if this patch actually fixes it.

I think I addressed nthomas' concerns from comment 11 by adding another call to loop.trigger() at the end if we have builders that aren't the highest priority.
(In reply to Chris AtLee [:catlee] from comment #18)
> https://github.com/catlee/buildbot-configs/compare/master...
> branch_priorities is an attempt to port the patch above to the current
> implementation.
> 
> I'd love to be able to reproduce the suspected current behaviour in a test
> environment and then see if this patch actually fixes it.
> 
> I think I addressed nthomas' concerns from comment 11 by adding another call
> to loop.trigger() at the end if we have builders that aren't the highest
> priority.

I like it, my one nitpick is we should probably do the .random() in the final sort `[b[1] for b in sorted(important_builders)]` rather than in the priority function now. Otherwise we'll have different prior for each type of builder repeatedly, we would be benefited by letting that final run have multiple builders with sorter prior there (imho)
Attached patch better prioritization function (obsolete) — Splinter Review
I'm actually hoping the code is well documented enough that I don't need to explain it here :)
Attachment #539388 - Attachment is obsolete: true
Attachment #544247 - Attachment is obsolete: true
Attachment #544249 - Attachment is obsolete: true
Attachment #716074 - Flags: feedback?(rail)
Attachment #716074 - Flags: feedback?(philringnalda)
Attachment #716074 - Flags: feedback?(nthomas)
Attachment #716074 - Flags: feedback?(bhearsum)
Comment on attachment 716074 [details] [diff] [review]
better prioritization function

Review of attachment 716074 [details] [diff] [review]:
-----------------------------------------------------------------

jaredgrubb just landed a change to the BRD that might be of interest to you, as well:
  https://github.com/buildbot/buildbot/pull/615

::: mozilla/master_common.py
@@ +19,5 @@
>  c['schedulers'] = []
>  c['change_source'] = []
>  
> +# Builders from these branches are given custom priority, default is 4 for unlisted branches
> +# Please leave in sorted order

It says the default is 4 here, but 2 in the code below.  This comment should also probably mention that release builds are automatically given priority 0
Comment on attachment 716074 [details] [diff] [review]
better prioritization function

>+# Builders from these branches are given custom priority, default is 4 for unlisted branches

>+    # Default branch priority is 2
>+    branch_priority = 2

Oops.

> # Give the release builders priority over other builders

Pretty sure this is an orphan comment from the very first bit of prioritizing, where the whole thing just did release > everything > try.

My one remaining concern is that we don't know how much we are currently depending on coalescing on inbound to keep try wait times down, so we're at risk of breaking them and wanting to blame the number of try jobs rather than blaming the number of inbound jobs we actually run instead of coalesce.
Attachment #716074 - Flags: feedback?(philringnalda) → feedback+
Comment on attachment 716074 [details] [diff] [review]
better prioritization function

Review of attachment 716074 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla/master_common.py
@@ +72,5 @@
> +    req_priority = request[1]
> +    submitted_at = request[2]
> +
> +    # Default branch priority is 2
> +    branch_priority = 2

Nit: might be good to define default priority up above, alongside the branch priorities.
Attachment #716074 - Flags: feedback?(bhearsum) → feedback+
Attached patch better prioritization function (obsolete) — Splinter Review
I factored out the code that figures out which slaves are available, since we need to do that twice, and it was taking the most time. I also moved the default branch priority into a variable.
Attachment #716074 - Attachment is obsolete: true
Attachment #716074 - Flags: feedback?(rail)
Attachment #716074 - Flags: feedback?(nthomas)
Attachment #716490 - Flags: review?(rail)
Attachment #716490 - Flags: review?(bhearsum)
Attachment #716490 - Flags: review?(bhearsum) → review+
Comment on attachment 716490 [details] [diff] [review]
better prioritization function

lgtm
Attachment #716490 - Flags: review?(rail) → review+
Attached patch flip request priority (obsolete) — Splinter Review
Attachment #716587 - Flags: review?(bhearsum)
Attachment #716587 - Flags: review?(bhearsum) → review+
Attachment #716490 - Flags: checked-in+
Attachment #716587 - Flags: checked-in+
in production
Comment on attachment 716490 [details] [diff] [review]
better prioritization function

http://hg.mozilla.org/build/buildbot-configs/rev/ea56c3600cd4
Attachment #716490 - Flags: checked-in+ → checked-in-
Attachment #716587 - Flags: checked-in+ → checked-in-
Something went wrong and we hit a situation when a lot of slaves were idle while the pending queue was huge.
For posterity's consideration, tbpl's pending/running menu just before the backout reconfig:

ionmonkey
    204 / 27
profiling
    29 / 28
mozilla-b2g18_v1_0_1
    11 / 57
birch
    33 / 0
ash
    236 / 1
mozilla-beta
    84 / 44
alder
    3 / 0
mozilla-aurora
    64 / 2
try
    2602 / 145
try-comm-central
    28 / 0
gaia-central
    1 / 0
cedar
    156 / 1
mozilla-central
    198 / 43
comm-central
    8 / 0
mozilla-inbound
    794 / 34
addon-sdk
    44 / 0
build-system
    163 / 23
fx-team
    17 / 3
mozilla-b2g18
    0 / 11
l10n-central
    0 / 3
mozilla-esr17
    0 / 4
Total
    4675 / 426
The problem with the previous patch is that it was looking at s.slave.name to group together slaves that were available. Unfortunately s.slave.name is None most of the time (I'm not sure what it's used for), and s.slave.slavename is the actual name of the slave.

The effect of this bug is that we'd have at most 1 slave set with an entry [None], with whichever slave was the first that was found to be available.
Attachment #716490 - Attachment is obsolete: true
Attachment #716587 - Attachment is obsolete: true
Attachment #717048 - Flags: review?(rail)
This is the diff from the patch that landed yesterday

diff --git a/mozilla/master_common.py b/mozilla/master_common.py
index 764a1a7..2f972f9 100644
--- a/mozilla/master_common.py
+++ b/mozilla/master_common.py
@@ -147,14 +147,14 @@ def prioritizeBuilders(buildmaster, builders):
     seen_slaves = set()
     for b in builders:
         for s in b.slaves:
-            if s.slave.name not in seen_slaves:
+            if s.slave.slavename not in seen_slaves:
                 if s.isAvailable():
-                    avail_slaves.add(s.slave.name)
-                seen_slaves.add(s.slave.name)
+                    avail_slaves.add(s.slave.slavename)
+                seen_slaves.add(s.slave.slavename)
     log("found %i available of %i connected slaves", len(avail_slaves), len(seen_slaves))
 
     # Remove builders we have no slaves for
-    builders = filter(lambda builder: [s for s in builder.slaves if s.slave.name in avail_slaves], builders)
+    builders = filter(lambda builder: [s for s in builder.slaves if s.slave.slavename in avail_slaves], builders)
     log("builders with slaves: %i", len(builders))
 
     # Annotate our list of builders with their priority
@@ -166,7 +166,7 @@ def prioritizeBuilders(buildmaster, builders):
     # of slaves
     builders_by_slaves = {}
     for b in builders:
-        slaves = frozenset(s.slave.slavename for s in b[1].slaves if s.slave.name in avail_slaves)
+        slaves = frozenset(s.slave.slavename for s in b[1].slaves if s.slave.slavename in avail_slaves)
         builders_by_slaves.setdefault(slaves, []).append(b)
     log("assigned into %i slave set(s)", len(builders_by_slaves))
Attachment #717048 - Flags: review?(rail) → review+
Attachment #717048 - Flags: checked-in+
In production
I think this is working now
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.