While waiting for my Linux32 mochitest-oth to get scheduled, and while being goaded into seeing the effects of increasing priority, I was looking at what pushes had gotten what things running on Linux32, and noticed something bizarre. We have a 6+ hour backlog for Linux32 tests on try, but some jobs have already run despite only waiting for 3 hours or less, and I don't mean "some random jobs." On Linux32 debug, the scheduler prefers mochitest-3, reftest and xpcshell to other suites, and does them before other suites that have been waiting much longer. On Linux32 opt, the scheduler prefers crashtest-ipc, reftest-no-accel, reftest-ipc, xpcshell and talos to other suites. Linux64 opt prefers to run jsreftest and talos, while Linux64 debug prefers to run mochitest-4. I'm sure there must be some rational explanation, some numeric index for test suites that differs between platforms and that's getting into the sort ahead of submission_time somehow, but it sure looks like the scheduler just plays favorites with some test suites.
Because I have this example handy: https://tbpl.mozilla.org/?tree=Try&rev=8d940fe83d31 scheduled Linux32 opt tests at 21:59:06, and with no priority fiddling, it started reftest-no-accel at 22:06:45. https://tbpl.mozilla.org/?tree=Try&rev=8d940fe83d31 scheduled Linux32 opt tests at 15:53:06, and it still has a priority 0 mochitest-1 pending, so other than "more than 7 hours and 45 minutes, and thus more than 60 times how long I had to wait for my reftest-no-accel" I can't even say yet how long it will take.
Maybe a fallout of bug 714313? Could this have been happening from before?
Sounds sort of like bug 659222. I don't have a great explanation for why certain suites seem to be "preferred"....buildbot will go through the builders in the same order each time, so maybe that explains it? I would have expected that slaves would become available more or less randomly during that processing, so that no one builder would be preferred.... I don't see how bug 714313 could be involved here...
How serious is this?
If you have to get a Linux32 opt mochitest-1 run on the thing you pushed to try at noon in order to land a critical patch before the next nightly, and it's 7pm, and you cannot get it to run because someone who pushed at 6pm is testing for an intermittent xpcshell failure, so every single one of their 20 xpcshell retriggers will be picked up before your mochitest-1 will, pretty serious. If, as I suspect is the case, the same sort of thing is true of scheduling on other trees, it's just usually masked by the way coalescing makes things jump around, then if you are waiting for whatever suite is picked last to get picked because that's the one you have the tree closed for, it's maybe an extra 10 minutes on a closure, or maybe an extra hour (I do remember one wait on a Windows suite where the one I needed was looking like it was going to be picked up dead last, that was just under an hour because I finally wound up cancelling all the other tests so I could force that one to be next). Hmm, coalescing, does this mean that on Linux32 opt on non-try trees that get lots of coalescing (which would be pretty much just inbound), we run crashtest-ipc significantly more often than we run crashtest, and reftest-no-accel significantly more often than we run reftest? That looks entirely possible, and rather backward.
We could try adding a random element here: http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla/master_common.py#l89 return priority, req_priority, submitted_at, random.random() so that we process builders in a different order each time.
Created attachment 673865 [details] [diff] [review] add entropy not 100% this will fix things...but it can't hurt, right???
Comment on attachment 673865 [details] [diff] [review] add entropy Review of attachment 673865 [details] [diff] [review]: ----------------------------------------------------------------- I agree, this shouldn't make things any worse...please add a comment explaining why we're introducing randomness though.
no complaints == FIXED!
I'm actually going to reopen now, since philor reported weird fedora behavior and I can confirm. All [almost] of our fedora32 jobs are up and taking jobs, however none of them seem to be honoring branch prioritization, nor time prioritization. For one example [time] on B2g18, a push triggered a set of tests, then ~ an hour later a pgo set of tests was scheduled. Mochi-1 for both sets ran and passed before mochi-4 for either set started. Mochi-4 for pgo (which was scheduled LATER) ran to completion before mochi-4 of the earlier non-pgo build even started. With lots of pendings elsewhere, we're still freshly starting fedora jobs on try, despite other branches having a bunch of pending jobs.
I did some digging on this, and wrote a script to help understand what's going on: https://github.com/nthomas-mozilla/helpers/blob/master/test_slave_usage/dump_fedora_builds.py To confirm Callek's last point, the script returns: Branch priorities: # leaves out actual release builds 0 comm-release, mozilla-release 1 comm-esr10, comm-esr17, mozilla-b2g18, mozilla-b2g18_v1_0_0, mozilla-esr10, mozilla-esr17 2 comm-beta, fx-team, ionmonkey, mozilla-beta, mozilla-inbound, profiling 3 comm-aurora, comm-central, mozilla-aurora, mozilla-central 4 try, try-comm-central 5 alder, ash, birch, cedar, date, elm, fig, gum, holly, jamun, larch, maple, oak, pine Running builds (as they started, then priority+wait sort) Pri. Wait (s) Run (s) Branch Revision Builder name 2 5447 903 mozilla-inbound 8da4794af394 Rev3 Fedora 12 mozilla-inbound opt test reftest-no-accel 3 1725 930 mozilla-central 677e87c11252 Rev3 Fedora 12 mozilla-central pgo talos other 4 1904 941 try d9ba5784332b Rev3 Fedora 12 try opt test crashtest 4 57521 985 try 693faa58eab0 Rev3 Fedora 12 try opt test jsreftest 4 2986 1000 try d9ba5784332b b2g_ics_armv7a_gecko_emulator try opt test crashtest-1 4 63823 1022 try a197b511c2f6 Rev3 Fedora 12 try opt test mochitest-4 2 5266 1024 mozilla-inbound 8da4794af394 Rev3 Fedora 12 mozilla-inbound debug test mochitest-4 4 10905 1079 try 898a1cd52439 b2g_ics_armv7a_gecko_emulator try opt test reftest-9 2 7904 1094 mozilla-inbound 8da4794af394 b2g_ics_armv7a_gecko_emulator mozilla-inbound opt test reftest-5 4 1665 1133 try d9ba5784332b Rev3 Fedora 12 try debug test jetpack 4 21645 1149 try 0ec4dc0dbdee b2g_ics_armv7a_gecko_emulator try opt test reftest-9 ... Pending builds (priority then wait sort) Pri. Wait (s) Run (s) Branch Revision Builder name 2 5447 0 mozilla-inbound 8da4794af394 Rev3 Fedora 12 mozilla-inbound opt test jsreftest 2 3768 0 profiling 50c7b136d935 Rev3 Fedora 12 profiling pgo test mochitest-2 2 3768 0 profiling 50c7b136d935 Rev3 Fedora 12 profiling pgo test mochitest-browser-chrome 2 3768 0 profiling 50c7b136d935 Rev3 Fedora 12 profiling pgo test mochitest-other 2 3768 0 profiling 50c7b136d935 Rev3 Fedora 12 profiling pgo test reftest-no-accel 2 3767 0 profiling 50c7b136d935 Rev3 Fedora 12 profiling pgo talos tpn 2 1065 0 fx-team 677e87c11252 b2g_ics_armv7a_gecko_emulator fx-team opt test mochitest-1 2 1065 0 fx-team 677e87c11252 b2g_ics_armv7a_gecko_emulator fx-team opt test mochitest-2 ie lots of try jobs starting while high priority branches have jobs pending for long periods. There is a known issue where slaves connect while the masters are scheduling jobs and have already considered high priority branches, so end up starting try. Maybe that's an issue if we have busy masters. I was also surprised to see that fx-team, build-system, profiling et al are priority 2, and therefore higher than mozilla-central and aurora.
Both the wrong position of unlisted branches and the starting of lots of low priority jobs were fixed by bug 659222 (which I kept thinking was this bug, and wondering where comments I remembered existing had gone).