Scheduling of pending test jobs chooses some test suites ahead of others that were scheduled earlier

RESOLVED FIXED

Status

Release Engineering
General Automation
P3
normal
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: philor, Assigned: catlee)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [buildbot][schedulers])

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
While waiting for my Linux32 mochitest-oth to get scheduled, and while being goaded into seeing the effects of increasing priority, I was looking at what pushes had gotten what things running on Linux32, and noticed something bizarre.

We have a 6+ hour backlog for Linux32 tests on try, but some jobs have already run despite only waiting for 3 hours or less, and I don't mean "some random jobs." On Linux32 debug, the scheduler prefers mochitest-3, reftest and xpcshell to other suites, and does them before other suites that have been waiting much longer. On Linux32 opt, the scheduler prefers crashtest-ipc, reftest-no-accel, reftest-ipc, xpcshell and talos to other suites. Linux64 opt prefers to run jsreftest and talos, while Linux64 debug prefers to run mochitest-4.

I'm sure there must be some rational explanation, some numeric index for test suites that differs between platforms and that's getting into the sort ahead of submission_time somehow, but it sure looks like the scheduler just plays favorites with some test suites.
(Reporter)

Comment 1

5 years ago
Because I have this example handy: https://tbpl.mozilla.org/?tree=Try&rev=8d940fe83d31 scheduled Linux32 opt tests at 21:59:06, and with no priority fiddling, it started reftest-no-accel at 22:06:45. https://tbpl.mozilla.org/?tree=Try&rev=8d940fe83d31 scheduled Linux32 opt tests at 15:53:06, and it still has a priority 0 mochitest-1 pending, so other than "more than 7 hours and 45 minutes, and thus more than 60 times how long I had to wait for my reftest-no-accel" I can't even say yet how long it will take.

Comment 2

5 years ago
Maybe a fallout of bug 714313?

Could this have been happening from before?
(Assignee)

Comment 3

5 years ago
Sounds sort of like bug 659222. I don't have a great explanation for why certain suites seem to be "preferred"....buildbot will go through the builders in the same order each time, so maybe that explains it? I would have expected that slaves would become available more or less randomly during that processing, so that no one builder would be preferred....

I don't see how bug 714313 could be involved here...
(Assignee)

Comment 4

5 years ago
How serious is this?
Priority: -- → P3
Whiteboard: [buildbot][schedulers]
(Reporter)

Comment 5

5 years ago
If you have to get a Linux32 opt mochitest-1 run on the thing you pushed to try at noon in order to land a critical patch before the next nightly, and it's 7pm, and you cannot get it to run because someone who pushed at 6pm is testing for an intermittent xpcshell failure, so every single one of their 20 xpcshell retriggers will be picked up before your mochitest-1 will, pretty serious.

If, as I suspect is the case, the same sort of thing is true of scheduling on other trees, it's just usually masked by the way coalescing makes things jump around, then if you are waiting for whatever suite is picked last to get picked because that's the one you have the tree closed for, it's maybe an extra 10 minutes on a closure, or maybe an extra hour (I do remember one wait on a Windows suite where the one I needed was looking like it was going to be picked up dead last, that was just under an hour because I finally wound up cancelling all the other tests so I could force that one to be next).

Hmm, coalescing, does this mean that on Linux32 opt on non-try trees that get lots of coalescing (which would be pretty much just inbound), we run crashtest-ipc significantly more often than we run crashtest, and reftest-no-accel significantly more often than we run reftest? That looks entirely possible, and rather backward.
(Assignee)

Comment 6

5 years ago
We could try adding a random element here:
http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla/master_common.py#l89

    return priority, req_priority, submitted_at, random.random()

so that we process builders in a different order each time.
(Assignee)

Updated

5 years ago
Assignee: nobody → catlee
(Assignee)

Comment 7

5 years ago
Created attachment 673865 [details] [diff] [review]
add entropy

not 100% this will fix things...but it can't hurt, right???
Attachment #673865 - Flags: review?(bhearsum)
Comment on attachment 673865 [details] [diff] [review]
add entropy

Review of attachment 673865 [details] [diff] [review]:
-----------------------------------------------------------------

I agree, this shouldn't make things any worse...please add a comment explaining why we're introducing randomness though.
Attachment #673865 - Flags: review?(bhearsum) → review+
(Assignee)

Updated

5 years ago
Attachment #673865 - Flags: checked-in+
In production
(Assignee)

Comment 10

5 years ago
no complaints == FIXED!
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
I'm actually going to reopen now, since philor reported weird fedora behavior and I can confirm.

All [almost] of our fedora32 jobs are up and taking jobs, however none of them seem to be honoring branch prioritization, nor time prioritization.

For one example [time] on B2g18, a push triggered a set of tests, then ~ an hour later a pgo set of tests was scheduled.

Mochi-1 for both sets ran and passed before mochi-4 for either set started.
Mochi-4 for pgo (which was scheduled LATER) ran to completion before mochi-4 of the earlier non-pgo build even started.

With lots of pendings elsewhere, we're still freshly starting fedora jobs on try, despite other branches having a bunch of pending jobs.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I did some digging on this, and wrote a script to help understand what's going on:
 https://github.com/nthomas-mozilla/helpers/blob/master/test_slave_usage/dump_fedora_builds.py

To confirm Callek's last point, the script returns:

Branch priorities:                  # leaves out actual release builds
0 comm-release, mozilla-release
1 comm-esr10, comm-esr17, mozilla-b2g18, mozilla-b2g18_v1_0_0, mozilla-esr10, mozilla-esr17
2 comm-beta, fx-team, ionmonkey, mozilla-beta, mozilla-inbound, profiling
3 comm-aurora, comm-central, mozilla-aurora, mozilla-central
4 try, try-comm-central
5 alder, ash, birch, cedar, date, elm, fig, gum, holly, jamun, larch, maple, oak, pine

Running builds   (as they started, then priority+wait sort)
Pri.  Wait (s)    Run (s)  Branch                Revision      Builder name
2        5447         903  mozilla-inbound       8da4794af394  Rev3 Fedora 12 mozilla-inbound opt test reftest-no-accel
3        1725         930  mozilla-central       677e87c11252  Rev3 Fedora 12 mozilla-central pgo talos other
4        1904         941  try                   d9ba5784332b  Rev3 Fedora 12 try opt test crashtest
4       57521         985  try                   693faa58eab0  Rev3 Fedora 12 try opt test jsreftest
4        2986        1000  try                   d9ba5784332b  b2g_ics_armv7a_gecko_emulator try opt test crashtest-1
4       63823        1022  try                   a197b511c2f6  Rev3 Fedora 12 try opt test mochitest-4
2        5266        1024  mozilla-inbound       8da4794af394  Rev3 Fedora 12 mozilla-inbound debug test mochitest-4
4       10905        1079  try                   898a1cd52439  b2g_ics_armv7a_gecko_emulator try opt test reftest-9
2        7904        1094  mozilla-inbound       8da4794af394  b2g_ics_armv7a_gecko_emulator mozilla-inbound opt test reftest-5
4        1665        1133  try                   d9ba5784332b  Rev3 Fedora 12 try debug test jetpack
4       21645        1149  try                   0ec4dc0dbdee  b2g_ics_armv7a_gecko_emulator try opt test reftest-9
...

Pending builds   (priority then wait sort)
Pri.  Wait (s)    Run (s)  Branch                Revision      Builder name
2        5447           0  mozilla-inbound       8da4794af394  Rev3 Fedora 12 mozilla-inbound opt test jsreftest
2        3768           0  profiling             50c7b136d935  Rev3 Fedora 12 profiling pgo test mochitest-2
2        3768           0  profiling             50c7b136d935  Rev3 Fedora 12 profiling pgo test mochitest-browser-chrome
2        3768           0  profiling             50c7b136d935  Rev3 Fedora 12 profiling pgo test mochitest-other
2        3768           0  profiling             50c7b136d935  Rev3 Fedora 12 profiling pgo test reftest-no-accel
2        3767           0  profiling             50c7b136d935  Rev3 Fedora 12 profiling pgo talos tpn
2        1065           0  fx-team               677e87c11252  b2g_ics_armv7a_gecko_emulator fx-team opt test mochitest-1
2        1065           0  fx-team               677e87c11252  b2g_ics_armv7a_gecko_emulator fx-team opt test mochitest-2

ie lots of try jobs starting while high priority branches have jobs pending for long periods. There is a known issue where slaves connect while the masters are scheduling jobs and have already considered high priority branches, so end up starting try. Maybe that's an issue if we have busy masters.

I was also surprised to see that fx-team, build-system, profiling et al are priority 2, and therefore higher than mozilla-central and aurora.
(Reporter)

Comment 13

5 years ago
Both the wrong position of unlisted branches and the starting of lots of low priority jobs were fixed by bug 659222 (which I kept thinking was this bug, and wondering where comments I remembered existing had gone).
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.