Closed Bug 980890 Opened 11 years ago Closed 11 years ago

Integration Trees closed, high number/time backlog of pending jobs

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Linux
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Unassigned)

Details

here we go again, 2 hour backlog of linux jobs in the integration trees, possible related to problems with the linux servo slaves disconnecting. Closing integration trees now
< pmoore> (bug 980889 is for the servo linux slaves)
seems it recoverd and trees reopened, letting this bug open in case you want to do investigations on this
which linux jobs exactly? builds? tests?
(In reply to Chris AtLee [:catlee] from comment #3) > which linux jobs exactly? builds? tests? was a backlog of build jobs
After midnight we started seeing a growth of linux and linux64 *pending* jobs [1]. This is very unusual. I can also notice a huge spike around midnight for the try linux/linux64 jobs [2]. rail, can you think of anything that could be causing these issues on AWS instances? catlee mentions release jobs and that could have been if the builds were triggered last evening. Pending last 24 hours: http://people.mozilla.org/~armenzg/sattap/69594def.png [1] http://people.mozilla.org/~armenzg/sattap/2ae1a8fd.png [2] Running last 24 hours: http://people.mozilla.org/~armenzg/sattap/cd4eb63d.png http://people.mozilla.org/~armenzg/sattap/6377027b.png
btw Ryan closed again the trees for backlog of build jobs
One of the things we seem to have trouble doing is going from 0 to 60 - I merged around the integration trees just before 6pm Saturday night, when pretty much nothing was building and everything was long idle, and it took around 45 minutes to get my Linux/Android/b2g builds going, and then since the 6pm PGO/non-unified piled on top, they are mostly still pending 75 minutes after they were triggered.
And once I finally got build slaves, I've got linux64 tests (including b2g emulator) that have been pending for up to three hours.
and again trees are closed for backlogs of build jobs :(
trees reopened 5:24 PDT but we should really fix this :(
Landed http://hg.mozilla.org/build/buildbotcustom/rev/a55559e39a59 to see if this helps. Here's a log snippet from buildbot-master70: 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s important builder Android 2.2 Debug b2g-inbound non-unified (p == (4, 0L, 100, 1394456402L)) 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s important builder Linux x86-64 b2g-inbound non-unified (p == (4, 0L, 100, 1394456402L)) 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s important builder Linux b2g-inbound pgo-build (p == (4, 0L, 100, 1394456402L)) 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s important builder Linux x86-64 b2g-inbound leak test non-unified (p == (4, 0L, 100, 1394456402L)) 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s important builder Linux x86-64 b2g-inbound pgo-build (p == (4, 0L, 100, 1394456402L)) 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s unimportant builder Linux x86-64 mozilla-inbound leak test non-unified ((4, 0L, 100, 1394456403L) != (4, 0L, 100, 1394456402L)) ... (lots of other unimportant builders) ... 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.55s shuffling important builders 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.55s triggering builder loop again since we've dropped some lower priority builders 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.55s finished prioritization 2014-03-10 07:03:33-0700 [-] nextAWSSlave: 0 retries for Linux x86-64 b2g-inbound leak test non-unified 2014-03-10 07:03:33-0700 [-] nextAWSSlave: No slaves appropriate for skip-spot job - returning None 2014-03-10 07:03:33-0700 [-] <Builder ''Linux x86-64 b2g-inbound leak test non-unified'' at 465084288>: want to start build, but we don't have a remote 2014-03-10 07:03:33-0700 [-] <Builder ''Linux x86-64 b2g-inbound leak test non-unified'' at 465084288>: got assignments: {} 2014-03-10 07:03:33-0700 [-] nextAWSSlave: 0 retries for Linux x86-64 b2g-inbound pgo-build 2014-03-10 07:03:33-0700 [-] nextAWSSlave: No slaves appropriate for skip-spot job - returning None ... and in the end, no jobs get assigned to slaves. So, we have a ton of pending jobs, including some older non-unified and PGO builds. watch_pending sees all the pending jobs and is starting up machines like crazy trying to service those requests. prioritizeBuilders is then seeing the most important builders are the old non-unified and PGO builds, and discarding all the other job types for the current run. Unfortunately neither prioritizeBuilds nor watch_pending know that non-unified or PGO builds won't run on spot. And then prioritizeBuilders runs again, and we do exactly the same thing.
Urgh. The non-unified parts of http://hg.mozilla.org/build/cloud-tools/rev/481a527a9fad may require revision.
in production
we have a backlog of build jobs again and ryan closed the trees for this
(In reply to Carsten Book [:Tomcat] from comment #14) > we have a backlog of build jobs again and ryan closed the trees for this trees reopen at 5:45am PDT
Digging around a bit more into this today, I'm pretty sure this is caused by l10n jobs. A few snippets from master logs: 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.34s important builder Firefox mozilla-central linux64 l10n nightly (p == (3, 0L, 150, 1394538810L)) 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.34s unimportant builder Firefox mozilla-central linux l10n nightly ((3, 0L, 150, 1394539963L) != (3, 0L, 150, 1394538810L)) 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.34s unimportant builder b2g_b2g-inbound_emulator-jb_dep ((4, 0L, 100, 1394536005L) != (3, 0L, 150, 1394538810L)) 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.35s unimportant builder b2g_b2g-inbound_linux32_gecko build ((4, 0L, 100, 1394536005L) != (3, 0L, 150, 1394538810L)) 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.35s unimportant builder b2g_b2g-inbound_linux64_gecko build ((4, 0L, 100, 1394536005L) != (3, 0L, 150, 1394538810L)) 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.35s unimportant builder b2g_b2g-inbound_emulator-jb-debug_dep ((4, 0L, 100, 1394536005L) != (3, 0L, 150, 1394538810L)) ... So the mozilla-central l10n jobs (branch priority = 3) are stealing all the slaves being spun up for the other jobs. Looking at jobs starting between 11:00 and 12:30 UTC seems to confirm this: mysql> select builders.name, count(*) as c from builders, builds, slaves where builders.id = builds.builder_id and slaves.id = builds.slave_id and builds.starttime >= '2014-03-11 11:00' and builds.starttime <= '2014-03-11 12:30' and (slaves.name like 'bld-linux64-%') group by builders.name order by c desc; +-------------------------------------------------+----+ | name | c | +-------------------------------------------------+----+ | mozilla-aurora-linux64-l10n-nightly | 90 | | mozilla-aurora-linux-l10n-nightly | 88 | | comm-central-linux-l10n-nightly | 52 | | comm-central-linux64-l10n-nightly | 51 | | mozilla-central-linux64-l10n-nightly | 40 | | mozilla-central-linux-l10n-nightly | 6 | | b2g_b2g-inbound_emulator_dep | 2 | | try-linux64 | 2 | | b2g_b2g-inbound_emulator-debug_dep | 2 | | b2g_mozilla-central_inari_nightly | 1 | | mozilla-central-android-l10n_4 | 1 | | mozilla-central-linux64_gecko-nightly | 1 | | b2g_mozilla-b2g18_hamachi_nightly | 1 | | holly-linux64-asan-debug-nightly | 1 | | mozilla-central-android-l10n_3 | 1 | | b2g_mozilla-central_hamachi_eng_nightly | 1 | | b2g_mozilla-b2g18_leo_nightly | 1 | | holly-android-x86-nightly | 1 | | b2g_mozilla-b2g18_v1_1_0_hd_helix_nightly | 1 | | mozilla-central-android-l10n_1 | 1 | | b2g_mozilla-central_leo_eng_nightly | 1 | | b2g_mozilla-b2g18_inari_nightly | 1 | | holly-android-nightly | 1 | | b2g_mozilla-central_helix_eng_nightly | 1 | | b2g_mozilla-inbound_emulator_dep | 1 | | mozilla-central-linux32_gecko-nightly | 1 | | b2g_b2g-inbound_hamachi_eng_dep | 1 | | holly-android-armv6-nightly | 1 | | b2g_mozilla-central_hamachi_nightly | 1 | | b2g_mozilla-inbound_hamachi_eng_dep | 1 | | mozilla-central-android-l10n_2 | 1 | | b2g_mozilla-central_wasabi_nightly | 1 | | holly-linux64-nightly | 1 | | b2g_mozilla-central_helix_nightly | 1 | | b2g_mozilla-inbound_emulator-debug_dep | 1 | | mozilla-central-android-l10n_5 | 1 | | mozilla-central-linux64_gecko_localizer-nightly | 1 | | b2g_mozilla-b2g18_leo_eng_nightly | 1 | | holly-linux64-asan-nightly | 1 | +-------------------------------------------------+----+ 39 rows in set (0.05 sec)
We enabled jacuzzis for the l10n builds in https://bugzilla.mozilla.org/show_bug.cgi?id=982634. Hopefully that's the end of this!
It seems that we have been doing better after that. Thanks for fixing it! Tomcat: let us know if it happens again.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.