980890 - Integration Trees closed, high number/time backlog of pending jobs

Reporter

Description

•

11 years ago

here we go again, 2 hour backlog of linux jobs in the integration trees, possible related to problems with the linux servo slaves disconnecting. Closing integration trees now

Carsten Book [:Tomcat]

Reporter

Comment 1

•

11 years ago

< pmoore> (bug 980889 is for the servo linux slaves)

Carsten Book [:Tomcat]

Reporter

Comment 2

•

11 years ago

seems it recoverd and trees reopened, letting this bug open in case you want to do investigations on this

Chris AtLee [:catlee]

Comment 3

•

11 years ago

which linux jobs exactly? builds? tests?

Carsten Book [:Tomcat]

Reporter

Comment 4

•

11 years ago

(In reply to Chris AtLee [:catlee] from comment #3) > which linux jobs exactly? builds? tests? was a backlog of build jobs

Armen [:armenzg]

Comment 5

•

11 years ago

After midnight we started seeing a growth of linux and linux64 *pending* jobs [1]. This is very unusual. I can also notice a huge spike around midnight for the try linux/linux64 jobs [2]. rail, can you think of anything that could be causing these issues on AWS instances? catlee mentions release jobs and that could have been if the builds were triggered last evening. Pending last 24 hours: http://people.mozilla.org/~armenzg/sattap/69594def.png [1] http://people.mozilla.org/~armenzg/sattap/2ae1a8fd.png [2] Running last 24 hours: http://people.mozilla.org/~armenzg/sattap/cd4eb63d.png http://people.mozilla.org/~armenzg/sattap/6377027b.png

Carsten Book [:Tomcat]

Reporter

Comment 6

•

11 years ago

btw Ryan closed again the trees for backlog of build jobs

Phil Ringnalda (:philor)

Comment 7

•

11 years ago

One of the things we seem to have trouble doing is going from 0 to 60 - I merged around the integration trees just before 6pm Saturday night, when pretty much nothing was building and everything was long idle, and it took around 45 minutes to get my Linux/Android/b2g builds going, and then since the 6pm PGO/non-unified piled on top, they are mostly still pending 75 minutes after they were triggered.

Phil Ringnalda (:philor)

Comment 8

•

11 years ago

And once I finally got build slaves, I've got linux64 tests (including b2g emulator) that have been pending for up to three hours.

Carsten Book [:Tomcat]

Reporter

Comment 9

•

11 years ago

and again trees are closed for backlogs of build jobs :(

Carsten Book [:Tomcat]

Reporter

Comment 10

•

11 years ago

trees reopened 5:24 PDT but we should really fix this :(

Chris AtLee [:catlee]

Comment 11

•

11 years ago

Landed http://hg.mozilla.org/build/buildbotcustom/rev/a55559e39a59 to see if this helps. Here's a log snippet from buildbot-master70: 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s important builder Android 2.2 Debug b2g-inbound non-unified (p == (4, 0L, 100, 1394456402L)) 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s important builder Linux x86-64 b2g-inbound non-unified (p == (4, 0L, 100, 1394456402L)) 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s important builder Linux b2g-inbound pgo-build (p == (4, 0L, 100, 1394456402L)) 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s important builder Linux x86-64 b2g-inbound leak test non-unified (p == (4, 0L, 100, 1394456402L)) 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s important builder Linux x86-64 b2g-inbound pgo-build (p == (4, 0L, 100, 1394456402L)) 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.53s unimportant builder Linux x86-64 mozilla-inbound leak test non-unified ((4, 0L, 100, 1394456403L) != (4, 0L, 100, 1394456402L)) ... (lots of other unimportant builders) ... 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.55s shuffling important builders 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.55s triggering builder loop again since we've dropped some lower priority builders 2014-03-10 06:49:31-0700 [-] prioritizeBuilders: 0.55s finished prioritization 2014-03-10 07:03:33-0700 [-] nextAWSSlave: 0 retries for Linux x86-64 b2g-inbound leak test non-unified 2014-03-10 07:03:33-0700 [-] nextAWSSlave: No slaves appropriate for skip-spot job - returning None 2014-03-10 07:03:33-0700 [-] <Builder ''Linux x86-64 b2g-inbound leak test non-unified'' at 465084288>: want to start build, but we don't have a remote 2014-03-10 07:03:33-0700 [-] <Builder ''Linux x86-64 b2g-inbound leak test non-unified'' at 465084288>: got assignments: {} 2014-03-10 07:03:33-0700 [-] nextAWSSlave: 0 retries for Linux x86-64 b2g-inbound pgo-build 2014-03-10 07:03:33-0700 [-] nextAWSSlave: No slaves appropriate for skip-spot job - returning None ... and in the end, no jobs get assigned to slaves. So, we have a ton of pending jobs, including some older non-unified and PGO builds. watch_pending sees all the pending jobs and is starting up machines like crazy trying to service those requests. prioritizeBuilders is then seeing the most important builders are the old non-unified and PGO builds, and discarding all the other job types for the current run. Unfortunately neither prioritizeBuilds nor watch_pending know that non-unified or PGO builds won't run on spot. And then prioritizeBuilders runs again, and we do exactly the same thing.

Nick Thomas [:nthomas] (UTC+12)

Comment 12

•

11 years ago

Urgh. The non-unified parts of http://hg.mozilla.org/build/cloud-tools/rev/481a527a9fad may require revision.

Rail Aliiev [:rail]

Comment 13

•

11 years ago

in production

Carsten Book [:Tomcat]

Reporter

Comment 14

•

11 years ago

we have a backlog of build jobs again and ryan closed the trees for this

Carsten Book [:Tomcat]

Reporter

Comment 15

•

11 years ago

(In reply to Carsten Book [:Tomcat] from comment #14) > we have a backlog of build jobs again and ryan closed the trees for this trees reopen at 5:45am PDT

Chris AtLee [:catlee]

Comment 16

•

11 years ago

Digging around a bit more into this today, I'm pretty sure this is caused by l10n jobs. A few snippets from master logs: 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.34s important builder Firefox mozilla-central linux64 l10n nightly (p == (3, 0L, 150, 1394538810L)) 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.34s unimportant builder Firefox mozilla-central linux l10n nightly ((3, 0L, 150, 1394539963L) != (3, 0L, 150, 1394538810L)) 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.34s unimportant builder b2g_b2g-inbound_emulator-jb_dep ((4, 0L, 100, 1394536005L) != (3, 0L, 150, 1394538810L)) 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.35s unimportant builder b2g_b2g-inbound_linux32_gecko build ((4, 0L, 100, 1394536005L) != (3, 0L, 150, 1394538810L)) 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.35s unimportant builder b2g_b2g-inbound_linux64_gecko build ((4, 0L, 100, 1394536005L) != (3, 0L, 150, 1394538810L)) 2014-03-11 05:26:42-0700 [-] prioritizeBuilders: 0.35s unimportant builder b2g_b2g-inbound_emulator-jb-debug_dep ((4, 0L, 100, 1394536005L) != (3, 0L, 150, 1394538810L)) ... So the mozilla-central l10n jobs (branch priority = 3) are stealing all the slaves being spun up for the other jobs. Looking at jobs starting between 11:00 and 12:30 UTC seems to confirm this: mysql> select builders.name, count(*) as c from builders, builds, slaves where builders.id = builds.builder_id and slaves.id = builds.slave_id and builds.starttime >= '2014-03-11 11:00' and builds.starttime <= '2014-03-11 12:30' and (slaves.name like 'bld-linux64-%') group by builders.name order by c desc; +-------------------------------------------------+----+ | name | c | +-------------------------------------------------+----+ | mozilla-aurora-linux64-l10n-nightly | 90 | | mozilla-aurora-linux-l10n-nightly | 88 | | comm-central-linux-l10n-nightly | 52 | | comm-central-linux64-l10n-nightly | 51 | | mozilla-central-linux64-l10n-nightly | 40 | | mozilla-central-linux-l10n-nightly | 6 | | b2g_b2g-inbound_emulator_dep | 2 | | try-linux64 | 2 | | b2g_b2g-inbound_emulator-debug_dep | 2 | | b2g_mozilla-central_inari_nightly | 1 | | mozilla-central-android-l10n_4 | 1 | | mozilla-central-linux64_gecko-nightly | 1 | | b2g_mozilla-b2g18_hamachi_nightly | 1 | | holly-linux64-asan-debug-nightly | 1 | | mozilla-central-android-l10n_3 | 1 | | b2g_mozilla-central_hamachi_eng_nightly | 1 | | b2g_mozilla-b2g18_leo_nightly | 1 | | holly-android-x86-nightly | 1 | | b2g_mozilla-b2g18_v1_1_0_hd_helix_nightly | 1 | | mozilla-central-android-l10n_1 | 1 | | b2g_mozilla-central_leo_eng_nightly | 1 | | b2g_mozilla-b2g18_inari_nightly | 1 | | holly-android-nightly | 1 | | b2g_mozilla-central_helix_eng_nightly | 1 | | b2g_mozilla-inbound_emulator_dep | 1 | | mozilla-central-linux32_gecko-nightly | 1 | | b2g_b2g-inbound_hamachi_eng_dep | 1 | | holly-android-armv6-nightly | 1 | | b2g_mozilla-central_hamachi_nightly | 1 | | b2g_mozilla-inbound_hamachi_eng_dep | 1 | | mozilla-central-android-l10n_2 | 1 | | b2g_mozilla-central_wasabi_nightly | 1 | | holly-linux64-nightly | 1 | | b2g_mozilla-central_helix_nightly | 1 | | b2g_mozilla-inbound_emulator-debug_dep | 1 | | mozilla-central-android-l10n_5 | 1 | | mozilla-central-linux64_gecko_localizer-nightly | 1 | | b2g_mozilla-b2g18_leo_eng_nightly | 1 | | holly-linux64-asan-nightly | 1 | +-------------------------------------------------+----+ 39 rows in set (0.05 sec)

Chris AtLee [:catlee]

Comment 17

•

11 years ago

We enabled jacuzzis for the l10n builds in https://bugzilla.mozilla.org/show_bug.cgi?id=982634. Hopefully that's the end of this!

Armen [:armenzg]

Comment 18

•

11 years ago

It seems that we have been doing better after that. Thanks for fixing it! Tomcat: let us know if it happens again.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Integration Trees closed, high number/time backlog of pending jobs

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: cbook, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated

Updated