Closed Bug 1020210 Opened 11 years ago Closed 11 years ago

All Trees Closed -> building backlog of linux jobs because of issues with dynamic jacuzzi allocation

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Linux
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Unassigned)

Details

building up backlog of linux builds due to train move 03:00 < pmoore|buildduty> Tomcat|sheriffduty: i think i see the problem 03:00 < pmoore|buildduty> https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=build&type=bld-linux64-ix 03:00 < pmoore|buildduty> we only have 4 machines here, because of the train move B, we haven't updated buildbot configs yet with the new names 03:00 < pmoore|buildduty> so currently only 4 builders available :( 03:00 < Tomcat|sheriffduty> oh 03:03 < pmoore|buildduty> Tomcat|sheriffduty: also https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=tst-w64-ec2 03:03 < pmoore|buildduty> 2 available slaves, 110 pending jobs 03:00 < pmoore|buildduty> therefore 190 pending jobs for 4 slaves
So my comment above was not accurate: there were 190 pending jobs, but for any linux builder, of which there were several. The real problem was that the jacuzzi allocator had jobs with 0 builders available for it.
note as of 4:20 all trees are now closed (not just integration) due to the building backlog
Summary: Integration Trees Closed -> building backlog of linux jobs because of builder shortage due to train move → All Trees Closed -> building backlog of linux jobs because of builder shortage due to train move
Thanks to nthomas for spotting this. So logging onto relengwebadm, we could see several unstaged changes in /data/tmp/jacuzzi-allocator/repo, e.g.: [root@relengwebadm.private.scl3 repo]# git status # On branch master # Changes not staged for commit: # (use "git add/rm <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # # modified: v1/allocated/all # modified: v1/builders/Firefox mozilla-aurora linux l10n nightly # modified: v1/builders/Firefox mozilla-aurora linux64 l10n nightly # modified: v1/builders/Firefox mozilla-central linux l10n nightly # modified: v1/builders/Firefox mozilla-central linux64 l10n nightly # modified: v1/builders/Linux birch build # modified: v1/builders/Linux birch leak test build # modified: v1/builders/Linux mozilla-inbound build # modified: v1/builders/Linux mozilla-inbound leak test build # modified: v1/builders/Linux x86-64 birch build # modified: v1/builders/Linux x86-64 birch leak test build # modified: v1/builders/Linux x86-64 mozilla-inbound build # modified: v1/builders/Linux x86-64 mozilla-inbound leak test build # modified: v1/builders/Thunderbird comm-aurora linux l10n nightly # modified: v1/builders/Thunderbird comm-aurora linux64 l10n nightly # modified: v1/builders/Thunderbird comm-central linux l10n nightly # modified: v1/builders/Thunderbird comm-central linux64 l10n nightly # modified: v1/builders/b2g_b2g-inbound_emulator-debug_dep # modified: v1/builders/b2g_b2g-inbound_emulator-jb-debug_dep # modified: v1/builders/b2g_b2g-inbound_emulator-jb_dep # modified: v1/builders/b2g_b2g-inbound_emulator_dep # modified: v1/builders/b2g_b2g-inbound_hamachi_eng_dep # modified: v1/builders/b2g_b2g-inbound_linux32_gecko build # modified: v1/builders/b2g_b2g-inbound_linux64_gecko build # modified: v1/builders/b2g_mozilla-inbound_emulator-debug_dep # modified: v1/builders/b2g_mozilla-inbound_emulator-jb-debug_dep # modified: v1/builders/b2g_mozilla-inbound_emulator-jb_dep # modified: v1/builders/b2g_mozilla-inbound_emulator_dep # modified: v1/builders/b2g_mozilla-inbound_hamachi_eng_dep # modified: v1/builders/b2g_mozilla-inbound_linux64_gecko build # deleted: v1/machines/bld-linux64-spot-010 # deleted: v1/machines/bld-linux64-spot-011 # deleted: v1/machines/bld-linux64-spot-012 # deleted: v1/machines/bld-linux64-spot-013 # deleted: v1/machines/bld-linux64-spot-014 # deleted: v1/machines/bld-linux64-spot-015 # deleted: v1/machines/bld-linux64-spot-016 # deleted: v1/machines/bld-linux64-spot-017 # deleted: v1/machines/bld-linux64-spot-018 # deleted: v1/machines/bld-linux64-spot-019 # deleted: v1/machines/bld-linux64-spot-020 # deleted: v1/machines/bld-linux64-spot-021 # deleted: v1/machines/bld-linux64-spot-022 # deleted: v1/machines/bld-linux64-spot-023 # deleted: v1/machines/bld-linux64-spot-024 # deleted: v1/machines/bld-linux64-spot-025 # deleted: v1/machines/bld-linux64-spot-026 # deleted: v1/machines/bld-linux64-spot-027 # deleted: v1/machines/bld-linux64-spot-028 # deleted: v1/machines/bld-linux64-spot-029 # deleted: v1/machines/bld-linux64-spot-030 # deleted: v1/machines/bld-linux64-spot-031 # deleted: v1/machines/bld-linux64-spot-032 # deleted: v1/machines/bld-linux64-spot-033 # deleted: v1/machines/bld-linux64-spot-034 # deleted: v1/machines/bld-linux64-spot-035 # deleted: v1/machines/bld-linux64-spot-036 # deleted: v1/machines/bld-linux64-spot-037 # deleted: v1/machines/bld-linux64-spot-038 # deleted: v1/machines/bld-linux64-spot-039 # deleted: v1/machines/bld-linux64-spot-055 # deleted: v1/machines/bld-linux64-spot-058 # deleted: v1/machines/bld-linux64-spot-062 # deleted: v1/machines/bld-linux64-spot-066 # deleted: v1/machines/bld-linux64-spot-077 # deleted: v1/machines/bld-linux64-spot-081 # deleted: v1/machines/bld-linux64-spot-082 # deleted: v1/machines/bld-linux64-spot-083 # deleted: v1/machines/bld-linux64-spot-084 # deleted: v1/machines/bld-linux64-spot-086 # deleted: v1/machines/bld-linux64-spot-087 # deleted: v1/machines/bld-linux64-spot-089 # deleted: v1/machines/bld-linux64-spot-090 # deleted: v1/machines/bld-linux64-spot-091 # deleted: v1/machines/bld-linux64-spot-092 # deleted: v1/machines/bld-linux64-spot-093 # deleted: v1/machines/bld-linux64-spot-094 # deleted: v1/machines/bld-linux64-spot-095 # deleted: v1/machines/bld-linux64-spot-096 # deleted: v1/machines/bld-linux64-spot-097 # deleted: v1/machines/bld-linux64-spot-098 # deleted: v1/machines/bld-linux64-spot-099 # deleted: v1/machines/bld-linux64-spot-301 # deleted: v1/machines/bld-linux64-spot-302 # deleted: v1/machines/bld-linux64-spot-303 # deleted: v1/machines/bld-linux64-spot-304 # deleted: v1/machines/bld-linux64-spot-305 # deleted: v1/machines/bld-linux64-spot-306 # deleted: v1/machines/bld-linux64-spot-307 # deleted: v1/machines/bld-linux64-spot-308 # deleted: v1/machines/bld-linux64-spot-309 # deleted: v1/machines/bld-linux64-spot-310 # deleted: v1/machines/bld-linux64-spot-311 # deleted: v1/machines/bld-linux64-spot-312 # deleted: v1/machines/bld-linux64-spot-313 # deleted: v1/machines/bld-linux64-spot-314 # deleted: v1/machines/bld-linux64-spot-315 # deleted: v1/machines/bld-linux64-spot-316 # deleted: v1/machines/bld-linux64-spot-317 # deleted: v1/machines/bld-linux64-spot-318 # deleted: v1/machines/bld-linux64-spot-319 # deleted: v1/machines/bld-linux64-spot-327 # deleted: v1/machines/bld-linux64-spot-328 # deleted: v1/machines/bld-linux64-spot-329 # deleted: v1/machines/bld-linux64-spot-330 # deleted: v1/machines/bld-linux64-spot-331 # deleted: v1/machines/bld-linux64-spot-332 # deleted: v1/machines/bld-linux64-spot-333 # deleted: v1/machines/bld-linux64-spot-334 # deleted: v1/machines/bld-linux64-spot-346 # deleted: v1/machines/bld-linux64-spot-356 # deleted: v1/machines/bld-linux64-spot-357 # deleted: v1/machines/bld-linux64-spot-358 # deleted: v1/machines/bld-linux64-spot-359 # deleted: v1/machines/bld-linux64-spot-362 # deleted: v1/machines/bld-linux64-spot-363 # deleted: v1/machines/bld-linux64-spot-364 # deleted: v1/machines/bld-linux64-spot-365 # deleted: v1/machines/bld-linux64-spot-366 # deleted: v1/machines/bld-linux64-spot-367 # deleted: v1/machines/bld-linux64-spot-368 # deleted: v1/machines/bld-linux64-spot-369 # deleted: v1/machines/bld-linux64-spot-372 # deleted: v1/machines/bld-linux64-spot-373 # deleted: v1/machines/bld-linux64-spot-375 # deleted: v1/machines/bld-linux64-spot-378 # deleted: v1/machines/bld-linux64-spot-380 # deleted: v1/machines/bld-linux64-spot-381 # deleted: v1/machines/bld-linux64-spot-382 # deleted: v1/machines/bld-linux64-spot-383 # deleted: v1/machines/bld-linux64-spot-384 # deleted: v1/machines/bld-linux64-spot-385 # deleted: v1/machines/bld-linux64-spot-386 # deleted: v1/machines/bld-linux64-spot-387 # deleted: v1/machines/bld-linux64-spot-388 # deleted: v1/machines/bld-linux64-spot-389 # deleted: v1/machines/bld-linux64-spot-390 # deleted: v1/machines/bld-linux64-spot-391 # deleted: v1/machines/bld-linux64-spot-392 # deleted: v1/machines/bld-linux64-spot-393 # deleted: v1/machines/bld-linux64-spot-395 # deleted: v1/machines/bld-linux64-spot-396 # deleted: v1/machines/bld-linux64-spot-397 # deleted: v1/machines/bld-linux64-spot-398 # deleted: v1/machines/bld-linux64-spot-399 # # Untracked files: # (use "git add <file>..." to include in what will be committed) # # allocate.log no changes added to commit (use "git add" and/or "git commit -a")
Summary: All Trees Closed -> building backlog of linux jobs because of builder shortage due to train move → All Trees Closed -> building backlog of linux jobs because of issues with dynamic jacuzzi allocation
/data/tmp/jacuzzi-allocator/repo and /mnt/netapp/relengweb/jacuzzi-allocator were in sync
I planned to roll back repo to a good state (where there were not jobs with 0 or 1 builders) which meant: 1) disabling puppet 2) disabling cronjob for dynamic allocator 3) restoring /data/tmp/jacuzzi-allocator/repo to a "good known state" 4) rsyncing /data/tmp/jacuzzi-allocator/repo to /mnt/netapp/relengweb/jacuzzi-allocator
(In reply to Carsten Book [:Tomcat] from comment #2) > note as of 4:20 all trees are now closed (not just integration) due to the > building backlog per discussion with Nick (thanks!) and the affected list http://jacuzzi-allocator.pub.build.mozilla.org/v1/builders/ limited this tree closure to m-i and b2g-i as the affected trees since all other trees should be fine and not suffering/contributing to the job backlog
[root@relengwebadm.private.scl3 repo]# find /data/tmp/jacuzzi-allocator/repo/v1/builders -type f -print0 | xargs -0 wc -l | sort -nr | tail -20 12 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_b2g-inbound_hamachi_eng_dep 11 /data/tmp/jacuzzi-allocator/repo/v1/builders/WINNT 5.2 mozilla-inbound leak test build 11 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 mozilla-inbound leak test build 11 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_mozilla-inbound_linux64_gecko build 11 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_b2g-inbound_linux64_gecko build 11 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_b2g-inbound_linux32_gecko build 10 /data/tmp/jacuzzi-allocator/repo/v1/builders/WINNT 5.2 mozilla-inbound build 10 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux mozilla-inbound build 7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-central linux l10n nightly 7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-central linux64 l10n nightly 7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-aurora linux l10n nightly 7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-aurora linux64 l10n nightly 7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-central linux l10n nightly 7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-central linux64 l10n nightly 7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-aurora linux l10n nightly 7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-aurora linux64 l10n nightly 4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch leak test build 4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch build 4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch leak test build 4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch build [root@relengwebadm.private.scl3 repo]# git log -1 commit d1547aec4563fb287ca2347cdb9366c4def47d41 Author: allocator <no-reply@mozilla.com> Date: Mon Jun 2 15:20:08 2014 -0700 2014-06-02 15:20:05,590 - Linux x86-64 mozilla-inbound build currently 8819s full and 547765s idle 2014-06-02 15:20:05,590 - Linux x86-64 mozilla-inbound build 9 (+1 was 8) would result in 1310s full and 547765s idle [root@relengwebadm.private.scl3 repo]#
[root@relengwebadm.private.scl3 repo]#find /data/tmp/jacuzzi-allocator/repo/v1/builders -type f -print0 | xargs -0 wc -l | sort -nr | while read lines file; do if [ "${lines}" -lt 6 ]; then echo "${file}"; echo "${file//?/=}"; cat "${file}"; fi; done /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch leak test build =============================================================================== { "machines": [ "bld-linux64-spot-314" ] }/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch build ===================================================================== { "machines": [ "bld-linux64-spot-311" ] }/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch leak test build ======================================================================== { "machines": [ "bld-linux64-spot-311" ] }/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch build ============================================================== { "machines": [ "bld-linux64-spot-313" ] }[root@relengwebadm.private.scl3 repo]#
These are all birch, so this should be ok. :)
The root issue of this was that all the spot machines disappeared from the usable slaves report here: https://secure.pub.build.mozilla.org/builddata/reports/reportor/daily/machine_sanity/usable_slaves.json We've updated the script to handle the new way of managing spot nodes, so once that gets deployed we should be able to re-enable the allocator. The lack of commits is due to https://github.com/bhearsum/static-jacuzzis/blob/master/Makefile#L24 not producing any output (since the numerical allocations in config.json were fine). https://github.com/bhearsum/static-jacuzzis/blob/master/Makefile#L11 then skips the commit because there's nothing in allocate.log. However, https://github.com/bhearsum/static-jacuzzis/blob/master/Makefile#L22 has removed all unusable slaves from the allocations, leaving empty jacuzzis.
Severity: blocker → major
Re-enabling dynamic jacuzzis now...
Dynamic jacuzzi's live again. Closing bug. If it happens again, feel free to reopen bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.