Closed Bug 1020210 Opened 10 years ago Closed 10 years ago

All Trees Closed -> building backlog of linux jobs because of issues with dynamic jacuzzi allocation

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Linux
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Unassigned)

Details

building up backlog of linux builds due to train move 

03:00 < pmoore|buildduty> Tomcat|sheriffduty: i think i see the problem
03:00 < pmoore|buildduty> https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=build&type=bld-linux64-ix
03:00 < pmoore|buildduty> we only have 4 machines here, because of the train move B, we haven't updated buildbot configs yet with the new names
03:00 < pmoore|buildduty> so currently only 4 builders available :(
03:00 < Tomcat|sheriffduty> oh
03:03 < pmoore|buildduty> Tomcat|sheriffduty: also https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=tst-w64-ec2
03:03 < pmoore|buildduty> 2 available slaves, 110 pending jobs
03:00 < pmoore|buildduty> therefore 190 pending jobs for 4 slaves
So my comment above was not accurate: there were 190 pending jobs, but for any linux builder, of which there were several.

The real problem was that the jacuzzi allocator had jobs with 0 builders available for it.
note as of 4:20 all trees are now closed (not just integration) due to the building backlog
Summary: Integration Trees Closed -> building backlog of linux jobs because of builder shortage due to train move → All Trees Closed -> building backlog of linux jobs because of builder shortage due to train move
Thanks to nthomas for spotting this.

So logging onto relengwebadm, we could see several unstaged changes in /data/tmp/jacuzzi-allocator/repo, e.g.:

[root@relengwebadm.private.scl3 repo]# git status
# On branch master
# Changes not staged for commit:
#   (use "git add/rm <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#	modified:   v1/allocated/all
#	modified:   v1/builders/Firefox mozilla-aurora linux l10n nightly
#	modified:   v1/builders/Firefox mozilla-aurora linux64 l10n nightly
#	modified:   v1/builders/Firefox mozilla-central linux l10n nightly
#	modified:   v1/builders/Firefox mozilla-central linux64 l10n nightly
#	modified:   v1/builders/Linux birch build
#	modified:   v1/builders/Linux birch leak test build
#	modified:   v1/builders/Linux mozilla-inbound build
#	modified:   v1/builders/Linux mozilla-inbound leak test build
#	modified:   v1/builders/Linux x86-64 birch build
#	modified:   v1/builders/Linux x86-64 birch leak test build
#	modified:   v1/builders/Linux x86-64 mozilla-inbound build
#	modified:   v1/builders/Linux x86-64 mozilla-inbound leak test build
#	modified:   v1/builders/Thunderbird comm-aurora linux l10n nightly
#	modified:   v1/builders/Thunderbird comm-aurora linux64 l10n nightly
#	modified:   v1/builders/Thunderbird comm-central linux l10n nightly
#	modified:   v1/builders/Thunderbird comm-central linux64 l10n nightly
#	modified:   v1/builders/b2g_b2g-inbound_emulator-debug_dep
#	modified:   v1/builders/b2g_b2g-inbound_emulator-jb-debug_dep
#	modified:   v1/builders/b2g_b2g-inbound_emulator-jb_dep
#	modified:   v1/builders/b2g_b2g-inbound_emulator_dep
#	modified:   v1/builders/b2g_b2g-inbound_hamachi_eng_dep
#	modified:   v1/builders/b2g_b2g-inbound_linux32_gecko build
#	modified:   v1/builders/b2g_b2g-inbound_linux64_gecko build
#	modified:   v1/builders/b2g_mozilla-inbound_emulator-debug_dep
#	modified:   v1/builders/b2g_mozilla-inbound_emulator-jb-debug_dep
#	modified:   v1/builders/b2g_mozilla-inbound_emulator-jb_dep
#	modified:   v1/builders/b2g_mozilla-inbound_emulator_dep
#	modified:   v1/builders/b2g_mozilla-inbound_hamachi_eng_dep
#	modified:   v1/builders/b2g_mozilla-inbound_linux64_gecko build
#	deleted:    v1/machines/bld-linux64-spot-010
#	deleted:    v1/machines/bld-linux64-spot-011
#	deleted:    v1/machines/bld-linux64-spot-012
#	deleted:    v1/machines/bld-linux64-spot-013
#	deleted:    v1/machines/bld-linux64-spot-014
#	deleted:    v1/machines/bld-linux64-spot-015
#	deleted:    v1/machines/bld-linux64-spot-016
#	deleted:    v1/machines/bld-linux64-spot-017
#	deleted:    v1/machines/bld-linux64-spot-018
#	deleted:    v1/machines/bld-linux64-spot-019
#	deleted:    v1/machines/bld-linux64-spot-020
#	deleted:    v1/machines/bld-linux64-spot-021
#	deleted:    v1/machines/bld-linux64-spot-022
#	deleted:    v1/machines/bld-linux64-spot-023
#	deleted:    v1/machines/bld-linux64-spot-024
#	deleted:    v1/machines/bld-linux64-spot-025
#	deleted:    v1/machines/bld-linux64-spot-026
#	deleted:    v1/machines/bld-linux64-spot-027
#	deleted:    v1/machines/bld-linux64-spot-028
#	deleted:    v1/machines/bld-linux64-spot-029
#	deleted:    v1/machines/bld-linux64-spot-030
#	deleted:    v1/machines/bld-linux64-spot-031
#	deleted:    v1/machines/bld-linux64-spot-032
#	deleted:    v1/machines/bld-linux64-spot-033
#	deleted:    v1/machines/bld-linux64-spot-034
#	deleted:    v1/machines/bld-linux64-spot-035
#	deleted:    v1/machines/bld-linux64-spot-036
#	deleted:    v1/machines/bld-linux64-spot-037
#	deleted:    v1/machines/bld-linux64-spot-038
#	deleted:    v1/machines/bld-linux64-spot-039
#	deleted:    v1/machines/bld-linux64-spot-055
#	deleted:    v1/machines/bld-linux64-spot-058
#	deleted:    v1/machines/bld-linux64-spot-062
#	deleted:    v1/machines/bld-linux64-spot-066
#	deleted:    v1/machines/bld-linux64-spot-077
#	deleted:    v1/machines/bld-linux64-spot-081
#	deleted:    v1/machines/bld-linux64-spot-082
#	deleted:    v1/machines/bld-linux64-spot-083
#	deleted:    v1/machines/bld-linux64-spot-084
#	deleted:    v1/machines/bld-linux64-spot-086
#	deleted:    v1/machines/bld-linux64-spot-087
#	deleted:    v1/machines/bld-linux64-spot-089
#	deleted:    v1/machines/bld-linux64-spot-090
#	deleted:    v1/machines/bld-linux64-spot-091
#	deleted:    v1/machines/bld-linux64-spot-092
#	deleted:    v1/machines/bld-linux64-spot-093
#	deleted:    v1/machines/bld-linux64-spot-094
#	deleted:    v1/machines/bld-linux64-spot-095
#	deleted:    v1/machines/bld-linux64-spot-096
#	deleted:    v1/machines/bld-linux64-spot-097
#	deleted:    v1/machines/bld-linux64-spot-098
#	deleted:    v1/machines/bld-linux64-spot-099
#	deleted:    v1/machines/bld-linux64-spot-301
#	deleted:    v1/machines/bld-linux64-spot-302
#	deleted:    v1/machines/bld-linux64-spot-303
#	deleted:    v1/machines/bld-linux64-spot-304
#	deleted:    v1/machines/bld-linux64-spot-305
#	deleted:    v1/machines/bld-linux64-spot-306
#	deleted:    v1/machines/bld-linux64-spot-307
#	deleted:    v1/machines/bld-linux64-spot-308
#	deleted:    v1/machines/bld-linux64-spot-309
#	deleted:    v1/machines/bld-linux64-spot-310
#	deleted:    v1/machines/bld-linux64-spot-311
#	deleted:    v1/machines/bld-linux64-spot-312
#	deleted:    v1/machines/bld-linux64-spot-313
#	deleted:    v1/machines/bld-linux64-spot-314
#	deleted:    v1/machines/bld-linux64-spot-315
#	deleted:    v1/machines/bld-linux64-spot-316
#	deleted:    v1/machines/bld-linux64-spot-317
#	deleted:    v1/machines/bld-linux64-spot-318
#	deleted:    v1/machines/bld-linux64-spot-319
#	deleted:    v1/machines/bld-linux64-spot-327
#	deleted:    v1/machines/bld-linux64-spot-328
#	deleted:    v1/machines/bld-linux64-spot-329
#	deleted:    v1/machines/bld-linux64-spot-330
#	deleted:    v1/machines/bld-linux64-spot-331
#	deleted:    v1/machines/bld-linux64-spot-332
#	deleted:    v1/machines/bld-linux64-spot-333
#	deleted:    v1/machines/bld-linux64-spot-334
#	deleted:    v1/machines/bld-linux64-spot-346
#	deleted:    v1/machines/bld-linux64-spot-356
#	deleted:    v1/machines/bld-linux64-spot-357
#	deleted:    v1/machines/bld-linux64-spot-358
#	deleted:    v1/machines/bld-linux64-spot-359
#	deleted:    v1/machines/bld-linux64-spot-362
#	deleted:    v1/machines/bld-linux64-spot-363
#	deleted:    v1/machines/bld-linux64-spot-364
#	deleted:    v1/machines/bld-linux64-spot-365
#	deleted:    v1/machines/bld-linux64-spot-366
#	deleted:    v1/machines/bld-linux64-spot-367
#	deleted:    v1/machines/bld-linux64-spot-368
#	deleted:    v1/machines/bld-linux64-spot-369
#	deleted:    v1/machines/bld-linux64-spot-372
#	deleted:    v1/machines/bld-linux64-spot-373
#	deleted:    v1/machines/bld-linux64-spot-375
#	deleted:    v1/machines/bld-linux64-spot-378
#	deleted:    v1/machines/bld-linux64-spot-380
#	deleted:    v1/machines/bld-linux64-spot-381
#	deleted:    v1/machines/bld-linux64-spot-382
#	deleted:    v1/machines/bld-linux64-spot-383
#	deleted:    v1/machines/bld-linux64-spot-384
#	deleted:    v1/machines/bld-linux64-spot-385
#	deleted:    v1/machines/bld-linux64-spot-386
#	deleted:    v1/machines/bld-linux64-spot-387
#	deleted:    v1/machines/bld-linux64-spot-388
#	deleted:    v1/machines/bld-linux64-spot-389
#	deleted:    v1/machines/bld-linux64-spot-390
#	deleted:    v1/machines/bld-linux64-spot-391
#	deleted:    v1/machines/bld-linux64-spot-392
#	deleted:    v1/machines/bld-linux64-spot-393
#	deleted:    v1/machines/bld-linux64-spot-395
#	deleted:    v1/machines/bld-linux64-spot-396
#	deleted:    v1/machines/bld-linux64-spot-397
#	deleted:    v1/machines/bld-linux64-spot-398
#	deleted:    v1/machines/bld-linux64-spot-399
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	allocate.log
no changes added to commit (use "git add" and/or "git commit -a")
Summary: All Trees Closed -> building backlog of linux jobs because of builder shortage due to train move → All Trees Closed -> building backlog of linux jobs because of issues with dynamic jacuzzi allocation
/data/tmp/jacuzzi-allocator/repo
and
/mnt/netapp/relengweb/jacuzzi-allocator

were in sync
I planned to roll back repo to a good state (where there were not jobs with 0 or 1 builders) which meant:

1) disabling puppet
2) disabling cronjob for dynamic allocator
3) restoring /data/tmp/jacuzzi-allocator/repo to a "good known state"
4) rsyncing /data/tmp/jacuzzi-allocator/repo to /mnt/netapp/relengweb/jacuzzi-allocator
(In reply to Carsten Book [:Tomcat] from comment #2)
> note as of 4:20 all trees are now closed (not just integration) due to the
> building backlog

per discussion with Nick (thanks!) and the affected list http://jacuzzi-allocator.pub.build.mozilla.org/v1/builders/ limited this tree closure to m-i and b2g-i as the affected trees since all other trees should be fine and not suffering/contributing to the job backlog
[root@relengwebadm.private.scl3 repo]# find /data/tmp/jacuzzi-allocator/repo/v1/builders -type f -print0 | xargs  -0 wc -l | sort -nr | tail -20
  12 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_b2g-inbound_hamachi_eng_dep
  11 /data/tmp/jacuzzi-allocator/repo/v1/builders/WINNT 5.2 mozilla-inbound leak test build
  11 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 mozilla-inbound leak test build
  11 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_mozilla-inbound_linux64_gecko build
  11 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_b2g-inbound_linux64_gecko build
  11 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_b2g-inbound_linux32_gecko build
  10 /data/tmp/jacuzzi-allocator/repo/v1/builders/WINNT 5.2 mozilla-inbound build
  10 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux mozilla-inbound build
   7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-central linux l10n nightly
   7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-central linux64 l10n nightly
   7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-aurora linux l10n nightly
   7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-aurora linux64 l10n nightly
   7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-central linux l10n nightly
   7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-central linux64 l10n nightly
   7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-aurora linux l10n nightly
   7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-aurora linux64 l10n nightly
   4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch leak test build
   4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch build
   4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch leak test build
   4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch build
[root@relengwebadm.private.scl3 repo]# git log -1
commit d1547aec4563fb287ca2347cdb9366c4def47d41
Author: allocator <no-reply@mozilla.com>
Date:   Mon Jun 2 15:20:08 2014 -0700

    2014-06-02 15:20:05,590 - Linux x86-64 mozilla-inbound build currently 8819s full and 547765s idle
    2014-06-02 15:20:05,590 - Linux x86-64 mozilla-inbound build 9 (+1 was 8) would result in 1310s full and 547765s idle
[root@relengwebadm.private.scl3 repo]#
[root@relengwebadm.private.scl3 repo]#find /data/tmp/jacuzzi-allocator/repo/v1/builders -type f -print0 | xargs  -0 wc -l | sort -nr | while read lines file; do if [ "${lines}" -lt 6 ]; then echo "${file}"; echo "${file//?/=}"; cat "${file}"; fi; done
/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch leak test build
===============================================================================
{
  "machines": [
    "bld-linux64-spot-314"
  ]
}/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch build
=====================================================================
{
  "machines": [
    "bld-linux64-spot-311"
  ]
}/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch leak test build
========================================================================
{
  "machines": [
    "bld-linux64-spot-311"
  ]
}/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch build
==============================================================
{
  "machines": [
    "bld-linux64-spot-313"
  ]
}[root@relengwebadm.private.scl3 repo]#
These are all birch, so this should be ok. :)
The root issue of this was that all the spot machines disappeared from the usable slaves report here:
https://secure.pub.build.mozilla.org/builddata/reports/reportor/daily/machine_sanity/usable_slaves.json

We've updated the script to handle the new way of managing spot nodes, so once that gets deployed we should be able to re-enable the allocator.

The lack of commits is due to https://github.com/bhearsum/static-jacuzzis/blob/master/Makefile#L24 not producing any output (since the numerical allocations in config.json were fine). https://github.com/bhearsum/static-jacuzzis/blob/master/Makefile#L11 then skips the commit because there's nothing in allocate.log. However, https://github.com/bhearsum/static-jacuzzis/blob/master/Makefile#L22 has removed all unusable slaves from the allocations, leaving empty jacuzzis.
Severity: blocker → major
Re-enabling dynamic jacuzzis now...
Dynamic jacuzzi's live again. Closing bug. If it happens again, feel free to reopen bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.