Closed
Bug 1020210
Opened 11 years ago
Closed 11 years ago
All Trees Closed -> building backlog of linux jobs because of issues with dynamic jacuzzi allocation
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Unassigned)
Details
building up backlog of linux builds due to train move
03:00 < pmoore|buildduty> Tomcat|sheriffduty: i think i see the problem
03:00 < pmoore|buildduty> https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=build&type=bld-linux64-ix
03:00 < pmoore|buildduty> we only have 4 machines here, because of the train move B, we haven't updated buildbot configs yet with the new names
03:00 < pmoore|buildduty> so currently only 4 builders available :(
03:00 < Tomcat|sheriffduty> oh
03:03 < pmoore|buildduty> Tomcat|sheriffduty: also https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=tst-w64-ec2
03:03 < pmoore|buildduty> 2 available slaves, 110 pending jobs
03:00 < pmoore|buildduty> therefore 190 pending jobs for 4 slaves
Comment 1•11 years ago
|
||
So my comment above was not accurate: there were 190 pending jobs, but for any linux builder, of which there were several.
The real problem was that the jacuzzi allocator had jobs with 0 builders available for it.
Reporter | ||
Comment 2•11 years ago
|
||
note as of 4:20 all trees are now closed (not just integration) due to the building backlog
Summary: Integration Trees Closed -> building backlog of linux jobs because of builder shortage due to train move → All Trees Closed -> building backlog of linux jobs because of builder shortage due to train move
Comment 3•11 years ago
|
||
Thanks to nthomas for spotting this.
So logging onto relengwebadm, we could see several unstaged changes in /data/tmp/jacuzzi-allocator/repo, e.g.:
[root@relengwebadm.private.scl3 repo]# git status
# On branch master
# Changes not staged for commit:
# (use "git add/rm <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: v1/allocated/all
# modified: v1/builders/Firefox mozilla-aurora linux l10n nightly
# modified: v1/builders/Firefox mozilla-aurora linux64 l10n nightly
# modified: v1/builders/Firefox mozilla-central linux l10n nightly
# modified: v1/builders/Firefox mozilla-central linux64 l10n nightly
# modified: v1/builders/Linux birch build
# modified: v1/builders/Linux birch leak test build
# modified: v1/builders/Linux mozilla-inbound build
# modified: v1/builders/Linux mozilla-inbound leak test build
# modified: v1/builders/Linux x86-64 birch build
# modified: v1/builders/Linux x86-64 birch leak test build
# modified: v1/builders/Linux x86-64 mozilla-inbound build
# modified: v1/builders/Linux x86-64 mozilla-inbound leak test build
# modified: v1/builders/Thunderbird comm-aurora linux l10n nightly
# modified: v1/builders/Thunderbird comm-aurora linux64 l10n nightly
# modified: v1/builders/Thunderbird comm-central linux l10n nightly
# modified: v1/builders/Thunderbird comm-central linux64 l10n nightly
# modified: v1/builders/b2g_b2g-inbound_emulator-debug_dep
# modified: v1/builders/b2g_b2g-inbound_emulator-jb-debug_dep
# modified: v1/builders/b2g_b2g-inbound_emulator-jb_dep
# modified: v1/builders/b2g_b2g-inbound_emulator_dep
# modified: v1/builders/b2g_b2g-inbound_hamachi_eng_dep
# modified: v1/builders/b2g_b2g-inbound_linux32_gecko build
# modified: v1/builders/b2g_b2g-inbound_linux64_gecko build
# modified: v1/builders/b2g_mozilla-inbound_emulator-debug_dep
# modified: v1/builders/b2g_mozilla-inbound_emulator-jb-debug_dep
# modified: v1/builders/b2g_mozilla-inbound_emulator-jb_dep
# modified: v1/builders/b2g_mozilla-inbound_emulator_dep
# modified: v1/builders/b2g_mozilla-inbound_hamachi_eng_dep
# modified: v1/builders/b2g_mozilla-inbound_linux64_gecko build
# deleted: v1/machines/bld-linux64-spot-010
# deleted: v1/machines/bld-linux64-spot-011
# deleted: v1/machines/bld-linux64-spot-012
# deleted: v1/machines/bld-linux64-spot-013
# deleted: v1/machines/bld-linux64-spot-014
# deleted: v1/machines/bld-linux64-spot-015
# deleted: v1/machines/bld-linux64-spot-016
# deleted: v1/machines/bld-linux64-spot-017
# deleted: v1/machines/bld-linux64-spot-018
# deleted: v1/machines/bld-linux64-spot-019
# deleted: v1/machines/bld-linux64-spot-020
# deleted: v1/machines/bld-linux64-spot-021
# deleted: v1/machines/bld-linux64-spot-022
# deleted: v1/machines/bld-linux64-spot-023
# deleted: v1/machines/bld-linux64-spot-024
# deleted: v1/machines/bld-linux64-spot-025
# deleted: v1/machines/bld-linux64-spot-026
# deleted: v1/machines/bld-linux64-spot-027
# deleted: v1/machines/bld-linux64-spot-028
# deleted: v1/machines/bld-linux64-spot-029
# deleted: v1/machines/bld-linux64-spot-030
# deleted: v1/machines/bld-linux64-spot-031
# deleted: v1/machines/bld-linux64-spot-032
# deleted: v1/machines/bld-linux64-spot-033
# deleted: v1/machines/bld-linux64-spot-034
# deleted: v1/machines/bld-linux64-spot-035
# deleted: v1/machines/bld-linux64-spot-036
# deleted: v1/machines/bld-linux64-spot-037
# deleted: v1/machines/bld-linux64-spot-038
# deleted: v1/machines/bld-linux64-spot-039
# deleted: v1/machines/bld-linux64-spot-055
# deleted: v1/machines/bld-linux64-spot-058
# deleted: v1/machines/bld-linux64-spot-062
# deleted: v1/machines/bld-linux64-spot-066
# deleted: v1/machines/bld-linux64-spot-077
# deleted: v1/machines/bld-linux64-spot-081
# deleted: v1/machines/bld-linux64-spot-082
# deleted: v1/machines/bld-linux64-spot-083
# deleted: v1/machines/bld-linux64-spot-084
# deleted: v1/machines/bld-linux64-spot-086
# deleted: v1/machines/bld-linux64-spot-087
# deleted: v1/machines/bld-linux64-spot-089
# deleted: v1/machines/bld-linux64-spot-090
# deleted: v1/machines/bld-linux64-spot-091
# deleted: v1/machines/bld-linux64-spot-092
# deleted: v1/machines/bld-linux64-spot-093
# deleted: v1/machines/bld-linux64-spot-094
# deleted: v1/machines/bld-linux64-spot-095
# deleted: v1/machines/bld-linux64-spot-096
# deleted: v1/machines/bld-linux64-spot-097
# deleted: v1/machines/bld-linux64-spot-098
# deleted: v1/machines/bld-linux64-spot-099
# deleted: v1/machines/bld-linux64-spot-301
# deleted: v1/machines/bld-linux64-spot-302
# deleted: v1/machines/bld-linux64-spot-303
# deleted: v1/machines/bld-linux64-spot-304
# deleted: v1/machines/bld-linux64-spot-305
# deleted: v1/machines/bld-linux64-spot-306
# deleted: v1/machines/bld-linux64-spot-307
# deleted: v1/machines/bld-linux64-spot-308
# deleted: v1/machines/bld-linux64-spot-309
# deleted: v1/machines/bld-linux64-spot-310
# deleted: v1/machines/bld-linux64-spot-311
# deleted: v1/machines/bld-linux64-spot-312
# deleted: v1/machines/bld-linux64-spot-313
# deleted: v1/machines/bld-linux64-spot-314
# deleted: v1/machines/bld-linux64-spot-315
# deleted: v1/machines/bld-linux64-spot-316
# deleted: v1/machines/bld-linux64-spot-317
# deleted: v1/machines/bld-linux64-spot-318
# deleted: v1/machines/bld-linux64-spot-319
# deleted: v1/machines/bld-linux64-spot-327
# deleted: v1/machines/bld-linux64-spot-328
# deleted: v1/machines/bld-linux64-spot-329
# deleted: v1/machines/bld-linux64-spot-330
# deleted: v1/machines/bld-linux64-spot-331
# deleted: v1/machines/bld-linux64-spot-332
# deleted: v1/machines/bld-linux64-spot-333
# deleted: v1/machines/bld-linux64-spot-334
# deleted: v1/machines/bld-linux64-spot-346
# deleted: v1/machines/bld-linux64-spot-356
# deleted: v1/machines/bld-linux64-spot-357
# deleted: v1/machines/bld-linux64-spot-358
# deleted: v1/machines/bld-linux64-spot-359
# deleted: v1/machines/bld-linux64-spot-362
# deleted: v1/machines/bld-linux64-spot-363
# deleted: v1/machines/bld-linux64-spot-364
# deleted: v1/machines/bld-linux64-spot-365
# deleted: v1/machines/bld-linux64-spot-366
# deleted: v1/machines/bld-linux64-spot-367
# deleted: v1/machines/bld-linux64-spot-368
# deleted: v1/machines/bld-linux64-spot-369
# deleted: v1/machines/bld-linux64-spot-372
# deleted: v1/machines/bld-linux64-spot-373
# deleted: v1/machines/bld-linux64-spot-375
# deleted: v1/machines/bld-linux64-spot-378
# deleted: v1/machines/bld-linux64-spot-380
# deleted: v1/machines/bld-linux64-spot-381
# deleted: v1/machines/bld-linux64-spot-382
# deleted: v1/machines/bld-linux64-spot-383
# deleted: v1/machines/bld-linux64-spot-384
# deleted: v1/machines/bld-linux64-spot-385
# deleted: v1/machines/bld-linux64-spot-386
# deleted: v1/machines/bld-linux64-spot-387
# deleted: v1/machines/bld-linux64-spot-388
# deleted: v1/machines/bld-linux64-spot-389
# deleted: v1/machines/bld-linux64-spot-390
# deleted: v1/machines/bld-linux64-spot-391
# deleted: v1/machines/bld-linux64-spot-392
# deleted: v1/machines/bld-linux64-spot-393
# deleted: v1/machines/bld-linux64-spot-395
# deleted: v1/machines/bld-linux64-spot-396
# deleted: v1/machines/bld-linux64-spot-397
# deleted: v1/machines/bld-linux64-spot-398
# deleted: v1/machines/bld-linux64-spot-399
#
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# allocate.log
no changes added to commit (use "git add" and/or "git commit -a")
Updated•11 years ago
|
Summary: All Trees Closed -> building backlog of linux jobs because of builder shortage due to train move → All Trees Closed -> building backlog of linux jobs because of issues with dynamic jacuzzi allocation
Comment 5•11 years ago
|
||
/data/tmp/jacuzzi-allocator/repo
and
/mnt/netapp/relengweb/jacuzzi-allocator
were in sync
Comment 6•11 years ago
|
||
I planned to roll back repo to a good state (where there were not jobs with 0 or 1 builders) which meant:
1) disabling puppet
2) disabling cronjob for dynamic allocator
3) restoring /data/tmp/jacuzzi-allocator/repo to a "good known state"
4) rsyncing /data/tmp/jacuzzi-allocator/repo to /mnt/netapp/relengweb/jacuzzi-allocator
Reporter | ||
Comment 7•11 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #2)
> note as of 4:20 all trees are now closed (not just integration) due to the
> building backlog
per discussion with Nick (thanks!) and the affected list http://jacuzzi-allocator.pub.build.mozilla.org/v1/builders/ limited this tree closure to m-i and b2g-i as the affected trees since all other trees should be fine and not suffering/contributing to the job backlog
Comment 8•11 years ago
|
||
[root@relengwebadm.private.scl3 repo]# find /data/tmp/jacuzzi-allocator/repo/v1/builders -type f -print0 | xargs -0 wc -l | sort -nr | tail -20
12 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_b2g-inbound_hamachi_eng_dep
11 /data/tmp/jacuzzi-allocator/repo/v1/builders/WINNT 5.2 mozilla-inbound leak test build
11 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 mozilla-inbound leak test build
11 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_mozilla-inbound_linux64_gecko build
11 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_b2g-inbound_linux64_gecko build
11 /data/tmp/jacuzzi-allocator/repo/v1/builders/b2g_b2g-inbound_linux32_gecko build
10 /data/tmp/jacuzzi-allocator/repo/v1/builders/WINNT 5.2 mozilla-inbound build
10 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux mozilla-inbound build
7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-central linux l10n nightly
7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-central linux64 l10n nightly
7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-aurora linux l10n nightly
7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Thunderbird comm-aurora linux64 l10n nightly
7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-central linux l10n nightly
7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-central linux64 l10n nightly
7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-aurora linux l10n nightly
7 /data/tmp/jacuzzi-allocator/repo/v1/builders/Firefox mozilla-aurora linux64 l10n nightly
4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch leak test build
4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch build
4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch leak test build
4 /data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch build
[root@relengwebadm.private.scl3 repo]# git log -1
commit d1547aec4563fb287ca2347cdb9366c4def47d41
Author: allocator <no-reply@mozilla.com>
Date: Mon Jun 2 15:20:08 2014 -0700
2014-06-02 15:20:05,590 - Linux x86-64 mozilla-inbound build currently 8819s full and 547765s idle
2014-06-02 15:20:05,590 - Linux x86-64 mozilla-inbound build 9 (+1 was 8) would result in 1310s full and 547765s idle
[root@relengwebadm.private.scl3 repo]#
Comment 9•11 years ago
|
||
[root@relengwebadm.private.scl3 repo]#find /data/tmp/jacuzzi-allocator/repo/v1/builders -type f -print0 | xargs -0 wc -l | sort -nr | while read lines file; do if [ "${lines}" -lt 6 ]; then echo "${file}"; echo "${file//?/=}"; cat "${file}"; fi; done
/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch leak test build
===============================================================================
{
"machines": [
"bld-linux64-spot-314"
]
}/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux x86-64 birch build
=====================================================================
{
"machines": [
"bld-linux64-spot-311"
]
}/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch leak test build
========================================================================
{
"machines": [
"bld-linux64-spot-311"
]
}/data/tmp/jacuzzi-allocator/repo/v1/builders/Linux birch build
==============================================================
{
"machines": [
"bld-linux64-spot-313"
]
}[root@relengwebadm.private.scl3 repo]#
Comment 10•11 years ago
|
||
These are all birch, so this should be ok. :)
Comment 11•11 years ago
|
||
The root issue of this was that all the spot machines disappeared from the usable slaves report here:
https://secure.pub.build.mozilla.org/builddata/reports/reportor/daily/machine_sanity/usable_slaves.json
We've updated the script to handle the new way of managing spot nodes, so once that gets deployed we should be able to re-enable the allocator.
The lack of commits is due to https://github.com/bhearsum/static-jacuzzis/blob/master/Makefile#L24 not producing any output (since the numerical allocations in config.json were fine). https://github.com/bhearsum/static-jacuzzis/blob/master/Makefile#L11 then skips the commit because there's nothing in allocate.log. However, https://github.com/bhearsum/static-jacuzzis/blob/master/Makefile#L22 has removed all unusable slaves from the allocations, leaving empty jacuzzis.
Severity: blocker → major
Comment 12•11 years ago
|
||
Re-enabling dynamic jacuzzis now...
Comment 13•11 years ago
|
||
Dynamic jacuzzi's live again. Closing bug. If it happens again, feel free to reopen bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•