Trees closed, mass command queue dead items

RESOLVED FIXED

Status

Release Engineering
Buildduty
--
blocker
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: philor, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

a year ago
Hundreds of nagios alerts about "Command Queue is CRITICAL: 12 dead items", all non-try trees closed.
I ran the ansible script per https://wiki.mozilla.org/ReleaseEngineering/Queue_directories#troubleshooting

still getting reports about dead items in #buildduty
Nick broke PTO silence and offered:
[15:25] <@nthomas|pto>| I think it’s upload.ffxbld.productdelivery.prod.mozaws.net not responding to ssh connections, in which case it’s a Cloud Services problem, eg oremj
[15:25] <@nthomas|pto>| that’s what’s happening on bm52 in use1, at least
[15:26] <@nthomas|pto>| opens a connection ok, but never completes all setup work on the session
[15:27] <@nthomas|pto>| aka nc works, ssh -v ends with
[15:27] <@nthomas|pto>| debug1: Entering interactive session.
[15:27] <@nthomas|pto>| debug1: Sending environment.
[15:27] <@nthomas|pto>| debug1: Sending env LANG = en_US.UTF-8
[15:27] <@nthomas|pto>| shell request failed on channel 0
UPDATE: I had told philor to close try (a while ago), based on initial symptoms.

A minute or two ago oremj told me try looked like it wasn't exhibiting the errors, I confirmed that on one of our buildbot masters, and had try reopened.
the ffxbld host has been fixed. A bunch of defunct sshd processes quickly accumulated, maxing out sshd's process limits. I'll add monitoring for this scenario on Monday.
We had a handful of the following style of error at ~2016-10-02 16:26:08,644 (PT)

Another retry of those caused them all to succeed, :philor has reopened trees.

Exception: Command ['ssh', '-l', 'ffxbld', '-i', '/home/cltbld/.ssh/ffxbld_rsa', '-p', '22', 'upload.ffxbld.productdelivery.prod.mozaws.net', u'post_upload.py --tinderbox-builds-dir fx-team-linux64 -b fx-team -p firefox -i 20161002114254 --revision 19c9698fe7c3c724485604c9b6ba530ebb539550 --release-to-tinderbox-dated-builds /media/ephemeral0/tmp/tmp.NhjZIKpDLm /media/ephemeral0/tmp/tmp.NhjZIKpDLm/fx-team_ubuntu64_vm_test-reftest-no-accel-e10s-1-bm51-tests1-linux64-build9.txt.gz'] returned non-zero exit code 255:
ssh_exchange_identification: Connection closed by remote host
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.