Closed Bug 936827 Opened 11 years ago Closed 11 years ago

Number of running panda slaves is near zero

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

ARM
Android
task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 932231

People

(Reporter: mbrubeck, Assigned: Callek)

Details

Aside from two jobs running on mozilla-beta, no Android 4.0 test jobs have started in the past ten hours.  There's a growing pending queue on mozilla-inbound, and we may need to close the trees if this doesn't recover soon.
I'm not with my laptop today so I don't have vpn access until late tomorrow et. but I am available to direct via phone call if needed.

a member of releng or relops can fix by imaging all pandas with no image to the android...v3 image. the underlying issue is a p1 for me at the moment so ill hopefully be able to finish the fix monday
Severity: critical → blocker
19 hour backlog? Closed b2g-inbound, fx-team, mozilla-aurora, mozilla-beta, mozilla-central, mozilla-inbound, mozilla-release.
After working with Nthomas (super great thanks Nick) this Sunday i reopened the trees at 2:40am Pacific because Nick were able to bring back some Pandas and Trees are also picking Tests.

Hopefully the pandas will stay alive and will take now on jobs.
I've been applying the panda-android-4.0.4_v3.1 image to pandas which are marked as production, and are in an otherwise OK state. I haven't gotten through the whole pool yet, progress is on https://etherpad.mozilla.org/uqrNsJF4rJ, but the backlog of pending jobs is being cut through with speed.
Severity: blocker → major
The remaining machines have been set to reimage (details of specific panda-#### on the etherpad). Over to Callek for a cross-check.

Btw, I noticed there are ~25 pandas which last did work on 10/25, which probably need a different kind of rescue. Often in a failed_pxe_booting state.
Assignee: nobody → bugspam.Callek
Severity: major → normal
Followup after irc discussion; callek asked me to add for further investigation tomorrow morning.


The panda-not-accepting-jobs woes here may be related to RelEng losing a bunch of infrastructure in MTV1 earlier in the morning; specifically we lost KVMs (kvm,kvm1,kvm2,kvm3), PDUs (pdu1,pdu3,pdu5), buildbot-master{10,19,20,22; bm-remote}, lots of foopies, and bm-remote-talos-webhost-{01,02,03}. Nagios started alerting about all of these systems at 07:39am PDT, nagios claimed they all recovered at 08:45am PDT, mbrubeck noticed backlog and filed this bug at 13:22 PDT.

The only info I have so far is in nagios, one example below. Happy to include (many!) other examples of nagios alerts if that helps.


-------- Original Message --------
Subject: PROBLEM - Host DOWN alert for kvm1.build.mtv1.mozilla.com!
Date: Sat, 09 Nov 2013 07:39:30 -0800
From: nagios@nagios.releng.scl3.mozilla.com (nagios)
To: release@mozilla.com

***** Nagios  *****

Notification Type: PROBLEM
Host: kvm1.build.mtv1.mozilla.com
State: DOWN
Address: 10.250.49.201
Info: PING CRITICAL - Packet loss = 100%

Date/Time: 11-09-2013 07:39:30
Any reason for me not to dupe this against bug 932231?

Note the main reason for panda shortage is bug 936615, which Callek is working on, but until fixed requires reqular manual re-imaging of pandas via lifeguard.
(In reply to John O'Duinn [:joduinn] from comment #6)
> Followup after irc discussion; callek asked me to add for further
> investigation tomorrow morning.

Turns out that event had nothing to do with this, I had mistakenly heard "kvms went down and most foopies" as "...in scl1" which wasn't actually present in the statement.

(In reply to Ed Morley [:edmorley UTC+1] from comment #7)
> Any reason for me not to dupe this against bug 932231?

Sounds good to me.

> Note the main reason for panda shortage is bug 936615, which Callek is
> working on, but until fixed requires reqular manual re-imaging of pandas via
> lifeguard.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
No longer depends on: 932231
(In reply to Justin Wood (:Callek) from comment #8)
> (In reply to John O'Duinn [:joduinn] from comment #6)
> > Followup after irc discussion; callek asked me to add for further
> > investigation tomorrow morning.
> 
> Turns out that event had nothing to do with this,

And we now have Bug 937322 to track our investigation into the mtv1 event, for any who wish to follow along there.
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.