936827 - Number of running panda slaves is near zero

Reporter

Description

•

12 years ago

Aside from two jobs running on mozilla-beta, no Android 4.0 test jobs have started in the past ten hours. There's a growing pending queue on mozilla-inbound, and we may need to close the trees if this doesn't recover soon.

Justin Wood (:Callek)

Assignee

Comment 1

•

12 years ago

I'm not with my laptop today so I don't have vpn access until late tomorrow et. but I am available to direct via phone call if needed. a member of releng or relops can fix by imaging all pandas with no image to the android...v3 image. the underlying issue is a p1 for me at the moment so ill hopefully be able to finish the fix monday

Severity: critical → blocker

Phil Ringnalda (:philor)

Comment 2

•

12 years ago

19 hour backlog? Closed b2g-inbound, fx-team, mozilla-aurora, mozilla-beta, mozilla-central, mozilla-inbound, mozilla-release.

Carsten Book [:Tomcat]

Comment 3

•

12 years ago

After working with Nthomas (super great thanks Nick) this Sunday i reopened the trees at 2:40am Pacific because Nick were able to bring back some Pandas and Trees are also picking Tests. Hopefully the pandas will stay alive and will take now on jobs.

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

12 years ago

I've been applying the panda-android-4.0.4_v3.1 image to pandas which are marked as production, and are in an otherwise OK state. I haven't gotten through the whole pool yet, progress is on https://etherpad.mozilla.org/uqrNsJF4rJ, but the backlog of pending jobs is being cut through with speed.

Severity: blocker → major

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

12 years ago

The remaining machines have been set to reimage (details of specific panda-#### on the etherpad). Over to Callek for a cross-check. Btw, I noticed there are ~25 pandas which last did work on 10/25, which probably need a different kind of rescue. Often in a failed_pxe_booting state.

Assignee: nobody → bugspam.Callek

Severity: major → normal

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 6

•

12 years ago

Followup after irc discussion; callek asked me to add for further investigation tomorrow morning. The panda-not-accepting-jobs woes here may be related to RelEng losing a bunch of infrastructure in MTV1 earlier in the morning; specifically we lost KVMs (kvm,kvm1,kvm2,kvm3), PDUs (pdu1,pdu3,pdu5), buildbot-master{10,19,20,22; bm-remote}, lots of foopies, and bm-remote-talos-webhost-{01,02,03}. Nagios started alerting about all of these systems at 07:39am PDT, nagios claimed they all recovered at 08:45am PDT, mbrubeck noticed backlog and filed this bug at 13:22 PDT. The only info I have so far is in nagios, one example below. Happy to include (many!) other examples of nagios alerts if that helps. -------- Original Message -------- Subject: PROBLEM - Host DOWN alert for kvm1.build.mtv1.mozilla.com! Date: Sat, 09 Nov 2013 07:39:30 -0800 From: nagios@nagios.releng.scl3.mozilla.com (nagios) To: release@mozilla.com ***** Nagios ***** Notification Type: PROBLEM Host: kvm1.build.mtv1.mozilla.com State: DOWN Address: 10.250.49.201 Info: PING CRITICAL - Packet loss = 100% Date/Time: 11-09-2013 07:39:30

Ed Morley [:emorley]

Comment 7

•

12 years ago

Any reason for me not to dupe this against bug 932231? Note the main reason for panda shortage is bug 936615, which Callek is working on, but until fixed requires reqular manual re-imaging of pandas via lifeguard.

Justin Wood (:Callek)

Assignee

Comment 8

•

12 years ago

(In reply to John O'Duinn [:joduinn] from comment #6) > Followup after irc discussion; callek asked me to add for further > investigation tomorrow morning. Turns out that event had nothing to do with this, I had mistakenly heard "kvms went down and most foopies" as "...in scl1" which wasn't actually present in the statement. (In reply to Ed Morley [:edmorley UTC+1] from comment #7) > Any reason for me not to dupe this against bug 932231? Sounds good to me. > Note the main reason for panda shortage is bug 936615, which Callek is > working on, but until fixed requires reqular manual re-imaging of pandas via > lifeguard.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → DUPLICATE

Ed Morley [:emorley]

Updated

•

12 years ago

No longer depends on: 932231

Justin Wood (:Callek)

Assignee

Comment 9

•

12 years ago

(In reply to Justin Wood (:Callek) from comment #8) > (In reply to John O'Duinn [:joduinn] from comment #6) > > Followup after irc discussion; callek asked me to add for further > > investigation tomorrow morning. > > Turns out that event had nothing to do with this, And we now have Bug 937322 to track our investigation into the mtv1 event, for any who wish to follow along there.

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Number of running panda slaves is near zero

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: mbrubeck, Assigned: Callek)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Updated

Updated