817103 - (bad-panda-log) Interim Bad Panda Log

Reporter

Description

•

12 years ago

This bug will serve as a log for all failing pandas and their remediation, while we gather data to develop a more robust and distributed process.  See
  https://wiki.mozilla.org/ReleaseEngineering/Mozpool/Handling_Panda_Failures

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 1

•

12 years ago

From Armen in bug 818729:

I'm getting failed_b2g_downloading for two boards:
http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/log.html?device=panda-0094
http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/log.html?device=panda-0089

This does not happen constantly but I would like to discover what it is and not have the problem on production.

Here's the return code I get:
5:33:45    ERROR - Bad return status from http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0094/request/: 502!

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 3

•

12 years ago

those two pandas have this in the logs:

> 2012-12-04 17:10:09 | entering state pxe_power_cycling
> 2012-12-04 17:10:09 | setting PXE config to 'panda-b2g.1'
> 2012-12-04 17:10:09 | initiating power cycle
> 2012-12-04 17:10:12 | entering state pxe_booting
> 2012-12-04 17:11:26 | second stage URL found: http://10.12.128.33/scripts/b2g-second-stage.sh
> 2012-12-04 17:11:26 | wget success: --2025-11-08 18:22:20--  http://10.12.128.33/scripts/b2g-second-stage.sh#012Connecting
> 2012-12-04 17:11:26 | Executing second-stage.sh
> 2012-12-04 17:11:26 | beginning b2g-second-stage.sh
> 2012-12-04 17:11:26 | setting time with ntpdate
> 2012-12-04 17:11:34 | Submitting lifeguard event at http://10.12.128.33/api/device/panda-0094/event/b2g_downloading/
> 2012-12-04 17:11:34 | clearing PXE config
> 2012-12-04 17:11:34 | entering state b2g_downloading
> 2012-12-04 17:11:34 | getting B2G_URL from http://10.12.128.33/api/device/panda-0094/bootconfig/
> 2012-12-04 17:11:35 | B2G URL: https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/201212041449
> 2012-12-04 17:11:35 | fetching https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/201212041449
> 2012-12-04 17:11:35 | wget failed:
> 2012-12-04 17:14:39 | entering state failed_b2g_downloading
> 2012-12-04 17:14:39 | device has failed: failed_b2g_downloading

so they're being given the wrong URL.  No mozpool problems here here.

Jake Watkins [:dividehex]

Assignee

Comment 4

•

11 years ago

This is the latest of set of pandas to take a look at (taken from bug 826694):

0034 -  failed_pxe_booting
0106 -  pc_pinging
0250 -  failed_pxe_booting
0258-290 - failed_pxe_booting
0342 -  failed_pxe_booting
0468 -  failed_pxe_booting
0472 -  failed_pxe_booting
0488 -  failed_pxe_booting
0554 -  failed_pxe_booting
0562 -  failed_pxe_booting
0570 -  failed_pxe_booting
0574 -  failed_pxe_booting
0598 -  failed_pxe_booting
0621 -  failed_pxe_booting
0662 -  failed_pxe_booting
0720-0730 - pxe_power_cycling - possible relay failure -- see bug 821239
0741 -  failed_pxe_booting
0781 -  failed_pxe_booting
0793 -  failed_pxe_booting
0796 -  failed_pxe_booting

Armen [:armenzg]

Comment 5

•

11 years ago

According to https://wiki.mozilla.org/ReleaseEngineering/Mozpool/Handling_Panda_Failures#Log_Failure_In_Interim_Tracking_Bug

* panda-0091
* failure type: "statemachine ignored event free in state failed_b2g_pinging"
* logs below
* the panda was doing nothing - it got re-imaged on the 11th but it ended on a failure to reboot and pinging. Trying to request the device or rebooting does not succeed.

2013-01-11T14:24:50 syslog Submitting lifeguard event at http://10.12.128.33/api/device/panda-0091/event/b2g_rebooting/
2013-01-11T14:24:50 statemachine entering state b2g_rebooting
2013-01-11T14:24:50 syslog Imaging complete. Rebooting
2013-01-11T14:26:59 statemachine entering state b2g_pinging
2013-01-11T14:27:00 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-11T14:27:19 statemachine entering state b2g_pinging
2013-01-11T14:27:20 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-11T14:27:39 statemachine entering state b2g_pinging
2013-01-11T14:27:40 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-11T14:27:59 statemachine entering state b2g_pinging
2013-01-11T14:28:00 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-11T14:28:19 statemachine entering state b2g_pinging
2013-01-11T14:28:20 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-11T14:28:40 statemachine entering state failed_b2g_pinging
2013-01-11T14:28:41 statemachine device has failed: failed_b2g_pinging
2013-01-14T08:51:56 statemachine ignored event free in state failed_b2g_pinging

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 6

•

11 years ago

It sounds like this failed to install B2G, which is usually an image problem, rather than a hardware problem.  The self-test support that's going to land soon will address this (by self-testing).  In the interim, try reimaging it with Android?  If that succeeds, it will go to the 'ready' and then 'free' states and can then be requested.

Armen [:armenzg]

Comment 7

•

11 years ago

I didn't do anything to it and it recovered :S

2013-01-14T10:33:40 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-14T10:33:59 statemachine entering state pc_power_cycling
2013-01-14T10:33:59 bmm clearing PXE config
2013-01-14T10:33:59 bmm initiating power cycle
2013-01-14T10:34:02 statemachine entering state pc_rebooting
2013-01-14T10:36:09 statemachine entering state pc_pinging
2013-01-14T10:36:09 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: ok
2013-01-14T10:36:09 statemachine entering state ready
2013-01-14T10:37:20 statemachine in ready state but not assigned to a request; moving to free state

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 8

•

11 years ago

Maybe a squirrely B2G image?

Amy Rich [:arr] [:arich]

Updated

•

11 years ago

colo-trip: --- → scl1

Amy Rich [:arr] [:arich]

Comment 10

•

11 years ago

Update to this now that nagios is working.  the following production pandas are reporting as down:


panda-0106.p1.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0342.p3.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0472.p5.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0488.p5.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0554.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0562.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0574.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0598.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting

This doesn't cover panda-0720 - panda-0730 which are unresponsive because panda-relay-065 is not functioning properly (see bug 821239).

Jake Watkins [:dividehex]

Assignee

Comment 11

•

11 years ago

Update:
I went through all the problem boards on 1/29 and was able fix or mark them for replacement.

These boards have been fixed:
panda-0342	switch patch cable plugged into empty panda port on rear of chassis
panda-0472	Had non PXE boot image on it.  Reimaged with preseed
panda-0488	mac address collision with panda-0482.  Fixed panda-0482 mac address in inventory
panda-0554	Power cables swapped between 0554 and empty spot
panda-0562	mac address collision with panda-0561.  Fixed panda-0562 mac address in inventory
panda-0598	Power cables swapped between 0598 and empty spot
panda-0621	seemed ok.  re-imaged with android
panda-0741	Power cables swapped between 0741 and the empty spot
panda-0796	relays didn't match/swapped.  0795 = bank2 relay7, 0796 = bank2 relay8, empty slot = bank2 relay6. I had to change these in inventory rather than swap cables due to cable lengths
panda-{0258..0290}	these were all in the wrong vlan scope in inventory

These boards seemed ok:
panda-0034
panda-0106

These boards have failed and need to be replaced (see Bug 836857):
panda-0250	failed on mkfs.ext. couldn't see mmc devices in /dev
panda-0468	failed on mkfs.ext. couldn't see mmc devices in /dev
panda-0570	failed on mkfs.ext. couldn't see mmc devices in /dev
panda-0574	mac address collision with 0531 
panda-0662	lights but no serial output and no LAN traffic
panda-0781	bad ethernet port on panda board
panda-0793	failed on mkfs.ext. couldn't see mmc devices in /dev

Justin Wood (:Callek)

Comment 12

•

11 years ago

I just landed http://hg.mozilla.org/build/tools/rev/6597bd01e9f4 to account for the changes in c#11

And also just deployed that to all foopies.

As well as coordinated to get the device directories on these foopies stopped and removed (for the Bug 836857 list), so we don't try to re-enable with the devices missing.

Justin Wood (:Callek)

Updated

•

11 years ago

Depends on: panda-0070

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

No longer depends on: panda-0070

Armen [:armenzg]

Comment 13

•

11 years ago

panda-0139 is burning every job it takes and is complaining on #buildduty

https://secure.pub.build.mozilla.org/buildapi/recent/panda-0139
14:32 nagios-releng: Fri 11:32:28 PST [418] panda-0139.p1.releng.scl1.mozilla.com is DOWN :CRITICAL: in state failed_b2g_pinging

Blocks: panda-0139

Jake Watkins [:dividehex]

Assignee

Comment 14

•

11 years ago

(In reply to Armen Zambrano G. [:armenzg] from comment #13)
> panda-0139 is burning every job it takes and is complaining on #buildduty
> 
> https://secure.pub.build.mozilla.org/buildapi/recent/panda-0139
> 14:32 nagios-releng: Fri 11:32:28 PST [418]
> panda-0139.p1.releng.scl1.mozilla.com is DOWN :CRITICAL: in state
> failed_b2g_pinging

Once a panda board enters a failed_*_pinging state, it will refuse all please image requests.  At this time, these can be handled by manually requesting a please_self_test from the lifeguard UI.  If the test passes, it will be put back into a free state where it can accept please image requests.   If the board continues to end up in a failed_*_pinging state, please raise the issue in this bug again.

I've gone ahead and requested a self test on all 4 of the failed_b2g_pinging pandas.  These have been in a failed state for quite some time now.
  
panda-0139
panda-0104
panda-0117
panda-0123

All 4 passed.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 15

•

11 years ago

And it's worth noting, the Buildbot/clientproxy code shouldn't be trying to use a panda that's in a failed state..

Armen [:armenzg]

Comment 16

•

11 years ago

Once it gets to that point, would it make sense to auto-self-tests?
Or, if I request that device, could mozpool try to self test automatically if it's in a known bad state?
Is there a technical reason that we won't try to auto-recover?

We request a device several times and then give up after a max amount.

Should we check the status every time we get a denied request?
Which status should require a self test for?
Is there a way through the API that I can request a self test?

If the self test, any suggestions on how could the automation put a stop to take more jobs for that panda?

Armen [:armenzg]

Comment 17

•

11 years ago

FTR, the state was showing as "pending"
10:09:26     INFO - Waiting for request 'ready' stage.  Current state: 'pending'

https://tbpl.mozilla.org/php/getParsedLog.php?id=19538828&tree=Firefox&full=1#error0

   104             response = mph.query_request_status(self.request_url)
   105             state = response['state']
   106             if state == 'ready':
   107                 return
   108             self.info("Waiting for request 'ready' stage.  Current state: '%s'" % state)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 18

•

11 years ago

(In reply to Armen Zambrano G. [:armenzg] from comment #16)
> Once it gets to that point, would it make sense to auto-self-tests?
> Or, if I request that device, could mozpool try to self test automatically
> if it's in a known bad state?
> Is there a technical reason that we won't try to auto-recover?

We're working on that - bug 834568.  It's a little complicated.

> We request a device several times and then give up after a max amount.
> 
> Should we check the status every time we get a denied request?

Ah, I think the piece of information we're not agreeing on is that Buildbot requests the device after the build has started, so the build burns if that request fails.

This isn't really the way the system was designed -- it was designed so that Buildbot requests *a* device, not a specific device -- so we may have to put some workarounds in place here.

> Which status should require a self test for?
> Is there a way through the API that I can request a self test?

Let Mozpool worry about that.

> If the self test, any suggestions on how could the automation put a stop to
> take more jobs for that panda?

What I would suggest is that clientproxy should poll the status of the device, and stop buildslave when the device is failed_*.  It should start buildslave when the device is not failed_*.  This is a pretty ugly workaround, but it will do until we have things reorganized to request devices from mozpool.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 19

•

11 years ago

That's the request state you're looking at in comment 17, not the device state.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 20

•

11 years ago

From kim in bug 842597:

In bug 836857, several panda boards were replaced, including panda-0081.  I imaged panda-0081 with the android image today and the status in lifeguard is "failed_sut_verifying" for a long time, not sure what to do here.

Kim Moir [:kmoir] ET

Comment 22

•

11 years ago

Also, note that I tried to image panda-0081 twice, both times, same error.

Jake Watkins [:dividehex]

Assignee

Comment 23

•

11 years ago

I wonder if these got a current preseed image put on them before deployment.  This is another reason we need a written policy for replacing failed panda boards.

Kim Moir [:kmoir] ET

Comment 24

•

11 years ago

panda-0069 and 0057 seems fine, they were replaced in the same batch.

Kim Moir [:kmoir] ET

Comment 25

•

11 years ago

hmm, panda-0081 is up now and running green tests. Perhaps this is bug 836417, and we just need to wait a long time after a reimage

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0081

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Blocks: panda-0590

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Blocks: panda-0856

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Depends on: panda-0180

Amy Rich [:arr] [:arich]

Comment 26

•

11 years ago

Now that panda-relay-065 is functional, I tried imaging the pandas attached to it.  It looks like the following needs some attention:

panda-0720: 
2013-04-11T17:30:29 syslog mkfs.ext4 failed: 
2013-04-11T17:30:29 syslog formatting system partition
2013-04-11T17:33:23 statemachine device has failed: failed_android_downloading
2013-04-11T17:33:23 statemachine entering state failed_android_downloading

Yet it changes to ready state, so I'm not sure what the heck is up there?

panda-0723:
2013-04-11T17:34:05 statemachine device failed ping check

panda-0730: 
2013-04-11T17:17:44 statemachine entering state android_extracting
2013-04-11T17:17:44 syslog extracting boot artifact
2013-04-11T17:17:47 syslog extracting system artifact
2013-04-11T17:18:27 syslog tar failed:

Justin Wood (:Callek)

Updated

•

11 years ago

Blocks: panda-0371

Justin Wood (:Callek)

Updated

•

11 years ago

Blocks: panda-0392

Justin Wood (:Callek)

Updated

•

11 years ago

Blocks: panda-0395

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0831

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0026

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0056

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0581

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0706

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0542

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0631

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0589

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0709

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0694

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0778

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0740

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0789

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0728

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0734

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0840

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0836

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0817

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0296

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Blocks: panda-0172

Justin Wood (:Callek)

Updated

•

11 years ago

Blocks: panda-0313

Justin Wood (:Callek)

Updated

•

11 years ago

Blocks: panda-0548

Justin Wood (:Callek)

Updated

•

11 years ago

Blocks: panda-0679

Justin Wood (:Callek)

Updated

•

11 years ago

Blocks: panda-0794

Justin Wood (:Callek)

Updated

•

11 years ago

Blocks: panda-0800

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Blocks: panda-0580

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Blocks: 873613

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Blocks: panda-0834

Jake Watkins [:dividehex]

Assignee

Comment 27

•

11 years ago

I have a SCL1 trip scheduled for tomorrow (Thurs, 7/11) to attend to the panda boards currently blocked here.

Jake Watkins [:dividehex]

Assignee

Comment 28

•

11 years ago

Update from 7/11 SCL1 visit:

I wasn't able to get to all the panda boards blocked in this bug but I was able to attend to the ones currently in production for android tests.  The follow list are pandas that have been recovered.

26, 56, 81, 542, 548, 580, 581, 588, 589, 590, 631, 679, 694, 706, 709, 728, 734, 740, 778, 789, 794, 800, 817, 831, 834, 836, 840, 856

This is the list of panda boards that still need attention.
172, 180, 296, 313, 371, 392, 395

Overall most pandas that were experiencing 'failed_pxe_booting' in mozpool had the identical symptom of just not booting at all (no serial output and no led activity.) Replacing the SD card with one that had a fresh preseed image, recovered all with this common issue.

Another symptom was a panda board that was able to load the boot loader but not find an ethernet connection while attempting to PXE boot.  These were all recovered by simply re-seating the internal CAT5 cable into the RJ45 coupler.

Other pandas boards showed no overt problem signs at all and were noted in the problem tracker bug of that panda.  These passed a selftest and were at the very least able to be reimaged with the android image via mozpool.

As for the SD Cards that were replaced, I've taking them with me so I may try and determine why these failed at the boot level.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Jake Watkins [:dividehex]

Assignee

Comment 29

•

11 years ago

I've analyzed all of the SD Card that were removed from service as noted in the previous comment and can conclude that 17 out of the 22 cards removed all have I/O failures at the mmcblk kernel device driver level.  In technical terms, they are completely Borked.

The other 5 cards didn't seem to have any issues but I haven't tested them any further than just checking for basic IO errors.

Jake Watkins [:dividehex]

Assignee

Comment 30

•

11 years ago

Until the new bad panda handling process is in play, I'm logging these failures here.  This might be a fail state I've seen before where the mmc hardware never gets detected and/or initialized by the kernel within the mozpool live boot environment.

panda-0720:
2013-07-18T10:08:57 syslog mkfs.ext4 failed:
2013-07-18T10:11:51 statemachine entering state failed_android_downloading

panda-0696:
2013-07-18T10:11:46 statemachine entering state failed_android_downloading

panda-0674:
2013-07-18T10:11:56 statemachine entering state failed_android_downloading

panda-0664:
2013-07-18T10:14:56 statemachine entering state failed_android_downloading

Jake Watkins [:dividehex]

Assignee

Comment 31

•

11 years ago

These 2 pandas still referred back to the old fix-boot-scr image which means they haven't properly taken a different image since.  Recording here as bad pandas.
+------+------------+
| id   | name       |
+------+------------+
| 2638 | panda-0445 |
| 2675 | panda-0482 |
+------+------------+

Jake Watkins [:dividehex]

Assignee

Comment 32

•

11 years ago

I'm dupping this now that we are officially tracking pandas in the panda-recovery bug (bug902657)

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → DUPLICATE