Closed Bug 817103 (bad-panda-log) Opened 12 years ago Closed 11 years ago

Interim Bad Panda Log

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 902657

People

(Reporter: dustin, Assigned: dividehex)

References

Details

This bug will serve as a log for all failing pandas and their remediation, while we gather data to develop a more robust and distributed process.  See
  https://wiki.mozilla.org/ReleaseEngineering/Mozpool/Handling_Panda_Failures
From Armen in bug 818729:

I'm getting failed_b2g_downloading for two boards:
http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/log.html?device=panda-0094
http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/log.html?device=panda-0089

This does not happen constantly but I would like to discover what it is and not have the problem on production.

Here's the return code I get:
5:33:45    ERROR - Bad return status from http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0094/request/: 502!
those two pandas have this in the logs:

> 2012-12-04 17:10:09 | entering state pxe_power_cycling
> 2012-12-04 17:10:09 | setting PXE config to 'panda-b2g.1'
> 2012-12-04 17:10:09 | initiating power cycle
> 2012-12-04 17:10:12 | entering state pxe_booting
> 2012-12-04 17:11:26 | second stage URL found: http://10.12.128.33/scripts/b2g-second-stage.sh
> 2012-12-04 17:11:26 | wget success: --2025-11-08 18:22:20--  http://10.12.128.33/scripts/b2g-second-stage.sh#012Connecting
> 2012-12-04 17:11:26 | Executing second-stage.sh
> 2012-12-04 17:11:26 | beginning b2g-second-stage.sh
> 2012-12-04 17:11:26 | setting time with ntpdate
> 2012-12-04 17:11:34 | Submitting lifeguard event at http://10.12.128.33/api/device/panda-0094/event/b2g_downloading/
> 2012-12-04 17:11:34 | clearing PXE config
> 2012-12-04 17:11:34 | entering state b2g_downloading
> 2012-12-04 17:11:34 | getting B2G_URL from http://10.12.128.33/api/device/panda-0094/bootconfig/
> 2012-12-04 17:11:35 | B2G URL: https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/201212041449
> 2012-12-04 17:11:35 | fetching https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/201212041449
> 2012-12-04 17:11:35 | wget failed:
> 2012-12-04 17:14:39 | entering state failed_b2g_downloading
> 2012-12-04 17:14:39 | device has failed: failed_b2g_downloading

so they're being given the wrong URL.  No mozpool problems here here.
This is the latest of set of pandas to take a look at (taken from bug 826694):

0034 -  failed_pxe_booting
0106 -  pc_pinging
0250 -  failed_pxe_booting
0258-290 - failed_pxe_booting
0342 -  failed_pxe_booting
0468 -  failed_pxe_booting
0472 -  failed_pxe_booting
0488 -  failed_pxe_booting
0554 -  failed_pxe_booting
0562 -  failed_pxe_booting
0570 -  failed_pxe_booting
0574 -  failed_pxe_booting
0598 -  failed_pxe_booting
0621 -  failed_pxe_booting
0662 -  failed_pxe_booting
0720-0730 - pxe_power_cycling - possible relay failure -- see bug 821239
0741 -  failed_pxe_booting
0781 -  failed_pxe_booting
0793 -  failed_pxe_booting
0796 -  failed_pxe_booting
According to https://wiki.mozilla.org/ReleaseEngineering/Mozpool/Handling_Panda_Failures#Log_Failure_In_Interim_Tracking_Bug

* panda-0091
* failure type: "statemachine ignored event free in state failed_b2g_pinging"
* logs below
* the panda was doing nothing - it got re-imaged on the 11th but it ended on a failure to reboot and pinging. Trying to request the device or rebooting does not succeed.

2013-01-11T14:24:50 syslog Submitting lifeguard event at http://10.12.128.33/api/device/panda-0091/event/b2g_rebooting/
2013-01-11T14:24:50 statemachine entering state b2g_rebooting
2013-01-11T14:24:50 syslog Imaging complete. Rebooting
2013-01-11T14:26:59 statemachine entering state b2g_pinging
2013-01-11T14:27:00 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-11T14:27:19 statemachine entering state b2g_pinging
2013-01-11T14:27:20 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-11T14:27:39 statemachine entering state b2g_pinging
2013-01-11T14:27:40 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-11T14:27:59 statemachine entering state b2g_pinging
2013-01-11T14:28:00 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-11T14:28:19 statemachine entering state b2g_pinging
2013-01-11T14:28:20 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-11T14:28:40 statemachine entering state failed_b2g_pinging
2013-01-11T14:28:41 statemachine device has failed: failed_b2g_pinging
2013-01-14T08:51:56 statemachine ignored event free in state failed_b2g_pinging
It sounds like this failed to install B2G, which is usually an image problem, rather than a hardware problem.  The self-test support that's going to land soon will address this (by self-testing).  In the interim, try reimaging it with Android?  If that succeeds, it will go to the 'ready' and then 'free' states and can then be requested.
I didn't do anything to it and it recovered :S

2013-01-14T10:33:40 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed
2013-01-14T10:33:59 statemachine entering state pc_power_cycling
2013-01-14T10:33:59 bmm clearing PXE config
2013-01-14T10:33:59 bmm initiating power cycle
2013-01-14T10:34:02 statemachine entering state pc_rebooting
2013-01-14T10:36:09 statemachine entering state pc_pinging
2013-01-14T10:36:09 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: ok
2013-01-14T10:36:09 statemachine entering state ready
2013-01-14T10:37:20 statemachine in ready state but not assigned to a request; moving to free state
Maybe a squirrely B2G image?
colo-trip: --- → scl1
Update to this now that nagios is working.  the following production pandas are reporting as down:


panda-0106.p1.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0342.p3.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0472.p5.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0488.p5.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0554.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0562.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0574.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting
panda-0598.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting

This doesn't cover panda-0720 - panda-0730 which are unresponsive because panda-relay-065 is not functioning properly (see bug 821239).
Update:
I went through all the problem boards on 1/29 and was able fix or mark them for replacement.

These boards have been fixed:
panda-0342	switch patch cable plugged into empty panda port on rear of chassis
panda-0472	Had non PXE boot image on it.  Reimaged with preseed
panda-0488	mac address collision with panda-0482.  Fixed panda-0482 mac address in inventory
panda-0554	Power cables swapped between 0554 and empty spot
panda-0562	mac address collision with panda-0561.  Fixed panda-0562 mac address in inventory
panda-0598	Power cables swapped between 0598 and empty spot
panda-0621	seemed ok.  re-imaged with android
panda-0741	Power cables swapped between 0741 and the empty spot
panda-0796	relays didn't match/swapped.  0795 = bank2 relay7, 0796 = bank2 relay8, empty slot = bank2 relay6. I had to change these in inventory rather than swap cables due to cable lengths
panda-{0258..0290}	these were all in the wrong vlan scope in inventory

These boards seemed ok:
panda-0034
panda-0106

These boards have failed and need to be replaced (see Bug 836857):
panda-0250	failed on mkfs.ext. couldn't see mmc devices in /dev
panda-0468	failed on mkfs.ext. couldn't see mmc devices in /dev
panda-0570	failed on mkfs.ext. couldn't see mmc devices in /dev
panda-0574	mac address collision with 0531 
panda-0662	lights but no serial output and no LAN traffic
panda-0781	bad ethernet port on panda board
panda-0793	failed on mkfs.ext. couldn't see mmc devices in /dev
I just landed http://hg.mozilla.org/build/tools/rev/6597bd01e9f4 to account for the changes in c#11

And also just deployed that to all foopies.

As well as coordinated to get the device directories on these foopies stopped and removed (for the Bug 836857 list), so we don't try to re-enable with the devices missing.
Depends on: panda-0070
No longer depends on: panda-0070
panda-0139 is burning every job it takes and is complaining on #buildduty

https://secure.pub.build.mozilla.org/buildapi/recent/panda-0139
14:32 nagios-releng: Fri 11:32:28 PST [418] panda-0139.p1.releng.scl1.mozilla.com is DOWN :CRITICAL: in state failed_b2g_pinging
Blocks: panda-0139
(In reply to Armen Zambrano G. [:armenzg] from comment #13)
> panda-0139 is burning every job it takes and is complaining on #buildduty
> 
> https://secure.pub.build.mozilla.org/buildapi/recent/panda-0139
> 14:32 nagios-releng: Fri 11:32:28 PST [418]
> panda-0139.p1.releng.scl1.mozilla.com is DOWN :CRITICAL: in state
> failed_b2g_pinging

Once a panda board enters a failed_*_pinging state, it will refuse all please image requests.  At this time, these can be handled by manually requesting a please_self_test from the lifeguard UI.  If the test passes, it will be put back into a free state where it can accept please image requests.   If the board continues to end up in a failed_*_pinging state, please raise the issue in this bug again.

I've gone ahead and requested a self test on all 4 of the failed_b2g_pinging pandas.  These have been in a failed state for quite some time now.
  
panda-0139
panda-0104
panda-0117
panda-0123

All 4 passed.
And it's worth noting, the Buildbot/clientproxy code shouldn't be trying to use a panda that's in a failed state..
Once it gets to that point, would it make sense to auto-self-tests?
Or, if I request that device, could mozpool try to self test automatically if it's in a known bad state?
Is there a technical reason that we won't try to auto-recover?

We request a device several times and then give up after a max amount.

Should we check the status every time we get a denied request?
Which status should require a self test for?
Is there a way through the API that I can request a self test?

If the self test, any suggestions on how could the automation put a stop to take more jobs for that panda?
FTR, the state was showing as "pending"
10:09:26     INFO - Waiting for request 'ready' stage.  Current state: 'pending'

https://tbpl.mozilla.org/php/getParsedLog.php?id=19538828&tree=Firefox&full=1#error0

   104             response = mph.query_request_status(self.request_url)
   105             state = response['state']
   106             if state == 'ready':
   107                 return
   108             self.info("Waiting for request 'ready' stage.  Current state: '%s'" % state)
(In reply to Armen Zambrano G. [:armenzg] from comment #16)
> Once it gets to that point, would it make sense to auto-self-tests?
> Or, if I request that device, could mozpool try to self test automatically
> if it's in a known bad state?
> Is there a technical reason that we won't try to auto-recover?

We're working on that - bug 834568.  It's a little complicated.

> We request a device several times and then give up after a max amount.
> 
> Should we check the status every time we get a denied request?

Ah, I think the piece of information we're not agreeing on is that Buildbot requests the device after the build has started, so the build burns if that request fails.

This isn't really the way the system was designed -- it was designed so that Buildbot requests *a* device, not a specific device -- so we may have to put some workarounds in place here.

> Which status should require a self test for?
> Is there a way through the API that I can request a self test?

Let Mozpool worry about that.

> If the self test, any suggestions on how could the automation put a stop to
> take more jobs for that panda?

What I would suggest is that clientproxy should poll the status of the device, and stop buildslave when the device is failed_*.  It should start buildslave when the device is not failed_*.  This is a pretty ugly workaround, but it will do until we have things reorganized to request devices from mozpool.
That's the request state you're looking at in comment 17, not the device state.
From kim in bug 842597:

In bug 836857, several panda boards were replaced, including panda-0081.  I imaged panda-0081 with the android image today and the status in lifeguard is "failed_sut_verifying" for a long time, not sure what to do here.
Also, note that I tried to image panda-0081 twice, both times, same error.
I wonder if these got a current preseed image put on them before deployment.  This is another reason we need a written policy for replacing failed panda boards.
panda-0069 and 0057 seems fine, they were replaced in the same batch.
hmm, panda-0081 is up now and running green tests. Perhaps this is bug 836417, and we just need to wait a long time after a reimage
Blocks: panda-0081
Now that panda-relay-065 is functional, I tried imaging the pandas attached to it.  It looks like the following needs some attention:

panda-0720: 
2013-04-11T17:30:29 syslog mkfs.ext4 failed: 
2013-04-11T17:30:29 syslog formatting system partition
2013-04-11T17:33:23 statemachine device has failed: failed_android_downloading
2013-04-11T17:33:23 statemachine entering state failed_android_downloading

Yet it changes to ready state, so I'm not sure what the heck is up there?

panda-0723:
2013-04-11T17:34:05 statemachine device failed ping check

panda-0730: 
2013-04-11T17:17:44 statemachine entering state android_extracting
2013-04-11T17:17:44 syslog extracting boot artifact
2013-04-11T17:17:47 syslog extracting system artifact
2013-04-11T17:18:27 syslog tar failed:
Blocks: panda-0371
Blocks: panda-0392
Blocks: panda-0395
Blocks: panda-0831
Blocks: panda-0026
Blocks: panda-0056
Blocks: panda-0581
Blocks: panda-0706
Blocks: panda-0542
Blocks: panda-0631
Blocks: panda-0589
Blocks: panda-0709
Blocks: panda-0694
Blocks: panda-0778
Blocks: panda-0740
Blocks: panda-0789
Blocks: panda-0728
Blocks: panda-0734
Blocks: panda-0840
Blocks: panda-0836
Blocks: panda-0817
Blocks: panda-0296
Blocks: panda-0172
Blocks: panda-0313
Blocks: panda-0548
Blocks: panda-0679
Blocks: panda-0794
Blocks: panda-0800
I have a SCL1 trip scheduled for tomorrow (Thurs, 7/11) to attend to the panda boards currently blocked here.
Update from 7/11 SCL1 visit:

I wasn't able to get to all the panda boards blocked in this bug but I was able to attend to the ones currently in production for android tests.  The follow list are pandas that have been recovered.

26, 56, 81, 542, 548, 580, 581, 588, 589, 590, 631, 679, 694, 706, 709, 728, 734, 740, 778, 789, 794, 800, 817, 831, 834, 836, 840, 856

This is the list of panda boards that still need attention.
172, 180, 296, 313, 371, 392, 395

Overall most pandas that were experiencing 'failed_pxe_booting' in mozpool had the identical symptom of just not booting at all (no serial output and no led activity.) Replacing the SD card with one that had a fresh preseed image, recovered all with this common issue.

Another symptom was a panda board that was able to load the boot loader but not find an ethernet connection while attempting to PXE boot.  These were all recovered by simply re-seating the internal CAT5 cable into the RJ45 coupler.

Other pandas boards showed no overt problem signs at all and were noted in the problem tracker bug of that panda.  These passed a selftest and were at the very least able to be reimaged with the android image via mozpool.

As for the SD Cards that were replaced, I've taking them with me so I may try and determine why these failed at the boot level.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
I've analyzed all of the SD Card that were removed from service as noted in the previous comment and can conclude that 17 out of the 22 cards removed all have I/O failures at the mmcblk kernel device driver level.  In technical terms, they are completely Borked.

The other 5 cards didn't seem to have any issues but I haven't tested them any further than just checking for basic IO errors.
Until the new bad panda handling process is in play, I'm logging these failures here.  This might be a fail state I've seen before where the mmc hardware never gets detected and/or initialized by the kernel within the mozpool live boot environment.

panda-0720:
2013-07-18T10:08:57 syslog mkfs.ext4 failed:
2013-07-18T10:11:51 statemachine entering state failed_android_downloading

panda-0696:
2013-07-18T10:11:46 statemachine entering state failed_android_downloading

panda-0674:
2013-07-18T10:11:56 statemachine entering state failed_android_downloading

panda-0664:
2013-07-18T10:14:56 statemachine entering state failed_android_downloading
These 2 pandas still referred back to the old fix-boot-scr image which means they haven't properly taken a different image since.  Recording here as bad pandas.
+------+------------+
| id   | name       |
+------+------------+
| 2638 | panda-0445 |
| 2675 | panda-0482 |
+------+------------+
I'm dupping this now that we are officially tracking pandas in the panda-recovery bug (bug902657)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.