Closed
Bug 817103
(bad-panda-log)
Opened 12 years ago
Closed 11 years ago
Interim Bad Panda Log
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 902657
People
(Reporter: dustin, Assigned: dividehex)
References
Details
This bug will serve as a log for all failing pandas and their remediation, while we gather data to develop a more robust and distributed process. See https://wiki.mozilla.org/ReleaseEngineering/Mozpool/Handling_Panda_Failures
Reporter | ||
Comment 1•12 years ago
|
||
From Armen in bug 818729: I'm getting failed_b2g_downloading for two boards: http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/log.html?device=panda-0094 http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/log.html?device=panda-0089 This does not happen constantly but I would like to discover what it is and not have the problem on production. Here's the return code I get: 5:33:45 ERROR - Bad return status from http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0094/request/: 502!
Reporter | ||
Comment 3•12 years ago
|
||
those two pandas have this in the logs:
> 2012-12-04 17:10:09 | entering state pxe_power_cycling
> 2012-12-04 17:10:09 | setting PXE config to 'panda-b2g.1'
> 2012-12-04 17:10:09 | initiating power cycle
> 2012-12-04 17:10:12 | entering state pxe_booting
> 2012-12-04 17:11:26 | second stage URL found: http://10.12.128.33/scripts/b2g-second-stage.sh
> 2012-12-04 17:11:26 | wget success: --2025-11-08 18:22:20-- http://10.12.128.33/scripts/b2g-second-stage.sh#012Connecting
> 2012-12-04 17:11:26 | Executing second-stage.sh
> 2012-12-04 17:11:26 | beginning b2g-second-stage.sh
> 2012-12-04 17:11:26 | setting time with ntpdate
> 2012-12-04 17:11:34 | Submitting lifeguard event at http://10.12.128.33/api/device/panda-0094/event/b2g_downloading/
> 2012-12-04 17:11:34 | clearing PXE config
> 2012-12-04 17:11:34 | entering state b2g_downloading
> 2012-12-04 17:11:34 | getting B2G_URL from http://10.12.128.33/api/device/panda-0094/bootconfig/
> 2012-12-04 17:11:35 | B2G URL: https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/201212041449
> 2012-12-04 17:11:35 | fetching https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/201212041449
> 2012-12-04 17:11:35 | wget failed:
> 2012-12-04 17:14:39 | entering state failed_b2g_downloading
> 2012-12-04 17:14:39 | device has failed: failed_b2g_downloading
so they're being given the wrong URL. No mozpool problems here here.
Assignee | ||
Comment 4•11 years ago
|
||
This is the latest of set of pandas to take a look at (taken from bug 826694): 0034 - failed_pxe_booting 0106 - pc_pinging 0250 - failed_pxe_booting 0258-290 - failed_pxe_booting 0342 - failed_pxe_booting 0468 - failed_pxe_booting 0472 - failed_pxe_booting 0488 - failed_pxe_booting 0554 - failed_pxe_booting 0562 - failed_pxe_booting 0570 - failed_pxe_booting 0574 - failed_pxe_booting 0598 - failed_pxe_booting 0621 - failed_pxe_booting 0662 - failed_pxe_booting 0720-0730 - pxe_power_cycling - possible relay failure -- see bug 821239 0741 - failed_pxe_booting 0781 - failed_pxe_booting 0793 - failed_pxe_booting 0796 - failed_pxe_booting
Comment 5•11 years ago
|
||
According to https://wiki.mozilla.org/ReleaseEngineering/Mozpool/Handling_Panda_Failures#Log_Failure_In_Interim_Tracking_Bug * panda-0091 * failure type: "statemachine ignored event free in state failed_b2g_pinging" * logs below * the panda was doing nothing - it got re-imaged on the 11th but it ended on a failure to reboot and pinging. Trying to request the device or rebooting does not succeed. 2013-01-11T14:24:50 syslog Submitting lifeguard event at http://10.12.128.33/api/device/panda-0091/event/b2g_rebooting/ 2013-01-11T14:24:50 statemachine entering state b2g_rebooting 2013-01-11T14:24:50 syslog Imaging complete. Rebooting 2013-01-11T14:26:59 statemachine entering state b2g_pinging 2013-01-11T14:27:00 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed 2013-01-11T14:27:19 statemachine entering state b2g_pinging 2013-01-11T14:27:20 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed 2013-01-11T14:27:39 statemachine entering state b2g_pinging 2013-01-11T14:27:40 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed 2013-01-11T14:27:59 statemachine entering state b2g_pinging 2013-01-11T14:28:00 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed 2013-01-11T14:28:19 statemachine entering state b2g_pinging 2013-01-11T14:28:20 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed 2013-01-11T14:28:40 statemachine entering state failed_b2g_pinging 2013-01-11T14:28:41 statemachine device has failed: failed_b2g_pinging 2013-01-14T08:51:56 statemachine ignored event free in state failed_b2g_pinging
Reporter | ||
Comment 6•11 years ago
|
||
It sounds like this failed to install B2G, which is usually an image problem, rather than a hardware problem. The self-test support that's going to land soon will address this (by self-testing). In the interim, try reimaging it with Android? If that succeeds, it will go to the 'ready' and then 'free' states and can then be requested.
Comment 7•11 years ago
|
||
I didn't do anything to it and it recovered :S 2013-01-14T10:33:40 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: failed 2013-01-14T10:33:59 statemachine entering state pc_power_cycling 2013-01-14T10:33:59 bmm clearing PXE config 2013-01-14T10:33:59 bmm initiating power cycle 2013-01-14T10:34:02 statemachine entering state pc_rebooting 2013-01-14T10:36:09 statemachine entering state pc_pinging 2013-01-14T10:36:09 bmm ping of panda-0091.p1.releng.scl1.mozilla.com complete: ok 2013-01-14T10:36:09 statemachine entering state ready 2013-01-14T10:37:20 statemachine in ready state but not assigned to a request; moving to free state
Reporter | ||
Comment 8•11 years ago
|
||
Maybe a squirrely B2G image?
Updated•11 years ago
|
colo-trip: --- → scl1
Comment 10•11 years ago
|
||
Update to this now that nagios is working. the following production pandas are reporting as down: panda-0106.p1.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting panda-0342.p3.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting panda-0472.p5.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting panda-0488.p5.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting panda-0554.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting panda-0562.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting panda-0574.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting panda-0598.p6.releng.scl1.mozilla.com CRITICAL: in state failed_pxe_booting This doesn't cover panda-0720 - panda-0730 which are unresponsive because panda-relay-065 is not functioning properly (see bug 821239).
Assignee | ||
Comment 11•11 years ago
|
||
Update: I went through all the problem boards on 1/29 and was able fix or mark them for replacement. These boards have been fixed: panda-0342 switch patch cable plugged into empty panda port on rear of chassis panda-0472 Had non PXE boot image on it. Reimaged with preseed panda-0488 mac address collision with panda-0482. Fixed panda-0482 mac address in inventory panda-0554 Power cables swapped between 0554 and empty spot panda-0562 mac address collision with panda-0561. Fixed panda-0562 mac address in inventory panda-0598 Power cables swapped between 0598 and empty spot panda-0621 seemed ok. re-imaged with android panda-0741 Power cables swapped between 0741 and the empty spot panda-0796 relays didn't match/swapped. 0795 = bank2 relay7, 0796 = bank2 relay8, empty slot = bank2 relay6. I had to change these in inventory rather than swap cables due to cable lengths panda-{0258..0290} these were all in the wrong vlan scope in inventory These boards seemed ok: panda-0034 panda-0106 These boards have failed and need to be replaced (see Bug 836857): panda-0250 failed on mkfs.ext. couldn't see mmc devices in /dev panda-0468 failed on mkfs.ext. couldn't see mmc devices in /dev panda-0570 failed on mkfs.ext. couldn't see mmc devices in /dev panda-0574 mac address collision with 0531 panda-0662 lights but no serial output and no LAN traffic panda-0781 bad ethernet port on panda board panda-0793 failed on mkfs.ext. couldn't see mmc devices in /dev
Comment 12•11 years ago
|
||
I just landed http://hg.mozilla.org/build/tools/rev/6597bd01e9f4 to account for the changes in c#11 And also just deployed that to all foopies. As well as coordinated to get the device directories on these foopies stopped and removed (for the Bug 836857 list), so we don't try to re-enable with the devices missing.
Updated•11 years ago
|
Depends on: panda-0070
Reporter | ||
Updated•11 years ago
|
No longer depends on: panda-0070
Comment 13•11 years ago
|
||
panda-0139 is burning every job it takes and is complaining on #buildduty https://secure.pub.build.mozilla.org/buildapi/recent/panda-0139 14:32 nagios-releng: Fri 11:32:28 PST [418] panda-0139.p1.releng.scl1.mozilla.com is DOWN :CRITICAL: in state failed_b2g_pinging
Blocks: panda-0139
Assignee | ||
Comment 14•11 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] from comment #13) > panda-0139 is burning every job it takes and is complaining on #buildduty > > https://secure.pub.build.mozilla.org/buildapi/recent/panda-0139 > 14:32 nagios-releng: Fri 11:32:28 PST [418] > panda-0139.p1.releng.scl1.mozilla.com is DOWN :CRITICAL: in state > failed_b2g_pinging Once a panda board enters a failed_*_pinging state, it will refuse all please image requests. At this time, these can be handled by manually requesting a please_self_test from the lifeguard UI. If the test passes, it will be put back into a free state where it can accept please image requests. If the board continues to end up in a failed_*_pinging state, please raise the issue in this bug again. I've gone ahead and requested a self test on all 4 of the failed_b2g_pinging pandas. These have been in a failed state for quite some time now. panda-0139 panda-0104 panda-0117 panda-0123 All 4 passed.
Reporter | ||
Comment 15•11 years ago
|
||
And it's worth noting, the Buildbot/clientproxy code shouldn't be trying to use a panda that's in a failed state..
Comment 16•11 years ago
|
||
Once it gets to that point, would it make sense to auto-self-tests? Or, if I request that device, could mozpool try to self test automatically if it's in a known bad state? Is there a technical reason that we won't try to auto-recover? We request a device several times and then give up after a max amount. Should we check the status every time we get a denied request? Which status should require a self test for? Is there a way through the API that I can request a self test? If the self test, any suggestions on how could the automation put a stop to take more jobs for that panda?
Comment 17•11 years ago
|
||
FTR, the state was showing as "pending" 10:09:26 INFO - Waiting for request 'ready' stage. Current state: 'pending' https://tbpl.mozilla.org/php/getParsedLog.php?id=19538828&tree=Firefox&full=1#error0 104 response = mph.query_request_status(self.request_url) 105 state = response['state'] 106 if state == 'ready': 107 return 108 self.info("Waiting for request 'ready' stage. Current state: '%s'" % state)
Reporter | ||
Comment 18•11 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] from comment #16) > Once it gets to that point, would it make sense to auto-self-tests? > Or, if I request that device, could mozpool try to self test automatically > if it's in a known bad state? > Is there a technical reason that we won't try to auto-recover? We're working on that - bug 834568. It's a little complicated. > We request a device several times and then give up after a max amount. > > Should we check the status every time we get a denied request? Ah, I think the piece of information we're not agreeing on is that Buildbot requests the device after the build has started, so the build burns if that request fails. This isn't really the way the system was designed -- it was designed so that Buildbot requests *a* device, not a specific device -- so we may have to put some workarounds in place here. > Which status should require a self test for? > Is there a way through the API that I can request a self test? Let Mozpool worry about that. > If the self test, any suggestions on how could the automation put a stop to > take more jobs for that panda? What I would suggest is that clientproxy should poll the status of the device, and stop buildslave when the device is failed_*. It should start buildslave when the device is not failed_*. This is a pretty ugly workaround, but it will do until we have things reorganized to request devices from mozpool.
Reporter | ||
Comment 19•11 years ago
|
||
That's the request state you're looking at in comment 17, not the device state.
Reporter | ||
Comment 20•11 years ago
|
||
From kim in bug 842597: In bug 836857, several panda boards were replaced, including panda-0081. I imaged panda-0081 with the android image today and the status in lifeguard is "failed_sut_verifying" for a long time, not sure what to do here.
Comment 22•11 years ago
|
||
Also, note that I tried to image panda-0081 twice, both times, same error.
Assignee | ||
Comment 23•11 years ago
|
||
I wonder if these got a current preseed image put on them before deployment. This is another reason we need a written policy for replacing failed panda boards.
Comment 24•11 years ago
|
||
panda-0069 and 0057 seems fine, they were replaced in the same batch.
Comment 25•11 years ago
|
||
hmm, panda-0081 is up now and running green tests. Perhaps this is bug 836417, and we just need to wait a long time after a reimage
Updated•11 years ago
|
Blocks: panda-0081
Updated•11 years ago
|
Blocks: panda-0590
Updated•11 years ago
|
Blocks: panda-0856
Updated•11 years ago
|
Depends on: panda-0180
Comment 26•11 years ago
|
||
Now that panda-relay-065 is functional, I tried imaging the pandas attached to it. It looks like the following needs some attention: panda-0720: 2013-04-11T17:30:29 syslog mkfs.ext4 failed: 2013-04-11T17:30:29 syslog formatting system partition 2013-04-11T17:33:23 statemachine device has failed: failed_android_downloading 2013-04-11T17:33:23 statemachine entering state failed_android_downloading Yet it changes to ready state, so I'm not sure what the heck is up there? panda-0723: 2013-04-11T17:34:05 statemachine device failed ping check panda-0730: 2013-04-11T17:17:44 statemachine entering state android_extracting 2013-04-11T17:17:44 syslog extracting boot artifact 2013-04-11T17:17:47 syslog extracting system artifact 2013-04-11T17:18:27 syslog tar failed:
Updated•11 years ago
|
Blocks: panda-0371
Updated•11 years ago
|
Blocks: panda-0392
Updated•11 years ago
|
Blocks: panda-0395
Updated•11 years ago
|
Blocks: panda-0831
Updated•11 years ago
|
Blocks: panda-0026
Updated•11 years ago
|
Blocks: panda-0056
Updated•11 years ago
|
Blocks: panda-0581
Updated•11 years ago
|
Blocks: panda-0706
Updated•11 years ago
|
Blocks: panda-0542
Updated•11 years ago
|
Blocks: panda-0631
Updated•11 years ago
|
Blocks: panda-0589
Updated•11 years ago
|
Blocks: panda-0709
Updated•11 years ago
|
Blocks: panda-0694
Updated•11 years ago
|
Blocks: panda-0778
Updated•11 years ago
|
Blocks: panda-0740
Updated•11 years ago
|
Blocks: panda-0789
Updated•11 years ago
|
Blocks: panda-0728
Updated•11 years ago
|
Blocks: panda-0734
Updated•11 years ago
|
Blocks: panda-0840
Updated•11 years ago
|
Blocks: panda-0836
Updated•11 years ago
|
Blocks: panda-0817
Updated•11 years ago
|
Blocks: panda-0296
Updated•11 years ago
|
Blocks: panda-0172
Updated•11 years ago
|
Blocks: panda-0313
Updated•11 years ago
|
Blocks: panda-0548
Updated•11 years ago
|
Blocks: panda-0679
Updated•11 years ago
|
Blocks: panda-0794
Updated•11 years ago
|
Blocks: panda-0800
Updated•11 years ago
|
Blocks: panda-0580
Updated•11 years ago
|
Blocks: panda-0834
Assignee | ||
Comment 27•11 years ago
|
||
I have a SCL1 trip scheduled for tomorrow (Thurs, 7/11) to attend to the panda boards currently blocked here.
Assignee | ||
Comment 28•11 years ago
|
||
Update from 7/11 SCL1 visit: I wasn't able to get to all the panda boards blocked in this bug but I was able to attend to the ones currently in production for android tests. The follow list are pandas that have been recovered. 26, 56, 81, 542, 548, 580, 581, 588, 589, 590, 631, 679, 694, 706, 709, 728, 734, 740, 778, 789, 794, 800, 817, 831, 834, 836, 840, 856 This is the list of panda boards that still need attention. 172, 180, 296, 313, 371, 392, 395 Overall most pandas that were experiencing 'failed_pxe_booting' in mozpool had the identical symptom of just not booting at all (no serial output and no led activity.) Replacing the SD card with one that had a fresh preseed image, recovered all with this common issue. Another symptom was a panda board that was able to load the boot loader but not find an ethernet connection while attempting to PXE boot. These were all recovered by simply re-seating the internal CAT5 cable into the RJ45 coupler. Other pandas boards showed no overt problem signs at all and were noted in the problem tracker bug of that panda. These passed a selftest and were at the very least able to be reimaged with the android image via mozpool. As for the SD Cards that were replaced, I've taking them with me so I may try and determine why these failed at the boot level.
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
Assignee | ||
Comment 29•11 years ago
|
||
I've analyzed all of the SD Card that were removed from service as noted in the previous comment and can conclude that 17 out of the 22 cards removed all have I/O failures at the mmcblk kernel device driver level. In technical terms, they are completely Borked. The other 5 cards didn't seem to have any issues but I haven't tested them any further than just checking for basic IO errors.
Assignee | ||
Comment 30•11 years ago
|
||
Until the new bad panda handling process is in play, I'm logging these failures here. This might be a fail state I've seen before where the mmc hardware never gets detected and/or initialized by the kernel within the mozpool live boot environment. panda-0720: 2013-07-18T10:08:57 syslog mkfs.ext4 failed: 2013-07-18T10:11:51 statemachine entering state failed_android_downloading panda-0696: 2013-07-18T10:11:46 statemachine entering state failed_android_downloading panda-0674: 2013-07-18T10:11:56 statemachine entering state failed_android_downloading panda-0664: 2013-07-18T10:14:56 statemachine entering state failed_android_downloading
Assignee | ||
Comment 31•11 years ago
|
||
These 2 pandas still referred back to the old fix-boot-scr image which means they haven't properly taken a different image since. Recording here as bad pandas. +------+------------+ | id | name | +------+------------+ | 2638 | panda-0445 | | 2675 | panda-0482 | +------+------------+
Assignee | ||
Comment 32•11 years ago
|
||
I'm dupping this now that we are officially tracking pandas in the panda-recovery bug (bug902657)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•