Closed Bug 902657 Opened 11 years ago Closed 10 years ago

panda-recovery

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: jpech)

References

Details

+++ This bug was initially created as a clone of Bug #817103 +++
Blocks: panda-0282
Blocks: panda-0292
Blocks: panda-0295
Blocks: panda-0300
Blocks: panda-0301
Blocks: panda-0305
Blocks: panda-0306
Blocks: panda-0325
Blocks: panda-0340
Blocks: panda-0387
Blocks: panda-0396
Blocks: panda-0479
Blocks: panda-0482
Blocks: panda-0729
Blocks: panda-0737
Blocks: panda-0743
Blocks: panda-0763
Blocks: panda-0769
Blocks: panda-0770
Blocks: panda-0788
Blocks: panda-0172
Blocks: panda-0180
Blocks: panda-0296
Blocks: panda-0313
Blocks: panda-0371
Blocks: panda-0392
Blocks: panda-0395
Blocks: panda-0664
Blocks: panda-0674
Blocks: panda-0696
Blocks: panda-0720
Blocks: panda-0820
Assignee: relops → jwatkins
Blocks: panda-0810
No longer blocks: panda-0479
No longer blocks: panda-0482
Depends on: 909440
Blocks: panda-0739
Assignee: jwatkins → achavez
All had failures, replaced SD cards on 2013-08-26.
The following list of pandas can be put back into service:
panda-0282	ready
panda-0300	ready
panda-0301	ready
panda-0305	ready
panda-0306	ready
panda-0313	ready
panda-0325	ready
panda-0387	ready
panda-0396	ready
panda-0729	ready
panda-0737	ready
panda-0743	ready
panda-0763	ready
panda-0769	ready
panda-0770	ready
panda-0788	ready
panda-0810	ready
panda-0820	ready
The following had SD card failures, SD cards were replaced and can be put back in production:
panda-0292
panda-0295
panda-0296

Panda board failure, decommissioned and will be replaced with a new panda:
panda-0172
Assignee: achavez → arich
Status: NEW → ASSIGNED
Assignee: arich → achavez
These pandas were incorrectly flagged by the new mozpool selftest.  The selftest has since been corrected and they are now passing the selftest without issue. We can close the tracker bugs and return them to service.

0835
0834
0803
0731
0674
0664
Also, a false positive. Pls return to service.

0819
panda-0036 	removed/ decommissioned
panda-0044	removed/ decommissioned
panda-0172 	removed/ decommissioned
panda-0180	passed self test/sd card replaced
panda-0259      passed self test/sd card replaced
panda-0262 	passed self test/sd card replaced
panda-0265	passed self test/sd card replaced
panda-0340	removed/decomissioned
panda-0371	removed/decommissioned
panda-0392 	removed/decomissioned
panda-0395	removed/decommissioned
panda-0696	removed/decommissioned
panda-0720	removed/decommissioned
panda-0739	passed self test/sd card replaced
panda-0784	passed self test/sd card replaced
panda-0795      passed self test/sd card replaced
panda-0801	passed self test/sd card replaced
panda-0816	passed self test/sd card replaced
panda-0864 	passed self test/sd card replaced
panda-0870 	passed self test/sd card replaced
Before completely removing the decommissioned boards from mozpool/production, we should double check them.  I see some boards that were removed but had previously passing tests.
Component: RelOps → Server Operations: DCOps
Product: Infrastructure & Operations → mozilla.org
QA Contact: arich → dmoore
colo-trip: --- → scl1
Whiteboard: [Will work with Jake on this Thursday]
Whiteboard: [Will work with Jake on this Thursday] → [Will work with Jake on this after summit2013]
 Ashlee Chavez [:Ashlee] 2013-10-02 16:47:22 EDT
Whiteboard: [Will work with Jake on this Thursday] → [Will work with Jake on this after summit2013]

ETA on peeking guys?
Flags: needinfo?(jwatkins)
Flags: needinfo?(achavez)
Blocks: panda-0558
Blocks: panda-0843
(In reply to Justin Wood (:Callek) from comment #9)
>  Ashlee Chavez [:Ashlee] 2013-10-02 16:47:22 EDT
> Whiteboard: [Will work with Jake on this Thursday] → [Will work with Jake on
> this after summit2013]
> 
> ETA on peeking guys?

I've spoken with Jake via irc, we have not come to a conclusion as to when we will be able to tackle this. 
Jake, any ideas?
Depends on: panda-0818
Flags: needinfo?(achavez)
Blocks: panda-0818
No longer depends on: panda-0818
Blocks: panda-0479
Blocks: panda-0482
Blocks: panda-0701
Blocks: panda-0357
Blocks: panda-0347
Since Ashlee have moved to another team. I want to volunteer and take over this bug and wish to solve this bug before my internship ends..(T.T) What and Who can help guide me through the process to recover the Panda boards?
Flags: needinfo?(bugspam.Callek)
Whiteboard: [Will work with Jake on this after summit2013]
Hey John,

Please work with Jake (already needinfo'd) and coord with dmoore as well to properly allocate your resources here. (It would be good imho, to have a permament member of dcops also go through this process with Jake, so we don't lose the mindshare when your internship ends)
Flags: needinfo?(bugspam.Callek) → needinfo?(dmoore)
(In reply to Justin Wood (:Callek) from comment #12)
> Hey John,
> 
> Please work with Jake (already needinfo'd) and coord with dmoore as well to
> properly allocate your resources here. (It would be good imho, to have a
> permament member of dcops also go through this process with Jake, so we
> don't lose the mindshare when your internship ends)

Will do. Thanks for the info!
Assignee: achavez → jpech
DCops got a good 4hours of training on panda therapy yesterday.  So we should get this bug resolved soon and on to a weekly "r/f and clone" schedule.
Flags: needinfo?(jwatkins)
Majority of the pandas are in the "ready" state for releng to proceed with testing.  The below pandas are failing and will need further troubleshooting.  If releng can close out the working pandas in the "block" list above, then it will help me narrow down exactly which pandas need investigation (similar to tegra bugs).  Thanks!


panda-0172	failed_pxe_booting		vhua-Unable to read "preEnv.txt" from mmc 0:1 **	
panda-0444	failed_pxe_booting		vhua-23.533630] panic occurred, switching back to text console	
panda-0479	failed_pxe_booting		vhua-not in chassis	
panda-0482	failed_pxe_booting		vhua-not in chassis	
panda-0638	failed_pxe_booting	panda-android-4.0.4_v3.1		
panda-0797	failed_pxe_booting	android		
panda-0081	failed_self_test		dividehex-panda-intervention	
panda-0173	failed_self_test			
panda-0280	failed_self_test		vhua-selftest.py[INFO]: test_preseed_file_integrity[FAILED] boot.scr : 	
panda-0720	failed_self_test		vhua-selftest.py[INFO]: test_mmc_blk_dev[FAILED] /dev/mmcblk0 - No such file or directory (tried multiple SD cards)
panda-0678	locked_out	android
(In reply to Vinh Hua [:vinh] from comment #15)
> Majority of the pandas are in the "ready" state for releng to proceed with
> testing.  The below pandas are failing and will need further
> troubleshooting.  If releng can close out the working pandas in the "block"
> list above, then it will help me narrow down exactly which pandas need
> investigation (similar to tegra bugs).  Thanks!
> 
> 
> panda-0172	failed_pxe_booting		vhua-Unable to read "preEnv.txt" from mmc 0:1

Unable to read "preEnv.txt" from mmc 0:1 is a normal error message.  The uboot loader should continue to load/netboot
Did you try swapping out the sdcard on this?  if you did, does it halt at that msg or continue booting?


> **	
> panda-0444	failed_pxe_booting		vhua-23.533630] panic occurred, switching
> back to text console

This sounds like a pandaboard hardware issue and it would be interesting to see a entire serial console capture to better identify the deeper issue here. It might be something we can add to selftest to check for. I would suggest swapping the sdcard if you haven't.  If you have, and it still continues, remove the panda board (and order a replacement if we don't any spare boards)

> panda-0479	failed_pxe_booting		vhua-not in chassis	
> panda-0482	failed_pxe_booting		vhua-not in chassis

These 2 panda boards were removed from service and should be replaced.  I'll remove them from mozpool. see bug836808
	
> panda-0638	failed_pxe_booting	panda-android-4.0.4_v3.1
SDcard swap didn't work here?  If so, we can assume the pandaboard is dead and should be replaced.
		
> panda-0797	failed_pxe_booting	android
Same here?
	
> panda-0081	failed_self_test		dividehex-panda-intervention
what is the reason the self_test failed here? (see device log)
	
> panda-0173	failed_self_test
Same.  Why did it fail? (see device log)
			
> panda-0280	failed_self_test		vhua-selftest.py[INFO]:
> test_preseed_file_integrity[FAILED] boot.scr : 	
Boot.scr integrity check failure indicates outdated preseed image and should be fixed by:

1.) force state to 'troubleshooting'
2.) please_image -> repair-boot
3.) please_self_test

> panda-0720	failed_self_test		vhua-selftest.py[INFO]:
> test_mmc_blk_dev[FAILED] /dev/mmcblk0 - No such file or directory (tried
> multiple SD cards)
This test indicates a bad pandaboard.  Remove and replace.


> panda-0678	locked_out	android
I have no idea why (or who) locked_out this panda.  Check with #releng or #ateam.  There should always be a bug # in the comment of the panda that is locked_out for obvious reasons. If no one claims they have reserved it, force state to troubleshooting and then run a selftest.

Aside from the pandas listed here,  I do think we should close this bug, migrate the list in c15 a new recovery bug and return the rest of the pandas to production.  We really want to get into the habit of a weekly bug for DCOPs to handle.

Callek: is it reasonable to do this sometime this week so new problem pandas don't get cluttered up here.
Flags: needinfo?(bugspam.Callek)
(In reply to Jake Watkins [:dividehex] from comment #16)
> Aside from the pandas listed here,  I do think we should close this bug,
> migrate the list in c15 a new recovery bug and return the rest of the pandas
> to production.  We really want to get into the habit of a weekly bug for
> DCOPs to handle.
> 
> Callek: is it reasonable to do this sometime this week so new problem pandas
> don't get cluttered up here.

Indeed, I had already planned to do so today, and got caught up with a power outage at home --> doing so now.
Flags: needinfo?(dmoore)
Flags: needinfo?(bugspam.Callek)
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Alias: panda-recovery
Product: mozilla.org → Infrastructure & Operations
No longer blocks: panda-0283
You need to log in before you can comment on or make changes to this bug.