Closed Bug 807163 Opened 12 years ago Closed 9 years ago

logcat chassis 6 panda

Categories

(Testing :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: van, Unassigned)

Details

(Whiteboard: [reit-panda])

Attachments

(5 files)

Attached file panda-0073
The following pandas in chassis 6 are bad and I've attached the logcat for them.

panda-0073
panda-0075
panda-0076
panda-0080
panda-0081
Attached file panda-0075
Attached file panda-0076
Attached file panda-0080
Attached file panda-0081
Blocks: 799698
Whiteboard: [reit-panda]
Carrying forward:
(In reply to Van Le [:van] from comment #14)
> I'm going to work on one chassis at a time and open bugs for the pandas I
> can't get to come online.
> 
> Chassis 6: I was NOT able to get the following pandas online after multiple
> SD card swaps, reboots, etc... I have opened bug 807163 and attached the
> logcat for the pandas.
> 
> panda-00[73,75,76,80,81]
Can someone attach logs for a couple of "good" pandas, for comparison?
"adb shell dumpsys" (or is it "adb dumpsys"? I forget) from good/bad devices may also provide some clues.
This looks pretty telling:

E/EthernetStateMachine( 1397): DhcpHandler: DHCP request failed: Timed out waiting for dhcpcd to start

We should try to find what command line it's using there and run it ourselves to debug further.
It looks like the daemon is started by setting the 'ctl.start' property to something like 'dhcpcd_eth0:eth0' and then waiting on property 'init.svc.dhcpcd_eth0' to be set. It's timing out while waiting on that property. See dhcp_utils.c in the Linaro Android source.
Do these machines fail on a consistent basis or is it just sometimes?
:snorp, the ones in this bug are the ones I was unable to get online after several reboots/SD card swaps/reimages.

Van
Is there any update on the status of these pandas?
Should we move these pandas to rack#10 and see if we can re-image them with mozpool?

Remove block on bug 799698.
No longer blocks: 799698
can we close this bug?  14 months with no traction?
We shouldn't just close the bug -- we need to make a decision on these hardware units. The two choices I see listed are:
 a) :armenzg in comment 13 - physically move units to another location for further investigation. That would be 2 new bugs (1 to move, the other for investigation)
 b) my suggestion to declare them BER (beyond economic repair) and file a bug to decom them

After that decision, and those bugs are filed, we can close this one.

Callek: do you know enough about the current state to make the call?
Flags: needinfo?(bugspam.Callek)
I don't know enough to identify how difficult/costly a repair is. I do know however that we are very over-current-capacity needs with regard to pandas.

So my recommendation is:

(a) store pandas in a "possibly damaged" box, leaving asset tags in place, without sdcards, and document in inventory this bug number as a reference point incase we need to spend time on recovery in future
(b) close this bug.

Any human effort to recover from unknown panda device [hardware] conditions is, imo, not worth it at this time.
Flags: needinfo?(bugspam.Callek)
Okay, I like the minimize human effort -- so lets leave them in chassis if :dividhex agrees that won't hurt. That's save pulling, boxing, and then moving that box to scl3

Jake: I think this is our first decom-some-pandas-in-chassis case. Can you advise on:
 - whether it's okay to leave physically in the chassis
 - how inventory should be marked so we won't get confused.
 - any other needed changes to procedure in comment 16
Flags: needinfo?(jwatkins)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #17)
> Okay, I like the minimize human effort -- so lets leave them in chassis if
> :dividhex agrees that won't hurt. That's save pulling, boxing, and then
> moving that box to scl3
> 
> Jake: I think this is our first decom-some-pandas-in-chassis case. Can you
> advise on:
>  - whether it's okay to leave physically in the chassis
>  - how inventory should be marked so we won't get confused.
>  - any other needed changes to procedure in comment 16

It is perfectly fine to leave them in the chassis until the can be properly decommed and removed.
Not sure how to mark them in inventory since they aren't actually decommed (removed from chassis).  Maybe "error/service."  I also think it is a good idea to note the bug# in inventory.  

My bigger question is; are these really "bad" pandas?  I can't find the original reason they were considered bad and needed dcops to look at them to begin with.  This bug is also extremely old and I believe it was filed during the smoke test era.  The environment around it has also changed alot since then. eg. mozpool upgrades, psu adjustments, ethernet cable reseating, chassis was moved to different pod, etc.  It is possible the original interpretation of them as "bad" may have been from external influence.  They all pass the basic mozpool selftest so is there any reason not to just add these back to the pool and see if they attract sheriff attention?
Flags: needinfo?(jwatkins)
no tracking for 1.5 years, going to close. let me know if this is still relevant.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: