Closed Bug 1363893 Opened 7 years ago Closed 7 years ago

scl3 proxxy host has failed

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(firefox-esr52 fixed, firefox53 fixed, firefox54 fixed, firefox55 fixed)

RESOLVED WONTFIX
Tracking Status
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: dustin, Unassigned)

References

Details

(Whiteboard: [stockwell infra])

Attachments

(1 file)

This host has a read-only filesystem, is dropping HTTP connections without answering, and seems to have a bad disk.
Blocks: 1357753
@Amy: I'm thinking about filing a bug to DCOps for running some disk diagnostics.
Does that sound OK to you?
Flags: needinfo?(arich)
That host is many years out of warranty. When releng asked to set it up on scavenged hardware, it was with the understanding that it wasn't production critical (we could do without it) and that repairing it might be expensive if it ever broke. IHave dcops check, but if the disk is bad, it's not a given that we'll want to fix/replace it (check with catlee).
Flags: needinfo?(arich)
Can we try powering it down right now?  I think its particular failure mode is causing issues with newer pip versions (luckily, the pre-cambrian version of pip we use in buildbot configs seems to survive this failure mode, or we'd have an outage)
I took a look at the logs, there are definitely hardware errors, no need to run diags. The logs are full of:

May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625696] sd 2:0:0:0: [sda] Unhandled sense code
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625706] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625709] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625712] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625713] Sense Key : Hardware Error [current] 
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625717] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625720] Add. Sense: Logical unit failure
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625723] sd 2:0:0:0: [sda] CDB: 
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625724] Write(10): 2a 00 39 c7 08 a0 00 04 00 00
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625777] sd 2:0:0:0: [sda] Unhandled sense code
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625780] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625782] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625784] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625785] Sense Key : Hardware Error [current] 
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625788] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625791] Add. Sense: Logical unit failure
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625793] sd 2:0:0:0: [sda] CDB: 
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625795] Write(10): 2a 00 39 c7 04 a0 00 04 00 00
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625850] sd 2:0:0:0: [sda] Unhandled sense code
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625853] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625854] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625857] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625858] Sense Key : Hardware Error [current] 
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625861] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625863] Add. Sense: Logical unit failure
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625866] sd 2:0:0:0: [sda] CDB: 
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625867] Write(10): 2a 00 39 c7 10 a0 00 04 00 00
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625913] sd 2:0:0:0: [sda] Unhandled sense code
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625916] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625918] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625920] sd 2:0:0:0: [sda]  
May  2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625921] Sense Key : Hardware Error [current]
There's nothing we can do with the running system, so I attempted to reboot it to see if it would fsck on it's own. I don't expect this will work well, and that the disk or the controller is just toast.

I've downtimed it for 30d in nagios for the time being.
Even with the host off and not responding, newer pip's still fail:
  https://public-artifacts.taskcluster.net/GElmN67oRBGDHJUEdzOqcw/0/public/logs/live_backing.log

Should we kill proxxy everywhere, or just in scl3?
Comment on attachment 8866792 [details]
Bug 1363893: remove downed scl3 proxxy instance;

https://reviewboard.mozilla.org/r/138398/#review141644
Attachment #8866792 - Flags: review?(catlee) → review+
Pushed by dmitchell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0acdb2ad7e97
remove downed scl3 proxxy instance; r=catlee
Whiteboard: [checkin-needed-beta][checkin-needed-release][checkin-needed-esr52]
Assignee: nobody → dustin
Is this causing failures in Buildbot runs?  I thought the only impacted builds were my test runs on the moonshot hardware.
Flags: needinfo?(ryanvm)
catlee claims that the OSX timeouts we're seeing across all branches are the same issue:
https://treeherder.mozilla.org/logviewer.html#?job_id=98422173&repo=mozilla-beta
Flags: needinfo?(ryanvm)
Whiteboard: [stockwell infra]
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
No longer blocks: 1357753
Do you want to just decommission this hardware?
Assignee: dustin → nobody
Flags: needinfo?(catlee)
(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> Do you want to just decommission this hardware?

Yes, I don't see any hope or point in repairing it.

If we do decide we need a new proxxy instance in SCL3 (or new DCs), then they should be created on fresh VMs or hosts.
Flags: needinfo?(catlee)
Blocks: 1364881
Hardware decom bug opened for dcops.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Keywords: leave-open
Resolution: --- → WONTFIX
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: