Closed
Bug 1363893
Opened 7 years ago
Closed 7 years ago
scl3 proxxy host has failed
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(firefox-esr52 fixed, firefox53 fixed, firefox54 fixed, firefox55 fixed)
People
(Reporter: dustin, Unassigned)
References
Details
(Whiteboard: [stockwell infra])
Attachments
(1 file)
This host has a read-only filesystem, is dropping HTTP connections without answering, and seems to have a bad disk.
Comment 1•7 years ago
|
||
@Amy: I'm thinking about filing a bug to DCOps for running some disk diagnostics. Does that sound OK to you?
Flags: needinfo?(arich)
Comment 2•7 years ago
|
||
That host is many years out of warranty. When releng asked to set it up on scavenged hardware, it was with the understanding that it wasn't production critical (we could do without it) and that repairing it might be expensive if it ever broke. IHave dcops check, but if the disk is bad, it's not a given that we'll want to fix/replace it (check with catlee).
Flags: needinfo?(arich)
Reporter | ||
Comment 3•7 years ago
|
||
Can we try powering it down right now? I think its particular failure mode is causing issues with newer pip versions (luckily, the pre-cambrian version of pip we use in buildbot configs seems to survive this failure mode, or we'd have an outage)
Comment 4•7 years ago
|
||
I took a look at the logs, there are definitely hardware errors, no need to run diags. The logs are full of: May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625696] sd 2:0:0:0: [sda] Unhandled sense code May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625706] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625709] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625712] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625713] Sense Key : Hardware Error [current] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625717] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625720] Add. Sense: Logical unit failure May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625723] sd 2:0:0:0: [sda] CDB: May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625724] Write(10): 2a 00 39 c7 08 a0 00 04 00 00 May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625777] sd 2:0:0:0: [sda] Unhandled sense code May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625780] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625782] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625784] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625785] Sense Key : Hardware Error [current] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625788] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625791] Add. Sense: Logical unit failure May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625793] sd 2:0:0:0: [sda] CDB: May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625795] Write(10): 2a 00 39 c7 04 a0 00 04 00 00 May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625850] sd 2:0:0:0: [sda] Unhandled sense code May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625853] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625854] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625857] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625858] Sense Key : Hardware Error [current] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625861] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625863] Add. Sense: Logical unit failure May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625866] sd 2:0:0:0: [sda] CDB: May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625867] Write(10): 2a 00 39 c7 10 a0 00 04 00 00 May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625913] sd 2:0:0:0: [sda] Unhandled sense code May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625916] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625918] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625920] sd 2:0:0:0: [sda] May 2 11:21:37 proxxy1.srv.releng.scl3.mozilla.com kernel: [37930414.625921] Sense Key : Hardware Error [current]
Comment 5•7 years ago
|
||
There's nothing we can do with the running system, so I attempted to reboot it to see if it would fsck on it's own. I don't expect this will work well, and that the disk or the controller is just toast. I've downtimed it for 30d in nagios for the time being.
Reporter | ||
Comment 6•7 years ago
|
||
Even with the host off and not responding, newer pip's still fail: https://public-artifacts.taskcluster.net/GElmN67oRBGDHJUEdzOqcw/0/public/logs/live_backing.log Should we kill proxxy everywhere, or just in scl3?
Comment hidden (mozreview-request) |
Reporter | ||
Comment 8•7 years ago
|
||
Testing that patch in https://treeherder.mozilla.org/#/jobs?repo=try&revision=38ab3937d032ea7b8cadf192718ecde4107c7094
Comment 9•7 years ago
|
||
mozreview-review |
Comment on attachment 8866792 [details] Bug 1363893: remove downed scl3 proxxy instance; https://reviewboard.mozilla.org/r/138398/#review141644
Attachment #8866792 -
Flags: review?(catlee) → review+
Comment 10•7 years ago
|
||
Pushed by dmitchell@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0acdb2ad7e97 remove downed scl3 proxxy instance; r=catlee
Updated•7 years ago
|
status-firefox53:
--- → affected
status-firefox54:
--- → affected
status-firefox55:
--- → affected
status-firefox-esr52:
--- → affected
Whiteboard: [checkin-needed-beta][checkin-needed-release][checkin-needed-esr52]
Updated•7 years ago
|
Assignee: nobody → dustin
Reporter | ||
Comment 11•7 years ago
|
||
Is this causing failures in Buildbot runs? I thought the only impacted builds were my test runs on the moonshot hardware.
Flags: needinfo?(ryanvm)
Comment 12•7 years ago
|
||
catlee claims that the OSX timeouts we're seeing across all branches are the same issue: https://treeherder.mozilla.org/logviewer.html#?job_id=98422173&repo=mozilla-beta
Flags: needinfo?(ryanvm)
Comment 13•7 years ago
|
||
https://hg.mozilla.org/releases/mozilla-beta/rev/8b7e2a303954 https://hg.mozilla.org/releases/mozilla-release/rev/cd237bf1676e
Whiteboard: [checkin-needed-beta][checkin-needed-release][checkin-needed-esr52] → [checkin-needed-esr52]
Comment 14•7 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-esr52/rev/6fbf8768a030
Whiteboard: [checkin-needed-esr52]
Comment 15•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/0acdb2ad7e97
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Whiteboard: [stockwell infra]
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Comment 17•7 years ago
|
||
Do you want to just decommission this hardware?
Assignee: dustin → nobody
Flags: needinfo?(catlee)
Comment hidden (Intermittent Failures Robot) |
Comment 19•7 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #17) > Do you want to just decommission this hardware? Yes, I don't see any hope or point in repairing it. If we do decide we need a new proxxy instance in SCL3 (or new DCs), then they should be created on fresh VMs or hosts.
Flags: needinfo?(catlee)
Comment 20•7 years ago
|
||
Hardware decom bug opened for dcops.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Keywords: leave-open
Resolution: --- → WONTFIX
Comment 21•7 years ago
|
||
uplift |
https://hg.mozilla.org/releases/mozilla-esr52/rev/aa411bd76227 (FIREFOX_ESR_52_1_X_RELBRANCH)
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•