Closed Bug 730545 (talos-r4-lion-058) Opened 12 years ago Closed 11 years ago

talos-r4-lion-058 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86_64
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

()

Details

(Whiteboard: [decomm])

Attachments

(2 files)

+++ This bug was initially created as a clone of Bug #728535 +++

Same symptom as bug 728535, when talos-r4-snow-007 took to saying that every file in rm -rf tools was an invalid argument, and then dying download builds, saying Cannot write to `firefox-13.0a1.en-US.mac.dmg' (Invalid argument).

Anyway, it's chewing up jobs (what jobs there are to chew right now) like crazy because it only takes a couple of seconds to fail to rm and then fail to save a downloaded build.
Disabled in slavealloc, and did a graceful shutdown on the master. Should in the naughty corner now.
Priority: -- → P3
Alias: talos-r4-lion-058
Component: Release Engineering → Release Engineering: Machine Management
QA Contact: release → armenzg
Summary: talos-r4-lion-058 is broken → talos-r4-lion-058 problem tracking
Depends on: 734066
Back in the production pool.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Depends on: 734778
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This mini has been repaired, reimaged and placed back in scl1.  It is ready to be placed in production.
Back in production
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Last job was 8 days ago.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Power off and on again as it was refusing an ssh connection.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=14658404&tree=Ionmonkey

"inflating: reftest/tests/layout/reftests/fonts/dejavu-sans/DejaVuSans-Oblique.ttf   bad CRC 21b8d878  (should be 0b256820)"

Expect me to be coming around claiming that the disk is bad before too much longer.
https://tbpl.mozilla.org/php/getParsedLog.php?id=14758821&tree=Mozilla-Inbound

Error 1000 (image data corrupted).
calculated CRC32 $C8359846, expected   CRC32 $269CA257
https://tbpl.mozilla.org/php/getParsedLog.php?id=14760663&tree=Ionmonkey

  inflating: reftest/tests/layout/reftests/fonts/dejavu-sans/DejaVuSans.ttf   bad CRC 21e182a9  (should be d8a4d667)

Needs hardware diagnostics, I fear.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 786369
comment 3 suggested that this slave was repaired 4 months ago and no one reported any issues. Perhaps it just broke again and that is why it was down for several days?
Added a note to slavealloc and disabled the slave.
Back into production now the RAM has been replaced.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Made it 6 jobs before https://tbpl.mozilla.org/php/getParsedLog.php?id=15309593&tree=Mozilla-Inbound where it apparently went into a coma that was forcibly ended by a reconfig.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 793221
host had memory issues and ram was replaced but looks like issues persisting. will need to bring to apple certified tech in desktop to rerun diagnostics.

Bug 794184 opened to track.
Severity: major → normal
Depends on: 794184
I don't actually want to ever see this slave back again, but I know I will anyway, and I feel duty-bound to comment before acking the nagios alert that's been going off forever now: bug 794184 comment 2 claimed there was nothing wrong with this clearly broken slave, and that it was going back into production, but hasn't been heard from in the 6 weeks since, so somebody probably ought to do something so we can get back to reimaging it every few days.
Did anyone ever actually put this back into production or test it out after the RAM was replaced?
The RAM replacement was 2012-09-14, went back into production 2012-09-17, only did six jobs and then died, no?
philor: Ah, okay, I was misreading the timeline and thought that the RAM had been replaced after comment 14.  I've asked dcops to run some more diags on it to see if they can find something in addition to Apple's own hardware diags (which come up clean).
Diagnostics were run and came up clean. Returning to production. 

If it fails again, it will get a sharper hook.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
coop: We've run extensive diagnostics and they've repeatedly come up clean.  Despite that, philor says it burns jobs almost immediately when put back into production (see comment 15).  I'm not sure where to go from here.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 818467
(In reply to Amy Rich [:arich] [:arr] from comment #20)
> coop: We've run extensive diagnostics and they've repeatedly come up clean. 
> Despite that, philor says it burns jobs almost immediately when put back
> into production (see comment 15).  I'm not sure where to go from here.

Filed bug 818467 to get this mini re-imaged. One last try before decommissioning.
Depends on: 818502
Down again. IRC conversation concluded that we should decommission ?
nthomas: yeah, hardware diags show nothing wrong, we've reimaged it, and all it does is burn jobs and go down.
Depends on: 820115
Attachment #699975 - Flags: review?(rail) → review+
Attachment #699977 - Flags: review?(rail) → review+
Flags: needinfo?(nobody)
Hey ben, can we get these patches landed please :-)

--> removing from buildduty queue
Flags: needinfo?(bhearsum)
Whiteboard: [badslave?][buildduty] → [badslave?]
Flags: needinfo?(nobody)
(In reply to Justin Wood (:Callek) from comment #26)
> Hey ben, can we get these patches landed please :-)
> 
> --> removing from buildduty queue

You could've just landed these yourself, but OK...
Flags: needinfo?(bhearsum)
Attachment #699975 - Flags: checked-in+
Attachment #699977 - Flags: checked-in+
in production
Decommissioned in bug 820115.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Whiteboard: [badslave?] → [decomm]
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: