Closed Bug 842461 (t-snow-r4-0064) Opened 11 years ago Closed 10 years ago

t-snow-r4-0064 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

()

Details

(Whiteboard: [buildduty][buildslaves][capacity])

https://tbpl.mozilla.org/php/getParsedLog.php?id=19863634&tree=Mozilla-Inbound is a pink pixel of death reftest failure (actually a single pixel that's 255,255,247 instead of 255, more of an off-white pixel of death). Once a few more happen, I'll reopen, we'll run diagnostics that won't find anything, rinse, repeat.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=19888311&tree=Fx-Team is exactly the sort of GC crash that started us down the road of (unsuccessfully) blaming slaves with bad RAM for having bad RAM.

Disable, diagnostics that won't show anything, reimage, restart the cycle of blame, please.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Disabled in slavealloc.
Depends on: 854544
back in production
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Bad-RAM-caused GC crash in https://bugzilla.mozilla.org/show_bug.cgi?id=856612#c1
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Disabled in slavealloc due to failing to clone hg repos:

could not lookup DNS configuration info service: (ipc/send) invalid destination port
abort: error: nodename nor servname provided, or not known
Also note that the reboot step fails when in this state because tools aren't cloned:

python: can't open file 'tools/buildfarm/maintenance/count_and_reboot.py': [Errno 2] No such file or directory
After manual reboot (thanks to :jhopkins), DNS looks okay.

re-enabled in slavealloc - will monitor for a clean job or two
two successful jobs with reboots, declaring victory
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Mildly curious that the same thing would afflict the same slave again, but exactly as with comment 5, busted DNS that will heal once someone's around to reboot it, disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 933889
Memtest requested in bug 933889.

I didn't see anyone dealing with comment 4.
2013-03-25 diagnostics requested
2013-04-30 same issues + DNS issues
2013-11-11 memtest does not find memory issues

I'm putting this into production as I need logs to do anything in here.
Assignee: nobody → armenzg
1 job failed in the last 50 jobs.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Score for the last 100 jobs is three expected failures from bad checkins, and three suspicious GC crashes that make me think I'll be coming back to disable it before too long.
And another GC crash.

Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 944520
Memory replacement requested.
Rebooted into production.
It is looking good.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=31631470&tree=Mozilla-Aurora

14:00:40     INFO -  firefox-bin(955,0x7fff70221cc0) malloc: *** error for object 0x120e7b008: incorrect checksum for freed object - object was probably modified after being freed.
14:00:40     INFO -  *** set a breakpoint in malloc_error_break to debug
14:01:15  WARNING -  PROCESS-CRASH | file:///builds/slave/talos-slave/test/build/tests/reftest/tests/content/events/crashtests/recursive-DOMNodeInserted.html | application crashed [@ libSystem.B.dylib + 0x4f0b6]
2013-04 - disk diagnostics fixed some bad sectors - bug 854544
2013-11 - memtest requested - bug 933889
2013-11 - memory replaced - bug 944520

Keeping an eye on philor reporting more issues.
https://tbpl.mozilla.org/php/getParsedLog.php?id=31995503&tree=Mozilla-Central - "Error in collecting counter: Private Bytes, pid: 958, exception: list index out of range," and then a socket error traceback, rather like Python got loaded into the bad memory this time.
Depends on: 748873
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
No longer depends on: 748873
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
Assignee: armenzg → nobody
QA Contact: armenzg → bugspam.Callek
Alias: talos-r4-snow-066 → t-snow-r4-0064
Summary: talos-r4-snow-066 problem tracking → t-snow-r4-0064 problem tracking
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.