Closed
Bug 724397
(talos-r4-lion-005)
Opened 12 years ago
Closed 11 years ago
talos-r4-lion-005 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
Details
(Whiteboard: [buildduty][pink pixel of death][needs diagnostics bug filed])
If you look at https://build.mozilla.org/buildapi/recent/talos-r4-lion-005?numbuilds=100 it's obvious mostly that being a lion slave is a thankless and extremely orange job, but right now that includes a pair of reds: https://tbpl.mozilla.org/php/getParsedLog.php?id=9101174&tree=Mozilla-Inbound Rev4 MacOSX Lion 10.7 mozilla-inbound debug test jsreftest on 2012-02-05 04:52:49 PST for push 9ff7a8136813 Upon execvpe wget ['wget', '--progress=dot:mega', '-N', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1328444406/firefox-13.0a1.en-US.mac64.tests.zip'] in environment id 140319611977248 :Traceback (most recent call last): File "/tools/buildbot-0.8.4-pre-moz2/lib/python2.7/site-packages/twisted/internet/process.py", line 414, in _fork executable, args, environment) File "/tools/buildbot-0.8.4-pre-moz2/lib/python2.7/site-packages/twisted/internet/process.py", line 460, in _execChild os.execvpe(executable, args, environment) File "/tools/buildbot-0.8.4-pre-moz2/bin/../lib/python2.7/os.py", line 353, in execvpe _execvpe(file, args, env) File "/tools/buildbot-0.8.4-pre-moz2/bin/../lib/python2.7/os.py", line 383, in _execvpe if (e.errno != errno.ENOENT and e.errno != errno.ENOTDIR AttributeError: 'module' object has no attribute 'ENOENT' and https://tbpl.mozilla.org/php/getParsedLog.php?id=9073161&tree=Firefox Rev4 MacOSX Lion 10.7 mozilla-central debug test jsreftest on 2012-02-03 13:45:06 PST for push 394c3ef8a0dc hg clone http://hg.mozilla.org/build/tools tools requesting all changes adding changesets adding manifests adding file changes added 2193 changesets with 4722 changes to 949 files updating working directory ** unknown exception encountered, details follow ** report bug details to http://mercurial.selenic.com/bts/ ** or mercurial@selenic.com ** Mercurial Distributed SCM (version 1.3.1) ** Extensions loaded: Traceback (most recent call last): File "/usr/local/bin/hg", line 20, in <module> mercurial.dispatch.run() File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 16, in run File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 27, in dispatch File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 43, in _runcatch File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 449, in _dispatch File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 317, in runcommand File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 501, in _runcommand File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 454, in checkargs File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 448, in <lambda> File "/usr/local/lib/python2.5/site-packages/mercurial/util.py", line 402, in check File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/commands.py", line 636, in clone File "/usr/local/lib/python2.5/site-packages/mercurial/hg.py", line 313, in clone File "/usr/local/lib/python2.5/site-packages/mercurial/hg.py", line 331, in update File "/usr/local/lib/python2.5/site-packages/mercurial/merge.py", line 468, in update File "/usr/local/lib/python2.5/site-packages/mercurial/merge.py", line 305, in applyupdates File "/usr/local/lib/python2.5/site-packages/mercurial/context.py", line 303, in data File "/usr/local/lib/python2.5/site-packages/mercurial/filelog.py", line 16, in read File "/usr/local/lib/python2.5/site-packages/mercurial/revlog.py", line 998, in revision File "/usr/local/lib/python2.5/site-packages/mercurial/revlog.py", line 958, in _chunk File "/usr/local/lib/python2.5/site-packages/mercurial/revlog.py", line 108, in decompress zlib.error: Error -3 while decompressing data: incorrect data check Bad disk? Bad RAM? Acting out because it hates that its job involves mostly expected failure?
Updated•12 years ago
|
Priority: -- → P3
Updated•12 years ago
|
Assignee: nobody → bhearsum
Comment 1•12 years ago
|
||
Let's try a re-image.
Alias: talos-r4-lion-005
Assignee: bhearsum → nobody
Component: Release Engineering → Release Engineering: Machine Management
QA Contact: release → armenzg
Summary: Something just ain't right about talos-r4-lion-005 → talos-r4-lion-005 problem tracking
Comment 2•12 years ago
|
||
Some helpful stranger put this back into production.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 3•12 years ago
|
||
looks like something's gone wrong with the user account: talos-r4-lion-005:~ I have no name!$ whoami 501 talos-r4-lion-005:~ I have no name!$ sudo reboot sudo: unknown uid: 501
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 4•12 years ago
|
||
seems to have fixed itself...
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 5•12 years ago
|
||
The startup crashes in system libraries in https://tbpl.mozilla.org/php/getParsedLog.php?id=16183140&tree=Mozilla-Inbound make me want to see what hardware diagnostics are going to say about this three-time-loser.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 6•12 years ago
|
||
I don't see any crashes in awhile.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 7•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=16981742&tree=Mozilla-Beta is the infamous "pink pixel of death" reftest failure, where a bit of corrupted memory blows several dozen tests.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 8•11 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #7) > https://tbpl.mozilla.org/php/getParsedLog.php?id=16981742&tree=Mozilla-Beta > is the infamous "pink pixel of death" reftest failure, where a bit of > corrupted memory blows several dozen tests. Not sure what happened between this comment and now, but I'm not seeing any failures like this recently. I expect philor will prove me wrong momentarily.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Reporter | ||
Comment 9•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=18845553&tree=Mozilla-Inbound is a nice start back down the wrong path, crashing in the not-us weeds on startup.
Comment 10•11 years ago
|
||
Put back into production now that hardware diagnostics have been run and the machine has been re-imaged
Comment 11•11 years ago
|
||
green builds so far
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 12•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=19213957&tree=Mozilla-Inbound 440 reftest failures, every single one of them with a single black pixel far down in the lower right of the canvas.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 13•11 years ago
|
||
...not sure what the next step here is, coop? Disabled in slavealloc
Updated•11 years ago
|
Flags: needinfo?(coop)
Comment 14•11 years ago
|
||
Next step is to file a repair-or-decommission bug against IT. Replacing memory is (relatively) cheap. Other options get more interesting.
Flags: needinfo?(coop)
Comment 15•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=20169126&tree=Mozilla-Inbound DIDiskImageNewWithBackingStore: instantiator returned 0 Verifying… Checksumming Driver Descriptor Map (DDM : 0)… Driver Descriptor Map (DDM : 0): verified CRC32 $0EBD5181 Checksumming Apple (Apple_partition_map : 1)… Apple (Apple_partition_map : 1): verified CRC32 $62B9DF30 Checksumming DiscRecording 6.0d1 (Apple_HFS : 2)… __decompressChunk: returning 1000 CUDIFDiskImage::readSectors: returning 1000 DiscRecording 6.0d1 (Apple_HFS : 2): checksum failed with error 1000. CUDIFDiskImage: error 1000 calculating checksum Verification completed… Error 1000 (image data corrupted). calculated CRC32 $0187B220, expected CRC32 $2A030FB6 Finishing… DIHLDiskImageAttach() returned 1000 hdiutil: attach failed - image data corrupted
Reporter | ||
Comment 16•11 years ago
|
||
Pink pixel of death - https://tbpl.mozilla.org/php/getParsedLog.php?id=20689484&tree=Mozilla-Inbound
Reporter | ||
Comment 17•11 years ago
|
||
Pink pixel of death - https://tbpl.mozilla.org/php/getParsedLog.php?id=20795734&tree=Mozilla-Inbound
Reporter | ||
Comment 18•11 years ago
|
||
Pink pixel of death - https://tbpl.mozilla.org/php/getParsedLog.php?id=21044403&tree=Mozilla-Inbound
Reporter | ||
Updated•11 years ago
|
Keywords: intermittent-failure
Updated•11 years ago
|
Whiteboard: [badslave?][buildduty] → [buildduty][pink pixel of death]
Reporter | ||
Comment 19•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=21201719&tree=Mozilla-Esr17 - crashed in the system library weeds.
Reporter | ||
Comment 20•11 years ago
|
||
Pink pixel of death - https://tbpl.mozilla.org/php/getParsedLog.php?id=21233252&tree=Mozilla-Inbound
Comment 21•11 years ago
|
||
So, this machine has been hitting pink pixel issues off and on since November. Coop, you requested that we try some memory replacement in bug 839921, but that was never done. Should we pursue that some more, decomm, or ...?
Flags: needinfo?(coop)
Comment 22•11 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #21) > So, this machine has been hitting pink pixel issues off and on since > November. Coop, you requested that we try some memory replacement in bug > 839921, but that was never done. Should we pursue that some more, decomm, or > ...? Memory is the cheaper option. Let's get new memory installed and re-image this machine.
Flags: needinfo?(coop)
Comment 23•11 years ago
|
||
Disabled in slavealloc for bug 856622.
Comment 24•11 years ago
|
||
re-enabling in slavealloc since the reimage was done. I'm sending this out against the gods to see if it avoids the pink-pixel plague.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 25•11 years ago
|
||
Churning through jobs with "hdiutil: attach failed - Device not configured" and "sudo: unknown uid: 501" instead of a reboot; typically fixed by just forcing it to reboot.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 26•11 years ago
|
||
Forced a reboot via ssh + |sudo reboot|.
Comment 27•11 years ago
|
||
Been running jobs for awhile.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Keywords: intermittent-failure
Reporter | ||
Comment 28•11 years ago
|
||
Most extreme PPoD I've ever seen in https://tbpl.mozilla.org/php/getParsedLog.php?id=26746862&tree=Mozilla-Inbound Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Updated•11 years ago
|
Whiteboard: [buildduty][pink pixel of death] → [buildduty][pink pixel of death][needs diagnostics bug filed]
Comment 29•11 years ago
|
||
I have no idea what to do with this slave anymore. RAM was replaced but it's still hitting PPOD. Kill it with fire?
Comment 30•11 years ago
|
||
Back in production.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 31•11 years ago
|
||
Still not working properly. Could be that the corrupt RAM which was replaced caused some corruption on the filesystem. Let's give this machine another chance with a reimage.
Comment 32•11 years ago
|
||
Back in production. Probably not for long.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•