Closed Bug 724397 (talos-r4-lion-005) Opened 14 years ago Closed 12 years ago

talos-r4-lion-005 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86_64
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

(Whiteboard: [buildduty][pink pixel of death][needs diagnostics bug filed])

If you look at https://build.mozilla.org/buildapi/recent/talos-r4-lion-005?numbuilds=100 it's obvious mostly that being a lion slave is a thankless and extremely orange job, but right now that includes a pair of reds: https://tbpl.mozilla.org/php/getParsedLog.php?id=9101174&tree=Mozilla-Inbound Rev4 MacOSX Lion 10.7 mozilla-inbound debug test jsreftest on 2012-02-05 04:52:49 PST for push 9ff7a8136813 Upon execvpe wget ['wget', '--progress=dot:mega', '-N', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1328444406/firefox-13.0a1.en-US.mac64.tests.zip'] in environment id 140319611977248 :Traceback (most recent call last): File "/tools/buildbot-0.8.4-pre-moz2/lib/python2.7/site-packages/twisted/internet/process.py", line 414, in _fork executable, args, environment) File "/tools/buildbot-0.8.4-pre-moz2/lib/python2.7/site-packages/twisted/internet/process.py", line 460, in _execChild os.execvpe(executable, args, environment) File "/tools/buildbot-0.8.4-pre-moz2/bin/../lib/python2.7/os.py", line 353, in execvpe _execvpe(file, args, env) File "/tools/buildbot-0.8.4-pre-moz2/bin/../lib/python2.7/os.py", line 383, in _execvpe if (e.errno != errno.ENOENT and e.errno != errno.ENOTDIR AttributeError: 'module' object has no attribute 'ENOENT' and https://tbpl.mozilla.org/php/getParsedLog.php?id=9073161&tree=Firefox Rev4 MacOSX Lion 10.7 mozilla-central debug test jsreftest on 2012-02-03 13:45:06 PST for push 394c3ef8a0dc hg clone http://hg.mozilla.org/build/tools tools requesting all changes adding changesets adding manifests adding file changes added 2193 changesets with 4722 changes to 949 files updating working directory ** unknown exception encountered, details follow ** report bug details to http://mercurial.selenic.com/bts/ ** or mercurial@selenic.com ** Mercurial Distributed SCM (version 1.3.1) ** Extensions loaded: Traceback (most recent call last): File "/usr/local/bin/hg", line 20, in <module> mercurial.dispatch.run() File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 16, in run File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 27, in dispatch File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 43, in _runcatch File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 449, in _dispatch File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 317, in runcommand File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 501, in _runcommand File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 454, in checkargs File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 448, in <lambda> File "/usr/local/lib/python2.5/site-packages/mercurial/util.py", line 402, in check File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/commands.py", line 636, in clone File "/usr/local/lib/python2.5/site-packages/mercurial/hg.py", line 313, in clone File "/usr/local/lib/python2.5/site-packages/mercurial/hg.py", line 331, in update File "/usr/local/lib/python2.5/site-packages/mercurial/merge.py", line 468, in update File "/usr/local/lib/python2.5/site-packages/mercurial/merge.py", line 305, in applyupdates File "/usr/local/lib/python2.5/site-packages/mercurial/context.py", line 303, in data File "/usr/local/lib/python2.5/site-packages/mercurial/filelog.py", line 16, in read File "/usr/local/lib/python2.5/site-packages/mercurial/revlog.py", line 998, in revision File "/usr/local/lib/python2.5/site-packages/mercurial/revlog.py", line 958, in _chunk File "/usr/local/lib/python2.5/site-packages/mercurial/revlog.py", line 108, in decompress zlib.error: Error -3 while decompressing data: incorrect data check Bad disk? Bad RAM? Acting out because it hates that its job involves mostly expected failure?
Priority: -- → P3
Assignee: nobody → bhearsum
Let's try a re-image.
Alias: talos-r4-lion-005
Assignee: bhearsum → nobody
Component: Release Engineering → Release Engineering: Machine Management
QA Contact: release → armenzg
Summary: Something just ain't right about talos-r4-lion-005 → talos-r4-lion-005 problem tracking
Depends on: 733830
Some helpful stranger put this back into production.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
looks like something's gone wrong with the user account: talos-r4-lion-005:~ I have no name!$ whoami 501 talos-r4-lion-005:~ I have no name!$ sudo reboot sudo: unknown uid: 501
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
seems to have fixed itself...
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
The startup crashes in system libraries in https://tbpl.mozilla.org/php/getParsedLog.php?id=16183140&tree=Mozilla-Inbound make me want to see what hardware diagnostics are going to say about this three-time-loser.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 806987
I don't see any crashes in awhile.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=16981742&tree=Mozilla-Beta is the infamous "pink pixel of death" reftest failure, where a bit of corrupted memory blows several dozen tests.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Phil Ringnalda (:philor) from comment #7) > https://tbpl.mozilla.org/php/getParsedLog.php?id=16981742&tree=Mozilla-Beta > is the infamous "pink pixel of death" reftest failure, where a bit of > corrupted memory blows several dozen tests. Not sure what happened between this comment and now, but I'm not seeing any failures like this recently. I expect philor will prove me wrong momentarily.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Depends on: 830154
Resolution: FIXED → ---
https://tbpl.mozilla.org/php/getParsedLog.php?id=18845553&tree=Mozilla-Inbound is a nice start back down the wrong path, crashing in the not-us weeds on startup.
Depends on: 831281
Put back into production now that hardware diagnostics have been run and the machine has been re-imaged
green builds so far
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=19213957&tree=Mozilla-Inbound 440 reftest failures, every single one of them with a single black pixel far down in the lower right of the canvas.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
...not sure what the next step here is, coop? Disabled in slavealloc
Flags: needinfo?(coop)
Depends on: 839921
Next step is to file a repair-or-decommission bug against IT. Replacing memory is (relatively) cheap. Other options get more interesting.
Flags: needinfo?(coop)
https://tbpl.mozilla.org/php/getParsedLog.php?id=20169126&tree=Mozilla-Inbound DIDiskImageNewWithBackingStore: instantiator returned 0 Verifying… Checksumming Driver Descriptor Map (DDM : 0)… Driver Descriptor Map (DDM : 0): verified CRC32 $0EBD5181 Checksumming Apple (Apple_partition_map : 1)… Apple (Apple_partition_map : 1): verified CRC32 $62B9DF30 Checksumming DiscRecording 6.0d1 (Apple_HFS : 2)… __decompressChunk: returning 1000 CUDIFDiskImage::readSectors: returning 1000 DiscRecording 6.0d1 (Apple_HFS : 2): checksum failed with error 1000. CUDIFDiskImage: error 1000 calculating checksum Verification completed… Error 1000 (image data corrupted). calculated CRC32 $0187B220, expected CRC32 $2A030FB6 Finishing… DIHLDiskImageAttach() returned 1000 hdiutil: attach failed - image data corrupted
Whiteboard: [badslave?][buildduty] → [buildduty][pink pixel of death]
So, this machine has been hitting pink pixel issues off and on since November. Coop, you requested that we try some memory replacement in bug 839921, but that was never done. Should we pursue that some more, decomm, or ...?
Flags: needinfo?(coop)
(In reply to Ben Hearsum [:bhearsum] from comment #21) > So, this machine has been hitting pink pixel issues off and on since > November. Coop, you requested that we try some memory replacement in bug > 839921, but that was never done. Should we pursue that some more, decomm, or > ...? Memory is the cheaper option. Let's get new memory installed and re-image this machine.
Flags: needinfo?(coop)
Depends on: 856622
Disabled in slavealloc for bug 856622.
Depends on: ppod
re-enabling in slavealloc since the reimage was done. I'm sending this out against the gods to see if it avoids the pink-pixel plague.
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
Churning through jobs with "hdiutil: attach failed - Device not configured" and "sudo: unknown uid: 501" instead of a reboot; typically fixed by just forcing it to reboot.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Forced a reboot via ssh + |sudo reboot|.
Been running jobs for awhile.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 903462
Product: mozilla.org → Release Engineering
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Depends on: 902970
Most extreme PPoD I've ever seen in https://tbpl.mozilla.org/php/getParsedLog.php?id=26746862&tree=Mozilla-Inbound Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [buildduty][pink pixel of death] → [buildduty][pink pixel of death][needs diagnostics bug filed]
I have no idea what to do with this slave anymore. RAM was replaced but it's still hitting PPOD. Kill it with fire?
Depends on: 927695
Back in production.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Still not working properly. Could be that the corrupt RAM which was replaced caused some corruption on the filesystem. Let's give this machine another chance with a reimage.
Depends on: 938199
Back in production. Probably not for long.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.