Closed Bug 724397 (talos-r4-lion-005) Opened 12 years ago Closed 11 years ago

talos-r4-lion-005 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86_64
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

(Whiteboard: [buildduty][pink pixel of death][needs diagnostics bug filed])

If you look at https://build.mozilla.org/buildapi/recent/talos-r4-lion-005?numbuilds=100 it's obvious mostly that being a lion slave is a thankless and extremely orange job, but right now that includes a pair of reds:

https://tbpl.mozilla.org/php/getParsedLog.php?id=9101174&tree=Mozilla-Inbound
Rev4 MacOSX Lion 10.7 mozilla-inbound debug test jsreftest on 2012-02-05 04:52:49 PST for push 9ff7a8136813

Upon execvpe wget ['wget', '--progress=dot:mega', '-N', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1328444406/firefox-13.0a1.en-US.mac64.tests.zip'] in environment id 140319611977248
:Traceback (most recent call last):
  File "/tools/buildbot-0.8.4-pre-moz2/lib/python2.7/site-packages/twisted/internet/process.py", line 414, in _fork
    executable, args, environment)
  File "/tools/buildbot-0.8.4-pre-moz2/lib/python2.7/site-packages/twisted/internet/process.py", line 460, in _execChild
    os.execvpe(executable, args, environment)
  File "/tools/buildbot-0.8.4-pre-moz2/bin/../lib/python2.7/os.py", line 353, in execvpe
    _execvpe(file, args, env)
  File "/tools/buildbot-0.8.4-pre-moz2/bin/../lib/python2.7/os.py", line 383, in _execvpe
    if (e.errno != errno.ENOENT and e.errno != errno.ENOTDIR
AttributeError: 'module' object has no attribute 'ENOENT'

and

https://tbpl.mozilla.org/php/getParsedLog.php?id=9073161&tree=Firefox
Rev4 MacOSX Lion 10.7 mozilla-central debug test jsreftest on 2012-02-03 13:45:06 PST for push 394c3ef8a0dc

hg clone http://hg.mozilla.org/build/tools tools
requesting all changes
adding changesets
adding manifests
adding file changes
added 2193 changesets with 4722 changes to 949 files
updating working directory
** unknown exception encountered, details follow
** report bug details to http://mercurial.selenic.com/bts/
** or mercurial@selenic.com
** Mercurial Distributed SCM (version 1.3.1)
** Extensions loaded: 
Traceback (most recent call last):
  File "/usr/local/bin/hg", line 20, in <module>
    mercurial.dispatch.run()
  File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 16, in run
  File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 27, in dispatch
  File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 43, in _runcatch
  File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 449, in _dispatch
  File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 317, in runcommand
  File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 501, in _runcommand
  File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 454, in checkargs
  File "/usr/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 448, in <lambda>
  File "/usr/local/lib/python2.5/site-packages/mercurial/util.py", line 402, in check
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/commands.py", line 636, in clone
    
  File "/usr/local/lib/python2.5/site-packages/mercurial/hg.py", line 313, in clone
  File "/usr/local/lib/python2.5/site-packages/mercurial/hg.py", line 331, in update
  File "/usr/local/lib/python2.5/site-packages/mercurial/merge.py", line 468, in update
  File "/usr/local/lib/python2.5/site-packages/mercurial/merge.py", line 305, in applyupdates
  File "/usr/local/lib/python2.5/site-packages/mercurial/context.py", line 303, in data
  File "/usr/local/lib/python2.5/site-packages/mercurial/filelog.py", line 16, in read
  File "/usr/local/lib/python2.5/site-packages/mercurial/revlog.py", line 998, in revision
  File "/usr/local/lib/python2.5/site-packages/mercurial/revlog.py", line 958, in _chunk
  File "/usr/local/lib/python2.5/site-packages/mercurial/revlog.py", line 108, in decompress
zlib.error: Error -3 while decompressing data: incorrect data check

Bad disk? Bad RAM? Acting out because it hates that its job involves mostly expected failure?
Priority: -- → P3
Assignee: nobody → bhearsum
Let's try a re-image.
Alias: talos-r4-lion-005
Assignee: bhearsum → nobody
Component: Release Engineering → Release Engineering: Machine Management
QA Contact: release → armenzg
Summary: Something just ain't right about talos-r4-lion-005 → talos-r4-lion-005 problem tracking
Depends on: 733830
Some helpful stranger put this back into production.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
looks like something's gone wrong with the user account:

talos-r4-lion-005:~ I have no name!$ whoami
501

talos-r4-lion-005:~ I have no name!$ sudo reboot
sudo: unknown uid: 501
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
seems to have fixed itself...
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
The startup crashes in system libraries in https://tbpl.mozilla.org/php/getParsedLog.php?id=16183140&tree=Mozilla-Inbound make me want to see what hardware diagnostics are going to say about this three-time-loser.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 806987
I don't see any crashes in awhile.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=16981742&tree=Mozilla-Beta is the infamous "pink pixel of death" reftest failure, where a bit of corrupted memory blows several dozen tests.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Phil Ringnalda (:philor) from comment #7)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=16981742&tree=Mozilla-Beta
> is the infamous "pink pixel of death" reftest failure, where a bit of
> corrupted memory blows several dozen tests.

Not sure what happened between this comment and now, but I'm not seeing any failures like this recently. I expect philor will prove me wrong momentarily.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Depends on: 830154
Resolution: FIXED → ---
https://tbpl.mozilla.org/php/getParsedLog.php?id=18845553&tree=Mozilla-Inbound is a nice start back down the wrong path, crashing in the not-us weeds on startup.
Depends on: 831281
Put back into production now that hardware diagnostics have been run and the machine has been re-imaged
green builds so far
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=19213957&tree=Mozilla-Inbound

440 reftest failures, every single one of them with a single black pixel far down in the lower right of the canvas.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
...not sure what the next step here is, coop? 

Disabled in slavealloc
Flags: needinfo?(coop)
Depends on: 839921
Next step is to file a repair-or-decommission bug against IT. Replacing memory is (relatively) cheap. Other options get more interesting.
Flags: needinfo?(coop)
https://tbpl.mozilla.org/php/getParsedLog.php?id=20169126&tree=Mozilla-Inbound

DIDiskImageNewWithBackingStore: instantiator returned 0
Verifying…
Checksumming Driver Descriptor Map (DDM : 0)…
     Driver Descriptor Map (DDM : 0): verified   CRC32 $0EBD5181
Checksumming Apple (Apple_partition_map : 1)…
     Apple (Apple_partition_map : 1): verified   CRC32 $62B9DF30
Checksumming DiscRecording 6.0d1 (Apple_HFS : 2)…
__decompressChunk: returning 1000
CUDIFDiskImage::readSectors: returning 1000
 DiscRecording 6.0d1 (Apple_HFS : 2): checksum failed with error 1000.
CUDIFDiskImage: error 1000 calculating checksum
Verification completed…
Error 1000 (image data corrupted).
calculated CRC32 $0187B220, expected   CRC32 $2A030FB6
Finishing…
DIHLDiskImageAttach() returned 1000
hdiutil: attach failed - image data corrupted
Whiteboard: [badslave?][buildduty] → [buildduty][pink pixel of death]
So, this machine has been hitting pink pixel issues off and on since November. Coop, you requested that we try some memory replacement in bug 839921, but that was never done. Should we pursue that some more, decomm, or ...?
Flags: needinfo?(coop)
(In reply to Ben Hearsum [:bhearsum] from comment #21)
> So, this machine has been hitting pink pixel issues off and on since
> November. Coop, you requested that we try some memory replacement in bug
> 839921, but that was never done. Should we pursue that some more, decomm, or
> ...?

Memory is the cheaper option. Let's get new memory installed and re-image this machine.
Flags: needinfo?(coop)
Depends on: 856622
Disabled in slavealloc for bug 856622.
Depends on: ppod
re-enabling in slavealloc since the reimage was done.

I'm sending this out against the gods to see if it avoids the pink-pixel plague.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Churning through jobs with "hdiutil: attach failed - Device not configured" and "sudo: unknown uid: 501" instead of a reboot; typically fixed by just forcing it to reboot.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Forced a reboot via ssh + |sudo reboot|.
Been running jobs for awhile.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 903462
Product: mozilla.org → Release Engineering
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Depends on: 902970
Most extreme PPoD I've ever seen in https://tbpl.mozilla.org/php/getParsedLog.php?id=26746862&tree=Mozilla-Inbound

Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [buildduty][pink pixel of death] → [buildduty][pink pixel of death][needs diagnostics bug filed]
I have no idea what to do with this slave anymore. RAM was replaced but it's still hitting PPOD. Kill it with fire?
Depends on: 927695
Back in production.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Still not working properly.  Could be that the corrupt RAM which was replaced caused some corruption on the filesystem.  Let's give this machine another chance with a reimage.
Depends on: 938199
Back in production. Probably not for long.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.