Fix "Stray process with PGID equal to this dead job" on leopard talos systems

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps
RESOLVED FIXED
7 years ago
5 years ago

People

(Reporter: dustin, Assigned: dustin)

Tracking

Details

Attachments

(1 attachment)

This host failed to start the buildslave.

2011-06-09 00:22:07-0700 [-] Log opened.
2011-06-09 00:22:07-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz1/bin/python 2.5.1) starting up.
2011-06-09 00:22:07-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
talos-r3-leopard-029:~ cltbld$ 

(yep, that's it .. no running twistd either)

system.log has:

Jun  9 00:22:07 talos-r3-leopard-029 com.apple.launchd[106] (org.mozilla.build.buildslave[242]): Stray process with PGID equal to this dead job: PID 244 PPID 1 python

I thought we fixed that??

Dmesg has:

hfs_relocate: diskimages-helper didn't move into MDZ (382 blks)
hfs_relocate: virtual.rb didn't move into MDZ (2 blks)
hfs_relocate: __init__.pyc didn't move into MDZ (2 blks)
hfs_relocate: ic.pyc didn't move into MDZ (8 blks)
hfs_relocate: sRGB Profile.icc didn't move into MDZ (2 blks)
hfs_relocate: grp.so didn't move into MDZ (28 blks)
hfs_relocate: zlib.so didn't move into MDZ (50 blks)
hfs_relocate: syslog.so didn't move into MDZ (28 blks)
hfs_relocate: randbytes.pyc didn't move into MDZ (4 blks)
hfs_relocate: _baseprocess.pyc didn't move into MDZ (2 blks)
hfs_relocate: provider_features.rb didn't move into MDZ (4 blks)
hfs_relocate: GridIcon.icns didn't move into MDZ (2 blks)
hfs_relocate: appdmg.rb didn't move into MDZ (4 blks)
hfs_relocate: copy_reg.pyc didn't move into MDZ (4 blks)
hfs_relocate: log.pyc didn't move into MDZ (2 blks)
hfs_relocate: heapq.pyc didn't move into MDZ (6 blks)
hfs_relocate: stat.pyc didn't move into MDZ (2 blks)
hfs_relocate: opcode.pyc didn't move into MDZ (4 blks)
hfs_relocate: address.pyc didn't move into MDZ (4 blks)
hfs_relocate: opcode.pyc didn't move into MDZ (4 blks)
hfs_relocate: crefutil.pyc didn't move into MDZ (6 blks)
hfs_relocate: termios.so didn't move into MDZ (40 blks)
hfs_relocate: authstore.rb didn't move into MDZ (6 blks)
hfs_relocate: spawn.pyc didn't move into MDZ (4 blks)
hfs_relocate: string_escape.pyc didn't move into MDZ (2 blks)
hfs_relocate: sob.pyc didn't move into MDZ (6 blks)
hfs_relocate: advice.pyc didn't move into MDZ (4 blks)
hfs_relocate: types.pyc didn't move into MDZ (2 blks)
hfs_relocate: dep_util.pyc didn't move into MDZ (2 blks)
hfs_relocate: ToDo_Chbx_Shadow.png didn't move into MDZ (2 blks)
hfs_relocate: styles.pyc didn't move into MDZ (6 blks)
hfs_relocate: ldap.rb didn't move into MDZ (2 blks)
hfs_relocate: gestalt.so didn't move into MDZ (16 blks)
hfs_relocate: __init__.py didn't move into MDZ (2 blks)
hfs_relocate: errors.pyc didn't move into MDZ (4 blks)
hfs_relocate: pbutil.pyc didn't move into MDZ (4 blks)
hfs_relocate: ignore.pyc didn't move into MDZ (2 blks)
hfs_relocate: lockfile.pyc didn't move into MDZ (4 blks)
hfs_relocate: InfoPlist.strings didn't move into MDZ (2 blks)
hfs_relocate: util.pyc didn't move into MDZ (2 blks)
hfs_relocate: apple.convs didn't move into MDZ (2 blks)
hfs_relocate: itertools.so didn't move into MDZ (68 blks)
hfs_relocate: _socket.so didn't move into MDZ (104 blks)
hfs_relocate: sre_constants.pyc didn't move into MDZ (4 blks)
hfs_relocate: mdiff.pyc didn't move into MDZ (6 blks)
hfs_relocate: ToDo_Chbx_Shape.png didn't move into MDZ (2 blks)
hfs_relocate: weakref.pyc didn't move into MDZ (8 blks)
hfs_relocate: fancy_getopt.pyc didn't move into MDZ (8 blks)
hfs_relocate: _twistd_unix.pyc didn't move into MDZ (8 blks)
hfs_relocate: portal.pyc didn't move into MDZ (4 blks)
hfs_relocate: re.pyc didn't move into MDZ (8 blks)
hfs_relocate: compat.pyc didn't move into MDZ (4 blks)
hfs_relocate: deprecate.pyc didn't move into MDZ (8 blks)
hfs_relocate: chkbxShape.png didn't move into MDZ (2 blks)
hfs_relocate: internet.pyc didn't move into MDZ (10 blks)
hfs_relocate: MacOS.so didn't move into MDZ (20 blks)
hfs_relocate: dis.pyc didn't move into MDZ (4 blks)
hfs_relocate: zipstream.pyc didn't move into MDZ (8 blks)
hfs_relocate: zipstream.pyc didn't move into MDZ (8 blks)
hfs_relocate: pipes.py didn't move into MDZ (6 blks)
hfs_relocate: posixpath.pyc didn't move into MDZ (8 blks)
hfs_relocate: posixpath.pyc didn't move into MDZ (8 blks)
hfs_relocate: Localized.rsrc didn't move into MDZ (12 blks)
hfs_relocate: changelog.pyc didn't move into MDZ (6 blks)
hfs_relocate: Info.plist didn't move into MDZ (2 blks)
hfs_relocate: reactors.pyc didn't move into MDZ (2 blks)
hfs_relocate: chkbxShadow.png didn't move into MDZ (2 blks)
hfs_relocate: win32.pyc didn't move into MDZ (4 blks)
hfs_relocate: objects.nib didn't move into MDZ (2 blks)
hfs_relocate: com.apple.TimeMachine.C928F2EC-068D-506C-8562-DF91B27546C8.plist didn't move into MDZ (2 blks)
hfs_relocate: GlobalCount.plist didn't move into MDZ (2 blks)

which makes me wonder if this machine has disk problems?  Google doesn't tell me much about MDZ..
Assignee: server-ops-releng → zandr
Sorry, I meant to assign this to self when I filed it.
Assignee: zandr → dustin
OK, hardware seems fine, but I'd like to solve this once and for all:

Jun  9 00:22:07 talos-r3-leopard-029 com.apple.launchd[106] (org.mozilla.build.buildslave[242]): Stray process with PGID equal to this dead job: PID 244 PPID 1 python
Summary: talos-r3-leopard-029 looking sick? → Fix "Stray process with PGID equal to this dead job" on leopard talos systems
So for background, pgid is "process group identifier".  A process group is created when a process sets its pgid to its pid, making it the process group leader.  Any processes it spawns are then members of that process group (have the same pgid) unless they change their pgid.

A process with a pgid that corresponds to the pid of a process which launchd has just reaped is, arguably, a hanger-on that should be killed.  That's what this page is suggesting to me, at any rate:
  https://discussions.apple.com/thread/1571473?start=0&tstart=0

So, while twistd is daemonizing - after it has forked, but before it has set its pgid - launchd spots it and kills it.  Bad luck.  I'll keep reading, but if this is the root of the problem, then we can probably fix it with a time.sleep(..) in runslave.py.
http://lists.macosforge.org/pipermail/launchd-dev/2009-July/000592.html
  suggests writing a fresh new plist for each launched process - that doesn't seem right!
Created attachment 540638 [details] [diff] [review]
m664310-puppet-manifests-p1-r1.patch

I tried this fix out on talos-r3-leopard-010, and it properly slept for the required duration.  This particular form of error is so rare that I won't be able to verify this as a fix, but hopefully it won't do any harm.

I'll run this in dev/preprod after r+.
Attachment #540638 - Flags: review?(armenzg)

Comment 6

7 years ago
Comment on attachment 540638 [details] [diff] [review]
m664310-puppet-manifests-p1-r1.patch

It should do no harm.
Let's hope it takes care of it! Good finding!
Attachment #540638 - Flags: review?(armenzg) → review+
landed and deployed.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.