Closed
Bug 664310
Opened 14 years ago
Closed 14 years ago
Fix "Stray process with PGID equal to this dead job" on leopard talos systems
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
Details
Attachments
(1 file)
|
1.25 KB,
patch
|
armenzg
:
review+
|
Details | Diff | Splinter Review |
This host failed to start the buildslave.
2011-06-09 00:22:07-0700 [-] Log opened.
2011-06-09 00:22:07-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz1/bin/python 2.5.1) starting up.
2011-06-09 00:22:07-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
talos-r3-leopard-029:~ cltbld$
(yep, that's it .. no running twistd either)
system.log has:
Jun 9 00:22:07 talos-r3-leopard-029 com.apple.launchd[106] (org.mozilla.build.buildslave[242]): Stray process with PGID equal to this dead job: PID 244 PPID 1 python
I thought we fixed that??
Dmesg has:
hfs_relocate: diskimages-helper didn't move into MDZ (382 blks)
hfs_relocate: virtual.rb didn't move into MDZ (2 blks)
hfs_relocate: __init__.pyc didn't move into MDZ (2 blks)
hfs_relocate: ic.pyc didn't move into MDZ (8 blks)
hfs_relocate: sRGB Profile.icc didn't move into MDZ (2 blks)
hfs_relocate: grp.so didn't move into MDZ (28 blks)
hfs_relocate: zlib.so didn't move into MDZ (50 blks)
hfs_relocate: syslog.so didn't move into MDZ (28 blks)
hfs_relocate: randbytes.pyc didn't move into MDZ (4 blks)
hfs_relocate: _baseprocess.pyc didn't move into MDZ (2 blks)
hfs_relocate: provider_features.rb didn't move into MDZ (4 blks)
hfs_relocate: GridIcon.icns didn't move into MDZ (2 blks)
hfs_relocate: appdmg.rb didn't move into MDZ (4 blks)
hfs_relocate: copy_reg.pyc didn't move into MDZ (4 blks)
hfs_relocate: log.pyc didn't move into MDZ (2 blks)
hfs_relocate: heapq.pyc didn't move into MDZ (6 blks)
hfs_relocate: stat.pyc didn't move into MDZ (2 blks)
hfs_relocate: opcode.pyc didn't move into MDZ (4 blks)
hfs_relocate: address.pyc didn't move into MDZ (4 blks)
hfs_relocate: opcode.pyc didn't move into MDZ (4 blks)
hfs_relocate: crefutil.pyc didn't move into MDZ (6 blks)
hfs_relocate: termios.so didn't move into MDZ (40 blks)
hfs_relocate: authstore.rb didn't move into MDZ (6 blks)
hfs_relocate: spawn.pyc didn't move into MDZ (4 blks)
hfs_relocate: string_escape.pyc didn't move into MDZ (2 blks)
hfs_relocate: sob.pyc didn't move into MDZ (6 blks)
hfs_relocate: advice.pyc didn't move into MDZ (4 blks)
hfs_relocate: types.pyc didn't move into MDZ (2 blks)
hfs_relocate: dep_util.pyc didn't move into MDZ (2 blks)
hfs_relocate: ToDo_Chbx_Shadow.png didn't move into MDZ (2 blks)
hfs_relocate: styles.pyc didn't move into MDZ (6 blks)
hfs_relocate: ldap.rb didn't move into MDZ (2 blks)
hfs_relocate: gestalt.so didn't move into MDZ (16 blks)
hfs_relocate: __init__.py didn't move into MDZ (2 blks)
hfs_relocate: errors.pyc didn't move into MDZ (4 blks)
hfs_relocate: pbutil.pyc didn't move into MDZ (4 blks)
hfs_relocate: ignore.pyc didn't move into MDZ (2 blks)
hfs_relocate: lockfile.pyc didn't move into MDZ (4 blks)
hfs_relocate: InfoPlist.strings didn't move into MDZ (2 blks)
hfs_relocate: util.pyc didn't move into MDZ (2 blks)
hfs_relocate: apple.convs didn't move into MDZ (2 blks)
hfs_relocate: itertools.so didn't move into MDZ (68 blks)
hfs_relocate: _socket.so didn't move into MDZ (104 blks)
hfs_relocate: sre_constants.pyc didn't move into MDZ (4 blks)
hfs_relocate: mdiff.pyc didn't move into MDZ (6 blks)
hfs_relocate: ToDo_Chbx_Shape.png didn't move into MDZ (2 blks)
hfs_relocate: weakref.pyc didn't move into MDZ (8 blks)
hfs_relocate: fancy_getopt.pyc didn't move into MDZ (8 blks)
hfs_relocate: _twistd_unix.pyc didn't move into MDZ (8 blks)
hfs_relocate: portal.pyc didn't move into MDZ (4 blks)
hfs_relocate: re.pyc didn't move into MDZ (8 blks)
hfs_relocate: compat.pyc didn't move into MDZ (4 blks)
hfs_relocate: deprecate.pyc didn't move into MDZ (8 blks)
hfs_relocate: chkbxShape.png didn't move into MDZ (2 blks)
hfs_relocate: internet.pyc didn't move into MDZ (10 blks)
hfs_relocate: MacOS.so didn't move into MDZ (20 blks)
hfs_relocate: dis.pyc didn't move into MDZ (4 blks)
hfs_relocate: zipstream.pyc didn't move into MDZ (8 blks)
hfs_relocate: zipstream.pyc didn't move into MDZ (8 blks)
hfs_relocate: pipes.py didn't move into MDZ (6 blks)
hfs_relocate: posixpath.pyc didn't move into MDZ (8 blks)
hfs_relocate: posixpath.pyc didn't move into MDZ (8 blks)
hfs_relocate: Localized.rsrc didn't move into MDZ (12 blks)
hfs_relocate: changelog.pyc didn't move into MDZ (6 blks)
hfs_relocate: Info.plist didn't move into MDZ (2 blks)
hfs_relocate: reactors.pyc didn't move into MDZ (2 blks)
hfs_relocate: chkbxShadow.png didn't move into MDZ (2 blks)
hfs_relocate: win32.pyc didn't move into MDZ (4 blks)
hfs_relocate: objects.nib didn't move into MDZ (2 blks)
hfs_relocate: com.apple.TimeMachine.C928F2EC-068D-506C-8562-DF91B27546C8.plist didn't move into MDZ (2 blks)
hfs_relocate: GlobalCount.plist didn't move into MDZ (2 blks)
which makes me wonder if this machine has disk problems? Google doesn't tell me much about MDZ..
Updated•14 years ago
|
Assignee: server-ops-releng → zandr
| Assignee | ||
Comment 1•14 years ago
|
||
Sorry, I meant to assign this to self when I filed it.
Assignee: zandr → dustin
| Assignee | ||
Comment 2•14 years ago
|
||
OK, hardware seems fine, but I'd like to solve this once and for all:
Jun 9 00:22:07 talos-r3-leopard-029 com.apple.launchd[106] (org.mozilla.build.buildslave[242]): Stray process with PGID equal to this dead job: PID 244 PPID 1 python
Summary: talos-r3-leopard-029 looking sick? → Fix "Stray process with PGID equal to this dead job" on leopard talos systems
| Assignee | ||
Comment 3•14 years ago
|
||
So for background, pgid is "process group identifier". A process group is created when a process sets its pgid to its pid, making it the process group leader. Any processes it spawns are then members of that process group (have the same pgid) unless they change their pgid.
A process with a pgid that corresponds to the pid of a process which launchd has just reaped is, arguably, a hanger-on that should be killed. That's what this page is suggesting to me, at any rate:
https://discussions.apple.com/thread/1571473?start=0&tstart=0
So, while twistd is daemonizing - after it has forked, but before it has set its pgid - launchd spots it and kills it. Bad luck. I'll keep reading, but if this is the root of the problem, then we can probably fix it with a time.sleep(..) in runslave.py.
| Assignee | ||
Comment 4•14 years ago
|
||
http://lists.macosforge.org/pipermail/launchd-dev/2009-July/000592.html
suggests writing a fresh new plist for each launched process - that doesn't seem right!
| Assignee | ||
Comment 5•14 years ago
|
||
I tried this fix out on talos-r3-leopard-010, and it properly slept for the required duration. This particular form of error is so rare that I won't be able to verify this as a fix, but hopefully it won't do any harm.
I'll run this in dev/preprod after r+.
Attachment #540638 -
Flags: review?(armenzg)
Comment 6•14 years ago
|
||
Comment on attachment 540638 [details] [diff] [review]
m664310-puppet-manifests-p1-r1.patch
It should do no harm.
Let's hope it takes care of it! Good finding!
Attachment #540638 -
Flags: review?(armenzg) → review+
| Assignee | ||
Comment 7•14 years ago
|
||
landed and deployed.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•