Closed
Bug 568035
Opened 14 years ago
Closed 13 years ago
Linux slaves complaining intermittently of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log) (/dev/ptmx permissions problem)
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: lsblakk, Unassigned)
Details
(Whiteboard: [buildslaves][badslave?][hardware][buildduty])
Attachments
(2 files)
655 bytes,
patch
|
bhearsum
:
feedback+
|
Details | Diff | Splinter Review |
477 bytes,
patch
|
bear
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
This happened today to mv-moz2-linux-ix-slave03 where the slave gets stuck on set_basedir and doesn't reboot, but keeps picking up the builds (4x in a row) then failing again on set_basedir with the following on the slave's twistd.log: 2010-05-25 09:32:40-0700 [Broker,client] startCommand:shell [id 786277] 2010-05-25 09:32:40-0700 [Broker,client] ShellCommand._startCommand 2010-05-25 09:32:40-0700 [Broker,client] python tools/buildfarm/maintenance/count_and_reboot.py -f ../reboot_count.txt -n 1 -z 2010-05-25 09:32:40-0700 [Broker,client] in dir /builds/slave/mozilla-central-linux/. (timeout 1200 secs) 2010-05-25 09:32:40-0700 [Broker,client] watching logfiles {} 2010-05-25 09:32:40-0700 [Broker,client] argv: ['python', 'tools/buildfarm/maintenance/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z'] 2010-05-25 09:32:40-0700 [Broker,client] environment: {'SSH_ASKPASS': '/usr/libexec/openssh/gnome-ssh-askpass', 'LESSOPEN': '|/usr/bin/lesspipe.sh %s', 'CVS_RSH': 'ssh', 'LOGNAME': 'cltbld', 'USER': 'cltbld', 'INPUTRC': '/etc/inputrc', 'HOME': '/home/cltbld', 'PATH': '/opt/local/bin:/tools/python/bin:/tools/buildbot/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/cltbld/bin', 'HISTSIZE': '1000', 'LANG': 'en_US.UTF-8', 'TERM': 'linux', 'SHELL': '/bin/bash', 'SHLVL': '1', 'G_BROKEN_FILENAMES': '1', 'TBOX_CLIENT_CVS_DIR': '/builds/tinderbox/mozilla/tools', 'JAVA_HOME': '/builds/jdk', 'CC': '/tools/gcc/bin/gcc', '_': '/tools/buildbot/bin/buildbot', 'CXX': '/tools/gcc/bin/g++', 'HOSTNAME': 'mv-moz2-linux-ix-slave03.build.mozilla.org', 'PWD': '/builds/slave/mozilla-central-linux', 'MAIL': '/var/spool/mail/cltbld', 'LS_COLORS': 'no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:'} 2010-05-25 09:32:40-0700 [Broker,client] closing stdin 2010-05-25 09:32:40-0700 [Broker,client] using PTY: True 2010-05-25 09:32:40-0700 [Broker,client] error in ShellCommand._startCommand 2010-05-25 09:32:40-0700 [Broker,client] Unhandled Error Traceback (most recent call last): File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/buildbot-0.8.0rc3-py2.6.egg/buildbot/slave/bot.py", line 172, in remote_startCommand d = self.command.doStart() File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/buildbot-0.8.0rc3-py2.6.egg/buildbot/slave/commands/base.py", line 880, in doStart d = defer.maybeDeferred(self.start) File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 102, in maybeDeferred result = f(*args, **kw) File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/buildbot-0.8.0rc3-py2.6.egg/buildbot/slave/commands/base.py", line 1000, in start d = self.command.start() --- <exception caught here> --- File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/buildbot-0.8.0rc3-py2.6.egg/buildbot/slave/commands/base.py", line 382, in start self._startCommand() File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/buildbot-0.8.0rc3-py2.6.egg/buildbot/slave/commands/base.py", line 503, in _startCommand usePTY=self.usePTY) File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/posixbase.py", line 220, in spawnProcess processProtocol, uid, gid, usePTY) File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/process.py", line 809, in __init__ masterfd, slavefd = pty.openpty() File "/tools/python-2.6.5/lib/python2.6/pty.py", line 29, in openpty master_fd, slave_name = _open_terminal() File "/tools/python-2.6.5/lib/python2.6/pty.py", line 70, in _open_terminal raise os.error, 'out of pty devices' exceptions.OSError: out of pty devices 2010-05-25 09:32:40-0700 [Broker,client] SlaveBuilder.commandFailed <buildbot.slave.commands.base.SlaveShellCommand instance at 0x8e2638c> 2010-05-25 09:32:40-0700 [Broker,client] Unhandled Error Traceback (most recent call last): Failure: buildbot.slave.commands.base.AbandonChain: -1
Comment 1•14 years ago
|
||
FWIW, this slave had just cleanly rebooted, so this was the first thing it had tried to run via buildbot.
Comment 2•14 years ago
|
||
mv-moz2-linux-slave13 too.
Comment 3•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274834527.1274834530.10528.gz Linux mozilla-central build on 2010/05/25 17:42:07
Comment 4•14 years ago
|
||
just moved mv-moz2-linux-ix-slave08 's buildbot.tac to .off
Updated•14 years ago
|
Summary: Linux ix slave complaining of "out of pty devices" → Linux ix slave complaining of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log)
Comment 5•14 years ago
|
||
ls -l /dev/ptmx on various slaves is interesting: mv-moz2-linux-ix-slave03 crw-rw-rw- 1 root tty 5, 2 May 25 18:46 /dev/ptmx mv-moz2-linux-ix-slave13 crw------- 1 root root 5, 2 May 25 18:46 /dev/ptmx mv-moz2-linux-ix-slave08 crw------- 1 root root 5, 2 May 25 18:52 /dev/ptmx slave03 was failing, then got rebooted iirc, and then has had a few successful builds since. 13 and 08 have been failing and were disconnected. could scratchbox be doing something here?
Comment 6•14 years ago
|
||
Oh, and the permissions on /dev/ptmx cause this python code to fail:
>>> import pty
>>> pty.openpty()
Comment 7•14 years ago
|
||
I noticed this: [root@staging-puppet dist]# tar tjvf scratchbox-2010-03-30-1129.tar.bz2 | grep ptmx -rw-r--r-- root/root 3800 2002-10-04 14:18:21 scratchbox/devkits/doctools/share/texmf/tex/latex/psnfss/mathptmx.sty -rw-r--r-- root/root 4631 2004-09-18 01:59:53 scratchbox/devkits/doctools/share/texmf-dist/tex/latex/psnfss/mathptmx.sty crw-rw-rw- root/tty 5,2 2010-03-12 14:29:23 scratchbox/dev/ptmx As far as *running* scratchbox -- are all the ix linux boxes involved all attached to pm02? (No mobile builders on pm01)
Comment 8•14 years ago
|
||
No, some are on pm01, 03 and 08 for example.
Comment 9•14 years ago
|
||
I am having the same problem with mv-moz2-linux-ix-slave02 which was on pm. I am going to reboot it and put it on staging (bug 571492). [cltbld@mv-moz2-linux-ix-slave02 slave]$ ls -l /dev/ptmx crw------- 1 root root 5, 2 Jun 11 07:20 /dev/ptmx [cltbld@mv-moz2-linux-ix-slave02 slave]$ python Python 2.5.1 (r251:54863, Jan 14 2010, 12:26:02) [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import pty >>> pty.openpty() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/tools/python-2.5.1/lib/python2.5/pty.py", line 29, in openpty master_fd, slave_name = _open_terminal() File "/tools/python-2.5.1/lib/python2.5/pty.py", line 70, in _open_terminal raise os.error, 'out of pty devices' OSError: out of pty devices After a reboot I got: [cltbld@mv-moz2-linux-ix-slave02 ~]$ ls -l /dev/ptmx crw-rw-rw- 1 root tty 5, 2 Jun 11 07:28 /dev/ptmx [cltbld@mv-moz2-linux-ix-slave02 ~]$ python Python 2.5.1 (r251:54863, Jan 14 2010, 12:26:02) [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import pty >>> pty.openpty() (3, 4) As we can tell the group is now "tty" instead of "root" which matches what I see on slave 21: [cltbld@mv-moz2-linux-ix-slave21 ~]$ ls -l /dev/ptmx crw-rw-rw- 1 root tty 5, 2 Jun 11 07:25 /dev/ptmx
Depends on: 571492
Comment 10•14 years ago
|
||
I had the same problem with slave mv-moz2-linux-ix-slave18. 4 builds starting at Jun 11 10:00. /dev/ptmx's group was root. I rebooted it. http://production-master02.build.mozilla.org:8010/buildslaves/mv-moz2-linux-ix-slave18
Comment 11•14 years ago
|
||
Hit this on pm03:mv-moz2-linux-ix-slave21 over in bug 579622. I suspect we didn't reboot after a crashtest leaving the machine in a bad state, but don't have data to back that up. I've left it untouched (except for stopping buildbot) for further debugging.
Comment 12•14 years ago
|
||
See comment 5, and comment 6. What are the permissions on /dev/ptmx?
Comment 13•14 years ago
|
||
[cltbld@mv-moz2-linux-ix-slave21 ~]$ ls -l /dev/ptmx crw------- 1 root root 5, 2 Jul 17 04:53 /dev/ptmx The code snippet in comment #6 fails with 'OSError: out of pty devices'.
Comment 14•14 years ago
|
||
mv-moz2-linux-ix-slave12 on pm01. (In reply to comment #5) > could scratchbox be doing something here? It started happening after a mobile build: Jul 26 04:02 exception Linux electrolysis nightly "Exception set_basedir maybe_rebooting" Jul 26 03:32 success Maemo 5 QT tracemonkey nightly Build successful slave lost [cltbld@mv-moz2-linux-ix-slave12 ~]$ uptime 07:40:21 up 3:42, 1 user, load average: 0.00, 0.00, 0.00 This means that the last time it rebooted was ~3:58 and it was just after the mobile build. This means that during the maemo build that file got corrupted. If this happens again could we check what was the last known good build? I believe catlee's suspicious are on the right way. Do you think a puppet change ensuring the permissions and the group ownership would fix this? (In reply to comment #2) > mv-moz2-linux-slave13 too. Not only happening on IX machines.
Summary: Linux ix slave complaining of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log) → Linux slaves complaining intermittently of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log)
Comment 15•14 years ago
|
||
(In reply to comment #14) > Do you think a puppet change ensuring the permissions and the group ownership > would fix this? Yeah, I bet ensuring it via Puppet would fix it, since Puppet is guaranteed to run prior to Buildbot on these machines.
Updated•14 years ago
|
Whiteboard: [buildslaves][badslave?][hardware] → [buildslaves][badslave?][hardware][buidduty]
Comment 16•14 years ago
|
||
mv-moz2-linux-ix-slave02 failed with 'OSError: out of pty devices'. buildbot stopped. $ ls -l /dev/ptmx crw------- 1 root root 5, 2 Sep 8 04:24 /dev/ptmx
Comment 17•14 years ago
|
||
this has happend for 23 builds on ix on mv-moz2-linux-ix-slave22 error in ShellCommand._startCommand Traceback (most recent call last): File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 400, in start self._startCommand() File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 528, in _startCommand usePTY=self.usePTY) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/posixbase.py", line 220, in spawnProcess processProtocol, uid, gid, usePTY) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/process.py", line 809, in __init__ masterfd, slavefd = pty.openpty() File "/tools/python-2.6.5/lib/python2.6/pty.py", line 29, in openpty master_fd, slave_name = _open_terminal() File "/tools/python-2.6.5/lib/python2.6/pty.py", line 70, in _open_terminal raise os.error, 'out of pty devices' OSError: out of pty devices
Comment 18•14 years ago
|
||
This has killed around 30 jobs today alone. With more and more IX boxes coming online, we should figure this out soon.
Severity: normal → major
Priority: P5 → P3
Comment 19•14 years ago
|
||
mv-moz2-linux-ix-slave23 started doing this on Sep 11. Rebooted.
Comment 20•14 years ago
|
||
first step is to make sure /dev/ptmx is rw- and owned by root:tty next step will be to add logging to find out where it get's changed
Attachment #474704 -
Flags: feedback?(bhearsum)
Comment 21•14 years ago
|
||
Comment on attachment 474704 [details] [diff] [review] adjust centos puppet manifest to ensure /dev/ptmx has proper privs and owner Looks fine to me.
Attachment #474704 -
Flags: feedback?(bhearsum) → feedback+
Updated•14 years ago
|
Summary: Linux slaves complaining intermittently of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log) → Linux slaves complaining intermittently of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log) (/dev/ptmx permissions problem)
Updated•14 years ago
|
Whiteboard: [buildslaves][badslave?][hardware][buidduty] → [buildslaves][badslave?][hardware][buildduty]
Comment 22•14 years ago
|
||
[cltbld@linux-ix-slave37 ~]$ ls -l /dev/ptmx crw------- 1 root root 5, 2 Oct 15 05:26 /dev/ptmx I pulled the slave out of production.
Comment 23•14 years ago
|
||
Reboot recovered linux-ix-slave37, back to the pool.
Reporter | ||
Comment 24•14 years ago
|
||
Pulled linux-ix-slave30 out of production today when it complained of this and burned 2 mozilla-central leak test builds and a Maemo QT tracemonkey build. Removed slave, put buildbot.tac to .off, clobbered all the talos-slave/ build dirs, moved buildbot.tac back, rebooted.
Comment 25•14 years ago
|
||
linux-ix-slave07 got rebooted to fix this error.
Updated•14 years ago
|
Attachment #486274 -
Flags: review?(bear) → review+
Comment 27•14 years ago
|
||
Should happen while somebody can pay close attention to the results.
Flags: needs-reconfig?
Comment 28•14 years ago
|
||
Comment on attachment 486274 [details] [diff] [review] adjust ix puppet manifest to ensure /dev/ptmx has proper privs and owner. changeset: 239:a5b01d972112
Attachment #486274 -
Flags: checked-in+
Updated•14 years ago
|
Flags: needs-reconfig? → needs-reconfig+
Comment 30•14 years ago
|
||
Seems that this can now be closed?
Comment 31•13 years ago
|
||
Fix landed in November; reopen or file new if this occurs again.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•