Closed Bug 568035 Opened 14 years ago Closed 13 years ago

Linux slaves complaining intermittently of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log) (/dev/ptmx permissions problem)

Categories

(Release Engineering :: General, defect, P3)

x86
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lsblakk, Unassigned)

Details

(Whiteboard: [buildslaves][badslave?][hardware][buildduty])

Attachments

(2 files)

This happened today to mv-moz2-linux-ix-slave03 where the slave gets stuck on set_basedir and doesn't reboot, but keeps picking up the builds (4x in a row) then failing again on set_basedir with the following on the slave's twistd.log:

	
2010-05-25 09:32:40-0700 [Broker,client]  startCommand:shell [id 786277]
2010-05-25 09:32:40-0700 [Broker,client] ShellCommand._startCommand
2010-05-25 09:32:40-0700 [Broker,client]  python tools/buildfarm/maintenance/count_and_reboot.py -f ../reboot_count.txt -n 1 -z
2010-05-25 09:32:40-0700 [Broker,client]   in dir /builds/slave/mozilla-central-linux/. (timeout 1200 secs)
2010-05-25 09:32:40-0700 [Broker,client]   watching logfiles {}
2010-05-25 09:32:40-0700 [Broker,client]   argv: ['python', 'tools/buildfarm/maintenance/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z']
2010-05-25 09:32:40-0700 [Broker,client]  environment: {'SSH_ASKPASS': '/usr/libexec/openssh/gnome-ssh-askpass', 'LESSOPEN': '|/usr/bin/lesspipe.sh %s', 'CVS_RSH': 'ssh', 'LOGNAME': 'cltbld', 'USER': 'cltbld', 'INPUTRC': '/etc/inputrc', 'HOME': '/home/cltbld', 'PATH': '/opt/local/bin:/tools/python/bin:/tools/buildbot/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/cltbld/bin', 'HISTSIZE': '1000', 'LANG': 'en_US.UTF-8', 'TERM': 'linux', 'SHELL': '/bin/bash', 'SHLVL': '1', 'G_BROKEN_FILENAMES': '1', 'TBOX_CLIENT_CVS_DIR': '/builds/tinderbox/mozilla/tools', 'JAVA_HOME': '/builds/jdk', 'CC': '/tools/gcc/bin/gcc', '_': '/tools/buildbot/bin/buildbot', 'CXX': '/tools/gcc/bin/g++', 'HOSTNAME': 'mv-moz2-linux-ix-slave03.build.mozilla.org', 'PWD': '/builds/slave/mozilla-central-linux', 'MAIL': '/var/spool/mail/cltbld', 'LS_COLORS': 'no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:'}
2010-05-25 09:32:40-0700 [Broker,client]   closing stdin
2010-05-25 09:32:40-0700 [Broker,client]   using PTY: True
2010-05-25 09:32:40-0700 [Broker,client] error in ShellCommand._startCommand
2010-05-25 09:32:40-0700 [Broker,client] Unhandled Error
	Traceback (most recent call last):
	  File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/buildbot-0.8.0rc3-py2.6.egg/buildbot/slave/bot.py", line 172, in remote_startCommand
	    d = self.command.doStart()
	  File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/buildbot-0.8.0rc3-py2.6.egg/buildbot/slave/commands/base.py", line 880, in doStart
	    d = defer.maybeDeferred(self.start)
	  File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 102, in maybeDeferred
	    result = f(*args, **kw)
	  File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/buildbot-0.8.0rc3-py2.6.egg/buildbot/slave/commands/base.py", line 1000, in start
	    d = self.command.start()
	--- <exception caught here> ---
	  File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/buildbot-0.8.0rc3-py2.6.egg/buildbot/slave/commands/base.py", line 382, in start
	    self._startCommand()
	  File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/buildbot-0.8.0rc3-py2.6.egg/buildbot/slave/commands/base.py", line 503, in _startCommand
	    usePTY=self.usePTY)
	  File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/posixbase.py", line 220, in spawnProcess
	    processProtocol, uid, gid, usePTY)
	  File "/tools/buildbot-0.8.0pre/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/process.py", line 809, in __init__
	    masterfd, slavefd = pty.openpty()
	  File "/tools/python-2.6.5/lib/python2.6/pty.py", line 29, in openpty
	    master_fd, slave_name = _open_terminal()
	  File "/tools/python-2.6.5/lib/python2.6/pty.py", line 70, in _open_terminal
	    raise os.error, 'out of pty devices'
	exceptions.OSError: out of pty devices
	
2010-05-25 09:32:40-0700 [Broker,client] SlaveBuilder.commandFailed <buildbot.slave.commands.base.SlaveShellCommand instance at 0x8e2638c>
2010-05-25 09:32:40-0700 [Broker,client] Unhandled Error
	Traceback (most recent call last):
	Failure: buildbot.slave.commands.base.AbandonChain: -1
FWIW, this slave had just cleanly rebooted, so this was the first thing it had tried to run via buildbot.
mv-moz2-linux-slave13 too.
just moved mv-moz2-linux-ix-slave08 's buildbot.tac to .off
Summary: Linux ix slave complaining of "out of pty devices" → Linux ix slave complaining of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log)
ls -l /dev/ptmx on various slaves is interesting:

mv-moz2-linux-ix-slave03  crw-rw-rw- 1 root tty  5, 2 May 25 18:46 /dev/ptmx
mv-moz2-linux-ix-slave13  crw------- 1 root root 5, 2 May 25 18:46 /dev/ptmx
mv-moz2-linux-ix-slave08  crw------- 1 root root 5, 2 May 25 18:52 /dev/ptmx

slave03 was failing, then got rebooted iirc, and then has had a few successful builds since. 13 and 08 have been failing and were disconnected.

could scratchbox be doing something here?
Oh, and the permissions on /dev/ptmx cause this python code to fail:

>>> import pty
>>> pty.openpty()
I noticed this:

[root@staging-puppet dist]# tar tjvf scratchbox-2010-03-30-1129.tar.bz2 | grep ptmx
-rw-r--r-- root/root      3800 2002-10-04 14:18:21 scratchbox/devkits/doctools/share/texmf/tex/latex/psnfss/mathptmx.sty
-rw-r--r-- root/root      4631 2004-09-18 01:59:53 scratchbox/devkits/doctools/share/texmf-dist/tex/latex/psnfss/mathptmx.sty
crw-rw-rw- root/tty        5,2 2010-03-12 14:29:23 scratchbox/dev/ptmx

As far as *running* scratchbox -- are all the ix linux boxes involved all attached to pm02? (No mobile builders on pm01)
No, some are on pm01, 03 and 08 for example.
I am having the same problem with mv-moz2-linux-ix-slave02 which was on pm.
I am going to reboot it and put it on staging (bug 571492).

[cltbld@mv-moz2-linux-ix-slave02 slave]$ ls -l /dev/ptmx
crw------- 1 root root 5, 2 Jun 11 07:20 /dev/ptmx
[cltbld@mv-moz2-linux-ix-slave02 slave]$ python
Python 2.5.1 (r251:54863, Jan 14 2010, 12:26:02) 
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pty
>>> pty.openpty()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tools/python-2.5.1/lib/python2.5/pty.py", line 29, in openpty
    master_fd, slave_name = _open_terminal()
  File "/tools/python-2.5.1/lib/python2.5/pty.py", line 70, in _open_terminal
    raise os.error, 'out of pty devices'
OSError: out of pty devices

After a reboot I got:
[cltbld@mv-moz2-linux-ix-slave02 ~]$ ls -l /dev/ptmx
crw-rw-rw- 1 root tty 5, 2 Jun 11 07:28 /dev/ptmx
[cltbld@mv-moz2-linux-ix-slave02 ~]$ python
Python 2.5.1 (r251:54863, Jan 14 2010, 12:26:02) 
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pty
>>> pty.openpty()
(3, 4)

As we can tell the group is now "tty" instead of "root" which matches what I see on slave 21:
[cltbld@mv-moz2-linux-ix-slave21 ~]$ ls -l /dev/ptmx
crw-rw-rw- 1 root tty 5, 2 Jun 11 07:25 /dev/ptmx
Depends on: 571492
No longer depends on: 571492
I had the same problem with slave mv-moz2-linux-ix-slave18. 4 builds starting at Jun 11 10:00. /dev/ptmx's group was root. I rebooted it.

http://production-master02.build.mozilla.org:8010/buildslaves/mv-moz2-linux-ix-slave18
Hit this on pm03:mv-moz2-linux-ix-slave21 over in bug 579622. I suspect we didn't reboot after a crashtest leaving the machine in a bad state, but don't have data to back that up. I've left it untouched (except for stopping buildbot) for further debugging.
See comment 5, and comment 6.  What are the permissions on /dev/ptmx?
[cltbld@mv-moz2-linux-ix-slave21 ~]$ ls -l /dev/ptmx
crw------- 1 root root 5, 2 Jul 17 04:53 /dev/ptmx

The code snippet in comment #6 fails with 'OSError: out of pty devices'.
mv-moz2-linux-ix-slave12 on pm01.

(In reply to comment #5)
> could scratchbox be doing something here?
It started happening after a mobile build:
Jul 26 04:02 exception Linux electrolysis nightly "Exception set_basedir maybe_rebooting"
Jul 26 03:32 success Maemo 5 QT tracemonkey nightly Build successful slave lost

[cltbld@mv-moz2-linux-ix-slave12 ~]$ uptime
 07:40:21 up  3:42,  1 user,  load average: 0.00, 0.00, 0.00
This means that the last time it rebooted was ~3:58 and it was just after the mobile build. This means that during the maemo build that file got corrupted.

If this happens again could we check what was the last known good build? I believe catlee's suspicious are on the right way.

Do you think a puppet change ensuring the permissions and the group ownership would fix this?

(In reply to comment #2)
> mv-moz2-linux-slave13 too.
Not only happening on IX machines.
Summary: Linux ix slave complaining of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log) → Linux slaves complaining intermittently of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log)
(In reply to comment #14)
> Do you think a puppet change ensuring the permissions and the group ownership
> would fix this?

Yeah, I bet ensuring it via Puppet would fix it, since Puppet is guaranteed to run prior to Buildbot on these machines.
Whiteboard: [buildslaves][badslave?][hardware] → [buildslaves][badslave?][hardware][buidduty]
mv-moz2-linux-ix-slave02 failed with 'OSError: out of pty devices'. buildbot stopped.

$ ls -l /dev/ptmx
crw------- 1 root root 5, 2 Sep  8 04:24 /dev/ptmx
this has happend for 23 builds on ix on mv-moz2-linux-ix-slave22

error in ShellCommand._startCommand
Traceback (most recent call last):
  File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 400, in start
    self._startCommand()
  File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 528, in _startCommand
    usePTY=self.usePTY)
  File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/posixbase.py", line 220, in spawnProcess
    processProtocol, uid, gid, usePTY)
  File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/process.py", line 809, in __init__
    masterfd, slavefd = pty.openpty()
  File "/tools/python-2.6.5/lib/python2.6/pty.py", line 29, in openpty
    master_fd, slave_name = _open_terminal()
  File "/tools/python-2.6.5/lib/python2.6/pty.py", line 70, in _open_terminal
    raise os.error, 'out of pty devices'
OSError: out of pty devices
This has killed around 30 jobs today alone.  With more and more IX boxes coming online, we should figure this out soon.
Severity: normal → major
Priority: P5 → P3
mv-moz2-linux-ix-slave23 started doing this on Sep 11. Rebooted.
first step is to make sure /dev/ptmx is rw- and owned by root:tty

next step will be to add logging to find out where it get's changed
Attachment #474704 - Flags: feedback?(bhearsum)
Comment on attachment 474704 [details] [diff] [review]
adjust centos puppet manifest to ensure /dev/ptmx has proper privs and owner

Looks fine to me.
Attachment #474704 - Flags: feedback?(bhearsum) → feedback+
Summary: Linux slaves complaining intermittently of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log) → Linux slaves complaining intermittently of "out of pty devices" ("Failure: buildbot.slave.commands.base.AbandonChain: -1" in build log) (/dev/ptmx permissions problem)
Whiteboard: [buildslaves][badslave?][hardware][buidduty] → [buildslaves][badslave?][hardware][buildduty]
[cltbld@linux-ix-slave37 ~]$ ls -l /dev/ptmx
crw------- 1 root root 5, 2 Oct 15 05:26 /dev/ptmx

I pulled the slave out of production.
Reboot recovered linux-ix-slave37, back to the pool.
Pulled linux-ix-slave30 out of production today when it complained of this and burned 2 mozilla-central leak test builds and a Maemo QT tracemonkey build.

Removed slave, put buildbot.tac to .off, clobbered all the talos-slave/ build dirs, moved buildbot.tac back, rebooted.
linux-ix-slave07 got rebooted to fix this error.
Attachment #486274 - Flags: review?(bear) → review+
Should happen while somebody can pay close attention to the results.
Flags: needs-reconfig?
Comment on attachment 486274 [details] [diff] [review]
adjust ix puppet manifest to ensure /dev/ptmx has proper privs and owner.

changeset:   239:a5b01d972112
Attachment #486274 - Flags: checked-in+
Flags: needs-reconfig? → needs-reconfig+
Leaving open to see if this fixes anything.
Flags: needs-reconfig+
Seems that this can now be closed?
Fix landed in November; reopen or file new if this occurs again.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: