Closed Bug 702482 Opened 13 years ago Closed 11 years ago

ensure Xvfb is running before starting buildbot on linux test slaves

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86
Linux

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mbrubeck, Unassigned)

References

Details

This happened once, but went away on the next push.  I don't know if this is a bad slave, or an intermittent code problem, or what.

https://tbpl.mozilla.org/php/getParsedLog.php?id=7346014&full=1&branch=build-system
Linux build-system leak test build on 2011-11-10 21:14:41 PST for push a7b08c15904b

builder: build-system-linux-debug
slave: mv-moz2-linux-ix-slave12
starttime: 1320988481.91
results: warnings (1)
buildid: 20111110211429
builduid: 89b0dc16c21f43aeba9f757dde037d94
revision: a7b08c15904b

========= Started alive test failed (results: 2, elapsed: 12 secs) ==========
python leaktest.py
 in dir /builds/slave/bld-system-lnx-dbg/build/obj-firefox/_leaktest (timeout 1200 secs)
 watching logfiles {}
 argv: ['python', 'leaktest.py']
 environment:
  CC=/tools/gcc/bin/gcc
  CCACHE_BASEDIR=/builds/slave/bld-system-lnx-dbg
  CCACHE_COMPRESS=1
  CCACHE_DIR=/builds/ccache
  CCACHE_UMASK=002
  CVS_RSH=ssh
  CXX=/tools/gcc/bin/g++
  DISPLAY=:2
  G_BROKEN_FILENAMES=1
  HG_SHARE_BASE_DIR=/builds/hg-shared
  HISTSIZE=1000
  HOME=/home/cltbld
  HOSTNAME=mv-moz2-linux-ix-slave12.build.mozilla.org
  INPUTRC=/etc/inputrc
  JAVA_HOME=/builds/jdk
  LANG=en_US.UTF-8
  LC_ALL=C
  LD_LIBRARY_PATH=/tools/gcc-4.3.3/installed/lib:obj-firefox/dist/bin
  LESSOPEN=|/usr/bin/lesspipe.sh %s
  LOGNAME=cltbld
  LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:
  MAIL=/var/spool/mail/cltbld
  MINIDUMP_SAVE_PATH=/builds/slave/bld-system-lnx-dbg/minidumps
  MINIDUMP_STACKWALK=/builds/slave/bld-system-lnx-dbg/tools/breakpad/linux/minidump_stackwalk
  MOZ_CRASHREPORTER_NO_REPORT=1
  MOZ_OBJDIR=obj-firefox
  PATH=/opt/local/bin:/tools/python/bin:/tools/buildbot/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/cltbld/bin
  PWD=/builds/slave/bld-system-lnx-dbg/build/obj-firefox/_leaktest
  SHELL=/bin/bash
  SHLVL=1
  SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
  TBOX_CLIENT_CVS_DIR=/builds/tinderbox/mozilla/tools
  TERM=linux
  USER=cltbld
  XPCOM_DEBUG_BREAK=stack-and-abort
  _=/tools/python/bin/python
 using PTY: False
args: ['/builds/slave/bld-system-lnx-dbg/build/obj-firefox/dist/bin/firefox-bin', '-no-remote', '-profile', '/builds/slave/bld-system-lnx-dbg/build/obj-firefox/_leaktest/leakprofile/', 'http://localhost:8888/bloatcycle.html']
INFO | automation.py | Application pid: 17219
args: ['/usr/bin/perl', '/builds/slave/bld-system-lnx-dbg/build/obj-firefox/dist/bin/fix-linux-stack.pl']
nsStringStats
 => mAllocCount:              5
 => mReallocCount:            3
 => mFreeCount:               3  --  LEAKED 2 !!!
 => mShareCount:              1
 => mAdoptCount:              0
 => mAdoptFreeCount:          0
Error: cannot open display: :2
nsStringStats
 => mAllocCount:             41
 => mReallocCount:           16
 => mFreeCount:              26  --  LEAKED 15 !!!
 => mShareCount:             57
 => mAdoptCount:              0
 => mAdoptFreeCount:          0
TEST-UNEXPECTED-FAIL | automation.py | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:00:12.057919
INFO | automation.py | Reading PID log: /tmp/tmpWs8N6Npidlog
program finished with exit code 1
elapsedTime=12.109331
======== Finished alive test failed (results: 2, elapsed: 12 secs) ========
pretty sure this was a one-time glitch
Whiteboard: [orange] → [orange][triagefollowup][buildduty]
Status: NEW → RESOLVED
Closed: 13 years ago
Priority: -- → P3
Resolution: --- → WORKSFORME
Whiteboard: [orange][triagefollowup][buildduty] → [orange][badslave?][buildduty]
https://tbpl.mozilla.org/php/getParsedLog.php?id=7800130&tree=Mozilla-Inbound
linux-ix-slave32
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
This has happened again.  

https://tbpl.mozilla.org/php/getParsedLog.php?id=7800130&full=1&branch=mozilla-inbound

We use Xvfb to allow our machines to run headless.  I think that Xvfb is crashing or failing to initialize for a random reason, because the next boot, its there and running fine.

[root@linux-ix-slave32 ~]# tail -3 /var/log/Xorg.0.log
Fatal server error:
xf86OpenConsole: VT_WAITACTIVE failed: Interrupted system call


Not sure that there is anything here to do other than be annoyed at Xvfb failing to initialize.
Should we add a puppet check for this, or at least add verification to however we're launching Xvfb?
(In reply to Chris Cooper [:coop] from comment #4)
> Should we add a puppet check for this, or at least add verification to
> however we're launching Xvfb?

+1

Additionally I found a bug which looks related to this: https://bugzilla.redhat.com/show_bug.cgi?id=323501

We can disable rhgb (RedHat Graphical Boot I believe) in grub.conf to avoid conflicts.
https://tbpl.mozilla.org/php/getParsedLog.php?id=9402761&tree=Mozilla-Inbound - linux-ix-slave30
Summary: TEST-UNEXPECTED-FAIL | automation.py | Exited with code 1 during test run: "Error: cannot open display: :2" on mv-moz2-linux-ix-slave12 → TEST-UNEXPECTED-FAIL | automation.py | Exited with code 1 during test run: "Error: cannot open display: :2"
Updating summary to reflect what we want to here.

(In reply to Chris Cooper [:coop] from comment #4)
> Should we add a puppet check for this, or at least add verification to
> however we're launching Xvfb?

Yeah, I think we could do this. We currently launch Xvfb through an @reboot cronjob, but I don't see why we couldn't launch it through Puppet directly or in some other way that Puppet can watch (maybe through an init.d service?). If we did this, Buildbot wouldn't launch until it had been launched successfully.

I think this fits into the new Machine Management category, too.
Component: Release Engineering → Release Engineering: Machine Management
QA Contact: release → armenzg
Summary: TEST-UNEXPECTED-FAIL | automation.py | Exited with code 1 during test run: "Error: cannot open display: :2" → ensure Xvfb is running before starting buildbot on linux test slaves
Component: Release Engineering: Machine Management → Release Engineering: Platform Support
QA Contact: armenzg → coop
Whiteboard: [orange][badslave?][buildduty] → [orange][badslave?]
Xvfb is actually launched through a lovely cronjob since it's known to be crashy:

# Make sure Xvfb is running on :2
@reboot     ps -C Xvfb | grep -q Xvfb || exec Xvfb :2 -screen 0 1280x1024x24 &
*/5 * * * * ps -C Xvfb | grep -q Xvfb || exec Xvfb :2 -screen 0 1280x1024x24 &

# Make sure metacity is running on :2
@reboot     ps -C metacity -f | grep -q :2 || exec metacity --display :2 --replace &
*/5 * * * * ps -C metacity -f | grep -q :2 || exec metacity --display :2 --replace &
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → WORKSFORME
Blocks: 797242
No longer blocks: 438871
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Whiteboard: [orange][badslave?]
I don't think we'll be propping up the Fedora test slaves any more.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → Release Engineering
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.