Closed Bug 1042358 Opened 10 years ago Closed 10 years ago

Make runner responsible for buildbot startup on Ubuntu test

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ianconnolly, Assigned: bhearsum)

References

Details

Attachments

(2 files, 1 obsolete file)

      No description provided.
Depends on: 1042340
Depends on: 1042359
Depends on: 1045730
Blocks: 1052581
I still need to test this more, but I _think_ this has the bases covered as far as getting runner running at all. I need to make sure the other tasks work still, but this gets as far as running Buildbot and connecting to a master.
Assignee: ian → bhearsum
Status: NEW → ASSIGNED
Attachment #8480813 - Flags: feedback?(dustin)
Comment on attachment 8480813 [details] [diff] [review]
run runner with upstart on ubuntu

Review of attachment 8480813 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/runner/templates/runner.upstart.conf.erb
@@ +10,5 @@
> +
> +    # We sleep a bit here because even though Xvfb has completed, we want to
> +    # make sure that the DE has launched. Some sort of check of the process
> +    # list would be better, but this is probably good enough.
> +    sleep 10

So, this is a pretty substantial change in buildbot startup: from running in a gnome terminal after DE startup, to running via "su -c cltbld 'python runslave.py'".  It looks like the latter doesn't even take care to set up DISPLAY, actually.  And I know at least __GL_YIELD=NOTHING is required (modules/gui/manifests/init.pp), and possibly others.
Attachment #8480813 - Flags: feedback?(dustin) → feedback+
(In reply to Dustin J. Mitchell [:dustin] from comment #2)
> Comment on attachment 8480813 [details] [diff] [review]
> run runner with upstart on ubuntu
> 
> Review of attachment 8480813 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: modules/runner/templates/runner.upstart.conf.erb
> @@ +10,5 @@
> > +
> > +    # We sleep a bit here because even though Xvfb has completed, we want to
> > +    # make sure that the DE has launched. Some sort of check of the process
> > +    # list would be better, but this is probably good enough.
> > +    sleep 10
> 
> So, this is a pretty substantial change in buildbot startup: from running in
> a gnome terminal after DE startup, to running via "su -c cltbld 'python
> runslave.py'".  It looks like the latter doesn't even take care to set up
> DISPLAY, actually.  And I know at least __GL_YIELD=NOTHING is required
> (modules/gui/manifests/init.pp), and possibly others.

Yeah, this is something I'm still testing for. DISPLAY is already set by buildbot, but I'm concerned about XDG/GNOME/DBUS stuff (and the __GL_YIELD one you just mentioned). So far, all of the desktop tests appear to pass. I still need do some checking on other machine types, too.
(In reply to Ben Hearsum [:bhearsum] from comment #3)
> (In reply to Dustin J. Mitchell [:dustin] from comment #2)
> > Comment on attachment 8480813 [details] [diff] [review]
> > run runner with upstart on ubuntu
> > 
> > Review of attachment 8480813 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > ::: modules/runner/templates/runner.upstart.conf.erb
> > @@ +10,5 @@
> > > +
> > > +    # We sleep a bit here because even though Xvfb has completed, we want to
> > > +    # make sure that the DE has launched. Some sort of check of the process
> > > +    # list would be better, but this is probably good enough.
> > > +    sleep 10
> > 
> > So, this is a pretty substantial change in buildbot startup: from running in
> > a gnome terminal after DE startup, to running via "su -c cltbld 'python
> > runslave.py'".  It looks like the latter doesn't even take care to set up
> > DISPLAY, actually.  And I know at least __GL_YIELD=NOTHING is required
> > (modules/gui/manifests/init.pp), and possibly others.
> 
> Yeah, this is something I'm still testing for. DISPLAY is already set by
> buildbot, but I'm concerned about XDG/GNOME/DBUS stuff (and the __GL_YIELD
> one you just mentioned). So far, all of the desktop tests appear to pass. I
> still need do some checking on other machine types, too.

Somewhat surprisingly, no tests have failed due to not having these variables. I've grepped over the logs to make sure that tests actually ran, and spot checked a bunch of logs. If anyone else wants to look, they'll be available here for awhile: http://dev-master1.srv.releng.scl3.mozilla.com:8118/one_line_per_build?numbuilds=150

I'm going to ask around to try and get better confirmation about these variables, but unless I find something suggesting they *are* important, I'm planning to proceed here. Catlee suggested doing some sort of staged rollout, and I think that would be prudent here. Eg, 5-10 regular AWS machines, 5-10 large ones (for emulator tests), and a few in house machines. I still need to figure out how to make this happen in Puppet.
Per IRC, I'd like to roll this out on a few production slaves pointing at my puppet environment. Seems like I should have r+ before doing that, though.
Attachment #8480813 - Attachment is obsolete: true
Attachment #8482741 - Flags: review?(dustin)
I spoke with Rail about how to set aside some AWS machines to do this. It looks like we should be able to just bring up some on demand machines and pin them to my environment. Emulator test machines don't have any entries in buildbot-configs for ondemand machines yet, so I'm adding some here.

I'll be fiddling with moz-state to make sure that stop idle doesn't shut these down (otherwise it's very unlikely that they'll get picked over spot machines).
Attachment #8482758 - Flags: review?(catlee)
Attachment #8482758 - Flags: review?(catlee) → review+
Attachment #8482758 - Flags: checked-in+
Comment on attachment 8482741 [details] [diff] [review]
fully tested patch to get buildbot started with runner

Review of attachment 8482741 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/toplevel/manifests/slave/releng/test.pp
@@ +14,5 @@
>      include dirs::builds::hg_shared
>      include dirs::builds::git_shared
>      include dirs::builds::tooltool_cache
>  
> +    case $::operatingsystem {

Can you add a comment here explaining that this conditional is temporary until runner is set up on every platform?
Attachment #8482741 - Flags: review?(dustin) → review+
Merged to production, and deployed.
I pinned talos-linux64-ix-001, 002, 005, and 006 to my user environment. Sheriffs are aware, and I've added a note in Slavealloc. I'll be doing the same for a few slaves from the tst-linux64 and tst-emulator64 aws pools shortly, too.
The ec2 machines are up now too:
tst-linux64-ec2-001, 002, 003, and 004
tst-emulator64-ec2-001 and 002

I've flipped their moz-state tags to testing-bug1042358 to avoid them getting shut down. That should be changed back when testing is done.
So far things are looking mostly fine. One build failed with DISPLAY not being set, but I'm extremely confused as to why
This: http://buildbot-master103.srv.releng.scl3.mozilla.com:8201/builders/Ubuntu%20HW%2012.04%20x64%20mozilla-central%20pgo%20talos%20other_l64/builds/160

  HOME=/home/cltbld
  LANG=en_US.UTF-8
  LANGUAGE=en_US:en
  LOGNAME=cltbld
  MAIL=/var/mail/cltbld
  NODE_PATH=/usr/lib/nodejs:/usr/lib/node_modules:/usr/share/javascript
  PATH=/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
  PROPERTIES_FILE=/builds/slave/talos-slave/test-pgo/buildprops.json
  PWD=/builds/slave/talos-slave/test-pgo
  SHELL=/bin/bash
  SHLVL=1
  TERM=linux
  TMOUT=86400
  USER=cltbld
  XDG_SESSION_COOKIE=dd26bb57dc7379c38bda76df000001a9-1409930523.515999-565090523
  _=/tools/buildbot/bin/python

In addition to not having DISPLAY set, it's also missing other variables defined in the same place (http://mxr.mozilla.org/build-central/source/buildbotcustom/env.py#186). I'm tempted to write this off as a freak occurence because other jobs that are configured in the exact same way have the right variables set:

  DISPLAY=:0
  HOME=/home/cltbld
  LANG=en_US.UTF-8
  LANGUAGE=en_US:en
  LOGNAME=cltbld
  MAIL=/var/mail/cltbld
  MOZ_CRASHREPORTER_NO_REPORT=1
  MOZ_NO_REMOTE=1
  NODE_PATH=/usr/lib/nodejs:/usr/lib/node_modules:/usr/share/javascript
  NO_EM_RESTART=1
  PATH=/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
  PROPERTIES_FILE=/builds/slave/talos-slave/test/buildprops.json
  PWD=/builds/slave/talos-slave/test
  SHELL=/bin/bash
  SHLVL=1
  TERM=linux
  TMOUT=86400
  USER=cltbld
  XDG_SESSION_COOKIE=dd26bb57dc7379c38bda76df000001a9-1409925928.329783-534482962
  XPCOM_DEBUG_BREAK=warn
  _=/tools/buildbot/bin/python


Still, going to look into this more, but I'm not going to disable anything -- I'd like them to run over the weekend.
Depends on: 1063739
Turns out that we don't set the env in buildbot for PGO talos jobs, but we do for non-PGO talos jobs. I'm fixing this in bug 1063739. I'm not going to disable the 4 slaves locked to my puppet env because there's only a small set jobs that will fail because of this, and there shouldn't be more than a few that happen over the weekend.
These jobs have looked fine on the pinned machines for awhile. I plan to check in the puppet change to production tonight, so that the spot AMIs will pick up the changes tomorrow morning. In-house Ubuntu machines (such as talos-linux64-ix) will pick up the changes tonight - I'll hang around to watch them in case of bustage.
I've moved all machines back to the production environment, and reset moz-state on the ec2 machines. Aka, they're back to how they were before I started testing this. I'll land the puppet patch later this evening.
Comment on attachment 8482741 [details] [diff] [review]
fully tested patch to get buildbot started with runner

Landed on default+production.
Attachment #8482741 - Flags: checked-in+
I forgot to add a new file when I first landed. This worked fine after I fixed that, though.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: