Make runner responsible for buildbot startup on Ubuntu test

RESOLVED FIXED

Status

Release Engineering
General Automation
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: ianconnolly, Assigned: bhearsum)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 1 obsolete attachment)

Comment hidden (empty)
(Reporter)

Updated

3 years ago
Depends on: 1042340
(Reporter)

Updated

3 years ago
Depends on: 1042359
(Reporter)

Updated

3 years ago
Depends on: 1045730
(Reporter)

Updated

3 years ago
Blocks: 1052581
(Assignee)

Comment 1

3 years ago
Created attachment 8480813 [details] [diff] [review]
run runner with upstart on ubuntu

I still need to test this more, but I _think_ this has the bases covered as far as getting runner running at all. I need to make sure the other tasks work still, but this gets as far as running Buildbot and connecting to a master.
Assignee: ian → bhearsum
Status: NEW → ASSIGNED
Attachment #8480813 - Flags: feedback?(dustin)
Comment on attachment 8480813 [details] [diff] [review]
run runner with upstart on ubuntu

Review of attachment 8480813 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/runner/templates/runner.upstart.conf.erb
@@ +10,5 @@
> +
> +    # We sleep a bit here because even though Xvfb has completed, we want to
> +    # make sure that the DE has launched. Some sort of check of the process
> +    # list would be better, but this is probably good enough.
> +    sleep 10

So, this is a pretty substantial change in buildbot startup: from running in a gnome terminal after DE startup, to running via "su -c cltbld 'python runslave.py'".  It looks like the latter doesn't even take care to set up DISPLAY, actually.  And I know at least __GL_YIELD=NOTHING is required (modules/gui/manifests/init.pp), and possibly others.
Attachment #8480813 - Flags: feedback?(dustin) → feedback+
(Assignee)

Comment 3

3 years ago
(In reply to Dustin J. Mitchell [:dustin] from comment #2)
> Comment on attachment 8480813 [details] [diff] [review]
> run runner with upstart on ubuntu
> 
> Review of attachment 8480813 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: modules/runner/templates/runner.upstart.conf.erb
> @@ +10,5 @@
> > +
> > +    # We sleep a bit here because even though Xvfb has completed, we want to
> > +    # make sure that the DE has launched. Some sort of check of the process
> > +    # list would be better, but this is probably good enough.
> > +    sleep 10
> 
> So, this is a pretty substantial change in buildbot startup: from running in
> a gnome terminal after DE startup, to running via "su -c cltbld 'python
> runslave.py'".  It looks like the latter doesn't even take care to set up
> DISPLAY, actually.  And I know at least __GL_YIELD=NOTHING is required
> (modules/gui/manifests/init.pp), and possibly others.

Yeah, this is something I'm still testing for. DISPLAY is already set by buildbot, but I'm concerned about XDG/GNOME/DBUS stuff (and the __GL_YIELD one you just mentioned). So far, all of the desktop tests appear to pass. I still need do some checking on other machine types, too.
(Assignee)

Comment 4

3 years ago
(In reply to Ben Hearsum [:bhearsum] from comment #3)
> (In reply to Dustin J. Mitchell [:dustin] from comment #2)
> > Comment on attachment 8480813 [details] [diff] [review]
> > run runner with upstart on ubuntu
> > 
> > Review of attachment 8480813 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > ::: modules/runner/templates/runner.upstart.conf.erb
> > @@ +10,5 @@
> > > +
> > > +    # We sleep a bit here because even though Xvfb has completed, we want to
> > > +    # make sure that the DE has launched. Some sort of check of the process
> > > +    # list would be better, but this is probably good enough.
> > > +    sleep 10
> > 
> > So, this is a pretty substantial change in buildbot startup: from running in
> > a gnome terminal after DE startup, to running via "su -c cltbld 'python
> > runslave.py'".  It looks like the latter doesn't even take care to set up
> > DISPLAY, actually.  And I know at least __GL_YIELD=NOTHING is required
> > (modules/gui/manifests/init.pp), and possibly others.
> 
> Yeah, this is something I'm still testing for. DISPLAY is already set by
> buildbot, but I'm concerned about XDG/GNOME/DBUS stuff (and the __GL_YIELD
> one you just mentioned). So far, all of the desktop tests appear to pass. I
> still need do some checking on other machine types, too.

Somewhat surprisingly, no tests have failed due to not having these variables. I've grepped over the logs to make sure that tests actually ran, and spot checked a bunch of logs. If anyone else wants to look, they'll be available here for awhile: http://dev-master1.srv.releng.scl3.mozilla.com:8118/one_line_per_build?numbuilds=150

I'm going to ask around to try and get better confirmation about these variables, but unless I find something suggesting they *are* important, I'm planning to proceed here. Catlee suggested doing some sort of staged rollout, and I think that would be prudent here. Eg, 5-10 regular AWS machines, 5-10 large ones (for emulator tests), and a few in house machines. I still need to figure out how to make this happen in Puppet.
(Assignee)

Comment 5

3 years ago
Created attachment 8482741 [details] [diff] [review]
fully tested patch to get buildbot started with runner

Per IRC, I'd like to roll this out on a few production slaves pointing at my puppet environment. Seems like I should have r+ before doing that, though.
Attachment #8480813 - Attachment is obsolete: true
Attachment #8482741 - Flags: review?(dustin)
(Assignee)

Comment 6

3 years ago
Created attachment 8482758 [details] [diff] [review]
add entries for ondemand emulator64 machines

I spoke with Rail about how to set aside some AWS machines to do this. It looks like we should be able to just bring up some on demand machines and pin them to my environment. Emulator test machines don't have any entries in buildbot-configs for ondemand machines yet, so I'm adding some here.

I'll be fiddling with moz-state to make sure that stop idle doesn't shut these down (otherwise it's very unlikely that they'll get picked over spot machines).
Attachment #8482758 - Flags: review?(catlee)

Updated

3 years ago
Attachment #8482758 - Flags: review?(catlee) → review+
(Assignee)

Updated

3 years ago
Attachment #8482758 - Flags: checked-in+
Comment on attachment 8482741 [details] [diff] [review]
fully tested patch to get buildbot started with runner

Review of attachment 8482741 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/toplevel/manifests/slave/releng/test.pp
@@ +14,5 @@
>      include dirs::builds::hg_shared
>      include dirs::builds::git_shared
>      include dirs::builds::tooltool_cache
>  
> +    case $::operatingsystem {

Can you add a comment here explaining that this conditional is temporary until runner is set up on every platform?
Attachment #8482741 - Flags: review?(dustin) → review+

Comment 8

3 years ago
Merged to production, and deployed.
(Assignee)

Comment 9

3 years ago
I pinned talos-linux64-ix-001, 002, 005, and 006 to my user environment. Sheriffs are aware, and I've added a note in Slavealloc. I'll be doing the same for a few slaves from the tst-linux64 and tst-emulator64 aws pools shortly, too.
(Assignee)

Comment 10

3 years ago
The ec2 machines are up now too:
tst-linux64-ec2-001, 002, 003, and 004
tst-emulator64-ec2-001 and 002

I've flipped their moz-state tags to testing-bug1042358 to avoid them getting shut down. That should be changed back when testing is done.
(Assignee)

Comment 11

3 years ago
So far things are looking mostly fine. One build failed with DISPLAY not being set, but I'm extremely confused as to why
This: http://buildbot-master103.srv.releng.scl3.mozilla.com:8201/builders/Ubuntu%20HW%2012.04%20x64%20mozilla-central%20pgo%20talos%20other_l64/builds/160

  HOME=/home/cltbld
  LANG=en_US.UTF-8
  LANGUAGE=en_US:en
  LOGNAME=cltbld
  MAIL=/var/mail/cltbld
  NODE_PATH=/usr/lib/nodejs:/usr/lib/node_modules:/usr/share/javascript
  PATH=/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
  PROPERTIES_FILE=/builds/slave/talos-slave/test-pgo/buildprops.json
  PWD=/builds/slave/talos-slave/test-pgo
  SHELL=/bin/bash
  SHLVL=1
  TERM=linux
  TMOUT=86400
  USER=cltbld
  XDG_SESSION_COOKIE=dd26bb57dc7379c38bda76df000001a9-1409930523.515999-565090523
  _=/tools/buildbot/bin/python

In addition to not having DISPLAY set, it's also missing other variables defined in the same place (http://mxr.mozilla.org/build-central/source/buildbotcustom/env.py#186). I'm tempted to write this off as a freak occurence because other jobs that are configured in the exact same way have the right variables set:

  DISPLAY=:0
  HOME=/home/cltbld
  LANG=en_US.UTF-8
  LANGUAGE=en_US:en
  LOGNAME=cltbld
  MAIL=/var/mail/cltbld
  MOZ_CRASHREPORTER_NO_REPORT=1
  MOZ_NO_REMOTE=1
  NODE_PATH=/usr/lib/nodejs:/usr/lib/node_modules:/usr/share/javascript
  NO_EM_RESTART=1
  PATH=/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
  PROPERTIES_FILE=/builds/slave/talos-slave/test/buildprops.json
  PWD=/builds/slave/talos-slave/test
  SHELL=/bin/bash
  SHLVL=1
  TERM=linux
  TMOUT=86400
  USER=cltbld
  XDG_SESSION_COOKIE=dd26bb57dc7379c38bda76df000001a9-1409925928.329783-534482962
  XPCOM_DEBUG_BREAK=warn
  _=/tools/buildbot/bin/python


Still, going to look into this more, but I'm not going to disable anything -- I'd like them to run over the weekend.
(Assignee)

Updated

3 years ago
Depends on: 1063739
(Assignee)

Comment 12

3 years ago
Turns out that we don't set the env in buildbot for PGO talos jobs, but we do for non-PGO talos jobs. I'm fixing this in bug 1063739. I'm not going to disable the 4 slaves locked to my puppet env because there's only a small set jobs that will fail because of this, and there shouldn't be more than a few that happen over the weekend.
(Assignee)

Comment 13

3 years ago
These jobs have looked fine on the pinned machines for awhile. I plan to check in the puppet change to production tonight, so that the spot AMIs will pick up the changes tomorrow morning. In-house Ubuntu machines (such as talos-linux64-ix) will pick up the changes tonight - I'll hang around to watch them in case of bustage.
(Assignee)

Comment 14

3 years ago
I've moved all machines back to the production environment, and reset moz-state on the ec2 machines. Aka, they're back to how they were before I started testing this. I'll land the puppet patch later this evening.
(Assignee)

Comment 15

3 years ago
Comment on attachment 8482741 [details] [diff] [review]
fully tested patch to get buildbot started with runner

Landed on default+production.
Attachment #8482741 - Flags: checked-in+
(Assignee)

Comment 16

3 years ago
I forgot to add a new file when I first landed. This worked fine after I fixed that, though.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.