Closed Bug 731763 Opened 12 years ago Closed 12 years ago

Set up production autoland instance

Categories

(mozilla.org Graveyard :: Server Operations: Projects, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mjessome, Assigned: dustin)

References

Details

(Whiteboard: [tools])

Attachments

(1 file)

We can puppetize on releng puppet servers.
I think we had said:
1 VM to run our bugzilla scraping tool / producer, autoland_queue.
1 VM for running our worker instances, hgpusher.

We currently have 2 staging servers, autoland-staging{01,02} we can drop one of those once this is set up, but will still need a staging server.

We also need access to a rabbitmq server, schedulerDb and releng MySQL server.
Whiteboard: [autoland][tools]
Whiteboard: [autoland][tools] → [tools]
Docs
    https://wiki.mozilla.org/BugzillaAutoLanding

From talking to Amy, I think it will be best to set this up *outside* of the build network.  Nothing here (including schedulerdb) is within the build network, so there's no access loss involved.

Per our discussions yesterday, production will wait for scl3, which means in the next few weeks as the remaining requirements appear in that datacenter.  Staging's already running, so let's keep rolling with that.  Those are in scl1, so we can re-create staging in scl3 at our leisure.

The pieces I see here are:
 - VMs - waiting on scl3 ESX
 - schedulerdb access - just ACL changes
 - "releng MySQL server" I'm reading AutolandDB, a different DB on the same cluster, so this is free
 - bugzilla API - unrestricted, no issues
 - rabbitmq - see below
 - I'll get a page up in the websites space documenting this, pointing to the wiki page linked above.
 - monitoring - we'll need some details on how best to monitor this, what to do when it breaks


Open questions:

Brandon: this won't really be a web app -- it just quietly does its thing -- but it needs rabbtimq.  Can it use a vhost on the generic cluster's staging/prod rabbitmq instances?  If it's any consolation, in a later phase it may have at least a small dashboard/management UI that would probably fit well in the generic cluster.

Marc: can you verify that your AMQP client library can connect to a list of servers sequentially, and that it can set the necessary flags for HA messages?  I know Catlee's had trouble with this.

I expect these will go on the DMZ VLAN.  Does anyone know differently?
Assignee: server-ops-releng → dustin
AIUI the VM infra in SCL3 is up now. What's the ETA for these VMs?
Still waiting for fb from Brandon and Marc.
It isn't a problem to connect to a list of servers in sequence.

Looking into the documentation for our library (pika), it seems like it does support the declaration of HA/mirrored queues. I've tested out declaring a queue in such a manner, and it works however I do not have a cluster to properly try it on.
You can try it against rabbit-[a..h].build.mozilla.org.  It'd be good if you can also try it against rabbit1-dev -- we've been stalled on catlee for a few weeks waiting for him to test that.
Dustin, I missed that you were waiting on input from me :/

I'll ping you on irc today.
From IRC, using the dev/staging/prod generic clusters' rabbitmq instances are fine.

Brandon, can you set up a single vhost, named say /autoland, with a single user ("autoland" has a nice ring to it) and a password to be communicated to Marc?

I'll open a bug with Dan to get the VMs set up.
So we normally make a vhost with the env name it and the username matches.

I've setup vhosts for -dev and -stage, info below

-dev

Host: generic-celery1.dev.seamicro.phx1.mozilla.com
Username: autoland_dev

-stage
Host: generic-celery1.stage.seamicro.phx1.mozilla.com 
Username: autoland_stage

Passwords will be relayed via IRC
(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> I'll open a bug with Dan to get the VMs set up.

That's bug 736650.  Marc, please have a look and make sure the specs are OK.
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> Docs
>     https://wiki.mozilla.org/BugzillaAutoLanding
> 
> From talking to Amy, I think it will be best to set this up *outside* of the
> build network.  Nothing here (including schedulerdb) is within the build
> network, so there's no access loss involved.
> 

These will need to be in the build network since the machines will have a user account on them that has permissions to push to scm_level3 hg repos as well as to make changes to bugzilla and eventually buildapi.
None of that requires being in the build network.
From an IRC conversation, there's been a secreview based on information different from what I was given two weeks ago - namely, that the hosts would be in the build network.  Putting the hosts there means:

 * stopping bug 736650 until I'm back online
 * using limited build network VM resources
 * placing app further from the view of the SREs

Lukas told me this is a Q1 goal and that getting the hosts up quickly is critical, which I take to mean that she will re-visit the secreview to see if the network isolation was required, and that we will continue deploying these hosts as planned.  I haven't heard back in IRC to verify either of those.
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> I expect these will go on the DMZ VLAN.  Does anyone know differently?

I was wrong here -- DMZ's not the right place for boxes with sensitive info like lukas mentioned in comment 10.

I *think* this means the private VLAN, but I'd like verification -- fox2mike? justdave?
(In reply to Dustin J. Mitchell [:dustin] from comment #12)
> From an IRC conversation, there's been a secreview based on information
> different from what I was given two weeks ago - namely, that the hosts would
> be in the build network.  Putting the hosts there means:
> 
>  * stopping bug 736650 until I'm back online
>  * using limited build network VM resources
>  * placing app further from the view of the SREs
> 
> Lukas told me this is a Q1 goal and that getting the hosts up quickly is
> critical, which I take to mean that she will re-visit the secreview to see
> if the network isolation was required, and that we will continue deploying
> these hosts as planned.  I haven't heard back in IRC to verify either of
> those.

After discussing with Amy, this is not necessary - please continue as planned with getting these VMs online.  This is indeed a Q1 goal and not only do we need these VMs up but we will need a point person to work with Marc to get our puppet manifests to work with the IT puppet set up - also does IT have a secure password storage location that can store the user info for autoland?
I'd be the point person if I was around.  Someone else may be able to substitute, or this may need to wait until I'm available.

Yes, we have password storage (via GPG).
Not sure why this was referred to me for questions (other than just tribal knowledge or something) but here's my take on it.

This is based on the understanding that:
1) This system contains user accounts that shouldn't be accessible to the Internet, which means it shouldn't live in the DMZ, and won't have inbound Internet access except via VPN
2) This system should also not be accessible to community build helpers who access via the build VPN.

This sounds like it should either be in vlan72 (corporate resources) or we need a new VPN for non-community administrative resources for build.
or vlan275?  I'm told that might be separate from build-vpn as well.
I'd prefer vlan72 (corp.scl3).  I'll cross-post that to bug 736650.
Got into a discussion today about inodes/disk space and I'm starting to second guess the 20G disk space on these for production. I'll explain the situation:

On our worker VM, we keep a clean clone of each supported branch to lower external resource usage, and increase our efficiency. At this moment, we can foresee supporting mozilla-central, mozilla-inbound, mozilla-beta, mozilla-aurora, and mozilla-esr.
In addition to these clean clones, for each instance of our worker tool (we hope to run 2), we keep a working copy of the repository that is being worked on.
This means that we are looking at needing to store at least 7 clones at any point in time.

A clone & checkout of mozilla-central itself takes ~1.4G, and has somewhere around 160K files. This means inodes _will_ run out,
-----
[root@autoland-staging02 clean]# du -hs mozilla-central
1.4G	mozilla-central
[root@autoland-staging02 clean]# find mozilla-central/ | wc -l
160228
------

Note that this is only for one of the VMs, since we have 1 worker VM which needs these repositories, and another which does not store any repositories.
So how much disk space do you need?
I would say 20G will be good on one, and 50G for the other. I think that should allow for enough overhead as to not run into trouble. Thanks.
Depends on: 738436
(In reply to Marc Jessome[:mjessome] from comment #21)
> I would say 20G will be good on one, and 50G for the other. I think that
> should allow for enough overhead as to not run into trouble. Thanks.

If it's not too much space hogging, I'd like if we created both with 50GB. It would be great to have a consistent ref image for these vms as we are going to add more builders to be hg_pushers over time.
(In reply to Lukas Blakk [:lsblakk] from comment #22)
> If it's not too much space hogging, I'd like if we created both with 50GB.
> It would be great to have a consistent ref image for these vms as we are
> going to add more builders to be hg_pushers over time.

We don't use refimages.  This will be done in puppet.

Speaking of which, if you can point me to the necessary pieces in build/puppet-manifests, I can start working on that in sysadmins puppet.
The puppet manifests were never landed on build/puppet-manifests since we realized that we would be using the sysadmins puppet.

I have it in my user repo at:
https://hg.mozilla.org/users/mjessome_mozilla.com/puppet-manifests/

Files of note are:
    buildmaster-production.pp - Where the machine instances are used.
    modules/autoland/*  - instance & dep information, config files & templates.
    secrets.pp.template - template for password storage.

bug 723998 has the patch if you'd want to see that.

With regards to the secrets.pp.template, we will need to talk about how passwords should be stored, and how I can get those to you.
The VMs have been created. Let me know if there's anything else I need to do.

autoland1.corpdmz.scl3.mozilla.com
autoland2.corpdmz.scl3.mozilla.com
No, sir.  I'll get them set up with puppet!
Marc --

I had a look at the puppet manifests.  The normal practice in infra is to use RPMs to install python libraries to site-packages, rather than virtualenvs, so I'll cook that up.  

As for password storage, sysadmins puppet doesn't have the equivalent to secrets.pp.  I think that the best solution will be to add a local.ini with passwords in it on each machine, by hand.  We'll keep a backup copy of the necessary passwords in our password store for easy re-creation in a disaster.

I assume I can find those values on autoland-staging02 now, so that should be sufficient.

On a side node, I notice that you're using 'source ...bin/activate' in a lot of places -- keep in mind that for virtualenvs, activate is just a convenient way to edit PATH in your interactive shell, and isn't useful for automated stuff.  Instead of

command=/bin/sh -c 'source <%=basedir%>/bin/activate && cd <%=script_dir%> && python <%=script_dir%>/hgpusher.py'

you should use

command=/bin/sh -c '<%= basedir %>/bin/python <%=script_dir%>/hgpusher.py'
https://hg.mozilla.org/users/mjessome_mozilla.com/puppet-manifests/file/a2bc3ef7f1c5/modules/autoland/manifests/instance.pp#l47 lists:

 - SQLAlchemy==0.7.2
 - argparse==1.2.1
 - mercurial==1.9.1
 - pysqlite==2.6.3
 - pika==0.9.5
 - simplejson==2.1.6
 - urllib3==1.0.2
 - python-ldap==2.3.13
 - Twisted==11.0.0
 - MySQL-python==1.2.3
 - requests==0.10.8

all for Python-2.6.  I need an assist building RPMs for these.  Jeremy, I hear you have an egg-to-rpm tool?
(In reply to Dustin J. Mitchell [:dustin] from comment #29)
>  - SQLAlchemy==0.7.2
needed
>  - argparse==1.2.1
/data/mrepo/www/6-x86_64/RPMS.epel/python-argparse-1.2.1-2.el6.noarch.rpm
>  - mercurial==1.9.1`
needed
>  - pysqlite==2.6.3
not sure why you need this with Python-2.6, but OK :)
>  - pika==0.9.5
/data/mrepo/www/6-x86_64/RPMS.epel/python-pika-0.9.5-2.el6.noarch.rpm
>  - simplejson==2.1.6
/data/mrepo/www/6-x86_64/RPMS.mozilla/simplejson-2.3.2-1.x86_64.rpm
  (if the different version is OK?  It's simplejson, after all!)
>  - urllib3==1.0.2
needed
>  - python-ldap==2.3.13
needed
>  - Twisted==11.0.0
needed
>  - MySQL-python==1.2.3
/data/mrepo/www/6-x86_64/RPMS.updates/MySQL-python-1.2.3-0.3.c1.1.el6.x86_64.rpm
>  - requests==0.10.8
needed
After some discussion, this will be more flexible -- and more similar to other webops deployments -- if the python libraries that aren't copmiled are in a vendor directory, similar to https://github.com/mozilla/balrog/tree/master/vendor

Of what remains, pysqlite isn't required on 2.6 (it's built in).  So that leaves MySQL-python, in the RPMS.updates repo, and Twisted, for which I'll build an RPM and add it to mrepo.
The Twisted package is created and in mrepo now, and the spec file is checked into subversion.

Marc's working on a vendor lib for the others.  I'll work on puppet manifests.
Rather than a local 'autoland' user, I'm going to use the 'autolanduser' user already in LDAP.

Also, your manifests list rabbitmq-server, but from the above you don't need a rabbitmq server.  That brings lots of deps with it, so I'd rather leave it out.  Is that OK?

I haven't converted autoland::instance yet, so I may have more questions tomorrow.
That isn't a problem leaving out rabbitmq-server, since we'll be using what is provided. The switch to "autolanduser" is not a problem.
Assignee: dustin → nobody
Component: Server Operations: RelEng → Server Operations: Projects
QA Contact: arich → mrz
I've finished the vendor libs. This also makes me think that we can get rid of virtualenv that was being used in my puppet setup.
Looking into it, it seems like the python-ldap module is also compiled, so we will require an RPM for that as well.
Assignee: nobody → dustin
Summary: Set up VMs for autoland → Set production autoland instance
Summary: Set production autoland instance → Set up production autoland instance
(In reply to Marc Jessome[:mjessome] from comment #34)
> That isn't a problem leaving out rabbitmq-server, since we'll be using what
> is provided. The switch to "autolanduser" is not a problem.

Awesome x 2

(In reply to Marc Jessome[:mjessome] from comment #35)
> I've finished the vendor libs. This also makes me think that we can get rid
> of virtualenv that was being used in my puppet setup.

Indeed - I'll work that into the autoland::instance changes.

> Looking into it, it seems like the python-ldap module is also compiled, so
> we will require an RPM for that as well.

Good point - I'll get that.
Here is a list of the differentiation between the two VMs that we'll have:

The bugzilla poller: (runs autoland_queue and schedulerDbPoller)
- This is the 20G VM
- Requires the cron job that runs schedulerDbPoller
- Requires supervisord.conf.erb to contain [program:autoland_queue] section
- Requires netflow to tm-b01-slave01 (for access to the scheduler database)
- 

The hg facing tools: (runs hgpushers)
- This is the 50G VM
- Requires supervisord.conf.erb to contain [program:hgpusher] section
- requires ssh keys
- requires hgrc

As for configuration, it can be kept the same in both unless there are issues of not wanting unnecessary secrets on a second machine. In that case, secrets could be flagged out in same way as will be done in supervisord.conf.erb .
Attached file secrets.ini.template —
The outline for what is required in the secrets.ini file.
I haven't included variable names since I'm not sure what will be used in the end.
OK, I think we're at a good stopping point for the day.  The puppet manifests are in and working, so here's what's left:

 * get secrets.ini and the ssh key on the machine - I can do this when Marc's around tomorrow

 * land the code in hg.m.o/build/tools

 * figure out the right instance parameters.  I have
    autoland::instance {
        "/data/autoland/autoland-env":
            code_tag       => "staging",
            user           => "autolanduser",
            attachment_url => "https://landfill.bugzilla.org/bzapi_sandbox/attachment.cgi?id=",
            api_url        => "https://api-dev.bugzilla.mozilla.org/test/latest/",
            bz_url         => "https://landfill.bugzilla.org/show_bug.cgi?id=",
            config_flags   => "staging=1",
            poll_schedulerdb => false;
    }

 * fire it up and see what happens :)

we'll get back to it tomorrow.
I should add to that list:
 * flows for MySQL
 * use rabbitmq as set up by brandon
bug 740006 and bug 740007 for the flows
OK, changes made today:
 * sorted out secrets and ssh keys, using autoland-staging02 as a source
 * code changes to use secrets.ini
 * only start hgpusher if poll_schedulerdb is false, and only start autoland_queue if it's true
 * set up both instances, using instance config from mjessome
 * used code from github, which is not actually accessible from corpdmz, so this will need to change (to hg) before this goes into production.
 * code changes to specify rabbitmq host in config

And then we discovered that the rabbitmq hosts are in phx1, and the VMs are in scl3.  Which isn't going to work.  The easiest fix is to move the hosts to phx1, I think.  I'll open a bug for that, and update the flow and ACL bugs.
bug 740198 for the move
I also added
  try_syntax=-b do -p all -u none -t none
to config.ini, and built an SQLAlchemy RPM.
Would it be possible to see a generated version of the config at this point? I would like to make sure that there isn't anything else missing from there.
(done in irc)
Depends on: 740198
Added the dependency for the move to phx1 move.

Next steps are:
- find a permanent home for staging repository
- land production code to hg.m.o/build/tools
- test the puppet deployment
bug 741851 for the celery flows
Depends on: 741851
Depends on: 741975, 740006
Marc, can you get the production and staging code landed today?  Once the flows are in, we should be ready to fire this up and see how it works.
Dustin:
I'm currently working on that and hopefully will have it soon.

Since nothing in hg.m.o/build/tools uses autoland code, we're going to land to a separate repository, build/autoland. This means that we'll be using a "production" tag which should be checked out on the production master, and tip on staging.

One thing that comes up with this is that autoland depends on build/tools, so we need that checked out and made available. Would cloning hg.m.o/build/tools and then uaing the setup.py provided there to install the libraries be a good way of doing this?

Some things that need to be added to config.ini:

[bz]
webui_url=https://bugzilla-stage-tip.mozilla.org/jsonrpc.cgi
webui_login=autoland-try@mozilla.bugs

[ldap]
branch_api=https://hg.mozilla.org/repo-group?repo=/

secrets.ini:
[bz]
webui_password=password
I added bug 742495 to get mrepo to work, as it's currently breaking puppet.
Config changes:

commit 2c2c48913a9c5a6630fcbc093844cf1a4cded4c4
Author: Dustin J. Mitchell <dustin@mozilla.com>
Date:   Wed Apr 4 14:55:56 2012 -0500

    Bug 731763: more configuration for autoland

diff --git a/modules/autoland/templates/config.ini.erb b/modules/autoland/templates/config.ini.erb
index b115cce..a4eb21a 100644
--- a/modules/autoland/templates/config.ini.erb
+++ b/modules/autoland/templates/config.ini.erb
@@ -9,6 +9,7 @@ base_url=ssh://hg.mozilla.org/
 username=autolanduser@mozilla.com
 # this ssh key is not handled by puppet and must be installed manually
 ssh_key=/home/autoland/.ssh/id_rsa
+branch_api=https://hg.mozilla.org/repo-group?repo=/
 
 [mq]
 host=<%= mq_host %>
@@ -27,6 +28,8 @@ api_url=<%=api_url%>
 url=<%=bz_url%>
 # poll frequency in seconds
 poll_frequency=180
+webui_url=https://bugzilla-stage-tip.mozilla.org/jsonrpc.cgi
+webui_login=autoland-try@mozilla.bugs
 
 [ldap]
 bind_dn=autolanduser,ou=logins,dc=mozilla
diff --git a/modules/autoland/templates/secrets.ini.template b/modules/autoland/templates/secrets.ini.template
index 6353eaa..bfddd28 100644
--- a/modules/autoland/templates/secrets.ini.template
+++ b/modules/autoland/templates/secrets.ini.template
@@ -9,3 +9,6 @@ scheduler_db_url=
 
 [mq]
 password=
+
+[bz]
+webui_password=password

----

(and I'll get the appropriate secret installed locally)
I also changed the manifests to checkout $basedir/tools (-r default) and $basedir/autoland (-r $autoland_tag).  The tools dir is provided in the config file in

 [defaults]
 tools=...

I can't test this until puppet starts running again, but it's most likely ready to roll.
OK, I made some significant changes today on :solarce's advice.  This now looks *very* much like a webapp, which will make webops and the SREs happy.  At least, that's the idea.

That means that autoland can be deployed using this process:

  https://mana.mozilla.org/wiki/display/websites/Home

and in particular that we can have dev boxes updating automatically every 15 minutes, with staging and prod updated on request (usually with bugs, but we can work that out).  The mechanics of the automatic updates aren't quite worked out yet, but it's on my list.

At this point, the schedulerDbPoller.py seems to work fine (no errors).

autoland_queue fails with

Traceback (most recent call last):
  File "autoland_queue.py", line 31, in <module>
    db = DBHandler(config['databases_autoland_db_url'])
  File "/data/www/autoland-service/autoland/utils/db_handler.py", line 18, in __init__
    self.scheduler_db_meta.reflect(bind=self.engine)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/schema.py", line 2355, in reflect
    conn = bind.contextual_connect()
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 2328, in contextual_connect
    self.pool.connect(), 
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/pool.py", line 209, in connect
    return _ConnectionFairy(self).checkout()
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/pool.py", line 370, in __init__
    rec = self._connection_record = pool._do_get()
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/pool.py", line 757, in _do_get
    return self._create_connection()
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/pool.py", line 174, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/pool.py", line 255, in __init__
    self.connection = self.__connect()
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/pool.py", line 315, in __connect
    connection = self.__pool._creator()
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py", line 275, in connect
    return self.dbapi.connect(*cargs, **cparams)
sqlalchemy.exc.OperationalError: (OperationalError) unable to open database file None None

which I assume is because the configured URL is sqlite:///data/autoland_live.sqlite - should that be mysql instead?  Or somewhere else?

All I have in the hgpusher log is

2012-04-05 16:02:19,022 hgpusher        main    An error occurred: [Errno 13] Permission denied: 'build'

Other than fixing the above, my list is:
 * ganglia
 * nagios
 * system to automatically restart supervisord services after deploy
 * automatic updates for dev environments (waiting until we *have* a dev env)
 * system docs (I really should have done this months ago)
Oh, we should put the autoland DB on mysql too, right?
I just saw this now for some reason so sorry for the late reply.

(In reply to Dustin J. Mitchell [:dustin] from comment #54)
> ... ... ...
> 
> At this point, the schedulerDbPoller.py seems to work fine (no errors).
> 
> autoland_queue fails with
> 
> ... ... ...
>
> sqlalchemy.exc.OperationalError: (OperationalError) unable to open database
> file None None
> 
> which I assume is because the configured URL is
> sqlite:///data/autoland_live.sqlite - should that be mysql instead?  Or
> somewhere else?
The database is a sqlite db, and I would be wary of switching over to mysql without a bit of testing and tweaking, since a few things may need to be changed. The file autoland/data/autoland_sqlite.sql needs to be used to generate a new db, `sqlite3 data/autoland_live.sqlite < autoland_sqlite.sql`.

> 
> All I have in the hgpusher log is
> 
> 2012-04-05 16:02:19,022 hgpusher        main    An error occurred: [Errno
> 13] Permission denied: 'build'
In the config there is [default] work_dir=build
which points to where the working directory should be (contains checked out repos, patch files, etc). This will need to be somewhere writable by the user.
OK, I added a workdir (/data/workdir) to all three services, created by puppet.

autoland_queue
 - updated in config.ini:
   autoland_db_url=sqlite:////data/workdir/autoland_queue/autoland_live.sqlite
 - db generated as per above
with the result that autoland_queue runs and connects to rabbitmq, but
2012-04-09 15:37:25,159 bz_utils        request REQUEST ERROR: <urlopen error [Errno 110] Connection timed out>: https://api-dev.bugzilla.mozilla.org/test/latest/bug/?whiteboard=\[autoland.*\]&whiteboard_type=regex&include_fields=id,whiteboard&username=release@mozilla.com&password=5EgfR97CVW


hgpusher
 - updated in config.ini:
   work_dir=/data/workdir/hgpusher
with the result that it holds steady at:
2012-04-09 15:38:40,541 hgpusher        main    Working directory: hgpusher.0


schedulerdbpoller
 - added --cache-dir=$workdir/schedulerdbpoller/cache to the invocation per bug 743001.  I can add extra arguments as you add them to the source.
with the (expected for now) result
IOError: [Errno 13] Permission denied: '/data/www/autoland-service/autoland/schedulerDBpoller.log'


Do we need more flows for the Bugzilla API?  Can you make a comprehensive flow list, 'cuz I think netops is going to start black-holing my requests soon :)
marking as moco confidential because of password leakage
Group: mozilla-corporation-confidential
(In reply to Dustin J. Mitchell [:dustin] from comment #58)
> OK, I added a workdir (/data/workdir) to all three services, created by
> puppet.
> 
> autoland_queue
>  - updated in config.ini:
>   
> autoland_db_url=sqlite:////data/workdir/autoland_queue/autoland_live.sqlite
>  - db generated as per above
> with the result that autoland_queue runs and connects to rabbitmq, but
> 2012-04-09 15:37:25,159 bz_utils        request REQUEST ERROR: <urlopen
> error [Errno 110] Connection timed out>:
> https://api-dev.bugzilla.mozilla.org/test/latest/bug/?whiteboard=\[autoland.
> *\]&whiteboard_type=regex&include_fields=id,
> whiteboard&username=release@mozilla.com&password=5EgfR97CVW
> 
I've got a new password which I'll have to get to you for the config file. This is an issue on the BzApi end, and has affected the current live setup as well.

> 
> hgpusher
>  - updated in config.ini:
>    work_dir=/data/workdir/hgpusher
> with the result that it holds steady at:
> 2012-04-09 15:38:40,541 hgpusher        main    Working directory: hgpusher.0
> 
> 
> schedulerdbpoller
>  - added --cache-dir=$workdir/schedulerdbpoller/cache to the invocation per
> bug 743001.  I can add extra arguments as you add them to the source.
> with the (expected for now) result
> IOError: [Errno 13] Permission denied:
> '/data/www/autoland-service/autoland/schedulerDBpoller.log'
> t
> 
> Do we need more flows for the Bugzilla API?  Can you make a comprehensive
> flow list, 'cuz I think netops is going to start black-holing my requests
> soon :)
I'm pretty sure all we need is flows to the SchedulerDb and the rabbitmq hosts. I'm not sure how locked down these machines are, but we also need to be able to access hg, ldap, bugzilla.
password changed, removing confidential flag
Group: mozilla-corporation-confidential
Ugh, sorry, and thanks.  Passwords in URLs :'-(

I'll get the new pw from marc and update autoland1/2

(In reply to Marc Jessome[:mjessome] from comment #60)
> I'm pretty sure all we need is flows to the SchedulerDb and the rabbitmq
> hosts. I'm not sure how locked down these machines are, but we also need to
> be able to access hg, ldap, bugzilla.

They're very locked-down.  Anything not explicitly allowed is forbidden, basically.

So, that will be hg.m.o for both ssh and http; the LDAP server in the config; the various bugzilla VIPs in the config (including backup sites), tbpl, and self-serve.  I'll see if I can suss out exactly what those flows should be, and get a bug filed.
With the updated password, and flows in, I'm still getting a 400 in the autoland poller logs.  The same URL in my browser says "message: Invalid username or password".  I verified this is using the new password you supplied.  Is this related to the api-dev move?
https://api-dev.bugzilla.mozilla.org/test/latest/
should be
https://api-dev.bugzilla.mozilla.org/tip/

I also made sure passwords were changed on /tip/ .
(In reply to Marc Jessome[:mjessome] from comment #64)
> https://api-dev.bugzilla.mozilla.org/test/latest/
> should be
> https://api-dev.bugzilla.mozilla.org/tip/
> 
> I also made sure passwords were changed on /tip/ .

Is this for access to the production bmo Bugzilla instead? If so, you probably want https://api-dev.bugzilla.mozilla.org/latest/ instead. If you're testing, /tip/ is fine, but that's probably not the best use of a production service. :)
Are the source code changes ready to roll?  And, please clarify which api-dev URL I should use - I'm not clear from the last few comments.
With the latest code, and /tip/, I get

2012-04-12 18:22:26,128 autoland_queue  bz_search_handler       Flagged for landing on branches: [u'users/mjessome_mozilla.com/mozilla-central', u'try']
2012-04-12 18:22:26,129 autoland_queue  bz_search_handler       Branch users/mjessome_mozilla.com/mozilla-central does not exist.
2012-04-12 18:22:26,130 autoland_queue  bz_search_handler       Branch try does not exist.

hgpusher seems happy as before.

schedulerdbpoller gives:

ConfigParser.NoOptionError: No option 'posted_bugs' in section: 'log'

which seems like a new bug!  I'd like to get this code into production as-is (with the necessary modifications to support that) before continuing to make changes.  Let me know what's next.
posted_bugs will need to be in the [log] section of the config, as is in the new config.ini-dist in the hg repository. It will need to be somewhere writable by the user running schedulerDBpoller.
Just to clarify that it should be a writable file path, so /data/postebugs.log , for example.
schedulerdbpoller looks good

-sh-4.1$ cd /data/www/autoland-service/autoland && python schedulerDBpoller.py -b try -c config.ini -u None -p None --verbose --cache-dir=/data/workdir/schedulerbpoller/cache/
[RabbitMQ] Established connection to generic-celery1.stage.seamicro.phx1.mozilla.com.
-sh-4.1$

I think we're in good shape?  What's next?
A bit more work in IRC brings us to hgpusher trying to clone from https://hg.mozilla.org, based on an entry I had added to the DB on request:

10:25 < mjessome> oh, and we need to enable try branch in the database. If you `sqlite3 /path/to/autoland_live.sqlite`, and "INSERT INTO branches VALUES (1, 'try', 'https://hg.mozilla.org/try', 80, 'enabled', 0, 0);"

I just now changed that to http://...  I'll file a bug to get the https flow, since that's probably smarter all around.
(In reply to Dustin J. Mitchell [:dustin] from comment #71)
> A bit more work in IRC brings us to hgpusher trying to clone from
> https://hg.mozilla.org, based on an entry I had added to the DB on request:
> 
> 10:25 < mjessome> oh, and we need to enable try branch in the database. If
> you `sqlite3 /path/to/autoland_live.sqlite`, and "INSERT INTO branches
> VALUES (1, 'try', 'https://hg.mozilla.org/try', 80, 'enabled', 0, 0);"
> 
> I just now changed that to http://...  I'll file a bug to get the https
> flow, since that's probably smarter all around.
Will this also give ssh:// flows to hg.mozilla.org ? We can't push over https, only pull -- so ssh:// is required for pushing.
SSH (and http) is already set up to hg.m.o.
Any news on those https flows? Could you add it as a blocker?
Once that is ready, I hope we can to a test try-landing. To make those test runs a bit easier, is there some way that I can access the logs without having to bug you to pastebin them?
Thanks
The https flows were fixed last week.

We can probably set something up with screen to look at the logs - look me up in IRC.
We made some small fixes (regarding expansion of ~ in pathnames) and got hgpusher pushing to hg.  I'm going to add a bug to get SSH flows into the systems from MPT-VPN and allow mjessome to log in.
mjessome and lsblakk have sudo access on the host now (so just the flow remains - I'll copy you both on that bug).
/home/mjessome/production on autoland1 contains the autoland_live.sqlite and postedbugs.log that can be dropped into "/data/workdir/autoland_queue/autoland_live.sqlite" and "/data/workdir/schedulerdbpoller/postedbugs.log" respectively.

production config is located at /home/mjessome/production/config.ini-production, and still contains passwords, I wasn't sure if you'd want to drop the config in or puppetize, so I left them there.

I'm not sure how the ssh key is being handled, so just a reminder about it!
:solarce, can you set up a production celery account for this?  comment 8 only lists dev and stage.
This should be ready to use now

[root@node339.seamicro.phx1 ~]# rabbitmqctl list_vhosts | grep autoland
autoland_prod

Password has the same convention of dev and stage
I re-enabled schedulerdbpoller, and moved the hosts to a new prod cluster.  I'll need to do some re-jiggering for the cluster change.
OK, this is now in place, and according to the logfiles, it's running fine.  Can you verify and close?

Things to be done on other bugs:
 - refactor manifests to sit inside modules/webapp
 - update docs to talk about puppet configs
 - include the revision in the update script
 - dev/staging VMs (probably in scl3)
 - automatically restart daemons in update-www.sh
 - nagios and ganglia
So, from autoland2:

[root@autoland2.shared.phx1 ~]# wget https://bugzilla.mozilla.org/attachment.cgi?id=618738
--2012-04-26 11:50:59--  https://bugzilla.mozilla.org/attachment.cgi?id=618738
Resolving bugzilla.mozilla.org... 63.245.217.60
Connecting to bugzilla.mozilla.org|63.245.217.60|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://bug749284.bugzilla.mozilla.org/attachment.cgi?id=618738 [following]
--2012-04-26 11:50:59--  https://bug749284.bugzilla.mozilla.org/attachment.cgi?id=618738
Resolving bug749284.bugzilla.mozilla.org... 63.245.217.61
Connecting to bug749284.bugzilla.mozilla.org|63.245.217.61|:443... ^C

which, I suspect, means we need moar flows.  I'd like to get justdave's input here to try to cast a future-proof network of bugzilla IPs (including the impending scl3 cluster).
Won't know 'em till we get 'em.  But yeah, the attachments are on a different IP because not enough people support SNI yet and it has a different SSL cert.
17:31 < mjessome> dustin: all is working :D

I'll get new bugs open for the follow-on stuff.  Less than 100 comments, woo!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Bugs filed for the follow-on:
 * bug 749469 (puppet implementation fixes)
 * bug 749470 (better deployment automation)
 * bug 749471 (dev/staging)
I'll be talking to the dev services group about how autoland works, so there's even more shared knowledge.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: