Closed Bug 712951 Opened 13 years ago Closed 11 years ago

Production server for new PTO app

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: peterbe, Assigned: cturra)

Details

(Whiteboard: [pending testing completion])

The new PTO app is practically finished. Needs some final cosmetic touches and needs to be dogfed and signed-off on a dev server (I plan to use Khan for this) by the ladies on the fourth floor.

The code to be installed is this: https://github.com/peterbe/pto (i'll move this to the "mozilla" account when going live)

As you can see from the requirements (https://github.com/peterbe/pto/blob/develop/requirements/compiled.txt) it's the standard Playdoh stuff plus a specific version of LDAP. 

I am going to need SSH access to run certain django commands when we migrate the old database.
What's the timeframe for this needing to go live? I'm currently working on a new cluster for these sorts of apps to go on, but not sure if there'll be anything concrete before the new year.. If there is a bigger hurry, we can probably shoehorn it into the aging existing intranet server and migrate later, but would be nice to go fresh on the new cluster, if it can wait a few weeks.
Assignee: server-ops → jdow
Just clarifying a few things:

* It needs a MySQL. Either local or over the network to a cluster. 
* It does NOT need LDAP authentication via Apache. Internally does need to connect to our LDAP server (whatever that means in terms of location and VPN)
* I don't really need SSH access but I don't have the database dump and instructions for importing/migrating it just yet. I can document this later.
I suggested Peter set up a mana page with the details.

Jabba: a few weeks would be fine, so let's go with the new cluster.
Wondering if we need a dev/stage setup for this too.
(In reply to Shyam Mani [:fox2mike] from comment #4)
> Wondering if we need a dev/stage setup for this too.

Not going to do it now but if I can set it up on Khan I don't think we need it. Otherwise, if it's easy, a stage server would be nice.
dev and stage servers are already in the works for this cluster, so we'll have that available as well.
Any update here?
I've got Zeus set up for pto.mozilla.org, as well as pto-dev.allizom.org and pto.allizom.org. I have mysql flows and database in place for pto.mozilla.org, but need to get some network flows added for -dev and stage for the mysql access.

It would be good if :peterbe could write a mana page in the Websites space with any specific instructions on how to set this up, which settings are needed, memcached required or not, what manage.py commands are needed, and all that stuff.

What I've got so far is:
git clone https://github.com/peterbe/pto.git pto
cd pto
git submodule update --init

and deploy to webheads. I'm guessing at the apache configuration as well, but if you have any pointers there, that can speed things up as well, like if there is an alias for /media or anything like that.
also, I have python-ldap-2.3.10 . Is this sufficient?
(In reply to Justin Dow [:jabba] from comment #9)
> also, I have python-ldap-2.3.10 . Is this sufficient?

I hope so. I was using 2.3.13 locally but we're really just scratching the surface of that that lib can do so we should hopefully be fine. 
When we install the compiled python stuff (`pip install requirements/compiled.txt`) it might pick up 2.3.13 but it was good that you installed the OS level package since that'll take care of installing the dev libs needed by the python package.
For the record, the Mana documentation is here: https://mana.mozilla.org/wiki/display/websites/pto.mozilla.org
we don't use pip in production, so I'll need to build RPMs for any other requirements (like hmac and hashlib)
(In reply to Justin Dow [:jabba] from comment #12)
> we don't use pip in production, so I'll need to build RPMs for any other
> requirements (like hmac and hashlib)
That's fine. There's also specific versions of MySQL-python and Jinja2. I hope I can get those exact versions for the sake of predictable development. 

Also, apparently Commander is recommended for updates. 
https://github.com/mozilla/playdoh/issues/96
Not familiar with it at all. But maybe we should
I've got zeus vips set up and flows in place for mysql, etc. The -dev, stage and prod clusters are ready to go. I'm handing this off to jakem, who will better be able to figure out how to get the app running.
Assignee: jdow → nmaul
Yay Jake! 

Let's ignore the thoughts on a new script for deployment for now. 

Have you been able to set up all the external requirements? I.e. the compiled ones. 
Hit me up on #webdev if you have any questions
Jake, do you know when you might get to this?
Any progress on this?
Ping?
Sorry I haven't gotten back to you sooner on this... it's marked as minor prio, and we (webops) have been busy the last couple weeks.

As far as I can tell the external dependencies should be be present. However, I'm getting a 500 ISE error after logging in to LDAP here: https://pto.mozilla.org/. Trying to track this down now, but given that it's a django app you should be able to get emails.

I see a few odd things in settings/local.py that we should correct:

1) "DEV = True" on production. This should be false... in fact, I'm not sure that infrasec allows this to be set even on 'dev' instances without extra precautions. In this case the LDAP wall that's currently in place might be sufficient.

2) No ADMINS = () set... this sends the traceback emails to someone. Who should I set this to?

3) DEBUG = TEMPLATE_DEBUG = DEBUG_PROPAGATE_EXCEPTIONS = True ... should probably be False on production... same concern as #1 above. Offhand I'm not sure which it was that infrasec was concerned about... DEV or DEBUG.

4) I'm interested in this:
PROD_DETAILS_DIR='/mnt/netapp/pto.mozilla.org/product_details'
Specifically, I don't think this is how we typically manage this, but honestly it's been a while since I set up a new site that used product_details, and it'd be easy to consider a lot of our existing ones "legacy" or "custom" in this regard, so maybe that's fine.
On the 500 ISE error, I see these in the error_log on one of the servers:

[Mon Feb 13 11:38:58 2012] [error] [client 10.8.33.240] File does not exist: /data/www/pto.mozilla.org/pto/static

[Mon Feb 13 11:45:20 2012] [error] /data/www/pto.mozilla.org/pto/vendor/lib/python/nose/plugins/manager.py:405: UserWarning: Module memcache was already imported from /data/www/pto.mozilla.org/pto/vendor-local/lib/python/memcache.py, but /usr/lib/python2.6/site-packages is being added to sys.path

[Mon Feb 13 11:45:20 2012] [error]   import pkg_resources
[Mon Feb 13 11:45:21 2012] [error] [client 10.8.33.240] mod_wsgi (pid=21537): Exception occurred processing WSGI script '/data/www/pto.mozilla.org/pto/wsgi/playdoh.wsgi'.
.... (cut many traceback lines)
[Mon Feb 13 11:45:21 2012] [error] [client 10.8.33.240] DatabaseError: (1146, "Table 'pto_app.django_session' doesn't exist")


I am able to access the MySQL DB from this node, but there are no tables in it. Is there a schema somewhere that needs to be run in a migration, maybe?
(In reply to Jake Maul [:jakem] from comment #19)
> 1) "DEV = True" on production. This should be false... in fact, I'm not sure
> that infrasec allows this to be set even on 'dev' instances without extra
> precautions. In this case the LDAP wall that's currently in place might be
> sufficient.

DEV isn't a built-in Django setting, so there might be issues but it's app-dependent. (Unless this is part of Playdoh?)

> 3) DEBUG = TEMPLATE_DEBUG = DEBUG_PROPAGATE_EXCEPTIONS = True ... should
> probably be False on production... same concern as #1 above. Offhand I'm not
> sure which it was that infrasec was concerned about... DEV or DEBUG.

It's DEBUG, that and TEMPLATE_DEBUG should definitely be False.
DEV=True should never be set except on your laptop.  It's a playdoh l10n thing.

My rule is if you have to resort to DEV|DEBUG|TEMPLATE_DEBUG=True, you probably need more log.debug('foo')
(In reply to Jake Maul [:jakem] from comment #19)
> Sorry I haven't gotten back to you sooner on this... it's marked as minor
> prio, and we (webops) have been busy the last couple weeks.
> 
> As far as I can tell the external dependencies should be be present.
> However, I'm getting a 500 ISE error after logging in to LDAP here:
> https://pto.mozilla.org/. Trying to track this down now, but given that it's
> a django app you should be able to get emails.
> 
Can you confirm, without revealing too much, what the LDAP settings are? I.e.
>>> from django.conf import settings
>>> settings.AUTHENTICATION_BACKENDS
>>> settings.AUTH_LDAP_SERVER_URI
>>> settings.AUTH_LDAP_BIND_DN
>>> settings.AUTH_LDAP_BIND_PASSWORD  # be careful what you paste here

> I see a few odd things in settings/local.py that we should correct:
> 
> 1) "DEV = True" on production. This should be false... in fact, I'm not sure
> that infrasec allows this to be set even on 'dev' instances without extra
> precautions. In this case the LDAP wall that's currently in place might be
> sufficient.
> 
Let's focus on getting it up and running first. The local.py-dist is often too much geared towards the Nth developer starting out. Perhaps it should be that the local.py-dist is specifically for IT instead. 

Once we know it's running, let's switch this off to DEBUG=False

> 2) No ADMINS = () set... this sends the traceback emails to someone. Who
> should I set this to?
> 
Again, I didn't want to be too deployment specific. This can be filled in as:

ADMINS = (
    ('Peter Bengtsson', 'peterbe@mozilla.com'),
)

> 3) DEBUG = TEMPLATE_DEBUG = DEBUG_PROPAGATE_EXCEPTIONS = True ... should
> probably be False on production... same concern as #1 above. Offhand I'm not
> sure which it was that infrasec was concerned about... DEV or DEBUG.
> 
Yeah, switch this to False when you get it up and running. No production, staging or dev system should ever run in DEBUG=True. 
The settings/local.py is entirely your file to own. 

> 4) I'm interested in this:
> PROD_DETAILS_DIR='/mnt/netapp/pto.mozilla.org/product_details'
> Specifically, I don't think this is how we typically manage this, but
> honestly it's been a while since I set up a new site that used
> product_details, and it'd be easy to consider a lot of our existing ones
> "legacy" or "custom" in this regard, so maybe that's fine.

I can't help there. I don't think this app uses any of that anyway. It certainly doesn't matter when I do local development. We can probably ignore this.
(In reply to Jake Maul [:jakem] from comment #20)
> On the 500 ISE error, I see these in the error_log on one of the servers:
> 
> [Mon Feb 13 11:38:58 2012] [error] [client 10.8.33.240] File does not exist:
> /data/www/pto.mozilla.org/pto/static
> 
> [Mon Feb 13 11:45:20 2012] [error]
> /data/www/pto.mozilla.org/pto/vendor/lib/python/nose/plugins/manager.py:405:
> UserWarning: Module memcache was already imported from
> /data/www/pto.mozilla.org/pto/vendor-local/lib/python/memcache.py, but
> /usr/lib/python2.6/site-packages is being added to sys.path
> 
> [Mon Feb 13 11:45:20 2012] [error]   import pkg_resources
> [Mon Feb 13 11:45:21 2012] [error] [client 10.8.33.240] mod_wsgi
> (pid=21537): Exception occurred processing WSGI script
> '/data/www/pto.mozilla.org/pto/wsgi/playdoh.wsgi'.
> .... (cut many traceback lines)
> [Mon Feb 13 11:45:21 2012] [error] [client 10.8.33.240] DatabaseError:
> (1146, "Table 'pto_app.django_session' doesn't exist")
> 
> 
> I am able to access the MySQL DB from this node, but there are no tables in
> it. Is there a schema somewhere that needs to be run in a migration, maybe?

For the first time setting it up. Run `./manage.py syncdb`. That's also going to ask if you want to create an admin user. Say no to that. We'll later need to make myself an admin so that I can make other admins. 

I thought I had this documented in the mana page. If not, my bad.
Excellent, the "manage.py syncdb" took care of making all the tables. The app seems to be working now:

https://pto.mozilla.org/


DEV, DEBUG, and ADMINS all set as discussed above. The former two are "False", and ADMINS is set to just you (peterbe@mozilla.com).


The login on that page doesn't work, but that makes good sense to me... the LDAP settings appear to be wrong.

LDAP settings:

AUTH_LDAP_SERVER_URI = 'ldap://pm-ns.mozilla.org/'
AUTH_LDAP_BIND_DN = 'uid=bindweb2,ou=logins,dc=mozilla'
AUTH_LDAP_BIND_PASSWORD = '<snip>'

This is almost certainly not going to work, as this cluster is in PHX1 and IIRC pm-ns.mozilla.org is in SJC1. Jabba, can you comment on this?
pm-ns.mozilla.org is correct. That hostname resolves to a local cluster in each data center.
(In reply to Jake Maul [:jakem] from comment #25)
> Excellent, the "manage.py syncdb" took care of making all the tables. The
> app seems to be working now:
> 
> https://pto.mozilla.org/
> 
More importantly, can you set up pto.allizom.org too. That's what we'll use to stage and sign-off before going live. 

> 
> DEV, DEBUG, and ADMINS all set as discussed above. The former two are
> "False", and ADMINS is set to just you (peterbe@mozilla.com).
> 
> 
> The login on that page doesn't work, but that makes good sense to me... the
> LDAP settings appear to be wrong.
>
I can't access it to test because it pops up with a Basic Authentication pop-up.

 
> LDAP settings:
> 
> AUTH_LDAP_SERVER_URI = 'ldap://pm-ns.mozilla.org/'
> AUTH_LDAP_BIND_DN = 'uid=bindweb2,ou=logins,dc=mozilla'
> AUTH_LDAP_BIND_PASSWORD = '<snip>'
> 
> This is almost certainly not going to work, as this cluster is in PHX1 and
> IIRC pm-ns.mozilla.org is in SJC1. Jabba, can you comment on this?

I'll let you and :jabba figure that one out :)
Jabba pointed me at the magic setting:

#AUTH_LDAP_START_TLS = False

This needs to be uncommented. I suspect using the SSL version of the hostname (ldaps://pm-ns.mozilla.org) *might* work, but would involve extra ACLs to be opened... probably the main port simply doesn't support TLS.

Anyway, this is functional now. I am able to log in to it.


(In reply to Peter Bengtsson [:peterbe] from comment #27)
> More importantly, can you set up pto.allizom.org too. That's what we'll use
> to stage and sign-off before going live. 

Hah. Given the bug summary I just skipped dev and stage, thinking they were already functional and/or not needed. Silly me. I will investigate them.

> I can't access it to test because it pops up with a Basic Authentication
> pop-up.

Should just be your LDAP credentials (full email for username).
Status: NEW → ASSIGNED
(In reply to Jake Maul [:jakem] from comment #28)
> Jabba pointed me at the magic setting:
> 
> #AUTH_LDAP_START_TLS = False
> 
> This needs to be uncommented. I suspect using the SSL version of the
> hostname (ldaps://pm-ns.mozilla.org) *might* work, but would involve extra
> ACLs to be opened... probably the main port simply doesn't support TLS.
> 
> Anyway, this is functional now. I am able to log in to it.
> 
> 
> (In reply to Peter Bengtsson [:peterbe] from comment #27)
> > More importantly, can you set up pto.allizom.org too. That's what we'll use
> > to stage and sign-off before going live. 
> 
> Hah. Given the bug summary I just skipped dev and stage, thinking they were
> already functional and/or not needed. Silly me. I will investigate them.
> 
I definitely care about stage. Dev on the other hand I don't think I will ever use.

> > I can't access it to test because it pops up with a Basic Authentication
> > pop-up.
> 
> Should just be your LDAP credentials (full email for username).

Can you disable the htaccess stuff? The site is already 100% authentication protected. 

Also, do note that we intend to the "master" branch for production and the "develop" branch for stage. Is this set up accordingly?

Lastly, have you got a plan for how I can upgrade stage and production? If it matters, I'm happy to upgrade stage every time jenkins passes on the build for that repo.
I think you want -dev for that. Generally what we do is set up the dev instance to automatically update via a cronjob from jenkins or github or whatever and should be on the development branch. Stage should be an exact copy of production. So what you describe above should be "dev" and what we refer to as stage is more like a "this is to test the change from one prod release to the next", wheras dev would often have lots of incremental releases, so what happens on dev might not always be what happens on prod.
:jakem, any progress on this? 

For the record, if it wasn't clear: 

* I need a -dev instance which runs from the "develop" branch. I'll use this for "staging" and get sign-off on the functionality

* I don't need a -stage server. If you're going to set one up, I won't mind but I won't use it either. 

* Can we remove the .htaccess Basic Authentication that pops up? 

* How do I deploy to -dev when I have new code ready?

* How do I deploy to -live when I have new code ready?
It looks like all 3 (dev/stage/prod) are currently using this origin:
https://github.com/peterbe/pto.git
On the branch "develop".

I presume dev needs to be on the 'develop' branch, and stage/prod should be on the 'master' branch.


Removing the Basic Auth is probably something that will need an infrasec sign-off. However, since the whole thing is behind LDAP anyway, it's probably not a big issue.


Dev is typically auto-deployed on a schedule. Every 10 minutes is pretty common. Every 5 minutes is sometimes possible, but Django apps tend to take longer to restart, so this can really mess with automated testing sometimes.


Stage is typically deployed identically to prod.

One of the primary purposes of stage is to test a deployment from current-prod to new-prod. Dev changes all the time, which makes it a poor test of an update from what's currently in prod to what's about to go live. Typical problems are settings that get tweaked in dev over time, but not migrated to prod right away... things like that have a tendency to "fall off" when the final update comes. For an app like this however, I don't imagine it's going to be a very significant issue. It really depends on the strictness of uptime requirements and the velocity of changes.

Put another way: stage is not just a place to test the code, but also to test the deployment of the code.


Deploying to prod is typically done by filing a bug and organizing a time when it should be pushed live. For smaller, less-frequently-changed apps, it's also common to file a bug to do the update whenever is convenient.

A 3rd mechanism, widely preferred but harder to set up, is to go with some form of developer-pushed mechanism. AMO uses one called 'chief'... works well. The problem with systems like this (apart from initial setup) is that if IT isn't involved in a code push, then IT won't be immediately on-hand to respond to problems. AMO use chief but still schedules the push with IT, so we know about it and are available, should we be needed for something. This is also necessary for things like Apache / config changes that a dev may not have access to change on their own.
(In reply to Jake Maul [:jakem] from comment #32)
> A 3rd mechanism, widely preferred but harder to set up, is to go with some
> form of developer-pushed mechanism. AMO uses one called 'chief'... works
> well. The problem with systems like this (apart from initial setup) is that
> if IT isn't involved in a code push, then IT won't be immediately on-hand to
> respond to problems. AMO use chief but still schedules the push with IT, so
> we know about it and are available, should we be needed for something. This
> is also necessary for things like Apache / config changes that a dev may not
> have access to change on their own.

I'd encourage setting up chief even if the pushes are always scheduled with IT. It's automated and much faster, and lets you move toward not requiring IT in the future. (Which IT should consider a good thing, given how stretched y'all are ;)

Also getting freddo to do github-push-based deploys to -dev rather than time-based deploys.
(In reply to Jake Maul [:jakem] from comment #32)
> It looks like all 3 (dev/stage/prod) are currently using this origin:
> https://github.com/peterbe/pto.git
> On the branch "develop".
> 
> I presume dev needs to be on the 'develop' branch, and stage/prod should be
> on the 'master' branch.
>
Yes please. 
 
> 
> Removing the Basic Auth is probably something that will need an infrasec
> sign-off. However, since the whole thing is behind LDAP anyway, it's
> probably not a big issue.
> 
Can you fix that? 
We had that discussion almost a year ago. We'll do the authentication in the app instead. There is nothing available on the site without logging in and doing it with .htaccess is unacceptable. 

> 
> Dev is typically auto-deployed on a schedule. Every 10 minutes is pretty
> common. Every 5 minutes is sometimes possible, but Django apps tend to take
> longer to restart, so this can really mess with automated testing sometimes.
> 
What does that mean? It pulls the git repo every 10 minutes and if something has changed it reloads apache?

> 
> Stage is typically deployed identically to prod.
> 
> One of the primary purposes of stage is to test a deployment from
> current-prod to new-prod. Dev changes all the time, which makes it a poor
> test of an update from what's currently in prod to what's about to go live.
> Typical problems are settings that get tweaked in dev over time, but not
> migrated to prod right away... things like that have a tendency to "fall
> off" when the final update comes. For an app like this however, I don't
> imagine it's going to be a very significant issue. It really depends on the
> strictness of uptime requirements and the velocity of changes.
> 
> Put another way: stage is not just a place to test the code, but also to
> test the deployment of the code.
> 
I hope we can just ignore stage. Maybe in a year or three we might care about staging code and deployment. 

> 
> Deploying to prod is typically done by filing a bug and organizing a time
> when it should be pushed live. For smaller, less-frequently-changed apps,
> it's also common to file a bug to do the update whenever is convenient.
> 
> A 3rd mechanism, widely preferred but harder to set up, is to go with some
> form of developer-pushed mechanism. AMO uses one called 'chief'... works
> well. The problem with systems like this (apart from initial setup) is that
> if IT isn't involved in a code push, then IT won't be immediately on-hand to
> respond to problems. AMO use chief but still schedules the push with IT, so
> we know about it and are available, should we be needed for something. This
> is also necessary for things like Apache / config changes that a dev may not
> have access to change on their own.

Having to file a bug every time sucks. I'd rather not have to do that. This site is internal and although we're a rapidly growing company there's only really a couple of hundreds users of this site. 

Is it still out of the question to just give me SSH access? That would save both you and me time and the company would be able to get a better PTO app to work for them quicker.
(In reply to Peter Bengtsson [:peterbe] from comment #34)
> (In reply to Jake Maul [:jakem] from comment #32)
> > I presume dev needs to be on the 'develop' branch, and stage/prod should be
> > on the 'master' branch.
> >
> Yes please. 

Done!


> > Removing the Basic Auth is probably something that will need an infrasec
> > sign-off.
> > 
> Can you fix that? <snip>

I have spoken to infrasec on this: I cannot remove the basic auth protection until the app has passed a security review. If you have a bug number where that already happened, please let me know... and I can remove it easily. Otherwise, there are instructions on how to file one properly here: https://wiki.mozilla.org/WebAppSec/Security_Review_Request. Sorry I can't be of more help on this.


> What does that mean? It pulls the git repo every 10 minutes and if something
> has changed it reloads apache?

Essentially yes. There is an admin node that does a git pull (and generally other things relating to locales, submodules, compress_assets, etc), then pushes out the completed code tree to the web nodes. Typically it will also 'touch' the .wsgi file, which is sufficient to cause Apache/mod_wsgi to reload the app.


> > Put another way: stage is not just a place to test the code, but also to
> > test the deployment of the code.
> > 
> I hope we can just ignore stage. Maybe in a year or three we might care
> about staging code and deployment. 

From our perspective, skipping a staging environment is tantamount to accepting downtime due to a failed deployment. That said, we are willing to do this. It simply means that if a deployment breaks something, it's your fault. :)

Is this okay with you?


> Having to file a bug every time sucks. I'd rather not have to do that. This
> site is internal and although we're a rapidly growing company there's only
> really a couple of hundreds users of this site. 

I agree, it's not ideal... but it works, and it works across a wide range of sites. Virtually all of our sites are deployed like this. Over the last year it has resulted in very few deployment-related outages, and even the complete banishment of announcing maintenance windows for new code deployments. It's only a very few at the top that aren't done this way... namely, AMO and SUMO. Almost everything else is deployed by filing an IT bug.

As I said though, we do have some tools that can be configured to make this self-manageable. It's quite strange for them to be used on a site as comparatively small as this one, especially at first... but certainly doable.

There is one important point to note on this: if you (web dev) manages deployments for a site, then you are responsible for it. This is no small detail: you won't have root access to the servers running the code, only the ability to deploy new versions of your code. If a problem arises, we can't necessarily drop whatever we're doing and come fix it.. it will likely be broken for a little while. [1]


> Is it still out of the question to just give me SSH access? That would save
> both you and me time and the company would be able to get a better PTO app
> to work for them quicker.

Unfortunately, I can't do that. For one thing the servers in question run more than just the PTO app.

We can give you SSH access to the development server (useful for checking logs and the like), but not to prod. Even in dev you would not necessarily have access to *everything* (ex: the SVN repo where puppet manages the Apache config), but there is lots of precedence and even a framework for setting this up.



[1] In the future we plan to have some sort of environment where devs can run their own apps, with as little IT involvement as possible. The trade-off here is that there would also be minimal IT support for apps in this type of environment- the developer would be the primary support contact for outages and problem resolution. It sounds like this is what you'd want to have in this case... we're just not there yet.
(In reply to Jake Maul [:jakem] from comment #35)
> (In reply to Peter Bengtsson [:peterbe] from comment #34)
> > (In reply to Jake Maul [:jakem] from comment #32)
> > > I presume dev needs to be on the 'develop' branch, and stage/prod should be
> > > on the 'master' branch.
> > >
> > Yes please. 
> 
> Done!
> 
Awesome!

> 
> > > Removing the Basic Auth is probably something that will need an infrasec
> > > sign-off.
> > > 
> > Can you fix that? <snip>
> 
> I have spoken to infrasec on this: I cannot remove the basic auth protection
> until the app has passed a security review. If you have a bug number where
> that already happened, please let me know... and I can remove it easily.
> Otherwise, there are instructions on how to file one properly here:
> https://wiki.mozilla.org/WebAppSec/Security_Review_Request. Sorry I can't be
> of more help on this.
> 
Laura led that discussion. I don't have anything formal to prove that we were given the go-ahead. Laura is currently on holiday. 
I guess I'll have to start the security review process :(

> 
> > What does that mean? It pulls the git repo every 10 minutes and if something
> > has changed it reloads apache?
> 
> Essentially yes. There is an admin node that does a git pull (and generally
> other things relating to locales, submodules, compress_assets, etc), then
> pushes out the completed code tree to the web nodes. Typically it will also
> 'touch' the .wsgi file, which is sufficient to cause Apache/mod_wsgi to
> reload the app.
>
Fine. Cool. 
 
> 
> > > Put another way: stage is not just a place to test the code, but also to
> > > test the deployment of the code.
> > > 
> > I hope we can just ignore stage. Maybe in a year or three we might care
> > about staging code and deployment. 
> 
> From our perspective, skipping a staging environment is tantamount to
> accepting downtime due to a failed deployment. That said, we are willing to
> do this. It simply means that if a deployment breaks something, it's your
> fault. :)
>
Not funny. 
> Is this okay with you?
> 
Been talking to :jsocol about this and he explained it as staging being a dev environment for *deployment* so to say. So, me and the stakeholders manually QA features and stuff on the -dev (develop branch), you QA on the stage and then you deploy on prod. 
Keep prod but I don't think we (myself, Jill van de Ven and others in admin) will actually do the testing on that URL. 
 
> 
> > Having to file a bug every time sucks. I'd rather not have to do that. This
> > site is internal and although we're a rapidly growing company there's only
> > really a couple of hundreds users of this site. 
> 
> I agree, it's not ideal... but it works, and it works across a wide range of
> sites. Virtually all of our sites are deployed like this. Over the last year
> it has resulted in very few deployment-related outages, and even the
> complete banishment of announcing maintenance windows for new code
> deployments. It's only a very few at the top that aren't done this way...
> namely, AMO and SUMO. Almost everything else is deployed by filing an IT bug.
> 
Alright Let's go this way for now and hopefully soon we can upgrade once it's launched. 
...assuming we see lots of new features needing to be deployed on a regular basis. 

> As I said though, we do have some tools that can be configured to make this
> self-manageable. It's quite strange for them to be used on a site as
> comparatively small as this one, especially at first... but certainly doable.
> 
> There is one important point to note on this: if you (web dev) manages
> deployments for a site, then you are responsible for it. This is no small
> detail: you won't have root access to the servers running the code, only the
> ability to deploy new versions of your code. If a problem arises, we can't
> necessarily drop whatever we're doing and come fix it.. it will likely be
> broken for a little while. [1]
> 
> 
> > Is it still out of the question to just give me SSH access? That would save
> > both you and me time and the company would be able to get a better PTO app
> > to work for them quicker.
> 
> Unfortunately, I can't do that. For one thing the servers in question run
> more than just the PTO app.
> 
> We can give you SSH access to the development server (useful for checking
> logs and the like), but not to prod. Even in dev you would not necessarily
> have access to *everything* (ex: the SVN repo where puppet manages the
> Apache config), but there is lots of precedence and even a framework for
> setting this up.
> 
> 
> 
> [1] In the future we plan to have some sort of environment where devs can
> run their own apps, with as little IT involvement as possible. The trade-off
> here is that there would also be minimal IT support for apps in this type of
> environment- the developer would be the primary support contact for outages
> and problem resolution. It sounds like this is what you'd want to have in
> this case... we're just not there yet.

Project Petri? Yeah, I'm following that. Genuinely looking forward to that.
Peter asked me to take a look at this bug since I've recently been through this process with Dragnet. Obviously, the Apache Auth is sub-optimal in terms of a user experience (e.g. having to log in twice), but I agree that a secreview is probably a good idea before we make this site live to the outside. However, we'll need at least a stage environment to have a secreview, which is part of why this may not have been done yet.

Is there a way that we can make stage and dev available only to those who are on the VPN? This would probably eliminate the need for the Apache Auth and more accurately emulate the final release environment for testing and secreview purposes. 

Jake, please let me know what I can do to help get this released.
> 
> I have spoken to infrasec on this: I cannot remove the basic auth protection
> until the app has passed a security review. If you have a bug number where
> that already happened, please let me know... and I can remove it easily.
> Otherwise, there are instructions on how to file one properly here:
> https://wiki.mozilla.org/WebAppSec/Security_Review_Request. Sorry I can't be
> of more help on this.
> 
https://bugzilla.mozilla.org/show_bug.cgi?id=733406
Jake (and likely Corey as well, I think)

The secreview stuff won't be able to be done until sometime in Q2.  Having said that, I want to move ahead with this sooner (as it's a Q1 goal), so let's do it either behind Basic auth or VPN-only, whichever is easier for IT.
http://pto-dev.allizom.org/ is mostly functional. There is currently a CSRF error of some kind being reported after logging in. I don't know how to go about fixing this.


Note that per infrasec/webdev policy, I need to set DEBUG = False for all sites, dev/stage/prod. I turned it on to get this output, but it's back off now.


---------

Reason given for failure:

    CSRF token missing or incorrect.
    

In general, this can occur when there is a genuine Cross Site Request Forgery, or when Django's CSRF mechanism has not been used correctly. For POST forms, you need to ensure:

    The view function uses RequestContext for the template, instead of Context.
    In the template, there is a {% csrf_token %} template tag inside each POST form that targets an internal URL.
    If you are not using CsrfViewMiddleware, then you must use csrf_protect on any views that use the csrf_token template tag, as well as those that accept the POST data.

You're seeing the help section of this page because you have DEBUG = True in your Django settings file. Change that to False, and only the initial error message will be displayed.

You can customize this page using the CSRF_FAILURE_VIEW setting.

---------
This error is resolved, and the dev site works: https://pto-dev.allizom.org/

I can log in to HTTP Basic Auth, and into the app itself, and I get a pretty calendar. The dev database has no data from prod, so it's bereft of pre-existing information.


In talking with peterbe on this, it seems like the next step he will initiate- QA on the app at pto-dev.allizom.org, and introducing it to HR. Once there's something ready to go, we'll be able to deploy it out.

We also need to set up a staging instance at some point, but as the new app lives at a new location (pto.mozilla.org rather than intranet.mozilla.org/pto), the new production instance *is* essentially a staging instance for this first rollout. So there's not a huge priority on that IMO.
Jake, should infrasec be looking at pto-dev.a.o or at pto.m.c?  (FWIW I can't log into the latter with my LDAP credentials - just keep getting a basic auth prompt)
I mentioned this in another bug (appsec's bug, I guess), but I would recommend looking at pto.mozilla.org, unless :peterbe believes that code will change significantly between now and release.
(In reply to Jake Maul [:jakem] from comment #43)
> I mentioned this in another bug (appsec's bug, I guess), but I would
> recommend looking at pto.mozilla.org, unless :peterbe believes that code
> will change significantly between now and release.

The code difference between pto-dev.allizom.org and pto.mozilla.org is very small. Especially in terms of security. 

I fear the pto-dev.allizom.org server is on atom seamicros so it'll be horrible to test against.
I've lost state on this bug. The last few emails indicate this is pending AppSec sign-off on pto-dev.allizom.org? If so, this bug should probably be resolved in favor of the review bug.

This was was moved to the Generic cluster recently I believe (bug 766190 and others), so performance on -dev should not be a concern anymore. I'm not sure if the prod instance on generic has been configured or not yet, but if it's pending AppSec approval anyway that seems unlikely to be a blocker.

In any case, :cturra is working on PTO lately, so I'll assign this to him... potentially there's nothing here for IT/WebOps to do and this can be closed out, or potentially he can use this bug to set up the new prod instance on the generic cluster, if it's not already done.
Assignee: nmaul → cturra
Severity: minor → normal
just an update. all dev, stage and prod environments are up and ready. :peterbe and i are waiting on some testing, etc which is to be conducted until jul 27th. a go-live will be discussed for after that date.
Whiteboard: [pending testing completion]
Any problem with making this bug public?
Can some member of the infra group please remove this bug from that group? Most of Mozilla (including MoCo) can't see what's going on at the moment.

It would also be good to get a status update from the infra, webdev or people side.

Gerv
:laura :gerv - i have opened this bug up per your requests. we're just waiting for signoff from payroll at this point. everything else has been ready to go for a couple months now.
Group: infra
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
cturra: is there a bug or other tracking mechanism for that signoff from payroll?

Gerv
:gerv this has all been back channel conversations to my understanding. in fact, i haven't really been part of those discussions either. 

i just realized i marked this as r/fixed in error. reopening.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
pto-dev.allizom.org is currently barfing
http://cl.ly/K2bS

:cturra is that something you know about how to fix?
:peterbe - looks like the passwd was updated... i had it reset for you.
sounds like this won't actually be making its way out. closing as r/incomplete for now.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → INCOMPLETE
cturra: can you provide any more information about why this rollout has been cancelled?

The current PTO system is not very usable, and someone has written a better one. Why would we not deploy it? Did it fail security review in spectacular fashion? Was it badly specced so it didn't actually solve the problem? Something else?

Gerv
(In reply to Gervase Markham [:gerv] from comment #55)
> cturra: can you provide any more information about why this rollout has been
> cancelled?
> 
Can we move that discussion to Yammer or something? It's "political" rather than technical I think.
The good news is that supposedly they've found another solution that is, implicitly, better.
(In reply to Gervase Markham [:gerv] from comment #55)
> cturra: can you provide any more information about why this rollout has been
> cancelled?

about all the 'official' information i have here is from Laura's blog post the other day:

  http://www.laurathomson.com/2013/03/webtools-in-2012-part-2/
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.