Closed Bug 969533 Opened 10 years ago Closed 9 years ago

Please install Python 2.7 on the machines in the bedrock cluster

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

task
Not set
minor

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmac, Assigned: nmaul)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/36] [change - configuration])

User Story

We'd like to run bedrock on Python 2.7 for several reasons:

1. Python 2.6 has already reached end-of-life
2. Python libraries we'd like to use require a minimum of 2.7
3. It's the best jumping off point for a move to Python 3.x, which will be required for us to move to Django 1.7.

This will require at least (as far as I know):

1. Installing Python 2.7 RPM on webheads and admin node.
2. Installing Python 2.7 compatible version of mod_wsgi on webheads and admin node.
3. Configuring mod_wsgi to use the Python 2.7 executable.
4. Installing bedrock's compiled dependencies in the new Python 2.7 environment (e.g. MySQL-Python, Jinja2, lxml, etc.)
I've heard that this is now doable, and it would be very nice to have for bedrock. This would include the dev/demo server, stage, prod, and the admin node.

There's no rush, but will be quite nice to have.

Thanks for looking into this.
Whiteboard: [change - configuration]
Where did you hear this is now doable? I don't think it is, at least on our normal clusters. It might be possible to side-install 2.7, like we did with 2.6 back when our OS's had 2.4 as the default... but that would mean recompiling and side-installing a lot of other packages as well. Maybe someone has already done this... not sure.

PaaS can use 2.7, maybe that's where this came from?

My info could be out of date... ;)
(In reply to Jake Maul [:jakem] from comment #2)
> Where did you hear this is now doable? I don't think it is, at least on our
> normal clusters.

A side-install would be just fine, and is really what I should have said. "Upgrade" is not what I meant. "Make available" is a better way to put it. I heard this based on other webdev teams saying they currently have 2.7 in production, and hearing that Sumo and Input were requesting 2.7 right now. I was told that getting it on generic cluster is likely a non-starter, but for hardware dedicated to a single site (like bedrock) might be doable. I'm quite happy to update our references to use "python2.7" in our deployment scripts. All that would need doing by Ops would be installing 2.7 and configuring apache to use the 2.7 mod_wsgi (I think).
Blocks: 1005361
According to https://bugzilla.mozilla.org/show_bug.cgi?id=785137#c16 we have had the ability to use python 2.7 since at least 2013-10-17. Further comments on that bug indicate that solitude and webpay are already on python 2.7.
And as bug 1005361 says, Python 2.6 has been unsupported (no security patches) for 6 months now. I know we have to keep 2.6 around because RHEL, but for public facing stuff we should be running a more recent and supported version.
Whiteboard: [change - configuration] → [kanban:https://kanbanize.com/ctrl_board/4/182] [change - configuration]
Summary: Please upgrade the machines in the bedrock cluster to Python 2.7 → Please install Python 2.7 on the machines in the bedrock cluster
I've updated the title and added the full description to the "user story" field. Any update on this one would be appreciated as we're increasingly hitting a wall on feature development due to being unable to use 2.7 dependent libraries.
User Story: (updated)
@pmac: the bedrock/requirements/compiled.txt file does not appear to include peep hashes. Care to add them or shall I proceed without?
Flags: needinfo?(pmac)
This is for setting up the new virtualenvs? It's probably safe for now without, but I can add them tomorrow. We'll be moving toward peep and away from vendor soon, but not quite yet.
Flags: needinfo?(pmac)
Depends on: 1118786
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/182] [change - configuration] → [kanban:https://webops.kanbanize.com/ctrl_board/2/36] [change - configuration]
We have completed moving over the www-dev and demoX environments!

So far all seems well. Let's aim for doing stage and prod next week. We should be able to do one datacenter at a time and have zero downtime.
Assignee: server-ops-webops → nmaul
Stage was completed without significant issue. (writing this after the fact, so I may be remembering optimistically)

Prod had a few snags, but nothing really horrible. SCL3 upgrade is still in progress, but should be completed momentarily.


1) When doing PHX1 (first), WebQA folks were getting 500 ISE errors being served from PHX1. The errors were expected, but they should have been getting data from SCL3 (which was not broken) and not PHX1. We believe this may be some sort of DNS caching or authoritative-ness issue affecting Mozilla offices. This was not replicable from the outside.

2) PHX1 took longer than expected due to an odd quirk in the bedrock puppet module that I'd forgotten about. Namely, PHX1 has a different apache config file than SCL3. IIRC, this was a quick-and-dirty hack put in place a while back because they needed slightly different settings, and can likely go away. Once I made the same change in the PHX1 configs, those nodes came right up fine.

3) SCL3 went down unexpectedly while we were letting PHX1 "bake". This is entirely my fault- I disabled puppet on these nodes, but not for long enough- they re-enabled themselves, updated the apache config, and broke (because they didn't have the proper virtualenv to run with python-2.7). Luckily, it appears that Dynect caught this (as it should) and stopped sending traffic to SCL3, meaning there was only a short incident as measured by New Relic (about 2 minutes). During that time, approximately 33% of total django traffic would have failed (6 working nodes in PHX1, 2 working in SCL3 - hadn't yet broken - and 4 failed in SCL3).
Another minor snag - this involved an upgrade to a newer version of the New Relic python library, which needed a slightly newer config file. I forgot to do this beforehand, and we forgot to do it during dev and stage as well. So, this was done *after* the update was pushed live.

Not a big issue, except it does result in an *apparent* drop in throughput to bedrock during those 2 time windows (first PHX1, then SCL3). Of course no real drop occurred, those nodes just weren't reporting their data properly.
All done! No ill effects have surfaced as far as we can tell.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.