Closed Bug 1133837 Opened 9 years ago Closed 9 years ago

Deploy the Django 1.7 migration, API refactor and more

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Creating a bug for this deployment, since we've now had I think 3 attempts at it, so could do with somewhere to track.
Problems we hit today:

1) We had to manually pip2.7 install Django 1.7 to prod, before deploying master, since the code on master only works with Django 1.7. However this:
  (a) broke prod, in the few mins between updating Django and pushing (since the changes to Django take effect immediately; I would have though it would be in-memory by that point for the currently running celery processes, but oh well). This is unavoidable unless we have a deploy script that take a nodes offline, pip installs, updates source and then brings it back; but worth remembering we need to warn people in #developers before running the pip install & also to deploy the code as soon as possible after.
  (b) got clobbered by puppet, which started rolling things back to 1.5. We need to remember to pause puppet when deploying package changes.

2) The deploy failed, since update.py was updated, but the script is run from the previous on-disk version, not the new one. This resulted in an obscure error message, due to the deploy manage.py commands using python2.6 instead of python2.7, so wasn't immediately obvious what was wrong. ie: we need to remember if we make changes to update.py, we need to deploy twice, to actually run the new deploy code. 

3) The log viewer "click a line number to load part of the log" feature stopped working. Issues found:
  (a) this didn't result in any new relic exceptions due to a generic try-except which turned any exception into a HTTP404:

  (b) we hit permissions errors in the log_cache used by the logslice endpoint. Fixing the permissions for the src directories on the admin node to the 'treeherder' user rather than 'root' resulted in us then hitting a different exception (can't pickle BytesIO objects). It's not clear why the permissions worked before but didn't now.

4) Ryan noticed even after clearing the cache and reloading the page, existing jobs that were in the running state were not updating automatically as they completed - when opening a new identical tab showed the job as completed.

5) It was decided to roll back due to #3 and #4, however this then required rolling back the Django package updates. We had to:
  (a) pause puppet on prod (which had been updated to enforce Django 1.7 in the time since step #1), so it didn't clobber everything.
  (b) cherry pick the update.py commit onto the "known good" 'production' branch, since otherwise we'd try running python2.6 manage.py commands again (https://github.com/mozilla/treeherder-service/commit/eabe3a47de94d5debb009bfd5973b4779e1bbbb0).
  (c) pip2.7 install Django 1.5.11 onto the prod nodes
  (d) Deploy & hope for the best
  (e) Update puppet to enforce Django 1.5.11 instead of 1.7, and unpause puppet.

6) Unfortunately at #5(d) we hit exceptions during the deploy - since the admin node was on Django 1.7, and as part of the deploy, commands are run on the admin node before things are rsynced to the other nodes - so we had to rollback Django on the admin node as well, before redeploying again. The same admin node python environment is used to deploy to both prod and stage, since virtualenvs are not used.

Things the treeherder devs can do to reduce the disruption for sheriffs/other users on future deploys:
* Notify #developers/sheriffs when installing new packages on prod, since it's going to be bumpy due to the difficulty coordinating the pip install & code deploy (we did this to a certain extent today, but perhaps could emphasises the bumpiness next time).
* When using fubar to pip install packages, double check puppet has been paused.
* Remember we need to deploy twice if updating update.py.
* Add treeherder-ui tests (!).
* Ask sheriffs to test risky changes on stage, in addition to the sanity-testing we've done.
* Avoid the use of generic try-excepts (https://github.com/mozilla/treeherder-service/blob/f5c0b53e0ce6b527c5eb2d861adeb72e1e5859ea/treeherder/webapp/api/logslice.py#L81).
* When updating versions of dependencies, do not include any other changes in the same deploy, since rollbacks for regressions in the other changes are painful.
* Remember that we need to double check the correct versions of packages are installed on the admin node, since it's used to deploy - particularly when rolling back versions.

Next steps for this deploy:
* See if we can repro on stage.
* Fix or revert the API refactor
* Get sheriffs to test stage before the next deploy
* Deploy! And eat lots of cake! :-)
Priority: -- → P1
Depends on: 1133870
Depends on: 1133910
No longer depends on: 1133870
Depends on: 1097090
Depends on: 1134140
No longer depends on: 1134140
The issue with the log viewer is fixed, the problem with jobs not appearing in the UI also. However bug 1134916 has just been filed, while IMO blocks the deploy.
Given the problems for the last deploy (comment 1), I've created a checklist...

Done:
* Check the admin node has Django 1.7.4 installed.
* Perform a grunt build if required.
* Update stage to master, so we can watch for New Relic failures for things already on master.

To do:
1) Land bug 1134916 once reviewed.
2) Update stage to master to pick up bug 1134916.
3) Both treeherder devs and sheriffs to test stage - specifically job updating, log viewer, classifying failures, filtering.
4) Announce in #developers that treeherder may experience some disruption for a few minutes, whilst the packages are updated.
5) Ask fubar to pause puppet on prod.
6) Ask fubar to |pip2.7 install -I Django==1.7.4| on the production nodes.
7) As soon as that is complete (and as soon as possible), hit the Chief red button to deploy master to prod.
8) Check Treeherder working via treeherder.m.o and New Relic.
9) Comment in #treeherder telling people to *force* refresh the page to pick up the new UI assets that are comptible with the revised API.

Assuming all good (we'll wait for 30 mins to ensure no further issues found by sheriffs/devs):
* Ask fubar to update puppet for the package change & unpause puppet.

In the case of problems:
1) Ask fubar to roll back to Django 1.5.11 on the prod nodes *and* the admin node.
2) Once done, Chief deploy the 'production' branch.

The cherrypick for the python2.7 change in update.py is already done, so we won't have to worry about that on either the deploy or potential rollback (ie: we won't have to deploy twice again).
This is now complete - fingers crossed it will stick!

We hit two issues:
1) The command to pause puppet was typoed, so puppet stomped over our changes again. Once it was paused for real & the package changes reinstalled, this was fine :-)
2) We discovered bug 1135798, which is a regression from bug 1113160. The tl;dr is that we need an additional |git pull| after switching branches, otherwise we don't have the latest code. A  2nd deploy using Chief resolved this.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.