Closed Bug 1169263 Opened 9 years ago Closed 5 years ago

Switch local development to use Docker

Categories

(Tree Management :: Treeherder, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: armenzg)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Docker is also now (slowly) getting less painful on Windows - eg the docker client now supports it (you couldn't run docker commands outside the boot2docker VM previously):
http://azure.microsoft.com/blog/2014/11/18/docker-cli-for-windows-clients/

Also boot2docker-cli is being replaced by Docker Machine (https://docs.docker.com/machine/) so the Windows story is looking hopeful. Microsoft are contributing quite a bit towards Windows compatibility.

As such, my concerns over using docker (from the painfulness of both using it personally and for contributors on Windows) are diminishing by the day :-)
Assignee: nobody → emorley
Depends on: 1363444
See Also: → 1365567
Dustin couldn't get the current Vagrant environment working. Between that and a few other issues people have had, it would be make a Docker solution available. Docker seems to have finally overtaken Vagrant in terms of project activity/reliability/...
Priority: P3 → P1
Priority: P1 → P2
Blocks: 1416266
Another instance where a Docker/Docker Compose based solution would be preferable:

<igoldan> I think there's a bug on the celery workers now
<igoldan> https://bugzilla.mozilla.org/show_bug.cgi?id=1395356#c54
<igoldan> I checked out an old revision and didn't reproduced it; so it's not my local env
<emorley> you need to run provision to install the new dependencies
<igoldan> for vagrant, you mean?
<emorley> Yeah - from the host run `vagrant provision`
<emorley> the list of python and JS packages is regularly changing, so it's good to run provision semi-regularly (or at least whenever something isn't working) :-)
<emorley> If we were to switch to a Docker (and Docker Compose) based workflow, this would happen automatically
<Aryx> has anybody recently set up vagrant? was the download of the vbox from vagrant slow (20-60k/s)?
<emorley> I seem to remember it not maxing out my connection, but not quite that slow (more like 200-500kb/s)
<emorley> A docker image download from hub.docker.com would no doubt be faster hehe :-)
Blocks: 1451483
See Also: → 1506909

I'm unable to get Vagrant working on my new laptop (it hangs during provision; I'm presuming Virtualbox 6's new "work with Hyper-V" feature is not as complete as implied), so until we do this bug I don't have a way of working on anything backend related.

Status: NEW → ASSIGNED
Priority: P2 → P1
No longer blocks: 1451483
Assignee: emorley → armenzg
Priority: P1 → P2

I would like to have Docker instead of Vagrant: The latter requires bios changes to enable some virtualbox feature, and requires a physical network cable because of reasons. https://github.com/klahnakoski/treeherder-for-windows

Docker on windows will get around the problems of Vagrant. The Docker script is effective instructions on how to setup Treeherder on a new machine; important instructions that I would have preferred over Vagrant. Finally, with Docker, we can overlay other images for development: Specifically, Pycharm's remote debugger; so debugging is easier.

Some thoughts/ideas from when I looked into this previously:

  • Docker and docker-compose are much more reliable (especially on Windows) than they were a few years ago, whereas the number of issues seen with Vagrant seems to have increased. Between that, the fact that Taskcluster/other Mozilla projects extensively use Docker, and that it's generally more popular in the open source community - switching definitely make senses from an ease of contribution point of view.

  • Since Treeherder uses multiple external services (MySQL, Redis, RabbitMQ), a docker-compose based development environment that uses those project's native Docker Hub images seems best.

  • It would be great to use the same docker-compose project on Travis as works locally, since it would (a) avoid having to keep two environments in sync, (b) mean that there is now actually test coverage for the development environment (currently the Vagrant scripts aren't tested).

  • Heroku could eventually use a Docker (rather than buildpack) based solution too (bug 1506909), however the production and development Dockerfiles would likely need to be quite different (additional Python packages for the latter, as well as things like installing Firefox for Selenium), and unfortunately Docker doesn't yet support "including" one Dockerfile from another. As such, it's probably best to ignore the production case and write the Dockerfile with only development in mind for now (eg only COPY the bare minimum source into the image and .dockerignore everything else to avoid cache churn, and rely on docker-compose source directory bind mounts).

  • To reduce the time taken for the initial Docker build, as well as how often the cache gets busted, careful thought will need to be given to how the main Dockerfile is written, particularly since Treeherder uses both Python and Node.js. Possible approaches might be:

    1. Use Docker's new multi-stage builds feature. ie have an initial stage that uses FROM node:..., followed by the FROM python:... stage that copies across the Node binaries.
    2. (Probably preferred) Separate out the Python and Node.js parts into separate docker-compose services. eg: have the main service be the Python app, then another Node.js image-using service that just runs the yarn start.
  • An additional complication (particularly if the separate Python and Node.js services approach is taken above) are Treeherder's Selenium tests. These require: Firefox, Geckodriver, the Python dev environment (since they are run using pytest), and a built UI (which requires Node for the yarn build). To avoid adding too much complexity to the main Python docker image, it might be worth seeing if https://github.com/SeleniumHQ/docker-selenium can be used alongside it. However those images are pretty heavyweight, so if used are probably best kept out of the main docker-compose.yml (running Selenium tests locally is not something we do as often) - perhaps by having a separate docker compose file that extends the main one.

  • As part of initial setup (and in fact any time new migrations are added), Django migrations need to be run (amongst other things). As such it will likely be necessary to have the main Python image use a custom entrypoint script that runs these prior to exec-ing the passed command (see official docker hub images for examples of this pattern), to replace what happens at the end of vagrant/setup.sh.

  • Django's development server, runserver, currently doesn't gracefully handle the DB not being available at startup (see https://groups.google.com/forum/#!topic/django-developers/gNRC4IzInms). This means that if the MySQL docker container isn't ready (which is particularly likely at first run, since the mysql docker image performs a few additional tasks such as creating the empty database), then it will error. To avoid this, the custom entrypoint script mentioned above, should also check that MySQL is ready, for example by using ! nc -z in a while loop or similar.

  • The Vagrant environment currently uses an iptables hack to mean that incoming requests get sent to runserver/webpack-dev-server even if they aren't listening specifically on 0.0.0.0 (this was performed this way to save people from having to manually pass CLI arguments to runserver every time). This can perhaps be replaced by having the Python app container's CMD be runserver with the appropriate network adapter binding arguments.

  • Our Travis setup script currently moves the MySQL data directory onto a ramdisk (tmpfs) since it reduces the pytest suite runtime by 30%. As part of running tests in CI using the new docker environment, it might be worth experimenting with the docker tmpfs storage volume type to see if it similarly improves test runtime (given data loss isn't an issue in CI). Incidentally, I think the reason we see such a speedup is that we use Django's transactional_db fixture in places where we should really be using db (see bug 1348947), which iirc means Django doesn't use a simpler ~faked in-memory approach.

  • It appears that Django's runserver doesn't listen to the default of SIGTERM, so docker-compose must send SIGINT instead to avoid waiting 10 seconds for the time out. This can be achieved by using stop_signal: SIGINT in the compose config for the Python image.

  • One way to speed up Travis (which if we're not careful will be slower than at present, since it has no native docker caching support) might be to have Docker Hub (now called Docker Cloud) automatically build the image from master, and then reference that image via the Docker cache_from feature, which means if there have been no docker-related changes, existing image layers can be re-used. This would also help speed up the time it takes for initial setup locally too. The downside appears to be that it then causes cache misses in other cases, unless workarounds are applied (see https://github.com/moby/moby/issues/32612), plus doesn't yet play nicely with multi-stage builds (see https://github.com/moby/moby/issues/34715).

I'll also see if I can tidy up my local experiment branch to push up to the Treeherder repo as a starting point.

Meant to say that a few recent changes have handily reduced the complexity for this bug:

  1. Treeherder's recent switch from Karma to Jest now means that a browser environment is no longer needed to run the JS tests, so the stock node image can be used for them (more commonly run locally than the selenium tests)
  2. The vagrant/Travis elasticsearch pieces were recently removed, since we decided to disable that feature locally/in CI, for improved prod parity (bug 1527868). As such it's no longer necessary to set up an elasticsearch compose service here, which sidesteps a few annoyances that I'd seen using their (bloated) docker images.
Attached file Ed's WIP PR
Attachment #9049975 - Attachment description: Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/4775 → Ed's WIP PR

As part of moving from Vagrant to Docker I'm looking to run a couple of test groups within Docker and the other two outside of it. This can add 2-4 minutes more to each run. I'm considering leaving js-test and python-linters outside Docker to get those results back faster while running python-tests-main and python-tests-selenium outside of it.

Here's the increase for python-tests-main [1][2] 4 min 24 sec --> 7 min 25 sec.

In the future we should be able to speed up the Docker/Travis set up:
http://atodorov.org/blog/2017/08/07/faster-travis-ci-tests-with-docker-cache/

In the future we should also need to publish the built images.

You can already test the Docker work by following these steps:

git fetch origin
git checkout -b docker origin/docker
git merge master
docker compose up

[1] https://travis-ci.org/mozilla/treeherder/builds/522371686
[2] https://travis-ci.org/mozilla/treeherder/builds/523251453

Summary: Switch the local development environment to something docker based → Switch local development to use Docker
Type: defect → enhancement

When running the Python tests for slow tests or Selenium, do we want the Django migrations to be applied or not?

sclements: igoldan: This is ready to be tested.

If any of you could try celery tasks it would be great as I don't know what is supposed to happen with them.
I also don't know what the Debug toolbar in the settings is about:
https://github.com/mozilla/treeherder/pull/4901/files#diff-26d4413823b415e0e1902bf845ff7067L351

I've also asked for review in here:
https://github.com/mozilla/treeherder/pull/4901

Steps for testing (You should see these in the changes to the docs:

  • Install Docker for Desktop if not installed (https://www.docker.com/products/docker-desktop)
  • Checkout the branch from the treeherder repo (not my user repo) [1]
  • docker-compose build
    • The first time this will take a long time since it will download a lot of images and build the Django app
    • In the future we can speed this up
  • docker-compose up
    This will start up all services (redis, mysql & rabbitmq), the frontend and the Django app. You can do code changes and it will reload the UI or the backend. You can load http://localhost:8000/docs/ for the backend and http://localhost:5000 for the UI. The backend will automatically apply the migrations upon first run. When you're done you can Ctrl + C
  • docker-compose run -p 8000:8000 backend bash
    This will start all services except the frontend. Your prompt will be within the Django app container. No migrations will be applied by default. You can apply them with ./initialize_data.sh. You can start the backend like so ./manage.py runserver 0.0.0.0:8000.
  • If you need to start the backend with an environment variable preset you can do so with -e (e.g. docker-compose run -e KEY='hi!' backend bash)
  • docker-compose run -p 5000:5000 frontend sh -c "yarn && yarn start -env.BACKEND=https://treeherder.allizom.org --host 0.0.0.0"
    This will only start the frontend. You can load it and I think it is pulling data from staging.

NOTES:

At any moment when you're done and want to shut containers down you can do so like this:

  • Use docker-compose stop which will stop the containers w/o deleting them
  • Use docker-compose down which will destroy the containers, remove the volumes and created network.

Starting the backend and the frontend with 0.0.0.0 is important for Docker to route things properly. Check in the commands above to see where it's used.

Troubleshoot:
If you can't load http://localhost:8000 then check the PORT mapping for the backend command. Run docker container ls and
check that the container treeherder-backend has ports 0.0.0.0:8000->8000/tcp rather than 8000/tcp.
Try again with -p 8000:8000. If that doesn't work and you're on Mac try downgrading Docker to https://download.docker.com/mac/stable/26764/Docker.dmg and see https://github.com/docker/for-mac/issues/3350#issuecomment-472141881 for more details.

[1]
git fetch origin
git checkout -b docker origin/docker
git merge master

Flags: needinfo?(sclements)
Flags: needinfo?(igoldan)

Tasks post this bug:

Flags: needinfo?(sclements)

The Django debug toolbar provides an overlay when accessing API's on localhost. It shows the SQL statement and how long the query took to execute, among other details. It's quite useful. The celery tasks are executed by Heroku via the procfile so I'm not sure if they need to be tested. I can however test out one of the django commands for the intermittents commenter (in test mode) - it's a celery task but also has a separate Django command so it can be manually run.

Edit: I just remembered the Commenter isn't a celery task anymore. It was originally, but we changed it so it's executed by the Heroku Scheduler instead. But, it'll still be worth making sure running Django commands work as expected. I'll test this out later today.

camd: See comment 16

Flags: needinfo?(cdawson)

Assuming you're all happy about the PR, do you want it to land tomorrow (I'm PTO Friday) and back out if there are issues?
Or wait until I come back on May 20th to land it?

It should not be easy to bitrot while I'm away.

For anyone else who tries running the backend with an environment variable preset the command is: docker-compose run -p 8000:8000 -e DATABASE_URL=<read-only-replica> backend bash

A note from IRC, for development, we can make the backend image also have node and yarn to only have a single dev environment. We could name it the dev image instead of the backend image.

On production we will use two different images since we want very slim images.

I've been experimenting with this. It seems to work pretty well. I've hit a couple hiccups, but I think they're pilot error on my part. I'll play with this more Monday.

Flags: needinfo?(cdawson)

This landed few weeks ago.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Flags: needinfo?(igoldan)
Component: Treeherder: Docs & Development → TreeHerder
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: