Closed Bug 1245385 Opened 10 years ago Closed 10 years ago

Deploy Dockerized Tokenserver

Categories

(Cloud Services Graveyard :: Server: Token, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mostlygeek, Assigned: mostlygeek)

References

Details

Attachments

(2 files)

Task: - Convert tokenserver to deploy as a Docker container - QA / Load test it in stage - Create a new Jenkins deployment pipeline - Deploy it to production - Run it for a week beside current RPM based stack as 1 box - Scale RPM stack to 1 box, enable auto-scaling for docker stack - Remove RPM stack
Assignee: nobody → bwong
QA Contact: kthiessen
woot
Opened Bug 1246008 to add a secondary load balancer specific health check.
Will be getting logs of the testing in stage in tonight as attachments to this bug; they need a little cleanup first.
These are not exactly the most readable, but they capture all the tester-side data from the testing yesterday. Narrative comments are introduced by ##.
Deployed a canary box to production today. Notice that it used about 50% higher CPU than the other boxes. Looking at the build logs [1] for the container (mozilla/browserid-verifier:0.3.0) it seems that node 4 has issues building the bigint library. Switched the base container back to node 0.10.41 [2]. The bigint library appears to build correctly. We want to schedule another load test in stage with the same number of servers and RDS instance: - 9 x c3.large web servers - 1 x m3.large RDS w/ 1TB gp2 to see how it compares. [1] https://circleci.com/gh/mozilla/browserid-verifier/6 [2] https://github.com/mozilla/browserid-verifier/pull/77
Found a few issues with the production canary today. Issue #1 Disk full. The docker daemon by default logs to a json file on disk that never rotates. The solution is: - in /etc/sysconf/docker, add `--log-driver=none` - OPTIONS='--selinux-enabled --log-driver=none' Issue #2 - CPU use of Verifier - deployed mozilla/browserid-verifier:0.3.2 container - now CPU usage is in line with RPM based servers
Flags: needinfo?(ckolos)
:ckolos could you create a new docker AMI with `--log-driver=none`. Since we run docker containers as systemd units all the logs to stdout/stderr are already captured into journalctl.
I agree that the log output should be handled in some way, as that's definitely not good, however, I disagree that we should set log-driver=none and instead leaving log-driver as default. setting the --log-opt max-size=<some size> or adding LOGROTATE=true to the end of /etc/sysconfig/docker
Flags: needinfo?(ckolos)
*sigh* "setting the --log-opt max-size=<some size> or adding LOGROTATE=true to the end of /etc/sysconfig/docker" "I suggest setting the --log-opt max-size=<some size> or adding LOGROTATE=true to the end of /etc/sysconfig/docker"
Update: Finally have a good version of dockerized Tokenserver in prod as a single server canary. I will be running this for a couple of weeks to see how it compares to the current generation of the service. A few notes on fixes: - run docker containers with --log-driver=none. This will likely be permanent as the AMI will rotate docker logs by default. I rather they didn't get written twice and instead, go straight to journald via systemd. - deployed browserid-verifier:0.3.2, which bumps the base container back to node v0.10.41 from node4 since bigint didn't compile correctly with node4 (for now) - CPU usage on the canary is the same as the previous generation - disk write (bytes and iops) are much lower. Not sure if this is due to more efficient logging of journald vs circus.
Attached a screenshot of current vs new dockerized tokenserver. The orange line is the docker based servers. - CPU usage is about the same. This is good. - Disk and network usage is much lower. This is interesting and I suspect it is due to writing logs directly to journald instead of having circus write logs to disk. Logs are being shipped correctly and comparing new/old servers they have the same amount of logs.
Update to the plan: - Increase the number of docker based servers in the cluster to 2, 4 and 6 over the next couple of weeks - When we reach 6 it should make up most of the cluster with the RPM based servers coming/going from auto-scaling - Feb 29th, remove old servers and run everything from dockerized servers
I decided to accelerate the migration plan when the new docker servers were showing very stable and positive results. Tonight the dockerized tokenserver instances are running 100% of production.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 1247736
QA Contact: kthiessen
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: