Closed
Bug 1245385
Opened 10 years ago
Closed 10 years ago
Deploy Dockerized Tokenserver
Categories
(Cloud Services Graveyard :: Server: Token, defect)
Cloud Services Graveyard
Server: Token
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mostlygeek, Assigned: mostlygeek)
References
Details
Attachments
(2 files)
Task:
- Convert tokenserver to deploy as a Docker container
- QA / Load test it in stage
- Create a new Jenkins deployment pipeline
- Deploy it to production
- Run it for a week beside current RPM based stack as 1 box
- Scale RPM stack to 1 box, enable auto-scaling for docker stack
- Remove RPM stack
| Assignee | ||
Updated•10 years ago
|
Assignee: nobody → bwong
QA Contact: kthiessen
Comment 1•10 years ago
|
||
woot
Comment 2•10 years ago
|
||
Opened Bug 1246008 to add a secondary load balancer specific health check.
Comment 3•10 years ago
|
||
Will be getting logs of the testing in stage in tonight as attachments to this bug; they need a little cleanup first.
Comment 4•10 years ago
|
||
These are not exactly the most readable, but they capture all the tester-side data from the testing yesterday. Narrative comments are introduced by ##.
| Assignee | ||
Comment 5•10 years ago
|
||
Deployed a canary box to production today. Notice that it used about 50% higher CPU than the other boxes. Looking at the build logs [1] for the container (mozilla/browserid-verifier:0.3.0) it seems that node 4 has issues building the bigint library.
Switched the base container back to node 0.10.41 [2]. The bigint library appears to build correctly. We want to schedule another load test in stage with the same number of servers and RDS instance:
- 9 x c3.large web servers
- 1 x m3.large RDS w/ 1TB gp2
to see how it compares.
[1] https://circleci.com/gh/mozilla/browserid-verifier/6
[2] https://github.com/mozilla/browserid-verifier/pull/77
| Assignee | ||
Comment 6•10 years ago
|
||
Found a few issues with the production canary today.
Issue #1 Disk full.
The docker daemon by default logs to a json file on disk that never rotates. The solution is:
- in /etc/sysconf/docker, add `--log-driver=none`
- OPTIONS='--selinux-enabled --log-driver=none'
Issue #2 - CPU use of Verifier
- deployed mozilla/browserid-verifier:0.3.2 container
- now CPU usage is in line with RPM based servers
| Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(ckolos)
| Assignee | ||
Comment 7•10 years ago
|
||
:ckolos could you create a new docker AMI with `--log-driver=none`.
Since we run docker containers as systemd units all the logs to stdout/stderr are already captured into journalctl.
I agree that the log output should be handled in some way, as that's definitely not good, however, I disagree that we should set log-driver=none and instead leaving log-driver as default. setting the --log-opt max-size=<some size> or adding LOGROTATE=true to the end of /etc/sysconfig/docker
Flags: needinfo?(ckolos)
*sigh* "setting the --log-opt max-size=<some size> or adding LOGROTATE=true to the end of /etc/sysconfig/docker"
"I suggest setting the --log-opt max-size=<some size> or adding LOGROTATE=true to the end of /etc/sysconfig/docker"
| Assignee | ||
Comment 10•10 years ago
|
||
Update:
Finally have a good version of dockerized Tokenserver in prod as a single server canary. I will be running this for a couple of weeks to see how it compares to the current generation of the service. A few notes on fixes:
- run docker containers with --log-driver=none. This will likely be permanent as the AMI will rotate docker logs by default. I rather they didn't get written twice and instead, go straight to journald via systemd.
- deployed browserid-verifier:0.3.2, which bumps the base container back to node v0.10.41 from node4 since bigint didn't compile correctly with node4 (for now)
- CPU usage on the canary is the same as the previous generation
- disk write (bytes and iops) are much lower. Not sure if this is due to more efficient logging of journald vs circus.
| Assignee | ||
Comment 11•10 years ago
|
||
Attached a screenshot of current vs new dockerized tokenserver. The orange line is the docker based servers.
- CPU usage is about the same. This is good.
- Disk and network usage is much lower. This is interesting and I suspect it is due to writing logs directly to journald instead of having circus write logs to disk. Logs are being shipped correctly and comparing new/old servers they have the same amount of logs.
| Assignee | ||
Comment 12•10 years ago
|
||
Update to the plan:
- Increase the number of docker based servers in the cluster to 2, 4 and 6 over the next couple of weeks
- When we reach 6 it should make up most of the cluster with the RPM based servers coming/going from auto-scaling
- Feb 29th, remove old servers and run everything from dockerized servers
| Assignee | ||
Comment 13•10 years ago
|
||
I decided to accelerate the migration plan when the new docker servers were showing very stable and positive results. Tonight the dockerized tokenserver instances are running 100% of production.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
QA Contact: kthiessen
Updated•3 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•