Closed Bug 996763 Opened 11 years ago Closed 11 years ago

Deploy browserid-verifier tag 0.2.1 to stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: mostlygeek)

References

Details

(Whiteboard: [qa+])

The latest browserid-verifier includes logging and conformance improvements, as well as a bugfix that will let it verifier assertions from persona.org. Please deploy to stage for loadtesting and then let's get it into prod. The following configuration changes will be necessary, I will try to make a puppet-config PR but recording them here for now. * Add a .json config file with the following contents, which enables the persona.org fix and creates a new "request summary" logfile: { "allowURLOmission": true, "logging": { "handlers": { "console": { "class": "intel/handlers/console", "formatter": "pretty" }, "summary": { "class": "intel/handlers/file", "file": "/media/ephemeral0/<WHATEVER>/verifier/summary.log", "formatter": "json" } } } } * Update circus to include CONFIG_FILES=/path/to/the/above/file.json in the env vars for the verifier * Configure heka to slurp in the summary.log file and pass it to elasticsearch. A stand-alone verifier loadtest should be enough to QA this thing. We should additionally check for the ability to verify persona.org assertions, and check that the summary logs flow correctly into heka. /cc Katie because logging stuff going live with this deploy.
OS: Windows 7 → All
Hardware: x86_64 → All
Mostlygeek: it's probably obvious, but just to be clear: the tokenserver deployment stuff you're working on should definitely take priority over this.
I took a quick stab at the puppet-config update in https://github.com/mozilla-services/puppet-config/pull/357 but it is incomplete.
trink & whd: app level logging from the verifier is landing soon
:mostlygeek what's the plan for this deployment to Stage and Prod? Do we need these deploys this week, along with the changes to TS and FxA ?
I'll work on this to get it into stage.
Assignee: nobody → bwong
Summary: Deploy browserid-verifier tag rpm-0.1.0-1 → Deploy browserid-verifier tag 0.1.0
BTW I changed the tagging scheme to be in line with the other repos, just X.Y.Z
I need some input on how we can quickly verify the following: 1. the ability to verify persona.org assertions 2. the summary logs flow correctly into heka
Summary: Deploy browserid-verifier tag 0.1.0 → Deploy browserid-verifier tag 0.1.1
This turned into an opportunity to clean up a few things: - puppet-config changes: https://github.com/mozilla-services/puppet-config/pull/384 - RPM building changes: https://github.com/mozilla-services/svcops/pull/106 In the process of cleaning up the RPM to use `npm --production`, found a bug in browserid_verifier, which created v0.1.1. That has been created. I'll be able to wrap this up as soon as I have the heka changes into the puppet-config PR.
> I need some input on how we can quickly verify the following: > 1. the ability to verify persona.org assertions You'll have to generate a persona.org-backed assertion , perhaps by logging in on http://myfavoritebeer.org/ and capturing the assertion from the web debug console. Then POST it to the verifier /v2 endpoint like this: curl https://verifier.stage.mozaws.net/v2 -X POST -H "Content-Type: application/json" -d '{"assertion": "ASSSERTION DATA GOES HERE", "audience": "http://myfavoritebeer.org/" }' It should report success in a JSON response.
Summary: Deploy browserid-verifier tag 0.1.1 → Deploy browserid-verifier tag 0.1.2
So a few things I'm not sure about: - can multiple nodejs write to the same summary log? (it seems to work), though not sure if this is safe. Is nodejs asynchronously opening the file, writing and then closing it? - log rotation for summary.log? - can we have summary.log simply go to stdout instead of nodejs writing to a file itself? Then we can use circus to auto-rotate the file
Also we're going to need a heka filter for the summary format. Is that documented anywhere?
:mostlygeek you can probably get help from :mtrinkala or :rmiller on that heka filter bit...
> - can multiple nodejs write to the same summary log? (it seems to work), though not sure > if this is safe. Is nodejs asynchronously opening the file, writing and then closing it? I wouldn't count on it. > - can we have summary.log simply go to stdout instead of nodejs writing to a file itself? > Then we can use circus to auto-rotate the file What's currently going to stdout? IIRC it's human-readable log lines. We can probably make these json and merge them with the summary info into a single stream if that's easier.
Based on the above and on https://github.com/mozilla-services/puppet-config/pull/384#issuecomment-41078194 I get the feeling this "summary log line" is not ready for production. Do we need to do some fixes in code before we can sensibly deploy this?
In addition, is this a "blocker" for Fx29 or just an "asap"?
Blocks: 1000574
From my perspective, summary log line is not a blocker for Fx29, the nginx logging is enough to spot high level problems.
I agree, not a blocker for FF29. Let's wait and do it right, I'll get onto it when I return next week.
I assume this open deploy ticket can pick up all the latest changes to browserid-verifier from this past weekend?
Summary: Deploy browserid-verifier tag 0.1.2 → Deploy browserid-verifier tag 0.1.3
I tagged 0.1.3 with some updates to the summary log line as described in https://github.com/mozilla/browserid-verifier/issues/39 This also changes the summary logging so that it goes directly to stdout, per https://github.com/mozilla/browserid-verifier/pull/43. We need to update the JSON config file to contain the following logging configuration: { "logging": { "handlers": { "console": { "class": "intel/handlers/console", "formatter": "json" } }, "loggers": { "bid.summary": { "propagate": true } } } } This will cause stdout to contain json-formatted log lines that can be slurped into heka using the parser from https://github.com/mozilla-services/puppet-config/pull/422
:rkelly :mostlygeek Can we schedule the Stage deployment for this week? I would like to get some testing done before QA gets busy with new projects...
I'd like to know if implementing the computer-cluster code would be doable. See: https://github.com/mozilla/browserid-verifier/issues/44 This will greatly simplify: - deployment. Less circus watchers, individual log files, config files, etc. - heka integration, only 1 log file to use - puppet-config, will look like all other fxa apps If we can, let's land that, and then QA it for prod readiness.
Version 0.2.0 has been cut with the compute-cluster code added. No extra settings or config, it should just do The Right Thing: spin up a single nodejs webhead process, it spawns N background processes based on the number of CPUs in the system and farms work out to them. We'll need a thorough loadtest for this, since it adds another way in which 503s can be generated. I'll be interested to see if there's different failure behaviors under load.
Summary: Deploy browserid-verifier tag 0.1.3 → Deploy browserid-verifier tag 0.2.0
For load testing we can make use of the X:Y:Z user parameter to "scale" up the load as we go. Perhaps 30-60min per cycle, for example...
I'm going to be rewriting puppet-config for the verifier: - single nodejs process now that we have computer-cluster *woohoo* - make it easier to reuse puppet-config to deploy the app locally in tokenserver (woo hoo) - other small clean ups
This is deployed to stage now. It is a single node, c3.large running at: https://verifier.stage.mozaws.net. Suggest we hit it with very light load, 4 concurrent users to make sure - nodejs is spawning workers correctly - it is logging correctly (heka) - 10m load test Then I'll bump up the cluster to 1x c3.2xlarge (8 cpus) and make sure it's scaling up worker processes correctly. Then I'll bring up 3x c3.2xlarge and we'll try to crush it :) In prod we'll run 3x c3.large, which is our default.
OK. Verified that Stage is now deployed as one c3.large instance in US East. Checking processes, I see one new one corresponding to the computer cluster enhancements: app /usr/bin/node /data/fxa-browserid-verifier/lib/ccverifier/worker.js Of course, the main browserid-verifier process is still there: app /usr/bin/node server.js Firs round quick test: users = 2 duration = 600 agents = 2
Moving to this for the single c3.2xlarge: users = 10 duration = 600 agents = 2
Moving to more load: users = 20 duration = 600 include_file = ./loadtest.py python_dep = PyBrowserID agents = 10
Turning the dial to 11: users = 40 duration = 600 agents = 20
All load tests look good. We are back to a single c3.large instance in Stage. Let me know if we are ready to move this to Production. If so, please mark this bug Resolved and I will open a Prod deploy ticket...
Sorry that was unfortunately premature. Load tests on the one c3.large are showing 503s. This needs to be investigated...
We will need to roll this config file update to prod tokenserver before it can talk to the new verifier: https://github.com/mozilla-services/puppet-config/pull/462
The loadtest is not sufficient to max CPU, but I see a burst of "compute cluster error: cannot enqueue work: maximum backlog exceeded (30)" at the start of the loadtest that then subside. First guess: the worker processes are dying when they don't have anything to do, and it takes a little while to spin them back up when the loadtest starts again. Will try to confirm.
Just FYI to myself. A short Tokenserver load test in Stage shows the following: The nginx logs show a small percentage of 503s The token.log file shows a matching percentage of "name": "token.assertion.connection_error"
A note for jbonacci: apparently our load test is *enough* to take out an c3.large but not big enough to take out a c3.2xlarge (~8x the capacity :)
PR to allow setting the backlog via config file: https://github.com/mozilla/browserid-verifier/pull/47
Summary: Deploy browserid-verifier tag 0.2.0 → Deploy browserid-verifier tag 0.2.1
OK, version 0.2.1 gives you some config knobs to tune the compute-cluster settings. In the config file you can set one or both of: computecluster: { maxProcesses: 2, maxBacklog: 60, } The default is 10*num_workers, which is why it was 30 on these c3.large. We should try going to 60 and see if that makes a difference - since the errors were quite bursty, it may be enough to smooth it out and get a pass from the loadtest.
It seems that it is not picking up the config when it gets passed through to the compute cluster. When the app boots it prints out the right configuration and data. However, it doesn't seem to be getting passed through correctly to the compute cluster. - Setting maxProcesses to 2, and maxBacklog to 40 - load testing it, shows that it still created 3 workers and a backlog of 30 - it is failing somewhere...
The json config has its "computecluster" key at the wrong level of nesting, see https://github.com/mozilla-services/puppet-config/pull/469
Oh duh. :mostlygeek is not a good JSON parser :b
OK ran a light load test on it. Looks like it is respecting the values now.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Summary: Deploy browserid-verifier tag 0.2.1 → Deploy browserid-verifier tag 0.2.1 to stage
Blocks: 1009096
Re-opening to track experimentation with the max-processes/max-backlog values. We managed t get 503s out of the verifier with only 40% CPU utilization, so we need to experiment to find the sweet-spot that: * gets the CPU utilization under load up above, say, 80% * without backing up enough to give significant level of 503s
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Setting maxBacklog=1000 improves CPU utilization and drops the error rate to zero in our verifier-only loadtest. I think this is the way to go, as we can fill this queue very quickly on the nodejs side, but also empty it pretty quickly as the workers burn through it. A longish queue seems fine to me.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
This is out in Prod.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.