Closed
Bug 996763
Opened 11 years ago
Closed 11 years ago
Deploy browserid-verifier tag 0.2.1 to stage
Categories
(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)
Cloud Services
Operations: Deployment Requests - DEPRECATED
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rfkelly, Assigned: mostlygeek)
References
Details
(Whiteboard: [qa+])
The latest browserid-verifier includes logging and conformance improvements, as well as a bugfix that will let it verifier assertions from persona.org. Please deploy to stage for loadtesting and then let's get it into prod.
The following configuration changes will be necessary, I will try to make a puppet-config PR but recording them here for now.
* Add a .json config file with the following contents, which enables the persona.org fix and creates a new "request summary" logfile:
{
"allowURLOmission": true,
"logging": {
"handlers": {
"console": {
"class": "intel/handlers/console",
"formatter": "pretty"
},
"summary": {
"class": "intel/handlers/file",
"file": "/media/ephemeral0/<WHATEVER>/verifier/summary.log",
"formatter": "json"
}
}
}
}
* Update circus to include CONFIG_FILES=/path/to/the/above/file.json in the env vars for the verifier
* Configure heka to slurp in the summary.log file and pass it to elasticsearch.
A stand-alone verifier loadtest should be enough to QA this thing. We should additionally check for the ability to verify persona.org assertions, and check that the summary logs flow correctly into heka.
/cc Katie because logging stuff going live with this deploy.
Reporter | ||
Updated•11 years ago
|
OS: Windows 7 → All
Hardware: x86_64 → All
Reporter | ||
Comment 1•11 years ago
|
||
Mostlygeek: it's probably obvious, but just to be clear: the tokenserver deployment stuff you're working on should definitely take priority over this.
Comment 2•11 years ago
|
||
Listing some (possibly) related github issues:
https://github.com/mozilla/browserid-verifier/issues/19
https://github.com/mozilla/browserid-verifier/issues/24
https://github.com/mozilla/browserid-verifier/issues/35
Status: NEW → ASSIGNED
Whiteboard: [qa+]
Reporter | ||
Comment 3•11 years ago
|
||
I took a quick stab at the puppet-config update in https://github.com/mozilla-services/puppet-config/pull/357 but it is incomplete.
Comment 4•11 years ago
|
||
trink & whd: app level logging from the verifier is landing soon
Comment 5•11 years ago
|
||
:mostlygeek what's the plan for this deployment to Stage and Prod?
Do we need these deploys this week, along with the changes to TS and FxA ?
Assignee | ||
Comment 6•11 years ago
|
||
I'll work on this to get it into stage.
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → bwong
Reporter | ||
Updated•11 years ago
|
Summary: Deploy browserid-verifier tag rpm-0.1.0-1 → Deploy browserid-verifier tag 0.1.0
Reporter | ||
Comment 7•11 years ago
|
||
BTW I changed the tagging scheme to be in line with the other repos, just X.Y.Z
Comment 8•11 years ago
|
||
I need some input on how we can quickly verify the following:
1. the ability to verify persona.org assertions
2. the summary logs flow correctly into heka
Assignee | ||
Updated•11 years ago
|
Summary: Deploy browserid-verifier tag 0.1.0 → Deploy browserid-verifier tag 0.1.1
Assignee | ||
Comment 9•11 years ago
|
||
This turned into an opportunity to clean up a few things:
- puppet-config changes: https://github.com/mozilla-services/puppet-config/pull/384
- RPM building changes: https://github.com/mozilla-services/svcops/pull/106
In the process of cleaning up the RPM to use `npm --production`, found a bug in browserid_verifier, which created v0.1.1. That has been created.
I'll be able to wrap this up as soon as I have the heka changes into the puppet-config PR.
Reporter | ||
Comment 10•11 years ago
|
||
> I need some input on how we can quickly verify the following:
> 1. the ability to verify persona.org assertions
You'll have to generate a persona.org-backed assertion , perhaps by logging in on http://myfavoritebeer.org/ and capturing the assertion from the web debug console. Then POST it to the verifier /v2 endpoint like this:
curl https://verifier.stage.mozaws.net/v2 -X POST -H "Content-Type: application/json" -d '{"assertion": "ASSSERTION DATA GOES HERE", "audience": "http://myfavoritebeer.org/" }'
It should report success in a JSON response.
Assignee | ||
Updated•11 years ago
|
Summary: Deploy browserid-verifier tag 0.1.1 → Deploy browserid-verifier tag 0.1.2
Assignee | ||
Comment 11•11 years ago
|
||
So a few things I'm not sure about:
- can multiple nodejs write to the same summary log? (it seems to work), though not sure if this is safe. Is nodejs asynchronously opening the file, writing and then closing it?
- log rotation for summary.log?
- can we have summary.log simply go to stdout instead of nodejs writing to a file itself? Then we can use circus to auto-rotate the file
Assignee | ||
Comment 12•11 years ago
|
||
Also we're going to need a heka filter for the summary format. Is that documented anywhere?
Comment 13•11 years ago
|
||
:mostlygeek you can probably get help from :mtrinkala or :rmiller on that heka filter bit...
Reporter | ||
Comment 14•11 years ago
|
||
> - can multiple nodejs write to the same summary log? (it seems to work), though not sure
> if this is safe. Is nodejs asynchronously opening the file, writing and then closing it?
I wouldn't count on it.
> - can we have summary.log simply go to stdout instead of nodejs writing to a file itself?
> Then we can use circus to auto-rotate the file
What's currently going to stdout? IIRC it's human-readable log lines. We can probably make these json and merge them with the summary info into a single stream if that's easier.
Reporter | ||
Comment 15•11 years ago
|
||
Based on the above and on https://github.com/mozilla-services/puppet-config/pull/384#issuecomment-41078194 I get the feeling this "summary log line" is not ready for production. Do we need to do some fixes in code before we can sensibly deploy this?
Comment 16•11 years ago
|
||
In addition, is this a "blocker" for Fx29 or just an "asap"?
Comment 17•11 years ago
|
||
From my perspective, summary log line is not a blocker for Fx29, the nginx logging is enough to spot high level problems.
Reporter | ||
Comment 18•11 years ago
|
||
I agree, not a blocker for FF29. Let's wait and do it right, I'll get onto it when I return next week.
Comment 19•11 years ago
|
||
I assume this open deploy ticket can pick up all the latest changes to browserid-verifier from this past weekend?
Reporter | ||
Updated•11 years ago
|
Summary: Deploy browserid-verifier tag 0.1.2 → Deploy browserid-verifier tag 0.1.3
Reporter | ||
Comment 20•11 years ago
|
||
I tagged 0.1.3 with some updates to the summary log line as described in https://github.com/mozilla/browserid-verifier/issues/39
This also changes the summary logging so that it goes directly to stdout, per https://github.com/mozilla/browserid-verifier/pull/43. We need to update the JSON config file to contain the following logging configuration:
{
"logging": {
"handlers": {
"console": {
"class": "intel/handlers/console",
"formatter": "json"
}
},
"loggers": {
"bid.summary": {
"propagate": true
}
}
}
}
This will cause stdout to contain json-formatted log lines that can be slurped into heka using the parser from https://github.com/mozilla-services/puppet-config/pull/422
Comment 21•11 years ago
|
||
:rkelly
:mostlygeek
Can we schedule the Stage deployment for this week?
I would like to get some testing done before QA gets busy with new projects...
Assignee | ||
Comment 22•11 years ago
|
||
I'd like to know if implementing the computer-cluster code would be doable. See: https://github.com/mozilla/browserid-verifier/issues/44
This will greatly simplify:
- deployment. Less circus watchers, individual log files, config files, etc.
- heka integration, only 1 log file to use
- puppet-config, will look like all other fxa apps
If we can, let's land that, and then QA it for prod readiness.
Reporter | ||
Comment 23•11 years ago
|
||
Version 0.2.0 has been cut with the compute-cluster code added. No extra settings or config, it should just do The Right Thing: spin up a single nodejs webhead process, it spawns N background processes based on the number of CPUs in the system and farms work out to them.
We'll need a thorough loadtest for this, since it adds another way in which 503s can be generated. I'll be interested to see if there's different failure behaviors under load.
Summary: Deploy browserid-verifier tag 0.1.3 → Deploy browserid-verifier tag 0.2.0
Comment 24•11 years ago
|
||
For load testing we can make use of the X:Y:Z user parameter to "scale" up the load as we go. Perhaps 30-60min per cycle, for example...
Assignee | ||
Comment 25•11 years ago
|
||
I'm going to be rewriting puppet-config for the verifier:
- single nodejs process now that we have computer-cluster *woohoo*
- make it easier to reuse puppet-config to deploy the app locally in tokenserver (woo hoo)
- other small clean ups
Comment 26•11 years ago
|
||
Forget Comment2! Hear is a more complete list of issues that I *think* are tagging along with this release.
Please confirm!
https://github.com/mozilla-services/puppet-config/pull/357
https://github.com/mozilla-services/puppet-config/pull/384
https://github.com/mozilla-services/puppet-config/pull/453
https://github.com/mozilla-services/puppet-config/issues/443
https://github.com/mozilla-services/puppet-config/issues/343
https://github.com/mozilla-services/puppet-config/pull/422
and
https://github.com/mozilla/browserid-verifier/issues/19
https://github.com/mozilla/browserid-verifier/issues/24
https://github.com/mozilla/browserid-verifier/pull/35
https://github.com/mozilla/browserid-verifier/pull/36
https://github.com/mozilla/browserid-verifier/issues/37
https://github.com/mozilla/browserid-verifier/pull/38
https://github.com/mozilla/browserid-verifier/pull/40
https://github.com/mozilla/browserid-verifier/pull/45
https://github.com/mozilla/browserid-verifier/issues/39
https://github.com/mozilla/browserid-verifier/pull/43
https://github.com/mozilla/browserid-verifier/issues/44
Assignee | ||
Comment 27•11 years ago
|
||
This is deployed to stage now. It is a single node, c3.large running at: https://verifier.stage.mozaws.net. Suggest we hit it with very light load, 4 concurrent users to make sure
- nodejs is spawning workers correctly
- it is logging correctly (heka)
- 10m load test
Then I'll bump up the cluster to 1x c3.2xlarge (8 cpus) and make sure it's scaling up worker processes correctly.
Then I'll bring up 3x c3.2xlarge and we'll try to crush it :)
In prod we'll run 3x c3.large, which is our default.
Comment 28•11 years ago
|
||
OK. Verified that Stage is now deployed as one c3.large instance in US East.
Checking processes, I see one new one corresponding to the computer cluster enhancements:
app /usr/bin/node /data/fxa-browserid-verifier/lib/ccverifier/worker.js
Of course, the main browserid-verifier process is still there:
app /usr/bin/node server.js
Firs round quick test:
users = 2
duration = 600
agents = 2
Comment 29•11 years ago
|
||
Moving to this for the single c3.2xlarge:
users = 10
duration = 600
agents = 2
Comment 30•11 years ago
|
||
Moving to more load:
users = 20
duration = 600
include_file = ./loadtest.py
python_dep = PyBrowserID
agents = 10
Comment 31•11 years ago
|
||
Turning the dial to 11:
users = 40
duration = 600
agents = 20
Comment 32•11 years ago
|
||
All load tests look good.
We are back to a single c3.large instance in Stage.
Let me know if we are ready to move this to Production.
If so, please mark this bug Resolved and I will open a Prod deploy ticket...
Comment 33•11 years ago
|
||
Sorry that was unfortunately premature.
Load tests on the one c3.large are showing 503s.
This needs to be investigated...
Reporter | ||
Comment 34•11 years ago
|
||
We will need to roll this config file update to prod tokenserver before it can talk to the new verifier:
https://github.com/mozilla-services/puppet-config/pull/462
Reporter | ||
Comment 35•11 years ago
|
||
The loadtest is not sufficient to max CPU, but I see a burst of "compute cluster error: cannot enqueue work: maximum backlog exceeded (30)" at the start of the loadtest that then subside.
First guess: the worker processes are dying when they don't have anything to do, and it takes a little while to spin them back up when the loadtest starts again. Will try to confirm.
Comment 36•11 years ago
|
||
Just FYI to myself. A short Tokenserver load test in Stage shows the following:
The nginx logs show a small percentage of 503s
The token.log file shows a matching percentage of "name": "token.assertion.connection_error"
Assignee | ||
Comment 37•11 years ago
|
||
A note for jbonacci: apparently our load test is *enough* to take out an c3.large but not big enough to take out a c3.2xlarge (~8x the capacity :)
Reporter | ||
Comment 38•11 years ago
|
||
PR to allow setting the backlog via config file: https://github.com/mozilla/browserid-verifier/pull/47
Reporter | ||
Updated•11 years ago
|
Summary: Deploy browserid-verifier tag 0.2.0 → Deploy browserid-verifier tag 0.2.1
Reporter | ||
Comment 39•11 years ago
|
||
OK, version 0.2.1 gives you some config knobs to tune the compute-cluster settings. In the config file you can set one or both of:
computecluster: {
maxProcesses: 2,
maxBacklog: 60,
}
The default is 10*num_workers, which is why it was 30 on these c3.large. We should try going to 60 and see if that makes a difference - since the errors were quite bursty, it may be enough to smooth it out and get a pass from the loadtest.
Assignee | ||
Comment 40•11 years ago
|
||
It seems that it is not picking up the config when it gets passed through to the compute cluster. When the app boots it prints out the right configuration and data. However, it doesn't seem to be getting passed through correctly to the compute cluster.
- Setting maxProcesses to 2, and maxBacklog to 40
- load testing it, shows that it still created 3 workers and a backlog of 30
- it is failing somewhere...
Reporter | ||
Comment 41•11 years ago
|
||
The json config has its "computecluster" key at the wrong level of nesting, see https://github.com/mozilla-services/puppet-config/pull/469
Assignee | ||
Comment 42•11 years ago
|
||
Oh duh. :mostlygeek is not a good JSON parser :b
Assignee | ||
Comment 43•11 years ago
|
||
OK ran a light load test on it. Looks like it is respecting the values now.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Summary: Deploy browserid-verifier tag 0.2.1 → Deploy browserid-verifier tag 0.2.1 to stage
Reporter | ||
Comment 44•11 years ago
|
||
Re-opening to track experimentation with the max-processes/max-backlog values.
We managed t get 503s out of the verifier with only 40% CPU utilization, so we need to experiment to find the sweet-spot that:
* gets the CPU utilization under load up above, say, 80%
* without backing up enough to give significant level of 503s
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 45•11 years ago
|
||
Setting maxBacklog=1000 improves CPU utilization and drops the error rate to zero in our verifier-only loadtest. I think this is the way to go, as we can fill this queue very quickly on the nodejs side, but also empty it pretty quickly as the workers burn through it. A longish queue seems fine to me.
Reporter | ||
Comment 46•11 years ago
|
||
Official config PR here: https://github.com/mozilla-services/puppet-config/pull/477
Assignee | ||
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•