996763 - Deploy browserid-verifier tag 0.2.1 to stage

Reporter

Description

•

11 years ago

The latest browserid-verifier includes logging and conformance improvements, as well as a bugfix that will let it verifier assertions from persona.org. Please deploy to stage for loadtesting and then let's get it into prod. The following configuration changes will be necessary, I will try to make a puppet-config PR but recording them here for now. * Add a .json config file with the following contents, which enables the persona.org fix and creates a new "request summary" logfile: { "allowURLOmission": true, "logging": { "handlers": { "console": { "class": "intel/handlers/console", "formatter": "pretty" }, "summary": { "class": "intel/handlers/file", "file": "/media/ephemeral0/<WHATEVER>/verifier/summary.log", "formatter": "json" } } } } * Update circus to include CONFIG_FILES=/path/to/the/above/file.json in the env vars for the verifier * Configure heka to slurp in the summary.log file and pass it to elasticsearch. A stand-alone verifier loadtest should be enough to QA this thing. We should additionally check for the ability to verify persona.org assertions, and check that the summary logs flow correctly into heka. /cc Katie because logging stuff going live with this deploy.

Ryan Kelly [:rfkelly]

Reporter

Updated

•

11 years ago

OS: Windows 7 → All

Hardware: x86_64 → All

Ryan Kelly [:rfkelly]

Reporter

Comment 1

•

11 years ago

Mostlygeek: it's probably obvious, but just to be clear: the tokenserver deployment stuff you're working on should definitely take priority over this.

James Bonacci [:jbonacci]

Comment 2

•

11 years ago

Listing some (possibly) related github issues: https://github.com/mozilla/browserid-verifier/issues/19 https://github.com/mozilla/browserid-verifier/issues/24 https://github.com/mozilla/browserid-verifier/issues/35

Status: NEW → ASSIGNED

Whiteboard: [qa+]

Ryan Kelly [:rfkelly]

Reporter

Comment 3

•

11 years ago

I took a quick stab at the puppet-config update in https://github.com/mozilla-services/puppet-config/pull/357 but it is incomplete.

Katie Parlante

Comment 4

•

11 years ago

trink & whd: app level logging from the verifier is landing soon

James Bonacci [:jbonacci]

Comment 5

•

11 years ago

:mostlygeek what's the plan for this deployment to Stage and Prod? Do we need these deploys this week, along with the changes to TS and FxA ?

Benson Wong [:mostlygeek]

Assignee

Comment 6

•

11 years ago

I'll work on this to get it into stage.

Benson Wong [:mostlygeek]

Assignee

Updated

•

11 years ago

Assignee: nobody → bwong

Ryan Kelly [:rfkelly]

Reporter

Updated

•

11 years ago

Summary: Deploy browserid-verifier tag rpm-0.1.0-1 → Deploy browserid-verifier tag 0.1.0

Ryan Kelly [:rfkelly]

Reporter

Comment 7

•

11 years ago

BTW I changed the tagging scheme to be in line with the other repos, just X.Y.Z

James Bonacci [:jbonacci]

Comment 8

•

11 years ago

I need some input on how we can quickly verify the following: 1. the ability to verify persona.org assertions 2. the summary logs flow correctly into heka

Benson Wong [:mostlygeek]

Assignee

Updated

•

11 years ago

Summary: Deploy browserid-verifier tag 0.1.0 → Deploy browserid-verifier tag 0.1.1

Benson Wong [:mostlygeek]

Assignee

Comment 9

•

11 years ago

This turned into an opportunity to clean up a few things: - puppet-config changes: https://github.com/mozilla-services/puppet-config/pull/384 - RPM building changes: https://github.com/mozilla-services/svcops/pull/106 In the process of cleaning up the RPM to use `npm --production`, found a bug in browserid_verifier, which created v0.1.1. That has been created. I'll be able to wrap this up as soon as I have the heka changes into the puppet-config PR.

Ryan Kelly [:rfkelly]

Reporter

Comment 10

•

11 years ago

> I need some input on how we can quickly verify the following: > 1. the ability to verify persona.org assertions You'll have to generate a persona.org-backed assertion , perhaps by logging in on http://myfavoritebeer.org/ and capturing the assertion from the web debug console. Then POST it to the verifier /v2 endpoint like this: curl https://verifier.stage.mozaws.net/v2 -X POST -H "Content-Type: application/json" -d '{"assertion": "ASSSERTION DATA GOES HERE", "audience": "http://myfavoritebeer.org/" }' It should report success in a JSON response.

Benson Wong [:mostlygeek]

Assignee

Updated

•

11 years ago

Summary: Deploy browserid-verifier tag 0.1.1 → Deploy browserid-verifier tag 0.1.2

Benson Wong [:mostlygeek]

Assignee

Comment 11

•

11 years ago

So a few things I'm not sure about: - can multiple nodejs write to the same summary log? (it seems to work), though not sure if this is safe. Is nodejs asynchronously opening the file, writing and then closing it? - log rotation for summary.log? - can we have summary.log simply go to stdout instead of nodejs writing to a file itself? Then we can use circus to auto-rotate the file

Benson Wong [:mostlygeek]

Assignee

Comment 12

•

11 years ago

Also we're going to need a heka filter for the summary format. Is that documented anywhere?

James Bonacci [:jbonacci]

Comment 13

•

11 years ago

:mostlygeek you can probably get help from :mtrinkala or :rmiller on that heka filter bit...

Ryan Kelly [:rfkelly]

Reporter

Comment 14

•

11 years ago

> - can multiple nodejs write to the same summary log? (it seems to work), though not sure > if this is safe. Is nodejs asynchronously opening the file, writing and then closing it? I wouldn't count on it. > - can we have summary.log simply go to stdout instead of nodejs writing to a file itself? > Then we can use circus to auto-rotate the file What's currently going to stdout? IIRC it's human-readable log lines. We can probably make these json and merge them with the summary info into a single stream if that's easier.

Ryan Kelly [:rfkelly]

Reporter

Comment 15

•

11 years ago

Based on the above and on https://github.com/mozilla-services/puppet-config/pull/384#issuecomment-41078194 I get the feeling this "summary log line" is not ready for production. Do we need to do some fixes in code before we can sensibly deploy this?

James Bonacci [:jbonacci]

Comment 16

•

11 years ago

In addition, is this a "blocker" for Fx29 or just an "asap"?

Katie Parlante

Updated

•

11 years ago

Blocks: 1000574

Katie Parlante

Comment 17

•

11 years ago

From my perspective, summary log line is not a blocker for Fx29, the nginx logging is enough to spot high level problems.

Ryan Kelly [:rfkelly]

Reporter

Comment 18

•

11 years ago

I agree, not a blocker for FF29. Let's wait and do it right, I'll get onto it when I return next week.

James Bonacci [:jbonacci]

Comment 19

•

11 years ago

I assume this open deploy ticket can pick up all the latest changes to browserid-verifier from this past weekend?

Ryan Kelly [:rfkelly]

Reporter

Updated

•

11 years ago

Summary: Deploy browserid-verifier tag 0.1.2 → Deploy browserid-verifier tag 0.1.3

Ryan Kelly [:rfkelly]

Reporter

Comment 20

•

11 years ago

I tagged 0.1.3 with some updates to the summary log line as described in https://github.com/mozilla/browserid-verifier/issues/39 This also changes the summary logging so that it goes directly to stdout, per https://github.com/mozilla/browserid-verifier/pull/43. We need to update the JSON config file to contain the following logging configuration: { "logging": { "handlers": { "console": { "class": "intel/handlers/console", "formatter": "json" } }, "loggers": { "bid.summary": { "propagate": true } } } } This will cause stdout to contain json-formatted log lines that can be slurped into heka using the parser from https://github.com/mozilla-services/puppet-config/pull/422

James Bonacci [:jbonacci]

Comment 21

•

11 years ago

:rkelly :mostlygeek Can we schedule the Stage deployment for this week? I would like to get some testing done before QA gets busy with new projects...

Benson Wong [:mostlygeek]

Assignee

Comment 22

•

11 years ago

I'd like to know if implementing the computer-cluster code would be doable. See: https://github.com/mozilla/browserid-verifier/issues/44 This will greatly simplify: - deployment. Less circus watchers, individual log files, config files, etc. - heka integration, only 1 log file to use - puppet-config, will look like all other fxa apps If we can, let's land that, and then QA it for prod readiness.

Ryan Kelly [:rfkelly]

Reporter

Comment 23

•

11 years ago

Version 0.2.0 has been cut with the compute-cluster code added. No extra settings or config, it should just do The Right Thing: spin up a single nodejs webhead process, it spawns N background processes based on the number of CPUs in the system and farms work out to them. We'll need a thorough loadtest for this, since it adds another way in which 503s can be generated. I'll be interested to see if there's different failure behaviors under load.

Summary: Deploy browserid-verifier tag 0.1.3 → Deploy browserid-verifier tag 0.2.0

James Bonacci [:jbonacci]

Comment 24

•

11 years ago

For load testing we can make use of the X:Y:Z user parameter to "scale" up the load as we go. Perhaps 30-60min per cycle, for example...

Benson Wong [:mostlygeek]

Assignee

Comment 25

•

11 years ago

I'm going to be rewriting puppet-config for the verifier: - single nodejs process now that we have computer-cluster *woohoo* - make it easier to reuse puppet-config to deploy the app locally in tokenserver (woo hoo) - other small clean ups

James Bonacci [:jbonacci]

Comment 26

•

11 years ago

Benson Wong [:mostlygeek]

Assignee

Comment 27

•

11 years ago

This is deployed to stage now. It is a single node, c3.large running at: https://verifier.stage.mozaws.net. Suggest we hit it with very light load, 4 concurrent users to make sure - nodejs is spawning workers correctly - it is logging correctly (heka) - 10m load test Then I'll bump up the cluster to 1x c3.2xlarge (8 cpus) and make sure it's scaling up worker processes correctly. Then I'll bring up 3x c3.2xlarge and we'll try to crush it :) In prod we'll run 3x c3.large, which is our default.

James Bonacci [:jbonacci]

Comment 28

•

11 years ago

OK. Verified that Stage is now deployed as one c3.large instance in US East. Checking processes, I see one new one corresponding to the computer cluster enhancements: app /usr/bin/node /data/fxa-browserid-verifier/lib/ccverifier/worker.js Of course, the main browserid-verifier process is still there: app /usr/bin/node server.js Firs round quick test: users = 2 duration = 600 agents = 2

James Bonacci [:jbonacci]

Comment 29

•

11 years ago

Moving to this for the single c3.2xlarge: users = 10 duration = 600 agents = 2

James Bonacci [:jbonacci]

Comment 30

•

11 years ago

Moving to more load: users = 20 duration = 600 include_file = ./loadtest.py python_dep = PyBrowserID agents = 10

James Bonacci [:jbonacci]

Comment 31

•

11 years ago

Turning the dial to 11: users = 40 duration = 600 agents = 20

James Bonacci [:jbonacci]

Comment 32

•

11 years ago

All load tests look good. We are back to a single c3.large instance in Stage. Let me know if we are ready to move this to Production. If so, please mark this bug Resolved and I will open a Prod deploy ticket...

James Bonacci [:jbonacci]

Comment 33

•

11 years ago

Sorry that was unfortunately premature. Load tests on the one c3.large are showing 503s. This needs to be investigated...

Ryan Kelly [:rfkelly]

Reporter

Comment 34

•

11 years ago

We will need to roll this config file update to prod tokenserver before it can talk to the new verifier: https://github.com/mozilla-services/puppet-config/pull/462

Ryan Kelly [:rfkelly]

Reporter

Comment 35

•

11 years ago

The loadtest is not sufficient to max CPU, but I see a burst of "compute cluster error: cannot enqueue work: maximum backlog exceeded (30)" at the start of the loadtest that then subside. First guess: the worker processes are dying when they don't have anything to do, and it takes a little while to spin them back up when the loadtest starts again. Will try to confirm.

James Bonacci [:jbonacci]

Comment 36

•

11 years ago

Just FYI to myself. A short Tokenserver load test in Stage shows the following: The nginx logs show a small percentage of 503s The token.log file shows a matching percentage of "name": "token.assertion.connection_error"

Benson Wong [:mostlygeek]

Assignee

Comment 37

•

11 years ago

A note for jbonacci: apparently our load test is *enough* to take out an c3.large but not big enough to take out a c3.2xlarge (~8x the capacity :)

Ryan Kelly [:rfkelly]

Reporter

Comment 38

•

11 years ago

PR to allow setting the backlog via config file: https://github.com/mozilla/browserid-verifier/pull/47

Ryan Kelly [:rfkelly]

Reporter

Updated

•

11 years ago

Summary: Deploy browserid-verifier tag 0.2.0 → Deploy browserid-verifier tag 0.2.1

Ryan Kelly [:rfkelly]

Reporter

Comment 39

•

11 years ago

OK, version 0.2.1 gives you some config knobs to tune the compute-cluster settings. In the config file you can set one or both of: computecluster: { maxProcesses: 2, maxBacklog: 60, } The default is 10*num_workers, which is why it was 30 on these c3.large. We should try going to 60 and see if that makes a difference - since the errors were quite bursty, it may be enough to smooth it out and get a pass from the loadtest.

Benson Wong [:mostlygeek]

Assignee

Comment 40

•

11 years ago

It seems that it is not picking up the config when it gets passed through to the compute cluster. When the app boots it prints out the right configuration and data. However, it doesn't seem to be getting passed through correctly to the compute cluster. - Setting maxProcesses to 2, and maxBacklog to 40 - load testing it, shows that it still created 3 workers and a backlog of 30 - it is failing somewhere...

Ryan Kelly [:rfkelly]

Reporter

Comment 41

•

11 years ago

The json config has its "computecluster" key at the wrong level of nesting, see https://github.com/mozilla-services/puppet-config/pull/469

Benson Wong [:mostlygeek]

Assignee

Comment 42

•

11 years ago

Oh duh. :mostlygeek is not a good JSON parser :b

Benson Wong [:mostlygeek]

Assignee

Comment 43

•

11 years ago

OK ran a light load test on it. Looks like it is respecting the values now.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Summary: Deploy browserid-verifier tag 0.2.1 → Deploy browserid-verifier tag 0.2.1 to stage

James Bonacci [:jbonacci]

Updated

•

11 years ago

Blocks: 1009096

Ryan Kelly [:rfkelly]

Reporter

Comment 44

•

11 years ago

Re-opening to track experimentation with the max-processes/max-backlog values. We managed t get 503s out of the verifier with only 40% CPU utilization, so we need to experiment to find the sweet-spot that: * gets the CPU utilization under load up above, say, 80% * without backing up enough to give significant level of 503s

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Ryan Kelly [:rfkelly]

Reporter

Comment 45

•

11 years ago

Setting maxBacklog=1000 improves CPU utilization and drops the error rate to zero in our verifier-only loadtest. I think this is the way to go, as we can fill this queue very quickly on the nodejs side, but also empty it pretty quickly as the workers burn through it. A longish queue seems fine to me.

Ryan Kelly [:rfkelly]

Reporter

Comment 46

•

11 years ago

Official config PR here: https://github.com/mozilla-services/puppet-config/pull/477

Benson Wong [:mostlygeek]

Assignee

Updated

•

11 years ago

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Comment 47

•

11 years ago

This is out in Prod.

Status: RESOLVED → VERIFIED