Closed
Bug 1014496
Opened 11 years ago
Closed 11 years ago
Please deploy tokenserver 1.2.9 to stage
Categories
(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)
Cloud Services
Operations: Deployment Requests - DEPRECATED
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rfkelly, Assigned: mostlygeek)
References
Details
(Whiteboard: [qa+])
(/cc Katie because this affects metrics logging, and Bob because there'll be an identical Sync bug coming down the pipe once this goes out.)
This version of tokenserver includes some database tweaks, and new JSON-format logging output that should be compatible with the existing heka decoders from fxa-*. Specifically:
Bug 1002898 - Normalize storage of node names in tokenserver db
Bug 1012509 - mozsvc: metlog out, heka in
Bug 969184 - Update tokenserver to use heka instead of metlog
Please deploy to stage, do the db migration, then the usual loadtest suite:
* Build the rpm and deploy to all webheads
* From one of the webheads, run the migration tool:
./bin/alembic upgrade head
It should work without any config changes required.
To get the new logging output, you'll need to tweak the log config in the token-prod.ini file and tell gunicorn to load it, like this:
https://github.com/mozilla-services/puppet-config/pull/502
This will log JSON lines to stderr, where circus can dump them and heka can slurp them per standard procedure.
Comment 1•11 years ago
|
||
I would prefer we start this on Tuesday next week (after the long weekend). If time is of the essence, we can do it today (Thu). I am off Fri through Monday...
Status: NEW → ASSIGNED
QA Contact: jbonacci
| Assignee | ||
Comment 2•11 years ago
|
||
I get this error message (after merging the PR):
{"error": "No option 'logger_hook' in section: 'global'", "traceback": "Uncaught exception:\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/arbiter.py\", line 495, in spawn_worker\n worker.init_process()\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/workers/base.py\", line 106, in init_process\n self.wsgi = self.app.wsgi()\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/app/base.py\", line 114, in wsgi\n self.callable = self.load()\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/app/wsgiapp.py\", line 60, in load\n return self.load_pasteapp()\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/app/wsgiapp.py\", line 56, in load_pasteapp\n return load_pasteapp(self.cfgurl, self.relpath, global_conf=None)\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/app/pasterapp.py\", line 58, in load_pasteapp\n global_conf=global_conf)\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/loadwsgi.py\", line 247, in loadapp\n return loadobj(APP, uri, name=name, **kw)\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/loadwsgi.py\", line 272, in loadobj\n return context.create()\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/loadwsgi.py\", line 710, in create\n return self.object_type.invoke(self)\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/loadwsgi.py\", line 207, in invoke\n app = filter(app)\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/loadwsgi.py\", line 167, in filter_wrapper\n **context.local_conf)\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/util.py\", line 55, in fix_call\n val = callable(*args, **kw)\n File \"/data/tokenserver/lib/python2.6/site-packages/mozsvc/middlewares.py\", line 120, in make_err_mdw\n return CatchErrorMiddleware(app, config)\n File \"/data/tokenserver/lib/python2.6/site-packages/mozsvc/middlewares.py\", line 86, in __init__\n hook = config.get('global', 'logger_hook')\n File \"/data/tokenserver/lib/python2.6/site-packages/configparser.py\", line 808, in get\n raise NoOptionError(option, section)\n<class 'configparser.NoOptionError'>\nNo option 'logger_hook' in section: 'global'\n", "time": "2014-05-22T18:30:56.795953Z", "v": 1, "message": "Exception in worker process:\nTraceback (most recent call last):\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/arbiter.py\", line 495, in spawn_worker\n worker.init_process()\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/workers/base.py\", line 106, in init_process\n self.wsgi = self.app.wsgi()\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/app/base.py\", line 114, in wsgi\n self.callable = self.load()\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/app/wsgiapp.py\", line 60, in load\n return self.load_pasteapp()\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/app/wsgiapp.py\", line 56, in load_pasteapp\n return load_pasteapp(self.cfgurl, self.relpath, global_conf=None)\n File \"/data/tokenserver/lib/python2.6/site-packages/gunicorn/app/pasterapp.py\", line 58, in load_pasteapp\n global_conf=global_conf)\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/loadwsgi.py\", line 247, in loadapp\n return loadobj(APP, uri, name=name, **kw)\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/loadwsgi.py\", line 272, in loadobj\n return context.create()\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/loadwsgi.py\", line 710, in create\n return self.object_type.invoke(self)\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/loadwsgi.py\", line 207, in invoke\n app = filter(app)\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/loadwsgi.py\", line 167, in filter_wrapper\n **context.local_conf)\n File \"/data/tokenserver/lib/python2.6/site-packages/paste/deploy/util.py\", line 55, in fix_call\n val = callable(*args, **kw)\n File \"/data/tokenserver/lib/python2.6/site-packages/mozsvc/middlewares.py\", line 120, in make_err_mdw\n return CatchErrorMiddleware(app, config)\n File \"/data/tokenserver/lib/python2.6/site-packages/mozsvc/middlewares.py\", line 86, in __init__\n hook = config.get('global', 'logger_hook')\n File \"/data/tokenserver/lib/python2.6/site-packages/configparser.py\", line 808, in get\n raise NoOptionError(option, section)\nNoOptionError: No option 'logger_hook' in section: 'global'\n", "hostname": "ip-10-203-128-210", "pid": 3287, "op": "gunicorn.error", "name": "gunicorn.error"}
| Reporter | ||
Comment 3•11 years ago
|
||
> I would prefer we start this on Tuesday next week (after the long weekend).
Fine by me :-)
| Reporter | ||
Comment 4•11 years ago
|
||
> I get this error message
Whelp, at least the json-format error logging is working fine...
| Reporter | ||
Comment 5•11 years ago
|
||
This error is *supposed* to be catch and ignored, but it looks like gunicorn is somehow using an updated version of the configparser library and the exception class is different. I've added a compatibility check into mozsvc here:
https://github.com/mozilla-services/mozservices/commit/5fabece891bbd3bd2c9528cb3bf0562b3efb4af1
And tagged tokenserver 1.2.7 with the update. Please try again.
Summary: Please deploy tokenserver 1.2.6 to stage → Please deploy tokenserver 1.2.7 to stage
| Assignee | ||
Updated•11 years ago
|
Assignee: nobody → bwong
| Assignee | ||
Updated•11 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 6•11 years ago
|
||
deploying the heka integration with this release. So reopening the bug so those changes can also be tested.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Assignee | ||
Comment 7•11 years ago
|
||
Updated new tokenserver stack in stage (tokenserver-stage) with latest puppet-config including this PR
https://github.com/mozilla-services/puppet-config/pull/546
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 8•11 years ago
|
||
OK. I will get to this tomorrow!
| Assignee | ||
Comment 9•11 years ago
|
||
Sorry spoke too soon. looks like the last PR introduced some bugs. Let me fix those first.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Assignee | ||
Comment 10•11 years ago
|
||
bugs fixed. ready for QA
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 11•11 years ago
|
||
Found some errors during load testing - I want to make sure they are OK, before I Verify this ticket to move to Production.
Summary:
* 60 Min load test had 13 "addFailure" count (no other details from dashboard)
* Errors found here related to verifier:
/media/ephemeral0/logs/tokenserver/token.error.log
Examples:
"HttpConnectionPool is full, discarding connection: verifier.stage.mozaws.net"
"Resetting dropped connection: verifier.stage.mozaws.net"
"Starting new HTTPS connection (179): verifier.stage.mozaws.net"
I am going to assume, as Benson stated, that this is due to the fact that 3 m3.medium TS instances
are hitting one c3.large verifier instance. But, I would like confirmation that these are "acceptable" errors under high load.
* /media/ephemeral0/logs/nginx/access.log
Shows the expected % of 200s and 401s
There were a handful of 503s per instance. I think this is ok.
* /var/log/hekad/tokenserver.stderr.log
2014/06/12 16:03:10 Decoder 'TokenServer-TokenServerDecoder' error: Failed parsing: Exception KeyError: KeyError(39996304,) in <module 'threading' from '/usr/lib64/python2.6/threading.pyc'> ignored
2014/06/12 23:03:28 Plugin 'AggregatorOutput' error: writing to heka.shared.us-east-1.stage.mozaws.net:5565: write tcp 10.182.140.200:5565: broken pipe
2014/06/12 23:03:29 Plugin 'AggregatorOutput' error: dial tcp 10.182.140.200:5565: connection refused
I think this needs to be addressed.
Over on the Verifier:
I filed two issues here not directly related to the load testing on TokenServer:
https://github.com/mozilla-services/puppet-config/issues/599
https://github.com/mozilla-services/puppet-config/issues/600
/media/ephemeral0/fxa-browserid-verifier/verifier_out.log
I see these, which I assume are ok for tokenserver load testing:
{"op":"bid.v2","name":"bid.v2","time":"2014-06-12T23:20:06.177Z","pid":31393,"v":1,"hostname":"ip-10-187-17-59","message":"assertion_failure"}
{"op":"bid.v2","name":"bid.v2","time":"2014-06-12T23:20:06.178Z","pid":31393,"v":1,"hostname":"ip-10-187-17-59","message":"verify { result: 'failure',\n reason: 'untrusted issuer, expected \\'login.mozilla.org\\', got \\'mockmyid.s3-us-west-2.amazonaws.com\\'',\n rp: 'https://token.stage.mozaws.net' }"}
Comment 12•11 years ago
|
||
(In reply to James Bonacci [:jbonacci] from comment #11)
>
> * /var/log/hekad/tokenserver.stderr.log
> 2014/06/12 16:03:10 Decoder 'TokenServer-TokenServerDecoder' error: Failed
> parsing: Exception KeyError: KeyError(39996304,) in <module 'threading' from
> '/usr/lib64/python2.6/threading.pyc'> ignored
> 2014/06/12 23:03:28 Plugin 'AggregatorOutput' error: writing to
> heka.shared.us-east-1.stage.mozaws.net:5565: write tcp 10.182.140.200:5565:
> broken pipe
> 2014/06/12 23:03:29 Plugin 'AggregatorOutput' error: dial tcp
> 10.182.140.200:5565: connection refused
>
> I think this needs to be addressed.
>
The broken pipe/tcp connection refused are not a problem. https://github.com/mozilla-services/puppet-config/pull/601 probably fixes the decoding error.
| Reporter | ||
Comment 13•11 years ago
|
||
> Summary:
> * 60 Min load test had 13 "addFailure" count (no other details from dashboard)
> ...
> There were a handful of 503s per instance.
Were there 13 503s to match the 13 addFailure events? In general, any amount of addFailure in the loads output is bad news. If they were 503s, we should be able to find corresponding errors in the verifier logs.
I'm not *too* worried about a handful for 503s as long as they're failing cleanly, it may just mean we need to up the scale of the verifier. But we should make sure we can track them all the way to the source.
> I am going to assume, as Benson stated, that this is due to the fact that 3
> m3.medium TS instances are hitting one c3.large verifier instance.
How does this compare to the instances in prod?
| Reporter | ||
Comment 14•11 years ago
|
||
Have the instances for this been torn down? I see a "tokenserver-stage" stack in CFN but there are no instances in it.
I can't find compute-cluster failures in the verifier logs. If we've lost the TS instances, we may have to run this test again to sanity-check things.
Flags: needinfo?(bwong)
| Assignee | ||
Comment 15•11 years ago
|
||
Yes! That was totally my bad. Drive by ops. I will spin them back up.
Flags: needinfo?(bwong)
Comment 16•11 years ago
|
||
The 3 instances are back up:
i-546e4004
i-1feca04c
i-ef2287bd
I will see about a 30-60min load test this afternoon...
Comment 17•11 years ago
|
||
Started with a short 15min Verifier load test to see what shows up in the logs.
Now going with a 30min TS load test.
Comment 18•11 years ago
|
||
Did a 30min load test: https://loads.services.mozilla.com/run/087de103-1a1e-444f-9acb-cde2aafbac17
Results there are similar:
Results
Tests over 419844
Successes 419825
Failures 0
Errors 0
TCP Hits 428108
Opened web sockets 0
Total web sockets 0
Bytes/websockets 0
Requests / second (RPS) 228
Custom metrics
addFailure 19
Checking the 3 instances - /media/ephemeral0/logs/nginx/access.log:
i-546e4004 - ec2-107-20-66-198 had no 503s, just 200s, 401s, 499s
i-1feca04c - ec2-54-89-96-167 had no 503s, just 200s and 401s
i-ef2287bd - ec2-54-89-19-190 had no 503s, just 200s, 401s, 499s
So, I can't find the 503s this time around.
Leaving this as is for :rfkelly to dig in on his Monday...
| Reporter | ||
Comment 19•11 years ago
|
||
Ouch, 499s, that ain't good. I'll dig into this today.
| Reporter | ||
Comment 20•11 years ago
|
||
The 499s are nginx timing us out after 60s. AFAICT this is caused by the verifier taking a very long time to respond, e.g. I found some log lines like the following:
{"code": 200, "v": 1, "pid": 3642, "agent": "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic", "token.assertion.verify_failure": 1, "path": "https://token.stage.mozaws.net/1.0/sync/1.5", "remoteAddressChain": ["54.218.210.15", "10.116.221.171", "10.116.221.171"], "request_time": 63.027626991271973, "hostname": "ip-10-231-59-21", "name": "mozsvc.metrics", "token.assertion.connection_error": 1, "tokenserver.assertion.verify": 63.027315855026245, "time": "2014-06-13T22:46:17.237487Z", "method": "GET", "op": "mozsvc.metrics"}
Note the "tokenserver.assertion.verify" timer telling us that it took 60 seconds for the verifier to return.
Indeed, on i-546e4004 I see 13 499 responses, and 13 log lines with a "tokenserver.assertion.verify" timer greater than 60s.
Two things to do coming out of this:
* dig into the verifier logs and see if there's anything in there about the slowdown
* put a hard timeout on these requests from the tokenserver side, so that nginx doesn't have to kill us
Neither of these need to block this from rolling to production. The verifier got a bit overloaded, but we coped with it fairly gracefully.
| Reporter | ||
Comment 21•11 years ago
|
||
Hmm, actually that log line has `"token.assertion.connection_error": 1` so this did actually time out - it just timed out *after* nginx had given up and reported an error to the client.
Assuming these metrics are going into heka, we should consider plotting/analysing the "tokenserver.assertion.verify" timer for reporting purposes.
| Reporter | ||
Comment 22•11 years ago
|
||
Interestingly, the verifier nginx logs dont show any requests taking more than ~1 second to complete. I wonder if this was caused by network problems of some kind?
Comment 23•11 years ago
|
||
OK, not sure, based on :rfkelly's comments above if we are ready to go to Prod.
https://github.com/mozilla-services/puppet-config/issues/599 has merged
https://github.com/mozilla-services/puppet-config/issues/600 is now closed, but it sounds like we need an update to verifier stage (rfk's/whd's comments) before we can complete the work here.
So, we deploy to Verifier Stage, I test/pass that
I retest this release in Tokenserver Stage, then we go to Prod for both?
:rfkelly comments?
:mostlygeek comments?
Comment 24•11 years ago
|
||
Is there a new deploy of Tokenserver needed due to https://github.com/mozilla-services/puppet-config/pull/601?
Sure feels like it.
| Assignee | ||
Comment 25•11 years ago
|
||
So the plan is:
- new verifier version + puppet to stage
- new tokenserver puppet-config version to stage
we'll deploy both new ones and load test them together. If no issues, we'll put both to production. Likely starting with verifier first and then tokenserver
| Assignee | ||
Comment 26•11 years ago
|
||
Reopening this bug for the updated stage deploy. Some changes:
- using 3x c3.large instances instead of m3.medium. We use a lot more CPU than we thought
- in prod, 6x m3.medium peaks at about 35% CPU
- we see latency spikes due to AWS CPU limiting (in CPU steal stats)
- trying out bigger instances
- are the load tests updated to match what we see in prod, ie: Lots more reads than writes
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 27•11 years ago
|
||
I think the load test tuning is defined here:
https://github.com/mozilla-services/tokenserver/blob/master/loadtest/loadtest.py#L101
| Reporter | ||
Comment 28•11 years ago
|
||
I've tagged 1.2.8 with this change included:
Bug 1025767- Add explicit timeout to remote verifier requests
> are the load tests updated to match what we see in prod, ie: Lots more reads than writes
I'll start looking int this but I don't think we should block the deploy while waiting for it.
| Reporter | ||
Updated•11 years ago
|
Summary: Please deploy tokenserver 1.2.7 to stage → Please deploy tokenserver 1.2.8 to stage
| Assignee | ||
Comment 29•11 years ago
|
||
Updating tokenserver stage stack. Notable changes:
- using c3.large instances instead of m3.medium.
- using tokenserver 1.2.8
- using puppet-config version: 20140618181804
| Assignee | ||
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 30•11 years ago
|
||
Hmm... getting this error when circus tries to start token server:
Error: No module named paste.deploy
I guess something changed in the dependencies?
Ryan, please look at: ec2-54-80-30-203.compute-1.amazonaws.com
Status: RESOLVED → REOPENED
Flags: needinfo?(rfkelly)
Resolution: FIXED → ---
| Assignee | ||
Comment 31•11 years ago
|
||
It looks like the RPM build script (svcops/services/token-server/mock-create.sh) needs to be updated since the Makefile changed a bit. I'll dig deeper into this.
| Assignee | ||
Comment 32•11 years ago
|
||
- had to create a 1.2.8-2 RPM w/ a fixed rpm build script (https://github.com/mozilla-services/svcops/pull/169)
- had to update nginx proxy config and circus configs (https://github.com/mozilla-services/puppet-config/pull/610)
- updated the stack
- looks like it is all working now
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 33•11 years ago
|
||
Clearing the info field.
Verified 3 instances runnning "tokenserver-svcops 1.2.8-2 x86_64 73529457"
Starting a 10min load test now.
Will switch to 60min after that.
Relate/relevant issues to track/review/verify:
bug 1025767
bug 997344
https://github.com/mozilla-services/tokenserver/issues/65
https://github.com/mozilla-services/svcops/pull/169
https://github.com/mozilla-services/puppet-config/issues/599
https://github.com/mozilla-services/puppet-config/issues/600
https://github.com/mozilla-services/puppet-config/pull/601
https://github.com/mozilla-services/puppet-config/pull/610
Flags: needinfo?(rfkelly)
Comment 34•11 years ago
|
||
10min test looked fine.
Going to a 60min test.
I will look over the test results, the logs, and review the above list of bugs/issues tomorrow morning...
Comment 35•11 years ago
|
||
Here are the results of the 60min test:
Test was launched by jbonacci
Run Id be7fc3fb-5fd7-4d70-af10-4f07b3773cd9
Duration 1 h and 22 sec.
Started 2014-06-18 23:51:57 UTC
Ended 2014-06-19 00:52:19 UTC
State Ended
Users [20]
Hits None
Agents 5
Duration 3600
Server URL https://token.stage.mozaws.net
Tests over 866281
Successes 866271
Failures 0
Errors 0
TCP Hits 883617
Opened web sockets 0
Total web sockets 0
Bytes/websockets 0
Requests / second (RPS) 243
addFailure 10
REF: https://loads.services.mozilla.com/run/be7fc3fb-5fd7-4d70-af10-4f07b3773cd9
I will look at those addFailures tomorrow along with the usual review/investigations...
| Reporter | ||
Comment 36•11 years ago
|
||
Picking a random TS webhead, I see 4 503s corresponding to 4 connection_error log lines like:
{"code": 200, "v": 1, "pid": 3435, "agent": "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic", "token.assertion.verify_failure": 1, "path": "https://token.stage.mozaws.net/1.0/sync/1.5", "remoteAddressChain": ["54.245.44.231", "10.198.137.229", "127.0.0.1"], "request_time": 0.10312891006469727, "hostname": "ip-10-97-158-44", "name": "mozsvc.metrics", "token.assertion.connection_error": 1, "tokenserver.assertion.verify": 0.1025238037109375, "time": "2014-06-19T00:48:01.585699Z", "method": "GET", "op": "mozsvc.metrics"}
Tangential bug: this should really be saying code=503, not code=200; Bug 1027444.
Interestingly, I can't see any new errors in the verifier logs.
| Reporter | ||
Comment 37•11 years ago
|
||
We probably need Bug 988095 to get any more details of the cause of these connection errors.
| Reporter | ||
Comment 38•11 years ago
|
||
I tagged 1.2.9 with the improved logging from Bug 988095, let's try again with that in place.
Summary: Please deploy tokenserver 1.2.8 to stage → Please deploy tokenserver 1.2.9 to stage
Comment 39•11 years ago
|
||
OK. I will coordinate with :mostlygeek. Not sure of the priority of this one compared to Loop and FMD
Updated•11 years ago
|
| Assignee | ||
Comment 40•11 years ago
|
||
done.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 41•11 years ago
|
||
awesome. Will start some load tests this afternoon and into tomorrow...
Comment 42•11 years ago
|
||
Verified 3 c3.large instances running - tokenserver-svcops 1.2.9-1 x86_64 73530502
Starting a 10min load test now...
Comment 43•11 years ago
|
||
Make that a 30min load test...
Comment 44•11 years ago
|
||
So, here are the results: https://loads.services.mozilla.com/run/8b7cb33d-1752-46b3-be5d-6475a4bb3488
Run Id 8b7cb33d-1752-46b3-be5d-6475a4bb3488
Duration 30 min and 29 sec.
Started 2014-06-19 20:52:33 UTC
Ended 2014-06-19 21:23:02 UTC
State Ended
Users [20]
Hits None
Agents 5
Duration 1800
Server URL https://token.stage.mozaws.net
Results
Tests over 428408
Successes 428024
Failures 0
Errors 0
TCP Hits 437191
Opened web sockets 0
Total web sockets 0
Bytes/websockets 0
Requests / second (RPS) 239
addFailure 384
I will look into the failures...
Comment 45•11 years ago
|
||
Here is a quick summary:
/media/ephemeral0/logs/nginx/access.log
503s:
54.245.44.231 2014-06-19T21:20:43+00:00 "GET /1.0/sync/1.5 HTTP/1.1" 503 1922 320 "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 0.008
(we saw this last time)
/media/ephemeral0/logs/tokenserver/token.log
(I was blocking all 'code": 200' - see bug 1027444)
{"time": "2014-06-19T21:22:02.008053Z", "v": 1, "message": "Resetting dropped connection: verifier.stage.mozaws.net", "hostname": "ip-10-69-11-50", "pid": 3557, "op": "requests.packages.urllib3.connectionpool", "name": "requests.packages.urllib3.connectionpool"}
{"time": "2014-06-19T21:09:09.697471Z", "v": 1, "message": "Connection pool is full, discarding connection: verifier.stage.mozaws.net", "hostname": "ip-10-69-11-50", "pid": 3558, "op": "requests.packages.urllib3.connectionpool", "name": "requests.packages.urllib3.connectionpool"}
"token.assertion.invalid_signature_error" (that look like 200s)
"tokenserver.assertion.verify" (that look like 200s)
"token.assertion.connection_error" (that look like 200s)
We saw all those last time
This is new:
/media/ephemeral0/logs/tokenserver/token.log
{"error": "ConnectionError('server returned invalid response code',)", "traceback": "Uncaught exception:\n File \"/data/tokenserver/tokenserver/views.py\", line 73, in valid_assertion\n assertion = verifier.verify(assertion)\n File \"/data/tokenserver/tokenserver/verifiers.py\", line 111, in verify\n raise ConnectionError('server returned invalid response code')\n<class 'browserid.errors.ConnectionError'>\nConnectionError('server returned invalid response code',)\n", "time": "2014-06-19T21:20:41.903379Z", "v": 1, "message": "Unexpected verification error", "hostname": "ip-10-69-11-50", "pid": 3565, "op": "tokenserver", "name": "tokenserver"}
| Reporter | ||
Comment 46•11 years ago
|
||
Great, this is the new logging doing its thing. "server returned invalid response code" means we *should* be able to find some non-200 responses on the verifier. Either that or our requests lib is synthesizing an error in response to some network silliness.
| Reporter | ||
Comment 47•11 years ago
|
||
Yep, I see a bunch of "too busy" messages on ec2-54-81-41-31.compute-1.amazonaws.com
| Reporter | ||
Comment 48•11 years ago
|
||
374 of them to be precise - which leaves 10 addFailures on the TS unaccounted for...
No errors reported on the other verifier webhead, ec2-54-81-65-105.compute-1.amazonaws.com
Comment 49•11 years ago
|
||
yea
https://bugzilla.mozilla.org/show_bug.cgi?id=1026644#c5
All my original errors for verifier testing showed up on 41-31 not 65-105.
| Reporter | ||
Comment 50•11 years ago
|
||
On the tokenserver webheads, I see a total of 384 503s in the nginx logs, and a total of 384 log entries with "traceback" in them. So the new logging appears to be working correctly.
| Reporter | ||
Comment 51•11 years ago
|
||
Aha, the other 10 are all instances of this, a failure generated locally by the requests library:
{"error": "ConnectionError(u\"Failed to POST https://verifier.stage.mozaws.net/v2. Reason: HTTPSConnectionPool(host='verifier.stage.mozaws.net', port=443): Max retries exceeded with url: /v2 (Caused by <class 'httplib.BadStatusLine'>: )\",)", "traceback": "Uncaught exception:\n File \"/data/tokenserver/tokenserver/views.py\", line 73, in valid_assertion\n assertion = verifier.verify(assertion)\n File \"/data/tokenserver/tokenserver/verifiers.py\", line 108, in verify\n raise ConnectionError(msg)\n<class 'browserid.errors.ConnectionError'>\nConnectionError(u\"Failed to POST https://verifier.stage.mozaws.net/v2. Reason: HTTPSConnectionPool(host='verifier.stage.mozaws.net', port=443): Max retries exceeded with url: /v2 (Caused by <class 'httplib.BadStatusLine'>: )\",)\n", "time": "2014-06-19T21:09:02.561113Z", "v": 1, "message": "Unexpected verification error", "hostname": "ip-10-150-39-209", "pid": 3555, "op": "tokenserver", "name": "tokenserver"}
| Reporter | ||
Comment 52•11 years ago
|
||
IIRC "BadStatusLine" is usually python throwing an error due to premature connection close or some other network-level rubbish.
Comment 53•11 years ago
|
||
OK. Talked to :rfkelly on IRC - we don't see anything here specific to this deployment, so let's move forward with Production...
See bug 1027899
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•