Closed
Bug 974194
Opened 10 years ago
Closed 10 years ago
Tokenserver now giving 401s under load
Categories
(Cloud Services Graveyard :: Server: Token, defect, P1)
Cloud Services Graveyard
Server: Token
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: jbonacci, Unassigned)
References
Details
(Whiteboard: [qa+])
Attachments
(1 file)
845 bytes,
patch
|
telliott
:
review+
|
Details | Diff | Splinter Review |
This seems like something new, based on the current TS Stage environment. I ran a 30 minute load test against the TS Stage environment, making use of the new "megabench" command in the TS loadtest directory and the Loads cluster. There are a significant number of 401s showing in the /media/ephemeral0/logs/nginx/access.log on each TS instance. Trolling the /media/ephemeral0/logs/tokenserver/token.log, you can see a significant number of JSON events with "name": "token.assertion.connection_error" "name": "token.assertion.expired_signature_error" This needs to be investigated...
Reporter | ||
Updated•10 years ago
|
Whiteboard: [qa+]
Comment 1•10 years ago
|
||
I assume the proximate cause here is that the loads slave machines don't have ntp installed, so their clocks are all wonkey. I've been resisting fixing this because it's a good opportunity to debug our clockskew logic.
Comment 2•10 years ago
|
||
*slaps forehead* So the "fix" in this commit doesn't work: https://github.com/mozilla-services/tokenserver/commit/d5e7c1e24a4d97be45adf8dbb3f224c94215d1c7 It has the logic to adjust for timeskew, but doesn't actually record timeskew if we get a 401. Bleh.
Comment 3•10 years ago
|
||
Patch to detect and adjust for clockskew in the loadtest.
Attachment #8377963 -
Flags: review?(telliott)
Updated•10 years ago
|
Attachment #8377963 -
Flags: review?(telliott) → review+
Comment 4•10 years ago
|
||
This seems to do the trick for me, much fewer failures reported from the run: https://github.com/mozilla-services/tokenserver/commit/03318a5cb981841d564cbe07f1896527ae80b959
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 5•10 years ago
|
||
Do we need to push the new code to the TS Stage instances?
Comment 6•10 years ago
|
||
> Do we need to push the new code to the TS Stage instances?
no, the fix is in the loadtest client, as long as you run loadtests using up-to-date checkout you should be good
Reporter | ||
Comment 7•10 years ago
|
||
I git cloned and built the Tokenserver repo (which would be master, which I assume is what we want). I am still getting 401s... I am using the very latest setup of TS and Sync and Verifier (all now in US East). I am just running the basic test against TS Stage: $ make test SERVER_URL=https://token.stage.mozaws.net I see 401s in the nginx access logs. [21/Feb/2014:14:03:12 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 110... [21/Feb/2014:14:03:37 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 96... I see errors in the tokenserver token.log file: "name": "token.assertion.verify_failure" "token.assertion.audience_mismatch_error" (always paired) These are consistent across all 3 instances of TS in Stage.
Comment 8•10 years ago
|
||
> make test SERVER_URL=https://token.stage.mozaws.net I'm not familiar with this stage URL. Does it work if you use the "token-stage3" url that we used to have? > "token.assertion.audience_mismatch_error" This indicates that the "audiences" setting on the stage servers is wrong. It must always match the expected public URL of the service. So I suspect that the stage servers are using "token-stage3.stage.mozaws.net" as their expected audience string, and it fails when you access them via a different URL. Benson, can you confirm this in the config?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 9•10 years ago
|
||
Right, I see this comment explaining the change: https://bugzilla.mozilla.org/show_bug.cgi?id=969209#c10 So we need to update the "audiences" setting to match.
Reporter | ||
Comment 10•10 years ago
|
||
OK. Good to know. And I "think" that token.stage.mozaws.net will be the final name.
Reporter | ||
Comment 11•10 years ago
|
||
:bwong to make the appropriate change on the 3 TS instances. Consider this a blocker.
Comment 12•10 years ago
|
||
:jbonacci ok all staging token servers have been upgraded. So going forward when we put a new staging tokenserver stack into production it'll be token.stage.mozaws.net
Reporter | ||
Comment 13•10 years ago
|
||
Had to restart the app/hosts with this change... :-)
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 14•10 years ago
|
||
Sent out email on this ticket. We may want to bounce it once we talk about the issues I found in the Friday night load tests.
Reporter | ||
Comment 15•10 years ago
|
||
See email thread - so the trick here will be to determine if the % of 401s I see matches (roughly) to what is expected in the load test, by definition.
Reporter | ||
Updated•10 years ago
|
Priority: -- → P1
Reporter | ||
Comment 16•10 years ago
|
||
Re-opening rather than creating a brand new ticket. I am testing TS+Verifier Stage today, and I am seeing nearly 100% failures, so we must have some config wrong somewhere... On the 3 TS instances: The /media/ephemeral0/logs/tokenserver/token.log are filled with these: "name": "token.assertion.invalid_signature_error" "name": "token.assertion.verify_failure" The /media/ephemeral0/logs/nginx/access.log are filled with these: 10.28.71.244 - - [07/Mar/2014:16:45:50 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 110 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" XFF="54.245.44.231" TIME=0.301 On the Verifier instances: The /media/ephemeral0/fxa-browserid-verifier/fxa-browserid-verifier-800{0,1}-out.log is showing the following: INFO: verify { result: 'failure', reason: 'mockmyid.com is not a browserid primary - non-200 response code to /.well-known/browserid', rp: 'token.stage.mozaws.net' } INFO: assertion_verification_time 243 INFO: assertion_failure That can't be good! The /media/ephemeral0/nginx/logs/access.log shows nothing but 200s (on POSTs)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 17•10 years ago
|
||
Should be self-evident, but consider this a blocker for TS and combined load tests.
Comment 18•10 years ago
|
||
Visiting "https://mockmyid.com/.well-known/browserid" shows it working fine now. Possible transient issue with mockmyid server? Or, serious network-access problem in the verifier...
Comment 19•10 years ago
|
||
The loadtests seem to be working now for me - my guess would be some sort of transient error on the mockmyid.com server.
Reporter | ||
Comment 20•10 years ago
|
||
So very odd. Ok. I thought sure we had a config mismatch given all the changes we have been making to Stage. Will restart this tomorrow.
Reporter | ||
Comment 21•10 years ago
|
||
See the follow-up bugs from :rfkelly and from Dan C: https://bugzilla.mozilla.org/show_bug.cgi?id=977922 https://bugzilla.mozilla.org/show_bug.cgi?id=981877 Leaving this as open until those get addressed...
Reporter | ||
Comment 22•10 years ago
|
||
OK, so we have moved to our own IdP in Stage. The tests are looking much much better. Some 401s still to contend with, but more manageable... Running an 8-hour overnight. If the count of 401s is low, or all at the beginning, etc. I will consider this bug fixed.
Reporter | ||
Comment 23•10 years ago
|
||
So, I still see some 401s in the TS nginx logs: 200s: 101612 401s: 2995 This is a better percentage than before. The count of these "name": "token.assertion.verify_failure" "name": "token.assertion.expired_signature_error" is about 2640 That seems to be about 2% failure rate per TS instance/node. Let me know if that is expected (I thought was 1%) https://github.com/mozilla-services/tokenserver/blob/master/loadtest/loadtest.py#L58-L60
Comment 24•10 years ago
|
||
Digging into code, the loadtest calls _test_bad_assertion() 1% of the time, but that method actually sends three separate requests: one with an expired assertion, one with the wrong issuer, and one with an authority mis-match. Given the vagaries of randomness, I think anything in the low-single-digit percentage of 401s is acceptable. Most importantly, they should not show up as errors in the loadtest report.
Reporter | ||
Comment 25•10 years ago
|
||
Awesome.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•10 years ago
|
Status: RESOLVED → VERIFIED
Updated•1 year ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•