Closed
Bug 974194
Opened 11 years ago
Closed 11 years ago
Tokenserver now giving 401s under load
Categories
(Cloud Services Graveyard :: Server: Token, defect, P1)
Cloud Services Graveyard
Server: Token
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: jbonacci, Unassigned)
References
Details
(Whiteboard: [qa+])
Attachments
(1 file)
|
845 bytes,
patch
|
telliott
:
review+
|
Details | Diff | Splinter Review |
This seems like something new, based on the current TS Stage environment.
I ran a 30 minute load test against the TS Stage environment, making use of the new "megabench" command in the TS loadtest directory and the Loads cluster.
There are a significant number of 401s showing in the /media/ephemeral0/logs/nginx/access.log on each TS instance.
Trolling the /media/ephemeral0/logs/tokenserver/token.log, you can see a significant number of JSON events with
"name": "token.assertion.connection_error"
"name": "token.assertion.expired_signature_error"
This needs to be investigated...
| Reporter | ||
Updated•11 years ago
|
Whiteboard: [qa+]
Comment 1•11 years ago
|
||
I assume the proximate cause here is that the loads slave machines don't have ntp installed, so their clocks are all wonkey. I've been resisting fixing this because it's a good opportunity to debug our clockskew logic.
Comment 2•11 years ago
|
||
*slaps forehead*
So the "fix" in this commit doesn't work: https://github.com/mozilla-services/tokenserver/commit/d5e7c1e24a4d97be45adf8dbb3f224c94215d1c7
It has the logic to adjust for timeskew, but doesn't actually record timeskew if we get a 401. Bleh.
Comment 3•11 years ago
|
||
Patch to detect and adjust for clockskew in the loadtest.
Attachment #8377963 -
Flags: review?(telliott)
Updated•11 years ago
|
Attachment #8377963 -
Flags: review?(telliott) → review+
Comment 4•11 years ago
|
||
This seems to do the trick for me, much fewer failures reported from the run: https://github.com/mozilla-services/tokenserver/commit/03318a5cb981841d564cbe07f1896527ae80b959
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 5•11 years ago
|
||
Do we need to push the new code to the TS Stage instances?
Comment 6•11 years ago
|
||
> Do we need to push the new code to the TS Stage instances?
no, the fix is in the loadtest client, as long as you run loadtests using up-to-date checkout you should be good
| Reporter | ||
Comment 7•11 years ago
|
||
I git cloned and built the Tokenserver repo (which would be master, which I assume is what we want).
I am still getting 401s...
I am using the very latest setup of TS and Sync and Verifier (all now in US East).
I am just running the basic test against TS Stage:
$ make test SERVER_URL=https://token.stage.mozaws.net
I see 401s in the nginx access logs.
[21/Feb/2014:14:03:12 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 110...
[21/Feb/2014:14:03:37 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 96...
I see errors in the tokenserver token.log file:
"name": "token.assertion.verify_failure"
"token.assertion.audience_mismatch_error"
(always paired)
These are consistent across all 3 instances of TS in Stage.
Comment 8•11 years ago
|
||
> make test SERVER_URL=https://token.stage.mozaws.net
I'm not familiar with this stage URL. Does it work if you use the "token-stage3" url that we used to have?
> "token.assertion.audience_mismatch_error"
This indicates that the "audiences" setting on the stage servers is wrong. It must always match the expected public URL of the service. So I suspect that the stage servers are using "token-stage3.stage.mozaws.net" as their expected audience string, and it fails when you access them via a different URL.
Benson, can you confirm this in the config?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 9•11 years ago
|
||
Right, I see this comment explaining the change: https://bugzilla.mozilla.org/show_bug.cgi?id=969209#c10
So we need to update the "audiences" setting to match.
| Reporter | ||
Comment 10•11 years ago
|
||
OK. Good to know.
And I "think" that token.stage.mozaws.net will be the final name.
| Reporter | ||
Comment 11•11 years ago
|
||
:bwong to make the appropriate change on the 3 TS instances.
Consider this a blocker.
Comment 12•11 years ago
|
||
:jbonacci ok all staging token servers have been upgraded. So going forward when we put a new staging tokenserver stack into production it'll be token.stage.mozaws.net
| Reporter | ||
Comment 13•11 years ago
|
||
Had to restart the app/hosts with this change...
:-)
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 14•11 years ago
|
||
Sent out email on this ticket.
We may want to bounce it once we talk about the issues I found in the Friday night load tests.
| Reporter | ||
Comment 15•11 years ago
|
||
See email thread - so the trick here will be to determine if the % of 401s I see matches (roughly) to what is expected in the load test, by definition.
| Reporter | ||
Updated•11 years ago
|
Priority: -- → P1
| Reporter | ||
Comment 16•11 years ago
|
||
Re-opening rather than creating a brand new ticket.
I am testing TS+Verifier Stage today, and I am seeing nearly 100% failures, so we must have some config wrong somewhere...
On the 3 TS instances:
The /media/ephemeral0/logs/tokenserver/token.log are filled with these:
"name": "token.assertion.invalid_signature_error"
"name": "token.assertion.verify_failure"
The /media/ephemeral0/logs/nginx/access.log are filled with these:
10.28.71.244 - - [07/Mar/2014:16:45:50 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 110 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" XFF="54.245.44.231" TIME=0.301
On the Verifier instances:
The /media/ephemeral0/fxa-browserid-verifier/fxa-browserid-verifier-800{0,1}-out.log is showing the following:
INFO: verify { result: 'failure',
reason: 'mockmyid.com is not a browserid primary - non-200 response code to /.well-known/browserid',
rp: 'token.stage.mozaws.net' }
INFO: assertion_verification_time 243
INFO: assertion_failure
That can't be good!
The /media/ephemeral0/nginx/logs/access.log shows nothing but 200s (on POSTs)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Reporter | ||
Comment 17•11 years ago
|
||
Should be self-evident, but consider this a blocker for TS and combined load tests.
Comment 18•11 years ago
|
||
Visiting "https://mockmyid.com/.well-known/browserid" shows it working fine now. Possible transient issue with mockmyid server? Or, serious network-access problem in the verifier...
Comment 19•11 years ago
|
||
The loadtests seem to be working now for me - my guess would be some sort of transient error on the mockmyid.com server.
| Reporter | ||
Comment 20•11 years ago
|
||
So very odd. Ok.
I thought sure we had a config mismatch given all the changes we have been making to Stage.
Will restart this tomorrow.
| Reporter | ||
Comment 21•11 years ago
|
||
See the follow-up bugs from :rfkelly and from Dan C:
https://bugzilla.mozilla.org/show_bug.cgi?id=977922
https://bugzilla.mozilla.org/show_bug.cgi?id=981877
Leaving this as open until those get addressed...
| Reporter | ||
Comment 22•11 years ago
|
||
OK, so we have moved to our own IdP in Stage.
The tests are looking much much better.
Some 401s still to contend with, but more manageable...
Running an 8-hour overnight.
If the count of 401s is low, or all at the beginning, etc.
I will consider this bug fixed.
| Reporter | ||
Comment 23•11 years ago
|
||
So, I still see some 401s in the TS nginx logs:
200s: 101612
401s: 2995
This is a better percentage than before.
The count of these
"name": "token.assertion.verify_failure"
"name": "token.assertion.expired_signature_error"
is about 2640
That seems to be about 2% failure rate per TS instance/node.
Let me know if that is expected (I thought was 1%)
https://github.com/mozilla-services/tokenserver/blob/master/loadtest/loadtest.py#L58-L60
Comment 24•11 years ago
|
||
Digging into code, the loadtest calls _test_bad_assertion() 1% of the time, but that method actually sends three separate requests: one with an expired assertion, one with the wrong issuer, and one with an authority mis-match. Given the vagaries of randomness, I think anything in the low-single-digit percentage of 401s is acceptable.
Most importantly, they should not show up as errors in the loadtest report.
| Reporter | ||
Comment 25•11 years ago
|
||
Awesome.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
| Reporter | ||
Updated•11 years ago
|
Status: RESOLVED → VERIFIED
Updated•2 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•