Closed Bug 974194 Opened 10 years ago Closed 10 years ago

Tokenserver now giving 401s under load

Categories

(Cloud Services Graveyard :: Server: Token, defect, P1)

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jbonacci, Unassigned)

References

Details

(Whiteboard: [qa+])

Attachments

(1 file)

This seems like something new, based on the current TS Stage environment.
I ran a 30 minute load test against the TS Stage environment, making use of the new "megabench" command in the TS loadtest directory and the Loads cluster.

There are a significant number of 401s showing in the /media/ephemeral0/logs/nginx/access.log on each TS instance.

Trolling the /media/ephemeral0/logs/tokenserver/token.log, you can see a significant number of JSON events with 
"name": "token.assertion.connection_error"
"name": "token.assertion.expired_signature_error"

This needs to be investigated...
Whiteboard: [qa+]
Blocks: 956217
I assume the proximate cause here is that the loads slave machines don't have ntp installed, so their clocks are all wonkey.  I've been resisting fixing this because it's a good opportunity to debug our clockskew logic.
*slaps forehead*

So the "fix" in this commit doesn't work: https://github.com/mozilla-services/tokenserver/commit/d5e7c1e24a4d97be45adf8dbb3f224c94215d1c7

It has the logic to adjust for timeskew, but doesn't actually record timeskew if we get a 401.  Bleh.
Patch to detect and adjust for clockskew in the loadtest.
Attachment #8377963 - Flags: review?(telliott)
Attachment #8377963 - Flags: review?(telliott) → review+
This seems to do the trick for me, much fewer failures reported from the run: https://github.com/mozilla-services/tokenserver/commit/03318a5cb981841d564cbe07f1896527ae80b959
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Do we need to push the new code to the TS Stage instances?
> Do we need to push the new code to the TS Stage instances?

no, the fix is in the loadtest client, as long as you run loadtests using up-to-date checkout you should be good
I git cloned and built the Tokenserver repo (which would be master, which I assume is what we want).
I am still getting 401s...


I am using the very latest setup of TS and Sync and Verifier (all now in US East).
I am just running the basic test against TS Stage:
$ make test SERVER_URL=https://token.stage.mozaws.net

I see 401s in the nginx access logs.
[21/Feb/2014:14:03:12 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 110...
[21/Feb/2014:14:03:37 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 96...

I see errors in the tokenserver token.log file:
"name": "token.assertion.verify_failure"
"token.assertion.audience_mismatch_error"
(always paired)

These are consistent across all 3 instances of TS in Stage.
>  make test SERVER_URL=https://token.stage.mozaws.net

I'm not familiar with this stage URL.  Does it work if you use the "token-stage3" url that we used to have?

> "token.assertion.audience_mismatch_error"

This indicates that the "audiences" setting on the stage servers is wrong.  It must always match the expected public URL of the service.  So I suspect that the stage servers are using "token-stage3.stage.mozaws.net" as their expected audience string, and it fails when you access them via a different URL.

Benson, can you confirm this in the config?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Right, I see this comment explaining the change: https://bugzilla.mozilla.org/show_bug.cgi?id=969209#c10

So we need to update the "audiences" setting to match.
OK. Good to know.
And I "think" that token.stage.mozaws.net will be the final name.
:bwong to make the appropriate change on the 3 TS instances.
Consider this a blocker.
:jbonacci ok all staging token servers have been upgraded. So going forward when we put a new staging tokenserver stack into production it'll be token.stage.mozaws.net
Had to restart the app/hosts with this change...
:-)
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Sent out email on this ticket.
We may want to bounce it once we talk about the issues I found in the Friday night load tests.
See email thread - so the trick here will be to determine if the % of 401s I see matches (roughly) to what is expected in the load test, by definition.
Priority: -- → P1
Re-opening rather than creating a brand new ticket.
I am testing TS+Verifier Stage today, and I am seeing nearly 100% failures, so we must have some config wrong somewhere...

On the 3 TS instances:
The /media/ephemeral0/logs/tokenserver/token.log are filled with these:
"name": "token.assertion.invalid_signature_error"
"name": "token.assertion.verify_failure"

The /media/ephemeral0/logs/nginx/access.log are filled with these:
10.28.71.244 - - [07/Mar/2014:16:45:50 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 110 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" XFF="54.245.44.231" TIME=0.301


On the Verifier instances:
The /media/ephemeral0/fxa-browserid-verifier/fxa-browserid-verifier-800{0,1}-out.log is showing the following:
INFO: verify { result: 'failure',
  reason: 'mockmyid.com is not a browserid primary - non-200 response code to /.well-known/browserid',
  rp: 'token.stage.mozaws.net' }
INFO: assertion_verification_time 243
INFO: assertion_failure

That can't be good!

The /media/ephemeral0/nginx/logs/access.log shows nothing but 200s (on POSTs)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Should be self-evident, but consider this a blocker for TS and combined load tests.
Visiting "https://mockmyid.com/.well-known/browserid" shows it working fine now.  Possible transient issue with mockmyid server?  Or, serious network-access problem in the verifier...
The loadtests seem to be working now for me - my guess would be some sort of transient error on the mockmyid.com server.
So very odd. Ok.
I thought sure we had a config mismatch given all the changes we have been making to Stage.
Will restart this tomorrow.
See the follow-up bugs from :rfkelly and from Dan C:
https://bugzilla.mozilla.org/show_bug.cgi?id=977922
https://bugzilla.mozilla.org/show_bug.cgi?id=981877

Leaving this as open until those get addressed...
OK, so we have moved to our own IdP in Stage.
The tests are looking much much better.
Some 401s still to contend with, but more manageable...
Running an 8-hour overnight.
If the count of 401s is low, or all at the beginning, etc.
I will consider this bug fixed.
Depends on: 982176
So, I still see some 401s in the TS nginx logs:
200s: 101612
401s: 2995

This is a better percentage than before.

The count of these
"name": "token.assertion.verify_failure"
"name": "token.assertion.expired_signature_error"
is about 2640

That seems to be about 2% failure rate per TS instance/node.
Let me know if that is expected (I thought was 1%)
https://github.com/mozilla-services/tokenserver/blob/master/loadtest/loadtest.py#L58-L60
Digging into code, the loadtest calls _test_bad_assertion() 1% of the time, but that method actually sends three separate requests: one with an expired assertion, one with the wrong issuer, and one with an authority mis-match.  Given the vagaries of randomness, I think anything in the low-single-digit percentage of 401s is acceptable.

Most importantly, they should not show up as errors in the loadtest report.
Awesome.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: