974194 - Tokenserver now giving 401s under load

Reporter

Description

•

10 years ago

This seems like something new, based on the current TS Stage environment.
I ran a 30 minute load test against the TS Stage environment, making use of the new "megabench" command in the TS loadtest directory and the Loads cluster.

There are a significant number of 401s showing in the /media/ephemeral0/logs/nginx/access.log on each TS instance.

Trolling the /media/ephemeral0/logs/tokenserver/token.log, you can see a significant number of JSON events with 
"name": "token.assertion.connection_error"
"name": "token.assertion.expired_signature_error"

This needs to be investigated...

James Bonacci [:jbonacci]

Reporter

Updated

•

10 years ago

Whiteboard: [qa+]

James Bonacci [:jbonacci]

Reporter

Updated

•

10 years ago

Blocks: 956217

Ryan Kelly [:rfkelly]

Comment 1

•

10 years ago

I assume the proximate cause here is that the loads slave machines don't have ntp installed, so their clocks are all wonkey.  I've been resisting fixing this because it's a good opportunity to debug our clockskew logic.

Ryan Kelly [:rfkelly]

Comment 2

•

10 years ago

*slaps forehead*

So the "fix" in this commit doesn't work: https://github.com/mozilla-services/tokenserver/commit/d5e7c1e24a4d97be45adf8dbb3f224c94215d1c7

It has the logic to adjust for timeskew, but doesn't actually record timeskew if we get a 401.  Bleh.

Ryan Kelly [:rfkelly]

Comment 3

•

10 years ago

Attached patch tokenserver-loadtest-clockskew.diff — Details — Splinter Review

Patch to detect and adjust for clockskew in the loadtest.

Attachment #8377963 - Flags: review?(telliott)

Toby Elliott [:telliott]

Updated

•

10 years ago

Attachment #8377963 - Flags: review?(telliott) → review+

Ryan Kelly [:rfkelly]

Comment 4

•

10 years ago

This seems to do the trick for me, much fewer failures reported from the run: https://github.com/mozilla-services/tokenserver/commit/03318a5cb981841d564cbe07f1896527ae80b959

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Reporter

Comment 5

•

10 years ago

Do we need to push the new code to the TS Stage instances?

Ryan Kelly [:rfkelly]

Comment 6

•

10 years ago

> Do we need to push the new code to the TS Stage instances?

no, the fix is in the loadtest client, as long as you run loadtests using up-to-date checkout you should be good

James Bonacci [:jbonacci]

Reporter

Comment 7

•

10 years ago

I git cloned and built the Tokenserver repo (which would be master, which I assume is what we want).
I am still getting 401s...


I am using the very latest setup of TS and Sync and Verifier (all now in US East).
I am just running the basic test against TS Stage:
$ make test SERVER_URL=https://token.stage.mozaws.net

I see 401s in the nginx access logs.
[21/Feb/2014:14:03:12 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 110...
[21/Feb/2014:14:03:37 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 96...

I see errors in the tokenserver token.log file:
"name": "token.assertion.verify_failure"
"token.assertion.audience_mismatch_error"
(always paired)

These are consistent across all 3 instances of TS in Stage.

Ryan Kelly [:rfkelly]

Comment 8

•

10 years ago

>  make test SERVER_URL=https://token.stage.mozaws.net

I'm not familiar with this stage URL.  Does it work if you use the "token-stage3" url that we used to have?

> "token.assertion.audience_mismatch_error"

This indicates that the "audiences" setting on the stage servers is wrong.  It must always match the expected public URL of the service.  So I suspect that the stage servers are using "token-stage3.stage.mozaws.net" as their expected audience string, and it fails when you access them via a different URL.

Benson, can you confirm this in the config?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Ryan Kelly [:rfkelly]

Comment 9

•

10 years ago

Right, I see this comment explaining the change: https://bugzilla.mozilla.org/show_bug.cgi?id=969209#c10

So we need to update the "audiences" setting to match.

James Bonacci [:jbonacci]

Reporter

Comment 10

•

10 years ago

OK. Good to know.
And I "think" that token.stage.mozaws.net will be the final name.

James Bonacci [:jbonacci]

Reporter

Comment 11

•

10 years ago

:bwong to make the appropriate change on the 3 TS instances.
Consider this a blocker.

Benson Wong [:mostlygeek]

Comment 12

•

10 years ago

:jbonacci ok all staging token servers have been upgraded. So going forward when we put a new staging tokenserver stack into production it'll be token.stage.mozaws.net

James Bonacci [:jbonacci]

Reporter

Comment 13

•

10 years ago

Had to restart the app/hosts with this change...
:-)

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Reporter

Comment 14

•

10 years ago

Sent out email on this ticket.
We may want to bounce it once we talk about the issues I found in the Friday night load tests.

James Bonacci [:jbonacci]

Reporter

Comment 15

•

10 years ago

See email thread - so the trick here will be to determine if the % of 401s I see matches (roughly) to what is expected in the load test, by definition.

James Bonacci [:jbonacci]

Reporter

Updated

•

10 years ago

Priority: -- → P1

James Bonacci [:jbonacci]

Reporter

Comment 16

•

10 years ago

Re-opening rather than creating a brand new ticket.
I am testing TS+Verifier Stage today, and I am seeing nearly 100% failures, so we must have some config wrong somewhere...

On the 3 TS instances:
The /media/ephemeral0/logs/tokenserver/token.log are filled with these:
"name": "token.assertion.invalid_signature_error"
"name": "token.assertion.verify_failure"

The /media/ephemeral0/logs/nginx/access.log are filled with these:
10.28.71.244 - - [07/Mar/2014:16:45:50 -0500] "GET /1.0/sync/1.5 HTTP/1.1" 401 110 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" XFF="54.245.44.231" TIME=0.301


On the Verifier instances:
The /media/ephemeral0/fxa-browserid-verifier/fxa-browserid-verifier-800{0,1}-out.log is showing the following:
INFO: verify { result: 'failure',
  reason: 'mockmyid.com is not a browserid primary - non-200 response code to /.well-known/browserid',
  rp: 'token.stage.mozaws.net' }
INFO: assertion_verification_time 243
INFO: assertion_failure

That can't be good!

The /media/ephemeral0/nginx/logs/access.log shows nothing but 200s (on POSTs)

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

James Bonacci [:jbonacci]

Reporter

Comment 17

•

10 years ago

Should be self-evident, but consider this a blocker for TS and combined load tests.

Ryan Kelly [:rfkelly]

Comment 18

•

10 years ago

Visiting "https://mockmyid.com/.well-known/browserid" shows it working fine now.  Possible transient issue with mockmyid server?  Or, serious network-access problem in the verifier...

Ryan Kelly [:rfkelly]

Comment 19

•

10 years ago

The loadtests seem to be working now for me - my guess would be some sort of transient error on the mockmyid.com server.

James Bonacci [:jbonacci]

Reporter

Comment 20

•

10 years ago

So very odd. Ok.
I thought sure we had a config mismatch given all the changes we have been making to Stage.
Will restart this tomorrow.

James Bonacci [:jbonacci]

Reporter

Comment 21

•

10 years ago

See the follow-up bugs from :rfkelly and from Dan C:
https://bugzilla.mozilla.org/show_bug.cgi?id=977922
https://bugzilla.mozilla.org/show_bug.cgi?id=981877

Leaving this as open until those get addressed...

James Bonacci [:jbonacci]

Reporter

Comment 22

•

10 years ago

OK, so we have moved to our own IdP in Stage.
The tests are looking much much better.
Some 401s still to contend with, but more manageable...
Running an 8-hour overnight.
If the count of 401s is low, or all at the beginning, etc.
I will consider this bug fixed.

James Bonacci [:jbonacci]

Reporter

Updated

•

10 years ago

Depends on: 982176

James Bonacci [:jbonacci]

Reporter

Comment 23

•

10 years ago

So, I still see some 401s in the TS nginx logs:
200s: 101612
401s: 2995

This is a better percentage than before.

The count of these
"name": "token.assertion.verify_failure"
"name": "token.assertion.expired_signature_error"
is about 2640

That seems to be about 2% failure rate per TS instance/node.
Let me know if that is expected (I thought was 1%)
https://github.com/mozilla-services/tokenserver/blob/master/loadtest/loadtest.py#L58-L60

Ryan Kelly [:rfkelly]

Comment 24

•

10 years ago

Digging into code, the loadtest calls _test_bad_assertion() 1% of the time, but that method actually sends three separate requests: one with an expired assertion, one with the wrong issuer, and one with an authority mis-match.  Given the vagaries of randomness, I think anything in the low-single-digit percentage of 401s is acceptable.

Most importantly, they should not show up as errors in the loadtest report.

James Bonacci [:jbonacci]

Reporter

Comment 25

•

10 years ago

Awesome.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Reporter

Updated

•

10 years ago

Status: RESOLVED → VERIFIED

BMO Automation

Updated

•

1 year ago

Product: Cloud Services → Cloud Services Graveyard