Closed
Bug 997344
Opened 12 years ago
Closed 11 years ago
Investigate unusually write-heavy db load on stage tokenserver
Categories
(Cloud Services Graveyard :: Server: Token, defect)
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rfkelly, Unassigned)
References
Details
(Whiteboard: [qa+])
Spinning this off from Bug 985651 Comment 9.
The tokenserver loadtests seem to be unusually write-heavy for the expected load, which should skew towards existing accounts. We need to debug this and figure out if it's a code problem, a loadtest problem, a repoting problem, or some exciting combination of the three.
Comment 1•12 years ago
|
||
We need to schedule this testing ASAP.
(which means whenever :rfkelly gets back)
;-)
Status: NEW → ASSIGNED
Whiteboard: [qa+]
| Reporter | ||
Comment 2•11 years ago
|
||
The tokenserver logs some metrics as it does various backend operations. Here are some counts from one of the stage webheads after recent loadtesting:
$ grep "tokenserver.backend.get_user" token.error.log | wc -l
141637
$ grep "tokenserver.backend.allocate_user" token.error.log | wc -l
10384
$ grep "tokenserver.backend.update_user" token.error.log | wc -l
1
The "get_user" operation is done on every request and simply tries to read their existing data out of the db. The "allocate_user" operation is done for users who don't have an existing record, and the "update_user" operation is done when a user e.g. changes their generation-number of x-client-state.
This suggests that around 7% of the stage tokenserver requests are for new users, 93% are for existing users. The lone update_user call is probably someone doing manual testing against stage, since the loadtest doesn't send any changed data.
The ratio on other stage webheads looks similar. Benson, it would be interesting to do a similar comparison on prod (might have to wait for the 1.2.8 release to roll out, I forget the details of how we used to log these metrics)
According to the loadtest, we do 95% requests for existing users, 4% requests for new users, and 1% of "bad assertion" requests that will not trigger and db operations. The difference from 4% to 7% may be explained by the set of new users not being completely populated in the db. At least it's in the same ballpark.
Flags: needinfo?(bwong)
| Reporter | ||
Comment 3•11 years ago
|
||
Comparing some stackdriver metrics in prod versus stage-under-load:
RPS: state=230/s, prod=120/s
DB Read IOPS: stage=5/s, prod=1/s
DB Write IOPS: stage=200/s, prod=15/s (recently peaking at 60/s)
So we certainly see some unusually high write IOPS on stage compared to prod. Comparing the ratio of "get_user" to "allocate_user" against prod may help to make more sense of this.
| Reporter | ||
Comment 4•11 years ago
|
||
(I'll also note that the "get_user" operation can, in extremely rare circumstances, actually do a write. These circumstances don't appear to hold for any but one of the ~1million accounts in the stage db.)
| Reporter | ||
Comment 5•11 years ago
|
||
> The difference from 4% to 7% may be explained by the set of new users
> not being completely populated in the db.
Checking the db, only 972539 our of a possible 1000000 of the "pre-existing" accounts have been populated by loadtest script runs. This puts the expected number of write requests for subsequent loadtests at around 6.5% and dropping, which is roughly in line with the metrics counts from above.
So whatever is going on here, I don't think it's due to any funny business on the part of the loadtests. Perhaps the targeted 4% rate for new-user requests is just much higher than we're actually seeing in prod?
Comment 6•11 years ago
|
||
OK let's roll out 1.2.8 and I'll update this ticket once we have a couple of days of logs.
Flags: needinfo?(bwong)
Comment 7•11 years ago
|
||
We actually rolled 1.2.9 to Stage.
Monday 6/23, we are rolling 1.2.9 to Production.
:rfkelly what's the next step?
Comment 8•11 years ago
|
||
We rolled TS Stage and Production:
Stage Deploy: bug 1014496
Prod Deploy: bug 1027899
| Reporter | ||
Comment 9•11 years ago
|
||
Re-tagging Benson to duplicate the op breaking from Comment 2 in prod, now that this has been running for a few days.
Flags: needinfo?(bwong)
Comment 10•11 years ago
|
||
grep tokenserver.backend.get_user token.log* | wc -l
975723
grep "tokenserver.backend.allocate_user"
2045
grep "tokenserver.backend.update_user" token.log* | wc -l
70
Flags: needinfo?(bwong)
| Reporter | ||
Comment 11•11 years ago
|
||
Uh-huh. So, like, an order-of-magnitude fewer writes than what stage gets during the loadtest then :-)
| Reporter | ||
Comment 12•11 years ago
|
||
I tweaked the loadtest to do 0.3% deletes rather than 4%, which seems to more closely match the numbers we're seeing here. I also took the opportunity to clean up some dead code while I was in there: https://github.com/mozilla-services/tokenserver/commit/ca487e27b35ae1f8fca2d99c779bfce3cb80486a
Let's try this out for the next TS deploy.
Comment 13•11 years ago
|
||
OK. Dropping this in the bucket with bug 1009361.
Comment 14•11 years ago
|
||
:rfkelly
:mostlygeek
we are already past 1.2.9 now working on the combined TS/Verifier stack.
Do we want to address this now or later?
Comment 15•11 years ago
|
||
:jbonacci let's run the load test today hitting about 300rps. It should be similar to prod averaging about 20 to 30 write IOPS / sec. We might just need to tweak the balance of read/write requests the load test does.
Ben.
Comment 16•11 years ago
|
||
oh I just read comment #12. If :rfkelly's tweaks match prod I think we can close this bug.
| Reporter | ||
Comment 17•11 years ago
|
||
For the next loadtest, let's just do a sanity-check per Comment 15 and then close this out.
Updated•11 years ago
|
Comment 18•11 years ago
|
||
Got through some load testing with minor issues. I will consider this fixed.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Status: RESOLVED → VERIFIED
Updated•3 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•