Closed Bug 997344 Opened 12 years ago Closed 11 years ago

Investigate unusually write-heavy db load on stage tokenserver

Categories

(Cloud Services Graveyard :: Server: Token, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Unassigned)

References

Details

(Whiteboard: [qa+])

Spinning this off from Bug 985651 Comment 9. The tokenserver loadtests seem to be unusually write-heavy for the expected load, which should skew towards existing accounts. We need to debug this and figure out if it's a code problem, a loadtest problem, a repoting problem, or some exciting combination of the three.
We need to schedule this testing ASAP. (which means whenever :rfkelly gets back) ;-)
Status: NEW → ASSIGNED
Whiteboard: [qa+]
The tokenserver logs some metrics as it does various backend operations. Here are some counts from one of the stage webheads after recent loadtesting: $ grep "tokenserver.backend.get_user" token.error.log | wc -l 141637 $ grep "tokenserver.backend.allocate_user" token.error.log | wc -l 10384 $ grep "tokenserver.backend.update_user" token.error.log | wc -l 1 The "get_user" operation is done on every request and simply tries to read their existing data out of the db. The "allocate_user" operation is done for users who don't have an existing record, and the "update_user" operation is done when a user e.g. changes their generation-number of x-client-state. This suggests that around 7% of the stage tokenserver requests are for new users, 93% are for existing users. The lone update_user call is probably someone doing manual testing against stage, since the loadtest doesn't send any changed data. The ratio on other stage webheads looks similar. Benson, it would be interesting to do a similar comparison on prod (might have to wait for the 1.2.8 release to roll out, I forget the details of how we used to log these metrics) According to the loadtest, we do 95% requests for existing users, 4% requests for new users, and 1% of "bad assertion" requests that will not trigger and db operations. The difference from 4% to 7% may be explained by the set of new users not being completely populated in the db. At least it's in the same ballpark.
Flags: needinfo?(bwong)
Comparing some stackdriver metrics in prod versus stage-under-load: RPS: state=230/s, prod=120/s DB Read IOPS: stage=5/s, prod=1/s DB Write IOPS: stage=200/s, prod=15/s (recently peaking at 60/s) So we certainly see some unusually high write IOPS on stage compared to prod. Comparing the ratio of "get_user" to "allocate_user" against prod may help to make more sense of this.
(I'll also note that the "get_user" operation can, in extremely rare circumstances, actually do a write. These circumstances don't appear to hold for any but one of the ~1million accounts in the stage db.)
> The difference from 4% to 7% may be explained by the set of new users > not being completely populated in the db. Checking the db, only 972539 our of a possible 1000000 of the "pre-existing" accounts have been populated by loadtest script runs. This puts the expected number of write requests for subsequent loadtests at around 6.5% and dropping, which is roughly in line with the metrics counts from above. So whatever is going on here, I don't think it's due to any funny business on the part of the loadtests. Perhaps the targeted 4% rate for new-user requests is just much higher than we're actually seeing in prod?
OK let's roll out 1.2.8 and I'll update this ticket once we have a couple of days of logs.
Flags: needinfo?(bwong)
Depends on: 1014496
We actually rolled 1.2.9 to Stage. Monday 6/23, we are rolling 1.2.9 to Production. :rfkelly what's the next step?
We rolled TS Stage and Production: Stage Deploy: bug 1014496 Prod Deploy: bug 1027899
Re-tagging Benson to duplicate the op breaking from Comment 2 in prod, now that this has been running for a few days.
Flags: needinfo?(bwong)
grep tokenserver.backend.get_user token.log* | wc -l 975723 grep "tokenserver.backend.allocate_user" 2045 grep "tokenserver.backend.update_user" token.log* | wc -l 70
Flags: needinfo?(bwong)
Uh-huh. So, like, an order-of-magnitude fewer writes than what stage gets during the loadtest then :-)
I tweaked the loadtest to do 0.3% deletes rather than 4%, which seems to more closely match the numbers we're seeing here. I also took the opportunity to clean up some dead code while I was in there: https://github.com/mozilla-services/tokenserver/commit/ca487e27b35ae1f8fca2d99c779bfce3cb80486a Let's try this out for the next TS deploy.
OK. Dropping this in the bucket with bug 1009361.
:rfkelly :mostlygeek we are already past 1.2.9 now working on the combined TS/Verifier stack. Do we want to address this now or later?
:jbonacci let's run the load test today hitting about 300rps. It should be similar to prod averaging about 20 to 30 write IOPS / sec. We might just need to tweak the balance of read/write requests the load test does. Ben.
oh I just read comment #12. If :rfkelly's tweaks match prod I think we can close this bug.
For the next loadtest, let's just do a sanity-check per Comment 15 and then close this out.
Depends on: 1046464
No longer depends on: 1014496
Got through some load testing with minor issues. I will consider this fixed.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.