Closed
Bug 979014
Opened 10 years ago
Closed 10 years ago
Investigate unexpected write-heavy workload on prod tokenserver DB
Categories
(Cloud Services Graveyard :: Server: Token, defect)
Cloud Services Graveyard
Server: Token
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rfkelly, Assigned: rfkelly)
References
Details
(Whiteboard: [qa+])
Our production tokenserver DB is currently showing around 2:1 ratio of writes to reads. I find this quite unexpected - most users should not have to make any writes to the DB once they have their initial node allocation. It may be an atifact of the initial ramp-up phase (lots of new users trying it out, not a lot of users actually sticking around to sync with an existing account?) but it may also indicate some sort of bug in our logic.
Assignee | ||
Comment 1•10 years ago
|
||
Benson, can you please do the following (or similar) as an initial sanity-check on prod DB? SELECT COUNT(*) FROM users WHERE replaced_at IS NULL; SELECT COUNT(*) FROM users WHERE replaced_at IS NOT NULL; This will tell us the overall ratio of obsolete-vs-active node-assignment records. An obsolete record should only get created from a migration or a client-state change, so if we see heaps of them in the db then something janky is going on.
Flags: needinfo?(bwong)
Updated•10 years ago
|
Whiteboard: [qa+]
Comment 2•10 years ago
|
||
:rfkelly mysql> SELECT COUNT(*) FROM users WHERE replaced_at IS NULL; +----------+ | COUNT(*) | +----------+ | 68986 | +----------+ 1 row in set (0.02 sec) mysql> SELECT COUNT(*) FROM users WHERE replaced_at IS NOT NULL; +----------+ | COUNT(*) | +----------+ | 753 | +----------+ 1 row in set (0.03 sec)
Assignee | ||
Comment 3•10 years ago
|
||
OK, that seems like a plausibly small number of obsolete assignment records. I grabbed the following from IRC: <mostlygeek> so w/ the 500IOPs we were doing in the load test ... that was about 8MB/second ... <mostlygeek> being written to the tokenserver DB . <mostlygeek> holy. For clarity, is this suggesting that we sustained 8MB/second writes to the stage TS db for the duration of a loadtest run? This seems ludicrously high. There are currently 776664 "users" records in the stage db, all of which look pretty much exactly like this: +-------+---------+-------------------------+-------------------------------------------+------------+--------------+---------------+-------------+ | uid | service | email | node | generation | client_state | created_at | replaced_at | +-------+---------+-------------------------+-------------------------------------------+------------+--------------+---------------+-------------+ | 41412 | 1 | user410418@mockmyid.com | https://sync-1-us-east-1.stage.mozaws.net | 0 | | 1391736408936 | NULL | | 41413 | 1 | user192675@mockmyid.com | https://sync-2-us-east-1.stage.mozaws.net | 0 | | 1391736442904 | NULL | | 41414 | 1 | user827152@mockmyid.com | https://sync-2-us-east-1.stage.mozaws.net | 0 | | 1391736442999 | NULL | | 41415 | 1 | user313751@mockmyid.com | https://sync-3-us-east-1.stage.mozaws.net | 0 | | 1391736443828 | NULL | | 41416 | 1 | user713918@mockmyid.com | https://sync-1-us-east-1.stage.mozaws.net | 0 | | 1391736444008 | NULL | +-------+---------+-------------------------+-------------------------------------------+------------+--------------+---------------+-------------+ Even if stored in the above form, that's about 100MB of row data in the "users" table. Which doesn't seem like that much.
Assignee | ||
Comment 4•10 years ago
|
||
Grepping the logfiles on one of the stage webheads reveals the following ration between get, create, and update operations: [rfkelly]$ grep "get_user" token.* | wc -l 256149 [rfkelly]$ grep "create_user" token.* | wc -l 128440 [rfkelly]$ grep "update_user" token.* | wc -l 0 Which seems like a plausible mix given the makeup of the loadtests. Perhaps a similar poke at the prod logs will reveal something interesting?
Comment 5•10 years ago
|
||
From one of the servers in prod: get_user: 946379 create_user: 38812 update_user: 2359
Flags: needinfo?(bwong)
Comment 6•10 years ago
|
||
Hmm... so in prod, create/update user makes up only 4% of the requests above. This is plausible too even with the 2x write to read IOPS stats on the server because the DB is so small right now reading might be from RAM cache than from actual disk.
Assignee | ||
Comment 7•10 years ago
|
||
Yep, that get-to-create ratio seems fine to me. Unless get_user is accidentally writing to the db (hey, anything is possible!) then maybe we're seeing some sort of cache-related effect as you describe above.
Comment 8•10 years ago
|
||
Here's actual mysql server stats: | Com_insert | 79705 | | Com_insert_select | 0 | | Com_select | 6368515 | | Com_update | 76164 | | Com_update_multi | 0 | ~97 to 98% so ratio of DQL/DML matches. :)
Assignee | ||
Comment 9•10 years ago
|
||
OK awesome, glad we dug into this. Happy to close it out and we'll just keep an eye on things as the dataset gets bigger?
Comment 10•10 years ago
|
||
sounds good to me.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•1 year ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•