Closed Bug 979014 Opened 10 years ago Closed 10 years ago

Investigate unexpected write-heavy workload on prod tokenserver DB

Categories

(Cloud Services Graveyard :: Server: Token, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

Our production tokenserver DB is currently showing around 2:1 ratio of writes to reads.  I find this quite unexpected - most users should not have to make any writes to the DB once they have their initial node allocation.  It may be an atifact of the initial ramp-up phase (lots of new users trying it out, not a lot of users actually sticking around to sync with an existing account?) but it may also indicate some sort of bug in our logic.
Blocks: 956217
Benson, can you please do the following (or similar) as an initial sanity-check on prod DB?

   SELECT COUNT(*) FROM users WHERE replaced_at IS NULL;
   SELECT COUNT(*) FROM users WHERE replaced_at IS NOT NULL;

This will tell us the overall ratio of obsolete-vs-active node-assignment records.  An obsolete record should only get created from a migration or a client-state change, so if we see heaps of them in the db then something janky is going on.
Flags: needinfo?(bwong)
Whiteboard: [qa+]
:rfkelly

mysql> SELECT COUNT(*) FROM users WHERE replaced_at IS NULL;
+----------+
| COUNT(*) |
+----------+
|    68986 |
+----------+
1 row in set (0.02 sec)

mysql> SELECT COUNT(*) FROM users WHERE replaced_at IS NOT NULL;
+----------+
| COUNT(*) |
+----------+
|      753 |
+----------+
1 row in set (0.03 sec)
OK, that seems like a plausibly small number of obsolete assignment records.

I grabbed the following from IRC:

<mostlygeek> so w/ the 500IOPs we were doing in the load test ... that was about 8MB/second ... 
<mostlygeek> being written to the tokenserver DB .
<mostlygeek> holy.

For clarity, is this suggesting that we sustained 8MB/second writes to the stage TS db for the duration of a loadtest run?  This seems ludicrously high.

There are currently 776664 "users" records in the stage db, all of which look pretty much exactly like this:

+-------+---------+-------------------------+-------------------------------------------+------------+--------------+---------------+-------------+
| uid   | service | email                   | node                                      | generation | client_state | created_at    | replaced_at |
+-------+---------+-------------------------+-------------------------------------------+------------+--------------+---------------+-------------+
| 41412 |       1 | user410418@mockmyid.com | https://sync-1-us-east-1.stage.mozaws.net |          0 |              | 1391736408936 |        NULL |
| 41413 |       1 | user192675@mockmyid.com | https://sync-2-us-east-1.stage.mozaws.net |          0 |              | 1391736442904 |        NULL |
| 41414 |       1 | user827152@mockmyid.com | https://sync-2-us-east-1.stage.mozaws.net |          0 |              | 1391736442999 |        NULL |
| 41415 |       1 | user313751@mockmyid.com | https://sync-3-us-east-1.stage.mozaws.net |          0 |              | 1391736443828 |        NULL |
| 41416 |       1 | user713918@mockmyid.com | https://sync-1-us-east-1.stage.mozaws.net |          0 |              | 1391736444008 |        NULL |
+-------+---------+-------------------------+-------------------------------------------+------------+--------------+---------------+-------------+

Even if stored in the above form, that's about 100MB of row data in the "users" table.  Which doesn't seem like that much.
Grepping the logfiles on one of the stage webheads reveals the following ration between get, create, and update operations:

[rfkelly]$ grep "get_user" token.* | wc -l
256149
[rfkelly]$ grep "create_user" token.* | wc -l
128440
[rfkelly]$ grep "update_user" token.* | wc -l
0

Which seems like a plausible mix given the makeup of the loadtests.  Perhaps a similar poke at the prod logs will reveal something interesting?
From one of the servers in prod:

get_user: 946379
create_user: 38812
update_user: 2359
Flags: needinfo?(bwong)
Hmm... so in prod, create/update user makes up only 4% of the requests above. 

This is plausible too even with the 2x write to read IOPS stats on the server because the DB is so small right now reading might be from RAM cache than from actual disk.
Yep, that get-to-create ratio seems fine to me.  Unless get_user is accidentally writing to the db (hey, anything is possible!) then maybe we're seeing some sort of cache-related effect as you describe above.
Here's actual mysql server stats: 

| Com_insert                                    | 79705       |
| Com_insert_select                             | 0           |
| Com_select                                    | 6368515     |
| Com_update                                    | 76164       |
| Com_update_multi                              | 0           |

~97 to 98% so ratio of DQL/DML matches. :)
OK awesome, glad we dug into this.  Happy to close it out and we'll just keep an eye on things as the dataset gets bigger?
sounds good to me.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Fine. Fine.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.