Closed
Bug 1007987
Opened 10 years ago
Closed 10 years ago
Internal pagination for large fetches
Categories
(Cloud Services Graveyard :: Server: Sync, defect)
Cloud Services Graveyard
Server: Sync
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rfkelly, Assigned: rfkelly)
References
Details
(Whiteboard: [qa+])
Attachments
(1 file)
9.62 KB,
patch
|
telliott
:
feedback+
|
Details | Diff | Splinter Review |
Sync1.5 has a much nicer pagination API than sync1.1, including transparent offset tokens so that we can do indexed pagination rather than "SELECR * OFFSET N LIMIT M" nonsense. Sadly, the clients don't use it yet. They just ask for all the data in a single query. If there's a lot of data, this can take a long time and tie up server resources. We should consider doing internal pagination between the client and the db. So the server would detect incoming "GET /storage/collection" requests that do not have a limit parameter, and internally translate them into a series of limit-bearing requests like this recipe: https://docs.services.mozilla.com/storage/apis-1.5.html#example-paging-through-a-large-set-of-items This would result in more queries for that client (potentially some extra queries to check for concurrent writes etc) but would help to smooth the overall db load and make it less likely to lock up the connection pool for large periods of time. Worth experimenting with anyway.
Comment 1•10 years ago
|
||
Insert caveats about leapfrogging data loss, client timestamping, etc. here. (I'm tired, so I'll let Future Richard deal with the pain!)
Assignee | ||
Comment 2•10 years ago
|
||
That what those "some extra queries" were all about, sanity-checking that we're not stuffing up the client.
Updated•10 years ago
|
Whiteboard: [qa+]
Assignee | ||
Comment 3•10 years ago
|
||
Here's a more concrete take on what I want to do here. The incoming requests are checked for a "limit" parameter, and if there isn't one then we do the pagination ourselved internally. We don't have to commit to doing it this way long-term, but we could try this approach and see if it help tame some of the humungous wait-times we see in production. (Toby correctly points out that ideally, this pagination would be done by the SQL backend internally, since that's the only backend where it really makes a difference. But it's *easier* to do at the view level initially.)
Assignee: nobody → rfkelly
Attachment #8438895 -
Flags: feedback?(telliott)
Assignee | ||
Comment 4•10 years ago
|
||
We will also need the more-efficient "offset token" implementation discussed in Bug 735102.
Depends on: 735102
Comment 5•10 years ago
|
||
Comment on attachment 8438895 [details] [diff] [review] sync15-internal-pagination.diff worth trying on a node to see if it improves things.
Attachment #8438895 -
Flags: feedback?(telliott) → feedback+
Assignee | ||
Comment 6•10 years ago
|
||
I went ahead and committed this with one change in https://github.com/mozilla-services/server-syncstorage/commit/8bded2b793ee57920830298dda35d255f58fe52e The change is that internal pagination is disabled by default, and only activates when you set the storage.pagination_batch_size config option. This will let us easily enable/disable for experimenting on a node. Let's keep the bug open to track whether this works out OK in practice.
Assignee | ||
Comment 7•10 years ago
|
||
This code is now live in prod, we should try switching on the pagination and see if it helps things. Here's what I like to see: * Find a node or two that seems to have high db contention, e.g. lots of time waiting for connections and locks. * Add "pagination_batch_size = 5000" in the [storage] config section and restart the app * Monitor performance and see if it improves overall response time and/or throughput: * I would expect the max response time to increase slightly due to additional overhead on large queries. * But I would expect the average response time to decrease, since small requests don't have to wait as long for large ones to complete. But it's all hypothetical at this point. Bob and James, thoughts?
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(bobm)
Comment 8•10 years ago
|
||
This has been enabled on the sync-45 and sync-46 production nodes. Here are some basic stats for the servers from before it was enabled. Note the Nginx query times are longer because of network traffic. Upstream traffic is always cut off at the 575 second timeout. MySQL queries may run longer than the 575 second timeout because sometimes they are from the purge_ttl script and do not have a timeout, and because MySQL queries do not automatically die. Sync-45: 8/27 - 9/2 Nginx: Avg Service Time: 4.516, Stdev: 7.419, Min: 0.000, Max: 698.241 Upstream: 4.209 Stdev: 7.088, Min: 0.000, Max: 575.017 48hrs of Slow Query Logs: Query min max mean std_deviation ALL 1s 296781s 12s 321s upsert_bso 1s 945s 19s 55s find_items 1s 621s 26s 72s collection_timestamp 1s 222s 9s 11s lock_collection_write 1s 222s 8s 9s lock_collection_read 1s 221s 9s 10s Sync-46: 8/27 - 9/2 Nginx: Avg Service Time: 3.359, Stdev: 5.169, Min: 0.000, Max: 822.944 Upstream: 3.064 Stdev: 4.895, Min: 0.000, Max: 575.007 48hrs of Slow Query Logs: Query min max mean std_deviation ALL 1s 4851s 12s 46s upsert_bso 1s 1693s 15s 72s find_items 1s 634s 35s 80s collection_timestamp 1s 295s 10s 14s lock_collection_write 1s 300s 9s 10s lock_collection_read 1s 297s 10s 14s
Flags: needinfo?(bobm)
Comment 9•10 years ago
|
||
(In reply to Bob Micheletto [:bobm] from comment #8) Below are some mixed and slightly overlapping results. It might be best to broaden the test, and wait a few more days. Sync-45: 9/3 - 9/4 Nginx: Avg Service Time: 3.141, Stdev: 4.898, Min: 0.000, Max: 617.339 Upstream: 2.812 Stdev: 4.591, Min: 0.000, Max: 575.007 48hrs of Slow Query Logs: Query min max mean std_deviation ALL 1s 296781s 17s 592s upsert_bso 1s 430s 26s 54s find_items 1s 670s 13s 50s collection_timestamp 1s 101s 9s 9s lock_collection_write 1s 102s 7s 7s lock_collection_read 1s 101s 6s 7s Sync-46: 9/3 - 9/4 Nginx: Avg Service Time: 12.797, Stdev: 19.383, Min: 0.000, Max: 681.281 Upstream: 11.928 Stdev: 18.412, Min: 0.000, Max: 590.004 48hrs of Slow Query Logs: Query min max mean std_deviation All 1s 401249s 31s 521s upsert_bso 1s 1110s 42s 78s find_items 1s 1529s 62s 121s lock_collection_write 1s 64s 6s 8s collection_timestamp 1s 41s 6s 6s lock_collection_read 1s 61s 5s 6s
Assignee | ||
Comment 10•10 years ago
|
||
> max: 401249s
What on earth is this query taking over four days to complete?
Assignee | ||
Comment 11•10 years ago
|
||
> find_items 1s 1529s
Am I reading this correctly, that a single find_items query took 25 minutes to complete? I would expect the request to time out, and the corresponding query to be killed by our automatic kill-db-connection-on-timeout logic...so perhaps that hackery is no longer working in the new system?
Comment 12•10 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #11) The offending query: SELECT bso.userid, bso.collection, bso.id, bso.sortindex, bso.modified, bso.payload, bso.payload_size, bso.ttl FROM bso WHERE bso.userid = XXXXXXX AND bso.collection = 4 AND bso.modified > 1409686992970 AND bso.ttl > 1409721284 ORDER BY bso.modified DESC LIMIT 5001; The explanation: +------+-------------+-------+-------+-----------------------------------------+---------------------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+-------+-----------------------------------------+---------------------+---------+------+------+-------------+ | 1 | SIMPLE | bso | range | PRIMARY,bso_ttl_idx,bso_usr_col_mod_idx | bso_usr_col_mod_idx | 16 | NULL | 1572 | Using where | +------+-------------+-------+-------+-----------------------------------------+---------------------+---------+------+------+-------------+
Assignee | ||
Comment 13•10 years ago
|
||
The conclusion from trying this out in prod was that it's not worth it - so significant wins seen, and sometimes worse performance. Particularly given that we've had other db improvements that helped a lot with our operational issues, let's switch this off in production config and close out the bug. (RESOLVED/FIXED because we still have the code for it, just not enabled in prod by default)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•1 year ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•