Closed Bug 1007987 Opened 10 years ago Closed 10 years ago

Internal pagination for large fetches

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

Attachments

(1 file)

Sync1.5 has a much nicer pagination API than sync1.1, including transparent offset tokens so that we can do indexed pagination rather than "SELECR * OFFSET N LIMIT M" nonsense.

Sadly, the clients don't use it yet.  They just ask for all the data in a single query.  If there's a lot of data, this can take a long time and tie up server resources.

We should consider doing internal pagination between the client and the db.  So the server would detect incoming "GET /storage/collection" requests that do not have a limit parameter, and internally translate them into a series of limit-bearing requests like this recipe:

  https://docs.services.mozilla.com/storage/apis-1.5.html#example-paging-through-a-large-set-of-items

This would result in more queries for that client (potentially some extra queries to check for concurrent writes etc) but would help to smooth the overall db load and make it less likely to lock up the connection pool for large periods of time.

Worth experimenting with anyway.
Insert caveats about leapfrogging data loss, client timestamping, etc. here.

(I'm tired, so I'll let Future Richard deal with the pain!)
That what those "some extra queries" were all about, sanity-checking that we're not stuffing up the client.
Whiteboard: [qa+]
Blocks: 996819
Here's a more concrete take on what I want to do here.  The incoming requests are checked for a "limit" parameter, and if there isn't one then we do the pagination ourselved internally.

We don't have to commit to doing it this way long-term, but we could try this approach and see if it help tame some of the humungous wait-times we see in production.

(Toby correctly points out that ideally, this pagination would be done by the SQL backend internally, since that's the only backend where it really makes a difference.  But it's *easier* to do at the view level initially.)
Assignee: nobody → rfkelly
Attachment #8438895 - Flags: feedback?(telliott)
We will also need the more-efficient "offset token" implementation discussed in Bug 735102.
Depends on: 735102
Comment on attachment 8438895 [details] [diff] [review]
sync15-internal-pagination.diff

worth trying on a node to see if it improves things.
Attachment #8438895 - Flags: feedback?(telliott) → feedback+
Depends on: 1025735
I went ahead and committed this with one change in https://github.com/mozilla-services/server-syncstorage/commit/8bded2b793ee57920830298dda35d255f58fe52e

The change is that internal pagination is disabled by default, and only activates when you set the storage.pagination_batch_size config option.  This will let us easily enable/disable for experimenting on a node.

Let's keep the bug open to track whether this works out OK in practice.
This code is now live in prod, we should try switching on the pagination and see if it helps things.  Here's what I like to see:


* Find a node or two that seems to have high db contention, e.g. lots of time waiting for connections and locks.
* Add "pagination_batch_size = 5000" in the [storage] config section and restart the app
* Monitor performance and see if it improves overall response time and/or throughput:
   * I would expect the max response time to increase slightly due to additional overhead on large queries.
   * But I would expect the average response time to decrease, since small requests don't have to wait as long for large ones to complete.

But it's all hypothetical at this point.  Bob and James, thoughts?
Flags: needinfo?(bobm)
This has been enabled on the sync-45 and sync-46 production nodes.  Here are some basic stats for the servers from before it was enabled.  Note the Nginx query times are longer because of network traffic.  Upstream traffic is always cut off at the 575 second timeout.  MySQL queries may run longer than the 575 second timeout because sometimes they are from the purge_ttl script and do not have a timeout, and because MySQL queries do not automatically die.

Sync-45: 8/27 - 9/2
Nginx: Avg Service Time: 4.516, Stdev: 7.419, Min: 0.000, Max: 698.241
Upstream: 4.209 Stdev: 7.088, Min: 0.000, Max: 575.017

48hrs of Slow Query Logs:
Query                 min max      mean std_deviation
ALL                   1s  296781s  12s  321s
upsert_bso            1s  945s     19s  55s
find_items            1s  621s     26s  72s
collection_timestamp  1s  222s     9s   11s
lock_collection_write 1s  222s     8s   9s
lock_collection_read  1s  221s     9s   10s

Sync-46: 8/27 - 9/2
Nginx: Avg Service Time: 3.359, Stdev: 5.169, Min: 0.000, Max: 822.944
Upstream: 3.064 Stdev: 4.895, Min: 0.000, Max: 575.007

48hrs of Slow Query Logs:
Query                 min max   mean std_deviation
ALL                   1s  4851s 12s  46s
upsert_bso            1s  1693s 15s  72s
find_items            1s  634s  35s  80s
collection_timestamp  1s  295s  10s  14s
lock_collection_write 1s  300s  9s   10s
lock_collection_read  1s  297s  10s  14s
Flags: needinfo?(bobm)
(In reply to Bob Micheletto [:bobm] from comment #8)

Below are some mixed and slightly overlapping results.  It might be best to broaden the test, and wait a few more days.

Sync-45: 9/3 - 9/4
Nginx: Avg Service Time: 3.141, Stdev: 4.898, Min: 0.000, Max: 617.339
Upstream: 2.812 Stdev: 4.591, Min: 0.000, Max: 575.007

48hrs of Slow Query Logs:
Query                 min max     mean std_deviation
ALL                   1s  296781s 17s  592s
upsert_bso            1s  430s    26s  54s
find_items            1s  670s    13s  50s
collection_timestamp  1s  101s    9s   9s
lock_collection_write 1s  102s    7s   7s
lock_collection_read  1s  101s    6s   7s

Sync-46: 9/3 - 9/4
Nginx: Avg Service Time: 12.797, Stdev: 19.383, Min: 0.000, Max: 681.281
Upstream: 11.928 Stdev: 18.412, Min: 0.000, Max: 590.004

48hrs of Slow Query Logs:
Query                 min max     mean std_deviation
All	              1s	  401249s 31s  521s
upsert_bso	      1s  1110s	  42s  78s
find_items 	      1s  1529s   62s  121s
lock_collection_write 1s	  64s	  6s   8s
collection_timestamp  1s	  41s	  6s   6s
lock_collection_read  1s	  61s	  5s   6s
> max: 401249s

What on earth is this query taking over four days to complete?
> find_items 1s  1529s

Am I reading this correctly, that a single find_items query took 25 minutes to complete?  I would expect the request to time out, and the corresponding query to be killed by our automatic kill-db-connection-on-timeout logic...so perhaps that hackery is no longer working in the new system?
(In reply to Ryan Kelly [:rfkelly] from comment #11)
The offending query:

SELECT bso.userid, bso.collection, bso.id, bso.sortindex, bso.modified, bso.payload, bso.payload_size, bso.ttl 
FROM bso 
WHERE bso.userid = XXXXXXX AND bso.collection = 4 AND bso.modified > 1409686992970 AND bso.ttl > 1409721284 ORDER BY bso.modified DESC 
LIMIT 5001;

The explanation:
+------+-------------+-------+-------+-----------------------------------------+---------------------+---------+------+------+-------------+
| id   | select_type | table | type  | possible_keys                           | key                 | key_len | ref  | rows | Extra       |
+------+-------------+-------+-------+-----------------------------------------+---------------------+---------+------+------+-------------+
|    1 | SIMPLE      | bso   | range | PRIMARY,bso_ttl_idx,bso_usr_col_mod_idx | bso_usr_col_mod_idx | 16      | NULL | 1572 | Using where |
+------+-------------+-------+-------+-----------------------------------------+---------------------+---------+------+------+-------------+
The conclusion from trying this out in prod was that it's not worth it - so significant wins seen, and sometimes worse performance.  Particularly given that we've had other db improvements that helped a lot with our operational issues, let's switch this off in production config and close out the bug.

(RESOLVED/FIXED because we still have the code for it, just not enabled in prod by default)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
OK. Will Verify for now.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: