1007987 - Internal pagination for large fetches

Assignee

Description

•

11 years ago

Sync1.5 has a much nicer pagination API than sync1.1, including transparent offset tokens so that we can do indexed pagination rather than "SELECR * OFFSET N LIMIT M" nonsense. Sadly, the clients don't use it yet. They just ask for all the data in a single query. If there's a lot of data, this can take a long time and tie up server resources. We should consider doing internal pagination between the client and the db. So the server would detect incoming "GET /storage/collection" requests that do not have a limit parameter, and internally translate them into a series of limit-bearing requests like this recipe: https://docs.services.mozilla.com/storage/apis-1.5.html#example-paging-through-a-large-set-of-items This would result in more queries for that client (potentially some extra queries to check for concurrent writes etc) but would help to smooth the overall db load and make it less likely to lock up the connection pool for large periods of time. Worth experimenting with anyway.

Richard Newman [:rnewman]

Comment 1

•

11 years ago

Insert caveats about leapfrogging data loss, client timestamping, etc. here. (I'm tired, so I'll let Future Richard deal with the pain!)

Ryan Kelly [:rfkelly]

Assignee

Comment 2

•

11 years ago

That what those "some extra queries" were all about, sanity-checking that we're not stuffing up the client.

James Bonacci [:jbonacci]

Updated

•

11 years ago

Whiteboard: [qa+]

Ryan Kelly [:rfkelly]

Assignee

Updated

•

11 years ago

Blocks: 996819

Ryan Kelly [:rfkelly]

Assignee

Comment 3

•

11 years ago

Attached patch sync15-internal-pagination.diff — Details — Splinter Review

Here's a more concrete take on what I want to do here. The incoming requests are checked for a "limit" parameter, and if there isn't one then we do the pagination ourselved internally. We don't have to commit to doing it this way long-term, but we could try this approach and see if it help tame some of the humungous wait-times we see in production. (Toby correctly points out that ideally, this pagination would be done by the SQL backend internally, since that's the only backend where it really makes a difference. But it's *easier* to do at the view level initially.)

Assignee: nobody → rfkelly

Attachment #8438895 - Flags: feedback?(telliott)

Ryan Kelly [:rfkelly]

Assignee

Comment 4

•

11 years ago

We will also need the more-efficient "offset token" implementation discussed in Bug 735102.

Depends on: 735102

Toby Elliott [:telliott]

Comment 5

•

11 years ago

Comment on attachment 8438895 [details] [diff] [review] sync15-internal-pagination.diff worth trying on a node to see if it improves things.

Attachment #8438895 - Flags: feedback?(telliott) → feedback+

Ryan Kelly [:rfkelly]

Assignee

Updated

•

11 years ago

Depends on: 1025735

Ryan Kelly [:rfkelly]

Assignee

Comment 6

•

11 years ago

I went ahead and committed this with one change in https://github.com/mozilla-services/server-syncstorage/commit/8bded2b793ee57920830298dda35d255f58fe52e The change is that internal pagination is disabled by default, and only activates when you set the storage.pagination_batch_size config option. This will let us easily enable/disable for experimenting on a node. Let's keep the bug open to track whether this works out OK in practice.

Ryan Kelly [:rfkelly]

Assignee

Comment 7

•

11 years ago

This code is now live in prod, we should try switching on the pagination and see if it helps things. Here's what I like to see: * Find a node or two that seems to have high db contention, e.g. lots of time waiting for connections and locks. * Add "pagination_batch_size = 5000" in the [storage] config section and restart the app * Monitor performance and see if it improves overall response time and/or throughput: * I would expect the max response time to increase slightly due to additional overhead on large queries. * But I would expect the average response time to decrease, since small requests don't have to wait as long for large ones to complete. But it's all hypothetical at this point. Bob and James, thoughts?

Ryan Kelly [:rfkelly]

Assignee

Updated

•

11 years ago

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Comment 8

•

11 years ago

This has been enabled on the sync-45 and sync-46 production nodes. Here are some basic stats for the servers from before it was enabled. Note the Nginx query times are longer because of network traffic. Upstream traffic is always cut off at the 575 second timeout. MySQL queries may run longer than the 575 second timeout because sometimes they are from the purge_ttl script and do not have a timeout, and because MySQL queries do not automatically die. Sync-45: 8/27 - 9/2 Nginx: Avg Service Time: 4.516, Stdev: 7.419, Min: 0.000, Max: 698.241 Upstream: 4.209 Stdev: 7.088, Min: 0.000, Max: 575.017 48hrs of Slow Query Logs: Query min max mean std_deviation ALL 1s 296781s 12s 321s upsert_bso 1s 945s 19s 55s find_items 1s 621s 26s 72s collection_timestamp 1s 222s 9s 11s lock_collection_write 1s 222s 8s 9s lock_collection_read 1s 221s 9s 10s Sync-46: 8/27 - 9/2 Nginx: Avg Service Time: 3.359, Stdev: 5.169, Min: 0.000, Max: 822.944 Upstream: 3.064 Stdev: 4.895, Min: 0.000, Max: 575.007 48hrs of Slow Query Logs: Query min max mean std_deviation ALL 1s 4851s 12s 46s upsert_bso 1s 1693s 15s 72s find_items 1s 634s 35s 80s collection_timestamp 1s 295s 10s 14s lock_collection_write 1s 300s 9s 10s lock_collection_read 1s 297s 10s 14s

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Comment 9

•

11 years ago

(In reply to Bob Micheletto [:bobm] from comment #8) Below are some mixed and slightly overlapping results. It might be best to broaden the test, and wait a few more days. Sync-45: 9/3 - 9/4 Nginx: Avg Service Time: 3.141, Stdev: 4.898, Min: 0.000, Max: 617.339 Upstream: 2.812 Stdev: 4.591, Min: 0.000, Max: 575.007 48hrs of Slow Query Logs: Query min max mean std_deviation ALL 1s 296781s 17s 592s upsert_bso 1s 430s 26s 54s find_items 1s 670s 13s 50s collection_timestamp 1s 101s 9s 9s lock_collection_write 1s 102s 7s 7s lock_collection_read 1s 101s 6s 7s Sync-46: 9/3 - 9/4 Nginx: Avg Service Time: 12.797, Stdev: 19.383, Min: 0.000, Max: 681.281 Upstream: 11.928 Stdev: 18.412, Min: 0.000, Max: 590.004 48hrs of Slow Query Logs: Query min max mean std_deviation All 1s 401249s 31s 521s upsert_bso 1s 1110s 42s 78s find_items 1s 1529s 62s 121s lock_collection_write 1s 64s 6s 8s collection_timestamp 1s 41s 6s 6s lock_collection_read 1s 61s 5s 6s

Ryan Kelly [:rfkelly]

Assignee

Comment 10

•

11 years ago

> max: 401249s What on earth is this query taking over four days to complete?

Ryan Kelly [:rfkelly]

Assignee

Comment 11

•

11 years ago

> find_items 1s 1529s Am I reading this correctly, that a single find_items query took 25 minutes to complete? I would expect the request to time out, and the corresponding query to be killed by our automatic kill-db-connection-on-timeout logic...so perhaps that hackery is no longer working in the new system?

Bob Micheletto [:bobm]

Comment 12

•

11 years ago

Ryan Kelly [:rfkelly]

Assignee

Comment 13

•

11 years ago

The conclusion from trying this out in prod was that it's not worth it - so significant wins seen, and sometimes worse performance. Particularly given that we've had other db improvements that helped a lot with our operational issues, let's switch this off in production config and close out the bug. (RESOLVED/FIXED because we still have the code for it, just not enabled in prod by default)

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Comment 14

•

11 years ago

OK. Will Verify for now.

Status: RESOLVED → VERIFIED

BMO Automation

Updated

•

2 years ago

Product: Cloud Services → Cloud Services Graveyard