Closed Bug 1332987 Opened 3 years ago Closed 3 years ago

Batch upload queries are very slow

Categories

(Cloud Services :: Server: Sync, defect)

defect
Not set

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Unassigned)

References

Details

(Whiteboard: [sync-atomic])

Attachments

(1 file)

As noted in https://bugzilla.mozilla.org/show_bug.cgi?id=1319872#c9, the production python servers are currently taking several seconds to process batch upload queries.  Here's the latency comparison between python and go servers:

----

Go                             Python

=============                  =============
Batch Uploads (202)            Batch Uploads (202)
=============                  =============
Total rows: 146                Total rows: 96
   bucket    count pct            bucket    count pct
<     5ms        0 0.00%       <     5ms        0 0.00%
<    10ms       35 23.97%      <    10ms        0 0.00%
<    25ms      100 68.49%      <    25ms        0 0.00%
<    50ms        8 5.48%       <    50ms        0 0.00%
<   100ms        1 0.68%       <   100ms        0 0.00%
<   500ms        2 1.37%       <   500ms        0 0.00%
<  1000ms        0 0.00%       <  1000ms        0 0.00%
<  2000ms        0 0.00%       <  2000ms        0 0.00%
<  4000ms        0 0.00%       <  4000ms       96 100.00
< 10000ms        0 0.00%       < 10000ms        0 0.00%
< 60000ms        0 0.00%       < 60000ms        0 0.00%
Blocks: 1250189
Some ideas about what may be going on here:

* The batch_upload_items tables are currently plain InnoDB rather than TokuDB tables.  We should try switching to TokuDB to see if that makes a difference.

* It may be that there's a lot of lock contention when inserting into the batch_upload_items tables, because batch-ids are timestamps, and all concurrent users will be simultaneously trying to insert at one side of the primary key index.  Bob, does MySQL give us statistics on lock contention or similar that we could compare between batch-enabled and batch-disabled hosts?
Flags: needinfo?(bobm)
The upsert-into-batch-upload-items queries are heavily represented in the slow query log, and you can see them in this dashboard:

  https://kibana-shared-us-west-2.prod.mozaws.net/index.html#dashboard/temp/AVnJ_d2qPqd6ozB414u2

Interestingly, it says the "lock_time" is very small but the "query_time" is regularly 1 - 3 seconds.  I'm not 100% sure what "lock_time" means though...
The queries in the kibana dashboard are truncated; Bob, I wonder if you could capture me a couple of the full query texts for these UPSERT_batch_upload_items queries, to check for any obvious silliness therein?
QA Contact: kthiessen
(In reply to Ryan Kelly [:rfkelly] from comment #1)
> * The batch_upload_items tables are currently plain InnoDB rather than
> TokuDB tables.  We should try switching to TokuDB to see if that makes a
> difference.
One of the very next tests destined for stage.

> * It may be that there's a lot of lock contention when inserting into the
> batch_upload_items tables, because batch-ids are timestamps, and all
> concurrent users will be simultaneously trying to insert at one side of the
> primary key index.  Bob, does MySQL give us statistics on lock contention or
> similar that we could compare between batch-enabled and batch-disabled hosts?

There are statistics, but we can't really compare them very well because of the difference in storage engine type.  I'll grab some stats and some full UPSERT_batch_upload_items queries.
Flags: needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #3)
> The queries in the kibana dashboard are truncated; Bob, I wonder if you
> could capture me a couple of the full query texts for these
> UPSERT_batch_upload_items queries, to check for any obvious silliness
> therein?

Here's one where ... is 52826 bytes: 

/* [queryName=UPSERT_batch_upload_items] */ INSERT INTO batch_upload_items11 (sortindex,ttl_offset,payload_size,batch,id,payload) VALUES ... ON DUPLICATE KEY UPDATE sortindex = VALUES(sortindex),ttl_offset = VALUES(ttl_offset),payload_size = VALUES(payload_size),batch = VALUES(batch),id = VALUES(id),payload = VALUES(payload);
Whiteboard: [sync-atomic]
(In reply to Ryan Kelly [:rfkelly] from comment #1)
> * The batch_upload_items tables are currently plain InnoDB rather than
> TokuDB tables.  We should try switching to TokuDB to see if that makes a
> difference.

The stage 3 node batch_upload* tables have been altered to TokuDB.  Initial API verification was successful.  :kthiessen please run a standard single node load test, I'm curious to see the difference in memory consumption.  

The tables were altered with the following command:
ALTER TABLE batch_uploadsX ENGINE=TokuDB;

They completed quickly, most likely because there is no present load nor many rows in the tables.  I would expect it to take longer in production.  I'm not convinced that's a good idea to do in production, due to how the clients might handle it.  We should probably instead build new nodes with the new configuration and migrate the users.  However, we could consider something like the following:

1. Kill the python processes, effectively sending the nodes into backoff. (Nginx will send 503 + the backoff header)
2. Alter the tables, or drop and recreate.
3. Restart the python processes.
Flags: needinfo?(kthiessen)
Bob, which node would you like the load test run against?
Flags: needinfo?(kthiessen)
Running 90-minute medium-intensity tests against Stage Sync nodes 1 (InnoDB) and 3 (TokuDB).
Attached image SlowQueries.png
Slow query log graph from InnoDB vs TokuDB engine backed batch_uploads tables.
(In reply to Bob Micheletto [:bobm] from comment #9)
> Created attachment 8830995 [details]
> SlowQueries.png
> 
> Slow query log graph from InnoDB vs TokuDB engine backed batch_uploads
> tables.

The yellow bars are from Sync Node 1 with InnoDB and the tiny bit of red is Sync node 3 with TokuDB.
So...it was magic?
(In reply to Ryan Kelly [:rfkelly] from comment #11)
> So...it was magic?

In the Arthur C. Clarke sense.  Any thoughts on the migration options from comment 6?
> We should probably instead build new nodes with the new configuration and migrate the users.

Would it work if we did something like:

* Pref the batch-upload API off on the server
* Wait for in-flight clients to complete existing batches
* Migrate the batch-upload-related tables
* Pref the batch-upload API back on

I don't recall how aggressively the client caches results from /info/configuration, and hence how long we'd have to wait for batch-upload traffic to cease.  Thom, thoughts?
Flags: needinfo?(tchiovoloni)
The client will only make batch upload requests if the first request returns a 202, which I think would mean you don't have to wait very long.

But if you do need to wait for the configuration to update, AFAICT we'll only request it once, although it's stored in memory. So you would have to wait for the clients to restart, which might be a while. (But again, I think you shouldn't have to wait for this, unless I'm misunderstanding).
Flags: needinfo?(tchiovoloni)
(In reply to Thom Chiovoloni [:tcsc] from comment #14)
> The client will only make batch upload requests if the first request returns
> a 202, which I think would mean you don't have to wait very long.
> 
> But if you do need to wait for the configuration to update, AFAICT we'll
> only request it once, although it's stored in memory. So you would have to
> wait for the clients to restart, which might be a while. (But again, I think
> you shouldn't have to wait for this, unless I'm misunderstanding).

Is it possible that we could be putting clients into a bad state if they are interrupted and resume old batches?  Should we remove all inflight batches?  Will they be unable to sync until batch uploads are re-enabled?  How do you think we should test this?
Flags: needinfo?(tchiovoloni)
> The client will only make batch upload requests if the first request returns a 202

Good call, I had forgotten about this behaviour.  I guess the only interesting window of behaviour is then what happens to in-flight batches, where the client got a 202 on the first upload, but then we switch of batch uploads before it commits the batch.

Maybe we can backoff the node for a short period in order to flush existing batches or cause clients to abort them?
(In reply to Ryan Kelly [:rfkelly] from comment #16)

> Maybe we can backoff the node for a short period in order to flush existing
> batches or cause clients to abort them?

What's the maximum a client will wait for a batch to complete?  If we go this route, it seems like it be easier to just backoff the node through the whole procedure and not bother with flipping the batch upload support.
(In reply to Bob Micheletto [:bobm] from comment #15)
> Is it possible that we could be putting clients into a bad state if they are
> interrupted and resume old batches?  Should we remove all inflight batches? 
> Will they be unable to sync until batch uploads are re-enabled?

That should be fine - the client should recover fine (more below)

> How do you think we should test this?

I think we'd have to test manually, ensuring "success" logs are enabled, then manually verifying the logs show the expected behaviour. Thom or I would be able to look at any such logs for you.

(In reply to Ryan Kelly [:rfkelly] from comment #16)
> Good call, I had forgotten about this behaviour.  I guess the only
> interesting window of behaviour is then what happens to in-flight batches,
> where the client got a 202 on the first upload, but then we switch of batch
> uploads before it commits the batch.

We will throw in that case (ie, when a batch is in progress and subsequent responses also aren't 202), which should cause us to retry next sync (at which time the initial 200 will avoid batch semantics) - so this seems safe.

(In reply to Bob Micheletto [:bobm] from comment #17)
> What's the maximum a client will wait for a batch to complete?

I'm not quite sure what's being asked here.

> If we go
> this route, it seems like it be easier to just backoff the node through the
> whole procedure and not bother with flipping the batch upload support.

From the client's POV, both this and "return 200 instead of 202" are roughly the same.

I'm not sure how practical or worthwhile this is, but ISTM the least disruptive way of doing this would be something like:
* if "batch=true", return 200 (ie, this is the first POST in a batch, but the server responds that batch semantics aren't supported)
* if "batch=anything-else", return 202 (ie, a batch is underway, let it complete)

then at some point there will be few batches in-flight, so you'd return 200 for *all* batch requests (meaning any very-long-lived batches end up throwing and retrying the entire batch) - eg, I imagine that if a laptop goes to sleep while posting a batch, there's a chance it will try to resume that same batch at some time in the future, so there's probably no time for which we can be sure a batch isn't in-flight.
Flags: needinfo?(tchiovoloni)
> I'm not sure how practical or worthwhile this is, but ISTM the least disruptive way
> of doing this would be something like...

This sounds ideal from the POV of the client, but would require new server-side code to be deployed and QA'd, which probably tips the balance towards just 503'ing the node for a while.

> > what happens to in-flight batches [...] where the client got a 202 on the first upload,
> > but then we switch of batch uploads before it commits the batch.
>
> We will throw in that case (ie, when a batch is in progress and subsequent
> responses also aren't 202), which should cause us to retry next sync

An additional risk may be if we give the client 202s all the way up to its final post, but then switch of batches before it does the final "POST ?commit=true".  IIRC the client would accept the commit request, return a "200 OK", but would ignore the rest of the batch because batch uploads have been disabled.

In retrospect, it may be saner for the server to explicitly reject "?commit=true" when batch uploads are disabled, because that's not going to end well for the client.
(In reply to Ryan Kelly [:rfkelly] from comment #19)
> This sounds ideal from the POV of the client, but would require new
> server-side code to be deployed and QA'd, which probably tips the balance
> towards just 503'ing the node for a while.

Yeah, I guessed as much :)

> In retrospect, it may be saner for the server to explicitly reject
> "?commit=true" when batch uploads are disabled, because that's not going to
> end well for the client.

Although "batch=true&commit=true" might be seen in single-post batches IIRC.
(In reply to Mark Hammond [:markh] from comment #18)
> > What's the maximum a client will wait for a batch to complete?
> 
> I'm not quite sure what's being asked here.
I misunderstood the reason for waiting in Thom's comment 14.  I was wondering if there was some reason to wait for them to abandon a given batch before re-enabling the feature.  Sounds like we don't need to.
(In reply to Mark Hammond [:markh] from comment #18)
> > How do you think we should test this?
> 
> I think we'd have to test manually, ensuring "success" logs are enabled,
> then manually verifying the logs show the expected behaviour. Thom or I
> would be able to look at any such logs for you.

:kthiessen we can test this on the same node in stage, but we'll need to coordinate flipping the support back and forth.  I think starting and stopping a sync mid-batch (unless we don't need to do that) might be tricky but we can probably figure something out.  Should we schedule some time?
Flags: needinfo?(kthiessen)
Yes, I think so.  Drop a meeting request, 90 minutes or so, in my calendar and I'll make time for this.
Flags: needinfo?(kthiessen)
IIUC, per Comment 19, just disabling the batch upload feature will risk data loss if users happen to be making the final ?commit=true request of a batch.  They'll think they successfully uploaded the entire batch, but we will have only committed the contents of that one final request.

Based on above discussion, I lean towards just downing the nodes for the duration of the migration rather than trying to out-clever the edge-cases here (as much as I enjoy out-clevering things in general).
(In reply to Bob Micheletto [:bobm] from comment #22)
> (In reply to Mark Hammond [:markh] from comment #18)
> > > How do you think we should test this?
> > 
> > I think we'd have to test manually, ensuring "success" logs are enabled...

If the plan is just to backoff the node for a while, TBH I'm not sure we need to test this. I guess this will be done node-by-node, so it would be ideal if we could verify logs from a client on an early node, but otherwise assume we can handle this somewhat routine state.
What's the worst-case failure here?
Flags: needinfo?(markh)
(In reply to Karl Thiessen [:kthiessen] from comment #26)
> What's the worst-case failure here?

The worst case is really the same worst case whenever a node is in backoff - which *should* simply be that sync fails for a period, but fully recovers later.
Flags: needinfo?(markh)
Than I'd say we can test this at a best-effort level -- if I have some spare time, I'll see if I can find a good test scenario, but I won't make it a top priority.

Thanks, Mark.
(In reply to Karl Thiessen [:kthiessen] from comment #28)
> Than I'd say we can test this at a best-effort level -- if I have some spare
> time, I'll see if I can find a good test scenario, but I won't make it a top
> priority.

It would be perfect if testing was limited to something like:
* Log in to sync in a profile, ensuring success logs are enabled.
* Ask someone to perform some magic so your test user is on a node which will be backed off.
* Back that node off, perform the migration, reinstate the node.
* During that time, mash the "sync now" button.
* Hopefully observe nothing bad happing during this process
* Send us the logs for that period.
* We declare the logs look good, then migrate other nodes at your convenience :)
We have altered the tables of the batch enabled Python sync servers in production.  Will need to let it run for awhile to verify this as fixed, but from the temporary dashboard in Comment 2 it looks promising.  Tentatively marking this as resolved.
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(kthiessen)
Resolution: --- → FIXED
Thom took a look at my logs from a user on a node with the TokuDB batch upload tables, and found nothing amiss.  So far, this looks good to me.
Flags: needinfo?(kthiessen)
Marking this as verified, as no problems have come up in the interim.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.