Closed
Bug 1273102
Opened 9 years ago
Closed 8 years ago
Deploy and load-test stage storage node running the new "batch upload" API
Categories
(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)
Cloud Services
Operations: Deployment Requests - DEPRECATED
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rfkelly, Assigned: bobm)
References
Details
(Whiteboard: [sync-atomic])
Before we commit to landing the new batch upload API, we want to get it running in stage and convince ourselves that it's going to have acceptable performance characteristics. This bug tracks the work of deploying the code, running the tests, and summarizing the results.
The actual running of the test is blocked by Bug 1271466, but we can work on the deployment in parallel. :rtilder, do we have a commit ready for :bobm to deploy? I've lost track a little bit of where the code's at, between the in-progress PR at [1] and the not-passing-yet testcases at [2].
[1] https://github.com/mozilla-services/server-syncstorage/pull/32/
[2] https://github.com/rtilder/server-syncstorage/pull/1
Flags: needinfo?(rtilder)
Comment 1•9 years ago
|
||
I'll be glad to help Bob test this code once it's ready.
QA Contact: kthiessen
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → bobm
Status: NEW → ASSIGNED
Assignee | ||
Comment 2•9 years ago
|
||
Please provide a tag in the https://github.com/mozilla-services/server-syncstorage repository for operations to deploy from. If that isn't an option presently, a branch or any old gitref will do.
Updated•9 years ago
|
Whiteboard: [sync-data-integrity]
Assignee | ||
Comment 3•9 years ago
|
||
Sync stage deployed with https://github.com/mozilla-services/server-syncstorage/tree/batch-uploads branch. Not sure if this needs any .ini settings, or if we have to apply new schema. If so, this isn't ready.
:kthiessen can you verify that the batch API is working on these nodes, and if so run some load tests?
Flags: needinfo?(kthiessen)
Updated•9 years ago
|
Whiteboard: [sync-data-integrity] → [data-integrity]
Comment 4•9 years ago
|
||
Do we have a straightforward test scenario specifically for the batch API?
Flags: needinfo?(kthiessen)
Comment 5•9 years ago
|
||
Also, :rfkelly, have the changes in your 'rfkelly-batch-upload-tweaks' branch been incorporated into the 'batch-upload' branch?
Flags: needinfo?(rfkelly)
Reporter | ||
Comment 6•9 years ago
|
||
Yes, but I have an additional PR that appears to be outstanding:
https://github.com/mozilla-services/server-syncstorage/pull/37
switching ni? to :rtilder to see if we want to get that one in as well.
Flags: needinfo?(rtilder)
Flags: needinfo?(rfkelly)
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(rtilder)
Assignee | ||
Comment 8•9 years ago
|
||
(In reply to Ryan Tilder [:rtilder] from comment #7)
> Merged PR 37 into the batch-uploads branch.
Commit 25556c07aaa5eefe307bdad1ac5b785ce99b9c56 deployed to Sync stage. :kthiessen can we try the same load test that you ran last?
Flags: needinfo?(kthiessen)
Updated•8 years ago
|
Whiteboard: [data-integrity] → [sync-atomic]
Assignee | ||
Comment 10•8 years ago
|
||
One of the purge statements is getting stuck on these systems and consuming ~90% of the available read operations to the database storage volume. Interestingly, it's for the batch_upload_items0 in all cases, and not another table.
SQL statement: /* [queryName=PURGE_BATCH_CONTENTS] */ DELETE FROM batch_upload_items0 WHERE batch < (UNIX_TIMESTAMP() - 10800) * 1000 ORDER BY ttl LIMIT 1000
The MySQL Slow Query Log dashboard reflects that all these long running queries are to batch_upload_items0:
Field Action Value
EnvVersion
Hostname ip-172-31-26-71
Lock_time 0.000084
Logger Sync-1_5-SlowQuery
Payload DELETE FROM batch_upload_items0 WHERE batch < (UNIX_TIMESTAMP() ...
Pid 0
QC_hit No
Query_name PURGE_BATCH_CONTENTS
Query_time 206.422672
Rows_examined 5436051
Rows_sent 0
Schema weave0
Severity 7
Thread_id 23650
Timestamp 2016-08-03T22:59:04
Type mysql.slow-query
Flags: needinfo?(rtilder)
Reporter | ||
Comment 11•8 years ago
|
||
It could be that the purge script it not correctly handling the table sharding, and directing its attention to *only* batch_upload_items0.
It also seems odd that this should be consuming read ops, maybe because of the `ORDER BY ttl`?
> Rows_examined 5436051
Ouch.
Reporter | ||
Comment 12•8 years ago
|
||
> SQL statement: /* [queryName=PURGE_BATCH_CONTENTS] */ DELETE FROM batch_upload_items0
> WHERE batch < (UNIX_TIMESTAMP() - 10800) * 1000 ORDER BY ttl LIMIT 1000
Aha, batch_upload_items doesn't have an index on `ttl`, so this query is a full table scan.
I don't think the `ORDER BY ttl` is necessary here since the batch id itself serves as a timestamp, and we should be able to just `ORDER BY batch` instead.
Assignee | ||
Comment 13•8 years ago
|
||
Here are my thoughts on perhaps why the load tests aren't pushing these new staging servers to 503 responses, as discussed in the Monday stand-up.
Typically 503s are served during the load test for two reasons:
1. Database locking issues.
2. Exhausted connection pools.
Locks are usually encountered because of long running purge_ttl queries. It appears that the current purge_ttl is only running against the batch_uploads0 table, when that is fixed locking issues may reappear.
If I recall correctly, the Sync SQL Alchemy connection pools are created per by table. If that's true; we are now distributing some requests across two pools per "shard": some percentage to the INNODB batch_uploads table, and another percentage over to the TokuDB BSO table. I suspect that if we configure the batch_uploads percentage to zero that we will once again exhaust the connection pools.
Comment 14•8 years ago
|
||
Turns out the batch responses were slightly out of spec. I've pushed a small change to base64 encode the batch identifier when the client starts a new batch.
https://github.com/mozilla-services/server-syncstorage/commit/3694262132ec47f60e0ce9c9e4645f23969ece13
Assignee | ||
Comment 15•8 years ago
|
||
(In reply to Ryan Tilder [:rtilder] from comment #14)
This has been deployed to stage.
:kthiessen please test.
Flags: needinfo?(kthiessen)
Comment 16•8 years ago
|
||
Bob, I've dropped two ninety-minute slots in your calendar for testing this, with one of them timed so we can get input from :markh (or :rfkelly) as necessary.
Updated•8 years ago
|
Flags: needinfo?(kthiessen)
Comment 17•8 years ago
|
||
Bob's concern here is that we're not testing the sharding completely, only 2 of the shards are getting traffic. We need to consult with :rtilder and possibly :rfkelly to see what changes are needed to the load tests.
Reporter | ||
Comment 18•8 years ago
|
||
> Bob's concern here is that we're not testing the sharding completely,
> only 2 of the shards are getting traffic
Here's a theory as to why that might be happening.
The batch-upload-items tables are sharded by batchid, by doing:
return get_batch_item_table(batchid % self.shardsize)
The batchids are bigint timestamps, based on the session timestamp as an integer number of milliseconds:
def ts2bigint(timestamp):
return int(timestamp * 1000)
batchid = ts2bigint(session.timestamp)
But the timestamps in sync are actually in seconds, and for hysterical raisins they are quantized to two decimal places:
def get_timestamp(value=None):
if value is None:
value = time.time()
value = decimal.Decimal(str(value))
return value.quantize(TWO_DECIMAL_PLACES)
Which I think means that the final digit of the batchid will always be zero, and we only access every tenth shard.
We could do a gross hack and shard based on (batchid / 10), but it may be more reliable to shard based on userid in the same way as the main bso table
Comment 19•8 years ago
|
||
We talked about this at last week's meetings, but did not agree on who would take the action for fixing the tests to do sharding differently. :rtilder, :rfkelly, do either of you have time to do this? I could try to hack it together, but you might not like the results. ;)
Currently batch-upload load testing is blocking on this issue.
Reporter | ||
Comment 20•8 years ago
|
||
:rtilder has a patch in progress for the sharding fixes
Flags: needinfo?(rtilder)
Comment 21•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #18)
> We could do a gross hack and shard based on (batchid / 10), but it may be
> more reliable to shard based on userid in the same way as the main bso table
Sadly, using the userid was problematic which is why I went with sharding on batchid.
The userid isn't part of the batch_upload_items table and the sharding occurs automagically in the DB connector. The DBConnection class's query() method calls DBConnector's get_query() which is where the magic occurs. Sadly, query() call fails with a user ID since it isn't part of the batch_upload_items schema.
Reporter | ||
Comment 22•8 years ago
|
||
I posted a proposed fix for the sharding issue here:
https://github.com/mozilla-services/server-syncstorage/pull/43
Pending review, but :bobm :kthiessen if you want to try it out in stage, the commit is 742ca6b5f2807fb252b2a99e7c9f8162ef7334e0
Reporter | ||
Comment 23•8 years ago
|
||
The sharding fix is now merged into the batch-uploads branch:
https://github.com/mozilla-services/server-syncstorage/tree/batch-uploads
ni? :bobm for deployment from commit 393a747f23869e0a5ba3914c0dcf04b94f91f590
Flags: needinfo?(bobm)
Assignee | ||
Comment 24•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #23)
This commit has been pushed to stage. :kthiessen please do the usual battery of testing.
Flags: needinfo?(bobm) → needinfo?(kthiessen)
Assignee | ||
Comment 25•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #23)
Also, I needed to add pep8 to the requirements.txt to build successfully.
Comment 26•8 years ago
|
||
Got it. Testing on this will start either tonight or tomorrow morning, PDT.
Flags: needinfo?(kthiessen)
Comment 27•8 years ago
|
||
90-minute load test started. Let's make sure this one lights up all the shard tables.
Assignee | ||
Comment 28•8 years ago
|
||
(In reply to Karl Thiessen [:kthiessen] from comment #27)
Judging by the slow query dashboard, and the IOPs metrics the results of the load test look good to me. The sharding across the batch_upload_items{0..19} tables appears to be working, and no prolonged read IOPs spikes happened.
Slow queries across one stage server (note these were on average 2s, most requests don't end up creating an entry):
950 batch_upload_items0
981 batch_upload_items1
911 batch_upload_items2
942 batch_upload_items3
797 batch_upload_items4
925 batch_upload_items5
756 batch_upload_items6
940 batch_upload_items7
903 batch_upload_items8
902 batch_upload_items9
829 batch_upload_items10
913 batch_upload_items11
870 batch_upload_items12
916 batch_upload_items13
802 batch_upload_items14
920 batch_upload_items15
858 batch_upload_items16
949 batch_upload_items17
876 batch_upload_items18
923 batch_upload_items19
Assignee | ||
Comment 29•8 years ago
|
||
:rfkelly at this point I don't think we want to block on any additional activity, but I thought I'd point out we talked about bundling changes from Bug 1306129 into this release. Other than that, I think the only thing we are looking for is the feature flag for batch upload API support.
Flags: needinfo?(rfkelly)
Reporter | ||
Comment 30•8 years ago
|
||
The latest commit includes a slightly-adjusted DB query that gives proper semantics for partial updates, and the new feature-flag from Bug 1304627.
Commit hash: a0c1d99934c95e746c584b161e536b7157eb10e9
:bobm, also, I don't seem to have access to Bug 1306129
Flags: needinfo?(rfkelly) → needinfo?(bobm)
Reporter | ||
Comment 31•8 years ago
|
||
I have a handful of smaller batch-related PRs pending:
https://github.com/mozilla-services/server-syncstorage/pulls
But I think this is ready for another loadtest without them. Given we have the feature flag, we can then merge the batch-uploads branch to master and continue work on it from there, rather than maintaining it in a long-lived branch.
Comment 32•8 years ago
|
||
I will load test this commit as soon as I hear from Bob that it has been deployed.
Reporter | ||
Comment 33•8 years ago
|
||
Actually even more of the PRs have landed since then, probably worth just deploying the head of the "batch-uploads" branch, whichever commit happens to be there at the time.
Assignee | ||
Comment 34•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #30)
> The latest commit includes a slightly-adjusted DB query that gives proper
> semantics for partial updates, and the new feature-flag from Bug 1304627.
>
> Commit hash: a0c1d99934c95e746c584b161e536b7157eb10e9
I'll build this and push to Stage, as long as :kthiessen agrees.
> :bobm, also, I don't seem to have access to Bug 1306129
That was a cut-and-paste bait-and-switch, I had meant this Bug 1300451
Flags: needinfo?(bobm) → needinfo?(kthiessen)
Assignee | ||
Comment 35•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #33)
> Actually even more of the PRs have landed since then, probably worth just
> deploying the head of the "batch-uploads" branch, whichever commit happens
> to be there at the time.
I mean this.
Comment 36•8 years ago
|
||
OK. Bob, let me know which commit hash you wind up deploying, and we'll be off to the races.
Flags: needinfo?(kthiessen)
Assignee | ||
Comment 37•8 years ago
|
||
(In reply to Karl Thiessen [:kthiessen] from comment #36)
> OK. Bob, let me know which commit hash you wind up deploying, and we'll be
> off to the races.
:kthiessen Git hash: 85871a7896593df71e5faa81b3a336b10d8b00f5 deployed to stage.
Flags: needinfo?(kthiessen)
Comment 38•8 years ago
|
||
Six-hour heavy one-node load test launched against current stage. Results should come in around 20:00 PDT today.
Flags: needinfo?(kthiessen)
Reporter | ||
Comment 39•8 years ago
|
||
Re: Bug 1300451, the logging code from that bug is included in the upcoming release, since the batch-uploads work was merged back to master.
Comment 40•8 years ago
|
||
Bob, are we still lighting up all of the shard tables with the last couple of load tests?
Flags: needinfo?(bobm)
Assignee | ||
Comment 41•8 years ago
|
||
(In reply to Karl Thiessen [:kthiessen] from comment #40)
> Bob, are we still lighting up all of the shard tables with the last couple
> of load tests?
Yes.
Flags: needinfo?(bobm)
Comment 42•8 years ago
|
||
Do we need a separate bug for the deployment of Sync 1.6.0 to Stage, or are we wrapping it up in this bug?
Comment 43•8 years ago
|
||
This is done -- I am going to take two steps to mark this RESOLVED ...
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•