Closed Bug 1273102 Opened 8 years ago Closed 8 years ago

Deploy and load-test stage storage node running the new "batch upload" API

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: bobm)

References

Details

(Whiteboard: [sync-atomic])

Before we commit to landing the new batch upload API, we want to get it running in stage and convince ourselves that it's going to have acceptable performance characteristics.  This bug tracks the work of deploying the code, running the tests, and summarizing the results.

The actual running of the test is blocked by Bug 1271466, but we can work on the deployment in parallel.  :rtilder, do we have a commit ready for :bobm to deploy?  I've lost track a little bit of where the code's at, between the in-progress PR at [1] and the not-passing-yet testcases at [2].

[1] https://github.com/mozilla-services/server-syncstorage/pull/32/
[2] https://github.com/rtilder/server-syncstorage/pull/1
Flags: needinfo?(rtilder)
Depends on: 1271466
I'll be glad to help Bob test this code once it's ready.
QA Contact: kthiessen
Assignee: nobody → bobm
Status: NEW → ASSIGNED
Please provide a tag in the https://github.com/mozilla-services/server-syncstorage repository for operations to deploy from.  If that isn't an option presently, a branch or any old gitref will do.
Whiteboard: [sync-data-integrity]
Sync stage deployed with https://github.com/mozilla-services/server-syncstorage/tree/batch-uploads branch.  Not sure if this needs any .ini settings, or if we have to apply new schema.  If so, this isn't ready.

:kthiessen can you verify that the batch API is working on these nodes, and if so run some load tests?
Flags: needinfo?(kthiessen)
Whiteboard: [sync-data-integrity] → [data-integrity]
Do we have a straightforward test scenario specifically for the batch API?
Flags: needinfo?(kthiessen)
Also, :rfkelly, have the changes in your 'rfkelly-batch-upload-tweaks' branch been incorporated into the 'batch-upload' branch?
Flags: needinfo?(rfkelly)
Yes, but I have an additional PR that appears to be outstanding:

  https://github.com/mozilla-services/server-syncstorage/pull/37

switching ni? to :rtilder to see if we want to get that one in as well.
Flags: needinfo?(rtilder)
Flags: needinfo?(rfkelly)
Flags: needinfo?(rtilder)
Merged PR 37 into the batch-uploads branch.
Flags: needinfo?(rtilder)
(In reply to Ryan Tilder [:rtilder] from comment #7)
> Merged PR 37 into the batch-uploads branch.

Commit 25556c07aaa5eefe307bdad1ac5b785ce99b9c56 deployed to Sync stage.  :kthiessen can we try the same load test that you ran last?
Flags: needinfo?(kthiessen)
Absolutely.  Let me fire it up.
Flags: needinfo?(kthiessen)
Whiteboard: [data-integrity] → [sync-atomic]
One of the purge statements is getting stuck on these systems and consuming ~90% of the available read operations to the database storage volume.  Interestingly, it's for the batch_upload_items0 in all cases, and not another table.

SQL statement: /* [queryName=PURGE_BATCH_CONTENTS] */ DELETE FROM batch_upload_items0 WHERE batch < (UNIX_TIMESTAMP() - 10800) * 1000 ORDER BY ttl LIMIT 1000

The MySQL Slow Query Log dashboard reflects that all these long running queries are to batch_upload_items0:
Field	Action	Value
EnvVersion		
Hostname		ip-172-31-26-71
Lock_time		0.000084
Logger		Sync-1_5-SlowQuery
Payload		DELETE FROM batch_upload_items0 WHERE batch < (UNIX_TIMESTAMP() ...
Pid		0
QC_hit		No
Query_name		PURGE_BATCH_CONTENTS
Query_time		206.422672
Rows_examined		5436051
Rows_sent		0
Schema		weave0
Severity		7
Thread_id		23650
Timestamp		2016-08-03T22:59:04
Type		mysql.slow-query
Flags: needinfo?(rtilder)
It could be that the purge script it not correctly handling the table sharding, and directing its attention to *only* batch_upload_items0.

It also seems odd that this should be consuming read ops, maybe because of the `ORDER BY ttl`?

> Rows_examined		5436051

Ouch.
> SQL statement: /* [queryName=PURGE_BATCH_CONTENTS] */ DELETE FROM batch_upload_items0
> WHERE batch < (UNIX_TIMESTAMP() - 10800) * 1000 ORDER BY ttl LIMIT 1000

Aha, batch_upload_items doesn't have an index on `ttl`, so this query is a full table scan.

I don't think the `ORDER BY ttl` is necessary here since the batch id itself serves as a timestamp, and we should be able to just `ORDER BY batch` instead.
Here are my thoughts on perhaps why the load tests aren't pushing these new staging servers to 503 responses, as discussed in the Monday stand-up.

Typically 503s are served during the load test for two reasons:
1. Database locking issues.
2. Exhausted connection pools.

Locks are usually encountered because of long running purge_ttl queries.  It appears that the current purge_ttl is only running against the batch_uploads0 table, when that is fixed locking issues may reappear.  

If I recall correctly, the Sync SQL Alchemy connection pools are created per by table.  If that's true; we are now distributing some requests across two pools per "shard": some percentage to the INNODB batch_uploads table, and another percentage over to the TokuDB BSO table.  I suspect that if we configure the batch_uploads percentage to zero that we will once again exhaust the connection pools.
Turns out the batch responses were slightly out of spec.  I've pushed a small change to base64 encode the batch identifier when the client starts a new batch.

https://github.com/mozilla-services/server-syncstorage/commit/3694262132ec47f60e0ce9c9e4645f23969ece13
(In reply to Ryan Tilder [:rtilder] from comment #14)
This has been deployed to stage.

:kthiessen please test.
Flags: needinfo?(kthiessen)
Bob, I've dropped two ninety-minute slots in your calendar for testing this, with one of them timed so we can get input from :markh (or :rfkelly) as necessary.
Flags: needinfo?(kthiessen)
Bob's concern here is that we're not testing the sharding completely, only 2 of the shards are getting traffic.  We need to consult with :rtilder and possibly :rfkelly to see what changes are needed to the load tests.
> Bob's concern here is that we're not testing the sharding completely,
> only 2 of the shards are getting traffic

Here's a theory as to why that might be happening.

The batch-upload-items tables are sharded by batchid, by doing:

   return get_batch_item_table(batchid % self.shardsize)

The batchids are bigint timestamps, based on the session timestamp as an integer number of milliseconds:

    def ts2bigint(timestamp):
        return int(timestamp * 1000)

    batchid = ts2bigint(session.timestamp)

But the timestamps in sync are actually in seconds, and for hysterical raisins they are quantized to two decimal places:

    def get_timestamp(value=None):
        if value is None:
            value = time.time()
        value = decimal.Decimal(str(value))
        return value.quantize(TWO_DECIMAL_PLACES)


Which I think means that the final digit of the batchid will always be zero, and we only access every tenth shard.

We could do a gross hack and shard based on (batchid / 10), but it may be more reliable to shard based on userid in the same way as the main bso table
We talked about this at last week's meetings, but did not agree on who would take the action for fixing the tests to do sharding differently.  :rtilder, :rfkelly, do either of you have time to do this?  I could try to hack it together, but you might not like the results.  ;)

Currently batch-upload load testing is blocking on this issue.
:rtilder has a patch in progress for the sharding fixes
Flags: needinfo?(rtilder)
(In reply to Ryan Kelly [:rfkelly] from comment #18)
> We could do a gross hack and shard based on (batchid / 10), but it may be
> more reliable to shard based on userid in the same way as the main bso table

Sadly, using the userid was problematic which is why I went with sharding on batchid.

The userid isn't part of the batch_upload_items table and the sharding occurs automagically in the DB connector.  The DBConnection class's query() method calls DBConnector's get_query() which is where the magic occurs.  Sadly, query() call fails with a user ID since it isn't part of the batch_upload_items schema.
I posted a proposed fix for the sharding issue here:

  https://github.com/mozilla-services/server-syncstorage/pull/43

Pending review, but :bobm :kthiessen if you want to try it out in stage, the commit is 742ca6b5f2807fb252b2a99e7c9f8162ef7334e0
The sharding fix is now merged into the batch-uploads branch:

  https://github.com/mozilla-services/server-syncstorage/tree/batch-uploads

ni? :bobm for deployment from commit 393a747f23869e0a5ba3914c0dcf04b94f91f590
Flags: needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #23)
This commit has been pushed to stage.  :kthiessen please do the usual battery of testing.
Flags: needinfo?(bobm) → needinfo?(kthiessen)
(In reply to Ryan Kelly [:rfkelly] from comment #23)
Also, I needed to add pep8 to the requirements.txt to build successfully.
Got it.  Testing on this will start either tonight or tomorrow morning, PDT.
Flags: needinfo?(kthiessen)
90-minute load test started.  Let's make sure this one lights up all the shard tables.
(In reply to Karl Thiessen [:kthiessen] from comment #27)
Judging by the slow query dashboard, and the IOPs metrics the results of the load test look good to me.  The sharding across the batch_upload_items{0..19} tables appears to be working, and no prolonged read IOPs spikes happened.

Slow queries across one stage server (note these were on average 2s, most requests don't end up creating an entry):
    950 batch_upload_items0
    981 batch_upload_items1
    911 batch_upload_items2
    942 batch_upload_items3
    797 batch_upload_items4
    925 batch_upload_items5
    756 batch_upload_items6
    940 batch_upload_items7
    903 batch_upload_items8
    902 batch_upload_items9
    829 batch_upload_items10
    913 batch_upload_items11
    870 batch_upload_items12
    916 batch_upload_items13
    802 batch_upload_items14
    920 batch_upload_items15
    858 batch_upload_items16
    949 batch_upload_items17
    876 batch_upload_items18
    923 batch_upload_items19
:rfkelly at this point I don't think we want to block on any additional activity, but I thought I'd point out we talked about bundling changes from Bug 1306129 into this release.  Other than that, I think the only thing we are looking for is the feature flag for batch upload API support.
Flags: needinfo?(rfkelly)
No longer depends on: 1271466
The latest commit includes a slightly-adjusted DB query that gives proper semantics for partial updates, and the new feature-flag from Bug 1304627.

Commit hash:  a0c1d99934c95e746c584b161e536b7157eb10e9

:bobm, also, I don't seem to have access to Bug 1306129
Flags: needinfo?(rfkelly) → needinfo?(bobm)
I have a handful of smaller batch-related PRs pending:

  https://github.com/mozilla-services/server-syncstorage/pulls

But I think this is ready for another loadtest without them.  Given we have the feature flag, we can then merge the batch-uploads branch to master and continue work on it from there, rather than maintaining it in a long-lived branch.
I will load test this commit as soon as I hear from Bob that it has been deployed.
Actually even more of the PRs have landed since then, probably worth just deploying the head of the "batch-uploads" branch, whichever commit happens to be there at the time.
(In reply to Ryan Kelly [:rfkelly] from comment #30)
> The latest commit includes a slightly-adjusted DB query that gives proper
> semantics for partial updates, and the new feature-flag from Bug 1304627.
> 
> Commit hash:  a0c1d99934c95e746c584b161e536b7157eb10e9
I'll build this and push to Stage, as long as :kthiessen agrees.

> :bobm, also, I don't seem to have access to Bug 1306129
That was a cut-and-paste bait-and-switch, I had meant this Bug 1300451
Flags: needinfo?(bobm) → needinfo?(kthiessen)
(In reply to Ryan Kelly [:rfkelly] from comment #33)
> Actually even more of the PRs have landed since then, probably worth just
> deploying the head of the "batch-uploads" branch, whichever commit happens
> to be there at the time.

I mean this.
OK.  Bob, let me know which commit hash you wind up deploying, and we'll be off to the races.
Flags: needinfo?(kthiessen)
(In reply to Karl Thiessen [:kthiessen] from comment #36)
> OK.  Bob, let me know which commit hash you wind up deploying, and we'll be
> off to the races.

:kthiessen Git hash: 85871a7896593df71e5faa81b3a336b10d8b00f5 deployed to stage.
Flags: needinfo?(kthiessen)
Six-hour heavy one-node load test launched against current stage.  Results should come in around 20:00 PDT today.
Flags: needinfo?(kthiessen)
Re: Bug 1300451, the logging code from that bug is included in the upcoming release, since the batch-uploads work was merged back to master.
Bob, are we still lighting up all of the shard tables with the last couple of load tests?
Flags: needinfo?(bobm)
(In reply to Karl Thiessen [:kthiessen] from comment #40)
> Bob, are we still lighting up all of the shard tables with the last couple
> of load tests?

Yes.
Flags: needinfo?(bobm)
Do we need a separate bug for the deployment of Sync 1.6.0 to Stage, or are we wrapping it up in this bug?
This is done -- I am going to take two steps to mark this RESOLVED ...
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
... and VERIFIED.  (Apologies for the bug-spam.)
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.