Bugzilla

Reporter

Updated

•

8 years ago

Depends on: 1271466

Comment 1

•

8 years ago

I'll be glad to help Bob test this code once it's ready.

QA Contact: kthiessen

Assignee

Updated

•

8 years ago

Assignee: nobody → bobm

Status: NEW → ASSIGNED

Assignee

Comment 2

•

8 years ago

Please provide a tag in the https://github.com/mozilla-services/server-syncstorage repository for operations to deploy from.  If that isn't an option presently, a branch or any old gitref will do.

Edwin Wong [:edwong]

Updated

•

8 years ago

Whiteboard: [sync-data-integrity]

Assignee

Comment 3

•

8 years ago

Sync stage deployed with https://github.com/mozilla-services/server-syncstorage/tree/batch-uploads branch.  Not sure if this needs any .ini settings, or if we have to apply new schema.  If so, this isn't ready.

:kthiessen can you verify that the batch API is working on these nodes, and if so run some load tests?

Flags: needinfo?(kthiessen)

Edwin Wong [:edwong]

Updated

•

8 years ago

Whiteboard: [sync-data-integrity] → [data-integrity]

Comment 4

•

8 years ago

Do we have a straightforward test scenario specifically for the batch API?

Flags: needinfo?(kthiessen)

Comment 5

•

8 years ago

Also, :rfkelly, have the changes in your 'rfkelly-batch-upload-tweaks' branch been incorporated into the 'batch-upload' branch?

Flags: needinfo?(rfkelly)

Reporter

Comment 6

•

8 years ago

Yes, but I have an additional PR that appears to be outstanding:

  https://github.com/mozilla-services/server-syncstorage/pull/37

switching ni? to :rtilder to see if we want to get that one in as well.

Flags: needinfo?(rtilder)

Flags: needinfo?(rfkelly)

Reporter

Updated

•

8 years ago

Flags: needinfo?(rtilder)

Ryan Tilder [:rtilder]

Comment 7

•

8 years ago

Merged PR 37 into the batch-uploads branch.

Flags: needinfo?(rtilder)

Assignee

Comment 8

•

8 years ago

(In reply to Ryan Tilder [:rtilder] from comment #7)
> Merged PR 37 into the batch-uploads branch.

Commit 25556c07aaa5eefe307bdad1ac5b785ce99b9c56 deployed to Sync stage.  :kthiessen can we try the same load test that you ran last?

Flags: needinfo?(kthiessen)

Comment 9

•

8 years ago

Absolutely.  Let me fire it up.

Flags: needinfo?(kthiessen)

Edwin Wong [:edwong]

Updated

•

8 years ago

Whiteboard: [data-integrity] → [sync-atomic]

Assignee

Comment 10

•

8 years ago

One of the purge statements is getting stuck on these systems and consuming ~90% of the available read operations to the database storage volume.  Interestingly, it's for the batch_upload_items0 in all cases, and not another table.

SQL statement: /* [queryName=PURGE_BATCH_CONTENTS] */ DELETE FROM batch_upload_items0 WHERE batch < (UNIX_TIMESTAMP() - 10800) * 1000 ORDER BY ttl LIMIT 1000

The MySQL Slow Query Log dashboard reflects that all these long running queries are to batch_upload_items0:
Field	Action	Value
EnvVersion		
Hostname		ip-172-31-26-71
Lock_time		0.000084
Logger		Sync-1_5-SlowQuery
Payload		DELETE FROM batch_upload_items0 WHERE batch < (UNIX_TIMESTAMP() ...
Pid		0
QC_hit		No
Query_name		PURGE_BATCH_CONTENTS
Query_time		206.422672
Rows_examined		5436051
Rows_sent		0
Schema		weave0
Severity		7
Thread_id		23650
Timestamp		2016-08-03T22:59:04
Type		mysql.slow-query

Flags: needinfo?(rtilder)

Reporter

Comment 11

•

8 years ago

It could be that the purge script it not correctly handling the table sharding, and directing its attention to *only* batch_upload_items0.

It also seems odd that this should be consuming read ops, maybe because of the `ORDER BY ttl`?

> Rows_examined		5436051

Ouch.

Reporter

Comment 12

•

8 years ago

> SQL statement: /* [queryName=PURGE_BATCH_CONTENTS] */ DELETE FROM batch_upload_items0
> WHERE batch < (UNIX_TIMESTAMP() - 10800) * 1000 ORDER BY ttl LIMIT 1000

Aha, batch_upload_items doesn't have an index on `ttl`, so this query is a full table scan.

I don't think the `ORDER BY ttl` is necessary here since the batch id itself serves as a timestamp, and we should be able to just `ORDER BY batch` instead.

Assignee

Comment 13

•

8 years ago

Here are my thoughts on perhaps why the load tests aren't pushing these new staging servers to 503 responses, as discussed in the Monday stand-up.

Typically 503s are served during the load test for two reasons:
1. Database locking issues.
2. Exhausted connection pools.

Locks are usually encountered because of long running purge_ttl queries.  It appears that the current purge_ttl is only running against the batch_uploads0 table, when that is fixed locking issues may reappear.  

If I recall correctly, the Sync SQL Alchemy connection pools are created per by table.  If that's true; we are now distributing some requests across two pools per "shard": some percentage to the INNODB batch_uploads table, and another percentage over to the TokuDB BSO table.  I suspect that if we configure the batch_uploads percentage to zero that we will once again exhaust the connection pools.

Ryan Tilder [:rtilder]

Comment 14

•

8 years ago

Turns out the batch responses were slightly out of spec.  I've pushed a small change to base64 encode the batch identifier when the client starts a new batch.

https://github.com/mozilla-services/server-syncstorage/commit/3694262132ec47f60e0ce9c9e4645f23969ece13

Assignee

Comment 15

•

8 years ago

(In reply to Ryan Tilder [:rtilder] from comment #14)
This has been deployed to stage.

:kthiessen please test.

Flags: needinfo?(kthiessen)

Comment 16

•

8 years ago

Bob, I've dropped two ninety-minute slots in your calendar for testing this, with one of them timed so we can get input from :markh (or :rfkelly) as necessary.

Updated

•

8 years ago

Flags: needinfo?(kthiessen)

Comment 17

•

8 years ago

Bob's concern here is that we're not testing the sharding completely, only 2 of the shards are getting traffic.  We need to consult with :rtilder and possibly :rfkelly to see what changes are needed to the load tests.

Reporter

Comment 18

•

8 years ago

> Bob's concern here is that we're not testing the sharding completely,
> only 2 of the shards are getting traffic

Here's a theory as to why that might be happening.

The batch-upload-items tables are sharded by batchid, by doing:

   return get_batch_item_table(batchid % self.shardsize)

The batchids are bigint timestamps, based on the session timestamp as an integer number of milliseconds:

    def ts2bigint(timestamp):
        return int(timestamp * 1000)

    batchid = ts2bigint(session.timestamp)

But the timestamps in sync are actually in seconds, and for hysterical raisins they are quantized to two decimal places:

    def get_timestamp(value=None):
        if value is None:
            value = time.time()
        value = decimal.Decimal(str(value))
        return value.quantize(TWO_DECIMAL_PLACES)


Which I think means that the final digit of the batchid will always be zero, and we only access every tenth shard.

We could do a gross hack and shard based on (batchid / 10), but it may be more reliable to shard based on userid in the same way as the main bso table

Comment 19

•

8 years ago

We talked about this at last week's meetings, but did not agree on who would take the action for fixing the tests to do sharding differently.  :rtilder, :rfkelly, do either of you have time to do this?  I could try to hack it together, but you might not like the results.  ;)

Currently batch-upload load testing is blocking on this issue.

Reporter

Comment 20

•

8 years ago

:rtilder has a patch in progress for the sharding fixes

Flags: needinfo?(rtilder)

Ryan Tilder [:rtilder]

Comment 21

•

8 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #18)
> We could do a gross hack and shard based on (batchid / 10), but it may be
> more reliable to shard based on userid in the same way as the main bso table

Sadly, using the userid was problematic which is why I went with sharding on batchid.

The userid isn't part of the batch_upload_items table and the sharding occurs automagically in the DB connector.  The DBConnection class's query() method calls DBConnector's get_query() which is where the magic occurs.  Sadly, query() call fails with a user ID since it isn't part of the batch_upload_items schema.

Reporter

Comment 22

•

8 years ago

I posted a proposed fix for the sharding issue here:

  https://github.com/mozilla-services/server-syncstorage/pull/43

Pending review, but :bobm :kthiessen if you want to try it out in stage, the commit is 742ca6b5f2807fb252b2a99e7c9f8162ef7334e0

Reporter

Comment 23

•

8 years ago

The sharding fix is now merged into the batch-uploads branch:

  https://github.com/mozilla-services/server-syncstorage/tree/batch-uploads

ni? :bobm for deployment from commit 393a747f23869e0a5ba3914c0dcf04b94f91f590

Flags: needinfo?(bobm)

Assignee

Comment 24

•

8 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #23)
This commit has been pushed to stage.  :kthiessen please do the usual battery of testing.

Flags: needinfo?(bobm) → needinfo?(kthiessen)

Assignee

Comment 25

•

8 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #23)
Also, I needed to add pep8 to the requirements.txt to build successfully.

Comment 26

•

8 years ago

Got it.  Testing on this will start either tonight or tomorrow morning, PDT.

Flags: needinfo?(kthiessen)

Comment 27

•

8 years ago

90-minute load test started.  Let's make sure this one lights up all the shard tables.

Assignee

Comment 28

•

8 years ago

(In reply to Karl Thiessen [:kthiessen] from comment #27)
Judging by the slow query dashboard, and the IOPs metrics the results of the load test look good to me.  The sharding across the batch_upload_items{0..19} tables appears to be working, and no prolonged read IOPs spikes happened.

Slow queries across one stage server (note these were on average 2s, most requests don't end up creating an entry):
    950 batch_upload_items0
    981 batch_upload_items1
    911 batch_upload_items2
    942 batch_upload_items3
    797 batch_upload_items4
    925 batch_upload_items5
    756 batch_upload_items6
    940 batch_upload_items7
    903 batch_upload_items8
    902 batch_upload_items9
    829 batch_upload_items10
    913 batch_upload_items11
    870 batch_upload_items12
    916 batch_upload_items13
    802 batch_upload_items14
    920 batch_upload_items15
    858 batch_upload_items16
    949 batch_upload_items17
    876 batch_upload_items18
    923 batch_upload_items19

Assignee

Comment 29

•

8 years ago

:rfkelly at this point I don't think we want to block on any additional activity, but I thought I'd point out we talked about bundling changes from Bug 1306129 into this release.  Other than that, I think the only thing we are looking for is the feature flag for batch upload API support.

Flags: needinfo?(rfkelly)

Reporter

Updated

•

8 years ago

No longer depends on: 1271466

Reporter

Comment 30

•

8 years ago

The latest commit includes a slightly-adjusted DB query that gives proper semantics for partial updates, and the new feature-flag from Bug 1304627.

Commit hash:  a0c1d99934c95e746c584b161e536b7157eb10e9

:bobm, also, I don't seem to have access to Bug 1306129

Flags: needinfo?(rfkelly) → needinfo?(bobm)

Reporter

Comment 31

•

8 years ago

I have a handful of smaller batch-related PRs pending:

  https://github.com/mozilla-services/server-syncstorage/pulls

But I think this is ready for another loadtest without them.  Given we have the feature flag, we can then merge the batch-uploads branch to master and continue work on it from there, rather than maintaining it in a long-lived branch.

Comment 32

•

8 years ago

I will load test this commit as soon as I hear from Bob that it has been deployed.

Reporter

Comment 33

•

8 years ago

Actually even more of the PRs have landed since then, probably worth just deploying the head of the "batch-uploads" branch, whichever commit happens to be there at the time.

Assignee

Comment 34

•

8 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #30)
> The latest commit includes a slightly-adjusted DB query that gives proper
> semantics for partial updates, and the new feature-flag from Bug 1304627.
> 
> Commit hash:  a0c1d99934c95e746c584b161e536b7157eb10e9
I'll build this and push to Stage, as long as :kthiessen agrees.

> :bobm, also, I don't seem to have access to Bug 1306129
That was a cut-and-paste bait-and-switch, I had meant this Bug 1300451

Flags: needinfo?(bobm) → needinfo?(kthiessen)

Assignee

Comment 35

•

8 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #33)
> Actually even more of the PRs have landed since then, probably worth just
> deploying the head of the "batch-uploads" branch, whichever commit happens
> to be there at the time.

I mean this.

Comment 36

•

8 years ago

OK.  Bob, let me know which commit hash you wind up deploying, and we'll be off to the races.

Flags: needinfo?(kthiessen)

Assignee

Comment 37

•

8 years ago

(In reply to Karl Thiessen [:kthiessen] from comment #36)
> OK.  Bob, let me know which commit hash you wind up deploying, and we'll be
> off to the races.

:kthiessen Git hash: 85871a7896593df71e5faa81b3a336b10d8b00f5 deployed to stage.

Flags: needinfo?(kthiessen)

Comment 38

•

8 years ago

Six-hour heavy one-node load test launched against current stage.  Results should come in around 20:00 PDT today.

Flags: needinfo?(kthiessen)

Reporter

Comment 39

•

8 years ago

Re: Bug 1300451, the logging code from that bug is included in the upcoming release, since the batch-uploads work was merged back to master.

Comment 40

•

8 years ago

Bob, are we still lighting up all of the shard tables with the last couple of load tests?

Flags: needinfo?(bobm)

Assignee

Comment 41

•

8 years ago

(In reply to Karl Thiessen [:kthiessen] from comment #40)
> Bob, are we still lighting up all of the shard tables with the last couple
> of load tests?

Yes.

Flags: needinfo?(bobm)

Comment 42

•

8 years ago

Do we need a separate bug for the deployment of Sync 1.6.0 to Stage, or are we wrapping it up in this bug?

Comment 43

•

8 years ago

This is done -- I am going to take two steps to mark this RESOLVED ...

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED