If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Batch uploads can get infected with zombie items from a previous batch, if we re-use batch ids

RESOLVED FIXED

Status

Cloud Services
Server: Sync
RESOLVED FIXED
3 months ago
2 months ago

People

(Reporter: rfkelly, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 months ago
From the I-really-wish-we-hadn't-used-timestamps-for-batch-ids department...

If a user happens to create a batch, quickly commit it, and then quickly create another batch, the python sync server can re-use the same batch id for both uploads.  This would be fine, except that we do lazy garbage-collection of items from committed batches, so the second batch can resurrect items from the first, like so:

* Start a batch upload to "history" collection
    * get assigned a timestamp batchid, say batch=123456

* Upload a single item "x.com"
    * this gets written in the "batch_upload_items" table

* Commit the batch
    * this deletes the main "batch_uploads" db record for batch=123456
    * but it leaves the "x.com" item orphaned in the "batch_upload_items" table,
      with the intention of deleting it via background job

* Start a batch upload to "clients" collection
    * get assigned the same timestamp batchid, batch=123456
    * because no such record exists in "batch_uploads" table, this succeeds

* Upload a new client item "eoger's iphone"
    * this gets written in the "batch_upload_items" table, alongside
      the no-longer-orphaned item for "x.com"

* Commit the batch
    * the commit step looks in "batch_upload_items", sees two items
      with that batch id, and adds them both to the "clients" collection!
(Reporter)

Comment 1

3 months ago
From syncing up with Mark and Bob, the short-term plan is:

* Pref the batch-upload feature off in the production, again
* See if we can identify users who might have been impacted by the issue, for potential node-reassignment

We've tested the feature-disablement procedure in stage and it appears to work correctly.
(Reporter)

Comment 2

3 months ago
Failing test that highlights the issue: https://github.com/mozilla-services/server-syncstorage/pull/63
(Reporter)

Comment 3

3 months ago
The simplest fix for this is a little hacky, but I'm putting it up here for reference:

  https://github.com/mozilla-services/server-syncstorage/pull/63/files#diff-a5228165092b370879bcc6a8c678b9c0

When a batch gets committed, instead of deleting it from the "batch_uploads" table and allowing its batchid to be re-used, we can mark it as committed but leave it in the database.  This will prevent it its batchid from being re-used and triggering this bug.

The hacky part is, rather than adding a new "committed" column to the "batch_uploads" table, it just sets the existing "collection" column to zero.  That's an invalid collection id, so it prevents any further use of the batch.

I'm not terribly fond of this though, I think adding an explicit "committed" column would be better, but we don't have any db migration infrastructure in the syncstorage project, so I'll need to figure out how to add some in a way that works both for production (where Bob will run them by hand) and for self-hosters.
(Reporter)

Comment 4

3 months ago
An alternative approach would be to get rid of this "lazy deletion" feature - when the user commits a batch, delete the record from batch_uploads and delete all the associated items from batch_upload_items.

The reason we currently do lazy deletion, is that we've historically had performance problems when attempting to delete large numbers of rows from the db.  :bobm, is that still a concern with the new TokuDB stuff?
Flags: needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #4)

> The reason we currently do lazy deletion, is that we've historically had
> performance problems when attempting to delete large numbers of rows from
> the db.  :bobm, is that still a concern with the new TokuDB stuff?

DELETEs are still the longest running queries.  However, we aren't presently running into locking issues.  I suspect these DELETEs would run quite a bit faster than the range queries used to expire items.  Also, the batch_upload_items tables should have a relatively static and much smaller size than the BSO tables.  If we interrupt a batch with a 503 due to locking, a client should be able to resume the batch if we don't send a back-off header that's longer than the cleanup period, correct?

If we re-enable the batch API, as we did last time with a staggered roll-out, we can watch for an increasing DELETE trend and determine if it will be a problem.
Flags: needinfo?(bobm)
Other upsides of DELETE after the batches are committed: 

- fixes this bug
- should be fairly fast as the number of records are easy / fast to find 
- we can see how long COMMIT+DELETE takes in web log latency. Presumably a good signal for need to optimize 
- frees up DB pages so they can be reused immediately. Should cap off DB growth a bit 
- it's what gosync does :)
(Reporter)

Comment 7

3 months ago
OK, let's do the deletion thing then, I'll feel safer with that as  fix than with doubling-down on the lazy cleanup.
(Reporter)

Comment 8

3 months ago
Deletion-based fix landed in https://github.com/mozilla-services/server-syncstorage/commit/f3cadd8d335a8becf01099b0a7b93223c5b4b974

Next steps are to consider:

* What analysis and cleanups do we want to attempt before deploying this?
* How do we want to manage the new attempt at rollout?
(In reply to Ryan Kelly [:rfkelly] from comment #8)

I built and deployed to a stage node git hash: 7c0331a2209ac44cd3fb6f11fa9683f0647c286b
I ran into the following error verifying the node:

s...s.s.........................s..................E
======================================================================
ERROR: test_that_we_dont_resurrect_committed_batches (__main__.LiveTestCases)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/server-syncstorage/build/lib/syncstorage/tests/functional/test_storage.py", line 1828, in test_that_we_dont_resurrect_committed_batches
    req = '/storage/col1?batch={}&commit=true'.format(batch1)
ValueError: zero length field name in format

----------------------------------------------------------------------
Ran 52 tests in 35.442s

FAILED (errors=1, skipped=4)
Flags: needinfo?(rfkelly)
See Also: → bug 1378567
(Reporter)

Comment 10

2 months ago
What version of python are you using to run the tests?  IIUC this error can happen on python2.6 when you try to use the "{}".format(foo) syntax which is only supported in python2.7 or later, but I thought we were finally python2.7 everywhere.
Flags: needinfo?(rfkelly) → needinfo?(bobm)
We are, unfortunately, on Python2.6 on the Admin servers.
Flags: needinfo?(bobm)
(Reporter)

Comment 12

2 months ago
Ah, ok.  I've fixed the use of constructs not supported on py2.6, can you please try running from this branch and see if it works?

  https://github.com/mozilla-services/server-syncstorage/pull/67
Flags: needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #12)
> Ah, ok.  I've fixed the use of constructs not supported on py2.6, can you
> please try running from this branch and see if it works?
> 
>   https://github.com/mozilla-services/server-syncstorage/pull/67

r+

Also, deployed to stage.
Flags: needinfo?(bobm)
(Reporter)

Comment 14

2 months ago
I'm going to RESOLVED/FIXED this bug and we can follow up on the deployment aspects over in Bug 1378567
Status: NEW → RESOLVED
Last Resolved: 2 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.