Closed Bug 1336001 Opened 4 years ago Closed 4 years ago

Batching uploader behaves erratically when dealing with multiple batches

Categories

(Firefox for Android Graveyard :: Android Sync, defect, P1)

defect

Tracking

(firefox52 wontfix, firefox53 affected, firefox54 fixed)

RESOLVED FIXED
Firefox 54
Tracking Status
firefox52 --- wontfix
firefox53 --- affected
firefox54 --- fixed

People

(Reporter: Grisha, Assigned: Grisha)

References

Details

(Whiteboard: [MobileAS])

Attachments

(3 files)

Attached file bad-uploader
When there are enough records so that multiple batches need to be uploaded and committed, the uploader might behave very poorly when committing.

When uploader decides to commit a batch, it schedules execution of a runnable (which will construct a payload and POST it), and immediately it clears batch's meta information which runnable uses to construct a payload. So depending on execution timing, request might be constructed out of entirely or partially incorrect batch meta information. Particularly, token and/or XIUS header might be wrong.

- Token might be missing (resulting in "batch=true") or present.
- XIUS header might be collection's older L-M instead of L-M value returned after a previous batch commit.

Attached are some client request logs that illustrate one of the possible scenarios. Briefly, they might look something like this:

POST batch=true, xius=1
POST batch=1, xius=1
POST batch=1, xius=1
POST batch=true&commit=true, xius=1. Got L-M=2. NB: this is the partial data now live on the server.
POST batch=true, xius=2
POST batch=3, xius=2
POST batch=true&commit=true, xius=1 ->> 412!

--> incomplete state is now on the server.

A client fix for this needs to ensure that at the time of request construction, request _will_ have all of the necessary (and correct) information available (timestamp for xius, batch token, isCommit).

Batching uploader landed in 51, and there are currently a few nodes that support batching uploads. We might want to consider bumping maxTotalRecords and maxTotalBytes limits on the servers involved to make this problem (much) less likely to occur until a fix made its way through the uplifts.

Somewhat offtopic for this client bug (need file separate ones for the affected servers):

Currently, on the gosync server, I'm seeing the following limits for batches and their payloads:
maxPostRecords=100
maxPostBytes=2097152
maxTotalRecords = 1000
maxTotalBytes = 20971520

Such low total record limits means that we'll have a lot of separate batches to commit for large collections, which partially defeats the purpose of a batching uploader. Even with a correctly functioning client, this is an unsafely low per-batch record limit! We essentially have the same problem we had before batching uploads, just somewhat less likely to occur. If client's connection drops after they've committed one or two batches out of, say, five, we get an incomplete state on the server.
Good spot, Grisha.

I remember discussions with rfk and others: our per-batch limit should be on the order of 100,000 records, not 1000. The idea was that the split-batch mode should be a last resort (some users will still try to upload more than 100,000 records in a single sync, and we shouldn't stop them) not common!

Similarly, maxPostRecords should also be raised to 500 or more. The higher that number, the fewer requests (and thus the quicker uploads will complete), and the less likely that an ongoing batch will be created at all. That'll make bobm happy.

(At 2MB per request, 500 records get 4KB each, which is plenty. We could likely go even higher, if our goal is to get the top of the bell curve of POST body sizes to butt up against the limit, while keeping an eye on the rate of connection timeouts.)

Benson, could you file appropriate bugs to change the configuration for the Go server?
Flags: needinfo?(bwong)
I increased it to match the batch enabled python server's configuration:

- maxTotalRecords = 10000
- maxTotalBytes   = 100MB
- maxPostBytes    = 2097152
- maxPostRecords  = 100

:bobm said he left them as the default values[1]. 

[1] https://github.com/mozilla-services/server-syncstorage/blob/master/syncstorage/views/util.py#L91-L96
Flags: needinfo?(bwong)
We want to change that 100 in Bug 1250747.
:bob

Can you confirm what the MaxPostRecords is on the prod python node with batch enabled?
I'd like gosync and pysync to match in configuration limits so we can get accurate performance comparisons.
Flags: needinfo?(bobm)
Iteration: --- → 1.14
Whiteboard: [MobileAS]
Assignee: nobody → gkruglov
Status: NEW → ASSIGNED
(In reply to Benson Wong [:mostlygeek] from comment #4)
> Can you confirm what the MaxPostRecords is on the prod python node with
> batch enabled?

They are still set to the default.
Flags: needinfo?(bobm)
Iteration: 1.14 → 1.15
Comment on attachment 8835673 [details]
Bug 1336001 - Refactor BatchingUploader's state-holder objects to fix threading problems

https://reviewboard.mozilla.org/r/111340/#review113060

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java:318
(Diff revision 3)
> +        // below opens the gate. This avoids a race between the worker and the reset call.
> +        // This sort of a simple gated access to batchMeta allows for a theoretical possibility
> +        // of a race between network request/PayloadUploadDelegate and the reset call below. However,
> +        // we're going to assume that network request delegate running on the worker thread will be
> +        // executing significantly later than the reset call running in the current thread.
> +        // In other words, we assume that there's a natural, stable ordering of these events.

That there's a wall-clock ordering of events doesn't necessarily mean that the Java memory model will behave consistently. After all, many race conditions are only visible under load…

Also, consider what happens if a network request fails instantly (e.g., the network connection is temporarily offline). Is it possible that the upload delegate is called immediately?

Unless you spend a year rewriting this code in Coq, I'm not sure you can prove that it's safe, so I'd err on the side of making this code thread-safe.

Put another way: fix all the threading issues you're aware of, because it's bad enough dealing with the issues we aren't aware of!
Attachment #8835673 - Flags: review?(rnewman)
Comment on attachment 8835673 [details]
Bug 1336001 - Refactor BatchingUploader's state-holder objects to fix threading problems

https://reviewboard.mozilla.org/r/111340/#review113060

> That there's a wall-clock ordering of events doesn't necessarily mean that the Java memory model will behave consistently. After all, many race conditions are only visible under load…
> 
> Also, consider what happens if a network request fails instantly (e.g., the network connection is temporarily offline). Is it possible that the upload delegate is called immediately?
> 
> Unless you spend a year rewriting this code in Coq, I'm not sure you can prove that it's safe, so I'd err on the side of making this code thread-safe.
> 
> Put another way: fix all the threading issues you're aware of, because it's bad enough dealing with the issues we aren't aware of!

Added another gate to ensure strict ordering of updates, reads and resets of batchMeta by two threads. Which gets us to two different latches and two threads having to wait on each other in cases of commits, just skirting around a deadlock, and leaves me to wonder if this could all be arranged in some other fashion (how to nicely work around interdependency on a shared object in a producer/consumer pattern?).

But as far as I can tell, this approach should work.
(In reply to :Grisha Kruglov from comment #12)

> Added another gate to ensure strict ordering of updates, reads and resets of
> batchMeta by two threads. Which gets us to two different latches and two
> threads having to wait on each other in cases of commits, just skirting
> around a deadlock, and leaves me to wonder if this could all be arranged in
> some other fashion (how to nicely work around interdependency on a shared
> object in a producer/consumer pattern?).

You mean like RecordsChannel?

It's always going to be a pain (in Java, at least).

There are ways to make this kind of thing simpler if you change the model a bit:

- You can have very coarse mutual exclusion. This trades theoretical throughput for simplicity.
- If you have one primary reader/writer and another actor who occasionally takes some action which must be processed at some sensible point, you can flip a 'signal' flag for the primary owner to do the work. This is sometimes enough to make most of the system single-threaded.
- You can use 'send' primitives instead of 'sync' primitives (to borrow Rust's terminology). That is: pass ownership of things, rather than coordinating concurrent access.
Thanks, Richard - I'd like to explore these kinds of improvements separately from this bug, as a follow-up to Bug 1291821.
Comment on attachment 8835673 [details]
Bug 1336001 - Refactor BatchingUploader's state-holder objects to fix threading problems

https://reviewboard.mozilla.org/r/111340/#review114280

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java:381
(Diff revisions 2 - 4)
> -            batchMetaGate.await();
> +            batchResetGate.await();
>          } catch (InterruptedException e) {
>              Thread.currentThread().interrupt();
>          }
>  
>          batchMeta.reset();

We only ever want to reset the batch if one of two things is true:

- The batch (not the payload) is full.
- There are no more records coming.

If you reset here, you will be racing with retrieving the successful GUIDs in the payload callback:

- Queue a runnable.
- When it finally runs, it grabs the token and LM, and counts down the gate.
- Here we reset the batch.
- The upload completes, and the success GUIDs have been thrown away apart from the last payload.

Is my analysis correct?

I believe you need to queue a runnable that resets the batchmeta between batches in the work queue.
Attachment #8835673 - Flags: review?(rnewman)
Comment on attachment 8835673 [details]
Bug 1336001 - Refactor BatchingUploader's state-holder objects to fix threading problems

https://reviewboard.mozilla.org/r/111340/#review114280

> We only ever want to reset the batch if one of two things is true:
> 
> - The batch (not the payload) is full.
> - There are no more records coming.
> 
> If you reset here, you will be racing with retrieving the successful GUIDs in the payload callback:
> 
> - Queue a runnable.
> - When it finally runs, it grabs the token and LM, and counts down the gate.
> - Here we reset the batch.
> - The upload completes, and the success GUIDs have been thrown away apart from the last payload.
> 
> Is my analysis correct?
> 
> I believe you need to queue a runnable that resets the batchmeta between batches in the work queue.

Inspired by our IRC discussion, I've attempting to address underlying data structure issues which are causing this mess. I've split up BatchMeta into UploaderMeta (whatever's necessary for processing records) & BatchMeta (a shared whiteboard for the network runnables), and created a dispatcher which takes care of scheduling runnables and processing their side-effects.

The end result is that very little is now shared between record consumer thread pool and the network threads, there's a much clearer separation of concerns, and it's explicitely stated who owns/accesses what data.
Comment on attachment 8835673 [details]
Bug 1336001 - Refactor BatchingUploader's state-holder objects to fix threading problems

https://reviewboard.mozilla.org/r/111340/#review115658

Need to test this:

- against an empty server with local records to upload
- uploading 0 records to an empty server
- uploading some records to a non-empty server

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchMeta.java:29
(Diff revision 5)
>      private static final String LOG_TAG = "BatchMeta";
>  
> -    // Will be set once first payload upload succeeds. We don't expect this to change until we
> -    // commit the batch, and which point it must change.
> -    /* @GuardedBy("this") */ private Long lastModified;
> +    private volatile Boolean inBatchingMode;
> +    private volatile Long lastModified;
> +    private volatile String token;
> +    private final List<String> successRecordGuids = Collections.synchronizedList(new ArrayList<String>());

If you don't expect concurrent access to the list, the only role of synchronization here is to provide the equivalent of `volatile` — that is, writes done in thread A are visible to thread B after they are complete.

Consider instead using a `ConcurrentLinkedQueue`, with `getSuccessRecordGuids` simply using `queue.toArray()`.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java:73
(Diff revision 5)
> -    private final Payload payload;
> +            throw new IllegalStateException("Separator and start tokens must be of the same length");
> +        }
> +    }
>  
> -    // Accessed by synchronously running threads, OK to not synchronize and just make it volatile.
> -    private volatile Boolean inBatchingMode;
> +    // Accessed by the record consumer thread pool.
> +    // Will be re-create, so mark it as volatile.

re-created

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java:82
(Diff revision 5)
> +    /* package-local */ final Uri collectionUri;
> +    /* package-local */ final long collectionLastModified;
> +    /* package-local */ final RepositorySessionStoreDelegate sessionStoreDelegate;
> +    /* package-local */ @VisibleForTesting final PayloadDispatcher payloadDispatcher;
> +    private final Server11RepositorySession repositorySession;
> +    // Will be re-create, so mark it as volatile.

re-created

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java:98
(Diff revision 5)
>          this.sessionStoreDelegate = sessionStoreDelegate;
>          this.collectionUri = Uri.parse(repositorySession.getServerRepository().collectionURI().toString());
>  
> +        Long lastModified = repositorySession.getServerRepository().getCollectionLastModified();
> +        if (lastModified == null) {
> +            throw new IllegalStateException("Collection's last-modified timestamp must be available.");

Easy tiger! This will be `null` on a first sync, and that's totally fine.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java:125
(Diff revision 5)
>              sessionStoreDelegate.onRecordStoreFailed(new RecordTooLargeToUpload(), guid);
>              return;
>          }
>  
>          synchronized (payloadLock) {
> -            final boolean canFitRecordIntoBatch = batchMeta.canFit(recordDeltaByteCount);
> +            final boolean canFitRecordIntoBatch = uploaderMeta.canFit(recordDeltaByteCount);

It might help clarify if you had `UploadMeta` for the entire sync, `BatchMeta` for the entire batch, and `PayloadMeta` (if necessary) for the individual `POST` payload. I think `Payload` serves as the last one…

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java:218
(Diff revision 5)
>          synchronized (payloadLock) {
>              outgoing = payload.getRecordsBuffer();
>              outgoingGuids = payload.getRecordGuidsBuffer();
>              byteCount = payload.getByteCount();
>          }
> +        payload = new Payload(payload);

This might be clearer if you define `payload.nextPayload()`.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java:223
(Diff revision 5)
> -
> -        payload.reset();
> -    }
>  
> -    private class BatchingAtomicUploaderMayUploadProvider implements MayUploadProvider {
> -        public boolean mayUpload() {
> +        if (isCommit && !isLastPayload) {
> +            uploaderMeta = new UploaderMeta(uploaderMeta);

Similarly, `uploaderMeta.nextUpload()`.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/PayloadDispatcher.java:22
(Diff revision 5)
> +
> +/**
> + * Dispatches payload runnables and handles their results.
> + *
> + * All of the methods, except for `queue` and `finalizeQueue`, will be called from the thread(s)
> + * running sequentially on the `executor`.

Hold up. Your synchronization needs vary depending on whether there's more than one thread. Why not fix this to one?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/PayloadDispatcher.java:50
(Diff revision 5)
> +            final ArrayList<byte[]> outgoing,
> +            final ArrayList<String> outgoingGuids,
> +            final long byteCount,
> +            final boolean isCommit, final boolean isLastPayload) {
> +
> +        // Note that workQueue is expected to be a SingleThreadExecutor.

Oh hey! It is. Fix this comment above, rename this comment to `executor` instead of `workQueue`?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/PayloadDispatcher.java:56
(Diff revision 5)
> +        executor.execute(new BatchContextRunnable(isCommit) {
> +            @Override
> +            public void run() {
> +                final long lastModified;
> +                if (batchWhiteboard.getLastModified() == null) {
> +                    lastModified = uploader.collectionLastModified;

This can still be `null`, so you can't use a `long` on 54.
Attachment #8835673 - Flags: review?(rnewman)
Comment on attachment 8835673 [details]
Bug 1336001 - Refactor BatchingUploader's state-holder objects to fix threading problems

https://reviewboard.mozilla.org/r/111340/#review115658

> It might help clarify if you had `UploadMeta` for the entire sync, `BatchMeta` for the entire batch, and `PayloadMeta` (if necessary) for the individual `POST` payload. I think `Payload` serves as the last one…

Currently it's
- UploadMeta for the duration of a batch.
- BatchMeta "whiteboard" for the duration of a whole upload phase of a collection sync, but gets cleared after batch is committed. L-M carries over, but nothing else.
- PayloadMeta for each POST uploader makes.

BatchMeta could technically be thrown away in between batches like UploadMeta and re-created with a new L-M value, but I feel that re-using the same object and explicitely "preparing it" for the next batch makes it a little more obvious what's going on. "erase everything on the whiteboard except for L-M value", as opposed to "remember L-M value, throw away a whiteboard, get a new one, and write down L-M value".

I'll add clarifying comments.

> This might be clearer if you define `payload.nextPayload()`.

Sure, that works as well.

> Hold up. Your synchronization needs vary depending on whether there's more than one thread. Why not fix this to one?

I'll clarify the comment. It's a "single thread" executor, but iiuc it's guaranteed that we'll have the _same_ single thread at all times.
Comment on attachment 8835673 [details]
Bug 1336001 - Refactor BatchingUploader's state-holder objects to fix threading problems

https://reviewboard.mozilla.org/r/111340/#review115658

> Easy tiger! This will be `null` on a first sync, and that's totally fine.

Ah, of course! Added tests to catch this & modified things a bit. BatchMeta "whiteboard" now always has a timestamp (potentially null), and starts off as a collection timestamp.

> I'll clarify the comment. It's a "single thread" executor, but iiuc it's guaranteed that we'll have the _same_ single thread at all times.

Err, "not guaranteed".
(In reply to Richard Newman [:rnewman] from comment #18)
> Need to test this:
> 
> - against an empty server with local records to upload
> - uploading 0 records to an empty server
> - uploading some records to a non-empty server

After fixing the wrong assumption around initial sync (timestamp=null is valid), above scenarios appear to work well. I've tried it against batching and non-batching, empty/non-empty servers, with and without local records.
Comment on attachment 8835673 [details]
Bug 1336001 - Refactor BatchingUploader's state-holder objects to fix threading problems

https://reviewboard.mozilla.org/r/111340/#review116044

If this works for single batches and multi-POST uploads, then I'm happy.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java:51
(Diff revisions 5 - 6)
>   *
>   * Last-Modified header returned with the first batch payload POST success is maintained for a batch,
>   * to guard against concurrent-modification errors (different uploader commits before we're done).
>   *
> + * Implementation notes:
> + * - UploaderMeta is used to track batch-level information necessary for processing incoming records

Careful about using "incoming" here. These are really *outgoing* records being delivered by a record channel.
Attachment #8835673 - Flags: review?(rnewman) → review+
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/85f2279e848f
Refactor BatchingUploader's state-holder objects to fix threading problems r=rnewman
https://hg.mozilla.org/mozilla-central/rev/85f2279e848f
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 54
Iteration: 1.15 → 1.16
NI myself to uplift this.
Flags: needinfo?(gkruglov)
[Feature/Bug causing the regression]: Bug 1253111

[User impact if declined]: Bug causes potential data corruption while syncing against servers which support batching uploads. Specifically, bug is triggered when amount of records being uploaded is greater than # of records allowed in a single batch. Both go and python servers configure this value at 10000; so to be affected, an android client needs to be trying to upload more than 10000 records for a single collection. Bug makes it possible for a partial subset of the records to be committed on a server. Most negatively impacted collections (bookmarks & form history) won't sync if there are more than 5000 records, so they're not affected. Other collections are affected, but impact is much less severe since they're unstructured (e.g. partial data commit will simply mean that user lost some of their history, as opposed to severely corrupted their bookmarks tree). Batching upload support is currently being rolled out to more and more servers, with a plan to have 100% support soon.

[Is this code covered by automated tests?]: Yes; there is a variety of unit tests which cover this functionality, and some new tests were added.

[Has the fix been verified in Nightly?]: Yes.

[Needs manual test from QE? If yes, steps to reproduce]: This has been tested against described problem cases. However, if desired, one simple way is to add HTTP request logging in [0] (or in some other way monitor requests being made by sync), and ensure that as sync client is uploading records, it does so in accordance to protocol described in [1]. Test profile must have enough records to upload (more than 10k) to observe buggy behaviour if being tested against production servers. Alternatively, max_total_records might be hard-coded to a very small value (e.g. 10) in BatchingUploader's constructor, to make the problem obvious with any profile.

[List of other uplifts needed for the feature/fix]: N/A

[Is the change risky?]: Somewhat.

[Why is the change risky/not risky?]: It's a somewhat significant refactoring of a batching uploader. Core functionality is not affected for the most part; refactoring is mostly concerned with fixing some underlying data structures and changing flow of events in order to alleviate threading problems. Problem and its fix are also well understood, and fixed builds (current nightly) behave well.

[String changes made/needed]: N/A

[0] https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/main/java/org/mozilla/gecko/sync/net/BaseResource.java?q=BaseResource.java&redirect_type=direct#309
[1] https://docs.services.mozilla.com/storage/apis-1.5.html#example-uploading-a-large-batch-of-items
Flags: needinfo?(gkruglov)
Attachment #8841861 - Flags: approval-mozilla-beta?
Attachment #8841861 - Flags: approval-mozilla-aurora?
Comment on attachment 8841861 [details] [diff] [review]
batching-refactor-beta.patch

Seems too late for 52.
Attachment #8841861 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Comment on attachment 8841861 [details] [diff] [review]
batching-refactor-beta.patch

These patches are huge and it seems risky to me. We only have less than 1 week to be Beta53. Aurora53- and let this ride the train on 54.
Attachment #8841861 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
In fairness, this is mostly a large patch because it's Java. The conceptual changes are relatively small. So it goes.

If this isn't uplifted, Bug 1250747 needs to be addressed quickly in prod.
(In reply to Richard Newman [:rnewman] from comment #31)
> If this isn't uplifted, Bug 1250747 needs to be addressed quickly in prod.

Bug 1250747 will help mostly _if_ we increase batch limits further (currently they're at 10k). Increasing payload limits will help for other reasons, but that's not where the big impact will come from. As I've noted in Comment 28, at least this doesn't affect our structured collections (we won't even sync them beyond 5k records until 54). So we're mostly talking about screwing up history here.
Product: Android Background Services → Firefox for Android
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.