Closed Bug 1291821 Opened 8 years ago Closed 7 years ago

Incremental server mirrorer on Android

Categories

(Firefox for Android Graveyard :: Android Sync, defect, P1)

All
Android
defect

Tracking

(relnote-firefox 54+, firefox54 fixed)

RESOLVED FIXED
Firefox 54
Tracking Status
relnote-firefox --- 54+
firefox54 --- fixed

People

(Reporter: dlim, Assigned: Grisha, Mentored)

References

Details

(Whiteboard: [MobileAS])

Attachments

(20 files, 1 obsolete file)

58 bytes, text/x-review-board-request
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
58 bytes, text/x-review-board-request
rnewman
: review+
Details
      No description provided.
Assignee: nobody → dlim
Blocks: 814801
Whiteboard: [MobileAS s1.1]
Depends on: 730142
I recommend starting with Bug 730142. The general approach would be to:

* Ideal: figure out how long we have left in a sync. Use info/collection_counts and network status to decide if we should try downloading at all.
* Create a batcher (Bug 730142) and point it at the bookmarks collection. Pass in a "green light" function that will encapsulate whether we're near the Android 5-minute limit for syncing and stop batching safely. The batcher wraps up the continuation token and internal watermarking to allow for safe and efficient batching.
* When the batcher reports that we're done and complete, decide if we have time to continue. Figure out how long it takes to apply a record on the average device.
* If so, proceed to the next phase (Bug 1291822 or just blindly applying records). Record application can take care of clearing the buffer.
Mentor: gkruglov, rnewman
Status: NEW → ASSIGNED
Whiteboard: [MobileAS s1.1] → [MobileAS s1.1]
Whiteboard: [MobileAS s1.1] → [MobileAS s1.2]
Assignee: daniellimpersonal → nobody
Status: ASSIGNED → NEW
Whiteboard: [MobileAS s1.2] → [MobileAS backlog]
Whiteboard: [MobileAS backlog] → [MobileAS]
Priority: -- → P3
Priority: P3 → P1
Assignee: nobody → gkruglov
Comment on attachment 8795966 [details]
Bug 1291821 - Pre: remove unused SerialRecordConsumer

I'm starting to work on a concrete implementation of this.

This quick sketch is to illustrate my general thinking on how to approach buffering. Currently code implements a BufferingMiddleware which wraps a (local)RepositorySession, which could be configured with a buffer store and a consistency checker.

As an example, I've currently wrapped HistoryRepositorySession in a buffer configured with an in-memory store and a no-op checker.

I'll continue flushing out this general idea, adding a transactional database-backed buffer store, adding a green-light function into the mix, etc.

This push is just to get some early feedback on the general direction.
Attachment #8795966 - Flags: feedback?(rnewman)
Status: NEW → ASSIGNED
Comment on attachment 8795966 [details]
Bug 1291821 - Pre: remove unused SerialRecordConsumer

https://reviewboard.mozilla.org/r/81948/#review80778

f+. Next step is to figure out what the client does when it sees a partial write -- it probably shouldn't just fail hard and die. But this is good, and pretty much what we have right now, so just think about that for the future!

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/BufferingMiddlewareRepositorySession.java:77
(Diff revision 1)
> +        // Could this be a pipeline? E.g.:
> +        // records -> consistency check -> fix -> store
> +        // But: At this point we still don't have a "common ancestor" and can't do a three way merge.
> +        final List<Record> records = bufferStore.all();
> +        if (!consistencyChecker.isValid(records)) {
> +            abort();

You might need richer signaling here, just as iOS has: <https://github.com/mozilla-mobile/firefox-ios/blob/master/Sync/Synchronizers/Bookmarks/BookmarksDownloader.swift#L126>.

It's not quite enough to say "what we downloaded isn't consistent, throw it away" -- in most cases you're not going to do the same thing again and get different results.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/BufferingMiddlewareRepositorySession.java:99
(Diff revision 1)
> +        super.storeDone(end);
> +    }
> +
> +    @Override
> +    public void abort() {
> +        bufferStore.clear();

You might want to rethink this. A persistent buffer is likely to be filled across syncs…

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/MemoryBufferStore.java:1
(Diff revision 1)
> +package org.mozilla.gecko.sync.middleware;

License blocks.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/checkers/ConsistencyChecker.java:11
(Diff revision 1)
> +
> +/**
> + * @author grisha
> + */
> +public interface ConsistencyChecker {
> +    boolean isValid(List<Record> records);

You probably want validation to be a method on the buffer class -- it's likely to be more tightly coupled to a specific Record type and a storage backend than you can manage with just a `List<Record>` type.
Attachment #8795966 - Flags: review+
Comment on attachment 8795966 [details]
Bug 1291821 - Pre: remove unused SerialRecordConsumer

Another WIP push.

Primarily added richer signaling for complete/fail events, rebased (with some cleanup) on top of download batching work that landed.
Attachment #8795966 - Flags: feedback?(rnewman)
Comment on attachment 8797320 [details]
Bug 1291821 - WIP first pass on 'green light' checkpoints

WIP push, first stab at "green light" checkpoints.

Currently I've added them in between batches during download part of a sync stage and right before buffer is processed and "applied".
Attachment #8797320 - Flags: feedback?(rnewman)
Some thoughts on the implications of "green light" checkpoints, and the incremental aspects of syncing.

A platform limitation that is underlying this thinking process is that our sync may be killed at any point. Documentation mentions a hard limit of 30 minutes if sync wasn't initiated by the user (this might be much less for older android versions). Furthermore, sync will be killed if the OS determines it's not making progress by examining process's network activity [0].

The pertinent question we'd like to ask at various stages of sync is whether or not we should proceed further. There are multiple places when this question may be asked, guiding principle for the selection being that operation might take a while:
- in-between sync stages - should we attempt next stage?
- during a batching download, should we attempt a next batch?
- once stage buffer has been filled up successfully (all batches finished downloading), should we attempt to analyze it and persist it?

The reason we have to ask this "to be or not to be" question primarily revolves around the fact that when we're actually persisting data, we're doing so non-transactionally. From a data integrity point of view, we can eliminate this question entirely if we're able to ensure that we may be safely killed whenever OS desires to kill us.

What this means, roughly speaking:
1) all database writes must be transactional. Once sync collection has been mirrored into a local buffer, and we'd like to persist it, we must first perform any reconciliation, figure out what it is that we want to INSERT/UPDATE, and then do so atomically in one transaction
2) writes into the mirror buffer must be transactional, if they are to persist across multiple syncs. Transactions will encompass downloaded batches. One transaction per-batch seems reasonable given that network latency is likely to be greater than local storage latency.

(2) is part of the current buffering work anyway, but (1) will require additional effort: As we're reconciling records, we already form small "batches" which we then transactionally persist via BrowserProvider's batch operations. We could expand the size of these batches to encompass all of records that will be stored, which will achieve the goal at the expense of using up more memory (as much as 2x per-stage), which might not be feasible for large collections/low end devices.

If we can't accomplish safety (1)+(2), then we must be able to safely "stop" or "move on" at any of the points mentioned above, without loosing/corrupting data either locally or remotely, and without unnecessarily discarding progress which we have made up to the point. Additionally, in order to minimize chances of another client writing to the collection while we're in this incomplete state, we want our total sync time window for any given collection (which might span multiple syncs) to be as small as possible.

Introduction of a buffer both helps us and complicates our sync lifecycle. A key question we must be able to answer is whether or not our current buffer for a given stage is valid. A buffer might have three states:
- empty buffer, trivial case, it is always valid.
- partially-complete buffer:
-- If we're still filling it up during a batching download, valid as long as we're not hitting 412 failures, indicating XIUS pre-condition failure - a concurrent modification.
-- if we had to stop batching, and are now starting to batch again in a different sync, valid as long as we're not hitting 412 failures.
- complete buffer:
-- always valid, and safe to apply, if we are handling last-synced timestamps per-stage and XIUS logic correctly.
-- however, we might know that another client wrote to the collection for which we have a complete (not-yet-applied) mirror.

If buffer is incomplete and isn't valid anymore (collection has been modified), then it makes sense to discard it and start from scratch.
If buffer is complete and isn't valid anymore, the current approach (buffering as middleware wrapping a repository) guides us towards discarding it, and trying again from scratch. However, it might be worth the effort and certainly more efficient to apply it first, and then download/apply any new changes, which is likely going to be a smaller data set.

Current "green light" approach is as follows:
- sync deadline is generated when we start to sync. Currently it is set to 30 minutes in the future, and we might adjust this based on OS version, or other heuristics.
- deadline is passed down into each sync stage when it is executed, and the stage is free to use it or discard it.
- deadline is passed down to Server11Repository, in order to checkpoint batching downloads, and to BufferingMiddleware, in order to checkpoint buffer application.
- Batching downloader keeps track of average time it takes to download and process a batch, and prior to starting another batch it determines if it has enough time left.
- BufferingMiddleware makes a "green light" decision based on number of records in the complete buffer, possibly type of record, and time left in a sync.
- GlobalSession currently does nothing with this deadline. It does not skip stages.

TODO list is upcoming to move current implementation towards addressing things mentioned in this brain-dump. Some things are currently done incorrectly.

[0] https://developer.android.com/reference/android/content/AbstractThreadedSyncAdapter.html - 2nd paragraph
(In reply to :Grisha Kruglov from comment #10)
> Some thoughts on the implications of "green light" checkpoints, and the
> incremental aspects of syncing.
> 
> A platform limitation that is underlying this thinking process is that our
> sync may be killed at any point. 
> …
> The reason we have to ask this "to be or not to be" question primarily
> revolves around the fact that when we're actually persisting data, we're
> doing so non-transactionally. From a data integrity point of view, we can
> eliminate this question entirely if we're able to ensure that we may be
> safely killed whenever OS desires to kill us.

That's not the only reason for a green-light. Others are:

* We sync stages in sequence. It's probably better to quickly skip through stages than to get bogged down in one. (iOS doesn't do this.)
* Rate-limiting limits the damage we can do to your data plan or your battery.
* We often, but not always, run a sync during your initial interactions with the browser, or with the first page you load. That's a good time to make sure everything's up to date, but a bad time to be sitting there chewing up resources for several minutes.
(In reply to :Grisha Kruglov from comment #10)

> - during a batching download, should we attempt a next batch?

With batch tokens, and no high-watermark, this is almost always a yes. High-watermark timestamp tracking makes this less risky.

> - once stage buffer has been filled up successfully (all batches finished
> downloading), should we attempt to analyze it and persist it?

I'd say "merge" instead of "persist" here. We'll end up with persistent buffers, so it's really the merging from the buffer into canonical storage that we care about. And you're right: we shouldn't start that unless we're confident that we'll succeed.
(In reply to :Grisha Kruglov from comment #10)

> - Batching downloader keeps track of average time it takes to download and
> process a batch, and prior to starting another batch it determines if it has
> enough time left.
> - BufferingMiddleware makes a "green light" decision based on number of
> records in the complete buffer, possibly type of record, and time left in a
> sync.

Careful about over-engineering this -- particularly stuff like average download time. It's _probably_ sufficient to use very simple heuristics, which could be informed by measurement on some devices. Remember that on Android we can finish a sync and tell the OS that we still have more work to do, and it'll call us again very soon -- can we use that to simplify this logic?

It will almost never be the case that we take 30 minutes to sync, so this all becomes much more interesting when the time is one minute or five minutes.
> That's not the only reason for a green-light. Others are:
> 
> * We sync stages in sequence. It's probably better to quickly skip through
> stages than to get bogged down in one. (iOS doesn't do this.)

We could also re-order more expensive stages (likely bookmarks & history) to be last, currently they're in the middle. However, these are the stages that user probably cares the most about...

> * Rate-limiting limits the damage we can do to your data plan or your
> battery.

Battery impact of sync must correlate pretty well with device's connectivity state (cell vs wifi), how frequently we sync and how long each sync takes. I think a worthy goal is to minimize total time we will spend in "syncing" state. This will have two nice results:
- we'll use less resources. Radio is on for less time, etc.
- we have a smaller chance of hitting 412, thus having to re-download data, which will greatly prolong our total sync time, which will increase our resource/battery consumption.

So it's really just one result (minimizing resource consumption), given that we handle data consistently and don't screw up.

OS will hopefully take care of running sync when we're in an "optimal" state to do so - wifi is available, other applications want to use radio, etc. I don't think we should worry about this too much beyond not triggering "immediate" syncs too often.

Actions such as "don't download the next batch" will have a negative impact on our total sync time, potentially a very high impact if another client updated relevant collections. So I think we should only do this if we're confident we won't succeed, and always proceed otherwise.

Green-lighting buffer merges is important, since we're at risk of potential data corruption at this point  (merge is not atomic right now).

As far as data plans go... User did ask us to keep their data in sync sync. Other than trying to do our best to sync while on wifi, I'm not sure what else we can/should do? Sync less of their data?

> Careful about over-engineering this -- particularly stuff like average download time.
> It's _probably_ sufficient to use very simple heuristics, which could be informed by
> measurement on some devices. Remember that on Android we can finish a sync and tell
> the OS that we still have more work to do, and it'll call us again very soon --
> can we use that to simplify this logic?

"How long it takes to process a batch" is a fairly simple heuristic, I think. It's simple enough to keep track of (assuming that batch processing times don't differ too much from each other), has low overhead, and flexible enough to encapsulate environmental conditions (slow vs fast network, slow vs fast storage, congested device vs nothing else is running).

My assumption here is that I don't want to restart sync at a later point, I want to finish it ASAP. So this is really meant to ensure that we'll proceed unless we're confident we'll fail, at which point we decide to not risk wasting any more resources, stop, and request a (hopefully) prompt restart.
Another thing to note are implications of specifying order="oldest". Does a "total" limit still makes sense?
For example, if there are 1000 records, our total limit is 500, and if we're asking for oldest first on the first sync, we will only receive first half of the records, and not anything recently modified...
(In reply to :Grisha Kruglov from comment #15)
> Another thing to note are implications of specifying order="oldest". Does a
> "total" limit still makes sense?

The two aren't really compatible, no. See Bug 730142 Comment 36.

IIRC we only limit on form history (potentially big and full of cruft, but it expires anyway…), history (old history isn't always that useful, and we'll keep up as we go), and bookmarks.

The limit on bookmarks is historical and nonsensical, and can go away as soon as we have batched downloads.

Indeed, download batching and modern device limitations do away with most of the need for total limits, in which case oldest-first is useful because it allows for timestamp high-watermark resuming. Otherwise fetches should be newest-first or sortindex-based. I'd rather see good local expiration than hard fetch limits.
(In reply to :Grisha Kruglov from comment #14)

> We could also re-order more expensive stages (likely bookmarks & history) to
> be last, currently they're in the middle. However, these are the stages that
> user probably cares the most about...

Yup. We prioritized the tiny/necessary (clients, tabs), then stuff for the awesomebar, then the rest. It's hard to make a case for putting history and bookmarks last, but I remember noting in a bug somewhere that we might consider fetching your most recent 100 history items on the very first sync, and coming back later to fill in the rest.

 
> > * Rate-limiting limits the damage we can do to your data plan or your
> > battery.
> 
> Battery impact of sync must correlate pretty well with device's connectivity
> state (cell vs wifi), how frequently we sync and how long each sync takes. I
> think a worthy goal is to minimize total time we will spend in "syncing"
> state.

You're absolutely right that one big cookie is more efficient than lots of small cookies… but remember that some failure modes for Sync are "do everything again from the beginning". Rate-limiting in both frequency and per-run activity is sometimes a worthwhile mitigation against broken states, not an optimization for the best-case scenario.


> As far as data plans go... User did ask us to keep their data in sync sync.
> Other than trying to do our best to sync while on wifi, I'm not sure what
> else we can/should do? Sync less of their data?

Persistent buffering is a _huge_ win in this regard. Remember that right now if we screw something up, we don't just have to try the merge again, we also download all the records again. With a persistent buffer, if we can't do something, our next sync probably starts with downloading 0-5 records, rather than all of them.
 

> "How long it takes to process a batch" is a fairly simple heuristic, I
> think.

If your clock stays consistent, and Android gives you a consistent slice of a core and doesn't put you to sleep, and you don't get hit by a network switch, and… :P

My point is essentially: think carefully about the edge cases that could turn this into a death-spiral of fixes and confusing code, compared to the potential value of a calculated heuristic over something simple like "do we have more than one minute left?".
Attachment #8797320 - Attachment is obsolete: true
Attachment #8797320 - Flags: feedback?(rnewman)
Comment on attachment 8795966 [details]
Bug 1291821 - Pre: remove unused SerialRecordConsumer

Here's another WIP push. What's currently present:
- buffering middleware currently backed by memory-backed buffer store
- buffering middleware only clears its buffer once sync for a given stage is complete. Otherwise incoming records are appended or replace existing ones.
- split history sync into two stages - recent + full
-- recentHistory stage runs only if full stage didn't complete yet
- no more total limits for sync. Collections are being synced in full
- switched most stages to use sort=oldest, except for recent history, which is sort=newest
- added "single batch mode" to batching downloader, in support of recentHistory sync. Only one batch will be fetched (newest-first), regardless of how much data is present.
- all browser data stages are buffered
- "green light" mechanism is present, currently using a simple heuristic as you suggested, "do I have two minutes left in a sync?", applied in-between batches and before buffer is merged

My next steps at this point is:
- add database-backed buffer store (I didn't quite get to this yet...)
-- I need to decide if using the main browser.db and following a ContentProvider is the right approach. I currently don't see big roadblocks here, but I haven't quite thought through implications of buffer analysis work and how it might relate to storage, etc.
- tests buffering middleware & buffer stores
- split this up into a patch series
- investigate improving transactionality of how we store records during buffer merge
- somehow land this sanely before moving on to "analyze buffer consistency" and friends
Attachment #8795966 - Flags: feedback?(rnewman)
My inbooxxxxxxxxx
Comment on attachment 8800048 [details]
Bug 1291821 - Add storeIncomplete to RepositorySession interface

https://reviewboard.mozilla.org/r/85056/#review84724

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/BufferingMiddlewareRepositorySession.java:105
(Diff revision 2)
> +    @VisibleForTesting
> +    public void doStoreDone(final long end) {
> +        final Collection<Record> unprocessedBuffer = bufferStorage.all();
> +
> +        // Trivial case of an empty buffer.
> +        if (unprocessedBuffer.size() == 0) {

Prefer `.isEmpty()`.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/BufferingMiddlewareRepositorySession.java:111
(Diff revision 2)
> +            super.storeDone(end);
> +            return;
> +        }
> +
> +        // Process our buffer into a set of records we're happy to merge.
> +        final Collection<Record> processedBuffer = consistencyChecker.processRecords(unprocessedBuffer);

As noted below, you need a better way of expressing invalidity.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/BufferingMiddlewareRepositorySession.java:117
(Diff revision 2)
> +
> +        // Flush our buffer to the wrapped local repository. Data goes live!
> +        for (Record record : processedBuffer) {
> +            try {
> +                this.inner.store(record);
> +            } catch (NoStoreDelegateException e) {

If this can't happen, you should wrap this outside the loop.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/BufferingMiddlewareRepositorySession.java:176
(Diff revision 2)
> +
> +    private boolean mayProceedToMergeBuffer() {
> +        // While actual runtime of a merge operation is a function of record type, buffer size, etc.,
> +        // let's do a simple thing for now and say that we may proceed if we have couple of minutes
> +        // of runtime left. That surely is enough, right?
> +        final long timeLeft = syncDeadline - SystemClock.elapsedRealtime();

`timeLeftMillis` (and `syncDeadlineElapsedRealtime`, too).

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/BufferingMiddlewareRepositorySession.java:177
(Diff revision 2)
> +    private boolean mayProceedToMergeBuffer() {
> +        // While actual runtime of a merge operation is a function of record type, buffer size, etc.,
> +        // let's do a simple thing for now and say that we may proceed if we have couple of minutes
> +        // of runtime left. That surely is enough, right?
> +        final long timeLeft = syncDeadline - SystemClock.elapsedRealtime();
> +        return timeLeft > TimeUnit.MINUTES.toMillis(2);

Just hard-code `2 * 60 * 1000` and let the compiler optimize. The arithmetic is clear enough.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/checkers/ConsistencyChecker.java:15
(Diff revision 2)
> +
> +/**
> + * @author grisha
> + */
> +public interface ConsistencyChecker {
> +    Collection<Record> processRecords(Collection<Record> records);

Describe what this is supposed to do. Does it throw a runtime exception if invalid? 'cos the way you wrote the use of this, the buffer is cleared of _all_ records after processing, so this can't return an empty collection…

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/storage/BufferStorage.java:31
(Diff revision 2)
> +    void flush();
> +
> +    void clear();
> +    boolean isEmpty();
> +
> +    // For buffers that are filled up newest-first, this is a high water mark, which enabled resuming

You mean oldest-first?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/storage/MemoryBufferStorage.java:26
(Diff revision 2)
> +public class MemoryBufferStorage implements BufferStorage {
> +    private final Map<String, Record> recordBuffer = Collections.synchronizedMap(new HashMap<String, Record>());
> +
> +    @Override
> +    public Collection<Record> all() {
> +        return Collections.unmodifiableCollection(recordBuffer.values());

I'm not sure this does what you expect.

`Collections.synchronizedMap` returns a wrapper around a map that synchronizes each operation.

`Map.values` returns a view onto the value set of the map. As the map changes, so do the values.

`Collections.unmodifiableCollection` returns a read-only wrapper.

So what you have returned here is a read-only read-through wrapper onto the values of the collection. Notably, though, it's not safe to iterate over — if the collection changes (either because another thread added something to the buffer, or because this thread did so), then you'll get undefined behavior or a `ConcurrentModificationException`:

> If the map is modified while an iteration over the collection is in progress (except through the iterator's own remove operation), the results of the iteration are undefined. 

In order to safely iterate, you'd need to synchronize on `recordBuffer`.

You might want to choose a different way to express this.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/storage/MemoryBufferStorage.java:46
(Diff revision 2)
> +        recordBuffer.clear();
> +    }
> +
> +    @Override
> +    public boolean isEmpty() {
> +        return recordBuffer.size() == 0;

`return recordBuffer.isEmpty();`

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/storage/MemoryBufferStorage.java:50
(Diff revision 2)
> +    public boolean isEmpty() {
> +        return recordBuffer.size() == 0;
> +    }
> +
> +    @Override
> +    public long lastSyncedTimestamp() {

I'd call this `latestModifiedTimestamp`.
Attachment #8800048 - Flags: review?(rnewman) → review-
Comment on attachment 8800049 [details]
Bug 1291821 - Switch stage duration interval counting to use elapsedRealtime

https://reviewboard.mozilla.org/r/85058/#review84730
Attachment #8800049 - Flags: review?(rnewman) → review+
Comment on attachment 8800050 [details]
Bug 1291821 - Rename RepositorySession's delegate to storeDelegate, for clarity

https://reviewboard.mozilla.org/r/85060/#review84732
Attachment #8800050 - Flags: review?(rnewman) → review+
Comment on attachment 8800051 [details]
Bug 1291821 - Add onBatchComplete to a FetchRecordsDelegate

https://reviewboard.mozilla.org/r/85062/#review84734

This seems harmless, but I don't yet know what it's for…
Attachment #8800051 - Flags: review?(rnewman) → review+
Comment on attachment 8800052 [details]
Bug 1291821 - Simplify onFetchFailed, clean up some exception code

https://reviewboard.mozilla.org/r/85064/#review84736
Attachment #8800052 - Flags: review?(rnewman) → review+
Iteration: --- → 1.7
Comment on attachment 8800053 [details]
Bug 1291821 - Decouple BatchingUploader from Server11Repository

https://reviewboard.mozilla.org/r/85066/#review86692
Attachment #8800053 - Flags: review?(rnewman) → review+
Comment on attachment 8800054 [details]
Bug 1291821 - Buffering repository middleware

https://reviewboard.mozilla.org/r/85068/#review86698

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/Server11RepositorySession.java:111
(Diff revision 4)
>    @Override
>    public boolean dataAvailable() {
>      return serverRepository.updateNeeded(getLastSyncTimestamp());
>    }
> +
> +  protected static BatchingDownloader initializeDownloader(final Server11RepositorySession serverRepositorySession) {

I'd probably name this `downloaderForSession`.

Then I'd see that it's static but only accesses the fields of its argument, so I'd probably make it a non-static method of `Server11RepositorySession`.

Then I'd notice that this *is* `Server11RepositorySession`, so it should basically be part of `fetchSince`, no?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/downloaders/BatchingDownloader.java:297
(Diff revision 4)
> -                        null);
>              }
>          });
>      }
> +    @VisibleForTesting
> +    public static URI buildCollectionURI(Uri baseCollectionUri, boolean full, long newer, long limit, String sort, String ids, String offset) throws URISyntaxException {

Maybe `escapedOffset`?
Attachment #8800054 - Flags: review?(rnewman) → review+
Comment on attachment 8800055 [details]
Bug 1291821 - Keep track of sync deadline

https://reviewboard.mozilla.org/r/85070/#review86704

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/sync/FxAccountSyncAdapter.java:393
(Diff revision 4)
>      final AndroidFxAccount fxAccount = new AndroidFxAccount(context, account);
>  
> +    // Sync can't take longer than 30 minutes.
> +    // NB: we use elapsedRealtime which is time since boot, to ensure our clock is monotonic and isn't
> +    // paused while CPU is in the power-saving mode.
> +    final long syncDeadline = SystemClock.elapsedRealtime() + TimeUnit.MINUTES.toMillis(30);

IMO 30m is too long.
Attachment #8800055 - Flags: review?(rnewman) → review+
Comment on attachment 8800056 [details]
Bug 1291821 - Use sync deadline to decide of batching downloader should proceed

https://reviewboard.mozilla.org/r/85072/#review86708

Persuade me otherwise!

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/downloaders/BatchingDownloader.java:225
(Diff revision 4)
>              }
>          });
>  
> +        // Should we proceed, however? Do we have enough time?
> +        if (!mayProceedWithBatching(fetchDeadline)) {
> +            this.abort(fetchRecordsDelegate, new Exception("Not enough time to complete next batch"));

I'm don't think that `abort` is the right flow for this.

`abort` isn't particularly graceful.

What you mean here is a measured retreat: we don't have time to wait for another long download, but we should make sure we update local metadata, record a successful sync, but don't continue with more collections or with merging this one.
Attachment #8800056 - Flags: review?(rnewman) → review-
Comment on attachment 8800056 [details]
Bug 1291821 - Use sync deadline to decide of batching downloader should proceed

https://reviewboard.mozilla.org/r/85072/#review86708

> I'm don't think that `abort` is the right flow for this.
> 
> `abort` isn't particularly graceful.
> 
> What you mean here is a measured retreat: we don't have time to wait for another long download, but we should make sure we update local metadata, record a successful sync, but don't continue with more collections or with merging this one.

From the perspective of one collection, I think calling RepositorySessionFetchRecordsDelegate's onFetchFailed with an appropriate exception will work here. RecordsChannel will inform sink that source failed, we won't update last-synced for the collection but buffer will be preserved, so on the next sync we'll resume fetching from the buffer's high water mark.

From the perspective of sync as a process combined of synchronizing multiple collections, current approach doesn't even attempt to solve the problem. However, a first stab at this is probably not difficult at all. If I say this quickly, GlobalSession is a good place to make these decisions - it has the deadline information, and might decide to skip stages in its `advance` method.
Comment on attachment 8800055 [details]
Bug 1291821 - Keep track of sync deadline

https://reviewboard.mozilla.org/r/85070/#review86704

> IMO 30m is too long.

I've got to pick a number though, so what do you have in mind? Android docs mention 30m deadline for a non-user initiated sync, and I believe no limit for a sync that was initiated by the user themselves.

What might be the harm in setting our deadline at the max of allowable time? What's currently the longest stage - full history sync - is the very last one (after the split into recent & full history). So at worst, we'll take forever churning through batches of history records, filling up the buffer, and hopefully not hit the limit. However, while we're churning, user already has the recent history data, and the bookmarks (which might also take a while to sync), so as far as they're concerned, life is generally good.

Other than us burning through their battery in the background with the constant radio load, that is... We could pick a closer deadline, but since user data is still there, and we're still aiming to sync all of it, we'll likely be increasing total time to complete a sync. But, I feel like we've had this discussion before :-)
(In reply to :Grisha Kruglov from comment #98)

> From the perspective of sync as a process combined of synchronizing multiple
> collections, current approach doesn't even attempt to solve the problem.
> However, a first stab at this is probably not difficult at all. If I say
> this quickly, GlobalSession is a good place to make these decisions - it has
> the deadline information, and might decide to skip stages in its `advance`
> method.

Phrased as "this only happens if an enclosing caller was very confident that we'd succeed, but we took too long anyway", I could see something like `abort` as acceptable.

But ideally we'd have the sync engine report to the stage that it succeeded, not that it failed -- after all, if the stage runner doesn't proceed because the time limit has been hit, that's the ideal way to terminate.

What I want to avoid is the harsh result of a failure bubbling up and returning a failure flag in the SyncAdapter.onPerformSync method.

We succeeded, even if we only made incremental progress, and the SyncResult should reflect that.

https://developer.android.com/reference/android/content/SyncResult.html

Instead of failure, we should be setting `moreRecordsToGet`, perhaps `fullSyncRequested` (if our rate limiting isn't a problem).

See what I mean?
(In reply to :Grisha Kruglov from comment #99)

> I've got to pick a number though, so what do you have in mind? Android docs
> mention 30m deadline for a non-user initiated sync, and I believe no limit
> for a sync that was initiated by the user themselves.

At one point it was five minutes, I think, but that might have been a purely empirical observation.


> Other than us burning through their battery in the background with the
> constant radio load, that is...

Yeah, that's the trick.

For the average user, the limit doesn't matter. They'll be done in 30 seconds.

For users in a bad state, or with lots of data, the time limit is a rate/load limiter.

See, if we take up to 30 minutes per sync, and we're called every hour, we'll sync for no more than 50% of the wake time of the device.

If we take up to 5 minutes per sync, and we're called every hour, we'll sync for no more than 8% of the wake time of the device.

That decreases the battery and bandwidth load for heavy users, and it increases the chances that we'll finish the sync on fast wifi.

Taking a short time limit is a way of delegating to the OS scheduler the load we'll apply to the device.

One correct way to do this is to use the 'full sync requested' flag, and adjust the time limit accordingly -- say, 1 minute or 5 minutes without, and 20m with?
(In reply to Richard Newman [:rnewman] from comment #100)
> (In reply to :Grisha Kruglov from comment #98)
> 
> > From the perspective of sync as a process combined of synchronizing multiple
> > collections, current approach doesn't even attempt to solve the problem.
> > However, a first stab at this is probably not difficult at all. If I say
> > this quickly, GlobalSession is a good place to make these decisions - it has
> > the deadline information, and might decide to skip stages in its `advance`
> > method.
> 
> Phrased as "this only happens if an enclosing caller was very confident that
> we'd succeed, but we took too long anyway", I could see something like
> `abort` as acceptable.
> 
> But ideally we'd have the sync engine report to the stage that it succeeded,
> not that it failed -- after all, if the stage runner doesn't proceed because
> the time limit has been hit, that's the ideal way to terminate.
> 
> What I want to avoid is the harsh result of a failure bubbling up and
> returning a failure flag in the SyncAdapter.onPerformSync method.
> 
> We succeeded, even if we only made incremental progress, and the SyncResult
> should reflect that.
> 
> https://developer.android.com/reference/android/content/SyncResult.html
> 
> Instead of failure, we should be setting `moreRecordsToGet`, perhaps
> `fullSyncRequested` (if our rate limiting isn't a problem).
> 
> See what I mean?

Looking through the code a bit closer and tracking up and down through the delegate chain, it's actually doing what you want it to do - almost.

That `abort` function in BatchingDownloader is somewhat poorly named. In the context of the batcher, it is an "abort" operation. But it calls a "onFetchFailed" on its delegate. So in essence, any fetch failure is treated the same - be it as a result from a 412, or from a timeout, with an exception of those failures when we receive a back-off interval.

So, on a timeout, or a 412, we do indeed report "success" via SyncResult, and we progress through onto another session. In case of a timeout, the next session will also fail, and advance, and so on, until we get to the end. Sync timestamps aren't persisted, so, so we'll just "try again" on the next sync, resuming from a high-water mark when there's a persistent buffer available.

The one thing that IS missing in the picture is setting a `fullSyncRequested` on SyncResult in case of a failure like this. On a 412 it's a clear-cut situation, we need to try syncing ASAP in order to catch up with other clients. In case of us hitting a sync deadline, we should tread more carefully, and probably have a persistent counter of sorts of how many times we've hit a deadline, and if it's happening consistently, we'd want to stop and back-off. However, we do have a back-off mechanism in place anyway which will ensure we're not syncing too frequently - so this might just work out as is!
Comment on attachment 8800057 [details]
Bug 1291821 - Remove total sync limits, refactor batching downloader

https://reviewboard.mozilla.org/r/85074/#review87156

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/stage/AndroidBrowserBookmarksServerSyncStage.java:46
(Diff revision 4)
>  
>    @Override
>    protected Repository getRemoteRepository() throws URISyntaxException {
>      return new ConstrainedServer11Repository(
>              getCollection(),
> +            session.getSyncDeadline(),

I should perhaps have commented on a different patch in the series, but here I am:

This doesn't follow the spirit of the design.

The deadline is very time-sensitive, by definition. That means it should be in the `RepositorySession` (the class that represents a single sync of the repository), not the repository itself.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/stage/PasswordsServerSyncStage.java:34
(Diff revision 4)
>    }
>  
>    @Override
>    protected Repository getLocalRepository() {
> -    return new PasswordsRepositorySession.PasswordsRepository();
> +    return new BufferingMiddlewareRepository(
> +            session.getSyncDeadline(),

And here you can see the smell…

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java:179
(Diff revision 4)
> +    // Fetch all records that were modified since our previous flow. If our previous flow succeeded,
> +    // we will use source's last-sync timestamp. If our previous flow didn't complete, resume it.
> +    // If there was no previous flow (first sync, or data was cleared...), fetch everything.
> +    // Resuming a flow is supported for buffered RepositorySessions. We degrade gracefully otherwise.
> +    final long highWaterMark = sink.getHighWaterMarkTimestamp();
> +    final long lastSync = sink.getLastSyncTimestamp();

Why not push this logic into the sink -- `sink.getFetchSinceTimestamp()`?
Attachment #8800057 - Flags: review?(rnewman) → review-
Comment on attachment 8800057 [details]
Bug 1291821 - Remove total sync limits, refactor batching downloader

https://reviewboard.mozilla.org/r/85074/#review87156

> I should perhaps have commented on a different patch in the series, but here I am:
> 
> This doesn't follow the spirit of the design.
> 
> The deadline is very time-sensitive, by definition. That means it should be in the `RepositorySession` (the class that represents a single sync of the repository), not the repository itself.

Could one argue that it's a matter of squinting at the abstraction just right? In some way this deadline isn't terribly different from changing-through-time `InfoCollection` and maybe even `InfoConfiguration` objects...

I've tried the `RepositorySession` approach, adjusting constructor, Repository's `createSession` method, and passing in the deadline via `Synchronizer`/`SynchronizerSession`... While that works and feels like it _is_ in spirit of the design, the end result is this approach touches pretty much everything. This deadline is then all over the place, being passed through over and over again, sometimes with inconsequential values just to satisfy the interfaces in cases when sessions are create for "one-off" tasks like wiping. For something that is important only at a couple of crucial moments, this also doesn't feel right.

It's probably possible to do this a bit cleaner (have `RepositorySession` and/or `createSession` assume a default deadline if one isn't passed in, etc), but the core of my complaint will still be there.
Comment on attachment 8800056 [details]
Bug 1291821 - Use sync deadline to decide of batching downloader should proceed

https://reviewboard.mozilla.org/r/85072/#review88894

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/Server11RepositorySession.java:115
(Diff revision 5)
>  
>    protected static BatchingDownloader initializeDownloader(final Server11RepositorySession serverRepositorySession) {
>      return new BatchingDownloader(
>              serverRepositorySession.serverRepository.authHeaderProvider,
>              Uri.parse(serverRepositorySession.serverRepository.collectionURI().toString()),
> +            serverRepositorySession.serverRepository.syncDeadline,

Let's wrap this in `getSyncDeadline`, at least, so we have a place to put a docstring.
Attachment #8800056 - Flags: review?(rnewman) → review+
Comment on attachment 8800057 [details]
Bug 1291821 - Remove total sync limits, refactor batching downloader

https://reviewboard.mozilla.org/r/85074/#review88896
Attachment #8800057 - Flags: review?(rnewman) → review+
Comment on attachment 8805182 [details]
Bug 1291821 - Wrap local repositories in buffering middleware

https://reviewboard.mozilla.org/r/88978/#review88900

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java:181
(Diff revision 1)
> +    // starting from sink's high water mark timestamp.
> +    // If there was no previous flow (first sync, or data was cleared...), fetch everything.
> +    // Resuming a flow is supported for buffered RepositorySessions. We degrade gracefully otherwise.
> +    final long highWaterMark = sink.getHighWaterMarkTimestamp();
> +    final long lastSync = source.getLastSyncTimestamp();
> +    final long sinceTimestamp = Math.max(highWaterMark, lastSync);

Blegh, why does the source own one timestamp and the sink own another?!

(Because we only advance the high water mark when we apply a record, but we advance the timestamp when we tell the source that we're done applying a bunch. Errr…)
Attachment #8805182 - Flags: review?(rnewman) → review+
Comment on attachment 8800058 [details]
Bug 1291821 - Split history stage into recent and full history stages

https://reviewboard.mozilla.org/r/85076/#review88904

This needs a little thought.

As implemented, this fetches the most recent 500 history records twice on the first sync.

Sync 1.5 no longer has an 'older' parameter, so the obvious way to improve this isn't available to us. (Though I'm asking about that, 'cos I see it: https://github.com/mozilla-services/server-syncstorage/blob/master/syncstorage/storage/sql/queries_generic.py#L118)

What we'd like to do is fetch the 500 most recent, sort=newest, newer=lastsync (defaulting to zero), then fetch all, sort=oldest, older=oldest-new-record+1, newer=lastsync.

We can indeed generalize that to run on each sync, and remove the logic for doing this only once.

Very often we'd find that newer=older, and we'd have no work to do in the second stage (when < 500 changed items).

If we're able to specify both newer and older, I'd like to do this: it'll improve responsiveness in the general case, always fetching most recent history, and it'll avoid potentially pulling down hundreds of KB of redundant records on a first sync. (The most recently visited history is likely to also be your most visited, so each will have 20 visits…)

Please file a follow-up for that.

In the short term, though, we can mitigate the impact: download 50 or 100 records instead of 500. We only need enough to show Top Sites and recent history.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/GlobalSession.java:192
(Diff revision 5)
>  
>      stages.put(Stage.syncClientsEngine,       new SyncClientsEngineStage());
>  
>      stages.put(Stage.syncTabs,                new FennecTabsServerSyncStage());
>      stages.put(Stage.syncPasswords,           new PasswordsServerSyncStage());
> +    stages.put(Stage.syncRecentHistory,       new AndroidBrowserRecentHistoryServerSyncStage());

This deserves a comment, both here and in the commit message.

::: mobile/android/tests/background/junit4/src/org/mozilla/android/sync/test/TestResetCommands.java:120
(Diff revision 5)
>          no.called = true;
>        }
>      };
>  
>      stagesToRun.put(Stage.syncBookmarks, stageGetsReset);
> -    stagesToRun.put(Stage.syncHistory,   stageNotReset);
> +    stagesToRun.put(Stage.syncFullHistory,   stageNotReset);

Whitespace.
Attachment #8800058 - Flags: review?(rnewman) → review+
Comment on attachment 8800059 [details]
Bug 1291821 - Ensure sink repository is aware of new events coming from the source

https://reviewboard.mozilla.org/r/85078/#review88928
Attachment #8800059 - Flags: review?(rnewman) → review+
Comment on attachment 8800060 [details]
Bug 1291821 - Get tests to work after sync changes

https://reviewboard.mozilla.org/r/85080/#review88930
Attachment #8800060 - Flags: review?(rnewman) → review+
Comment on attachment 8800061 [details]
Bug 1291821 - Rename repositories/sessions

https://reviewboard.mozilla.org/r/85082/#review88934
Attachment #8800061 - Flags: review?(rnewman) → review+
Comment on attachment 8800062 [details]
Bug 1291821 - Move bulk insert logic for new history to BrowserProvider

https://reviewboard.mozilla.org/r/85084/#review88938

Rather than effectively making a fake transaction by disabling notifications for each write, and adding a bunch of complexity, spreading logic across the CP boundary, I'd much rather you push the visit updating logic into the provider and tickle it once from Sync.

You've only band-aided part of the problem (notifications): you're still calling bulkInsert N+2 times for N records, creating N+2 transactions and N+2 ContentProvider dances, when you can do all this quickly and easily in a single transaction with a single notification and no excess effort, just by changing where you draw the line.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/android/AndroidBrowserHistoryDataAccessor.java:142
(Diff revision 5)
> +    // prevents us from "batching" all of the operations together, since `updates` are conditional on
> +    // the result of `bulkInsert` of visits.
> +
> +    // Alternative approach would be to run some task which will recalculate remote/local counts based on
> +    // the data in VISITS table for every history record after syncing history.
> +    // This would be a costly operation, but doable in the background.

Another approach is to define a ContentProvider URI for visits -- oh hey, we have one! -- and define its insert logic to be a non-duplicating insert that doesn't notify if nothing changed.

That way you get to loop this insertion/change count check/update logic within a transaction block, queuing notifications until the transaction is committed.

In other words: push this logic into BrowserProvider.

See BOOKMARKS_POSITIONS/updateBookmarkPositions for an example of doing more sophisticated work within a CP.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/android/AndroidBrowserHistoryDataAccessor.java:186
(Diff revision 5)
> +              bulkIncrementRemoteCounts(skipNotifyChange(getUri())),
> +              remoteCountIncrementValues.toArray(new ContentValues[remoteCountIncrementValues.size()])
> +      );
> +    }
> +
> +    context.getContentResolver().notifyChange(BrowserContract.History.CONTENT_URI, null, false);

You should only do this if we added data.
Attachment #8800062 - Flags: review?(rnewman) → review-
Comment on attachment 8800497 [details]
Bug 1291821 - Allow BatchingDownloader to resume downloads using offset

https://reviewboard.mozilla.org/r/85408/#review88970

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:181
(Diff revision 4)
> +    private void createHistoryBufferTable(SQLiteDatabase db) {
> +        debug("Creating " + HistoryBuffer.TABLE_NAME + " table");
> +        db.execSQL("CREATE TABLE " + HistoryBuffer.TABLE_NAME + "(" +
> +                History._ID + " INTEGER PRIMARY KEY AUTOINCREMENT," +
> +                History.TITLE + " TEXT," +
> +                History.URL + " TEXT," +

`NOT NULL`, surely?

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:187
(Diff revision 4)
> +                History.GUID + " TEXT NOT NULL," +
> +
> +                // JSON blob of incoming visits
> +                History.VISITS + " TEXT," +
> +
> +                History.DATE_MODIFIED + " INTEGER," +

`NOT NULL`

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:191
(Diff revision 4)
> +
> +                History.DATE_MODIFIED + " INTEGER," +
> +                History.IS_DELETED + " INTEGER NOT NULL DEFAULT 0" +
> +                ");");
> +
> +        db.execSQL("CREATE INDEX history_buffer_modified_index ON " + HistoryBuffer.TABLE_NAME + "("

Why do you need an index?

Presumably you insert+overwrite records by GUID, and when you apply records you query for all of them, without caring about order.
Depends on: 1314171
Iteration: 1.7 → 1.8
Comment on attachment 8800497 [details]
Bug 1291821 - Allow BatchingDownloader to resume downloads using offset

https://reviewboard.mozilla.org/r/85408/#review88970

> `NOT NULL`, surely?

I have a feeling I saw NULL url at some point in sync data... But I can't quite remember if it actually happened or I just dreamt it up in some sync-induced nightmare. Let's try NOT NULL and see if it breaks.
Comment on attachment 8800048 [details]
Bug 1291821 - Add storeIncomplete to RepositorySession interface

https://reviewboard.mozilla.org/r/85056/#review89770

This commit seems to be a couple of negation fixes and then a no-op… could you add a commit message that explains what 'clean' is for? Or roll this into the commit that uses it?

But I have to point you to Martin Fowler on this…

http://martinfowler.com/bliki/FlagArgument.html
Attachment #8800048 - Flags: review?(rnewman)
See Also: → 1316110
Comment on attachment 8800048 [details]
Bug 1291821 - Add storeIncomplete to RepositorySession interface

https://reviewboard.mozilla.org/r/85056/#review92408
Attachment #8800048 - Flags: review?(rnewman) → review+
Comment on attachment 8800497 [details]
Bug 1291821 - Allow BatchingDownloader to resume downloads using offset

https://reviewboard.mozilla.org/r/85408/#review92410

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:2004
(Diff revision 6)
>      private void upgradeDatabaseFrom35to36(final SQLiteDatabase db) {
>          createPageMetadataTable(db);
>      }
>  
> +    private void upgradeDatabaseFrom36to37(final SQLiteDatabase db) {
> +    	createHistoryBufferTable(db);

Are these… tabs?!

::: mobile/android/base/java/org/mozilla/gecko/db/SyncProvider.java:1
(Diff revision 6)
> +package org.mozilla.gecko.db;

License block.

::: mobile/android/base/java/org/mozilla/gecko/db/SyncProvider.java:40
(Diff revision 6)
> +                    if (insertedId != -1) {
> +                        inserted++;
> +                    }
> +                }
> +                db.setTransactionSuccessful();
> +                db.endTransaction();

You should always call `endTransaction` in a `finally` block.

::: mobile/android/base/java/org/mozilla/gecko/db/SyncProvider.java:83
(Diff revision 6)
> +        }
> +    }
> +
> +    @Override
> +    protected Uri insertInTransaction(Uri uri, ContentValues values) {
> +        return null;

Make these unimplemented methods throw.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/storage/BufferStorage.java:36
(Diff revision 6)
>  
>      // For buffers that are filled up oldest-first this is a high water mark, which enables resuming
>      // a sync.
>      long latestModifiedTimestamp();
> +
> +    void setContext(Context context);

This is a huge wart. Whether this needs a context or not is an implementation detail, and you can see that with the `null` you have to pass around in tests.

Make the creator of the `BufferStorage` hand it the context as a constructor argument, and leave it out of the interface and method signatures.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/storage/HistoryDatabaseBufferStorage.java:44
(Diff revision 6)
> +        cv.put(HistoryBuffer.GUID, historyRecord.guid);
> +        cv.put(HistoryBuffer.TITLE, historyRecord.title);
> +        cv.put(HistoryBuffer.URL, historyRecord.histURI);
> +        cv.put(HistoryBuffer.IS_DELETED, historyRecord.deleted);
> +        // NB: We don't convert from microseconds to milliseconds as we would for "last visited".
> +        // This is field is stored purely for sync-related operations, in the buffer table.

`lastModified` isn't a microsecond timestamp anyway. Clarify the comment?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/storage/HistoryDatabaseBufferStorage.java:114
(Diff revision 6)
> +                    record.lastModified = dateModified;
> +                    record.title = title;
> +                    record.deleted = (isDeleted == 1);
> +                    record.visits = RepoUtils.getJSONArrayFromCursor(cursor, HistoryBuffer.VISITS);
> +
> +                    records.add(record);

Consider using `ArrayList.ensureCapacity` rather than relying on it growing as you add one at a time.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/storage/MemoryBufferStorage.java:69
(Diff revision 6)
>  
>          return lastModified;
>      }
> +
> +    @Override
> +    public void setContext(Context context) {}

See!

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/stage/AndroidBrowserRecentHistoryServerSyncStage.java:59
(Diff revision 6)
>      }
>  
>      /**
> +     * Return a memory-buffered local repository. We don't want to use the same database-backed buffer
> +     * as does the main history repository, because we're fetching newest-first, and the main repository
> +     * is fetching oldest-first. This will lead to a wrong "high water-mark" timestamp.

Also we're unlikely to be batching downloads, so there's no point in writing intermediate state to the DB.
Comment on attachment 8808780 [details]
Bug 1291821 - Track incomplete stages and re-sync them

https://reviewboard.mozilla.org/r/91520/#review92412

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/sync/FxAccountSyncDelegate.java:52
(Diff revision 2)
>      syncResult.stats.numUpdates += 1;
>      syncResult.stats.numIoExceptions += 1;
>    }
>  
> +  public void requestFullSync() {
> +    syncResult.fullSyncRequested = true;

Have you tested this? IIRC Android immediately requests another sync, and our de-bouncer ignores it.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/CollectionModifiedException.java:13
(Diff revision 2)
> + * Thrown when a collection has been modified by another client while we were either
> + * downloading from it or uploading to it.
> + *
> + * @author grisha
> + */
> +public class CollectionModifiedException extends Exception {

`CollectionConcurrentModificationException`.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/ReflowIsNecessaryException.java:8
(Diff revision 2)
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +package org.mozilla.gecko.sync;
> +
> +/**
> + * Used by SynchronizerSession to indicate that reflow of the stage is necessary.

What does reflow mean? Do you mean a full re-sync from 0, a re-download of batched records from 0, a re-download from the previous persisted timestamp, or something else?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/SyncDeadlineException.java:12
(Diff revision 2)
> +/**
> + * Thrown when we've hit a self-imposed sync deadline, and decided not to proceed.
> + *
> + * @author grisha
> + */
> +public class SyncDeadlineException extends Exception {

`SyncDeadlineReachedException`

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/PayloadUploadDelegate.java:171
(Diff revision 2)
>      }
>  
>      @Override
>      public void handleRequestFailure(final SyncStorageResponse response) {
> +        if (response.getStatusCode() == 412) {
> +            uploader.collectionModified();

`concurrentModificationDetected`?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/stage/ServerSyncStage.java:626
(Diff revision 2)
>        } else {
>          session.interpretHTTPFailure(response.httpResponse()); // Does not call session.abort().
>        }
>      }
>  
> +    // If we've hit a 412, or we ran out of time during this stage, request a full sync.

If we hit a 412, we simply discard our current batch continuation offset token and try again, right? That's not a full sync, which means to download all records on the server.

Similarly, if we ran out of time, we just pick up where we left off next time. That's also not a full sync.
Iteration: 1.8 → 1.9
Comment on attachment 8800062 [details]
Bug 1291821 - Move bulk insert logic for new history to BrowserProvider

Clearing flag while waiting for changes.
Attachment #8800062 - Flags: review?(rnewman)
Attachment #8800497 - Flags: review?(rnewman)
Whiteboard: [MobileAS] → [MobileAS][data-integrity]
Whiteboard: [MobileAS][data-integrity] → [MobileAS]
Comment on attachment 8808780 [details]
Bug 1291821 - Track incomplete stages and re-sync them

https://reviewboard.mozilla.org/r/91520/#review92412

Addressed your feedback, cleaned up some things and changed how re-syncing is done. Something's not working quite correctly with sync mechanics though after these changes and I haven't yet tracked the problem down, so there will be a follow up.

> Have you tested this? IIRC Android immediately requests another sync, and our de-bouncer ignores it.

Yup, this didn't work. I've switched to tracking incomplete stages and requesting an immediate sync for those stages after current sync is complete.

> If we hit a 412, we simply discard our current batch continuation offset token and try again, right? That's not a full sync, which means to download all records on the server.
> 
> Similarly, if we ran out of time, we just pick up where we left off next time. That's also not a full sync.

That's correct, we simply pick up where we left off in either case during a re-sync. I've changed wording throughout and added comments.
Comment on attachment 8800055 [details]
Bug 1291821 - Keep track of sync deadline

https://reviewboard.mozilla.org/r/85070/#review86704

> I've got to pick a number though, so what do you have in mind? Android docs mention 30m deadline for a non-user initiated sync, and I believe no limit for a sync that was initiated by the user themselves.
> 
> What might be the harm in setting our deadline at the max of allowable time? What's currently the longest stage - full history sync - is the very last one (after the split into recent & full history). So at worst, we'll take forever churning through batches of history records, filling up the buffer, and hopefully not hit the limit. However, while we're churning, user already has the recent history data, and the bookmarks (which might also take a while to sync), so as far as they're concerned, life is generally good.
> 
> Other than us burning through their battery in the background with the constant radio load, that is... We could pick a closer deadline, but since user data is still there, and we're still aiming to sync all of it, we'll likely be increasing total time to complete a sync. But, I feel like we've had this discussion before :-)

Switched to 10m as a quick solution for now. Everything's in place to customize this value in case we're trying to re-sync an incomplete stage (if we ran out of time, for example).
Comment on attachment 8800062 [details]
Bug 1291821 - Move bulk insert logic for new history to BrowserProvider

https://reviewboard.mozilla.org/r/85084/#review88938

> Another approach is to define a ContentProvider URI for visits -- oh hey, we have one! -- and define its insert logic to be a non-duplicating insert that doesn't notify if nothing changed.
> 
> That way you get to loop this insertion/change count check/update logic within a transaction block, queuing notifications until the transaction is committed.
> 
> In other words: push this logic into BrowserProvider.
> 
> See BOOKMARKS_POSITIONS/updateBookmarkPositions for an example of doing more sophisticated work within a CP.

I've changed two things.
1) moved all of the addition logic down to BrowserProvider, the whole bulkInsert now happens in a transaction on that level.
2) realized that these are supposed to be new records that we're adding, so the dance with tracking visit inserts and updating remote visit counts is not necessary.

As a result the whole thing is much simpler now.

However, I've had to add some defensive checks to protect against duplicated visits in history records. Apparently that's a thing. Looking at the actual failing records, their timestamp resolution is in milliseconds, padded with a few zeros to get to microseconds - Fennec does this for locally generated visits.

My guess is that this loss of resolution and some close-in-time visits are causing these conflicts. Conflicts were pretty rare for me in local testing (something like 5 duplicates over a set of around 150-200k visits), and all had millisecond resolution.
Comment on attachment 8800062 [details]
Bug 1291821 - Move bulk insert logic for new history to BrowserProvider

https://reviewboard.mozilla.org/r/85084/#review93962

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java:2324
(Diff revision 8)
> +        }
> +    }
> +
> +    private int bulkInsertHistory(final SQLiteDatabase db, ContentValues[] values) {
> +        int inserted = 0;
> +        final String fullInsertSqlStatement = "INSERT INTO " + History.TABLE_NAME + " (" +

Decide whether you want `INSERT OR REPLACE`, `INSERT OR IGNORE`, or for this entire batch to fail in the case of a uniqueness constraint violation on `GUID`.

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java:2372
(Diff revision 8)
> +                statementToExec.bindString(3, url);
> +                if (dateLastVisited != null) {
> +                    statementToExec.bindLong(4, dateLastVisited);
> +                    statementToExec.bindLong(5, remoteDateLastVisited);
> +                    statementToExec.bindLong(6, visits);
> +                    statementToExec.bindLong(7, visits);

Add a comment that this is because there are three visit count columns, and in this case two of them will be the same.

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java:2387
(Diff revision 8)
> +        }
> +        return inserted;
> +    }
> +
> +    private int bulkInsertVisits(SQLiteDatabase db, ContentValues[][] valueSets) {
> +        final String insertSqlStatement = "INSERT INTO " + Visits.TABLE_NAME + " (" +

I think you probably want `INSERT OR IGNORE` here. There's a unique index on (guid, date_visited), so if you end up inserting a duplicate somehow, this will raise an error.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/android/AndroidBrowserHistoryDataAccessor.java:134
(Diff revision 8)
> -                BrowserContract.Visits.CONTENT_URI,
> -                VisitsHelper.getVisitsContentValues(rec.guid, rec.visits)
> -        );
>  
> -        // If we just inserted any visits, update remote visit aggregate values.
> -        // While inserting visits, we might not insert all of rec.visits - if we already have a local
> +    // Let our ContentProvider handle insertion of everything.
> +    final Bundle result = context.getContentResolver().call(

Added in API 11, wanted for years and years. Hooray!

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/android/AndroidBrowserHistoryRepositorySession.java:31
(Diff revision 8)
>    public static final String LOG_TAG = "ABHistoryRepoSess";
>  
>    /**
>     * The number of records to queue for insertion before writing to databases.
>     */
> -  public static final int INSERT_RECORD_THRESHOLD = 50;
> +  public static final int INSERT_RECORD_THRESHOLD = 5000;

Are you sure? That's a lot of buffering. Consider also that you end up with ContentValues, bundles, and then they get expanded again into arrays and bindings… what's the peak memory usage here for 5000 items x 20 visits?
Attachment #8800062 - Flags: review?(rnewman) → review+
Comment on attachment 8800062 [details]
Bug 1291821 - Move bulk insert logic for new history to BrowserProvider

https://reviewboard.mozilla.org/r/85084/#review93962

> Decide whether you want `INSERT OR REPLACE`, `INSERT OR IGNORE`, or for this entire batch to fail in the case of a uniqueness constraint violation on `GUID`.

What I do now is catch SQL exceptions arising from constraint violations and log them. So same behaviour as INSERT OR IGNORE, except this gives us an insight into when the failures are happening. This event would be a good candidate for telemetry. We shouldn't be trying to insert a supposedly new record with a conflicting GUID, so it seems that we shouldn't just ignore situations when this happens.

> I think you probably want `INSERT OR IGNORE` here. There's a unique index on (guid, date_visited), so if you end up inserting a duplicate somehow, this will raise an error.

Same comment as above, with a clarification: these conflicts do happen, they're rare, and could be a result of us converting milliseconds to microsends when we're tracking visits in fennec.

> Are you sure? That's a lot of buffering. Consider also that you end up with ContentValues, bundles, and then they get expanded again into arrays and bindings… what's the peak memory usage here for 5000 items x 20 visits?

Some rough calculations: a record with a title on a longer side and 20 visits is around 1kb. With an overhead of perhaps 2x (not sure about this number...) for all of the data structures we're probably looking at about 10mb of memory per 5000 records at peak. Nightly is using 172mb of RAM with one tab open right now, so this doesn't seem too outrageous?
Blocks: 1318515
Blocks: 1316110
See Also: 1316110
Comment on attachment 8808780 [details]
Bug 1291821 - Track incomplete stages and re-sync them

https://reviewboard.mozilla.org/r/91520/#review94208

Is there a good reason why this doesn't have tests? At the very least I expect you want to simulate XIUS failure on upload, 412 during batching, and deadline hit, and ensure that we request a follow-up sync, with the right timestamps and batching state…

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/sync/FxAccountSyncAdapter.java:597
(Diff revision 5)
>      }
>  
> +    // If there have been any incomplete stages, request a follow-up sync. Otherwise, we're done.
> +    // Incomplete stage is:
> +    // - one that hit a 412 error during either upload or download of data, indicating that
> +    //   its collection have been modified remotely, or

s/have/has

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/ReflowIsNecessaryException.java:9
(Diff revision 5)
> +
> +package org.mozilla.gecko.sync;
> +
> +/**
> + * Used by SynchronizerSession to indicate that reflow of a stage is necessary.
> + * To reflow a stage is to request that it is synced using the same (or earlier) timestamps.

You've noted that there are three ways this can happen:

- Deadline hit. In this case we want to continue where we left off, not using the same timestamp per se.
- 412 during batching. In this case we want to reflow with the old timestamp unless we're trying a high-water-mark, in which case we can pick up where we left off.
- 412 during upload. In this case we simply abort, and pick up next time at the end of our completed download -- we don't try the previous download again, because there's no need. Any new records will have later timestamps.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/BufferingMiddlewareRepositorySession.java:83
(Diff revision 5)
> +     * When source fails to provide all records, we need to decide what to do with the buffer.
> +     * We might fail because of a network partition, or because of a concurrent modification of a
> +     * collection, or because we ran out of time fetching records, or some other reason.
> +     *
> +     * Either way we do not clear the buffer in any error scenario, but rather
> +     * allow it to be re-filled, replacing existing records with their newer version of necessary.

s/version of/versions if

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java:320
(Diff revision 5)
> +    // we don't need them (case 1).
> +    waitingForQueueDone = false;
> +
> +    // If consumer is still going at it, tell it to stop.
> +    this.consumer.halt();
> +    

Nit: whitespace.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java:370
(Diff revision 5)
> +  @Nullable
> +  /* package-local */ synchronized Exception getReflowException() {
> +    return reflowException;
> +  }
> +
> +  private synchronized void setReflowException(@NonNull Exception e) {

Why isn't this more tightly typed, with casts in the call sites alongside their `instanceof` checks?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/ServerLocalSynchronizerSession.java:33
(Diff revision 5)
>    }
>  
>    @Override
>    public void onFirstFlowCompleted(RecordsChannel recordsChannel, long fetchEnd, long storeEnd) {
> +    // If a "reflow exception" was thrown, consider this synchronization failed.
> +    final Exception reflowException = recordsChannel.getReflowException();

You realize we now have flow control through exceptions, delegate methods, callbacks, channels, and now a side-channel conditional by poking internal state, right? :P
Iteration: 1.9 → 1.10
Comment on attachment 8800497 [details]
Bug 1291821 - Allow BatchingDownloader to resume downloads using offset

https://reviewboard.mozilla.org/r/85408/#review96708

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/stage/ServerSyncStage.java:175
(Diff revision 10)
>  
>    protected String bundlePrefix() {
>      return this.getCollection() + ".";
>    }
>  
> +  // TODO is this the right place for this??

WIP, need to reset persistent state when appropriate. Currently I'm thinking that `wipeLocal` and `resetLocal` are the two appropriate places.
Pushed up a patch which introduces resuming of batching downloads using offset and high water mark as appropriate, as well as bunch of fixes and tests to ensure that RecordsChannel is doing the right thing when we hit a "reflow" exception (412 or a sync deadline).

Still WIP:
- tests to verify re-sync behaviour in case of a reflow exception
- resetting of repository state (used by the downloader) when appropriate (see Comment 234).
Iteration: 1.10 → 1.11
Iteration: 1.11 → 1.12
Iteration: 1.12 → 1.13
This still needs more tests, but:
- download resume logic is sane now, following discussion from Bug 1330839. Download state("offset context") is encapsulated in RepositoryStateProvider
- RepositoryStateProvider reset logic is in place, ensuring it's nulled out whenever a stage is reset or wiped
- History stage is no longer wrapped in any buffers. Records go straight to live storage, high water mark is maintained and used across syncs
- HistorySession's internal recordsBuffer is now flushed on "storeIncomplete" events, otherwise if we hit a 412 or a deadline while an internal buffer isn't empty our h.w.m. will get ahead of what's actually in the local store

Richard, would you mind giving this a review pass?
Flags: needinfo?(rnewman)
(In reply to :Grisha Kruglov from comment #269)
> Richard, would you mind giving this a review pass?

Hmm, not sure I worded this correctly. Not necessarily an r+, of course :-)
Will get to this when the US work week begins! :D
(In reply to Richard Newman [:rnewman] from comment #271)
> Will get to this when the US work week begins! :D

Thanks :-) I just noticed that there is some weirdness in how recent and regular history stages interact. I'll push an update soon.
Comment on attachment 8808780 [details]
Bug 1291821 - Track incomplete stages and re-sync them

https://reviewboard.mozilla.org/r/91520/#review106110

Make sure we have hooks to clear buffers in all the expected places, noted inline.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/ReflowIsNecessaryException.java:16
(Diff revision 8)
> + * Stages which complete only partially due to hitting a concurrent collection modification error or
> + * hitting a sync deadline should be re-flowed as soon as possible.
> + *
> + * @author grisha
> + */
> +public class ReflowIsNecessaryException extends Exception {

Name the stage, if only for debugging's sake?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/delegates/GlobalSessionCallback.java:41
(Diff revision 8)
>  
>    void handleAborted(GlobalSession globalSession, String reason);
>    void handleError(GlobalSession globalSession, Exception ex);
>    void handleSuccess(GlobalSession globalSession);
>    void handleStageCompleted(Stage currentState, GlobalSession globalSession);
> -
> +  void handleIncompleteStage(Stage currentState, GlobalSession globalSession);

Newline after.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/BufferingMiddlewareRepositorySession.java:83
(Diff revision 8)
> +     * When source fails to provide all records, we need to decide what to do with the buffer.
> +     * We might fail because of a network partition, or because of a concurrent modification of a
> +     * collection, or because we ran out of time fetching records, or some other reason.
> +     *
> +     * Either way we do not clear the buffer in any error scenario, but rather
> +     * allow it to be re-filled, replacing existing records with their newer versions if necessary.

We do, however, clear the buffer if:

* The user signs out.
* The syncID for the collection or the account itself changes.
* Our server URL changes.
* Our crypto keys change.
* A resetClient command is processed (similar flow to the above).
* A wipeClient command is processed.
* The engine is disabled.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/middleware/BufferingMiddlewareRepositorySession.java:133
(Diff revision 8)
> -            for (Record record : buffer) {
> +            for (Record record : bufferData) {
>                  this.inner.store(record);
>              }

Future, to think about: can we add safety here by adding begin/end pairs around this? If a repository knows it'll only ever be called in a batch, then it can reasonably begin and end a transaction, ensuring that the merge happens transactionally.

In this case we'd also need a kind of local concurrent write exception, where the transaction transiently failed to apply -- don't clear the buffer, just come back shortly.
Comment on attachment 8800497 [details]
Bug 1291821 - Allow BatchingDownloader to resume downloads using offset

https://reviewboard.mozilla.org/r/85408/#review106122

See previous note about resetting in the right spots.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/ConfigurableServer15Repository.java:60
(Diff revision 14)
> +    this.highWaterMark = highWaterMark;
> +
> +    // Sanity check: let's ensure we're configured correctly. At this point in time, it doesn't make
> +    // sense to use H.W.M. with a non-persistent state provider. This might change if we start retrying
> +    // during a download in case of 412s.
> +    if (stateProvider.getClass().equals(NonPersistentRepositoryStateProvider.class)

I think you want `stateProvider.isPersistent()`.

(And even if you were to persist with this approach, it should be `instanceof`, not using `getClass`.)

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/PersistentRepositoryStateProvider.java:33
(Diff revision 14)
> +        this.editor = prefs.edit();
> +    }
> +
> +    @Override
> +    public void commit() {
> +        this.editor.apply();

`apply` doesn't tell you if the write to disk fails, and your code will continue merrily along before the write even begins.

For safety's sake, I think you want `commit` here instead of `apply`.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/RepositoryStateProvider.java:41
(Diff revision 14)
> +    public abstract @Nullable String getString(String key);
> +
> +    public abstract void setLong(String key, Long value);
> +    public abstract @Nullable Long getLong(String key);
> +
> +    public void reset() {

I can easily imagine that implementations might want their own behavior of `reset` — e.g., to remove multiple keys at once.

At that point this becomes an interface instead of an abstract class, which I think makes more sense.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/Server15RepositorySession.java:122
(Diff revision 14)
> +   * and repository is set to fetch oldest-first.
> +   * Otherwise, returns repository's last-synced timestamp.
> +   */
>    @Override
> -  public boolean dataAvailable() {
> -    return serverRepository.updateNeeded(getLastSyncTimestamp());
> +  public long getLastSyncTimestamp() {
> +    if (!serverRepository.getAllowHighWaterMark() || !serverRepository.getSortOrder().equals("oldest")) {

Note that in principle you could use a low water mark and sort=newest since Bug 1314171.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/downloaders/BatchingDownloaderController.java:41
(Diff revision 14)
> +    }
> +
> +    private static ResumeContext getResumeContext(RepositoryStateProvider stateProvider, Long since, String order) {
> +        final String offset = stateProvider.getString(RepositoryStateProvider.KEY_OFFSET);
> +
> +        if (offset != null) {

Invert conditional and early return.
Attachment #8800497 - Flags: review?(rnewman) → review+
Comment on attachment 8815607 [details]
Bug 1291821 - Post: remove unused files

https://reviewboard.mozilla.org/r/96476/#review106132
Attachment #8815607 - Flags: review?(rnewman) → review+
Blocks: 1332094
Comment on attachment 8808780 [details]
Bug 1291821 - Track incomplete stages and re-sync them

https://reviewboard.mozilla.org/r/91520/#review106110

> Name the stage, if only for debugging's sake?

In the context of how this exception is used, this doesn't feel necessary (it's also annoying to do low enough in the call stack to be useful if they're initiated within a local session). These reflow exceptions bubble their way up to ServerSyncStage, at which point they're mapped to a GlobalSession callback call to handle an incomplete stage, explicitely naming the current collection so that it may be tracked for a re-sync.
While you're debugging a flow you're explitely aware of which stage you're in at the moment... and once you move beyond a stage, it doesn't matter.

> We do, however, clear the buffer if:
> 
> * The user signs out.
> * The syncID for the collection or the account itself changes.
> * Our server URL changes.
> * Our crypto keys change.
> * A resetClient command is processed (similar flow to the above).
> * A wipeClient command is processed.
> * The engine is disabled.

I'm considering addressing this in a follow-up, probably as part of Bug 1318515 or similar. Given lifetime and temporal nature of the currently-used memory buffers, explicitely clearing them at noted points is neither possible nor necessary.

> Future, to think about: can we add safety here by adding begin/end pairs around this? If a repository knows it'll only ever be called in a batch, then it can reasonably begin and end a transaction, ensuring that the merge happens transactionally.
> 
> In this case we'd also need a kind of local concurrent write exception, where the transaction transiently failed to apply -- don't clear the buffer, just come back shortly.

Good call, I filed Bug 1332094 to tackle transactional merging.
Blocks: 1332431
Comment on attachment 8800497 [details]
Bug 1291821 - Allow BatchingDownloader to resume downloads using offset

https://reviewboard.mozilla.org/r/85408/#review106122

If I understand flow of things correctly, resetting in SyncServerStage's resetLocal should be enough. I'll investigate more to see if I'm missing anything.

> `apply` doesn't tell you if the write to disk fails, and your code will continue merrily along before the write even begins.
> 
> For safety's sake, I think you want `commit` here instead of `apply`.

Switched to commit, annotated methods using it with @CheckResult, and documented in-line consequences of a failing commit (for setting and updating context, committing hwm, etc), and how the system will behave & recover. Also filed Bug 1332431 to limit impact of SharedPreferences.Edit's commit failures in various cases, beyond just repository state.

> Note that in principle you could use a low water mark and sort=newest since Bug 1314171.

Going to look at these improvements in Bug 1316110.

> Invert conditional and early return.

I've tightened up setting/updating of resume context, splitting it into "set initial" and "update", with assertions in place to enforce correct ordering of operations.
Comment on attachment 8808780 [details]
Bug 1291821 - Track incomplete stages and re-sync them

https://reviewboard.mozilla.org/r/91520/#review94208

> You've noted that there are three ways this can happen:
> 
> - Deadline hit. In this case we want to continue where we left off, not using the same timestamp per se.
> - 412 during batching. In this case we want to reflow with the old timestamp unless we're trying a high-water-mark, in which case we can pick up where we left off.
> - 412 during upload. In this case we simply abort, and pick up next time at the end of our completed download -- we don't try the previous download again, because there's no need. Any new records will have later timestamps.

I've fixed up the class comment. A "re-flow", or a "re-sync" of a stage is no different from a regular sync of the same stage. Depending on the resume context, presence/absence of high-water-mark, sort order, last-synced timestamps, etc. we might do any of the following:
- sync a stage using an available high-water-mark
- resume a download using persisted offset & "offset context"
- sync "regularly" from collection's last-synced timestamp

Currently we have no way to skip some of the flow (e.g. don't download, and just upload). We always flow bi-directionally.
Latest patches address review comments, and make some other changes:
- While buffering a stage and dealing with a non-persistent buffer, ignore sync deadline. If we hit a deadline and re-sync, a non-persistent buffer means that we'll have to re-download bunch of records we just had.
- When considering if we should flow a stage from a last-synced timestamp or from a high-water-mark, in presence of a h.w.m. pick whichever timestamp is greater. After a successful sync, it's expected that our h.w.m. will be behind our last-synced timestamp. This change is not strictly necessary, we should behave correctly in either case. But it makes flow of records easier to reason about. Note that this doesn't affect our X-I-U-S headers at all (for upload it's always collection's last-synced, and for batching downloads it's whatever server returns to us a last-modified on a first request).
Iteration: 1.13 → 1.14
Comment on attachment 8808780 [details]
Bug 1291821 - Track incomplete stages and re-sync them

https://reviewboard.mozilla.org/r/91520/#review109678
Attachment #8808780 - Flags: review?(rnewman) → review+
Done.

Please test this thoroughly before landing. Extra points if you pick someone, block off an hour, and explain the entire diff to them. They won't necessarily come out with a good understanding, but I bet you spot one of your own mistakes.
Flags: needinfo?(rnewman)
Depends on: 1336001
Iteration: 1.14 → 1.15
Rebased on top of Bug 1336001, improved some tests, and added some "bail out early on errors" logic to BatchingUploader. I intend to land this once Bug 1336001 lands.
Iteration: 1.15 → 1.16
Re-based and re-tested this on top of Bug 1336001.
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f3e6086e7297
Pre: remove unused SerialRecordConsumer r=rnewman
https://hg.mozilla.org/integration/autoland/rev/ff7ff02d0edf
Add storeIncomplete to RepositorySession interface r=rnewman
https://hg.mozilla.org/integration/autoland/rev/86bc523c8c60
Buffering repository middleware r=rnewman
https://hg.mozilla.org/integration/autoland/rev/e41e4bbf72ab
Switch stage duration interval counting to use elapsedRealtime r=rnewman
https://hg.mozilla.org/integration/autoland/rev/07c5a4e93738
Rename RepositorySession's delegate to storeDelegate, for clarity r=rnewman
https://hg.mozilla.org/integration/autoland/rev/3d8a46a90f23
Add onBatchComplete to a FetchRecordsDelegate r=rnewman
https://hg.mozilla.org/integration/autoland/rev/75035bf6dd1e
Simplify onFetchFailed, clean up some exception code r=rnewman
https://hg.mozilla.org/integration/autoland/rev/4ecaf3862bc7
Decouple BatchingUploader from Server11Repository r=rnewman
https://hg.mozilla.org/integration/autoland/rev/63957df45bac
Remove total sync limits, refactor batching downloader r=rnewman
https://hg.mozilla.org/integration/autoland/rev/50608e055005
Keep track of sync deadline r=rnewman
https://hg.mozilla.org/integration/autoland/rev/3a43ecf3c625
Use sync deadline to decide of batching downloader should proceed r=rnewman
https://hg.mozilla.org/integration/autoland/rev/d61efb5192df
Wrap local repositories in buffering middleware r=rnewman
https://hg.mozilla.org/integration/autoland/rev/7e35c924287a
Split history stage into recent and full history stages r=rnewman
https://hg.mozilla.org/integration/autoland/rev/c027b2bb0734
Ensure sink repository is aware of new events coming from the source r=rnewman
https://hg.mozilla.org/integration/autoland/rev/09ae692fe42c
Get tests to work after sync changes r=rnewman
https://hg.mozilla.org/integration/autoland/rev/c1069ad96647
Rename repositories/sessions r=rnewman
https://hg.mozilla.org/integration/autoland/rev/28164480660d
Move bulk insert logic for new history to BrowserProvider r=rnewman
https://hg.mozilla.org/integration/autoland/rev/3c89bba23c2d
Track incomplete stages and re-sync them r=rnewman
https://hg.mozilla.org/integration/autoland/rev/5e0b1684de40
Allow BatchingDownloader to resume downloads using offset r=rnewman
https://hg.mozilla.org/integration/autoland/rev/bd232d46a396
Post: remove unused files r=rnewman
Blocks: 1291822
See Also: → 1343726
Added to Fx54 (Beta) release notes.
Product: Android Background Services → Firefox for Android
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.