Closed Bug 1392505 Opened 7 years ago Closed 7 years ago

Comparing local device vs server clocks in order to determine if collection has new data available

Categories

(Firefox for Android Graveyard :: Android Sync, enhancement, P1)

enhancement

Tracking

(firefox57 verified)

RESOLVED FIXED
Firefox 57
Tracking Status
firefox57 --- verified

People

(Reporter: Grisha, Assigned: Grisha)

Details

Attachments

(2 files)

While syncing a collection, android sync sets up a local and a remote "repository" and a "session"; repository has a persisted state. For the purposes of this bug, we only care about the "timestamp" portion of that state.

A sync is "first, flow records from remote session to local session. then, flow records the other way."

First flow is "download stuff from the server", let's call it flowAtoB.
Second flow is "upload stuff to the server", let's call it flowBtoA.

We don't want to download if there's nothing new, and so before we try to sync any collections, we first fetch info/collections, which has last-modified timestamps for each collection.

We then perform flowAtoB.

First thing it does is checks if there's data available [0], and short-circuits to flowBtoA if there isn't.

To check if there's data available for flowAtoB, we compare timestamp for a collection we downloaded from the info/collections endpoint to a timestamp we have persisted for the repositoryA.

What is that timestamp? It could be a high-watermark in case of batching downloads, to supported a more efficient downloading of records.

Or it could be a max(fetchEnd, storeEnd) - [1].

These two timestamps are obtained in different ways, depending on what happens during the sync.

Looking at onFlowCompleted calls in RecordsChannel [2] is telling: those timestamps are sometimes X-Last-Modified header values from the downloader (good!), and sometimes they are System.currentTimeMillis() (bad!).

In a common scenario - download some records, apply them and proceed - we end up calling onFlowCompleted(Header: X-Last-Modified, Device: local time when last record's store finished).

Depending on how local device is configured, its local time could be pretty much anything. If it's 10 minutes ahead of server time, that means in the best case our syncs will be lagging by 10 minutes - unless device also performs some local modifications to data.

There are various other scenarios under which we might end up comparing local and remote clocks, all to a non-deterministic result. They could be found by tracing through [2].

I'm not sure why we do this, but it does seem to explain a lot of subtle bugs around delays in tabs or clients appearing, for example.

It seems that we should always be comparing timestamp we obtain from info/collections to a X-Last-Modified header value from the latest GET or POST request which we made against a given collection.

[0] https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java#178
[1] https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/SynchronizerSession.java#434
[2] https://dxr.mozilla.org/mozilla-central/search?q=onFlowCompleted+path%3ARecordsChannel.java&redirect=false
> in the best case our syncs will be lagging by 10 minutes - unless device also performs some local modifications to data

Actually, it's more subtle - and worse. After uploading, we'd still take a maximum of "fetchEnd, storeEnd", where "fetchEnd" is going to be a local timestamp recorded while we get records from the database. If it's ahead of server, we'd use it, and miss any new modifications from other clients until server time catches up to our local time.

I'll double check that this is indeed what's happening tomorrow with a fresh head.
Yikes. Yeah, we shouldn't be doing that. The remote to local fetch should always be advanced by download or by upload.
Assignee: nobody → gkruglov
Status: NEW → ASSIGNED
Priority: -- → P1
Inspecting this code closer, it does appear that I've misread it originally, and the problem is somewhat narrower in scope.

In practice, the one situation I can clearly see in which we'll end up comparing clocks of different origin is when remote repository acting as source (we're downloading) declares that it has no data available, and we don't have anything to upload. In that case, we're going to set remote repository's lastModified timestamp to current local time.

Unfortunately, that's likely the most common sync scenario, and so when we're checking if remote has any new data, very frequently we'll be comparing device's local time at some point vs server's time at some point.

Sidenote: our timestamp management logic is lives in multiple layers of this onion, which makes it a bit hard to follow. Sessions generate some "end" timestamps, sometimes real (x-last-modified values from the server), sometime synthetic: 0 for "nothing was done" in case of the uploader, or "now" after storing some records in the database instead of record's timestamps, thus allowing for small races. RecordsChannel flows these timestamps upwards and makes up its own in various scenarios in case of errors or optimizations (when skipping a flow). Its delegate SynchronizerSession uses the received timestamps to find the "latest" for each side of the flow (at this point, we might have had errors, and timestamps would be misleading or meaningless, we could be comparing local clock vs remote, or real timestamp vs 0L, etc). It also decides if synchronization is a success or a failure by inspecting store/fetch error counts. And finally, SyncStage at the very top actually decides if timestamps are to be persisted depending on the synchronization state (pass/fail). It is that last step that is acting as a safe guard, preventing us from screwing up in a lot of cases.

And so, it's hard to make clear assertions about what should happen in case of partial downloads or uploads, since at every level we either have too little information, or we're working with wrong objects.

We also now have an awkward situation: sessions "own" the concept of "last modified", but all timestamp-based sessions make their decisions based on what the layers above decide.

Bookmarks (version-based) is the non-awkward one of the bunch, since in combination with the underlying data layer it owns "what should be uploaded" concept entirely on its own.
(In reply to :Grisha Kruglov from comment #3)

> And so, it's hard to make clear assertions about what should happen in case
> of partial downloads or uploads, since at every level we either have too
> little information, or we're working with wrong objects.

Yes. The repository abstraction is not good.

On the other hand, we do always have a server timestamp: info/collections (which has the current server time) and the collection modified time inside it. So it should never be the case that we need to make up a modified timestamp for the server.

There are actually two things that we use timestamps for:

1. Deciding whether there are remote changes. (We do the same thing locally, too.)
2. Deciding whether we've waited long enough and should sync again.

I suspect those two are conflated. The latter can use local time; the former cannot.

Oh, for a stronger type system.
Comment on attachment 8901402 [details]
Bug 1392505 - Pre: remove unused delegate interface

https://reviewboard.mozilla.org/r/172872/#review178690

It worries me that we don't use this interface! :D
Attachment #8901402 - Flags: review?(rnewman) → review+
Comment on attachment 8901403 [details]
Bug 1392505 - Let RepositorySessions track their own lastFetch and lastStore timestamps

https://reviewboard.mozilla.org/r/172874/#review178702

I'm not yet convinced by this patch.

In particular, I'm not convinced that the right approach is to pass less information around, and if we do pass less information around, I think we should do so via explicitly simpler APIs -- e.g., `onStoreSkipped()`, or something like that. That would make the situations in which we choose to do this explicit, and would also make the alternative logic flow more obvious (e.g., reusing the last sync's timestamp).

::: commit-message-2b9b0:7
(Diff revision 1)
> +
> +It might be necessary for RecordsChannel to complete a flow while also declaring that
> +none of the timestamps have meaningfully advanced.
> +
> +In practice, the main problems this solves is advancing a remote timestamp when there
> +was both nothing to upload and nothing to download. Since we didn't interract with

interact

::: commit-message-2b9b0:14
(Diff revision 1)
> +We'd then end up comparing remote server time to a locally generated one, which
> +is quite bad and leads to inconsistently buggy behavior.
> +
> +This patch elects not to change interface between RecordsChannel and its delegate
> +too much, choosing to pass around nulls to indicate that "nothing have changed"
> +out of simplicity.

Other options:

- Get a better timestamp (e.g., from `info/collections` or local prefs, depending on direction).
- Change the interface.

Why did you not choose either of those? The former seems to be even less invasive (and `null` is evil!); the latter is more explicit.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/Utils.java:564
(Diff revision 1)
>      };
>    }
> +
> +  @Nullable
> +  public static Long getMaxFromOptionals(@Nullable Long first, @Nullable Long second) {
> +    if (first == null && second == null) {

This clause is redundant.

Also this function is horrible.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/PayloadDispatcher.java:119
(Diff revision 1)
>                  if (needToCommit && Boolean.TRUE.equals(batchWhiteboard.getInBatchingMode())) {
>                      finalRunnable.run();
>  
>                      // Otherwise, we're done.
>                  } else {
> +                    // NB: if we had nothing to upload, this timestamp would be 0.

"will be" or "is"

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java:295
(Diff revision 1)
>        this.sink.storeIncomplete();
>  
> -      delegate.onFlowCompleted(this, fetchEnd, System.currentTimeMillis());
> +      // "Partially done" means that we failed to fetch all of the records. In practice, this won't
> +      // happen for local repositories - it won't be reported via this flow - but might happen
> +      // (interrupted downloads) for remote.
> +      delegate.onFlowCompleted(this, null, null);

This breaks something.

Read this comment again.

```
  // Bug 726054: just like desktop, we track our last interaction with the server,
  // not the last record timestamp that we fetched. This ensures that we don't re-
  // download the records we just uploaded, at the cost of skipping any records
  // that a concurrently syncing client has uploaded.
  private long pendingATimestamp = -1;
  private long pendingBTimestamp = -1;
  private long storeEndATimestamp = -1;
  private long storeEndBTimestamp = -1;
  private boolean flowAToBCompleted = false;
  private boolean flowBToACompleted = false;
```

I'm not convinced that you're preserving advancement everywhere you need to.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/Synchronizer.java:28
(Diff revision 1)
>   * `onSynchronizeFailed` callback methods. In addition, I call
>   * `onSynchronizeAborted` before `onSynchronizeFailed` when I encounter a fetch,
>   * store, or session error while synchronizing.
>   *
>   * After synchronizing, call `save` to get back a SynchronizerConfiguration with
> - * updated bundle information.
> + * updated bundle information. 

Nit: whitespace.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/SynchronizerSession.java:60
(Diff revision 1)
> + * Assumed usage is between a remote server and a local repository.
> + * <p>
> + * Handles failure cases as follows (in the order they will occur during a sync):
> + * <ul>
> + * <li>Remote fetch failures abort.</li>
> + * <li>Local store failures are ignored.</li>

Is this a change of behavior?
(In reply to Richard Newman [:rnewman] from comment #9)
> ::: commit-message-2b9b0:14
> (Diff revision 1)
> Other options:
> 
> - Get a better timestamp (e.g., from `info/collections` or local prefs,
> depending on direction).
> - Change the interface.
> 
> Why did you not choose either of those? The former seems to be even less
> invasive (and `null` is evil!); the latter is more explicit.

I've thought of just changing the interface, but elected to not do that largely to just avoid that work. So, no particularly good reason.

An interface to indicate that a flow is done but timestamps shouldn't be advanced would be a good fit here.

I'm not a fan of "pass around better timestamps" approach. It might fix this bug, but wont't help make the codebase easier to reason about. That sort of approach seems to create problems that "hide in the shadows".

> :::
> mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/
> RecordsChannel.java:295
> (Diff revision 1)
> >        this.sink.storeIncomplete();
> >  
> > -      delegate.onFlowCompleted(this, fetchEnd, System.currentTimeMillis());
> > +      // "Partially done" means that we failed to fetch all of the records. In practice, this won't
> > +      // happen for local repositories - it won't be reported via this flow - but might happen
> > +      // (interrupted downloads) for remote.
> > +      delegate.onFlowCompleted(this, null, null);
> 
> This breaks something.
> 
> Read this comment again.
> 
> ```
>   // Bug 726054: just like desktop, we track our last interaction with the
> server,
>   // not the last record timestamp that we fetched. This ensures that we
> don't re-
>   // download the records we just uploaded, at the cost of skipping any
> records
>  [...]
> I'm not convinced that you're preserving advancement everywhere you need to.

This is interesting. We have several scenarios in which this might happen:
- During remote -> local flow, this means we failed to fetch some of the records.
If local session is wrapped in a buffer, none of the fetched records have been applied, and so timestamps shouldn't move forward.
For non-wrapped sessions (history is the only one), fetched records have been applied locally, and so we do need to move the local timestamp forward to avoid re-uploading records. History session tracks a "high watermark" and a batching offset, and so it already supports an efficient resume mechanism for downloads regardless, in a sense going around session's timestamp. Sidenote: this approach now feels a bit hacky, like an inconsistent delegation of responsibilities across components.

Either way, for remote->local flow, it's fine to _not_ increment timestamps.

For local->remote flow, this situation won't happen under normal circumstances - local sessions only fire onFetchFailed in exceptional circumstances. Regardless, if we know we've uploaded some records already, but our record source suddenly failed, it seems unwise to commit what we've uploaded on the server if we have batching API support. Without batching API in place, our partial upload is auto-committed. If our partial upload is committed, we could move the server timestamp forward, and avoid downloading records we just uploaded on the next sync. If partial uploads are discarded by the server, we don't need to move the timestamp. Another sidenote: we're missing storeIncomplete implementation on the Server15RepositorySession side, which should be in place to support this.

RecordsChannel knows none of these details. It doesn't know that local sessions generally won't fire onFetchFailed; it doesn't know if local `sink` session will buffer incoming records before applying them; it doesn't know that history session is "special"; it doesn't know if a remote `sink` session will discard partial uploads (via batching api).

So it seems like the sink session should decide if it's safe to move timestamps in cases of partial application.

> :::
> mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/
> SynchronizerSession.java:60
> (Diff revision 1)
> > + * Assumed usage is between a remote server and a local repository.
> > + * <p>
> > + * Handles failure cases as follows (in the order they will occur during a sync):
> > + * <ul>
> > + * <li>Remote fetch failures abort.</li>
> > + * <li>Local store failures are ignored.</li>
> 
> Is this a change of behavior?

No, same as ever.
We could also step back a little, and notice that sessions changed quite significantly since this code was written - both sides of the flow might have an equivalent of a buffer, and sessions now own the "modified" concept, suggesting that they're also well suited to know if a current flow modified their data.

Buffering is also not a given for either side of the flow. We might have a local session not wrapped in buffering middleware (history). We might have a server with batching API turned off, or with batching API turned on but per-batch record limits lower than what we're trying to upload - and we might fail to upload the second batch after committing the first.

Prior behaviour: both sides apply records to live storage as the flow proceeds.

The problem can be restated somewhat vaguely as "SynchronizerSession makes incorrect assumptions about session behaviour".

And so, we could do some refactoring to untagle and simplify things: remote timestamps from SynchronizerSession's onFlowComplete interface - once flow is finished, ask sessions if they've been modified, and what the timestamps are for each side.
(In reply to :Grisha Kruglov from comment #11)

> And so, we could do some refactoring to untagle and simplify things: remote
> timestamps from SynchronizerSession's onFlowComplete interface - once flow
> is finished, ask sessions if they've been modified, and what the timestamps
> are for each side.

Cleanup is good, but make sure you ride the line between reducing existing problems and introducing new ones…
Comment on attachment 8901403 [details]
Bug 1392505 - Let RepositorySessions track their own lastFetch and lastStore timestamps

https://reviewboard.mozilla.org/r/172874/#review179736

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/RepositorySession.java:179
(Diff revision 3)
> -    final long end = now();
> +    Logger.debug(LOG_TAG, "Scheduling onStoreCompleted for after storing is done");
> -    Logger.debug(LOG_TAG, "Scheduling onStoreCompleted for after storing is done: " + end);
>      Runnable command = new Runnable() {
>        @Override
>        public void run() {
> -        storeDelegate.onStoreCompleted(end);
> +        setLastStoreTimestamp(now());

Nah, capture it outside the runnable -- it'll be earlier.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/SynchronizerSession.java:72
(Diff revision 3)
> -  // Bug 726054: just like desktop, we track our last interaction with the server,
> -  // not the last record timestamp that we fetched. This ensures that we don't re-
> -  // download the records we just uploaded, at the cost of skipping any records
> -  // that a concurrently syncing client has uploaded.
> +  // Bug 1392505: for each "side" of the channel, we keep track of lastFetch and lastStore timestamps.
> +  // For local repositories these timestamps represent our last interactions with local data.
> +  // For the remote repository these timestamps represent server collection's last-modified
> +  // timestamp after a corresponding operation (GET or POST) finished. We obtain these from server's
> +  // response headers.
> +  // It's important that we never compare timestamps which originated from different clocks.

I think it's worth keeping part of the old comment: it's still useful to know that we might skip.
Attachment #8901403 - Flags: review?(rnewman) → review+
Comment on attachment 8901403 [details]
Bug 1392505 - Let RepositorySessions track their own lastFetch and lastStore timestamps

https://reviewboard.mozilla.org/r/172874/#review179736

> I think it's worth keeping part of the old comment: it's still useful to know that we might skip.

I don't think we can skip records from other clients. XIUS helps here a lot, and we fail a sync (without advancing timestamps) if we either failed to fetch or to upload (e.g. on a 412). Our remote timestamp is always derived from headers in server response, indicating last-modified for the collection we're working with at the end of an operation we just performed. We either succeed to upload/download (and timestamp will reflect the new server state), or a concurrent client touches the collection in a way that it interleaves with us, triggering a 412, failing our sync (and our remote timestamp will remain the same as before, meaning we'd catch up on the next sync).
Seems like some junit3 tests are failing intermittently, but doesn't seem like it's from this work; updated junit4 tests pass. Landing this.
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/63093a78771f
Pre: remove unused delegate interface r=rnewman
https://hg.mozilla.org/integration/autoland/rev/5f2aa1dd303e
Let RepositorySessions track their own lastFetch and lastStore timestamps r=rnewman
https://hg.mozilla.org/mozilla-central/rev/63093a78771f
https://hg.mozilla.org/mozilla-central/rev/5f2aa1dd303e
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 57
Product: Android Background Services → Firefox for Android
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: