Closed Bug 1294599 Opened 3 years ago Closed 3 years ago

Desktop should download and apply Sync records in batches

Categories

(Firefox :: Sync, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Firefox 52
Tracking Status
firefox52 --- fixed

People

(Reporter: markh, Assigned: tcsc)

References

(Blocks 1 open bug)

Details

(Whiteboard: [sync-atomic])

Attachments

(1 file)

Via bug 1294270 we observe the following behaviour:

* desktop makes a request to read an entire collection (ie, full=1) without a limit param.
* As the data streams in we apply the records one at a time.
* After 10 minutes or so (eg, a few thousand passwords being applied) the server closes the connection while the client is still happily applying them and reading the next chunk.
* Client sees NS_ERROR_NET_PARTIAL_TRANSFER, doesn't advance lastSync.
* Next sync does exactly the same thing, repeat ad-nauseum.

Bug 1294595 exists to increase the timeout on the server, and while this will probably unstick many users, there will always be pathological outliers.

It seems a simple (hah!) mitigation here is for Sync to use a hard-coded limit and X-Weave-Next-Offset header to split the fetch into multiple requests. I'm not sure if this will need to be in combination with x-if-unmodified-since (forcing us to restart if another client wrote), and the implementation will be slightly complicated by the fact that we support some collections (eg, history) imposing a limit simply because it doesn't want every record.

I can't find a bug on file for this. Richard, any thoughts?
Flags: needinfo?(rnewman)
Android: Bug 730142. Bug 730142 Comment 7 has a pointer to iOS's implementation. The use of the server offset header plus record-level watermarking yields a safe and resumable implementation. 

I don't think there's an earlier desktop bug to dupe to, but looking at Bug 814801 might link you to some prior notes.
Flags: needinfo?(rnewman)
Grisha is working on this for Android right now.
See Also: → 730142
Priority: -- → P2
Blocks: 1294270
P3 ATM, waiting to deploy a server workaround before we'll measure this.
Priority: P2 → P3
Whiteboard: [atomic-io]
Whiteboard: [atomic-io] → [sync-atomic]
Priority: P3 → P2
Assignee: nobody → tchiovoloni
Comment on attachment 8798615 [details]
Bug 1294599 - Implement batched downloads of sync collections on desktop

Note that this is intended to be a drop-in replacement for get() in most cases.

It might be a little cludgey to put so much state inside the getBatched funciton, but it seemed like the simplest way to implement this given how the rest of the desktop code is structured.
Attachment #8798615 - Flags: feedback?(rnewman)
Attachment #8798615 - Flags: feedback?(markh)
Status: NEW → ASSIGNED
Comment on attachment 8798615 [details]
Bug 1294599 - Implement batched downloads of sync collections on desktop

https://reviewboard.mozilla.org/r/84026/#review83084

I think this is looking good, although I don't think this should do any buffering while keeping the same basic semantics which exist today.

::: services/sync/modules/constants.js:81
(Diff revision 1)
>  ADDONS_STORE_BATCH_SIZE:               1000000, // process all addons at once
>  APPS_STORE_BATCH_SIZE:                 50,      // same as MOBILE_BATCH_SIZE
>  
> +// Default batch size for download batching
> +// (how many records are fetched at a time from the server when batching is used).
> +DEFAULT_DOWNLOAD_BATCH_SIZE:           100,

While it's impossible to specify a number that will always work, this seems fairly conservative. ie, if users with > 100 bookmarks (or any other type of record) saw this I think it would be reported far more often than it currently is.

OTOH though, I see this is the same figure we upload by default in the absence of explicit batch semantics, so maybe this is OK? But something is nagging me to suggest 500 or so might be better and high enough that the vast majority of users will never see multiple requests and the increased latency that comes with it. Thoughts?

::: services/sync/modules/record.js:538
(Diff revision 1)
>    this._data = [];
> -  // optional members used by batch operations.
> +  // optional members used by batch upload operations.
>    this._batch = null;
>    this._commit = false;
> +  // Used for batch download operations -- note that this is explicitly an
> +  // opaque value and not (necessarially) a number.

typo in necessarially

::: services/sync/modules/record.js:649
(Diff revision 1)
> +  // records (or if a network error occurs).
> +  getBatched(batchSize = DEFAULT_DOWNLOAD_BATCH_SIZE) {
> +    let totalLimit = Number(this.limit) || Infinity;
> +    if (batchSize <= 0 || batchSize >= totalLimit) {
> +      // Invalid batch sizes should arguably be an error, but they're easy to
> +      // handle. Also just fall back to using the normal get() in cases where

comment seems to be missing something at the end?

::: services/sync/modules/record.js:653
(Diff revision 1)
> +      // Invalid batch sizes should arguably be an error, but they're easy to
> +      // handle. Also just fall back to using the normal get() in cases where
> +      return this.get();
> +    }
> +
> +    if (!this.full) {

I'd suggest we throw here, seeing as the callsite in question here either wants GUIDs or it doesn't, and that isn't going to change at runtime.

IOW, I believe this would log.warn every single time for a given caller, so throwing would help identify and unconditionally fix such a caller.

(OTOH though, why this limitation? Is it just to keep this patch sane? That's a reasonable justification, but it doesn't seem inherently necessary, so maybe a comment indicating why we haven't implemented it would make sense?)

::: services/sync/modules/record.js:659
(Diff revision 1)
> +      this._log.warn("getBatched is unimplemented for guid-only GETs, falling back to get()");
> +      return this.get();
> +    }
> +
> +    // _onComplete and _onProgress are reset after each `get` by AsyncResource.
> +    // We overwrite _onRecord to something that stores the data in an array

I don't think it is necessary to buffer here, as existing callers with a .recordHandler already buffer.

IOW, I think we can still pipe the records to the caller one at a time, even if it now takes multiple requests for all records to be consumed. In both the existing world and the new world, we can still end up in a situation where we fail 1/2 way through and the next request ends up restarting.

Actually, I think as written this will have slightly worse characteristics:

* memory will end up being larger - as we get to the last records, they will be buffered here *and* by the default recordHandler's batching.
* There will likely be more jank - the existing `_onRecord` handler calls sleep() after records to ensure we don't starve the event loop. This code doesn't do that, so a fast network connection may prevent other code running while we are buffering.

::: services/sync/modules/record.js:688
(Diff revision 1)
> +        let lastModified = resp.headers["x-last-modified"];
> +        if (!lastModifiedTime) {
> +          lastModifiedTime = lastModified;
> +          this.setHeader("X-If-Unmodified-Since", lastModified);
> +        } else if (lastModified != lastModifiedTime) {
> +          throw new Error("X-Last-Modified changed in the middle of a download batch! " +

IIUC, this should be "impossible" as we expect a 412 in this case? If so, maybe add a comment to this effect.

::: services/sync/modules/record.js:701
(Diff revision 1)
> +      // Ensure we undo any temporary state so that subsequent calls to get()
> +      // or getBatched() work properly. We do this before calling the record
> +      // handler so that we can more convincingly pretend to be a normal get()
> +      // call.
> +      this._onRecord = _onRecord;
> +      this._limit = totalLimit;

ISTM that some of these can be removed:

* `_onRecord` can probably die completely - see above.
* I don't see why `_limit` needs to be reset - in the same way as, say, `._sort` is handled, this should be under the control of the caller (ie, they probably expect a limit they set to be used for all requests and not reset after a get)
* `_offset` and x-if-unmodified-since seems like they should be reset at the start of this method.
(In reply to Mark Hammond [:markh] from comment #6)

> While it's impossible to specify a number that will always work, this seems
> fairly conservative. ie, if users with > 100 bookmarks (or any other type of
> record) saw this I think it would be reported far more often than it
> currently is.
> 
> OTOH though, I see this is the same figure we upload by default in the
> absence of explicit batch semantics, so maybe this is OK? But something is
> nagging me to suggest 500 or so might be better and high enough that the
> vast majority of users will never see multiple requests and the increased
> latency that comes with it. Thoughts?

Grisha and I were discussing this as he's working on Bug 1291821. (Go read that bug, and the batching downloader bug it depends on! It's a good read.)


There are only four reasons to use a number smaller than 99999 here:

1. To reduce the amount of re-downloading that's necessary after an interruption… *if this client supports high-watermark persistent batch tracking*, tracks at the end of a successful chunk, and is fetching oldest-first.

If you haven't already built that kind of batch tracking, or you're fetching anything but oldest-first, you get no benefit from smaller checkpoints: you'll always have to start again from scratch to make sure you got everything. (Think: another client might have uploaded a record with a really large sortindex at position 0!)

You should ideally support high-watermark tracking, and almost always fetch oldest-first (perhaps after an initial mini-most-recent sync to get a good first experience).


2. To allow the client to decide to gracefully abort (e.g., because you're on an OS that limits how long your syncs can be). That doesn't really affect desktop.


3. To minimize peak memory usage.


4. To minimize the possibility of NS_ERROR_NET_RESET and other such interruptions. We've been fetching 5000 records in a single HTTP request on Android for years, so this doesn't seem to be a big concern in practice.

I'd say use 1000 instead of 100, and you should add basic telemetry to tune if needed.
(In reply to Mark Hammond [:markh] from comment #0)
> Via bug 1294270 we observe the following behaviour:
> 
> * desktop makes a request to read an entire collection (ie, full=1) without
> a limit param.
> * As the data streams in we apply the records one at a time.
> * After 10 minutes or so (eg, a few thousand passwords being applied) the
> server closes the connection while the client is still happily applying them
> and reading the next chunk.
> * Client sees NS_ERROR_NET_PARTIAL_TRANSFER, doesn't advance lastSync.
> * Next sync does exactly the same thing, repeat ad-nauseum.

I thought it was worth explicitly (re?)-addressing this for the benefit of people reading Comment 0.

One of the best ways to avoid this kind of network interruption is to buffer records locally rather than applying them. The reason we're holding a socket open for ten minutes is because we're reading and writing local data over and over before allowing the idle connection to be closed.

(Our original 5000-bookmark limitation on Android was similarly due to the amount of time spent on N SQL inserts.)

When adjusted to slurp records off the wire into memory, flush them to a holding area, close the connection, *then* do the expensive work, our wire limits can be much higher.
Comment on attachment 8798615 [details]
Bug 1294599 - Implement batched downloads of sync collections on desktop

https://reviewboard.mozilla.org/r/84026/#review83084

> While it's impossible to specify a number that will always work, this seems fairly conservative. ie, if users with > 100 bookmarks (or any other type of record) saw this I think it would be reported far more often than it currently is.
> 
> OTOH though, I see this is the same figure we upload by default in the absence of explicit batch semantics, so maybe this is OK? But something is nagging me to suggest 500 or so might be better and high enough that the vast majority of users will never see multiple requests and the increased latency that comes with it. Thoughts?

I picked 100 because I had to pick something, mainly. Richard's suggestion of 1000 seems fine to me (as does 500), and I don't think we'd need extra telemetry here (anything I can think that we'd want to record -- total download count, error rate, etc -- is probably already part of the sync ping)

> I'd suggest we throw here, seeing as the callsite in question here either wants GUIDs or it doesn't, and that isn't going to change at runtime.
> 
> IOW, I believe this would log.warn every single time for a given caller, so throwing would help identify and unconditionally fix such a caller.
> 
> (OTOH though, why this limitation? Is it just to keep this patch sane? That's a reasonable justification, but it doesn't seem inherently necessary, so maybe a comment indicating why we haven't implemented it would make sense?)

Throwing sounds like a good idea.

It's to keep the patch sane and because code downloading GUIDs already does some batching, but using a different strategy -- doing two levels of batching would be unnecessary, and making this code use that strategy would complicate it too much (IMO).

> I don't think it is necessary to buffer here, as existing callers with a .recordHandler already buffer.
> 
> IOW, I think we can still pipe the records to the caller one at a time, even if it now takes multiple requests for all records to be consumed. In both the existing world and the new world, we can still end up in a situation where we fail 1/2 way through and the next request ends up restarting.
> 
> Actually, I think as written this will have slightly worse characteristics:
> 
> * memory will end up being larger - as we get to the last records, they will be buffered here *and* by the default recordHandler's batching.
> * There will likely be more jank - the existing `_onRecord` handler calls sleep() after records to ensure we don't starve the event loop. This code doesn't do that, so a fast network connection may prevent other code running while we are buffering.

They do buffer, but AFAICT not the whole batch -- e.g. the buffer size is 1 for most engines, and 50 for the history, forms, and password engine. (I've provided a few links below\*, it's possible I'm not following the code properly so LMK if this is the case)

So I think we still need to buffer here to properly handle 412s. It is true that we *will* have higher memory usage this way, but I see no way around this. FWIW Android and iOS appear to buffer everything.

Addressing the jank by adding some sleeping here doesn't sound unreasonable to me.

\* See:
- https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/constants.js#72
- https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/engines.js#766
- https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/engines.js#1153

> IIUC, this should be "impossible" as we expect a 412 in this case? If so, maybe add a comment to this effect.

Yeah, that should be impossible.

> ISTM that some of these can be removed:
> 
> * `_onRecord` can probably die completely - see above.
> * I don't see why `_limit` needs to be reset - in the same way as, say, `._sort` is handled, this should be under the control of the caller (ie, they probably expect a limit they set to be used for all requests and not reset after a get)
> * `_offset` and x-if-unmodified-since seems like they should be reset at the start of this method.

We change `_limit` at the start of the method to implement the batching, and reset it to what it was before the getBatched was called.  We change it *back* to what the caller thinks it should be, not to something else.

If it's an engine like the history engine, it had an existing limit.  In order to make this as much of a drop-in replacement for `get` as possible, we don't want the caller to have to worry about the batch limit in addition to any limits it has.

(At the risk of being overly clear, keep in mind that we *must* pass a limit as part of the batching API, and that's not the same limit we pass as part of the non-batched API. Since `_sort` doesn't have that kind of dual usage, it needs to be handled differently)

Offset needs to be reset so that we don't end up using this offset later on for normal `get` calls. Arguably we should reset the offset at the start of `get`, but IMO that would be unintuitive behavior.

`_onRecord` should stay, IMO, for the reason I mentioned above (We only appear to batch in the caller, but the batchSize is usually just 1.
Priority: P2 → P1
Comment on attachment 8798615 [details]
Bug 1294599 - Implement batched downloads of sync collections on desktop

https://reviewboard.mozilla.org/r/84026/#review83690

LGTM, thanks! Let's wait for Richard's review though...
Attachment #8798615 - Flags: review?(markh) → review+
> It's to keep the patch sane and because code downloading GUIDs already does some batching…

There's existing code in engines.js to do chunked batching. It's all conditionalized on `isMobile`, which should now always be false. I encourage you to prune dead code there before layering complexity on top.
Comment on attachment 8798615 [details]
Bug 1294599 - Implement batched downloads of sync collections on desktop

https://reviewboard.mozilla.org/r/84026/#review84684

::: services/sync/modules/record.js:653
(Diff revision 2)
> +      // Invalid batch sizes should arguably be an error, but they're easy to handle
> +      return this.get();
> +    }
> +
> +    if (!this.full) {
> +      throw new Error("getBatched is unimplemented for guid-only GETs, falling back to get()")

Error message is wrong.

::: services/sync/modules/record.js:663
(Diff revision 2)
> +    // until the end.
> +    let { _onComplete, _onProgress, _onRecord } = this;
> +    let recordBuffer = [];
> +    let resp;
> +    try {
> +      this._onRecord = r => recordBuffer.push(r);

Oh god, that's horrible. Is there really no better way to do this? Can you at least refactor `this.get` to take progress/complete/record callbacks as arguments?

::: services/sync/modules/record.js:673
(Diff revision 2)
> +        this._onProgress = _onProgress;
> +        this._onComplete = _onComplete;
> +        if (batchSize + recordBuffer.length > totalLimit) {
> +          this.limit = totalLimit - recordBuffer.length;
> +        }
> +        this._log.trace(`Peforming batched GET`, { limit: this.limit, offset: this.offset });

Performing

::: services/sync/modules/record.js:677
(Diff revision 2)
> +        }
> +        this._log.trace(`Peforming batched GET`, { limit: this.limit, offset: this.offset });
> +        // Actually perform the request
> +        resp = this.get();
> +        if (!resp.success) {
> +          break;

Shouldn't this throw away the buffer, set ENGINE_DOWNLOAD_FAIL, and throw the failure response?

::: services/sync/modules/record.js:708
(Diff revision 2)
> +      delete this._headers["x-if-unmodified-since"];
> +      this._rebuildURL();
> +    }
> +    if (resp.success) {
> +      // call the original _onRecord (e.g. the user supplied record handler)
> +      // for each record we've stored

Before we start doing this, perhaps call Async.checkAppReady? No sense tempting fate if we happen to be shutting down.
Attachment #8798615 - Flags: review?(rnewman)
Comment on attachment 8798615 [details]
Bug 1294599 - Implement batched downloads of sync collections on desktop

https://reviewboard.mozilla.org/r/84026/#review84684

> Oh god, that's horrible. Is there really no better way to do this? Can you at least refactor `this.get` to take progress/complete/record callbacks as arguments?

Part of the goal with this patch (and the reason this is a separate method from `get`) is to minimize the number of changes needed to callers, most of whom don't care about batching (keep in mind that `get` is implemented on Resource, and some callers of `Collection.prototype.get` want the non-batching behavior -- e.g. if `full` is true).

Not to say I think this code is ideal, but I don't really see how adding callbacks would make it much cleaner...

> Shouldn't this throw away the buffer, set ENGINE_DOWNLOAD_FAIL, and throw the failure response?

The callers already do this if it's a good idea, and it isn't always (for example, it's probably not a good idea for the validators to set ENGINE_DOWNLOAD_FAIL).
(In reply to Mark Hammond [:markh] from comment #11)

> LGTM, thanks! Let's wait for Richard's review though...

Already done :D
Comment on attachment 8798615 [details]
Bug 1294599 - Implement batched downloads of sync collections on desktop

https://reviewboard.mozilla.org/r/84026/#review86680
Attachment #8798615 - Flags: review?(rnewman) → review+
Pushed by tchiovoloni@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e4ef00583a1a
Implement batched downloads of sync collections on desktop r=markh,rnewman
https://hg.mozilla.org/mozilla-central/rev/e4ef00583a1a
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 52
Duplicate of this bug: 1220469
Duplicate of this bug: 1200955
Duplicate of this bug: 1015016
See Also: → 1434055
You need to log in before you can comment on or make changes to this bug.