Desktop client: upload records atomically

RESOLVED FIXED

Status

()

defect
P1
normal
RESOLVED FIXED
3 years ago
10 months ago

People

(Reporter: rnewman, Assigned: tcsc)

Tracking

(Blocks 1 bug)

unspecified
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(relnote-firefox 51+, firefox51 fixed)

Details

(Whiteboard: [sync-atomic])

Attachments

(1 attachment, 1 obsolete attachment)

Bug 1250189 adds backward-compatible atomic write options to the Sync 1.5 server.

When that's deployed, we should attempt to use them when uploading records for some or all engines -- bookmarks at a minimum, and perhaps also passwords and form history.

Note that self-hosted servers might not support atomic writes; we'll detect this because the appropriate header won't be present in the response to the first request, and in this case we must fall back to the current (unsafe) behavior.

We should also continue to do single uploads if the upload will fit in a single batch.
See Also: → 1253111
See Also: → 1253112
Blocks: 1196238
No longer blocks: 1196238
Blocks: 1257961
Summary: Upload records atomically → Desktop client: upload records atomically
Blocks: 1250189
No longer depends on: 1250189
No longer blocks: 1257961
Priority: -- → P1
Whiteboard: [sync-data-integrity]
Can this be closed? The linked PR already is.

And while I'm here, any chance of getting guidance in the docs for when the new /info/configuration endpoint should be called (ie, how long, if at all, the client should cache it for)?  I'm *guessing* it's reasonable to cache in memory for the session (but still after a node reassignment) etc, or should I call it once per-sync?
markh: I think that comment was intended for another bug?
Flags: needinfo?(markh)
(In reply to Richard Newman [:rnewman] from comment #2)
> markh: I think that comment was intended for another bug?

It was, thanks - it was intended for bug 1250189.
Flags: needinfo?(markh)
Whiteboard: [sync-data-integrity] → [data-integrity]
Priority: P1 → P2
Comment on attachment 8774637 [details]
Bug 1253051 - wip.

This is very much still a WIP. The core of the batch upload semantics should be working, but there will need to be changes to sync's engines.js to account for the fact that (IIUC?) success/failed/timestamp indicators come back each POST, but sync shouldn't persist the outcome of those indicators until a batch is complete.

I'm not sure how to elegantly tell Sync about that - I'm thinking the PostQueue will pass the x-last-modified header at the end of each batch as a flag to tell engines.js move its lastSync forward and change the item states in the tracker (the worst-case for this stuff is that a single queue of records will generate many batches, each with many posts). Or maybe just another "on batch end" callback in addition to the existing "on response" one. Dunno.

The patch also needs to sanely handle a single record being too large to upload - this patch will still try and upload it, but an error response will presumably kill the batch. We could consider not even trying the upload if the server has given us /configuration while continuing the try-anyway thing for legacy servers.

The tests need better error handling tests and better "record count exceeded" tests - currently they are mainly "byte count exceeded".

But with all that in mind, feedback welcome from all and sundry (but particularly Richard) ;)
Attachment #8774637 - Flags: feedback?(rnewman)
Assignee: nobody → markh
Comment on attachment 8774637 [details]
Bug 1253051 - wip.

https://reviewboard.mozilla.org/r/67092/#review64052

<p>Looks great.</p>
<p>I'd consider splitting out something of an internal POSTQueue state machine -- it starts off either as a non-batching POSTQueue if the configuration says no batching, or as a NeedsBatchIDPOSTQueue... right now you use <code>nil</code>/<code>undefined</code>/<code>String</code> to model that state machine.</p>

::: services/sync/modules/engines.js:1464
(Diff revision 1)
>        let up = new Collection(this.engineURL, null, this.service);
>        let handleResponse = resp => {
> +        // XXX - TODO - this needs to be aware of batch semantics somehow:
> +        // * We don't want to update this.lastSync until the batch is complete.
> +        // * We still need to remember the success/failure indicators, but
> +        //   not actually save them to this._modified etc until a batch

My guess is that making PostQueue batch-aware (perhaps by having a BatchPostQueue and a NonBatchPostQueue, or by introspection as you've done) is sufficient.

PostQueue can then invoke its callback -- this function -- either on failure in any `flush`, or after success in `commit`.

::: services/sync/modules/record.js:561
(Diff revision 1)
>      if (this.ids != null)
>        args.push("ids=" + this.ids);
>      if (this.limit > 0 && this.limit != Infinity)
>        args.push("limit=" + this.limit);
> +    if (this._batch)
> +      args.push("batch=" + this._batch);

Does this need to be encoded?

::: services/sync/modules/record.js:724
(Diff revision 1)
> +  this.bytesAlreadyBatched = 0;
> +
> +  // The ID of our current batch. Can be undefined (meaning we are yet to make
> +  // the first post of a patch, so don't know if we have a batch), null (meaning
> +  // we've made the first post but the server response indicated no batching
> +  // semantics, otherwise we have made the first post and it holds the batch ID

Missing closing paren.

::: services/sync/modules/record.js:752
(Diff revision 1)
>  
> -    // Do a flush if we can't add this record without exceeding our limits.
> +    // Do a flush if we can't add this record without exceeding our single-request
> +    // limits, or without exceeding the total limit for a single batch.
>      let newLength = this.queued.length + bytes.length + 1; // extra 1 for trailing "]"
> -    if (this.numQueued >= MAX_UPLOAD_RECORDS || newLength >= MAX_UPLOAD_BYTES) {
> -      this.log.trace("PostQueue flushing"); // flush logs more info...
> +    let postSizeExceeded = this.numQueued >= this.config.max_post_records ||
> +                           newLength >= this.config.max_post_bytes;

I think you still need a `+ 1` in here somewhere for the last record -- if we're adding the last record, but with the closing `]` it'll be longer than max_batch_bytes, then we need to make *two* requests rather than just queueing this one up.

::: services/sync/modules/record.js:771
(Diff revision 1)
>    },
>  
> -  flush() {
> +  flush(finalBatchPost = true) {
>      if (!this.queued) {
> -      // nothing queued.
> +      // nothing queued - we can't be in a batch, and something has gone very
> +      // bad if we think we are.

In particular, you will hit this immediately if the first record in a non-first batch is larger than `this.config.max_batch_bytes`.

I think you should check that up on line 760 -- don't call `flush` if there's nothing in the batch yet, and maybe don't even call `flush` if the record we're trying to queue up is too big to send, because we're going to abort in that case anyway.

::: services/sync/modules/record.js:840
(Diff revision 1)
> +      this.log.trace("Server error response during a batch", response);
> +      // not clear what we should do here - we expect the consumer of this to
> +      // abort by throwing in the postCallback below.
> +      // We should probably reset the batch indicators just in case...
> +    }
> +    this.postCallback(response);

I'd rather see this pile of nested `if` terms unraveled a bit:

```
if (finalBatchPost) {
  if (response.success) {
    this.batchID = undefined;
    return this.postCallback(response);
  }
  // TOOD: error.
}

// Not the last batch.
if (!response.success) {
  // TODO: error.
  this.postCallback(...);
  return;
}

if (response.status == 202) {
  ...
  this.postCallback(response);
  return;
}
```
Attachment #8774637 - Flags: review+
Comment on attachment 8774637 [details]
Bug 1253051 - wip.

Goddamn MozReview.
Attachment #8774637 - Flags: review+
Attachment #8774637 - Flags: feedback?(rnewman)
Attachment #8774637 - Flags: feedback+
Priority: P2 → P1
Whiteboard: [data-integrity] → [sync-atomic]
Assignee: markh → tchiovoloni
rnewman: Should we be using the X-If-Unmodified-Since header for these collection updates? Currently the desktop client isn't, but https://docs.services.mozilla.com/storage/apis-1.5.html appears to say that we should be if we want to be safe.

Since this is already a somewhat involved change in the same code that is likely to get a good amount of testing, assuming there isn't a reason not to do so, it seems like now would be a reasonable time to add that as well. One possible concern is this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=692700, or more generally https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/service.js#1328-1331. (Although this doesn't appear to apply to the engine collections).
Flags: needinfo?(rnewman)
(In reply to Thom Chiovoloni [:tcsc] from comment #9)
> rnewman: Should we be using the X-If-Unmodified-Since header for these
> collection updates?

Yes. Without X-I-U-S, this isn't atomic upload!

You should include the previous GET collection modified time in the subsequent POST. If that POST begins a batch, you should supply the same timestamp for each subsequent POST in the batch. On ending a batch, chain the timestamp back out. If you need to start another batch (what Steph calls a "bundle"), chain the timestamp back in. At any point, be prepared to receive a 412, which means you were saved from disaster.


> One possible concern is this bug…

That's not really concerning. X-I-U-S applies to collection timestamps with POST and to record timestamps with PUT, so we have everything we need. (That bug probably discussed Sync 2.0, which never shipped.)

Coincidentally, I just explained the use of X-I-U-S in a review for sleroux:

https://github.com/mozilla/firefox-ios/pull/2014#issuecomment-238641161
Status: NEW → ASSIGNED
Flags: needinfo?(rnewman)
Attachment #8780266 - Flags: feedback?(rnewman)
Attachment #8780266 - Flags: feedback?(markh)
Comment on attachment 8774637 [details]
Bug 1253051 - wip.

https://reviewboard.mozilla.org/r/67092/#review64052

I feel like tackling the state machine with typeis a bit weird here -- I tried something like this before uploading the patch, where each enqueue returned a different PostQueue type but it didn't end up simplifying it much at all -- just moved the complexity around, while adding its own. (Although I will admit that the null/undefined/batchid thing is opaque and potentially confusing).

And of course we can't just create a Batch/NonBatch PostQueue initially since we don't know the type until the first request gets back.
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

https://reviewboard.mozilla.org/r/70966/#review69022

I think this is looking good - quite a few comments but almost all are nits. Please make the requested changes and flag Richard for review (he can send it back to me if he likes!)

::: services/sync/modules/engines.js:1475
(Diff revision 1)
>            resp.failureCode = ENGINE_UPLOAD_FAIL;
>            throw resp;
>          }
>  
>          // Update server timestamp from the upload.
>          let modified = resp.headers["x-weave-timestamp"];

See bug 1250189 comment 73 and later - I'm fairly sure we only want to look at this header when !batchOngoing. It's probably OK as written (as in, this.lastSync will end up being the correct value), but I think it's better to be clear (ie, only look at the header when !batchOngoing)

::: services/sync/modules/engines.js:1480
(Diff revision 1)
>          let modified = resp.headers["x-weave-timestamp"];
> -        if (modified > this.lastSync)
> -          this.lastSync = modified;
> -
> -        let failed_ids = Object.keys(resp.obj.failed);
> -        counts.failed += failed_ids.length;
> +        if (modified > latest) {
> +          latest = modified;
> +        }
> +        failed.push.apply(failed, Object.keys(resp.obj.failed));
> +        successful.push.apply(successful, Object.values(resp.obj.success));

using apply here seems odd?

::: services/sync/modules/engines.js:1482
(Diff revision 1)
> -          this.lastSync = modified;
> -
> -        let failed_ids = Object.keys(resp.obj.failed);
> -        counts.failed += failed_ids.length;
> -        if (failed_ids.length)
> +          latest = modified;
> +        }
> +        failed.push.apply(failed, Object.keys(resp.obj.failed));
> +        successful.push.apply(successful, Object.values(resp.obj.success));
> +        if (!batchOngoing) {
> +          if (latest > this.lastSync) {

I think a comment here would be nice to make it easier for future readers of the code - eg, something like "// we just committed a batch, so we advance ..."

::: services/sync/modules/record.js:670
(Diff revision 1)
> +             defaultVal;
> +    }
> +    let config = {
> +      max_post_bytes: getConfig("max_post_bytes", MAX_UPLOAD_BYTES),
> +      max_post_records: getConfig("max_post_records", MAX_UPLOAD_RECORDS),
> +      // XXX - docs for this feature suggest "max_batch_bytes" etc, but in

I believe the docs now match the impl, so this comment should be adjusted (however, I think using "batch" in the local names is better than using the raw config names, but YMMV)

::: services/sync/modules/record.js:688
(Diff revision 1)
> +   This supports the concept of a server-side "batch". The general idea is:
> +   * We queue as many records as allowed in memory, then make a single POST.
> +   * This first POST (optionally) gives us a batch ID, which we use for
> +     all subsequent posts, until...
> +   * At some point we hit a batch-maximum, and jump through a few hoops to
> +   * commit the current batch (ie, all previous POSTs) and start a new one.

I think the leading "*" should be removed from this line (even though I probably added it initially ;)

::: services/sync/modules/record.js:732
(Diff revision 1)
> +  // semantics, otherwise we have made the first post and it holds the batch ID
> +  // returned from the server).
> +  this.batchID = undefined;
> +
> +  // Time used for X-If-Unmodified-Since -- should be the timestamp from the last GET.
> +  this.expectedLastModified = timestamp;

I think the name of this variable could be improved - it reads like "a last modified value we expect to get back" - I think dropping "expected" is OK.

::: services/sync/modules/record.js:748
(Diff revision 1)
>      }
>      let bytes = JSON.stringify(jsonRepr);
> -    // Note that we purposely don't check if a single record would exceed our
> -    // limit - we still attempt the post and if it sees a 413 like we think it
> -    // will, we just let that do whatever it does (which is probably cause
> -    // ongoing sync failures for that engine - bug 1241356 exists to fix this)
> +
> +    // Do a flush if we can't add this record without exceeding our single-request
> +    // limits, or without exceeding the total limit for a single batch.
> +    let newLength = this.queued.length + bytes.length + 2; // extra 1 for leading "[" / "," and trailing "]"

this comment should be updated (to either say "extra 2", or "extra 1" twice, or maybe even just "extras")

::: services/sync/modules/record.js:755
(Diff revision 1)
> -    // not happen here but on the next .enqueue/.flush.)
> -
> -    // Do a flush if we can't add this record without exceeding our limits.
> -    let newLength = this.queued.length + bytes.length + 1; // extra 1 for trailing "]"
> -    if (this.numQueued >= MAX_UPLOAD_RECORDS || newLength >= MAX_UPLOAD_BYTES) {
> -      this.log.trace("PostQueue flushing"); // flush logs more info...
> +                           newLength >= this.config.max_post_bytes;
> +    let batchSizeExceeded = this.numQueued + this.numAlreadyBatched >= this.config.max_batch_records ||
> +                            newLength + this.bytesAlreadyBatched >= this.config.max_batch_bytes;
> +
> +    if (postSizeExceeded || batchSizeExceeded) {
> +      let singleRecordTooBig = bytes.length + 2 > this.config.max_batch_bytes ||

I think you may as well hoist this up to where |bytes| is defined, just to help reduce the complexity of this block. I also don't think it's worthwhile checking max_batch_bytes - we can safely assume max_batch_bytes > max_post_bytes (or if you think that sounds risky, maybe have the code above that creates the config object to a max or something)

::: services/sync/modules/record.js:779
(Diff revision 1)
>      this.queued += this.numQueued ? "," : "[";
>      this.queued += bytes;
>      this.numQueued++;
>    },
>  
> -  flush() {
> +  flush(finalBatchPost = true) {

All consumers of this now specify the param, so I'm not sure a default is worthwhile (or, alternatively, have the only external caller of this (ie, engines.js) not specify it.

::: services/sync/modules/record.js:792
(Diff revision 1)
>      }
> -    this.log.info(`Posting ${this.numQueued} records of ${this.queued.length+1} bytes`);
> +    // the batch query-param and headers we'll send.
> +    let batch;
> +    let headers = [];
> +    if (this.batchID === undefined) {
> +      batch = "true";

I think a comment matching the ones below makes sense here - ie, "this will be the first post in a (possible) batch" or similar.

::: services/sync/modules/record.js:800
(Diff revision 1)
> +      batch = this.batchID;
> +    } else {
> +      // Not the first post and we know we have no batch semantics.
> +      batch = null;
> +    }
> +    

trailing whitespace

::: services/sync/modules/record.js:801
(Diff revision 1)
> +    } else {
> +      // Not the first post and we know we have no batch semantics.
> +      batch = null;
> +    }
> +    
> +    if (this.expectedLastModified) {

Is it possible we'd ever have lastModified of zero here? ISTM that this should never happen as we will always have read a collection before posting to it, and if we haven't, we probably need to arrange to get the lastModified some other way so we don't accidentally end up bypassing this safety feature.

::: services/sync/modules/record.js:827
(Diff revision 1)
> +
> +    if (!response.success) {
> +      this.log.trace("Server error response during a batch", response);
> +      // not clear what we should do here - we expect the consumer of this to
> +      // abort by throwing in the postCallback below. Reset the batch indicators just in case...
> +      this.batchID = null;

I think this originated with me, but another option that's possibly safer and clearer is to keep the comment about expecting the postCallback to throw, but then throw regardless.

Also, if we think that resetting .batchID is worthwhile (which I guess it is), it seems odd we don't do that in the 412 case. Indeed, I'm not sure why have have that 412 case anyway, other than logging - ISTM that if we expect postCallback to throw, then it must also throw in the 412 case (and there's no harm in always doing a .warn in the !success case.

::: services/sync/modules/record.js:852
(Diff revision 1)
> +    // this response is saying the server has batch semantics - we should
> +    // always have a batch ID in the response.
> +    let responseBatchID = response.obj.batch;
> +    this.log.trace("Server responsed 202 with batch", responseBatchID);
> +    if (!responseBatchID) {
> +      throw new Error("Invalid server response: 202 without a batch ID", response);

I don't think the 2nd arg here is correct - I think doing a log.error before the throw with the response object as the 2nd arg to that is the right thing to do.

::: services/sync/modules/record.js:858
(Diff revision 1)
> +    }
> +
> +    if (this.batchID === undefined) {
> +      this.batchID = responseBatchID;
> +      if (!this.expectedLastModified) {
> +        this.expectedLastModified = response.headers["x-last-modified"];

Even though we expect x-last-modified and x-weave-timestamp to be identical in this case, I think we should be consistent with that we use (the former is used here and hte latter is used above)

::: services/sync/tests/unit/test_postqueue.js:44
(Diff revision 1)
> +  }
> +
> +  const time = 11111111;
> +
> +  function* responseGenerator() {
> +    yield { success: true, status: 200, headers: { 'x-weave-timestamp': time+100 } };

spaces around operators ;) (here and below)

::: services/sync/tests/unit/test_postqueue.js:153
(Diff revision 1)
> +    max_batch_records: 200,
> +  }
> +  const time = 11111111;
> +  function* responseGenerator() {
> +    yield { success: true, status: 202, obj: { batch: 1234 },
> +            headers: { 'x-last-modified': time, 'x-weave-timestamp': time+100 },

this looks odd - shouldn't these values be identical? (here and below)
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

https://reviewboard.mozilla.org/r/70966/#review69022

> using apply here seems odd?

It's equivalent to doing a concat in-place, but I'll use a concat since I don't think it will matter and it probably is clearer that way.

> I believe the docs now match the impl, so this comment should be adjusted (however, I think using "batch" in the local names is better than using the raw config names, but YMMV)

I'm not sure I follow. I removed the comment, but it seems reasonable to comment on the fact that we're using `batch` in the names when the server sends `total`, unless that's not what you mean.

> I think the name of this variable could be improved - it reads like "a last modified value we expect to get back" - I think dropping "expected" is OK.

I wasn't happy with that name, but nothing better jumped out at me. lastModified is probably an improvement.

> I think you may as well hoist this up to where |bytes| is defined, just to help reduce the complexity of this block. I also don't think it's worthwhile checking max_batch_bytes - we can safely assume max_batch_bytes > max_post_bytes (or if you think that sounds risky, maybe have the code above that creates the config object to a max or something)

No, I hadn't considered that max_batch_bytes should always be greater than max_post_bytes, so checking both is redundant.

> I think this originated with me, but another option that's possibly safer and clearer is to keep the comment about expecting the postCallback to throw, but then throw regardless.
> 
> Also, if we think that resetting .batchID is worthwhile (which I guess it is), it seems odd we don't do that in the 412 case. Indeed, I'm not sure why have have that 412 case anyway, other than logging - ISTM that if we expect postCallback to throw, then it must also throw in the 412 case (and there's no harm in always doing a .warn in the !success case.

Thanks, the extra complexity with 412 is mainly a holdover from when I was doing something different there. I suspect it would be clear from logging anyway, so that case isn't worth keeping.

I'm tempted to agree and say we should throw and not call postCallback, but I think postCallback should be checking anyway, and having both check+throw seems unnecessary.

> I don't think the 2nd arg here is correct - I think doing a log.error before the throw with the response object as the 2nd arg to that is the right thing to do.

Missed this one, thanks.

> this looks odd - shouldn't these values be identical? (here and below)

Ah, yes, in a few cases this was using the wrong values.
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

https://reviewboard.mozilla.org/r/70966/#review69166

::: services/sync/modules/engines.js:1470
(Diff revision 2)
> +        // the batch is complete, however we want to remember success/failure
> +        // indicators for when that happens.
>          if (!resp.success) {
>            this._log.debug("Uploading records failed: " + resp);
>            resp.failureCode = ENGINE_UPLOAD_FAIL;
>            throw resp;

If we fail due to, e.g., an interrupted network connection, we will abort here and orphan the batch.

If this is a common occurrence, we will cause operational problems -- those batches will stick around for a while.

File a follow-up for server and client support for proactively cleaning up a batch, and also for telemetry to figure out if it's worthwhile.

::: services/sync/modules/engines.js:1475
(Diff revision 2)
>            throw resp;
>          }
>  
>          // Update server timestamp from the upload.
> +        failed = failed.concat(Object.keys(resp.obj.failed))
> +        successful = successful.concat(Object.values(resp.obj.success))

This doesn't seem quite right.

```
{
 "batch": "OPAQUEBATCHID",
 "success": ["GXS58IDC_12", "GXS58IDC_13", "GXS58IDC_15",
             "GXS58IDC_16", "GXS58IDC_18", "GXS58IDC_19"],
 "failed": {"GXS58IDC_11": "invalid ttl"],
            "GXS58IDC_14": "invalid sortindex"}
}
```

Shouldn't this be

```
successful = successful.concat(resp.obj.success)
```
?

You're relying on object-array equivalence in JS — that `["a", "b"]` is `{0: "a", 1: "b"}`, and so `Object.values(arr) == arr`!

::: services/sync/modules/engines.js:1476
(Diff revision 2)
>          }
>  
>          // Update server timestamp from the upload.
> +        failed = failed.concat(Object.keys(resp.obj.failed))
> +        successful = successful.concat(Object.values(resp.obj.success))
> +        if (!batchOngoing) {

Prefer early return:

```
if (batchOngoing) {
  // Nothing to do yet.
  return;
}
```

::: services/sync/modules/engines.js:1483
(Diff revision 2)
> -          this.lastSync = modified;
> +            this.lastSync = modified;
> -
> -        let failed_ids = Object.keys(resp.obj.failed);
> +          }
> +          if (failed.length) {
> -        counts.failed += failed_ids.length;
> -        if (failed_ids.length)
> -          this._log.debug("Records that will be uploaded again because "
> +            this._log.debug("Records that will be uploaded again because "

While we're here, add a logging check to avoid joining a potentially huge array of failed IDs only to not actually write them to a log.

You basically want to extract this logic:

https://dxr.mozilla.org/mozilla-central/source/toolkit/modules/Log.jsm#375

into a `logsAtLevel(level)` check.

::: services/sync/modules/record.js:614
(Diff revision 2)
>      this._sort = value;
>      this._rebuildURL();
>    },
>  
> +  // Set information about the batch for this request.
> +  get batch() { throw new Error("Can't get the batch ID via a getter")},

Why not? This is JS.

::: services/sync/modules/record.js:621
(Diff revision 2)
> +    this._batch = value;
> +    this._rebuildURL();
> +  },
> +
> +  get commit() { throw new Error("Can't get the commit flag via a getter")},
> +  set commit(value) {

Consider making this explicit:

```
  this._commit = value && true;
```

::: services/sync/modules/record.js:665
(Diff revision 2)
>        return Resource.prototype.post.call(this, data);
>      }
> -    return new PostQueue(poster, log, postCallback);
> +    let getConfig = (name, defaultVal) => {
> +      return this._service.serverConfiguration && this._service.serverConfiguration[name] ?
> +             this._service.serverConfiguration[name] :
> +             defaultVal;

Be careful here! If a server config value is falsy -- including 0 -- this will use the default value.

```
> let config = {"foobar": 0};
undefined
> (config && config["foobar"] ? config["foobar"] : 99)
99
```

I'd do this differently: mandate that `_service` have a `serverConfiguration`, and either explicitly put a default configuration in that slot with fields for each configuration property, or use `hasOwnProperty`.

::: services/sync/modules/record.js:697
(Diff revision 2)
> -function PostQueue(poster, log, postCallback) {
> +function PostQueue(poster, timestamp, config, log, postCallback) {
>    // The "post" function we should use when it comes time to do the post.
>    this.poster = poster;
>    this.log = log;
>  
> +  // The config we use

// Must have fields "max_total_bytes", ...

::: services/sync/modules/record.js:726
(Diff revision 2)
> +
> +  // The ID of our current batch. Can be undefined (meaning we are yet to make
> +  // the first post of a patch, so don't know if we have a batch), null (meaning
> +  // we've made the first post but the server response indicated no batching
> +  // semantics, otherwise we have made the first post and it holds the batch ID
> +  // returned from the server).

Your closing paren is in the wrong place.

::: services/sync/modules/record.js:752
(Diff revision 2)
> -    // (Note that counter-intuitively, the post of the oversized record will
> -    // not happen here but on the next .enqueue/.flush.)
> -
> -    // Do a flush if we can't add this record without exceeding our limits.
> -    let newLength = this.queued.length + bytes.length + 1; // extra 1 for trailing "]"
> -    if (this.numQueued >= MAX_UPLOAD_RECORDS || newLength >= MAX_UPLOAD_BYTES) {
> +
> +    let postSizeExceeded = this.numQueued >= this.config.max_post_records ||
> +                           newLength >= this.config.max_post_bytes;
> +
> +    let batchSizeExceeded = this.numQueued + this.numAlreadyBatched >= this.config.max_batch_records ||
> +                            newLength + this.bytesAlreadyBatched >= this.config.max_batch_bytes;

I can never remember the operator precedence table. Consider ()ing this.

::: services/sync/modules/record.js:754
(Diff revision 2)
> -
> -    // Do a flush if we can't add this record without exceeding our limits.
> -    let newLength = this.queued.length + bytes.length + 1; // extra 1 for trailing "]"
> -    if (this.numQueued >= MAX_UPLOAD_RECORDS || newLength >= MAX_UPLOAD_BYTES) {
> -      this.log.trace("PostQueue flushing"); // flush logs more info...
> -      // We need to write the queue out before handling this one.
> +                           newLength >= this.config.max_post_bytes;
> +
> +    let batchSizeExceeded = this.numQueued + this.numAlreadyBatched >= this.config.max_batch_records ||
> +                            newLength + this.bytesAlreadyBatched >= this.config.max_batch_bytes;
> +
> +    let singleRecordTooBig = bytes.length + 2 > this.config.max_post_bytes;

This should never happen: the POST limit should always be bigger than the Sync record size limit, which is not defined in info/configuration. Otherwise there will be valid Sync records that you cannot upload!

::: services/sync/modules/record.js:758
(Diff revision 2)
> -      this.log.trace("PostQueue flushing"); // flush logs more info...
> -      // We need to write the queue out before handling this one.
> -      this.flush();
> +
> +    let singleRecordTooBig = bytes.length + 2 > this.config.max_post_bytes;
> +
> +    if (postSizeExceeded || batchSizeExceeded) {
> +      this.log.trace(`PostQueue flushing due to postSizeExceeded=${postSizeExceeded}, batchSizeExceeded=${batchSizeExceeded
> +                     }, max_batch_bytes: ${this.config.max_batch_bytes}, max_post_bytes: ${this.config.max_post_bytes}`);

Try not to split lines inside `${}`; it makes reviewboard unhappy.

::: services/sync/modules/record.js:821
(Diff revision 2)
> +    if (!response.success) {
> +      this.log.trace("Server error response during a batch", response);
> +      // not clear what we should do here - we expect the consumer of this to
> +      // abort by throwing in the postCallback below. Reset the batch indicators just in case...
> +      this.batchID = null;
> +      return this.postCallback(response);

I think we still care about whether we thought we were in a batch or not. Send the second argument and don't clear the batchID!

::: services/sync/modules/record.js:837
(Diff revision 2)
> +      if (this.batchID) {
> +        throw new Error("Server responded non-202 success code while a batch was in progress");
> +      }
> +      this.batchID = null; // no batch semantics are in place.
> +      this.lastModified = response.headers["x-last-modified"];
> +      return this.postCallback(response, false);

This doesn't make sense, I think. If this is the second POST in a batch, and we get a random 200, we might get unpredictable results -- some records won't be marked as uploaded/deleted (those in the previous POST), for example. Consider failing hard instead, and consider some way of finding out if we screw up like this.

(I worry about captive portals for this kind of thing -- Bug 1290256.)

::: services/sync/modules/service.js:1060
(Diff revision 2)
>        return this.errorHandler.errorStr(ex.body);
>      }
>    },
>  
> +  _fetchServerConfiguration() {
> +    this.serverConfiguration = null;

I see no reason to do this. Either it'll be set during this function (or will be 304ed), or this function will abort the sync.

::: services/sync/modules/service.js:1067
(Diff revision 2)
> +
> +    let infoURL = this.userBaseURL + "info/configuration";
> +    this._log.debug("Fetching server configuration", infoURL);
> +    let configResponse;
> +    // At this stage we ignore almost all errors fetching the config as it
> +    // is an optional facility.

I think that's a terrible idea.

There are three acceptable paths out of this function.

1. We got a 200 and a valid configuration.
2. We got a 404, with headers that indicate that the Sync server responded (not a captive portal), and we use the default configuration.
3. We got _anything else_ -- network error, DNS lookup failure, invalid response, a 302 to Starbucks' sign-in page -- and we abort the sync without altering our state.
Attachment #8780266 - Flags: review?(rnewman) → review-
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

https://reviewboard.mozilla.org/r/70966/#review69166

> If we fail due to, e.g., an interrupted network connection, we will abort here and orphan the batch.
> 
> If this is a common occurrence, we will cause operational problems -- those batches will stick around for a while.
> 
> File a follow-up for server and client support for proactively cleaning up a batch, and also for telemetry to figure out if it's worthwhile.

Filed bug 1295303, and regarding telemetry, if we throw here it should be caught by the sync ping (perhaps it should get a different error code, but that could easily be done in this bug), unless you wanted server telemetry or telemetry for other clients.

> This doesn't seem quite right.
> 
> ```
> {
>  "batch": "OPAQUEBATCHID",
>  "success": ["GXS58IDC_12", "GXS58IDC_13", "GXS58IDC_15",
>              "GXS58IDC_16", "GXS58IDC_18", "GXS58IDC_19"],
>  "failed": {"GXS58IDC_11": "invalid ttl"],
>             "GXS58IDC_14": "invalid sortindex"}
> }
> ```
> 
> Shouldn't this be
> 
> ```
> successful = successful.concat(resp.obj.success)
> ```
> ?
> 
> You're relying on object-array equivalence in JS — that `["a", "b"]` is `{0: "a", 1: "b"}`, and so `Object.values(arr) == arr`!

Yep, nice catch, not sure what I was thinking.

> Why not? This is JS.

A reasonable explanation could be that the null/undefined/batchid behavior is fairly opaque (Although markh would be able to say concretely, since this is part of his original patch). It's not clear to me this should be exposed at all, but I guess it should be consistent, which is an easy fix.

> Be careful here! If a server config value is falsy -- including 0 -- this will use the default value.
> 
> ```
> > let config = {"foobar": 0};
> undefined
> > (config && config["foobar"] ? config["foobar"] : 99)
> 99
> ```
> 
> I'd do this differently: mandate that `_service` have a `serverConfiguration`, and either explicitly put a default configuration in that slot with fields for each configuration property, or use `hasOwnProperty`.

Are we sure we want to allow zero values here? I guess I could see the argument that it's more future proof if it works properly and that's something that is handled by the caller, but a 0 value for byte or record counts sounds like a bug.

> This doesn't make sense, I think. If this is the second POST in a batch, and we get a random 200, we might get unpredictable results -- some records won't be marked as uploaded/deleted (those in the previous POST), for example. Consider failing hard instead, and consider some way of finding out if we screw up like this.
> 
> (I worry about captive portals for this kind of thing -- Bug 1290256.)

If we failed hard and threw a recognizable error, we could detect it in the telemetry ping.

> I think that's a terrible idea.
> 
> There are three acceptable paths out of this function.
> 
> 1. We got a 200 and a valid configuration.
> 2. We got a 404, with headers that indicate that the Sync server responded (not a captive portal), and we use the default configuration.
> 3. We got _anything else_ -- network error, DNS lookup failure, invalid response, a 302 to Starbucks' sign-in page -- and we abort the sync without altering our state.

This makes sense and is an easy fix.
(In reply to Thom Chiovoloni [:tcsc] from comment #17)
> > Why not? This is JS.
> 
> A reasonable explanation could be that the null/undefined/batchid behavior
> is fairly opaque (Although markh would be able to say concretely, since this
> is part of his original patch). It's not clear to me this should be exposed
> at all, but I guess it should be consistent, which is an easy fix.

Yeah - I was thinking that if something thinks they want to get these values, they are confused. I prefer throwing in cases like this rather than allowing something obviously wrong to be allowed just for a sense of symmetry.

If the choice is between allowing these to be got via a getter or making them "internal" while allowing the batcher to poke those internal references, I'd prefer the latter. But TBH I feel the code as written is best - a public setter and no functioning getter best reflects the reality of the implementation without offering foot-guns.

> > Be careful here! If a server config value is falsy -- including 0 -- this will use the default value.

Surely that's better than actually using zero though? Zero would mean that nothing could be uploaded, and it seems unlikely that the server would choose this as a signal to say "don't upload anything" (and if they really did mean that, we'd just fail on the first post anyway)
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

I think this is heading in the right direction and rnewman's review is all you care about :) IOW, feel free to request feedback from me if you feel it would be helpful, but otherwise I'm happy to leave you and rnewman to negotiate the final patch.
Attachment #8780266 - Flags: review?(markh)
(In reply to Mark Hammond [:markh] from comment #18)

> Yeah - I was thinking that if something thinks they want to get these
> values, they are confused. I prefer throwing in cases like this rather than
> allowing something obviously wrong to be allowed just for a sense of
> symmetry.

Yeah, WFM; just curious about the reasoning.

> > > Be careful here! If a server config value is falsy -- including 0 -- this will use the default value.
> 
> Surely that's better than actually using zero though?

For current config parameters, yes. But if we add max_allowed_retries, or something like that, this code will be quietly and perhaps dangerously wrong. The original verbose Map-based code was correct, of course :D
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

https://reviewboard.mozilla.org/r/70966/#review69166

> This should never happen: the POST limit should always be bigger than the Sync record size limit, which is not defined in info/configuration. Otherwise there will be valid Sync records that you cannot upload!

I'm not sure I follow you here. Are you suggesting that it should instead ensure that the post size limit is bigger than the sync record size limit?

> I see no reason to do this. Either it'll be set during this function (or will be 304ed), or this function will abort the sync.

Not sure I follow here either. Do you mean that you think the assignment should be moved to the 404 check?
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

https://reviewboard.mozilla.org/r/70966/#review69166

> I'm not sure I follow you here. Are you suggesting that it should instead ensure that the post size limit is bigger than the sync record size limit?

You're checking whether the size of this single record is larger than max_post_bytes. Sync specifies that no record can be larger than 256KB. max_post_bytes defaults to 1MB. I think it's a fair assumption that max_post_bytes > (max_record_size + post_overhead), and indeed if max_post_bytes is set to 256KB then there will be perfectly valid Sync records that you can't send to this server!

I think you should either ensure that this constraint is documented for the server ("don't make this number less than 260KB"), or you should impose it as a sanity check when parsing info/configuration. If a server says that max_post_bytes is 200KB, we should either ignore it, refuse to sync, or we should define a max_record_size and adjust that.

Any of those options will make this code either simpler or more generic.

> Not sure I follow here either. Do you mean that you think the assignment should be moved to the 404 check?

I mean:

* `_fetchServerConfiguration` should succeed or abort the sync every time it is called.
* It will be called on first sync, and perhaps subsequently, as frequently as once per sync.
* If it doesn't succeed, you can't continue to sync.
* Ergo, `this.serverConfiguration` should only be null before the first sync begins, and it will never be null inside a sync.

Nulling it out is thus unnecessary.

(Future: it would be nice if the server could return a 304 for info/configuration, a la Bug 736374.)

If you want to assign it anywhere -- either to a default configuration object as we do on other platforms, or to null if you don't mind the confusion between "not set" and "default" -- then do so in the 404 check.
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

https://reviewboard.mozilla.org/r/70966/#review69166

> Filed bug 1295303, and regarding telemetry, if we throw here it should be caught by the sync ping (perhaps it should get a different error code, but that could easily be done in this bug), unless you wanted server telemetry or telemetry for other clients.

That was WONTFIXed so I'm dropping this issue (the telemetry should still let us know whether or not this happens much in practice).
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

https://reviewboard.mozilla.org/r/70966/#review69166

> You're checking whether the size of this single record is larger than max_post_bytes. Sync specifies that no record can be larger than 256KB. max_post_bytes defaults to 1MB. I think it's a fair assumption that max_post_bytes > (max_record_size + post_overhead), and indeed if max_post_bytes is set to 256KB then there will be perfectly valid Sync records that you can't send to this server!
> 
> I think you should either ensure that this constraint is documented for the server ("don't make this number less than 260KB"), or you should impose it as a sanity check when parsing info/configuration. If a server says that max_post_bytes is 200KB, we should either ignore it, refuse to sync, or we should define a max_record_size and adjust that.
> 
> Any of those options will make this code either simpler or more generic.

Bah, mozreview problems. 

We discussed this some on IRC, and the client doesn't check anywhere that 256k is respected, so I don't feel great removing the check -- at least not for this reason. Admittedly, it is some extra complexity, and the error *would* be caught further down the line, so if you think it's better to just do that way (and hope for the best when it comes recognizing the error), I'll update it to do so.
I don't think you should remove checking the single record size; I think you should explicitly add a check for a 256K payload (per [1]), which will fix Bug 697482.

Note that checking record size against the max POST size is not a substitute for that -- a 700KB record won't fail that check -- but the reverse is true if the max record size is smaller than the max POST size, which should always be the case.

If you serialize a payload and find that it's more than 256K[2], fail with a descriptive reason.

If you want to be really careful about it, use Math.min(max_post_bytes, max_specified_record_size), which will work correctly when max_post_bytes is set to less than 256K.


You are already in the territory which requires you to make a decision or two here.

Currently a 1.7MB form history record will be uploaded and fail. Other records in the upload will be sent.

With your change, a 1.7MB form history record will cause the sync to abort. A 900KB form history record will be sent and the server will quietly include it in 'failed'. (This is why you need to check![3])

Failing a sync entirely when one record is unserializable is desirable for bookmarks (no corruption), and perhaps not desirable for history or form history.

So you need to consider what the behavior should be when a record is found to be un-uploadable. I contend that it's engine-specific, defaulting to 'throw', with some engines choosing 'skip'.


[1] <http://docs.services.mozilla.com/storage/apis-1.5.html#syncstorage-bso>
[2] Technically the limitation is on payload size, not total record size, but it's pretty close.
[3] As rfkelly noted: "if you're producing valid records we'd have no reason to accept some items but not others out of the request" -- we should not deliberately produce invalid records!
Depends on: 1297561
Thom: is this ready for re-review yet?
Flags: needinfo?(tchiovoloni)
Yes, very sorry.
Flags: needinfo?(tchiovoloni)
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

https://reviewboard.mozilla.org/r/70966/#review75696

Two follow-ups, one nit, one coordination. :thumbsup:

::: services/sync/modules/engines.js:1498
(Diff revision 7)
> -        let failed_ids = Object.keys(resp.obj.failed);
> +        if (failed.length && this._log.level <= Log.Level.Debug) {
> -        counts.failed += failed_ids.length;
> -        if (failed_ids.length)
>            this._log.debug("Records that will be uploaded again because "
>                            + "the server couldn't store them: "
> -                          + failed_ids.join(", "));
> +                          + failed.join(", "));

Future: take a subsequence here. There might be hundreds of IDs, and we'll generate lots of logs.

::: services/sync/modules/engines.js:1504
(Diff revision 7)
> +        counts.failed += failed.length;
>  
> -        // Clear successfully uploaded objects.
> +        for (let id of successful) {
> -        const succeeded_ids = Object.values(resp.obj.success);
> -        for (let id of succeeded_ids) {
>            delete this._modified[id];

Make sure Kit is happy with this -- I think everyone is touching this bit at the same time!

::: services/sync/modules/engines/history.js:45
(Diff revision 7)
>    _recordObj: HistoryRec,
>    _storeObj: HistoryStore,
>    _trackerObj: HistoryTracker,
>    downloadLimit: MAX_HISTORY_DOWNLOAD,
>    applyIncomingBatchSize: HISTORY_STORE_BATCH_SIZE,
> +  allowSkippedRecord: true,

Future: this is really a special case of "what do we do when one record fails?"

Possible answers are:

* Fail the entire sync.
* Record that record as failed and retry it later.
* Drop that record on the floor.

And that's perhaps best answered by a function:

```
  allowSkippedRecord(guid, failureCode)
```

::: services/sync/modules/record.js:561
(Diff revision 7)
>      if (this.ids != null)
>        args.push("ids=" + this.ids);
>      if (this.limit > 0 && this.limit != Infinity)
>        args.push("limit=" + this.limit);
> +    if (this._batch)
> +      args.push("batch=" + this._batch);

`encodeURIComponent(this._batch);`.
Attachment #8780266 - Flags: review?(rnewman) → review+
Kit: Does _uploadOutgoing (and anything else we touched at the same time) look right to you (e.g. it hasn't gotten merged together, err, wrong?)
Flags: needinfo?(kcambridge)
Blocks: 1301394
Comment on attachment 8780266 [details]
Bug 1253051 - Implement atomic uploads for sync collection data

https://reviewboard.mozilla.org/r/70966/#review75696

> Future: take a subsequence here. There might be hundreds of IDs, and we'll generate lots of logs.

Filed bug 1301394

> Future: this is really a special case of "what do we do when one record fails?"
> 
> Possible answers are:
> 
> * Fail the entire sync.
> * Record that record as failed and retry it later.
> * Drop that record on the floor.
> 
> And that's perhaps best answered by a function:
> 
> ```
>   allowSkippedRecord(guid, failureCode)
> ```

Should this be a separate bug from 1300652?
Yes, LGTM!
Flags: needinfo?(kcambridge)
Attachment #8774637 - Attachment is obsolete: true
Please mark the pending issues in MozReview as resolved so that this can be autolanded.
Keywords: checkin-needed
Ack, my bad, I didn't realized those blocked it from landing. Sorry!
Keywords: checkin-needed
Pushed by kwierso@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/494bab58654b
Implement atomic uploads for sync collection data r=rnewman
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/494bab58654b
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Depends on: 1022215
Depends on: 1308994
See Also: → 1324600
In the release notes of 51 with "Improved reliability of browser data syncing with atomic uploads" as wording.
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.