Open Bug 1300652 Opened 8 years ago Updated 2 years ago

More considered handling of "failed" records during record upload.

Categories

(Firefox :: Sync, defect, P3)

defect

Tracking

()

People

(Reporter: markh, Unassigned)

Details

(Whiteboard: [sync-quality])

I believe there are a number of users who repeatedly see "failed" items after attempting an upload. Currently, these failed items will be repeatedly uploaded forever, which can lead to unnecessary load on both the server and the client, and a perception of poor performance on the clients.

In most cases the records will be rejected for reasons that mean they will be rejected "forever" - at least until the server configuration is changed - eg, the record exceeding the max size the server will accept for the record. Note that I'm only talking about records reported as "failed" in an otherwise successful response, or those we don't even attempt to upload due to exceeding various size restrictions. This is *not* discussing items that fail to upload due to a failed request/response.

We should consider a way to handle these failed in a better way. For some engines (eg, bookmarks) we probably need to abort this, and future syncs. For, eg, history, we can probably just discard the entry and never try that upload again.

A complication is as mentioned above - let's say we reject records due to hitting the 256k max record size, but then at some point in the future the server increases this limit - ideally we'd be smart enough to detect that change and attempt reupload only at that point. However, given these changes are rare, I don't think it's sane to retry every record on every sync in the hope the server config has changed.

I've a couple of poorly formed thoughts on how we might approach this, but thought I'd solicit general feedback first.

Thoughts?
Something I've thought about in the past -- assuming we know *why* a record was rejected (Bug 1252997), almost all reasons should put the failing GUID in a bucket that is only revisited if:

* The server configuration changes. (Per-record-size, probably.)
  - This is also a proxy for server code changing…
* The protocol version changes. (Perhaps Sync 1.6 will allow base64 sortindex? :P)
* The client version changes. (We had a bug!)
* The record changes, locally or remotely.

Some engines (bookmarks!) will refuse to sync if this bucket is non-empty.

Perhaps when the track-failed approach was built into Weave we expected one of those four things, or server fixes, to be changing frequently enough that "try again in five minutes" was acceptable. I doubt much thought was given to records that permanently exceed documented limits.


In the bad-bucket scheme we'd upload a 2MB forms record once per Firefox upgrade, which isn't so bad… and if the server changed its BSO limit, we'd upload it right away.
Oh, and I agree entirely that this is worth fixing.

Considering bookmarks alone, this is one more situation where we get stuffed up -- one huge URL, or one 30,000-item folder, and we'll both corrupt the server *and* keep churning that upload on every sync.
Priority: -- → P3
>   - This is also a proxy for server code changing…

You can check for this explicitly by requesting the /__version__ route, and/or we could include the server's version string as part of /info/configuration.
(In reply to Ryan Kelly [:rfkelly] from comment #3)
> >   - This is also a proxy for server code changing…
> 
> You can check for this explicitly by requesting the /__version__ route,
> and/or we could include the server's version string as part of
> /info/configuration.

It seems this might be too coarse - if a version bump happened due to other operational requirements but which didn't actually change in this way, we'd end up retrying more often than we'd need to.

If we did want to handle this, I'd think that locally caching the most recent info/configuration response and retrying on change would be better. However, given Richard's idea of auto-retrying on a client version bump, I'd be inclined to say we just allow this to eventually unstick clients, even if it means some clients are "stuck" 6 weeks longer than they otherwise would. At least for a first cut.

(However, the "on new client" strategy might want a little thought for Nightly and Aurora, where updates are far more frequent. OTOH, maybe it's fine if those channels end up retrying regularly.)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.