Open Bug 732614 Opened 12 years ago Updated 2 years ago

Failed records persist indefinitely

Categories

(Firefox :: Sync, defect)

defect

Tracking

()

People

(Reporter: gps, Unassigned)

References

Details

Currently in sync, if a record fails to apply, it will be cached on disk and we will attempt to apply it again on the next sync.

Unfortunately, there is no standard way for the cycle to get broken and failed records can keep getting applied indefinitely.

This is a very real bug in the case of add-on sync. Records will fail to apply if the AMO search doesn't find an add-on or if the found add-on doesn't have an install URI. Either of these could be a transient error, which is why the add-ons engine marks it as failed and hopes it will work again in the future. As logs on my machine indicate, the errors actually seem to be indefinite and the failed record doesn't go away.

I don't think individual engines should be burdened with maintaining state of failed records and whether they should just give up. Instead, I'd like to see some kind of mechanism on the base SyncEngine type to manage this. We can have a default policy and individual engines can tweak that as they need.

There are a number of ways we could go with a default policy:

* Delete record after N failed application attempts
* Delete record after T time has passed
* Combination of above
* Other

Considerations:

* If we break sync in a release and we delete records after T time, we would have to push a fixed release before T or else data will be permanently lost.
* Should persistently failed records trigger some kind of "fresh sync" scenario in hopes of reconciling things?
* It would be great to have Syncorro to measure failed records in the wild!
Strictly speaking, we persist the GUIDs, not the records, and refetch on each sync.

The purpose of Bug 622762 was to deal with code problems that caused a record to fail to apply. Unexpected record ordering problems, plain ol' bugs, etc.

When we switched to continuing a sync after a bad record, rather than aborting and retrying everything later, we needed a way to avoid fast-forwarding over a record that we failed to handle.

My contention is that this (now?) does more harm than good: I doubt that we (still?) frequently fix code problems of this kind. Most likely we simply retry and retry until another client uploads a correct version of the record (which of course will have a later timestamp).

There might be some value to some kind of tracking which can be consumed by engines — when a code change happens, we could look at the last bucket and do something new. But that should probably be opt-in, not default.

Add-ons is unusual in that record application is fraught with peril. Probably the right solution for that is to make record application safe — an installation queue? — unless we can find other engines that would benefit.

In the future this behavior might be opt-in through a repository middleware implementation.

Android Sync doesn't persist failed records right now (Bug 709371).
Depends on: 622762
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.