Some engines disabled during syncing with intermittent server failures

NEW
Unassigned

Status

Cloud Services
Firefox Sync: Backend
P3
major
6 years ago
6 months ago

People

(Reporter: rnewman, Unassigned)

Tracking

({testcase-wanted})

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

6 years ago
Myself, a user on Twitter, and others have observed during the recent outage that only bookmarks is checked.

I believe this only occurs for a fresh-start sync.

This reminds me of Bug 615926.

Hypothesis: we delete everything on the server, start uploading, manage clients and bookmarks, then abort the sync. The other client sees clients (so the logic in Bug 615926 doesn't come into play), but meta/global and the content of the server implies that the failed engines are disabled.

This might be a tricky one. We perhaps need to do some more integrity checking on meta/global.
zandr was seeing this too. CCing him so maybe we can his error logs.
I think I had one other engine checked (History?).

Between what Fx has and Time Machine, I should be able to gather a pretty good swath of logs. What am I looking for?
(Reporter)

Comment 3

3 years ago
A user just reported this on Nightly -- nothing seems to be syncing, lots of 'expected' FxA error logs, then everything is unchecked except for add-ons in Preferences.

This will probably be an even more frequent problem with FxA sync, because we're hitting error conditions as demonstrated by Bug 971194.

I think my hypothesis in Comment 0 still stands.

Bug 969672 will probably fix this, in that it won't assume that no meta/global entry = disabled.
Blocks: 969593
Depends on: 969672
Priority: -- → P2
(In reply to Richard Newman [:rnewman] from comment #3)
> This will probably be an even more frequent problem with FxA sync, because
> we're hitting error conditions as demonstrated by Bug 971194.

> Bug 969672 will probably fix this, in that it won't assume that no
> meta/global entry = disabled.

Bug 969669 comment 20 suggests we may not get that fix. Do we have other ways to mitigate? Do we really expect this to be more common in 29 once we actually ship? Which "error conditions" are required exactly?
(Reporter)

Comment 5

3 years ago
We don't fully understand the problem, and have never reproduced it in a debugging context, otherwise we'd have fixed it two and a half years ago :)

My intuition is that (hard?) failures between the bookended meta/global uploads on a first sync are the problem, but I don't have a detailed event sequence arranged in my head. It certainly seems to be related to upload (and maybe download?) errors, typically around a server maintenance (and thus a first sync on node reassignment).

If we ship with the dramatically higher incidence of 'ignorable' errors that we see in FxA Sync, there's a chance that this will spike accordingly.

(I hope we don't, and I hope it doesn't.)

Some parts of Bug 978876 (which is aimed at 29) might improve matters here, but that's a very speculative statement.
> If we ship with the dramatically higher incidence of 'ignorable' errors that we see in FxA Sync, there's a chance that this will spike accordingly.

Can you elaborate more on what you mean by 'ignorable' errors in FxA Sync?
(Reporter)

Comment 7

3 years ago
Pretty much all things that aren't genuine auth errors -- anything that doesn't raise a yellow bar.

For example, in the past few weeks we've had clock skew errors, 404s in custom engines, some new 503s and 500s, bookmarks appearing with no encrypted contents (that's new!), all of the setup timing bugs that Mark's been working on, and the fun smorgasbord in Bug 971194.

These are on top of the baseline set of DNS errors, cert errors, dropped connections, timeouts, etc. that also affect Sync 1.1.

I expect some but not all of the new bugs to have been caught and fixed before we go to Beta, but I'd be really surprised if FxA Sync has finished converging on Sync 1.1's asymptotic level of bugginess, and thus we're strictly more likely to hit "higher-order" bugs like this one.
No longer blocks: 969593

Updated

9 months ago
Keywords: testcase-wanted
Priority: P2 → --

Updated

6 months ago
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.