Closed Bug 615926 Opened 9 years ago Closed 9 years ago

All engines are disabled after fresh start (e.g., upgrade to simplified crypto)

Categories

(Firefox :: Sync, defect)

defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: philikon, Assigned: rnewman)

References

Details

Attachments

(3 files)

Either https://bugzilla.mozilla.org/show_bug.cgi?id=603489#c61 wasn't addressed or a new bug crept in.
Re the previous bug, I thought this was a sufficient fix:


commit 9e9c732d2e2e02c22122a93317e6921916db460f
Author: Richard Newman <rnewman@twinql.com>
Date:   Mon Nov 22 17:45:23 2010 -0800

    Avoid trampling existing meta on login.

diff --git a/services/sync/modules/service.js b/services/sync/modules/service.js
index 15c6da1..fafc465 100644
--- a/services/sync/modules/service.js
+++ b/services/sync/modules/service.js
@@ -1401,8 +1401,8 @@ WeaveSvc.prototype = {
       engine.lastModified = info.obj[engine.name] || 0;
 
     // If the modified time of the meta record ever changes, clear the cache.
-    // I've seen this come back undefined, so let's be careful.
-    if (info.obj && info.obj.meta && (info.obj.meta != this.metaModified)) {
+    // ... unless meta is marked as new.
+    if ((info.obj.meta != this.metaModified) && !Records.get(this.metaURL).isNew) {
       this._log.debug("Clearing cached meta record. metaModified is " +
           JSON.stringify(this.metaModified) + ", setting to " +
           JSON.stringify(info.obj.meta));


and this made it into fx-sync:

$ ack --js "unless meta"
services/sync/modules/service.js
1412:    // ... unless meta is marked as new.


Sounds like this might be a new issue, or that wasn't a sufficient fix. Looking now.
(Will be keeping notes in this bug.)

Using stable 3.6.10, two instances with two profiles:

* Install Sync 1.5 from Addons
* Sync
* Install XPI built from hg/services/fx-sync
* Sync

This works fine, with no engines disabled.

Downloading Fx4 nightlies now to see what happens there; that's where Philipp sees the issue.
Status: NEW → ASSIGNED
I just tried it again, again with a new account, and I'm not seeing it this time. Heisenbug?
Couldn't repro with

* 2010-11-30 => 2010-12-01 nightly
* 4.0b7 => 2010-12-01 nightly

either.

Logs show everything proceeding as expected, and prefs stay checked and engines report syncing.

Definitely a Heisenbug.

Philipp and I discussed addressing this issue without having a repro:

11:39:44 <@philiKON> rnewman: if we can't track this down then maybe we should add a band-aid
11:40:09 <@philiKON> rnewman: along the lines of: don't disable all engines, that clearly can't be the intent
11:40:46 <@philiKON> sure i believe that
11:40:47 <@rnewman> philiKON: that might be a reasonable solution
11:40:55 <@philiKON> i bet it's a race condition somewhere with the meta record
11:40:56 <@rnewman> but I'd probably want more consensus than just us
11:41:05 <@philiKON> agreed
11:41:16 <@rnewman> and worth me spending some time reading that code in case a race jumps out
11:41:26 <@rnewman> pity it's hard to repro!

I'm going to take a look at the meta manipulation code, see if it's something obvious like racing to read prefs before they've been set.
Philipp and I spent a little while discussing how this problem might occur. It doesn't seem to be limited to the recent crypto changes (it's been seen before, though New Crypto might make it more likely), and it's not readily reproducible.

Philipp suspects the following might occur, with the asynchronous operations causing a race:

1. _freshStart() creates a new meta record
2. Record is uploaded, _freshStart() is paused until request returns.
A. In the meantime sync() is started
B. sync() gets new meta record. Records will fetch it, function is paused.
3. _freshStart() continues, sets meta record to Records
C. sync() continues, Records overwrote the meta record in 3. with the one fetched in A.

The meta record is thus mangled (not marked with .isNew) when we come to assess which engines are active.

The attached patch adds locking (not strictly atomic, but sufficient for single-threaded-JS-with-async-operations) throughout the Records object, and an atomic setAndPut() method to upload a record and store it locally.

Without the ability to reproduce the error (I've tried!) I can't verify that this change fixes it... but the test suite passes, and a Minefield build with the change successfully upgrades and syncs an old account.

Thoughts, Philipp (or anyone)?

-- not setting a review flag, because I'm in no rush to commit this.

I am somewhat of the opinion that a thorough review of the codebase for interruption-safety would be worthwhile... though I hesitate to volunteer for it!
Summary: All engines are disabled after upgrade to simplified crypto → All engines are disabled after fresh start (e.g., upgrade to simplified crypto)
http://brasstacks.mozilla.com/resultserv/crossweave/logdata/6509163/ has revealed a cases of this with excellent logs. The main clues are this:

2010-12-06 12:19:13	Service.Main         DEBUG	Clearing cached meta record. metaModified is undefined, setting to 0.01
2010-12-06 12:19:13	Service.Main         DEBUG	Fetching global metadata record

The problem is that we hit the Records.del(this.metaURL) code path in Service.sync() without having gone through freshStart() first. That codepath should probably be executed before the _remoteSetup() call, or maybe even incorporated into remoteSetup().
(In reply to comment #6)
> That codepath
> should probably be executed before the _remoteSetup() call, or maybe even
> incorporated into remoteSetup().

I think you mean "after", no? It currently happens before.
(In reply to comment #7)
> (In reply to comment #6)
> > That codepath
> > should probably be executed before the _remoteSetup() call, or maybe even
> > incorporated into remoteSetup().
> 
> I think you mean "after", no? It currently happens before.

yes, sorry
Per Philipp's comment: this change relocates the check, refetches meta, and attempts to preserve meta flags across delete/refetch.

Tests pass, but I don't know if this is the correct fix. Opinions, please...
Attachment #495614 - Flags: review?(mconnor)
Comment on attachment 495614 [details] [diff] [review]
Move meta modified check inside remoteSetup, v1

I think this is at least more correct.  It's a hard to repro bug, really, but this should make things more robust.  Land it, see if we stop getting intermittent fails on crossweave.  (Though the exacerbating bug will go away soon.)
Attachment #495614 - Flags: review?(mconnor) → review+
(In reply to comment #10)
> Land it, see if we stop getting intermittent fails on crossweave.

OK, dropped: 

http://hg.mozilla.org/services/fx-sync/rev/d27f69801cc1

*fingers crossed*
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Stupid typo fix pushed:

http://hg.mozilla.org/services/fx-sync/rev/0db344491f29

Thanks philiKON.
Blocks: 617173
Patch to skip engine disabling if numClients is less than two. Should belt-and-braces protect against spurious engine disabling bugs.

Verified by hand that two clients can disable all sync engines for each other, and tests and Crossweave pass.

No unit tests OK for now per IRC chat with mconnor.
Attachment #495687 - Flags: review?(mconnor)
Attachment #495687 - Flags: review?(mconnor) → review+
Duplicate of this bug: 617104
Root issue still seems to persist: Bug 692620.
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.