Last Comment Bug 671066 - Unknown error on conclusion of easy setup of second client.
: Unknown error on conclusion of easy setup of second client.
Status: VERIFIED FIXED
[verified in services]
: regression, verified-aurora
Product: Cloud Services
Classification: Client Software
Component: Firefox Sync: Backend (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla8
Assigned To: Philipp von Weitershausen [:philikon]
:
Mentors:
Depends on:
Blocks: 664792
  Show dependency treegraph
 
Reported: 2011-07-12 13:54 PDT by Tracy Walker [:tracy]
Modified: 2011-08-03 09:21 PDT (History)
2 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
fixed


Attachments
v1 (2.49 KB, patch)
2011-07-12 15:40 PDT, Philipp von Weitershausen [:philikon]
no flags Details | Diff | Splinter Review
v1.1 (2.47 KB, patch)
2011-07-12 15:44 PDT, Philipp von Weitershausen [:philikon]
no flags Details | Diff | Splinter Review
v1.2 (2.60 KB, patch)
2011-07-12 16:10 PDT, Philipp von Weitershausen [:philikon]
rnewman: review+
Details | Diff | Splinter Review
part 2 (v1) (20.86 KB, patch)
2011-07-13 14:13 PDT, Philipp von Weitershausen [:philikon]
rnewman: review+
Details | Diff | Splinter Review
combined patch for aurora (20.70 KB, patch)
2011-07-14 08:16 PDT, Philipp von Weitershausen [:philikon]
blizzard: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Tracy Walker [:tracy] 2011-07-12 13:54:49 PDT
Discovered with s-c nightly builds but confirmed it also exists on m-c nightly as well. (I can't reproduce on aurora)

STR:

1) On client A, start it with an existing profile with a sync account already active.
2) On client B, go to Tools > Set Up Sync. Click "Connect" to be presented with 12 digit code.
3) On Client A, go to Prefs (options) > Sync > Add a Device. 
4) In the resulting dialog, type in the 12 digit code given by client B. Then click Continue.
5) Wait several seconds for the process to complete.  DO NOT click Done or close the setup dialog.
6) Go to client B. DO NOT click Finish or close the setup dialog.

tested results:
client B gets the message "Sync encountered an error while syncing: Unknown error. Sync will automatically retry this action. [Sync Now]"  

expected results: sync should not throw an error message. Until one or the other clients dismisses the setup dialog, nothing should happen. Suggest the successful setup dialog auto-dismiss itself after a certain period of time.

note: this could explain some people getting stuck on the error message. if they don't dismiss the easy setup dialogs, sync won't work (confirmed by clicking Sync Now button in the error message).  Then if you restart the second client, it reverts back to not being setup to sync.

also note: if you DO hit Done and Finish right away.  The error message doesn't appear in the add-ons bar.

In any case, once you do dismiss the setup dialog on client B, in any manner, a sync is initiated and is successful.  I don't think this should block the current train, but should be fixed before the next m-c merge to aurora.
Comment 1 Philipp von Weitershausen [:philikon] 2011-07-12 14:19:02 PDT
(In reply to comment #0)
> Discovered with s-c nightly builds but confirmed it also exists on m-c
> nightly as well. (I can't reproduce on aurora)

That doesn't make sense. This is the first train since the last Aurora branching point. Can you please double check?

> 1) On client A, start it with an existing profile with a sync account
> already active.
> 2) On client B, go to Tools > Set Up Sync. Click "Connect" to be presented
> with 12 digit code.
> 3) On Client A, go to Prefs (options) > Sync > Add a Device. 
> 4) In the resulting dialog, type in the 12 digit code given by client B.
> Then click Continue.
> 5) Wait several seconds for the process to complete.  DO NOT click Done or
> close the setup dialog.
> 6) Go to client B. DO NOT click Finish or close the setup dialog.
> 
> tested results:
> client B gets the message "Sync encountered an error while syncing: Unknown
> error. Sync will automatically retry this action. [Sync Now]"  

I can't reproduce this using the above STRs on OSX, only on Windows. Weird.

> note: this could explain some people getting stuck on the error message. if
> they don't dismiss the easy setup dialogs, sync won't work (confirmed by
> clicking Sync Now button in the error message).  Then if you restart the
> second client, it reverts back to not being setup to sync.

Where are those reports?
Comment 2 Tracy Walker [:tracy] 2011-07-12 14:29:42 PDT
I didn't see this failure on aurora with builds from 20110708, but do see it with 20110712 builds.

> Where are those reports?
SUMO is getting reports of this error message.
Comment 3 Philipp von Weitershausen [:philikon] 2011-07-12 15:40:28 PDT
Created attachment 545508 [details] [diff] [review]
v1

Further testing showed that the problem only occurs if you just started Firefox. If you had Sync set up earlier, then unlinked it and set it up again, it worked. This led me to believe the problem is related to Service initialization.

And sure enough, the regression originates in bug 664792 part 3 when we made autoconnect not only log in but sync. For some reason _autoConnect called _checkSync() with the kFirstChoiceNotMade constant. This didn't even make sense when autoconnect was just doing login, and it definitely doesn't make sense when it triggers a sync. (Because it causes this error.)
Comment 4 Philipp von Weitershausen [:philikon] 2011-07-12 15:44:47 PDT
Created attachment 545510 [details] [diff] [review]
v1.1

Patch without silly debug output.
Comment 5 Philipp von Weitershausen [:philikon] 2011-07-12 16:10:02 PDT
Created attachment 545514 [details] [diff] [review]
v1.2

Marina pointed out that waiting for five event loop ticks is a bit arbitrary and just as brittle as waiting for a specific delay. After consulting the nsITimer source [1], I think we should first wait for 100ms to ensure the timer in autoconnectDelay has fired (timers are fired in 100ms intervals) and then two event loop ticks to ensure the Utils.nextTick in _autoconnect has fired.

Pretty trick to test that an async process *wasn't* executed :)

[1] https://mxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#265
Comment 6 Richard Newman [:rnewman] 2011-07-12 16:23:55 PDT
Comment on attachment 545514 [details] [diff] [review]
v1.2

Review of attachment 545514 [details] [diff] [review]:
-----------------------------------------------------------------

Nice hack.

::: services/sync/tests/unit/test_syncscheduler.js
@@ +372,5 @@
> +  Svc.Obs.add("weave:service:sync:start", onSyncStart);
> +
> +  // First wait 100ms (nsITimers can take up to that much time to fire, so
> +  // we can account for the timer in delayedAutoconnect) and then two event
> +  // loop ticks (to account for the Utils.nextTick() in _autoConnect).

OK, this terrifies me a little bit, but there isn't really much of an alternative!

I'd probably let my physical engineering background show through here and say 150msec and three ticks, but I don't have a very good handle on how confident we can be that 100/2 are 100% adequate.
Comment 7 Philipp von Weitershausen [:philikon] 2011-07-12 16:30:50 PDT
(In reply to comment #6)
> I'd probably let my physical engineering background show through here and
> say 150msec and three ticks, but I don't have a very good handle on how
> confident we can be that 100/2 are 100% adequate.

Me neither. I can make it 150/3 if it puts your mind at ease.
Comment 8 Philipp von Weitershausen [:philikon] 2011-07-12 16:38:34 PDT
Landed with 150/2 because there's really no point in waiting another event loop tick.

http://hg.mozilla.org/services/services-central/rev/8f022250cf92
Comment 9 Philipp von Weitershausen [:philikon] 2011-07-12 16:43:04 PDT
Comment on attachment 545514 [details] [diff] [review]
v1.2

Requesting approval for Aurora because bug 664792 landed in Firefox 7 (now in Aurora) and hence introduced the regression there. The fix is low risk and well understood (it basically removes a small WTF from the code.)
Comment 10 Philipp von Weitershausen [:philikon] 2011-07-12 16:52:30 PDT
(In reply to comment #1)
> I can't reproduce this using the above STRs on OSX, only on Windows. Weird.

I'm still not sure as to why that is, but I suspect it has to do with the fact that dialogs such as the Sync setup wizard on Windows are modal and thus block the event loop of other windows.
Comment 11 Richard Newman [:rnewman] 2011-07-12 21:21:29 PDT
Re approval-mozilla-aurora?: apparently tracy can repro this even with the fix on WinXP and Win7. Needs further investigation. No approval for Aurora for this fix until we figure that out.
Comment 12 Tracy Walker [:tracy] 2011-07-12 21:25:17 PDT
With a perfect guess by Philipp, I was able to quickly nail the regression range down to m-c builds of 20110630 works and 20110701 has this bug.
Comment 13 Tracy Walker [:tracy] 2011-07-13 08:31:38 PDT
Philipp, had some time to do some cross checking this morning.  The bug happens on all platforms; Windows, Mac and Linux.
Comment 14 Tracy Walker [:tracy] 2011-07-13 08:34:06 PDT
Also, I think the expected result (besides no error message) is a sync should initiated on the newly added client, whether you dismiss the setup dialog or not.
Comment 15 Philipp von Weitershausen [:philikon] 2011-07-13 08:36:08 PDT
(In reply to comment #14)
> Also, I think the expected result (besides no error message) is a sync
> should initiated on the newly added client, whether you dismiss the setup
> dialog or not.

Nope. We must not initiate a sync until the dialog has been dismissed, because the user must be able to go to Sync Options and change the 'firstSync' pref there (merge, wipe local, wipe server) before we attempt to sync.
Comment 16 Philipp von Weitershausen [:philikon] 2011-07-13 14:13:56 PDT
Created attachment 545743 [details] [diff] [review]
part 2 (v1)

Alright, it took some poking around, but I think I've got it now... The source of the problem was still Service._autoconnect(), trying to trigger a sync as service.js is being loaded by the setup wizard.

Even though part 1 changed it to not ignore the "notReady" portion, it would still ignore it. Why? Because Service._checkSync() uses 'else if' and hit kNotLoggedIn first (which *is* ignored). Epic fail. Since kNotLoggedIn is a useless sentinel value at this point, the solution was to simply rip it out.

Why did the test not catch this? Because the tests didn't clean up properly. Fixed that as well! (Also fixed test_interval_triggers.js and test_service_login.js -- boy that last one makes me sad...)
Comment 17 Richard Newman [:rnewman] 2011-07-13 14:25:15 PDT
Comment on attachment 545743 [details] [diff] [review]
part 2 (v1)

Review of attachment 545743 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me. As mentioned on IRC, please check that our desired MP handling -- prompt on first automated sync, don't reprompt on subsequent if the user canceled, sync if the user OKs -- still occurs. Better to catch any problems now than in QA!

Good find, btw.
Comment 18 Philipp von Weitershausen [:philikon] 2011-07-13 14:45:42 PDT
(In reply to comment #17)
> This looks good to me. As mentioned on IRC, please check that our desired MP
> handling -- prompt on first automated sync, don't reprompt on subsequent if
> the user canceled, sync if the user OKs -- still occurs. Better to catch any
> problems now than in QA!

Yep. Sure enough I did, but as agree on IRC, this is a separate regression. Filed as bug 671422.
Comment 19 Philipp von Weitershausen [:philikon] 2011-07-13 15:57:25 PDT
Landed part 2: http://hg.mozilla.org/services/services-central/rev/96db7dbde55e. Fingers crossed.
Comment 20 Tracy Walker [:tracy] 2011-07-14 07:22:59 PDT
this is fixed in builds made last night.
Comment 21 Philipp von Weitershausen [:philikon] 2011-07-14 08:16:40 PDT
Created attachment 545913 [details] [diff] [review]
combined patch for aurora

Requesting approval for Aurora because
* Bug 664792 which caused the regression is in Aurora.
* While technically harmless, the regression does produce an error message during Sync signup that is confusing and bad messaging.
* The fix has been tested and verified by Services QA.
Comment 23 Christopher Blizzard (:blizzard) 2011-07-14 14:23:44 PDT
Comment on attachment 545913 [details] [diff] [review]
combined patch for aurora

Approved for Aurora.  Please check it in.
Comment 24 Philipp von Weitershausen [:philikon] 2011-07-14 14:35:35 PDT
http://hg.mozilla.org/releases/mozilla-aurora/rev/703694a623fa

Note You need to log in before you can comment on or make changes to this bug.