Closed Bug 659067 Opened 13 years ago Closed 13 years ago

Only show network errors for user triggered syncs, not for scheduled ones

Categories

(Firefox :: Sync, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla9

People

(Reporter: emtwo, Assigned: emtwo)

References

()

Details

(Whiteboard: [verified in services])

Attachments

(3 files, 10 obsolete files)

39.71 KB, patch
philikon
: review+
Details | Diff | Splinter Review
27.18 KB, patch
Details | Diff | Splinter Review
52.25 KB, patch
Details | Diff | Splinter Review
* persist guids of failed items (passed as notification on weave:engine:sync:apply-failed to errorpolicy)
** save in data structure while browser is open
** place as serialized json in database upon browser close
** load deserialized json from database upon browser open
* pull out error handling functionality from service.js and place it into errorpolicy.js, some of which include:
** Service._checkServerError
** Service._handleSyncError
** Service.shouldIgnoreError
* no need for service.js to listen to weave:engine:sync:apply-failed or keep track of syncError in Service.lockedSync, instead Service.lockedSync calls ErrorPolicy._checkFailedInterval function which reports first time failures and long term failures to be propagate to the user
** error policy observes weave:engine:sync:apply-failed instead and receives failed items from _processIncoming
* separate failed guids & toFetch guids in _processIncoming and notify errorpolicy with failed list for appropriate persistence & handling
* appropriately propagate errors to the users through the use of a throwable Error object whose prototype in errorpolicy.js accepts an error cause and description as constructor parameters
** start by throwing Error objects from Service.lockedSync instead of strings
Heres what I've roughly written so far (but still needs testing & cleanup):
* save in data structure while browser is open
* no need for service.js to listen to weave:engine:sync:apply-failed or keep track of syncError in Service.lockedSync, instead Service.lockedSync calls ErrorPolicy._checkFailedInterval function which reports first time failures and long term failures to be propagate to the user
** error policy observes weave:engine:sync:apply-failed instead and receives failed items from _processIncoming
* separate failed guids & toFetch guids in _processIncoming and notify errorpolicy with failed list for appropriate persistence & handling
(In reply to comment #0)
> * persist guids of failed items (passed as notification on
> weave:engine:sync:apply-failed to errorpolicy)
> ** save in data structure while browser is open
> ** place as serialized json in database upon browser close
> ** load deserialized json from database upon browser open

As discussed in person just now, it's better for SyncEngine.prototype._processIncoming to handle this because failed items should still be refetched. See bug 659107.

> * pull out error handling functionality from service.js and place it into
> errorpolicy.js, some of which include:
> ** Service._checkServerError
> ** Service._handleSyncError
> ** Service.shouldIgnoreError

Yes!

> * no need for service.js to listen to weave:engine:sync:apply-failed or keep
> track of syncError in Service.lockedSync, instead Service.lockedSync calls
> ErrorPolicy._checkFailedInterval function which reports first time failures
> and long term failures to be propagate to the user

Or the error policy just observes weave:service:sync:finish (or whatever it's called).

> ** error policy observes weave:engine:sync:apply-failed instead and receives
> failed items from _processIncoming

Yes, with the work in bug 659107, this will only happen if there are newly failed items.

> * separate failed guids & toFetch guids in _processIncoming and notify
> errorpolicy with failed list for appropriate persistence & handling

-> Bug 659107

> * appropriately propagate errors to the users through the use of a throwable
> Error object whose prototype in errorpolicy.js accepts an error cause and
> description as constructor parameters
> ** start by throwing Error objects from Service.lockedSync instead of strings

Sounds good. The prime reason for error objects instead of strings would be that we can add useful information to them later, e.g. which bookmark failed to apply, etc.
Sounds like client work, changing the component
Component: Server: Sync → Firefox Sync: Backend
QA Contact: sync-server → sync-backend
OS: Mac OS X → All
Hardware: x86 → All
Attached patch Part 1, WIP - Error Policy (obsolete) — Splinter Review
Passes TPS & xpcshell tests.

So far:
* Some error handling is pulled out of service & into ErrorHandler
* Network errors are never reported now
* UI no longer observes errors but ErrorHandler drives UI 

Next:
* Tests, open to suggestions here
* Report to user when sync hasn't been successful for a certain amount of time
* Pass more details about the cause of a sync error to the UI

Weirdness: 
* Haven't fully tested this, but calls to methods in gSyncUI don't seem to be causing any errors but I didn't import anything to be able to access that object :S
Attachment #547832 - Flags: feedback?(philipp)
Comment on attachment 547832 [details] [diff] [review]
Part 1, WIP - Error Policy

Great start. Summarizing face to face discussion:

* Because logging is pretty much tied to error handling now, it makes sense to also move all the logging stuff to the Error Handling Policy.

* The _syncError flag which gets set by either an exception thrown by engine.sync() or by newFailed items in weave:engine:sync:applied could also move with all of its logic attached. This means that the Error Handling Policy would have to update the UI in those cases (rather than going through weave:service:sync:error), and it would also have to ensure we continue to log in those cases.

* A Sync triggered by the SyncScheduler should never show network errors and the likes, only after a predefined interval. To ensure that the UX isn't broken for people with broken network setups, syncs explicitly triggered by the UI should always show network errors.

* Let's not refactor the way the UI is driven for now because Syncorro will likely change it anyways, and we can do some more platform work to get more details on errors in the meantime.
Attachment #547832 - Flags: feedback?(philipp)
Summary: Error Handling Policy → Only show network errors for user triggered syncs, not for scheduled ones
Attached patch Part 1, V1: Create ErrorHandler (obsolete) — Splinter Review
Attachment #547832 - Attachment is obsolete: true
Attachment #548630 - Flags: feedback?(philipp)
Comment on attachment 548630 [details] [diff] [review]
Part 1, V1: Create ErrorHandler

Nice work.

> const Cu = Components.utils;
>+const Cr = Components.results;

In all new files I just put this at the top because it's shorter and looks totally l33t:

  const {classes: Cc, interfaces: Ci, utils: Cu, results: Cr} = Components;

>+
>+//Cu.import("resource://gre/modules/XPCOMUtils.jsm");

Spurious left-over comment? (No need to import XPCOMUtils explicitly if you're already importing util.js.)

>+  resetFileLog: function resetFileLog(flushToFile, filenamePrefix) {
...
>+  errorStr: function errorStr(code) {

For extra bonus points, you could add JavaDoc style comments for those methods. Every new hire after you will love you forever. (For examples look for /** */ comments in service.js and util.js.)

>+  /**
>+   * Handle HTTP response results or exceptions and set the appropriate
>+   * Status.* bits.
>+   */
>+  _checkServerError: function _checkServerError(resp) {

Could drop the underscore here, since Service calls it.

>+  // Make sync fail due to changed credentials.
>+  Utils._isPassphrase = Utils.isPassphrase;
>+  Utils.isPassphrase = function() {
>+    return false;
>+  };

If you already have a whole Sync server setup going on, I would prefer if the server would send a 401 to indicate that credentials have changed, rather than monkey patching code. (That way, if we refactor the relevant client code, the test will still pass.)

I really wish we had a better test server implementation that would make writing such tests easier. For now, to keep things simple, you could just register additional handlers for another user and then swap the username, e.g. in sync_httpd_setup():

  let handler_401 = httpd_handler(401, "Unauthorized");
  return httpd_setup({
    "/1.1/johndoe/storage/meta/global": upd("meta", global.handler()),
...
    "/1.1/janedoe/storage/meta/global": handler_401,
    "/1.1/janedoe/info/collections": handler_401,
  });

and then in the test just do Service.username = "janedoe".

>+  // Error ignored since master password locked.
>+  let shouldIgnore = ErrorHandler.shouldIgnoreError();
>+  do_check_true(shouldIgnore);

  do_check_true(ErrorHandler.shouldIgnoreError());

seems shorter... but whatever :)

>diff --git a/services/sync/tests/unit/test_service_filelog.js b/services/sync/tests/unit/test_service_filelog.js
>--- a/services/sync/tests/unit/test_service_filelog.js
>+++ b/services/sync/tests/unit/test_service_filelog.js

I can haz rename to test_errorhandler_filelog.js?

>+add_test(function test_login_error_logOnError_false() {
...
>+add_test(function test_login_error_logOnError_true() {

<333

>diff --git a/services/sync/tests/unit/test_service_sync_checkServerError.js b/services/sync/tests/unit/test_service_sync_checkServerError.js
>--- a/services/sync/tests/unit/test_service_sync_checkServerError.js
>+++ b/services/sync/tests/unit/test_service_sync_checkServerError.js

Ditto: rename to test_errorhandler_checkServerError, please. (And while you're touching the file, can you update the comment just before test_engine_applyFailed to reflect the truth? The 'weave:engine:sync:apply-failed' observer was replaced by the more generic 'weave:engine:sync:applied' one.
Attachment #548630 - Flags: feedback?(philipp) → feedback+
(In reply to comment #7)
> >+  // Make sync fail due to changed credentials.
> >+  Utils._isPassphrase = Utils.isPassphrase;
> >+  Utils.isPassphrase = function() {
> >+    return false;
> >+  };
> 
> If you already have a whole Sync server setup going on, I would prefer if
> the server would send a 401 to indicate that credentials have changed,
> rather than monkey patching code. (That way, if we refactor the relevant
> client code, the test will still pass.)
> 
> I really wish we had a better test server implementation that would make
> writing such tests easier. For now, to keep things simple, you could just
> register additional handlers for another user and then swap the username,
> e.g. in sync_httpd_setup():
> 
>   let handler_401 = httpd_handler(401, "Unauthorized");
>   return httpd_setup({
>     "/1.1/johndoe/storage/meta/global": upd("meta", global.handler()),
> ...
>     "/1.1/janedoe/storage/meta/global": handler_401,
>     "/1.1/janedoe/info/collections": handler_401,
>   });
> 
> and then in the test just do Service.username = "janedoe".

This doesn't actually test the lines of code that I wanted to test (Even though it's still valid and I could have another test for it). I was trying to test the CREDENTIALS_CHANGED case in sync:error results in logging out not 401 error causing logout (which executes another bit of code). Any other ideas for executing that bit of code?
(In reply to comment #8)
> This doesn't actually test the lines of code that I wanted to test (Even
> though it's still valid and I could have another test for it).

(agreed, please add one :))

> I was trying
> to test the CREDENTIALS_CHANGED case in sync:error results in logging out
> not 401 error causing logout (which executes another bit of code). Any other
> ideas for executing that bit of code?

verifyAndFetchSymmetricKeys checks whether your Sync Key is correct by decrypting it. If that fails, we catch the HMAC mismatch exception and set CREDENTIALS_CHANGED:

            // One kind of exception: HMAC failure.
            if (Utils.isHMACMismatch(ex)) {
              Status.login = LOGIN_FAILED_INVALID_PASSPHRASE;
              Status.sync = CREDENTIALS_CHANGED;
            }

So you could prepare some key data on the server that doesn't match the user's keys. E.g. (untested!)

add_test(function test_credentials_changed_logout() {
  let server = sync_httpd_setup();
  setUp();

  // By calling sync, we ensure we're logged in.
  Service.sync();
  do_check_eq(Status.sync, SYNC_SUCCEEDED);
  do_check_true(Service.isLoggedIn);

  // Make sync fail due to changed credentials. We simply re-encrypt
  // the keys with a different Sync Key, without changing the local one.
  let newSyncKeyBundle = new SyncKeyBundle(PWDMGR_PASSPHRASE_REALM, Service.username);
  newSyncKeyBundle.keyStr = "12345123451234512345123451";
  let keys = CollectionKeys.asWBO();
  keys.encrypt(newSyncKeyBundle);
  keys.upload(Service.cryptoKeysURL);

  Service.sync();
  do_check_eq(Status.sync, CREDENTIALS_CHANGED);
  do_check_false(Service.isLoggedIn);
  
  // Clean up.
  Service.startOver();
  server.stop(run_next_test);
});
* addressed comment 7
* removing unused status 401 observer
Attachment #548630 - Attachment is obsolete: true
Attachment #549428 - Flags: review?(philipp)
Comment on attachment 549428 [details] [diff] [review]
Part 1, V2: Create Error Handler

\o/
Attachment #549428 - Flags: review?(philipp) → review+
I would like to point out with delight that, after this patch, service.js will have dropped by about 400 lines — more than 20% — since current mozilla-beta. Good job, ladies and gentlemen.
(In reply to comment #12)
> I would like to point out with delight that, after this patch, service.js
> will have dropped by about 400 lines — more than 20% — since current
> mozilla-beta. Good job, ladies and gentlemen.

Entirely due to the lady here! Also, we have lots of test coverage. It already saved my bacon yesterday...
This is the best I came up with for now (still needs tests), here's what I considered:

* It wasn't easy to move logic over from the _syncEngine catch{} clause since the function returns a boolean and I didn't think there was a pretty way of pulling it out.

* I didn't remove the _syncError boolean because:
** there wouldn't be a way to keep track of if there was an engine failure and would notify sync:finish instead of sync:error which isn't the behaviour we want.
** ultimately I don't think we'd want every engine's error to clear out the previous engine's error in the UI, but instead to store all errors until sync is complete then pass the information to the UI (syncorro style)

* hopefully the comments in registerSyncEngineError and onSyncEnd are clear enough for what I had in mind, if not I can try to clarify.
Attachment #549959 - Flags: feedback?(philipp)
Comment on attachment 549959 [details] [diff] [review]
Part 2, WIP: move _syncError to ErrorHandler & refactor

(In reply to comment #14)
> * I didn't remove the _syncError boolean because:
> ** there wouldn't be a way to keep track of if there was an engine failure

What about Status.engine and Status.sync? If we set Status.engine to anything other than ENGINE_SUCCEEDED right now, Status.service becomes some weird value which causes Status.sync to be SYNC_FAILED. That seems like the right kind of logic to me.

> and would notify sync:finish instead of sync:error which isn't the behaviour
> we want.

Why not? We can redefine what weave:service:sync:error means. We can redefine to mean: something at the service level broke. Yes, this would have to mean that the UI would now also have to listen for engine sync errors and display those. I think that's fine, we'll be cleaning that up later on (see next paragraph).

> ** ultimately I don't think we'd want every engine's error to clear out the
> previous engine's error in the UI, but instead to store all errors until
> sync is complete then pass the information to the UI (syncorro style)

Yup, indeed. But the ErrorHandler should do that, not the Service, and getting to a place where Service doesn't deal with errors anymore is half of that work! I'd like to keep the Service as dumb as possible, and loosely coupled to ErrorHandler. (By loosely coupled or decoupled I mean communicating via observers, not reaching into each other's code.) Ideally, Service would not know about ErrorHandler at all.

>+  registerSyncEngineError: function registerSyncEngineError(engineName, engineFailCode, msg, error) {
>+    // TODO: create Error object given parameters and store
>+    // it for future error propagation to UI on onSyncEnd.
>+    if (error) {
>+      this.checkServerError(error);
>+    }
>+    this._log.debug(engineName + " failed: " + msg);
>+    Status.engines = [engineName, engineFailCode];
>+    this._syncError = true;
>+  },
...
>-      ErrorHandler.checkServerError(e);
>-
>-      Status.engines = [engine.name, e.failureCode || ENGINE_UNKNOWN_FAIL];
>-
>-      this._syncError = true;
>-      this._log.debug(engine.name + " failed: " + Utils.exceptionStr(e));
>+      let msg = engine.name + " failed: " + Utils.exceptionStr(e);
>+      ErrorHandler.registerSyncEngineError(engine.name,
>+                                           e.failureCode || ENGINE_UNKNOWN_FAIL,
>+                                           msg, e);

You can decouple these with an weave:engine:sync:error observer. I recently added the possibility to intercept the exception in an error observer (see bug 675125). So instead of calling ErrorHandler.registerSyncEngineError(...) from Service, you can just have ErrorHandler listen to weave:engine:sync:error. 'data' will be the engine name, 'subject' will be the exception object. That's pretty much all you need.

>-      if (this._syncError) {
>-        throw "Some engines did not sync correctly";
>-      } else {
>-        Svc.Prefs.set("lastSync", new Date().toString());
>-        Status.sync = SYNC_SUCCEEDED;
>-        let syncTime = ((Date.now() - syncStartTime) / 1000).toFixed(2);
>-        let dateStr = new Date().toLocaleFormat(LOG_DATE_FORMAT);
>-        this._log.info("Sync completed successfully at " + dateStr
>-                       + " after " + syncTime + " secs.");
>-      }
>+      ErrorHandler.onSyncEnd();
>+
>     } finally {
>-      this._syncError = false;
>+      ErrorHandler._syncError = false;
>       Svc.Prefs.reset("firstSync");
>     }
>   }))(),
...
>+  onSyncEnd: function onSyncEnd(syncStartTime) {
>+    // TODO: use previously stored sync engine Error objects to
>+    // throw the details of which engines failed and what
>+    // to propagate to the UI.
>+    if (this._syncError) {
>+      throw "Some engines did not sync correctly";
>+    } else {
>+      Svc.Prefs.set("lastSync", new Date().toString());
>+      Status.sync = SYNC_SUCCEEDED;
>+      let syncTime = ((Date.now() - this.syncStartTime) / 1000).toFixed(2);
>+      let dateStr = new Date().toLocaleFormat(LOG_DATE_FORMAT);
>+      this._log.info("Sync completed successfully at " + dateStr
>+                     + " after " + syncTime + " secs.");
>+    }
>+  },

Once again we can decouple Service and ErrorHandler here via observers. I would argue that much of onSyncEnd() is not necessary at all (apart from the syncTime stuff; see below.) ErrorHandler can keep track of the rest of the stuff via observers. It already tracks failures from individual engines, so if it looked at Status.sync at the end of a sync, it can nicely determine whether the sync was successful or not. It should just reset the flag at the beginning of a sync.

>+      case "weave:service:sync:start":
>+        this.syncStartTime = Date.now();
>+        break;
...
>-    let syncStartTime = Date.now();
>-
...
>+      let syncTime = ((Date.now() - this.syncStartTime) / 1000).toFixed(2);
>+      let dateStr = new Date().toLocaleFormat(LOG_DATE_FORMAT);
>+      this._log.info("Sync completed successfully at " + dateStr
>+                     + " after " + syncTime + " secs.");

I don't think this should move to the ErrorHandler at all since it doesn't really have anything to do with errors ;) We should just always log this line at the end of a sync, no matter what.
Attachment #549959 - Flags: feedback?(philipp)
Here's a patch based on what I gathered from the last feedback comment. xpcshell tests pass after a bit of modification. Also, some changes to note that this patch implies:

* a sync engine failure no longer results in logging as a sync error - its log won't have error- in the log name
* sync:error isn't notified so handleSyncError is not called so a sync engine failure will not contribute to triggering a backoff interval 

Are these 2 points things that shouldn't be changed? Or are they ok this way?
Attachment #549959 - Attachment is obsolete: true
Attachment #550566 - Flags: feedback?(philipp)
Attachment #550566 - Attachment is patch: true
Comment on attachment 550566 [details] [diff] [review]
Part 2, WIP: remove _syncError & refactor in ErrorHandler

(In reply to comment #16)
> Here's a patch based on what I gathered from the last feedback comment.
> xpcshell tests pass after a bit of modification. Also, some changes to note
> that this patch implies:
> 
> * a sync engine failure no longer results in logging as a sync error - its
> log won't have error- in the log name
> * sync:error isn't notified so handleSyncError is not called so a sync
> engine failure will not contribute to triggering a backoff interval

Good observations! Clearly I hadn't thought this through a 100%, though I came close in my seocnd bullet point in comment 5 ;). Anyway, while it's a bit of an annoyance, these aren't insurmountable problems:

The ErrorHandler can easily keep track of errors that occurred during a sync, and then make sure that logging is invoked appropriately. That one's easy since it's internal to ErrorHandler.

Now, about SyncScheduler.handleSyncError... that's an odd one. I'm pretty sure its behaviour is slightly broken now. It should only really kick in when there's an error communicating with the server, as opposed to *all* errors. At least that's what the comment in that method indicates, and that's also the only thing that makes sense for a backoff behavour. So SyncScheduler might want to start listening in on engine sync errors too and adjust the backoff accordingly. But of course only when the problem was actually a server error due to 5xx responses (check for exception.status).

Sorry to be piling a bunch of seemingly unrelated changes onto this bug, but as you can probably tell, this is a very useful exercise to even just understand, document, and test the current behaviour. And then improve it, of course. BECAUSE LABS CODE.


I've skimmed the patch, it looks good. Only one thing caught my eye:

>+      case "weave:engine:sync:error":
>+        // Maybe a 401, cluster update perhaps needed?
>+        if (subject.status == 401) {
>+          // Log out and clear the cluster URL pref. That will make us perform
>+          // cluster detection and password check on next sync, which handles
>+          // both causes of 401s; in either case, we won't proceed with this
>+          // sync, but kick off a sync for next time.
>+          Weave.Service.logout();
>+          Svc.Prefs.reset("clusterURL");
>+          Utils.nextTick(Weave.Service.sync, Weave.Service);
>+        }

Sorry, I should've been a bit clearer. I think that part of the error handling should remain in Service, because it's an implementation detail to resuming a sync that was aborted due to a node reassignment.
Attachment #550566 - Flags: feedback?(philipp)
* addressed comment 17 
* added tests
Attachment #551166 - Flags: feedback?(philipp)
Attachment #550566 - Attachment is obsolete: true
kind of rough but gives an idea of what I had in mind
Comment on attachment 551166 [details] [diff] [review]
Part 2, v1: remove _syncError & refactor in ErrorHandler

This patch was so pleasant to read, I wish I had read it sooner!

Just some nits:

>+      case "weave:engine:sync:error":
>+        if (subject.status >= 500 && subject.status <= 504) {
>+          this.requiresBackoff = true;
>+        }
>+        break;

Can you document that 'subject' will be the exception object? You could even write verbose code and do

  let exception = subject;
  if (exception.status ...

I'm mostly concerned about readability for other developers.

>   observe: function observe(subject, topic, data) {
>     switch(topic) {
>+      case "weave:engine:sync:applied":
>+        if (subject.newFailed) {
>+          // An engine isn't able to apply one or more incoming records.
>+          // We don't fail hard on this, but it usually indicates a bug,
>+          // so for now treat it as sync error (c.f. Service._syncEngine())
>+          Status.engines = [data, ENGINE_APPLY_FAIL];

Can you also please document that 'data' will be the engine's name, or even do

  let engine_name = data;
  Status.engines = [engine_name, ...

>+  Service.clusterURL = "";
>+  Service.setCluster = Service._setCluster;
>+  Service._setCluster = function() {
>+    return false;
>+  };

Tsk tsk, I thought I'd taught you better than that by now :p Instead of the monkey-patching, can you please set Service.serverURL to "http://localhost:8080" and then implement the 'node/weave' call from the user API in the local test server to return 'null'? That should make the Service think there's no node available for now. Something like this should work:

  "/user/1.0/johndoe/node/weave": httpd_handler(200, "OK", "null")

>+add_test(function test_sync_failed_partial_500s() {
>+add_test(function test_sync_failed_partial_400s() {
...
>+  try {
...
>+  } finally {
>+    Status.resetSync();
>+    Service.startOver();
>+  }

No need for the try/finally clauses. They're old relics.

r=me with that.
Attachment #551166 - Flags: feedback?(philipp) → review+
Comment on attachment 551926 [details] [diff] [review]
Part 3, WIP: ignore network errors except on "Sync Now" or a prolonged failure period

Great work!

>   // Commands
>   doSync: function SUI_doSync() {
>-    setTimeout(function() Weave.Service.sync(), 0);
>+    setTimeout(function() Weave.Service.sync(true), 0);
>   },
...
>-  sync: function sync() {
>+  sync: function sync(syncNowClicked) {
>+    ErrorHandler.syncNowClicked = syncNowClicked;

I would prefer if the UI invoked the sync through the ErrorHandler. That way Service won't have to be aware of ErrorHandler and "syncNowClicked" remains an implementation detail. (Also, the name of that property is modeled on a UI that will not exist for much longer. Let's name it after what it *does*, e.g. "dontIgnoreErrors" or something like that.)

Also, I don't see you resetting syncNowClicked anywhere. It should be set to false once the manually forced sync has completed.

>diff --git a/services/sync/services-sync.js b/services/sync/services-sync.js
>--- a/services/sync/services-sync.js
>+++ b/services/sync/services-sync.js
>@@ -10,16 +10,18 @@ pref("services.sync.lastversion", "first
> pref("services.sync.sendVersionInfo", true);
> 
> pref("services.sync.scheduler.singleDeviceInterval", 86400); // 1 day
> pref("services.sync.scheduler.idleInterval",         3600);  // 1 hour
> pref("services.sync.scheduler.activeInterval",       300);   // 5 minutes
> pref("services.sync.scheduler.immediateInterval",    60);    // 1 minute
> pref("services.sync.scheduler.idleTime",             300);   // 5 minutes
> 
>+pref("services.sync.errorhandler.errorReportingFrequency", 604800); // 1 week

s/errorReportingFrequency/networkFailureReportTimeout/

>diff --git a/services/sync/locales/en-US/sync.properties b/services/sync/locales/en-US/sync.properties
>--- a/services/sync/locales/en-US/sync.properties
>+++ b/services/sync/locales/en-US/sync.properties
>@@ -15,16 +15,17 @@ error.login.title = Error While Signing 
> error.login.description = Sync encountered an error while connecting: %1$S.  Please try again.
> error.login.prefs.label = Preferencesâ¦
> error.login.prefs.accesskey = P
> # should decide if we're going to show this
> error.logout.title = Error While Signing Out
> error.logout.description = Sync encountered an error while connecting.  It's probably ok, and you don't have to do anything about it.
> error.sync.title = Error While Syncing
> error.sync.description = Sync encountered an error while syncing: %1$S.  Sync will automatically retry this action.
>+error.sync.prolonged_failure = Sync has not been successful for %1$S days or more.

We can be a bit more specific and helpful here. E.g.:

  Sync has not been able to complete during the last %1$S days. Please check your network settings.

>+  setDefaults: function setDefaults() {
>+    this._log.trace("Setting ErrorHandler policy values to defaults.");
>+
>+    this.errorReportingFrequency =
>+      Svc.Prefs.get("errorhandler.errorReportingFrequency") * 1000;

This seems unnecessary. There's little cost to just getting the pref each time you need it, but there's a lot of cost making sure you reset this property. Also, it means changing the value requires a restart. (Which isn't so bad, but if it can be avoided, why not do it.)

>       case "weave:service:login:error":
>-        if (Status.login == LOGIN_FAILED_NETWORK_ERROR &&
>-            !Services.io.offline) {
>-          this._ignorableErrorCount += 1;
>-        } else {
>-          this.resetFileLog(Svc.Prefs.get("log.appender.file.logOnError"),
>-                            LOG_PREFIX_ERROR);
>-        }
>+        this.resetFileLog(Svc.Prefs.get("log.appender.file.logOnError"),
>+                          LOG_PREFIX_ERROR);

Hmm, this means we now log on every network failure. Not sure we want to do that.

>-  _ignorableErrorCount: 0,
>   shouldIgnoreError: function shouldIgnoreError() {
>+    let prolongedFailure = (Date.now() - parseFloat(Svc.Prefs.get("lastSync"))) >
>+                           this.errorReportingFrequency;
>+    let networkError =
>+      [Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) != -1;
>+
>+    if (prolongedFailure) {
>+      Status.sync = PROLONGED_SYNC_FAILURE;
>+    }

I'm not sure how well I like a method that you ask whether you want to ignore an error or not to modify global state. Then again, shouldIgnoreError is one of those pieces that will die once ErrorHandler drives the UI and not the other way around.
* addressing comment 20
* passes xpcshell & tps tests
Attachment #551166 - Attachment is obsolete: true
(In reply to Philipp von Weitershausen [:philikon] from comment #21)
> >-  _ignorableErrorCount: 0,
> >   shouldIgnoreError: function shouldIgnoreError() {
> >+    let prolongedFailure = (Date.now() - parseFloat(Svc.Prefs.get("lastSync"))) >
> >+                           this.errorReportingFrequency;
> >+    let networkError =
> >+      [Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) != -1;
> >+
> >+    if (prolongedFailure) {
> >+      Status.sync = PROLONGED_SYNC_FAILURE;
> >+    }
> 
> I'm not sure how well I like a method that you ask whether you want to
> ignore an error or not to modify global state. Then again, shouldIgnoreError
> is one of those pieces that will die once ErrorHandler drives the UI and not
> the other way around.

Agreed. This would be more appropriate to remove in another patch later on which makes ErrorHandler drive the UI.
But I suppose for now there may be a way to avoid changing global state in the method
* addressing comment 21
* modifying tests
Attachment #551926 - Attachment is obsolete: true
Attachment #552555 - Flags: feedback?(philipp)
Comment on attachment 552555 [details] [diff] [review]
Part 3, v1: ignore network errors except on "Sync Now" or a prolonged failure period

Awesome stuff, especially the tests. Just a bunch of nits:

>   /**
>    * Process the locally stored clients list to figure out what mode to be in
>    */
>   updateClientMode: function updateClientMode() {
>     // Nothing to do if it's the same amount
>-    let {numClients} = Clients.stats;	
>+    let {numClients} = Clients.stats;

let numClients = Clients.stats.numClients;

>       case "weave:service:sync:error":
>+        let prolongedFailure =
>+          (Date.now() - parseFloat(Svc.Prefs.get("lastSync"))) >
>+           Svc.Prefs.get("errorhandler.networkFailureReportTimeout") * 1000;

Nit: that last line should align with the `(`, not the `D`.

>-  _ignorableErrorCount: 0,
>   shouldIgnoreError: function shouldIgnoreError() {
>+    let returnValue =
>+      (Status.login == MASTER_PASSWORD_LOCKED) ||
>+      (!this.dontIgnoreErrors &&
>+       Status.sync != PROLONGED_SYNC_FAILURE &&
>+       [Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) != -1);
>+
>+    this.dontIgnoreErrors = false;

This means that `dontIgnoreErrors` will only be reset to `false` if the UI-triggered sync actually produced an error and the UI code called shouldIgnoreError. It should reset no matter what, so I think we want it in weave:service:sync:finished.

>     // Never show an error bar for a locked master password.
>-    return (Status.login == MASTER_PASSWORD_LOCKED) ||
>-           ([Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) != -1
>-            && this._ignorableErrorCount < MAX_IGNORE_ERROR_COUNT);
>+    return returnValue;
>   },

I don't see why `returnValue` would have to be computed before the `return` statement.

>+add_test(function test_shouldIgnoreError_master_password() {
>+  _("Test error ignored due to locked master password");
>+  let server = sync_httpd_setup();
>+  setUp();
> 
>-  // Error ignored since master password locked.
>+  Service._verifyLogin = Service.verifyLogin;
>+  Service.verifyLogin = function () {
>+    Status.login = MASTER_PASSWORD_LOCKED;
>+    return false;
>+  };

Sigh. I wish I knew a way to avoid that monkey-patching, but I don't. Can you add a comment here about why we're monkey patching?

>+add_test(function test_prolonged_network_error() {
>+  // Test prolonged, network errors are reported
>+  Services.io.offline = true;

Is this bit really necessary?

>+add_test(function test_network_error() {
>+  // Test network errors are reported when.

When what? :) Shouldn't it be "Test network errors are *not* reported"?
Attachment #552555 - Flags: feedback?(philipp)
(In reply to Philipp von Weitershausen [:philikon] from comment #26)

> >-  _ignorableErrorCount: 0,
> >   shouldIgnoreError: function shouldIgnoreError() {
> >+    let returnValue =
> >+      (Status.login == MASTER_PASSWORD_LOCKED) ||
> >+      (!this.dontIgnoreErrors &&
> >+       Status.sync != PROLONGED_SYNC_FAILURE &&
> >+       [Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) != -1);
> >+
> >+    this.dontIgnoreErrors = false;
> 
> This means that `dontIgnoreErrors` will only be reset to `false` if the
> UI-triggered sync actually produced an error and the UI code called
> shouldIgnoreError. It should reset no matter what, so I think we want it in
> weave:service:sync:finished.

If we want it to be set no matter what, I think we need it in both sync:finished and sync:error. However, there's a bit of an issue with that if you read my next comment.  (Also, you're right, I didn't realize this doesn't set dontIgnoreError to false on sync:finish)

> >     // Never show an error bar for a locked master password.
> >-    return (Status.login == MASTER_PASSWORD_LOCKED) ||
> >-           ([Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) != -1
> >-            && this._ignorableErrorCount < MAX_IGNORE_ERROR_COUNT);
> >+    return returnValue;
> >   },
> 
> I don't see why `returnValue` would have to be computed before the `return`
> statement.

The reason I'm doing that is because dontIgnoreErrors needs to be used to calculate returnValue before it is set to false. Also if dontIgnoreErrors is reset in sync:error to false then by the time shouldIgnoreError tries to access its value it will always be false. An option to avoiding setting dontIgnoreErrors = false in shouldIgnoreError() is to maybe have a shouldIgnoreError boolean that is set in sync:error prior to setting dontIgnoreErrors to false (essentially moving shouldIgnoreError() logic to sync:error) then shouldIgnoreError() would no longer be necessary and the UI would just access the shouldIgnoreError boolean. Does that sound ok?

> >+add_test(function test_prolonged_network_error() {
> >+  // Test prolonged, network errors are reported
> >+  Services.io.offline = true;
> 
> Is this bit really necessary?

If I remove this and only do setUp() but don't start the sync server in the test, I get an unexpected fail with the error NS_ERROR_CONNECTION_REFUSED. Is there something else I would need to add to the test to get LOGIN_FAILED_NETWORK_ERROR?
(In reply to Marina Samuel [:marina] from comment #27)
> The reason I'm doing that is because dontIgnoreErrors needs to be used to
> calculate returnValue before it is set to false. Also if dontIgnoreErrors is
> reset in sync:error to false then by the time shouldIgnoreError tries to
> access its value it will always be false. An option to avoiding setting
> dontIgnoreErrors = false in shouldIgnoreError() is to maybe have a
> shouldIgnoreError boolean that is set in sync:error prior to setting
> dontIgnoreErrors to false (essentially moving shouldIgnoreError() logic to
> sync:error) then shouldIgnoreError() would no longer be necessary and the UI
> would just access the shouldIgnoreError boolean. Does that sound ok?

Yeah. Pretty clunky, but that should work. It's becoming pretty apparent that the ErrorHandler should drive the UI, not the other way around.

> > >+add_test(function test_prolonged_network_error() {
> > >+  // Test prolonged, network errors are reported
> > >+  Services.io.offline = true;
> > 
> > Is this bit really necessary?
> 
> If I remove this and only do setUp() but don't start the sync server in the
> test, I get an unexpected fail with the error NS_ERROR_CONNECTION_REFUSED.

What's the stacktrace for that? Is it failing in the login? Try catching the error somewhere and print Utils.exceptionStr(exception). Alternatively, you can also just print Error().stack somewhere in the code to find out how you got there.

> Is there something else I would need to add to the test to get
> LOGIN_FAILED_NETWORK_ERROR?

checkServerError recognizes, among others, an NS_ERROR_CONNECTION_REFUSED exception and sets that status. So somehow it seems you're not going down that path. Looking at the stacktrace should help.
* sync:error and login:error are no longer observed in the UI
* adjusted sync:error tests accordingly and added login:error tests
Attachment #552555 - Attachment is obsolete: true
Attachment #554647 - Flags: feedback?(philipp)
Comment on attachment 554647 [details] [diff] [review]
Part 3, v2: ignore network errors except on "Sync Now" or a prolonged failure period

>+        if (Status.login != LOGIN_FAILED_NETWORK_ERROR ||
>+            Services.io.offline) {

I realize you're just inverting the logic here, but really, we shouldn't be checking for offline mode (Services.io.offline) at all anymore since Firefox no longer adjusts that flag automatically.

>+        if (this.shouldReportError()) {
>+          Svc.Obs.notify("weave:ui:login:error");
>+        } else {
>+          Svc.Obs.notify("weave:ui:update");
>+        }

Why only call (indirectly) updateUI() here? If we have no error report, we should also clear previous error notifications, which is what happens in onSyncFinish() in the UI. So I think we want to make that "weave:ui:sync:finished" and pipe that through to onSyncFinish() in the successful sync case. In the successful login case, we don't have to bother since a sync will follow immediately.

In the long run (where we have the ErrorHandler drive the UI), we probably want to have notifications for "show error" and "clear error" and "a sync just happened, so update the last synced time and all that stuff".

>+  shouldReportError: function shouldReportError() {
>+    if ((Date.now() - Date.parse(Services.prefs.getCharPref("services.sync.lastSync"))) >
>+      Svc.Prefs.get("errorhandler.networkFailureReportTimeout") * 1000) {
>+      Status.sync = PROLONGED_SYNC_FAILURE;
>+    }
>+
>+    return (Status.login != MASTER_PASSWORD_LOCKED) &&
>+            (this.dontIgnoreErrors ||
>+            Status.sync == PROLONGED_SYNC_FAILURE ||
>+            [Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) == -1);
>   },

Isn't the PROLONGED_SYNC_FAILURE case already covered in the last line?

Looks good otherwise!
Attachment #554647 - Flags: feedback?(philipp)
(In reply to Philipp von Weitershausen [:philikon] from comment #30)

> >+  shouldReportError: function shouldReportError() {
> >+    if ((Date.now() - Date.parse(Services.prefs.getCharPref("services.sync.lastSync"))) >
> >+      Svc.Prefs.get("errorhandler.networkFailureReportTimeout") * 1000) {
> >+      Status.sync = PROLONGED_SYNC_FAILURE;
> >+    }
> >+
> >+    return (Status.login != MASTER_PASSWORD_LOCKED) &&
> >+            (this.dontIgnoreErrors ||
> >+            Status.sync == PROLONGED_SYNC_FAILURE ||
> >+            [Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) == -1);
> >   },
> 
> Isn't the PROLONGED_SYNC_FAILURE case already covered in the last line?

I'm confused. Do you mean the line that checks for LOGIN_FAILED_NETWORK_ERROR? If so, PROLONGED_SYNC_FAILURE is treated differently than LOGIN_FAILED_NETWORK_ERROR providing a different error message and shown for different scenarios.
(In reply to Marina Samuel [:marina] from comment #31)
> (In reply to Philipp von Weitershausen [:philikon] from comment #30)
> 
> > >+  shouldReportError: function shouldReportError() {
> > >+    if ((Date.now() - Date.parse(Services.prefs.getCharPref("services.sync.lastSync"))) >
> > >+      Svc.Prefs.get("errorhandler.networkFailureReportTimeout") * 1000) {
> > >+      Status.sync = PROLONGED_SYNC_FAILURE;
> > >+    }
> > >+
> > >+    return (Status.login != MASTER_PASSWORD_LOCKED) &&
> > >+            (this.dontIgnoreErrors ||
> > >+            Status.sync == PROLONGED_SYNC_FAILURE ||
> > >+            [Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) == -1);
> > >   },
> > 
> > Isn't the PROLONGED_SYNC_FAILURE case already covered in the last line?
> 
> I'm confused. Do you mean the line that checks for
> LOGIN_FAILED_NETWORK_ERROR?

Yes. [Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) == -1 will hold true for Status.sync == PROLONGED_SYNC_FAILURE, right?
(In reply to Philipp von Weitershausen [:philikon] from comment #32)
> Yes. [Status.login, Status.sync].indexOf(LOGIN_FAILED_NETWORK_ERROR) == -1
> will hold true for Status.sync == PROLONGED_SYNC_FAILURE, right?

Hmm. I guess not always, if Status.login happens to be LOGIN_FAILED_NETWORK_ERROR. So, yeah, ok. Never mind.

But, see bug 578195 comment 20 about splitting this expression into multiple parts. For instance, we could simply return true after setting Status.sync to PROLONGED_SYNC_FAILURE.
Attachment #554647 - Attachment is obsolete: true
Comment on attachment 552176 [details] [diff] [review]
Part 2, v2: remove _syncError & refactor in ErrorHandler

While reviewing Part 3, I just noticed one problem that was sort of introduced in Part 2:

The UI now listens to 'weave:engine:sync:error' because 'weave:service:sync:error' is no longer necessarily notified if just one engine fails. The problem is that the UI depends on stuff that gets set by the ErrorHandler's observer for that very same topic:

> // gSyncUI handles updating the tools menu
> let gSyncUI = {
...
>+      case "weave:engine:sync:error":
>       case "weave:service:sync:error":


> let ErrorHandler = {
...
>+      case "weave:engine:sync:error":
>+        let exception = subject;  // exception thrown by engine's sync() method
>+        let engine_name = data;   // engine name that threw the exception
>+
>+        this.checkServerError(exception);
>+
>+        Status.engines = [engine_name, exception.failureCode || ENGINE_UNKNOWN_FAIL];
>+        this._log.debug(engine_name + " failed: " + Utils.exceptionStr(exception));
>+        break;

So these could race each other and the UI could win. What we really want to do is in 'weave:service:sync:finish', where we look at the Status.service/Status.sync flag, notify 'weave:ui:sync:error' to update the UI. Basically, I propose that the UI doesn't listen to 'weave:service:sync:error', 'weave:service:sync:error', 'weave:service:sync:finished' at all. That makes it easier to understand when the UI gets invoked.
Comment on attachment 555561 [details] [diff] [review]
Part 3, v3: ignore network errors except on "Sync Now" or a prolonged failure period

Please see my previous comment 35 about the UI's observers. In the interest of progress, I think it's fine to just apply those changes in Part 3 (even though they should really belong in Part 2, but oh well).

Also, I think it'd be good to have some more unit-y tests for ErrorHandler.shouldReportError(), IOW set various inputs (on Status) and then check its output.

Apart from that I just found one minor nit:

>+  Svc.Prefs.set("lastSync", (new Date(Date.now() - 691200000)).toString());

Please const magic numbers to meaningful names. In this case it looks like you're trying to create a value that's bigger than the 'errorhandler.networkFailureReportTimeout' pref, so I would make it a function of that, e.g.:

  const FORCE_PROLONGED_ERROR_DURATION = (Svc.Prefs.get('errorhandler.networkFailureReportTimeout') + 604800) * 1000;

(feel free to come up with a better name :))
Attachment #555561 - Flags: feedback?(philipp)
* addressing comment 35 and comment 36
Attachment #555561 - Attachment is obsolete: true
Attachment #555809 - Flags: feedback?(philipp)
Attachment #555809 - Attachment is patch: true
Comment on attachment 555809 [details] [diff] [review]
Part 3, v4: ignore network errors except on "Sync Now" or a prolonged failure period

>+  Svc.Prefs.set("lastSync", (new Date(Date.now() -
>+    NON_PROLONGED_ERROR_DURATION)).toString());

Date.now() returns a number, so you can just do

  Svc.Prefs.set("lastSync", Date.now() - NON_PROLONGED_ERROR_DURATION);

which incidentally should also fit on one line :)

>diff --git a/services/sync/tests/unit/test_errorhandler_filelog.js b/services/sync/tests/unit/test_errorhandler_filelog.js
>--- a/services/sync/tests/unit/test_errorhandler_filelog.js
>+++ b/services/sync/tests/unit/test_errorhandler_filelog.js
...
>-  // Fake an unsuccessful login.
>+  // Fake an unsuccessful login due to prolonged failure.
>+  Svc.Prefs.set("lastSync", (new Date(Date.now() - 691200000)).toString());

Same treatment here, please (simplify + const the magic number)


Looks good otherwise. I will do a final review in a bit. Given that Part 2 and Part 3 are kind of connected now, I think I'll just apply the patches locally and read the code as a whole.
Attachment #555809 - Flags: feedback?(philipp)
(In reply to Philipp von Weitershausen [:philikon] from comment #38)
> Comment on attachment 555809 [details] [diff] [review]
> Part 3, v4: ignore network errors except on "Sync Now" or a prolonged
> failure period
> 
> >+  Svc.Prefs.set("lastSync", (new Date(Date.now() -
> >+    NON_PROLONGED_ERROR_DURATION)).toString());
> 
> Date.now() returns a number, so you can just do
> 
>   Svc.Prefs.set("lastSync", Date.now() - NON_PROLONGED_ERROR_DURATION);
> 
> which incidentally should also fit on one line :)

I know the way I had it looks weird but the reason for that is the lastSync pref is expected to be a string not a number and it's expected by the UI to be in the format given by new Date()
(In reply to Marina Samuel [:marina] from comment #39)
> I know the way I had it looks weird but the reason for that is the lastSync
> pref is expected to be a string not a number and it's expected by the UI to
> be in the format given by new Date()

Woah, you're right. services.sync.lastSync is a totally different format than services.sync.<engine>.lastSync! In that case, can you just define a helper at the top of the test file that returns that rather ungainly value?
* addressing comment 38 & comment 40
* added a little check to take care of the edge case where there has never been a successful sync so the lastSync pref isn't set
Attachment #555809 - Attachment is obsolete: true
Needed a small follow-up for Part 1 because I didn't pay enough attention in the review. The skip-if directives weren't moved accordingly as some of the test files were renamed.

http://hg.mozilla.org/services/services-central/rev/82446dae2f5a
STRs for QA:

* Sync should never complain about a network problem anymore unless you the Sync Now menu item/toolbar button is clicked.
* If Sync encounters an error and finds that it has been unable to successfully sync for over a week, it will display an error message to that effect. The duration value is tunable via the services.sync.errorhandler.networkFailureReportTimeout preference (it's in seconds).
with network disconnected I get 
1314726658891   Sync.Service       DEBUG  Exception: NS_ERROR_UNKNOWN_HOST JS Stack trace: Res_get()@resource.js:473 < _fetchInfo(...)@service.js:580 < ()@service.js:1313 < WrappedNotify()@util.js:148 < WrappedLock()@util.js:103 < _lockedSync()@service.js:1277 < ()@service.js:1268 < WrappedCatch()@util.js:77 < sync()@service.js:1256

No notification in UI unless manual sync is fired.
(In reply to Tracy Walker from comment #45)
> with network disconnected I get 
> 1314726658891   Sync.Service       DEBUG  Exception: NS_ERROR_UNKNOWN_HOST

This is expected.
On restart with no network, the Sync icon spins endlessly.  It should stop spinning once sync realizes it has no network, yes?

It works fine with sc build from last week.  Do you want a follow up bug for the regression?
(In reply to Tracy Walker from comment #47)
> On restart with no network, the Sync icon spins endlessly.  It should stop
> spinning once sync realizes it has no network, yes?

Can you file a follow-up blocking this bug please? Thanks!
Depends on: 683254
verified in m-c
Status: RESOLVED → VERIFIED
Blocks: 684160
error.sync.prolonged_failure = Sync has not been able to complete during the last %1$S days. Please check your network settings.

Strings like these should use localization-friendly plural forms.
See https://developer.mozilla.org/en/Localization_and_Plurals for details.

CCing Pike.
(In reply to Marek Stępień :marcoos from comment #53)
Agreed. Follow up here or in a new bug?
(In reply to Axel Hecht [:Pike] from comment #54)
> Agreed. Follow up here or in a new bug?

Follow-up please.

(I should note that the value is set to 7 days. The only reason it's a pref in the first place is so we can modify for QA and testing purposes...)
Whiteboard: [fixed in services] → [verified in services]
Depends on: 691705
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.