Closed Bug 622762 Opened 13 years ago Closed 13 years ago

Add batch API for incoming records, deal with exceptions in SyncEngine's recordHandler

Categories

(Firefox :: Sync, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: philikon, Assigned: philikon)

References

(Blocks 1 open bug)

Details

(Whiteboard: [softblocker][has patch][has review rnewman][needs review mconnor])

Attachments

(1 file, 6 obsolete files)

Bookmark, history and password sync seem to love to throw exceptions for certain bogus records, or even because there's something wrong with our code somewhere. Right now we fail on the first bogus record for that engine and then stop. The error usually occurs somewhere in store.update() or store.create(), called by store.applyIncoming(), called by SyncEngine's collection recordHandler.

I think we should not stop processing incoming records just because one record failed to apply. So I think we should catch exceptions thrown by the recordHandler, make sure we log the hell out of them AND (this is the important bit) the service and engine know about the failure so we can potentially notify the user in the UI.

The goals here are to
a) get better logging on those exception because the Res_get() stack we see right now is shite
b) continue processing records even if one record produced an error.
This should block. I believe that solving this would alleviate some of the pain people have with history, password and bookmark sync. Because while it would prevent them from syncing that one dodgy record, it would at least process the other ones (which is quite helpful in case of bookmarks when you haven't synced the folders yet, for instance).
blocking2.0: --- → ?
OS: Mac OS X → All
Hardware: x86 → All
Assignee: nobody → philipp
blocking2.0: ? → betaN+
(In reply to comment #0)
> I think we should not stop processing incoming records just because one record
> failed to apply. So I think we should catch exceptions thrown by the
> recordHandler, make sure we log the hell out of them AND (this is the important
> bit) the service and engine know about the failure so we can potentially notify
> the user in the UI.

One problem with this is that we'll move lastSync forward, so we will never sync that broken record down in the future. To solve that problem we should remember the failed GUID(s) and refetch them on next syncs.
Assignee: philipp → rnewman
Attached patch Early version. v1 (obsolete) — Splinter Review
Here's a first draft.

It reuses the history engine tests, because that's a great environment in which to inject bad records.

This patch includes the following changes:

* Comments.
* Removes keyForCollection's noisy trace logging. It's overdue.
* Adds THRESHOLD_FAILURE_COUNT (how long to wait before checking error numbers) and THRESHOLD_FAILURE_PROPORTION. Once we've passed 5 failures, if more than 30% of the total seen records have been failures we'll throw instead of tracking.
* Store failed IDs in Prefs: <engine>.failedDown.
* Wipe them in resetLastSync, because this is where .lastSync is changed.
* Track failed counts and failed IDs inside recordHandler. No decision is made based on exception content; we just catch, log, and track.
* Fixes unrelated possible bug: if the store threw, we wouldn't set the tracker back to non-ignore mode!
* Tests: deliberately screw up two history items, and verify that they appear in the failed list in Prefs. Fix one and verify that it disappears but the other stays, and that the fixed item is now in the DB.

Comments most welcome.
Attachment #503023 - Flags: review?(philipp)
Attached patch Early version. v2 (obsolete) — Splinter Review
This version adds a flag to adjust handling behavior. Tab engine doesn't bother tracking; one bookmark test sets it to 'throw' to allow the test to pass.
Attachment #503023 - Attachment is obsolete: true
Attachment #503038 - Flags: review?(philipp)
Attachment #503023 - Flags: review?(philipp)
Whiteboard: [softblocker]
Whiteboard: [softblocker] → [softblocker][has-patch]
Whiteboard: [softblocker][has-patch] → [softblocker][has patch]
Whiteboard: [softblocker][has patch] → [softblocker][has patch][needs review philipp]
Comment on attachment 503038 [details] [diff] [review]
Early version. v2

>+// Constants for failure behavior.
>+const FAILURES_THROW  = 0;
>+const FAILURES_TRACK  = 1;
>+const FAILURES_IGNORE = 2;

...

>+  
>+  _trackFailures: FAILURES_TRACK,    // Also: FAILURES_IGNORE, FAILURES_THROW.
>+  
>+  get trackFailures() this._trackFailures,
>+  set trackFailures(value) {
>+    let f = ["throw", "track", "ignore"].indexOf(value);
>+    if (f == -1)
>+      throw "Invalid input to trackFailures: " + f;
>+    this._trackFailures = f;
>+  },

You could just make those consts the strings, then you won't have to translate the strings into numbers.

>-        } catch (ex) {
>-          if (Utils.isHMACMismatch(ex) &&
>-              this.handleHMACMismatch()) {
>-            // Let's try handling it.
>-            // If the callback returns true, try decrypting again, because
>-            // we've got new keys.
>-            this._log.info("Trying decrypt again...");
>-            item.decrypt();
>-          }
>-          else {
>-            throw ex;
>-          }
>+        } catch (ex if (Utils.isHMACMismatch(ex) &&
>+                        this.handleHMACMismatch())) {
>+          // Let's try handling it.
>+          // If the callback returns true, try decrypting again, because
>+          // we've got new keys.
>+          this._log.info("Trying decrypt again...");
>+          item.decrypt();
>         }

I feel like this change already landed as bug 623813...

>        
>         if (this._reconcile(item)) {
>+          this._tracker.ignoreAll = true;
>+          try {
>+            this._store.applyIncoming(item);
>+          } finally {
>+            this._tracker.ignoreAll = false;
>+          }

Yes!!! Need tests for this, though!


>+      // Other exceptions are plain ol' failures, unless we've got too many.
>+      catch (ex if this.trackFailures) {

If you use strings for the consts when this would become an explicit value comparision (if this.trackFailures != FAILURES_TRHOW) which makes things a bit more explicit.


>+// Error handling behavior in record handlers.
>+THRESHOLD_FAILURE_COUNT:               5,      // Wait until we've seen enough.
>+THRESHOLD_FAILURE_PROPORTION:          0.3     // Above 30% failure = immediate failure.

...

>+        // Don't stop the failure if we've seen too many.
>+        let cFailed = count.failed;
>+        let cTotal  = cFailed + count.applied + count.reconciled;
>+        if (cFailed > THRESHOLD_FAILURE_COUNT &&
>+            cFailed / cTotal > THRESHOLD_FAILURE_PROPORTION) {
>+          this._log.debug("Rethrowing exception because error count (" +
>+                          cFailed + ") exceeded threshold. Total records: " +
>+                          cTotal + ".");
>+          throw ex;
>+        }
>+      }

I'm not a big fan of these heuristics. IMHO we should save the (last) exception and throw it at the end of sync(). In other words, if there was a failure with one or more records, I think we should always throw to the service. It or even a higher power (e.g. the UI layer) can decide what to do with these exceptions, but I don't think we should be making this kind of policy decision so deep down in the code.

>+    // Everyone: make another stab at fetching previously failed records.
>+    let retries = this.savedDownloadFailures;
>+    this._log.info("Retrying " + (retries.length || 0) + " old failures.");
>+    try {
>+      if (retries.length) {
>+        newitems.ids   = retries;
>+        newitems.limit = 0;
>+        newitems.newer = 0;
>+        let resp = newitems.get();
>+        if (!resp.success) {
>+          resp.failureCode = ENGINE_DOWNLOAD_FAIL;
>+          throw resp;
>+        }
>+      }
>+    } catch (ex) {
>+      this._log.debug("Got exception during failed item retry: " +
>+                      Utils.exceptionStr(ex));
>+    }

Nit: put the try/catch block around the newitems.get() and following lines.

> function TabEngine() {
>   SyncEngine.call(this, "Tabs");
>+  this.trackFailures = "ignore";    // No point retrying individual records.

That's great that we can do this, but I don't actually see where you implemented the "ignore" case. Maybe I oversaw something, but I can only see code for the "track" and "throw" case.

>diff --git a/services/sync/tests/unit/test_history_engine.js b/services/sync/tests/unit/test_history_engine.js
>--- a/services/sync/tests/unit/test_history_engine.js
>+++ b/services/sync/tests/unit/test_history_engine.js

I would prefer if we had generic tests for this in test_syncengine_sync.js. Relying on the particulars of the history engine isn't exactly unit testing.


r- for
* error handling heuristics that I don't think should live at this level
* tests depending on specific engine implementation rather than writing/extending a dummy engine
* lack of tests for the ignoreAll case
Attachment #503038 - Flags: review?(philipp) → review-
Whiteboard: [softblocker][has patch][needs review philipp] → [softblocker][needs revised patch]
Two things that I forgot to mention last night:

* When we encounter a failed record, I think we also want to add it to the tracker. That way we either replace the bogus one with a proper one (if the GUID exists locally) or we create a delete record and get rid of the bogus one everywhere else (if the GUID doesn't exist locally).

* If the length of the 'failed' array gets close to the order of magnitude as the initial download, we probably don't want to bother with saving it to engine.savedDownloadFailures (prefs are read synchronously on startup, good for small values). Instead we can just decide not to fast-forward engine.lastSync to the latest timestamp.
(In reply to comment #5)

> I feel like this change already landed as bug 623813...

Probably did. This patch is at least a week old :D


> I'm not a big fan of these heuristics. IMHO we should save the (last) exception
> and throw it at the end of sync(). In other words, if there was a failure with
> one or more records, I think we should always throw to the service. It or even
> a higher power (e.g. the UI layer) can decide what to do with these exceptions,
> but I don't think we should be making this kind of policy decision so deep down
> in the code.

Interesting -- so rather than muffling a small number of errors, complete the whole sync if there's a small number, then throw at the end? Makes a certain amount of sense, and the number of yellow bars would be no worse than we have now.

I think we still need heuristics at this level to decide when to _abandon_, unless we use notifications to just put the logic in service.js... right?


> That's great that we can do this, but I don't actually see where you
> implemented the "ignore" case. Maybe I oversaw something, but I can only see
> code for the "track" and "throw" case.

"Track" and "ignore" are both covered by 

> >+      catch (ex if this.trackFailures) {

because 'throw' is 0 and thus false. The difference between 'track' and 'ignore' is a little further down:

+        if (this.trackFailures == FAILURES_TRACK)


> I would prefer if we had generic tests for this in test_syncengine_sync.js.
> Relying on the particulars of the history engine isn't exactly unit testing.

Fair enough.
(In reply to comment #7)
> > I'm not a big fan of these heuristics. IMHO we should save the (last) exception
> > and throw it at the end of sync(). In other words, if there was a failure with
> > one or more records, I think we should always throw to the service. It or even
> > a higher power (e.g. the UI layer) can decide what to do with these exceptions,
> > but I don't think we should be making this kind of policy decision so deep down
> > in the code.
> 
> Interesting -- so rather than muffling a small number of errors, complete the
> whole sync if there's a small number, then throw at the end?

Yes. In particular, we want to complete _processIncoming *and* _uploadOutgoing. Right now we throw somewhere in _processIncoming and then never get to the upload part.

> I think we still need heuristics at this level to decide when to _abandon_,

I guess the question is: do we ever want to abandon? If we do what I described in comment 6 (marking a bogus record as changed locally so we replace/delete it on the server), then we want to power through all the broken records and make sure we get to _uploadOutgoing, right?

> unless we use notifications to just put the logic in service.js... right?

I would prefer if service.js or engines.js wouldn't use notifications.js. Right now it almost feels misplaced in fx-sync since it purely is a UI helper. I think the current observer notifications to notify the UI of problems works well, it keeps services.js and engines.js free of UI-aware code.

If anything, we might want to make notifications.js aware of observer notifications from service.js and engines.js to avoid having the observers in Firefox UI code. Then we can have the error handling and escalating policy in one single place.

> "Track" and "ignore" are both covered by 
> 
> > >+      catch (ex if this.trackFailures) {
> 
> because 'throw' is 0 and thus false.

I get that, but it kind of defeats the point of defining those explicit constants. It's hard to tell from knowing the values of the constants when this if will be truthy.

> The difference between 'track' and 'ignore' is a little further down:
> 
> +        if (this.trackFailures == FAILURES_TRACK)

Oh rite. Yay! :)
Just a thought, while it's in my head: if we record failed records such that we
would retry, then we shouldn't repeatedly throw errors if we fail the same
record twice. Perhaps obvious, but worth noting for the forgetful (me).


Now a point for you, Mr PhiliKON:

(In reply to comment #6)

> * When we encounter a failed record, I think we also want to add it to the
> tracker. That way we either replace the bogus one with a proper one (if the
> GUID exists locally) or we create a delete record and get rid of the bogus one
> everywhere else (if the GUID doesn't exist locally).

I'm not 100% convinced about tracking and deleting or re-uploading, or I'm misunderstanding your suggestion. We could do this for certain classes of error (e.g., crypto), but even so there are potential bad consequences:

   * Client A uploads record
   * Client B fails to download it, or has a bug in decryption, or ...
   * Client B tracks the failure
   * Client B has no local copy to upload, so asks other clients to delete it.
   * !

Could you clarify your first point from Comment 6? Do you perhaps mean "delete from the server" instead of "get rid of ... everywhere else" (assuming that another client will re-upload after the deletion)?

This is an area of Sync with which I haven't yet familiarized myself, so forgive me if the question is dumb.
(In reply to comment #9)
> Just a thought, while it's in my head: if we record failed records such that we
> would retry, then we shouldn't repeatedly throw errors if we fail the same
> record twice. Perhaps obvious, but worth noting for the forgetful (me).

Yeah, we can do that, but it's more of a UX issue. I suggest filing a follow-up for this. The goal of this bug is to not fail hard when encountering an error syncing down.

> (In reply to comment #6)
> > * When we encounter a failed record, I think we also want to add it to the
> > tracker. That way we either replace the bogus one with a proper one (if the
> > GUID exists locally) or we create a delete record and get rid of the bogus one
> > everywhere else (if the GUID doesn't exist locally).
> 
> I'm not 100% convinced about tracking and deleting or re-uploading, or I'm
> misunderstanding your suggestion. We could do this for certain classes of error
> (e.g., crypto), but even so there are potential bad consequences:
> 
>    * Client A uploads record
>    * Client B fails to download it, or has a bug in decryption, or ...
>    * Client B tracks the failure
>    * Client B has no local copy to upload, so asks other clients to delete it.
>    * !
> 
> Could you clarify your first point from Comment 6? Do you perhaps mean "delete
> from the server" instead of "get rid of ... everywhere else" (assuming that
> another client will re-upload after the deletion)?

Yes, you're right, we should be conservative about issuing delete records (records that will cause all other clients to delete the corresponding item locally).

> This is an area of Sync with which I haven't yet familiarized myself, so
> forgive me if the question is dumb.

It's not a dumb question at all. We definitely want to make sure we don't cause data loss. First we should probably distinguish between

a) failures at the WBO level: invalid JSON, invalid object structure, HMAC verification failure, decryption failure, etc.
b) failures in store.applyIncoming()

Case a) would suggest some sort of data corruption (e.g. bug 617650). Here we can see if we have a local record for the corresponding GUID. If we do, we can mark it as changed so that we replace it with a good record later. If we don't, we could delete it from the server. The other client won't notice this, however, so it won't upload a new and hopefully good record the next time it syncs. But at least we're not going to delete their record.

Case b) is "just" an engine bug. Records written by another client can't be written by this client. Perhaps it's not even the record's fault (e.g. corrupt places.sqlite). I don't think we want to do anything here. Best thing we can do is ignore/track it and hope for a bugfix ;)
For bug 606353 I'm going to need to add a batch-apply method to store objects which will also fix the b) case I outlined in comment 10. Here's my proposal for this:

* store.applyIncomingBatch([records]) -> [failed, guids] will be added. It is not allowed to throw but instead should return a list of guids that failed to be applied.

* By default it will just call store.applyIncoming() for each record, applyIncoming() will from now on be considered private.

* By default we're still going to do batches of 1 (basically process records as they're coming in).

* Engines can opt into having batches bigger than 1 fed into store.applyIncomingBatches(). They can also overwrite this method to call a batch store mechanism (mozIAsyncHistory::updatePlaces for history).

For this bug then we still need to handle case a) as well as take the list of failed guids from applyIncomingBatch(), remember it and retry it on the next sync. Bug 615284 might help with the remembering bit, btw, if we decide to bring back the toFetch file (which I'm not a big fan of, but it might be a necessary evil.)
Depends on: 606353
Blocks: 625622
Actually, I'm going to implement the batching along with the exception handling stuff in this bug. Yoinking this from rnewman :)

In order to store the failed items I'm going to bring back the toFetch stuff which is bug 615284.
Assignee: rnewman → philipp
Blocks: 606353
Depends on: 615284
No longer depends on: 606353
Morphing this bug according to comment 11 and comment 12.
Status: NEW → ASSIGNED
Summary: Figure out how to deal with exceptions in SyncEngine's recordHandler → Add batch API for incoming records, deal with exceptions in SyncEngine's recordHandler
Attached patch wip v3 (obsolete) — Splinter Review
Here's a wip for the revamp:

* Introduces store.applyBatchIncoming

* Tracks records that failed to apply locally

* Records that failed to decrypt are replaced if the record exists locally. If not, they're tracked like records that failed to apply (still need to write the test for this)

Note: This is on top of bug 615284 which brings back the persistent engine.toFetch.
Attachment #503038 - Attachment is obsolete: true
Whiteboard: [softblocker][needs revised patch] → [softblocker][has wip]
Attached patch v3 (obsolete) — Splinter Review
This adds some more tests to exercise the catching of exceptions thrown by item.decrypt(). It also updates the test in test_bookmark_engine.js that relies on engine.sync() throwing. In the constructed case the engine no longer throws so we now construct a different case.
Attachment #507181 - Attachment is obsolete: true
Attachment #507313 - Flags: review?(rnewman)
Whiteboard: [softblocker][has wip] → [softblocker][has patch][needs review rnewman]
Comment on attachment 507313 [details] [diff] [review]
v3

>+    //TODO need to figure out how to ensure the batching doesn't refetch failed items

Forgot to remove this, not an actual TODO item. ;)
Attached patch v4 (obsolete) — Splinter Review
v3 was only applying records in applyIncomingBatchSize, so if less records where synced, they weren't applied *cough*. The 'failed' list was also not persisted if the initial GET request was eventually unsuccessful (e.g. network timeout).

Fixed and tested now!
Attachment #507313 - Attachment is obsolete: true
Attachment #507518 - Flags: review?(rnewman)
Attachment #507313 - Flags: review?(rnewman)
(In reply to comment #17)

Preliminary comments:

From a first read through, the patch seems to "do what it does". I'm not sure
if that's everything we want it to do.

* This patch appears to significantly change how Sync handles errors. It
  changes from a default of throwing (excepting HMAC errors, which are
  specially handled, and problems which engines catch themselves) to a default
  of never throwing. That's probably a good step, but some of your thinking
  earlier in this bug isn't reflected in this patch. For example:

> the service and engine know about the failure so we can potentially notify
> the user in the UI.

  and

> IMHO we should save the (last) exception
> and throw it at the end of sync(). In other words, if there was a failure with
> one or more records, I think we should always throw to the service. It or even
> a higher power (e.g. the UI layer) can decide what to do with these exceptions,
> but I don't think we should be making this kind of policy decision so deep down
> in the code.

  and

> If the length of the 'failed' array gets close to the order of magnitude as
> the initial download, we probably don't want to bother with saving it to
> engine.savedDownloadFailures (prefs are read synchronously on startup, good for
> small values). Instead we can just decide not to fast-forward engine.lastSync
> to the latest timestamp.


  What gives here? I'd appreciate knowing why you decided to never error out
  (even at the end of the sync, with the last exception), check the number of
  failures, or send notifications.


* I believe the decryption error detection logic might have become too broad.
  Namely, this:

       try {
         try {
           item.decrypt();
         } catch (ex if (Utils.isHMACMismatch(ex) &&
                         this.handleHMACMismatch())) {
           // Let's try handling it.
           // If the callback returns true, try decrypting again, because
           // we've got new keys.
           this._log.info("Trying decrypt again...");
           item.decrypt();
+        }       
+      } catch (ex) {
+        this._log.warn("Error decrypting record: " + Utils.exceptionStr(ex));
+
+        // Upload a new record to replace the bad one if we have it locally.
+        // Otherwise we'll just try, try again.
+        if (this._store.itemExists(item.id)) {
+          this._modified[item.id] = 0;
+        } else {
+          failed.push(item.id);
         }

  There are several reasons why item.decrypt() might fail (even twice!) --
  e.g., the crypto system isn't working correctly (and thus uploading will
  fail, too), or our own keys are wrong! (More likely when we cache
  CollectionKeys locally, I grant.)

  I'd be inclined to whitelist such, rather than assume that all of these
  exceptions should be handled by uploading our own record. I can see the bug
  reports now... someone still running an older, buggy release, and it keeps
  resetting their Gmail password every time it syncs!

* It would be nice to log the count of failed records here:

     this._log.info(["Records:", count.applied, "applied,", count.reconciled,
       "reconciled."].join(" "));

* What happens if _reconcile throws? That case -- assuming you want to collect
  every record that fails somehow -- doesn't seem to be handled. (_reconcile
  ends up calling _createRecord and itemExists, so it's quite possible that it
  could throw.)

* Minor, performance: it seems like these lines:

+        this.toFetch = failed.concat(Utils.arraySub(this.toFetch, failed));
+        failed = [];

  could get called a lot, especially if the default batch size is 1, and there
  are a lot of failures... and we don't throw if we see a lot of failures.

  We already spend way too much time scrabbling in JS allocations; perhaps
  switch to a different data structure for |toFetch| (e.g., just collect the
  arrays with push, then walk "i and j-wise") to avoid all the consing?

  It's added complication, so I quite understand when you say "not worth it",
  but thought I'd note it down regardless.

I'll spend a little more time reading through before lunch.
(In reply to comment #18)
>   What gives here? I'd appreciate knowing why you decided to never error out
>   (even at the end of the sync, with the last exception), check the number of
>   failures, or send notifications.

You are quite right to call this out. I should have elaborated here. My implementation was made with the batching API in mind which made me reconsider a few of the points you (quite correctly) bring up. At this point I don't think it's very helpful to throw anything. Even now we have to consult the (Trace!) logs to find out which record breaks. This reminds me that I should add some lines to log failed items to help analysis.

It is quite right that I did not include any engine-to-service notifications when failed records are encountered. I think here I'm going to agree with my older self and say that we should notify the service of this. But not via an exception but an observer notification. That way engine.sync() can continue all the way through. The service (or UI) can then decide to surface an error notification (which for now it should continue to do, if we want to change this, we should do so in a follow-up bug).

With respect to the heuristics of continuing vs. throwing when the list of failed gets long, I think we're in better shape now that we can write the list of failed items asynchronously to disk (bug 615284) rather than (ab)using prefs. That's why I kept things simple and just accumulated failed records in toFetch. This also means we get to upload things even if all download fails which I think is good too.

>   There are several reasons why item.decrypt() might fail (even twice!) --
>   e.g., the crypto system isn't working correctly (and thus uploading will
>   fail, too), or our own keys are wrong! (More likely when we cache
>   CollectionKeys locally, I grant.)

Our own keys are wrong is a good point, although this would mean we have a bug elsewhere. A bug that begets other bugs, is not too uncommon... But I see your point. I'm not sure how feasible or useful whitelisting is so I'd like to punt on this for now and deal with this in a follow-up. Let's not replace any server records for now but just ignore them (and refetch them later), ok?

> * It would be nice to log the count of failed records here:
> 
>      this._log.info(["Records:", count.applied, "applied,", count.reconciled,
>        "reconciled."].join(" "));

Agreed.

> 
> * What happens if _reconcile throws? That case -- assuming you want to collect
>   every record that fails somehow -- doesn't seem to be handled. (_reconcile
>   ends up calling _createRecord and itemExists, so it's quite possible that it
>   could throw.)

Good point!

> * Minor, performance: it seems like these lines:
> 
> +        this.toFetch = failed.concat(Utils.arraySub(this.toFetch, failed));
> +        failed = [];
> 
>   could get called a lot, especially if the default batch size is 1, and there
>   are a lot of failures... and we don't throw if we see a lot of failures.

Unless I'm mistaken somehow, doApplyBatchAndPersistFailed() only gets called *after* the GET request has been completed, not on every batch.

>   We already spend way too much time scrabbling in JS allocations; perhaps
>   switch to a different data structure for |toFetch| (e.g., just collect the
>   arrays with push, then walk "i and j-wise") to avoid all the consing?

I think the data structure we're looking for for toFetch is a set. A regular JS object would fulfill this task since we're just dealing with strings here (attributes of object = members of the set). Since toFetch gets written to disk as JSON, we would have to migrate old toFetch files from arrays to objects. We would also have to convert between the object and array in a bunch of places. Not sure it's worth the effort, I'd like to get some numbers first before I rip something out that works. Follow-up?


To summarize, I propose to address your comments so far as follows:

* log failed items and failed item count
* notify service of failed items, have the service listen for this notification, set this._syncError = true which means it'll throw and notify the UI
* not replace bogus records with local ones
* ensure that we catch exceptions thrown by _reconcile
(In reply to comment #19)
> This reminds me that I should add some
> lines to log failed items to help analysis.

+1.

> It is quite right that I did not include any engine-to-service notifications
> when failed records are encountered. I think here I'm going to agree with my
> older self and say that we should notify the service of this. But not via an
> exception but an observer notification. That way engine.sync() can continue all
> the way through. The service (or UI) can then decide to surface an error
> notification (which for now it should continue to do, if we want to change
> this, we should do so in a follow-up bug).

+1.

> That's why I kept things simple and just accumulated failed records in
> toFetch. This also means we get to upload things even if all download fails
> which I think is good too.

Agreed. Good justifications.

> I'm not sure how feasible or useful whitelisting is so I'd like to punt
> on this for now and deal with this in a follow-up. Let's not replace any server
> records for now but just ignore them (and refetch them later), ok?

That works for me. There might be alternatives to whitelisting which we can explore now we have persistent stored failures -- e.g., if an item fails across two Firefox launches (heuristic, of course), or send a command to another client to get a second opinion.

Filed Bug 629463.

 
> I think the data structure we're looking for for toFetch is a set. A regular JS
> object would fulfill this task since we're just dealing with strings here
> (attributes of object = members of the set).

Moreover, this gives us a value field to use for storage -- e.g., to track number of times we've failed that object.

> Since toFetch gets written to disk
> as JSON, we would have to migrate old toFetch files from arrays to objects. We
> would also have to convert between the object and array in a bunch of places.
> Not sure it's worth the effort, I'd like to get some numbers first before I rip
> something out that works. Follow-up?

Bug 629467.


> To summarize, I propose to address your comments so far as follows:

Sounds good!
Blocks: 629463, 629467
Blocks: 620901
Attached patch v5 (obsolete) — Splinter Review
Addresses the points outlines in comment 19.
Attachment #507518 - Attachment is obsolete: true
Attachment #507679 - Flags: review?(rnewman)
Attachment #507518 - Flags: review?(rnewman)
Attachment #507679 - Flags: review?(mconnor)
Comment on attachment 507679 [details] [diff] [review]
v5

Looking good.

Read-through review notes:

* You're not incrementing count.failed in the reconcile case. Intentional?

* This pattern comes up several times (688, 776, 777, 806...):

    baz = foo.concat(Utils.arraySub(bar, foo));

  Could you please define:

    function union(foo, bar) {
      return foo.concat(Utils.arraySub(bar, foo));
    }

  to make this clearer, and easier to switch out for a better data structure
  later?

* Proposal: track the last exception somewhere, and add it to the apply-failed
  notification. Counts aren't likely to be sufficient to determine a course of
  action, and I think tracking *all* exceptions would be expensive. (Or maybe
  track nsresult codes?)

* Optional: in the case of an HMAC mismatch we're pretty certain it's a bad
  server record, so one could restore some lines like:

    } catch (ex if (Utils.isHMACMismatch(ex))) {
      this._log.warn("Error decrypting record: " + Utils.exceptionStr(ex));
      if (this._store.itemExists(item.id)) {
        this._modified[item.id] = 0;
      } else {
        failed.push(item.id);
      }
      return;
    }

  before your modified line 717... or leave it to the follow-on bug. Up to you.

Will head on to test analysis next (and one more code point that I want to explore myself).
(In reply to comment #22)
> * You're not incrementing count.failed in the reconcile case. Intentional?

I increment count.failed when I persist and reset 'failed'. This should also include the reconcile case. In fact, I have tests that prove it does! From test_syncengine_sync.js, test_processIncoming_failed_records(): 

    // Ensure the observer was notified
    do_check_eq(observerData, engine.name);
    do_check_eq(observerSubject.failed, BOGUS_RECORDS.length);

> * This pattern comes up several times (688, 776, 777, 806...):
> 
>     baz = foo.concat(Utils.arraySub(bar, foo));
> 
>   Could you please define:
> 
>     function union(foo, bar) {
>       return foo.concat(Utils.arraySub(bar, foo));
>     }
> 
>   to make this clearer, and easier to switch out for a better data structure
>   later?

Good idea!

> * Proposal: track the last exception somewhere, and add it to the apply-failed
>   notification. Counts aren't likely to be sufficient to determine a course of
>   action, and I think tracking *all* exceptions would be expensive. (Or maybe
>   track nsresult codes?)

There are three places where we *might* get an exception:

* item.decrypt()
* engine._reconcile()
* somewhere internal to store.applyIncomingBatch()

As more engines will provide their own applyIncomingBatch() method, we might not even get exceptions there anymore and in any case, applyIncomingBatch() is not supposed to throw anymore. Engine specific exceptions from _reconcile and applyIncoming are pretty useless to the Service anyway, I think. What apart from logging (which is already taken care of) could do it do? And nsresult codes are typically pretty generic (NS_ERROR_NOT_AVAILABLE or NS_ERROR_XPC_JS_THREW_STRING anyone?).

This only leaves exceptions from item.decrypt(). I think these are worth exploring in the follow-up bug 629463. At that point we could also always change what the notification transmits if we find it would be useful to the service. Does that sound good to you?

> * Optional: in the case of an HMAC mismatch we're pretty certain it's a bad
>   server record, so one could restore some lines like:
> 
>     } catch (ex if (Utils.isHMACMismatch(ex))) {
>       this._log.warn("Error decrypting record: " + Utils.exceptionStr(ex));
>       if (this._store.itemExists(item.id)) {
>         this._modified[item.id] = 0;
>       } else {
>         failed.push(item.id);
>       }
>       return;
>     }
> 
>   before your modified line 717... or leave it to the follow-on bug. Up to you.

I'd prefer to punt this to the follow-up bug 629463.
(In reply to comment #23)

> This only leaves exceptions from item.decrypt(). I think these are worth
> exploring in the follow-up bug 629463. At that point we could also always
> change what the notification transmits if we find it would be useful to the
> service. Does that sound good to you?

Yup; just raising it for thoroughness.

> I'd prefer to punt this to the follow-up bug 629463.

WFM.

r+ from me with the discussed change. Now to wait for mconnor...
Attachment #507679 - Flags: review?(rnewman) → review+
Whiteboard: [softblocker][has patch][needs review rnewman] → [softblocker][has patch][has review rnewman][needs review mconnor]
Attached patch v5.1Splinter Review
This addresses rnewman's review comment: introduced Utils.arrayUnion()
Attachment #507679 - Attachment is obsolete: true
Attachment #507759 - Flags: review?(mconnor)
Attachment #507679 - Flags: review?(mconnor)
Comment on attachment 507759 [details] [diff] [review]
v5.1

At some point we'll have to profile a bunch of this stuff... feels like the Utils.foo(Utils.bar(Utils.baz(foopy)))) pattern is going to suck a lot.  But... maybe JS saves us here?
Attachment #507759 - Flags: review?(mconnor) → review+
https://hg.mozilla.org/services/fx-sync/rev/11c34dd2683b
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Blocks: 629603
Blocks: 732614
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: