Last Comment Bug 691612 - Backoff handling is broken
: Backoff handling is broken
Status: VERIFIED FIXED
[verified in services][qa!:8][qa!:9]
: verified-aurora, verified-beta
Product: Cloud Services
Classification: Client Software
Component: Firefox Sync: Backend (show other bugs)
: unspecified
: All All
: -- critical (vote)
: mozilla10
Assigned To: Philipp von Weitershausen [:philikon]
:
:
Mentors:
Depends on: 691630
Blocks: 691663 692006
  Show dependency treegraph
 
Reported: 2011-10-03 17:39 PDT by Philipp von Weitershausen [:philikon]
Modified: 2011-12-16 08:16 PST (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
affected
fixed
fixed
fixed


Attachments
v1 (6.55 KB, patch)
2011-10-03 21:32 PDT, Philipp von Weitershausen [:philikon]
rnewman: review+
asa: approval‑mozilla‑aurora+
Details | Diff | Splinter Review
patch for beta (6.59 KB, patch)
2011-10-04 11:47 PDT, Philipp von Weitershausen [:philikon]
asa: approval‑mozilla‑beta+
Details | Diff | Splinter Review
patch for release (6.59 KB, patch)
2011-10-04 11:49 PDT, Philipp von Weitershausen [:philikon]
no flags Details | Diff | Splinter Review

Description Philipp von Weitershausen [:philikon] 2011-10-03 17:39:15 PDT
The "weave:service:backoff:interval" observer computes nonsense [1] which means neither Status.backoffInterval nor Status.minimumNextSync are actually useful values.

[1] https://mxr.mozilla.org/services-central/source/fx-sync/services/sync/modules/service.js#573
Comment 1 Richard Newman [:rnewman] 2011-10-03 17:45:29 PDT
I am here and waiting to review. Alternatively, I'm happy to write a patch and test.
Comment 2 Richard Newman [:rnewman] 2011-10-03 17:46:37 PDT
I'd probably call this one critical, mm?
Comment 3 Mike Connor [:mconnor] 2011-10-03 18:03:25 PDT
http://mxr.mozilla.org/mozilla-central/source/services/sync/modules/policies.js#186 really just needs to use the values from subject, not data.

It also needs to multiply the value by 1000 for the minimumNextSync.

Also, omg this needs tests.
Comment 4 Philipp von Weitershausen [:philikon] 2011-10-03 18:07:36 PDT
Yes to all three points.

On a further note, in a follow-up, I would like to untangle all of the backoff code. It's spread over various functions and methods. I'm not even confident that by fixing those two bugs backoff will work. But the tests I am writing *right now* will tell us more.
Comment 5 Philipp von Weitershausen [:philikon] 2011-10-03 20:57:15 PDT
(In reply to Philipp von Weitershausen [:philikon] from comment #4)
> On a further note, in a follow-up, I would like to untangle all of the
> backoff code. It's spread over various functions and methods. I'm not even
> confident that by fixing those two bugs backoff will work. But the tests I
> am writing *right now* will tell us more.

Looks my suspicions were somewhat right: yes, after fixing those two bugs we will now observe Status.backoffInterval, but a forced sync will proceed because we reset the Status object at the beginning of the sync [1]. If I understood mconnor correctly earlier today, then this is not the intended behaviour.

However, if we forced the sync to fail, then we're lacking some serious UX here as it would just simply result in an Unknown Error. We could of course ignore Sync errors in backoff mode. This would require two slightly different implementations in Firefox 7/8 and 9/10, but it could be done.

[1] https://mxr.mozilla.org/services-central/source/fx-sync/services/sync/modules/service.js#1701

(I'm deliberately quoting the old extension code in this bug to demonstrate that this behaviour hasn't changed at all for at least a year.)
Comment 6 Philipp von Weitershausen [:philikon] 2011-10-03 21:05:24 PDT
Also, it turns out not all of the sync scheduling code actually adheres to the backoff interval, or it uses different backoff metrics (some of them seem bogus to me). I filed bug 691663.
Comment 7 Philipp von Weitershausen [:philikon] 2011-10-03 21:32:28 PDT
Created attachment 564455 [details] [diff] [review]
v1
Comment 8 Richard Newman [:rnewman] 2011-10-03 22:00:19 PDT
Comment on attachment 564455 [details] [diff] [review]
v1

Review of attachment 564455 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me.

::: services/sync/tests/unit/test_syncscheduler.js
@@ +586,5 @@
> +  // Use an odd value on purpose so that it doesn't happen to coincide with one
> +  // of the sync intervals.
> +  const BACKOFF = 7337;
> +
> +  // Make extend info/collections so that we can put it into server

%s/Make extend/Extend

(Also in test below.)

@@ +590,5 @@
> +  // Make extend info/collections so that we can put it into server
> +  // maintenance mode.
> +  const INFO_COLLECTIONS = "/1.1/johndoe/info/collections";
> +  let infoColl = server._handler._overridePaths[INFO_COLLECTIONS];
> +  let server_backoff = false;

serverBackoff?
Comment 9 Philipp von Weitershausen [:philikon] 2011-10-03 23:00:22 PDT
https://hg.mozilla.org/services/services-central/rev/3af678a82e64
Comment 10 Richard Newman [:rnewman] 2011-10-04 09:05:31 PDT
Greener than Eden.

https://tbpl.mozilla.org/?tree=Services-Central&rev=3af678a82e64
Comment 11 Philipp von Weitershausen [:philikon] 2011-10-04 11:45:40 PDT
Comment on attachment 564455 [details] [diff] [review]
v1

This is a critical bug in Sync's backoff handling, which means it doesn't actually obey the server's request to back off in case of an infrastructure meltdown.
Comment 12 Philipp von Weitershausen [:philikon] 2011-10-04 11:47:25 PDT
Created attachment 564623 [details] [diff] [review]
patch for beta

Please see comment 11 for mozilla-beta justification. Try build is spinning and will report back to this bug: https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=c4bb15672702
Comment 13 Philipp von Weitershausen [:philikon] 2011-10-04 11:49:29 PDT
Created attachment 564626 [details] [diff] [review]
patch for release

Requesting release approval in case there's a 7.0.x chemspill that we can piggyback on. Please see comment 11 for justification. Try build is spinning and will report back here: https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=d36116c32ba4
Comment 14 Tracy Walker [:tracy] 2011-10-04 11:59:16 PDT
verified against stage with s-c build built this morning.
Comment 15 Philipp von Weitershausen [:philikon] 2011-10-04 12:58:19 PDT
https://hg.mozilla.org/mozilla-central/rev/3af678a82e64
Comment 16 christian 2011-10-04 15:06:02 PDT
Is this a regression? If so, what caused it? Are the servers having issues or have you taken mitigation issues on the server or is this a "just in case"/"nice to have" fix?
Comment 17 christian 2011-10-04 15:07:21 PDT
s/mitigation issues/mitigation steps/
Comment 18 Mozilla RelEng Bot 2011-10-04 15:21:31 PDT
Try run for d36116c32ba4 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=d36116c32ba4
Results (out of 30 total builds):
    exception: 1
    success: 25
    warnings: 2
    failure: 2
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/pweitershausen@mozilla.com-d36116c32ba4
Comment 19 Mozilla RelEng Bot 2011-10-04 15:51:09 PDT
Try run for c4bb15672702 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=c4bb15672702
Results (out of 30 total builds):
    success: 29
    failure: 1
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/pweitershausen@mozilla.com-c4bb15672702
Comment 20 Richard Newman [:rnewman] 2011-10-04 15:57:07 PDT
(In reply to Christian Legnitto [:LegNeato] from comment #16)
> Is this a regression? If so, what caused it? Are the servers having issues
> or have you taken mitigation issues on the server or is this a "just in
> case"/"nice to have" fix?

It's not a regression; this has been broken for perhaps a year now. We just haven't needed it, so we haven't noticed that it was buggy.

The primary load-restriction mechanism built into the Sync service is an X-Weave-Backoff header. When we're under massive load, as we are right now (see Bug 690470 for one of the effects!), we set this header to tell clients to go away for a duration.

We found, in the course of investigating this storm of unavailable Sync nodes, that the client code simply doesn't work.

Services Operations has spent several days now working around the clock attempting to get us back on our feet. Most recovery approaches make the problem worse, and without this client handling we have no way to get back to a good state without throwing error bars on users' machines.

tl;dr: Sync is throwing errors left right and center for 90% of our user base, and a client update would help immensely in allowing operations to get things back in the green.
Comment 21 Philipp von Weitershausen [:philikon] 2011-10-04 15:59:15 PDT
(In reply to Christian Legnitto [:LegNeato] from comment #16)
> Is this a regression?

No, it has been broken for at least a year and a half. The brokenness dates back to a time when Sync was called Weave and just an add-on in Labs.

> Are the servers having issues
> or have you taken mitigation issues on the server or is this a "just in
> case"/"nice to have" fix?

The servers are having heavy issues from the unexpectedly high load caused by the Instant Sync feature, landed in Firefox 7. We found that the backoff header that the server is sending, which was designed as a safety valve for those cases, isn't actually honoured by any client.

Just FYI: we're also looking at what causes the unexpectedly high syncing load in Firefox 7 and later to make sure we're not continuing to melt the servers for all eternity. These will be filed as separate bugs and we'll ask approval for those separately.
Comment 22 Richard Newman [:rnewman] 2011-10-04 16:00:45 PDT
(In reply to Richard Newman [:rnewman] from comment #20)
> (In reply to Christian Legnitto [:LegNeato] from comment #16)
> Most recovery approaches make the problem worse…

Expanding on this a little bit: database load appears to be the largest culprit here. Ops is taking various steps to improve the situation (off-hand, I can think of: DB upgrades; pruning records; adding capacity), but in order to effect these steps they need to stop clients making requests to the affected nodes.

(atoll, feel free to clarify if I mis-stated.)
Comment 23 Philipp von Weitershausen [:philikon] 2011-10-06 16:46:06 PDT
https://hg.mozilla.org/releases/mozilla-aurora/rev/dcb1ab75cf63
Comment 24 Philipp von Weitershausen [:philikon] 2011-10-06 16:55:48 PDT
https://hg.mozilla.org/releases/mozilla-beta/rev/19cf03b8ed32
Comment 25 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-10-13 11:25:54 PDT
qa- as there does not appear to be much QA can do to verify this fix. Please set to qa+ if this is not the case.
Comment 26 Philipp von Weitershausen [:philikon] 2011-10-13 12:04:05 PDT
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #25)
> qa- as there does not appear to be much QA can do to verify this fix. Please
> set to qa+ if this is not the case.

That's not true. Services Ops can easily configure the staging servers to send backoff notices. Perhaps the folks from Services QA can chime in here and help out with verifying.

This fix is absolutely vital for the health of our Sync server infrastructure. We need to make sure it works in the released versions of Firefox.
Comment 27 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-10-13 12:06:27 PDT
Thanks for clarifying, setting to qa+
Comment 28 James Bonacci [:jbonacci] 2011-10-13 12:22:26 PDT
Yea, Services QA will pick this up.
Looks like Tracy covered the client side already (Comment 14). He and I can work with petef to get this verified from the server side.
Comment 29 Philipp von Weitershausen [:philikon] 2011-10-13 12:33:22 PDT
(In reply to James Bonacci [:jbonacci] from comment #28)
> Yea, Services QA will pick this up.
> Looks like Tracy covered the client side already (Comment 14). He and I can
> work with petef to get this verified from the server side.

James, this is a client bug. There's nothing in the server code that needs verifying. Tracy verified the fix in s-c before the merge to m-c. But we also need to verify this in Aurora and Beta. I believe this is what :ashughes was talking about in comment 25 and later.
Comment 30 Paul Silaghi, QA [:pauly] 2011-11-28 05:07:06 PST
Could you please tell me how to test this ?
Comment 31 Tracy Walker [:tracy] 2011-11-28 10:47:53 PST
Paul, this has to be coordinated with Services OPs so they can send appropriate back-off messages from the stage server. See https://bugzilla.mozilla.org/show_bug.cgi?id=684798#c30 for some examples.
Comment 32 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-12-13 13:42:12 PST
(In reply to Tracy Walker [:tracy] from comment #31)
> Paul, this has to be coordinated with Services OPs so they can send
> appropriate back-off messages from the stage server. See
> https://bugzilla.mozilla.org/show_bug.cgi?id=684798#c30 for some examples.

Tracy, given Paul's timezone and that he is on sick PTO, could you coordinate testing for verification in Firefox 9.0b6?
Comment 33 Tracy Walker [:tracy] 2011-12-15 15:27:35 PST
verified on Fx9b6

Note You need to log in before you can comment on or make changes to this bug.