Closed Bug 691612 Opened 13 years ago Closed 13 years ago

Backoff handling is broken

Categories

(Firefox :: Sync, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla10
Tracking Status
firefox7 --- affected
firefox8 --- fixed
firefox9 --- fixed
firefox10 --- fixed

People

(Reporter: philikon, Assigned: philikon)

References

Details

(Keywords: verified-aurora, verified-beta, Whiteboard: [verified in services][qa!:8][qa!:9])

Attachments

(3 files)

The "weave:service:backoff:interval" observer computes nonsense [1] which means neither Status.backoffInterval nor Status.minimumNextSync are actually useful values.

[1] https://mxr.mozilla.org/services-central/source/fx-sync/services/sync/modules/service.js#573
I am here and waiting to review. Alternatively, I'm happy to write a patch and test.
I'd probably call this one critical, mm?
Severity: normal → critical
Assignee: nobody → philipp
http://mxr.mozilla.org/mozilla-central/source/services/sync/modules/policies.js#186 really just needs to use the values from subject, not data.

It also needs to multiply the value by 1000 for the minimumNextSync.

Also, omg this needs tests.
Yes to all three points.

On a further note, in a follow-up, I would like to untangle all of the backoff code. It's spread over various functions and methods. I'm not even confident that by fixing those two bugs backoff will work. But the tests I am writing *right now* will tell us more.
Blocks: 691630
(In reply to Philipp von Weitershausen [:philikon] from comment #4)
> On a further note, in a follow-up, I would like to untangle all of the
> backoff code. It's spread over various functions and methods. I'm not even
> confident that by fixing those two bugs backoff will work. But the tests I
> am writing *right now* will tell us more.

Looks my suspicions were somewhat right: yes, after fixing those two bugs we will now observe Status.backoffInterval, but a forced sync will proceed because we reset the Status object at the beginning of the sync [1]. If I understood mconnor correctly earlier today, then this is not the intended behaviour.

However, if we forced the sync to fail, then we're lacking some serious UX here as it would just simply result in an Unknown Error. We could of course ignore Sync errors in backoff mode. This would require two slightly different implementations in Firefox 7/8 and 9/10, but it could be done.

[1] https://mxr.mozilla.org/services-central/source/fx-sync/services/sync/modules/service.js#1701

(I'm deliberately quoting the old extension code in this bug to demonstrate that this behaviour hasn't changed at all for at least a year.)
Depends on: 691663
Also, it turns out not all of the sync scheduling code actually adheres to the backoff interval, or it uses different backoff metrics (some of them seem bogus to me). I filed bug 691663.
Attached patch v1Splinter Review
Attachment #564455 - Flags: review?(rnewman)
No longer blocks: 691630
Depends on: 691630
Comment on attachment 564455 [details] [diff] [review]
v1

Review of attachment 564455 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me.

::: services/sync/tests/unit/test_syncscheduler.js
@@ +586,5 @@
> +  // Use an odd value on purpose so that it doesn't happen to coincide with one
> +  // of the sync intervals.
> +  const BACKOFF = 7337;
> +
> +  // Make extend info/collections so that we can put it into server

%s/Make extend/Extend

(Also in test below.)

@@ +590,5 @@
> +  // Make extend info/collections so that we can put it into server
> +  // maintenance mode.
> +  const INFO_COLLECTIONS = "/1.1/johndoe/info/collections";
> +  let infoColl = server._handler._overridePaths[INFO_COLLECTIONS];
> +  let server_backoff = false;

serverBackoff?
Attachment #564455 - Flags: review?(rnewman) → review+
https://hg.mozilla.org/services/services-central/rev/3af678a82e64
Status: NEW → ASSIGNED
Whiteboard: [fixed in services]
Comment on attachment 564455 [details] [diff] [review]
v1

This is a critical bug in Sync's backoff handling, which means it doesn't actually obey the server's request to back off in case of an infrastructure meltdown.
Attachment #564455 - Flags: approval-mozilla-aurora?
Attached patch patch for betaSplinter Review
Please see comment 11 for mozilla-beta justification. Try build is spinning and will report back to this bug: https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=c4bb15672702
Attachment #564623 - Flags: approval-mozilla-beta?
Requesting release approval in case there's a 7.0.x chemspill that we can piggyback on. Please see comment 11 for justification. Try build is spinning and will report back here: https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=d36116c32ba4
Attachment #564626 - Flags: approval-mozilla-release?
verified against stage with s-c build built this morning.
Whiteboard: [fixed in services] → [verified in services]
https://hg.mozilla.org/mozilla-central/rev/3af678a82e64
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla10
Is this a regression? If so, what caused it? Are the servers having issues or have you taken mitigation issues on the server or is this a "just in case"/"nice to have" fix?
s/mitigation issues/mitigation steps/
Try run for d36116c32ba4 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=d36116c32ba4
Results (out of 30 total builds):
    exception: 1
    success: 25
    warnings: 2
    failure: 2
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/pweitershausen@mozilla.com-d36116c32ba4
Try run for c4bb15672702 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=c4bb15672702
Results (out of 30 total builds):
    success: 29
    failure: 1
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/pweitershausen@mozilla.com-c4bb15672702
(In reply to Christian Legnitto [:LegNeato] from comment #16)
> Is this a regression? If so, what caused it? Are the servers having issues
> or have you taken mitigation issues on the server or is this a "just in
> case"/"nice to have" fix?

It's not a regression; this has been broken for perhaps a year now. We just haven't needed it, so we haven't noticed that it was buggy.

The primary load-restriction mechanism built into the Sync service is an X-Weave-Backoff header. When we're under massive load, as we are right now (see Bug 690470 for one of the effects!), we set this header to tell clients to go away for a duration.

We found, in the course of investigating this storm of unavailable Sync nodes, that the client code simply doesn't work.

Services Operations has spent several days now working around the clock attempting to get us back on our feet. Most recovery approaches make the problem worse, and without this client handling we have no way to get back to a good state without throwing error bars on users' machines.

tl;dr: Sync is throwing errors left right and center for 90% of our user base, and a client update would help immensely in allowing operations to get things back in the green.
(In reply to Christian Legnitto [:LegNeato] from comment #16)
> Is this a regression?

No, it has been broken for at least a year and a half. The brokenness dates back to a time when Sync was called Weave and just an add-on in Labs.

> Are the servers having issues
> or have you taken mitigation issues on the server or is this a "just in
> case"/"nice to have" fix?

The servers are having heavy issues from the unexpectedly high load caused by the Instant Sync feature, landed in Firefox 7. We found that the backoff header that the server is sending, which was designed as a safety valve for those cases, isn't actually honoured by any client.

Just FYI: we're also looking at what causes the unexpectedly high syncing load in Firefox 7 and later to make sure we're not continuing to melt the servers for all eternity. These will be filed as separate bugs and we'll ask approval for those separately.
(In reply to Richard Newman [:rnewman] from comment #20)
> (In reply to Christian Legnitto [:LegNeato] from comment #16)
> Most recovery approaches make the problem worse…

Expanding on this a little bit: database load appears to be the largest culprit here. Ops is taking various steps to improve the situation (off-hand, I can think of: DB upgrades; pruning records; adding capacity), but in order to effect these steps they need to stop clients making requests to the affected nodes.

(atoll, feel free to clarify if I mis-stated.)
Blocks: 692006
Attachment #564455 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #564623 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Blocks: 691663
No longer depends on: 691663
qa- as there does not appear to be much QA can do to verify this fix. Please set to qa+ if this is not the case.
Whiteboard: [verified in services] → [verified in services][qa-]
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #25)
> qa- as there does not appear to be much QA can do to verify this fix. Please
> set to qa+ if this is not the case.

That's not true. Services Ops can easily configure the staging servers to send backoff notices. Perhaps the folks from Services QA can chime in here and help out with verifying.

This fix is absolutely vital for the health of our Sync server infrastructure. We need to make sure it works in the released versions of Firefox.
Thanks for clarifying, setting to qa+
Whiteboard: [verified in services][qa-] → [verified in services][qa+]
Yea, Services QA will pick this up.
Looks like Tracy covered the client side already (Comment 14). He and I can work with petef to get this verified from the server side.
(In reply to James Bonacci [:jbonacci] from comment #28)
> Yea, Services QA will pick this up.
> Looks like Tracy covered the client side already (Comment 14). He and I can
> work with petef to get this verified from the server side.

James, this is a client bug. There's nothing in the server code that needs verifying. Tracy verified the fix in s-c before the merge to m-c. But we also need to verify this in Aurora and Beta. I believe this is what :ashughes was talking about in comment 25 and later.
Attachment #564626 - Flags: approval-mozilla-release?
Keywords: verified-beta
Could you please tell me how to test this ?
Paul, this has to be coordinated with Services OPs so they can send appropriate back-off messages from the stage server. See https://bugzilla.mozilla.org/show_bug.cgi?id=684798#c30 for some examples.
(In reply to Tracy Walker [:tracy] from comment #31)
> Paul, this has to be coordinated with Services OPs so they can send
> appropriate back-off messages from the stage server. See
> https://bugzilla.mozilla.org/show_bug.cgi?id=684798#c30 for some examples.

Tracy, given Paul's timezone and that he is on sick PTO, could you coordinate testing for verification in Firefox 9.0b6?
verified on Fx9b6
Status: RESOLVED → VERIFIED
Keywords: verified-aurora
Whiteboard: [verified in services][qa+] → [verified in services]
Whiteboard: [verified in services] → [verified in services][qa+][qa!:9]
Whiteboard: [verified in services][qa+][qa!:9] → [verified in services][qa!:8][qa!:9]
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: