Closed
Bug 691612
Opened 12 years ago
Closed 12 years ago
Backoff handling is broken
Categories
(Firefox :: Sync, defect)
Firefox
Sync
Tracking
()
VERIFIED
FIXED
mozilla10
People
(Reporter: philikon, Assigned: philikon)
References
Details
(Keywords: verified-aurora, verified-beta, Whiteboard: [verified in services][qa!:8][qa!:9])
Attachments
(3 files)
6.55 KB,
patch
|
rnewman
:
review+
asa
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
6.59 KB,
patch
|
asa
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
6.59 KB,
patch
|
Details | Diff | Splinter Review |
The "weave:service:backoff:interval" observer computes nonsense [1] which means neither Status.backoffInterval nor Status.minimumNextSync are actually useful values. [1] https://mxr.mozilla.org/services-central/source/fx-sync/services/sync/modules/service.js#573
Comment 1•12 years ago
|
||
I am here and waiting to review. Alternatively, I'm happy to write a patch and test.
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → philipp
status-firefox10:
--- → affected
status-firefox7:
--- → affected
status-firefox8:
--- → affected
status-firefox9:
--- → affected
Comment 3•12 years ago
|
||
http://mxr.mozilla.org/mozilla-central/source/services/sync/modules/policies.js#186 really just needs to use the values from subject, not data. It also needs to multiply the value by 1000 for the minimumNextSync. Also, omg this needs tests.
Assignee | ||
Comment 4•12 years ago
|
||
Yes to all three points. On a further note, in a follow-up, I would like to untangle all of the backoff code. It's spread over various functions and methods. I'm not even confident that by fixing those two bugs backoff will work. But the tests I am writing *right now* will tell us more.
Assignee | ||
Comment 5•12 years ago
|
||
(In reply to Philipp von Weitershausen [:philikon] from comment #4) > On a further note, in a follow-up, I would like to untangle all of the > backoff code. It's spread over various functions and methods. I'm not even > confident that by fixing those two bugs backoff will work. But the tests I > am writing *right now* will tell us more. Looks my suspicions were somewhat right: yes, after fixing those two bugs we will now observe Status.backoffInterval, but a forced sync will proceed because we reset the Status object at the beginning of the sync [1]. If I understood mconnor correctly earlier today, then this is not the intended behaviour. However, if we forced the sync to fail, then we're lacking some serious UX here as it would just simply result in an Unknown Error. We could of course ignore Sync errors in backoff mode. This would require two slightly different implementations in Firefox 7/8 and 9/10, but it could be done. [1] https://mxr.mozilla.org/services-central/source/fx-sync/services/sync/modules/service.js#1701 (I'm deliberately quoting the old extension code in this bug to demonstrate that this behaviour hasn't changed at all for at least a year.)
Assignee | ||
Comment 6•12 years ago
|
||
Also, it turns out not all of the sync scheduling code actually adheres to the backoff interval, or it uses different backoff metrics (some of them seem bogus to me). I filed bug 691663.
Assignee | ||
Comment 7•12 years ago
|
||
Attachment #564455 -
Flags: review?(rnewman)
Assignee | ||
Updated•12 years ago
|
Comment 8•12 years ago
|
||
Comment on attachment 564455 [details] [diff] [review] v1 Review of attachment 564455 [details] [diff] [review]: ----------------------------------------------------------------- This looks good to me. ::: services/sync/tests/unit/test_syncscheduler.js @@ +586,5 @@ > + // Use an odd value on purpose so that it doesn't happen to coincide with one > + // of the sync intervals. > + const BACKOFF = 7337; > + > + // Make extend info/collections so that we can put it into server %s/Make extend/Extend (Also in test below.) @@ +590,5 @@ > + // Make extend info/collections so that we can put it into server > + // maintenance mode. > + const INFO_COLLECTIONS = "/1.1/johndoe/info/collections"; > + let infoColl = server._handler._overridePaths[INFO_COLLECTIONS]; > + let server_backoff = false; serverBackoff?
Attachment #564455 -
Flags: review?(rnewman) → review+
Assignee | ||
Comment 9•12 years ago
|
||
https://hg.mozilla.org/services/services-central/rev/3af678a82e64
Status: NEW → ASSIGNED
Whiteboard: [fixed in services]
Comment 10•12 years ago
|
||
Greener than Eden. https://tbpl.mozilla.org/?tree=Services-Central&rev=3af678a82e64
Assignee | ||
Comment 11•12 years ago
|
||
Comment on attachment 564455 [details] [diff] [review] v1 This is a critical bug in Sync's backoff handling, which means it doesn't actually obey the server's request to back off in case of an infrastructure meltdown.
Attachment #564455 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 12•12 years ago
|
||
Please see comment 11 for mozilla-beta justification. Try build is spinning and will report back to this bug: https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=c4bb15672702
Attachment #564623 -
Flags: approval-mozilla-beta?
Assignee | ||
Comment 13•12 years ago
|
||
Requesting release approval in case there's a 7.0.x chemspill that we can piggyback on. Please see comment 11 for justification. Try build is spinning and will report back here: https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=d36116c32ba4
Attachment #564626 -
Flags: approval-mozilla-release?
Comment 14•12 years ago
|
||
verified against stage with s-c build built this morning.
Whiteboard: [fixed in services] → [verified in services]
Assignee | ||
Comment 15•12 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/3af678a82e64
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla10
Comment 16•12 years ago
|
||
Is this a regression? If so, what caused it? Are the servers having issues or have you taken mitigation issues on the server or is this a "just in case"/"nice to have" fix?
Comment 17•12 years ago
|
||
s/mitigation issues/mitigation steps/
Comment 18•12 years ago
|
||
Try run for d36116c32ba4 is complete. Detailed breakdown of the results available here: https://tbpl.mozilla.org/?tree=Try&rev=d36116c32ba4 Results (out of 30 total builds): exception: 1 success: 25 warnings: 2 failure: 2 Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/pweitershausen@mozilla.com-d36116c32ba4
Comment 19•12 years ago
|
||
Try run for c4bb15672702 is complete. Detailed breakdown of the results available here: https://tbpl.mozilla.org/?tree=Try&rev=c4bb15672702 Results (out of 30 total builds): success: 29 failure: 1 Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/pweitershausen@mozilla.com-c4bb15672702
Comment 20•12 years ago
|
||
(In reply to Christian Legnitto [:LegNeato] from comment #16) > Is this a regression? If so, what caused it? Are the servers having issues > or have you taken mitigation issues on the server or is this a "just in > case"/"nice to have" fix? It's not a regression; this has been broken for perhaps a year now. We just haven't needed it, so we haven't noticed that it was buggy. The primary load-restriction mechanism built into the Sync service is an X-Weave-Backoff header. When we're under massive load, as we are right now (see Bug 690470 for one of the effects!), we set this header to tell clients to go away for a duration. We found, in the course of investigating this storm of unavailable Sync nodes, that the client code simply doesn't work. Services Operations has spent several days now working around the clock attempting to get us back on our feet. Most recovery approaches make the problem worse, and without this client handling we have no way to get back to a good state without throwing error bars on users' machines. tl;dr: Sync is throwing errors left right and center for 90% of our user base, and a client update would help immensely in allowing operations to get things back in the green.
Assignee | ||
Comment 21•12 years ago
|
||
(In reply to Christian Legnitto [:LegNeato] from comment #16) > Is this a regression? No, it has been broken for at least a year and a half. The brokenness dates back to a time when Sync was called Weave and just an add-on in Labs. > Are the servers having issues > or have you taken mitigation issues on the server or is this a "just in > case"/"nice to have" fix? The servers are having heavy issues from the unexpectedly high load caused by the Instant Sync feature, landed in Firefox 7. We found that the backoff header that the server is sending, which was designed as a safety valve for those cases, isn't actually honoured by any client. Just FYI: we're also looking at what causes the unexpectedly high syncing load in Firefox 7 and later to make sure we're not continuing to melt the servers for all eternity. These will be filed as separate bugs and we'll ask approval for those separately.
Comment 22•12 years ago
|
||
(In reply to Richard Newman [:rnewman] from comment #20) > (In reply to Christian Legnitto [:LegNeato] from comment #16) > Most recovery approaches make the problem worse… Expanding on this a little bit: database load appears to be the largest culprit here. Ops is taking various steps to improve the situation (off-hand, I can think of: DB upgrades; pruning records; adding capacity), but in order to effect these steps they need to stop clients making requests to the affected nodes. (atoll, feel free to clarify if I mis-stated.)
Assignee | ||
Updated•12 years ago
|
Updated•12 years ago
|
Attachment #564455 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Updated•12 years ago
|
Attachment #564623 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Assignee | ||
Comment 23•12 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/dcb1ab75cf63
Assignee | ||
Comment 24•12 years ago
|
||
https://hg.mozilla.org/releases/mozilla-beta/rev/19cf03b8ed32
Assignee | ||
Updated•12 years ago
|
Comment 25•12 years ago
|
||
qa- as there does not appear to be much QA can do to verify this fix. Please set to qa+ if this is not the case.
Whiteboard: [verified in services] → [verified in services][qa-]
Assignee | ||
Comment 26•12 years ago
|
||
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #25) > qa- as there does not appear to be much QA can do to verify this fix. Please > set to qa+ if this is not the case. That's not true. Services Ops can easily configure the staging servers to send backoff notices. Perhaps the folks from Services QA can chime in here and help out with verifying. This fix is absolutely vital for the health of our Sync server infrastructure. We need to make sure it works in the released versions of Firefox.
Comment 27•12 years ago
|
||
Thanks for clarifying, setting to qa+
Whiteboard: [verified in services][qa-] → [verified in services][qa+]
Comment 28•12 years ago
|
||
Yea, Services QA will pick this up. Looks like Tracy covered the client side already (Comment 14). He and I can work with petef to get this verified from the server side.
Assignee | ||
Comment 29•12 years ago
|
||
(In reply to James Bonacci [:jbonacci] from comment #28) > Yea, Services QA will pick this up. > Looks like Tracy covered the client side already (Comment 14). He and I can > work with petef to get this verified from the server side. James, this is a client bug. There's nothing in the server code that needs verifying. Tracy verified the fix in s-c before the merge to m-c. But we also need to verify this in Aurora and Beta. I believe this is what :ashughes was talking about in comment 25 and later.
Assignee | ||
Updated•12 years ago
|
Attachment #564626 -
Flags: approval-mozilla-release?
Updated•12 years ago
|
Keywords: verified-beta
Comment 30•12 years ago
|
||
Could you please tell me how to test this ?
Comment 31•12 years ago
|
||
Paul, this has to be coordinated with Services OPs so they can send appropriate back-off messages from the stage server. See https://bugzilla.mozilla.org/show_bug.cgi?id=684798#c30 for some examples.
Comment 32•11 years ago
|
||
(In reply to Tracy Walker [:tracy] from comment #31) > Paul, this has to be coordinated with Services OPs so they can send > appropriate back-off messages from the stage server. See > https://bugzilla.mozilla.org/show_bug.cgi?id=684798#c30 for some examples. Tracy, given Paul's timezone and that he is on sick PTO, could you coordinate testing for verification in Firefox 9.0b6?
Comment 33•11 years ago
|
||
verified on Fx9b6
Status: RESOLVED → VERIFIED
Keywords: verified-aurora
Whiteboard: [verified in services][qa+] → [verified in services]
Whiteboard: [verified in services] → [verified in services][qa+][qa!:9]
Updated•11 years ago
|
Whiteboard: [verified in services][qa+][qa!:9] → [verified in services][qa!:8][qa!:9]
Updated•5 years ago
|
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in
before you can comment on or make changes to this bug.
Description
•