If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Catch all cases of server maintenance at login. All of them.

VERIFIED FIXED in mozilla10

Status

Cloud Services
Firefox Sync: Backend
VERIFIED FIXED
6 years ago
6 years ago

People

(Reporter: emtwo, Assigned: philikon)

Tracking

unspecified
mozilla10
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox9 affected)

Details

(Whiteboard: [fixed in services])

Attachments

(8 attachments, 9 obsolete attachments)

11.40 KB, patch
rnewman
: review+
Details | Diff | Splinter Review
41.65 KB, patch
Details | Diff | Splinter Review
6.28 KB, patch
rnewman
: review+
Details | Diff | Splinter Review
16.37 KB, text/html
Details
5.38 KB, patch
rnewman
: review+
Details | Diff | Splinter Review
34.50 KB, text/plain
Details
9.82 KB, patch
Details | Diff | Splinter Review
3.29 KB, patch
rnewman
: review+
Details | Diff | Splinter Review
(Reporter)

Description

6 years ago
We need to add tests for the "new/empty account" login path where we upload keys and meta/global. So we might get a successful 404 on the GET for those, but the PUT might return a 503 + Retry-After.
(Reporter)

Updated

6 years ago
Depends on: 683396
Since I've started writing these tests (will upload WIP), they've already uncovered some areas of the code where we need additional checks. Bug 688573 has now uncovered more (_remoteSetup calls _freshStart which calls wipeServer which may fail at any random URL). Morphing bug accordingly.
Summary: Add tests for identifying server maintenance at login → Catch all cases of server maintenance at login. All of them.
(Assignee)

Updated

6 years ago
Blocks: 688573
Created attachment 561915 [details] [diff] [review]
WIP v1
Assignee: nobody → philipp
Status: NEW → ASSIGNED
(Assignee)

Updated

6 years ago
Blocks: 692556
Created attachment 565438 [details] [diff] [review]
Part 1 (v1): catch errors for crypto/keys
Attachment #561915 - Attachment is obsolete: true
Attachment #565438 - Flags: review?(rnewman)
Marking Firefox 9 and 10 as affected because there the expectation that we've set in the Death To Unknown Error feature (specifically bug 683396) was that we don't display the stupid error bar for maintenance or load problems ever. Of course Firefox 8 and prior are affected, too, but they didn't include Death To Unknown Error.
Blocks: 683396
status-firefox10: --- → affected
status-firefox9: --- → affected
No longer depends on: 683396
Attachment #565438 - Flags: review?(rnewman) → review+
Created attachment 566019 [details] [diff] [review]
Part 2 (WIP 1): sanitize wipeServer, catch server errors
(In reply to Philipp von Weitershausen [:philikon] from comment #5)
> Created attachment 566019 [details] [diff] [review] [diff] [details] [review]
> Part 2 (WIP 1): sanitize wipeServer, catch server errors

There's one typo in service.js. Fix:

-          response = resource.delete();
+          response = res.delete();
Created attachment 566102 [details] [diff] [review]
Part 2 (WIP 2): sanitize wipeServer, catch server errors, fix tests

Fixed tests and typos.
Attachment #566019 - Attachment is obsolete: true
(In reply to Richard Newman [:rnewman] from comment #7)
> Created attachment 566102 [details] [diff] [review] [diff] [details] [review]
> Part 2 (WIP 2): sanitize wipeServer, catch server errors, fix tests
> 
> Fixed tests and typos.

Thank you, Richard.
(Assignee)

Updated

6 years ago
Depends on: 693893
Created attachment 566416 [details] [diff] [review]
Part 2 (v1): sanitize wipeServer, catch server errors

Lots moar tests compared to the WIP.

The X-If-Unmodified-Since header seems to break the prod and stage servers (bug 693893), though a local server instance doesn't fail. We could easily take it out, though, and we'd just defer that bit to bug 692700. If we leave it in, I should probably amend the tests to check for it, though.
Attachment #566102 - Attachment is obsolete: true
Attachment #566416 - Flags: review?(rnewman)
Bug 693896 requests that Grinder loadtests reflect the new DELETE /storage pattern, so we can verify that things continue to work okay.
Depends on: 693896
Bug 693864 covers implementation of DELETE /storage in the JS Sync server. I'm working on that now.

Philipp, regarding

> The X-If-Unmodified-Since header seems to break the prod and stage servers
> (bug 693893), though a local server instance doesn't fail. We could easily
> take it out, though, and we'd just defer that bit to bug 692700. If we leave
> it in, I should probably amend the tests to check for it, though.

It might make sense to land a commit without X-If-Unmodified-Since, and a commit with it immediately afterwards. Would make a backout much easier. If breaking the patch in two is easy, I would suggest that.
OS: Mac OS X → All
Hardware: x86 → All
(In reply to Richard Newman [:rnewman] from comment #11)
> It might make sense to land a commit without X-If-Unmodified-Since, and a
> commit with it immediately afterwards. Would make a backout much easier. If
> breaking the patch in two is easy, I would suggest that.

f+. Seems like XIUS is a new feature, rather than part of "catch all cases of 503" bug.
(In reply to Richard Newman [:rnewman] from comment #11)
> It might make sense to land a commit without X-If-Unmodified-Since, and a
> commit with it immediately afterwards. Would make a backout much easier. If
> breaking the patch in two is easy, I would suggest that.

Yeah that may make sense, except I would actually really like to write tests for the X-I-U-S stuff, too. And if we're backing it out immediately, that's just wasted time right now. Let's do this properly later.
(In reply to Richard Soderberg [:atoll] from comment #12)
> f+. Seems like XIUS is a new feature, rather than part of "catch all cases
> of 503" bug.

True. That's why I'd like to defer it altogether to bug 692700.
Created attachment 566439 [details] [diff] [review]
Part 2 (v1.1): sanitize wipeServer, catch server errors

Same as v1 except it doesn't use X-I-U-S.
Attachment #566416 - Attachment is obsolete: true
Attachment #566416 - Flags: review?(rnewman)
Attachment #566439 - Flags: review?(rnewman)
Comment on attachment 566439 [details] [diff] [review]
Part 2 (v1.1): sanitize wipeServer, catch server errors

Review of attachment 566439 [details] [diff] [review]:
-----------------------------------------------------------------

::: services/sync/modules/service.js
@@ +1609,2 @@
>        if (!collections) {
> +        let res = new Resource(this.storageURL.slice(0, -1));

Please add a brief comment here that you're dropping the trailing slash to conform to the API.
Attachment #566439 - Flags: review?(rnewman) → review+
Created attachment 566444 [details] [diff] [review]
Part 2 (v1.2): sanitize wipeServer, catch server errors

While going over test_service_sync_updateEnabledEngines.js, I found that rnewman's helpful treatment to make it pass with my changes was actually covering up a deficiency in the test itself. Here's the same patch as before but with a fix for the test that actually tests what we want to test (wiping individual engines, rather than wiping all data due to lack of keys.)
Attachment #566439 - Attachment is obsolete: true
Attachment #566444 - Flags: review?(rnewman)
Comment on attachment 566444 [details] [diff] [review]
Part 2 (v1.2): sanitize wipeServer, catch server errors

Review of attachment 566444 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #566444 - Flags: review?(rnewman) → review+
Created attachment 566446 [details] [diff] [review]
Part 2 (v1.3): sanitize wipeServer, catch server errors

Addressed review comment.
Attachment #566444 - Attachment is obsolete: true
Created attachment 566453 [details] [diff] [review]
Part 3 (v1): catch errors when wiping/disabling engines

This should conclude this bug. At least I haven't found anything that else that makes requests to the Sync storage server during a sync and isn't yet covered by the ErrorHandler.
Attachment #566453 - Flags: review?(rnewman)
Comment on attachment 566453 [details] [diff] [review]
Part 3 (v1): catch errors when wiping/disabling engines

Review of attachment 566453 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, and my SyncServer and 401 patches apply. 108 passing tests.
Attachment #566453 - Flags: review?(rnewman) → review+
https://hg.mozilla.org/services/services-central/rev/f15a17ef38dd
https://hg.mozilla.org/services/services-central/rev/c7a5c385d01a
https://hg.mozilla.org/services/services-central/rev/f06dfa65c0be
Whiteboard: [fixed in services]
(Assignee)

Updated

6 years ago
Blocks: 692714
Comment on attachment 565438 [details] [diff] [review]
Part 1 (v1): catch errors for crypto/keys

Requesting approval for Aurora. While this bug by itself is not a regression, it provides ground work for the fix of another regression (bug 692714). It also muffles more annoying error bars in case of server overload or maintenance (the cases that weren't caught by bug 659067 which is in Aurora).
Attachment #565438 - Flags: approval-mozilla-aurora?
Comment on attachment 566446 [details] [diff] [review]
Part 2 (v1.3): sanitize wipeServer, catch server errors

Requesting approval for Aurora. Please see comment 23 for justification.
Attachment #566446 - Flags: approval-mozilla-aurora?
Comment on attachment 566453 [details] [diff] [review]
Part 3 (v1): catch errors when wiping/disabling engines

Requesting approval for Aurora. Please see comment 23 for justification.
Attachment #566453 - Flags: approval-mozilla-aurora?
STRs for QA:

* Connect to a server environment that returns 503 + Retry-After on the following URL calls randomly (best with a high probability, so that it's easier to reproduce these):
  - GET .../info/collections
  - GET + PUT .../storage/meta/global
  - GET + PUT .../storage/crypto/keys
  - DELETE .../storage
  - DELETE .../storage/*
* These server errors should never bubble up in the UI, particularly at login (first sync after browser restart), first sync to a new account, and the sync following a Reset Sync operation. Please verify cases where Sync fails (but doesn't show an error)
  - at info/collections
  - GETing or PUTing meta/global, after a successful fetch of info/collections
  - GETing or PUTing crypto/keys
Created attachment 567883 [details]
error log of instant sync

with stage set to randomly error. I got error bar in UI when doing a bookmark initiated Instant Sync.
I can't verify this as I am seeing error bar on instant syncs triggered by Bookmarks.  and just hit it on idleInterval set next sync.
(In reply to Tracy Walker [:tracy] from comment #28)
> I can't verify this as I am seeing error bar on instant syncs triggered by
> Bookmarks.  and just hit it on idleInterval set next sync.

Tracy, can you verify that this does not introduce any regressions? We may not declare this fixed, but can the train go ahead nonetheless? Or do we need to back out?
(In reply to Philipp von Weitershausen [:philikon] from comment #26)
> STRs for QA:
> 
> * Connect to a server environment that returns 503 + Retry-After on the
> following URL calls randomly (best with a high probability, so that it's
> easier to reproduce these):
>   - GET .../info/collections
>   - GET + PUT .../storage/meta/global
>   - GET + PUT .../storage/crypto/keys
>   - DELETE .../storage
>   - DELETE .../storage/*

I realize my suggestion to do random 503 + Retry-After responses was probably not very helpful. It's probably easier to work closely with Ops while verifying this bug and configure staging to return 503 + Retry-After on each one of those 100% of the time, so that each one of those can be verified reliably.
Created attachment 568249 [details] [diff] [review]
Part 4 (v1): Schedule syncs on temporary/recoverable login errors

Since we no longer shove the login error bar with a "Sync Now" button in people's faces, we should really make sure that we always schedule another sync when login fails due to a recoverable error. Whitelisting fatal errors in SyncScheduler.
Attachment #568249 - Flags: review?(rnewman)
Comment on attachment 568249 [details] [diff] [review]
Part 4 (v1): Schedule syncs on temporary/recoverable login errors

Review of attachment 568249 [details] [diff] [review]:
-----------------------------------------------------------------

Lots of QA to do here!
Attachment #568249 - Flags: review?(rnewman) → review+
Part 4: https://hg.mozilla.org/services/services-central/rev/439b5c6088c1
(In reply to Richard Newman [:rnewman] from comment #32)

> Lots of QA to do here!

Yes, please provide STR's for affected items.
also, do these builds contain the fix?

1319073531/	19-Oct-2011 22:31
(In reply to Tracy Walker [:tracy] from comment #35)
> also, do these builds contain the fix?
> 
> 1319073531/	19-Oct-2011 22:31

Yes.
(In reply to Tracy Walker [:tracy] from comment #34)
> (In reply to Richard Newman [:rnewman] from comment #32)
> 
> > Lots of QA to do here!
> 
> Yes, please provide STR's for affected items.

While testing builds with part 1-3 yesterday, QA found that while errors during download or upload of meta/global, crypto/keys, or server wiping are muffled at first sync ("login"), a new sync isn't scheduled. With part 4:

* We now always schedule another sync at whatever sync interval is appropriate (this includes backoff calculation, of course) when the error was due to a recoverable error (network problem, server maintenance, etc.).

* If the error is due to incorrect password or Sync Key, we show the error bar and do NOT schedule naother sync.
Created attachment 568505 [details]
DELETE fail 505 but successful log

attached log from reset sync (replace all with this device)

Comment 39

6 years ago
Phillipp, sounds like there might be problems with these patches (comment 37)? Also, the dependency part (bug 692714) doesn't seem ready yet so what's interesting in getting this into Aurora is the silencing of error bars only?
(In reply to Asa Dotzler [:asa] from comment #39)
> Phillipp, sounds like there might be problems with these patches (comment
> 37)?

Yes, my approval requests were probably premature. Our excellent QA team has poked some holes into this patch. We're going to take it back to the drawing board.

> Also, the dependency part (bug 692714) doesn't seem ready yet

It's "equally" ready, in that it landed on s-c but here too QA has found some edge case problems that we need to resolve.

Both fixes would be good to have in a release soon because it gives Ops more headroom, as the client is more tolerant of server load issues and load rebalancing. But, the fixes need to be ready, of course.

I will withdraw Aurora approval request for now and cycle back when QA has verified that all outstanding issues have been fixed. Thanks, and sorry for the premature request.
(Assignee)

Updated

6 years ago
Attachment #565438 - Flags: approval-mozilla-aurora?
(Assignee)

Updated

6 years ago
Attachment #566446 - Flags: approval-mozilla-aurora?
(Assignee)

Updated

6 years ago
Attachment #566453 - Flags: approval-mozilla-aurora?
Backed out on s-c: https://hg.mozilla.org/services/services-central/rev/353f54133958
Whiteboard: [fixed in services]
(In reply to Tracy Walker [:tracy] from comment #38)
> Created attachment 568505 [details]
> DELETE fail 505 but successful log
> 
> attached log from reset sync (replace all with this device)

So it turns out this is not a regression at all. Service.wipe* muffle all exceptions. That's still wrong, of course, and we should fix that. Will work on a patch.
(In reply to Philipp von Weitershausen [:philikon] from comment #42)

> So it turns out this is not a regression at all. Service.wipe* muffle all
> exceptions. That's still wrong, of course, and we should fix that. Will work
> on a patch.

Any update on this, philikon? Got a few more days until the next train, but I am interested to see this and Bug 692714 re-land.
Created attachment 570179 [details] [diff] [review]
Part 5 (v1): Don't ignore errors in wipeRemote et.al.
Attachment #570179 - Flags: review?(rnewman)
Created attachment 570185 [details] [diff] [review]
Part 5 (v2): Don't ignore errors in wipeRemote, wipeClient
Attachment #570179 - Attachment is obsolete: true
Attachment #570179 - Flags: review?(rnewman)
Attachment #570185 - Flags: review?(rnewman)
Comment on attachment 570185 [details] [diff] [review]
Part 5 (v2): Don't ignore errors in wipeRemote, wipeClient

Review of attachment 570185 [details] [diff] [review]:
-----------------------------------------------------------------

Just comment rephrasing for clarity.

Don't forget to change

  # User Richard Newman <rnewman@mozilla.com>

in the patch header!

::: services/sync/tests/unit/test_errorhandler.js
@@ +1047,5 @@
>    Service.sync();
>  });
>  
>  add_test(function test_wipeServer_login_prolonged_server_maintenance_error(){
> +  // Test prolonged server maintenance errors while wiping the server are reported.

Test that we report prolonged server maintenance errors that occur whilst wiping the server.

@@ +1081,5 @@
>    Service.sync();
>  });
>  
> +add_test(function test_wipeRemote_prolonged_server_maintenance_error(){
> +  // Test prolonged server maintenance errors while wiping all remote devices are reported.

Test that we report prolonged server maintenance errors that occur whilst wiping all remote devices.

@@ +1319,5 @@
>    ErrorHandler.syncAndReportErrors();
>  });
>  
> +add_test(function test_wipeRemote_syncAndReportErrors_server_maintenance_error(){
> +  // Test prolonged server maintenance errors while wiping all remote devices are reported.

Test that we report prolonged server maintenance errors that occur whilst wiping all remote devices.
Attachment #570185 - Flags: review?(rnewman) → review+
Created attachment 570188 [details] [diff] [review]
Part 5 (v3): Don't ignore errors in wipeRemote, wipeClient
Attachment #570185 - Attachment is obsolete: true
https://hg.mozilla.org/services/services-central/rev/85ae672d8a0c
https://hg.mozilla.org/services/services-central/rev/2e8913411943
https://hg.mozilla.org/services/services-central/rev/56c7e424d06c
https://hg.mozilla.org/services/services-central/rev/d81e478453ba
https://hg.mozilla.org/services/services-central/rev/369146ce350c
Whiteboard: [fixed in services]
Did I not hit 'submit' for that? *sigh*
As discovered by TPS tests, we need one more follow-up here: 

As of bug 692249, nextSync no longer gets reset at the beginning of the sync but at the end. If the nextSync is in the past and we get a login:error, we never bump nextSync. And thanks to part 4 of this bug, we will always keep scheduling syncs for immediately and hence introduce a loop. It seems to me that clearSyncTriggers should also reset nextSync so that a login:error will always create a clean slate for the code that schedules syncs after it.

Part of the problem is also that verifyLogin identifies *all* exceptions that occur as a network error, so e.g. quitting the application during the login (which will create an exception) will trigger a sync loop and keep Firefox alive.

I will round up a patch to fix these two issues.
Created attachment 571072 [details] [diff] [review]
Part 6 (v1): Avoid scheduling loops

Sigh. I hope this is it.
Attachment #571072 - Flags: review?(rnewman)
Created attachment 571082 [details] [diff] [review]
Part 6 (v2): Avoid scheduling loops

Fix test clean up.
Attachment #571072 - Attachment is obsolete: true
Attachment #571072 - Flags: review?(rnewman)
Attachment #571082 - Flags: review?(rnewman)
Attachment #571082 - Flags: review?(rnewman) → review+
Landed Part 6: https://hg.mozilla.org/services/services-central/rev/6071624954fa
(In reply to Philipp von Weitershausen [:philikon] from comment #30)
> (In reply to Philipp von Weitershausen [:philikon] from comment #26)
> > STRs for QA:
> > 
> > * Connect to a server environment that returns 503 + Retry-After on the
> > following URL calls randomly (best with a high probability, so that it's
> > easier to reproduce these):
> >   - GET .../info/collections
> >   - GET + PUT .../storage/meta/global
> >   - GET + PUT .../storage/crypto/keys
> >   - DELETE .../storage
> >   - DELETE .../storage/*
> 
> I realize my suggestion to do random 503 + Retry-After responses was
> probably not very helpful. It's probably easier to work closely with Ops
> while verifying this bug and configure staging to return 503 + Retry-After
> on each one of those 100% of the time, so that each one of those can be
> verified reliably.

Ops is using this set of trafficscript rules to implement the above tests, hand-uncommented w/ coordination on irc:

# What response code?
$code = 503;
# What % of the time should the uncommented if line trigger?
$pct_of_time = 100;
# Five if lines, five steps in the bug.  One must be uncommented.
if (math.random(100) < $pct_of_time && http.getMethod() == "GET" && string.contains(http.getPath(), "/info/collections")) {
#if (math.random(100) < $pct_of_time && string.contains(http.getPath(), "/storage/meta/global")) {
#if (math.random(100) < $pct_of_time && string.contains(http.getPath(), "/storage/crypto/keys")) {
#if (math.random(100) < $pct_of_time && http.getMethod() == "DELETE" && string.endsWith(http.getPath(), "/storage")) {
#if (math.random(100) < $pct_of_time && http.getMethod() == "DELETE" && string.contains(http.getPath(), "/storage/")) {
#if (math.random(100) < $pct_of_time && string.contains(http.getPath(), "/storage")) {
   http.sendResponse("503 Random Error", "application/json", '"server error: random intentional error testing"', "Retry-After: 120");
}
​
https://hg.mozilla.org/mozilla-central/rev/85ae672d8a0c
https://hg.mozilla.org/mozilla-central/rev/2e8913411943
https://hg.mozilla.org/mozilla-central/rev/56c7e424d06c
https://hg.mozilla.org/mozilla-central/rev/d81e478453ba
https://hg.mozilla.org/mozilla-central/rev/369146ce350c
https://hg.mozilla.org/mozilla-central/rev/6071624954fa
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla10
Blocks: 698803

Updated

6 years ago
Status: RESOLVED → VERIFIED

Updated

6 years ago
status-firefox10: affected → ---
You need to log in before you can comment on or make changes to this bug.