Correctly identify server maintenance at login, too

RESOLVED FIXED in mozilla9

Status

Cloud Services
Firefox Sync: Backend
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: philikon, Assigned: emtwo)

Tracking

unspecified
mozilla9
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [fixed in services])

Attachments

(2 attachments, 4 obsolete attachments)

Bug 578195 makes us identify and display or muffle server maintenance errors, but only for syncs. We don't actually identify server errors correctly on login which means right after startup, you'd see "Server incorrectly configured".
I *think* the solution is to make sure that we call ErrorHandler.checkServerError() for every non-200 response. For machines that start up during maintenance, they will obviously fail on the very first fetch of info/collection in verifyLogin(), but really, it could happen at any point along the login() codepath where we fetch a resource.

In a lot of those places we already call ErrorHandler.checkServerError() and then set some Status.sync/Status.login value. That's not a good order to do things in since ErrorHandler.checkServerError()'s whole point is to set those flags. So at a minimum, we should reverse those lines.
(In reply to Philipp von Weitershausen [:philikon] from comment #1)
> I *think* the solution is to make sure that we call
> ErrorHandler.checkServerError() for every non-200 response.

... and every exception we catch while fetching a request (which is what happens for network errors).
(Assignee)

Updated

6 years ago
Assignee: nobody → msamuel
(Assignee)

Comment 3

6 years ago
Created attachment 557370 [details] [diff] [review]
WIP: correctly identify server maintenance during login

Still needs tests for other login scenarios but is mostly done
Attachment #557370 - Flags: feedback?(philipp)
Comment on attachment 557370 [details] [diff] [review]
WIP: correctly identify server maintenance during login

This patch makes me cry. Not sure whether it's for joy or sorrow, though. Just one nit and a note about test coverage:

>@@ -75,16 +82,18 @@ function sync_httpd_setup() {
>     "/1.1/johndoe/storage/meta/global": upd("meta", global.handler()),
>     "/1.1/johndoe/info/collections": collectionsHelper.handler,
>     "/1.1/johndoe/storage/crypto/keys":
>       upd("crypto", (new ServerWBO("keys")).handler()),
>     "/1.1/johndoe/storage/clients": upd("clients", clientsColl.handler()),
> 
>     "/1.1/janedoe/storage/meta/global": handler_401,
>     "/1.1/janedoe/info/collections": handler_401,
>+
>+    "/api/1.1/johnsmith/info/collections": service_unavailable,

I suggest s/api/maintenance/ or something more descriptive.

It would be good if we could also get tests for info/collections working fine, but crypto/keys and meta/global returning maintenance errors, both in the empty account (first sync, so we're uploading) and the existing account case. Given your time constraints, we could defer those to a follow-up, though, and queue that bug at the end of your todo list :)

One thing you definitely want to do in this patch is make sure the Status.login = SERVER_MAINTENANCE case that was added to ErrorHandler.shouldReportError() is tested in test_shouldReportError().
Attachment #557370 - Flags: feedback?(philipp) → feedback+
(Assignee)

Comment 5

6 years ago
(In reply to Philipp von Weitershausen [:philikon] from comment #4)
> Comment on attachment 557370 [details] [diff] [review]
> WIP: correctly identify server maintenance during login
> 
> This patch makes me cry. Not sure whether it's for joy or sorrow

aww, what could possibly be so sorrowful about the patch? :P

> It would be good if we could also get tests for info/collections working
> fine, but crypto/keys and meta/global returning maintenance errors, both in
> the empty account (first sync, so we're uploading) and the existing account
> case. Given your time constraints, we could defer those to a follow-up,
> though, and queue that bug at the end of your todo list :)

Will try.

> One thing you definitely want to do in this patch is make sure the
> Status.login = SERVER_MAINTENANCE case that was added to
> ErrorHandler.shouldReportError() is tested in test_shouldReportError().

Yes. You did find something I almost missed! :O
(Assignee)

Comment 6

6 years ago
Created attachment 557645 [details] [diff] [review]
V1: correctly identify server maintenance during login

added more tests
Attachment #557370 - Attachment is obsolete: true
Attachment #557645 - Flags: review?(philipp)
(Assignee)

Updated

6 years ago
Attachment #557645 - Attachment is patch: true
Comment on attachment 557645 [details] [diff] [review]
V1: correctly identify server maintenance during login

>+
>+    "/maintenance/1.1/johnsmith/info/collections": service_unavailable,
>+
>+    "/maintenance/1.1/janesmith/storage/meta/global": service_unavailable,
>+    "/maintenance/1.1/janesmith/info/collections": collectionsHelper.handler,
>+
>+    "/maintenance/1.1/foo/storage/meta/global": upd("meta", global.handler()),
>+    "/maintenance/1.1/foo/info/collections": collectionsHelper.handler,
>+    "/maintenance/1.1/foo/storage/crypto/keys": service_unavailable,
>   });

This is awesome, thanks for writing tests for these. These cover at least the "existing account" login path which will cover us 99% of the time. We should still have tests for the "new/empty account" login path where we upload keys and meta/global. So we might get a successful 404 on the GET for those, but the PUT might return a 503 + Retry-After. Edge case, I know, but we've had other key-related edge cases before where maintenance lead to inconsistencies across clients.


>+add_test(function testA_login_server_maintenance_error() {
>+add_test(function testB_login_server_maintenance_error() {
>+add_test(function testC_login_server_maintenance_error() {
etc...

Please give these more descriptive names. Best is to mention which resource produced the maintenance failure (IOW, one of info_collections, meta_global, crypto_keys).

r=me with that!
Attachment #557645 - Flags: review?(philipp) → review+
(In reply to Philipp von Weitershausen [:philikon] from comment #7)
> We should still have tests for the "new/empty account" login path where we
> upload keys and meta/global. So we might get a successful 404 on the GET for
> those, but the PUT might return a 503 + Retry-After. Edge case, I know, but
> we've had other key-related edge cases before where maintenance lead to
> inconsistencies across clients.

I meant to add: it's not really worth it holding this bug for those tests, so please just file a follow-up bug for writing them.
(Assignee)

Updated

6 years ago
Blocks: 684798
(Assignee)

Comment 9

6 years ago
Created attachment 558406 [details] [diff] [review]
V2: correctly identify server maintenance during login

* addressing comment 7
* comment 8 addresssed: filed bug 684798
(Assignee)

Updated

6 years ago
Attachment #557645 - Attachment is obsolete: true
http://hg.mozilla.org/services/services-central/rev/29bba7fb6ddc
Status: NEW → ASSIGNED
Whiteboard: [fixed in services]
(Assignee)

Comment 11

6 years ago
STR:
1) When a server node goes down for maintenance (sends 503 + Retry-After) prior to being logged in, Sync should *not* display an error message on a scheduled sync.
2) You will get a nice "server is down for maintenance" message if a sync is invoked by clicking the Sync Now button. (It works exactly like the muffling/displaying of network errors in bug 659067)
(In reply to Marina Samuel [:marina] from comment #11)
> STR:
> 1) When a server node goes down for maintenance (sends 503 + Retry-After)
> prior to being logged in, Sync should *not* display an error message on a
> scheduled sync.
> 2) You will get a nice "server is down for maintenance" message if a sync is
> invoked by clicking the Sync Now button. (It works exactly like the
> muffling/displaying of network errors in bug 659067)

To clarify, "prior to being logged in" means first starting the browser.
This change introduced a random orange:
http://tinderbox.mozilla.org/showlog.cgi?log=Services-Central/1315330787.1315332010.24825.gz

and possibly also
http://tinderbox.mozilla.org/showlog.cgi?log=Services-Central/1315330573.1315332801.26586.gz

We need to fix those before we can merge to m-c.
(Assignee)

Comment 14

6 years ago
Created attachment 558678 [details] [diff] [review]
Proposed test failure fix

Maybe this will help?
Attachment #558678 - Flags: feedback?(philipp)
Comment on attachment 558678 [details] [diff] [review]
Proposed test failure fix

This is a good idea, but I don't see how this would help unless nsIObserverService actually unwinds nested notifications somehow or we deferred weave:ui:sync:* ourselves. I actually think that's a good idea anyway, so let's drive the UI on the next tick in addition to this patch.
Attachment #558678 - Flags: feedback?(philipp)
(Assignee)

Comment 16

6 years ago
Created attachment 558690 [details] [diff] [review]
V2: Proposed test failure fix

* notifying the UI on nextTick
Attachment #558678 - Attachment is obsolete: true
Attachment #558690 - Flags: feedback?(philipp)
Comment on attachment 558690 [details] [diff] [review]
V2: Proposed test failure fix

>+  notifyOnNextTick: function notifyOnNextTick(observer) {
>+    Utils.nextTick(function() Svc.Obs.notify(observer), Svc.Obs);
>+  },

s/observer/topic/ please and the 2nd argument to nextTick isn't necessary here, you can just omit it.
Attachment #558690 - Flags: feedback?(philipp) → review+
Comment on attachment 558690 [details] [diff] [review]
V2: Proposed test failure fix

Please refactor 

+    Service.startOver();
+    Status.resetSync();
+    Status.resetBackoff();
+
+    server.stop(run_next_test);

into something analogous to test_syncengine_sync.js's

  function cleanAndGo(server) {
    Svc.Prefs.resetBranch("");
    Records.clearCache();
    server.stop(run_next_test);
  }

Avoids having to correct multiple locations.
(Assignee)

Comment 19

6 years ago
Created attachment 558696 [details] [diff] [review]
V3: Proposed test failure fix

addressing comment 17 & comment 18

Found a tiny problem (& fixed) in test_login_syncAndReportErrors_prolonged_network_error after doing comment 18 :O yay
Attachment #558690 - Attachment is obsolete: true
(Assignee)

Updated

6 years ago
Attachment #558696 - Attachment is patch: true
Attachment #558696 - Flags: feedback?(philipp)
(In reply to Marina Samuel [:marina] from comment #19)

> Found a tiny problem (& fixed) in
> test_login_syncAndReportErrors_prolonged_network_error after doing comment
> 18 :O yay

*makes zen monk face*

;)
(Reporter)

Updated

6 years ago
Attachment #558696 - Flags: feedback?(philipp) → review+
Pushed testfix follow-up: http://hg.mozilla.org/services/services-central/rev/607edee2e848
Yeah I totally botched up the rebase of that patch... Landed fix: http://hg.mozilla.org/services/services-central/rev/474cf6d733df
Tests were hanging on opt builds, it turned out to be a test fixture problem that was simply not showing up on debug builds (presumably because they're a bit slower so the test fixtures weren't racing the actual code):

http://hg.mozilla.org/services/services-central/rev/752ab20f2ff8
http://hg.mozilla.org/mozilla-central/rev/29bba7fb6ddc
http://hg.mozilla.org/mozilla-central/rev/607edee2e848
http://hg.mozilla.org/mozilla-central/rev/474cf6d733df
http://hg.mozilla.org/mozilla-central/rev/752ab20f2ff8
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla9
(Reporter)

Updated

6 years ago
No longer blocks: 684798
Depends on: 684798
You need to log in before you can comment on or make changes to this bug.