Closed Bug 1014411 Opened 10 years ago Closed 9 years ago

Create infrastructure for developing/testing sync migration

Categories

(Firefox :: Sync, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: markh, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

Attachments

(3 files, 1 obsolete file)

I speculate that we will need some kind of infrastructure for the development and testing of the sync migration process.  Specifically, some way so we can simulate the various "end of life" notifications we come up with.

Vague for now - hopefully will become clearer as bug 1014406 and deps are nailed down.
Flags: firefox-backlog?
Depends on: 908467
For client-side testing purposes, I will set up some machines in EC2 that can send the various headers.  This will allow us to test without disrupting the prod sync environment.

Concurrently, :bobm will consider the details of rolling out both soft-EOL and hard-EOL messages in production.
Assignee: nobody → rfkelly
Blocks: 908467
No longer depends on: 908467
I have a preliminary machine stood up at http://sync-eol.dev.mozaws.net/

It doesn't yet serve the EOL headers, will work on that tomorrow, most likely via simple nginx rules like the ones we might use in prod.

Bob, do we have a wildcard SSL I can put on this box to make it that little bit more prod-like?
Flags: needinfo?(bobm)
Mark, can I easily configure firefox to sync to this new server, or do I have to do the whole "install FF28, set up sync, upgrade to nightly" dance to try it out?
Flags: needinfo?(mhammond)
(In reply to Ryan Kelly [:rfkelly] from comment #3)
> Mark, can I easily configure firefox to sync to this new server, or do I
> have to do the whole "install FF28, set up sync, upgrade to nightly" dance
> to try it out?

I'm afraid that's what I did - I told it not to update (but I think that's a property of the profile, not the install itself?) so I can start from scratch at any time, and just swap that profile between 28 and nightly as appropriate.
Flags: needinfo?(mhammond)
Whiteboard: [qa+]
:bobm sorted me out with some ssl in IRC, clearing the flag
Flags: needinfo?(bobm)
I've set up a single server instance with three public URL:

  https://sync-eol.stage.mozaws.net:   normal sync server, no extra headers
  https://sync-soft-eol.stage.mozaws.net:   normal sync server with soft-eol X-Weave-Alert headers
  https://sync-hard-eol.stage.mozaws.net:   everything gets a "513" with X-Weave-Alert

Here's what I did to try it out, other folks may be able to suggest a slightly simpler procedure:

  * Downgrade to FF28 to get the old-sync setup UI back
  * "Set Up Sync" using https://sync-eol.stage.mozaws.net as the server url
  * Use about:config to change "services.sync.clusterURL" to "https://sync-soft-eol.stage.mozaws.net"
  * Restart firefox to make this change stick
  * "Sync Now" and watch for errors etc

I can tweak the JSON in the X-Weave-Alter header as appropriate based on initial testing here.  The soft-eol seems to trigger nicely for me, the hard-eol just shows up as a generic "server encountered an error" response.
Attached patch server-full.diffSplinter Review
For completeness, here is the patch I'm using on server-full to produce the hard-eol error messages.
Attached file nginx.conf
And here's the nginx config that works the hostname magic.
One thing to note with the nginx "add_header" approach - it will only add the header on success responses, i.e. 2XX and 3XX but not on e.g. 401 or 404.  I think this should be OK but worth pointing out.  If that's not sufficient, we'll have to find something else for prod deployment.
I get 401s trying to hit any of these.
Followed your steps above, but did not get past Set Up Sync with the Yellow bar and errors on login and/or password.
Sync-log shows this:
GET fail 401 https://sync-eol.stage.mozaws.net/1.1/BLAH/info/collections
1401387139193	Sync.Service	WARN	401: login failed.
Although, hmmmmm hitting it directly gets me the appropriate values:
0 for https://sync-eol.stage.mozaws.net/user/1.0/a
and a 404 for curl https://sync-eol.stage.mozaws.net/

Here is what I see in about:config (ignoring the yellow bar that keeps coming up)
services.sync.clusterURL     https://sync-soft-eol.stage.mozaws.net/
services.sync.serverURL     https://sync-eol.stage.mozaws.net/

Maybe this is correct behavior for the above configs?
1401387969455	Sync.Resource	DEBUG	mesg: GET fail 401 https://sync-soft-eol.stage.mozaws.net/1.1/BLAH/storage/meta/global
1401387969455	Sync.Resource	DEBUG	GET fail 401 https://sync-soft-eol.stage.mozaws.net/1.1/BLAH/storage/meta/global
1401387969455	Sync.Service	DEBUG	Weave Version: 1.30.0 Local Storage: 5 Remote Storage: 
1401387969455	Sync.Service	INFO	One of: no meta, no meta storageVersion, or no meta syncID. Fresh start needed.
1401387969455	Sync.Status	DEBUG	Status.sync: success.sync => error.sync.reason.metarecord_download_fail
1401387969455	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed
FYI - this errors
* Use about:config to change "services.sync.clusterURL" to "https://sync-soft-eol.stage.mozaws.net"
unless you put the trailing slash on it:
* Use about:config to change "services.sync.clusterURL" to "https://sync-soft-eol.stage.mozaws.net/"
(similar for https://sync-hard-eol.stage.mozaws.net/)

OK, this works:
services.sync.serverURL     https://sync-eol.stage.mozaws.net/
services.sync.clusterURL     https://sync-hard-eol.stage.mozaws.net/

I see
1401391221303	Sync.Resource	DEBUG	mesg: GET fail 513 https://sync-hard-eol.stage.mozaws.net/1.1/BLAH/info/collections
1401391221303	Sync.Resource	DEBUG	GET fail 513 https://sync-hard-eol.stage.mozaws.net/1.1/BLAH/info/collections
1401391221303	Sync.Status	DEBUG	Status.login: success.login => error.login.reason.server
1401391221303	Sync.Status	DEBUG	Status.service: success.status_ok => error.login.failed
1401391221303	Sync.ErrorHandler	ERROR	X-Weave-Alert: hard-eol: SYNC HAS SUNK
(ha ha nice one)

Going back to this:
services.sync.serverURL     https://sync-eol.stage.mozaws.net/
services.sync.clusterURL     https://sync-soft-eol.stage.mozaws.net/

I see an initial successful sync
1401391442868	Sync.Collection	DEBUG	mesg: GET success 200 https://sync-soft-eol.stage.mozaws.net/1.1/BLAH/storage/tabs?full=1
1401391442868	Sync.Collection	DEBUG	GET success 200 https://sync-soft-eol.stage.mozaws.net/1.1/BLAH/storage/tabs?full=1

and all good syncs after that...
Not sure why all 401s before and good results now.
No specific X-Weave-Alerts
(In reply to James Bonacci [:jbonacci] from comment #12)
> and all good syncs after that...
> Not sure why all 401s before and good results now.
> No specific X-Weave-Alerts

Note that the client only writes the log about X-Weave-Alert when the previous alert was some time previously - ie, it only shows it when it would display the infobar, not every time it is received.  There is a pref (the name of which escapes me now, but can be found easily enough under services.sync) that records the most recent EOL type seen - removing this pref will cause the message to be seen again on the next request.
>  * Downgrade to FF28 to get the old-sync setup UI back

According to the intrepid folks over at ownCloud [1], you can get the old sync setup UI back by creating a dummy "serivces.sync.username" property in about:config.  This could make testing quite a bit simpler.


[1]  https://github.com/owncloud/mozilla_sync/issues/33
(In reply to Ryan Kelly [:rfkelly] from comment #9)
> One thing to note with the nginx "add_header" approach - it will only add
> the header on success responses, i.e. 2XX and 3XX but not on e.g. 401 or
> 404.  I think this should be OK but worth pointing out.  If that's not
> sufficient, we'll have to find something else for prod deployment.

There are a couple of Nginx modules that would work: http://wiki.nginx.org/HttpHeadersMoreModule

However using them requires a custom Nginx build, and a one-off RPM for this service.  Not necessarily a big deal.  

What the expected client behavior is with an EOL header set?  If we stand up an EOLinator service that sends 200s with the Hard EOL responses to all traffic, what percentage of older clients will ignore the EOL header and blissfully keep syncing?
(In reply to Bob Micheletto [:bobm] from comment #15)

> What the expected client behavior is with an EOL header set?

Until last week, our expectations were different than reality :(  See bug 1017443

> If we stand up
> an EOLinator service that sends 200s with the Hard EOL responses to all
> traffic, what percentage of older clients will ignore the EOL header and
> blissfully keep syncing?

I believe all versions will continue to sync in that case, but the message displayed will just be "more error-like".

Given bug 1017443, the best option for hard-eol would be to return 200s with real (but read-only) data for meta/global and info/collections, but failing responses with the hard-eol notification for all storage requests.  I'm not sure if the fact the first 2 can be read-only helps at all.

Sadly this is somewhat theoretical, so if doing this sounds appealing we should verify it actually works from the client's POV.
I've set up a third option to experiment with this:

   https://sync-hardish-eol.stage.mozaws.net

This is a simple memcached-backed server that will accept writes to /meta/global and /crypto/keys, but rejects other API accesses with the 513 Hard EOL error.  This seems to allow the client to get through the "panic, wipe the server and sync from scratch" stage without error, and causes it to show the appropriate Hard-EOL message from the subsequent failed sync attempt.
Attached file sync11-eol-server.tar.gz (obsolete) —
Uploading my little "sync11eol" server for compelteness.  Bob, what do you think of this approach?  If the client behaviour is client enough then we could probably run this server as a kind of "EOLinator" pretty cheaply - just memcached and a few webheads.
Attachment #8432351 - Flags: feedback?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #18)

> Uploading my little "sync11eol" server for compelteness.  Bob, what do you
> think of this approach?  If the client behaviour is client enough then we
> could probably run this server as a kind of "EOLinator" pretty cheaply -
> just memcached and a few webheads.

Seems like a good approach.  Would it be okay to establish a relatively short TTL on those records?
Attachment #8432351 - Flags: feedback?(bobm) → feedback+
> Would it be okay to establish a relatively short TTL on those records?

Yes, we could expire them super-aggressively because they only have to survive the duration of a single sync.  I hard-coded it to 60s in the current version, which seems reasonable to me.
(In reply to Ryan Kelly [:rfkelly] from comment #6)
> Here's what I did to try it out, other folks may be able to suggest a
> slightly simpler procedure:
> 
>   * Downgrade to FF28 to get the old-sync setup UI back

See bug 1020112 - this process no longer seems to work if the profile has already been used with Nightly - when Nightly is restarted with that profile it has lost the Sync credentials created in 28.
> you can get the old sync setup UI back by creating a dummy "serivces.sync.username" property

*cough*  "services.sync.username", thanks Mark ;-)
Uploading a slightly cleaned-up version of my sync11eol-inator.  Toby,  please give this a sanity-check, and we'll keep it on our pocket for the eventual hard-eol.
Attachment #8432351 - Attachment is obsolete: true
Attachment #8433911 - Flags: review?(telliott)
Flags: needinfo?(telliott)
Flags: firefox-backlog? → firefox-backlog-
The server is fine, but I don't think it helps.

The vast, vast majority of people who will be running 1.1 at this point aren't going to be on a server that can do anything with this information. I think we're best off sending the informational header (to show up in the status bar) while still fulfilling the 1.1 requests for as long as possible, then just shutting it off.
Flags: needinfo?(telliott)
Attachment #8433911 - Flags: review?(telliott) → review+
We should test how the client copes with a "hard-eol" message during an otherwise successful sync.  Although I guess we could simply not bother with the "hard-eol", and just transition from "the service is going to shut down soon" straight to "it's dead jim"
(In reply to Ryan Kelly [:rfkelly] from comment #25)
> We should test how the client copes with a "hard-eol" message during an
> otherwise successful sync.

hard-eol is only sent with error responses, and should end the sync immediately. Send it for every response; there's no such thing as an otherwise-successful sync in this context.

> Although I guess we could simply not bother with
> the "hard-eol", and just transition from "the service is going to shut down
> soon" straight to "it's dead jim"

At best that'll leave clients constantly getting "Unknown error", maybe after a few weeks of being unable to sync.
I have stood up a new, tweaked instance of the server infrastructure described in Comment 6.  There are now five different options for your testing convenience:


    https://sync-eol-fine.dev.mozaws.net:   a normal, full-functioning sync1.1 server
    https://sync-eol-soft.dev.mozaws.net:   sends soft-eol header but otherwise syncs normally
    https://sync-eol-hard.dev.mozaws.net:   sends hard-eol header but otherwise syncs normally
    https://sync-eol-done.dev.mozaws.net:   sends hard eol header and doesn't store data, per Comment 17
    https://sync-eol-gone.dev.mozaws.net:   not actually a server; simulates switching the service off


Here's how to test behaviour with one of these servers:

  * Use a new profile, or disconnect from sync before starting
  * Re-enable old sync:
        * Use about:config to create a new pref called "services.sync.username" with value "dummy"
        * Restart firefox so this will take effect
  * Select "Set Up Sync", "Create a New Account", and use https://sync-eol-fine.dev.mozaws.net as custom server URL
  * This should sync fine; allow it to do so
  * Switch to one of the testing servers:
        * Use about:config to change "services.sync.clusterURL" to one of the above testing URLs
        * Use about:config to reset "services.sync.errorhandler.alert.mode" to an empty string
        * Restart firefox to make these changes stick
  * Select "Sync Now" and watch for error bars, failures, explosions, etc


With current nightly I see the following behaviours:

  * sync-eol-fine:  no error bars or strange behaviour, obviously
  * sync-eol-soft:  grey error bar saying "Your Firefox Sync service is shutting down soon. Upgrade Nightly to keep syncing. [Learn More]"
  * sync-eol-hard:  blue error bar saying "Your Firefox Sync service is no longer available. You need to upgrade Nightly to keep syncing. [Learn More]"
  * sync-eol-done:  blue error bar as above, plus a grey "unknown error" bar stacked on top of it
  * sync-eol-gone:  blue error bar saying "Sync encountered an error while syncing: Failed to connect to the server."

In all cases the [Learn More] button opens a tab to a server-provided URL, which I assume we will point at a SUMO page or similar once this goes live.


Based on previous discussions, the above five are the stages we could potentially move through, in order.  We may skip one or more of them if they don't seem to provide much value to the user experience.  Per Toby's Comment 24, at least one of eol-hard and eol-done is likely superfluous.


Open questions from my end:

Mark, do you want the server to send some sort of timestamp field indicating *when* the service will be shutting down?  We have discussed this in the past but IIRC no firm conclusion has been reached.

Nick, is this sufficient/helpful for testing flows on Android?  Anything else I can provide at this stage?
Flags: needinfo?(nalexander)
Flags: needinfo?(mhammond)
According to IRL discussion, we're not going ahead with the server-sent-timestamps thing; clearing needinfo?
Flags: needinfo?(mhammond)
> Nick, is this sufficient/helpful for testing flows on Android?  Anything
> else I can provide at this stage?

At this stage, no.  I will be rehabilitating some old code of rnewman's and will test against this in the next week or so.  Thanks!
Flags: needinfo?(nalexander)
As a heads-up, I believe that the clusterURL setting must end in a trailing slash for things to work correctly.  So using e.g. "https://sync-eol-soft.dev.mozaws.net" may give you an error about being unable to connect to the server, while using "https://sync-eol-soft.dev.mozaws.net/" should work fine.
Calling this done
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
See Also: → 1207867
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: