Closed Bug 1110013 Opened 10 years ago Closed 9 years ago

Implement throttled sending of EOL headers for sync1.1 migration

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: rfkelly)

References

Details

Attachments

(2 files, 1 obsolete file)

We'll need a little bit of server-side logic to make for a smooth rollout of the sync1.1 migration EOL headers.  Details copied from an email thread:
----------------

> * On our request, Services will start sending EOL headers to a subset of 
> users (ie, throttled) who both (a) report as being 37+ and (b) only have 
> a single device linked to their account.

I propose to implement this in the server-side application code for sync1.1, using a combination of two mechanisms:

1)  On each incoming request, check the User-Agent header to see if it's 37+. If so then check the DB to see if they have a single client record.  If both of these checks are true, we can safely send EOL headers.

2)  On each incoming request, calculate `x = userid % 100` to get a number 0 <= x < 100.  Only send EOL headers if x is below some throttling threshold.

We'd start with both (1) and (2) in place and the throttling threshold set to, say, 10 out of 100.

This won't be pretty, but it seems to strike a nice balance of caution and pragmatism.
 
> * If this seems to be working, we consider sending EOL headers to *all* 
> users who meet this criteria (ie, same criteria but unthrottled).

We'd achieve this by stepping up the throttling threshold to 100.
 
> * At some point after 37 hits release, we move back to aggressive 
> throttling (ie, only for a fraction of users) but drop the criteria - we 
> offer it to users regardless of the version they are using and 
> regardless of the number of devices they have.

This corresponds to dropping check (1) and resetting the throttling threshold for (2) back to 10.  In theory, a simple config change on the server.

> * Based on telemetry and server-side stats, we slowly drop the throttling.
> 
> * Profit!

And we can switch the machines off.  And there will be much rejoicing.
Mark, I don't recall the precise format of the sync User-Agent header.  Can I reliably use it to detect newer clients with a simple version check?
Flags: needinfo?(mhammond)
(In reply to Ryan Kelly [:rfkelly] from comment #1)
> Mark, I don't recall the precise format of the sync User-Agent header.  Can
> I reliably use it to detect newer clients with a simple version check?

On my dev box the header sent looks like:

User-Agent: Firefox/37.0a1 FxSync/1.39.0.20141211102843.desktop

so yes, this sounds fine
Flags: needinfo?(mhammond)
QA Contact: kthiessen
N.B., 

  37.0a1 = Nightly
  37.0a2 = Aurora
  37.0b1 = Beta
  37.0, 37.0.2, ... = Release

".desktop" on the end identifies desktop uniquely, but Android will also report itself very differently, something like:

  Firefox-Android-FxAccounts/37.0a1 (Nightly)


So I recommend a regex like:

  Firefox/3[78]\..*\.desktop

which will match 37 and 38, covering Nightly until mid-Feb.

If you want to stick with Nightly only:

  Firefox/3[78]\.0a1.*\.desktop
Blocks: 1008066
Attached patch sync11-staged-eol-rollout.diff (obsolete) — Splinter Review
Asking r? from Tarek for old times' sake :-)

Here's a basic implementation of the functionality I described, providing some config knobs for the server that control when and what we send for the EOL notification header.

Configuration will be via the app .ini file, like this:

[storage]
...other existing config here...

eol_general_release = false    # <-- whether to gate on user-agent string
eol_rollout_percent = 42       # <-- percent of users that should see EOL
eol_header_code = soft-eol     # <-- customize the fields in the EOL header
eol_header_message = cya-l8r
eol_header_url = http://blah.com


I think this will achieve what we want, with one exception.  When we switch from eol_general_release=False to eol_general_release=True, we will drop the eol_rollout_percentage to a lower number, and some users that were previously seeing the notifications will stop receiving them.  This doesn't seem like too much of a problem to me, but worth pointing out explicitly.  How does it sound to you Mark?
Attachment #8538296 - Flags: review?(tarek)
Attachment #8538296 - Flags: feedback?(mhammond)
Attachment #8538296 - Flags: feedback?(bobm)
Whoops, one minor update to the patch, adding a test for the "restrict headers when there's more than one client record" case.
Attachment #8538296 - Attachment is obsolete: true
Attachment #8538296 - Flags: review?(tarek)
Attachment #8538296 - Flags: feedback?(mhammond)
Attachment #8538296 - Flags: feedback?(bobm)
Attachment #8538303 - Flags: review?(tarek)
Attachment #8538303 - Flags: feedback?(mhammond)
Attachment #8538303 - Flags: feedback?(bobm)
Comment on attachment 8538303 [details] [diff] [review]
sync11-staged-eol-rollout.diff

Review of attachment 8538303 [details] [diff] [review]:
-----------------------------------------------------------------

I (obviously ;) didn't look too closely at the patch, but LGTM!

> and some users that were previously
> seeing the notifications will stop receiving them.  This doesn't seem like too
> much of a problem to me, but worth pointing out explicitly.  How does it sound
> to you Mark?

That sounds fine in this specific scenario as we previously would only have offered to single-device users.  In the more general case I think it *would* be a problem as a user may have migrated some of their devices and not others.  IOW, this approach means that if we found we needed to drop the percentage after we've dropped the version/devices check some devices will end up "stranded" as their subsequent devices don't see the migration trigger.

However, in practice I guess this is unlikely to happen - if the poo hits the fan we'd probably drop it to zero, and I guess we'd be careful about increasing the percentage in small enough increments that we're confident the load isn't going to bite us.  But that's worth keeping in mind :)
Attachment #8538303 - Flags: feedback?(mhammond) → feedback+
> In the more general case I think it *would* be a problem as a user may have migrated some of their 
> devices and not others.  IOW, this approach means that if we found we needed to drop the percentage
> after we've dropped the version/devices check some devices will end up "stranded" as their subsequent
> devices don't see the migration trigger.

One option might be to guard against this server-side.  We could check for the presence of the migration sentinel and if there's one there, always show the EOL headers for that user.

It's an extra check to add on each request, so we probably don't want to do it unless we have to...but might be a nice option to have if it really does hit the fan.
(In reply to Ryan Kelly [:rfkelly] from comment #7)
> It's an extra check to add on each request, so we probably don't want to do
> it unless we have to...but might be a nice option to have if it really does
> hit the fan.

There's no "really good" reason the existence of the migration sentinel can't also trigger subsequent clients.  That's what drives android migration.

The only reason we do is that it gives us a lever to *choose* to strand those subsequent clients for a short time in a worst-case scenario.  So it's a balance between 2 risks; I'm happy for the client to do that if you think we should swing that way.
Comment on attachment 8538303 [details] [diff] [review]
sync11-staged-eol-rollout.diff

small improvement:

+        should_see_eol_headers = False
+        if userid % 100 < self.eol_rollout_percent:
+            should_see_eol_headers = True

should_see_eol_headers = userid % 100 < self.eol_rollout_percent
Attachment #8538303 - Flags: review?(tarek) → review+
> There's no "really good" reason the existence of the migration sentinel can't also trigger
> subsequent clients.

Apart from "we probably can't get that shipped in time" I guess :-)

So, I think I'd like to implement this option of checking for the sentinel server-side, behind a config option and off by default.  It will be an extra level of re-assurance if we have this in our pocket.

Mark, what's the collection/key I should be checking for to detect this, and is its mere presence in the db sufficient to trigger EOL headers for all clients on that account?
Flags: needinfo?(mhammond)
Committed in http://hg.mozilla.org/services/server-storage/rev/91d037fd360f but I'll probably add to it with the above check-for-the-sentinel thing, and also if :bobm has any feedback on the config options.
(In reply to Ryan Kelly [:rfkelly] from comment #4)
> [storage]
> ...other existing config here...
> 
> eol_general_release = false    # <-- whether to gate on user-agent string
This will default to false, right?  I can't tell for certain.

> eol_rollout_percent = 42       # <-- percent of users that should see EOL
The value format doesn't bother me, but it does conflict with selective backoff value where this would be .42
Attachment #8538303 - Flags: feedback?(bobm) → feedback+
(In reply to Ryan Kelly [:rfkelly] from comment #10)
> Mark, what's the collection/key I should be checking for to detect this

meta/fxa_credentials

> is its mere presence in the db sufficient to trigger EOL headers for all
> clients on that account?

Yep.  If it is any help, the account skippy.hammond+migrate-eol-soft@gmail.com on https://sync-eol-soft.dev.mozaws.net/ will have this.

Thanks!
Flags: needinfo?(mhammond)
> > eol_general_release = false    # <-- whether to gate on user-agent string
> This will default to false, right?  I can't tell for certain.

Yes.
Here's the extra logic to implement what I suggested in Comment 7.  This adds a high-water-mark setting for the eol_rollout_percent, which we can lock in if we ever need to drop the rollout percent in production.

For users who are numbered above the eol_rollout_percent (and hence will not see the EOL headers automatically) but below the eol_rollout_percent_hwm (and hence may have seen the EOL headers in the past) we do an extra check in the db to see if they've started a migration, and continue sending them the headers if so.

I think this is a nice compromise solution.  It defaults off unless we decide we need it, and if we do, we only have to do the extra DB check for a small percentage of users.

Configuration in the event of dropping the rollout_percent will look like:


[storage]
...other existing config here...

eol_general_release = true    # <-- should only need this during general release
eol_rollout_percent_hwm = 40  # <-- we went as high as 40%, but got overwhelmed
eol_rollout_percent = 30      # <-- so we reduced it to only 30% for a while
Attachment #8539968 - Flags: review?(tarek)
Attachment #8539968 - Flags: feedback?(mhammond)
Attachment #8539968 - Flags: review?(tarek) → review+
With that, I think we should be covered on the server side.  The next step is for Bob, Karl and I to get this rollout out to a sync1.1 node and test it out, likely sometime early next year.
Comment on attachment 8539968 [details] [diff] [review]
sync11-eol-rollout-percent-hwm.diff

Review of attachment 8539968 [details] [diff] [review]:
-----------------------------------------------------------------

A bit late, but f+ FTR...
Attachment #8539968 - Flags: feedback?(mhammond) → feedback+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Testing has been underway this week, and will continue the week of 2015-01-24 -- I will mark this bug VERIFIED when everyone is happy with the results.
Headers have been seen in the wild, throttles are slowly moving upward, and people are gradually migrating.  

Well done, all.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: