Closed Bug 1122443 Opened 9 years ago Closed 7 years ago

Sign up verification poll causes strain on servers

Categories

(Firefox :: Firefox Accounts, defect, P2)

defect

Tracking

()

RESOLVED FIXED
Firefox 56
Tracking Status
firefox56 --- fixed

People

(Reporter: stomlinson, Assigned: eoger)

References

()

Details

Attachments

(1 file)

Originally filed as https://github.com/mozilla/fxa-auth-server/issues/849 and https://github.com/mozilla/fxa-content-server/issues/68.

When a user signs up for FxA via Sync, the browser polls every few seconds indefinitely until the user confirms. This causes a heavy strain on the servers.

Instead of polling indefinitely at the same rate, a couple of strategies could be implemented:
* Exponential backoff after a short initial time has elapsed. Poll every 5 seconds for 5 minutes, then 10 seconds, then 20....
* If the user has not verified when the browser re-starts, perform an initial check. If the user still has not verified, then check every 10 minutes. Perhaps stop polling and message the user after a set time period has elapsed.
OS: Mac OS X → All
Hardware: x86 → All
See Also: → 1018022
Currently we don't poll indefinitely, but poll each 3 seconds for 5 minutes (but do start again on browser restart if necessary). In bug 1018022, we'd like to extend the period to > 5 minutes, but do understand the impact on both the client and server.

So each 3 seconds for 5 minutes implies 100 polls per client - it would be nice to know what is considered reasonable by the services team.

Drew and I just chatted and are thinking along the lines of max((2^n)+1, 60), where n is the number of minutes we've been polling for - so 3 seconds for the first minute, 5s for the second, 9s for the 3rd, and by the time we get to the 6th minute we are polling once per minute.  If we kept this up for 20 minutes, a rough calculation gives us ~57 polls - better than 100, but is that good enough?

We could make the max 120 (or even higher) but obviously most of the count occurs in the first few minutes - although that would allow us to increase the total poll time to an hour or few without increasing the total too much.

Shane, can you give us an idea of what would be "acceptable" from your POV?  Would a reduction from the current 100 to (say) 60 give you the head room we are after, or should we be targetting (say) 30 total?  However, keep in mind that the greater the poll period the worse the UX is for the account creation process, so making it poll too infrequently isn't really viable either.
Flags: needinfo?(stomlinson)
(In reply to Mark Hammond [:markh] from comment #2)
> max((2^n)+1, 60)

Doh - make that min()
Also, see bug 945449 - in theory we support backing off if the server so requests, which could also be used to help manage the server load.
Thanks, Mark. This is helpful. I added a review comment in Bug 1018022. 

Short polling is generally not a great solution, but it's the one we got. Longer term we need to figure that out (e.g., Push or long poll). And yes, we always have the server backoff mechanism, but we'd rather not make the exceptional case routine. :)
See Also: → 1124941
Thanks for all the info :mhammond, plenty to learn from! We realized we were needlessly polling on the content server for users that sign up for Sync (see [1] and [2]), even though the content server relies on the browser to update the UI when the verification is complete. If a user is signing up via Sync, we have eliminated the poll from the content server. This hopefully gives us more wiggle room.

> Shane, can you give us an idea of what would be "acceptable" from your POV?

I'm going to redirect this question to :ckarlof and Benson Wong.

------

[1] - https://github.com/mozilla/fxa-content-server/issues/2038 
[2] - https://github.com/mozilla/fxa-content-server/pull/2039
Flags: needinfo?(stomlinson)
Flags: needinfo?(ckarlof)
Flags: needinfo?(bwong)
> So each 3 seconds for 5 minutes implies 100 polls per client - it would be nice to know what is considered reasonable by the services team.

> Drew and I just chatted and are thinking along the lines of max((2^n)+1, 60), where n is the number of minutes we've been polling for - so 3 seconds for the first minute, 5s for the second, 9s for the 3rd, and by the time we get to the 6th minute we are polling once per minute.  If we kept this up for 20 minutes, a rough calculation gives us ~57 polls - better than 100, but is that good enough?

I usually like to solve problems with money instead of code, but it's getting a bit silly on the back end at this point. I'm looking for a simple solution that will drop the number of polling requests by a significant amount without seriously impacting UX.

If the exponential backoff approach you discuss is a simple enough solution that results in a ~50% reduction without serious UX impact, it sounds like a no-brainer to me and will be a good start.
Flags: needinfo?(ckarlof)
I talked to :ckolos about this yesterday to see where this issue/bug came from. We just put in some better logging and it seems the problem is more interesting than just too much polling. 

In summary of what we found: 

- the problem, spike in 502's, always happens around the same time 16:00 UTC. 
- the problem is usually about 3 hours
- a single server in the cluster generates a bulk of the 502s
- much of the 502's is caused by /v1/recovery_email/status, maybe because this is also most of the requests
  - the 502's might be just happening
- we're still investigating the root cause of the issue
Flags: needinfo?(bwong)
Shane, push seems to be working well these days, so if this is still a problem we could drop the poll interval to (say) once per minute or similar. Would this help and is it still worthwhile?
Flags: needinfo?(stomlinson)
(In reply to Mark Hammond [:markh] from comment #9)
> Shane, push seems to be working well these days, so if this is still a
> problem we could drop the poll interval to (say) once per minute or similar.
> Would this help and is it still worthwhile?


Is a push notification sent to the browser as soon as verification is complete? If so, lowering the polling frequency would definitely help reduce the load from future connecting clients. Polling is still the #1 auth server load.
Flags: needinfo?(stomlinson)
Sorry, I should have thought of this previously. Some time ago, we added a ?reason= param when hitting the /recovery_email/status endpoint we hit to check if the user is verified. While we still expect some ?reason=timer calls here while waiting for the user to actually verify, we would expect the vast majority of calls made when the user *is* verified to have ?reason=push. If we can't dig in to what the response was for these calls, an alternative might be to try and see if the *last* such call made for users is typically ?reason=push.

Shane, are you able to help us to get some stats on this so we can reduce the polling with confidence it isn't hurting our users?
Flags: needinfo?(stomlinson)
:markh - I'm trying to get a realistic number out of Kibana and DataDog right now, the stats are vastly different between the two, and we aren't reporting the necessary info to pull out of re:dash. I'm trying to figure out which numbers I should believe, if any.

DataDog reports between 30k and 80k calls to /recovery_email/status?reason=push every two hours, in a repeating 24 hour cycle. :vladikoff informs me he doesn't trust our DataDog auth-server sourced metrics, and the sampling function uses Math.random() which isn't totally random. These numbers seem high to me because we'd end up having more calls to /recovery_email/status?reason=push than we have logins per day, unless this endpoint is called on every push notification?

Kibana has a gotcha where the auth-servers used exclusively for polling do not send metrics to Kibana. :vladikoff showed me how I could query using a URL, but I don't trust the numbers I see. Over the past 2 days, Kibana reports 18M calls to /recovery_email/status?reason=timer and only 8k to /recovery_email/status?reason=push. These numbers may or may not be sampled at 10%, we are trying to find out. If 80k calls are made over two days to /recovery_email/status?reason=push, that seems too low.
Flags: needinfo?(stomlinson)
Priority: -- → P3
The telemetry from bug 1353645 is telling us that *at least* 94% of verifications are being noticed via push rather than polling. IMO, this means we can reduce the polling frequency fairly aggressively, then continue to monitor this probe - at which time I expect the rate to increase above 94% as the polling will have less chance of winning the race against the push notification.
Priority: P3 → --
Assignee: nobody → eoger
Priority: -- → P2
How about we start the polling after 1 min, every 1 min and up to 20 minutes?
From my 1:1 with Mark:

On sign-up, every minute for 5 minutes, then every 5 minutes up to 20 minutes.
On restart: every 5 min up to 20 minutes.

We should also think about disconnecting users after X days/weeks in the "please verify" state, we believe that they are the ones DDoSing the servers.
Shane, does comment 15 sound good to you?

(In reply to Edouard Oger [:eoger] from comment #15)
> We should also think about disconnecting users after X days/weeks in the
> "please verify" state, we believe that they are the ones DDoSing the servers.

Also, are you able to verify this is true and offer some numbers here? It seems bad to keep users in an unverified state forever - does that seem to be somewhat common?
Flags: needinfo?(stomlinson)
I took the opportunity to refactor that code.

http://searchfox.org/mozilla-central/source/services/fxaccounts/FxAccounts.jsm#1166-1187
I moved this into startPollEmailStatus (atm the the big "why" conditions are basically checking if why != "timer", so we don't end up doing that "initialization" code twice).
Comment on attachment 8888041 [details]
Bug 1122443 - Decrease FxA email status polling frequency.

https://reviewboard.mozilla.org/r/158908/#review164402

Thanks for cleaning some of this up, but I think we should also try and improve the tests a little. Maybe a hookable `this._startNewTimer` or something might help (eg, ensure that we start a new timer with the expected intervals, do not start a new timer when reason=="push", etc)?

::: services/fxaccounts/FxAccounts.jsm:1167
(Diff revision 1)
> -        log.debug("pollEmailStatus starting while existing timer is running");
> +      log.debug("startPollEmailStatus starting while existing timer is running");
> -        clearTimeout(this.currentTimer);
> +      clearTimeout(this.currentTimer);
> -        this.currentTimer = null;
> +      this.currentTimer = null;
> -      }
> +    }
>  
> -      // If we were already polling, stop and start again.  This could happen
> +    // If we were already polling, stop and start again.  This could happen

It looks like this comment should be a few lines up, above the block checking this.currentTimer (your patch didn't introduce this, but it makes sense to move it IMO)

::: services/fxaccounts/FxAccounts.jsm:1187
(Diff revision 1)
> +  },
>  
> -    // We return a promise for testing only. Other callers can ignore this,
> +  // We return a promise for testing only. Other callers can ignore this,
> -    // since verification polling continues in the background.
> +  // since verification polling continues in the background.
> -    return this.checkEmailStatus(sessionToken, { reason: why })
> -      .then((response) => {
> +  async pollEmailStatus(currentState, sessionToken, why) {
> +    log.debug("entering pollEmailStatus: " + why);

ISTM that this will now never be "timer" (ie, is most likely to continiously be "start" or "browser-start") - maybe we could capture that in, say, this.currentTimerReason, set in startPollEmailStatus? Or maybe mover this log into startPoll, and have the timer function both log and clear this.currentTime?

::: services/fxaccounts/FxAccounts.jsm:1199
(Diff revision 1)
> -              }
> +      }
> -              // Tell FxAccountsManager to clear its cache
> -              this.notifyObservers(ON_FXA_UPDATE_NOTIFICATION, ONVERIFIED_NOTIFICATION);
> -              // Record how we determined the account was verified
> -              Services.telemetry.scalarSet("services.sync.fxa_verification_method",
> -                                           why == "push" ? "push" : "poll");
> +    } catch (error) {
> +      if (error && error.code && error.code == 401) {
> +        let error = new Error("Verification status check failed");
> +        this._rejectWhenVerified(currentState, error);
> +        return null;

This function seems to always return null - can we just plain |return| in all cases to make it clear there is no return value?

::: services/fxaccounts/FxAccounts.jsm:1219
(Diff revision 1)
> -      return;
> +      return null;
>      }
> -    if (timeoutMs === undefined) {
> -      let currentMinute = Math.ceil(ageMs / 60000);
> -      timeoutMs = currentMinute <= 2 ? this.VERIFICATION_POLL_TIMEOUT_INITIAL
> -                                     : this.VERIFICATION_POLL_TIMEOUT_SUBSEQUENT;
> +    // Poll email status again after a short delay.
> +    if (nextPollMs === undefined) {
> +      let currentMinute = Math.ceil(pollDuration / 60000);
> +      nextPollMs = (why == "start" && currentMinute < 5) ? 60000 : 5 * 60000;

It appears that even in the old code, we started a timer if why == "push" and we weren't verified, which I don't think was the intent (I guess that should be generally impossible, so doesn't matter much in reality, but it still seems bad)
Attachment #8888041 - Flags: review?(markh)
(In reply to Mark Hammond [:markh] from comment #16)
> Shane, does comment 15 sound good to you?
> 
> (In reply to Edouard Oger [:eoger] from comment #15)
> > We should also think about disconnecting users after X days/weeks in the
> > "please verify" state, we believe that they are the ones DDoSing the servers.
> 
> Also, are you able to verify this is true and offer some numbers here? It
> seems bad to keep users in an unverified state forever - does that seem to
> be somewhat common?

I have asked :jrgm to help dig in here since he is in the best position to
find out how many folks poll forever. When he and I spoke on Thursday he
commented that polling is not as big of a problem as it used to be, though
he needed to do some log spelunking to get more info.

Adding jrgm as an ni, I'll also follow up with him today to find out whether 
he was able to find out more.
Flags: needinfo?(stomlinson) → needinfo?(jrgm)
Comments addressed and added some tests :)
Comment on attachment 8888041 [details]
Bug 1122443 - Decrease FxA email status polling frequency.

https://reviewboard.mozilla.org/r/158908/#review165986

awesome, thanks!
Attachment #8888041 - Flags: review?(markh) → review+
Pushed by eoger@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/30b65b87bd9a
Decrease FxA email status polling frequency. r=markh
https://hg.mozilla.org/mozilla-central/rev/30b65b87bd9a
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
I realized I wasn't sure what I was looking for in raw logs. Nonetheless,
here are the counts for the different query strings seen in a live sample of
1 million requests for `/v1/recovery_email/status` today:

  querystring              count     %
----------------------------------------
  reason=timer:	          897205  89.7%
  no query string:         78909   7.9%
  reason=start:            23666   2.4%
  reason=push:               219  <0.1%
  reason=browser-startup:      1  <0.1%

Let me know what other information I should be providing, if not the above.
Flags: needinfo?(jrgm) → needinfo?(stomlinson)
(In reply to John Morrison [:jrgm] from comment #26)
> I realized I wasn't sure what I was looking for in raw logs. Nonetheless,
> here are the counts for the different query strings seen in a live sample of
> 1 million requests for `/v1/recovery_email/status` today:
> 
>   querystring              count     %
> ----------------------------------------
>   reason=timer:	          897205  89.7%

That's interesting - it would be interesting to see what these figures start to look like for Firefox 56 now that this patch has landed - I'd expect to see this drop significantly - although it's probably way too early to analyze that because it's only in Nightly, and I expect that Nightly and Release users have significantly different patterns even without this change.

A broader question I'd really like the answer to is "how many *active* users never verify?" - or restated, "how many users poll forever?" - I'm not sure how that could be answered though.
> A broader question I'd really like the answer to is "how many *active* users never verify?" - or restated,
> "how many users poll forever?" - I'm not sure how that could be answered though.

We're adding some additional logging on the server side to try to answer this - basically, count the number of requests to /recovery_email/status for unverified accounts that were created more than X days ago.
(In reply to John Morrison [:jrgm] from comment #26)
> 
> Let me know what other information I should be providing, if not the above.

Thanks :jrgm, :markh does this give you the info you need?
Flags: needinfo?(stomlinson) → needinfo?(markh)
(In reply to Shane Tomlinson [:stomlinson] from comment #29)
> Thanks :jrgm, :markh does this give you the info you need?

Yeah, thanks.
Flags: needinfo?(markh)
Product: Core → Firefox
Target Milestone: mozilla56 → Firefox 56
You need to log in before you can comment on or make changes to this bug.