Closed Bug 1597868 Opened 5 years ago Closed 4 years ago

Avoid fetching the device list on every sync, to reduce FxA server load

Categories

(Firefox :: Sync, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 73
Tracking Status
firefox73 --- fixed

People

(Reporter: rfkelly, Assigned: markh)

References

Details

Attachments

(1 file)

In a recent summary of FxA server traffic that :jrgm shared with me, calls to /v1/account/devices account for more than 80% of all requests made to the FxA servers. That's a lot!

It's not causing operational problems right now, but it is certainly consuming a lot of operational headroom that could be made available for other things (such as coping with sudden spikes in traffic). Let's figure out how to do better.

If I understand correctly, we currently fetch the device list on every sync, debounced to 1-minute granularity because there might be multiple bits of browser UI that want to show the device list at the same time. I don't recall exactly why we refresh it on every sync though.

Mark and Ed, I feel like you guys probably have the most context on this part of Desktop's behavior, could you comment more on why the current behaviour is the way it is? (If you have suggestions for how to refresh it less often that's also great, but I mostly want to ensure this bug has full context on the current behaviour before we dig into suggestions on how to change it.).

Flags: needinfo?(markh)
Flags: needinfo?(eoger)

(Also, just to be clear, I don't believe this is a new behaviour introduced by e.g. the sync decoupling work, server logs indicate that this traffic ratio has been that way for a few releases at least)

There's no good-good reason and TBH I've been kinda waiting for this bug to be filed :/ The less-good reason is that it has allowed us to avoid thorny questions about how fresh is fresh enough.

We fetch every sync primarily so we can use this list to help us mark devices as "stale" - eg, duplicate devices which are probably just the same device but recently reconnected. This is probably much less relevant now that we tend to build the "send tab" list directly from FxA itself - but it's probably still partially relevant for the "synced tabs" device list. Ed did most of this "stale" support but it was years ago, so it's probably as stale in his memory as it is mine :)

Lina recently improved the caching a little (so I cc'd her for her thoughts) - there's now a recentDeviceList getter - but the intersection of this getter and the cache functionality isn't great. Specifically, while * calling refreshDeviceList() imposes a 1 minute throttle, exclusively calling recentDeviceList never says "hey, this list is getting stale, I should update it". So currently, if we moved Sync to using recentDeviceList, a user that didn't send-tab might not update the list until the next browser start (as we don't cache the list to disk).

I don't think anything bad would happen if sync used the recent device list "most" of the time - not clear how far to take that though. We sync every 10 minutes, so just refreshing once an hour would possibly be a big enough win. It would probably be trivial, but not ideal, to have sync itself track this time. It would probably be better to see if we can largely deprecate the refresh function entirely such that in (almost) all cases the recent device list will be known to be fresh enough (ie, have the the fxa devices code decide when to refresh in most cases)

Flags: needinfo?(markh)

Awesome, thanks for a detailed response! :-)

IIUC we also expect that the majority of the time, we'll receive a push notification to inform us of changes to the device list, which might allow us to poll even less frequently.

Another option would be to cache the results server-side and serve an eTag as I believe the /devices endpoint does expensive db calls.

Flags: needinfo?(eoger)

Having slept on it, I personally think we should try to address this with some urgency. It's not a firedrill by any means. But I feel like the FxA server system has been stressed into unusual behaviour a fair bit recently, it's only going to become more loaded and more critical to the success of more products over time, and this seems like relatively low-hanging fruit to buy us some headroom there.

I'm adding some FxA folks for input though.

Another option would be to cache the results server-side and serve an eTag as I believe the /devices endpoint does expensive db calls.

Indeed! This would have the advantage of not having to ride the Firefox trains, but would still involve some amount of db activity on every sync (if only to check the sessionToken). It's something the FxA team can pursue in the short-term to mitigate client behaviour, but if we can do better on the client, I think we should try.

IIUC we also expect that the majority of the time, we'll receive a push notification to inform us of changes to the device list,
which might allow us to poll even less frequently.

It looks like we already have mechanisms in place to refresh the device list in response to a push notification about devices coming or going:

Do we have any sense of how reliable that mechanism is in practice?

So currently, if we moved Sync to using recentDeviceList, a user that didn't send-tab might not update the list until
the next browser start (as we don't cache the list to disk).

So I feel like this should be handled by the push messages most of the time. But I also feel like we may have some doubts about how that works out in practice, and hence we should poll as a fallback.

exclusively calling recentDeviceList never says "hey, this list is getting stale, I should update it". So currently, if we moved Sync to using
recentDeviceList, a user that didn't send-tab might not update the list until the next browser start (as we don't cache the list to disk).

Could the getter here check the freshness of the list and kick off a background promise to update it? Would anything need to listen for the completion of that promise and e.g. update UI state to match?

(In reply to Ryan Kelly [:rfkelly] from comment #5)

Another option would be to cache the results server-side and serve an eTag as I believe the /devices endpoint does expensive db calls.

Indeed! This would have the advantage of not having to ride the Firefox trains,

I believe it would - Firefox would need to remember the etag and send the if-none-match header, and we don't do that. We should consider doing this anyway in anticipation of the server supporting it.

It looks like we already have mechanisms in place to refresh the device list in response to a push notification about devices coming or going:

Do we have any sense of how reliable that mechanism is in practice?

No :( We do record a scalar "identity.fxaccounts.missed_commands_fetched" for this purpose though. The intent is that this should be very close to zero because push messages should typically deliver the comments. I believe no one has actually looked at it though (and there's a slight risk that it will be hit when sending a tab to a device that's not running?)

So currently, if we moved Sync to using recentDeviceList, a user that didn't send-tab might not update the list until
the next browser start (as we don't cache the list to disk).

So I feel like this should be handled by the push messages most of the time. But I also feel like we may have some doubts about how that works out in practice, and hence we should poll as a fallback.

sgtm. However, I think it's probably fine and less risky to always refresh as the user selects "send tab" - that's really the only place the list being wrong will upset them and the load caused by that should be small. Even "synced tabs" can handle a reasonably stale list and requires a sync record anyway.

So forced refresh on send-tab and a very low frequency poll sounds fine to me, particularly if we add a bit more telemetry - we should find that the forced refresh and timed polling almost never gets an updated list, so we can then make things even more aggressive.

exclusively calling recentDeviceList never says "hey, this list is getting stale, I should update it". So currently, if we moved Sync to using
recentDeviceList, a user that didn't send-tab might not update the list until the next browser start (as we don't cache the list to disk).

Could the getter here check the freshness of the list and kick off a background promise to update it? Would anything need to listen for the completion of that promise and e.g. update UI state to match?

Assuming we keep the existing force-refresh for send-tab, I can't think of anything else. Even synced tabs is a bit of a special case because it requires the sync client record and refreshing fxa devices isn't going to make that appear. I think a worst-case scenario there is stale/duplicate devices being shown, but that's just an annoyance rather than a major issue.

I'd be inclined to consider doing this in 2 parts - the "easy" option without telemetry and a more considered approach that includes telemetry, and both could possibly end up on different trains (eg, we could consider uplifting the easy option)

To be more concrete, the simplest and most impactful fix might simply be a trivial Date.now() - lastRefreshed > AN_HOUR_OR_2 style check at https://searchfox.org/mozilla-central/source/services/sync/modules/engines/clients.js#399

(In reply to Mark Hammond [:markh] from comment #6)

No :( We do record a scalar "identity.fxaccounts.missed_commands_fetched" for this purpose though. The intent is that this should be very close to zero because push messages should typically deliver the comments. I believe no one has actually looked at it though (and there's a slight risk that it will be hit when sending a tab to a device that's not running?)

It looks pretty reliable, if this query is right! 😅 Only a handful of users missed any at all.

To be more concrete, the simplest and most impactful fix might simply be a trivial Date.now() - lastRefreshed > AN_HOUR_OR_2 style check at

+1 FWIW :-)

Assignee: nobody → markh
Pushed by mhammond@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/519f481d939a
refresh the fxa device list each 2 hours instead of each sync. r=eoger
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 73

ni? Ryan to see if he can determine the impact on the fxa server before we request uplift.

Flags: needinfo?(rfkelly)

Request volume is down a little compared to when 72 was in nightly, but not by as much as I would have hoped. I think we're still in the initial ramp-up phase though, and need to wait about a week to get a good idea of what the change will be.

For my future reference this is the bigquery query I was using to count request volume by firefox version:

SELECT
  EXTRACT(DAYOFYEAR FROM timestamp) AS day,
  COUNT(CASE WHEN jsonPayload.fields.agent LIKE "%Firefox/70%" THEN TRUE ELSE NULL END) as fx70,
  COUNT(CASE WHEN jsonPayload.fields.agent LIKE "%Firefox/71%" THEN TRUE ELSE NULL END) as fx71,
  COUNT(CASE WHEN jsonPayload.fields.agent LIKE "%Firefox/72%" THEN TRUE ELSE NULL END) as fx72,
  COUNT(CASE WHEN jsonPayload.fields.agent LIKE "%Firefox/73%" THEN TRUE ELSE NULL END) as fx73
FROM `moz-fx-fxa-prod-0712.fxa_prod_logs.docker_fxa_auth_20191*`
WHERE
  jsonPayload.fields.agent LIKE '%Firefox/7%'
GROUP BY 1
ORDER BY 1

Leaving ni? myself to revisit this after some more time has passed.

Disappointingly, this doesn't seem to have made much of a difference in practice. The volume of requests from 73 while it's in nightly looks a little smaller than from previous versions of Firefox while they were in nightly, but not by much. We could try to uplift for completeness, but I have a hard time telling a good story about the impact of doing so. I lean towards just letting this ride the trains.

Perhaps sync is not the main driver of these requests to /account/devices after all?

Flags: needinfo?(rfkelly)
See Also: → 1604699
Flags: qe-verify+

Clearing the QE+ flag.
Upon reviewing the discussion it appears that most(if not all) of the changes have the impact server side.
The automated tests cover well what's needed for verification, as confirmed with Mark(thanks again for the input!).

Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: