Closed Bug 1416320 Opened 3 years ago Closed 3 years ago

Do a quick sync before going to sleep

Categories

(Firefox :: Sync, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
Firefox 59
Tracking Status
firefox59 --- fixed

People

(Reporter: eoger, Assigned: eoger)

References

Details

Attachments

(1 file, 1 obsolete file)

We should do a quick sync of the tabs, and maybe the history collection before a computer is going to sleep.
This will help with the case of a user closing their laptop lid, then switching to their phone and not finding their laptop's tabs.
Assignee: nobody → eoger
Priority: -- → P1
What testing of this have you done?
In IRC you mentioned the cutoff was 2s on windows.

When I run this script [0] (which looks at the data about:telemetry uses), it tells me that 25% of my syncs would be too slow to make the cutoff here (75% would make it, though). The direct equivalent of that query is harder to write in redash, but looking at the median is easy enough and says on average it seems to be over 50% that take over 2s. [1] (Those are the durations for the median sync which syncs at least one outgoing tab).

So, I think that in practice this might fail on most users computers given how long sync takes to currently run. Not saying it's not worth doing despite that, it's just worth keeping in mind. It might be worth considering how to mitigate how much this will skew the telemetry error numbers if we land it (without bug 1304898 it's hard to filter -- maybe we could do just the minimal amount of work to record "why" the sync happened in this case?).

Perhaps we don't care that much, and admittedly only syncing the tabs engine will help (possibly a lot? Not clear how much), but just I think it would be a shame to make our metrics much more confusing for a feature that is still unreliable.

[0]: https://gist.github.com/thomcc/700f17ba4431a965e25d5463c3383619
[1]: https://sql.telemetry.mozilla.org/queries/48953/source#131878
> What testing of this have you done?

I have manually tested that patch with my 2 laptops, both on Windows and macOS. So far it's been working pretty reliably, but we're talking about fast computers with a fast internet connection... not very representative.

> When I run this script [0] ...

From reading the script, I see that we are taking in account the syncs that _contain_ the "tabs" col, not the ones that _do only_ the "tabs" col. I feel like it is a bit hard to draw conclusion from a full sync. Keep also in mind that a sync before sleep may be shorter than your average sync, compared to a "first sync after opening Firefox".

I wonder if we could A/B test that patch with x% of users, and see if our failure numbers go up substantially.
Thanks for that script Thom - my own Syncs seem to show better behavior - but I thought I'd fire up spark to see what it tells me.

I made a script [1] that (I hope :) looks at a sample of pings, and looks for sync that only did clients and tabs, neither engine failed, neither engine processed incoming, and clients had no outgoing while tabs did. I *think* this constitutes the "best case" and is roughly what kind of sync happens when "synced tabs" is opened.

The histogram of time, in seconds, for these Syncs are:

 	time to sync tabs
count 	34523.000000
mean 	35.084392
std 	3174.552835
min 	0.116000
25% 	1.234000
50% 	1.719000
75% 	2.510000
max 	422144.649000

meaning that somewhere between 50% and 75% of syncs were done in our 2s budget. It might even be better than that - note the max, mean and stddev figures - there are lots of syncs there which have insane times. If we somehow decided to look only at "sane" times, I expect we'd see even better results.

It still seems questionable to me, although I guess I'm leaning towards the fact that it may well add enough value to enough users to make it worthwhile - but I'm still wavering.

Thom - do you mind checking that gist is measuring what I think it is measuring, and whether that thing is a sane thing?

> and see if our failure numbers go up substantially.

TBH, I'm not bothered if the failure numbers to go up by (say) ~1/3 of the new syncs this causes to happen. What I am a little worried about is failure rates (or failures to start syncing) indirectly caused by this. For example, we now schedule syncs 5 seconds after waking. There seems a risk that, say, the new sync started just before going to sleep may still be in the process of failing 5 seconds after waking (eg, network strangeness), meaning that sync after wake doesn't actually start due to failure to acquire the lock, etc. I'm really not sure how to measure that (other than making a "reason" for each sync be meaningful, as Thom mentioned in some other bug recently), and looking at analysis specific to each "reason" value.

[1] https://gist.github.com/mhammond/bc8bd3a7fbadd2c7eedbaa9a4f0666b6
> Thom - do you mind checking that gist is measuring what I think it is measuring, and whether that thing is a sane thing?

It seems right to me. 

> It might even be better than that - note the max, mean and stddev figures - there are lots of syncs there which have insane times. If we somehow decided to look only at "sane" times, I expect we'd see even better results.

This is why the percentile times are more valuable, the 50% time is (AIUI) the median, which should be more representative of a 'typical' sync time, due to the average being skewed by outliers.

> TBH, I'm not bothered if the failure numbers to go up by (say) ~1/3 of the new syncs this causes to happen. What I am a little worried about is failure rates (or failures to start syncing) indirectly caused by this

My concern is mostly that it will call failure numbers to go up after this patch, which will make analysis like "this failure has an increase in version 59" very hard to do -- we'd expect many (but not necessarily all) failures to spike.

> other than making a "reason" for each sync be meaningful, as Thom mentioned in some other bug recently

Bug 1304898 (I mentioned it in this one). I don't think this should be that hard to do for cases like this. Cases where we cause the next sync by scheduling it to occur soon are hard though. I think we should consider doing that bug anyway though, now that we actually have data, since it would be interesting to know e.g. how many syncs are caused by resyncs, or how many syncs that are triggered after wake fail, etc.
Some more thoughts before I forget:

- We should try to sync only the tabs collection, without the mandatory clients one.
- We could also skip querying info/collections.
- We could even skip the XUIS for tabs since we only upload our record (might be dangerous).
Attachment #8929168 - Attachment is obsolete: true
Comment on attachment 8927417 [details]
Bug 1416320 - Do a quick sync before going to sleep.

https://reviewboard.mozilla.org/r/198744/#review206650

::: services/sync/modules/policies.js:522
(Diff revision 3)
>    },
>  
>    /**
>     * Set a timer for the next sync
>     */
> -  scheduleNextSync: function scheduleNextSync(interval) {
> +  scheduleNextSync: function scheduleNextSync(interval, engines = null, why = null) {

You might as well upgrade this to a plain |scheduleNextSync(interval, engines = null, why = null)| decl while touching it
Attachment #8927417 - Flags: review?(markh) → review+
Pushed by eoger@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/23fbfd83e957
Do a quick sync before going to sleep. r=markh
Thanks!
Flags: needinfo?(eoger)
Pushed by eoger@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8b6a35a99b40
Do a quick sync before going to sleep. r=markh
https://hg.mozilla.org/mozilla-central/rev/8b6a35a99b40
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 59
You need to log in before you can comment on or make changes to this bug.