Closed Bug 1287473 Opened 4 years ago Closed 4 years ago

Unexpected high number of sync pings

Categories

(Firefox :: Sync, defect, P1)

defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: gfritzsche, Assigned: tcsc)

References

Details

(Whiteboard: [data-integrity] [sync-metrics])

Attachments

(7 files)

We see a high jump in the number of pings in the "other" bucket from July 13 (for nightly ping counts per day):
https://metrics.services.mozilla.com/telemetry-budget-dashboard/

We see similar jumps in internal Telemetry client metrics, which together suggest a high number of small pings being submitted.

The clear suspect in the regression range is the new sync ping from bug 1267919:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=aac8ff1024c553d9c92b85b8b6ba90f65de2ed08&tochange=04821a70c739a00d12e12df651c0989441e22728
It looks like the "empty" pings we decided to keep cause this. According to this: https://gist.github.com/thomcc/707ae437e034f319ae68c561704a5a76 (sorry, I really have no idea what I'm doing there!) it looks like only ~4% of the pings we get actually have meaningful information in them aside from timing numbers... which certainly aren't useless, so if we can keep them, I'd prefer to.


Comment 67 on bug 1267919 (mine) and comment 48 on bug 1267919 (markh's) are basically why we have them still, and while I stand by my comment (the information is still useful), it's certainly *less* useful than the information submitted in the other pings...

That said, the numbers I'm seeing are 1M-2M/day for nightly users. Abusing https://telemetry.mozilla.org/new-pipeline/dist.html#!measure=WEAVE_DEVICE_COUNT_DESKTOP (which is opt-out) a bit, it looks like the ratio for sync desktop users is (assuming  something like 15/(15 + 12 + 25 + 80) = ~11%. Which would mean that the expected total number of pings -- if it went to all channels as is -- would be closer to 20M, which is well below the cited value of 100M.

dcoates can probably weigh in as to whether or not 11% is a reasonable amount for the ratio of nightly sync users to total sync users -- It sounds a bit high to me.
Flags: needinfo?(dcoates)
The 2M is consistent with Kibana on /info/collections for user_agent_version 50, but it does seem high relative to v47 of ~90M. The daily count of devices by version on https://sql.telemetry.mozilla.org/queries/97#152 don't match this ratio at all. I'm checking the /storage data in redshift to see if it matches Kibana. From there I should be able to tell if that rate is expected or elevated.
Flags: needinfo?(dcoates)
I do see elevated activity for v50 relative to the number of sessions as compared to other versions, which is probably the cause of the high ping count. It appears to have started around July 5th. Here's a graph of the request count by version number around that time https://sql.telemetry.mozilla.org/queries/698#1178 meanwhile the device count by version shows v50 to be quite small https://sql.telemetry.mozilla.org/queries/97#152
Sync did indeed introduce a regression causing it to Sync far more often than it should in some cases. I filed bug 1287687. Metrics FTW? :)

However, I'm a little unclear what the status of this bug is - even with that bug fixed, we will expect a significant increase in the number of pings made by clients (eg, ~ one every 10 minutes for many clients) - just not as significant as we actually caused (every 90 seconds for some clients). The intention in bug 1267919 was always to make one ping per Sync.
One every 10 minutes sounds like a lot, what specific value do you get out of that?
Can we get a volume estimate from there or should we wait for Nightly data to see about it?

E.g. if you want to track sync frequency per client or time between syncs, this could be more efficiently done with histograms.
If you want to submit more event-driven pings, there should at least be a throttling mechanism to avoid scenarios like this.
Maybe this is the right bug that I am experiencing on nightlys since 1-2 weeks (July 13 mentioned above could be right):
A certain amount of time (1-2 h?) after starting nightly, Sync or Telemetry? goes crazy and constantly causes 8-12 kb/s download and 6 kb/s upload, also noticable CPU-usage.
AFAIK it runs forever, longest "run" was about 2h before I closed nightly (after restart, it stays calm for 1-2h?)

Browser-Console is totally spammed with:

GET 
https://sync-298-us-west-2.sync.services.mozilla.com/1.5/37719633/storage/clients [HTTP/1.1 200 OK 206ms]
13:37:28.119 POST 
XHR 
https://incoming.telemetry.mozilla.org/submit/telemetry/661e7b21-dc4d-4a8c-b1dc-01b2842dde87/sync/Firefox/50.0a1/nightly/20160721030216 [HTTP/1.1 200 OK 194ms]
13:37:28.221 GET 
https://sync-298-us-west-2.sync.services.mozilla.com/1.5/37719633/info/collections [HTTP/1.1 200 OK 215ms]
13:37:28.442 GET 
https://sync-298-us-west-2.sync.services.mozilla.com/1.5/37719633/storage/clients [HTTP/1.1 200 OK 206ms]
13:37:28.723 POST 
XHR 
https://incoming.telemetry.mozilla.org/submit/telemetry/f99ac6e8-b59a-46ae-9089-68a4503b1ebc/sync/Firefox/50.0a1/nightly/20160721030216 [HTTP/1.1 200 OK 200ms]
13:37:28.825 GET 
https://sync-298-us-west-2.sync.services.mozilla.com/1.5/37719633/info/collections [HTTP/1.1 200 OK 195ms]
13:37:29.027 GET 
https://sync-298-us-west-2.sync.services.mozilla.com/1.5/37719633/storage/clients [HTTP/1.1 200 OK 213ms]
13:37:29.294 POST 
XHR 
https://incoming.telemetry.mozilla.org/submit/telemetry/453bf1b2-4352-4525-9f63-528d55ed1db1/sync/Firefox/50.0a1/nightly/20160721030216 [HTTP/1.1 200 OK 194ms]
13:37:29.418 GET 
https://sync-298-us-west-2.sync.services.mozilla.com/1.5/37719633/info/collections [HTTP/1.1 200 OK 193ms]
13:37:29.621 GET 
https://sync-298-us-west-2.sync.services.mozilla.com/1.5/37719633/storage/clients [HTTP/1.1 200 OK 195ms]
13:37:29.892 POST 
XHR 
https://incoming.telemetry.mozilla.org/submit/telemetry/3bd9b74e-24a3-457f-b265-169f97bf494f/sync/Firefox/50.0a1/nightly/20160721030216
...

User Agent 	Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
OS 	Linux 4.4.15-3-MANJARO
Build ID 	20160721030216
Flags: needinfo?(tchiovoloni)
Flags: needinfo?(markh)
Stebs: Yes, this is the right bug! 

Could you attach a sync log from when this is happening? There are some instructions on this page for getting the log https://wiki.mozilla.org/CloudServices/Sync/File_a_desktop_bug (although obviously ignore the bits about filing a new bug on bugzilla, just add the log as an attachment to this one).

Also it would help if you could tell us if you have any addons installed could be effecting sync (especially if they've updated recently), and/or if you have changed any about:config prefs starting with services.sync.
Flags: needinfo?(tchiovoloni) → needinfo?(stebs)
Attached file about:support
Attaching about:support and several error logs (bugzilla really can only attach one file at a time?)
I have only those 5 extensions active (as reported in the error logs), afair no special setting concerning sync. Accelerated layers are forced (but I doubt, it has to do anything with this bug).
You can see that at some point or another, I closed Firefox while sync was running (since it runs forever).
BTW, tried to disable Telemetry, still happened, so I reactivated it.
Flags: needinfo?(stebs)
services.sync.declinedEngines: prefs,adblockplus
Not related probably, but this adblockplus entry is extremely sticky: I replaced adblockplus with uBlock origin on all clients since long ago, I even manually deleted adblockplus from the setting above but it magically reappears.
Didn't change manually any other services.sync setting, I could still attach all settings if needed.
(Are any of those privacy sensitive appart name, username and account?)
Stebs: This is great, thanks a ton! And the about:support file lists all the relevant changed settings, thanks. I (or possibly someone else) will let you know if we have any other questions.

And yes, disabling telemetry wouldn't fix the issue -- sending too many sync pings (despite being the name of this bug) is only a symptom of the underlying "syncing too much for some users" issue.
I'm able to query the server metrics for this account. Here's the read activity on the clients bucket since June per device. I'm happy to dig deeper if it helps.

    day     | dev_id | r_clients
------------+--------+-----------
 2016-06-02 | 619... |        82
 2016-06-02 | 7b3... |        25
 2016-06-03 | 619... |        65
 2016-06-03 | 7b3... |        35
 2016-06-04 | 619... |        31
 2016-06-04 | 7b3... |        97
 2016-06-04 | a55... |         2
 2016-06-06 | 619... |        36
 2016-06-06 | 7b3... |        95
 2016-06-06 | a55... |         1
 2016-06-07 | 619... |        75
 2016-06-07 | 7b3... |        82
 2016-06-08 | 619... |        48
 2016-06-08 | 7b3... |        45
 2016-06-09 | 619... |        79
 2016-06-09 | 7b3... |        45
 2016-06-09 | a55... |         2
 2016-06-10 | 619... |        62
 2016-06-10 | 6af... |         1
 2016-06-11 | 619... |        34
 2016-06-11 | 7b3... |        49
 2016-06-11 | e0b... |         1
 2016-06-12 | 619... |        66
 2016-06-12 | 6af... |         1
 2016-06-12 | 7b3... |        69
 2016-06-13 | 619... |        94
 2016-06-13 | 7b3... |        46
 2016-06-14 | 7b3... |        31
 2016-06-14 | e0b... |         2
 2016-06-15 | 619... |       125
 2016-06-15 | 7b3... |        83
 2016-06-16 | 619... |        18
 2016-06-17 | 619... |        66
 2016-06-17 | 7b3... |        77
 2016-06-18 | 619... |        23
 2016-06-18 | 6af... |         1
 2016-06-18 | 7b3... |        49
 2016-06-18 | e0b... |         1
 2016-06-19 | 7b3... |        69
 2016-06-20 | 619... |        67
 2016-06-20 | 7b3... |        53
 2016-06-20 | e0b... |         1
 2016-06-21 | 619... |       129
 2016-06-21 | 7b3... |        54
 2016-06-21 | e0b... |         1
 2016-06-22 | 619... |        65
 2016-06-22 | 7b3... |       102
 2016-06-23 | 619... |         8
 2016-06-23 | 7b3... |        13
 2016-06-25 | 7b3... |        68
 2016-06-27 | 6af... |         1
 2016-06-27 | 7b3... |        42
 2016-06-29 | 619... |        51
 2016-06-29 | 7b3... |        48
 2016-06-29 | e0b... |         1
 2016-06-30 | 619... |        61
 2016-06-30 | 7b3... |        44
 2016-07-01 | 619... |        34
 2016-07-01 | 7b3... |       149
 2016-07-01 | a55... |         2
 2016-07-03 | 7b3... |       139
 2016-07-04 | 619... |        89
 2016-07-04 | e0b... |         1
 2016-07-05 | 619... |        31
 2016-07-05 | 7b3... |        52
 2016-07-05 | a94... |         8
 2016-07-05 | e0b... |         1
 2016-07-06 | 7b3... |        50
 2016-07-06 | a55... |         2
 2016-07-06 | a94... |      7329
 2016-07-07 | 7b3... |        34
 2016-07-07 | a94... |        87
 2016-07-07 | e0b... |         1
 2016-07-08 | 7b3... |        76
 2016-07-08 | a94... |        26
 2016-07-09 | a94... |         2
 2016-07-10 | 7b3... |        56
 2016-07-10 | a94... |     12016
 2016-07-11 | 7b3... |        69
 2016-07-11 | a94... |      5713
 2016-07-11 | e0b... |         5
 2016-07-12 | 7b3... |        31
 2016-07-12 | a94... |       103
 2016-07-12 | e0b... |         2
 2016-07-13 | 7b3... |        50
 2016-07-13 | a55... |         1
 2016-07-13 | a94... |       496
 2016-07-13 | e0b... |         3
 2016-07-14 | 7b3... |        75
 2016-07-15 | a94... |      6821
 2016-07-16 | 7b3... |        80
 2016-07-16 | a55... |         2
 2016-07-16 | a94... |        32
 2016-07-16 | e0b... |         6
 2016-07-18 | 7b3... |        57
 2016-07-18 | a94... |      7870
 2016-07-19 | 6a2... |      2242
 2016-07-19 | 7b3... |        42
 2016-07-19 | a94... |       940
 2016-07-20 | 7b3... |        61
One of the logs has:
1468517776205	browserwindow.syncui	DEBUG	observed: weave:ui:sync:finish
1468517776293	Sync.SyncScheduler	DEBUG	Global Score threshold hit, triggering sync.

which is a Sync starting again immediately after it finished due to the score being updated. I suspect this is related to those additional engines, but I could be wrong.

Stebs, would it be possible for you to set the prefs:

services.sync.log.logger.service.main = "Trace"
services.sync.log.appender.file.logOnSuccess = true

and perform a few Syncs. about:sync-log should then list a number of "success" logs - please upload a few sequential ones here (ideally after reproducing this problem.)

Thanks!
Flags: needinfo?(markh) → needinfo?(stebs)
It seems likely this problem is bug 1288877, which should land on Nightly in the next day or 2. Stebs, you might like to wait a couple of days and confirm your problem is resolved by that patch before bothering with what I asked for in comment 17.
Prefs set, will report if I think I did catch anything.
Didn't mention btw. that sync works normally and quick as usual - until the continuous sync happens - which has to be reset with a restart.
I keep my eyes open after checkin of bug 1288877, sounds like it might be the right one.
Flags: needinfo?(stebs)
Catched.
Starts around 17:22
(In reply to Stebs from comment #20)
> Created attachment 8774061 [details]
> archived sync logs with tracing and logOnSuccess enabled

Thanks - the following log entries:
> 1469287719609	browserwindow.syncui	DEBUG	observed: weave:ui:sync:finish
> ...
> 1469287719708	Sync.SyncScheduler	TRACE	bookmarks: score: 602
> ...
> 1469287719709	Sync.SyncScheduler	TRACE	Global score updated: 602
> 1469287719709	Sync.SyncScheduler	TRACE	_checkSync returned "".
> 1469287719709	Sync.SyncScheduler	DEBUG	Global Score threshold hit, triggering sync.

makes me believe this is indeed bug 1288877
Depends on: 1288877
Assignee: nobody → tchiovoloni
We've been tracking the pings since bug 1288877 landed. For 7 days with that bug in-place (build_id=["20160715000000", "20160721999999"]) we were seeing ~2.2M pings per day. For 7 days after that bug landed (["20160725000000", "20160731999999"]) we are seeing ~400k per day. These figures correspond with the number of Syncs recorded by the server, which offers some confidence that we can extrapolate the server-counted Syncs to guess the volume once this hits release.

Currently server stats tell us release is seeing ~110M pings per day, so the estimate from bug 1267919 of ~112M pings seems on track.

Given the recent reduction and the fact we now appear to be on track to those original predictions, is there anything else we need to do in this bug?
Flags: needinfo?(gfritzsche)
I talked this over with mreid.

~110M pings per day can be ok, if we get an actual use out of the bigger volume.
However, this is still a lot of pings (>25% of the volume of main pings we get from the whole release population) and might increase with better sync adoption.

The planned sync health metrics don't seem to require any near-realtime latency:
https://gist.github.com/mhammond/e51a494cb04dc0acd44acf2c1589e7c5

Is it not possible to send, say, a daily ping with overall activity summary (sync use counts per engine etc., failures etc.)?

Is the sync scheduler that generates those pings synchronized across clients?
I.e. would we expect most or many clients to submit at the same time? Is submission clock based or time-since-last-ping based?

Are we now protected against submitting as many pings per client as before? Do we have some way of sanity-throttling?

Cost-wise this is acceptable if we can hold the assumption on ping volume.
A ballpark storage cost estimate here (leaving room for meta-data and common ping data):
  100M pings per day at 1kb per ping, 180 days = ~$500/mo
Flags: needinfo?(gfritzsche) → needinfo?(markh)
(In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> Is it not possible to send, say, a daily ping with overall activity summary
> (sync use counts per engine etc., failures etc.)?

That would be possible - but would have been much easier if done that way initially. In bug 1267919 comment 2 I mentioned:

---8<---
I initially assumed we'd want aggregated data, but recently concluded that individual records with aggregation done by the pipeline made more sense. This was mainly due to discussions with :rvitillo and Ilana Segal about the UITelemetry payload - Robert said he was looking to move away from aggregated data about UI clicks to a format like |{"Toolbar button X”: [ts1, ts2, ts3],| and Ilana mentioned she preferred this format as it allows deeper analysis, such as "did the user click button 1 followed by button 2", which is impossible when the payload just records the count of clicks for each button.

Similarly for Sync, it would allow analysis of things like "were we more likely to see sync errors if the previous sync was interrupted by app shutdown?" or "are we more likely to see validation errors after applying incoming records?", which doesn't seem possible with client-side aggregation.

Obviously storage and processing costs are critical, but I don't know how to make the correct tradeoffs here.
---8<---

and added a needinfo on rvitillo, who responded in comment 4 with:

---8<---
If the estimated amount of data generated in a day/week/month isn't too high then we could choose to go with the schema that gives us the greatest amount of flexibility.
---8<---

and comment 8:
---8<---
I am OK using non-aggregated data given the estimated volume mentioned in Comment 6.
---8<---

We landed the pings based on these conversations.

> Is the sync scheduler that generates those pings synchronized across clients?
> I.e. would we expect most or many clients to submit at the same time? Is
> submission clock based or time-since-last-ping based?

Each client Syncs on its own schedule without any coordination. When each client Syncs is based on a simple timer and also immediately once certain local changes are noticed.

> Are we now protected against submitting as many pings per client as before?
> Do we have some way of sanity-throttling?

We don't. There's always a chance we will introduce a similar bug that causes a spike in the number of Syncs, but once we have dashboards in place we should notice this bug much faster than we did this time (it was precisely the fact we started collecting pings that we noticed that bug in the first place)

> Cost-wise this is acceptable if we can hold the assumption on ping volume.
> A ballpark storage cost estimate here (leaving room for meta-data and common
> ping data):
>   100M pings per day at 1kb per ping, 180 days = ~$500/mo

Given the non-payload portions of the ping are themselves quite large, another alternative would be to arrange for telemetry to "buffer" the pings and once per day send an array of the already collected pings.
Flags: needinfo?(markh)
(In reply to Mark Hammond [:markh] from comment #24)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> > Is it not possible to send, say, a daily ping with overall activity summary
> > (sync use counts per engine etc., failures etc.)?
> 
> That would be possible - but would have been much easier if done that way
> initially.
...

Thanks for the run-down. Non-aggregated data could also be submitted with fewer pings (i.e. submit non-aggregated event-like data with daily pings etc.).
The more pings we submit, the more costly analysis becomes too.

Note that we are fine to go from the storage costs (if the volume prediction holds), but i'm curious about the wins.
I think for future efforts we should probably be more diligent about the benefits we have from this kind of ping volume.

> > Are we now protected against submitting as many pings per client as before?
> > Do we have some way of sanity-throttling?
> 
> We don't. There's always a chance we will introduce a similar bug that
> causes a spike in the number of Syncs, but once we have dashboards in place
> we should notice this bug much faster than we did this time (it was
> precisely the fact we started collecting pings that we noticed that bug in
> the first place)

We definitely need some throttling mechanism here before this can go to release.
There are several potential impacts here otherwise:
* sync pings drowning out other important pings
* too sudden spikes in submission volume can overload server-load infrastructure
* increased storage
* analysis jobs getting expensive
* ...

> > Cost-wise this is acceptable if we can hold the assumption on ping volume.
> > A ballpark storage cost estimate here (leaving room for meta-data and common
> > ping data):
> >   100M pings per day at 1kb per ping, 180 days = ~$500/mo
> 
> Given the non-payload portions of the ping are themselves quite large,
> another alternative would be to arrange for telemetry to "buffer" the pings
> and once per day send an array of the already collected pings.

Or for sync to submit a ping of interesting events ~daily and on shutdown.

If we are going forward with the current approach, we will need:
* throttling limits in place to prevent us from higher ping volumes
* to be sure the estimations will roughly hold up including sync growth, etc.
Flags: needinfo?(markh)
All identified future work on this will be done in bug 1295058, so closing this.
Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(markh)
Resolution: --- → INCOMPLETE
Just for the record: my problems mentioned in Comment 6 were fixed for good with fix for Bug 1288877
You need to log in before you can comment on or make changes to this bug.