Closed Bug 1189729 Opened 9 years ago Closed 9 years ago

Push API abusing data

Categories

(Core :: DOM: Push Subscriptions, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
FxOS-S4 (07Aug)

People

(Reporter: Nukeador, Assigned: gerard-majax)

References

Details

(Keywords: foxfood, regression, Whiteboard: [bzlite][dogfood-blocker])

Attachments

(7 files, 1 obsolete file)

User-Agent: Mozilla/5.0 (Mobile; rv:42.0) Gecko/42.0 Firefox/42.0

Around 1mb per minute. I see the data spinner moving all the time on the top bar.

Data usage is listed as Others
Anoter instance of bug 1152264 ?
> adb root

then

> adb shell

then

> tcpdump -i any -s 65536 -w /storage/sdcard0/dump.pcap

This will show us where the communications are going.
Can you run tcpdump as documented and provide me the dump.pcap file?

Let it run for some minutes doing nothing else on the device, then CTRL+C it and adb pull /storage/sdcard0/dump.pcap
Flags: needinfo?(nukeador)
root@aries:/ # tcpdump -i any -s 65536 -w /storage/sdcard0/dump.pcap
tcpdump: invalid snaplen 65536
Flags: needinfo?(nukeador)
Flags: needinfo?(lissyx+mozillians)
rha, 65535
Flags: needinfo?(nukeador)
I've sent you an email with the dump file.
Flags: needinfo?(nukeador)
Thanks. So I see a lot of HTTPS/TLS activity on Amazon IP range. There's also much DNS resolution of ... push.services.mozilla.com to those IPs.
Flags: needinfo?(lissyx+mozillians)
Gecko profile prefs.js

user_pref("services.push.adaptive.lastGoodPingInterval.mobile", 0);
user_pref("services.push.adaptive.lastGoodPingInterval.wifi", 0);
user_pref("services.push.adaptive.mobile", "mobile-214-04");
user_pref("services.push.pingInterval", 0);
user_pref("services.push.pingInterval.mobile", 0);
user_pref("services.push.pingInterval.wifi", 0);
Either bug 1152264 was not properly fixed or it regressed.
Component: Gaia::Feedback → DOM: Push Notifications
Flags: needinfo?(frsela)
Keywords: regression
Product: Firefox OS → Core
Summary: Something is using data all the time → Push API abusing data
Target Milestone: --- → FxOS-S4 (07Aug)
Version: unspecified → Trunk
Flags: needinfo?(kcambridge)
Flags: needinfo?(willyaranda)
right, those 0's are not good, those means continously checking.

or the bug you refer is not working properly or the default prefs.js from his build is set to innitially to 0.
Flags: needinfo?(frsela)
Adding my pref.js to show other status (Last update done):

user_pref("services.push.adaptive.lastGoodPingInterval.mobile", 111082);
user_pref("services.push.adaptive.lastGoodPingInterval.wifi", 760079);
user_pref("services.push.adaptive.mobile", "mobile-722-34");
user_pref("services.push.pingInterval", 1140118);
user_pref("services.push.pingInterval.mobile", 111082);
user_pref("services.push.pingInterval.wifi", 1140118);
Guillermo and I are using the latest Z3 OTA update, but it seems preferences are completely different.
Is it safe if I change them manually in that file to avoid this checking (so I can use the phone property) or do you need me to keep them like this for further testing?
(In reply to Rubén Martín [:Nukeador] from comment #14)
> Is it safe if I change them manually in that file to avoid this checking (so
> I can use the phone property) or do you need me to keep them like this for
> further testing?

Yes, completly safe. If the values return to 0 again, is a clear bug.
Thank you
(In reply to Fernando R. Sela (no CC, needinfo please) [:frsela] from comment #11)
> right, those 0's are not good, those means continously checking.
> 
> or the bug you refer is not working properly or the default prefs.js from
> his build is set to innitially to 0.

I doubt those prefs have been set to 0 explicitely. And even if it was the case, I think this should not be allowed anyway.
For now, removing the prefs the data usage has stopped.
The problem is back again even with the prefs removed.

I've sent :gerard-majax a new dump.
I rebooted the device and guess what's again in prefs.js

user_pref("services.push.adaptive.lastGoodPingInterval.mobile", 0);
user_pref("services.push.adaptive.lastGoodPingInterval.wifi", 0);
user_pref("services.push.adaptive.mobile", "mobile-214-04");
user_pref("services.push.pingInterval", 0);
user_pref("services.push.pingInterval.mobile", 0);
user_pref("services.push.pingInterval.wifi", 0);


:P
Either the recalculation is still tending to zero, but I thought frsela added fixes for that, or the prefs are incorrectly set so that reading them is leading to zero being read and then zero being written when we make changes. What build is this based on? The default prefs in b2g/app/b2g.js set it to 3 minute intervals.

Could you also enable services.push.debug and capture adb logcat output (bonus if you could filter the file so only output from PushService.jsm is shown)?
Is there a way to avoid the pref-to-0 coming back again and again even if I delete them?

I really want to test the phone under 3-4G, but it kills my data plan.
set pref services.push.adaptive.enabled = false.
I have got several reports from people on Flame v2.2 (build from today) hitting this bug also. They resetted the prefs, tried the value in comment 23: nothing. A tcpdump capture shows it is push.services.mozilla.com that is active.
I encounter the same issue on my flame on 2.2

version	2.2.0.0-prerelease
appbuildid	20150808032505
platformbuildid	20150808032505
platformversion	37.0
geckobuildid	20150808032505
geckoversion	37.0

Data usage has increased since June (I can't be more precise than that, sorry). It's around 100 MB per day. Almost all taken by 'Other'. I only use data when I'm not home (otherwise it's wifi) and I checked the data usage on several days.
Typical test day: 10 hours on data plan (outside home), checking mail, browsing web pages. 90% of data usage is used by 'other'.

I set pref services.push.adaptive.enabled = false -> no change

The services.push prefs are NOT set to 0 (values at 180000 or 60000, even 1 800 000 for pingInterval)
A pcap run during 10 minutes on the phone doing absolutely nothing (except maybe automatic email check) returned 2107 requests mainly to ec2-52-25-153-44.us-west-2.compute.amazonaws.com

This is a huge issue for me. The flame is my regular phone. I have a quite comfortable data plan (3GB per month), but I'm worrying on what will happen during my 2 weeks vacation without wifi... If I dare to extrapolate, 10 MB / hour * 24h * 14 days, I may have enough data for the complete vacation. Except if I check email or browse the internet.

It also appears that the battery is draining very quickly because of that... About 6/7 hours of just lying on my desktop and it's dead. My boyfriend has a flame on 2.1 (without this issue, of course) that can last 2 days although he uses it.
If I set services.push.connection.enabled = false, it stops. No more requests to push.services.mozilla.org

And if I set it to true, requests are sent again and again.

What is this pref, is it safe to disable it? Will there be side effects (other than saving my data plan)?
(In reply to Flore Allemandou [:flore] from comment #26)
> If I set services.push.connection.enabled = false, it stops. No more
> requests to push.services.mozilla.org
> 
> And if I set it to true, requests are sent again and again.
> 
> What is this pref, is it safe to disable it? Will there be side effects
> (other than saving my data plan)?

More than 36 hours after I set services.push.connection.enabled = false, (I used my phone to check emails, browse a few webpages  and a few texts), the battery is still 68% (not recharged since my last comment).
I still get phone calls and text messages notifications (I checked). My emails are automatically checked on a random schedule (although I configured every 5 minutes in the mail app) from a few minutes to a few hours. But that's the only downside, I noticed at the moment. I may have to check for daily updates manually, but not a big problem either.

I prefer that to battery draining and abusive data usage.
Since nobody cares, I'm disabling adaptive push.
Assignee: nobody → lissyx+mozillians
Attached patch Disable adaptive Push (obsolete) — Splinter Review
Attachment #8645640 - Flags: review+
Attachment #8645640 - Attachment is obsolete: true
Attachment #8645661 - Flags: review?(anygregor)
Comment on attachment 8645661 [details] [diff] [review]
Disable adaptive Push

Review of attachment 8645661 [details] [diff] [review]:
-----------------------------------------------------------------

Disabling because of https://bugzilla.mozilla.org/show_bug.cgi?id=1189729#c0
Attachment #8645661 - Flags: review?(anygregor) → review+
Going on comment #25, is this patch going to be enough?
(In reply to Alexandre LISSY :gerard-majax from comment #32)
> Created attachment 8645661 [details] [diff] [review]
> Disable adaptive Push

I noticed that disabling the adaptive push did not solve the issue for me. I had to disable services.push.connection.enabled (different pref) to get rid of unwanting data connections (and battery draining and overheating phone, also)
(In reply to Flore Allemandou [:flore] from comment #35)
> (In reply to Alexandre LISSY :gerard-majax from comment #32)
> > Created attachment 8645661 [details] [diff] [review]
> > Disable adaptive Push
> 
> I noticed that disabling the adaptive push did not solve the issue for me. I
> had to disable services.push.connection.enabled (different pref) to get rid
> of unwanting data connections (and battery draining and overheating phone,
> also)

Are you 100% sure ? Because then it's even worse, we cannot disable this API like this without breaking other usecase of dogfooders.
Depends on: 1152264
(In reply to Alexandre LISSY :gerard-majax from comment #36)
> (In reply to Flore Allemandou [:flore] from comment #35)
> > (In reply to Alexandre LISSY :gerard-majax from comment #32)
> > > Created attachment 8645661 [details] [diff] [review]
> > > Disable adaptive Push
> > 
> > I noticed that disabling the adaptive push did not solve the issue for me. I
> > had to disable services.push.connection.enabled (different pref) to get rid
> > of unwanting data connections (and battery draining and overheating phone,
> > also)
> 
> Are you 100% sure ? Because then it's even worse, we cannot disable this API
> like this without breaking other usecase of dogfooders.

Yes I'm sure. But maybe, it's only for me. And disabling adaptive push may work for other people.
Attached image push api data abuse.png
Screenshot of web IDE showing lot of connections to push.services.mozilla.com
services.push.adaptive.enabled = false
services.push.connection.enabled = true
(In reply to Fernando R. Sela (no CC, needinfo please) [:frsela] from comment #37)
> Should we rollback [1] and [2]?
> 
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1152264
> [2] https://bugzilla.mozilla.org/show_bug.cgi?id=1100863?

That would be ideal I think. Since you are the owner of those bugs, will you do this? Then we can do a quick release to foxfooders to fix their battery/data usage.
Flags: needinfo?(frsela)
Flags: needinfo?(kcambridge)
(In reply to Flore Allemandou [:flore] from comment #39)
> Screenshot of web IDE showing lot of connections to push.services.mozilla.com
> services.push.adaptive.enabled = false
> services.push.connection.enabled = true

Thanks, Flore. Could you please enable `services.push.debug` and paste a logcat from your phone? I'd like to see what's causing it to ping so frequently, especially since your `pingInterval` is 30 minutes.
Flags: needinfo?(Flore)
Just to note, in order to backout, the patch directory reference has to be renamed to simplepush instead of push.
I am not so sure we want to backout of bug 1152264...
Attached file dev-log-main.log
Logs illustrating constant requests to push.services.mozilla.com
If you need a log on a longer timespan or another logfile, just let me know.
Flags: needinfo?(Flore)
(In reply to Flore Allemandou [:flore] from comment #44)
> Logs illustrating constant requests to push.services.mozilla.com

That's very helpful; thanks. Here's the interesting part—and it looks like all reconnects follow this pattern:

> 08-10 21:27:10.637   210   210 I Gecko   : -*- PushService.jsm: Sending message: {"messageType":"hello" ...}
> 08-10 21:27:10.837   210   210 I Gecko   : -*- PushService.jsm: Socket error 2152398868
> 08-10 21:27:10.837   210   210 I Gecko   : -*- PushService.jsm: reconnectAfterBackoff()
> 08-10 21:27:10.837   210   210 I Gecko   : -*- PushService.jsm: _calculateAdaptivePing()

2152398868 is `NS_ERROR_NET_RESET`...so I guess it's severing the TCP connection when we send the opening handshake. (We're also resetting the failure count before we send the handshake, which might explain why it's not backing off on reconnect). We may also want to hold the wake lock until we finish the handshake, instead of releasing it before.

At first glance, this doesn't look related to adaptive pings...but the effect is the same.
correction for the comment 47, url got chopped off:
B2G-gecko-nightly-20150810214003
Revert patch 1152264
Based on branch b2g37_v2_2
Flags: needinfo?(frsela)
Revert 1100863
Based on branch b2g37_v2_2
(In reply to Michael Henretty [:mhenretty] from comment #40)
> (In reply to Fernando R. Sela (no CC, needinfo please) [:frsela] from
> comment #37)
> > Should we rollback [1] and [2]?
> > 
> > [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1152264
> > [2] https://bugzilla.mozilla.org/show_bug.cgi?id=1100863?
> 
> That would be ideal I think. Since you are the owner of those bugs, will you
> do this? Then we can do a quick release to foxfooders to fix their
> battery/data usage.

Done
Fixed by backing out bug 1100863. Let's keep an eye on this though, and make sure this back out did the job.

https://bugzilla.mozilla.org/show_bug.cgi?id=1100863#c20
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(willyaranda)
Resolution: --- → FIXED
Matheiu Agopian reported that the problem persists on his device even after the OTA update
I sent a dump to :gerard-majax and he told me I too suffer from the TCP RST flood.

Is there a fix or workaround for that?
I also updated from dogfood channel, and I don't see the Tcp flooding (and actually, I didn't notice before loading this OTA either), but I do see something odd.  Each time an app is opened or closed, there is a network activity, even when 'send mozilla feedback' option is turned off.  Each time it does,it spends about 2~3K, so it is not that much, but still odd to see the network activity when I open, let's say, a music app.

When I don't do anything, there is no network activity.   PBylenga is checking out the dogfood-latest to see whether there is any difference.  ni? him to provide feedback.
Flags: needinfo?(pbylenga)
Is it possible it's the telemetry ?
(In reply to Alexandre LISSY :gerard-majax from comment #56)
> Is it possible it's the telemetry ?

Hmm probably that might be the case, then it might be unrelated to the issue you are seeing.
on dogfood-test, I saw the data icon flashing, especially when switching apps.  However after 20 minutes it used 6kb of data and during the next 20 minutes no data was used.

I didn't have the pref:
user_pref("toolkit.telemetry.enabled", true);

I full flashed RC4 and OTA'd

Device: Aries 2.5
Build ID: 20150810214003
Gaia: 497fe3f938722b0aa49c93f975fad5d9ed3b0a82
Gecko: 08c257046b51447c4bb144ded0567ed3da74b9d7
Gonk: 2916e2368074b5383c80bf5a0fba3fc83ba310bd
Version: 42.0a1 (2.5)
Firmware Version: D5803_23.1.A.1.28_NCB.ftf
User Agent: Mozilla/5.0 (Mobile; rv:42.0) Gecko/42.0 Firefox/42.0


on dogfood-latest, I saw the data icon flashing when returning to homescreen but didn't observe any additional data used.

Environmental Variables:
Device: Aries 2.5
Build ID: 20150812135535
Gaia: 6fef72357971934c8774578044ea7a442be3a75d
Gecko: b7efed3a2640fba681d48846c3b2ab80474c6f77
Gonk: 2916e2368074b5383c80bf5a0fba3fc83ba310bd
Version: 43.0a1 (2.5)
Firmware Version: D5803_23.1.A.1.28_NCB.ftf
User Agent: Mozilla/5.0 (Mobile; rv:43.0) Gecko/43.0 Firefox/43.0
Flags: needinfo?(pbylenga)
Whiteboard: [bzlite] → [bzlite][dogfood-blocker]
I have the latest OTA on the Z3C and I'm still having this behavior. Is it normal or should I wait for the next OTA?
Yes, as far as I know, a release with the bug-fix has not happened yet. We are deploying a server-update next week which should help with this issue, at the cost of reducing how often push notifications are checked.
So what is the status on those Push failures because of TCP RST ? The API used to work well on one of my device I use for developpement purpose. After hacking to understand bug 1164713 from data connection, I started to get the API failing with TCP RST symptom. This morning in the office, switching to WiFi for this hacking. API works. After a few hours of debug, TCP RST also on the Mozilla Guest network.

Now I'm stuck.
Flags: needinfo?(kcambridge)
Flags: needinfo?(bbangert)
For what it's worth a TCP RST could come from anything. It may be that a router or hub is killing the connection because it's too old or because it's resource starved. Long lived connections tend to get killed or recycled for lots of reasons. 

The servers are running with lots of capacity at the moment, also, since the connections live for as long as possible, pretty much every one would be abruptly terminated from the server's point of view. The device could have lost power, gone into a tunnel, been hit with a hammer, etc. I'm not sure how the server could identify a connection that was terminated by the client or some device in between.
:gerard-majax, Are you able to reconnect to the server if you turn the network off for a few minutes? We're trying to find a solution that may work for you. 

(One theory is that you may be triggering some traffic shaping method on a router between you and the AWS servers. By having the device off, you may decrease whatever is being used to flag you as "abusive". It may also be useful to know how savvy you are with WireShark. It might be possible to spot the RST packet, find the MAC address of the origin and trace it back to the source, but that's not a simple task.)
(In reply to JR Conlin [:jrconlin,:jconlin] from comment #63)
> :gerard-majax, Are you able to reconnect to the server if you turn the
> network off for a few minutes? We're trying to find a solution that may work
> for you. 
> 
> (One theory is that you may be triggering some traffic shaping method on a
> router between you and the AWS servers. By having the device off, you may
> decrease whatever is being used to flag you as "abusive". It may also be
> useful to know how savvy you are with WireShark. It might be possible to
> spot the RST packet, find the MAC address of the origin and trace it back to
> the source, but that's not a simple task.)

No. My device is off-network during hours/days and this is still happening as soon as I reconnect. Whatever the network I connect to.
After getting a wireshark log dump, we were able to confirm that the problem seems to be in communications between the AWS server and the client. After a bit of debugging, we determined that it's a read buffer size issue on the production server. If the incoming "Hello" message exceeds 2048 characters, the server will drop the connection as invalid. The problem is that a client might have more than 50 registered ChannelIDs causing the "Hello" message to exceed 2048 characters (fun fact, during my tests, a Hello with 50 ChannelIDs produced a message exactly 2048 characters). 

Our server currently ignores the client ChannelID list, so it was recommended that the client stop sending ChannelIDs as part of the "Hello" message. This should greatly reduce the data burden for clients, as well as speed up connections. The only problem is that devices using the TEF server may experience problems since that server will drop connections not listed in the ChannelID listing. TEF is 
shutting down their push server once we have UDP support in place and can transition their load. 

And all of this becomes a non-issue once we switch over to WebPush.

https://github.com/mozilla-services/autopush/pull/152
Looks like your theory holds: I had a look at the Push implementation and noticed the Channel IDs values are coming from the PushDB. So I deleted the indexedDB from my profile:
> adb shell rm -r /data/local/storage/permanent/chrome/idb/846562544phus.*

And then restarted my device. That fixed the issue for me.
Flags: needinfo?(nukeador)
Flags: needinfo?(jrconlin)
Flags: needinfo?(Flore)
OK I did what(In reply to Alexandre LISSY :gerard-majax from comment #66)
> Looks like your theory holds: I had a look at the Push implementation and
> noticed the Channel IDs values are coming from the PushDB. So I deleted the
> indexedDB from my profile:
> > adb shell rm -r /data/local/storage/permanent/chrome/idb/846562544phus.*
> 
> And then restarted my device. That fixed the issue for me.

I did that and it works (no need to restart the device in my case).
- I removed the files
- Re-enabled the services.push.connection.enabled pref in webIDE
- there was an initial request to push.services.mozilla.com (as expected)
- I monitored the network traffic for 15 minutes and everything works okay: no abusive data usage
Flags: needinfo?(Flore)
Same for me! Thanks Alexandre for the workaround ;)
Cool, glad that works for the short term. Slightly longer term is that the server now has adjustable sized buffers we can tweak if this happens again, or if we see users with a large number of natural channels.
Flags: needinfo?(kcambridge)
Flags: needinfo?(jrconlin)
Flags: needinfo?(bbangert)
Have those buffers been adjusted? If not, is there a bug tracking making the required change to the client or the server?

Pretty sure 'run this adb command to delete an indexeddb' isn't an acceptable work-around for foxfooders.
Flags: needinfo?(jrconlin)
:cwiiis, Yes. See comment #65. 

There's still an outstanding issue with the client not pruning old or unused channels (I'd suppose that for devs who might be wiping devices, they probably should be removing old channels as well, but I don't know the specifics that cause 50+ channels to be generated.)
Flags: needinfo?(jrconlin)
(In reply to JR Conlin [:jrconlin,:jconlin] from comment #71)
> :cwiiis, Yes. See comment #65. 
> 
> There's still an outstanding issue with the client not pruning old or unused
> channels (I'd suppose that for devs who might be wiping devices, they
> probably should be removing old channels as well, but I don't know the
> specifics that cause 50+ channels to be generated.)

Going on that pull request, it just sounds like it will delay the problem until the client IDs grows to an even larger size? What's the bug number for fixing the client? (sorry if I've misunderstood)
Flags: needinfo?(jrconlin)
I don't know if there is a bug filed for the client fix, since there are a number of external contingencies that impact this condition. For accounting purposes, I've drafted: https://bugzilla.mozilla.org/show_bug.cgi?id=1201128.
Flags: needinfo?(jrconlin)
Thanks JR - just want to make sure we keep note of this, it'd be terrible if we forgot, encounter it again in a couple of months and someone loses a few days tracking it down (again).
Bad news for me, following https://bugzilla.mozilla.org/show_bug.cgi?id=1189729#c68 the problem is back again.

Also, just for the sake of completeness, here's what I needed to do (because it seems "adb shell rm..." tells me I don't have the correct permissions):
1/ adb shell
2/ su -
3/ rm -r /data/local/storage/permanent/chrome/idb/846562544phus.*
4/ restart the phone

doing that again seems to have fixed the issue with the "flashing data icon in the statusbar and rapidly draining battery" again, but for how long?
Depends on: 1201138
(In reply to Mathieu Agopian [:magopian] from comment #75)
> Bad news for me, following
> https://bugzilla.mozilla.org/show_bug.cgi?id=1189729#c68 the problem is back
> again.
> 
> Also, just for the sake of completeness, here's what I needed to do (because
> it seems "adb shell rm..." tells me I don't have the correct permissions):
> 1/ adb shell
> 2/ su -
> 3/ rm -r /data/local/storage/permanent/chrome/idb/846562544phus.*
> 4/ restart the phone
> 
> doing that again seems to have fixed the issue with the "flashing data icon
> in the statusbar and rapidly draining battery" again, but for how long?

Mathieu, I believe it means we have something not cleaning up those channels IDs. Filed bug 1201138 for this.
It seems it's not happening anymore here.
Flags: needinfo?(nukeador)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: