Closed Bug 888268 Opened 11 years ago Closed 11 years ago

WIFI Network Latency rounded up to next 100ms by 802.11 PSP

Categories

(Firefox for Android Graveyard :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Firefox 25

People

(Reporter: mcmanus, Assigned: mcmanus, NeedInfo)

References

Details

Attachments

(3 files, 3 obsolete files)

http://www.ducksong.com/zap runs a pretty simple test that underperforms on fennec. It does 40 serial tiny XHRs to the same server, logging the round trip time back to JS of each one. The first one has some complications with setting up DNS and TCP, but after that you see reused persistent connections and basically what you are measuring is 1 round trip time from javascript->gecko->necko -> internet -> server and back again. Bandwidth is irrelevant, this is all about latency.

If you run the test from a desktop you basically see the same response time as you get from ping www.ducksong.com with a couple milliseconds of overhead.. with an occasional outlier. For me, right now, that's about 60ms.

but if you run it on fennec WIFI (using the same wifi as the desktop), you see wild variability with LOTS of outliers and even the median data is well over 150ms. I've tested this with my nexus 4 and a nexus S before that. I believe they are both samsung manufactured devices.

You can totally reproduce this with chrome on android too - horrible numbers.

The effect does not seem to appear with cell networks, just wifi - though it of course its harder to say for sure because you don't have the "control" of being able to run a desktop version on the same network. (or at least I don't have that control).

After much digging, and some help from some android googlers, this has been attributed to the ethernet driver using 802.11 Power Save Polling mode. This isn't the same thing as a power save sleep state - those are transitioned to on the order of whole seconds and aren't coming into play here. When a device is associated with an Access Point and is in PSP mode the AP doesn't deliver data to it right away.. instead it queues it at the AP and the next time it sends the WIFI beacon (generally every 100ms though this is technically configurable on the AP) the beacon includes a "youve got mail" flag and the device actually polls the AP for the data.

This might be ok for general batching of content during large transfers, or even for being able to ignore the general chitter chatter of a busy lan but during the critical early moments of an HTTP transaction it has the effect of inflating RTT massively.. and RTT drives everything. We need one for DNS, one for TCP, 2 more for SSL, one more for HTTP and for anything of decent size 2 more for growing the congestion window large enough to actually use the available bandwidth. Those are all serialized and the PSP algorithm has just added ~50ms to each one. ouch!

I've largely confirmed PSP is the culprit by reconfiguring the beacon interval on 2 different access points. If I turned it down to 20ms the "zap" test did much better, if I turned it up to 1000 the zap test was pretty much unusable.

There is an ugly ugly workaround: when we send data it triggers a poll of the access point very quickly. I'm not sure if the driver is just piggybacking that logic onto the transmit, or it the "you've got mail" flag is being piggybacked by the AP onto the CTS message that lets us transmit - but the effect is the same. You can see this in action by opening up an adbshell on the device and while zap is running in fennec (or chrome) run a quick ping -i 0.016 -c 300 at the same time. While the ping is running your zap performance gets better.. and the ping isn't even part of the browser. It works with both chrome and firefox.

I've coddled together a patch that basically does the same thing automatically inside fennec.

If you're on wifi and an IPv4 DHCP network we will send 0 length UDP packets at port 4886 of your gateway at the default rate of 60hz for 400ms from the start of the transaction in an attempt to improve RTT during the critical early phases. I call this "tickle time".

That's not as much data as it sounds like. Effectively its 40 bytes (IP + UDP headers) repeated 24 times and only ever at LAN speeds. That's 19.2 kilobit (not byte) per second for a short burst. That would suck on an old modem, but this code only engages on high bandwidth wifi. That's a small tax compared to the real data you are also moving around - remember this never engages unless its expecting real data too. I think its reasonable.

The gateway address is used because its was the best thing I could find that wasn't routable onto the internet and is also off the android host (therefore triggering the ethernet drivers). There are multicast addresses that wouldn't be routed, but they would consume a lot more bandwidth because a lot of devices just treat multicast as broadcast.

Obviously a page is made up of more than one transaction, but to the extent they don't overlap that means the tickle time is not extended. (i.e. 2 transaction in parallel still have 400ms of tickle time.. and if one had a 50ms head start the tickle time would be 450ms.) This shouldn't be a big power drain - chips don't go into sleep mode anywhere near this quickly anyhow and 960 bytes of overhead is not consequential to the power it takes to move the correlated HTTP transaction.

One possible refinement I don't have the hardware for is to figure out if this is just a hardware quirk.. though it was originally brought up on a tablet and I produced it on the two android phones I have.. it seems ubiquitous at least on modern samsung chips and galaxy like devices. But we could easily use the zap url to build a whitelist or blacklist.

Finally, it has its own pref: network.tickle-wifi.enabled

It was a hard decision for me on whether or not we want a hack like this, but in the end I think its impact is strongly net positive and we want it.
Assignee: nobody → mcmanus
worth noting - the WIFI_MODE_FULL_HIGH_PERF wakelock only seems to work (and is only documented to work) in conjunction with screen off events and doesn't impact this use case. I tried it.
Attachment #768935 - Attachment is obsolete: true
Comment on attachment 768953 [details] [diff] [review]
wifi tickler for mitigating 802.11 psp mode on android

do you have time to look at this and see if its ready for prime time?
Attachment #768953 - Flags: review?(doug.turner)
Comment on attachment 768953 [details] [diff] [review]
wifi tickler for mitigating 802.11 psp mode on android

Review of attachment 768953 [details] [diff] [review]:
-----------------------------------------------------------------

This is really cool stuff! I took a look a the patch and have some drive-by comments.

::: mobile/android/base/GeckoNetworkManager.java
@@ +155,5 @@
> +        int gateway = 0;
> +        if (mNetworkType == NetworkType.NETWORK_WIFI) {
> +            WifiManager mgr = (WifiManager)sInstance.mApplicationContext.getSystemService(Context.WIFI_SERVICE);
> +            DhcpInfo d = mgr.getDhcpInfo();
> +            gateway = d.gateway;

Although the Android docs aren't clear I suspect getDhcpInfo() could return null here so it's probably wise to add a guard here.

::: widget/android/AndroidBridge.cpp
@@ +1834,5 @@
>  
>      AutoLocalJNIFrame jniFrame(env);
>  
>      // To prevent calling too many methods through JNI, the Java method returns
> +    // an array of double even if we actually want a double two booleans and an integer.

comma after "double"

@@ +1849,5 @@
>  
>      aNetworkInfo->bandwidth() = info[0];
>      aNetworkInfo->canBeMetered() = info[1] == 1.0f;
> +    aNetworkInfo->isWifi() = info[2] == 1.0f;
> +    aNetworkInfo->dhcpGateway() = info[3];

This patch is missing the corresponding change to GeckoNetworkManager.getCurrentInformation() that makes this bit work.
BTW this sort of problem will also affect Firefox OS as we have PS mode turned on by default. It would be great if we had some common heuristics in the networking code that could turn PS mode off as necessary during latency sensitive periods.
(In reply to Michael Wu [:mwu] from comment #6)
> Have you tried
> http://developer.android.com/reference/android/net/wifi/WifiManager.
> html#WIFI_MODE_FULL_HIGH_PERF ?

yes, comment 2.. no luck with it.

(In reply to Michael Wu [:mwu] from comment #7)
> BTW this sort of problem will also affect Firefox OS as we have PS mode
> turned on by default. It would be great if we had some common heuristics in
> the networking code that could turn PS mode off as necessary during latency
> sensitive periods.

b2g is being tracked under bug 888269 because I suspected this - but I haven't had the hardware to test it. Jason has agreed to coordinate a test there.. it would be great if you could comment over there and help him out. I'd be happy to work to find a more elegant solution with gonk.
(In reply to Patrick McManus [:mcmanus] from comment #8)
> (In reply to Michael Wu [:mwu] from comment #6)
> > Have you tried
> > http://developer.android.com/reference/android/net/wifi/WifiManager.
> > html#WIFI_MODE_FULL_HIGH_PERF ?
> 
> yes, comment 2.. no luck with it.
> 

Hmm yeah, I just checked the actual implementation and it doesn't touch the ps setting at all. I don't see anything which exposes control to the PS setting. Too bad.

> (In reply to Michael Wu [:mwu] from comment #7)
> > BTW this sort of problem will also affect Firefox OS as we have PS mode
> > turned on by default. It would be great if we had some common heuristics in
> > the networking code that could turn PS mode off as necessary during latency
> > sensitive periods.
> 
> b2g is being tracked under bug 888269 because I suspected this - but I
> haven't had the hardware to test it. Jason has agreed to coordinate a test
> there.. it would be great if you could comment over there and help him out.
> I'd be happy to work to find a more elegant solution with gonk.

Will do, thanks for the pointer.
Comment on attachment 768953 [details] [diff] [review]
wifi tickler for mitigating 802.11 psp mode on android

Review of attachment 768953 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/network/interfaces/nsIDOMConnection.idl
@@ +16,2 @@
>  
>    [implicit_jscontext] attribute jsval onchange;

Mounir needs to review these two new attributes.  It is part of a w3c spec (and yes, i already complained loudly about this API).

::: dom/network/src/Connection.h
@@ +56,5 @@
>     */
>    double mBandwidth;
>  
> +  /**
> +   * If the connection is specifically WIFI

s/specifically/

@@ +61,5 @@
> +   */
> +  bool mIsWifi;
> +
> +  /**
> +   * If DHCP Gateway information for IPV4, in network

s/If /

::: netwerk/base/src/Tickler.cpp
@@ +18,5 @@
> +namespace net {
> +
> +NS_IMPL_THREADSAFE_ISUPPORTS1(Tickler, nsISupportsWeakReference)
> +
> +Tickler::Tickler() :

style nit: usually the colon goes on the next line.

@@ +24,5 @@
> +    , mActive(false)
> +    , mCanceled(false)
> +    , mEnabled(false)
> +    , mDelay(16)
> +    , mDuration(TimeDuration::FromMilliseconds(256))

magic numbers.

Probably better to just handle the error case on line 136 and 144.

@@ +36,5 @@
> +  // non main thread uses of the tickler should hold weak
> +  // references to it if they must hold a reference at all
> +  MOZ_ASSERT(NS_IsMainThread());
> +
> +  if (mThread)

style nit: Not sure what the coding style guide networking uses... so ignore if you like

other parts of the code (DOM) uses:

if () {
  stmt;
}

even for one liners.

@@ +54,5 @@
> +  MOZ_ASSERT(!mThread);
> +  MOZ_ASSERT(!mFD);
> +
> +  if (!AndroidBridge::Bridge())
> +    return NS_ERROR_NOT_IMPLEMENTED;

Brad says that you don't need to check for null bridge in the chrome process.

@@ +57,5 @@
> +  if (!AndroidBridge::Bridge())
> +    return NS_ERROR_NOT_IMPLEMENTED;
> +
> +  AndroidBridge::Bridge()->EnableNetworkNotifications();
> + 

ws

@@ +78,5 @@
> +
> +  mTimer.swap(tmpTimer);
> +
> +  mAddr.inet.family = PR_AF_INET;
> +  mAddr.inet.port = PR_htons (4886);

why this port?  What if it is busy?

@@ +130,5 @@
> +    if (NS_SUCCEEDED(mPrefs->GetIntPref("network.tickle-wifi.duration", &val))) {
> +      if (val < 1)
> +        val = 1;
> +      if (val > 100000)
> +        val = 100000;

You are too nice.

::: netwerk/base/src/Tickler.h
@@ +34,5 @@
> +#include "nsWeakReference.h"
> +
> +class nsIPrefBranch;
> +
> +namespace mozilla { namespace net {

two lines

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +1896,5 @@
> +    // on B2G, contains the necessary information on wifi and gateway
> +
> +    nsCOMPtr<nsIDOMWindow> domWindow;
> +    cb->GetInterface(NS_GET_IID(nsIDOMWindow), getter_AddRefs(domWindow));
> +    if (!domWindow)

I love how the networking layer has to reach into the dom code to find a window and from that find the navigator object to figure out if the networking layer is connected over wifi or not.  For fucks sake.
Comment on attachment 768953 [details] [diff] [review]
wifi tickler for mitigating 802.11 psp mode on android

Review of attachment 768953 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/network/interfaces/nsIDOMConnection.idl
@@ +16,2 @@
>  
>    [implicit_jscontext] attribute jsval onchange;

I didn't realize that. I'll change it to something new internal to mozilla and just qi to it.

::: netwerk/base/src/Tickler.cpp
@@ +24,5 @@
> +    , mActive(false)
> +    , mCanceled(false)
> +    , mEnabled(false)
> +    , mDelay(16)
> +    , mDuration(TimeDuration::FromMilliseconds(256))

they need defaults for other applications that run gecko without our pref files. They got out of sync with my final pref values tho - I'll fix.

@@ +78,5 @@
> +
> +  mTimer.swap(tmpTimer);
> +
> +  mAddr.inet.family = PR_AF_INET;
> +  mAddr.inet.port = PR_htons (4886);

its basically arbitrary, but it meets a few interesting criteria. It's only known assignment is for some defunct messaging protocol that afaict never ran on UDP and is incredibly unlikely to be embedded into a WIFI router. Various "data by port" metrics at places like speedguide.net show very little data for, probably just from port scanners. We'll never know about mismatches anyhow, because its fire and forget by nature... and UDP receiver has to screen the incoming traffic for protocol compliance and can only drop stuff that doesn't match.. and our 0 length messages will pretty much flunk any test.
Good trick I've used in the past for keeping NAT bindings alive:

Send a packet to <wherever> (gateway is fine, as is google, or the NSA) with TTL=1 or 2.  Since we're only concerned with the interface directly attached to us, i think TTL 1 would be right.

(With NAT bindings we'd use more like 5 or 7 - enough to exit the house without getting to the destination and using bandwidth there).

As for 4886: any port *could* be randomly used, especially in the user range.  But 0-length UDPs are relatively innocuous.
thanks kats, dougt, jesup
Attachment #770753 - Flags: review?(doug.turner)
Attachment #768953 - Attachment is obsolete: true
Attachment #768953 - Flags: review?(doug.turner)
Comment on attachment 770753 [details] [diff] [review]
wifi tickler for mitigating 802.11 psp mode on android

Review of attachment 770753 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm.  please let me know about the questions

::: mobile/android/app/mobile.js
@@ +653,5 @@
>  pref("accessibility.accessfu.skip_empty_images", true);
>  
> +// Transmit UDP busy-work to the LAN when anticipating low latency
> +// network reads and on wifi to mitigate 802.11 Power Save Polling delays
> +pref("network.tickle-wifi.enabled", true);

ship it!

::: mobile/android/base/GeckoNetworkManager.java
@@ +147,5 @@
> +    private int wifiDhcpGatewayAddress() {
> +        if (mNetworkType != NetworkType.NETWORK_WIFI) {
> +            return 0;
> +        }
> +        

nit: tab

@@ +154,5 @@
> +        if (d == null) {
> +            return 0;
> +        }
> +
> +        return d.gateway;

In Java, does returning a null get cast to an int?

@@ -166,2 @@
>          updateNetworkType();
> -        mShouldNotify = true;

non-nit: What was this change about?

::: netwerk/base/public/nsINetworkProperties.idl
@@ +3,5 @@
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#include "nsISupports.idl"
> +
> +/* This is not a web facing interface. It provides supplemental information

Nit: you don't need to declare this as "not a web facing interface" I don't think any in netwerk/base/public are web facing.

::: netwerk/base/src/Tickler.cpp
@@ +53,5 @@
> +  MOZ_ASSERT(!mActive);
> +  MOZ_ASSERT(!mThread);
> +  MOZ_ASSERT(!mFD);
> +
> +  AndroidBridge::Bridge()->EnableNetworkNotifications();

Test for AndroidBridge::Bridge() being null?

::: netwerk/base/src/Tickler.h
@@ +21,5 @@
> +// and the TCP slow start phase. The transaction is given up to 400 miliseconds by
> +// default to get through those phases before the tickler is disabled.
> +//
> +// The tickler only applies to wifi on mobile right now. Hopefully it can also
> +// be restricted to particular handset models in the future.

nit: reformat these blocks.

@@ +58,5 @@
> +  void SetIPV4Port(uint16_t port);
> +
> +  // Touch the tickler to (re-)start the activity.
> +  // May call from any thread
> +  void Touch();

if you have a Tickler class, the method name here has to be Tickle().

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +348,5 @@
>  
>      MakeNewRequestTokenBucket();
> +    mWifiTickler = new Tickler();
> +    if (NS_FAILED(mWifiTickler->Init()))
> +        mWifiTickler = nullptr;

Probably nicer to hide this in a static:

static Ticker* Tickler::GetTickleMonster() {
 ...
}

::: netwerk/protocol/http/nsHttpTransaction.h
@@ -24,5 @@
>  #include "TimingStruct.h"
>  
>  //-----------------------------------------------------------------------------
>  
> -class nsHttpTransaction;

unrelated, but if things continue to compile...
Attachment #770753 - Flags: review?(doug.turner) → review+
> 
> ::: mobile/android/base/GeckoNetworkManager.java
>
> @@ +154,5 @@
> > +        if (d == null) {
> > +            return 0;
> > +        }
> > +
> > +        return d.gateway;
> 
> In Java, does returning a null get cast to an int?
>

I'm not positive - which is part of why I made sure to return an int :)

 
> @@ -166,2 @@
> >          updateNetworkType();
> > -        mShouldNotify = true;
> 
> non-nit: What was this change about?
> 

This is a good catch; thanks I reverted is as un-necessary. It had been added to work around a bug in my patch that kats identified in comment 5. With that fixed (as it already was in my last patch) it isn't necessary.


> ::: netwerk/base/src/Tickler.cpp
> @@ +53,5 @@
> > +  MOZ_ASSERT(!mActive);
> > +  MOZ_ASSERT(!mThread);
> > +  MOZ_ASSERT(!mFD);
> > +
> > +  AndroidBridge::Bridge()->EnableNetworkNotifications();
> 
> Test for AndroidBridge::Bridge() being null?

in comment 10 you said that brad said the null test was un-necessary, implicitly asking me to remove it :) So I've left it out
Attachment #774010 - Flags: review+
Attachment #770753 - Attachment is obsolete: true
https://hg.mozilla.org/mozilla-central/rev/e6257c74be3b
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 25
So I've tried the zap url on a LG G2x (tegra 2), Nexus 7 (tegra 3), and Galaxy Nexus, and I get similar numbers to my dekstop on each. There is an initial high value followed by pretty uniformly low numbers. I'll attach some screenshots.
The tickle pref does not appear in about:config, so I don't think I have this patch either.
I did get some ridiculously high numbers on the Nexus 4, however, which is qualcomm. Patrick, which devices did you test on?
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #23)
> I did get some ridiculously high numbers on the Nexus 4, however, which is
> qualcomm. Patrick, which devices did you test on?

a nexus 4, a nexus S, and someone else reported a "galaxy tab"
Depends on: 900322
I have a Samsung Galaxy S5 with Firefox Beta v.50.0

Unless I'm misinterpreting the data, my firewall is reporting 1342 packets of UDP Port 4886 to my default gateway over .  

This post indicates the following:
"If you're on wifi and an IPv4 DHCP network we will send 0 length UDP packets at port 4886 of your gateway at the default rate of 60hz for 400ms from the start of the transaction in an attempt to improve RTT during the critical early phases. I call this "tickle time"."

First Packet 08:20:43
Last Packet  08:24:18
That's 2 min 35 sec  

This is a long distance from 400ms.  It's 215000ms or 537.5 times the advertised 400ms.

This appears to be a bug.  Could you review?  I have firewall statistics for months on this device if you need them.

Thanks,
Doug
Flags: needinfo?(mcmanus)
its within 400ms of channel activity time - so its not a one time event
Flags: needinfo?(mcmanus)
Can this be readdressed at some point?  I've been using egress filtering on my network firewall and seeing this little tickler think be the only thing spamming my firewall is a bit alarming.
Flags: needinfo?(jduell.mcbugs)
Pat: we landed this hack 5 years ago--do we still need it with current phones?
Flags: needinfo?(jduell.mcbugs) → needinfo?(mcmanus)

On my current OnePlus3T with Lineage 17.1 (Android 10) and Firefox 68.8.0 disabling this feature doesn't make a difference. Not sure what kind of test infrastructure Mozilla has but maybe it would be worth running a test. You might be able to remove this feature if it turns there isn't a benefit anymore.

Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: