Closed Bug 1201037 Opened 9 years ago Closed 9 years ago

Repeated network-change events lead to multiple http2 pings and ultimately connection tear-down

Categories

(Core :: Networking: HTTP, defect)

Unspecified
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla44
Tracking Status
firefox43 --- affected
firefox44 --- fixed

People

(Reporter: bagder, Assigned: bagder)

References

Details

Attachments

(3 files, 6 obsolete files)

5.48 KB, patch
bagder
: review+
Details | Diff | Splinter Review
8.69 KB, patch
bagder
: review+
Details | Diff | Splinter Review
1.57 KB, patch
bagder
: review+
Details | Diff | Splinter Review
(This is forked off from bug 1191253)

When Firefox detects network changes and sends such an event internally, it will lead to pings getting sent on the existing SPDY and HTTP/2 connections to verify that they are still functional.

If the network is changed rapidly (when the connections are still alive), several pings can end up being sent over those connections and some (google) servers consider that bad and as a result close down the connections.

What to do:

1. Coalesce events. When the first change is detected, wait N milliseconds before triggering the event so that subsequent really quickly following ones can be absorbed as well. N could probably be in the 200 ms range.

2. Avoid sending further pings over an h2/spdy connection that hasn't transferred any data since the previous ping. If such a situation is detected, switch over to the passive HTTP/1.1 method where it waits to see if there's any traffic during a period and closes it if not.
I know we haven't seen this (at least in the absence of 1191253), but I worry about the exposure here so I would make this a high priority.. I would set N pretty conservative - maybe 1000ms
First stab at coalescing events during 1000ms to generate only a single event (on Linux). 

For some reason this doesn't work as intended as the timer never seems to call the callback so the event is never actually sent! Anything obvious or stupid I've missed?
Different approach. Since the thread is doing its own loop with a poll, I changed the coalescing timing logic to not use the timer API but just a timestamp and a check for duration, for simplicity.

Verified to work on Linux. On my laptop however, the 1000ms coalescing period is not enough to avoid it from issuing two network events when I switch from Wifi => ethernet. The time from the first to the last change is then about 1.6 seconds. For some reason it is better when I switch the other way around as then the 1000ms period is good enough to just generate a single event.
Attachment #8657037 - Attachment is obsolete: true
Attachment #8657791 - Flags: review?(mcmanus)
(In reply to Daniel Stenberg [:bagder] from comment #3)

> Verified to work on Linux. On my laptop however, the 1000ms coalescing
> period is not enough to avoid it from issuing two network events when I
> switch from Wifi => ethernet.

that doesn't sound like its verified :) .. what to do?
Sorry, I didn't express myself very good. Verified in the sense that it doesn't send any additional change-events within the coalescing period, I can see underlying triggers being coalesced within that period.

I'm not sure we need to care about this extra event. I would say that it is likely that only one of the events actually signaled the "true" change that caused us to close/restart any connections and the other one should be fairly innocent with such an interval. Extending the period to something like 2000ms will squash the second event too in my use case, but I can also see how 1000 is enough for when I switch TO wifi so then it'll delay the sending quite a lot. Also, we don't do any coalescing at all on Windows yet and yet it seems to do pretty OK.
Comment on attachment 8657791 [details] [diff] [review]
v4-0001-bug-1201037-Linux-squash-network-change-events-du.patch

Review of attachment 8657791 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/system/linux/nsNotifyAddrListener_Linux.cpp
@@ +36,5 @@
>  /* a shorter name that better explains what it does */
>  #define EINTR_RETRY(x) MOZ_TEMP_FAILURE_RETRY(x)
>  
> +// number of milliseconds to absorb subsequent events
> +#define NETWORK_CHANGE_COALESCING_PERIOD 1000

static const PRIntervalTime kNetworkChangeColaescingPeriod  = PR_MillisecondsToInterval(1000);

@@ +298,5 @@
>          }
> +        if (mCoalescingActive) {
> +            // check if coalescing period should continue
> +            double period = (TimeStamp::Now() - mChangeTime).ToMilliseconds();
> +            if (period >= NETWORK_CHANGE_COALESCING_PERIOD) {

if you keep the constant as an interval you can compare the difference directly to it without going through tomilliseconds() and using the float.

@@ +304,5 @@
> +                mCoalescingActive = false;
> +                pollWait = pollTimeout; // restore to default
> +            } else {
> +                // wait no longer than to the end of the period
> +                pollWait = static_cast<int>

you're definitely going to want to bounds check that.
Attachment #8657791 - Flags: review?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #6)

> > +#define NETWORK_CHANGE_COALESCING_PERIOD 1000
> 
> static const PRIntervalTime kNetworkChangeColaescingPeriod  =

Gah, my spine is made of plain C so darnit, I do these same nits every time! :-)

> if you keep the constant as an interval you can compare the difference
> directly to it without going through tomilliseconds() and using the float.

Perhaps I can, but I don't see how I gain a lot with that since I need the remaining time in milliseconds anyway so that I know how long to wait until I switch off the coalescing mode stored in pollWait, as poll() wants its argument in milliseconds. Or did you have any elegant way of writing that in mind?

> > +                // wait no longer than to the end of the period
> > +                pollWait = static_cast<int>
> 
> you're definitely going to want to bounds check that.

Which bounds check is missing?

In the if () just above I check for the condition when the period has been going on long enough, so when this else clause evaluates we know the right hand value 'period' is never larger than the left side value. And from what I understand, TimeStamp::Now() is monotonically increasing so 'period' shouldn't be possible to get negative. Or am I wrong?
(In reply to Daniel Stenberg [:bagder] from comment #7)

> Perhaps I can, but I don't see how I gain a lot with that since I need the

you could save one on the terminal condition. Not a big deal, I just don't like to see intervals converted to ms willy nilly.

> to bounds check that.
> 
> Which bounds check is missing?
> 
> In the if () just above I check for the condition when the period has been
>

you're right - I missed the conditional.
Updated after review
Attachment #8657791 - Attachment is obsolete: true
Attachment #8660586 - Flags: review+
Here's the Windows version of "coalesce network-changed events" with similarly applied logic as used in the Linux specific code. The try run looks like this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1be348c137d7

I'm struggling with my own local windows build at the moment (unrelated to this patch) so I've now yet taken this change on a drive myself. I'll make sure to do that before I try to land this.
Attachment #8660588 - Flags: review?(mcmanus)
Comment on attachment 8660588 [details] [diff] [review]
0001-bug-1201037-coalesce-network-events-on-Windows-r-mcm.patch

Review of attachment 8660588 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/system/win32/nsNotifyAddrListener.cpp
@@ +165,5 @@
> +        mCoalescingActive = false;
> +        return INFINITE; // return default
> +    } else {
> +        // wait no longer than to the end of the period
> +        return static_cast<int>

cast to DWORD?
Attachment #8660588 - Flags: review?(mcmanus) → review+
(In reply to Daniel Stenberg [:bagder] from comment #10)
> Created attachment 8660588 [details] [diff] [review]
> 0001-bug-1201037-coalesce-network-events-on-Windows-r-mcm.patch
> 
> Here's the Windows version of "coalesce network-changed events" with
> similarly applied logic as used in the Linux specific code. The try run
> looks like this:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=1be348c137d7
> 
> I'm struggling with my own local windows build at the moment (unrelated to
> this patch) so I've now yet taken this change on a drive myself. I'll make
> sure to do that before I try to land this.

1] you've got a recurring orange on one of the windows run there. Bears figuring out.

2] you can of course run the windows build for testing yourself - that's pretty much how I make 99% of my windows builds
Fixed the typecast. Running another try run that show a different set of oranges: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d18c6cbc8429
Attachment #8660588 - Attachment is obsolete: true
Attachment #8661691 - Flags: review+
I've not seen the same bursts of events on Mac at all after some local testing (and we have not seen any reports about it either). The network changes seem to reliably appear as a single event both on removal and insertion of the cable on Mac so I'd like to leave that code without coalescing logic.

Regarding the point (2) of my initial action points I'd like to postpone that as well until we see that it is actually still necessary. Here's my reasoning:

The Windows and Linux backends could generate quite a burst of events before these patches . Several events within a very short time period lead to several pings almost instantly and disconnects etc. Now (with the two patches in this bug applied) we're slowing down the ping rate quite drastically to at most one ping per second. Even if we still would at times ping a server three times in a row with a second interval and that connection would get closed, it is now much less of a problem because that would also imply that no other traffic has been going on over that connection during those three seconds.

By ducking for this, we also avoid complicating that code area even further.

What do you think, mcmanus?
Flags: needinfo?(mcmanus)
I think the youtube thing is a problem (long download, nothing upstream - could be interrupted by this) even though the coalesce should make it harder to hit.

perhaps ping could just return with a nop in case no control data had been written out since the last ping (flush-output-queue would be an easy way to track that state)
Flags: needinfo?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #15)

> perhaps ping could just return with a nop in case no control data had been
> written out since the last ping (flush-output-queue would be an easy way to
> track that state)

I'm not sure that's helpful.

1 - If data has been sent since last ping, there's no harm in sending another ping.

2 - If no data has been sent, that's exactly the kind of connection we need to check with another ping. The network changed, our idle connection is still there but does it work?
(In reply to Daniel Stenberg [:bagder] from comment #16)
> (In reply to Patrick McManus [:mcmanus] from comment #15)
> 
> > perhaps ping could just return with a nop in case no control data had been
> > written out since the last ping (flush-output-queue would be an easy way to
> > track that state)
> 
> I'm not sure that's helpful.
> 
> 1 - If data has been sent since last ping, there's no harm in sending
> another ping.
> 
> 2 - If no data has been sent, that's exactly the kind of connection we need
> to check with another ping. The network changed, our idle connection is
> still there but does it work?

well, if there are 0 streams outstanding in that case then a couple things can be done
 1] just close the session
 2] ping with abandon and risk goawaya

if there is >=1 stream outstanding then use data moving on that stream as a liveliness indication.

the problem with the youtube case is that there is data flowing downstream but nothing flowing upstream.. and then all we put on the upstream path are pings and that results in the server closing the connection and interrupting the video
What about trying to skip pinging completely for the cases where there is still (incoming) traffic? This patch skips the spdy/h2 ping if there was incoming traffic less than 500 milliseconds ago. This ought to prevent us from sending pings on connections that are still used, like the youtube case mentioned above.

Since the pending network-changed event coalescing is on a 1000 millisecond period it will still make every such event cause a ping if there's no (other) traffic on the connection.

The patch also adds a few log lines to make it possible to easier track behavior.
Attachment #8664690 - Flags: review?(mcmanus)
Comment on attachment 8664690 [details] [diff] [review]
0001-bug-1201037-only-send-HTTP-pings-on-seemingly-idle-c.patch

Review of attachment 8664690 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +2118,2 @@
>          if (mSpdySession) {
> +            if (PR_IntervalToMilliseconds(IdleTime()) >= 500) {

please comment that this 500 number only makes sense because of the 1000ms colaesce feature.. better yet, use constants related to each other
Attachment #8664690 - Flags: review?(mcmanus) → review+
Rebased
Attachment #8660586 - Attachment is obsolete: true
Attachment #8671302 - Flags: review+
Rebased
Attachment #8661691 - Attachment is obsolete: true
Attachment #8671303 - Flags: review+
Added comment.
Attachment #8664690 - Attachment is obsolete: true
Attachment #8671309 - Flags: review+
Keywords: checkin-needed
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: