Closed Bug 728113 Opened 8 years ago Closed 8 years ago

use spdy ping to avoid/detect nat timeout

Categories

(Core :: Networking: HTTP, defect)

13 Branch
x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla13

People

(Reporter: mcmanus, Assigned: mcmanus)

References

Details

(Whiteboard: [spdy])

Attachments

(2 files, 2 obsolete files)

Some aggressive NATs timeout idle persistent connections. The most aggressive ones at about 1 minute.

If they do so with a RST, we restart things ok unless its a post or something like that.

But sometimes they just hang. In which case people hit reload and life goes on with http/1. This works in HTTP/1 because it doesn't have a cancel semantic, so we react to the reload by dropping the connection and making a fresh one.

in spdy, the impact is a little worse. SPDY has a cancel semantic (which is very useful in general!), so we use that and don't tear down the connection. This means the restarted attempt is put onto the same connection - with the same bad results.

however, spdy has an advantage over http/1 - a ping frame that we can use to check the health of the server if we're ever in doubt. This also has the lovely side effect of keeping a NAT mapping from timing out when it sees the ping exchange.

The attached patch does the ping on a timer after 45 seconds of network inactivity on that spdysession.. if it doesn't get a reply it tears down the session proactively. This will not make your session stay alive for any longer than it would have before (basically the session is torn down 3 minutes after the last good data moved over it - control stuff like the ping doesn't count) so I'm not worried about battery impact in the way I would be with websockets and its potentially very long idle times. It adds up to very little activity.

the timer tick is taken from the stalled-read pipeline patch, so they can share that code.

Chrome does this a little differently as I understand it and I intentionally want to try a different approach. When using a previously idle connection chrome sends both the SYN and a PING and a short timer (long enough for the ping, but shorter than you would demand of a server to send a SYN_REPLY). If that timer fails, they open a new session and try again. The tradeoff here is badish latency when the problem is detected (perhaps worse than http/1) - but the upside is less idle pinging and battery pressure. Chrome doesn't timeout spdy sessions so they would have gigantic exposure to battery loss where we don't have that problem .
Assignee: nobody → mcmanus
Status: NEW → ASSIGNED
Attachment #598107 - Flags: review?(honzab.moz)
Attached patch part 2 spdy ping v0 (obsolete) — Splinter Review
Attachment #598109 - Flags: review?(honzab.moz)
Comment on attachment 598107 [details] [diff] [review]
part 1 timer tick v0

Review of attachment 598107 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab, makes patch from bug 603514 obsolete...
Attachment #598107 - Flags: review?(honzab.moz) → review+
Comment on attachment 598109 [details] [diff] [review]
part 2 spdy ping v0

Review of attachment 598109 [details] [diff] [review]:
-----------------------------------------------------------------

Dropping the flag because of the last comment.

::: netwerk/protocol/http/SpdySession.cpp
@@ +92,5 @@
>      mOutputQueueUsed(0),
> +    mOutputQueueSent(0),
> +    mLastReadEpoch(PR_IntervalNow()),
> +    mPingSentEpoch(0),
> +    mPingCtr(1)

What Ctr stands for?  Counter?  Please don't use abbreviations where not really necessary.  I'd prefere mNextPingID.

@@ +242,5 @@
> +    }
> +
> +    if (mPingSentEpoch) {
> +      LOG(("SpdySession::ReadTimeoutTick %p handle outstanding ping\n"));
> +      if ((now - mPingSentEpoch) >= gHttpHandler->SpdyPingThreshold()) {

s/SpdyPingThreshold/SpdyPingTimeout/ ?

Any tests for this functionality?  AFAIK we should have a framework for SPDY tests, time to start writing some.

@@ +259,5 @@
> +           this));
> +      return;
> +    }
> +
> +    mPingSentEpoch = PR_IntervalNow();

It's a very low probability, but this can return 0.  Maybe just + 1 on that case or all the time | 1 is a solution here.

@@ +1533,5 @@
>        LOG3(("Start Processing Data Frame. "
>              "Session=%p Stream ID 0x%x Stream Ptr %p Fin=%d Len=%d",
>              this, streamID, mInputFrameDataStream, mInputFrameDataLast,
>              mInputFrameDataSize));
> +      mLastDataReadEpoch = mLastReadEpoch;

Maybe I'm just too tired at the moment, but I don't see the place you update mLastDataReadEpoch to now/mLastReadEpoch when reading the actual data.

Shouldn't be in if (mDownstreamState == PROCESSING_DATA_FRAME || mDownstreamState == PROCESSING_CONTROL_SYN_REPLY) section?

::: netwerk/protocol/http/SpdySession.h
@@ +81,5 @@
>    // When the connection is active this is called every 15 seconds
> +  void ReadTimeoutTick(PRIntervalTime now);
> +  
> +  // Idle time represents time since "goodput".. e.g. a data or header frame
> +  PRUint32 IdleTime();

What resolution this is?

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +566,5 @@
> +PRUint32
> +nsHttpConnection::IdleTime()
> +{
> +    return mSpdySession ?
> +        mSpdySession->IdleTime() : (NowInSeconds() - mLastReadTime);

This starts to mess.  Can you please change this to the interval logic?  Actually, the result would be to remove NowInSeconds method and store only to PRIntervalTime.  Best in a separate patch/bug.

I can do that if you are out time.
Attachment #598109 - Flags: review?(honzab.moz)
> s/SpdyPingThreshold/SpdyPingTimeout/ ?

I was looking for another word than timeout.. there are two time values here - one threshold where we send the ping and the other is the timeout value of the ping where we expect the ping-response.

> 
> Any tests for this functionality?  AFAIK we should have a framework for SPDY
> tests, time to start writing some.

cc nick here.. anyway to determine in node that a ping frame has been received?

> 
> @@ +1533,5 @@
> >        LOG3(("Start Processing Data Frame. "
> >              "Session=%p Stream ID 0x%x Stream Ptr %p Fin=%d Len=%d",
> >              this, streamID, mInputFrameDataStream, mInputFrameDataLast,
> >              mInputFrameDataSize));
> > +      mLastDataReadEpoch = mLastReadEpoch;
> 
> Maybe I'm just too tired at the moment, but I don't see the place you update
> mLastDataReadEpoch to now/mLastReadEpoch when reading the actual data.

mLastReadEpoch is updated to now each time real network read activity takes place. That value is compared with the ping threshold to determine if the network has been quiet too long and we should send the keep-alive ping.

mLastDataReadEpoch is updated only on receipt of frames that gateway into http data (data frames, and frames that contain http headers). it just copies in mlastreadepoch (which was just updated when it came in on the network) rather than calling pr_intervalnow() again. mLastDataReadEpoch is used for calculating HTTP idle time for the purpose of closing the idle "persistent connection" when we don't need it anymore. That's why we don't count pings, and settings, and what not - we close it based on the fact that it hasn't moved useful data in 3 minutes.. but that's different than the keep-alive ping timeout of a connection we've identified as broken.

> 
> Shouldn't be in if (mDownstreamState == PROCESSING_DATA_FRAME ||
> mDownstreamState == PROCESSING_CONTROL_SYN_REPLY) section?

I avoided this on purpose, as that really just captures multiple network reads of the same data frame - which implies an active connection and the idle timeout is really only interested in sessions with 0 active streams, but now that you mention it the possibility of a really large frame size could screw up that logic. (though really large frames make no sense in spdy as they prohibit multiplexing, they are legal). so thanks!

> 
> ::: netwerk/protocol/http/SpdySession.h

> > +  // Idle time represents time since "goodput".. e.g. a data or header frame
> > +  PRUint32 IdleTime();
> 
> What resolution this is?

its in seconds..

> This starts to mess.  Can you please change this to the interval logic? 
> Actually, the result would be to remove NowInSeconds method and store only
> to PRIntervalTime.  Best in a separate patch/bug.
> 

the interval conversion for timestamps in nshttpconnection is already done in the pipeline patch series.. so lets just do it there. I thought the IdleTime() method was an improvement over the status quo.
Attached patch part 2 spdy ping v1 (obsolete) — Splinter Review
Attachment #598109 - Attachment is obsolete: true
Attachment #599454 - Flags: review?(honzab.moz)
(In reply to Patrick McManus [:mcmanus] from comment #5)
> > s/SpdyPingThreshold/SpdyPingTimeout/ ?
> 
> I was looking for another word than timeout.. there are two time values here
> - one threshold where we send the ping and the other is the timeout value of
> the ping where we expect the ping-response.

I know and I don't object to it at all.  You are just using the wrong function here ;)

> > This starts to mess.  Can you please change this to the interval logic? 
> > Actually, the result would be to remove NowInSeconds method and store only
> > to PRIntervalTime.  Best in a separate patch/bug.
> > 
> 
> the interval conversion for timestamps in nshttpconnection is already done
> in the pipeline patch series.. so lets just do it there. I thought the
> IdleTime() method was an improvement over the status quo.

It definitely is!  But it starts to be hard to track what is in what units and NowInSeconds should never be introduced in the past in the first place.  And also I don't well enough remember what has been fixed as part of the work on pipelines.  This seems to land sooner, so it's worth to carry some pieces forward, as you do with the first part patch here.

Thanks.
Comment on attachment 599454 [details] [diff] [review]
part 2 spdy ping v1

Review of attachment 599454 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/http/SpdySession.cpp
@@ +242,5 @@
> +    }
> +
> +    if (mPingSentEpoch) {
> +      LOG(("SpdySession::ReadTimeoutTick %p handle outstanding ping\n"));
> +      if ((now - mPingSentEpoch) >= gHttpHandler->SpdyPingThreshold()) {

This should call to SpdyPingTimeout().

::: netwerk/protocol/http/SpdySession.h
@@ +81,5 @@
>    // When the connection is active this is called every 15 seconds
> +  void ReadTimeoutTick(PRIntervalTime now);
> +  
> +  // Idle time represents seconds since "goodput".. e.g. a data or header frame
> +  PRUint32 IdleTime();

So, I didn't understand whether you wanted to change this API to intervals or leave it as seconds.

I'd prefer to do it before this patch lands.  Pipelining may land much later and we will have this mess-up left in the tree.
(In reply to Honza Bambas (:mayhemer) from comment #7)
> (In reply to Patrick McManus [:mcmanus] from comment #5)

> I know and I don't object to it at all.  You are just using the wrong
> function here ;)

ah yes - thank you. I tested that code by just supressing the ping response and looking for the timeout message - I didn't notice that it used the wrong timeout value.
Fix the wrong timeout val from the review.

Also convert all the uses of nowinseconds() in nshttpconnection to be printerval based - which lets the new idletime() method of spdysession be interval based. It had to stop somewhere (at least for this spdy patch), so there are still members in handler and connectionmgr that are unconverted time_t seconds based.
Attachment #599454 - Attachment is obsolete: true
Attachment #599681 - Flags: review?(honzab.moz)
Attachment #599454 - Flags: review?(honzab.moz)
(In reply to Patrick McManus [:mcmanus] from comment #5)
> > Any tests for this functionality?  AFAIK we should have a framework for SPDY
> > tests, time to start writing some.
> 
> cc nick here.. anyway to determine in node that a ping frame has been
> received?

Nothing that gets exposed to the app layer (which is where all the code for our tests is, at the moment). I would imagine, though, that we could poke at the internals of node-spdy and have this info available (perhaps send info about control frames we've received in a header on replies?) No guarantees, though, I'll have to dig into node-spdy's code a bit more to be sure.
Blocks: 729736
Comment on attachment 599681 [details] [diff] [review]
part 2 spdy ping v2

Review of attachment 599681 [details] [diff] [review]:
-----------------------------------------------------------------

Nice, thanks.  I'll report a followup to remove NowInSeconds completely and have all what makes sense in PRIntervalTime.

r=honzab with few nits

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +576,5 @@
> +nsHttpConnection::TimeToLive()
> +{
> +    if (IdleTime() >= mIdleTimeout)
> +        return 0;
> +    PRUint32 timeToLive = PR_IntervalToSeconds(mIdleTimeout - IdleTime());

Save IdleTime().  If PR_IntervalNow() jumps, we may overflow.

@@ +1372,5 @@
>      if (mCallbacks)
>          return mCallbacks->GetInterface(iid, result);
>      return NS_ERROR_NO_INTERFACE;
>  }
> +

No new blank line here.

::: netwerk/protocol/http/nsHttpConnection.h
@@ +209,5 @@
>      nsCOMPtr<nsIEventTarget>        mCallbackTarget;
>  
>      nsRefPtr<nsHttpConnectionInfo> mConnInfo;
>  
>      PRUint32                        mLastReadTime;

Also to PRIntervalTime.

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +2179,5 @@
>          // sockets established themselves.
>  
>          // We need to establish a small non-zero idle timeout so the connection
>          // mgr perceives this socket as suitable for persistent connection reuse
> +        const PRIntervalTime k5Sec = PR_SecondsToInterval(5);

static const?
Attachment #599681 - Flags: review?(honzab.moz) → review+
https://hg.mozilla.org/mozilla-central/rev/7dd51d8f8641
https://hg.mozilla.org/mozilla-central/rev/00ace68863af
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla13
You need to log in before you can comment on or make changes to this bug.