Last Comment Bug 603514 - restart pipelined transaction on read stall more aggressively
: restart pipelined transaction on read stall more aggressively
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Networking: HTTP (show other bugs)
: Trunk
: x86_64 Linux
: -- enhancement with 2 votes (vote)
: mozilla14
Assigned To: Patrick McManus [:mcmanus]
:
Mentors:
Depends on: 603512
Blocks: 603503 603508 pipelining-review
  Show dependency treegraph
 
Reported: 2010-10-11 17:19 PDT by Patrick McManus [:mcmanus]
Modified: 2012-03-23 05:57 PDT (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
pipeline stalls v1 (14.48 KB, patch)
2010-10-28 12:15 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Splinter Review
pipeline stalls v2 (14.96 KB, patch)
2010-12-03 15:30 PST, Patrick McManus [:mcmanus]
no flags Details | Diff | Splinter Review
pipeline stalls v4 (13.65 KB, patch)
2011-01-12 17:14 PST, Patrick McManus [:mcmanus]
no flags Details | Diff | Splinter Review
pipeline stalls v4 (verified) (13.59 KB, patch)
2011-02-18 18:43 PST, Patrick McManus [:mcmanus]
no flags Details | Diff | Splinter Review
pipeline stalls v 5 (12.66 KB, patch)
2011-06-26 22:13 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Splinter Review
Patch taken from Larch (by Patrick McManus) (12.82 KB, patch)
2011-07-22 12:23 PDT, Honza Bambas (:mayhemer)
honzab.moz: review-
Details | Diff | Splinter Review
v6 (19.88 KB, patch)
2012-01-04 06:49 PST, Patrick McManus [:mcmanus]
no flags Details | Diff | Splinter Review
v7 (20.52 KB, patch)
2012-01-09 13:25 PST, Patrick McManus [:mcmanus]
honzab.moz: review+
Details | Diff | Splinter Review
v8 (24.76 KB, patch)
2012-02-13 14:49 PST, Patrick McManus [:mcmanus]
honzab.moz: review+
Details | Diff | Splinter Review
v9 (24.53 KB, patch)
2012-02-15 12:22 PST, Patrick McManus [:mcmanus]
mcmanus: review+
Details | Diff | Splinter Review
telemetry v0 (4.57 KB, patch)
2012-02-15 13:23 PST, Patrick McManus [:mcmanus]
no flags Details | Diff | Splinter Review

Description Patrick McManus [:mcmanus] 2010-10-11 17:19:14 PDT
a read timeout in the middle of a pipelined transaction is a symptom of a document cross match (i.e. the document you are reading was given a length of a longer document it was crossed with).. when a pipelined retrieval stalls after the first byte consider restarting it on a different connection much more aggressively than normal.
Comment 1 Patrick McManus [:mcmanus] 2010-10-28 12:15:39 PDT
Created attachment 486694 [details] [diff] [review]
pipeline stalls v1

When a connection that involves a pipelined transaction has been stalled (i.e. idle with an open transaction, not an idle persistent connection) for almost a second move any transactions that are pipelined after the current one onto different connections (and implicitly close this connection when done with the current transaction). 

when it has been stalled for 4 seconds (pref configurable), cancel the current transaction itself too - depending on its state it can hopefully be restarted on a clean connection.
Comment 2 Patrick McManus [:mcmanus] 2010-12-03 15:30:09 PST
Created attachment 495145 [details] [diff] [review]
pipeline stalls v2

update bitrot, confrom better to style guide, updates based on experience (i.e. bugs and tweaks), etc..
Comment 3 Patrick McManus [:mcmanus] 2011-01-12 17:14:50 PST
Created attachment 503355 [details] [diff] [review]
pipeline stalls v4

update patch - broke out dead code removal patch and filed that separately as 624376. It wasn't really related to pipelining specifically.
Comment 4 Patrick McManus [:mcmanus] 2011-02-18 18:43:01 PST
Created attachment 513675 [details] [diff] [review]
pipeline stalls v4 (verified)
Comment 5 Patrick McManus [:mcmanus] 2011-06-26 22:13:29 PDT
Created attachment 542086 [details] [diff] [review]
pipeline stalls v 5

larch bitrot
Comment 6 Honza Bambas (:mayhemer) 2011-07-22 12:23:16 PDT
Created attachment 547769 [details] [diff] [review]
Patch taken from Larch (by Patrick McManus)

Actual patch for review.
Comment 7 Honza Bambas (:mayhemer) 2011-07-22 12:44:19 PDT
Comment on attachment 547769 [details] [diff] [review]
Patch taken from Larch (by Patrick McManus)

r+ set too early, I will post comments soon.
Comment 8 Honza Bambas (:mayhemer) 2011-07-22 13:57:23 PDT
Comment on attachment 547769 [details] [diff] [review]
Patch taken from Larch (by Patrick McManus)

Review of attachment 547769 [details] [diff] [review]:
-----------------------------------------------------------------

First look was like to give an r+, but I found a glitch.

Two details to discuss:

- I'm thinking of having just a central timer (timers are expensive) that iterates all active connections (must be canceled when there are none).  The tick should be some fraction of the timeout preference (one half for instance), it is not critical to be precise here..

- Let's talk about the default timeout value (IMO way too short)

This is also a candidate for a test.  Probably also the patch for large connection-length (bug 603512) as well.

::: modules/libpref/src/init/all.js
@@ +766,5 @@
>  pref("network.http.pipelining.max-optimistic-requests" , 4);
>  
>  pref("network.http.pipelining.aggressive", false);
>  pref("network.http.pipelining.maxsize" , 200000);
> +pref("network.http.pipelining.read-timeout", 4000);

Add also to mobile prefs please.

I'm pretty worried this timeout may be too short.  On slower mobile and wifi links we may cancel transactions too often..  I think it should be also influenced by RTT measured on the connection.

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +198,5 @@
> +    if (mReadTimer)
> +        mReadTimer->InitWithFuncCallback(ReadTimeoutTick,
> +                                         this,
> +                                         1500,
> +                                         nsITimer::TYPE_REPEATING_SLACK);

Please use InitWithCallback (the method taking nsITimerCallback) ; but depends on if we agreed on the central timer, and is also disputable because not having referencing with InitWithFuncCallback might have been an advantage here..

@@ +611,5 @@
> +    //
> +    // Right now we only take action if pipelining is involved, but this would
> +    // be the place to add general read timeout handling if it is desired.
> +
> +    if (delta >= mozilla::TimeDuration::FromMilliseconds(900)) {

Maybe rather early returns? Not sure in this method..

@@ +612,5 @@
> +    // Right now we only take action if pipelining is involved, but this would
> +    // be the place to add general read timeout handling if it is desired.
> +
> +    if (delta >= mozilla::TimeDuration::FromMilliseconds(900)) {
> +        nsAHttpConnection *pipeline = self->mTransaction->Connection();

And here is the glitch:

Here I really don't understand from just looking at the code what object you are trying to reach.

- self is nsHttpConnection
- self->mTransaction is nsHttpPipeline
- self->mTransaction->Connection() is nsConnectionHandler that points to this connection (to |self|)

Reason for this is described bellow..

This seems to need a query method, as I suggested earlier.  If you need the pipeline, then just get it from the transaction.

@@ +615,5 @@
> +    if (delta >= mozilla::TimeDuration::FromMilliseconds(900)) {
> +        nsAHttpConnection *pipeline = self->mTransaction->Connection();
> +        PRUint32 pipelineDepth = self->mTransaction->PipelineDepth();
> +        
> +        LOG(("cancelling pipeline due to a %lfms stall - depth %d\n",

Please "%1.0f ms"

@@ +619,5 @@
> +        LOG(("cancelling pipeline due to a %lfms stall - depth %d\n",
> +             delta.ToMilliseconds(), pipelineDepth));
> +        
> +        if (pipelineDepth > 1)
> +            pipeline->CancelPipeline();

Seems like here we go through mPipelineData, right?  Pretty zig-zag..

This looks like a very very weird indirect way to cast mTransaction from nsAHttpTransaction to nsAHttpConnection, i.e. get a pipeline from it...

We can do it better.  No.  We have to do it better.

@@ +626,5 @@
> +            mozilla::TimeDuration::FromMilliseconds(
> +                gHttpHandler->GetPipelineTimeout()) &&
> +            (pipelineDepth > 1 || self->mTransaction->PipelinePosition())) {
> +            // nothing has transpired on this pipelined socket for ~4
> +            // seconds. Call that a total stall and close the transaction.

close the transaction or the pipeline?

@@ +627,5 @@
> +                gHttpHandler->GetPipelineTimeout()) &&
> +            (pipelineDepth > 1 || self->mTransaction->PipelinePosition())) {
> +            // nothing has transpired on this pipelined socket for ~4
> +            // seconds. Call that a total stall and close the transaction.
> +            // There is a chance the transaction will be restarted again

here the same controversy

@@ +636,5 @@
> +                 "of depth %d and scheduled originally at pos %d\n",
> +                 delta.ToMilliseconds(),
> +                 pipelineDepth, self->mTransaction->PipelinePosition()));
> +
> +            self->CloseTransaction(self->mTransaction, NS_ERROR_NET_TIMEOUT);

You should comment this also closes this connection.

@@ +933,5 @@
>          delta = mozilla::TimeDuration::FromSeconds(0);
>      
> +    if (delta > mozilla::TimeDuration::FromMilliseconds(1200)) {
> +        LOG(("Read delta ms of %d causing slow read major "
> +             "event and pipeline cancellation"));

Missing format arguments!

@@ +939,5 @@
>              mConnInfo, nsHttpConnectionMgr::BadSlowReadMajor, this, 0);
>  
> +        nsAHttpConnection *pipeline = mTransaction->Connection();
> +        if (mTransaction->PipelineDepth() > 1)
> +            pipeline->CancelPipeline();

Again, the glitch.

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +1002,5 @@
>              mMaxPipelineObjectSize = PRInt64 (NS_CLAMP(val, 1000, 100000000));
>          }
>      }
>  
> +    if (PREF_CHANGED(HTTP_PREF("pipelining.readtimeout"))) {

Check name of the pref.
Comment 9 Patrick McManus [:mcmanus] 2012-01-04 06:48:46 PST
(In reply to Honza Bambas (:mayhemer) from comment #8)
> Comment on attachment 547769 [details] [diff] [review]
> Patch taken from Larch (by Patrick McManus)
>
> 
> - I'm thinking of having just a central timer (timers are expensive) that

I've updated to this by driving it out of connection manager. it is a repeating slack timer on a 1 second interval that is cancelled, but not destroyed, when 0 connections are active. It runs very little.

This makes the "reschedule pipeline, but don't cancel active transaction" timer at 1 second kind of variable in precision in practice but I think that's ok.

> 
> - Let's talk about the default timeout value (IMO way too short)

I presume you're talking about the 4000ms point where the pipeline is presumed to have caused the stall and things are restarted. I agree it is pretty drastic and should only happen after the user feels like the UI has totally frozen. I've upped it to 10 seconds. Let me know what you think of that. Remember that it isn't a 10 second transaction, it is a transaction that has not recvd any data in 10 seconds despite being the active one on the pipeline. (and of course, only if it was pipelined.) This isn't really either latency or bandwidth based in the sense that it dwarfs either of those numbers, so I don't see the reason to scale it that way. (i.e. 1rtt is indeed a factor, but that's such a small part of x-thousand that it isn't very impt).

This is a drastic save-the-day measure to recover from really broken services, not a head of line blocking thing. This is 1 failure mode I have seen.

> 
> This is also a candidate for a test.  Probably also the patch for large
> connection-length (bug 603512) as well.
> 

when we have a pipeline test server - sure.

> Add also to mobile prefs please.

mobile prefs should inherit.


> close the transaction or the pipeline?
> 

it really is the transaction because we know the rest of the pipeline is clear - the rest of the pipeline was already canceled and rescheduled after head of line blocking was detected at a much smaller (900ms) timeout.

> Check name of the pref.

thanks!
Comment 10 Patrick McManus [:mcmanus] 2012-01-04 06:49:27 PST
Created attachment 585740 [details] [diff] [review]
v6
Comment 11 Patrick McManus [:mcmanus] 2012-01-05 05:28:48 PST
Comment on attachment 585740 [details] [diff] [review]
v6

try claims a leak with this patch
Comment 12 Patrick McManus [:mcmanus] 2012-01-09 13:25:22 PST
Created attachment 587103 [details] [diff] [review]
v7

leak was a trivial shutdown thing - fix didn't impact approach
Comment 13 Honza Bambas (:mayhemer) 2012-02-08 14:28:13 PST
Comment on attachment 587103 [details] [diff] [review]
v7

Review of attachment 587103 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab with few comments.

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +918,5 @@
> +    // using the SPDY ping frame.
> +    if (mUsingSpdy)
> +        return;
> +
> +    mozilla::TimeDuration delta = mozilla::TimeStamp::Now() - mLastReadTime;

If we won't agree to turn here to PR_IntervalNow(), then please pass mozilla::TimeStamp::Now() as an arg to this method.  Call of it, mainly in loops, is a bad performance hit on Windows.

@@ +930,5 @@
> +    // Right now we only take action if pipelining is involved, but this would
> +    // be the place to add general read timeout handling if it is desired.
> +
> +    if (delta < mozilla::TimeDuration::FromMilliseconds(1000))
> +        return;

This definitely needs to be a pref, also to be able to switch this completely off.

With such a short timeout we may just end up with a more congested upstream since resending deep pipelines.  But we will see.

Please add also some telemetry, I would be interested how this is going to work (and also how it will effect load times).

@@ +942,5 @@
> +        nsHttpPipeline *pipeline = mTransaction->QueryPipeline();
> +        NS_ABORT_IF_FALSE(pipeline, "pipelinedepth > 1 without pipeline");
> +        // code this defensively for the moment and check for null in opt build
> +        if (pipeline)
> +            pipeline->CancelPipeline(NS_ERROR_NET_TIMEOUT);

The name "CancelPipeline" confuses me here again. It sounds like we want to cancel the whole object...

Please add a log we just did this.

@@ +1295,5 @@
> +            nsHttpPipeline *pipeline = mTransaction->QueryPipeline();
> +            NS_ABORT_IF_FALSE(pipeline, "pipelinedepth > 1 without pipeline");
> +            // code this defensively for the moment and check for null
> +            if (pipeline)
> +                pipeline->CancelPipeline(NS_ERROR_NET_TIMEOUT);

Please add a log we just did this.

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +269,4 @@
>          nsCOMPtr<nsITimer> timer = do_QueryInterface(subject);
> +        if (timer == mTimer) {
> +            PruneDeadConnections();
> +        } else {

} else if (timer == mReadTimeoutTick) {
...
} else {
  NS_ERROR("unexpected timer-callback");
}

?

@@ +271,5 @@
> +            PruneDeadConnections();
> +        } else {
> +            NS_ABORT_IF_FALSE(timer == mReadTimeoutTick,
> +                              "unexpected timer-callback");
> +            PostEvent(&nsHttpConnectionMgr::OnMsgPostReadTimeoutTick);

ActivateTimeoutTick() gets called on the socket thread.  Timers ensure the callback gets called on the same thread the timer has been activated on.  You don't need to post here probably.

@@ +1912,5 @@
> +         "this=%p mReadTimeoutTick=%p\n"));
> +
> +    // The timer tick should be enabled if it is not already pending.
> +    // Upon running the tick will rearm itself if there are active http/1
> +    // connections available.

Could we allow the pref value be 0 to turn this feature off?

@@ +1945,5 @@
> +        mReadTimeoutTickArmed = false;
> +        return;
> +    }
> +
> +    mCT.Enumerate(ReadTimeoutTickCB, this);

The closure here can be ref to TimeStamp.

::: netwerk/protocol/http/nsHttpConnectionMgr.h
@@ +552,5 @@
> +    // A 1s tick to call nsHttpConnection::ReadTimeoutTick on
> +    // active http/1 connections. Disabled when there are no
> +    // active connections.
> +    nsCOMPtr<nsITimer>              mReadTimeoutTick;
> +    bool                            mReadTimeoutTickArmed;

Maybe just a single space for consistency?
Comment 14 Patrick McManus [:mcmanus] 2012-02-13 14:48:38 PST
(In reply to Honza Bambas (:mayhemer) from comment #13)

> If we won't agree to turn here to PR_IntervalNow(), then please pass

I changed these to pr_interval as well as most of the ones in the type and state patch. I really dislike the interval api so please help me out by double checking the conversions. I reset r? for that reason. It seems like we need a cheap/rough monotonic version of the timestamp api.

> This definitely needs to be a pref, also to be able to switch this
> completely off.

added
pref("network.http.pipelining.reschedule-on-timeout", true);
pref("network.http.pipelining.reschedule-timeout", 1500);

didn't overload it into 1 value because even if resched-on-timeout is set to false I still apply negative pipeline feedback if timeout is exceeded.

> Could we allow the pref value be 0 to turn this feature off?

I have disabled it if reschedule-on-timeout or just plain any pipelining is pref'd off. I suspect we'll add other things to this tick in the future though. maybe I'm wrong.

> 
> 
> The closure here can be ref to TimeStamp.
> 

I like the OO nature of passing the instantiating class whenever possible.
Comment 15 Patrick McManus [:mcmanus] 2012-02-13 14:49:13 PST
Created attachment 596798 [details] [diff] [review]
v8
Comment 16 Honza Bambas (:mayhemer) 2012-02-14 06:34:42 PST
Comment on attachment 596798 [details] [diff] [review]
v8

Review of attachment 596798 [details] [diff] [review]:
-----------------------------------------------------------------

Please have a followup bug for adding a telemetry.  It's important to monitor how often this gets triggered and how it affects load times.

r=honzab.

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +979,5 @@
> +        }
> +    }
> +    
> +    if (delta < gHttpHandler->GetPipelineReadTimeout())
> +        return;

We should document (or change this code?) that if the read timeout is shorter then reschedule timeout, we don't cancel the pipeline until the actual timeout is larger then reschedule timeout.

@@ +996,5 @@
> +         PR_IntervalToMilliseconds(delta),
> +         pipelineDepth, mTransaction->PipelinePosition()));
> +
> +    // This will also close the connection
> +    CloseTransaction(mTransaction, NS_ERROR_NET_TIMEOUT);

Just a question - can we in case of this complete cancellation use the restart-in-progress logic?

@@ +1316,5 @@
>  
> +    if (delta >= (mRtt + gHttpHandler->GetPipelineRescheduleTimeout())) {
> +        LOG(("Read delta ms of %1.0f causing slow read major "
> +             "event and pipeline cancellation",
> +             PR_IntervalToMilliseconds(delta)));

%dms instead of %1.0f

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +1974,5 @@
> +        mReadTimeoutTickArmed = false;
> +        return;
> +    }
> +
> +    mCT.Enumerate(ReadTimeoutTickCB, this);

I was suggesting pass |now| here to save many calls to TimeStamp::Now().  

But PR_IntervalNow is also not for free, so please pass the value to the enumerator and then to mActiveConns[index]->ReadTimeoutTick().  Up to you how.

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +218,5 @@
>  
>      NS_ASSERTION(!gHttpHandler, "HTTP handler already created!");
>      gHttpHandler = this;
> +    mPipelineRescheduleTimeout = PR_MillisecondsToInterval(1500);
> +    mPipelineReadTimeout = PR_SecondsToInterval(10);

Maybe mPipelineReadTimeout = PR_MillisecodsToInterval(10000);  I've seen overlooks of what conversion had been used and it's better to put here the same number that is present in all.js file and users see on the about:config.

I think this can be done as initiators.

I like to have these values in intervals.  Little problem is that some APIs of the handler return seconds and not interval.  We may want to have a followup to sync the APIs, if possible.

::: netwerk/protocol/http/nsHttpHandler.h
@@ +237,5 @@
> +    }
> +
> +    bool GetPipelineRescheduleOnTimeout()
> +    {
> +        return  mPipelineRescheduleOnTimeout;

One space only
Comment 17 Patrick McManus [:mcmanus] 2012-02-15 12:21:49 PST
> > +    if (delta < gHttpHandler->GetPipelineReadTimeout())
> > +        return;
> 

I changed it so the two conditions can execute independently here.

> > +    // This will also close the connection
> > +    CloseTransaction(mTransaction, NS_ERROR_NET_TIMEOUT);
> 
> Just a question - can we in case of this complete cancellation use the
> restart-in-progress logic?

yes.
Comment 18 Patrick McManus [:mcmanus] 2012-02-15 12:22:26 PST
Created attachment 597518 [details] [diff] [review]
v9
Comment 19 Patrick McManus [:mcmanus] 2012-02-15 13:01:18 PST
(In reply to Honza Bambas (:mayhemer) from comment #16)
> Comment on attachment 596798 [details] [diff] [review]
> v8
> 
> Review of attachment 596798 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Please have a followup bug for adding a telemetry.  It's important to
> monitor how often this gets triggered and how it affects load times.
> 

I'm going to add this as a second patch to this bug
Comment 20 Patrick McManus [:mcmanus] 2012-02-15 13:23:51 PST
Created attachment 597538 [details] [diff] [review]
telemetry v0
Comment 21 Honza Bambas (:mayhemer) 2012-02-16 15:49:08 PST
Comment on attachment 597538 [details] [diff] [review]
telemetry v0

Review of attachment 597538 [details] [diff] [review]:
-----------------------------------------------------------------

We should be a bit smarter here.  I think the interesting information would be track the number of transactions that had been processed before (any of the two) events have been detected.  Something like "reschedule pipeline after N successful transactions" and "timeout/cancel transaction after N successful transactions" - report the N here.  

Then have a separate graph counting the number of restarts and cancellations on a connection.  This may report a lot of zeros, though.  I'm just looking for some record of ratio how often this happens.

In bug 707320 has been introduced a way to report some session collected telemetry right before shutdown, you may want to use it.
Comment 22 Honza Bambas (:mayhemer) 2012-02-22 04:31:10 PST
Comment on attachment 597538 [details] [diff] [review]
telemetry v0

Dropping the review flag, please see the previous comment.  This is IMO too weak telemetry.
Comment 23 Honza Bambas (:mayhemer) 2012-02-22 06:39:14 PST
Comment on attachment 597538 [details] [diff] [review]
telemetry v0

I didn't want to give an "r-" by dropping the flag here, just cleaning my r q while this is left open for discussion.
Comment 24 Patrick McManus [:mcmanus] 2012-02-22 06:43:41 PST
Comment on attachment 597538 [details] [diff] [review]
telemetry v0

dropping the flag on this so the review queue queries work is fine. I'll reset it if I want you to reconsider it when submitting the argument.

I had objected to the flag change privately only because of a mistake I made interpreting some email. That's all my fault.
Comment 25 Patrick McManus [:mcmanus] 2012-03-20 10:37:53 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/57c69b1e9dcf
Comment 26 Matt Brubeck (:mbrubeck) 2012-03-20 12:59:32 PDT
Sorry, I backed this out on inbound because something in the push appeared to cause a leak in mochitest-browser-chrome:
https://hg.mozilla.org/integration/mozilla-inbound/rev/577da08878cb
Comment 27 Patrick McManus [:mcmanus] 2012-03-22 18:02:59 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/52efc30fbfec
Comment 28 Marco Bonardo [::mak] (Away 6-20 Aug) 2012-03-23 05:57:45 PDT
https://hg.mozilla.org/mozilla-central/rev/52efc30fbfec

Note You need to log in before you can comment on or make changes to this bug.