If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

h2: REFUSED_STREAMs should be retried on the same connection

RESOLVED FIXED in mozilla36

Status

()

Core
Networking: HTTP
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: nwgh, Assigned: nwgh)

Tracking

unspecified
mozilla36
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [spdy])

Attachments

(1 attachment, 1 obsolete attachment)

While working on h2-14 interop with jetty with push disabled, I noticed we were getting a bunch of RST_STREAM/REFUSED_STREAM from https://webtide.com. We appropriately retry the transaction, but in doing so, we start over with a brand new connection, as the transaction calls DontReuse on the connection (in this case, the h2 session). Given that RST_STREAM/REFUSED_STREAM isn't a session-terminal error, we should really be retrying on the same connection. (Note: this would have not fixed the interop issue with jetty, as they had a bug on their end, but it would have ended up with fewer handshakes and less resource usage.)
Created attachment 8512207 [details] [diff] [review]
patch

So here's a patch that causes us to retry REFUSED_STREAM on the same connection up to 3 times. After that, we'll end up trying on a new connection (with 3 tries on that connection), etc until we reach the global retry limit (which is 10, I believe).

Totally not critical (as the bug description indicates), but kind of nice and slightly better behaved than our current behavior.
Attachment #8512207 - Flags: review?(mcmanus)
Comment on attachment 8512207 [details] [diff] [review]
patch

Review of attachment 8512207 [details] [diff] [review]:
-----------------------------------------------------------------

hmm. I don't love the counter. Its a fixed number against an unbounded lifetime of a session.. I'm not sure its useful as a metric.. is 3 lifetime moving violations the sign of a reckless driver? It probably is in an 18 year old, and it probably is not in a 80 year old.

I think the answer here is that REFUSED_STREAM_ERROR simply should not trigger DontReuse() - I think the fact that it does so right now is mostly an artifact of the way CANCEL is used.  Is there an argument against?

Is the DontReuse() that is being triggered nsHttpConnection::CloseTransaction() ? It seems like we could call CleanupStream with something less generic than CANCEL (perhaps something new) and have nsHttpConnection::CloseTransaction() filter the dontreuse() on that in the spdy path. That would be nicer than the bool-on-the-stack logic you've got going.. but if that's what it takes I'm ok with that too.

::: netwerk/protocol/http/Http2Session.cpp
@@ +57,5 @@
>    0x54, 0x50, 0x2f, 0x32, 0x2e, 0x30, 0x0d, 0x0a,
>    0x0d, 0x0a, 0x53, 0x4d, 0x0d, 0x0a, 0x0d, 0x0a
>  };
>  
> +const uint8_t Http2Session::kRetryCountMax = 3;

nit - better name.. retry what?

@@ +578,5 @@
> +  if (mDoingRstStream && mDownstreamRstReason == REFUSED_STREAM_ERROR &&
> +      mRetryCount < kRetryCountMax) {
> +    ++mRetryCount;
> +    LOG3(("Http2Session::DontReuse - ignoring because we're resetting a stream "
> +          "with REFUSED_STREAM"));

nit put "this" in log
Attachment #8512207 - Flags: review?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #2)
> Is the DontReuse() that is being triggered
> nsHttpConnection::CloseTransaction() ? It seems like we could call
> CleanupStream with something less generic than CANCEL (perhaps something
> new) and have nsHttpConnection::CloseTransaction() filter the dontreuse() on
> that in the spdy path. That would be nicer than the bool-on-the-stack logic
> you've got going.. but if that's what it takes I'm ok with that too.

Nope (although we may end up with that, too, come to think of it). The DontReuse I've been seeing is coming from nsHttpTransaction::Restart, which doesn't get a reason argument at all. Not that we couldn't change that, but the bool-on-the-stack logic seemed less intrusive. We might end up wanting to change that for HTTP_1_1_REQUIRED anyway, though (I still have to think that one through), so maybe it's a good idea to change it now? I'll circle back around on that once I've thought the HTTP_1_1_REQUIRED stuff through (which is now bug 1091263, by the way).

In regards to everything else - I'm totally fine doing away with the counter (like I said on vidyo, my original version didn't have that). If we wanted a counter, it would probably make more sense to have it on the transaction instead of the session anyway - in the case that brought this up, it was one particular stream that was misbehaving, others were fine. It does make me wonder, though, if we should have some sort of metric for "well, things on this connection have really been misbehaving a lot lately, perhaps we should dial back"? That's fodder for a separate bug, though.
New patch incoming using the same mechanism as bug 1091263 to prevent connection closing (so this now depends on that bug)
Depends on: 1091263
Created attachment 8518272 [details] [diff] [review]
patch v2
Attachment #8512207 - Attachment is obsolete: true
Attachment #8518272 - Flags: review?(mcmanus)
Will push to try once trees re-open, but for reference, tests pass locally.
Attachment #8518272 - Flags: review?(mcmanus) → review+
try finally didn't timeout on my push: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=06f7e84396dc
Keywords: checkin-needed
https://hg.mozilla.org/integration/mozilla-inbound/rev/12564626faca
Flags: in-testsuite+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/12564626faca
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
You need to log in before you can comment on or make changes to this bug.