Note: There are a few cases of duplicates in user autocompletion which are being worked on.

Investigate possible loop in http2

RESOLVED FIXED in Firefox 45

Status

()

Core
Networking: HTTP
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: dragana, Assigned: mcmanus)

Tracking

(Blocks: 1 bug)

unspecified
mozilla47
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox44 wontfix, firefox45 fixed, firefox46 fixed, firefox47 fixed)

Details

(Whiteboard: [spdy][necko-active])

Attachments

(1 attachment, 2 obsolete attachments)

(Reporter)

Description

2 years ago
I looked through some comments from shutdown hang:

https://crash-stats.mozilla.com/report/list?product=Firefox&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingleObject+|+PR_WaitCondVar+|+nsThread%3A%3AProcessNextEvent+|+mozilla%3A%3Anet%3A%3AnsHttpConnectionMgr%3A%3AShutdown#tab-comments

A lot of them mention facebook, which just recently switched to h2.

More important a lot of comment mention that firefox was behaving as if there is no connectivity and I am guessing the users decided to restart firefox when this crash happened.
(Reporter)

Comment 1

2 years ago
I have one theory (I do not know h2 code that good):

We call mInputFrameDataStream->WriteSegments at:
http://hg.mozilla.org/releases/mozilla-release/annotate/9b79adf82b13/netwerk/protocol/http/Http2Session.cpp#l2704

this returns SoftStreamError(rv) or NS_BASE_STREAM_WOULD_BLOCK without going to the network. If we go to the socket mSocketInCondition will be set to the real socket error and nsHttpConnection::OnSocketReadable would not loop. Looking into the code it is possible that we return from mInputFrameDataStream->WriteSegments without going to the network.

If this above it is true we will loop.

I think we can easily test this. There are some crashes on nightly with this stack. We can add some checks to confirm of not whether we are looping in this code.
(Reporter)

Updated

2 years ago
Blocks: 1158189
(Assignee)

Comment 2

2 years ago
(In reply to Dragana Damjanovic [:dragana] from comment #1)
> I have one theory (I do not know h2 code that good):
> 
> We call mInputFrameDataStream->WriteSegments at:
> http://hg.mozilla.org/releases/mozilla-release/annotate/9b79adf82b13/netwerk/
> protocol/http/Http2Session.cpp#l2704
> 
> this returns SoftStreamError(rv) or NS_BASE_STREAM_WOULD_BLOCK without going
> to the network. If we go to the socket mSocketInCondition will be set to the
> real socket error and nsHttpConnection::OnSocketReadable would not loop.
> Looking into the code it is possible that we return from
> mInputFrameDataStream->WriteSegments without going to the network.
> 
> If this above it is true we will loop.
> 
> I think we can easily test this. There are some crashes on nightly with this
> stack. We can add some checks to confirm of not whether we are looping in
> this code.

It seems this is generally probably right, but I think some subset of that looping is necessary to advance the state machine.. so I think we probably have to find a specific set of causes where the loop gets stuck forever.
(Reporter)

Comment 3

2 years ago
(In reply to Patrick McManus [:mcmanus] from comment #2)
> 
> It seems this is generally probably right, but I think some subset of that
> looping is necessary to advance the state machine.. so I think we probably
> have to find a specific set of causes where the loop gets stuck forever.

That is why I though of using this crash to narrow down this behavior to get some clues.
Ok I will make patch.


The other option is to contact some of the reporters :)
(Reporter)

Comment 4

2 years ago
Maybe related bug 1245085
(Reporter)

Comment 5

2 years ago
Is it possible that mInputBufferIn/Out pipe is full:

http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/Http2Stream.cpp#268
(Reporter)

Comment 6

2 years ago
Created attachment 8718816 [details] [diff] [review]
bug_1247205.patch

but this is not going to happen that often :)
Attachment #8718816 - Flags: review?(mcmanus)
(Assignee)

Comment 7

2 years ago
Comment on attachment 8718816 [details] [diff] [review]
bug_1247205.patch

Review of attachment 8718816 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/http/Http2Stream.cpp
@@ +284,5 @@
>    if (NS_SUCCEEDED(rv)) {
>      uint32_t buffered;
>      rv = mInputBufferOut->Write(buf, *countWritten, &buffered);
> +    if ((NS_SUCCEEDED(rv) && (buffered != *countWritten)) ||
> +        (rv == NS_BASE_STREAM_WOULD_BLOCK)) {

can you explain why you think this is happening?

That pipe is just being used as a buffer.. its prime use case is when the channel pipe is not accepting data (maybe due to the scheduler, but more likely because the channel is suspended).. in h1 we would just push back on the network and let TCP flow control kick in, but h2 can't really do that without blocking the other streams, so instead it has this buffer to stuff things into until the h2 per stream flow control window runs out (and then the backpressure will finally hit the sender).. so this buffer is sized 'infinite' to be bigger than the stream flow control and we should never get WOULD_BLOCK (though OOM is a possibility).

that's the theory anyhow.. The patch might be harmless in that light, but I'm trying to figure out why you suspect it.. cause maybe something is wrong with the buffering we could fix.
Attachment #8718816 - Flags: review?(mcmanus)
(Reporter)

Comment 8

2 years ago
I missed one thing, sorry I looked at too many stack

Some stacks:
https://crash-stats.mozilla.com/report/index/550901b5-3442-4ea9-8533-8d0602160206#allthreads

0 	nss3.dll 	md_UnlockAndPostNotifies 	nsprpub/pr/src/md/windows/w95cv.c
1 	nss3.dll 	PR_ExitMonitor 	nsprpub/pr/src/threads/prmon.c
2 	xul.dll 	nsPipe::OnPipeException(nsresult, bool) 	xpcom/io/nsPipe3.cpp
3 	xul.dll 	nsPipeOutputStream::WriteSegments(nsresult (*)(nsIOutputStream*, void*, char*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*) 	xpcom/io/nsPipe3.cpp
4 	xul.dll 	mozilla::net::nsHttpTransaction::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*) 	netwerk/protocol/http/nsHttpTransaction.cpp
5 	xul.dll 	nsSocketOutputStream::nsSocketOutputStream(nsSocketTransport*) 	netwerk/base/nsSocketTransport2.cpp
6 	xul.dll 	mozilla::net::Http2Stream::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*) 	netwerk/protocol/http/Http2Stream.cpp
7 	xul.dll 	mozilla::net::Http2Session::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*) 	netwerk/protocol/http/Http2Session.cpp
8 	xul.dll 	mozilla::net::nsHttpConnection::OnSocketReadable() 	netwerk/protocol/http/nsHttpConnection.cpp
9 	xul.dll 	mozilla::net::nsHttpConnection::OnInputStreamReady(nsIAsyncInputStream*) 	netwerk/protocol/http/nsHttpConnection.cpp
10 	xul.dll 	nsSocketInputStream::OnSocketReady(nsresult)

this one can only happen if the pipe's state is an error:
http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsPipe3.cpp#804

this can loop only if the error is SoftError:
http://hg.mozilla.org/releases/mozilla-release/annotate/9b79adf82b13/netwerk/protocol/http/Http2Session.cpp#l2709

I am not sure how pipe got broken.


Second stack:
https://crash-stats.mozilla.com/report/index/467e8bfc-0949-46dd-9f72-91dba2160206#allthreads

0 	ntdll.dll 	RtlEnterCriticalSection 	
1 	nss3.dll 	PR_ExitMonitor 	nsprpub/pr/src/threads/prmon.c
2 	xul.dll 	nsPipeInputStream::Available(unsigned __int64*) 	xpcom/io/nsPipe3.cpp
3 	xul.dll 	mozilla::net::IsDataAvailable 	netwerk/protocol/http/Http2Stream.cpp
4 	xul.dll 	mozilla::net::Http2Stream::DeferCleanup(nsresult) 	netwerk/protocol/http/Http2Stream.cpp
5 	xul.dll 	mozilla::net::Http2Session::CleanupStream(mozilla::net::Http2Stream*, nsresult, mozilla::net::Http2Session::errorType) 	netwerk/protocol/http/Http2Session.cpp
6 	xul.dll 	mozilla::net::Http2Session::CleanupStream(unsigned int, nsresult, mozilla::net::Http2Session::errorType) 	netwerk/protocol/http/Http2Session.cpp
7 	xul.dll 	mozilla::net::Http2Session::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*) 	netwerk/protocol/http/Http2Session.cpp
8 	xul.dll 	mozilla::net::nsHttpConnection::OnSocketReadable() 	netwerk/protocol/http/nsHttpConnection.cpp
9 	xul.dll 	mozilla::net::nsHttpConnection::OnInputStreamReady(nsIAsyncInputStream*) 	netwerk/protocol/http/nsHttpConnection.cpp
10 	xul.dll 	mozilla::net::HttpConnectionForceIO::Run() 	netwerk/protocol/http/nsHttpConnection.cpp
11 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp


It is probably the same scenario as above:
so at:
http://hg.mozilla.org/releases/mozilla-release/annotate/9b79adf82b13/netwerk/protocol/http/Http2Session.cpp#l2724
mNeedsCleanup is nullptr, there is no other way, otherwise mDownstreamState would not stay PROCESSING_DATA_FRAME
and mInputBuffer pipe is not empty.

So pipe is in error state and we do not get out of this loop. Why the pipe is broken, I still do not know.

I will make a patch get out of this loop and I will continue looking for reason why the loop is in the error state.
(Reporter)

Updated

2 years ago
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
(Assignee)

Comment 9

a year ago
perhaps the pipe is more machinery than we really want here. I still don't quite get what is wrong, but it certainly seems possible. In particular it contains locks we don't need.

what do think of just replacing it with something simpler.. a linked list of buffers that implements just read, write, and avaialable.. the operations would return OK or OOM. That's the behavior we're trying to get from pipe afterall.
(Assignee)

Updated

a year ago
Whiteboard: [spdy]
(Assignee)

Comment 10

a year ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=33bdc91b1419
(Assignee)

Comment 11

a year ago
Created attachment 8719184 [details] [diff] [review]
streamline h2 stream flow control buffer
Attachment #8719184 - Flags: review?(dd.mozilla)
(Assignee)

Updated

a year ago
Assignee: dd.mozilla → mcmanus
(Assignee)

Comment 12

a year ago
(In reply to Patrick McManus [:mcmanus] from comment #11)
> Created attachment 8719184 [details] [diff] [review]
> streamline h2 stream flow control buffer

I actually like this patch quite a bit.. while it adds a new class, it removes a couple nscomptr's and generally simplifies the h2 code.

it also adds OUT_OF_MEMORY to the list of hard errors - which is obviously true. (hard errors kill sessions, soft errors kill streams). Its possible that's all this bug really needs, but the "stuck on lock" stack traces make me want to do the whole thing.
(Assignee)

Updated

a year ago
Attachment #8719184 - Flags: feedback?(hurley)
(Reporter)

Comment 13

a year ago
Sorry I have not look at your patch, but I have a theory what is happening: 
Sorry it is Saturday, so maybe I am wrong, but I am writing it here, anyway :)

When we cancel a channel we cancel pipe on main thread and post an even to the socket thread actually to cancel nsTransaction:
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpChannel.cpp#4906
so this function cancels input stream and it sets pipe to error state, but transaction is still valid. The nsHttp transaction will be canceled on the socket thread which will happen later (in this case never happens while we are looping). So if we enter Http2Session::WriteSegment and we are in that state, mDownstreamState == PROCESSING_DATA_FRAME, we are going to loop.
(Reporter)

Comment 14

a year ago
You patch is good to have, but it will not resolve this loop.
(Assignee)

Comment 15

a year ago
I know its saturday. I'm snowed in. feel free to ignore me :)

ok - I think I get it now. I was focusing on stack #2 in comment 8.. stack #1 is more interesting, and now that I kinda grok what you're saying I can also see stack #2 in that light.

I believe you are suggesting

1] nshttpconnection::onsocketreadable() [h2]
2] main thread - nshttpchannel::cancel (closes pipe shared by channel and nshttptransaction)
3] http2session::writesegments
4] http2stream::writesegments
5] nshttptransaction::writesegments
6] pipeoutputstream::writesegments

this would all be normal, and the next thing we would expect would be nshttptransaction::writepipesegment.. which would go back through all the stack callers eventually ending up in nshttpconnection::onwritesegment which would set msocketincondition

but you're saying, based on stack 1 in comment 8, that pipeoutputstream::writesegments is returning an error without going deeper in the stack.. specifically something that would be a soft error.. so it gets passed back up to Http2Session which maps it to NS_OK (soft error).. which means nsHttpConnection::OnSocketReadable loops back into the same circumstance again.

I like your theory (if that's it :))

I don't know what the pipe error code is, but its probably the same thing the channel was canceled with.. which in turn is probably something that is intentionally flagged as a soft-error like binding-aborted (because its going to be the same code that the transaction is flagged with when it is canceled)

Its tempting to just change the mSocketInCondition initializer to WOULD_BLOCK, but I think that would break other cases.

we might just need something ugly here that allows readsegments/writesegments to indicate they are masking an error.

you're right about the other patch.. I still like it but I'll move it to another bug as time allows.
(Reporter)

Comment 16

a year ago
Sorry for not explaining this better the first time.(probably because I was looking at this code for sometime so I miss explaining it in details for someone who did not stared at it)

Yes that was my theory :)

Yes both stacks are important in order to understand what is happening: the first one to see what is the cause of the error and the second why we are looping.

I think we should somehow mark that the error is coming from pipe being canceled so we are never going to call function that is going to the socket or consume that data in the pipe (http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/Http2Stream.cpp#1411).
 
Http2Stream::DeferCleanup checks pipe and the pipe, even if it is in error state, is going to return amount of data available and NS_OK.

Maybe we can check if Http2Stream::OnWriteSegment was call - whether we are moving forward or not (I need to go through it once again but if I think that is enough) If it is not called and we are in Soft-error means the pipe has returned an error state.

(I was not complaining about Saturday, just I am not really working so I was writing comment 13 without really proving it completely :) (I am mo guaranty that there are no mistakes :) But I think comment 13 is correct.
(Assignee)

Comment 17

a year ago
when we hit the softstream error we do:
      CleanupStream(streamID, NS_OK, CANCEL_ERROR);

That second argument, is OK because normally this is because the transaction has signaled something harmless like the parser EOF for the stream.. cleanupstream does not happen if stream->DeferCleanup (NS_OK) returns true.. which is defined as

  return (NS_SUCCEEDED(status) && IsDataAvailable(mInputBufferIn));

so that doesn't happen on the softstream error path when there is data in the flow control buffer. (that's not the pipe that is in an error state - it is not shared with the channel.. there might be some confusion on that in the previous comment but I don't think it matters).

interestingly before firefox 44 (bug 1204614) defercleanup always returned false. So this is probably a regression

however if the event queue on the socket thread should continue to run (i.e. not be stopped by the again loop), Http2Session::CloseTransaction would call   CleanupStream(stream, aResult, CANCEL_ERROR) - where aResult is definitely not OK.

In that case DeferCleanup would return false and CleanupStream would call CloseStream which would put the state machine in a mode to ignore the rest of the data frame (bypassing the broken transaction), etc..
Blocks: 1204614
(Assignee)

Comment 18

a year ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5643f0137bef
(Assignee)

Comment 19

a year ago
[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

(In reply to Patrick McManus [:mcmanus] from comment #18)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=5643f0137bef

basically, if we don't loop on softerror things ought to work out... and that's a safe change to make. It seems plausible there are other scenarios where softerror is masking things to NS_OK that also would be fixed by this.

the other approach I considered, which is admittedly a smaller patch, is to chnage nsHttpConnection so that instead of looping synchronously it just posts a manual "fake" callback to OnSocketReady to the socket thread event queue every time again would be true and just returns instead... effectively making sure other stuff gets to run.. but a] that's a lot of machinery to just get a callback that probably immediately returns EWOULDBLOCK and b] that's going to change some timing pretty seriously and I'd like to live with it for a while before seeing it on release, and this patch feels like its going to be uplifted.
status-firefox44: --- → affected
status-firefox45: --- → affected
status-firefox46: --- → affected
status-firefox47: --- → affected
(Assignee)

Comment 20

a year ago
Created attachment 8719224 [details] [diff] [review]
dont loop on http2 softerror
Attachment #8719224 - Flags: review?(dd.mozilla)
(Assignee)

Updated

a year ago
Attachment #8718816 - Attachment is obsolete: true
(Assignee)

Updated

a year ago
Attachment #8719184 - Attachment is obsolete: true
Attachment #8719184 - Flags: review?(dd.mozilla)
Attachment #8719184 - Flags: feedback?(hurley)
(Assignee)

Updated

a year ago
Whiteboard: [spdy] → [spdy][necko-active]
(Assignee)

Comment 21

a year ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=abb6d808cb19
(Reporter)

Comment 22

a year ago
Comment on attachment 8719224 [details] [diff] [review]
dont loop on http2 softerror

Review of attachment 8719224 [details] [diff] [review]:
-----------------------------------------------------------------

This should solve the problem.

And it is easier to uplift. We should uplift this.
Attachment #8719224 - Flags: review?(dd.mozilla) → review+
(Assignee)

Comment 23

a year ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4f48d8cf8e82
(Assignee)

Comment 24

a year ago
(In reply to Patrick McManus [:mcmanus] from comment #21)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=abb6d808cb19

the different platform compilers are fighting me a little bit on the combination of virtual/pure-virtual/polymorphic there.. but otherwise the try looks solid. I'll just rename a symbol to make this go easier.
(Reporter)

Comment 25

a year ago
We need to back out bug 1246778, to see if this has solved the problem.
(Assignee)

Comment 26

a year ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e0b1c5a92634
(Assignee)

Comment 27

a year ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/9b6d687214543f0c21035b3d7f32520419314c2b
Bug 1247205 - dont loop on http2 softerror r=dragana
https://hg.mozilla.org/mozilla-central/rev/9b6d68721454
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
status-firefox47: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
(Assignee)

Comment 29

a year ago
Comment on attachment 8719224 [details] [diff] [review]
dont loop on http2 softerror

Approval Request Comment
[Feature/regressing bug #]: This patch fixes what is believed to be a significant subset of the hangs and shutdown crashes in bug 1158189. Feature in bug 1204614 (gecko 44) certainly made this worse, but the liability has been there for a couple years.
[User impact if declined]: The fact that facebook transitioned to http/2 in mid january has also made this hang happen more often - facebook video would create a possible trigger for hanging of the network stack. I think the potential with FB and browser hangs is large enough that this should go onto beta.
[Describe test coverage new/current, TreeHerder]: This fix covers a very hard to produce race condition. I have done it once by hand and only then my changing a few internals. However the code paths involved are heavily tested with xpcshell.
[Risks and why]: Low risk This is a pretty minimal fix - other more invasive approaches were rejected (or reserved only for mozilla-central) with the objective of choosing the safest path. Only spdy and h2 are impacted and even then they are only impacted on error paths.
[String/UUID change made/needed]:
Attachment #8719224 - Flags: approval-mozilla-beta?
Attachment #8719224 - Flags: approval-mozilla-aurora?
status-firefox44: affected → wontfix
Comment on attachment 8719224 [details] [diff] [review]
dont loop on http2 softerror

Fix an important issue, taking it.
Should be in 45 beta 7
Attachment #8719224 - Flags: approval-mozilla-beta?
Attachment #8719224 - Flags: approval-mozilla-beta+
Attachment #8719224 - Flags: approval-mozilla-aurora?
Attachment #8719224 - Flags: approval-mozilla-aurora+

Comment 31

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/5368758b59e9
status-firefox45: affected → fixed

Comment 32

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/f44573d0678b
status-firefox46: affected → fixed
(Reporter)

Comment 33

a year ago
This patch seems to fix the problem:

https://bugzilla.mozilla.org/show_bug.cgi?id=1174899#c69
(Assignee)

Updated

a year ago
Duplicate of this bug: 1237331
(Assignee)

Updated

a year ago
Duplicate of this bug: 1215243
(Assignee)

Updated

a year ago
Duplicate of this bug: 1211289
(Assignee)

Updated

a year ago
Duplicate of this bug: 1219009
(Assignee)

Updated

a year ago
Duplicate of this bug: 1253592
You need to log in before you can comment on or make changes to this bug.