Closed
Bug 1247205
Opened 9 years ago
Closed 9 years ago
Investigate possible loop in http2
Categories
(Core :: Networking: HTTP, defect)
Core
Networking: HTTP
Tracking
()
RESOLVED
FIXED
mozilla47
People
(Reporter: dragana, Assigned: mcmanus)
References
Details
(Whiteboard: [spdy][necko-active])
Attachments
(1 file, 2 obsolete files)
15.21 KB,
patch
|
dragana
:
review+
Sylvestre
:
approval-mozilla-aurora+
Sylvestre
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
I looked through some comments from shutdown hang:
https://crash-stats.mozilla.com/report/list?product=Firefox&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingleObject+|+PR_WaitCondVar+|+nsThread%3A%3AProcessNextEvent+|+mozilla%3A%3Anet%3A%3AnsHttpConnectionMgr%3A%3AShutdown#tab-comments
A lot of them mention facebook, which just recently switched to h2.
More important a lot of comment mention that firefox was behaving as if there is no connectivity and I am guessing the users decided to restart firefox when this crash happened.
Reporter | ||
Comment 1•9 years ago
|
||
I have one theory (I do not know h2 code that good):
We call mInputFrameDataStream->WriteSegments at:
http://hg.mozilla.org/releases/mozilla-release/annotate/9b79adf82b13/netwerk/protocol/http/Http2Session.cpp#l2704
this returns SoftStreamError(rv) or NS_BASE_STREAM_WOULD_BLOCK without going to the network. If we go to the socket mSocketInCondition will be set to the real socket error and nsHttpConnection::OnSocketReadable would not loop. Looking into the code it is possible that we return from mInputFrameDataStream->WriteSegments without going to the network.
If this above it is true we will loop.
I think we can easily test this. There are some crashes on nightly with this stack. We can add some checks to confirm of not whether we are looping in this code.
Assignee | ||
Comment 2•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #1)
> I have one theory (I do not know h2 code that good):
>
> We call mInputFrameDataStream->WriteSegments at:
> http://hg.mozilla.org/releases/mozilla-release/annotate/9b79adf82b13/netwerk/
> protocol/http/Http2Session.cpp#l2704
>
> this returns SoftStreamError(rv) or NS_BASE_STREAM_WOULD_BLOCK without going
> to the network. If we go to the socket mSocketInCondition will be set to the
> real socket error and nsHttpConnection::OnSocketReadable would not loop.
> Looking into the code it is possible that we return from
> mInputFrameDataStream->WriteSegments without going to the network.
>
> If this above it is true we will loop.
>
> I think we can easily test this. There are some crashes on nightly with this
> stack. We can add some checks to confirm of not whether we are looping in
> this code.
It seems this is generally probably right, but I think some subset of that looping is necessary to advance the state machine.. so I think we probably have to find a specific set of causes where the loop gets stuck forever.
Reporter | ||
Comment 3•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #2)
>
> It seems this is generally probably right, but I think some subset of that
> looping is necessary to advance the state machine.. so I think we probably
> have to find a specific set of causes where the loop gets stuck forever.
That is why I though of using this crash to narrow down this behavior to get some clues.
Ok I will make patch.
The other option is to contact some of the reporters :)
Reporter | ||
Comment 4•9 years ago
|
||
Maybe related bug 1245085
Reporter | ||
Comment 5•9 years ago
|
||
Is it possible that mInputBufferIn/Out pipe is full:
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/Http2Stream.cpp#268
Reporter | ||
Comment 6•9 years ago
|
||
but this is not going to happen that often :)
Attachment #8718816 -
Flags: review?(mcmanus)
Assignee | ||
Comment 7•9 years ago
|
||
Comment on attachment 8718816 [details] [diff] [review]
bug_1247205.patch
Review of attachment 8718816 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/protocol/http/Http2Stream.cpp
@@ +284,5 @@
> if (NS_SUCCEEDED(rv)) {
> uint32_t buffered;
> rv = mInputBufferOut->Write(buf, *countWritten, &buffered);
> + if ((NS_SUCCEEDED(rv) && (buffered != *countWritten)) ||
> + (rv == NS_BASE_STREAM_WOULD_BLOCK)) {
can you explain why you think this is happening?
That pipe is just being used as a buffer.. its prime use case is when the channel pipe is not accepting data (maybe due to the scheduler, but more likely because the channel is suspended).. in h1 we would just push back on the network and let TCP flow control kick in, but h2 can't really do that without blocking the other streams, so instead it has this buffer to stuff things into until the h2 per stream flow control window runs out (and then the backpressure will finally hit the sender).. so this buffer is sized 'infinite' to be bigger than the stream flow control and we should never get WOULD_BLOCK (though OOM is a possibility).
that's the theory anyhow.. The patch might be harmless in that light, but I'm trying to figure out why you suspect it.. cause maybe something is wrong with the buffering we could fix.
Attachment #8718816 -
Flags: review?(mcmanus)
Reporter | ||
Comment 8•9 years ago
|
||
I missed one thing, sorry I looked at too many stack
Some stacks:
https://crash-stats.mozilla.com/report/index/550901b5-3442-4ea9-8533-8d0602160206#allthreads
0 nss3.dll md_UnlockAndPostNotifies nsprpub/pr/src/md/windows/w95cv.c
1 nss3.dll PR_ExitMonitor nsprpub/pr/src/threads/prmon.c
2 xul.dll nsPipe::OnPipeException(nsresult, bool) xpcom/io/nsPipe3.cpp
3 xul.dll nsPipeOutputStream::WriteSegments(nsresult (*)(nsIOutputStream*, void*, char*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*) xpcom/io/nsPipe3.cpp
4 xul.dll mozilla::net::nsHttpTransaction::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*) netwerk/protocol/http/nsHttpTransaction.cpp
5 xul.dll nsSocketOutputStream::nsSocketOutputStream(nsSocketTransport*) netwerk/base/nsSocketTransport2.cpp
6 xul.dll mozilla::net::Http2Stream::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*) netwerk/protocol/http/Http2Stream.cpp
7 xul.dll mozilla::net::Http2Session::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*) netwerk/protocol/http/Http2Session.cpp
8 xul.dll mozilla::net::nsHttpConnection::OnSocketReadable() netwerk/protocol/http/nsHttpConnection.cpp
9 xul.dll mozilla::net::nsHttpConnection::OnInputStreamReady(nsIAsyncInputStream*) netwerk/protocol/http/nsHttpConnection.cpp
10 xul.dll nsSocketInputStream::OnSocketReady(nsresult)
this one can only happen if the pipe's state is an error:
http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsPipe3.cpp#804
this can loop only if the error is SoftError:
http://hg.mozilla.org/releases/mozilla-release/annotate/9b79adf82b13/netwerk/protocol/http/Http2Session.cpp#l2709
I am not sure how pipe got broken.
Second stack:
https://crash-stats.mozilla.com/report/index/467e8bfc-0949-46dd-9f72-91dba2160206#allthreads
0 ntdll.dll RtlEnterCriticalSection
1 nss3.dll PR_ExitMonitor nsprpub/pr/src/threads/prmon.c
2 xul.dll nsPipeInputStream::Available(unsigned __int64*) xpcom/io/nsPipe3.cpp
3 xul.dll mozilla::net::IsDataAvailable netwerk/protocol/http/Http2Stream.cpp
4 xul.dll mozilla::net::Http2Stream::DeferCleanup(nsresult) netwerk/protocol/http/Http2Stream.cpp
5 xul.dll mozilla::net::Http2Session::CleanupStream(mozilla::net::Http2Stream*, nsresult, mozilla::net::Http2Session::errorType) netwerk/protocol/http/Http2Session.cpp
6 xul.dll mozilla::net::Http2Session::CleanupStream(unsigned int, nsresult, mozilla::net::Http2Session::errorType) netwerk/protocol/http/Http2Session.cpp
7 xul.dll mozilla::net::Http2Session::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*) netwerk/protocol/http/Http2Session.cpp
8 xul.dll mozilla::net::nsHttpConnection::OnSocketReadable() netwerk/protocol/http/nsHttpConnection.cpp
9 xul.dll mozilla::net::nsHttpConnection::OnInputStreamReady(nsIAsyncInputStream*) netwerk/protocol/http/nsHttpConnection.cpp
10 xul.dll mozilla::net::HttpConnectionForceIO::Run() netwerk/protocol/http/nsHttpConnection.cpp
11 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp
It is probably the same scenario as above:
so at:
http://hg.mozilla.org/releases/mozilla-release/annotate/9b79adf82b13/netwerk/protocol/http/Http2Session.cpp#l2724
mNeedsCleanup is nullptr, there is no other way, otherwise mDownstreamState would not stay PROCESSING_DATA_FRAME
and mInputBuffer pipe is not empty.
So pipe is in error state and we do not get out of this loop. Why the pipe is broken, I still do not know.
I will make a patch get out of this loop and I will continue looking for reason why the loop is in the error state.
Reporter | ||
Updated•9 years ago
|
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Assignee | ||
Comment 9•9 years ago
|
||
perhaps the pipe is more machinery than we really want here. I still don't quite get what is wrong, but it certainly seems possible. In particular it contains locks we don't need.
what do think of just replacing it with something simpler.. a linked list of buffers that implements just read, write, and avaialable.. the operations would return OK or OOM. That's the behavior we're trying to get from pipe afterall.
Assignee | ||
Updated•9 years ago
|
Whiteboard: [spdy]
Assignee | ||
Comment 10•9 years ago
|
||
Assignee | ||
Comment 11•9 years ago
|
||
Attachment #8719184 -
Flags: review?(dd.mozilla)
Assignee | ||
Updated•9 years ago
|
Assignee: dd.mozilla → mcmanus
Assignee | ||
Comment 12•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #11)
> Created attachment 8719184 [details] [diff] [review]
> streamline h2 stream flow control buffer
I actually like this patch quite a bit.. while it adds a new class, it removes a couple nscomptr's and generally simplifies the h2 code.
it also adds OUT_OF_MEMORY to the list of hard errors - which is obviously true. (hard errors kill sessions, soft errors kill streams). Its possible that's all this bug really needs, but the "stuck on lock" stack traces make me want to do the whole thing.
Assignee | ||
Updated•9 years ago
|
Attachment #8719184 -
Flags: feedback?(hurley)
Reporter | ||
Comment 13•9 years ago
|
||
Sorry I have not look at your patch, but I have a theory what is happening:
Sorry it is Saturday, so maybe I am wrong, but I am writing it here, anyway :)
When we cancel a channel we cancel pipe on main thread and post an even to the socket thread actually to cancel nsTransaction:
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpChannel.cpp#4906
so this function cancels input stream and it sets pipe to error state, but transaction is still valid. The nsHttp transaction will be canceled on the socket thread which will happen later (in this case never happens while we are looping). So if we enter Http2Session::WriteSegment and we are in that state, mDownstreamState == PROCESSING_DATA_FRAME, we are going to loop.
Reporter | ||
Comment 14•9 years ago
|
||
You patch is good to have, but it will not resolve this loop.
Assignee | ||
Comment 15•9 years ago
|
||
I know its saturday. I'm snowed in. feel free to ignore me :)
ok - I think I get it now. I was focusing on stack #2 in comment 8.. stack #1 is more interesting, and now that I kinda grok what you're saying I can also see stack #2 in that light.
I believe you are suggesting
1] nshttpconnection::onsocketreadable() [h2]
2] main thread - nshttpchannel::cancel (closes pipe shared by channel and nshttptransaction)
3] http2session::writesegments
4] http2stream::writesegments
5] nshttptransaction::writesegments
6] pipeoutputstream::writesegments
this would all be normal, and the next thing we would expect would be nshttptransaction::writepipesegment.. which would go back through all the stack callers eventually ending up in nshttpconnection::onwritesegment which would set msocketincondition
but you're saying, based on stack 1 in comment 8, that pipeoutputstream::writesegments is returning an error without going deeper in the stack.. specifically something that would be a soft error.. so it gets passed back up to Http2Session which maps it to NS_OK (soft error).. which means nsHttpConnection::OnSocketReadable loops back into the same circumstance again.
I like your theory (if that's it :))
I don't know what the pipe error code is, but its probably the same thing the channel was canceled with.. which in turn is probably something that is intentionally flagged as a soft-error like binding-aborted (because its going to be the same code that the transaction is flagged with when it is canceled)
Its tempting to just change the mSocketInCondition initializer to WOULD_BLOCK, but I think that would break other cases.
we might just need something ugly here that allows readsegments/writesegments to indicate they are masking an error.
you're right about the other patch.. I still like it but I'll move it to another bug as time allows.
Reporter | ||
Comment 16•9 years ago
|
||
Sorry for not explaining this better the first time.(probably because I was looking at this code for sometime so I miss explaining it in details for someone who did not stared at it)
Yes that was my theory :)
Yes both stacks are important in order to understand what is happening: the first one to see what is the cause of the error and the second why we are looping.
I think we should somehow mark that the error is coming from pipe being canceled so we are never going to call function that is going to the socket or consume that data in the pipe (http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/Http2Stream.cpp#1411).
Http2Stream::DeferCleanup checks pipe and the pipe, even if it is in error state, is going to return amount of data available and NS_OK.
Maybe we can check if Http2Stream::OnWriteSegment was call - whether we are moving forward or not (I need to go through it once again but if I think that is enough) If it is not called and we are in Soft-error means the pipe has returned an error state.
(I was not complaining about Saturday, just I am not really working so I was writing comment 13 without really proving it completely :) (I am mo guaranty that there are no mistakes :) But I think comment 13 is correct.
Assignee | ||
Comment 17•9 years ago
|
||
when we hit the softstream error we do:
CleanupStream(streamID, NS_OK, CANCEL_ERROR);
That second argument, is OK because normally this is because the transaction has signaled something harmless like the parser EOF for the stream.. cleanupstream does not happen if stream->DeferCleanup (NS_OK) returns true.. which is defined as
return (NS_SUCCEEDED(status) && IsDataAvailable(mInputBufferIn));
so that doesn't happen on the softstream error path when there is data in the flow control buffer. (that's not the pipe that is in an error state - it is not shared with the channel.. there might be some confusion on that in the previous comment but I don't think it matters).
interestingly before firefox 44 (bug 1204614) defercleanup always returned false. So this is probably a regression
however if the event queue on the socket thread should continue to run (i.e. not be stopped by the again loop), Http2Session::CloseTransaction would call CleanupStream(stream, aResult, CANCEL_ERROR) - where aResult is definitely not OK.
In that case DeferCleanup would return false and CleanupStream would call CloseStream which would put the state machine in a mode to ignore the rest of the data frame (bypassing the broken transaction), etc..
Blocks: 1204614
Assignee | ||
Comment 18•9 years ago
|
||
Assignee | ||
Comment 19•9 years ago
|
||
[Tracking Requested - why for this release]:
[Tracking Requested - why for this release]:
(In reply to Patrick McManus [:mcmanus] from comment #18)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=5643f0137bef
basically, if we don't loop on softerror things ought to work out... and that's a safe change to make. It seems plausible there are other scenarios where softerror is masking things to NS_OK that also would be fixed by this.
the other approach I considered, which is admittedly a smaller patch, is to chnage nsHttpConnection so that instead of looping synchronously it just posts a manual "fake" callback to OnSocketReady to the socket thread event queue every time again would be true and just returns instead... effectively making sure other stuff gets to run.. but a] that's a lot of machinery to just get a callback that probably immediately returns EWOULDBLOCK and b] that's going to change some timing pretty seriously and I'd like to live with it for a while before seeing it on release, and this patch feels like its going to be uplifted.
status-firefox44:
--- → affected
status-firefox45:
--- → affected
status-firefox46:
--- → affected
status-firefox47:
--- → affected
Assignee | ||
Comment 20•9 years ago
|
||
Attachment #8719224 -
Flags: review?(dd.mozilla)
Assignee | ||
Updated•9 years ago
|
Attachment #8718816 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Attachment #8719184 -
Attachment is obsolete: true
Attachment #8719184 -
Flags: review?(dd.mozilla)
Attachment #8719184 -
Flags: feedback?(hurley)
Assignee | ||
Updated•9 years ago
|
Whiteboard: [spdy] → [spdy][necko-active]
Assignee | ||
Comment 21•9 years ago
|
||
Reporter | ||
Comment 22•9 years ago
|
||
Comment on attachment 8719224 [details] [diff] [review]
dont loop on http2 softerror
Review of attachment 8719224 [details] [diff] [review]:
-----------------------------------------------------------------
This should solve the problem.
And it is easier to uplift. We should uplift this.
Attachment #8719224 -
Flags: review?(dd.mozilla) → review+
Assignee | ||
Comment 23•9 years ago
|
||
Assignee | ||
Comment 24•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #21)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=abb6d808cb19
the different platform compilers are fighting me a little bit on the combination of virtual/pure-virtual/polymorphic there.. but otherwise the try looks solid. I'll just rename a symbol to make this go easier.
Reporter | ||
Comment 25•9 years ago
|
||
We need to back out bug 1246778, to see if this has solved the problem.
Assignee | ||
Comment 26•9 years ago
|
||
Assignee | ||
Comment 27•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/9b6d687214543f0c21035b3d7f32520419314c2b
Bug 1247205 - dont loop on http2 softerror r=dragana
Comment 28•9 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Assignee | ||
Comment 29•9 years ago
|
||
Comment on attachment 8719224 [details] [diff] [review]
dont loop on http2 softerror
Approval Request Comment
[Feature/regressing bug #]: This patch fixes what is believed to be a significant subset of the hangs and shutdown crashes in bug 1158189. Feature in bug 1204614 (gecko 44) certainly made this worse, but the liability has been there for a couple years.
[User impact if declined]: The fact that facebook transitioned to http/2 in mid january has also made this hang happen more often - facebook video would create a possible trigger for hanging of the network stack. I think the potential with FB and browser hangs is large enough that this should go onto beta.
[Describe test coverage new/current, TreeHerder]: This fix covers a very hard to produce race condition. I have done it once by hand and only then my changing a few internals. However the code paths involved are heavily tested with xpcshell.
[Risks and why]: Low risk This is a pretty minimal fix - other more invasive approaches were rejected (or reserved only for mozilla-central) with the objective of choosing the safest path. Only spdy and h2 are impacted and even then they are only impacted on error paths.
[String/UUID change made/needed]:
Attachment #8719224 -
Flags: approval-mozilla-beta?
Attachment #8719224 -
Flags: approval-mozilla-aurora?
Updated•9 years ago
|
Comment 30•9 years ago
|
||
Comment on attachment 8719224 [details] [diff] [review]
dont loop on http2 softerror
Fix an important issue, taking it.
Should be in 45 beta 7
Attachment #8719224 -
Flags: approval-mozilla-beta?
Attachment #8719224 -
Flags: approval-mozilla-beta+
Attachment #8719224 -
Flags: approval-mozilla-aurora?
Attachment #8719224 -
Flags: approval-mozilla-aurora+
Comment 31•9 years ago
|
||
bugherder uplift |
Comment 32•9 years ago
|
||
bugherder uplift |
Reporter | ||
Comment 33•9 years ago
|
||
This patch seems to fix the problem:
https://bugzilla.mozilla.org/show_bug.cgi?id=1174899#c69
You need to log in
before you can comment on or make changes to this bug.
Description
•