OnStop not called on websocket listener after successful asyncOpen
Categories
(Core :: Networking: WebSockets, defect)
Tracking
()
People
(Reporter: decoder, Assigned: decoder)
References
Details
(Whiteboard: [fuzzblocker])
Attachments
(3 files, 1 obsolete file)
The asyncOpen
method guarantees that OnStop
is called eventually if the initial call succeeded. With the attached testcase, the asyncOpen
succeeds, but OnStop
is never called afterwards. This is causing the fuzzer to timeout as it synchronizes on several events, including the OnStop
event.
In order to reproduce, you need to:
- Apply the patch from bug 1528951 to get the NetworkWebsocket fuzzing target.
- Build with --enable-fuzzing (requires Clang and ASan, also build gtests using
./mach gtest dontruntests
). - Run
MOZ_RUN_GTEST=1 LIBFUZZER=1 FUZZER=NetworkWebsocket objdir/dist/bin/firefox test.bin
I will also attach a full log after filing.
Assignee | ||
Comment 1•5 years ago
|
||
Assignee | ||
Comment 2•5 years ago
|
||
The old log had the flags set wrong, here is an updated one. So far I have found out that WebSocketChannel::OnStart/OnStop
is called in the log, but it is not propagated to my WebSocketListener
anymore (which works in the other cases).
Assignee | ||
Comment 3•5 years ago
|
||
Debugging this further, I think a call to AbortSession
might be missing somewhere or something is not propagating its failure correctly to WebSocketChannel::OnStopRequest
.
I see that https://searchfox.org/mozilla-central/rev/c8dba8c886609d0fdb8fae68bab865bf03f1165c/netwerk/protocol/websocket/WebSocketChannel.cpp#3849 is called with a aStatusCode = 0
:
[(null) 30363: Main Thread]: D/nsWebSocket WebSocketChannel::OnStopRequest() 0x61d00003d480 [0x61d0000730c0 0x61d0000730c0 0]
But at the same time, OnTransportAvailable
is never called on the channel either. Judging from the comment in OnStopRequest
I assume that the invariant here is "either OnStopRequest
is called with a status code indicating failure or eventually OnTransportAvailable
is called". Neither happens and no stop is emitted.
Will debug this further and see if I the caller maybe drops an error or something.
Assignee | ||
Comment 4•5 years ago
|
||
I think I found the place, where the error is dropped:
[(null) 30363: Socket Thread]: V/nsHttp nsHttpConnection::CloseTransaction[this=0x61600005fd80 trans=0x61b00001ff80 reason=80470002]
The reason=80470002
here is NS_BASE_STREAM_CLOSED
.
And here [1] we have this code:
void nsHttpConnection::CloseTransaction(nsAHttpTransaction* trans,
nsresult reason, bool aIsShutdown) {
...
// mask this error code because its not a real error.
if (reason == NS_BASE_STREAM_CLOSED) reason = NS_OK;
So we are masking the error (which is most likely due to the buffer being empty, which is the same as the connection just terminating). I don't see a good reason why this is not a "real error" because this seems equivalent to the connection terminating unexpectedly. If this happens mid-establishing a WebSocket channel like here, it won't be shutdown properly.
Michal, can you confirm this analysis? And should we remove this error masking, or will this cause other problems? Alternatively we could rewrite the websockets code to check in OnStopRequest
if AbortSession
was already called instead of relying on the status code to tell us that.
Assignee | ||
Comment 5•5 years ago
|
||
Unfortunately not masking the error still doesn't solve the problem. The error is again dropped shortly after (Http2Session::Close
is not seeing it anymore). I can't judge if this error is supposed to be propagated or if the WebSockets code should ensure on its own that it is properly teared down in all cases.
Assignee | ||
Comment 6•5 years ago
|
||
Okay... so the error is not propagated because the Http2Session::Close
occurs here [1] in Http2Session::DontReuse
which like all the DontReuse
functions has no parameter for passing a reason. I wonder if this in itself is already a problem because this function is called in error conditions (like this one, the call originates from here [2]) and at the same time, it calls Close
, essentially masking any potential error condition.
There are two ways to fix this:
- Change
DontReuse
to propagate an optionalnsresult aReason = NS_OK
(seems complicated)
or 2) Change WebSocketChannel::OnStopRequest
and assume that !mRecvdHttpUpgradeTransport && !aStatusCode
means that the stream just aborted and call AbortSession(NS_BASE_STREAM_CLOSED)
. This sounds like a hack to me.
This patch "fixes" the problem:
@@ -3856,16 +3859,18 @@ WebSocketChannel::OnStopRequest(nsIReque
LOG(("WebSocketChannel::OnStopRequest() %p [%p %p %" PRIx32 "]\n", this,
aRequest, mHttpChannel.get(), static_cast<uint32_t>(aStatusCode)));
MOZ_ASSERT(NS_IsMainThread(), "not main thread");
// OnTransportAvailable won't be called if the request is stopped with
// an error. Abort the session now instead of waiting for timeout.
if (NS_FAILED(aStatusCode) && !mRecvdHttpUpgradeTransport) {
AbortSession(aStatusCode);
+ } else if (!mRecvdHttpUpgradeTransport) {
+ AbortSession(NS_BASE_STREAM_CLOSED);
}
ReportConnectionTelemetry(aStatusCode);
[1] https://searchfox.org/mozilla-central/rev/c8dba8c886609d0fdb8fae68bab865bf03f1165c/netwerk/protocol/http/Http2Session.cpp#710
[2] https://searchfox.org/mozilla-central/rev/c8dba8c886609d0fdb8fae68bab865bf03f1165c/netwerk/protocol/http/nsHttpConnection.cpp#1813
Comment 7•5 years ago
|
||
(In reply to Christian Holler (:decoder) from comment #6)
@@ -3856,16 +3859,18 @@ WebSocketChannel::OnStopRequest(nsIReque LOG(("WebSocketChannel::OnStopRequest() %p [%p %p %" PRIx32 "]\n", this, aRequest, mHttpChannel.get(), static_cast<uint32_t>(aStatusCode))); MOZ_ASSERT(NS_IsMainThread(), "not main thread"); // OnTransportAvailable won't be called if the request is stopped with // an error. Abort the session now instead of waiting for timeout. if (NS_FAILED(aStatusCode) && !mRecvdHttpUpgradeTransport) { AbortSession(aStatusCode); + } else if (!mRecvdHttpUpgradeTransport) { + AbortSession(NS_BASE_STREAM_CLOSED); }
OnStopRequest() is called on main thread whereas OnTransportAvailable() is called on socket thread and is redispatched to main thread. I can imagine that OnTransportAvailable() will be called after OnStopRequest() is called and this change would drop any received data. Did you try to find out why OnTransportAvailable() isn't called.
Assignee | ||
Comment 8•5 years ago
|
||
(In reply to Michal Novotny [:michal] from comment #7)
OnStopRequest() is called on main thread whereas OnTransportAvailable() is called on socket thread and is redispatched to main thread. I can imagine that OnTransportAvailable() will be called after OnStopRequest() is called and this change would drop any received data. Did you try to find out why OnTransportAvailable() isn't called.
As discussed on IRC: OnTransportAvailable
is not called here because the connection is closed before upgrade. The invariant violated here is "either OnStopRequest
is called with an error or OnTransportAvailable
is called eventually". OnStopRequest
is called without an error because the error is not propagated.
If OnTransportAvailable
can be called at the same time as OnStopRequest
or even after, then my fix won't work.
In this case, the proper fix could be to propagate the NS_BASE_STREAM_CLOSED
all the way down so it can be handled as an error. I don't know what effects it would have on other implementation if we decide to not special case NS_BASE_STREAM_CLOSED
anymore, but it sounds to me that just terminating the connection should be an error in many cases. I would even say it has a lot of potential to just hang/leak things if at the right time the peer terminates the connection and we don't run e.g. error cleanup code. And those bugs will be hard to reproduce.
Assignee | ||
Comment 9•5 years ago
|
||
NI for :michal who wants to talk to :dragana to figure out a way forward here.
Comment 10•5 years ago
|
||
(In reply to Michal Novotny [:michal] from comment #7)
OnStopRequest() is called on main thread whereas OnTransportAvailable() is called on socket thread and is redispatched to main thread. I can imagine that OnTransportAvailable() will be called after OnStopRequest() is called and this change would drop any received data. Did you try to find out why OnTransportAvailable() isn't called.
I checked logs of few websocket sessions and OnTransportAvailable() is normally called after OnStopRequest(), so you patch breaks websocket completely. IMO the solution here is to make sure that when HTTPUpgrade() is called on the HTTP channel then either OnTransportAvailable() is called or OnStopRequest() is called with an error. The current situation when OnStopRequest() is called with a success code but OnTransportAvailable() is never called cannot be easily handled. Dragana, do you agree?
Assignee | ||
Comment 11•5 years ago
|
||
Assignee | ||
Comment 12•5 years ago
|
||
With the help of :michal I was able to figure out that inside nsHttpConnectionMgr::OnMsgCompleteUpgrade
the call to conn->TakeTransport
is failing. As OnStopRequest
might already have been called at this point by the main thread, we cannot use it to propagate the failure. Instead, :michal suggested that we add something to nsIHttpUpgradeListener
to allow error notification.
The attached patch adds OnUpgradeFailed
to the interface and adds the necessary implementations of that method. With this patch, the log looks like this now
[(null) 4599: Socket Thread]: V/nsHttp nsHttpConnectionMgr::OnMsgCompleteUpgrade conn=0x61b000020690 listener=0x61d000096b28 wrapped=0
[(null) 4599: Socket Thread]: V/nsHttp conn->TakeTransport failed with 8000ffff
[(null) 4599: Socket Thread]: D/nsWebSocket WebSocketChannel::OnUpgradeFailed() 0x61d000096a80 [aErrorCode 8000ffff]
[(null) 4599: Socket Thread]: D/nsWebSocket WebSocketChannel::AbortSession() 0x61d000096a80 [reason 8000ffff] stopped = 0
[(null) 4599: Socket Thread]: D/nsWebSocket WebSocketChannel::DoStopSession() 0x61d000096a80 [8000ffff]
[(null) 4599: Socket Thread]: D/nsWebSocket WebSocketChannel::CleanupConnection() 0x61d000096a80
[(null) 4599: Socket Thread]: D/nsWebSocket Websocket: OnStopSession: [this=0x61d000096a80, reason=0x8000ffff]
[(null) 4599: Socket Thread]: D/nsWebSocket WebSocket: connection to 127.0.0.1, 443 failed: [this=0x61d000096a80]
[(null) 4599: Socket Thread]: D/nsWebSocket Websocket: RemoveFromQueue: [this=0x61d000096a80]
[(null) 4599: Socket Thread]: D/nsWebSocket Websocket: removing conn 0x603000138550 from the queue
[(null) 4599: Socket Thread]: D/nsWebSocket Websocket: changing state to NOT_CONNECTING
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::OnStartRequest(): 0x61d000096a80 [0x61d0000974c0 0x61d0000974c0] recvdhttpupgrade=0
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::OnStartRequest: Channel Already Done
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::AbortSession() 0x61d000096a80 [reason 804b000d] stopped = 1
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::OnStopRequest() 0x61d000096a80 [0x61d0000974c0 0x61d0000974c0 804b000d]
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::AbortSession() 0x61d000096a80 [reason 804b000d] stopped = 1
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::ReportConnectionTelemetry() 0x61d000096a80 6
and the timeout in fuzzing gone.
Assignee | ||
Updated•5 years ago
|
Comment 13•5 years ago
|
||
Comment 14•5 years ago
|
||
bugherder |
Assignee | ||
Comment 16•5 years ago
|
||
From a fuzzing perspective, no. I think we can let this ride the trains.
Updated•5 years ago
|
Updated•5 years ago
|
Description
•