Closed Bug 1577571 Opened 5 years ago Closed 5 years ago

OnStop not called on websocket listener after successful asyncOpen

Categories

(Core :: Networking: WebSockets, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox70 --- wontfix
firefox71 --- fixed

People

(Reporter: decoder, Assigned: decoder)

References

Details

(Whiteboard: [fuzzblocker])

Attachments

(3 files, 1 obsolete file)

The asyncOpen method guarantees that OnStop is called eventually if the initial call succeeded. With the attached testcase, the asyncOpen succeeds, but OnStop is never called afterwards. This is causing the fuzzer to timeout as it synchronizes on several events, including the OnStop event.

In order to reproduce, you need to:

  1. Apply the patch from bug 1528951 to get the NetworkWebsocket fuzzing target.
  2. Build with --enable-fuzzing (requires Clang and ASan, also build gtests using ./mach gtest dontruntests).
  3. Run MOZ_RUN_GTEST=1 LIBFUZZER=1 FUZZER=NetworkWebsocket objdir/dist/bin/firefox test.bin

I will also attach a full log after filing.

The old log had the flags set wrong, here is an updated one. So far I have found out that WebSocketChannel::OnStart/OnStop is called in the log, but it is not propagated to my WebSocketListener anymore (which works in the other cases).

Attachment #9089135 - Attachment is obsolete: true

Debugging this further, I think a call to AbortSession might be missing somewhere or something is not propagating its failure correctly to WebSocketChannel::OnStopRequest.

I see that https://searchfox.org/mozilla-central/rev/c8dba8c886609d0fdb8fae68bab865bf03f1165c/netwerk/protocol/websocket/WebSocketChannel.cpp#3849 is called with a aStatusCode = 0:

[(null) 30363: Main Thread]: D/nsWebSocket WebSocketChannel::OnStopRequest() 0x61d00003d480 [0x61d0000730c0 0x61d0000730c0 0]

But at the same time, OnTransportAvailable is never called on the channel either. Judging from the comment in OnStopRequest I assume that the invariant here is "either OnStopRequest is called with a status code indicating failure or eventually OnTransportAvailable is called". Neither happens and no stop is emitted.

Will debug this further and see if I the caller maybe drops an error or something.

I think I found the place, where the error is dropped:

[(null) 30363: Socket Thread]: V/nsHttp nsHttpConnection::CloseTransaction[this=0x61600005fd80 trans=0x61b00001ff80 reason=80470002]

The reason=80470002 here is NS_BASE_STREAM_CLOSED.

And here [1] we have this code:

void nsHttpConnection::CloseTransaction(nsAHttpTransaction* trans,
                                        nsresult reason, bool aIsShutdown) {
  ...

  // mask this error code because its not a real error.
  if (reason == NS_BASE_STREAM_CLOSED) reason = NS_OK;

So we are masking the error (which is most likely due to the buffer being empty, which is the same as the connection just terminating). I don't see a good reason why this is not a "real error" because this seems equivalent to the connection terminating unexpectedly. If this happens mid-establishing a WebSocket channel like here, it won't be shutdown properly.

Michal, can you confirm this analysis? And should we remove this error masking, or will this cause other problems? Alternatively we could rewrite the websockets code to check in OnStopRequest if AbortSession was already called instead of relying on the status code to tell us that.

[1] https://searchfox.org/mozilla-central/rev/c8dba8c886609d0fdb8fae68bab865bf03f1165c/netwerk/protocol/http/nsHttpConnection.cpp#1810

Flags: needinfo?(michal.novotny)

Unfortunately not masking the error still doesn't solve the problem. The error is again dropped shortly after (Http2Session::Close is not seeing it anymore). I can't judge if this error is supposed to be propagated or if the WebSockets code should ensure on its own that it is properly teared down in all cases.

Okay... so the error is not propagated because the Http2Session::Close occurs here [1] in Http2Session::DontReuse which like all the DontReuse functions has no parameter for passing a reason. I wonder if this in itself is already a problem because this function is called in error conditions (like this one, the call originates from here [2]) and at the same time, it calls Close, essentially masking any potential error condition.

There are two ways to fix this:

  1. Change DontReuse to propagate an optional nsresult aReason = NS_OK (seems complicated)

or 2) Change WebSocketChannel::OnStopRequest and assume that !mRecvdHttpUpgradeTransport && !aStatusCode means that the stream just aborted and call AbortSession(NS_BASE_STREAM_CLOSED). This sounds like a hack to me.

This patch "fixes" the problem:

@@ -3856,16 +3859,18 @@ WebSocketChannel::OnStopRequest(nsIReque
   LOG(("WebSocketChannel::OnStopRequest() %p [%p %p %" PRIx32 "]\n", this,
        aRequest, mHttpChannel.get(), static_cast<uint32_t>(aStatusCode)));
   MOZ_ASSERT(NS_IsMainThread(), "not main thread");
 
   // OnTransportAvailable won't be called if the request is stopped with
   // an error. Abort the session now instead of waiting for timeout.
   if (NS_FAILED(aStatusCode) && !mRecvdHttpUpgradeTransport) {
     AbortSession(aStatusCode);
+  } else if (!mRecvdHttpUpgradeTransport) {
+    AbortSession(NS_BASE_STREAM_CLOSED);
   }
 
   ReportConnectionTelemetry(aStatusCode);

[1] https://searchfox.org/mozilla-central/rev/c8dba8c886609d0fdb8fae68bab865bf03f1165c/netwerk/protocol/http/Http2Session.cpp#710
[2] https://searchfox.org/mozilla-central/rev/c8dba8c886609d0fdb8fae68bab865bf03f1165c/netwerk/protocol/http/nsHttpConnection.cpp#1813

(In reply to Christian Holler (:decoder) from comment #6)

@@ -3856,16 +3859,18 @@ WebSocketChannel::OnStopRequest(nsIReque
   LOG(("WebSocketChannel::OnStopRequest() %p [%p %p %" PRIx32 "]\n", this,
        aRequest, mHttpChannel.get(), static_cast<uint32_t>(aStatusCode)));
   MOZ_ASSERT(NS_IsMainThread(), "not main thread");
 
   // OnTransportAvailable won't be called if the request is stopped with
   // an error. Abort the session now instead of waiting for timeout.
   if (NS_FAILED(aStatusCode) && !mRecvdHttpUpgradeTransport) {
     AbortSession(aStatusCode);
+  } else if (!mRecvdHttpUpgradeTransport) {
+    AbortSession(NS_BASE_STREAM_CLOSED);
   }

OnStopRequest() is called on main thread whereas OnTransportAvailable() is called on socket thread and is redispatched to main thread. I can imagine that OnTransportAvailable() will be called after OnStopRequest() is called and this change would drop any received data. Did you try to find out why OnTransportAvailable() isn't called.

Flags: needinfo?(michal.novotny)

(In reply to Michal Novotny [:michal] from comment #7)

OnStopRequest() is called on main thread whereas OnTransportAvailable() is called on socket thread and is redispatched to main thread. I can imagine that OnTransportAvailable() will be called after OnStopRequest() is called and this change would drop any received data. Did you try to find out why OnTransportAvailable() isn't called.

As discussed on IRC: OnTransportAvailable is not called here because the connection is closed before upgrade. The invariant violated here is "either OnStopRequest is called with an error or OnTransportAvailable is called eventually". OnStopRequest is called without an error because the error is not propagated.

If OnTransportAvailable can be called at the same time as OnStopRequest or even after, then my fix won't work.

In this case, the proper fix could be to propagate the NS_BASE_STREAM_CLOSED all the way down so it can be handled as an error. I don't know what effects it would have on other implementation if we decide to not special case NS_BASE_STREAM_CLOSED anymore, but it sounds to me that just terminating the connection should be an error in many cases. I would even say it has a lot of potential to just hang/leak things if at the right time the peer terminates the connection and we don't run e.g. error cleanup code. And those bugs will be hard to reproduce.

NI for :michal who wants to talk to :dragana to figure out a way forward here.

Flags: needinfo?(michal.novotny)

(In reply to Michal Novotny [:michal] from comment #7)

OnStopRequest() is called on main thread whereas OnTransportAvailable() is called on socket thread and is redispatched to main thread. I can imagine that OnTransportAvailable() will be called after OnStopRequest() is called and this change would drop any received data. Did you try to find out why OnTransportAvailable() isn't called.

I checked logs of few websocket sessions and OnTransportAvailable() is normally called after OnStopRequest(), so you patch breaks websocket completely. IMO the solution here is to make sure that when HTTPUpgrade() is called on the HTTP channel then either OnTransportAvailable() is called or OnStopRequest() is called with an error. The current situation when OnStopRequest() is called with a success code but OnTransportAvailable() is never called cannot be easily handled. Dragana, do you agree?

Flags: needinfo?(michal.novotny) → needinfo?(dd.mozilla)

With the help of :michal I was able to figure out that inside nsHttpConnectionMgr::OnMsgCompleteUpgrade the call to conn->TakeTransport is failing. As OnStopRequest might already have been called at this point by the main thread, we cannot use it to propagate the failure. Instead, :michal suggested that we add something to nsIHttpUpgradeListener to allow error notification.

The attached patch adds OnUpgradeFailed to the interface and adds the necessary implementations of that method. With this patch, the log looks like this now

[(null) 4599: Socket Thread]: V/nsHttp nsHttpConnectionMgr::OnMsgCompleteUpgrade conn=0x61b000020690 listener=0x61d000096b28 wrapped=0
[(null) 4599: Socket Thread]: V/nsHttp   conn->TakeTransport failed with 8000ffff
[(null) 4599: Socket Thread]: D/nsWebSocket WebSocketChannel::OnUpgradeFailed() 0x61d000096a80 [aErrorCode 8000ffff]
[(null) 4599: Socket Thread]: D/nsWebSocket WebSocketChannel::AbortSession() 0x61d000096a80 [reason 8000ffff] stopped = 0
[(null) 4599: Socket Thread]: D/nsWebSocket WebSocketChannel::DoStopSession() 0x61d000096a80 [8000ffff]
[(null) 4599: Socket Thread]: D/nsWebSocket WebSocketChannel::CleanupConnection() 0x61d000096a80
[(null) 4599: Socket Thread]: D/nsWebSocket Websocket: OnStopSession: [this=0x61d000096a80, reason=0x8000ffff]
[(null) 4599: Socket Thread]: D/nsWebSocket WebSocket: connection to 127.0.0.1, 443 failed: [this=0x61d000096a80]
[(null) 4599: Socket Thread]: D/nsWebSocket Websocket: RemoveFromQueue: [this=0x61d000096a80]
[(null) 4599: Socket Thread]: D/nsWebSocket Websocket: removing conn 0x603000138550 from the queue
[(null) 4599: Socket Thread]: D/nsWebSocket Websocket: changing state to NOT_CONNECTING
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::OnStartRequest(): 0x61d000096a80 [0x61d0000974c0 0x61d0000974c0] recvdhttpupgrade=0
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::OnStartRequest: Channel Already Done
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::AbortSession() 0x61d000096a80 [reason 804b000d] stopped = 1
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::OnStopRequest() 0x61d000096a80 [0x61d0000974c0 0x61d0000974c0 804b000d]
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::AbortSession() 0x61d000096a80 [reason 804b000d] stopped = 1
[(null) 4599: Main Thread]: D/nsWebSocket WebSocketChannel::ReportConnectionTelemetry() 0x61d000096a80 6

and the timeout in fuzzing gone.

Assignee: nobody → choller
Status: NEW → ASSIGNED
Summary: OnStop not called on listener after successful asyncOpen → OnStop not called on websocket listener after successful asyncOpen
Whiteboard: [fuzzblocker]
Pushed by choller@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cf7a95ce72ec Add OnUpgradeFailed to nsIHttpUpgradeListener. r=michal
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71

Is this needed in beta70?

Flags: needinfo?(choller)

From a fuzzing perspective, no. I think we can let this ride the trains.

Flags: needinfo?(choller)
Flags: needinfo?(dd.mozilla)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: