Intermittent test_websocket.html | Exited with code -1073741819 from a crash [@ nsHttpPipeline::IsDone()]

RESOLVED FIXED in Firefox 15

Status

()

Core
Networking: WebSockets
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: mak, Assigned: mcmanus)

Tracking

({intermittent-failure})

Trunk
mozilla15
x86_64
Windows 7
intermittent-failure
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox13 affected, firefox14 affected, firefox15 fixed)

Details

Attachments

(1 attachment)

could be related to recent changes to http pipelining

https://tbpl.mozilla.org/php/getParsedLog.php?id=10344326&tree=Mozilla-Inbound

TEST-UNEXPECTED-FAIL | /tests/content/base/test/test_websocket.html | Exited with code -1073741819 during test run
INFO | automation.py | Application ran for: 0:06:54.891000
INFO | automation.py | Reading PID log: c:\users\cltbld\appdata\local\temp\tmphuue0opidlog
==> process 2936 launched child process 3804
INFO | automation.py | Checking for orphan process with PID: 3804
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-debug/1332579810/firefox-14.0a1.en-US.win32.crashreporter-symbols.zip
PROCESS-CRASH | /tests/content/base/test/test_websocket.html | application crashed (minidump found)
Crash dump filename: c:\users\cltbld\appdata\local\temp\tmpizwgjc\minidumps\cbc89f78-dbcc-4ea2-895c-2e6e7569b1be.dmp
Operating system: Windows NT
                  6.1.7600 
CPU: x86
     GenuineIntel family 6 model 23 stepping 10
     2 CPUs

Crash reason:  EXCEPTION_ACCESS_VIOLATION_READ
Crash address: 0xffffffffddddddf9

Thread 0 (crashed)
 0  xul.dll!nsHttpPipeline::IsDone() [nsHttpPipeline.cpp:9094735bcb46 : 634 + 0xd]
    eip = 0x6941f9c7   esp = 0x002bd6a0   ebp = 0x002bd6c0   ebx = 0x00000001
    esi = 0x14828804   edi = 0x00000000   eax = 0x1916d6d0   ecx = 0x1916d6d0
    edx = 0xdddddddd   efl = 0x00210297
    Found by: given as instruction pointer in context
 1  xul.dll!nsHttpConnection::TakeTransport(nsISocketTransport * *,nsIAsyncInputStream * *,nsIAsyncOutputStream * *) [nsHttpConnection.cpp:9094735bcb46 : 929 + 0x12]
    eip = 0x693fb0b9   esp = 0x002bd6b0   ebp = 0x002bd6c0
    Found by: stack scanning
 2  xul.dll!nsHttpChannel::OnStopRequest(nsIRequest *,nsISupports *,unsigned int) [nsHttpChannel.cpp:9094735bcb46 : 4488 + 0x52]
    eip = 0x6941cfc0   esp = 0x002bd6c8   ebp = 0x002bd704
    Found by: call frame info
 3  xul.dll!nsInputStreamPump::OnStateStop() [nsInputStreamPump.cpp:9094735bcb46 : 583 + 0x18]
    eip = 0x6938b36a   esp = 0x002bd70c   ebp = 0x002bd72c
    Found by: call frame info
 4  xul.dll!nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream *) [nsInputStreamPump.cpp:9094735bcb46 : 405 + 0x7]
    eip = 0x6938b414   esp = 0x002bd734   ebp = 0x002bd740
    Found by: call frame info
 5  xul.dll!nsInputStreamReadyEvent::Run() [nsStreamUtils.cpp:9094735bcb46 : 114 + 0x10]
    eip = 0x6a03a08d   esp = 0x002bd748   ebp = 0x002bd758
    Found by: call frame info
 6  xul.dll!nsThread::ProcessNextEvent(bool,bool *) [nsThread.cpp:9094735bcb46 : 656 + 0xd]
    eip = 0x6a04eb78   esp = 0x002bd760   ebp = 0x002bd788
    Found by: call frame info
 7  xul.dll!NS_ProcessNextEvent_P(nsIThread *,bool) [nsThreadUtils.cpp:9094735bcb46 : 245 + 0xc]
    eip = 0x6a0116a7   esp = 0x002bd790   ebp = 0x002bd79c
    Found by: call frame info
 8  xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) [MessagePump.cpp:9094735bcb46 : 134 + 0x9]
    eip = 0x69f305b8   esp = 0x002bd7a4   ebp = 0x002bd7c8
I'm out of the office this week so don't panic if you don't see an immediate patch - but this is firmly on my radar.
Assignee: nobody → mcmanus
Blocks: 739762
Comment hidden (Treeherder Robot)
https://tbpl.mozilla.org/php/getParsedLog.php?id=10455529&tree=Mozilla-Inbound
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Thread 0 (crashed)
 0  xul.dll!nsHttpPipeline::IsDone() [nsHttpPipeline.cpp:9094735bcb46 : 634 + 0xd]
    eip = 0x6941f9c7   esp = 0x002bd6a0   ebp = 0x002bd6c0   ebx = 0x00000001
    esi = 0x14828804   edi = 0x00000000   eax = 0x1916d6d0   ecx = 0x1916d6d0
    edx = 0xdddddddd   efl = 0x00210297
    Found by: given as instruction pointer in context
 1  xul.dll!nsHttpConnection::TakeTransport(nsISocketTransport * *,nsIAsyncInputStream * *,nsIAsyncOutputStream * *) [nsHttpConnection.cpp:9094735bcb46 : 929 + 0x12]
    eip = 0x693fb0b9   esp = 0x002bd6b0   ebp = 0x002bd6c0
    Found by: stack scanning
 2  xul.dll!nsHttpChannel::OnStopRequest(nsIRequest *,nsISupports *,unsigned int) [nsHttpChannel.cpp:9094735bcb46 : 4488 + 0x52]
    eip = 0x6941cfc0   esp = 0x002bd6c8   ebp = 0x002bd704
    Found by: call frame info

Thread 0 is the main thread. That obviously is going to have some problems with the nshttppipeline implementation of nsAHttpTransaction. Its not clear to me that this is safe even with the normal mTransaction && mTransaciton->IsDone() check nsHttpConnection does.. that could be racy if an EOF or something arrived on the network thread.

To get this stable this code should ping pong onto the network thread for the taketransport() check and then back to the main thread for OnTransportAvailable().. that's sub-optimal but this is in the high-latency handshake portion of the program anyhow - it won't impact actual message latency in the normal states. We can get it stable and then later figure out how to streamline the handshale.

using the pipeline datastructure makes this more obvious, but imo this is a long standing bug that may need porting.
Created attachment 613839 [details] [diff] [review]
patch 0

move nsHttpConnection::TakeTransport() to the socket thread, while leaving OnTransportAvailable on the main thread.

Also, don't call content/nsWebSocket.cpp::OnStart() until the IO handlers are fully installed (we need both OnStartRequest and OnTransportAvailable to have been executed, not just OSR).. not doing this means that when firefox writes a message before the IO handlers are installed it might never go out - this was really obviously triggerable with the threading changes mentioned above, but it was a pre-existing problem.
Attachment #613839 - Flags: review?(jduell.mcbugs)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment on attachment 613839 [details] [diff] [review]
patch 0

Review of attachment 613839 [details] [diff] [review]:
-----------------------------------------------------------------

So the reason IsDone was segfaulting was re-entrant usage of mResponseQ/mRequestQ from both main/socket threads, correct?

Is there a reason we also need to move the call to mListener->OnStart() from WebSocketChannel::OnStartRequest to WebSocketChannel::StartWebsocketData?  The move seems ok, and it's intuitively sensible to wait for both OnTransportAvailable and OnStartRequest to be called before calling OnStart.  The only issue I could think of is error handling:  if for some reason TakeTransport fails (is it possible in practice in this control path?) you don't wind up calling OnTransportAvailable, so you'd now never call OnStart.  Not that this appears to make much difference: we don't seem to have error handling for failed TakeTransport either before or after this patch.  I'm ok with fixing that in another bug if it's indeed an issue.

All the code to move TakeTransport to the socket thread looks solid.

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +387,5 @@
> +public:
> +nsCompleteUpgradeData(nsAHttpConnection *aConn,
> +                      nsIHttpUpgradeListener *aListener)
> +    : mConn(aConn)
> +        , mUpgradeListener(aListener) {}

weird indenting.  Just put both inits on same line? Or at least same indent level.
Attachment #613839 - Flags: review?(jduell.mcbugs) → review+
Comment hidden (Treeherder Robot)
(In reply to Jason Duell (:jduell) from comment #27)
> Comment on attachment 613839 [details] [diff] [review]
> patch 0
> 
> Review of attachment 613839 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> So the reason IsDone was segfaulting was re-entrant usage of
> mResponseQ/mRequestQ from both main/socket threads, correct?
> 

yes.

but it really exposed a fundamental problem that we aren't seeing reports of - the transaction could just as easily be closed on the socket thread due to an EOF while IsDone() is running on the main thread.. so this isn't so much a regression in 14 as much as 14 bringing that issue to the surface.

> Is there a reason we also need to move the call to mListener->OnStart() from
> WebSocketChannel::OnStartRequest to WebSocketChannel::StartWebsocketData? 

without that change the content code can submit a message before the http upgrade is complete (i.e. before the transport object can do real websocket I/O) and OnOutputStreamReady() doesn't send it because mSocketOut is still not setup.. the change I made seemed as you say to be the intuitive way to fix that - an alternative would have been to manually call OnOutputStreamReady out of StartWebsocketData to clear the queue, but in general pushing the buffering back towards the source-generator is good policy.

> The move seems ok, and it's intuitively sensible to wait for both
> OnTransportAvailable and OnStartRequest to be called before calling OnStart.
> The only issue I could think of is error handling:  if for some reason
> TakeTransport fails (is it possible in practice in this control path?) you
> don't wind up calling OnTransportAvailable, so you'd now never call OnStart.
> Not that this appears to make much difference: we don't seem to have error
> handling for failed TakeTransport either before or after this patch.  I'm ok
> with fixing that in another bug if it's indeed an issue.
>

yes, that's orthogonal if its a problem at all. I don't think you have to call OTA (though I'd have to think about it), but I'm not sure the right start/stop gets sent to the content class.
status-firefox14: --- → affected
status-firefox15: --- → affected
Target Milestone: --- → mozilla15
ff15
https://hg.mozilla.org/integration/mozilla-inbound/rev/92d22954cbd1
https://hg.mozilla.org/mozilla-central/rev/92d22954cbd1
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
status-firefox15: affected → fixed
Comment on attachment 613839 [details] [diff] [review]
patch 0

[Approval Request Comment]
Regression caused by (bug #): 447866
User impact if declined: crash risk
Testing completed (on m-c, etc.): baked on m-c, has cured an orange there
Risk to taking this patch (and alternatives if risky):
String changes made by this patch: none
Attachment #613839 - Flags: approval-mozilla-aurora?
(In reply to Patrick McManus [:mcmanus] from comment #32)
> Risk to taking this patch (and alternatives if risky):

What risk does this patch to nsHttpChannel.cpp carry? I'd only fix an intermittent orange on Aurora if it had very low risk.
(In reply to Alex Keybl [:akeybl] from comment #33)
> (In reply to Patrick McManus [:mcmanus] from comment #32)
> > Risk to taking this patch (and alternatives if risky):
> 
> What risk does this patch to nsHttpChannel.cpp carry? I'd only fix an
> intermittent orange on Aurora if it had very low risk.

I never know how to answer this in a quantifiable way. So the question is always easier to ask than answer.

It's a moderate amount of platfrom C++ code. Mostly a rearrangement onto other execution threads of existing code though. It has test coverage and performs better with that same coverage on m-c than on aurora. The code path is pretty central to websockets, so we can be sure it is being exercised outside of tests too.

The aurora code can crash due to a race condition. This is a long standing issue, but a non websockets checkin on 14 made it worse by adding to the list of things that could race dangerously, so the fix for 13 and 14 would be the same - but the risk of triggering the bug is larger on 14. That's why I am not proposing beta?
(In reply to Patrick McManus [:mcmanus] from comment #34)
> The aurora code can crash due to a race condition. This is a long standing
> issue, but a non websockets checkin on 14 made it worse by adding to the
> list of things that could race dangerously, so the fix for 13 and 14 would
> be the same - but the risk of triggering the bug is larger on 14. That's why
> I am not proposing beta?

I'm trying to avoid taking moderate amounts of code change on any branch without data that suggests this will be a common/significant problem for our users. Without this fix, would we expect the crash signature [@ nsHttpPipeline::IsDone()] to be apparent in our crash data, or is that just one way the bug is apparent?
(In reply to Alex Keybl [:akeybl] from comment #35)
> 
> I'm trying to avoid taking moderate amounts of code change on any branch
> without data that suggests this will be a common/significant problem for our
> users. Without this fix, would we expect the crash signature [@
> nsHttpPipeline::IsDone()] to be apparent in our crash data, or is that just
> one way the bug is apparent?

nsHttpPipeline::IsDone() at the top will likely be the most common - they should all contain nsHttpConnection::TakeTransport() somewhere in the stack.
From looking at crash-stats, I don't believe this is prevalent enough to take any  code change to resolve for FF14. If related crashes do rise, we'll reconsider taking the change.
status-firefox13: --- → affected

Updated

5 years ago
Attachment #613839 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Patrick--any idea of recent changes that might be causing this to fail again/still?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Jason Duell (:jduell) from comment #40)
> Patrick--any idea of recent changes that might be causing this to fail
> again/still?

what are you referring to?

This is wontfix for FF13 and FF14 (see alex comment 37). The crash in 38 is FF14 and comment 39 is either ff 13 or 14 depending on the build time.
Ah, right.  Resume your normal programming...
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Keywords: intermittent-failure
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.