Closed Bug 1170645 Opened 9 years ago Closed 9 years ago

Intermittent crash in mozilla::net::nsWSAdmissionManager::RemoveFromQueue(mozilla::net::WebSocketChannel*) on web-platform-tests /websockets/constructor/009.html in debug builds

Categories

(Core :: Networking: WebSockets, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox40 --- fixed
firefox41 --- fixed

People

(Reporter: jgraham, Assigned: michal)

References

Details

Attachments

(1 file, 2 obsolete files)

Some of the stack:

09:34:35 INFO - PROCESS-CRASH | /websockets/constructor/009.html | application crashed [@ mozilla::net::nsWSAdmissionManager::RemoveFromQueue(mozilla::net::WebSocketChannel*)]
09:34:35 INFO - Crash dump filename: /tmp/tmpN2NZCG.mozrunner/minidumps/4765d043-a7bb-3943-13740333-61dfc886.dmp
09:34:35 INFO - Operating system: Linux
09:34:35 INFO - 0.0.0 Linux 3.2.0-76-generic #111-Ubuntu SMP Tue Jan 13 22:16:09 UTC 2015 x86_64
09:34:35 INFO - CPU: amd64
09:34:35 INFO - family 6 model 62 stepping 4
09:34:35 INFO - 1 CPU
09:34:35 INFO -
09:34:35 INFO - Crash reason: SIGSEGV
09:34:35 INFO - Crash address: 0x0
09:34:35 INFO -
09:34:35 INFO - Thread 0 (crashed)
09:34:35 INFO - 0 libxul.so!mozilla::net::nsWSAdmissionManager::RemoveFromQueue(mozilla::net::WebSocketChannel*) [WebSocketChannel.cpp:4275bf6a4834 : 507 + 0x1c]
09:34:35 INFO - rbx = 0x0000000000000000 r12 = 0x00007fb0a3c74ec0
09:34:35 INFO - r13 = 0x00007fb0a3c74ec8 r14 = 0x00007fb0a6b8c000
09:34:35 INFO - r15 = 0x0000000000000000 rip = 0x00007fb0cb691bf4
09:34:35 INFO - rsp = 0x00007fff1f9905a0 rbp = 0x00007fff1f9905c0
09:34:35 INFO - Found by: given as instruction pointer in context
09:34:35 INFO - 1 libxul.so!mozilla::net::nsWSAdmissionManager::OnStopSession(mozilla::net::WebSocketChannel*, nsresult) [WebSocketChannel.cpp:4275bf6a4834 : 429 + 0xe]
09:34:35 INFO - rbx = 0x00007fb0a6b8c000 r12 = 0x00007fb0af9c2c58
09:34:35 INFO - r13 = 0x00007fff1f9906ff r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x0000000000000000 rip = 0x00007fb0cb6933e0
09:34:35 INFO - rsp = 0x00007fff1f9905d0 rbp = 0x00007fff1f990620
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 2 libxul.so!mozilla::net::CallOnStop::Run() [WebSocketChannel.cpp:4275bf6a4834 : 617 + 0xb]
09:34:35 INFO - rbx = 0x00007fb0af9c2c40 r12 = 0x00007fb0af9c2c58
09:34:35 INFO - r13 = 0x00007fff1f9906ff r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x0000000000000001 rip = 0x00007fb0cb693463
09:34:35 INFO - rsp = 0x00007fff1f990630 rbp = 0x00007fff1f990660
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 3 libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp:4275bf6a4834 : 846 + 0x11]
09:34:35 INFO - rbx = 0x00007fb0d1755b80 r12 = 0x0000000000000000
09:34:35 INFO - r13 = 0x00007fff1f9906ff r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x0000000000000001 rip = 0x00007fb0cb4aa327
09:34:35 INFO - rsp = 0x00007fff1f990670 rbp = 0x00007fff1f9906e0
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 4 libxul.so!NS_ProcessNextEvent(nsIThread*, bool) [nsThreadUtils.cpp:4275bf6a4834 : 265 + 0xc]
09:34:35 INFO - rbx = 0x00007fb0c2a6de00 r12 = 0x00007fb0c2a6e360
09:34:35 INFO - r13 = 0x00007fb0c2a6dea8 r14 = 0x0000000000000001
09:34:35 INFO - r15 = 0x00007fb0c2a6dea0 rip = 0x00007fb0cb4d0de0
09:34:35 INFO - rsp = 0x00007fff1f9906f0 rbp = 0x00007fff1f990710
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 5 libxul.so!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [MessagePump.cpp:4275bf6a4834 : 95 + 0xa]
09:34:35 INFO - rbx = 0x00007fb0c2a6de80 r12 = 0x00007fb0c2a6e360
09:34:35 INFO - r13 = 0x00007fb0c2a6dea8 r14 = 0x0000000000000001
09:34:35 INFO - r15 = 0x00007fb0c2a6dea0 rip = 0x00007fb0cb72ae6e
09:34:35 INFO - rsp = 0x00007fff1f990720 rbp = 0x00007fff1f990770
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 6 libxul.so!MessageLoop::RunInternal() [message_loop.cc:4275bf6a4834 : 233 + 0x16]
09:34:35 INFO - rbx = 0x00007fb0c2a6e360 r12 = 0x00007fb0d1755b80
09:34:35 INFO - r13 = 0x00007fb0cb4da0c6 r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0cb70cfd1
09:34:35 INFO - rsp = 0x00007fff1f990780 rbp = 0x00007fff1f9907b0
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 7 libxul.so!MessageLoop::Run() [message_loop.cc:4275bf6a4834 : 226 + 0x7]
09:34:35 INFO - rbx = 0x00007fb0c2a6e360 r12 = 0x00007fb0d1755b80
09:34:35 INFO - r13 = 0x00007fb0cb4da0c6 r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0cb70cff8
09:34:35 INFO - rsp = 0x00007fff1f9907c0 rbp = 0x00007fff1f9907f0
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 8 libxul.so!nsBaseAppShell::Run() [nsBaseAppShell.cpp:4275bf6a4834 : 165 + 0xc]
09:34:35 INFO - rbx = 0x00007fb0baa6c180 r12 = 0x00007fb0d1755b80
09:34:35 INFO - r13 = 0x00007fb0cb4da0c6 r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0ccaa6cff
09:34:35 INFO - rsp = 0x00007fff1f990800 rbp = 0x00007fff1f990810
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 9 libxul.so!nsAppStartup::Run() [nsAppStartup.cpp:4275bf6a4834 : 280 + 0x5]
09:34:35 INFO - rbx = 0x00007fb0baa18060 r12 = 0x0000000000000000
09:34:35 INFO - r13 = 0x00007fb0cb4da0c6 r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0cd1dd69c
09:34:35 INFO - rsp = 0x00007fff1f990820 rbp = 0x00007fff1f990830
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 10 libxul.so!XREMain::XRE_mainRun() [nsAppRunner.cpp:4275bf6a4834 : 4252 + 0x14]
09:34:35 INFO - rbx = 0x00007fff1f990a08 r12 = 0x0000000000000000
09:34:35 INFO - r13 = 0x00007fb0cb4da0c6 r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0cd22e57a
09:34:35 INFO - rsp = 0x00007fff1f990840 rbp = 0x00007fff1f990970
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 11 libxul.so!XREMain::XRE_main(int, char**, nsXREAppData const*) [nsAppRunner.cpp:4275bf6a4834 : 4336 + 0x4]
09:34:35 INFO - rbx = 0x00007fff1f990a08 r12 = 0x00007fff1f990a48
09:34:35 INFO - r13 = 0x0000000000000000 r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0cd231066
09:34:35 INFO - rsp = 0x00007fff1f990980 rbp = 0x00007fff1f9909e0
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 12 libxul.so!XRE_main [nsAppRunner.cpp:4275bf6a4834 : 4425 + 0x4]
09:34:35 INFO - rbx = 0x0000000000000005 r12 = 0x00007fff1f991e78
09:34:35 INFO - r13 = 0x00007fff1f990bf0 r14 = 0x00007fb0d1737540
09:34:35 INFO - r15 = 0x00007fb0cb47fbe0 rip = 0x00007fb0cd2312e8
09:34:35 INFO - rsp = 0x00007fff1f9909f0 rbp = 0x00007fff1f990ba0
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 13 firefox!do_main [nsBrowserApp.cpp:4275bf6a4834 : 214 + 0x5]
09:34:35 INFO - rbx = 0x00007fff1f991e78 r12 = 0x0000000000000005
09:34:35 INFO - r13 = 0x00007fb0d1737900 r14 = 0x00007fb0d1737540
09:34:35 INFO - r15 = 0x00007fb0cb47fbe0 rip = 0x0000000000406894
09:34:35 INFO - rsp = 0x00007fff1f990bb0 rbp = 0x00007fff1f991c20
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 14 firefox!main [nsBrowserApp.cpp:4275bf6a4834 : 478 + 0x11]
09:34:35 INFO - rbx = 0x000000396e28592e r12 = 0x00007fff1f991e78
09:34:35 INFO - r13 = 0x0000000000000005 r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x0000000000000000 rip = 0x0000000000405dd3
09:34:35 INFO - rsp = 0x00007fff1f991c30 rbp = 0x00007fff1f991d90
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 15 libc-2.15.so + 0x2176c
09:34:35 INFO - rbx = 0x0000000000000000 r12 = 0x0000000000405e7c
09:34:35 INFO - r13 = 0x00007fff1f991e70 r14 = 0x0000000000000000
09:34:35 INFO - r15 = 0x0000000000000000 rip = 0x00007fb0d194b76d
09:34:35 INFO - rsp = 0x00007fff1f991da0 rbp = 0x0000000000000000
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 16 firefox!_init + 0x78f
09:34:35 INFO - rip = 0x0000000000405d10 rsp = 0x00007fff1f991dc0
09:34:35 INFO - Found by: stack scanning
09:34:35 INFO - 17 firefox + 0x5e7b
09:34:35 INFO - rip = 0x0000000000405e7c rsp = 0x00007fff1f991dd8
09:34:35 INFO - Found by: call frame info
09:34:35 INFO - 18 firefox!__libc_csu_fini + 0xf
09:34:35 INFO - rip = 0x000000000041f170 rsp = 0x00007fff1f991e20
09:34:35 INFO - Found by: stack scanning
09:34:35 INFO - 19 firefox + 0x5e7b
09:34:35 INFO - rip = 0x0000000000405e7c rsp = 0x00007fff1f991e48
09:34:35 INFO - Found by: stack scanning
09:34:35 INFO - 20 firefox!_start + 0x28
09:34:35 INFO - rip = 0x0000000000405ea5 rsp = 0x00007fff1f991e60
09:34:35 INFO - Found by: stack scanning
09:34:35 INFO -
09:34:35 INFO - Thread 1
09:34:35 INFO - 0 libc-2.15.so + 0xe8933
09:34:35 INFO - rbx = 0x00007fb0d17a4950 r12 = 0x00007fb0c8f7bf00
09:34:35 INFO - r13 = 0x00000000ffffffff r14 = 0x0000000000000001
09:34:35 INFO - r15 = 0x0000000000000001 rip = 0x00007fb0d1a12933
09:34:35 INFO - rsp = 0x00007fb0c41dbdf0 rbp = 0x00007fb0d1778390
09:34:35 INFO - Found by: given as instruction pointer in context
09:34:35 INFO - 1 libglib-2.0.so.0.3200.1 + 0x47ff5
09:34:35 INFO - rip = 0x00007fb0c8f6eff6 rsp = 0x00007fb0c41dbe20
09:34:35 INFO - rbp = 0x00007fb0d1778390
09:34:35 INFO - Found by: stack scanning
09:34:35 INFO - 2 libglib-2.0.so.0.3200.1 + 0x48459
09:34:35 INFO - rip = 0x00007fb0c8f6f45a rsp = 0x00007fb0c41dbe70
09:34:35 INFO - rbp = 0x00007fb0d1778390
09:34:35 INFO - Found by: stack scanning
09:34:35 INFO - 3 libdconfsettings.so + 0x498a
09:34:35 INFO - rip = 0x00007fb0c2b8798b rsp = 0x00007fb0c41dbe90
09:34:35 INFO - rbp = 0x00007fb0d1778390
09:34:35 INFO - Found by: stack scanning
09:34:35 INFO - 4 libglib-2.0.so.0.3200.1 + 0x699a4
09:34:35 INFO - rip = 0x00007fb0c8f909a5 rsp = 0x00007fb0c41dbea0
09:34:35 INFO - rbp = 0x00007fb0d1778390
09:34:35 INFO - Found by: stack scanning
09:34:35 INFO - 5 libpthread-2.15.so + 0x7e99
09:34:35 INFO - rip = 0x00007fb0d290ee9a rsp = 0x00007fb0c41dbeb0
09:34:35 INFO - rbp = 0x00007fb0d1778390
09:34:35 INFO - Found by: stack scanning 

[...]

https://treeherder.mozilla.org/logviewer.html#?job_id=148752&repo=cedar

This may be a regression since 12th May.
Assignee: nobody → michal.novotny
So RemoveFromQueue has this bit:

506     int32_t index = IndexOf(aChannel);
507     MOZ_ASSERT(index >= 0, "connection to remove not in queue");
508     if (index >= 0) {

and it looks like that assert is failing.

This code hasn't changed since February, so if there is a regression it's due to a change elsewhere.

In any case, if you look at the code in OnStopSession (the caller of RemoveFromQueue), it tests that aChannel->mConnecting is nonzero (so not NOT_CONNECTING), then calls RemoveFromQueue, then does:

431       bool wasNotQueued = (aChannel->mConnecting != CONNECTING_QUEUED);

In other words, there is no particular reason when we hit this code to expect that the mConnecting state is in fact CONNECTING_QUEUED, and if it's not then RemoveFromQueue would in fact assert as observed.

Seems like either the RemoveFromQueue call needs to be conditional on the mConnecting state, or the assert should be removed.

Jason, you seem to have blame for this code...  Could you take a look?
Assignee: michal.novotny → nobody
Flags: needinfo?(jduell.mcbugs)
Assignee: nobody → michal.novotny
Blocks: 1076787
It looks like this is the main blocker to enabling web-platform-tests on debug builds; I either need to disable a swathe of websockets tests or get a fix here. Any idea what the timescale for getting a fix is?
Attached patch Let's just try this (obsolete) — Splinter Review
James, give this a shot on try?
Attachment #8616063 - Flags: review?(jduell.mcbugs)
Assignee: michal.novotny → bzbarsky
Status: NEW → ASSIGNED
Comment on attachment 8616063 [details] [diff] [review]
Let's just try this

Review of attachment 8616063 [details] [diff] [review]:
-----------------------------------------------------------------

I'll steal this.. thanks. I agree that the primary function should still assert in this case.
Attachment #8616063 - Flags: review?(jduell.mcbugs) → review+
I'm running this now on cedar to test against the wpt debug builds. Thanks for looking at it!
OK, so this patch fails try.  It fails because we get into the code I just changed with aChannel->mConnecting set to CONNECTING_IN_PROGRESS but the channel is in the queue.

So clearly CONNECTING_QUEUED has nothing to do with being in the queue.  

Looking at the code more closely, things are added to the queue in ConditionallyConnect.  Before that point the mConnecting is always NOT_CONNECTING, I hope.  Once added, things can be CONNECTING_QUEUED or not.  Things can be removed from the queue via OnConnected (which sets the mConnecting to NOT_CONNECTING) or the OnStopSession code we're looking at.  But that also sets mConnecting to NOT_CONNECTING.

So given that aChannel->mConnecting, how can we ever end up NOT in the queue?  We'd have to have something that sets mConnecting to something other than NOT_CONNECTING but is not ConditionallyConnect or called from ConditionallyConnect.  There are a few places that set mConnecting:

1) DelayOrBegin.  This is called from ConditionallyConnect and from ConnectNext, but in the latter case only if the channel is already in the queue.

2) BeginOpen.  This sets mConnecting to CONNECTING_IN_PROGRESS.  But this is only called from mReconectDelayTimer firing, and this is only set up from DelayOrBegin.

That seems to be it.

So I'm a bit lost as to what's going wrong here in the original failure, at first glance.  :(  Seems like any time mConnecting is not NOT_CONNECTING we should in fact be in the queue.
I am seeing a bunch of websockets timeouts on cedar with this applied

https://treeherder.mozilla.org/#/jobs?repo=cedar&revision=daa688d50b09&filter-searchStr=web-platform-tests
Unless maybe BeginOpen can be called after OnStopSession?  But StopSession cancels mReconnectDelayTimer....
I haven't come up with anything yet, but I'm looking at this.
Assignee: bzbarsky → jduell.mcbugs
Flags: needinfo?(jduell.mcbugs)
I think this is a regression caused by bug 1111971. The problem is that nsWSAdmissionManager::OnStopSession() is called from CallOnStop::Run() as well as from WebSocketChannel::StopSession().
Attached patch fix (obsolete) — Splinter Review
Attachment #8616670 - Flags: review?(jduell.mcbugs)
Assignee: jduell.mcbugs → michal.novotny
Comment on attachment 8616670 [details] [diff] [review]
fix

Review of attachment 8616670 [details] [diff] [review]:
-----------------------------------------------------------------

Failed on try, so we've got more to do here.
Attachment #8616670 - Flags: review?(jduell.mcbugs) → review-
Is there any estimate for this bug being looked at? it currently blocks enabling the web-platform-tests for debug builds, which I would like to move forward so I'm trying to decide whether it makes more sense to wait for a fix or to temporarily disable all the affected tests until one is available.
I hope I'll have a working patch today or tomorrow.
Brilliant, thanks for the update?
s/?/!/ :)
Attached patch patch v2Splinter Review
(In reply to James Graham [:jgraham] from comment #13)
> This doesn't look great:
> https://treeherder.mozilla.org/logviewer.html#?job_id=151153&repo=cedar

There are 2 failures with the previous patch:

1) WebSocketChannel::Close() cancels the channel but nsWSAdmissionManager::OnStopSession() is called later so the channel is still queued. If nsWSAdmissionManager::OnStopSession() for some other channel is called in the meantime it calls nsWSAdmissionManager::ConnectNext() which finds the queued channel and schedules the delay timer. nsWSAdmissionManager::OnStopSession() is then called for the delayed channel and changes the status to NOT_CONNECTING. Once the delay timer is fired WebSocketChannel::Notify() asserts due to wrong state.

2) Patch in bug 1152047 caused that the status of the channel is not changed immediately when WebSocketChannel::BeginOpen() is called in FailDelayManager::DelayOrBegin(). So we have a queued channel with BeginOpen event pending. If WebSocketChannel::Close() is called on this channel in the meantime, then the channel is canceled and removed from the queue. WebSocketChannel::BeginOpen() is later called, it changes the status to CONNECTING_IN_PROGRESS but fails immediately. Destructor then asserts due to wrong connecting state.

With this patch, there is no possible race between nsWSAdmissionManager::OnStopSession() and nsWSAdmissionManager::ConnectNext() since both are called on the main thread. nsWSAdmissionManager::OnStopSession() can be still called on other thread but only for channels with mConnecting == NOT_CONNECTING. I've also added some useful LOG().
Attachment #8616063 - Attachment is obsolete: true
Attachment #8616670 - Attachment is obsolete: true
Attachment #8623045 - Flags: review?(mcmanus)
is this going to need backporting?
Flags: needinfo?(michal.novotny)
Comment on attachment 8623045 [details] [diff] [review]
patch v2

Review of attachment 8623045 [details] [diff] [review]:
-----------------------------------------------------------------

certainly seems cleaner. Thanks.
Attachment #8623045 - Flags: review?(mcmanus) → review+
https://hg.mozilla.org/mozilla-central/rev/190f21cd3194
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
(In reply to Patrick McManus [:mcmanus] from comment #21)
> is this going to need backporting?

Bug 1152047 landed on 41 and bug 1111971 on 38. OTOH second call to RemoveFromQueue() is no-op in release build so I'm not sure if we need to uplift.
Flags: needinfo?(michal.novotny)
Comment on attachment 8623045 [details] [diff] [review]
patch v2

[Triage Comment]
Taking it because it is necessary to bug 1152047
Attachment #8623045 - Flags: approval-mozilla-aurora+
Depends on: 1186160
You need to log in before you can comment on or make changes to this bug.