Closed
Bug 1170645
Opened 9 years ago
Closed 9 years ago
Intermittent crash in mozilla::net::nsWSAdmissionManager::RemoveFromQueue(mozilla::net::WebSocketChannel*) on web-platform-tests /websockets/constructor/009.html in debug builds
Categories
(Core :: Networking: WebSockets, defect)
Core
Networking: WebSockets
Tracking
()
RESOLVED
FIXED
mozilla41
People
(Reporter: jgraham, Assigned: michal)
References
Details
Attachments
(1 file, 2 obsolete files)
15.21 KB,
patch
|
mcmanus
:
review+
Sylvestre
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
Some of the stack: 09:34:35 INFO - PROCESS-CRASH | /websockets/constructor/009.html | application crashed [@ mozilla::net::nsWSAdmissionManager::RemoveFromQueue(mozilla::net::WebSocketChannel*)] 09:34:35 INFO - Crash dump filename: /tmp/tmpN2NZCG.mozrunner/minidumps/4765d043-a7bb-3943-13740333-61dfc886.dmp 09:34:35 INFO - Operating system: Linux 09:34:35 INFO - 0.0.0 Linux 3.2.0-76-generic #111-Ubuntu SMP Tue Jan 13 22:16:09 UTC 2015 x86_64 09:34:35 INFO - CPU: amd64 09:34:35 INFO - family 6 model 62 stepping 4 09:34:35 INFO - 1 CPU 09:34:35 INFO - 09:34:35 INFO - Crash reason: SIGSEGV 09:34:35 INFO - Crash address: 0x0 09:34:35 INFO - 09:34:35 INFO - Thread 0 (crashed) 09:34:35 INFO - 0 libxul.so!mozilla::net::nsWSAdmissionManager::RemoveFromQueue(mozilla::net::WebSocketChannel*) [WebSocketChannel.cpp:4275bf6a4834 : 507 + 0x1c] 09:34:35 INFO - rbx = 0x0000000000000000 r12 = 0x00007fb0a3c74ec0 09:34:35 INFO - r13 = 0x00007fb0a3c74ec8 r14 = 0x00007fb0a6b8c000 09:34:35 INFO - r15 = 0x0000000000000000 rip = 0x00007fb0cb691bf4 09:34:35 INFO - rsp = 0x00007fff1f9905a0 rbp = 0x00007fff1f9905c0 09:34:35 INFO - Found by: given as instruction pointer in context 09:34:35 INFO - 1 libxul.so!mozilla::net::nsWSAdmissionManager::OnStopSession(mozilla::net::WebSocketChannel*, nsresult) [WebSocketChannel.cpp:4275bf6a4834 : 429 + 0xe] 09:34:35 INFO - rbx = 0x00007fb0a6b8c000 r12 = 0x00007fb0af9c2c58 09:34:35 INFO - r13 = 0x00007fff1f9906ff r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x0000000000000000 rip = 0x00007fb0cb6933e0 09:34:35 INFO - rsp = 0x00007fff1f9905d0 rbp = 0x00007fff1f990620 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 2 libxul.so!mozilla::net::CallOnStop::Run() [WebSocketChannel.cpp:4275bf6a4834 : 617 + 0xb] 09:34:35 INFO - rbx = 0x00007fb0af9c2c40 r12 = 0x00007fb0af9c2c58 09:34:35 INFO - r13 = 0x00007fff1f9906ff r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x0000000000000001 rip = 0x00007fb0cb693463 09:34:35 INFO - rsp = 0x00007fff1f990630 rbp = 0x00007fff1f990660 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 3 libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp:4275bf6a4834 : 846 + 0x11] 09:34:35 INFO - rbx = 0x00007fb0d1755b80 r12 = 0x0000000000000000 09:34:35 INFO - r13 = 0x00007fff1f9906ff r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x0000000000000001 rip = 0x00007fb0cb4aa327 09:34:35 INFO - rsp = 0x00007fff1f990670 rbp = 0x00007fff1f9906e0 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 4 libxul.so!NS_ProcessNextEvent(nsIThread*, bool) [nsThreadUtils.cpp:4275bf6a4834 : 265 + 0xc] 09:34:35 INFO - rbx = 0x00007fb0c2a6de00 r12 = 0x00007fb0c2a6e360 09:34:35 INFO - r13 = 0x00007fb0c2a6dea8 r14 = 0x0000000000000001 09:34:35 INFO - r15 = 0x00007fb0c2a6dea0 rip = 0x00007fb0cb4d0de0 09:34:35 INFO - rsp = 0x00007fff1f9906f0 rbp = 0x00007fff1f990710 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 5 libxul.so!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [MessagePump.cpp:4275bf6a4834 : 95 + 0xa] 09:34:35 INFO - rbx = 0x00007fb0c2a6de80 r12 = 0x00007fb0c2a6e360 09:34:35 INFO - r13 = 0x00007fb0c2a6dea8 r14 = 0x0000000000000001 09:34:35 INFO - r15 = 0x00007fb0c2a6dea0 rip = 0x00007fb0cb72ae6e 09:34:35 INFO - rsp = 0x00007fff1f990720 rbp = 0x00007fff1f990770 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 6 libxul.so!MessageLoop::RunInternal() [message_loop.cc:4275bf6a4834 : 233 + 0x16] 09:34:35 INFO - rbx = 0x00007fb0c2a6e360 r12 = 0x00007fb0d1755b80 09:34:35 INFO - r13 = 0x00007fb0cb4da0c6 r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0cb70cfd1 09:34:35 INFO - rsp = 0x00007fff1f990780 rbp = 0x00007fff1f9907b0 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 7 libxul.so!MessageLoop::Run() [message_loop.cc:4275bf6a4834 : 226 + 0x7] 09:34:35 INFO - rbx = 0x00007fb0c2a6e360 r12 = 0x00007fb0d1755b80 09:34:35 INFO - r13 = 0x00007fb0cb4da0c6 r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0cb70cff8 09:34:35 INFO - rsp = 0x00007fff1f9907c0 rbp = 0x00007fff1f9907f0 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 8 libxul.so!nsBaseAppShell::Run() [nsBaseAppShell.cpp:4275bf6a4834 : 165 + 0xc] 09:34:35 INFO - rbx = 0x00007fb0baa6c180 r12 = 0x00007fb0d1755b80 09:34:35 INFO - r13 = 0x00007fb0cb4da0c6 r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0ccaa6cff 09:34:35 INFO - rsp = 0x00007fff1f990800 rbp = 0x00007fff1f990810 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 9 libxul.so!nsAppStartup::Run() [nsAppStartup.cpp:4275bf6a4834 : 280 + 0x5] 09:34:35 INFO - rbx = 0x00007fb0baa18060 r12 = 0x0000000000000000 09:34:35 INFO - r13 = 0x00007fb0cb4da0c6 r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0cd1dd69c 09:34:35 INFO - rsp = 0x00007fff1f990820 rbp = 0x00007fff1f990830 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 10 libxul.so!XREMain::XRE_mainRun() [nsAppRunner.cpp:4275bf6a4834 : 4252 + 0x14] 09:34:35 INFO - rbx = 0x00007fff1f990a08 r12 = 0x0000000000000000 09:34:35 INFO - r13 = 0x00007fb0cb4da0c6 r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0cd22e57a 09:34:35 INFO - rsp = 0x00007fff1f990840 rbp = 0x00007fff1f990970 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 11 libxul.so!XREMain::XRE_main(int, char**, nsXREAppData const*) [nsAppRunner.cpp:4275bf6a4834 : 4336 + 0x4] 09:34:35 INFO - rbx = 0x00007fff1f990a08 r12 = 0x00007fff1f990a48 09:34:35 INFO - r13 = 0x0000000000000000 r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x00007fff1f990a38 rip = 0x00007fb0cd231066 09:34:35 INFO - rsp = 0x00007fff1f990980 rbp = 0x00007fff1f9909e0 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 12 libxul.so!XRE_main [nsAppRunner.cpp:4275bf6a4834 : 4425 + 0x4] 09:34:35 INFO - rbx = 0x0000000000000005 r12 = 0x00007fff1f991e78 09:34:35 INFO - r13 = 0x00007fff1f990bf0 r14 = 0x00007fb0d1737540 09:34:35 INFO - r15 = 0x00007fb0cb47fbe0 rip = 0x00007fb0cd2312e8 09:34:35 INFO - rsp = 0x00007fff1f9909f0 rbp = 0x00007fff1f990ba0 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 13 firefox!do_main [nsBrowserApp.cpp:4275bf6a4834 : 214 + 0x5] 09:34:35 INFO - rbx = 0x00007fff1f991e78 r12 = 0x0000000000000005 09:34:35 INFO - r13 = 0x00007fb0d1737900 r14 = 0x00007fb0d1737540 09:34:35 INFO - r15 = 0x00007fb0cb47fbe0 rip = 0x0000000000406894 09:34:35 INFO - rsp = 0x00007fff1f990bb0 rbp = 0x00007fff1f991c20 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 14 firefox!main [nsBrowserApp.cpp:4275bf6a4834 : 478 + 0x11] 09:34:35 INFO - rbx = 0x000000396e28592e r12 = 0x00007fff1f991e78 09:34:35 INFO - r13 = 0x0000000000000005 r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x0000000000000000 rip = 0x0000000000405dd3 09:34:35 INFO - rsp = 0x00007fff1f991c30 rbp = 0x00007fff1f991d90 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 15 libc-2.15.so + 0x2176c 09:34:35 INFO - rbx = 0x0000000000000000 r12 = 0x0000000000405e7c 09:34:35 INFO - r13 = 0x00007fff1f991e70 r14 = 0x0000000000000000 09:34:35 INFO - r15 = 0x0000000000000000 rip = 0x00007fb0d194b76d 09:34:35 INFO - rsp = 0x00007fff1f991da0 rbp = 0x0000000000000000 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 16 firefox!_init + 0x78f 09:34:35 INFO - rip = 0x0000000000405d10 rsp = 0x00007fff1f991dc0 09:34:35 INFO - Found by: stack scanning 09:34:35 INFO - 17 firefox + 0x5e7b 09:34:35 INFO - rip = 0x0000000000405e7c rsp = 0x00007fff1f991dd8 09:34:35 INFO - Found by: call frame info 09:34:35 INFO - 18 firefox!__libc_csu_fini + 0xf 09:34:35 INFO - rip = 0x000000000041f170 rsp = 0x00007fff1f991e20 09:34:35 INFO - Found by: stack scanning 09:34:35 INFO - 19 firefox + 0x5e7b 09:34:35 INFO - rip = 0x0000000000405e7c rsp = 0x00007fff1f991e48 09:34:35 INFO - Found by: stack scanning 09:34:35 INFO - 20 firefox!_start + 0x28 09:34:35 INFO - rip = 0x0000000000405ea5 rsp = 0x00007fff1f991e60 09:34:35 INFO - Found by: stack scanning 09:34:35 INFO - 09:34:35 INFO - Thread 1 09:34:35 INFO - 0 libc-2.15.so + 0xe8933 09:34:35 INFO - rbx = 0x00007fb0d17a4950 r12 = 0x00007fb0c8f7bf00 09:34:35 INFO - r13 = 0x00000000ffffffff r14 = 0x0000000000000001 09:34:35 INFO - r15 = 0x0000000000000001 rip = 0x00007fb0d1a12933 09:34:35 INFO - rsp = 0x00007fb0c41dbdf0 rbp = 0x00007fb0d1778390 09:34:35 INFO - Found by: given as instruction pointer in context 09:34:35 INFO - 1 libglib-2.0.so.0.3200.1 + 0x47ff5 09:34:35 INFO - rip = 0x00007fb0c8f6eff6 rsp = 0x00007fb0c41dbe20 09:34:35 INFO - rbp = 0x00007fb0d1778390 09:34:35 INFO - Found by: stack scanning 09:34:35 INFO - 2 libglib-2.0.so.0.3200.1 + 0x48459 09:34:35 INFO - rip = 0x00007fb0c8f6f45a rsp = 0x00007fb0c41dbe70 09:34:35 INFO - rbp = 0x00007fb0d1778390 09:34:35 INFO - Found by: stack scanning 09:34:35 INFO - 3 libdconfsettings.so + 0x498a 09:34:35 INFO - rip = 0x00007fb0c2b8798b rsp = 0x00007fb0c41dbe90 09:34:35 INFO - rbp = 0x00007fb0d1778390 09:34:35 INFO - Found by: stack scanning 09:34:35 INFO - 4 libglib-2.0.so.0.3200.1 + 0x699a4 09:34:35 INFO - rip = 0x00007fb0c8f909a5 rsp = 0x00007fb0c41dbea0 09:34:35 INFO - rbp = 0x00007fb0d1778390 09:34:35 INFO - Found by: stack scanning 09:34:35 INFO - 5 libpthread-2.15.so + 0x7e99 09:34:35 INFO - rip = 0x00007fb0d290ee9a rsp = 0x00007fb0c41dbeb0 09:34:35 INFO - rbp = 0x00007fb0d1778390 09:34:35 INFO - Found by: stack scanning [...] https://treeherder.mozilla.org/logviewer.html#?job_id=148752&repo=cedar This may be a regression since 12th May.
Updated•9 years ago
|
Assignee: nobody → michal.novotny
Comment 1•9 years ago
|
||
So RemoveFromQueue has this bit: 506 int32_t index = IndexOf(aChannel); 507 MOZ_ASSERT(index >= 0, "connection to remove not in queue"); 508 if (index >= 0) { and it looks like that assert is failing. This code hasn't changed since February, so if there is a regression it's due to a change elsewhere. In any case, if you look at the code in OnStopSession (the caller of RemoveFromQueue), it tests that aChannel->mConnecting is nonzero (so not NOT_CONNECTING), then calls RemoveFromQueue, then does: 431 bool wasNotQueued = (aChannel->mConnecting != CONNECTING_QUEUED); In other words, there is no particular reason when we hit this code to expect that the mConnecting state is in fact CONNECTING_QUEUED, and if it's not then RemoveFromQueue would in fact assert as observed. Seems like either the RemoveFromQueue call needs to be conditional on the mConnecting state, or the assert should be removed. Jason, you seem to have blame for this code... Could you take a look?
Assignee: michal.novotny → nobody
Flags: needinfo?(jduell.mcbugs)
Updated•9 years ago
|
Assignee: nobody → michal.novotny
Reporter | ||
Comment 2•9 years ago
|
||
It looks like this is the main blocker to enabling web-platform-tests on debug builds; I either need to disable a swathe of websockets tests or get a fix here. Any idea what the timescale for getting a fix is?
Comment 3•9 years ago
|
||
James, give this a shot on try?
Attachment #8616063 -
Flags: review?(jduell.mcbugs)
Updated•9 years ago
|
Assignee: michal.novotny → bzbarsky
Status: NEW → ASSIGNED
Comment 4•9 years ago
|
||
Comment on attachment 8616063 [details] [diff] [review] Let's just try this Review of attachment 8616063 [details] [diff] [review]: ----------------------------------------------------------------- I'll steal this.. thanks. I agree that the primary function should still assert in this case.
Attachment #8616063 -
Flags: review?(jduell.mcbugs) → review+
Reporter | ||
Comment 5•9 years ago
|
||
I'm running this now on cedar to test against the wpt debug builds. Thanks for looking at it!
Comment 6•9 years ago
|
||
OK, so this patch fails try. It fails because we get into the code I just changed with aChannel->mConnecting set to CONNECTING_IN_PROGRESS but the channel is in the queue. So clearly CONNECTING_QUEUED has nothing to do with being in the queue. Looking at the code more closely, things are added to the queue in ConditionallyConnect. Before that point the mConnecting is always NOT_CONNECTING, I hope. Once added, things can be CONNECTING_QUEUED or not. Things can be removed from the queue via OnConnected (which sets the mConnecting to NOT_CONNECTING) or the OnStopSession code we're looking at. But that also sets mConnecting to NOT_CONNECTING. So given that aChannel->mConnecting, how can we ever end up NOT in the queue? We'd have to have something that sets mConnecting to something other than NOT_CONNECTING but is not ConditionallyConnect or called from ConditionallyConnect. There are a few places that set mConnecting: 1) DelayOrBegin. This is called from ConditionallyConnect and from ConnectNext, but in the latter case only if the channel is already in the queue. 2) BeginOpen. This sets mConnecting to CONNECTING_IN_PROGRESS. But this is only called from mReconectDelayTimer firing, and this is only set up from DelayOrBegin. That seems to be it. So I'm a bit lost as to what's going wrong here in the original failure, at first glance. :( Seems like any time mConnecting is not NOT_CONNECTING we should in fact be in the queue.
Reporter | ||
Comment 7•9 years ago
|
||
I am seeing a bunch of websockets timeouts on cedar with this applied https://treeherder.mozilla.org/#/jobs?repo=cedar&revision=daa688d50b09&filter-searchStr=web-platform-tests
Comment 8•9 years ago
|
||
Unless maybe BeginOpen can be called after OnStopSession? But StopSession cancels mReconnectDelayTimer....
Comment 9•9 years ago
|
||
I haven't come up with anything yet, but I'm looking at this.
Assignee: bzbarsky → jduell.mcbugs
Flags: needinfo?(jduell.mcbugs)
Assignee | ||
Comment 10•9 years ago
|
||
I think this is a regression caused by bug 1111971. The problem is that nsWSAdmissionManager::OnStopSession() is called from CallOnStop::Run() as well as from WebSocketChannel::StopSession().
Assignee | ||
Comment 11•9 years ago
|
||
Attachment #8616670 -
Flags: review?(jduell.mcbugs)
Reporter | ||
Comment 12•9 years ago
|
||
Tying this on cedar: https://treeherder.mozilla.org/#/jobs?repo=cedar&revision=8be32c64b0c6
Reporter | ||
Comment 13•9 years ago
|
||
This doesn't look great: https://treeherder.mozilla.org/logviewer.html#?job_id=151153&repo=cedar
Updated•9 years ago
|
Assignee: jduell.mcbugs → michal.novotny
Comment 14•9 years ago
|
||
Comment on attachment 8616670 [details] [diff] [review] fix Review of attachment 8616670 [details] [diff] [review]: ----------------------------------------------------------------- Failed on try, so we've got more to do here.
Attachment #8616670 -
Flags: review?(jduell.mcbugs) → review-
Reporter | ||
Comment 15•9 years ago
|
||
Is there any estimate for this bug being looked at? it currently blocks enabling the web-platform-tests for debug builds, which I would like to move forward so I'm trying to decide whether it makes more sense to wait for a fix or to temporarily disable all the affected tests until one is available.
Assignee | ||
Comment 16•9 years ago
|
||
I hope I'll have a working patch today or tomorrow.
Reporter | ||
Comment 17•9 years ago
|
||
Brilliant, thanks for the update?
Reporter | ||
Comment 18•9 years ago
|
||
s/?/!/ :)
Assignee | ||
Comment 19•9 years ago
|
||
(In reply to James Graham [:jgraham] from comment #13) > This doesn't look great: > https://treeherder.mozilla.org/logviewer.html#?job_id=151153&repo=cedar There are 2 failures with the previous patch: 1) WebSocketChannel::Close() cancels the channel but nsWSAdmissionManager::OnStopSession() is called later so the channel is still queued. If nsWSAdmissionManager::OnStopSession() for some other channel is called in the meantime it calls nsWSAdmissionManager::ConnectNext() which finds the queued channel and schedules the delay timer. nsWSAdmissionManager::OnStopSession() is then called for the delayed channel and changes the status to NOT_CONNECTING. Once the delay timer is fired WebSocketChannel::Notify() asserts due to wrong state. 2) Patch in bug 1152047 caused that the status of the channel is not changed immediately when WebSocketChannel::BeginOpen() is called in FailDelayManager::DelayOrBegin(). So we have a queued channel with BeginOpen event pending. If WebSocketChannel::Close() is called on this channel in the meantime, then the channel is canceled and removed from the queue. WebSocketChannel::BeginOpen() is later called, it changes the status to CONNECTING_IN_PROGRESS but fails immediately. Destructor then asserts due to wrong connecting state. With this patch, there is no possible race between nsWSAdmissionManager::OnStopSession() and nsWSAdmissionManager::ConnectNext() since both are called on the main thread. nsWSAdmissionManager::OnStopSession() can be still called on other thread but only for channels with mConnecting == NOT_CONNECTING. I've also added some useful LOG().
Attachment #8616063 -
Attachment is obsolete: true
Attachment #8616670 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Attachment #8623045 -
Flags: review?(mcmanus)
Reporter | ||
Comment 20•9 years ago
|
||
Looking good! https://treeherder.mozilla.org/#/jobs?repo=cedar&revision=508624b2e1b5&filter-searchStr=web-platform%20debug
Comment 22•9 years ago
|
||
Comment on attachment 8623045 [details] [diff] [review] patch v2 Review of attachment 8623045 [details] [diff] [review]: ----------------------------------------------------------------- certainly seems cleaner. Thanks.
Attachment #8623045 -
Flags: review?(mcmanus) → review+
Comment 24•9 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/190f21cd3194
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
status-firefox41:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
Assignee | ||
Comment 25•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #21) > is this going to need backporting? Bug 1152047 landed on 41 and bug 1111971 on 38. OTOH second call to RemoveFromQueue() is no-op in release build so I'm not sure if we need to uplift.
Flags: needinfo?(michal.novotny)
Comment 26•9 years ago
|
||
Comment on attachment 8623045 [details] [diff] [review] patch v2 [Triage Comment] Taking it because it is necessary to bug 1152047
Attachment #8623045 -
Flags: approval-mozilla-aurora+
Comment 27•9 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/691fa8d34fba
status-firefox40:
--- → fixed
You need to log in
before you can comment on or make changes to this bug.
Description
•