Closed
Bug 823056
Opened 12 years ago
Closed 12 years ago
Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output'
Categories
(Core :: WebRTC, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla21
People
(Reporter: whimboo, Assigned: abr)
References
()
Details
(Keywords: crash, hang, intermittent-failure, Whiteboard: [WebRTC][blocking-webrtc+][qa-])
Attachments
(1 file, 2 obsolete files)
10.80 KB,
patch
|
abr
:
review+
jesup
:
checkin+
|
Details | Diff | Splinter Review |
With the last merge from m-c to alder the peer connection tests are starting to hang and crash on Linux based platforms:
https://tbpl.mozilla.org/php/getParsedLog.php?id=18062650&tree=Alder#error0
52 INFO TEST-START | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html
++DOMWINDOW == 14 (0x4008a18) [serial = 19] [outer = 0x534b5d0]
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004003: file ../../../../intl/uconv/src/nsCharsetConverterManager.cpp, line 301
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004003: file ../../../../intl/uconv/src/nsCharsetConverterManager.cpp, line 301
TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | application timed out after 330 seconds with no output
args: ['/home/cltbld/talos-slave/test/build/bin/screentopng']
INFO | automation.py | Application ran for: 0:05:41.625900
INFO | automation.py | Reading PID log: /tmp/tmp0Lr3kgpidlog
PROCESS-CRASH | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | application crashed [@ libpthread-2.11.so + 0xdc44]
Crash dump filename: /tmp/tmpNqLUxy/minidumps/7f3fa034-3d37-b72b-7dc340b5-0e6b817d.dmp
Operating system: Linux
0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64
CPU: amd64
family 6 model 23 stepping 10
2 CPUs
Crash reason: SIGABRT
Crash address: 0x1f400000880
Thread 0 (crashed)
0 libpthread-2.11.so + 0xdc44
rbx = 0x00000000036d5d00 r12 = 0x00007fffbbf42740
r13 = 0x0000000000000000 r14 = 0x00007fffbbf429e0
r15 = 0x00007f2fc2ec4b26 rip = 0x00000034d360dc44
rsp = 0x00007fffbbf42690 rbp = 0x00007fffbbf42760
Found by: given as instruction pointer in context
1 libpthread-2.11.so + 0x8f4a
rip = 0x00000034d3608f4b rsp = 0x00007fffbbf426a8
rbp = 0x00007fffbbf42760
Found by: stack scanning
2 libxul.so!mozilla::BlockingResourceBase::CheckAcquire(mozilla::CallStack const&) [BlockingResourceBase.cpp : 107 + 0x8]
rip = 0x00007f2fc2072f46 rsp = 0x00007fffbbf426b0
rbp = 0x00007fffbbf42760
Found by: stack scanning
3 libxul.so!CSF::CC_SIPCCService::notifyDeviceEventObservers(ccapi_device_event_e, linked_ptr<CSF::CC_Device>, linked_ptr<CSF::CC_DeviceInfo>) [CC_SIPCCService.cpp : 717 + 0x4]
rbx = 0x0000000002a4e060 r12 = 0x00007fffbbf42790
rip = 0x00007f2fc250f089 rsp = 0x00007fffbbf42770
rbp = 0x00007fffbbf42800
Found by: call frame info
4 libxul.so!CSF::CC_SIPCCService::onDeviceEvent(ccapi_device_event_e, unsigned int, cc_device_info_t_*) [CC_SIPCCService.cpp : 607 + 0x22]
rbx = 0x00007fffbbf42980 r12 = 0x00007fffbbf42990
r13 = 0x0000000000000000 r14 = 0x00007fffbbf429e0
r15 = 0x00007f2fc2ec4b26 rip = 0x00007f2fc250f4b5
rsp = 0x00007fffbbf42810 rbp = 0x00007fffbbf42a30
Found by: call frame info
5 libxul.so!ccsnap_gen_deviceEvent [ccapi_snapshot.c : 414 + 0xd]
rbx = 0x0000000004763050 r12 = 0x0000000000000000
r13 = 0x0000000000000000 r14 = 0x00000034e08fe158
r15 = 0x00007fffbbf431a0 rip = 0x00007f2fc25241b9
rsp = 0x00007fffbbf42a40 rbp = 0x00007fffbbf42a60
Found by: call frame info
6 libxul.so!parse_setup_properties [ccapi_config.c : 59 + 0x8]
rbx = 0x00000000041f76e8 r12 = 0x00000034e08fe158
r13 = 0x00000000036d5dc8 r14 = 0x00000034e08fe158
r15 = 0x00007fffbbf431a0 rip = 0x00007f2fc252175c
rsp = 0x00007fffbbf42a70 rbp = 0x00007fffbbf42a90
Found by: call frame info
7 libxul.so!CCAPI_Start_response [ccapi_config.c : 41 + 0x10]
rbx = 0x00000000041f76e8 r12 = 0x0000000000000000
Reporter | ||
Comment 1•12 years ago
|
||
Sounds critical so we should better make it s-s.
Group: core-security
Comment 2•12 years ago
|
||
Nothing s-s about this, it's just a deadlock which is then being aborted by the test harness (SIGABRT).
Group: core-security
Reporter | ||
Comment 3•12 years ago
|
||
Looks like it's only happening on 64bit Linux and with a debug build.
Hardware: All → x86_64
Reporter | ||
Updated•12 years ago
|
Summary: Intermittent hang and crash in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output' and 'application crashed' [@ libpthread-2.11.so + 0xdc44] → [Linux64 debug] Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output'
Updated•12 years ago
|
Assignee: nobody → adam
Priority: -- → P1
Whiteboard: [WebRTC][automation-blocked] → [WebRTC][automation-blocked][blocking-webrtc+]
Updated•12 years ago
|
Crash Signature: [@ libpthread-2.11.so + 0xdc44] → [@ libpthread-2.11.so@0xdc44]
Reporter | ||
Updated•12 years ago
|
Crash Signature: [@ libpthread-2.11.so@0xdc44]
Updated•12 years ago
|
Whiteboard: [WebRTC][automation-blocked][blocking-webrtc+] → [WebRTC][blocking-webrtc+][automation-blocked][blocking-webrtc+]
Updated•12 years ago
|
Whiteboard: [WebRTC][blocking-webrtc+][automation-blocked][blocking-webrtc+] → [WebRTC][automation-blocked][blocking-webrtc+]
Reporter | ||
Comment 4•12 years ago
|
||
Also happens for x86 builds now:
https://tbpl.mozilla.org/php/getParsedLog.php?id=18177793&tree=Alder
Can we get this prioritized to get fixed cause it early aborts our testrun?
Hardware: x86_64 → All
Summary: [Linux64 debug] Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output' → [Linux debug] Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output'
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Reporter | ||
Updated•12 years ago
|
Summary: [Linux debug] Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output' → Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output'
Comment 7•12 years ago
|
||
These tbpls date from mid-day 12/21. Bug 820102 was merged to m-c on 12/24. Please verify that this is still an issue.
Reporter | ||
Comment 8•12 years ago
|
||
Randell, would have to do a merge from m-c to alder.
Flags: needinfo?(rjesup)
Comment 9•12 years ago
|
||
merge done
For reference:
"cd m-c; hg pull -u; cd ../alder; hg pull -u; hg pull ../m-c; hg merge"
Deal with any conflicts during the merge; generally there are none.
"hg commit -m 'Merge m-c to alder'; hg push alder"
In my .hgrc, I have:
[merge-tools]
emacs.args = -q --eval "(ediff-merge-with-ancestor \"$local\" \"$other\" \"$base\" nil \"$output\")"
But that's not necessary.
Flags: needinfo?(rjesup)
Comment 10•12 years ago
|
||
OK, these are in fact still failing. However, since they work on my machine, it is not possible to diagnose. Please turn up the logging (e.g., NSPR_LOG_MODULES=mediapipeline:5,ikran:5) and re-run
Comment 11•12 years ago
|
||
Actually, wait: these errors are all TEST_UNEXPECTED_PASS. The problem would appear to be that we are missing the latest updates from m-i.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Reporter | ||
Comment 13•12 years ago
|
||
I'm not able to get this test hanging on my local machine. I'm running a recent debug build from m-c on 10.7, so it matches the platform as given in the last comment. But looks like that something is still different.
Comment 14•12 years ago
|
||
What even is it waiting for?
Reporter | ||
Comment 15•12 years ago
|
||
Comment 12 is most likely bug 824851. So it doesn't belong to this bug. That said no platform shows the hang currently for the last merge from Randell. We should keep this bug open and watch further merges if it reproduces or not.
Reporter | ||
Comment 16•12 years ago
|
||
I have seen this hang this morning when I was not connected to a known network. I filed the issue as bug 824919 and a possible fix might also fix our problem here.
Depends on: 824919
Reporter | ||
Comment 17•12 years ago
|
||
With my patch on bug 824923 merged to m-c and then to alder we should get better information what's going on here.
Depends on: 824923
Comment 18•12 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [away 12/21 - 01/01] from comment #15)
> Comment 12 is most likely bug 824851. So it doesn't belong to this bug. That
> said no platform shows the hang currently for the last merge from Randell.
> We should keep this bug open and watch further merges if it reproduces or
> not.
Removing this from the blocking list, but keeping the bug open to catch regressions.
Henrik -- Once you test with your patch from bug 824923, if you find that this is happening frequently or blocking automation, we can promote this back to blocking.
Severity: critical → normal
Priority: P1 → P3
Whiteboard: [WebRTC][automation-blocked][blocking-webrtc+] → [WebRTC][blocking-webrtc-]
Comment 19•12 years ago
|
||
If this is "we kill it after 330 sec", there's one bug (Failure to shut down SRTCP) which I've seen today when testing the peerconnection leak patch. This bug can cause it to fail to exit
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Reporter | ||
Comment 22•12 years ago
|
||
Randell, the last two tinderbox failures are bug 824851 and not this one.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 26•12 years ago
|
||
Bumping this back to blocking - this is intermittent failure on m-c now. Let's clean this up.
Priority: P3 → P1
Whiteboard: [WebRTC][blocking-webrtc-] → [WebRTC][blocking-webrtc+]
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 28•12 years ago
|
||
Just some musings on the crashes, looking at the three starred failures... all three are Linux PGO builds; one is 32-bit, the other two are 64-bit. The problem has not yet surfaced on other platforms or in opt or debug builds.
I'm going to see if I can get it to show up on normal opt -- or, even better, debug -- builds. Here's a try from m-i tip that I plan to run mochi-3 on multiple times to see whether this problem can be provoked; this does not have PGO turned on:
https://tbpl.mozilla.org/?tree=Try&rev=4448d6d03a94
Assignee | ||
Comment 29•12 years ago
|
||
And to provide a comparison, here's a PGO try:
https://tbpl.mozilla.org/?tree=Try&rev=620fd740b4b2
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 31•12 years ago
|
||
Just some notes about the process state on failure (using https://tbpl.mozilla.org/php/getParsedLog.php?id=19255531&full=1&branch=services-central#error1 as my basis)...
Looking at thread 0 (main), we have to be at CC_SIPCCService.cpp:598:
CC_SIPCCDeviceInfoPtr infoPtr = CC_SIPCCDeviceInfo::wrap(info);
(Note that the execution of this operation has been delayed by the optimizer to line 607, hence the reason the stack is reporting line 607+0x1c)
Where the "wrap" method is provided by Wrapper.h:96. Part of this operation is putting the cc_device_info_t_ into the HandleMap. This is protected by the handleMapMutex. Given aggressive optimization, it seems highly likely that the lock operation for that mutex would be subsumed into the map insert() operation (frame 2), which would explain the calls down into libpthread (frames 0 and 1).
Thread 25 (CCApp) is of particular interest, for two reasons: it's the only thread that's actually running, *and* it is down in the same set of code as thread 0: it is in the process of handling CC_SIPCCService::onDeviceEvent, and is, in fact, in the process of handling notifyDeviceEventObservers (called on the aforementioned line 607).
If you go all the way to the bottom of the stack (depart), we're in depart() from linked_ptr.h. In fact, we're on the line where we're traversing the circular (!!!) list of objects to try to find this one:
67 while (p->next_ != this) p = p->next_;
My guess is that this thread is spinning furiously, traversing a very small circular list over and over again, all the while holding on to the lock that thread 0 really needs to finish its map operation.
So the real question is how we ended up with an instance of a linked_ptr whose next_ pointer indicates a circle that it's not a member of. Reading the code in linked_ptr, the only way this could plausibly happen is if two threads were manipulating linked_ptrs to the same object simultaneously. So I think what we're needing to do here is track down where such a race arises for one of linked_ptr<CSF::CC_Device> or linked_ptr<CSF::CC_DeviceInfo>.
Assignee | ||
Comment 32•12 years ago
|
||
FWIW, there are a number of traces in this bug, corresponding to four distinct problems.
The logs in comment 4 and comment 5 are a crash in SendAudioFrame that hasn't apparently been seen since before Christmas.
The log in comment 6 is a crash in sipcc::PeerConnectionMedia::IceCompleted that doesn't even appear to caused a 330-second timeout. The only thing it has to do with this bug is that it was provoked by the same test case.
The logs in comments 12, 20, and 21 appear to correspond to bug 824851, as Henrik has pointed out.
The logs in comments 23, 24, 25, 27, and 30 all exhibit the behavior I describe in comment 31, above. They appear to be pouring in at a cadence of one every day or two, starting on January 25th. This points to a change being made somewhere in the January 23 - January 25 timeframe that either introduced or aggravated this bug. I'm going to look through hg logs to see if anything looks likely.
Assignee | ||
Comment 33•12 years ago
|
||
Following up on my previous comment, the patch for Bug 806825 is right in the timeframe I'd expect, and touches exactly this code (PeerConnectionCtx::onDeviceEvent). I don't think it's the root of this bug, but it could certainly change the timing of the onDeviceEvent processing loop to make this race more likely.
In any case, attachment 676566 [details] from that bug may well shed some light on the locks and races involved.
On additional reflection, any attempts to call into CC_SIPCCService::onDeviceEvent from multiple threads can cause this problem if they're talking about the same device or same deviceptr, since both threads will be attempting to work with linked_ptr instances for the same objects. Any such creation or destruction can lead to corruption of the circular list.
I think this means that CC_SIPCCService needs to synchronize any calls into onDeviceEvent, but I'm going to have to poke around a bit more to determine whether that's a sensible thing to do.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 35•12 years ago
|
||
After reading the code some more, I think the problem might actually be more fundamental than just onDeviceEvent handling. Basically, CallControlManagerImpl::startSDPMode creates a new CC_SIPCCService, called "phone". It then calls init() on this phone, which starts up the CCApp_task thread. This thread starts processing events on its queue, which will make calls into CC_SIPCCService::on*Event (e.g., onDeviceEvent).
Almost immediately after starting this thread, startSDPMode calls phone->startService(), which makes some calls into CC_SIPCCService that ultimately call through to CC_SIPCCService::onDeviceEvent. There's no synchronization involved.
The reason it's kind of difficult to reproduce the bug is that the race that has caused the logs above is exquisitely narrow, and literally involves the two threads (main and CCApp) running linked_ptr_internal::join at exactly the same time. This function, mind you, comprises exactly two assignment operations. For this condition to arise, CCApp must perform the first assignment, then main must perform the first assignment, then CCApp must perform the second assignment, then main must perform the second assignment, thereby overwriting CCApp's previous write. This leaves the linked_ptr in CCApp pointing to a circular linked list of which it is not a member, which means it will spin when it attempts to remove itself later. The PCs for these threads must literally be within a byte or two of each other for this to happen.
I suspect that there are other races possible in here as well. The solution, I believe, is to ensure that all the operations inside phone->startService() are dispatched to the CCApp thread rather than being run from main (as I believe that all other methods in CC_SIPCCService are run from the ccapp_thread). Ideally, key methods within CC_SIPCCService would then check the current thread when they are invoked to ensure that they are being called from the ccapp_thread, and assert if not.
Assignee | ||
Comment 36•12 years ago
|
||
Assignee | ||
Comment 37•12 years ago
|
||
Assignee | ||
Updated•12 years ago
|
Attachment #709089 -
Attachment is obsolete: true
Assignee | ||
Updated•12 years ago
|
Attachment #709169 -
Flags: review?(ethanhugg)
Comment 38•12 years ago
|
||
Comment on attachment 709169 [details] [diff] [review]
Move locks to protect linked_ptr<> instances
Review of attachment 709169 [details] [diff] [review]:
-----------------------------------------------------------------
::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCService.cpp
@@ +475,5 @@
> }
>
> +// !!! Note that accessing *Ptr instances from multiple threads can
> +// lead to deadlocks, crashes, and spinning threads. Calls to this
> +// method are not safe except from ccapp_thread.
Adam, wouldn't it be better to have asserts rather than comments for these?
Assignee | ||
Comment 39•12 years ago
|
||
(In reply to Eric Rescorla (:ekr) from comment #38)
> Comment on attachment 709169 [details] [diff] [review]
> Move locks to protect linked_ptr<> instances
>
> Review of attachment 709169 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCService.cpp
> @@ +475,5 @@
> > }
> >
> > +// !!! Note that accessing *Ptr instances from multiple threads can
> > +// lead to deadlocks, crashes, and spinning threads. Calls to this
> > +// method are not safe except from ccapp_thread.
>
> Adam, wouldn't it be better to have asserts rather than comments for these?
It would probably be better to have both. I don't think this constraint is currently honored, so asserts would simply assert. I'm trying to come up with a broader solution to that problem.
The nature of this patch is known to be wallpaperish, as that's what jesup has indicated he would prefer at the moment. I'm putting the comments here to demystify any future problems.
Comment 40•12 years ago
|
||
Comment on attachment 709169 [details] [diff] [review]
Move locks to protect linked_ptr<> instances
Review of attachment 709169 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good to me. Built/Tested on Linux64
::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCService.cpp
@@ +756,5 @@
> }
>
> void CC_SIPCCService::notifyLineEventObservers (ccapi_line_event_e eventType, CC_LinePtr linePtr, CC_LineInfoPtr info)
> {
> + // m_lock must be held by the function that called us */
This line ends in a */
Attachment #709169 -
Flags: review?(ethanhugg) → review+
Assignee | ||
Comment 41•12 years ago
|
||
Assignee | ||
Updated•12 years ago
|
Attachment #709169 -
Attachment is obsolete: true
Assignee | ||
Comment 42•12 years ago
|
||
Comment on attachment 709250 [details] [diff] [review]
Move locks to protect linked_ptr<> instances
Carrying forward r+ from ehugg
Attachment #709250 -
Flags: review+
Attachment #709250 -
Flags: checkin?(rjesup)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 44•12 years ago
|
||
Target Milestone: --- → mozilla21
Updated•12 years ago
|
Attachment #709250 -
Flags: checkin?(rjesup) → checkin+
Comment 45•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Flags: in-testsuite+
Whiteboard: [WebRTC][blocking-webrtc+] → [WebRTC][blocking-webrtc+][qa-]
You need to log in
before you can comment on or make changes to this bug.
Description
•