Closed Bug 823056 Opened 12 years ago Closed 12 years ago

Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output'

Categories

(Core :: WebRTC, defect, P1)

All
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla21

People

(Reporter: whimboo, Assigned: abr)

References

()

Details

(Keywords: crash, hang, intermittent-failure, Whiteboard: [WebRTC][blocking-webrtc+][qa-])

Attachments

(1 file, 2 obsolete files)

With the last merge from m-c to alder the peer connection tests are starting to hang and crash on Linux based platforms: https://tbpl.mozilla.org/php/getParsedLog.php?id=18062650&tree=Alder#error0 52 INFO TEST-START | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html ++DOMWINDOW == 14 (0x4008a18) [serial = 19] [outer = 0x534b5d0] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004003: file ../../../../intl/uconv/src/nsCharsetConverterManager.cpp, line 301 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004003: file ../../../../intl/uconv/src/nsCharsetConverterManager.cpp, line 301 TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | application timed out after 330 seconds with no output args: ['/home/cltbld/talos-slave/test/build/bin/screentopng'] INFO | automation.py | Application ran for: 0:05:41.625900 INFO | automation.py | Reading PID log: /tmp/tmp0Lr3kgpidlog PROCESS-CRASH | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | application crashed [@ libpthread-2.11.so + 0xdc44] Crash dump filename: /tmp/tmpNqLUxy/minidumps/7f3fa034-3d37-b72b-7dc340b5-0e6b817d.dmp Operating system: Linux 0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64 CPU: amd64 family 6 model 23 stepping 10 2 CPUs Crash reason: SIGABRT Crash address: 0x1f400000880 Thread 0 (crashed) 0 libpthread-2.11.so + 0xdc44 rbx = 0x00000000036d5d00 r12 = 0x00007fffbbf42740 r13 = 0x0000000000000000 r14 = 0x00007fffbbf429e0 r15 = 0x00007f2fc2ec4b26 rip = 0x00000034d360dc44 rsp = 0x00007fffbbf42690 rbp = 0x00007fffbbf42760 Found by: given as instruction pointer in context 1 libpthread-2.11.so + 0x8f4a rip = 0x00000034d3608f4b rsp = 0x00007fffbbf426a8 rbp = 0x00007fffbbf42760 Found by: stack scanning 2 libxul.so!mozilla::BlockingResourceBase::CheckAcquire(mozilla::CallStack const&) [BlockingResourceBase.cpp : 107 + 0x8] rip = 0x00007f2fc2072f46 rsp = 0x00007fffbbf426b0 rbp = 0x00007fffbbf42760 Found by: stack scanning 3 libxul.so!CSF::CC_SIPCCService::notifyDeviceEventObservers(ccapi_device_event_e, linked_ptr<CSF::CC_Device>, linked_ptr<CSF::CC_DeviceInfo>) [CC_SIPCCService.cpp : 717 + 0x4] rbx = 0x0000000002a4e060 r12 = 0x00007fffbbf42790 rip = 0x00007f2fc250f089 rsp = 0x00007fffbbf42770 rbp = 0x00007fffbbf42800 Found by: call frame info 4 libxul.so!CSF::CC_SIPCCService::onDeviceEvent(ccapi_device_event_e, unsigned int, cc_device_info_t_*) [CC_SIPCCService.cpp : 607 + 0x22] rbx = 0x00007fffbbf42980 r12 = 0x00007fffbbf42990 r13 = 0x0000000000000000 r14 = 0x00007fffbbf429e0 r15 = 0x00007f2fc2ec4b26 rip = 0x00007f2fc250f4b5 rsp = 0x00007fffbbf42810 rbp = 0x00007fffbbf42a30 Found by: call frame info 5 libxul.so!ccsnap_gen_deviceEvent [ccapi_snapshot.c : 414 + 0xd] rbx = 0x0000000004763050 r12 = 0x0000000000000000 r13 = 0x0000000000000000 r14 = 0x00000034e08fe158 r15 = 0x00007fffbbf431a0 rip = 0x00007f2fc25241b9 rsp = 0x00007fffbbf42a40 rbp = 0x00007fffbbf42a60 Found by: call frame info 6 libxul.so!parse_setup_properties [ccapi_config.c : 59 + 0x8] rbx = 0x00000000041f76e8 r12 = 0x00000034e08fe158 r13 = 0x00000000036d5dc8 r14 = 0x00000034e08fe158 r15 = 0x00007fffbbf431a0 rip = 0x00007f2fc252175c rsp = 0x00007fffbbf42a70 rbp = 0x00007fffbbf42a90 Found by: call frame info 7 libxul.so!CCAPI_Start_response [ccapi_config.c : 41 + 0x10] rbx = 0x00000000041f76e8 r12 = 0x0000000000000000
Sounds critical so we should better make it s-s.
Group: core-security
Nothing s-s about this, it's just a deadlock which is then being aborted by the test harness (SIGABRT).
Group: core-security
Looks like it's only happening on 64bit Linux and with a debug build.
Hardware: All → x86_64
Summary: Intermittent hang and crash in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output' and 'application crashed' [@ libpthread-2.11.so + 0xdc44] → [Linux64 debug] Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output'
Assignee: nobody → adam
Priority: -- → P1
Whiteboard: [WebRTC][automation-blocked] → [WebRTC][automation-blocked][blocking-webrtc+]
Crash Signature: [@ libpthread-2.11.so + 0xdc44] → [@ libpthread-2.11.so@0xdc44]
Crash Signature: [@ libpthread-2.11.so@0xdc44]
Whiteboard: [WebRTC][automation-blocked][blocking-webrtc+] → [WebRTC][blocking-webrtc+][automation-blocked][blocking-webrtc+]
Whiteboard: [WebRTC][blocking-webrtc+][automation-blocked][blocking-webrtc+] → [WebRTC][automation-blocked][blocking-webrtc+]
Also happens for x86 builds now: https://tbpl.mozilla.org/php/getParsedLog.php?id=18177793&tree=Alder Can we get this prioritized to get fixed cause it early aborts our testrun?
Hardware: x86_64 → All
Summary: [Linux64 debug] Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output' → [Linux debug] Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output'
Summary: [Linux debug] Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output' → Intermittent hang in test_peerConnection_basicAudio.html | Failure: 'application timed out after 330 seconds with no output'
These tbpls date from mid-day 12/21. Bug 820102 was merged to m-c on 12/24. Please verify that this is still an issue.
Randell, would have to do a merge from m-c to alder.
Flags: needinfo?(rjesup)
merge done For reference: "cd m-c; hg pull -u; cd ../alder; hg pull -u; hg pull ../m-c; hg merge" Deal with any conflicts during the merge; generally there are none. "hg commit -m 'Merge m-c to alder'; hg push alder" In my .hgrc, I have: [merge-tools] emacs.args = -q --eval "(ediff-merge-with-ancestor \"$local\" \"$other\" \"$base\" nil \"$output\")" But that's not necessary.
Flags: needinfo?(rjesup)
OK, these are in fact still failing. However, since they work on my machine, it is not possible to diagnose. Please turn up the logging (e.g., NSPR_LOG_MODULES=mediapipeline:5,ikran:5) and re-run
Actually, wait: these errors are all TEST_UNEXPECTED_PASS. The problem would appear to be that we are missing the latest updates from m-i.
I'm not able to get this test hanging on my local machine. I'm running a recent debug build from m-c on 10.7, so it matches the platform as given in the last comment. But looks like that something is still different.
What even is it waiting for?
Comment 12 is most likely bug 824851. So it doesn't belong to this bug. That said no platform shows the hang currently for the last merge from Randell. We should keep this bug open and watch further merges if it reproduces or not.
I have seen this hang this morning when I was not connected to a known network. I filed the issue as bug 824919 and a possible fix might also fix our problem here.
Depends on: 824919
With my patch on bug 824923 merged to m-c and then to alder we should get better information what's going on here.
Depends on: 824923
(In reply to Henrik Skupin (:whimboo) [away 12/21 - 01/01] from comment #15) > Comment 12 is most likely bug 824851. So it doesn't belong to this bug. That > said no platform shows the hang currently for the last merge from Randell. > We should keep this bug open and watch further merges if it reproduces or > not. Removing this from the blocking list, but keeping the bug open to catch regressions. Henrik -- Once you test with your patch from bug 824923, if you find that this is happening frequently or blocking automation, we can promote this back to blocking.
Severity: critical → normal
Priority: P1 → P3
Whiteboard: [WebRTC][automation-blocked][blocking-webrtc+] → [WebRTC][blocking-webrtc-]
If this is "we kill it after 330 sec", there's one bug (Failure to shut down SRTCP) which I've seen today when testing the peerconnection leak patch. This bug can cause it to fail to exit
Randell, the last two tinderbox failures are bug 824851 and not this one.
Bumping this back to blocking - this is intermittent failure on m-c now. Let's clean this up.
Priority: P3 → P1
Whiteboard: [WebRTC][blocking-webrtc-] → [WebRTC][blocking-webrtc+]
Just some musings on the crashes, looking at the three starred failures... all three are Linux PGO builds; one is 32-bit, the other two are 64-bit. The problem has not yet surfaced on other platforms or in opt or debug builds. I'm going to see if I can get it to show up on normal opt -- or, even better, debug -- builds. Here's a try from m-i tip that I plan to run mochi-3 on multiple times to see whether this problem can be provoked; this does not have PGO turned on: https://tbpl.mozilla.org/?tree=Try&rev=4448d6d03a94
And to provide a comparison, here's a PGO try: https://tbpl.mozilla.org/?tree=Try&rev=620fd740b4b2
Just some notes about the process state on failure (using https://tbpl.mozilla.org/php/getParsedLog.php?id=19255531&full=1&branch=services-central#error1 as my basis)... Looking at thread 0 (main), we have to be at CC_SIPCCService.cpp:598: CC_SIPCCDeviceInfoPtr infoPtr = CC_SIPCCDeviceInfo::wrap(info); (Note that the execution of this operation has been delayed by the optimizer to line 607, hence the reason the stack is reporting line 607+0x1c) Where the "wrap" method is provided by Wrapper.h:96. Part of this operation is putting the cc_device_info_t_ into the HandleMap. This is protected by the handleMapMutex. Given aggressive optimization, it seems highly likely that the lock operation for that mutex would be subsumed into the map insert() operation (frame 2), which would explain the calls down into libpthread (frames 0 and 1). Thread 25 (CCApp) is of particular interest, for two reasons: it's the only thread that's actually running, *and* it is down in the same set of code as thread 0: it is in the process of handling CC_SIPCCService::onDeviceEvent, and is, in fact, in the process of handling notifyDeviceEventObservers (called on the aforementioned line 607). If you go all the way to the bottom of the stack (depart), we're in depart() from linked_ptr.h. In fact, we're on the line where we're traversing the circular (!!!) list of objects to try to find this one: 67 while (p->next_ != this) p = p->next_; My guess is that this thread is spinning furiously, traversing a very small circular list over and over again, all the while holding on to the lock that thread 0 really needs to finish its map operation. So the real question is how we ended up with an instance of a linked_ptr whose next_ pointer indicates a circle that it's not a member of. Reading the code in linked_ptr, the only way this could plausibly happen is if two threads were manipulating linked_ptrs to the same object simultaneously. So I think what we're needing to do here is track down where such a race arises for one of linked_ptr<CSF::CC_Device> or linked_ptr<CSF::CC_DeviceInfo>.
FWIW, there are a number of traces in this bug, corresponding to four distinct problems. The logs in comment 4 and comment 5 are a crash in SendAudioFrame that hasn't apparently been seen since before Christmas. The log in comment 6 is a crash in sipcc::PeerConnectionMedia::IceCompleted that doesn't even appear to caused a 330-second timeout. The only thing it has to do with this bug is that it was provoked by the same test case. The logs in comments 12, 20, and 21 appear to correspond to bug 824851, as Henrik has pointed out. The logs in comments 23, 24, 25, 27, and 30 all exhibit the behavior I describe in comment 31, above. They appear to be pouring in at a cadence of one every day or two, starting on January 25th. This points to a change being made somewhere in the January 23 - January 25 timeframe that either introduced or aggravated this bug. I'm going to look through hg logs to see if anything looks likely.
Following up on my previous comment, the patch for Bug 806825 is right in the timeframe I'd expect, and touches exactly this code (PeerConnectionCtx::onDeviceEvent). I don't think it's the root of this bug, but it could certainly change the timing of the onDeviceEvent processing loop to make this race more likely. In any case, attachment 676566 [details] from that bug may well shed some light on the locks and races involved. On additional reflection, any attempts to call into CC_SIPCCService::onDeviceEvent from multiple threads can cause this problem if they're talking about the same device or same deviceptr, since both threads will be attempting to work with linked_ptr instances for the same objects. Any such creation or destruction can lead to corruption of the circular list. I think this means that CC_SIPCCService needs to synchronize any calls into onDeviceEvent, but I'm going to have to poke around a bit more to determine whether that's a sensible thing to do.
After reading the code some more, I think the problem might actually be more fundamental than just onDeviceEvent handling. Basically, CallControlManagerImpl::startSDPMode creates a new CC_SIPCCService, called "phone". It then calls init() on this phone, which starts up the CCApp_task thread. This thread starts processing events on its queue, which will make calls into CC_SIPCCService::on*Event (e.g., onDeviceEvent). Almost immediately after starting this thread, startSDPMode calls phone->startService(), which makes some calls into CC_SIPCCService that ultimately call through to CC_SIPCCService::onDeviceEvent. There's no synchronization involved. The reason it's kind of difficult to reproduce the bug is that the race that has caused the logs above is exquisitely narrow, and literally involves the two threads (main and CCApp) running linked_ptr_internal::join at exactly the same time. This function, mind you, comprises exactly two assignment operations. For this condition to arise, CCApp must perform the first assignment, then main must perform the first assignment, then CCApp must perform the second assignment, then main must perform the second assignment, thereby overwriting CCApp's previous write. This leaves the linked_ptr in CCApp pointing to a circular linked list of which it is not a member, which means it will spin when it attempts to remove itself later. The PCs for these threads must literally be within a byte or two of each other for this to happen. I suspect that there are other races possible in here as well. The solution, I believe, is to ensure that all the operations inside phone->startService() are dispatched to the CCApp thread rather than being run from main (as I believe that all other methods in CC_SIPCCService are run from the ccapp_thread). Ideally, key methods within CC_SIPCCService would then check the current thread when they are invoked to ensure that they are being called from the ccapp_thread, and assert if not.
Blocks: 835476
Blocks: 837026
Attachment #709089 - Attachment is obsolete: true
Attachment #709169 - Flags: review?(ethanhugg)
Comment on attachment 709169 [details] [diff] [review] Move locks to protect linked_ptr<> instances Review of attachment 709169 [details] [diff] [review]: ----------------------------------------------------------------- ::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCService.cpp @@ +475,5 @@ > } > > +// !!! Note that accessing *Ptr instances from multiple threads can > +// lead to deadlocks, crashes, and spinning threads. Calls to this > +// method are not safe except from ccapp_thread. Adam, wouldn't it be better to have asserts rather than comments for these?
(In reply to Eric Rescorla (:ekr) from comment #38) > Comment on attachment 709169 [details] [diff] [review] > Move locks to protect linked_ptr<> instances > > Review of attachment 709169 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCService.cpp > @@ +475,5 @@ > > } > > > > +// !!! Note that accessing *Ptr instances from multiple threads can > > +// lead to deadlocks, crashes, and spinning threads. Calls to this > > +// method are not safe except from ccapp_thread. > > Adam, wouldn't it be better to have asserts rather than comments for these? It would probably be better to have both. I don't think this constraint is currently honored, so asserts would simply assert. I'm trying to come up with a broader solution to that problem. The nature of this patch is known to be wallpaperish, as that's what jesup has indicated he would prefer at the moment. I'm putting the comments here to demystify any future problems.
Comment on attachment 709169 [details] [diff] [review] Move locks to protect linked_ptr<> instances Review of attachment 709169 [details] [diff] [review]: ----------------------------------------------------------------- Looks good to me. Built/Tested on Linux64 ::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCService.cpp @@ +756,5 @@ > } > > void CC_SIPCCService::notifyLineEventObservers (ccapi_line_event_e eventType, CC_LinePtr linePtr, CC_LineInfoPtr info) > { > + // m_lock must be held by the function that called us */ This line ends in a */
Attachment #709169 - Flags: review?(ethanhugg) → review+
Attachment #709169 - Attachment is obsolete: true
Comment on attachment 709250 [details] [diff] [review] Move locks to protect linked_ptr<> instances Carrying forward r+ from ehugg
Attachment #709250 - Flags: review+
Attachment #709250 - Flags: checkin?(rjesup)
Attachment #709250 - Flags: checkin?(rjesup) → checkin+
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Flags: in-testsuite+
Whiteboard: [WebRTC][blocking-webrtc+] → [WebRTC][blocking-webrtc+][qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: