Closed Bug 1153094 Opened 9 years ago Closed 9 years ago

Crash on WebRTC mochitest shutdown

Categories

(Core :: Memory Allocator, defect)

40 Branch
x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox40 --- fixed

People

(Reporter: drno, Assigned: jya)

References

Details

Attachments

(1 file)

With a self compiled Nightly from today (238479:fec90cbfbaad) on Mac OSX 10.10.3 when executing this command:

./mach mochitest --debugger lldb dom/media/tests/mochitest/test_peerConnection_basicAudio.html

I get the following stack trace:

Hit MOZ_CRASH() at /Users/nohlmeier/src/mozilla-central/memory/mozjemalloc/jemalloc.c:1610
Process 3588 stopped
* thread #8: tid = 0x16521e, 0x000000010001f6ea libmozglue.dylib`jemalloc_crash + 58 at jemalloc.c:1610, name = 'Socket Thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010001f6ea libmozglue.dylib`jemalloc_crash + 58 at jemalloc.c:1610
   1607	MOZ_NEVER_INLINE static void
   1608	jemalloc_crash()
   1609	{
-> 1610		MOZ_CRASH();
   1611	}
   1612	
   1613	#if defined(MOZ_JEMALLOC_HARD_ASSERTS)
(lldb) bt
* thread #8: tid = 0x16521e, 0x000000010001f6ea libmozglue.dylib`jemalloc_crash + 58 at jemalloc.c:1610, name = 'Socket Thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000010001f6ea libmozglue.dylib`jemalloc_crash + 58 at jemalloc.c:1610
    frame #1: 0x000000010001ccd7 libmozglue.dylib`arena_dalloc + 113 at jemalloc.c:4651
    frame #2: 0x000000010001cc66 libmozglue.dylib`arena_dalloc(ptr=<unavailable>, offset=<unavailable>) + 886 at jemalloc.c:4719
    frame #3: 0x000000010001a400 libmozglue.dylib`free_impl(ptr=0x000000011bc8b918) + 96 at replace_malloc.c:201
    frame #4: 0x000000010001ab82 libmozglue.dylib`zone_free_definite_size(zone=0x000000010004d4b8, ptr=0x000000011bc8b918, size=16) + 146 at replace_malloc.c:359
    frame #5: 0x0000000101db6cd5 XUL`nsTArrayInfallibleAllocator::Free(aPtr=0x000000011bc8b918) + 21 at nsTArray.h:188
    frame #6: 0x0000000101db88fd XUL`nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::~nsTArray_base(this=0x000000011bc8b910) + 77 at nsTArray-inl.h:22
    frame #7: 0x0000000101db9ae2 XUL`nsTArray_Impl<unsigned char, nsTArrayInfallibleAllocator>::~nsTArray_Impl(this=0x000000011bc8b910) + 34 at nsTArray.h:790
    frame #8: 0x0000000101dbd9b5 XUL`nsTArray<unsigned char>::~nsTArray(this=0x000000011bc8b910) + 21 at nsTArray.h:1801
    frame #9: 0x0000000101dbd995 XUL`mozilla::DataBuffer::~DataBuffer(this=0x000000011bc8b910) + 21 at MediaData.h:425
    frame #10: 0x0000000101dbd975 XUL`mozilla::DataBuffer::~DataBuffer(this=0x000000011bc8b910) + 21 at MediaData.h:425
    frame #11: 0x0000000102f522fd XUL`nsAutoPtr<mozilla::DataBuffer>::~nsAutoPtr(this=0x0000000116fcf670) + 45 at nsAutoPtr.h:74
    frame #12: 0x0000000102f4f605 XUL`nsAutoPtr<mozilla::DataBuffer>::~nsAutoPtr(this=0x0000000116fcf670) + 21 at nsAutoPtr.h:73
    frame #13: 0x0000000102f52148 XUL`mozilla::runnable_args_m_1<mozilla::RefPtr<mozilla::MediaPipeline::PipelineTransport>, nsresult (mozilla::MediaPipeline::PipelineTransport::*)(nsAutoPtr<mozilla::DataBuffer>), nsAutoPtr<mozilla::DataBuffer> >::Run(this=0x000000011bd75900) + 184 at runnable_utils_generated.h:122
    frame #14: 0x0000000101f02a7a XUL`nsThread::ProcessNextEvent(this=0x00000001003319a0, aMayWait=true, aResult=0x0000000116fcf88e) + 2042 at nsThread.cpp:866
    frame #15: 0x0000000101f5e1d7 XUL`NS_ProcessNextEvent(aThread=0x00000001003319a0, aMayWait=true) + 151 at nsThreadUtils.cpp:265
    frame #16: 0x00000001020a6ca6 XUL`nsSocketTransportService::Run(this=0x0000000100347bf0) + 1014 at nsSocketTransportService2.cpp:851
    frame #17: 0x00000001020a794c XUL`non-virtual thunk to nsSocketTransportService::Run(this=0x0000000100347c08) + 28 at nsSocketTransportService2.cpp:927
    frame #18: 0x0000000101f02a7a XUL`nsThread::ProcessNextEvent(this=0x00000001003319a0, aMayWait=false, aResult=0x0000000116fcfc5e) + 2042 at nsThread.cpp:866
    frame #19: 0x0000000101f5e1d7 XUL`NS_ProcessNextEvent(aThread=0x00000001003319a0, aMayWait=false) + 151 at nsThreadUtils.cpp:265
    frame #20: 0x0000000102591090 XUL`mozilla::ipc::MessagePumpForNonMainThreads::Run(this=0x0000000117581780, aDelegate=0x0000000100326690) + 656 at MessagePump.cpp:339
    frame #21: 0x0000000102504875 XUL`MessageLoop::RunInternal(this=0x0000000100326690) + 117 at message_loop.cc:233
    frame #22: 0x0000000102504785 XUL`MessageLoop::RunHandler(this=0x0000000100326690) + 21 at message_loop.cc:226
    frame #23: 0x000000010250472d XUL`MessageLoop::Run(this=0x0000000100326690) + 45 at message_loop.cc:200
    frame #24: 0x0000000101f00fe9 XUL`nsThread::ThreadFunc(aArg=0x00000001003319a0) + 329 at nsThread.cpp:364
    frame #25: 0x0000000101b6b2b1 libnss3.dylib`_pt_root(arg=0x0000000100398530) + 449 at ptthread.c:212
    frame #26: 0x00007fff84106268 libsystem_pthread.dylib`_pthread_body + 131
    frame #27: 0x00007fff841061e5 libsystem_pthread.dylib`_pthread_start + 176
    frame #28: 0x00007fff8410441d libsystem_pthread.dylib`thread_start + 13
This assertion is likely just checking for memory corruption.
See Also: → 1153227
Primary suspicion would be on bug 1101651, which landed Thursday morning on inbound, due to the timeframe.  Of course that may have exposed an existing bug.  Regression testing seems feasible given high crash frequency for jib (and nils?)

There are some changes to non-standalone; I note that some of the "image_" fiddling is no longer in MOZILLA_INTERNAL_API; though that shouldn't matter to mochitests.  However, I've only looked at MediaPipeline.cpp/h thus far.

The crash is almost certainly in the cleanup for SendRtpPacket_s() or SendRtcpPacket_s()
Blocks: 1101651
Likely this isn't an allocator bug - but you never know, and IIRC jemalloc3 landed recently
(In reply to Randell Jesup [:jesup] from comment #3)
> Likely this isn't an allocator bug - but you never know, and IIRC jemalloc3
> landed recently

Agreed.I overlooked the MediaPipeline in the stack trace and initially thought this is JS related.

Another observation: if I add a 10s sleep to my tests (for other purposes) this crash happens in the middle of the call.

Yes I'll try regression or ASAN build.
Revision 238432 right before bug 1101651 seems broken.
Revision 238000 seems to be fine.
r238216 appears to be working
The range 238216:238432 includes these in media/webrtc:

Bug 1151139: Simplify how we choose which streams to gather stats from.

and in dom/media (mostly MSE-related)

Bug 1148571 - Dispatch AudioSink notifications asynchronously. r=jww
Bug 1150539: log getUserMedia constraints used in MediaManager:5. r=jesup
Bug 1150539 - fix getUserMedia constraint default to aPrefs.mFPS, not aPrefs.mMinFPS.
Bug 1149494 - Part 1. Add a listener directly to the unblocked input stream that reports the s
Bug 1149494 - Part 2. Add mochitest. r=jesup
Bug 1151656 - (bunch of patches)
Bug 1152236: OMX codec should use AnnexB as input format.
Bug 1094764 - Implement AudioContext.suspend and friends. (backed out after getting to m-c for mulet failures)
Bug 1152359 - Fix doppler shift calculation formula
Bug 1150853: Part1. Create MediaRawData object type.
(2 more parts)
Bug 1149616 - Fix the calculation of slots count in Read/WriteBuffer

Most interesting would be bug 1149494 (pehrsons), though none of these stand out.

Could be a chunking/timing change exposed something, and RTP data is hanging around to be processed "too long" in shutdown.
Also in mtransport:

Bug 1150966: Check whether |streams_| is null on stats methods in NrIceMediaStream.
Bug 935838 - Add per app network traffic statistics to the UDP socket. r=sicking, r=mayhemer
Please do not refer to revision numbers, they only apply to local clones, and are not guaranteed to mean the same thing for everyone. For instance, 238216:238432 contains none of the bugs you refer in comment 7 in my local clone.
I'm guessing/hope that the full revison plus hash string is better then just the revision...

238216:27d9481e341b = works
238432:6f7d0ef4912a = broken

BTW Jesup we are not talking about a shutdown problem as I tried to indicate in c4. If a test runs long enough it kills the browser mid call.
(In reply to Nils Ohlmeier [:drno] from comment #10)
> I'm guessing/hope that the full revison plus hash string is better then just
> the revision...
> 
> 238216:27d9481e341b = works
> 238432:6f7d0ef4912a = broken

Those match my clone of mozilla-central (used for the hg log entries above:
hg log -r 27d9481e341b:6f7d0ef4912a media/mtransport (or dom/media, etc)
The mtransport ones seem very unlikely to directly cause this:

Bug 1150966 Bug 935838 

Overall, I'm suspecting chunking/order/timing changes triggering a latent bug.  Further regression runs that isolate the changeset where things changed likely will make this clear, though may not help track down the failure directly.

mt's ASAN results would also point in this direction.  (Unfortunately, as ASAN was a good hope -- but ASAN is slow enough to throw timings off a lot.)
238216:27d9481e341b = works
238324:5b615a235096 = broken
As I noted in the other bug, bug 1150853 is the culprit.
Blocks: 1150853
DataBuffer (as defined in MediaData.h) is a ref counted object, can't assign it to a nsAutoPtr.

AFAIK, that object shouldn't be used outside the mp4 demuxer.

Could be that DataBuffer introduce a naming conflict and that name was used elsewhere (did check though)
backtraceframe #10: 0x0000000101dbd975 XUL`mozilla::DataBuffer::~DataBuffer(this=0x000000011bc8b910) + 21 at MediaData.h:425

indicates that it's using the wrong DataBuffer type (the one define in MediaData.h) and not the one define in media/mtransport/databuffer.h
Assignee: nobody → jyavenard
I'm guessing the compiler never sees both classes together, so it doesn't complain and generates calls to a "mozilla::DataBuffer::~DataBuffer" (that exists in mtransport).
But then the linker probably just drops one of them, so you end up calling the wrong one.

I could reproduce the same issue with a toy program:

data1.h:
void data1();

data1.cpp:
#include "data1.h"
struct Data
{
public:
    Data() { printf("Data1()\n"); };
    ~Data() { printf("~Data1()\n"); };
};
void data1()
{
    Data data;
}

data2.h:
void data2();

data2.cpp:
#include "data2.h"
struct Data
{
public:
    Data() { printf("Data2()\n"); };
    ~Data() { printf("~Data2()\n"); };
};
void data1()
{
    Data data;
}

main.cpp:
#include "data1.h"
#include "data2.h"
int main() {
    data1();
    data2();
    return 0;
}

When run on my machine, I get:
Data2()
~Data2()
Data2()
~Data2()

So the Data struct in data2.cpp takes over the world!
You'll need to rename things I'm afraid...
Typo above: data2.cpp defines "void data2()" of course.
rename the classes LargeDataBuffer into MediaLargeByteBuffer and DataBuffer into MediaByteBuffer
Attachment #8592584 - Flags: review?(cpearce)
Comment on attachment 8592584 [details] [diff] [review]
Rename LargeDataBuffer and DataBuffer object

Review of attachment 8592584 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/mediasource/SourceBuffer.h
@@ +162,1 @@
>                                                  uint32_t aLength,

Can you fix the hanging indent here?
Attachment #8592584 - Flags: review?(cpearce) → review+
https://hg.mozilla.org/mozilla-central/rev/cd563aff22c5
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: