Open Bug 1345652 Opened 3 years ago Updated 2 years ago

Crash in abort | rtc::FatalMessage::~FatalMessage called from rtc::PlatformThread::Start() -- OOM in ::CreateThread()

Categories

(Core :: WebRTC, defect, P2, critical)

53 Branch
All
Windows 7
defect

Tracking

()

Tracking Status
firefox52 --- unaffected
firefox-esr52 --- unaffected
firefox53 - wontfix
firefox54 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- fix-optional

People

(Reporter: philipp, Assigned: jesup)

References

Details

(Keywords: crash, regression, regressionwindow-wanted)

Crash Data

Attachments

(2 files)

75.41 KB, application/x-gzip
Details
106.17 KB, application/x-gzip
Details
This bug was filed from the Socorro interface and is 
report bp-d6d2b0a6-dff2-4b30-ae8f-433192170308.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ucrtbase.dll 	abort 	
1 	xul.dll 	rtc::FatalMessage::~FatalMessage() 	media/webrtc/trunk/webrtc/base/checks.cc:104
2 	xul.dll 	rtc::PlatformThread::Start() 	media/webrtc/trunk/webrtc/base/platform_thread.cc:212
3 	xul.dll 	webrtc::ProcessThreadImpl::Start() 	media/webrtc/trunk/webrtc/modules/utility/source/process_thread_impl.cc:80
4 	xul.dll 	webrtc::CongestionController::CongestionController(webrtc::ProcessThread*, webrtc::CallStats*, webrtc::BitrateObserver*) 	media/webrtc/trunk/webrtc/call/congestion_controller.cc:175
5 	xul.dll 	webrtc::internal::Call::Call(webrtc::Call::Config const&) 	media/webrtc/trunk/webrtc/call/call.cc:201
6 	xul.dll 	mozilla::WebRtcCallWrapper::Create(webrtc::Call::Config const&) 	media/webrtc/signaling/src/media-conduit/MediaConduitInterface.h:41
7 	xul.dll 	mozilla::PeerConnectionMedia::Init(std::vector<mozilla::NrIceStunServer, std::allocator<mozilla::NrIceStunServer> > const&, std::vector<mozilla::NrIceTurnServer, std::allocator<mozilla::NrIceTurnServer> > const&, mozilla::NrIceCtx::Policy) 	media/webrtc/signaling/src/peerconnection/PeerConnectionMedia.cpp:418
8 	xul.dll 	mozilla::PeerConnectionImpl::Initialize(mozilla::dom::PeerConnectionObserver&, nsGlobalWindow*, mozilla::PeerConnectionConfiguration const&, nsISupports*) 	media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp:713
9 	xul.dll 	mozilla::PeerConnectionImpl::Initialize(mozilla::dom::PeerConnectionObserver&, nsGlobalWindow&, mozilla::dom::RTCConfiguration const&, nsISupports*, mozilla::ErrorResult&) 	media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp:793
10 	xul.dll 	mozilla::dom::PeerConnectionImplBinding::initialize 	obj-firefox/dom/bindings/PeerConnectionImplBinding.cpp:81
11 	xul.dll 	mozilla::dom::GenericBindingMethod(JSContext*, unsigned int, JS::Value*) 	dom/bindings/BindingUtils.cpp:2914
12 		@0x1ec49c47 	
13 		@0x55d45f0f 	
14 		@0x1e7b0942 	
15 	xul.dll 	EnterBaseline 	js/src/jit/BaselineJIT.cpp:155
16 	xul.dll 	UpdateShapeTypeAndValue 	js/src/vm/NativeObject.cpp:1117
17 	xul.dll 	js::jit::EnterBaselineMethod(JSContext*, js::RunState&) 	js/src/jit/BaselineJIT.cpp:195
18 	xul.dll 	js::RunScript(JSContext*, js::RunState&) 	js/src/vm/Interpreter.cpp:396
19 	xul.dll 	js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp:478
20 	xul.dll 	js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>) 	js/src/vm/Interpreter.cpp:524
21 	xul.dll 	JS::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) 	js/src/jsapi.cpp:2847
22 	xul.dll 	JS::StructGCPolicy<JS::GCVector<jsid, 8, js::TempAllocPolicy> >::trace(JSTracer*, JS::GCVector<jsid, 8, js::TempAllocPolicy>*, char const*) 	obj-firefox/dist/include/js/GCPolicyAPI.h:87
23 	xul.dll 	mozilla::dom::RTCConfiguration::ToObjectInternal(JSContext*, JS::MutableHandle<JS::Value>) 	obj-firefox/dom/bindings/RTCConfigurationBinding.cpp:715
24 	xul.dll 	JS::StructGCPolicy<JS::GCVector<jsid, 8, js::TempAllocPolicy> >::trace(JSTracer*, JS::GCVector<jsid, 8, js::TempAllocPolicy>*, char const*) 	obj-firefox/dist/include/js/GCPolicyAPI.h:87
25 	xul.dll 	JS_DefineProperty(JSContext*, JS::Handle<JSObject*>, char const*, JS::Handle<JSObject*>, unsigned int, bool (*)(JSContext*, unsigned int, JS::Value*), bool (*)(JSContext*, unsigned int, JS::Value*)) 	js/src/jsapi.cpp:2263
26 	xul.dll 	js::NativeLookupOwnProperty<1>(js::ExclusiveContext*, JS::Handle<js::NativeObject*>, JS::Handle<jsid>, JS::MutableHandle<JS::PropertyResult>) 	js/src/vm/NativeObject.cpp:1053

judging by early crash data from 53.0b1 this signature seems to regress in volume in this cycle (related to bug 1250356?)
Rank: 10
Priority: -- → P1
A search with "proto_signature" "does not contain" "rtc::PlatformThread::Start" finds 2 in the last week; vs 26 with rtc::PlatformThread::Start.  The other 2 are 51 crashes, in the older version of the same function (webrtc::ProcessThreadImpl::Start).

So it's not per-se new with 53 and the webrtc.org 49 update, but may be more common now.
Assignee: nobody → rjesup
Summary: Crash in abort | rtc::FatalMessage::~FatalMessage → Crash in abort | rtc::FatalMessage::~FatalMessage called from rtc::PlatformThread::Start()
Note that this says that ::CreateThread() (Windows API call) is failing....  which supposedly only is on OOM.  We do specify 1M stack, but that's also the default in windows.  This may require 1M of contiguous virtual memory, not sure
At this point I think this is most likely some form of OOM; system call failure
No crashes in 54 or 55, so I think there must be some change not in 53 that avoids/fixes this; a regression-window would help identify an uplift
the last crash report was from 54.0a2 build 20170310004008 (but on aurora it happened rather infrequently).
these were the patches landing in the next 10 days after that: https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=a7744be94bea4581220cfce7c48e22fb96ac2bf3&enddate=2017-03-20
(In reply to [:philipp] from comment #5)
> the last crash report was from 54.0a2 build 20170310004008 (but on aurora it
> happened rather infrequently).
> these were the patches landing in the next 10 days after that:
> https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=a7744be94bea4581220cfce7c48e22fb96ac2bf3&enddate=2017-03-20

Thanks - if one of those fixed this, most likely would be bug 1339246 -- but that has uplifted to aurora and beta/53 on 3/10 (central/55 on 3/8).  And we did have a few crashes in 51
Flags: needinfo?(rjesup)
[Tracking Requested - why for this release]:
Moving back to tracking? for 53 - I've verified this *is* an OOM (typically heavily fragmented).  All examples have 150-700MB free (mostly 150-300), and none have more than 1MB of contiguous address space available, and thread creation requires reserving 1MB of space.

As to *why* memory is so fragmented... it appears these are due to IP-discovery/tracking scripts, creating lots of PeerConnections that are never GC'd and never connected.  Crashes typically have >150 threads active.  That's part of the problem but not all of it -- there aren't enough treads to fragment all of memory that I can see, and many of these have crashed within 100ish seconds of browser-start; quite fast to fragment all memory!

Also, a bunch of these are playing videos, so that may play into the fragmentation aspect.

Moving 54 and 55 back to ?, since the volume hitting the evil sites on aurora/nightly may be nil.

I tried this on Linux, and I don't see bad fragmentation (but I was running 64-bit)
Rank: 10 → 15
Summary: Crash in abort | rtc::FatalMessage::~FatalMessage called from rtc::PlatformThread::Start() → Crash in abort | rtc::FatalMessage::~FatalMessage called from rtc::PlatformThread::Start() -- OOM in ::CreateThread()
We used to get crashes like this in media until we refactored our architecture to instead of having N threads per media element, each media element has N TaskQueues, which map onto a SharedThreadPool. TaskQueues are an event target that guarantee that tasks dispatched to them will run in serial one after the other, and that there will be a memory barrier in between each task running (to ensure thread safety), but the task could run on any thread in the SharedThreadPool. This means you end up with a fixed maximum number of threads, yet you still have an interface that looks like it has N independent threads.

This model also works very nicely with MozPromise, which is a great construct to automate a lot of the boilerplate code needed to easily write thread-safe async code.

You probably want whatever is creating your thread calling rtc::Event::Wait(int) to do the same, but I expect that won't be a quick fix.
(In reply to Chris Pearce (:cpearce) from comment #8)
> We used to get crashes like this in media until we refactored our
> architecture to instead of having N threads per media element, each media
> element has N TaskQueues, which map onto a SharedThreadPool. TaskQueues are
> an event target that guarantee that tasks dispatched to them will run in
> serial one after the other, and that there will be a memory barrier in
> between each task running (to ensure thread safety), but the task could run
> on any thread in the SharedThreadPool. This means you end up with a fixed
> maximum number of threads, yet you still have an interface that looks like
> it has N independent threads.

This would be a major refactor of upstream code we import.  We can discuss this with the webrtc.org people, but doing it in our branch alone would likely be very painful to maintain.  (and it's unclear if it would really solve the problem).  Deferring parts of initialization on unconnected PeerConnections might be more faster, almost as effective and easier to maintain/uplift.

The upstream code runs their threads on top of the platform-independent interface (rtc::PlatformThread), which is the code in question here.  They also have tons of thread-safety checks that verify that the same thread is always used to call certain code (to validate that they don't need locks/semaphores/etc).

> This model also works very nicely with MozPromise, which is a great
> construct to automate a lot of the boilerplate code needed to easily write
> thread-safe async code.

I'm quite certain that the upstream code isn't going to move to MozPromise soon. :-)
No need to track for 53 based on wontfix status.
This crash doesn't look  like OOM: it has a max-contiguous of 13MB; default stacksize is 1MB

https://crash-stats.mozilla.com/report/index/ad8b7e67-1ddb-43a2-96be-8fde90170427

dmajor: any thoughts?
Flags: needinfo?(dmajor)
The memory report collection runs on a timer so the values may be up to 3 minutes old: https://dxr.mozilla.org/mozilla-central/rev/0b77ed3f26c5335503bc16e85b8c067382e7bb1e/xpcom/threads/nsThread.cpp#570-573

"Available Virtual Memory" on the other hand is a live value, and it says 201.91 MB, which is definitely in OOM-danger territory. I bet memory usage started increasing, we took a report, memory usage kept increasing, and eventually OOM.
Flags: needinfo?(dmajor)
Looks like there are some crashes in beta 54. About 20 per week per beta release.
this is now spiking with users on pt-BR builds in the last couple of days: https://crash-stats.mozilla.com/signature/?useragent_locale=pt-BR&signature=abort%20%7C%20rtc%3A%3AFatalMessage%3A%3A~FatalMessage&date=%3E%3D2017-06-01T00%3A00%3A00.000Z#graphs
many user comments refer to a sports live streaming sites, so maybe it's possible from url correlations to see where this is going on and reproduce it...
this signature was rising in the past weeks, mainly for pt-BR & ru users. url correlations apparently show some clear patterns on which sites this may be happening, but i wasn't able to immediately reproduce a crash there. 
maybe it's more obvious for you what those sites may be doing that's tickling the crashes...
Flags: needinfo?(rjesup)
Thoughts?
Flags: needinfo?(rjesup) → needinfo?(dmajor)
Flags: needinfo?(rjesup)
(In reply to Randell Jesup [:jesup] from comment #16)
> Thoughts?

I still think this is an address-space OOM.

It's interesting that all of these crashes are on 32-bit builds, on Win7 (with or without SP1), e10s disabled. Philipp do you have access to such a configuration to try those URLs on? Even if you don't crash, seeing high memory usage would be a pointer in the right direction.
Flags: needinfo?(dmajor) → needinfo?(madperson)
Attached file snapshot mid video
when i tried it on kinogo-2016 [dot] net/4457-12.html the memory use seemed to continually creep up the longer the video was running. the process went from 500MB RAM at the start to over 1000MB when it finished. i couldn't get firefox to crash though. i took 2 about:memory snapshots, one mid-video and one near the end in case it helps...

on vidik.my-cinema [dot] net/tviglemy & www.rtp.pt/play/direto/rtp1 i didn't notice any unusual symptoms (didn't want to test some of the other sites requiring flash)
Flags: needinfo?(madperson)
Attached file snapshot at the end
Mass change P1->P2 to align with new Mozilla triage process
Priority: P1 → P2
Duplicate of this bug: 1401747
Crash Signature: [@ abort | rtc::FatalMessage::~FatalMessage] → [@ abort | rtc::FatalMessage::~FatalMessage] [@ abort | rtc::PlatformThread::Start ]
(In reply to [:philipp] from comment #18)
> Created attachment 8891385 [details]
> snapshot mid video
> 
> when i tried it on kinogo-2016 [dot] net/4457-12.html the memory use seemed
> to continually creep up the longer the video was running. the process went
> from 500MB RAM at the start to over 1000MB when it finished. i couldn't get
> firefox to crash though. i took 2 about:memory snapshots, one mid-video and
> one near the end in case it helps...
> 
> on vidik.my-cinema [dot] net/tviglemy & www.rtp.pt/play/direto/rtp1 i didn't
> notice any unusual symptoms (didn't want to test some of the other sites
> requiring flash)

jya: any thoughts?
Flags: needinfo?(rjesup) → needinfo?(jyavenard)
Flags: needinfo?(jyavenard)
re-NI
Flags: needinfo?(jyavenard)
I have no thought on the matter, nor is this in my area of expertise... 

Note however that I plan to change the way the webrtc stack is making use of threads changing for task queues instead as :cpearce mentioned in comment 8 (on our side of code of course)
Flags: needinfo?(jyavenard)
Duplicate of this bug: 1423230
You need to log in before you can comment on or make changes to this bug.