Closed Bug 1187308 Opened 4 years ago Closed 3 years ago

crash in OOM | large | mozalloc_abort(char const* const) | mozalloc_handle_oom(unsigned int) | moz_xmalloc | webrtc::ViEExternalRendererImpl::RenderFrame(unsigned int, webrtc::I420VideoFrame&)

Categories

(Core :: WebRTC: Audio/Video, defect, P2, critical)

40 Branch
x86
Windows 10
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox40 + wontfix
firefox41 - wontfix
firefox42 --- affected
Blocking Flags:

People

(Reporter: adalucinet, Assigned: jesup)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-59193172-793a-4862-9757-63ca22150724.
=============================================================
Encountered 3 out of 6 times on 40.0b7 (en-US and ar builds) while performing Hello calls on Windows 10 32-bit:
bp-d8a6e8a8-1e08-41ce-9bed-2fc372150724
bp-f9543731-2d92-443d-bd9f-a27932150724

Couldn't reproduce under Windows 7, Ubuntu nor Mac OS X.

More reports:
https://crash-stats.mozilla.com/report/list?product=Firefox&signature=OOM+|+large+|+mozalloc_abort%28char+const*+const%29+|+mozalloc_handle_oom%28unsigned+int%29+|+moz_xmalloc+|+webrtc%3A%3AViEExternalRendererImpl%3A%3ARenderFrame%28unsigned+int%2C+webrtc%3A%3AI420VideoFrame%26%29#tab-reports
Can you watch memory use (for this (the plugin-container) process) while in a Hello call, in a talky.io call, and in https://mozilla.github.io/webrtc-landing/gum_test.html capturing video?  Does it steadily climb in any of them?


The crash report implies there's lots of VM space left and ram.  So it shouldn't OOM.  That and the Windows 10 dependency implies there's a VM issue with Firefox on Win10; likely this will need to go to whomever is responsible for Win10 integration and the people handling jemalloc

CC kairo for his knowledge
backlog: --- → webRTC+
Rank: 15
Flags: needinfo?(kairo)
Flags: needinfo?(alexandra.lucinet)
Priority: -- → P1
Forwarding to dmajor who is even better in knowledge about OOM than myself.
Flags: needinfo?(kairo) → needinfo?(dmajor)
I'm not so sure about the Win10 connection. In the wild, this signature has a very normal breakdown of Windows versions. Perhaps there are other differences in the test machines (I notice that it's a 32-bit OS, which really constrains our address space).

https://crash-stats.mozilla.com/search/?signature=~ViEExternalRendererImpl&_facets=signature&_facets=build_id&_facets=version&_facets=release_channel&_facets=platform_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-platform_version

Here are the three crashes from comment 0:
Available Virtual Memory 198262784
Available Virtual Memory 213676032
Available Virtual Memory 196493312

That doesn't account for fragmentation. In practice, at 200M it's extremely possible that we'll fail to find 460800 contiguous bytes. This is just a garden variety large-OOM. I would just make it fallible and move on.
Flags: needinfo?(dmajor)
(And/or figure out what's eating 1.8G of address space and reduce it)
Attached file memory-report.json
(In reply to Randell Jesup [:jesup] from comment #1)
> Can you watch memory use (for this (the plugin-container) process) while in
> a Hello call, in a talky.io call, and in
> https://mozilla.github.io/webrtc-landing/gum_test.html capturing video? 
> Does it steadily climb in any of them?
 
Unable to reproduce the crash with talky.io (~11% CPU and 200 MB memory consumption) and https://mozilla.github.io/webrtc-landing/gum_test.html (~3% CPU and 125 MB memory consumption).

Using Hello, with 40.0b7 the memory increased to 1.2 GB and then browser crashed: bp-568a2162-c22b-489c-8680-223062150727
Attached is the about:memory file.
Flags: needinfo?(alexandra.lucinet)
(In reply to Alexandra Lucinet, QA Mentor [:adalucinet] from comment #5)
> Created attachment 8639262 [details]
> memory-report.json
> 
> (In reply to Randell Jesup [:jesup] from comment #1)
> > Can you watch memory use (for this (the plugin-container) process) while in
> > a Hello call, in a talky.io call, and in
> > https://mozilla.github.io/webrtc-landing/gum_test.html capturing video? 
> > Does it steadily climb in any of them?
>  
> Unable to reproduce the crash with talky.io (~11% CPU and 200 MB memory
> consumption) and https://mozilla.github.io/webrtc-landing/gum_test.html (~3%
> CPU and 125 MB memory consumption).
> 
> Using Hello, with 40.0b7 the memory increased to 1.2 GB and then browser
> crashed: bp-568a2162-c22b-489c-8680-223062150727
> Attached is the about:memory file.

Thank you, Alexandra.  This is super useful information.  I talked to Randell, and he's going to investigate why we see this with Hello and not talky.io.
Assignee: nobody → rjesup
Rank: 15 → 7
Randell, let me know if you need additional memory reports. Crash-stats has a few dozen of them recorded by the "automatically save about:memory when close to OOM" feature.
(In reply to David Major [:dmajor] from comment #7)
> Randell, let me know if you need additional memory reports. Crash-stats has
> a few dozen of them recorded by the "automatically save about:memory when
> close to OOM" feature.

Yes, please!  That can quickly confirm one main possibility (or at least raise a lot of smoke)
Flags: needinfo?(dmajor)
Sent by email.
Flags: needinfo?(dmajor)
(In reply to Alexandra Lucinet, QA Mentor [:adalucinet] from comment #5)
> Created attachment 8639262 [details]
> memory-report.json
> 
> (In reply to Randell Jesup [:jesup] from comment #1)
> > Can you watch memory use (for this (the plugin-container) process) while in
> > a Hello call, in a talky.io call, and in
> > https://mozilla.github.io/webrtc-landing/gum_test.html capturing video? 
> > Does it steadily climb in any of them?
>  
> Unable to reproduce the crash with talky.io (~11% CPU and 200 MB memory
> consumption) and https://mozilla.github.io/webrtc-landing/gum_test.html (~3%
> CPU and 125 MB memory consumption).
> 
> Using Hello, with 40.0b7 the memory increased to 1.2 GB and then browser
> crashed: bp-568a2162-c22b-489c-8680-223062150727
> Attached is the about:memory file.

Alexandra: That only includes the Master (Main) process, not the content process (plugin-container), and so it doesn't show much useful.  And the resident size is only 473MB, vsize 962MB, well under the numbers you mention.

Some important points:  What are the STR?  Especially, what side of the call is the crashing machine?  Link creator (Hello runs in Master process), or link clicker (Hello runs in Content process)?  Anything else I should know?

Since you can repro a fair bit of the time: please also try to make this happen while capturing logs: NSPR_LOG_MODULES=getusermedia:4,signaling:5,mediamanager:4,timestamp NSPR_LOG_FILE=whatever.  Note: they'll likely be large; if too large to add as a compressed file, please put in dropbox/etc and send me a link.

Also, how *fast* does memory usage rise?  Rough measurement of MB/minute perhaps?  Does the camera/video ever stop working?  If so, self image, remote image, or remote image on the other machine?

Note that Process Explorer can be very good for this sort of thing (double-click on the process and it will give you graphs/etc for that one process)

dmajor: the memory reports were somewhat surprising; but 3 of the 5 clearly showed a large amount of uncategorized memory which would match a webrtc leak.  And looking at field reports almost all show 200MB +- 50MB of Available Free Memory, which implies a real OOM with fragmentation.

Thanks
Flags: needinfo?(dmajor)
Flags: needinfo?(alexandra.lucinet)
What's the needinfo?
Flags: needinfo?(dmajor)
Tracked for 40 as this is a critical crash on Win10.
dmajor: Sorry.  More memory reports would probably be useful.

Also, if anyone can repro this it'd be a huge win.  I tried a bunch of ways on Nightly/Win7 (e10s on/off, link-clicker and not); and memory use was totally stable each time
Flags: needinfo?(dmajor)
(In reply to Randell Jesup [:jesup] from comment #10)
> Alexandra: That only includes the Master (Main) process, not the content
> process (plugin-container), and so it doesn't show much useful.  And the
> resident size is only 473MB, vsize 962MB, well under the numbers you mention.

I am sorry, did not know that. Although, no Plugin container for Firefox process is displayed in Task Manager nor in Process Explorer.

> Some important points:  What are the STR?  Especially, what side of the call
> is the crashing machine?  Link creator (Hello runs in Master process), or
> link clicker (Hello runs in Content process)?  Anything else I should know?

The link creator side was crashing. When the link clicker joined, remote image was all black and a spinner was displayed until the crash happened, in ~30 seconds. 

> Since you can repro a fair bit of the time: please also try to make this
> happen while capturing logs:
> NSPR_LOG_MODULES=getusermedia:4,signaling:5,mediamanager:4,timestamp
> NSPR_LOG_FILE=whatever.  Note: they'll likely be large; if too large to add
> as a compressed file, please put in dropbox/etc and send me a link.
I couldn't manage to reproduce this crash under the same OS (Windows 10 x32), with 40.0b7 and 40.0b9 (en-US and ar).

Let me know if I can help more.
Flags: needinfo?(alexandra.lucinet)
Flags: needinfo?(dmajor)
Too late for 40 but tracking for 41 as 41.0b1 is affected too.
dmajor: could you pull some more about:memory dumps?  I'm still hoping to get some sort of handle on this.

It's not super-frequent.  Around <100 crashes/week on beta, but only a handful on Aurora in the last month, and none on Nightly (though perhaps those don't get logged?)

We're not fixing this for 41.
Flags: needinfo?(dmajor)
ritu: we should remove from 41 tracking
Flags: needinfo?(rkothari)
Untracked.
Flags: needinfo?(rkothari)
Jesup: Given your frequent bug investigations, I think it would be reasonable to request crash access.
Flags: needinfo?(dmajor)
Crash Signature: [@ OOM | large | mozalloc_abort(char const* const) | mozalloc_handle_oom(unsigned int) | moz_xmalloc | webrtc::ViEExternalRendererImpl::RenderFrame(unsigned int, webrtc::I420VideoFrame&)] → [@ OOM | large | mozalloc_abort(char const* const) | mozalloc_handle_oom(unsigned int) | moz_xmalloc | webrtc::ViEExternalRendererImpl::RenderFrame(unsigned int, webrtc::I420VideoFrame&)] [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmallo…
Tentatively downgrading, as I see no crashes from 43/44/45 - however, *every* crash (59) in a simple search of crashstats is from a beta build (39, 40, 42) - no release crashes, no Auroras, no nightly.  So it will be interesting to see if 43bN shows crashes here.
Rank: 7 → 22
Priority: P1 → P2
No hits on this since at least June 1, 2016.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.