Closed
Bug 1259512
(e10s-oom)
Opened 9 years ago
Closed 9 years ago
[e10s] significantly higher rates of OOM crashes in the content process of Firefox with e10s than in the main process of non-e10s
Categories
(Core :: DOM: Content Processes, defect)
Core
DOM: Content Processes
Tracking
()
RESOLVED
FIXED
mozilla48
Tracking | Status | |
---|---|---|
e10s | + | --- |
People
(Reporter: benjamin, Assigned: mccr8)
References
(Depends on 1 open bug)
Details
(Whiteboard: [MemShrink:meta] btpp-active)
Attachments
(1 file)
2.28 KB,
text/plain
|
Details |
Based on the recent e10s experiments on beta, there is a much higher incidence of OOM crashes with e10s enabled than with e10s disabled. It's silly to file a separate bug for each OOM crash signature, so instead I'm filing this single bug.
Our stability problems are primarily in the content process, not in the chrome process.
Analysis of e10s crash rates:
https://github.com/vitillo/e10s_analyses/blob/master/beta46-noapz/e10s-stability-analysis.ipynb
Breakdown of signatures that affect e10s:
https://gist.github.com/bsmedberg/f23e84ae4021a1cc3bcf
There are many things I don't know yet about this problem:
* Platforms affected
* Are we running out of real memory (physical+swap) or VM? Or just fragmenting our address space to death?
* Does this primarily affect certain subgroups, such as people with acceleration on/off or certain graphics drivers?
Running out of real memory could hopefully be figured out with about:memory.
Running out of VM might be caused by a bug in mapping shared memory sections across the IPC boundary, or other bugs (graphics drivers have caused problems like this in the past).
Related bugs:
bug 1250672 OSX does not reclaim memory properly
bug 1257486 - add more memory annotations to content process crash reports
bug 1236108 and followup bug 1256541 - make OOMAllocationSize work in content process crash reports
bug 1259358 - nsITimer sometimes doesn't work with e10s (could be causing GC/CC scheduling problems?)
Please DO dup any small-OOM crash signature bugs which are specific to e10s here.
Please DO NOT dup large-OOM crash signatures to this bug.
Reporter | ||
Updated•9 years ago
|
Alias: e10s-oom
Assignee | ||
Updated•9 years ago
|
Whiteboard: [MemShrink]
Assignee | ||
Comment 3•9 years ago
|
||
Thanks for filing this. I was noticing something might be awry here myself yesterday. For instance, in the control group there are 124 crashes in mozilla::CycleCollectedJSRuntime::CycleCollectedJSRuntime, but with e10s there are 192, which seems like an alarming difference.
The only bad e10s-specific leak (rather than just bloat) I'm aware of is bug 1252677, which looks like some kind of Windows-specific SharedMemory/PTextureChild. However, Bas said the leaking tests are related to "drawing video to a Canvas", which doesn't sound like something you'd see very often in regular web browsing.
Reporter | ||
Comment 4•9 years ago
|
||
Yeah, it's worse than 124/192, because content process crashes have a 10% submission rate while chrome process crashes have a 50% submission rate.
Comment 7•9 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> Yeah, it's worse than 124/192, because content process crashes have a 10%
> submission rate while chrome process crashes have a 50% submission rate.
Without knowing how many are not reported on shutdown it is hard to gauge if there is such a big change. To me signatures reported seems about the same (but just from ad hoc viewing. Startup crashes get reported more.)
js::AutoEnterOOMUnsafeRegion::crash seems to be the big increase. (Messy due to signature change.)
mozilla::CycleCollectedJSRuntime::CycleCollectedJSRuntime only 46 (With any luck fix is bug 1247122)
45 vs 46 e10s comparison.
https://crash-analysis.mozilla.com/rkaiser/datil/searchcompare/?common=process_type%3Dbrowser%26process_type%3Dcontent%26ActiveExperimentBranch%3D%253Dexperiment-no-addons&p1=ActiveExperiment%3D%253De10s-beta45-withoutaddons%2540experiments.mozilla.org%26date%3D%3E%253D2016-02-12%26date%3D%3C2016-02-25&p2=ActiveExperiment%3D%253De10s-beta46-noapz%2540experiments.mozilla.org%26date%3D%3E%253D2016-03-11%26date%3D%3C2016-03-25
Updated•9 years ago
|
tracking-e10s:
--- → +
Priority: -- → P1
Comment 8•9 years ago
|
||
Could it be IPC to make the heap fragmentation become more serious, so things like 1) what :billm is doing in bug 1235633 comment 12 to avoid a copy, or 2) reuse message buffers would help?
(In reply to Benjamin Smedberg [:bsmedberg] from comment #0)
> Based on the recent e10s experiments on beta, there is a much higher
> incidence of OOM crashes with e10s enabled than with e10s disabled.
Is this true? With e10s disabled, "OOM | small" accounts for about 7% of all crashes in beta 46. With e10s enabled, it's 0.55%. Granted, there are some specific OOM signatures in e10s that are significant. But it still seems like e10s has fewer OOMs overall. Am I misreading the data? I know that the way the crash annotations work is really complex.
Assignee | ||
Comment 10•9 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #9)
> Am I misreading the data?
Bug 1256541 only landed on beta yesterday, so we do not have crash signatures with proper OOM annotations for content processes from beta yet. To get an approximation of what OOM | small would be you have to add up the various OOM | unknown signatures, like those in the two duped bugs.
Assignee | ||
Comment 11•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #8)
> Could it be IPC to make the heap fragmentation become more serious,
Yes, that is the best guess I have so far. However, telemetry suggests that VSIZE_MAX_CONTIGUOUS is maybe a little better with e10s enabled, which contradicts that theory[1]. VSIZE_MAX_CONTIGUOUS is the largest contiguous amount of address space in the process, and tends to get really low when there is severe heap fragmentation. It does seem like it is always fairly low for some people, around 800kb in the chart, so maybe many users are just in a precarious state that results in a crash when IPC ends up allocating a large block of memory.
[1] https://github.com/vitillo/e10s_analyses/blob/master/beta45-withoutaddons/e10s_experiment.ipynb
I think we can analyze crash minidumps to figure out what exactly the address space looks like. There was some previous work along these lines for desktop Firefox in bug 1001760, and hopefully we can reuse those analyses.
> so things like 1) what :billm is doing in bug 1235633 comment 12 to avoid a
> copy, or 2) reuse message buffers would help?
Yes, I think those are worth trying. Also, the Pickle::Resize() method currently uses a doubling strategy, which could be bad if the size is nearing that of our largest contiguous block. There's a bug on file for it but I don't remember which.
![]() |
||
Comment 12•9 years ago
|
||
> Also, the Pickle::Resize() method
> currently uses a doubling strategy, which could be bad if the size is
> nearing that of our largest contiguous block. There's a bug on file for it
> but I don't remember which.
It's bug 1253131. I said I would take a look but I haven't got around to it yet. I'll probably get to it, maybe early next week, though I'd be happy if someone else took a look in the meantime.
Comment 13•9 years ago
|
||
I was wondering are there any information jemalloc can show us, but then realized the features like profiling [1] and measuring external fragmentation (stats.active) [2] are not existed in mozjemalloc.
The best I can get is:
___ Begin malloc statistics ___
Assertions disabled
Boolean MALLOC_OPTIONS: aCjPz
Max arenas: 1
Pointer size: 8
Quantum size: 16
Max small size: 512
Max dirty pages per arena: 256
Chunk size: 1048576 (2^20)
Allocated: 47656384, mapped: 110100480
huge: nmalloc ndalloc allocated
27 26 2097152
arenas[0]:
dirty: 136 pages dirty, 204 sweeps, 4659 madvises, 38082 pages purged
allocated nmalloc ndalloc
small: 31968704 1411295 1132137
large: 13590528 33356 31853
total: 45559232 1444651 1163990
mapped: 106954752
bins: bin size regs pgs requests newruns reruns maxruns curruns
0 T 8 500 1 76882 43 422 21 21
1 Q 16 252 1 222439 543 4460 276 143
2 Q 32 126 1 320891 1608 10905 985 766
3 Q 48 84 1 145888 1020 7771 652 568
Some notes:
- seems no one is actively working on bug 762449
- [3] mentioned "virtual memory fragmentation", sounds like the VSIZE_MAX_CONTIGUOUS :mccr8 mentioned in comment 11
[1] https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Profiling
[2] http://blog.gmane.org/gmane.comp.lib.jemalloc/month=20140501
[3] http://www.canonware.com/pipermail/jemalloc-discuss/2013-April/000572.html
Comment 14•9 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #11)
> I think we can analyze crash minidumps to figure out what exactly the
> address space looks like. There was some previous work along these lines for
> desktop Firefox in bug 1001760, and hopefully we can reuse those analyses.
I tried the tool minidump-memorylist with the latest google breakpad, and it crashed. Somehow GetMemoryInfoList() [1] returns null. :(
[1] https://github.com/bsmedberg/minidump-memorylist/blob/master/minidump-memorylist.cc#L66
![]() |
||
Comment 15•9 years ago
|
||
> profiling [1]
You can use DMD's "live" mode to do generic heap profiling. See the docs at https://developer.mozilla.org/en-US/docs/Mozilla/Performance/DMD
Depends on: 1260908
Comment 16•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #14)
> (In reply to Andrew McCreight [:mccr8] from comment #11)
> > I think we can analyze crash minidumps to figure out what exactly the
> > address space looks like. There was some previous work along these lines for
> > desktop Firefox in bug 1001760, and hopefully we can reuse those analyses.
>
> I tried the tool minidump-memorylist with the latest google breakpad, and it
> crashed. Somehow GetMemoryInfoList() [1] returns null. :(
>
> [1]
> https://github.com/bsmedberg/minidump-memorylist/blob/master/minidump-
> memorylist.cc#L66
This only works on dumps produced on Windows, FYI. We could make it work for Linux as well (Linux minidumps include /proc/self/maps). Mac minidumps do not include memory mapping info, IIRC.
Reporter | ||
Updated•9 years ago
|
Assignee: nobody → continuation
Assignee | ||
Updated•9 years ago
|
Whiteboard: [MemShrink] → [MemShrink:meta]
Updated•9 years ago
|
Whiteboard: [MemShrink:meta] → [MemShrink:meta] btpp-active
Comment 17•9 years ago
|
||
:mccr8, are there anything I can help? I am not sure what bugs you are not working on and are with higher priority.
Flags: needinfo?(continuation)
Assignee | ||
Comment 18•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #17)
> :mccr8, are there anything I can help? I am not sure what bugs you are not
> working on and are with higher priority.
I'm only working on bug 1253131 right now (bug 1263235 is just waiting for a review). I'm not really sure what the priority should be for the various bugs. We still don't have a great idea of what is causing OOMs, aside from large messages likely causing problems with contiguous address space. Bug 1262671 would be good to have, but it may take a little while, so something shorter term might be better right now. I'm not sure.
Flags: needinfo?(continuation)
Comment 19•9 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #18)
> large messages likely causing problems with contiguous address space. Bug
> 1262671 would be good to have, but it may take a little while, so something
> shorter term might be better right now. I'm not sure.
Then I'll see if I can get some data that :dmajor did in bug 1001760.
Comment 20•9 years ago
|
||
minidump-memorylist crashed because GetMemoryInfoList() returns null. The message:
2016-04-13 16:58:58: minidump.cc:4765: INFO: GetStream: type 16 not present
means it can not find the list of information about mapped memory regions for a process from the dump file.
Comment 21•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #20)
> minidump-memorylist crashed because GetMemoryInfoList() returns null. The
> message:
>
> 2016-04-13 16:58:58: minidump.cc:4765: INFO: GetStream: type 16 not present
>
> means it can not find the list of information about mapped memory regions
> for a process from the dump file.
Was this dump from a Windows system? If not, see comment 16.
Comment 22•9 years ago
|
||
Oh, crud. Apparently we're not writing memory info to minidumps from child processes:
https://dxr.mozilla.org/mozilla-central/rev/21bf1af375c1fa8565ae3bb2e89bd1a0809363d4/toolkit/crashreporter/nsExceptionHandler.cpp#3485
vs. the in-process case:
https://dxr.mozilla.org/mozilla-central/rev/21bf1af375c1fa8565ae3bb2e89bd1a0809363d4/toolkit/crashreporter/nsExceptionHandler.cpp#1545
Comment 23•9 years ago
|
||
Great, I was looking around for what MiniDumpNormal dumps. Would you fix it?
Comment 24•9 years ago
|
||
I'll fix this, it should be a straightforward patch.
Comment 25•9 years ago
|
||
Is it bug 1263774?
Comment 26•9 years ago
|
||
No, that's for "memory reports", which are the content of about:memory. The "memory info stream" is bug 1264242. (These are confusingly similar, aren't they?)
Comment 27•9 years ago
|
||
Analysed the minidumps from 47.0b1 and has crash signature [@ OOM | small] on content process. See bug 1001760 comment 4 for the meaning of each field.
Basically it suggests bug 1005844 would be helpful.
Assignee | ||
Comment 28•9 years ago
|
||
jimm noticed that a lot of the OOM small crashes are happening in IPC code: http://tinyurl.com/hxkcs79
(If you add "proto signature" as a facet to your super search then it shows the actual signatures.)
It might be that this is because memory usage is higher while we are in the middle of dealing with IPC, due to all of the allocations needed to serialize and deserialize. If we could reduce that memory spike, we might be able to reduce the amount of OOM crashes.
Comment 29•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #27)
> Created attachment 8749567 [details]
> oomdata.txt
I was wondering why there're so many tiny (<1M) blocks, by using minidump-memorylist, I saw many regions are with memory protection PAGE_WRITECOMBINE and type MD_MEMORY_TYPE_MAPPED, not sure what are they for:
BaseAddress AllocationBase AllocationProtect RegionSize State Protect Type
7fd60000 7fd60000 404 8000 1000 404 40000
7fd68000 0 0 8000 10000 1 0
7fd70000 7fd70000 404 8000 1000 404 40000
7fd78000 0 0 8000 10000 1 0
7fd80000 7fd80000 404 1c000 1000 404 40000
Comment 30•9 years ago
|
||
Will we switch to 64bit firefox whenever a user runs 64bit OS, or can we make that happen?
Comment 31•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #29)
> I was wondering why there're so many tiny (<1M) blocks, by using
> minidump-memorylist, I saw many regions are with memory protection
> PAGE_WRITECOMBINE and type MD_MEMORY_TYPE_MAPPED, not sure what are they for:
I placed breakpoints at VirtualAlloc* see if they can be hit with PAGE_WRITECOMBINE set, but no luck. I don't know, it could be allocated by drivers or something else. Another thing I noticed could cause fragmentation is random address allocation in js::jit::ExecutableAllocator::systemAlloc() [1] which is for security, see bug 677272.
[1] https://dxr.mozilla.org/mozilla-central/rev/c4449eab07d39e20ea315603f1b1863eeed7dcfe/js/src/jit/ExecutableAllocatorWin.cpp#226
Assignee | ||
Comment 32•9 years ago
|
||
Removing some fairly generic OOM signatures from blocking this bug.
![]() |
||
Updated•9 years ago
|
Priority: P1 → --
Assignee | ||
Comment 33•9 years ago
|
||
I think we can call this fixed, though of course there are still remaining memory improvements that could be made.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
You need to log in
before you can comment on or make changes to this bug.
Description
•