Closed Bug 939036 Opened 6 years ago Closed 3 years ago
ASan browser-chrome very close to OOM all the time
Currently, the ASan browser-chrome test doesn't OOM only by virtue of the forced full debug mode GC, which is to be removed by bug 933882 for performance reasons. One of the reasons I can't land that patch is that if I remove that GC, ASan BC OOMs. Empirically, the OOM line for ASan BC on our slaves is about 3000000000 bytes (~2.8 GB). See  and all the other retriggers in the same push. Grep for "ABOUT:MEMORY | resident:". I pushed an instrumented build to see how close we come to it right now on the slaves, with my patches backed out . Peak RSS is 2811572224 bytes. With current mid-memory options, this gives us very little breathing room. Any new test can push us over the OOM line. I also pushed an instrumented build forcing the low-memory option . Peak RSS is 2647334912 bytes. This gives us a bit more breathing room, but not that much. Forcing a GC every debugger test makes us pass, but *barely*, with the low-memory option . What should we do about ASan? Forcing low-memory ASAN_OPTIONS is a start, but it won't tide us over for long.  https://tbpl.mozilla.org/php/getParsedLog.php?id=30578096&tree=Try&full=1  https://tbpl.mozilla.org/php/getParsedLog.php?id=30587477&tree=Try&full=1  https://tbpl.mozilla.org/php/getParsedLog.php?id=30587389&tree=Try&full=1  https://tbpl.mozilla.org/?tree=Try&rev=15f51106f108
Summary: ASan very close to OOM all the time → ASan browser-chrome very close to OOM all the time
Here's a try build (still pending at time of this writing) that sets redzone=16, which according to  is the default. Maybe it'll decrease memory use some more. https://tbpl.mozilla.org/?tree=Try&rev=d0640f704735  http://code.google.com/p/address-sanitizer/wiki/Flags
Due to bug 934641, we'll probably make the low-memory configuration used on the current try slaves by default. We should also see what happens if we entirely remove the redzone option, and what happens with your redzone=16 approach. When we started doing ASan tests, it all worked well with the low-memory configuration on 2 GB test slaves. So something must have changed that it *barely* works now on 3-4 GB? Are we possibly leaking memory we shouldn't, during mochitests? We recently had such a bug already that led to tree closure on normal builds even.
Oops, I messed up the redzone=16 patch, repushing now. (In reply to Christian Holler (:decoder) from comment #2) > Due to bug 934641, we'll probably make the low-memory configuration used on > the current try slaves by default. We should also see what happens if we > entirely remove the redzone option, and what happens with your redzone=16 > approach. > > When we started doing ASan tests, it all worked well with the low-memory > configuration on 2 GB test slaves. So something must have changed that it > *barely* works now on 3-4 GB? Are we possibly leaking memory we shouldn't, > during mochitests? We recently had such a bug already that led to tree > closure on normal builds even. I measured the RSS of normal browser-chrome on Linux64 debug. We peak at about ~1.1 - ~1.2 GB. When was the ASan turned on? For instance, shadereditor devtools tests landed back in... September, I think, that have pretty high RSS use. Part of the problem is that we don't have any policing of memory usage in tests. People add more and more tests, and BC just keeps using more and more memory.
redzone=16: https://tbpl.mozilla.org/?tree=Try&rev=a6b983fcbd63 redzone option removed: https://tbpl.mozilla.org/?tree=Try&rev=a298fb8bb5b2
low-memory redzone=16 try has peak RSS of 2516951040 bytes low-memory default redzone try has peak RSS of 2548412416 bytes. The above seems to confirm that the default redzone is 16, and buys us another 100 MB.
These numbers say to me that viable long term fixes are either to chunk BC (probably should do this ASan or otherwise), or get slaves with more memory. Otherwise, it'll only be a matter of time, and probably not even that long, before BC OOMs again. An independent question is "what is reasonable peak RSS that BC should use?" To be safe, the ASan slaves should have 3.5x that memory.
Shu generated some great graphs showing the memory use over time of M-bc. With ASan: http://rfrn.org/~shu/mvv/viewer.html Without ASan: http://rfrn.org/~shu/mvv/win7bc1.html This looks like the test is behaving fundamentally differently with ASan. The graph keeps climbing. Which suggests that ASan does something funky, like ignoring requests to decommit memory, or fragmenting, or keeping around its own guard pages permanently or something. Assuming this is true, that seems to me to mean that this test is using ASan for something it is not intended for. We could paper over the problem for a while by adding more memory, but really the memory usage with ASan is unrelated to the memory usage without it, so we should do anything necessary to avoid failing tests due purely to excessive memory usage under ASan. That could be any of (1) don't run M-bc under ASan, (2) restart the process every so often when running under ASan, (3) check the memory after every test and just end early if it gets too high, or (4) find ASan tuning knobs that perhaps lose some precision in exchange for tolerable memory usage. Shu already said the same thing -- chunk BC or get slaves with more memory.
I think (1) is not acceptable. I think we should go for chunking BC (and I think there's a bug for that), and until we have that, we'll do (4) and I'm already testing that due to bug 934641. My testing did not include shu's patch though that disables full GCs when switching debug mode. So I retriggered M-bc a few times on his try push without the redzone parameter, so we'll see if the intermittents are gone with that.
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2979] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2984]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2984] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2989]
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.