Closed Bug 1019634 Opened 6 years ago Closed 6 years ago
Enabling DMD on flame puts phone in reboot loop
I tried enabling DMD on Flame (doing a clean gecko build with export MOZ_DMD=1 in my .userconfig) on recent trunk code and when I flashed it on the device it just continuously rebooted. The reboot happens really early in startup so I couldn't get a debugger attached. Logcat snippet attached.
I've reproduced this with the following backtrace: #0 0xb6e82ace in jemalloc_crash () at ../../../gecko/memory/mozjemalloc/jemalloc.c:1574 #1 0xb6e82eda in arena_bin_malloc_easy (bin=<optimized out>, run=<optimized out>, arena=Unhandled dwarf expression opcode 0xfa ) at ../../../gecko/memory/mozjemalloc/jemalloc.c:3870 #2 0xb6e8497a in arena_bin_malloc_hard (bin=<optimized out>, arena=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:3891 #3 arena_malloc_small (zero=false, size=112, arena=0xb6ba3040) at ../../../gecko/memory/mozjemalloc/jemalloc.c:4076 #4 arena_malloc (arena=0xb6ba3040, size=<optimized out>, zero=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:4150 #5 0xb6e84efe in imalloc (size=100) at ../../../gecko/memory/mozjemalloc/jemalloc.c:4162 #6 imalloc (size=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:6192 #7 je_malloc (size=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:6216 #8 0xb6f2618c in mozilla::dmd::InfallibleAllocPolicy::malloc_ (aSize=<optimized out>) at ../../../../gecko/memory/replace/dmd/DMD.cpp:95 #9 0xb6f27144 in new_<mozilla::dmd::StackTrace, mozilla::dmd::StackTrace> (p1=...) at ../../../../gecko/memory/replace/dmd/DMD.cpp:148 #10 mozilla::dmd::StackTrace::Get (aT=<optimized out>) at ../../../../gecko/memory/replace/dmd/DMD.cpp:903 #11 0xb6f27c56 in AllocCallback (aT=0xb6a02090, aReqSize=96, aPtr=0xb6ae2820) at ../../../../gecko/memory/replace/dmd/DMD.cpp:1178 #12 mozilla::dmd::AllocCallback (aPtr=0xb6ae2820, aReqSize=96, aT=0xb6a02090) at ../../../../gecko/memory/replace/dmd/DMD.cpp:1155 #13 0xb6f2901a in replace_realloc (aOldPtr=0xb6ac75f0, aSize=96) at ../../../../gecko/memory/replace/dmd/DMD.cpp:1299 #14 0xb6cde70c in android::Parcel::continueWrite (this=0xbedf38d0, desired=96) at frameworks/native/libs/binder/Parcel.cpp:1529 #15 0xb6cde7e6 in android::Parcel::writeInplace (this=0xbedf38d0, len=54) at frameworks/native/libs/binder/Parcel.cpp:610 #16 0xb6cdf08c in android::Parcel::writeString16 (this=0xbedf38d0, str=0xb6ad8110 u"android.os.IServiceManager", len=52) at frameworks/native/libs/binder/Parcel.cpp:685 #17 0xb6cdc328 in android::BpServiceManager::checkService (this=0xb6a04fa0, name=...) at frameworks/native/libs/binder/IServiceManager.cpp:148 #18 0xb6cdc7fc in android::BpServiceManager::getService (this=0xb6a04fa0, name=...) at frameworks/native/libs/binder/IServiceManager.cpp:137 #19 0xb4394b96 in ?? () #20 0xb4394b96 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?)
blocking-b2g: --- → 2.0?
It would appear a run header is being overwritten, the asserting line is: at ../../../gecko/memory/mozjemalloc/jemalloc.c:3870 3870 RELEASE_ASSERT(run->magic == ARENA_RUN_MAGIC);
I've attempted doing a FORTIFY build (it caught some unrelated compile time issues, I'll file bugs for those), and a stackprotect build neither of which caught the issue. After testing on a debug build I've narrowed this down to always happening when we're taking a stack trace for an allocation in DMD. It's not the first measurement, it seems to happen after the 5th or so. Basically this part: #9 0xb6f27144 in new_<mozilla::dmd::StackTrace, mozilla::dmd::StackTrace> (p1=...) at ../../../../gecko/memory/replace/dmd/DMD.cpp:148 #10 mozilla::dmd::StackTrace::Get (aT=<optimized out>) at ../../../../gecko/memory/replace/dmd/DMD.cpp:903 So it's possible that when we grab a stack trace there's some sort of memory corruption and then the next allocation blows up. The other possibility is there's memory corruption earlier on that is somehow consistently in the run that happens to be in the run prior to the sizeof(dmd::StackTrace) run (I think it ends up being 112). For background: jemalloc carves up large chunks of memory into page-sized runs, each run holds items all of the same size. Next steps: - Disable all stack walking, see if it still reproduces - Figure out what size bin the previous run serves, see if it's consistently the same each time we tryp to reproduce. If so set a breakpoint for allocations of that size, check the stack for allocations towards the end of the run and trace down the memory. - If we can get jemalloc to play nice w/ valgrind that might help us trace things down (bug 977067, comment 29). I'm open to other suggestions of course!
I disabled the NS_StackWalking call in DMD and am still seeing crashes, so that's probably not it. I disabled the AllocCallback and no longer saw crashes, so it's not inherently DMD injecting itself at least, but somehow we're doing something wrong in the stack tracking portion. Of further interest would probably be looking at the tables we're using to store blocks, and really any allocs or deallocs within DMD. There's an area where we "GC" the stack traces, so it's possible that's doing something bad as well.
Disabling stack trace GC had no effect. Allowing stack trace measurement, allocation and insertion into the stack trace table, but disabling insertion into the block table does not crash. So it looks like something bad is happening with the block table.
blocking-b2g: 2.0? → 2.0+
It looks like if I disable shrinking in JS::HashTable the crash goes away (although b2g ends up hosed in some other way).
I've tracked down the real issue to qcom's hwcomposer stomping memory in some debug config code.
Bug 1034146 provides further details and a patch that fixes the issue. |git apply| the patch to |hardware/qcom/display|, run |./build.sh && ./flash.sh| and you should be good to go.
Whiteboard: [MemShrink] → [MemShrink] [CR 1019634]
Whiteboard: [MemShrink] [CR 1019634] → [MemShrink] [CR 689431]
Whiteboard: [MemShrink] [CR 689431] → [caf priority: p2][MemShrink] [CR 689431]
Tapas, can you confirm comment #9 is working for you ?
(In reply to bhavana bajaj [:bajaj] [NOT reading Bugmail, needInfo please] from comment #10) > Tapas, can you confirm comment #9 is working for you ? This patch works fine and we are seeing dmd report now.
What else needs to be done to get this fix live?
Whiteboard: [caf priority: p2][MemShrink] [CR 689431] → [caf priority: p2][MemShrink:P1] [CR 689431]
(In reply to Eric Rahm [:erahm] from comment #12) > What else needs to be done to get this fix live? It is already landed. See bug 1034146 Comment 3
If you're testing on 2.0, we'll need to update our manifest to pick up the updates.
We need this for 1.4 and 2.0 to help with memory regression analysis.
Oh, actually that won't work because the fix from bug 1034146 comment 3 only landed on the KK branch. Our flames are still on JB, so that won't help us until we get KK images for our Flames.
Tapas, is there a way to get this landed on the JB branch?
Eric -- Sushil is working on getting the fix available on JB. Sushil -- please update here once we have the fix on codeaurora.org.
Flags: needinfo?(tkundu) → needinfo?(sushilchauhan)
Any updates here?
The fix has landed on CAF. Here is the link: https://www.codeaurora.org/cgit/quic/la/platform/hardware/qcom/display/commit/?h=b2g_jb_3.2&id=3f499aa8c4af9ae053dbce91eec60b37d0d6a26d
Now that this has landed on the JB branch, what's needed to get the manifests for 1.4, 2.0, m-c updated?
I can update the 2.0 manifest since this has 2.0 blocking.
FWIW bug 1034146 (the bug that blocks this) is 1.4+ which is what this issue fixes.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1034146
You need to log in before you can comment on or make changes to this bug.