Closed Bug 1302462 Opened 9 years ago Closed 8 years ago

High number of jemalloc crashes in webgl (analysis)

Categories

(Core :: Graphics: CanvasWebGL, defect, P1)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 - wontfix
firefox50 + wontfix
firefox51 + wontfix
firefox52 + wontfix
firefox53 --- wontfix

People

(Reporter: jesup, Assigned: jrmuizel)

Details

(Keywords: crash, csectype-bounds, sec-critical, Whiteboard: [gfx-noted][test disabled])

Attachments

(2 files)

[Tracking Requested - why for this release]: dangerous crash in release +++ This bug was initially created as a clone of Bug #1285531 +++ we see a high number of jemalloc crashes on the tres, this bug is to track hopefully the fixes of this crashes Spinning off a sec bug, since this clearly points to memory/allocator-structure trashing (in particular it looks like something wrote past the end of a previous buffer). Most traffic for bug 1285531 should occur here, and all further analysis.
Attached file dump of trashed arena —
dump of arena as hex values; clearly repeating patterns (gfx-ish?) Also, clearly the data escaped an earlier allocation; I dumped memory starting at a lower address.
And not to tip our hat, please don't mention this bug in the comments of bug 1285531. We can just let the original bug get quiet.
14:04 jesup: STR - My main 2000-tab profile... just starting it up :-( Probably one of the ~28 tabs that load at startup, or before that even. I can run under ASAN; I'll try that now. 14:05 jesup: Trying the the older ASAN build I have ready to go; if that doesn't crash I'll update to the same changeset or close. 14:13 jesup: Note sure this is the same place, but: 14:13 jesup: ==14158==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x7ffff3edd4a0 in thread T38 (Compositor) 14:16 jesup: Could be follow-on...: Crash Annotation GraphicsCriticalError: |[C0][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=189.858) #01: ???[/home/jesup/src/mozilla/asan-inbound/objdir-asan/dist/bin/libxul.so +0x40947ec]
Assignee: nobody → jmuizelaar
Attached file asan.out —
ASAN hit using a slightly older m-i asan tree starting the same profile. Free of non-malloced memory from libGL.so.1 May be from getenv(), not sure I trust that (probably not)
free()ing a ptr that was not malloc()'d might be an explanation for arena magic corruption. If we passes some valid pointer to jemalloc()'s free(), it will treat the initial bytes in the 1-MB chunk as a corrupted arena_chunk_t, and then get arena_t pointer from it, which also points to a corrupted arena_t. Then it aborts because for this corrupted arena, arena->magic != ARENA_MAGIC, which is the 1st assertion encountered in free_impl(). Randell, would you please check whether arena == areanas[0] when you reproduced this crash with jemalloc?
Flags: needinfo?(rjesup)
... which is not *at all* the same thing as bug 1285531
(In reply to Mike Hommey [:glandium] from comment #7) > ... which is not *at all* the same thing as bug 1285531 quite possibly not. It is definitely a webgl+jemalloc 'magic' crash. And cyu's comment about jemalloc's response to a bad ptr being freed may more generally be useful in attacking bug 1285531, if it leads to any sort of tripwire we could create to try to catch this live.
I declared a global raw pointer to point to the start address of arena magic. When crash happens, dumping those data for debugging. [1] was the try result. I hope it is the useful information from dumping arena onto stack. [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa9f9e87d676&exclusion_profile=false&selectedJob=27358972 [2]: https://public-artifacts.taskcluster.net/TsIk4NnuQMmQ1lDMKvBWRw/0/public/test_info//logcat-emulator-5554.log I got more questions. 1. Looked into one of the logcat output[2], I saw arena[0] was created twice in different process. The first one pointing to 0x95001000, and the second one pointing to 0x95101000. One was created in main and the other is not. Both of them has the same arena id 0. Is it normal? 2. When crash happens, the dumped data is normal (arena magic number and the paddings). Backed to Bug 1285531 Comment 14, I couldn't print out arena->magic when crash. I guess the arena itself was really corrupted when crash happens. How do you think?
Flags: needinfo?(mh+mozilla)
(In reply to Vincent Liu[:vliu] from comment #9) > I got more questions. > > 1. Looked into one of the logcat output[2], I saw arena[0] was created twice > in different process. The first one pointing to 0x95001000, and the second > one pointing to 0x95101000. One was created in main and the other is not. > Both of them has the same arena id 0. Is it normal? > It's perfectly normal that jemalloc is initialized in different processes. One concern of it is that it is initialized at the same address in different processes, which makes it an easy target for attacks. > 2. When crash happens, the dumped data is normal (arena magic number and the > paddings). Backed to Bug 1285531 Comment 14, I couldn't print out > arena->magic when crash. I guess the arena itself was really corrupted when > crash happens. How do you think? Given that both arena magic and paddings, which are all mproctect()'d in your patch, are intact at the time of the crash, and that you still crash in asserting arena magic, it quite possible that in bug 1285531 is also crashing in freeing memory that is not malloc'd. Would you add assertion of arena == arenas[0] to test the above mypothesis?
Flags: needinfo?(vliu)
In my crash in ASAN, there's nothing trashed in arenas[0] or what it points to. And arenas[0] != the bad arena at the crash. (and of course we're hitting NotMalloced ASAN error here). I don't know if this applies to the android failure, but certainly freeing nonmalloced memory can lead to arena failures.
Flags: needinfo?(rjesup)
I've pushed a few things to try on android, with all the RELEASE_ASSERT(arena->magic == ARENA_MAGIC) lines replaced with calls to the following function: static inline void check_arena(arena_t *arena) { if (arena->magic != ARENA_MAGIC) { unsigned i; for (i = 0; i < narenas; i++) { if (arena == arenas[i]) { break; } } RELEASE_ASSERT(i < narenas); } RELEASE_ASSERT(arena->magic == ARENA_MAGIC); } On the resulting builds, mochitest-gl1 consistently crashes on the first RELEASE_ASSERT in this function, meaning free() was called with a non-jemalloc handled pointer (arena != arenas[0])... but now that I write this, I realize that there's a chance the optimizer reduced it all to one call to jemalloc_crash and the stack trace maybe can't distinguish between both... and looking at the disassembly and the debugging symbols, that's exactly that, there's only one call to jemalloc_crash :( (and no, it's not related to the static inline, the function wasn't actually inlined ; I'll have to find something else) mochitest-gl5, however crashes on either https://dxr.mozilla.org/mozilla-central/rev/8a494adbc5cced90a4edf0c98cffde906bf7f3ae/memory/mozjemalloc/jemalloc.c#4229 or https://dxr.mozilla.org/mozilla-central/rev/8a494adbc5cced90a4edf0c98cffde906bf7f3ae/memory/mozjemalloc/jemalloc.c#4741 The latter would be caused by a double-free. The former is... different and needs further investigation.
(In reply to Mike Hommey [:glandium] from comment #12) > non-jemalloc handled pointer (arena != arenas[0])... but now that I write > this, I realize that there's a chance the optimizer reduced it all to one > call to jemalloc_crash and the stack trace maybe can't distinguish between > both... and looking at the disassembly and the debugging symbols, that's Is it possible we can avoid optimizing by declaring MOZ_ALWAYS_INLINE in jemalloc_crash()?
Flags: needinfo?(vliu)
(In reply to Vincent Liu[:vliu] from comment #13) > Is it possible we can avoid optimizing by declaring MOZ_ALWAYS_INLINE in > jemalloc_crash()? I'd expect that to make things worse, not better. That being said, I did further try builds that do have the expected separation of the jemalloc_crash calls, and they still failed on the first one, meaning those crashes are related to a non-jemalloc handled pointer. I then tried to modify breakpad to get full memory dumps for those crashes, but so far, without success: it seems like breakpad is OOMing while dumping memory.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #10) > it quite possible that in bug 1285531 is also > crashing in freeing memory that is not malloc'd. Would you add assertion of > arena == arenas[0] to test the above mypothesis? I did another try to assert check arena == arenas[0] in [1] [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2dfd1d1e8637&exclusion_profile=false&selectedJob=27698468 One of gl1 job[2] hit at this check, also meaning arena point to non-jemalloc handled pointer. [2]: https://public-artifacts.taskcluster.net/dPTMjNrRSkSJCjz2blERhw/0/public/logs/live_backing.log
Okay, I finally got a semi-full core dump of the GL1 crashes, that failed on the arena magic check, and here's an interesting fact: that's not at all cause by free being called for a non-jemalloc allocation: the pointer *is* in a jemalloc chunk, but the *first byte* of the chunk is overwritten. The first 4 bytes of the chunk is normally... the arena pointer, so of course the arena magic doesn't match, we're not reading the arena at the right location. For reference, in that crash, this is what the beginning of the chunk looks like: 0x67000000: 0x00 0x00 0x90 0x52 0x9c 0x00 0x90 0x52 0x67000008: 0x9d 0x00 0x90 0x52 0x00 0x00 0x00 0x00 0x67000010: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 This is what the beginning of another chunk looks like: 0x66c00000: 0x40 0x00 0x90 0x52 0x9c 0x00 0x90 0x52 0x66c00008: 0x9d 0x00 0x90 0x52 0x00 0x00 0x00 0x00 0x66c00010: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 The preceding chunk ends with: 0x66ffffe0: 0x00 0x00 0xff 0xff 0x00 0x00 0xff 0xff 0x66ffffe8: 0x00 0x00 0xff 0xff 0x00 0x00 0xff 0xff 0x66fffff0: 0x00 0x00 0xff 0xff 0x00 0x00 0xff 0xff 0x66fffff8: 0xff 0x00 0x00 0xff 0xe4 0xe4 0xe4 0xe4 ... and it's not a normal chunk: it's likely a huge allocation, but I don't have enough information yet to dig into that (missing debug symbols). If this is caused by an off-by-one write in a huge buffer, we can detect that by adding a guard page at the end of huge allocations. But I'd like to have more insight about that huge allocation first. There were two other kinds of crashes: - one on RELEASE_ASSERT((mapelm->bits & CHUNK_MAP_ALLOCATED) != 0) during free(); - another on RELEASE_ASSERT(size == bin->reg_size) during malloc(); I looked at one of each, in the same try run, so still without debug symbols, and could determine the following: For the former, it looks completely weird. The code that does the assert looks like this: 1a186: 69b3 ldr r3, [r6, #24] ; read the word at $r6+24 and put it in $r3 1a188: 07da lsls r2, r3, #31 ; $r2 = $r3 << 31 and set N and Z flags according to the result 1a18a: d401 bmi.n 1a190 ; if the N flag is not set, go to that address 1a18c: f7ff f8ea bl 19364 ; otherwise go to jemalloc_crash <- this is where we go r6's value is 0x66e00b40, which corresponds to the address I'm expecting for mapelm->bits - 24 given the pointer that was being freed being at 0x66ef0c80. Now, the memory dump reveals this: mapelm->bits is 0x66eef001, and CHUNK_MAP_ALLOCATED is 1, so according to the memory dump, that memory *is* allocated, and the corresponding run is at the address 0x66eef000, which is normal given the pointer address at 0x66e00b40. For the latter, the code that does the assert looks like this: 1b8f6: 6963 ldr r3, [r4, #20] ; read the word at $r4+20 and put it in $3 1b8f8: 429e cmp r6, r3 ; compare with $r6 1b8fa: d001 beq.n 1b900 ; if $r3 == $r6 then go to that address 1b8fc: f7fd fd32 bl 19364 ; otherwise go to jemalloc_crash <- this is where we go r4 is 0x52900aa8, r6 is 0x00000800. 20 is the offset of reg_size in the arena_bin_s struct, so r4 is the raw pointer to the bin. Now, the memory dump reveals that the value of bin->reg_size *is* 0x800. Sadly the first things that the machine code for jemalloc_crash does is trash r3, so there's no direct way to get it. I'll do another try adding a no_caller_saved_registers attribute to the jemalloc_crash function. With all the above and the weird things from bug 1285531 comment 14, I'm tempted to say we have 2 issues here, one that could be a buffer overflow, and one that looks more and more like bit flipping (but that's an awful lot of it).
Flags: needinfo?(mh+mozilla)
Damn GCC: warning: 'no_caller_saved_registers' attribute directive ignored
I checked all the 12 crashes I got on try on that same try build. 2 of them were on the arena->magic check, and they both show the "arena pointer got LSB byte overwritten" behavior. The preceding bytes have significantly different contents in both cases, *but* looking at the data at the beginning the preceding chunk, they do seem similar with patterns of 0x00s and 0xffs. The end of the chunk, though, looks like it's not fully allocated in one of the crashes, which might indicate something different going on than a buffer overflow. All the others were on the other cases (haven't counted exactly, but it's roughly half-half), and they *all* show the same behavior where the memory dump says the jemalloc data is right and the crash shouldn't have happened in the first place. Hopefully, my subsequent try push will tell a bit more.
Aha additional information about the arena->magic check crashes, now that I have debugging symbols: arenas[0] *also* points to the wrong arena that the chunk points to. Now, here's something interesting: there's *also* a behavior similar to the others happening in there: The code that crashes is something that I added: MOZ_NEVER_INLINE static void check_existing_arena(arena_t *arena) { unsigned i; for (i = 0; i < narenas; i++) { if (arena == arenas[i]) { break; } } RELEASE_ASSERT(i < narenas); } And the crash is triggered on that RELEASE_ASSERT. We get in this function because arena->magic != ARENA_MAGIC in the first place, so the weirdness is not there. Where the weirdness happens is that, according to the register values I dumped on the stack, the break should have been hit. The full disassembly for this function looks like this: 0x5280506c <+0>: ldr r3, [pc, #44] ; (0x5280509c <check_existing_arena+48>) 0x5280506e <+2>: push {r4, lr} 0x52805070 <+4>: add r3, pc 0x52805072 <+6>: ldr r2, [r3, #0] 0x52805074 <+8>: ldr r3, [pc, #40] ; (0x528050a0 <check_existing_arena+52>) 0x52805076 <+10>: add r3, pc 0x52805078 <+12>: ldr r1, [r3, #0] 0x5280507a <+14>: movs r3, #0 0x5280507c <+16>: cmp r3, r2 0x5280507e <+18>: beq.n 0x52805092 <check_existing_arena+38> 0x52805080 <+20>: ldr.w r4, [r1, r3, lsl #2] 0x52805084 <+24>: cmp r0, r4 0x52805086 <+26>: bne.n 0x5280508e <check_existing_arena+34> 0x52805088 <+28>: cmp r3, r2 0x5280508a <+30>: bcc.n 0x5280509a <check_existing_arena+46> 0x5280508c <+32>: b.n 0x52805092 <check_existing_arena+38> 0x5280508e <+34>: adds r3, #1 0x52805090 <+36>: b.n 0x5280507c <check_existing_arena+16> 0x52805092 <+38>: ldmia.w sp!, {r4, lr} 0x52805096 <+42>: b.w 0x5280335c <jemalloc_crash> 0x5280509a <+46>: pop {r4, pc} By 0x5280507a, we enter the loop. r3 is i, r0 is arena, r1 is arenas, r2 is narenas. At 0x52805080, we load arenas[0] in r4, next instruction compares to r0. The memory dump says both these values are equal, so what should have happened is not take the branch at 0x52805086, comparing r3 to r2, find they're different, and jump to the pop at 0x5280509a. But that's not what happened because according to the dump, r3 and r2 are both equal to 1. Two bugs for the price of one! And once again, the least significant bit of r3 is involved. I'm starting to wonder what's going on with that register. I also got, on that new try with debug symbols, a new crash on assert(run->magic == ARENA_RUN_MAGIC) in arena_dalloc_small, with /some/ similarity with the ARENA_MAGIC crash. What happens for this one is that the magic number itself has its LSB byte overwritten to 0. Contrary to the chunk one, the previous page is allocated to a different run, for small allocations, not a huge allocation. Strangely, that new signature happened a lot on that try run, along with another similar: RELEASE_ASSERT(bin->runcur->magic == ARENA_RUN_MAGIC). Interestingly, the one-bit wrong crashes have *not* happened on that try... The only difference with the previous try was that I rewrote jemalloc_crash in asm to dump push r0, r1, r2 and r3 on the stack manually. Come to think of it, I forgot to put a noreturn attribute in there. ... So with a noreturn on that jemalloc_crash, I got the RELEASE_ASSERT((mapelm->bits & CHUNK_MAP_ALLOCATED) != 0) and RELEASE_ASSERT(size == bin->reg_size) ones again, and the stack confirms that the value in r3 *is* broken (always 0), while the value where it's supposed to be loaded from is correct. Interestingly, those I looked at had the corresponding run magic number corrupted as per above, so if the RELEASE_ASSERT with an impossible value in r3 hadn't hit, the ARENA_RUN_MAGIC one would have.
Group: core-security → media-core-security
Is there anything more you can do here to investigate this? Thanks.
Flags: needinfo?(mh+mozilla)
This is a sec-critical that comes up during triage as stale. Is there anything we can do here to move along resolution?
Flags: needinfo?(rjesup)
This bug is pretty vague and not really going anywhere. I'm going to close it as incomplete. Feel free to reopen it if you want to investigate.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Flags: needinfo?(rjesup)
Flags: needinfo?(mh+mozilla)
Group: media-core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: