Closed
Bug 1139386
Opened 9 years ago
Closed 9 years ago
jemalloc crashes during profiling on Linux32 EC2 slaves
Categories
(DevTools :: Performance Tools (Profiler/Timeline), defect)
DevTools
Performance Tools (Profiler/Timeline)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: vporof, Unassigned)
References
Details
Attachments
(3 files)
1.83 KB,
patch
|
Details | Diff | Splinter Review | |
428 bytes,
application/javascript
|
Details | |
2.54 KB,
text/x-csrc
|
Details |
Reporter | ||
Updated•9 years ago
|
Blocks: perf-tool-crashes
Comment 1•9 years ago
|
||
The crash is inside jemalloc. Is this user error or an a jemalloc problem?
Flags: needinfo?(mh+mozilla)
Comment 2•9 years ago
|
||
21:07:19 INFO - Crash reason: SIGILL 21:07:19 INFO - Crash address: 0x805e292 21:07:19 INFO - Thread 0 (crashed) 21:07:19 INFO - 0 firefox!arena_run_reg_alloc [bitmap.h:04a2d0398e39 : 147 + 0xa] 21:07:19 INFO - eip = 0x0805e292 esp = 0xbf973a90 ebp = 0xbf973ab8 ebx = 0x0808e238 21:07:19 INFO - esi = 0x94902df4 edi = 0x0808eda4 eax = 0x00000000 ecx = 0x00000001 21:07:19 INFO - edx = 0x00000008 efl = 0x00210286 21:07:19 INFO - Found by: given as instruction pointer in context 21:07:19 INFO - 1 firefox!je_arena_malloc_small [arena.c:04a2d0398e39 : 1592 + 0xf] 21:07:19 INFO - eip = 0x0805fc43 esp = 0xbf973ac0 ebp = 0xbf973ae8 ebx = 0x0808e238 21:07:19 INFO - esi = 0xb73000c0 edi = 0xb73008bc 21:07:19 INFO - Found by: call frame info $ wget http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/gum-linux-debug/1425434990/firefox-39.0a1.en-US.linux-i686.tar.bz2 $ tar -jxf firefox-39.0a1.en-US.linux-i686.tar.bz2 $ objdump -d --start-address 0x0805e28d --stop-address 0x0805e296 firefox/firefox # Thanks to the lack of ASLR for the executable firefox/firefox: file format elf32-i386 Disassembly of section .text: 0805e28d <jemalloc_free_dirty_pages+0x11d8>: 805e28d: 8b 4d d8 mov -0x28(%ebp),%ecx 805e290: 33 44 8e 08 xor 0x8(%esi,%ecx,4),%eax 805e294: 85 c0 test %eax,%eax -> The crash address (0x0805e292) is in the middle of an instruction The next frame is plausible: $ objdump -d --start-address 0x0805fc3e --stop-address 0x0805fc43 firefox/firefox firefox/firefox: file format elf32-i386 Disassembly of section .text: 0805fc3e <jemalloc_free_dirty_pages+0x2b89>: 805fc3e: e8 cb e5 ff ff call 805e20e <jemalloc_free_dirty_pages+0x1159> So it is plausible that it came to that crash address via that call and executing code starting at that address. This is the complete function where the crash is happening: $ objdump -d --start-address 0x805e20e --stop-address 0x805e320 firefox/firefox firefox/firefox: file format elf32-i386 Disassembly of section .text: 0805e20e <jemalloc_free_dirty_pages+0x1159>: 805e20e: 55 push %ebp 805e20f: b9 ff ff ff ff mov $0xffffffff,%ecx 805e214: 89 e5 mov %esp,%ebp 805e216: 57 push %edi 805e217: 89 d7 mov %edx,%edi 805e219: 56 push %esi 805e21a: 89 c6 mov %eax,%esi 805e21c: 53 push %ebx 805e21d: 83 ec 1c sub $0x1c,%esp 805e220: 8d 42 14 lea 0x14(%edx),%eax 805e223: 89 45 e8 mov %eax,-0x18(%ebp) 805e226: 8b 47 18 mov 0x18(%edi),%eax 805e229: 8d 56 08 lea 0x8(%esi),%edx 805e22c: 89 55 e4 mov %edx,-0x1c(%ebp) 805e22f: e8 0a f9 fe ff call 804db3e <_start+0xa96> 805e234: 81 c3 04 00 03 00 add $0x30004,%ebx 805e23a: 48 dec %eax 805e23b: 8b 54 87 1c mov 0x1c(%edi,%eax,4),%edx 805e23f: 0f bc 54 96 08 bsf 0x8(%esi,%edx,4),%edx 805e244: 0f 44 d1 cmove %ecx,%edx 805e247: 89 55 f0 mov %edx,-0x10(%ebp) 805e24a: eb 26 jmp 805e272 <jemalloc_free_dirty_pages+0x11bd> 805e24c: 8b 4d e8 mov -0x18(%ebp),%ecx 805e24f: 48 dec %eax 805e250: 8b 55 f0 mov -0x10(%ebp),%edx 805e253: 03 54 81 08 add 0x8(%ecx,%eax,4),%edx 805e257: 8b 4d e4 mov -0x1c(%ebp),%ecx 805e25a: 0f bc 14 91 bsf (%ecx,%edx,4),%edx 805e25e: b9 ff ff ff ff mov $0xffffffff,%ecx 805e263: 0f 44 d1 cmove %ecx,%edx 805e266: 8b 4d f0 mov -0x10(%ebp),%ecx 805e269: c1 e1 05 shl $0x5,%ecx 805e26c: 8d 0c 0a lea (%edx,%ecx,1),%ecx 805e26f: 89 4d f0 mov %ecx,-0x10(%ebp) 805e272: 85 c0 test %eax,%eax 805e274: 75 d6 jne 805e24c <jemalloc_free_dirty_pages+0x1197> 805e276: 8b 4d f0 mov -0x10(%ebp),%ecx 805e279: b0 01 mov $0x1,%al 805e27b: c7 45 ec 01 00 00 00 movl $0x1,-0x14(%ebp) 805e282: c1 e9 05 shr $0x5,%ecx 805e285: 89 4d d8 mov %ecx,-0x28(%ebp) 805e288: 8b 4d f0 mov -0x10(%ebp),%ecx 805e28b: d3 e0 shl %cl,%eax 805e28d: 8b 4d d8 mov -0x28(%ebp),%ecx 805e290: 33 44 8e 08 xor 0x8(%esi,%ecx,4),%eax 805e294: 85 c0 test %eax,%eax 805e296: 89 44 8e 08 mov %eax,0x8(%esi,%ecx,4) 805e29a: 74 70 je 805e30c <jemalloc_free_dirty_pages+0x1257> 805e29c: 8b 8b 7c 11 00 00 mov 0x117c(%ebx),%ecx 805e2a2: 8d 46 f8 lea -0x8(%esi),%eax 805e2a5: 8b 57 34 mov 0x34(%edi),%edx 805e2a8: f7 d1 not %ecx 805e2aa: 21 c1 and %eax,%ecx 805e2ac: 2b 83 20 11 00 00 sub 0x1120(%ebx),%eax 805e2b2: 01 ca add %ecx,%edx 805e2b4: 89 55 ec mov %edx,-0x14(%ebp) 805e2b7: 31 d2 xor %edx,%edx 805e2b9: 29 c8 sub %ecx,%eax 805e2bb: b9 54 00 00 00 mov $0x54,%ecx 805e2c0: f7 f1 div %ecx 805e2c2: 8b 55 f0 mov -0x10(%ebp),%edx 805e2c5: 0f af 57 08 imul 0x8(%edi),%edx 805e2c9: ff 4e 04 decl 0x4(%esi) 805e2cc: 03 83 2c 11 00 00 add 0x112c(%ebx),%eax 805e2d2: c1 e0 0c shl $0xc,%eax 805e2d5: 03 45 ec add -0x14(%ebp),%eax 805e2d8: 01 d0 add %edx,%eax 805e2da: eb 3d jmp 805e319 <jemalloc_free_dirty_pages+0x1264> 805e2dc: 8b 4d d8 mov -0x28(%ebp),%ecx 805e2df: 8b 45 ec mov -0x14(%ebp),%eax 805e2e2: 8b 55 e8 mov -0x18(%ebp),%edx 805e2e5: c1 e9 05 shr $0x5,%ecx 805e2e8: 89 4d e0 mov %ecx,-0x20(%ebp) 805e2eb: 03 4c 82 08 add 0x8(%edx,%eax,4),%ecx 805e2ef: ba 01 00 00 00 mov $0x1,%edx 805e2f4: 8b 45 e4 mov -0x1c(%ebp),%eax 805e2f7: 8d 04 88 lea (%eax,%ecx,4),%eax 805e2fa: 8b 4d d8 mov -0x28(%ebp),%ecx 805e2fd: d3 e2 shl %cl,%edx 805e2ff: 31 10 xor %edx,(%eax) 805e301: 75 99 jne 805e29c <jemalloc_free_dirty_pages+0x11e7> 805e303: 8b 45 e0 mov -0x20(%ebp),%eax 805e306: ff 45 ec incl -0x14(%ebp) 805e309: 89 45 d8 mov %eax,-0x28(%ebp) 805e30c: 8b 55 e8 mov -0x18(%ebp),%edx 805e30f: 8b 45 ec mov -0x14(%ebp),%eax 805e312: 3b 42 04 cmp 0x4(%edx),%eax 805e315: 72 c5 jb 805e2dc <jemalloc_free_dirty_pages+0x1227> 805e317: eb 83 jmp 805e29c <jemalloc_free_dirty_pages+0x11e7> 805e319: 83 c4 1c add $0x1c,%esp 805e31c: 5b pop %ebx 805e31d: 5e pop %esi 805e31e: 5f pop %edi 805e31f: 5d pop %ebp It only contains one call, to this function: $ objdump -d --start-address 0x804db3e --stop-address 0x804db42 firefox/firefox firefox/firefox: file format elf32-i386 Disassembly of section .text: 0804db3e <_start+0xa96>: 804db3e: 8b 1c 24 mov (%esp),%ebx 804db41: c3 ret That's a pretty typical __i686.get_pc_thunk.bx.
Flags: needinfo?(mh+mozilla)
Reporter | ||
Comment 3•9 years ago
|
||
https://treeherder.mozilla.org/logviewer.html#?job_id=51888&repo=gum
Summary: CRASH: browser/devtools/performance/test/browser_perf-details-04.js | application crashed [@ arena_run_reg_alloc] → CRASH: browser/devtools/performance/test/browser_perf-details-04.js, browser_perf-states.js | application crashed [@ arena_run_reg_alloc]
Comment 4•9 years ago
|
||
Update: I can only reproduce crashes on a loaned linux32 slave, but I still have no clue how to debug this. At some point the instruction pointer gets corrupted. Most of the time, the crash is SIGILL in jemalloc per above. Rarely I've also gotten SIGSEGV in code that shouldn't be called (like vp9 decoder code). Examining %eip revealed what looked like bogus code (a series of several stos). The best theory so far is maybe the profiler's signal handler in rare cases corrupts the stack.
Comment 5•9 years ago
|
||
I'm no longer sure if this has anything to do with the profiler at all. There are other SIGILL crashes on Linux 32: https://bugzilla.mozilla.org/show_bug.cgi?id=1113543 Something the profiler does is exacerbating this problem, however.
Comment 6•9 years ago
|
||
Things that have been tested to not be the cause: The profiling signal handler, when turned into basically a nop, still triggers the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=551e8b9f6775 Turning off Ion still triggers the crash (the SM failures are due to forcefully turning off Ion): https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1dd781f092c
Comment 7•9 years ago
|
||
Turning off Baseline still triggers the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3414bee3f023
Comment 8•9 years ago
|
||
Ah ha, turning off jemalloc (plus dmd and replace-malloc, since they depend on jemalloc) make the crashes go away: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7b15b4a118d Perhaps this really is jemalloc bug, like not playing nice with our signal handlers on linux32?
Comment 9•9 years ago
|
||
Julian, I think you've wrestled with things like this in the past. Any insights about jemalloc + the profiler's signal handlers causing crashes on Linux 32bit only?
Flags: needinfo?(jseward)
Updated•9 years ago
|
Severity: normal → critical
Comment 10•9 years ago
|
||
I've come up with an xpcshell test case that reliably crashes. It stresses jemalloc from a custom xpcshell function while also running the sampler. The interesting thing is that it *only* crashes on our Linux 32 EC2 instances. It crashes there fairly quickly after startup. I've run the same test on a physical Linux 32 machine with the same kernel version (3.2.0) and the same glibc version (2.15) for hours, with no crashes. Here's a non-stripped --enable-debug --disable-optimize build that crashes on the loan slave I have: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/shu@rfrn.org-85014fd8d865/try-linux-debug/
Comment 11•9 years ago
|
||
The custom xpcshell function that stress tests malloc.
Comment 12•9 years ago
|
||
xpcshell test that crashes on our linux32 slaves.
Comment 13•9 years ago
|
||
Even easier STR: First, get a linux32 EC2 tst slave. The revision below is what's currently in m-c, see memory/jemalloc/upstream.info 1) Build jemalloc: > git clone https://github.com/jemalloc/jemalloc jemalloc-b4acf > cd jemalloc-b4acf > git checkout b4acf7300a4ca3423ca36fe227e9bc2e23f25b9f > autoconf > mkdir build > cd build > ../configure --enable-debug --disable-valgrind --enable-stats > make 2) Download this test as sampler-test.c. It spins up a profiler thread that sends SIGPROF to the main thread. The main thread loops 1024*1024*factor times, calling malloc. 3) Compile the test in the same build directory as in step 1 above. > export JEMALLOC_PATH=`pwd` > cc -std=gnu99 -pthread -g3 sampler-test.c -o sampler-test -L$JEMALLOC_PATH/lib -Wl,-rpath,$JEMALLOC_PATH/lib -ljemalloc > ./sampler-test 30 # or whatever factor The SIGPROF handler dumps some register info, and eventually it crashes. Here's an example crash: (gdb) run 30 <lots of spew> ... eax: 0x00000000 ecx: 0x00000064 edx: 0x00000000 eip: 0xb7fa2bd4 Program received signal SIGSEGV, Segmentation fault. (gdb) x $eip 0xb7fa2bd5 <je_bitmap_sfu+184>: 0xec458bf4 (gdb) x/2i $eip => 0xb7fa2bd5 <je_bitmap_sfu+184>: hlt 0xb7fa2bd6 <je_bitmap_sfu+185>: mov -0x14(%ebp),%eax (gdb) x/3i je_bitmap_sfu+180 0xb7fa2bd1 <je_bitmap_sfu+180>: mov (%eax),%eax 0xb7fa2bd3 <je_bitmap_sfu+182>: mov %eax,-0xc(%ebp) 0xb7fa2bd6 <je_bitmap_sfu+185>: mov -0x14(%ebp),%eax WTF is going on? The SIGPROF handler reports, on entry, that the saved EIP is 0xb7fa2bd5. Which is bogus, and is in between the valid instructions 0xb7fa2bd3 and 0xb7fa2bd6.
Updated•9 years ago
|
Summary: CRASH: browser/devtools/performance/test/browser_perf-details-04.js, browser_perf-states.js | application crashed [@ arena_run_reg_alloc] → jemalloc crashes during profiling on Linux32 EC2 slaves
Comment 14•9 years ago
|
||
If I turn off SA_RESTART in the test case, it instead reliably asserts: <jemalloc>: src/include/jemalloc/internal/bitmap.h:142: Failed assertion: "!bitmap_get(bitmap, binfo, bit)"
Comment 16•9 years ago
|
||
The crash bisects to https://github.com/jemalloc/jemalloc/commit/84c8eefeffa246607790ad12e28b0f6a24ecc59d Which is, I suppose, is unsurprising, since most of the crashes happen around bitmaps.
Comment 17•9 years ago
|
||
We narrowed it down on irc to a likely kernel bug fixed by upgrading to the latest from ubuntu 12.04.
Flags: needinfo?(mh+mozilla)
Comment 18•9 years ago
|
||
And this is the kernel bugfix: http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=6488ee494d5fbac63fb7c8e2fc3400c3dd53972f
Updated•9 years ago
|
Flags: needinfo?(jseward)
Reporter | ||
Comment 21•9 years ago
|
||
FIXED!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Firefox → DevTools
You need to log in
before you can comment on or make changes to this bug.
Description
•