Closed
Bug 1139386
Opened 11 years ago
Closed 11 years ago
jemalloc crashes during profiling on Linux32 EC2 slaves
Categories
(DevTools :: Performance Tools (Profiler/Timeline), defect)
DevTools
Performance Tools (Profiler/Timeline)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: vporof, Unassigned)
References
Details
Attachments
(3 files)
|
1.83 KB,
patch
|
Details | Diff | Splinter Review | |
|
428 bytes,
application/javascript
|
Details | |
|
2.54 KB,
text/x-csrc
|
Details |
| Reporter | ||
Updated•11 years ago
|
Blocks: perf-tool-crashes
Comment 1•11 years ago
|
||
The crash is inside jemalloc. Is this user error or an a jemalloc problem?
Flags: needinfo?(mh+mozilla)
Comment 2•11 years ago
|
||
21:07:19 INFO - Crash reason: SIGILL
21:07:19 INFO - Crash address: 0x805e292
21:07:19 INFO - Thread 0 (crashed)
21:07:19 INFO - 0 firefox!arena_run_reg_alloc [bitmap.h:04a2d0398e39 : 147 + 0xa]
21:07:19 INFO - eip = 0x0805e292 esp = 0xbf973a90 ebp = 0xbf973ab8 ebx = 0x0808e238
21:07:19 INFO - esi = 0x94902df4 edi = 0x0808eda4 eax = 0x00000000 ecx = 0x00000001
21:07:19 INFO - edx = 0x00000008 efl = 0x00210286
21:07:19 INFO - Found by: given as instruction pointer in context
21:07:19 INFO - 1 firefox!je_arena_malloc_small [arena.c:04a2d0398e39 : 1592 + 0xf]
21:07:19 INFO - eip = 0x0805fc43 esp = 0xbf973ac0 ebp = 0xbf973ae8 ebx = 0x0808e238
21:07:19 INFO - esi = 0xb73000c0 edi = 0xb73008bc
21:07:19 INFO - Found by: call frame info
$ wget http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/gum-linux-debug/1425434990/firefox-39.0a1.en-US.linux-i686.tar.bz2
$ tar -jxf firefox-39.0a1.en-US.linux-i686.tar.bz2
$ objdump -d --start-address 0x0805e28d --stop-address 0x0805e296 firefox/firefox # Thanks to the lack of ASLR for the executable
firefox/firefox: file format elf32-i386
Disassembly of section .text:
0805e28d <jemalloc_free_dirty_pages+0x11d8>:
805e28d: 8b 4d d8 mov -0x28(%ebp),%ecx
805e290: 33 44 8e 08 xor 0x8(%esi,%ecx,4),%eax
805e294: 85 c0 test %eax,%eax
-> The crash address (0x0805e292) is in the middle of an instruction
The next frame is plausible:
$ objdump -d --start-address 0x0805fc3e --stop-address 0x0805fc43 firefox/firefox
firefox/firefox: file format elf32-i386
Disassembly of section .text:
0805fc3e <jemalloc_free_dirty_pages+0x2b89>:
805fc3e: e8 cb e5 ff ff call 805e20e <jemalloc_free_dirty_pages+0x1159>
So it is plausible that it came to that crash address via that call and executing code starting at that address.
This is the complete function where the crash is happening:
$ objdump -d --start-address 0x805e20e --stop-address 0x805e320 firefox/firefox
firefox/firefox: file format elf32-i386
Disassembly of section .text:
0805e20e <jemalloc_free_dirty_pages+0x1159>:
805e20e: 55 push %ebp
805e20f: b9 ff ff ff ff mov $0xffffffff,%ecx
805e214: 89 e5 mov %esp,%ebp
805e216: 57 push %edi
805e217: 89 d7 mov %edx,%edi
805e219: 56 push %esi
805e21a: 89 c6 mov %eax,%esi
805e21c: 53 push %ebx
805e21d: 83 ec 1c sub $0x1c,%esp
805e220: 8d 42 14 lea 0x14(%edx),%eax
805e223: 89 45 e8 mov %eax,-0x18(%ebp)
805e226: 8b 47 18 mov 0x18(%edi),%eax
805e229: 8d 56 08 lea 0x8(%esi),%edx
805e22c: 89 55 e4 mov %edx,-0x1c(%ebp)
805e22f: e8 0a f9 fe ff call 804db3e <_start+0xa96>
805e234: 81 c3 04 00 03 00 add $0x30004,%ebx
805e23a: 48 dec %eax
805e23b: 8b 54 87 1c mov 0x1c(%edi,%eax,4),%edx
805e23f: 0f bc 54 96 08 bsf 0x8(%esi,%edx,4),%edx
805e244: 0f 44 d1 cmove %ecx,%edx
805e247: 89 55 f0 mov %edx,-0x10(%ebp)
805e24a: eb 26 jmp 805e272 <jemalloc_free_dirty_pages+0x11bd>
805e24c: 8b 4d e8 mov -0x18(%ebp),%ecx
805e24f: 48 dec %eax
805e250: 8b 55 f0 mov -0x10(%ebp),%edx
805e253: 03 54 81 08 add 0x8(%ecx,%eax,4),%edx
805e257: 8b 4d e4 mov -0x1c(%ebp),%ecx
805e25a: 0f bc 14 91 bsf (%ecx,%edx,4),%edx
805e25e: b9 ff ff ff ff mov $0xffffffff,%ecx
805e263: 0f 44 d1 cmove %ecx,%edx
805e266: 8b 4d f0 mov -0x10(%ebp),%ecx
805e269: c1 e1 05 shl $0x5,%ecx
805e26c: 8d 0c 0a lea (%edx,%ecx,1),%ecx
805e26f: 89 4d f0 mov %ecx,-0x10(%ebp)
805e272: 85 c0 test %eax,%eax
805e274: 75 d6 jne 805e24c <jemalloc_free_dirty_pages+0x1197>
805e276: 8b 4d f0 mov -0x10(%ebp),%ecx
805e279: b0 01 mov $0x1,%al
805e27b: c7 45 ec 01 00 00 00 movl $0x1,-0x14(%ebp)
805e282: c1 e9 05 shr $0x5,%ecx
805e285: 89 4d d8 mov %ecx,-0x28(%ebp)
805e288: 8b 4d f0 mov -0x10(%ebp),%ecx
805e28b: d3 e0 shl %cl,%eax
805e28d: 8b 4d d8 mov -0x28(%ebp),%ecx
805e290: 33 44 8e 08 xor 0x8(%esi,%ecx,4),%eax
805e294: 85 c0 test %eax,%eax
805e296: 89 44 8e 08 mov %eax,0x8(%esi,%ecx,4)
805e29a: 74 70 je 805e30c <jemalloc_free_dirty_pages+0x1257>
805e29c: 8b 8b 7c 11 00 00 mov 0x117c(%ebx),%ecx
805e2a2: 8d 46 f8 lea -0x8(%esi),%eax
805e2a5: 8b 57 34 mov 0x34(%edi),%edx
805e2a8: f7 d1 not %ecx
805e2aa: 21 c1 and %eax,%ecx
805e2ac: 2b 83 20 11 00 00 sub 0x1120(%ebx),%eax
805e2b2: 01 ca add %ecx,%edx
805e2b4: 89 55 ec mov %edx,-0x14(%ebp)
805e2b7: 31 d2 xor %edx,%edx
805e2b9: 29 c8 sub %ecx,%eax
805e2bb: b9 54 00 00 00 mov $0x54,%ecx
805e2c0: f7 f1 div %ecx
805e2c2: 8b 55 f0 mov -0x10(%ebp),%edx
805e2c5: 0f af 57 08 imul 0x8(%edi),%edx
805e2c9: ff 4e 04 decl 0x4(%esi)
805e2cc: 03 83 2c 11 00 00 add 0x112c(%ebx),%eax
805e2d2: c1 e0 0c shl $0xc,%eax
805e2d5: 03 45 ec add -0x14(%ebp),%eax
805e2d8: 01 d0 add %edx,%eax
805e2da: eb 3d jmp 805e319 <jemalloc_free_dirty_pages+0x1264>
805e2dc: 8b 4d d8 mov -0x28(%ebp),%ecx
805e2df: 8b 45 ec mov -0x14(%ebp),%eax
805e2e2: 8b 55 e8 mov -0x18(%ebp),%edx
805e2e5: c1 e9 05 shr $0x5,%ecx
805e2e8: 89 4d e0 mov %ecx,-0x20(%ebp)
805e2eb: 03 4c 82 08 add 0x8(%edx,%eax,4),%ecx
805e2ef: ba 01 00 00 00 mov $0x1,%edx
805e2f4: 8b 45 e4 mov -0x1c(%ebp),%eax
805e2f7: 8d 04 88 lea (%eax,%ecx,4),%eax
805e2fa: 8b 4d d8 mov -0x28(%ebp),%ecx
805e2fd: d3 e2 shl %cl,%edx
805e2ff: 31 10 xor %edx,(%eax)
805e301: 75 99 jne 805e29c <jemalloc_free_dirty_pages+0x11e7>
805e303: 8b 45 e0 mov -0x20(%ebp),%eax
805e306: ff 45 ec incl -0x14(%ebp)
805e309: 89 45 d8 mov %eax,-0x28(%ebp)
805e30c: 8b 55 e8 mov -0x18(%ebp),%edx
805e30f: 8b 45 ec mov -0x14(%ebp),%eax
805e312: 3b 42 04 cmp 0x4(%edx),%eax
805e315: 72 c5 jb 805e2dc <jemalloc_free_dirty_pages+0x1227>
805e317: eb 83 jmp 805e29c <jemalloc_free_dirty_pages+0x11e7>
805e319: 83 c4 1c add $0x1c,%esp
805e31c: 5b pop %ebx
805e31d: 5e pop %esi
805e31e: 5f pop %edi
805e31f: 5d pop %ebp
It only contains one call, to this function:
$ objdump -d --start-address 0x804db3e --stop-address 0x804db42 firefox/firefox
firefox/firefox: file format elf32-i386
Disassembly of section .text:
0804db3e <_start+0xa96>:
804db3e: 8b 1c 24 mov (%esp),%ebx
804db41: c3 ret
That's a pretty typical __i686.get_pc_thunk.bx.
Flags: needinfo?(mh+mozilla)
| Reporter | ||
Comment 3•11 years ago
|
||
Summary: CRASH: browser/devtools/performance/test/browser_perf-details-04.js | application crashed [@ arena_run_reg_alloc] → CRASH: browser/devtools/performance/test/browser_perf-details-04.js, browser_perf-states.js | application crashed [@ arena_run_reg_alloc]
Comment 4•11 years ago
|
||
Update:
I can only reproduce crashes on a loaned linux32 slave, but I still have no clue how to debug this. At some point the instruction pointer gets corrupted. Most of the time, the crash is SIGILL in jemalloc per above. Rarely I've also gotten SIGSEGV in code that shouldn't be called (like vp9 decoder code). Examining %eip revealed what looked like bogus code (a series of several stos).
The best theory so far is maybe the profiler's signal handler in rare cases corrupts the stack.
Comment 5•11 years ago
|
||
I'm no longer sure if this has anything to do with the profiler at all. There are other SIGILL crashes on Linux 32: https://bugzilla.mozilla.org/show_bug.cgi?id=1113543
Something the profiler does is exacerbating this problem, however.
Comment 6•11 years ago
|
||
Things that have been tested to not be the cause:
The profiling signal handler, when turned into basically a nop, still triggers the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=551e8b9f6775
Turning off Ion still triggers the crash (the SM failures are due to forcefully turning off Ion): https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1dd781f092c
Comment 7•11 years ago
|
||
Turning off Baseline still triggers the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3414bee3f023
Comment 8•11 years ago
|
||
Ah ha, turning off jemalloc (plus dmd and replace-malloc, since they depend on jemalloc) make the crashes go away: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7b15b4a118d
Perhaps this really is jemalloc bug, like not playing nice with our signal handlers on linux32?
Comment 9•11 years ago
|
||
Julian, I think you've wrestled with things like this in the past. Any insights about jemalloc + the profiler's signal handlers causing crashes on Linux 32bit only?
Flags: needinfo?(jseward)
Updated•11 years ago
|
Severity: normal → critical
Comment 10•11 years ago
|
||
I've come up with an xpcshell test case that reliably crashes. It stresses jemalloc from a custom xpcshell function while also running the sampler.
The interesting thing is that it *only* crashes on our Linux 32 EC2 instances. It crashes there fairly quickly after startup. I've run the same test on a physical Linux 32 machine with the same kernel version (3.2.0) and the same glibc version (2.15) for hours, with no crashes.
Here's a non-stripped --enable-debug --disable-optimize build that crashes on the loan slave I have: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/shu@rfrn.org-85014fd8d865/try-linux-debug/
Comment 11•11 years ago
|
||
The custom xpcshell function that stress tests malloc.
Comment 12•11 years ago
|
||
xpcshell test that crashes on our linux32 slaves.
Comment 13•11 years ago
|
||
Even easier STR:
First, get a linux32 EC2 tst slave. The revision below is what's currently in m-c, see memory/jemalloc/upstream.info
1) Build jemalloc:
> git clone https://github.com/jemalloc/jemalloc jemalloc-b4acf
> cd jemalloc-b4acf
> git checkout b4acf7300a4ca3423ca36fe227e9bc2e23f25b9f
> autoconf
> mkdir build
> cd build
> ../configure --enable-debug --disable-valgrind --enable-stats
> make
2) Download this test as sampler-test.c. It spins up a profiler thread that sends SIGPROF to the main thread. The main thread loops 1024*1024*factor times, calling malloc.
3) Compile the test in the same build directory as in step 1 above.
> export JEMALLOC_PATH=`pwd`
> cc -std=gnu99 -pthread -g3 sampler-test.c -o sampler-test -L$JEMALLOC_PATH/lib -Wl,-rpath,$JEMALLOC_PATH/lib -ljemalloc
> ./sampler-test 30 # or whatever factor
The SIGPROF handler dumps some register info, and eventually it crashes.
Here's an example crash:
(gdb) run 30
<lots of spew>
...
eax: 0x00000000 ecx: 0x00000064 edx: 0x00000000 eip: 0xb7fa2bd4
Program received signal SIGSEGV, Segmentation fault.
(gdb) x $eip
0xb7fa2bd5 <je_bitmap_sfu+184>: 0xec458bf4
(gdb) x/2i $eip
=> 0xb7fa2bd5 <je_bitmap_sfu+184>: hlt
0xb7fa2bd6 <je_bitmap_sfu+185>: mov -0x14(%ebp),%eax
(gdb) x/3i je_bitmap_sfu+180
0xb7fa2bd1 <je_bitmap_sfu+180>: mov (%eax),%eax
0xb7fa2bd3 <je_bitmap_sfu+182>: mov %eax,-0xc(%ebp)
0xb7fa2bd6 <je_bitmap_sfu+185>: mov -0x14(%ebp),%eax
WTF is going on? The SIGPROF handler reports, on entry, that the saved EIP is 0xb7fa2bd5. Which is bogus, and is in between the valid instructions 0xb7fa2bd3 and 0xb7fa2bd6.
Updated•11 years ago
|
Summary: CRASH: browser/devtools/performance/test/browser_perf-details-04.js, browser_perf-states.js | application crashed [@ arena_run_reg_alloc] → jemalloc crashes during profiling on Linux32 EC2 slaves
Comment 14•11 years ago
|
||
If I turn off SA_RESTART in the test case, it instead reliably asserts:
<jemalloc>: src/include/jemalloc/internal/bitmap.h:142: Failed assertion: "!bitmap_get(bitmap, binfo, bit)"
Comment 15•11 years ago
|
||
NI-ing glandium again given the new debugging.
Flags: needinfo?(mh+mozilla)
Comment 16•11 years ago
|
||
The crash bisects to https://github.com/jemalloc/jemalloc/commit/84c8eefeffa246607790ad12e28b0f6a24ecc59d
Which is, I suppose, is unsurprising, since most of the crashes happen around bitmaps.
Comment 17•11 years ago
|
||
We narrowed it down on irc to a likely kernel bug fixed by upgrading to the latest from ubuntu 12.04.
Flags: needinfo?(mh+mozilla)
Comment 18•11 years ago
|
||
And this is the kernel bugfix:
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=6488ee494d5fbac63fb7c8e2fc3400c3dd53972f
Updated•11 years ago
|
Flags: needinfo?(jseward)
| Reporter | ||
Comment 21•11 years ago
|
||
FIXED!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Firefox → DevTools
You need to log in
before you can comment on or make changes to this bug.
Description
•