Closed Bug 1139386 Opened 9 years ago Closed 9 years ago

jemalloc crashes during profiling on Linux32 EC2 slaves

Categories

(DevTools :: Performance Tools (Profiler/Timeline), defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: vporof, Unassigned)

References

Details

Attachments

(3 files)

The crash is inside jemalloc. Is this user error or an a jemalloc problem?
Flags: needinfo?(mh+mozilla)
21:07:19 INFO - Crash reason: SIGILL
21:07:19 INFO - Crash address: 0x805e292
21:07:19 INFO - Thread 0 (crashed)
21:07:19 INFO - 0 firefox!arena_run_reg_alloc [bitmap.h:04a2d0398e39 : 147 + 0xa]
21:07:19 INFO - eip = 0x0805e292 esp = 0xbf973a90 ebp = 0xbf973ab8 ebx = 0x0808e238
21:07:19 INFO - esi = 0x94902df4 edi = 0x0808eda4 eax = 0x00000000 ecx = 0x00000001
21:07:19 INFO - edx = 0x00000008 efl = 0x00210286
21:07:19 INFO - Found by: given as instruction pointer in context
21:07:19 INFO - 1 firefox!je_arena_malloc_small [arena.c:04a2d0398e39 : 1592 + 0xf]
21:07:19 INFO - eip = 0x0805fc43 esp = 0xbf973ac0 ebp = 0xbf973ae8 ebx = 0x0808e238
21:07:19 INFO - esi = 0xb73000c0 edi = 0xb73008bc
21:07:19 INFO - Found by: call frame info 

$ wget http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/gum-linux-debug/1425434990/firefox-39.0a1.en-US.linux-i686.tar.bz2
$ tar -jxf firefox-39.0a1.en-US.linux-i686.tar.bz2
$ objdump -d --start-address 0x0805e28d --stop-address 0x0805e296 firefox/firefox # Thanks to the lack of ASLR for the executable
firefox/firefox:     file format elf32-i386


Disassembly of section .text:

0805e28d <jemalloc_free_dirty_pages+0x11d8>:
 805e28d:	8b 4d d8             	mov    -0x28(%ebp),%ecx
 805e290:	33 44 8e 08          	xor    0x8(%esi,%ecx,4),%eax
 805e294:	85 c0                	test   %eax,%eax

-> The crash address (0x0805e292) is in the middle of an instruction

The next frame is plausible:
$ objdump -d --start-address 0x0805fc3e --stop-address 0x0805fc43 firefox/firefox

firefox/firefox:     file format elf32-i386


Disassembly of section .text:

0805fc3e <jemalloc_free_dirty_pages+0x2b89>:
 805fc3e:	e8 cb e5 ff ff       	call   805e20e <jemalloc_free_dirty_pages+0x1159>

So it is plausible that it came to that crash address via that call and executing code starting at that address.

This is the complete function where the crash is happening:

$ objdump -d --start-address 0x805e20e --stop-address 0x805e320 firefox/firefox

firefox/firefox:     file format elf32-i386


Disassembly of section .text:

0805e20e <jemalloc_free_dirty_pages+0x1159>:
 805e20e:	55                   	push   %ebp
 805e20f:	b9 ff ff ff ff       	mov    $0xffffffff,%ecx
 805e214:	89 e5                	mov    %esp,%ebp
 805e216:	57                   	push   %edi
 805e217:	89 d7                	mov    %edx,%edi
 805e219:	56                   	push   %esi
 805e21a:	89 c6                	mov    %eax,%esi
 805e21c:	53                   	push   %ebx
 805e21d:	83 ec 1c             	sub    $0x1c,%esp
 805e220:	8d 42 14             	lea    0x14(%edx),%eax
 805e223:	89 45 e8             	mov    %eax,-0x18(%ebp)
 805e226:	8b 47 18             	mov    0x18(%edi),%eax
 805e229:	8d 56 08             	lea    0x8(%esi),%edx
 805e22c:	89 55 e4             	mov    %edx,-0x1c(%ebp)
 805e22f:	e8 0a f9 fe ff       	call   804db3e <_start+0xa96>
 805e234:	81 c3 04 00 03 00    	add    $0x30004,%ebx
 805e23a:	48                   	dec    %eax
 805e23b:	8b 54 87 1c          	mov    0x1c(%edi,%eax,4),%edx
 805e23f:	0f bc 54 96 08       	bsf    0x8(%esi,%edx,4),%edx
 805e244:	0f 44 d1             	cmove  %ecx,%edx
 805e247:	89 55 f0             	mov    %edx,-0x10(%ebp)
 805e24a:	eb 26                	jmp    805e272 <jemalloc_free_dirty_pages+0x11bd>
 805e24c:	8b 4d e8             	mov    -0x18(%ebp),%ecx
 805e24f:	48                   	dec    %eax
 805e250:	8b 55 f0             	mov    -0x10(%ebp),%edx
 805e253:	03 54 81 08          	add    0x8(%ecx,%eax,4),%edx
 805e257:	8b 4d e4             	mov    -0x1c(%ebp),%ecx
 805e25a:	0f bc 14 91          	bsf    (%ecx,%edx,4),%edx
 805e25e:	b9 ff ff ff ff       	mov    $0xffffffff,%ecx
 805e263:	0f 44 d1             	cmove  %ecx,%edx
 805e266:	8b 4d f0             	mov    -0x10(%ebp),%ecx
 805e269:	c1 e1 05             	shl    $0x5,%ecx
 805e26c:	8d 0c 0a             	lea    (%edx,%ecx,1),%ecx
 805e26f:	89 4d f0             	mov    %ecx,-0x10(%ebp)
 805e272:	85 c0                	test   %eax,%eax
 805e274:	75 d6                	jne    805e24c <jemalloc_free_dirty_pages+0x1197>
 805e276:	8b 4d f0             	mov    -0x10(%ebp),%ecx
 805e279:	b0 01                	mov    $0x1,%al
 805e27b:	c7 45 ec 01 00 00 00 	movl   $0x1,-0x14(%ebp)
 805e282:	c1 e9 05             	shr    $0x5,%ecx
 805e285:	89 4d d8             	mov    %ecx,-0x28(%ebp)
 805e288:	8b 4d f0             	mov    -0x10(%ebp),%ecx
 805e28b:	d3 e0                	shl    %cl,%eax
 805e28d:	8b 4d d8             	mov    -0x28(%ebp),%ecx
 805e290:	33 44 8e 08          	xor    0x8(%esi,%ecx,4),%eax
 805e294:	85 c0                	test   %eax,%eax
 805e296:	89 44 8e 08          	mov    %eax,0x8(%esi,%ecx,4)
 805e29a:	74 70                	je     805e30c <jemalloc_free_dirty_pages+0x1257>
 805e29c:	8b 8b 7c 11 00 00    	mov    0x117c(%ebx),%ecx
 805e2a2:	8d 46 f8             	lea    -0x8(%esi),%eax
 805e2a5:	8b 57 34             	mov    0x34(%edi),%edx
 805e2a8:	f7 d1                	not    %ecx
 805e2aa:	21 c1                	and    %eax,%ecx
 805e2ac:	2b 83 20 11 00 00    	sub    0x1120(%ebx),%eax
 805e2b2:	01 ca                	add    %ecx,%edx
 805e2b4:	89 55 ec             	mov    %edx,-0x14(%ebp)
 805e2b7:	31 d2                	xor    %edx,%edx
 805e2b9:	29 c8                	sub    %ecx,%eax
 805e2bb:	b9 54 00 00 00       	mov    $0x54,%ecx
 805e2c0:	f7 f1                	div    %ecx
 805e2c2:	8b 55 f0             	mov    -0x10(%ebp),%edx
 805e2c5:	0f af 57 08          	imul   0x8(%edi),%edx
 805e2c9:	ff 4e 04             	decl   0x4(%esi)
 805e2cc:	03 83 2c 11 00 00    	add    0x112c(%ebx),%eax
 805e2d2:	c1 e0 0c             	shl    $0xc,%eax
 805e2d5:	03 45 ec             	add    -0x14(%ebp),%eax
 805e2d8:	01 d0                	add    %edx,%eax
 805e2da:	eb 3d                	jmp    805e319 <jemalloc_free_dirty_pages+0x1264>
 805e2dc:	8b 4d d8             	mov    -0x28(%ebp),%ecx
 805e2df:	8b 45 ec             	mov    -0x14(%ebp),%eax
 805e2e2:	8b 55 e8             	mov    -0x18(%ebp),%edx
 805e2e5:	c1 e9 05             	shr    $0x5,%ecx
 805e2e8:	89 4d e0             	mov    %ecx,-0x20(%ebp)
 805e2eb:	03 4c 82 08          	add    0x8(%edx,%eax,4),%ecx
 805e2ef:	ba 01 00 00 00       	mov    $0x1,%edx
 805e2f4:	8b 45 e4             	mov    -0x1c(%ebp),%eax
 805e2f7:	8d 04 88             	lea    (%eax,%ecx,4),%eax
 805e2fa:	8b 4d d8             	mov    -0x28(%ebp),%ecx
 805e2fd:	d3 e2                	shl    %cl,%edx
 805e2ff:	31 10                	xor    %edx,(%eax)
 805e301:	75 99                	jne    805e29c <jemalloc_free_dirty_pages+0x11e7>
 805e303:	8b 45 e0             	mov    -0x20(%ebp),%eax
 805e306:	ff 45 ec             	incl   -0x14(%ebp)
 805e309:	89 45 d8             	mov    %eax,-0x28(%ebp)
 805e30c:	8b 55 e8             	mov    -0x18(%ebp),%edx
 805e30f:	8b 45 ec             	mov    -0x14(%ebp),%eax
 805e312:	3b 42 04             	cmp    0x4(%edx),%eax
 805e315:	72 c5                	jb     805e2dc <jemalloc_free_dirty_pages+0x1227>
 805e317:	eb 83                	jmp    805e29c <jemalloc_free_dirty_pages+0x11e7>
 805e319:	83 c4 1c             	add    $0x1c,%esp
 805e31c:	5b                   	pop    %ebx
 805e31d:	5e                   	pop    %esi
 805e31e:	5f                   	pop    %edi
 805e31f:	5d                   	pop    %ebp

It only contains one call, to this function:

$ objdump -d --start-address 0x804db3e --stop-address 0x804db42 firefox/firefox

firefox/firefox:     file format elf32-i386


Disassembly of section .text:

0804db3e <_start+0xa96>:
 804db3e:	8b 1c 24             	mov    (%esp),%ebx
 804db41:	c3                   	ret    


That's a pretty typical __i686.get_pc_thunk.bx.
Flags: needinfo?(mh+mozilla)
https://treeherder.mozilla.org/logviewer.html#?job_id=51888&repo=gum
Summary: CRASH: browser/devtools/performance/test/browser_perf-details-04.js | application crashed [@ arena_run_reg_alloc] → CRASH: browser/devtools/performance/test/browser_perf-details-04.js, browser_perf-states.js | application crashed [@ arena_run_reg_alloc]
Update:

I can only reproduce crashes on a loaned linux32 slave, but I still have no clue how to debug this. At some point the instruction pointer gets corrupted. Most of the time, the crash is SIGILL in jemalloc per above. Rarely I've also gotten SIGSEGV in code that shouldn't be called (like vp9 decoder code). Examining %eip revealed what looked like bogus code (a series of several stos).

The best theory so far is maybe the profiler's signal handler in rare cases corrupts the stack.
I'm no longer sure if this has anything to do with the profiler at all. There are other SIGILL crashes on Linux 32: https://bugzilla.mozilla.org/show_bug.cgi?id=1113543

Something the profiler does is exacerbating this problem, however.
Things that have been tested to not be the cause:

The profiling signal handler, when turned into basically a nop, still triggers the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=551e8b9f6775

Turning off Ion still triggers the crash (the SM failures are due to forcefully turning off Ion): https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1dd781f092c
Turning off Baseline still triggers the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3414bee3f023
Ah ha, turning off jemalloc (plus dmd and replace-malloc, since they depend on jemalloc) make the crashes go away: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7b15b4a118d

Perhaps this really is jemalloc bug, like not playing nice with our signal handlers on linux32?
Julian, I think you've wrestled with things like this in the past. Any insights about jemalloc + the profiler's signal handlers causing crashes on Linux 32bit only?
Flags: needinfo?(jseward)
Severity: normal → critical
I've come up with an xpcshell test case that reliably crashes. It stresses jemalloc from a custom xpcshell function while also running the sampler.

The interesting thing is that it *only* crashes on our Linux 32 EC2 instances. It crashes there fairly quickly after startup. I've run the same test on a physical Linux 32 machine with the same kernel version (3.2.0) and the same glibc version (2.15) for hours, with no crashes.

Here's a non-stripped --enable-debug --disable-optimize build that crashes on the loan slave I have: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/shu@rfrn.org-85014fd8d865/try-linux-debug/
The custom xpcshell function that stress tests malloc.
Attached file stress-malloc.js
xpcshell test that crashes on our linux32 slaves.
Attached file sampler-test.c
Even easier STR:

First, get a linux32 EC2 tst slave. The revision below is what's currently in m-c, see memory/jemalloc/upstream.info

1) Build jemalloc:

> git clone https://github.com/jemalloc/jemalloc jemalloc-b4acf
> cd jemalloc-b4acf
> git checkout b4acf7300a4ca3423ca36fe227e9bc2e23f25b9f
> autoconf
> mkdir build
> cd build
> ../configure --enable-debug --disable-valgrind --enable-stats
> make

2) Download this test as sampler-test.c. It spins up a profiler thread that sends SIGPROF to the main thread. The main thread loops 1024*1024*factor times, calling malloc.

3) Compile the test in the same build directory as in step 1 above.

> export JEMALLOC_PATH=`pwd`
> cc -std=gnu99 -pthread -g3 sampler-test.c -o sampler-test -L$JEMALLOC_PATH/lib -Wl,-rpath,$JEMALLOC_PATH/lib -ljemalloc
> ./sampler-test 30 # or whatever factor

The SIGPROF handler dumps some register info, and eventually it crashes.

Here's an example crash:

(gdb) run 30
<lots of spew>
...
eax: 0x00000000 ecx: 0x00000064 edx: 0x00000000 eip: 0xb7fa2bd4

Program received signal SIGSEGV, Segmentation fault.

(gdb) x $eip
0xb7fa2bd5 <je_bitmap_sfu+184>: 0xec458bf4

(gdb) x/2i $eip
=> 0xb7fa2bd5 <je_bitmap_sfu+184>:      hlt
   0xb7fa2bd6 <je_bitmap_sfu+185>:      mov    -0x14(%ebp),%eax

(gdb) x/3i je_bitmap_sfu+180
   0xb7fa2bd1 <je_bitmap_sfu+180>:      mov    (%eax),%eax
   0xb7fa2bd3 <je_bitmap_sfu+182>:      mov    %eax,-0xc(%ebp)
   0xb7fa2bd6 <je_bitmap_sfu+185>:      mov    -0x14(%ebp),%eax

WTF is going on? The SIGPROF handler reports, on entry, that the saved EIP is 0xb7fa2bd5. Which is bogus, and is in between the valid instructions 0xb7fa2bd3 and 0xb7fa2bd6.
Summary: CRASH: browser/devtools/performance/test/browser_perf-details-04.js, browser_perf-states.js | application crashed [@ arena_run_reg_alloc] → jemalloc crashes during profiling on Linux32 EC2 slaves
If I turn off SA_RESTART in the test case, it instead reliably asserts:

<jemalloc>: src/include/jemalloc/internal/bitmap.h:142: Failed assertion: "!bitmap_get(bitmap, binfo, bit)"
NI-ing glandium again given the new debugging.
Flags: needinfo?(mh+mozilla)
The crash bisects to https://github.com/jemalloc/jemalloc/commit/84c8eefeffa246607790ad12e28b0f6a24ecc59d

Which is, I suppose, is unsurprising, since most of the crashes happen around bitmaps.
We narrowed it down on irc to a likely kernel bug fixed by upgrading to the latest from ubuntu 12.04.
Flags: needinfo?(mh+mozilla)
Depends on: 1141339
Flags: needinfo?(jseward)
FIXED!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: