Closed Bug 471585 Opened 11 years ago Closed 11 years ago

JIT causes maemo hang/crash


(Core :: JavaScript Engine, defect, critical)

Not set





(Reporter: aki, Assigned: vlad)




(Keywords: mobile)


(1 file)

I've been isolating this and tp is consistently dying on on the n810's.  Most of the time it's during cycle 1; occasionally it's in cycle 2, but I've seen over 5 crashes or hangs/disconnects while Fennec is hitting this page.

I've got plenty of swap (>200mb), disabled the ptmalloc thresholds, and made sure I have enough disk on / .  I was able to ssh in to an n810 while it was hanging on and verify that plenty of swap was available via the 'free' command; uptime showed load slightly above 3.

My browser_wait is 90 and my pagetimeout is 700.
After commenting this page out of the manifest.txt, the tests started hanging on other pages. was the culprit this time; since it's right before the commented-out, I'm guessing it might have something to do with the *next* test,

Still investigating on this end.
I believe I've commented out and it crashed elsewhere.  I've also tried adding dougt's patch from bug 454731, and I'm still crashing.  Since it looks like a hang when I observe this on-screen, I'm wondering if it's actually a timeout.

Also, with the amount of swap I have I'd be surprised if this is oom.  If we (or maemo) rely on having free physical memory, however, that's certainly possibly the culprit.
On a device that's hung,

maemo-n810-07:~# free
              total         used         free       shared      buffers
  Mem:       126796       103304        23492            0           12
 Swap:       260020       162512        97508
Total:       386816       265816       121000

But in top, fennec appears to be gobbling up more and more memory:

 1594 root     RW      374M  1593  4.9301.5 fennec

I saw it go up to 440M.  Load went from ~3 to over 10.

Possibly running this against a debug build would help.  In the meantime, I'm going to try disabling JIT in all.js.
maemo-n810-07:~# free
              total         used         free       shared      buffers
  Mem:       126796       125228         1568            0           12
 Swap:       260020       260020            0
Total:       386816       385248         1568

oom certainly is a possibility, but I'm not sure why it's using all that memory.
Changing the javascript.options.jit.content pref to false in all.js fixes the crash.

Changing the summary and downgrading to major since we have a workaround.
Severity: critical → major
Summary: maemo talos tp crasher on → JIT causes maemo talos tp crash
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Fennec → Core
QA Contact: general → general
Target Milestone: --- → mozilla1.9.1
(In reply to comment #5)
> Changing the javascript.options.jit.content pref to false in all.js fixes the
> crash.
> Changing the summary and downgrading to major since we have a workaround.

I understand your reasoning here, but I would argue this is still critical because we are trying to measure end-user performance, and end-users will be running with default prefs (jit on).

A set of perf numbers with jit off would be interesting, to measure its impact (once this is fixed), but in the end-game we must be able to measure this with jit on, so I think it should stay critical.  

My 2 cents...
Severity: major → critical says crashers have the "critical" severity.

As a practical matter, tho, if the existence of a workaround meant severity could always be downgraded, pretty much every crasher (app-internal crashes excepted) would only be major, because you can just not visit the site that caused the crash.  Workarounds generally don't affect severity.
We need stacks for crashes, and minimized test cases (down to a single page, then cut out script until it stops crashing, etc.), in separate bugs.  This isn't really actionable for people who can't access the still-not-public pageset, and we don't have bandwidth in the JS team to be setting up Tp on devices to get stacks and so forth.
I've narrowed this down to the page.  Still working on narrowing down to a set of scripts and getting a stack trace.
Looks like hitting with Fennec 1.0b1 pretty much hangs the browser; no need to set up Tp.
Adding Vlad in case he gets some time to look.
I'll check if I have time, but this probably wants latest tracemonkey tree arm fixes.
Summary: JIT causes maemo talos tp crash → JIT causes maemo hang/crash
No longer blocks: 455755
I had this in gdb. Here is "info all" from the crash:

(gdb) inf all
r0             0x0	0
r1             0xb83	2947
r2             0x6	6
r3             0x4000c750	1073792848
r4             0xb83	2947
r5             0x6	6
r6             0x41136c38	1091791928
r7             0x10c	268
r8             0xc34	3124
r9             0xc38	3128
r10            0x41136000	1091788800
r11            0xbe8dfe6c	3196976748
r12            0xbe8dfd50	3196976464
sp             0xbe8dfd40	0xbe8dfd40
lr             0x41050e40	1090850368
pc             0x41050e74	0x41050e74
f0             0	(raw 0x000189880000000000000000)
f1             0	(raw 0x000189880000000000000000)
f2             0	(raw 0x000189880000000000000000)
f3             0	(raw 0x000189880000000000000000)
f4             0	(raw 0x000189880000000000000000)
f5             0	(raw 0x000189880000000000000000)
f6             0	(raw 0x000189880000000000000000)
f7             0	(raw 0x000189880000000000000000)
fps            0x0	0
cpsr           0x20000010	536870928
Oh, this build is using tracemonkey from 6/30/2009
Here is a disas $pc-128 $pc+128:

Dump of assembler code from 0x41050df4 to 0x41050ef4:
0x41050df4:	moveq	r5, #268435456	; 0x10000000
0x41050df8:	movne	r5, #0	; 0x0
0x41050dfc:	orr	r3, r3, r1
0x41050e00:	mov	r0, r6
0x41050e04:	add	r1, sp, #144	; 0x90
0x41050e08:	mov	r2, sp
0x41050e0c:	str	r3, [r4, r12, lsl #2]
0x41050e10:	str	r5, [sp, #276]
0x41050e14:	bl	0x41051088
0x41050e18:	cmp	r0, #0	; 0x0
0x41050e1c:	ldrge	r1, [sp]
0x41050e20:	mvnlt	r1, #0	; 0x0
0x41050e24:	b	0x41050dac
0x41050e28:	andeq	r5, lr, r8, lsl #5
0x41050e2c:	andeq	r5, lr, r12, ror #6
0x41050e30:	andeq	r2, r0, r4, lsr #29
0x41050e34:	stmdb	sp!, {r4, r5, r7, lr}
0x41050e38:	mov	r5, r0
0x41050e3c:	bl	0x4103c570
0x41050e40:	ldr	r4, [r0, #-1112]
0x41050e44:	mov	r3, r0
0x41050e48:	cmp	r4, #0	; 0x0
0x41050e4c:	ldr	r0, [r0, #-1108]
0x41050e50:	bne	0x41050e90
0x41050e54:	mov	r7, #224	; 0xe0
0x41050e58:	svc	0x00000000
0x41050e5c:	mov	r4, r0
0x41050e60:	str	r4, [r3, #-1112]
0x41050e64:	mov	r2, r5
0x41050e68:	mov	r1, r4
0x41050e6c:	mov	r7, #268	; 0x10c
0x41050e70:	svc	0x00000000
0x41050e74:	cmn	r0, #4096	; 0x1000
0x41050e78:	mov	r1, r0
0x41050e7c:	bhi	0x41050eac
0x41050e80:	cmn	r0, #1	; 0x1
0x41050e84:	beq	0x41050ec4
0x41050e88:	mov	r0, r1
0x41050e8c:	ldmia	sp!, {r4, r5, r7, pc}
0x41050e90:	cmp	r0, #0	; 0x0
0x41050e94:	bgt	0x41050e64
0x41050e98:	bic	r3, r0, #-2147483648	; 0x80000000
0x41050e9c:	cmp	r3, #0	; 0x0
0x41050ea0:	rsbne	r0, r0, #0	; 0x0
0x41050ea4:	moveq	r0, r4
0x41050ea8:	b	0x41050e64
0x41050eac:	ldr	r3, [pc, #104]	; 0x41050f1c
0x41050eb0:	rsb	r2, r1, #0	; 0x0
0x41050eb4:	bl	0x4103c570
0x41050eb8:	ldr	r3, [pc, r3]
0x41050ebc:	mvn	r1, #0	; 0x0
0x41050ec0:	str	r2, [r0, r3]
0x41050ec4:	ldr	r3, [pc, #84]	; 0x41050f20
0x41050ec8:	bl	0x4103c570
0x41050ecc:	ldr	r3, [pc, r3]
0x41050ed0:	ldr	r2, [r0, r3]
0x41050ed4:	cmp	r2, #38	; 0x26
0x41050ed8:	bne	0x41050e88
0x41050edc:	mov	r1, r5
0x41050ee0:	mov	r0, r4
0x41050ee4:	mov	r7, #238	; 0xee
0x41050ee8:	svc	0x00000000
0x41050eec:	cmn	r0, #4096	; 0x1000
0x41050ef0:	mov	r2, r0
End of assembler dump.
running: x/4x 0x41050e6c

0x41050e6c:	0xe3a07f43	0xef000000	0xe3700a01	0xe1a01000
$pc is 0x41050e74, so:

0x41050e68:    mov    r1, r4
0x41050e6c:    mov    r7, #268    ; 0x10c
0x41050e70:    svc    0x00000000
0x41050e74:    cmn    r0, #4096    ; 0x1000

0x41050e6c: 0xe3a07f43 0xef000000 0xe3700a01 0xe1a01000

That svc 0x0 (0xef000000) looks extremely suspicious.  I'm not sure what gdb's $pc means, whether it's real pc + 8, as with the arm assembly view, or whether it's the next insn to be executed.  If the latter then the crash is on that svc instruction, no idea where it's coming from; we insert some breakpoint instructions in a few places as filler, but they always follow branches which isn't the case here.  And that's not a breakpoint instruction.

There are a bunch of those that appear in the above disassembly which is also strange...
Indeed, we never emit SVC for anything, so these are very suspicious!

There are some clues. The mystery SVCs usually (but not always) preceed a CMN instruction. A quick grep of the source indicates that CMN is only ever emitted by asm_cmpi. Tracing up the possible call stacks, we find that CMN can only be emitted by asm_branch, asm_cond or asm_cmov.

Because each cmn in your disassembly is proceeded by a MOV, asm_cmov is probably a good place to start. Can we have a dump of the LIR stream that caused the instructions to be generated please? I'm not sure how you do that in the browser, but it would certainly be useful. asm_cmov can come from either LIR_qcmov or LIR_cmov.
#19: browser works the same way, via the environment

export TMFLAGS=full
Hrm.  In the future, please do a debug build and try there -- the ARM backend is really good at having lots of assertions, and I'll look into figuring out how to enable them unconditionally for a while.

This problem is:

Assertion failed: ((-(1<<12)) <= (dr) && (dr) < (1<<12)) (c:/proj/firefox-wince-dbg/js/src/../../../mozilla-central/js/src/nanojit/NativeARM.cpp:658)

specifically, calling STR(ra, rb, dr) with a large dr.  Patch coming up.
Attached patch fixSplinter Review
Fix -- if dr won't fit in S12, just load the full address into IP first before calling STR.
Assignee: general → vladimir
Attachment #389213 - Flags: superreview?(Jacob.Bramley)
Attachment #389213 - Flags: review?(gal)
Attachment #389213 - Flags: superreview?(Jacob.Bramley)
Attachment #389213 - Flags: review?(gal)
Attachment #389213 - Flags: review?(Jacob.Bramley)
Attachment #389213 - Flags: review+
Closed: 11 years ago
Resolution: --- → FIXED
Comment on attachment 389213 [details] [diff] [review]

Yep, looks good.
Attachment #389213 - Flags: review?(Jacob.Bramley) → review+
I can confirm that loads on the n810 with JIT enabled. Thanks Vlad.
You need to log in before you can comment on or make changes to this bug.