top crash [@ nanojit::LIns::isTramp()]

RESOLVED FIXED

Status

()

Core
JavaScript Engine
P2
critical
RESOLVED FIXED
9 years ago
3 years ago

People

(Reporter: Samuel Sidler (old account; do not CC), Assigned: graydon)

Tracking

({crash, fixed1.9.1, topcrash})

1.9.1 Branch
crash, fixed1.9.1, topcrash
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9.1 +
wanted1.9.0.x -
in-testsuite -

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [sg:critical?], crash signature, URL)

Attachments

(1 attachment)

(I didn't see this filed already, but feel free to dupe it if so.)

The number 1 topcrash for Firefox 3.5b4 (and #3 in 1.9.1 branch nightly builds) happens at nanojit::LIns::isTramp(). I *don't* see this crash on trunk, but it's clear that it's still happening in (at least) yesterday's Shiretoko build.

From bp-b22c114a-0ae3-4684-b3b6-c0bbb2090505:

Crashing Thread
Frame 	Module 	Signature 	Source
0 	js3250.dll 	nanojit::LIns::isTramp() 	js/src/nanojit/LIR.h:347
1 	js3250.dll 	nanojit::LInsHashSet::hashcode(nanojit::LIns*) 	js/src/nanojit/LIR.cpp:1318
2 	js3250.dll 	nanojit::LInsHashSet::grow() 	js/src/nanojit/LIR.cpp:1380
3 	js3250.dll 	nanojit::CseFilter::ins2(nanojit::LOpcode,nanojit::LIns*,nanojit::LIns*) 	js/src/nanojit/LIR.cpp:2028
4 	js3250.dll 	nanojit::ExprFilter::ins2(nanojit::LOpcode,nanojit::LIns*,nanojit::LIns*) 	js/src/nanojit/LIR.cpp:984
5 	js3250.dll 	FuncFilter::ins2(nanojit::LOpcode,nanojit::LIns*,nanojit::LIns*) 	js/src/jstracer.cpp:937
6 	js3250.dll 	nanojit::LirWriter::ins2i(nanojit::LOpcode,nanojit::LIns*,int) 	js/src/nanojit/LIR.cpp:747
7 	js3250.dll 	TraceRecorder::record_FastNativeCallComplete() 	js/src/jstracer.cpp:8282
8 	js3250.dll 	js_Interpret 	js/src/jsinterp.cpp:5127
9 	js3250.dll 	js_Invoke 	js/src/jsinterp.cpp:1373
10 	js3250.dll 	js_fun_call 	js/src/jsfun.cpp:1688
11 	js3250.dll 	js_Interpret 	js/src/jsinterp.cpp:5100
12 	js3250.dll 	js_Invoke 	js/src/jsinterp.cpp:1373
13 	js3250.dll 	js_fun_call 	js/src/jsfun.cpp:1688
14 	js3250.dll 	js_Interpret 	js/src/jsinterp.cpp:5100
15 	js3250.dll 	js_Invoke 	js/src/jsinterp.cpp:1373
16 	js3250.dll 	js_fun_call 	js/src/jsfun.cpp:1688
17 	js3250.dll 	js_Interpret 	js/src/jsinterp.cpp:5100
18 	js3250.dll 	js_Invoke 	js/src/jsinterp.cpp:1373
19 	xul.dll 	nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*,unsigned short,XPTMethodDescriptor const*,nsXPTCMiniVariant*) 	js/src/xpconnect/src/xpcwrappedjsclass.cpp:1614
20 	xul.dll 	nsXPCWrappedJS::CallMethod(unsigned short,XPTMethodDescriptor const*,nsXPTCMiniVariant*) 	js/src/xpconnect/src/xpcwrappedjs.cpp:561
21 	xul.dll 	PrepareAndDispatch 	xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:114
22 	xul.dll 	SharedStub 	xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:141
23 	xul.dll 	nsEventListenerManager::HandleEventSubType(nsListenerStruct*,nsIDOMEventListener*,nsIDOMEvent*,nsPIDOMEventTarget*,unsigned int) 	content/events/src/nsEventListenerManager.cpp:1101
Flags: blocking1.9.1?

Updated

9 years ago
Flags: blocking1.9.1? → blocking1.9.1+
(Assignee)

Updated

9 years ago
Assignee: general → graydon

Updated

9 years ago
Priority: -- → P2

Comment 1

9 years ago
From looking at the surrounding code this looks like access to unmapped memory (random pointer), not a NULL pointer deref. The random pointer is coming from a table in memory (wild pointer?). Anyway. Until we know a bit more, I will make this invisible.
Group: core-security
Whiteboard: [sg:critical?]
(Assignee)

Comment 2

9 years ago
Yeah, badtimes. If you look through the topcrashes, there are all sorts of random callers that funnel into isTramp() and fall over. Clearly there's nothing wrong with isTramp -- it does almost nothing -- but a lot of people are getting bogus LIns* values. That's bad. Some reproducing-URLs would be wonderful, sigh.

If anyone wants another excuse to attempt landing bug 488775, I nominate this one :(
(In reply to comment #2)
> Yeah, badtimes. If you look through the topcrashes, there are all sorts of
> random callers that funnel into isTramp() and fall over. Clearly there's
> nothing wrong with isTramp -- it does almost nothing -- but a lot of people are
> getting bogus LIns* values. That's bad. Some reproducing-URLs would be
> wonderful, sigh.

I will check my topsite test Logs and do some extra topsite test runs to find a url and hopefully a Testcase !
I filed bug 491606 (security sensitive) to get URLs for this bug.
(Assignee)

Comment 5

9 years ago
An update: my suspicion is the trampoline-forming code is subtly wrong, but it's hard to eyeball where. So I cooked up a synthetic load on the LIR assembler that just makes a lot of random-sized skips and loads from way back in the stream, to hit lots of random possible trampoline sizes.

Result is rapid and reliable crashing of nanojit. So I'll work backwards now and try to reduce one of those cases.

(We need to start fuzzing via the LIR assembler, in case there is any doubt or lack of clarity here. It's a much faster way of hitting the worst cases TM is going to throw at it.)
(Reporter)

Updated

9 years ago
Depends on: 491606
In case you still want URLs, attachment 375902 [details] has them.
(In reply to comment #6)
> In case you still want URLs, attachment 375902 [details] has them.

thx ! i will run this urls through my Topsite automation !
(Assignee)

Comment 8

9 years ago
Heh. False belief on my part, figures. With a cocky comment like that, I was asking for it. Bug in assembler. Still attacking, can't reproduce yet :(
(Assignee)

Comment 9

9 years ago
Next exciting discovery: we cap our skip() values at 4094 bytes, but attempting to assemble anything more than 4084 bytes will trip a samepage() assert in a debug build. And unfortunately, everything between that gap does "something not equivalent to safely stopping" in an opt build.

Back at it tomorrow.
(Assignee)

Comment 10

9 years ago
Created attachment 376070 [details] [diff] [review]
First correction

I am not sure if this is actually causing the reported topcrash, but it was
the first seriously-incorrect thing I stumbled into while looking at code
paths related to the bug, so I figured I'd fix it. I'd suggest landing it
anyways and measuring the field response of the crash in nightlies, though
of course in the meantime I'll carry on looking for a way to trigger the
real case.

The problem here is that our skip() cap is wrong. The first obviously-wrong
thing about it is that it subtracts a "slot" (i.e. pointer-size) count from
a byte count. That doesn't work, duh. Graydon fails pointer arithmetic
101. The next less-obviously-wrong thing about it relates to the way the
code in LIR.cpp actually treats page boundary conditions.

Pages of LIR are 4096 bytes long. When you try to make an allocation that
goes off a page, the LIR buffer writer advances to the next page and puts a
far tramp (8 bytes) in the new page, pointing back to the position in the
old. So one might be forgiven for thinking that 4096-8=4088 bytes is the
largest skip() allocation you can do. Unfortunately if we set the limit
there we are still wrong: it decodes whether or not to advance from one
page to the next by taking the current write pointer, _unused, adding a
proposed distance to it, and checking to see if it's still "on the same
page" by masking to page boundaries and comparing. If you advance by 4088
bytes, you get a new page with an 8 byte far tramp at the beginning, then
4088 bytes of allocated space, and your write pointer _unused is sitting at
byte 8+4088=4096. I.e. "the first byte of the *next* page". So subsequent
writes to _unused+N appear very much to be on "the same page" as _unused,
and succeed without allocating fresh pages. Thereby scribbling into
random memory.

(As it turns out, calling skip() explicitly does a write of another
trampoline immediately after the skip quantity, because skipping and
page-boundary-transitions are ignorant of one another; so the simple
assembly program "skip 4088" is enough to get valgrind to note a UMR).

Anyway, net result is you have to cap "a little less than 4088". 4087 would
be ok but for the sake of preserving alignment in the buffer (other bits of
code seem to like keeping it) I decided to pull it back by a full
pointer-width, to 4084.

For the sake of trapping this earlier if it happens again, I also added an
assert to nanojit itself in the page-boundary-crossing path (ensureRoom())
to check that it's not exceeding 4084 bytes as well. Though that assert,
like all good sanity checking, is turned off in production builds :(
Attachment #376070 - Flags: review?(gal)

Comment 11

9 years ago
Comment on attachment 376070 [details] [diff] [review]
First correction

This might very well explain the crash, along with the occasional tinderbox crashes we saw in nanojit. Here is hopeing.
Attachment #376070 - Flags: review?(gal) → review+
(Assignee)

Comment 12

9 years ago
http://hg.mozilla.org/tracemonkey/rev/37d6f4a3c43f on that one. Keep the bug open for now, we'll have to see if crashes drop when it makes it to nightlies.

Am I correct in thinking that the nightly stream to watch at this point is 3.5b5pre? And/or 3.6a1?

Comment 13

9 years ago
>
> Am I correct in thinking that the nightly stream to watch at this point is
> 3.5b5pre? And/or 3.6a1?

3.5b5pre is mozilla-1.9.1 nightly. 3.6a1 is mozilla-central nightly.
(Assignee)

Comment 14

9 years ago
njn correctly points out that sizeof(LIns) is much more appropriate than sizeof(LIns*), though they are both 4 bytes on x86. They're not elsewhere. gal: r+ for me to correct that?

Comment 15

9 years ago
r+
(Assignee)

Comment 16

9 years ago
http://hg.mozilla.org/tracemonkey/rev/fef1f50d6750

Hopefully only second time's the charm.

Updated

9 years ago
Blocks: 490041

Comment 17

9 years ago
http://hg.mozilla.org/mozilla-central/rev/37d6f4a3c43f
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
(Assignee)

Comment 19

9 years ago
Not sure it's ideal to mark this fixed. It's fixed-if-the-bug-goes-away. We're not sure the patch actually relates. It just looks "likely".

Comment 20

9 years ago
(In reply to comment #19)
> Not sure it's ideal to mark this fixed. It's fixed-if-the-bug-goes-away. We're
> not sure the patch actually relates. It just looks "likely".

If it's not fixed, we'll retitle this bug and open a new one, to keep the patches separate.
(In reply to comment #19)
> Not sure it's ideal to mark this fixed. It's fixed-if-the-bug-goes-away. We're
> not sure the patch actually relates. It just looks "likely".

So far i was not able to reproduce the crash with one of the example urls, but i will continue the tests and file a new bug if i find something
(Assignee)

Comment 22

9 years ago
Hmm. So this landed on mozilla-central on the 7th; all topcrashes associated with LIR reading appear to continue through to buildid 20090509, then stop. As though the build 20090510 carried the fix. That would be a 2 day lag: 20090508 should have carried it.

Three possibilities: 
  - there was some kind of lag, but this patch did fix it
  - a different patch fixed it (eg. bug 488775, landed on trunk on the 10th)
  - LIR-reading bugs remain unfixed; we're just seeing sampling error and
    the gap since the 9th is just transient/random (this is plausible as there
    are few-day gaps in the existing signal and the count is low)

Will have to keep watching.
My inexpert guess would be that #488775 fixed it:  it was crashing in isTramp(), the code handling trampolines was very convoluted, and #488775's patch removed it all.

Comment 24

9 years ago
488775 is not a blocker. If we think 488775 fixed it and not graydon's patch, we have to consider taking 488775 for 3.5, which scares me a bit without additional work to make LIR non-persistent.
(In reply to comment #24)
> 488775 is not a blocker. If we think 488775 fixed it and not graydon's patch,
> we have to consider taking 488775 for 3.5, which scares me a bit without
> additional work to make LIR non-persistent.

Or develop, test, and commit a small, targeted fix to the trampoline code.

/be

Comment 26

9 years ago
We dont' understand the bug. Its sporadic.
It looks like talos recently crashed in isTramp() on 1.9.1:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1242823438.1242831846.14936.gz


0 js3250.dll!nanojit::LIns::isTramp() [LIR.h:acd2d4638228 : 347 + 0x0]
eip = 0x002e2b90 esp = 0x0012f054 ebp = 0x008b1000 ebx = 0x00000003 esi = 0x00000001 edi = 0x00cbd1d4 eax = 0x00cbd1d4 ecx = 0x00000002 edx = 0x00000010 efl = 0x00010206 1 js3250.dll!nanojit::Assembler::asm_call(nanojit::LIns *) [Nativei386.cpp:acd2d4638228 : 251 + 0x19]
eip = 0x002e242f esp = 0x0012f058 ebp = 0x008b1000 2 js3250.dll!nanojit::Assembler::prepResultReg(nanojit::LIns *,int) [Assembler.cpp:acd2d4638228 : 635 + 0x9]
eip = 0x002e391b esp = 0x0012f0b4 ebp = 0x0001c0c8 3 js3250.dll!nanojit::Assembler::gen(nanojit::LirFilter *,avmplus::List &) [Assembler.cpp:acd2d4638228 : 1477 + 0x6]
eip = 0x002e32d4 esp = 0x0012f0c0 ebp = 0x02cbd1e4 4 js3250.dll!nanojit::Assembler::assemble(nanojit::Fragment *,avmplus::List &) [Assembler.cpp:acd2d4638228 : 850 + 0x33]
eip = 0x00338da5 esp = 0x0012f0dc ebp = 0x008b1000 5 js3250.dll!nanojit::compile(nanojit::Assembler *,nanojit::Fragment *) [LIR.cpp:acd2d4638228 : 2150 + 0xd]
eip = 0x0033a147 esp = 0x0012f174 ebp = 0x0012f11c 6 js3250.dll!TraceRecorder::compile(JSTraceMonitor *) [jstracer.cpp:acd2d4638228 : 2809 + 0xc]
eip = 0x0033a276 esp = 0x0012f1c0 ebp = 0x0079503c 7 js3250.dll!TraceRecorder::closeLoop(JSTraceMonitor *,bool &) [jstracer.cpp:acd2d4638228 : 2943 + 0xc]
eip = 0x00335cad esp = 0x0012f1d0 ebp = 0x0079503c 8 js3250.dll!TraceRecorder::checkTraceEnd(unsigned char *) [jstracer.cpp:acd2d4638228 : 3242 + 0x12]
eip = 0x0033760d esp = 0x0012f1e8 ebp = 0x0012f248 9 js3250.dll!TraceRecorder::relational(nanojit::LOpcode,bool) [jstracer.cpp:acd2d4638228 : 6135 + 0x5]
eip = 0x003965bd esp = 0x0012f1fc ebp = 0x0012f248 10 js3250.dll!TraceRecorder::record_JSOP_LT() [jstracer.cpp:acd2d4638228 : 6977 + 0xa]
eip = 0x00396c5e esp = 0x0012f250 ebp = 0x023c3c00 11 js3250.dll!js_Interpret [jsinterp.cpp:acd2d4638228 : 3038 + 0x7]
eip = 0x00318afb esp = 0x0012f274 ebp = 0x023c3c00
Graydon: Can you CC me on the new bug you file for this (per comment 20)?

Updated

9 years ago
Flags: in-testsuite?
Keywords: testcase-wanted

Comment 31

9 years ago
The continuation is (was) bug 493991.
Group: core-security
Flags: wanted1.9.0.x-
Crash Signature: [@ nanojit::LIns::isTramp()]
Filter on qa-project-auto-change:

Bug in removed tracer code, setting in-testsuite- flag.
Flags: in-testsuite? → in-testsuite-
Keywords: testcase-wanted
You need to log in before you can comment on or make changes to this bug.