(I didn't see this filed already, but feel free to dupe it if so.) The number 1 topcrash for Firefox 3.5b4 (and #3 in 1.9.1 branch nightly builds) happens at nanojit::LIns::isTramp(). I *don't* see this crash on trunk, but it's clear that it's still happening in (at least) yesterday's Shiretoko build. From bp-b22c114a-0ae3-4684-b3b6-c0bbb2090505: Crashing Thread Frame Module Signature Source 0 js3250.dll nanojit::LIns::isTramp() js/src/nanojit/LIR.h:347 1 js3250.dll nanojit::LInsHashSet::hashcode(nanojit::LIns*) js/src/nanojit/LIR.cpp:1318 2 js3250.dll nanojit::LInsHashSet::grow() js/src/nanojit/LIR.cpp:1380 3 js3250.dll nanojit::CseFilter::ins2(nanojit::LOpcode,nanojit::LIns*,nanojit::LIns*) js/src/nanojit/LIR.cpp:2028 4 js3250.dll nanojit::ExprFilter::ins2(nanojit::LOpcode,nanojit::LIns*,nanojit::LIns*) js/src/nanojit/LIR.cpp:984 5 js3250.dll FuncFilter::ins2(nanojit::LOpcode,nanojit::LIns*,nanojit::LIns*) js/src/jstracer.cpp:937 6 js3250.dll nanojit::LirWriter::ins2i(nanojit::LOpcode,nanojit::LIns*,int) js/src/nanojit/LIR.cpp:747 7 js3250.dll TraceRecorder::record_FastNativeCallComplete() js/src/jstracer.cpp:8282 8 js3250.dll js_Interpret js/src/jsinterp.cpp:5127 9 js3250.dll js_Invoke js/src/jsinterp.cpp:1373 10 js3250.dll js_fun_call js/src/jsfun.cpp:1688 11 js3250.dll js_Interpret js/src/jsinterp.cpp:5100 12 js3250.dll js_Invoke js/src/jsinterp.cpp:1373 13 js3250.dll js_fun_call js/src/jsfun.cpp:1688 14 js3250.dll js_Interpret js/src/jsinterp.cpp:5100 15 js3250.dll js_Invoke js/src/jsinterp.cpp:1373 16 js3250.dll js_fun_call js/src/jsfun.cpp:1688 17 js3250.dll js_Interpret js/src/jsinterp.cpp:5100 18 js3250.dll js_Invoke js/src/jsinterp.cpp:1373 19 xul.dll nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*,unsigned short,XPTMethodDescriptor const*,nsXPTCMiniVariant*) js/src/xpconnect/src/xpcwrappedjsclass.cpp:1614 20 xul.dll nsXPCWrappedJS::CallMethod(unsigned short,XPTMethodDescriptor const*,nsXPTCMiniVariant*) js/src/xpconnect/src/xpcwrappedjs.cpp:561 21 xul.dll PrepareAndDispatch xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:114 22 xul.dll SharedStub xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:141 23 xul.dll nsEventListenerManager::HandleEventSubType(nsListenerStruct*,nsIDOMEventListener*,nsIDOMEvent*,nsPIDOMEventTarget*,unsigned int) content/events/src/nsEventListenerManager.cpp:1101
From looking at the surrounding code this looks like access to unmapped memory (random pointer), not a NULL pointer deref. The random pointer is coming from a table in memory (wild pointer?). Anyway. Until we know a bit more, I will make this invisible.
Yeah, badtimes. If you look through the topcrashes, there are all sorts of random callers that funnel into isTramp() and fall over. Clearly there's nothing wrong with isTramp -- it does almost nothing -- but a lot of people are getting bogus LIns* values. That's bad. Some reproducing-URLs would be wonderful, sigh. If anyone wants another excuse to attempt landing bug 488775, I nominate this one :(
(In reply to comment #2) > Yeah, badtimes. If you look through the topcrashes, there are all sorts of > random callers that funnel into isTramp() and fall over. Clearly there's > nothing wrong with isTramp -- it does almost nothing -- but a lot of people are > getting bogus LIns* values. That's bad. Some reproducing-URLs would be > wonderful, sigh. I will check my topsite test Logs and do some extra topsite test runs to find a url and hopefully a Testcase !
I filed bug 491606 (security sensitive) to get URLs for this bug.
An update: my suspicion is the trampoline-forming code is subtly wrong, but it's hard to eyeball where. So I cooked up a synthetic load on the LIR assembler that just makes a lot of random-sized skips and loads from way back in the stream, to hit lots of random possible trampoline sizes. Result is rapid and reliable crashing of nanojit. So I'll work backwards now and try to reduce one of those cases. (We need to start fuzzing via the LIR assembler, in case there is any doubt or lack of clarity here. It's a much faster way of hitting the worst cases TM is going to throw at it.)
(In reply to comment #6) > In case you still want URLs, attachment 375902 [details] has them. thx ! i will run this urls through my Topsite automation !
Heh. False belief on my part, figures. With a cocky comment like that, I was asking for it. Bug in assembler. Still attacking, can't reproduce yet :(
Next exciting discovery: we cap our skip() values at 4094 bytes, but attempting to assemble anything more than 4084 bytes will trip a samepage() assert in a debug build. And unfortunately, everything between that gap does "something not equivalent to safely stopping" in an opt build. Back at it tomorrow.
Created attachment 376070 [details] [diff] [review] First correction I am not sure if this is actually causing the reported topcrash, but it was the first seriously-incorrect thing I stumbled into while looking at code paths related to the bug, so I figured I'd fix it. I'd suggest landing it anyways and measuring the field response of the crash in nightlies, though of course in the meantime I'll carry on looking for a way to trigger the real case. The problem here is that our skip() cap is wrong. The first obviously-wrong thing about it is that it subtracts a "slot" (i.e. pointer-size) count from a byte count. That doesn't work, duh. Graydon fails pointer arithmetic 101. The next less-obviously-wrong thing about it relates to the way the code in LIR.cpp actually treats page boundary conditions. Pages of LIR are 4096 bytes long. When you try to make an allocation that goes off a page, the LIR buffer writer advances to the next page and puts a far tramp (8 bytes) in the new page, pointing back to the position in the old. So one might be forgiven for thinking that 4096-8=4088 bytes is the largest skip() allocation you can do. Unfortunately if we set the limit there we are still wrong: it decodes whether or not to advance from one page to the next by taking the current write pointer, _unused, adding a proposed distance to it, and checking to see if it's still "on the same page" by masking to page boundaries and comparing. If you advance by 4088 bytes, you get a new page with an 8 byte far tramp at the beginning, then 4088 bytes of allocated space, and your write pointer _unused is sitting at byte 8+4088=4096. I.e. "the first byte of the *next* page". So subsequent writes to _unused+N appear very much to be on "the same page" as _unused, and succeed without allocating fresh pages. Thereby scribbling into random memory. (As it turns out, calling skip() explicitly does a write of another trampoline immediately after the skip quantity, because skipping and page-boundary-transitions are ignorant of one another; so the simple assembly program "skip 4088" is enough to get valgrind to note a UMR). Anyway, net result is you have to cap "a little less than 4088". 4087 would be ok but for the sake of preserving alignment in the buffer (other bits of code seem to like keeping it) I decided to pull it back by a full pointer-width, to 4084. For the sake of trapping this earlier if it happens again, I also added an assert to nanojit itself in the page-boundary-crossing path (ensureRoom()) to check that it's not exceeding 4084 bytes as well. Though that assert, like all good sanity checking, is turned off in production builds :(
Attachment #376070 - Flags: review?(gal)
Comment on attachment 376070 [details] [diff] [review] First correction This might very well explain the crash, along with the occasional tinderbox crashes we saw in nanojit. Here is hopeing.
Attachment #376070 - Flags: review?(gal) → review+
http://hg.mozilla.org/tracemonkey/rev/37d6f4a3c43f on that one. Keep the bug open for now, we'll have to see if crashes drop when it makes it to nightlies. Am I correct in thinking that the nightly stream to watch at this point is 3.5b5pre? And/or 3.6a1?
> > Am I correct in thinking that the nightly stream to watch at this point is > 3.5b5pre? And/or 3.6a1? 3.5b5pre is mozilla-1.9.1 nightly. 3.6a1 is mozilla-central nightly.
njn correctly points out that sizeof(LIns) is much more appropriate than sizeof(LIns*), though they are both 4 bytes on x86. They're not elsewhere. gal: r+ for me to correct that?
http://hg.mozilla.org/tracemonkey/rev/fef1f50d6750 Hopefully only second time's the charm.
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
Not sure it's ideal to mark this fixed. It's fixed-if-the-bug-goes-away. We're not sure the patch actually relates. It just looks "likely".
(In reply to comment #19) > Not sure it's ideal to mark this fixed. It's fixed-if-the-bug-goes-away. We're > not sure the patch actually relates. It just looks "likely". If it's not fixed, we'll retitle this bug and open a new one, to keep the patches separate.
(In reply to comment #19) > Not sure it's ideal to mark this fixed. It's fixed-if-the-bug-goes-away. We're > not sure the patch actually relates. It just looks "likely". So far i was not able to reproduce the crash with one of the example urls, but i will continue the tests and file a new bug if i find something
Hmm. So this landed on mozilla-central on the 7th; all topcrashes associated with LIR reading appear to continue through to buildid 20090509, then stop. As though the build 20090510 carried the fix. That would be a 2 day lag: 20090508 should have carried it. Three possibilities: - there was some kind of lag, but this patch did fix it - a different patch fixed it (eg. bug 488775, landed on trunk on the 10th) - LIR-reading bugs remain unfixed; we're just seeing sampling error and the gap since the 9th is just transient/random (this is plausible as there are few-day gaps in the existing signal and the count is low) Will have to keep watching.
My inexpert guess would be that #488775 fixed it: it was crashing in isTramp(), the code handling trampolines was very convoluted, and #488775's patch removed it all.
488775 is not a blocker. If we think 488775 fixed it and not graydon's patch, we have to consider taking 488775 for 3.5, which scares me a bit without additional work to make LIR non-persistent.
(In reply to comment #24) > 488775 is not a blocker. If we think 488775 fixed it and not graydon's patch, > we have to consider taking 488775 for 3.5, which scares me a bit without > additional work to make LIR non-persistent. Or develop, test, and commit a small, targeted fix to the trampoline code. /be
We dont' understand the bug. Its sporadic.
It looks like talos recently crashed in isTramp() on 1.9.1: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1242823438.1242831846.14936.gz 0 js3250.dll!nanojit::LIns::isTramp() [LIR.h:acd2d4638228 : 347 + 0x0] eip = 0x002e2b90 esp = 0x0012f054 ebp = 0x008b1000 ebx = 0x00000003 esi = 0x00000001 edi = 0x00cbd1d4 eax = 0x00cbd1d4 ecx = 0x00000002 edx = 0x00000010 efl = 0x00010206 1 js3250.dll!nanojit::Assembler::asm_call(nanojit::LIns *) [Nativei386.cpp:acd2d4638228 : 251 + 0x19] eip = 0x002e242f esp = 0x0012f058 ebp = 0x008b1000 2 js3250.dll!nanojit::Assembler::prepResultReg(nanojit::LIns *,int) [Assembler.cpp:acd2d4638228 : 635 + 0x9] eip = 0x002e391b esp = 0x0012f0b4 ebp = 0x0001c0c8 3 js3250.dll!nanojit::Assembler::gen(nanojit::LirFilter *,avmplus::List &) [Assembler.cpp:acd2d4638228 : 1477 + 0x6] eip = 0x002e32d4 esp = 0x0012f0c0 ebp = 0x02cbd1e4 4 js3250.dll!nanojit::Assembler::assemble(nanojit::Fragment *,avmplus::List &) [Assembler.cpp:acd2d4638228 : 850 + 0x33] eip = 0x00338da5 esp = 0x0012f0dc ebp = 0x008b1000 5 js3250.dll!nanojit::compile(nanojit::Assembler *,nanojit::Fragment *) [LIR.cpp:acd2d4638228 : 2150 + 0xd] eip = 0x0033a147 esp = 0x0012f174 ebp = 0x0012f11c 6 js3250.dll!TraceRecorder::compile(JSTraceMonitor *) [jstracer.cpp:acd2d4638228 : 2809 + 0xc] eip = 0x0033a276 esp = 0x0012f1c0 ebp = 0x0079503c 7 js3250.dll!TraceRecorder::closeLoop(JSTraceMonitor *,bool &) [jstracer.cpp:acd2d4638228 : 2943 + 0xc] eip = 0x00335cad esp = 0x0012f1d0 ebp = 0x0079503c 8 js3250.dll!TraceRecorder::checkTraceEnd(unsigned char *) [jstracer.cpp:acd2d4638228 : 3242 + 0x12] eip = 0x0033760d esp = 0x0012f1e8 ebp = 0x0012f248 9 js3250.dll!TraceRecorder::relational(nanojit::LOpcode,bool) [jstracer.cpp:acd2d4638228 : 6135 + 0x5] eip = 0x003965bd esp = 0x0012f1fc ebp = 0x0012f248 10 js3250.dll!TraceRecorder::record_JSOP_LT() [jstracer.cpp:acd2d4638228 : 6977 + 0xa] eip = 0x00396c5e esp = 0x0012f250 ebp = 0x023c3c00 11 js3250.dll!js_Interpret [jsinterp.cpp:acd2d4638228 : 3038 + 0x7] eip = 0x00318afb esp = 0x0012f274 ebp = 0x023c3c00
Graydon: Can you CC me on the new bug you file for this (per comment 20)?
The continuation is (was) bug 493991.
Crash Signature: [@ nanojit::LIns::isTramp()]
Filter on qa-project-auto-change: Bug in removed tracer code, setting in-testsuite- flag.
Flags: in-testsuite? → in-testsuite-
You need to log in before you can comment on or make changes to this bug.