Closed
Bug 493991
Opened 16 years ago
Closed 16 years ago
TM/NJ: top crash [@ nanojit::LIns::isTramp()] -- continued
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: graydon, Assigned: graydon)
References
Details
(Keywords: fixed1.9.1)
Attachments
(3 files)
2.82 KB,
patch
|
vlad
:
review+
|
Details | Diff | Splinter Review |
5.17 KB,
text/plain
|
Details | |
777 bytes,
patch
|
gal
:
review+
|
Details | Diff | Splinter Review |
This is a continuation of bug 491514. I have copied the CC list forward from that bug. The initial patch in that bug missed the underlying issue; we're still seeing this topcrash signature. Examples:
http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.5b5pre&platform=windows&query_search=signature&query_type=exact&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=nanojit%3A%3ALIns%3A%3AisTramp%28%29
Summary of the issue:
- It looks like a wild-pointer write. General memory bug in nanojit. Assume exploitable.
- We have no testcase that reproduces it. It's spotty and random, less than 10 a day, but regular enough and serious enough that it should probably block.
- We have a patch that *might* fix it (bug 488775 -- widen-LIR) that has already landed on trunk/TM, and we *think* this has possibly brought this family of crashes under control there. But it "fixes" the problem by replacing a large amount of code entirely. It is not certain to really fix the matter, we just don't see this exact signature with it anymore. And the widen-LIR patch is intrusive. We're nervous about landing it on 1.9.1 so close to the finish line.
I am going to return to the approach I was taking in the previous bug, which is "feeding random input to the LIR assembler under valgrind". Hopefully this will turn up ... something. Further suggestions welcome!
Flags: blocking1.9.1?
Updated•16 years ago
|
Flags: blocking1.9.1? → blocking1.9.1+
![]() |
||
Comment 1•16 years ago
|
||
#492866 is a follow-on from #488775 which makes the LIR less wide again, thus reducing the complaints about the memory usage of #488775. But it's yet more new code...
Assignee | ||
Comment 2•16 years ago
|
||
First things first. I revived the random tester thingy and the widen-LIR patch on trunk/TM is incorrect in 2 ways:
First, it contains the same fault we were seeing back in bug 491514: skipping "1 page minus 1 LIns" means skipping 4096-16=4080 bytes, which means skipping to a page boundary once you insert the page-crossing skip instruction itself. And sadly, as before, that makes subsequent writes think they're in valid memory and scribble over unallocated space.
Second, there's a unit mismatch in the assertion in ensureRoom, it's comparing a LIns* count to a byte count. This doesn't entirely matter since the bound is set wrong anyways (see previous point) but they both need to be fixed.
All that aside, the random tester thingy isn't finding anything *else* wrong at the moment, so I'm going to port it over to 1.9.1 and check there. I'm assuming whatever else was busted in the pre-widen-LIR trampoline code was removed when widen-LIR landed, so I'm not going to find it hunting further on TM tip.
I'll spin the former two out into a separate bug presently. Also security-sensitive, unfortunately; arbitrary writes :(
Assignee | ||
Comment 3•16 years ago
|
||
Incorrect max skip size on TM/NJ split off as bug 494084.
Assignee | ||
Comment 4•16 years ago
|
||
I ported the fuzz mode of the assembler to 1.9.1 and turned it on. Immediately coughed up bug 494110, which would definitely cause us some grief, but not clear about whether it's the root cause. I think the symptoms would look different; runtime random writes, not compile-time. Should block anyway.
Assignee | ||
Comment 5•16 years ago
|
||
Ever more excitement.
It turns out that the fix landed in bug 491514 was close, but not *quite* right. While digging through the bad size in bug 494084, njn pointed out that we're not accounting for the page *header* size either. Pages start
This means that the skip size we reserve is still too small on 1.9.1. We're occasionally going to go past the end of a page still.
One might ask: why is this not triggering all the time? Why only 10 crashes a day? And why does fuzzing it / hitting the max limit not immediately catch it? These are good questions, but there is a curious and awful answer.
We are currently capping skips at NJ_PAGE_SIZE - (LIR_FAR_SLOTS + 1) * sizeof(LIns). This is because we may write either a near-tramp (24 bits) or a far-tramp (32 bits). Far tramps take 2 instructions, near-tramps take only 1. So LIR_FAR_SLOTS is 2, and we're reserving a total of 2+1=3 words, or 12 bytes on x86, to cover fartramp+skip. It turns out the page header itself is only 1 pointer, 4 bytes, so reserving 12 bytes of each page also covers all cases of header+neartramp+skip. The only case that will fail is header+fartramp+skip, which is 16 bytes.
The solution, just like in 491514, is to lower the skip limit 1 word further.
I'm tempted for the sake of paranoia to pull it down another few words "just in case", as we're so close to shipping and we keep hitting these. Thoughts?
Also as in 491514, I won't be able to conclusively call this bug "closed" until we see the topcrash reports vanish, as we have no precise case triggering it. But this explanation at least fits the symptoms.
Assignee | ||
Comment 6•16 years ago
|
||
This lowers the limit by the size of the header plus 3 more LIns slots for paranoia's sake. I'm sorry if that seems silly, but this bug needs to have died months ago.
Also fixes the unit-mismatch in ensureRoom()'s assert, that njn (a.k.a. "captain laser-eyes") caught.
Attachment #378780 -
Flags: review?(gal)
![]() |
||
Comment 7•16 years ago
|
||
You've defined MAX_SKIP_BYTES twice...
Assignee | ||
Comment 8•16 years ago
|
||
(In reply to comment #7)
> You've defined MAX_SKIP_BYTES twice...
Guh, yes, of course: I copied it over to LIR.h so it could be used in both places but neglected to delete the place I copied from. Long day.
Attachment #378780 -
Flags: review?(gal) → review+
Comment on attachment 378780 [details] [diff] [review]
A further lowering of the skip limit
Looks fine to me, but with one MAX_SKIP_BYTES definition
Comment 10•16 years ago
|
||
Graydon, the MAX_SKIP_BYTES could explain the crash in bug 494148 too, right?
Assignee | ||
Comment 11•16 years ago
|
||
(In reply to comment #10)
> Graydon, the MAX_SKIP_BYTES could explain the crash in bug 494148 too, right?
Potentially. As with all these cases, we have no reproduction, so ... best strategy I can think of is to keep fuzzing and keep fixing. Hope we find something.
Comment 12•16 years ago
|
||
I can see us creating a bad LIR link and then walking that, so I am somewhat comfortable duping the other bug against this until your fix is in and we see it again.
Assignee | ||
Comment 14•16 years ago
|
||
http://hg.mozilla.org/releases/mozilla-1.9.1/rev/bad80f86bf07
Marking fixed for now. If we don't see topcrashes of this signature fall off in the next 2 days, reopen.
Assignee | ||
Comment 15•16 years ago
|
||
bz notes that we're seeing the same-ish signature on an OSX talos crash after this landed. So unless the buildid / rev number is misreported, this is not fixed. Will return to fuzzing, I guess.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1242943769.1242946409.14828.gz
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•16 years ago
|
Keywords: fixed1.9.1
Assignee | ||
Comment 16•16 years ago
|
||
No sign of this in crash-stats since the 20th nightly. Odd.
Comment 17•16 years ago
|
||
Well, don't get too excited: I don't see _any_ crashes on crash-stats for 3.5pre, and I think we switched to that numbering on or around the 20th. I'll ask the web folks for hints.
Comment 18•16 years ago
|
||
http://bit.ly/ztGdI is the LIns::isTramp crashes on 3.5pre, the version we've had since the 21st.
I see 7, with 3 of them (unlikely all the same machine) keying off the address "0x194b3c".
Comment 19•16 years ago
|
||
The following query[1] shows some in the nightly from the 24th, but seems like we might be missing symbols for some builds[2].
[1] http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.5pre&query_search=signature&query_type=exact&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=nanojit%3A%3ALIns%3A%3AisTramp%28%29
[2] http://crash-stats.mozilla.com/query/query?do_query=1&product=Firefox&version=Firefox%3A3.5pre&date=&range_value=1&range_unit=weeks&query_search=signature&query_type=exact&query=
Comment 20•16 years ago
|
||
Sayre, I take it we've had no luck repro'ing this on your local debug talos loops?
Assignee | ||
Comment 21•16 years ago
|
||
We found a possible reproduction in the april-2 - april-4 range of talos runs on win32. I'm attempting to get a win32 build from that date with debug symbols repeatably crashing here. It's a little bit of work; has required some fussing with a VM. I'll update this bug as soon as / if I have anything like a reproduction.
Assignee | ||
Comment 22•16 years ago
|
||
No reproductions after ~8 hours of talos page-load cycling win32 builds, both 1.9.1 tip and during the proposed period of april. Likewise linux builds of same.
I don't know what else to do.
Comment 23•16 years ago
|
||
Graydon, how about we get releng to trigger the debugger popup on a crash for all vista talos on 1.9.1, and then you VNC into each of them to catch it when it happens?
Assignee | ||
Comment 24•16 years ago
|
||
If they can do so, I'm happy to try.
Comment 25•16 years ago
|
||
I've filed bug 495434 to see if there are common URLs from the crash reports which correlate with this crash.
Depends on: 495434
Assignee | ||
Comment 26•16 years ago
|
||
Thanks to build follks we got a couple isolated talos crashes caught in the act, I'm picking through the minidumps presently. They are seriously memory-corrupt -- hard to tell exactly what it's trying to do when things to south -- but they're all faulting in the same code. So that's progress. It's a chunk of Omniture (again!) code visible on lots of pages in the talos set.
3fcb60ade2cd34ca447d6714e7ebcd58035f697e ./pages/www.msn.co.uk/stj.msn.com/br/om/js/1/s_code.js
3fcb60ade2cd34ca447d6714e7ebcd58035f697e ./pages/www.msn.com.cn/stjcn.msn.com/br/om/js/1/s_code.js
3fcb60ade2cd34ca447d6714e7ebcd58035f697e ./pages/www.msn.com/stj.msn.com/br/om/js/1/s_code.js
3fcb60ade2cd34ca447d6714e7ebcd58035f697e ./pages/www.msn.com.tw/stjtw.msn.com/br/om/js/1/s_code.js
3fcb60ade2cd34ca447d6714e7ebcd58035f697e ./pages/www.msn.es/stj.msn.com/br/om/js/1/s_code.js
3fcb60ade2cd34ca447d6714e7ebcd58035f697e ./pages/www.msn.com.hk/stj.msn.com/br/om/js/1/s_code.js
3fcb60ade2cd34ca447d6714e7ebcd58035f697e ./pages/www.msn.ca/stj.msn.com/br/om/js/1/s_code.js
3fcb60ade2cd34ca447d6714e7ebcd58035f697e ./pages/www.msn.com.br/stj.msn.com/br/om/js/1/s_code.js
Will followup as details emerge.
Assignee | ||
Comment 27•16 years ago
|
||
Memory is quite corrupted in both samples but I've worked out a rough picture of the crashing context. Report attached.
Short summary: I still don't know what's so bad about this code. More eyes from more experts appreciated. Happy to answer questions / dig at more random corners of the crashed process if you can think of anything.
Assignee | ||
Comment 28•16 years ago
|
||
This will break your heart, but it actually was in the path I initially suspected. It was just not something my fuzzer or exhaustive tester hit, because it has to do with an argument mismatch in the code-generating predicate. And I didn't exercise that path quite the same way the LIR tramp-forming code does.
Turns out the LIR tramp-forming code checks if "from-to" fits in a signed 24 bit value, but then inserts the value "to-from". This is fine for most values, but not the case in which "to" = "from + 0x800000", which strangely enough, we hit in the field.
Fix is just to make the predicate check what the inserter inserts. There's only one user of the predicate so *hopefully* this one-liner is safe. I've tested it a bit here and it seems benign.
(The story of actually finding this involves several more hours of crawling through LIR buffers and x86 buffers manually disassembling things to find the operand that generated the faulting address; the target code really *was* trying to follow a tramp coding for a 0x8000000 words of displacement, which then gets implicitly mis-encoded and thereby decoded as the much more unsightly 0xff800000, points the wrong direction from the base pointer, faults the process, etc. etc. Blame the windows memory randomization algorithm?)
Attachment #381187 -
Flags: review?(gal)
Updated•16 years ago
|
Attachment #381187 -
Flags: review?(gal) → review+
Comment 29•16 years ago
|
||
Graydon is Da Man
Assignee | ||
Comment 30•16 years ago
|
||
Comment 31•16 years ago
|
||
marking this fixed for accounting purposes, but we'll of course be watching crash-stats.mozilla.com
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Updated•16 years ago
|
Keywords: fixed1.9.1
Assignee | ||
Comment 32•16 years ago
|
||
BuildID 20090602044233 is the last build in crash-stats showing any of the culprit signatures. I think that got it.
Comment 33•16 years ago
|
||
Awesome, awesome work, everyone.
Updated•16 years ago
|
Group: core-security
Flags: wanted1.9.0.x-
You need to log in
before you can comment on or make changes to this bug.
Description
•