Closed Bug 521668 Opened 15 years ago Closed 15 years ago

content sink re-entry causes crash [@ xul.dll@0x3ec28f ] and [@ SinkContext::DidAddContent(nsIContent*) ]

Categories

(Core :: DOM: HTML Parser, defect, P2)

All
Windows 7
defect

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- beta3-fixed

People

(Reporter: tdowner, Assigned: sicking)

References

()

Details

(Keywords: crash, dogfood, regression, Whiteboard: [#1 trunk (3.7a1) topcrash][crashkill])

Crash Data

Attachments

(3 files)

Kinda wierd, but when trying to go this page, get these crashes (did it 3 times, two were [@SinkContext::DidAddContent(nsIContent*) ] the other was the @xul one).
bp-032188c2-20c6-4d3f-b6b1-9fece2091011
bp-d81fe7bb-56b9-46eb-9359-217fb2091011
bp-2a44d905-cb6c-4fcb-b74c-b5a942091011
Did not crash in safe mode.
Bug 521164 may be related/the same.
This started with the Oct 13 nightly -- http://bit.ly/OSOPj for crash stats list.  We need a regression; it makes the nightly highly unstable and generally unusable.  (Note that I can reproducibly crash by loading bug 521164 while logged in, though with a nonstandard bugzilla skin.)
Ignore Oct 13 -- sorting by date doesn't actually seem to work in crash stats.
just looking at the incoming daily reports on the trunk crashes looks like this ramped around oct 5th and more reports are seen when that days crashes went into the report produced on the 6th.

report for the stack signature SinkContext::DidAddContent 
4   total crashes for SinkContext::DidAddContent on 20091004-crashdata.csv
3   total crashes for SinkContext::DidAddContent on 20091005-crashdata.csv
1   total crashes for SinkContext::DidAddContent on 20091006-crashdata.csv
15   total crashes for SinkContext::DidAddContent on 20091007-crashdata.csv
18   total crashes for SinkContext::DidAddContent on 20091008-crashdata.csv
17   total crashes for SinkContext::DidAddContent on 20091009-crashdata.csv
17   total crashes for SinkContext::DidAddContent on 20091010-crashdata.csv
19   total crashes for SinkContext::DidAddContent on 20091011-crashdata.csv
20   total crashes for SinkContext::DidAddContent on 20091012-crashdata.csv
report for the stack signature xul.dll@0x3ec28f 
0   total crashes for xul.dll@0x3ec28f on 20091004-crashdata.csv
0   total crashes for xul.dll@0x3ec28f on 20091005-crashdata.csv
0   total crashes for xul.dll@0x3ec28f on 20091006-crashdata.csv
0   total crashes for xul.dll@0x3ec28f on 20091007-crashdata.csv
0   total crashes for xul.dll@0x3ec28f on 20091008-crashdata.csv
0   total crashes for xul.dll@0x3ec28f on 20091009-crashdata.csv
3   total crashes for xul.dll@0x3ec28f on 20091010-crashdata.csv
5   total crashes for xul.dll@0x3ec28f on 20091011-crashdata.csv
1   total crashes for xul.dll@0x3ec28f on 20091012-crashdata.csv
is bug 521164 also related?
Using the latest nightly, I can reproduce this crash 99% of the time when I visit a crash report link such as http://crash-stats.mozilla.com/report/index/032188c2-20c6-4d3f-b6b1-9fece2091011 and click on one of the related bug links. I went back to both the 10/5 and 10/6 nightly trunk builds and repeated the same STR, but wasn't able to get it to crash.

I also tested the facebook search link but I wasn't able to crash on any of the three builds I tested.
Not sure if it's relevant or not, but the one change in the date range that dbaron pointed at which is Windows 7 specific is the fix for bug 501490.
I have this in the debugger.. the first time I hit it, I was at SinkContext::DidAddContent+0xbignum and the rest of the stack was bogus.  But most of the time I hit it with exactly the most common signature in the crash reports.

Note that we have crashes on multiple line numbers, in order of frequency: 706, 729, 707 (based on a very rough sampling of the crash reports).

For the crash in line 706, I dug into this.  mCurrentContext is valid in ProcessLINKTag and is passed in as %esi into AddLeaf.  In AddLeaf, it's moved to %ecx before calling DidAddContent.  It's bogus in DidAddContent, and it's the same |this| as the current value in %esi in AddLeaf.  AddLeaf is simple, except for a call to Node::Add; %esi is supposed to be callee-saved, so my guess is that something inside Node::Add is clobbering %esi and/or not restoring it correctly.
This bug seems to have more activity than the preceding bug 521164, so posting here.

I got this in the debugger as well, second time it has happened for me, both times using Bugzilla. Stack: http://pastebin.mozilla.org/676493

Saved a full process dump in case anyone wants to take a look.
vlad: your analysis sounds distressingly similar to my analysis in bug 475178 comment 20. There, the compiler wasn't saving %eax prior to a function call, despite using it to store a local variable. In that case I was able to convince the compiler to produce correct code by using the return value of the function instead of ignoring it.
... and I just hit this on the 1.9.2 branch.  Same symptoms, same stack, same analysis.
Flags: blocking-firefox3.6?
This is a chrome jit bug; it does not appear if chrome jit is disabled.  This is why safe mode fixes it -- safe mode forcibly disables the jit.  Reproduced with all of my extensions disabled, on both 1.9.2 and trunk.
Assignee: nobody → general
Component: General → JavaScript Engine
Flags: blocking-firefox3.6?
Product: Firefox → Core
QA Contact: general → general
Hardware: x86_64 → All
Summary: Crash [@xul.dll@0x3ec28f ] and [@SinkContext::DidAddContent(nsIContent*) ] → Chrome jit crash [@xul.dll@0x3ec28f ] and [@SinkContext::DidAddContent(nsIContent*) ]
Marking blocking 1.9.2 due to dogfood, as bugzilla very frequently triggers this.
Flags: blocking1.9.2+
Keywords: dogfood
(In reply to comment #14)
> This is a chrome jit bug; it does not appear if chrome jit is disabled.  This
> is why safe mode fixes it -- safe mode forcibly disables the jit.  Reproduced
> with all of my extensions disabled, on both 1.9.2 and trunk.

how can this be a chrome jit bug? there's no js on the stack, is there?
Disabling the JIT might move something around in memory. With no jit code on the stack at all, its really unlikely we cause this (not impossible, just unlikely). There is a deep callstack of jit-unrelated functions. Someone in there is going to use/clobber EAX, so even if the jit messed up earlier, I don't think it would reach this far, and this consistently this far.
FYI, I manually disabled jit.chrome and went to the test URL, and after several clicks on different tabs from the menu in the upper left corner of the page I got today's trunk to crash. 

unfortunately the crash-report looks useless maybe:

http://crash-stats.mozilla.com/report/index/fe03e30f-e051-43dd-90f7-745b52091014?p=1
I can now reproduce this with the jit entirely disabled, so the jit's off the hook I think.  But in doing so I've seen at least 5 different crash signatures/stacks, all with the same symptoms -- clobbered/incorrect registers.  I can reproduce this at will, though I haven't been able to set up a similar repro environment on the purify box.

It gets stranger, though; I changed the code in SinkContext::AddLeaf to this:

  PR_LogPrint("AddLeaf this: %p ", this);
  nsIContent *rval = mStack[mStackPos - 1].Add(aContent);
  PR_LogPrint("-> %p\n", this);

  DidAddContent(rval);

in the log file, at the end right before the crash I see:

....
0[83a140]: AddLeaf this 0: 17c618e0
0[83a140]: -> 17c618e0
0[83a140]: AddLeaf this 0: 17c618e0
0[83a140]: -> 17c618e0
0[83a140]: AddLeaf this 0: 17c618e0
0[83a140]: -> 17c618e0
0[83a140]: -> 17d66220

I'm terrified and confused.  How did we hit that second PR_LogPrint twice, without hitting the first one?  That second value is clearly bogus, and the crash happens right after when DidAddContent gets called.

This is also in a local opt build, with no PGO, and with VC9, so PGO is not to blame.
Assignee: general → nobody
Component: JavaScript Engine → General
QA Contact: general → general
Summary: Chrome jit crash [@xul.dll@0x3ec28f ] and [@SinkContext::DidAddContent(nsIContent*) ] → Crash [@xul.dll@0x3ec28f ] and [@SinkContext::DidAddContent(nsIContent*) ]
We're recursing into AddLeaf; is this valid?  Here's the stack showing how it happens.  Sometimes we crash, other times we don't after we return from the outer AddLeaf.  We often have a whole bunch of inner AddLeafs for one Add() call.
The code executing at the first js_Invoke (cx->fp->down->script) is

WindowsPreviewPerTab.jsm:108
WindowsPreviewPerTab.jsm:122
WindowsPreviewPerTab.jsm:127

which is _getFaviconAsImage, which fits, since this is adding a <link> element.  The code there does a synchronous http request to get the image, and so we spin up a nested event loop, which processes more data that we got on the parser listener channel before we're actually done processing the last chunk, and then very bad things happen.

I'm guessing that this bug isn't tab preview specific, but that it was very rare to hit this situation before.
Component: General → HTML: Parser
QA Contact: general → parser
Group: core-security
vlad's filed bug 522416 on the tab preview thing doing Open() on an HTTP channel.  But web content could trigger this too, using sync XHR from a DOMLinkAdded handler, as vlad noted.  Or anything else that happened to run off a script runner and trigger content script.  So just making DOMLinkAdded truly async is not a real solution.

The sink clearly can't deal with reentry here, and I guess we can fix that by putting a script blocker higher up on the stack in the sink.  Will that be good enough?  Can the parser deal with reentry in this situation?

Henri, can the HTML5 parser deal with script running under its AppendChildTo/InsertChildAt operations, or whenever the script blockers it puts around those operations come off the stack?
Do we feel that this issue is a P1 blocker for 3.6, or just the fact that it's being triggered by Aero Peek in bug 522416? My instinct is to move this to a P2.
I think that's correct, yeah.
Priority: P1 → P2
Summary: Crash [@xul.dll@0x3ec28f ] and [@SinkContext::DidAddContent(nsIContent*) ] → content sink re-entry causes crash [@xul.dll@0x3ec28f ] and [@SinkContext::DidAddContent(nsIContent*) ]
I think Blake/Olli/Jonas should comment on whether they think we can fix this in a way they're not ok beta baking....  But yes, generally I think this should be ok as P2.
I'll take this.

Recursing into AddLeaf isn't really unexpected, but obviously we're not dealing with it properly.
Assignee: nobody → jonas
Even with taskbar previews disabled, I still see frequent crashes (though less frequent), with this stack:

http://crash-stats.mozilla.com/report/index/1c15c9d1-5f5c-4b8d-8101-e19ef2091016

This is I think the same as in bug 521164, and is probably directly related to this.
Whiteboard: [#1 trunk (3.7a1) topcrash]
Whiteboard: [#1 trunk (3.7a1) topcrash] → [#1 trunk (3.7a1) topcrash][#1 3.6b1 topcrash]
Whiteboard: [#1 trunk (3.7a1) topcrash][#1 3.6b1 topcrash] → [#1 trunk (3.7a1) topcrash]
still early but this is near the top of the 3.6b1 topcrash list.

25 total crashes for SinkContext::DidAddContent on 20091102-crashdata.csv
4 start up crashes inside 3 minutes

os breakdown
  22 SinkContext::DidAddContent(nsIContent*) Windows NT 6.1.7600
   3 SinkContext::DidAddContent(nsIContent*) Windows NT 6.1.7100

distribution of all versions where the SinkContext::DidAddContent crash was found on 20091102-crashdata.csv
  25 Firefox 3.6b1

domains of sites
   7 http://www.facebook.com
   3 http://www.youtube.com
   2 http://vkontakte.ru
   1 http://xhamster.com
   1 http://www.teamsolutions.fr
   1 http://www.soundpharmaceuticals.com
Whiteboard: [#1 trunk (3.7a1) topcrash] → [#1 trunk (3.7a1) topcrash][crashkill]
possible dup at bug 522286 that is not security closed
The OS breakdown still indicates only windows 7, which is weird, since I thought we disabled all the win7-specific stuff.  I guess we still have the tab list, whatever that's called? (peek?)
Jonas, why aren't we seeing progress here?  Is there anything you need?  You mentioned issues with recursing into AddLeaf above, but have you investigated what's going on?
Summary: content sink re-entry causes crash [@xul.dll@0x3ec28f ] and [@SinkContext::DidAddContent(nsIContent*) ] → content sink re-entry causes crash [ @xul.dll@0x3ec28f ] and [@ SinkContext::DidAddContent(nsIContent*) ]
I know what's going on here and hope to have a patch today.
I'm fairly certain this isn't a regression. What's happened is that some extensions (possibly weave) have started doing evil things in their DOMLinkAdded listener.

If someone has indication that this is actually a regression, feel free to re-add the regression keyword
thunder, any ideas on comment 34 ?
ok, if I look back at crash data from august I can see traces of this signature on older builds

grep SinkContext::DidAddContent 200908* | awk -F\t '{print $1,$8}' | sort | uniq -c
   2 20090807-crashdata.csv:SinkContext::DidAddContent(nsIContent*) 3.5.1
   2 20090812-crashdata.csv:SinkContext::DidAddContent(nsIContent*) 3.5
   1 20090814-crashdata.csv:SinkContext::DidAddContent(nsIContent*) 3.0.1
   1 20090815-crashdata.csv:SinkContext::DidAddContent(nsIContent*) 3.5.2
   1 20090821-crashdata.csv:SinkContext::DidAddContent(nsIContent*) 3.5.2
   1 20090825-crashdata.csv:SinkContext::DidAddContent(nsIContent*) 3.5.2
   1 20090830-crashdata.csv:SinkContext::DidAddContent(nsIContent*) 3.0.10

it also looks like 3.5.3, 3.5.4, 3.0.x are in the 0-2 crashes per day on this signature, and trunk and 3.6b1 are around 24 to 25 crashes per day.
Can you group by OS?  Would be interesting to know what proportion of the trunk and 3.6b1 crashes are on Win7, since we know that tickled the bug aggressively.
(In reply to comment #35)
> thunder, any ideas on comment 34 ?
From the looks of the stack, I'd say it's because Weave likes the spin the event loop.  We have a nested event loop in that stack.  This is assuming that the JS on that stack is Weave.
There seems to be several issues here. It's unclear to me if there's just several ways to trigger the same bug, or if it's in fact several different bugs.

I'm currently writing up a testcase which reproduces the spinning of the event loop, a'la the stack in attachment 406382 [details], and then a fix for that crash. Then i'll look at if the fix might cover other ways to crash at the same signature.
(In reply to comment #37)
> Can you group by OS?  Would be interesting to know what proportion of the trunk
> and 3.6b1 crashes are on Win7, since we know that tickled the bug aggressively.

yeah, it looks like its pretty uniformly divided.  the 3.6/7 crashes are on NT 6.1 and the 3.0.x and 3.5.x crashes are on older versions of windows.  here is the date for oct/nov

crash counts - sig - firefox version - os version 

 326 SinkContext::DidAddContent(nsIContent*) 3.6b1 Windows NT 6.1.7600
 167 SinkContext::DidAddContent(nsIContent*) 3.7a1pre Windows NT 6.1.7600
 122 SinkContext::DidAddContent(nsIContent*) 3.6b1 Windows NT 6.1.7100
  83 SinkContext::DidAddContent(nsIContent*) 3.7a1pre Windows NT 6.1.7100
  11 SinkContext::DidAddContent(nsIContent*) 3.6b2pre Windows NT 6.1.7100
  11 SinkContext::DidAddContent(nsIContent*) 3.6b1pre Windows NT 6.1.7100
   6 SinkContext::DidAddContent(nsIContent*) 3.7a1pre Windows NT 6.1.7232
   5 SinkContext::DidAddContent(nsIContent*) 3.6b2pre Windows NT 6.1.7600
   4 SinkContext::DidAddContent(nsIContent*) 3.6b1pre Windows NT 6.1.7600
   3 SinkContext::DidAddContent(nsIContent*) 3.6b2pre Windows NT 6.1.7229 Service Pack 2
   3 SinkContext::DidAddContent(nsIContent*) 3.6b1 Windows NT 6.1.7137
   2 SinkContext::DidAddContent(nsIContent*) 3.7a1pre Windows NT 6.1.7077
   2 SinkContext::DidAddContent(nsIContent*) 3.6b1 Windows NT 6.1.7260
   1 SinkContext::DidAddContent(nsIContent*) 3.7a1pre Windows NT 6.1.7600 Service Pack 2
   1 SinkContext::DidAddContent(nsIContent*) 3.7a1pre Windows NT 6.1.7229
   1 SinkContext::DidAddContent(nsIContent*) 3.7a1pre Windows NT 6.1.7201
   1 SinkContext::DidAddContent(nsIContent*) 3.6b2pre Windows NT 6.1.7057
   1 SinkContext::DidAddContent(nsIContent*) 3.6b1 Windows NT 6.1.7600 Service Pack 2
   1 SinkContext::DidAddContent(nsIContent*) 3.6b1 Windows NT 6.1.7264
   1 SinkContext::DidAddContent(nsIContent*) 3.6b1 Windows NT 6.1.7229
   1 SinkContext::DidAddContent(nsIContent*) 3.6b1 Windows NT 6.1.7201
   1 SinkContext::DidAddContent(nsIContent*) 3.6b1 Windows NT 6.1.7068
   1 SinkContext::DidAddContent(nsIContent*) 3.6b1 Windows NT 6.1.7057
 


  23 SinkContext::DidAddContent(nsIContent*) 3.0.14 Windows NT 5.1.2600 Service Pack 2
  22 SinkContext::DidAddContent(nsIContent*) 3.0.14 Windows NT 5.1.2600 Service Pack 3
   8 SinkContext::DidAddContent(nsIContent*) 3.0.14 Windows NT 5.1.2600 Dodatek Service Pack 3
   4 SinkContext::DidAddContent(nsIContent*) 3.5.3 Windows NT 5.1.2600 Service Pack 2
  1 SinkContext::DidAddContent(nsIContent*) 3.5.4 Windows NT 6.0.6002 Service Pack 2
   1 SinkContext::DidAddContent(nsIContent*) 3.5.4 Windows NT 5.1.2600 Service Pack 3
   1 SinkContext::DidAddContent(nsIContent*) 3.5.3 Windows NT 5.1.2600 Service Pack 3
   1 SinkContext::DidAddContent(nsIContent*) 3.0.14 Windows NT 5.1.2600 Dodatek Service Pack 2
So it appears things are more complicated than I had hoped.

I have a testcase which recurses into AddLeaf, and the code is dealing just fine. So maybe it's only failing under certain other conditions, or the recursion was a red herring.

Will have to continue investigation tomorrow.
Oh, and in the meantime, if someone is able to reliably reproduce, do let me know.
I don't think all recursion is broken, just that sometimes we end up in very much the wrong state.  I would suggest going back to one of the previous builds before we fixed the sync http stuff, and run on Windows 7 -- it's usually trivially reproducable (especially opening lots of bugzilla pages), and then can be reproduced on demand with session restore.
Actually, I got another idea on the drive home. If the context is going away during the recursion, that should definitely kill us. Should be easy to try using a missplaced-table-content-context. Will try this tomorrow.

I'll also try with a build before Shawn fixed the tab preview code.
I still don't understand how this can possibly handle reentry correctly in general.  If we reenter under Add, won't we have mStackPos and such totally unrelated to aContent by the time we get into DidAddContent for the first Add?  In particular, won't the "immediate notificaton" case be completely screwed?
Jonas: I can reproduce this reliably using the QA x64 Win 7 lab machine (using an older build,  Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.3a1pre) Gecko/20091010 Minefield/3.7a1pre GTB6). I also installed Weave on that build.
Marcia, Awesome!

Ah, indeed, we risk notifying every which way. Though the interesting thing is that we only notify if we're in a table-misplaced-content-context. Or rather, we only notify in a way that would break in that case.

Hmm.. In general it'll be hard to write testcases that cover all possibilities here. We really need to fuzz this. Though that turns out to be hard because DOMLinkAdded is only fired on chrome these days. So it really needs to be chrome that is misbehaving. Unless there are other elements that also fire events in the same way, but I can't find any.

I'll chat with Jesse to see if we can figure out a way to fuzz here. In the meantime I'll write up a patch and see if it helps with Marcias steps to reproduce.

Adding back regression keyword since it really seems like it's the windows toolbar code that's breaking us, though the bug is very arguably in the contentsink still.
Keywords: regression
(In reply to comment #47)
> Adding back regression keyword since it really seems like it's the windows
> toolbar code that's breaking us, though the bug is very arguably in the
> contentsink still.

Then we should do a regression test now that we can reproduce it reliably.
Wow, this code is mind boggling, the notification stuff *might* actually be working as it is. It might actually be that FlushTags will notify on the node properly if script inside the DOMLinkAdded-handler does something to the DOM. In that case mNumFlushed will be set to the childcount, and DidAddContent won't do anything.

We've known for a long time that this code is too complex, and that this batching crap needs to go, which is happening in the HTML5 parser code.

More investigating coming up...
This is not a full fix for the bug, but it's something I think we should do. We already protect from letting the parser pull in more network data if we're currently executing a <script> element. However this doesn't protect us from doing so if we're executing scripts for other reasons. This patch should take care of that.

Blake, since this is parser land it would be great to get your input on if you think this is useful and safe enough to go on branch. I have a plan for fixing this bug in other ways, so it might not be strictly needed.
Attachment #410728 - Flags: review?(mrbkap)
Attached patch Should fix itSplinter Review
This patch should be enough to fix things. I've pushed a patch to the tryserver so I can try a build on the boxes where I am able to reproduce.

I'm a little uneasy about not knowing exactly why recurring into the sink again fails. I'd imagine it's the notification stuff, but I'd like to be sure. In any event this patch should make it so that the only place where we recur back into the sink is from <script> elements, which has received significant testing over the years.

Will ask for review once I'm more sure that this patch fixes the bug.

Talked to Jesse about fuzzing the sink (would be great for the HTML5 sink), but it's non-trivial. Something we should do though for sure.
Attachment #410728 - Flags: review?(mrbkap) → review+
Jonas, would that patch happen to fix bug 467005 on branches too?  Or is that happening from under a <script> element so still allowed?
Attachment #410732 - Flags: superreview?(jst)
Attachment #410732 - Flags: review?(bnewman)
Comment on attachment 410732 [details] [diff] [review]
Should fix it

Tried this on the machine and profile where we could reliably reproduce this before, and it's no longer crashing with this patch. So asking for reviews.
(In reply to comment #53)
> Jonas, would that patch happen to fix bug 467005 on branches too?  Or is that
> happening from under a <script> element so still allowed?

Hmm.. I'll try, but I don't think so. The document.write there shouldn't be affected. I think. (can't wait to have code we actually fully understand)
Comment on attachment 410732 [details] [diff] [review]
Should fix it

Excellent! sr=jst
Attachment #410732 - Flags: superreview?(jst) → superreview+
Comment on attachment 410732 [details] [diff] [review]
Should fix it

Ben Turner said he might be able to review this today
Attachment #410732 - Flags: review?(bnewman) → review?(bent.mozilla)
Comment on attachment 410732 [details] [diff] [review]
Should fix it

r=me. Talking with sicking more it seems like the best approach given the fragility of the sink and the impending HTML5 landing.
Attachment #410732 - Flags: review?(bent.mozilla) → review+
http://hg.mozilla.org/mozilla-central/rev/6da836add4b9
http://hg.mozilla.org/mozilla-central/rev/8abdde9f628c

Pushed to trunk. I'd lean towards just taking attachment 410732 [details] [diff] [review] on the branch. That one seems safer and should be enough to fix the problem here.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Summary: content sink re-entry causes crash [ @xul.dll@0x3ec28f ] and [@ SinkContext::DidAddContent(nsIContent*) ] → content sink re-entry causes crash [@ xul.dll@0x3ec28f ] and [@ SinkContext::DidAddContent(nsIContent*) ]
Crash Signature: [@ xul.dll@0x3ec28f ] [@ SinkContext::DidAddContent(nsIContent*) ]
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: