Closed Bug 549888 Opened 14 years ago Closed 14 years ago

Crash [@ nsDisplayList::PaintThebesLayers ] and [@ nsDisplayList::PaintThebesLayers(nsDisplayListBuilder*, nsTArray<nsDisplayList::LayerItems> const&) ]

Categories

(Core Graveyard :: Plug-ins, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(blocking2.0 beta1+, status1.9.2 .4-fixed)

RESOLVED FIXED
Tracking Status
blocking2.0 --- beta1+
status1.9.2 --- .4-fixed

People

(Reporter: bas.schouten, Assigned: cjones)

References

Details

(Whiteboard: [fixed-lorentz])

Attachments

(7 files)

It appears on latest trunk crashes in PaintThebesLayers are quite common:

One for example: http://crash-stats.mozilla.com/report/index/8e881458-b888-4579-8322-643be2100303

0  	xul.dll  	nsDisplayList::PaintThebesLayers  	 layout/base/nsDisplayList.cpp:832
1 	xul.dll 	nsDisplayList::Paint 	layout/base/nsDisplayList.cpp:771
2 	xul.dll 	nsLayoutUtils::PaintFrame 	layout/base/nsLayoutUtils.cpp:1195
3 	xul.dll 	PresShell::Paint 	layout/base/nsPresShell.cpp:5764
4 	xul.dll 	nsViewManager::Refresh 	view/src/nsViewManager.cpp:426
5 	xul.dll 	nsViewManager::DispatchEvent 	view/src/nsViewManager.cpp:877
6 	xul.dll 	HandleEvent 	view/src/nsView.cpp:160
7 	xul.dll 	nsWindow::DispatchEvent 	widget/src/windows/nsWindow.cpp:3061
8 	xul.dll 	nsWindow::DispatchWindowEvent 	widget/src/windows/nsWindow.cpp:3089
9 	xul.dll 	nsWindow::OnPaint 	widget/src/windows/nsWindowGfx.cpp:542
10 	xul.dll 	nsWindow::ProcessMessage 	widget/src/windows/nsWindow.cpp:4085
11 	xul.dll 	nsWindow::WindowProc 	widget/src/windows/nsWindow.cpp:3777
12 	user32.dll 	InternalCallWinProc 	
13 	user32.dll 	UserCallWinProcCheckWow 	
14 	user32.dll 	DispatchClientMessage 	
15 	user32.dll 	__fnDWORD 	
16 	ntdll.dll 	KiUserCallbackDispatcher 	
17 	xul.dll 	NS_GetXPTCallStub_P 	
18 	user32.dll 	DispatchMessageW 	
19 	xul.dll 	nsBaseAppShell::OnProcessNextEvent 	widget/src/xpwidgets/nsBaseAppShell.cpp:293
20 	xul.dll 	nsTArray_base::ShiftData 	obj-firefox/xpcom/build/nsTArray.cpp:162
21 	nspr4.dll 	_PR_MD_UNLOCK 	nsprpub/pr/src/md/windows/w95cv.c:344
22 	xul.dll 	MessageLoop::RunInternal 	ipc/chromium/src/base/message_loop.cc:216
23 	xul.dll 	MessageLoop::RunHandler 	ipc/chromium/src/base/message_loop.cc:199
24 	xul.dll 	NS_GetXPTCallStub_P
Version: 1.9.2 Branch → Trunk
I actually tried to reproduce this crash yesterday, but have not yet been able to.
Bas: I can reproduce a crash that is similar 100% - but my stack is slightly different -> http://crash-stats.mozilla.com/report/index/bp-1cfb7a8e-e1bc-4013-9d6a-5a1f42100303 is my report. I am running  Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.3a3pre) Gecko/20100303 Minefield/3.7a3pre (.NET CLR 3.5.30729).

The site requires a login and I can give you exact STR with that site.
This is my bug. Can you give me exact STR? Thanks!!!
Assignee: nobody → roc
blocking2.0: --- → beta1
Thanks Marcia, but I don't get a crash with those steps :-(
Marcia, are you using any extensions?
Roc; here are my extensions:

Microsoft .NET Framework Assistant1.1true{20a82645-c095-46ed-80e3-08825760534b}
RealPlayer Browser Record Plugin1.0true{ABDE892B-13A8-4d1b-88E6-365A6E755758}
Bing Bar5.0truemsntoolbar@msn.com
Search Helper Extension1.0true{27182e60-b5f3-411c-b545-b44205977502}
Google Gears0.5.33.0false{000a9d1c-beef-4f90-9363-039d445309b8}
Adblock Plus1.1.3true{d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d}
Nightly Tester Tools2.0.3true{8620c15f-30dc-4dba-a131-7c5d20cf4a29}
ChatZilla0.9.86false{59c81df5-4b7a-477b-912d-4e0fdf64e5f2}

I am running on a Windows 7 HP Touchsmart tx2. I can try to see if I can repro on another Windows machine.
Will try with a fresh profile as well.
Happens for me with a fresh profile. Would it be useful to use the vm that jst uses to troubleshoot?
Summary: Crash [@ nsDisplayList::PaintThebesLayers ] → Crash [@ nsDisplayList::PaintThebesLayers ] and [@ nsDisplayList::PaintThebesLayers(nsDisplayListBuilder*, nsTArray<nsDisplayList::LayerItems> const&) ]
Yes! If you could reproduce it while recording in that VM, I'd definitely be able to solve it one way or another, perhaps by using RDesktop to log into the machine. jst or bent can help you if you haven't used recording before.
roc: jst advises the vm is in use. As soon as it is available I will try to get a recording for you.

I also tried reproducing it in the QA lab on the Windows 7 box. I did not crash, but I did get a hang when I moused over the pictures as well as some odd black boxes that were painted.

(In reply to comment #11)
> Yes! If you could reproduce it while recording in that VM, I'd definitely be
> able to solve it one way or another, perhaps by using RDesktop to log into the
> machine. jst or bent can help you if you haven't used recording before.
roc: I thing martijn and I noticed is that you have to scroll the picture list quickly in order to trigger the crash. Moving your mouse from left to right quickly seems to trigger the crash more easily.

Also using a trunk build of Mac I do crash when I right click the area but I got a different stack. I can post the stack if you think it would be useful.
roc: wasn't able to reproduce this in Replay VM, but we used WinDbg on a Vista machine and this is what we got:

First attempt:

ntdll!NtTerminateProcess+0x12
ntdll!RtlExitUserProcess+0x7a
kernel32!ExitProcess+0x12
MOZCRT19!__crtExitProcess+0x2e [e:\builds\moz2_slave\mozilla-central-win32-nightly\build\obj-firefox\memory\jemalloc\crtsrc\crt0dat.c @ 683]
MOZCRT19!doexit+0x116 [e:\builds\moz2_slave\mozilla-central-win32-nightly\build\obj-firefox\memory\jemalloc\crtsrc\crt0dat.c @ 596]
MOZCRT19!exit+0xe [e:\builds\moz2_slave\mozilla-central-win32-nightly\build\obj-firefox\memory\jemalloc\crtsrc\crt0dat.c @ 398]
firefox!__tmainCRTStartup+0x169 [e:\builds\moz2_slave\mozilla-central-win32-nightly\build\obj-firefox\memory\jemalloc\crtsrc\crtexe.c @ 605]
kernel32!BaseThreadInitThunk+0xe
ntdll!__RtlUserThreadStart+0x23
ntdll!_RtlUserThreadStart+0x1b

Second Attempt:
00 0043fd2c 777a4d03 ffffffff 00000000 0043fd50 ntdll!NtTerminateProcess+0x12 (FPO: [2,0,0])
01 0043fd3c 75c54273 00000000 77e8f3b0 ffffffff ntdll!RtlExitUserProcess+0x7a (FPO: [1,0,0])
02 0043fd50 71e2179e 00000000 00a1d07c 71e21b67 kernel32!ExitProcess+0x12 (FPO: [1,0,0])
03 0043fd5c 71e21b66 00000000 26c3c9ad 75c40e2c MOZCRT19!__crtExitProcess(int status = 1910642590)+0x2e (FPO: [1,0,4]) (CONV: cdecl) [e:\builds\moz2_slave\mozilla-central-win32-nightly\build\obj-firefox\memory\jemalloc\crtsrc\crt0dat.c @ 683]
04 0043fd94 71e21bee 00000000 00000000 00000000 MOZCRT19!doexit(int code = 0, int quick = 10604668, int retcaller = 1910643559)+0x116 (FPO: [Non-Fpo]) (CONV: cdecl) [e:\builds\moz2_slave\mozilla-central-win32-nightly\build\obj-firefox\memory\jemalloc\crtsrc\crt0dat.c @ 596]
05 0043fda4 008616f9 00000000 26cd90f0 00000000 MOZCRT19!exit(int code = 1976298699)+0xe (FPO: [1,0,0]) (CONV: cdecl) [e:\builds\moz2_slave\mozilla-central-win32-nightly\build\obj-firefox\memory\jemalloc\crtsrc\crt0dat.c @ 398]
06 0043fddc 75cbeccb 7efde000 0043fe28 777dd24d firefox!__tmainCRTStartup(void)+0x169 (FPO: [Non-Fpo]) (CONV: cdecl) [e:\builds\moz2_slave\mozilla-central-win32-nightly\build\obj-firefox\memory\jemalloc\crtsrc\crtexe.c @ 605]
07 0043fde8 777dd24d 7efde000 43c4fc97 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [1,0,0])
08 0043fe28 777dd45f 00861860 7efde000 00000000 ntdll!__RtlUserThreadStart+0x23 (FPO: [SEH])
09 0043fe40 00000000 00861860 7efde000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [2,2,0])
Those stacks don't look like crashes. They look like exits.

How do you "scroll" the picture list? IIRC when I moused over it it just highlighted one of the photos.
Attached file stack from debug build
This is a stack that I get from a debug build.
I finally managed to reproduce this. This stack shows the real problem...
This is quite disturbing. Basically, we call into the plugin to ask it to paint (it's windowless), and during that call it asks us to run some script, which does a SetInnerHTML, which causes frame reconstruction. Painting depends on the frame tree not changing during paint...

The script we're evaluating is

try { __flash__toXML(PhotoTip.changeAnketaInfo("с 8 марта, девушки!","سعيد","23","F","Россия, Москва","null","http://mamba.ru/anketa.phtml?oid=306869119&s=S9llD09hch.ydn2mrQnqiPZCvV24h2LR",0,271)) ; } catch (e) { "<undefined/>"; }

I've heard rumours of evil Flash code like this before, but never seen it. This could explain some of the mysterious crash-on-corrupt-frames-during painting bugs that we've seen in the past.

I see two options:
1) don't call into plugin code during painting
2) don't allow NPN_Evaluate to run during painting

One thing is that we have nsContentUtils scriptblocking in place during painting, and I think it's pretty clear that mozilla::plugins::parent::_evaluate should not be running scripts while there's a scriptblocker. So I propose that if it's not currently safe to run script, _evaluate should use a scriptrunner to delay script execution until it's safe to run script. I guess in that case _evaluate will have to return void or something like that. That could break content, but that's better than breaking the browser. That would take care of #2.

Should nsJSContext::EvaluateStringWithValue assert when scripts are blocked?

We might also want to work towards #1 as well. Asynchronous painting for out-of-process plugins would do it. That would basically mean that any content that's broken by #2 would be fixed.
Karl pointed out on IRC that it's possible the plugin isn't actually servicing its paint event. The NPN_Evaluate could be coming across asynchronously while the plugin services a timer or input event, for example --- the messages just crossed in flight. So we can't do what I suggested, because that would cause NPN_Evaluate that expects results to randomly fail.
Yeah, this is "the bug" we talked about in the security review. It's important to solve the case where the plugin is accidentally triggering this behavior by changing the RPC race resolution for painting. In general we should also disallow any npruntime calls during painting, but it turns out that plugins actually rely on these currently (e.g. silverlight calls pluginelement.clientWidth every time it paints).

-> cjones for now to add a special case to the race resolution
Assignee: roc → jones.chris.g
Blocks: LorentzBeta1
I tried reproducing the bug with dom.ipc.plugins.enabled false, and I couldn't,
although that might not mean much since I can't reproduce this very well.

Asynchronous plugin painting would still be a valid fix here.

Can you explain how this RPC race resolution will work?
Currently when there is a race between two RPC messages, we have a "child always wins" policy, so e.g. if the following messages race:

parent->child: paint
child->parent: evaluate

We will call the evaluate within the paint, because that's the child message.

We can override this behavior for specific messages so that paint messages always win the race, and we deliver the paint within the child's evaluate call. The child's evaluate call won't be delivered until the stack unwinds normally.
FWIW, I can pretty easily write a testcase where this fails for non-IPC silverlight, but that's probably not what *usually* happens here.
Having the plugin process 'paint' inside its 'evaluate' sounds risky to me. Assuming we can handle an 'evaluate' inside a 'paint' on the browser side (possibly by restarting painting), wouldn't it be better for us to do that?
paint-inside-evaluate is perfectly normal, we do it all the time if the script causes synchronous painting, which at least happens if you set .scrollTop from within the script. We even have tests for that case.

If we can arrange for asynchronous painting a-la chromium, that's a good plan: is it a short-term fix, though? I think we can be good-enough to switching the race resolution and perhaps disallowing any npruntime calls during painting.
OK, I believe you :-) ... go ahead and do that then.
> e.g. silverlight calls pluginelement.clientWidth every time it paints

Hmm...  It may well be possible to create a testcase where that clientWidth call tears down the entire frametree the plugin is in.  Depends on exactly which presshells viewmanager flushes layout for before painting....
r? to jimm and karlt for the platform-specific sending of Paint().
Attachment #431205 - Flags: review?(karlt)
Attachment #431205 - Flags: review?(jmathies)
Attachment #431205 - Flags: review?(bent.mozilla)
Attachment #431205 - Flags: review?(karlt) → review+
Attachment #431203 - Flags: review?(bent.mozilla) → review+
Attachment #431204 - Flags: review?(bent.mozilla) → review+
Attachment #431205 - Flags: review?(bent.mozilla) → review+
Attachment #431205 - Flags: review?(jmathies) → review+
Either I've lost the knack for reproducing this bug, or it's no longer reproducible for me.

Marcia, how do you scroll that row of photos? whatever I do now with the mouse, I can't get them to scroll.
The row of photos scrolls by itself, occasionally. You don't have to hover over for that. But I can't reproduce the crash anymore on that site, so it seems like the patch has fixed this particular crash.

Not sure if this is the same crash, but:
I also managed once to crash with a different stack on this site: http://www.vanderaa.net/afterlife-video-trailer
http://crash-stats.mozilla.com/report/index/bp-19419c7f-f8ea-42d8-a7ac-b653a2100314
I think (tentative) steps to reproduce: right-click on the main video, then
minimize window, then close tab or something.
I think that I got a ghost flash context menu, while doing this. I get that a lot with oopp plugins enabled.
Alright. I filed bug 552512 on finding a way to stop plugins from intentionally changing our DOM during painting. I think this bug, as filed, is fixed.
Oh and Martijn, if you can confirm steps to reproduce for that vanderaa site and file a new bug, that would be great!
I actually still can repro this crash using today's trunk - here is my report: http://crash-stats.mozilla.com/report/index/e194791b-c101-4bfb-9930-0e4772100315

As far as STR, on this Win machine I am not using the mouse to scroll the list when I crash. I basically using the touchpad to move the cursor across the pictures quickly and it is fairly easy to get it to crash.
Can you reproduce it if you disable out-of-process plugins?
If I disable OOPP I cannot reproduce the crash.

(In reply to comment #41)
> Can you reproduce it if you disable out-of-process plugins?
OK, I'm going to reopen this bug then since it sounds like the problem is still OOPP-focused.
Blocks: OOPP
Status: RESOLVED → REOPENED
Component: Layout → Plug-ins
QA Contact: layout → plugins
Resolution: FIXED → ---
I have this in recording...
We're calling NPP_SetWindow in this stack frame while we're painting, which races against NPN_Evaluate.

>	mozilla::ipc::RPCChannel::Call(msg=0x0030c940, reply=0x07265a1c) Line 158	C++
 	mozilla::plugins::PPluginInstanceParent::CallNPP_SetWindow(window={...}) Line 179	C++
 	mozilla::plugins::PluginInstanceParent::NPP_SetWindow(aWindow=0x07296ec4) Line 433	C++
 	mozilla::plugins::PluginModuleParent::NPP_SetWindow(instance=0x07265a1c, window=0x07296ec4) Line 411	C++
 	nsNPAPIPluginInstance::SetWindow(window=0x07296ec4) Line 1259	C++
 	nsObjectFrame::PaintPlugin(aRenderingContext={...}, aDirtyRect={...}, aPluginRect={...}) Line 1754	C++
 	nsDisplayPlugin::Paint(aBuilder=0x0030ce50, aCtx=0x049f0880) Line 1152	C++
 	nsDisplayList::PaintThebesLayers(aBuilder=0x0030ce50, aLayers={...}) Line 843	C++
 	nsDisplayList::Paint(aBuilder=0x0030ce50, aCtx=0x00000000, aFlags=0x00000001) Line 772	C++
 	nsLayoutUtils::PaintFrame(aRenderingContext=0x00000000, aFrame=0x06e342a8, aDirtyRegion={...}, aBackstop=0xffffffff, aFlags=0x00000004) Line 1197	C++
 	PresShell::Paint(aDisplayRoot=, aViewToPaint=, aWidgetToPaint=, aDirtyRegion=, aPaintDefaultBackground=) Line 5790	C++
 	_NtGdiGetAppClipBox@8()	
 	_cairo_win32_save_initial_clip(hdc=0x0030d188, surface=0x00000000) Line 2281	C
 	ConvertDeviceRegionToAppRegion(aIn={...}, aContext=0x00000000) Line 367	C++
 	nsViewManager::DispatchEvent(aEvent=, aView=, aStatus=) Line 877	C++
 	_cairo_clip_init_copy(clip=0x6bbda2e8, other=0x6b5ef6a8) Line 79	C
 	_cairo_gstate_init_copy(gstate=0x04e67178, other=0x04e67218) Line 185	C
 	_cairo_gstate_redirect_target(gstate=0x00000001, child=0x04e3a420) Line 319	C
 	nsWindow::DispatchEvent(event=0x0030d338, aStatus=nsEventStatus_eIgnore) Line 3058	C++
 	nsWindow::DispatchWindowEvent(event=0x0030d338, aStatus=nsEventStatus_eIgnore) Line 3087	C++
 	nsWindow::OnPaint(aDC=0x00000000) Line 535	C++

The script being evaluated is """
try { __flash__toXML(PhotoTip.changeAnketaInfo("null","Инна","45","F","РоÑÑиÑ, МоÑква","null","<private URL removed for privacy>",0,45)) ; } catch (e) { "<undefined/>"; }
"""

Reading nsObjectFrame::PaintPlugin there are three different plugin method calls which can be made there:
HandleEvent(WM_WINDOWPOSCHANGED)
SetWindow()
Paint()

Ideally I think we should coalesce these into one RPC method call, especially because I think the OOPP plugin host does special trickery with WM_WINDOWPOSCHANGED anyway. It would certainly help with the number of RPC round-trips per-paint.
(In reply to comment #44)
> I have this in recording...

Do you have STR I can try?

(In reply to comment #45)
> Reading nsObjectFrame::PaintPlugin there are three different plugin method
> calls which can be made there:
> HandleEvent(WM_WINDOWPOSCHANGED)
> SetWindow()
> Paint()
> 
> Ideally I think we should coalesce these into one RPC method call, especially
> because I think the OOPP plugin host does special trickery with
> WM_WINDOWPOSCHANGED anyway. It would certainly help with the number of RPC
> round-trips per-paint.

I like that idea, that might also address bug 552062.
To be more specific about the issue:

nsObjectFrame::PaintPlugin makes several calls into nsIPluginInstance as part of painting. The proposed solutions from today's meeting:
 
* make all the resulting RPC calls win the race. Hard, because the OOPP code doesn't always know which calls are a result of painting
* unify those function calls into a single paint RPC method (which end up calling several functions on the plugin side)
** either by changing nsIPluginInstance and nsObjectFrame to call a single all-in-one-function
** or changing the OOPP code to magically detect the setwindow/windowposchanged/paint sequence and send it as a single call. Hard, because setwindow and windowposchanged may also be fired as independent events not associated with painting.
> ** either by changing nsIPluginInstance and nsObjectFrame to call a single
> all-in-one-function

I think this is clearly the best approach.
Note that SetWindow is called in reflow as well as painting and so should always win the race.

http://hg.mozilla.org/mozilla-central/annotate/bf58973256df/layout/generic/nsObjectFrame.cpp#l1124

The only other message that I'm aware of is WM_WINDOWPOSCHANGED, which is always followed by a SetWindow and Paint, and so can be "magically detected" easily.

http://hg.mozilla.org/mozilla-central/annotate/bf58973256df/layout/generic/nsObjectFrame.cpp#l1751

So making "all the resulting RPC calls win the race" is not actually hard but involves an extra round-trip compared with unifying into a single paint rpc method.
Don't we need to do something extra to ensure that an NPN_Evaluate doesn't come in between SetWindow and Paint?
IIUC ipc messages from the plugin will be processed only either in an event loop or when given priority in a race.  I don't know that code though so I'll ask cjones to correct me if I'm wrong.
... and also when in response to an rpc message from the browser.
I assume these nest to behave like function calls so that the rpc response to a race-winning message from the browser would be processed in priority to any NPN_Evaluate previously received.
Blocks: 550034
Race resolution is done after the racy RPC "stack frame" is popped.  So if we call RPC A() that races with NPN_Evaluate(), then A() is popped, we'll process NPN_Evaluate.  So we need to make sequences of calls A()+B()+C() that need to be atomic a single RPC.
Looked over the code again: if A() is the only message on the RPC stack, then we do indeed process it on a later iteration of the event loop.  So if we don't (reflow/paint?) in response to NPAPI calls, I think that's good enough for this case.
I think that if NPN_InvalidateRect (which is asynchronous) can trigger reflow/repaint, then we still have a problem.  Otherwise, we're OK.
Discussed this with karlt on IRC, and this seems to be the simplest solution.  We can coalesce these messages later if perf proves an issue.
Attachment #433511 - Flags: review?(karlt)
Attachment #433511 - Flags: review?(jmathies)
(In reply to comment #57)
> Discussed this with karlt on IRC, and this seems to be the simplest solution. 
> We can coalesce these messages later if perf proves an issue.

And I should add that since this patch makes WM_WINDOWPOSCHANGED asynchronous, it'll perform strictly better than what we have now.
Comment on attachment 433511 [details] [diff] [review]
Have the browser win NPP_SetWindow races, and add a special-cased NPP_HandleEvent async WindowPosChanged message for windows

r=me.   A couple of nits:

>+    int16_t dontcare = 0;
>+    return AnswerNPP_HandleEvent(event, &dontcare);

I wouldn't bother with the initialization of dontcare.

>+  int16 type = parent.type();
>   // our code relies on the frame list not changing during paints
>-  bool isPaint = (PPluginInstance::Msg_Paint__ID == parent.type());
>-
>+  bool isPaint = (PPluginInstance::Msg_Paint__ID == type ||
>+                  PPluginInstance::Msg_NPP_SetWindow__ID == type);
>   return isPaint ? RPCChannel::RRPParentWins : RPCChannel::RRPChildWins;

This should probably say "during paints and reflows" because windowed plugins
don't (well except for Silverlight) get paint messages, but even for them
Msg_NPP_SetWindow__ID is delivered during reflow.  The name "isPaint" isn't
quite accurate now.  "parentWins" maybe, or perhaps switch(parent.type()) with
two return statements, or ...
Attachment #433511 - Flags: review?(karlt) → review+
Attachment #433511 - Flags: review?(jmathies) → review+
local internet might be out until tomorrow, stuck on n900.  would appreciate it if someone could push this for me
Whiteboard: [checkin-needed]
http://hg.mozilla.org/mozilla-central/rev/b4d603fb501d
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Whiteboard: [checkin-needed]
Blanket approval for Lorentz merge to mozilla-1.9.2
a=beltzner for 1.9.2.4 - please make sure to mark status1.9.2:.4-fixed
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.