Closed Bug 531341 Opened 10 years ago Closed 10 years ago

Stack overflow crash related to scrolling with trackpad and plugins [@ nsWindow::HandleScrollingPlugins] [@ _SEH_prolog4 ][@ nsWindow::ProcessMessage(unsigned int, unsigned int&, long&, long*)][@ FindNCHit ]

Categories

(Core :: Widget: Win32, defect, critical)

1.9.2 Branch
x86
Windows Vista
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla1.9.3a1
Tracking Status
blocking1.9.2 --- needed
status1.9.2 --- .2-fixed
status1.9.1 --- unaffected

People

(Reporter: divergentdave, Assigned: masayuki)

References

Details

(4 keywords, Whiteboard: [crashkill])

Crash Data

Attachments

(2 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2b4) Gecko/20091124 Firefox/3.6b4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2b4) Gecko/20091124 Firefox/3.6b4

I have experienced three crashes (see below, two on 3.6b3, one on 3.6b4) which all seem to stem from the same recursive calls. The crashes occurred while I was opening pages with flash, (last.fm, youtube, and a google search result) scrolling with my trackpad nub, (Lenovo) and possibly switching tabs. I have tried to replicate this, but thus far I haven't been able to.

Reproducible: Couldn't Reproduce

Steps to Reproduce:
Unable to reproduce, but I will keep trying
Actual Results:  
Slight pause, (0.5-1s) then Firefox crashed, and the crash reporter opened up

Expected Results:  
Not crashing

bp-69b5d4c8-995e-42bb-9965-df4ec2091126 (under 3.6b4)
bp-c287da93-c960-48b4-a03f-551002091123 (under 3.6b3)
bp-11b1e782-eb83-497c-b560-4e9ea2091120 (under 3.6b3)

The infinite loop is between the following three functions:
nsWindow::HandleScrollingPlugins
nsWindow::OnScroll
xul.dll@0x42e0e9

The line referenced in nsWindow::OnScroll is within an if statement mentioning "Scroll message generated by Thinkpad Trackpoint Driver or similar"

My computer is a Lenovo ThinkPad R61
Version: unspecified → 3.6 Branch
6253  	xul.dll  	nsWindow::HandleScrollingPlugins  	 widget/src/windows/nsWindow.cpp:5944
6254 	xul.dll 	nsWindow::OnScroll 	widget/src/windows/nsWindow.cpp:5965
6255 	xul.dll 	xul.dll@0x42cc36 	
6256 	xul.dll 	nsWindow::HandleScrollingPlugins 	widget/src/windows/nsWindow.cpp:5944
6257 	xul.dll 	nsWindow::OnScroll 	widget/src/windows/nsWindow.cpp:5965
6258 	xul.dll 	xul.dll@0x42cc36 	
6259 	xul.dll 	nsWindow::WindowProc 	widget/src/windows/nsWindow.cpp:3501
6260 	user32.dll 	InternalCallWinProc
Severity: normal → critical
Component: General → Widget: Win32
Keywords: crash
Product: Firefox → Core
QA Contact: general → win32
Summary: Stack overflow crash related to scrolling with trackpad and plugins → Stack overflow crash related to scrolling with trackpad and plugins [@ nsWindow::HandleScrollingPlugins]
Version: 3.6 Branch → 1.9.2 Branch
I think I have figured out how to reproduce this bug:

1: Open Firefox, and open two empty tabs
2: Hold the scroll button down and hold the TrackPoint nub in any direction (This requires a ThinkPad, was not able to reproduce by scrolling with a normal mouse)
3: Press Ctrl-Tab, Firefox will crash
I bisected through the nightlies, and in 2009-11-14-05-mozilla-1.9.2 and earlier, the crash doesn't occur, and moreover, scrolling with the TrackPoint doesn't work at all. In 2009-11-15-04-mozilla-1.9.2 and later, scrolling with the TrackPoint works, and the crash occurs.
Topcrash #115 for 3.6b4
nsWindow::ProcessMessage(unsigned int, unsigned int&, long&, long*)
bp-2751f520-e4f6-46a1-a2aa-dbc352091205

Since this is a stack overflow, it probably appears with multiple signatures.
Keywords: topcrash
Indeed, bug 531856 would appear to be the same.
Assignee: nobody → me
Status: UNCONFIRMED → NEW
Ever confirmed: true
From bug 531856
> Stack overflow. Possibly nsWindow::HandleScrollingPlugins round-trips to
> itself?
> 
> Note stack contains "xul.dll@0x42cc36" -- any idea what that call is?
> 
> Example Beta4 stack:
> http://crash-stats.mozilla.com/report/index/7047c8c5-1a19-408a-bb8c-20f522091130
> 
> Example: Trunk stack:
> http://crash-stats.mozilla.com/report/index/22b61f9a-55a5-47a2-9a82-01b802091125
I can't reproduce this, but judging by the fact that there aren't many ways to trigger a stack overflow I would imagine that the problem is that at http://hg.mozilla.org/mozilla-central/annotate/ae44b06ea876/widget/src/windows/nsWindow.cpp#l6013

> <snip>
> if (destWnd != mWnd) {
>   if (destWindow) {
>     aHandled = destWindow->ProcessMessage(aMsg, aWParam, aLParam, aRetValue);
> </snip>

either destWindow is the window currently processing the message or slightly more interestingly, destWindow and the window currently processing the message are sending it back and forth.  The fact that this seems to happen when switching tabs (per comment 2 ) makes me think the second one is probable.  The first should be impossible (I don't believe there's a situation in which GetNSWindowPtr can return the same nsWindow if the HWNDs differ)

I also don't know what to make of the random offset in xul.dll.  If somebody more experienced than me with crashing ( :-D ) has ideas, I would love to hear them.
Ted, can you help with the "random offset in xul.dll" part?
This crash is new in 3.6 Beta 3.  This is almost certainly related to turning on the Trackpoint scrolling for Alps drivers as that was the only change in this code between Beta 2 and Beta 3.  The R61 has an Alps Touchpad.

David: What Thinkpad model do you have?
Blocks: 528267
(In reply to comment #10)
> This crash is new in 3.6 Beta 3.  This is almost certainly related to turning
> on the Trackpoint scrolling for Alps drivers as that was the only change in
> this code between Beta 2 and Beta 3.  The R61 has an Alps Touchpad.

Scratch that, it does show up earlier.
(In reply to comment #9)
> Ted, can you help with the "random offset in xul.dll" part?

Probably a symptom of bug 526484, if I had to guess. (That patch is nominated for branch approval.) If you look at the frame above it:
20	xul.dll	nsWindow::OnScroll	 widget/src/windows/nsWindow.cpp:5965
21	xul.dll	xul.dll@0x42cc36	
22	xul.dll	nsWindow::HandleScrollingPlugins	widget/src/windows/nsWindow.cpp:5944 

It's clearly calling nsWindow::ProcessMessage:
http://hg.mozilla.org/releases/mozilla-1.9.2/annotate/4c488520d1bf/widget/src/windows/nsWindow.cpp#l5944

And the only place that calls OnScroll is here:
http://mxr.mozilla.org/mozilla-central/source/widget/src/windows/nsWindow.cpp#4166

So that's probably your missing frame.
I do indeed have an R61.
I've found a narrower window for the cause of this crash.

Changeset ff084019260e was the first changeset in which Trackpoint scrolling didn't work. Prior to this changeset, scrolling worked fine, and the crash described above did not occur.

I then cherrypicked the six changesets which make up the "Trackpoint hack" onto this revision, (hg update -C -r ff084019260e followed by hg transplant 40cc465fb9be e264b5f6b874 5f2a6c970a96 03b2895110e0 55e73a35126e) and, with this patched version, scrolling worked once again, and I was able to reproduce the above crash. So, the regression is somewhere in these seven changesets.

Hope this helps, let me know if there's anything else I can do.
I'm still experiencing the crash in 3.6rc1.
http://crash-stats.mozilla.com/report/index/3738bdfa-997c-464e-88bd-95ff62100109
I just got an update for my Alps driver via Windows Update, and before the update, 3.6rc2 still crashed under the above circumstances, while after installing the update, the crash was fixed.

Is this good enough to close the bug?
that's great to hear, the only thing i'd want to know is which files it updated because if it's a file that is loaded into our process, and the updated version doesn't crash, we'd like to consider blocking the old version. I looked through your report from comment 15 and couldn't find something that I'd recognize as the driver :(.
I don't think this has been fixed in general though, it appears to have just moved.
Summary: Stack overflow crash related to scrolling with trackpad and plugins [@ nsWindow::HandleScrollingPlugins] → Stack overflow crash related to scrolling with trackpad and plugins [@ nsWindow::HandleScrollingPlugins] [@ _SEH_prolog4 ]
Masayuki, you said in Bug 483136 that it was possible for HandleScrollingPlugins to enter an infinite loop.  That's what's happening here, but I don't see how it's happening.  Any help would be appreciated :-)
Here's a "fake" crash report with my current set of modules. I didn't see any suspicious dlls either.

http://crash-stats.mozilla.com/report/index/9e4f350e-4f97-4375-a31b-409942100119
(In reply to comment #19)
> Masayuki, you said in Bug 483136 that it was possible for
> HandleScrollingPlugins to enter an infinite loop.  That's what's happening
> here, but I don't see how it's happening.  Any help would be appreciated :-)

The reason is that mInScrollProcessing is an instance member. And the caller's one is set to TRUE. So, the called instance's one is FALSE...

But the code is going to be removed completely by next patch of bug 483136. So, I hope that this should be fixed only on branch.
(In reply to comment #21)
> So, the called instance's one is FALSE...

Er, the other instances' are FALSE.
Attached patch Patch v1.0 (obsolete) — Splinter Review
I think that this patch fixes this bug. Can somebody test this if I prepare a test build?
ssssmemyself: Can you test the tryserver build Masayuki posted with the old drivers?
I might be able to; I'll look around for the old driver today.
I downgraded to the factory version of the driver, and the above tryserver build crashes.
(In reply to comment #27)
> I downgraded to the factory version of the driver, and the above tryserver
> build crashes.

Thank you, I'll be looking for another possibility.
And I hope that you test this test build but this is trunk build, be careful.
https://build.mozilla.org/tryserver-builds/masayuki@d-toybox.com-bug483136_v4.3/
If you sent the crash report with my test build, would you tell me the report ID?
When I crashed your build, I just got the standard Windows dialog box, not the Mozilla crash reporter.

With the trunk build and the old drivers, I am unable to reproduce the crash.
O.K. Thank you.
The v1.1 build crashes too, see attached WinDbg log.
Attachment #423245 - Attachment mime type: application/octet-stream → text/plain
Attached patch Patch v2.0Splinter Review
Building test builds for this patch...
Attachment #422756 - Attachment is obsolete: true
I can't reproduce the crash with this latest build. Hooray!
(In reply to comment #37)
> I can't reproduce the crash with this latest build. Hooray!

Thank you very much!
Comment on attachment 423306 [details] [diff] [review]
Patch v2.0

The mInScrollProcessing of the *caller* is set to TRUE. So, it's strange. We can move it to static variable and all instances should refer it.

When it's processing, we don't need to retargetting processing window, so, we can just return from the method at that time.
Attachment #423306 - Flags: review?(VYV03354)
Assignee: me → masayuki
Status: NEW → ASSIGNED
This is the kind of thing that could really use a testcase. Can we synthesize native scrolling events to test this the way we synthesize mousemove events?
Flags: in-testsuite?
Asking for blocking per http://groups.google.com/group/mozilla.dev.planning/browse_thread/thread/71972f1a7c5eb6f1 since Masayuki has a patch in hand.
blocking1.9.2: --- → ?
status1.9.2: --- → ?
Whiteboard: [crashkill]
Attachment #423306 - Flags: review?(VYV03354) → review+
http://hg.mozilla.org/mozilla-central/rev/57f9a74ab7d3

O.K. I pushed to trunk, we can watch the regression if there is.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
(In reply to comment #40)
> This is the kind of thing that could really use a testcase. Can we synthesize
> native scrolling events to test this the way we synthesize mousemove events?

Currently no. Roc hates that we add APIs only for testing, see bug 528396.
I hate adding large APIs that get shipped in our product but are only useful for testing.
Target Milestone: --- → mozilla1.9.3a1
I think the Alps driver sends scrolling events from the apoint.exe process. Would it be possible to write a unit test that compiles an external program, runs it alongside Firefox, and sends the scrolling events to it? This still sounds like a lot of code, but this way it wouldn't need to ship.
Roc:

I cannot read the discussion of all on Google (http://groups.google.ca/group/mozilla.dev.platform/browse_frm/thread/f6e625cdf06246ed#). Can I read the other emails somewhere else? I think that they are same things that we add a few APIs each new testcase and many APIs only for one testcase. I hope we implement new APIs which are not compiled on release build.

ssssmemyself:

We should be able to do because we can send the message ourselves. However, I don't have the idea how to create the message loop actually.
This is currently the #51 and #56 topcrashes for 3.6 accounting for 0.42% of all crashes.  I haven't seen a crash on trunk since Masayuki's patch landed but there's a very low volume there (only 12 or so over the last two weeks) so it's hard to say for sure that we've squashed all of them.  This is definitely worth taking in 3.6.2 or whatever the first point release is going to be.
Comment on attachment 423306 [details] [diff] [review]
Patch v2.0

We don't have any regression reports. This should be safe.

This fixes a crash and a regression of bug 487245 which causes bug 531341 and a part of bug 483136.
Attachment #423306 - Flags: approval1.9.2.1?
(In reply to comment #48)
> This fixes a crash and a regression of bug 487245 which causes bug 531341 and a
> part of bug 483136.

s/bug 531341/bug 541309
And this is now the #49, #55 and #92 windows topcrashes which puts it at #11 if all the signatures are combined.
Summary: Stack overflow crash related to scrolling with trackpad and plugins [@ nsWindow::HandleScrollingPlugins] [@ _SEH_prolog4 ] → Stack overflow crash related to scrolling with trackpad and plugins [@ nsWindow::HandleScrollingPlugins] [@ _SEH_prolog4 ][@ nsWindow::ProcessMessage(unsigned int, unsigned int&, long&, long*)][@ FindNCHit ]
Duplicate of this bug: 544931
blocking1.9.2: ? → needed
Duplicate of this bug: 545917
Duplicate of this bug: 546624
Duplicate of this bug: 547194
Considering the number of dupes, shouldn't this be pushed to the 1.9.2 branch too? Is it risky?
Keywords: regression
Comment on attachment 423306 [details] [diff] [review]
Patch v2.0

a1922=beltzner
Attachment #423306 - Flags: approval1.9.2.2? → approval1.9.2.2+
I take it that we still have no way to reproduce this issue for testing?
(In reply to comment #59)
> mine is reproducible https://bugzilla.mozilla.org/show_bug.cgi?id=546624

I don't have access to a Lenovo laptop. Can you try with a nightly 3.6 build from http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-1.9.2/ and see if it fixes the problem for you?
yep seems fine for me, no crash at all and working fine
Build 20100304 3.6.2pre from latest-mozilla-1.9.2 is working well for me too. No crashes yet, and they are easy to reproduce with 3.6.1 on my Thinkpad T40p w/ a Synaptics touchpad.
Verifying per comment 62
Status: RESOLVED → VERIFIED
Thank you for the verifying!
Blocks: 638756
Crash Signature: [@ nsWindow::HandleScrollingPlugins] [@ _SEH_prolog4 ] [@ nsWindow::ProcessMessage(unsigned int, unsigned int&, long&, long*)] [@ FindNCHit ]
You need to log in before you can comment on or make changes to this bug.