Closed Bug 347662 Opened 18 years ago Closed 18 years ago

[FIX]Windows Media Player Dynamic Plugin Crash Regression

Categories

(Core Graveyard :: Plug-ins, defect)

1.8 Branch
x86
Windows XP
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.8.1

People

(Reporter: bugzilla, Assigned: bzbarsky)

References

()

Details

(6 keywords, Whiteboard: [reflow-refactor])

Attachments

(5 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b1) Gecko/20060806 BonEcho/2.0b1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b1) Gecko/20060806 BonEcho/2.0b1

Browsing embedded media services with Media Player currently causes FireFox to hang after the media filled tab is terminated.

Reproducible: Always

Steps to Reproduce:
x. open a few tabs
x. watch the video link I provided.
x. wait for the media to begin or get a bit into it
x. close media tab
Actual Results:  
FireFox complains that Media Player's Dynamic Plugin has caused an error and must be terminated.

Expected Results:  
It should close.

Unfortunately after five minutes of searching boogzilla I can't find the landing that resolved this issue originally or I would have marked it as blocking that.
Keywords: hang
Attached file about:plugins output
I made a small recording of the bug happening for everyone to see.  It's too large for the site so here's a link: http://www.yousendit.com/transfer.php?action=download&ufid=B3CE5EFE255A7BEC
I can reproduce this on current branch builds.
You specifically need to have the page with the video opened in the second tab, and the tab bar needs to disappear when you close the tab.
This regressed between 2006-05-25 and 2006-05-26.
In my debug build I get this assertion, and afterwards, I crash:
###!!! ASSERTION: You can't dereference a NULL nsCOMPtr with operator->().: 'mRa
wPtr != 0', file ../../dist/include/xpcom/nsCOMPtr.h, line 849

After backing out the patch from bug 344587, I don't crash anymore. However, that code shouldn't make a difference whether or not this crash happens.
It's not the only way to make this crash happen, see bug 308799.
Blocks: 344587
Severity: major → critical
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.8.1?
Keywords: crash, regression
So... we really shouldn't be ending up with a pending reflow event if we have a null viewmanager...

Martijn, when we hit that assertion, what is the value of presShell->mIsDestroying?

Does setting the "sessionstore.max_tabs_undo" pref to 0 help by some long chance?

I assume this is not reproducible on trunk?
Correct.  I just tried it, it does not occur on trunk.

> I assume this is not reproducible on trunk?


We need some more analysis on this, but we don't want to break WMP ... uh .. again.
Flags: blocking1.8.1? → blocking1.8.1+
So actually, Martijn and I debugged this a little bit.  What he's seeing is that in destroying the frame manager (and hence the frame tree) ends up processing events somehow (!).  I'm really not sure how.

So we end up processing the reflow event while the presshell is in the middle of destroying, after it's nulled out mViewManager.  Hence the crash.

As a band-aid we could try moving the event revokation and reflow command clearing earlier in Destroy(), but I'd really like to know why we end up processing the main event loop under frame manager destruction.  In general, doing that is not safe in that fun "crash if it happens, deleted objects" kinda way.
This is still happening as of 20060808_BRANCH but the error is no longer coming up when the bug rears it's head, echo just stop responding after the tab is destroyed.  Thoughts?
Keywords: dataloss
Some more observations:

1) This is causing a loss of ALL session-cookies being used by the browser at the time it hangs or otherwise crashes out.

2) It is causing some random loss of history here on my end.  I will dig into this a little bit later, it's 4am here :-)

3) Is is breaking the session restore manager here as of 20060808's build.
Closing a tab while media is playing sounds like a pretty common user action, so I'm somewhat surprised that this isn't getting listed as a topcrasher; or is Talkback not getting opened?
Target Milestone: --- → mozilla1.8.1
(In reply to comment #11)
> Closing a tab while media is playing sounds like a pretty common user action,
> so I'm somewhat surprised that this isn't getting listed as a topcrasher; or is
> Talkback not getting opened?
> 

TalkBack is not producing any results because of the nature in which it happens.
Attached patch Per comment 8Splinter Review
Someone who can reproduce this should test this patch...  Note that this crash may be dependent on the exact Media Player version; some people apparently cannot reproduce this.  What versions are the people who can see this bug using?

Also, could someone who sees this please answer the second question in comment 5?
> What versions are the people who can see this bug using?

Windows Media Player 10
(In reply to comment #5)
> Does setting the "sessionstore.max_tabs_undo" pref to 0 help by some long
> chance?

No, that doesn't help.

The patch "per comment 8" seems to fix the crash. I'm using windows media player 10.

Bz is there any chance you can drive this into trunk and 1.8 branch?
Assignee: nobody → bzbarsky
Target Milestone: mozilla1.8.1 → mozilla1.8.1beta2
I can if I find a Windows buddy to debug or time to set up a Windows build environment...  The patch is basically a hack to work around this specific testcase; there are other ways to crash if event processing is happening here, so we need to figure out why event processing is happening.
There are vnc-drivable Windows build setups at Seneca, copying bc here in case he can spare his, otherwise I'll connect you with Ben Hearsum (bhearsum on IRC) to get you access to his buildbot slave.  I'm sure he'd love to help!
Version: Trunk → 1.8 Branch
(In reply to comment #18)
> There are vnc-drivable Windows build setups at Seneca, copying bc here in case

I have linux boxes (4,5,6) but no windows boxes at seneca. I can't find the email where you divied them up, but I remember some people (vlad?) did have windows.
Whiteboard: [at risk]
(In reply to comment #5)
> Martijn, when we hit that assertion, what is the value of
> presShell->mIsDestroying?

I have that assertion in a debugger, presShell->mIsDestroying is PR_TRUE.
Whiteboard: [at risk]
Whiteboard: [at risk]
Gavin and I debugged this a bit today.  As far as we can tell, we get into nsObjectFrame::Destroy, get the nsIPluginInstance (which should be an ns4xPluginInstance, right?) then as soon as we call Destroy() on that (which we couldn't step into!) we somehow end up in the main event loop, and thence reentering presshell destruction.

Oh, and if my assumption that the nsIPluginInstance is an ns4xPluginInstance is correct, it looked something like this right after we got it:

      -  {,,gkplugin.dll}(ns4xPluginInstance*){*}inst.mRawPtr  0x06ae3fc0 {mRefCnt={mValue=112088312 } _mOwningThread={mThread=0x0353a108 } mPeer={mRawPtr=0xabababab } ...}  ns4xPluginInstance * const
        +  nsIPluginInstance  {...}  nsIPluginInstance
        +  nsIScriptablePlugin  {...}  nsIScriptablePlugin
        +  nsIPluginInstanceInternal  {...}  nsIPluginInstanceInternal
        +  mRefCnt  {mValue=112088312 }  nsAutoRefCnt
        +  _mOwningThread  {mThread=0x0353a108 }  nsAutoOwningThread
        +  mPeer  {mRawPtr=0xabababab }  nsCOMPtr<nsIPluginInstancePeer>
        +  fCallbacks  0xabababab {size=??? version=??? newp=??? ...}  _NPPluginFuncs *
        +  fNPP  {pdata=0xfeeefeee ndata=0x00000000 }  _NPP
          mWindowless  0  unsigned char
          mTransparent  0  unsigned char
          mStarted  0  unsigned char
          mCached  0  unsigned char
          fLibrary  0x0006001b  PRLibrary *
        +  mStreams  0x0218079e {mNext=0x8b55cccc {mNext=??? mPluginStreamListener=??? } mPluginStreamListener=0x10458bec {mRefCnt={mValue=??? } _mOwningThread={mThread=??? } mNotifyData=??? ...} }  nsInstanceStream *
        +  mPopupStates  {mImpl=0xbaadf00d {mBits=??? mCount=??? mArray=0xbaadf015 } }  nsVoidArray 

Note in particular:

  mRefCnt.mValue == 112088312
  mPeer.mRawPtr == 0xabababab 
  fCallbacks == 0xabababab
  fNPP.pdata == 0xfeeefeee

Given that, and the fact that the debugger refused to identify the concrete class from the vtable to start with, I claim that this object is dead.  I have no idea how the addref on it (NS_IF_ADDREF while getting) succeeded...

It'd be great if someone who can reproduce this would breakpoint in PresShell::Destroy, from there set a breakpoint in ~ns4xPluginInstance, and see what the stack to that destructor is.
0xabababab is Memory following a block allocated by LocalAlloc()
0xbaadf00d is winnt: nt internal "not yours" filler
0xfeeefeee is probably probably memory that has been dedicated to a heap but not yet allocated by HeapAlloc() or LocalAlloc(), or memory that has
just been freed by HeapFree()

This means that the pointer class you picked is wrong. you should be looking for an object that has a vptr and 2 32bit long data fields
> This means that the pointer class you picked is wrong.

OK.  Is it possible that the plug-in itself is implementing nsIPluginInstance?  I believe Gavin was testing with Java, because he couldn't reproduce this with WMP...
Reseting TM as we probably won't hold b2 for this - but would really like a fix before final.
Target Milestone: mozilla1.8.1beta2 → mozilla1.8.1
"mRawPtr" 0x06ae3fc0 "firstPointer" 0x06ae54f8 "something" 0x0353a108

Whatever kind of object has firstPointer and something, it isn't storing a refcount near it. 

I can't offer much beyond that. java could very well have its own layout for objects.

the easiest way (short of asking someone from sun) to figure that out would be to walk into a QI of a normal java nsIPluginInstance object while gecko is happy and watch the memory to see what changes (QI to nsISupports and leak the result a few times).
Blocks: 347743
Attached file complete stack
This stack shows where we call into the plugin during frame destruction.  (The assertion at the top is for dereferencing a null nsCOMPtr.)
dbaron/bz: layout knows that it isn't reentrant, would it be hard to add a global variable to layout so that 
 	gklayout.dll!ReflowEvent::HandleEvent()
would either reschedule or simply return when the variable indicating reflow is inprocess is set?

we get similar crashes when debug asserts happen because those allow reflow to happen and in turn land in js which was in the middle of asserting and then it kills itself.
> when the variable indicating reflow is inprocess is set?

It's not set when this crashes.  In any case, that solution is the same wallpaper as the one I already posted.  It doesn't solve the real problem.

The real issue, per dbaron's stack, is that the plugin's destruction somehow manages to spin the Gecko event queue, causing event processing in the middle of PresShell::Destroy.  We're not in a position to handle that.  More below.

I guess I have the following questions:

1)  Does anyone know what the WMP plugin is doing to cause event processing and 
    how we could prevent it?  (I'm guessing "no" here.)
2)  When did this crash go away on trunk?  Was it when bug 1156 landed, or when 
    bug 326273 landed?  I realize that the patch for bug 344587 needs to be
    applied to reproduce easily with the current testcase, but that can
    probably be done just by editing the chrome in a downloaded nightly...
3)  Can we put Stop() off on an event?  I'm guessing "no", given the comments
    in nsObjectFrame::Destroy on trunk.

As I see it, the situation is basically as follows.  We can't affect what the WMP plugin is doing, nor can we change when we call Stop() on it (esp. on branches).  We might be able to change how we respond to the plugin's spinning the event queue (for example, by making it a no-op during plugin destruction).  Not sure how easy that is to do or how it would affect actual behavior... Darin, do you happen to know?

We can also take the layout bandaid here and hope, but then it just becomes a matter of triggering other events (timer events for JS setTimeout come to mind here).  So while we wouldn't crash _every_ time we close such a tab, we'd crash sometimes, and possibly in much worse ways than a null pointer dereference... :(

I'm not really in a reasonable position to work on this bug at this point unless we do want the band-aid (in which case we should probably spin it off into a separate bug).  I do have access to a Windows setup, sort of, but I can't install WMP 10 there without an OS upgrade, and I don't even see the bug with WMP 6.4 (which I did manage to install).
Assignee: bzbarsky → nobody
Blocks: 347742
Comment on attachment 233485 [details] [diff] [review]
Per comment 8

On second thought, we've disconnected from the document by this point, so perhaps this is safe enough...
Attachment #233485 - Flags: superreview?(roc)
Attachment #233485 - Flags: review?(roc)
I don't want to break the conversation flow here, but I don't seem to be having this issue as of last nights build.  Was whatever caused this backed out?
--> DBaron temporarily to take a look
Assignee: nobody → dbaron
One comment about the patch is that it might be possible for some of these things to be posted during the code that you're moving the revocation across -- so you might want to copy rather than move.  Although darin points out that PostReflowEvent checks mIsDestroying, so that's not an issue.
(In the long term it seems like maybe we should make our own frame tree destruction asynchronous, so that we can destroy the plugins in a first pass right off the event, and then destroy the frame tree, maybe even off a second event.)
We could make frame tree destruction async if we're sure frames can deal with elements that have no GetCurrentDoc()...
Yeah, unfortunately we have to assume that calling into plug-in code may result in events being pumped.  We should really only call into plug-in code from the event queue, but I'm sure that's hard to ensure in most cases.  Too bad plug-ins don't run in a separate thread! ;-)
> unfortunately we have to assume that calling into plug-in code may result
> in events being pumped.

Unless we document otherwise for a particular call.... but given the state of NPAPI, that's not too bloody likely.  :(

It still feel that it's a bug in the plugin that it pumps events from Stop(), fwiw.  Is there a way we can report the bug?
--> ------ Comment #30 From David Miller
--> I don't want to break the conversation flow here, but I don't seem to be having
--> this issue as of last nights build.  Was whatever caused this backed out?

Ditto.
I am using SeaMonkey, & for the last number of nights, I have not seen WMP plugin related errors.  (Assuming what I am seeing, or not seeing, is the same as is being reported in this bug, & similar to many others). 

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b2) Gecko/20060821 SeaMonkey/1.1a

"Illegal operation in Windows Media Player plugin"
http://forums.mozillazine.org/viewtopic.php?t=454548
darin: there's a half implemented pluginwrapper for windows which would run plugins in a different thread. but note that it will probably not work if the plugin in question decides to call any method other than npapi (java, not sure what else). and it's half implemented in that functions are only proxied one way between the threads, they need to be proxied both ways.

i was reading that plugin and thinking about this bug or vice versa earlier.
timeless: different thread, or different process? who's doing that?
Oh. What we really need is to run plugins in their own *processes*, not just threads.
i know, there are bugs for that already and i'm not signing up to fix them. it's a hard task especially when something might want to do just about anything.

i suspect that the thread thing might actually work well enough for this crash.
to be clear, the thread would have a SEH at the start of its stack so that it would catch the plugin misbehaving. but you'd also need to install an app SEH handler to catch a plugin that decided to make its own thread and crash there (thankfully so far wmp is just using our thread and helping us crash ourselves).
I am now seeing this on Windows Media Player 9 & 10 respectively.

User-Agent String: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b2) Gecko/20060829 BonEcho/2.0b2

test-case for 9/10: http://thatvideosite.com/view/1132.html
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b2) Gecko/20060829 BonEcho/2.0b2 - Build ID: 2006082904

I'm running Windows XP Professional SP2.

This is totally _not reproducable_ for me using Windows Media Player 9.00.00.3349.

My version of npdsplay.dll is 3.0.2.629.  I recommend everyone who is having this problem to right click on C:\Program Files\Windows Media Player\npdsplay.dll and verify that they have this latest version of the dll.  (Right click on it, Properties -> File Version)

If you have an older version of the plugin then you'll need to install this Microsoft Security Update: http://www.microsoft.com/technet/security/Bulletin/MS06-006.mspx
Comment on attachment 233485 [details] [diff] [review]
Per comment 8

This looks good if we still need it.
Attachment #233485 - Flags: superreview?(roc)
Attachment #233485 - Flags: superreview+
Attachment #233485 - Flags: review?(roc)
Attachment #233485 - Flags: review+
Comment on attachment 233485 [details] [diff] [review]
Per comment 8

Yeah, I think we should take this on both branches.  Not sure about trunk; the event stuff is pretty different there.  I'm not sure why this issue isn't showing up on trunk, but for trunk we'd need a totally different patch anyway, even if we needed it.
Attachment #233485 - Flags: approval1.8.1?
Attachment #233485 - Flags: approval1.8.0.8?
happening for me here on WMP 9.00.00.3349.  I have the right version of npdsplay.dll as well.


> My version of npdsplay.dll is 3.0.2.629.  I recommend everyone who is having
> this problem to right click on C:\Program Files\Windows Media
> Player\npdsplay.dll and verify that they have this latest version of the dll. 
> (Right click on it, Properties -> File Version)
Comment on attachment 233485 [details] [diff] [review]
Per comment 8

a=schrep for drivers.
Attachment #233485 - Flags: approval1.8.1? → approval1.8.1+
Whiteboard: [at risk] → [checkin needed (1.8 branch)]
Fixed on 1.8.1 branch.
Flags: blocking1.9?
Keywords: fixed1.8.1
Whiteboard: [checkin needed (1.8 branch)]
*** Bug 335553 has been marked as a duplicate of this bug. ***
verified fixed with Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b2) Gecko/20060911 BonEcho/2.0b2
Note that bug 346521 added a null-check on trunk that would hide this issue.  Not sure whether we want that or whether we should revoke the event earlier on trunk as on branches.
This is happening again:

BuildID: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061001 BonEcho/2.0

happens on this link: http://gorillamask.net/fgfbomb.shtml

*link contains vulgar language ..don't open if you're squeamish*
(In reply to comment #54)
> happens on this link: http://gorillamask.net/fgfbomb.shtml

bz asked me to test it, and it doesn't crash here, Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061001 BonEcho/2.0 (same as justdave's); WMP 11 (relatively old beta), plugin version 3.0.2.629, WinXP SP2 with all updates; almost clean profile.
Agreed, there appears to be a regression.

Not going to be too much help, but,

working fine here:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20060916 SeaMonkey/1.1b

broken here:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061001 SeaMonkey/1.1b

test links - at least sometimes, generates, "Illegal operation in Windows Media Player WMP plugin":
http://www.moronland.com/moronia/moron/917/
http://www.moronland.com/moronia/moron/920/
There is a dearth of SeaMonkey 1.1b builds for Windows over the time span, but ...

Works:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20060916 SeaMonkey/1.1b
...missing...
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20060919 SeaMonkey/1.1b

...MISSING...

Crash and/or WMP Plug-in Dynamic Link Library illegal operation message:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20060930 SeaMonkey/1.1b
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061002 SeaMonkey/1.1b
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061003 SeaMonkey/1.1b
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061004 SeaMonkey/1.1b
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061005 SeaMonkey/1.1b
Interestingly, when run under Windows Vista RC1, all of the SeaMonkey 1.1b builds I mentioned above, up through & including the latest, worked correctly.

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1) Gecko/20061005
SeaMonkey/1.1b

(Note the change from Windows NT 5.1 to Windows NT 6.0).

(Also note, that I manually copied the WinXP versions of the WMP "plug-ins" into the /plugins/ directory on Vista; npdsplay.dll, npdrmv2.dll, & npwmsdrm.dll.  Don't know if the two drm dll's will work under Vista?  For non-drm media, only npdsplay.dll is needed.)
More information won't hurt ...

On Windows XP SP2:
WMP 10.00.00.4036

On Windows Vista RC1:
WMP 11.0.5600.6276

Plugins (same used for both XP & Vista):
npdsplay.dll 3.0.2.629 (I believe this is the most recent)
npdrmv2.dll 9.0.0.3287
npwmsdrm.dll 9.0.0.3287
To add to the Vista/WMP 11 angle.

As noted in #58, I had been running Vista RC1 (in addition to XP).

Today, I installed Vista RC2.

Under RC2, I am now getting the 'plug-in performed an illegal operation' error.
I did not have this problem with Vista RC1.

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1) Gecko/20061005
SeaMonkey/1.1b
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1) Gecko/20061031
SeaMonkey/1.1b

npdsplay.dll 3.0.2.629
WMP 11.0.5744.6324 (different version from that found in RC1)

test link: http://gorillamask.net/fgfbomb.shtml

I'd also like to emphasize that in my case, I've been running these tests with JavaScript disabled.
Either through SeaMonkey itself, or by virtue of using the NoScript extension.

I do not see that fact specifically mentioned in this thread, but it is mentioned in the mozillazine link, #37 above.
Boris, what needs to happen here on the trunk?

I'm probably not the best owner for this bug.
I guess we need to decide whether to move the event revocation earlier on trunk, and if so, do it...
This is basically the branch patch merged to trunk... or rather to reflow branch, since I see no reason to cause extra merge conflicts for that and this code has changed a tad there.

Note that this backs out the null-check added in bug 346521.  I think that's safe to do.
Assignee: dbaron → bzbarsky
Status: NEW → ASSIGNED
Attachment #245312 - Flags: superreview?(roc)
Attachment #245312 - Flags: review?(roc)
I guess we should file followups on better ways of handling this or something...
Summary: Windows Media Player Dynamic Plugin Crash Regression → [FIX]Windows Media Player Dynamic Plugin Crash Regression
Whiteboard: [reflow-refactor]
Attachment #245312 - Flags: superreview?(roc)
Attachment #245312 - Flags: superreview+
Attachment #245312 - Flags: review?(roc)
Attachment #245312 - Flags: review+
Comment on attachment 233485 [details] [diff] [review]
Per comment 8

approved for 1.8.0 branch, a=dveditz for drivers
Attachment #233485 - Flags: approval1.8.0.9? → approval1.8.0.9+
Fixed for 1.8.0.9
Keywords: fixed1.8.0.9
Verified for 1.8.0.9 on Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.0.9pre) Gecko/20061201 Firefox/1.5.0.9pre and Vista - no crash
Checked in the trunk fix.
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Flags: blocking1.9? → in-testsuite?
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: