Closed Bug 410946 Opened 17 years ago Closed 16 years ago

Crash in JS engine aborting applet making Java/JS calls

Categories

(Core Graveyard :: Plug-ins, defect, P1)

x86
Windows XP

Tracking

(Not tracked)

VERIFIED FIXED
mozilla1.9beta4

People

(Reporter: kbrussel, Assigned: jst)

References

Details

Attachments

(10 files, 1 obsolete file)

Attached file Test case
The attached stress test case makes repeated Java-to-JavaScript and JavaScript-to-Java calls. If the applet is aborted by reloading it or switching to a different web page, a browser crash is provoked in js3250.dll. The crash occurs with the current Firefox 3 build and both the old OJI-based and the new NPRuntime-based Java Plug-In. The crash has only been positively identified as being in js3250.dll with the new NPRuntime-based Java Plug-In, however. A representative crash log from the HotSpot JVM embedded in the web browser is attached.

Note that the error may well be in the Java Plug-In, but at least in the NPRuntime-based Java Plug-In, I believe we have all of the safeguards in place to avoid making calls into the JavaScript engine against JavaScript object references which are no longer valid. Also, a similar crash is not seen with the new Java Plug-In for Internet Explorer, and the structure of the IE and Firefox plug-ins is very similar (actually, shared code) during the termination sequence. During NPP_Destroy we release all NPObjects acquired from the browser and prevent further calls against them, so that by the time we return from NPP_Destroy it will be impossible for us to make further Java-to-JavaScript calls.

Note also that while the Java Plug-in's MozillaPlugin.drainRunnableQueue() is on the stack in the attached crash log, we tried putting in additional safeguards to ensure that callbacks enqueued using the new NPN_PluginThreadAsyncCall could not be executed after a certain point in NPP_Destroy. While that removed MozillaPlugin.drainRunnableQueue() from the stack, the crash still occurred in js3250.dll.

To reproduce the crash:
1. Download and install the current early access build of 6uN (09 or later) from http://jdk6.dev.java.net/6uNea.html (the Windows Offline Installer).
2. Open the Java Control Panel, Advanced tab, Java Plug-in node, and check the checkbox to enable the next-generation Java Plug-In. (Note that in a future build this checkbox will be enabled by default, so in this case make sure it's checked.)
3. For debugging purposes it is also useful to enable the Java Console in the same Advanced tab (Java console node, Show console button).
4. Launch Firefox 3 and navigate to a random web page.
5. Navigate to HashTableTest.html from the attached test case.
6. While the test is running, press the back button.

It may be necessary to navigate forward to the test case and back from it a few times in order to reproduce the crash.

This isn't a critical bug as it's a stress test and we haven't seen it with any real-world applications. If this does appear to be a bug in the Java Plug-In, we would appreciate any indications you can give of things we might be doing wrong in our termination sequence.
Attachment #295530 - Attachment mime type: application/octet-stream → text/plain
The associated Sun bug ID is 6618975.
I got a crash using Beta2 and oji plugin (Java SE6 u3) and the testcase and doing reload/backward forward

http://crash-stats.mozilla.com/report/index/65056a40-bbec-11dc-bed2-001a4bd46e84?date=2008-01-06-00
Kenneth, this seems really odd. I'm able to reproduce something that looks like it might be a crash here, but it doesn't seem to be a crash that's catchable by the debuggers. It almost seems like some code running in the Firefox process is calling something like exit(). For fun, I set a breakpoint in exit() and sure enough, I got a call to it, with no symbols on the stack (except _except_handler() a few frames from the top) and this output on the console:

#
# An unexpected error has been detected by Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x7c911f6c, pid=3616, tid=1676
#
# Java VM: Java HotSpot(TM) Client VM (11.0-b09 mixed mode, sharing windows-x86)

# Problematic frame:
# C  [ntdll.dll+0x11f6c]
#
# An error report file with more information is saved as:
# E:\decom\fb-dbg\dist\bin\hs_err_pid3616.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

Not sure what's going on here, but I have yet to see a crash in the JS engine with this testcase.

Kenneth, any thoughts on this one?
This is likely happening because of the JVM's built-in handler for access violation exceptions (used to handle Java-level null pointer checks, among other things). To catch the exception in the debugger (using the MS Visual Studio 2003 IDE, at least), try going to the Debug menu, Exceptions selection, and in the dialog box, go to Win32 Exceptions, c0000005 (Access violation), and select "When the exception is thrown:" -> "Break into the debugger". That should catch the fault before the JVM sees it and give you the correct stack trace. Please update the bug if this technique doesn't work and we can see whether there are other alternatives.
Doing that let's me catch the exception, but that unfortunately doesn't really get me any farther :(

What I see is a non-main thread throwing that exceptions, with only stack frames w/o symbols on the stack, which leads me to believe that it's all Java plugin stuff.

Kenneth, any chance that you could attempt to send a crash report using the mozilla crash reporter if you're able to trigger this again, showing where in the JS engine this might be crashing?
Right, sorry. That exception is likely to occur at least a few times before the real crash in the JS engine happens, because it happens each time a NullPointerException is raised in Java code. Unfortunately there are several NPEs raised at least during JVM startup if not in other places that are considered "normal". Please try continuing past the exception in the debugger to the point where you can run the test case in the browser.

I think it will be difficult to send a crash report using the Mozilla Crash Reporter because the JVM is swallowing the exception and tearing down the process before the Crash Reporter gets a chance to run.
Ok, I got further here and now I'm seeing a crash in the npruntime layer in Firefox. I really don't know how to explain what could have caused this, as it's not obvious in the debugger any more (I lost that state already), but here's what I've got:

I'm looking at this stack:

>	0xblah
	NPObjWrapper_GetProperty()
 	js_NativeGet()
 	js_GetProperty()
 	js_Interpret()
 	js_Invoke()
 	js_InternalInvoke()
 	JS_CallFunctionValue()
 	doInvoke()
 	nsJSObjWrapper::NP_Invoke()
 	_invoke()

The NPObjects in NPObjWrapper_GetProperty() and nsJSObjWrapper::NP_Invoke() are different (as they should be), the former being an object form the Java plugin, the latter being one of our internal wrappers. The odd thing is that the code that's crashing is this:

  NPObject *npobj = GetNPObject(cx, obj);

  if (!npobj || !npobj->_class || !npobj->_class->hasProperty ||
      !npobj->_class->hasMethod || !npobj->_class->getProperty) {
    ThrowJSException(cx, "Bad NPObject as private data!");

    return JS_FALSE;
  }

  PRBool hasProperty = npobj->_class->hasProperty(npobj, (NPIdentifier)id);
  PRBool hasMethod = npobj->_class->hasMethod(npobj, (NPIdentifier)id);

and what I'm seeing in the debugger is that we're crashing on the line that invokes npobj->_class->hasMethod() (though it could be the line above, not sure). But when we're here it looks like (and re-stepping through code confirms this) npobj->_class->getProperty is null, which means that we should've never gotten here as the large if check above would've made us return early if that's the case.

So it looks like the npobj, or at least its class, it changed (torn down?) during the call to npobj->_class->hasProperty(), and things go south from there, or some other thread is messing with the NPObject in question.

I'll poke around more to see if I can find out more on what's happening here.

Marking P1 blocker as this is looking like an access to deleted memory crash. 
Assignee: nobody → jst
Flags: blocking1.9+
Priority: -- → P1
Target Milestone: --- → mozilla1.9 M11
Ok, this is looking more and more like a bug in the Java plugin now that I've got more data. I'm consistently seeing that this crash is happening due to the Java plugin running code past it being torn down. Plugins are torn down by us calling into into the destroy() hook in the plugin, and then we go through and invalidate all remaining NPObjects. This crash always happens after this has all been done, when the plugin has already been destroyed.

Kenneth, I don't think that's something that I can do much about here. I'm seeing calls come in with references to NPObjects that have already been deleted (0xdddddddd member values etc in debug builds). I don't know if this is due to async calls being completed late (though it's not through the plugin thread async callback code, if the stacks I'm seeing are to be trusted), or something else.

The last stack I got was this:

> 	067001bd	
 	js_IsObjLocked()
 	js_NativeGet()
 	js_GetProperty()
 	js_Interpret()
 	js_Invoke()
	js_InternalInvoke()
 	JS_CallFunctionValue()
 	doInvoke()
 	nsJSObjWrapper::NP_Invoke()
 	_invoke()
 	6db22502	

and there the NPObject passed to _invoke() by the Java plugin is a deleted object, the object was deleted when the plugin was destroyed.

I'm un-marking this as a blocker, but if this does indeed turn out to be a bug on our end, please renominate.
Flags: blocking1.9+
Priority: P1 → --
Thanks for the investigation. In investigating this bug on our side I am pretty sure I added guards to guarantee that we would not attempt to make JavaScript calls into the browser after our plugin instance was deleted, even if a NPN_PluginThreadAsyncCall callback was executed after we were destroyed, and that I still saw the browser crash in this case. However, the current code you are testing with does not have these safeguards.

I have raised the Sun bug 6618975 back up to medium priority, and will continue investigating it. I would appreciate it if we could leave this bug open for the time being so we can potentially work on this more in the future.
Attached patch cache _class (obsolete) — Splinter Review
i agree w/ jst that this bug doesn't seem like our fault, but i wonder if we could be nicer to compilers (and readers) by caching _class
Attachment #298703 - Flags: review?(jst)
I wouldn't ask for hacks to be done to the browser code since this seems to be a bug squarely in the Java Plug-In.

I have not had much success in fixing the bug on our side yet. I have added guards to make sure that if we are about to process a callback registered via NPN_PluginThreadAsyncCall, that we first check to see whether the plugin instance that registered the callback has been destroyed, and if it has, drop the callback on the floor rather than execute it. This seems to improve the reliability, but browser crashes are still seen. Some of them appear to be triggered by Java and some do not. Unfortunately at this point the above test case is not crashing reliably enough for me even without this fix to conclusively determine whether any fix I attempt has improved the situation, and poor error reporting from the HotSpot JVM is making it difficult to pinpoint exactly what code in the new Java Plug-In provoked the crash.

Will continue to investigate the bug from our side.
Comment on attachment 298703 [details] [diff] [review]
cache _class

Timeless, please file a new bug on the last attachment, let's keep this one about the Java specific crash.
Attachment #298703 - Attachment is obsolete: true
Attachment #298703 - Flags: review?(jst)
I have added the necessary safeguards to the Java Plug-In to guarantee that we will not execute a callback registered via NPN_PluginThreadAsyncCall if the plugin instance which registered that callback has been destroyed. I have also obtained a more recent build of the HotSpot JVM with better error reporting.

Even without the above fix, the crashes are very infrequent at this point with no other changes to the Java Plug-In. I suspect that recent browser bug fixes (perhaps around the teardown sequence for web pages?) have improved the overall stability.

With the bug fix in place, the crash frequency is unfortunately about the same as without it. With the test case in attachment 295529 [details], about 50% of the crashes appear to be initiated by the Java Plug-In as attachment 299013 [details] shows. About 50% of the crashes appear to occur in the browser with no code from the Java Plug-In involved; these crashes appear as a Windows dialog box indicating that an unexpected error has occurred in firefox.exe, with options to debug or close.

I believe that preexisting safeguards in the new Java Plug-In should have prevented a plugin instance from referencing an NPObject after the plugin instance had been destroyed, which is why the above safeguards did not appear to improve the situation.

At this point my best guess is that NPObjects are being destroyed out from under the Java Plug-In, and/or that there are other race conditions in the browser during page teardown that are causing the crashes.

Attachment 299014 [details] contains another test case from a customer which stresses rapid page reloads as well as NPRuntime calls (through Java->JavaScript calls). This test case crashes the browser 100% of the time, usually within about a minute. So far, all of the crashes I have seen from this test case appear to be purely within the browser, with no code from the Java Plug-In on the stack at the time of the crash.

I have sent jst a copy of the latest Java Plug-In containing the above safeguards for callback execution. I would appreciate it if Mozilla could look into this further with first the more recent test case (attachment 299014 [details]) and second with the original one (attachment 295529 [details]).
Good news, I think. It took me most of yesterday to find out what's going on here, but I did find a real problem that certainly could explain some really strange things, which is what I was seeing in the debugger all the time when looking at this. We're basically re-entering plugin initialization and once we unwind to the outer instantiating code we stomp all over deleted (and possibly re-allocated) memory, and that causes crashes at random spots later on, if not right then and there. And this appears to have messed up the debugger too as I've seen this happen numerous times, but never been able to get a stack out of the debugger that actually showed what was going on here. Purify wasn't of much help either :(

I have yet to see this crash in the NPN_HasProperty() and related code though, all the crashes I've seen have been in the object frame and related areas, but that's not to say this can't mess up memory that's being used for the NPObject methods etc too.

Kenneth, I'm spinning a build with a fix for this, and it also has a ton of printf()'s sprinkled throughout this code. Once it's done, could you test it out and see if you're able to reproduce a crash any more (and if you can, please attach output to this bug, i.e. run from the command line so you can capture it in a file)? The build should be available in about 2 hours. If that tests out good (either way, really, as it's a real problem whether it fixes this crash or not), I'll clean up the patch and try to get it in for beta3.

Marking this as a P1 blocker again as this *is* at least partly (if not completely) a problem on our end, and it's even a potentially bad one (and hard to find if nothing else).
Status: NEW → ASSIGNED
Flags: blocking1.9+
Priority: -- → P1
This is what'll be included in the build's that are queued on the try server for building.
Hmm, out infrastructure for doing test builds sort of fell over today, so the builds never made it (mac and linux did, but not windows). I'll be trying, and I'll post in the bug when they're available.
Kenneth, there's now a test build for Win32 available here:

https://build.mozilla.org/tryserver-builds/2008-01-25_17:46-jst@mozilla.com-java-crashfix/
Good news and bad news. Our colleague from NeuroDNA (who is on the CC: list for this bug) has confirmed that the build above fixes the problems they had been seeing. Unfortunately, even with the test build above, I can still crash the browser in the same way with a variant of the original test case attached to this bug. I will attach both the new test case as well as an hs_err log from my build.
Kenneth, we have tested your revised test case. The browser didnt crash but here is what we have experienced.

1. After test completed, Firefox memory usage was showing 255 MB of memory on Process Explorer. I do not think that is from Java heap usage. Java console says

Memory: 5,056K  Free: 2,436K  (48%) ... completed.

2. Because of this anourmous memory usage, there were a lot of context switches pushing the CPU usage over 90% all the time during the test.
3. I didnt rerun the test but I am pretty sure having 500 MB allocated memory will bring if not Firefox but the operating system on its knees. A possible crash is more than likely on Firefox side.
4. With that amount of memory and CPU usage, this test will crash on all machines out there if the memory modules and main board of the machine has disagreements at some point.

Also, on javascript strict mode, here is what we got on Error console for each call test() method.

Warning: deprecated arguments usage
Source File: file:///C:/Documents%20and%20Settings/Bora%20Ertung.NEURODNA-DMI4QU/Desktop/HashTable/6618975/HashTableTest.html
Line: 12
This clearly fixes one set of problems here, and would be worth getting in for beta3 IMO (dangling pointer dereferences etc). Boris, let me know if you can't r+sr this today and I'll find someone else.
Attachment #300170 - Flags: superreview?(bzbarsky)
Attachment #300170 - Flags: review?(bzbarsky)
jst, why is the EnsureVisible() call in content code needed?  Is the issue that frame code fails to deal with doing something that destroys itself?  If so, shouldn't that be handled in the frame code?

Doing that EnsureVisible() there might have some performance implications which I'd like to avoid in the non-plugin cases....
Also, which parts of that patch are really the fix, and which are just cleanup?
Boris, the problem was the EnsureVisible() call in nsPluginInstanceOwner::Init(), so the fix is moving that out to the content code. That call ends up handling XUL commands, which end up firing DOM events etc etc, so we end up tearing down a frame whose code is on the stack. I never saw full stacks in the debugger (looked corrupted), so it's hard to say exactly what the sequence of calls caused us to tear down stuff that was on the stack.

I honestly don't know what that call is needed for. But I do know it must not be called with frame code on the stack.
Target Milestone: mozilla1.9beta3 → mozilla1.9beta4
Any update on the availability of this fix or a test build containing it?
Comment on attachment 300170 [details] [diff] [review]
This clearly fixes at least one class of problems here.

Minusing per our discussion on irc the other day...  This should be doable without the content change with some nsWeakFrame love, I believe...
Attachment #300170 - Flags: superreview?(bzbarsky)
Attachment #300170 - Flags: superreview-
Attachment #300170 - Flags: review?(bzbarsky)
Attachment #300170 - Flags: review-
I've been busily working away at this problem, and I now have a fix in my tree that's able to run the attached testcases over night w/o crashing. The basic problem here, which isn't necessarily specific to the Java plugin, is that it's possible for a plugin to call out to the browser and trigger code that tears down the plugin. This leaves code from a destroyed plugin on the stack, and once we unwind, all bets are off. We've seen this over and over with other plugin in some situations as well, but this case is much harder to handle. The new Java plugin is particularly likely to trigger this due to its out of process JVM, because of this JS to Java and Java to JS calls end up spinning the event loop while waiting for the other process to do the work and give the result back, and that opens up more possibilities for the plugin being destroyed while we have it running code on the stack.

The way I fixed that problem was to introduce a new stack guard that makes sure we don't tear down a plugin while we're either calling into it, or called from it. Once we end up destroying a plugin, we check whether one of these guards are currently guarding it, and if so, we just set a bit in the outer-most guard and don't actually tear down the plugin. We do however null out its window. On Windows this will behave a bit differently than it will on other platforms, not due to this patch, but due to the fact that on Windows we already do the plugin destruction off of an event, and we reparent the window etc so that it's safe to do that there. But on other platforms there is no solution for that problem. So with this patch, a likely (though not common) plugin destruction could look like this on Windows:

  1:  JS calls into Java
  2:  spin event loop
  3:  handle network request
  4:  tear down document
  5:  reparent plugin window
  6:  dispatch destroy plugin runnable
  7:  unwind to the JS engine
  8:  JS calls into Java again
  9:  spin event loop
  10: run destroy plugin runnable
  11: call SetWindow(NULL)
  12: call destroy plugin code
  13: notice plugin destroy guard on the stack
  14: delay plugin destroy
  15: unwind to outermost plugin destroy guard
  16: dispatch runnable to actually destroy plugin
  17: unwind out of java back to the main message loop
  18: run runnable and actually destroy plugin

How we get into Java depends on what started this whole thing, we either get there by the browser calling into Java (as in the above case), or we get there by the out of process JVM calling into us. I don't know how we could catch the latter case in our code so that we could stick a plugin destruction guard on the stack there too, so thus the runnable to actually destroy the plugin once we unwind to the outermost guard, and it's probably still not impossible to run into trouble in this case. In the former case we can generally guard around calls into the plugins, assuming we find all the places we call into it.

I still have some cleaning up to do in the patch that does all this, but I fired off builds on our try servers with the patch I currently have. The builds are available here. Kenneth, could you give these a try?

https://build.mozilla.org/tryserver-builds/2008-02-22_20:39-jst@mozilla.com-plugindestroyguard/

Boris, any thoughts or concerns about this approach so far? I also rewrote the original patch in this bug to use nsWeakFrame reference and leave the call to EnsureVisible() where it was, and that's working fine as far as I can tell.
Attached patch Fix.Splinter Review
See previous comment for description of what this does.
Attachment #305554 - Flags: superreview?(bzbarsky)
Attachment #305554 - Flags: review?(bzbarsky)
Comment on attachment 305554 [details] [diff] [review]
Fix.

Could you add documentation to nsIObjectFrame which says which of the methods might destroy the frame?  And perhaps similar to any nsObjectLoadingContent methods that might do it.

>+++ b/modules/plugin/base/src/nsPluginHostImpl.cpp

>+  NS_IMETHOD Run()
>+    if (PluginDestructionGuard::DelayDestroy(mInstance)) {
>+      // It's still not safe to destroy the plugin, re-dispatch this
>+      // runnable.

Why re-dispatch?  Why not wait for that outermost guard to go off the stack and dispatch the runnable?  Re-dispatching this one just means that either we'll end up with two separate runnables pending or we'll end up firing this one before the guard has gone off the stack, and thus not destroying again.

Speaking of which, is this code safe when there _are_ two separate runnables pending for the same instance?  That can happen, I think....

I didn't double-check that you added this in all the places it's needed; I'm assuming you did.

r+sr=bzbarsky with the above fixed.  I really like this approach!  Much less fragile.
Attachment #305554 - Flags: superreview?(bzbarsky)
Attachment #305554 - Flags: superreview+
Attachment #305554 - Flags: review?(bzbarsky)
Attachment #305554 - Flags: review+
Oh, and perhaps add asserts that all the plugin guard action is happening on the main thread?
Flags: tracking1.9+ → blocking1.9+
Updated fix, fixes all of the above. Boris, the only real difference here is how nsPluginDestroyRunnable works, it's now a PRCList with a static list head so that when we run one of these we can walk the list of other scheduled runnables and see if there's another one that will destroy this plugin for us, and also if there's another guard on the stack for the instance in question the runnable doesn't tear down the plugin but makes the guard on the stack do that.

I also realized that I lost all of the guards I put in ns4xPluginInternal.cpp at some point, but they were not critical for the java issue in this bug. Either way, they're now back in the change...

Drivers, this neads a beta for sure, so it'd be awesome to get this in for beta4. It's essentially not a change in behavior unless we end up in a case where we tear down a plugin where it's not safe to do so w/o this patch. IOW, this should only change behavior when the current code crashes us.
Attachment #306220 - Flags: superreview+
Attachment #306220 - Flags: review+
Attachment #306220 - Flags: approval1.9b4?
Comment on attachment 306220 [details] [diff] [review]
Updated diff, ready to land.

a1.9b4=beltzner
Attachment #306220 - Flags: approval1.9b4? → approval1.9b4+
s/while we're instantiating the frame/while we're instantiating the plug-in/ in the comments?
Fix checked in! Kenneth, builds with a date of 2/29 or newer will contain this fix, and that's including beta4! Please test as much as possible. Marking FIXED.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Depends on: 420884
Fix has been verified both with the above test case as well as by our SQE department. Marking verified.
Status: RESOLVED → VERIFIED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: