Closed Bug 984447 Opened 10 years ago Closed 10 years ago

Hangs in fork() with Java plugin which spawn hungry zombie processes, caused by bug in jemalloc

Categories

(Core Graveyard :: Plug-ins, defect)

All
macOS
defect
Not set
critical

Tracking

(firefox28 wontfix, firefox29+ verified, firefox30+ verified, firefox31+ verified)

VERIFIED FIXED
mozilla31
Tracking Status
firefox28 --- wontfix
firefox29 + verified
firefox30 + verified
firefox31 + verified

People

(Reporter: smichaud, Assigned: glandium)

References

()

Details

(Keywords: reproducible, topcrash-mac, Whiteboard: [STR in comment #42][analysis in comment #34])

Crash Data

Attachments

(3 files, 1 obsolete file)

Atos translation of top few lines of crashes at libsystem_kernel.dylib@0x15716 (on OS X 10.9.X):

__psynch_cvwait (in libsystem_kernel.dylib) + 10
_pthread_cond_wait (in libsystem_pthread.dylib) + 727
_pthread_cond_cleanup (in libsystem_pthread.dylib) + 0
os::PlatformEvent::park(long long) (in libjvm.dylib) + 385
ObjectMonitor::EnterI(Thread*) (in libjvm.dylib) + 460
JavaThreadBlockedOnMonitorEnterState::contended_enter_begin(JavaThread*) (in libjvm.dylib) + 29
ObjectMonitor::enter(Thread*) (in libjvm.dylib) + 441
InterpreterRuntime::monitorenter(JavaThread*, BasicObjectLock*) (in libjvm.dylib) + 146
...
This signature seems to be a different bug, and was also reported at bug 888732.
Crash Signature: [@ hang | libsystem_kernel.dylib@0x15716 ] [@ hang | libsystem_kernel.dylib@0x16bca ] [@ hang | libsystem_kernel.dylib@0x120fa ] → [@ hang | libsystem_kernel.dylib@0x15716 ] [@ hang | libsystem_kernel.dylib@0x16bca ]
> This signature

[@ hang | libsystem_kernel.dylib@0x120fa ]
Atos translation of top few lines of crashes at libsystem_kernel.dylib@0x16bca (on OS X 10.7.X):

__psynch_cvwait (in libsystem_kernel.dylib) + 10
_pthread_cond_wait (in libsystem_c.dylib) + 840
cond_cleanup (in libsystem_c.dylib) + 0
pthread_mutex_lock (in libsystem_c.dylib) + 480
os::PlatformEvent::park() (in libjvm.dylib) + 173
ObjectMonitor::wait(long long, bool, Thread*) (in libjvm.dylib) + 757
BiasedLocking::revoke_and_rebias(Handle, bool, Thread*) (in libjvm.dylib) + 370
ObjectSynchronizer::wait(Handle, long long, Thread*) (in libjvm.dylib) + 203
JVM_MonitorWait (in libjvm.dylib) + 156
...
Summary: Crashes in Java plugin with Java 7 Update 45 and up on OS X → Crashes at __psynch_cvwait | os::PlatformEvent::park() in Java plugin with Java 7 Update 45 and up on OS X
The kinds of searches I reported in comment #0 (by plugin version string) only started being supported fairly recently.  So their failure to return any results on "Java 7 Update 40" and earlier may be an artifact -- these crashes may in fact have happened in earlier releases of Java 7.
So, if I look for all Mac crashes and hangs (plugin + browser) on 27 and 28, to give it perspective, I see the following:

https://crash-stats.mozilla.com/topcrasher/products/Firefox/versions/28.0b/date_range_type/report/crash_type/all/os_name/Mac%20OS%20X/result_count/50?days=7

#6 (0.2%) - hang | libsystem_kernel.dylib@0x15716
#36 (0.04%) - hang | libsystem_kernel.dylib@0x16bca
2 more hang | libsystem_kernel.dylib@0x1… in the top 50 but <10 crashes in that week.

https://crash-stats.mozilla.com/topcrasher/products/Firefox/versions/27.0.1/date_range_type/report/crash_type/all/os_name/Mac%20OS%20X/result_count/50?days=7

#13 (0.2%) - hang | libsystem_kernel.dylib@0x15716
#36 (0.08%) - hang | libsystem_kernel.dylib@0x120fa
None more with such a signature in the top 50.


If I look at Mac plugin crashes or hangs, then "hang | libsystem_kernel.dylib@0x15716" is the #1 or within a crash of the #1 in both versions, competing with or outpacing the "hang | mach_msg_trap" signature that is not filed in Bugzilla but seems to be a Flash issue.


So if @0x15716 is not this bug, then I do not see this as a topcrash issue on Mac for sure.
> So if @0x15716 is not this bug

In comment #2 and comment #3 I said that @0x120fa isn't this bug.
If we have contacts at Oracle, we should probably CC them on this bug.  It's almost certainly theirs.
Note that the searches from comment #0 include some crashes that aren't this bug (or at least whose signatures are significantly different).
(Following up comment #2 and comment #3)

Changed my mind again.  Crashes [@ hang | libsystem_kernel.dylib@0x120fa ] *are* this bug.  So is bug 888732.
Crash Signature: [@ hang | libsystem_kernel.dylib@0x15716 ] [@ hang | libsystem_kernel.dylib@0x16bca ] → [@ hang | libsystem_kernel.dylib@0x15716 ] [@ hang | libsystem_kernel.dylib@0x16bca ] [@ hang | libsystem_kernel.dylib@0x120fa ]
STR from bug 888732:

1) In the Java system control panel, under the Security tab, add http://people.cs.uchicago.edu to the Exception Site List (which means that you trust Java applets from that site).
2) Run Firefox, and in Tools : Add-ons : Plugins, set the Java plugin to always activate.
3) In Preferences : Advanced : Network, click the Clear Now button to clear cached web content.
4) Visit http://people.cs.uchicago.edu/~stefanko/Teaching/CS102-Sum2001/Applets/Applets.html.

Normally you'll hang for about 30 seconds, then be notified that the Java plugin has crashed.
Keywords: reproducible
Whiteboard: [STR in comment #13]
(In reply to Steven Michaud from comment #14)
> Normally you'll hang for about 30 seconds, then be notified that the Java
> plugin has crashed.

Actually, I think it's 45 seconds, as that's the default value for the hang timer. And as the signature says, this is a hang, i.e. Java does not answer pings from the Firefox process before that timer runs out, and then we kill it intentionally and report the "crash".
Mostly these crashes happen as the Java plugin is starting up, and look like the atos translations in comment #1 and comment #4.

But they can also happen later, when they tend to look like this:

bp-cabe2034-4531-41f9-9f5e-f952a2140317
> Actually, I think it's 45 seconds

Makes sense.  I didn't time the "hang" too carefully :-)
It's possible these only happen on pages with multiple Java applets.
Another thing I see:

If you do the STR from comment #14, and/or if you try reloading the page before it's finished loading, you can end up with several plugin-container processes, some of which stay alive after you've quit Firefox.

(Testing with FF 27.0.1 on OS X 10.7.5.)
> If you do the STR from comment #14

If you do the STR from comment #14 several times in a row.
Actually comment #22 isn't always true.

But when you see a hang and a crash (more likely if you reload http://people.cs.uchicago.edu/~stefanko/Teaching/CS102-Sum2001/Applets/Applets.html multiple times), then quit Firefox, you'll often end up with at least one "surviving" plugin-container process.

This often (always?) consumes 100% of one cpu.  Here's an all-thread stack trace:

(gdb) thread apply all bt

Thread 1 (process 586):
#0  0x00007fff99ead385 in spin_lock$VARIANT$mp ()
#1  0x000000010775f49e in arena_malloc ()
#2  0x0000000107758ebf in je_malloc ()
#3  0x00007fff99eab3c8 in malloc_zone_malloc ()
#4  0x00007fff99eac1a4 in malloc ()
#5  0x00007fff99e596ae in _pthread_work_internal_init ()
#6  0x00007fff99e5a01a in pthread_workqueue_atfork_child ()
#7  0x00007fff99e1104c in _cthread_fork_child_postinit ()
#8  0x00007fff99e345a1 in fork ()
#9  0x000000010ae3db5a in Java_java_lang_UNIXProcess_forkAndExec ()
#10 0x000000010ceae698 in ?? ()
#11 0x000000010cea21d4 in ?? ()
#12 0x000000010cea2058 in ?? ()
#13 0x000000010cea2233 in ?? ()
#14 0x000000010cea2233 in ?? ()
#15 0x000000010cea2706 in ?? ()
#16 0x000000010cea2058 in ?? ()
#17 0x000000010cea2058 in ?? ()
#18 0x000000010cea2233 in ?? ()
#19 0x000000010cea2233 in ?? ()
#20 0x000000010cea2233 in ?? ()
#21 0x000000010ce9c4e7 in ?? ()
#22 0x000000010c71fbb0 in JavaCalls::call_helper ()
#23 0x000000010c7200c7 in JavaCalls::call_virtual ()
#24 0x000000010c720204 in JavaCalls::call_virtual ()
#25 0x000000010c76f1ea in thread_entry ()
#26 0x000000010c938b57 in JavaThread::thread_main_inner ()
#27 0x000000010c93a25f in JavaThread::run ()
#28 0x000000010c8641d6 in java_start ()
#29 0x00007fff99e598bf in _pthread_start ()
#30 0x00007fff99e5cb75 in thread_start ()
(gdb) call (void) pss()
(gdb) call (void) ps()
(gdb) 

Note the there's only one thread (the main one), and that the Java-specific pss() and ps() methods return nothing.
My guess is that we shouldn't be running Java from the plugin-container process, but rather from the main process.  I suspect recent versions of the Java plugin expect that.  The Java plugin already supports a multi-process model (and has done so for a long time).
(Following up comment #23)

I suspect the "main thread" in that stack started out as a secondary thread, and that all the other threads have been killed as part of the FF shutdown.
(Following up comment #25)

That's not true.  The behavior (with this bug's testcase) is even worse if you set dom.ipc.plugins.enabled.x86_64.javaappletplugin.plugin to false (in about:config) -- which causes FF to run the Java plugin in the main process.  You end up with two FF processes, one of which has the same stack trace as in comment #23, and consumes 100% of one cpu!!
We really need someone from Oracle here.
(Following up comment #25)

> I suspect the "main thread" in that stack started out as a secondary
> thread, and that all the other threads have been killed as part of
> the FF shutdown.

I confirmed this using an interpose library.
So here's what seems to be going on here:

Java plugin code (running in the plugin-container process or the main process) calls fork() on a secondary thread, and sometimes hangs there.  When this happens in the main process, FF just hangs.  When this happens in the plugin-container process, that process hangs and the FF main process starts a timer for killing the plugin-container process (assuming that it might have hung).  Until that timer expires, the FF main process appears itself to be hung.  But afterwards the FF UI is usable once again.

However FF didn't actually succeed in killing the plugin-container process, which may continue using 100% of a cpu in its undead state.
Whiteboard: [STR in comment #13] → [STR in comment #13][analysis in comment #29]
Taking another look at the stack in comment #23, this bug may be a bad interaction between jemalloc and the Java plugin.

Just now I tried testing with jemalloc off (try setting the NO_MAC_JEMALLOC environment variable to something like "1").  I saw some beachballing, lasting up to about 15 seconds, but no crashes and no hangs.

I'd appreciate it if others also tested with jemalloc off.
Summary: Crashes at __psynch_cvwait | os::PlatformEvent::park() in Java plugin with Java 7 Update 45 and up on OS X → Crashes at __psynch_cvwait | os::PlatformEvent::park() in Java plugin with Java 7 Update 45 and up on OS X, possible bad interaction with jemalloc
CCing some jemalloc people.

At least I hope you are -- you reviewed some recent jemalloc patches :-)
You know, if this *is* a bad interaction between jemalloc and the Java plugin, the easiest way to deal with it would be to add "NO_MAC_JEMALLOC=1" to the plugin-container process's environment when we use it to instantiate the Java plugin.

Tomorrow I'll post a patch that does this.
The fact that we have two processes with the same name suggests that fork() is hanging in the child process -- not the parent process.  We really shouldn't be using jemalloc in the child process before exec() has had a chance to run (and replace the child process executable image with something else).
Revised description of what's going on in this bug:

Java plugin code (running in the plugin-container process or the main process) calls fork() on a secondary thread.  Sometimes the call to fork() hangs, both in the child process and the parent process (as per the stack in comment #23).  The child process hangs in jemalloc code.  The parent process hangs waiting for the call to fork() to return.  Though the child process is hung, the code where the hang occurs runs in such a tight loop that it can max out the cpu dedicated to running that process (can run that cpu at 100%).

When the parent process of the Java plugin's call to fork() is the main process, FF just hangs.  When it's the plugin-container process, that process hangs and the FF main process starts a timer for killing the plugin-container process (assuming that it might have hung).  Until that timer expires, the FF main process appears itself to be hung.  But afterwards the FF UI is usable once again.

FF does kill the plugin-container process.  But the child process started by fork() is still running the plugin-container process's image (since it hasn't yet called exec()).  So it looks like the plugin-container process wasn't killed.
Whiteboard: [STR in comment #13][analysis in comment #29] → [STR in comment #13][analysis in comment #34]
Attached file Fork hang parent side
Here's what the parent side of the fork hang looks like (in the plugin-container process).  (For reference, a stack of the child side of the fork hang is in comment #23.)

Here's a stack of the secondary thread on which the Java plugin called fork():

Thread 30 (process 376):
#0  0x00007fff8f566af2 in read ()
#1  0x000000010ae4afd4 in Java_java_lang_UNIXProcess_forkAndExec ()
#2  0x000000010cfae698 in ?? ()
#3  0x000000010cfa21d4 in ?? ()
#4  0x000000010cfa2058 in ?? ()
#5  0x000000010cfa2233 in ?? ()
#6  0x000000010cfa2233 in ?? ()
#7  0x000000010cfa2706 in ?? ()
#8  0x000000010cfa2058 in ?? ()
#9  0x000000010cfa2058 in ?? ()
#10 0x000000010cfa2233 in ?? ()
#11 0x000000010cfa2233 in ?? ()
#12 0x000000010cfa2233 in ?? ()
#13 0x000000010cf9c4e7 in ?? ()
#14 0x000000010c81fbb0 in JavaCalls::call_helper ()
#15 0x000000010c8200c7 in JavaCalls::call_virtual ()
#16 0x000000010c820204 in JavaCalls::call_virtual ()
#17 0x000000010c86f1ea in thread_entry ()
#18 0x000000010ca38b57 in JavaThread::thread_main_inner ()
#19 0x000000010ca3a25f in JavaThread::run ()
#20 0x000000010c9641d6 in java_start ()
#21 0x00007fff9096e8bf in _pthread_start ()
#22 0x00007fff90971b75 in thread_start ()

And here's a stack of the main thread:

Thread 1 (process 376):
#0  0x00007fff8f565bca in __psynch_cvwait ()
#1  0x00007fff90972274 in _pthread_cond_wait ()
#2  0x000000010c95fe39 in os::PlatformEvent::park ()
#3  0x000000010c957e94 in ObjectMonitor::EnterI ()
#4  0x000000010c958a21 in ObjectMonitor::enter ()
#5  0x000000010c81b35c in InterpreterRuntime::monitorenter ()
#6  0x000000010cfac8a7 in ?? ()
#7  0x000000010cfa2233 in ?? ()
#8  0x000000010cfa2058 in ?? ()
#9  0x000000010cf9c4e7 in ?? ()
#10 0x000000010c81fbb0 in JavaCalls::call_helper ()
#11 0x000000010c81f980 in JavaCalls::call ()
#12 0x000000010c853425 in jni_invoke_nonstatic ()
#13 0x000000010c847096 in jni_CallVoidMethodV ()
#14 0x000000010aae8512 in JNFCallVoidMethod ()
#15 0x00000001097dae61 in MacNPAPIJavaPlugin::handleWindowFocusEvent ()
#16 0x00000001097db70a in MacNPAPIJavaPlugin::handleEvent ()
#17 0x00000001097dbd8c in NPP_HandleEvent ()
#18 0x0000000100e2daf6 in mozilla::plugins::PluginInstanceChild::AnswerNPP_HandleEvent ()
#19 0x00000001004da72d in mozilla::plugins::PPluginInstanceChild::OnCallReceived ()
#20 0x00000001004ec755 in mozilla::plugins::PPluginModuleChild::OnCallReceived ()
#21 0x0000000100321035 in mozilla::ipc::MessageChannel::DispatchInterruptMessage ()
#22 0x000000010031e2cf in mozilla::ipc::MessageChannel::OnMaybeDequeueOne ()
#23 0x00000001002f0294 in MessageLoop::DeferOrRunPendingTask ()
#24 0x00000001002f05aa in MessageLoop::DoWork ()
#25 0x000000010031294a in base::MessagePumpCFRunLoopBase::RunWorkSource ()
#26 0x00007fff913834f1 in __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ ()
#27 0x00007fff91382d5d in __CFRunLoopDoSources0 ()
#28 0x00007fff913a9b49 in __CFRunLoopRun ()
#29 0x00007fff913a9486 in CFRunLoopRunSpecific ()
#30 0x00007fff9164e2bf in RunCurrentEventLoopInMode ()
#31 0x00007fff9165556d in ReceiveNextEventCommon ()
#32 0x00007fff916553fa in BlockUntilNextEventMatchingListInMode ()
#33 0x00007fff8de73779 in _DPSNextEvent ()
#34 0x00007fff8de7307d in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] ()
#35 0x00007fff8de6f9b9 in -[NSApplication run] ()
#36 0x0000000100313923 in base::MessagePumpNSApplication::DoRun ()
#37 0x0000000100312eaa in base::MessagePumpCFRunLoopBase::Run ()
#38 0x00000001002efb8e in MessageLoop::Run ()
#39 0x0000000101c5fefa in XRE_InitChildProcess ()
#40 0x0000000100000f1d in main ()
Thanks for the analysis Steven.
Do you still think disabling jemalloc for java plugin containers is the best option for a work-around?
Is this actionable for Oracle engineers too?
Should we maybe fork this in a work-around & Java bug?
I'm still working on this, and thinking about it.  So I'd prefer to keep quiet until I can come to some more definite conclusions, if possible.  My natural tendency is to logorreah ... but I'm trying to fight that :-)

> Do you still think disabling jemalloc for java plugin containers is the best option for a work-around?

No.  Though this might be a good fallback.  I now think this is probably a jemalloc bug, and therefor our bug (not Oracle's).  Jemalloc should be able to deal with fork() being called from a secondary thread ... but apparently for now it can't.

> Is this actionable for Oracle engineers too?

If I'm right that this is a jemalloc bug, then no.

> Should we maybe fork this in a work-around & Java bug?

Only if we can't find a better fix.

I'll be working on it.  And once I have something more definite I'll ask the jemalloc people about it.
For various reasons, I wasn't able to fix this bug in jemalloc itself -- at least not quickly.  Later I'll open one or more bugs about this.

So I'm falling back to disabling jemalloc for the Java plugin run in a separate plugin-container process.

I've started a tryserver build, which should eventually be available here:
https://tbpl.mozilla.org/?tree=Try&rev=e7e7678dff14

My patch gets rid of this bug in my tests, but we'll need others to test it to be quite sure about that.
Assignee: nobody → smichaud
Status: NEW → ASSIGNED
Since this is really our bug (as best I can tell), I'm recategorizing it.
Component: Java (Oracle) → Plug-ins
Product: Plugins → Core
Version: 7.x → unspecified
I've opened bug 985759 to deal with the problems I had trying to "fix" jemalloc.
(Following up comment #38)

The tryserver build is ready *much* earlier than I expected:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-e7e7678dff14/try-macosx64/firefox-31.0a1.en-US.mac.dmg

Please test with this and let us know your results.  I'd especially like to hear from eyalhrs@gmail.com, who reported bug 888732.
Revised STR:

1) In the Java system control panel, under the Security tab, add http://people.cs.uchicago.edu to the Exception Site List (which means that you trust Java applets from that site).
2) Run Firefox, and in Tools : Add-ons : Plugins, set the Java plugin to always activate.
3) In Preferences : Advanced : Network, click the Clear Now button to clear cached web content.
4) Visit http://people.cs.uchicago.edu/~stefanko/Teaching/CS102-Sum2001/Applets/Applets.html.

Normally you'll hang for about 30 seconds, then be notified that the Java plugin has crashed.

If you don't hang in step 4, try force-reloading the page (cmd-shift-r), then click repeatedly on the page, or scroll it, while the applets are reloading.  Or close the tab containing the applets (if they're in a separate tab) and click on or scroll the tab to which the focus changes.
Whiteboard: [STR in comment #13][analysis in comment #34] → [STR in comment #42][analysis in comment #34]
There are also a lot of these on OS X 10.6.8, with Apple's Java distro.  So it isn't just Oracle's distro that's effected.

https://crash-stats.mozilla.com/search/?signature=~hang+|+mach_msg_trap&plugin_name=Java+Applet+Plug-in&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=platform_version
Crash Signature: [@ hang | libsystem_kernel.dylib@0x15716 ] [@ hang | libsystem_kernel.dylib@0x16bca ] [@ hang | libsystem_kernel.dylib@0x120fa ] → [@ hang | libsystem_kernel.dylib@0x15716 ] [@ hang | libsystem_kernel.dylib@0x16bca ] [@ hang | libsystem_kernel.dylib@0x120fa ] [@ hang | mach_msg_trap ]
Summary: Crashes at __psynch_cvwait | os::PlatformEvent::park() in Java plugin with Java 7 Update 45 and up on OS X, possible bad interaction with jemalloc → Hangs in fork() with Java plugin which spawn hungry zombie processes, caused by bug in jemalloc
Comment on attachment 8393837 [details] [diff] [review]
Workaround (load Java plugin with NO_MAC_JEMALLOC=1)

John, are you willing to review this?  If not, please find someone else to take the review.

I'd hoped to get testing from others before a review.  Easier said than done.  But I've done a fair amount of testing myself, and I've *never* hung or crashed using the Java plugin with this patch.

Ideally we'd want to fix the bug in jemalloc.  But the hang is deep in system code (called from fork()), so it will be difficult (and time consuming) just to track it down.  And the problem may turn out to be a design flaw in jemalloc (which seems to use locks a lot more than the memory allocation code it replaces) -- in which case we'd just end up having to settle for some other kind of workaround ... if we can find one.

This bug's trigger seems to be a call to fork() on a secondary thread.  We never use fork() ourselves, at least on the Mac (instead we use posix_spawn()).  And calling fork() from a secondary thread is probably very unusual behavior, so there's a good chance that only the Java plugin does it.  So my workaround is likely to completely eliminate the effects of this bug.
Attachment #8393837 - Flags: review?(jschoenick)
Comment on attachment 8393837 [details] [diff] [review]
Workaround (load Java plugin with NO_MAC_JEMALLOC=1)

Review of attachment 8393837 [details] [diff] [review]:
-----------------------------------------------------------------

Impressive debugging! Is malloc() supposed to be safe between fork() and execve()? If so we should probably file a jemalloc bug as well.

This LGTM with the nit below, but I think bsmedberg should also the review changes to the IPC code.

::: dom/plugins/base/nsNPAPIPlugin.cpp
@@ +384,5 @@
>      return nullptr;
>    }
>  
>    if (nsNPAPIPlugin::RunPluginOOP(aPluginTag)) {
> +    return PluginModuleParent::LoadModule(aPluginTag->mFullPath.get(), aPluginTag);

I'd prefer if we moved the mIsJava check here and just pass ChildRestrictions or a PluginQuirks enum of some kind as the second param -- The pluginTag is refcounted and might go away when we do re-entrant IPC things, so having it available for the scope of that function might be a foot gun later.
Attachment #8393837 - Flags: review?(jschoenick)
Attachment #8393837 - Flags: review?(benjamin)
Attachment #8393837 - Flags: review+
(In reply to John Schoenick [:johns] from comment #45) 
> Impressive debugging! Is malloc() supposed to be safe between fork() and
> execve()? If so we should probably file a jemalloc bug as well.

@glandium ^
Flags: needinfo?(mh+mozilla)
In fact, i'd rather fix jemalloc than work around the issue.
Presumably, jemalloc3 doesn't have the problem. Can you check whether you reproduce the bug if you build with MOZ_JEMALLOC3=1 in your mozconfig?
Flags: needinfo?(mh+mozilla)
(In reply to comment #45)

>>    if (nsNPAPIPlugin::RunPluginOOP(aPluginTag)) {
>> +    return PluginModuleParent::LoadModule(aPluginTag->mFullPath.get(), aPluginTag);
>
> I'd prefer if we moved the mIsJava check here and just pass
> ChildRestrictions or a PluginQuirks enum of some kind as the second
> param -- The pluginTag is refcounted and might go away when we do
> re-entrant IPC things, so having it available for the scope of that
> function might be a foot gun later.

Sounds reasonable to me.  Tomorrow I'll post a patch that does this.
(In reply to comment #47)

Tomorrow I'll try jemalloc3.

But frankly I'm not willing to go beyond that, for now.  This is a topcrasher, which needs to be fixed or worked around quickly.  My workaround does that, and there's good reason to think that only the Java plugin triggers the jemalloc bug.

Once that's done, I'll open a bug on the "real" problem -- so far as I can tell what it is.  But I'm unlikely to be able to work on it anytime soon, so it will fall into your lap (or that of someone responsible for dealing with jemalloc bugs).
Given this is a (or *the*) Mac topcrash, and we have a lot of info about it right now, I think it should be tracked for upcoming releases, even though it's not new (actually seems to go quite a bit back in terms of releases).
I'm also marking newer versions as affected even though I haven't seen Aurora or Nightly in crash stats (but every versions start showing it when going to beta and a wider audience).
Attachment #8393837 - Flags: review?(benjamin) → review+
Benjamin, is this change alright with you?
Attachment #8393837 - Attachment is obsolete: true
Attachment #8398021 - Flags: review?(benjamin)
I just tested with jemalloc3, and found that it does seem to fix this bug.

I also found the following in the jemalloc 3.0.0 changelog, which seems to address the problem directly:

- Fix fork-related bugs that could cause deadlock in children between fork and exec.
https://hg.mozilla.org/mozilla-central/annotate/bb4dd9872236/memory/jemalloc/src/ChangeLog#l125

But I assume jemalloc3 isn't yet ready to go into releases.  So we should still proceed with my workaround patch.
Oops, wrong bug :-(
(In reply to Steven Michaud from comment #52)
> I just tested with jemalloc3, and found that it does seem to fix this bug.
> 
> I also found the following in the jemalloc 3.0.0 changelog, which seems to
> address the problem directly:
> 
> - Fix fork-related bugs that could cause deadlock in children between fork
> and exec.
> https://hg.mozilla.org/mozilla-central/annotate/bb4dd9872236/memory/jemalloc/
> src/ChangeLog#l125
> 
> But I assume jemalloc3 isn't yet ready to go into releases.  So we should
> still proceed with my workaround patch.

But that means there is a jemalloc that we can probably backport. I'll look at this today.
Turns out bug 694896 broke prefork/postfork handling when essentially merging the jemalloc zone and the default zone. The comment in the ozone_force_* functions because wrong, nothing is dealing with those (and pthread_atfork is not used on osx).

Flagging you because you're listed as co-author of the patch in mercurial history, and jlebar is not here anymore. Feel free to redirect to someone else if you don't feel comfortable. This is essentially what jemalloc3 does.
Attachment #8398431 - Flags: review?(smichaud)
Assignee: smichaud → mh+mozilla
Comment on attachment 8398431 [details] [diff] [review]
Properly handle forks in jemalloc after bug 694896

I'm not very familiar with jemalloc code, so I'm not really appropriate as a reviewer.  What I did at bug 694896 was find and correct some problems with jlebar's patch.

I will test this, though, and get back with my results.
Attachment #8398431 - Flags: review?(smichaud)
Comment on attachment 8398431 [details] [diff] [review]
Properly handle forks in jemalloc after bug 694896

This does fix the crashes in my tests.

It'd really help, you know, if we had a link to the jemalloc 3.0.0 revision referred to in the change log.

I just spent half an hour looking for it, without any luck.  But I did find the following:

http://www.canonware.com/pipermail/jemalloc-discuss/2012-March/000143.html

This is you commenting that the revision broke something.

I truly don't know who to suggest for the review.  Nathan Froyd?
Attachment #8398431 - Flags: review?(n.nethercote)
Nicholas, see comment 56.
(In reply to Steven Michaud from comment #58)
> Comment on attachment 8398431 [details] [diff] [review]
> Properly handle forks in jemalloc after bug 694896
> 
> This does fix the crashes in my tests.
> 
> It'd really help, you know, if we had a link to the jemalloc 3.0.0 revision
> referred to in the change log.

It's easily found with git log --grep= in the jemalloc upstream source repository. But the current code in memory/jemalloc/src/zone.c is more interesting than its history.
> in the jemalloc upstream source repository

And where is that?
(In reply to Steven Michaud from comment #61)
> > in the jemalloc upstream source repository
> 
> And where is that?

https://github.com/jemalloc/jemalloc
(In reply to Steven Michaud from comment #63)
> Thanks.  Here it is:
> https://github.com/jemalloc/jemalloc/commit/4e2e3dd

Except that's not relevant to this bug. What's relevant to this bug is that bug 694896 *removed* calls to jemalloc pre/postfork functions.
And as a matter of fact, jemalloc3 doesn't use postfork_child on osx, so that commit is irrelevant on mac anyways.
> - Fix fork-related bugs that could cause deadlock in children between fork and exec.
> https://hg.mozilla.org/mozilla-central/annotate/bb4dd9872236/memory/jemalloc/src/ChangeLog#l125

Are you saying that this is irrelevant to this bug?
yes
I'm not convinced that the patch for bug 694896 actually caused this bug.  But jemalloc in our tree was in such bad shape before that patch that it's not feasible to test with earlier builds.  So the question is moot.

I did try using pthread_atfork() just as your patch for this bug uses ozone_force_lock() and ozone_force_unlock() (and indirectly malloc_introspection_t.force_lock() and malloc_introspection_t.force_unlock()).  It didn't work -- it made the Java plugin *always* hang.

I didn't realize that we need to use malloc_introspection_t.force_lock() and malloc_introspection_t.force_unlock() instead -- or that these have anything to do with forking.  But this is what your patch does, and that's apparently the key to this problem.

Call that a review if you want.  But I don't know how malloc_introspection_t.force_lock() and malloc_introspection_t.force_unlock() are used elsewise, and so I don't know the possible other consequences of changing them.
malloc_introspection_t.force_lock and malloc_introspection_t.force_unlock are called for all zones by osx's libc at fork time. See functions _malloc_fork_prepare, _malloc_fork_parent and _malloc_fork_child in http://www.opensource.apple.com/source/Libc/Libc-594.1.4/gen/malloc.c

Before bug 694896, the jemalloc zone *did* have force_lock and force_unlock calling jemalloc pre/postfork functions. Bug 694896 *removed* that.
It depends on how you interpret what create_zone() and szone2ozone() did before the patch for bug 694896.

create_zone() (which was removed) did have the following code, which did (indirectly) call _malloc_prefork() and _malloc_postfork().

	zone_introspect->force_lock = (void *)zone_force_lock;
	zone_introspect->force_unlock = (void *)zone_force_unlock;

But both before and after the patch szone2ozone() had this, which doesn't:

  	ozone_introspect->force_lock = (void *)ozone_force_lock;
  	ozone_introspect->force_unlock = (void *)ozone_force_unlock;
> malloc_introspection_t.force_lock and
> malloc_introspection_t.force_unlock are called for all zones by
> osx's libc at fork time. See functions _malloc_fork_prepare,
> _malloc_fork_parent and _malloc_fork_child in
> http://www.opensource.apple.com/source/Libc/Libc-594.1.4/gen/malloc.c

Thanks for the reference.  But what *else* are
malloc_introspection_t.force_lock and
malloc_introspection_t.force_unlock used for?
the zone that create_zone created was registered to the libc, and is a separate zone from the one szone2ozone edited. The libc runs force_lock/unlock of *all* zones. So jemalloc's prefork/postfork functions *were* called. And bug 694896 removed that.
(In reply to Steven Michaud from comment #71)
> Thanks for the reference.  But what *else* are
> malloc_introspection_t.force_lock and
> malloc_introspection_t.force_unlock used for?

Nothing in Libc.
Thanks.  Things are now much clearer than after what you said in comment #57, which wasn't even entirely grammatical.  For what it's worth, I now know enough to r+ your patch.

I think we should land it on trunk and see what happens over the next week.  If there are no problems we should uplift it to aurora and beta.  But if there are problems we should fall back on my disable-jemalloc-in-the-Java-plugin workaround.
(In reply to Steven Michaud from comment #74)
> Thanks.  Things are now much clearer than after what you said in comment
> #57, which wasn't even entirely grammatical.

s/because/became/ is what i intended to write then.

> For what it's worth, I now know enough to r+ your patch.

Feel free to steel the review, then.
Attachment #8398431 - Flags: review?(n.nethercote) → review+
https://hg.mozilla.org/mozilla-central/rev/b8a7881a83fa
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
> I think we should land it on trunk and see what happens over the next week.

There haven't been any of this bug's crash stacks on the 31 branch at all -- before or after glandium's patch landed.  And there weren't any on the 30 branch for its entire history, as far as I can tell.

I don't know if this is a flaw in the data, or (perhaps) in how I searched it.  But I suspect it's just that we have next to no users of the Java plugin on trunk or alpha.  So we should probably uplift the patch to alpha and beta soon.

The 29 branch *does* have this bug's crashes, and we probably want to get the fix into the 29 release.  And it looks like the fix won't get any real testing until it gets onto the 29 branch.

I think the risk of uplifting this patch is "moderate".  The change does seem well isolated (unlikely to effect anything other than forking) -- which would tend to make it low risk.  But the jemalloc code is "difficult", which I think bumps the risk up to moderate.  However, this should motivate us to uplift it to beta (and alpha) more quickly, so that it gets adequate testing for longer before a release.

What do you think, glandium?  And do you want to do the honors (to request uplifting to alpha and beta)?
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(mh+mozilla)
Comment on attachment 8398431 [details] [diff] [review]
Properly handle forks in jemalloc after bug 694896

[Approval Request Comment]
Bug caused by (feature/regressing bug #): regression from bug 694896 (Which dates back FF10)
User impact if declined: Java plugin hangs on mac.
Testing completed (on m-c, etc.): This has been on m-c for a few days. Smichaud verified he couldn't reproduce anymore with the patch.
Risk to taking this patch (and alternatives if risky): I kind of agree with smichaud's comment 78, although I'd put this somewhere between low and moderate. This is also mac-only. The code the patch modifies is only used on mac, and in practice, restores what we were doing before FF10. The patch is straightforward, jemalloc is not modified a lot these days, so if it needs to be backed out, it can be backed out quite easily.
String or IDL/UUID changes made by this patch: None
Attachment #8398431 - Flags: approval-mozilla-beta?
Attachment #8398431 - Flags: approval-mozilla-aurora?
Attachment #8398431 - Flags: approval-mozilla-beta?
Attachment #8398431 - Flags: approval-mozilla-beta+
Attachment #8398431 - Flags: approval-mozilla-aurora?
Attachment #8398431 - Flags: approval-mozilla-aurora+
Attachment #8398021 - Flags: review?(benjamin)
Florin, can you please make sure this gets verified fixed in the next Nightly, Aurora, and Beta builds?
Flags: needinfo?(florin.mezei)
Reproduced with Nightly from 2014-03-17 with STR from comment 42.
Verified as fixed with Firefox 29 beta 5 (Build ID: 20140403132807), latest Nightly (Build ID: 20140403030201) and latest Aurora (Build ID: 20140404004001) on Mac OS X 10.9.2 - no Java plugin crash encountered.

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:29.0) Gecko/20100101 Firefox/29.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:30.0) Gecko/20100101 Firefox/30.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0
Status: RESOLVED → VERIFIED
Flags: needinfo?(florin.mezei)
Keywords: verifyme
Do we have indications that it's not happening on beta anymore?
Flags: needinfo?(smichaud)
Here are reports for the top two signatures on the 29 branch over the last week:

https://crash-stats.mozilla.com/report/list?signature=hang+%7C+libsystem_kernel.dylib%400x15716&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&platform=mac&version=Firefox%3A29.0b&version=Firefox%3A29.0b6&version=Firefox%3A29.0b5&version=Firefox%3A29.0b4&version=Firefox%3A29.0b3&version=Firefox%3A29.0b2&version=Firefox%3A29.0b1&hang_type=any&date=2014-04-09+14%3A00%3A00&range_value=1#reports
https://crash-stats.mozilla.com/report/list?signature=hang+%7C+mach_msg_trap&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&platform=mac&version=Firefox%3A29.0b&version=Firefox%3A29.0b6&version=Firefox%3A29.0b5&version=Firefox%3A29.0b4&version=Firefox%3A29.0b3&version=Firefox%3A29.0b2&version=Firefox%3A29.0b1&hang_type=any&date=2014-04-09+14%3A00%3A00&range_value=1#reports

The results are mixed:  This bug's patch landed as of FF29b5, but there are a small number of these signatures for 29b5, and even two for 29b6 (which was released yesterday).

But we need to keep in mind that these signature are very generic -- they might be recorded for *any* plugin crash, including non-related Java crashes and those in other plugins.  Indeed some of them are for the Flash plugin.

I'm confident glandium's patch fixed the jemalloc bug, and I'm reasonably confident that most of the pre-patch crashes were caused by the jemalloc bug.  But crashes with these signatures will never disappear completely.

We need to keep an eye on their numbers.  But if some of them don't go down far enough it's much more likely that we're seeing some *other*, unrelated bug whose traces (in the crash stats) were masked by this one.
Flags: needinfo?(smichaud)
In 29.0b5 and higher, we are definitely far away from what I saw in comment #7, and when I look at the Product split in the Signature Summary of the two top signatures from this bug, https://crash-stats.mozilla.com/report/list?signature=hang+|+libsystem_kernel.dylib%400x15716 and https://crash-stats.mozilla.com/report/list?signature=hang+|+mach_msg_trap it's pretty easy to see that of the crashes happening within the last week, we are still dominated by versions older than 29.0b4 (actually, b4 ranks way higher than b5 and b6 even though way more users are on the newer betas), so I think the patch here is a success and the remaining crashes are probably from other cases as mentioned in comment #84.

Thanks for your work here!
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: