Closed Bug 869167 Opened 11 years ago Closed 7 years ago

Mac: Hang when ASan-built plugin is used in non-ASan build

Categories

(Core :: IPC, defect, P3)

x86
macOS
defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: mwobensmith, Unassigned, Mentored)

Details

(Whiteboard: [lang=c++][good next bug])

Attachments

(4 files, 3 obsolete files)

Attached file Empty embed
1. Install Test.plugin from attachment in Mac plugins folder:
~/Library/Internet Plugins/

2. Open bug file, which contains only an empty embed.

Result:
Hangs the browser

Stack:
https://crash-stats.mozilla.com/report/index/257aadcc-6adc-4881-a698-5e1f02130506
Attached file ASan-built plugin, Mac (obsolete) —
Obviously a super corner case, but talking to John Schoenick, he suggested I file it. We shouldn't be hanging here.
Indeed, worth looking into why we're hanging here - especially if it's an IPC fail relating to the plugin child failing to spawn properly (which could happen in other cases)
Also, the crash only happens when we force quit. This is a sample of the stack taken during the hang itself.
Could you also attach an all-thread gdb stack trace of the hang?
Steven, I can't seem to get a stack trace via gdb while it's running. Perhaps give it a go on your end and see if you can repro?
Comment on attachment 746044 [details]
ASan-built plugin, Mac

This is corrupt.  All of its files are soft links to files that don't exist (except on the system where you build it).

One way around this is to do a universal build and then look in [objdir]...universal/test-package-stage/bin/plugins.
Attached file Mac ASAN nptest plugin
There are some regular asan builds maintained at https://people.mozilla.com/~choller/firefox/asan/

I grabbed the Test.plugin from one of these (its in the -tests.zip) and attached

I tried this same scenario in linux -- there we crash in ScanPluginsDirectory, due to a segfault trying to dlopen() the asan plugin. This is less of an issue -- I'm more concerned that plugins doing something bad on startup can cause firefox to hand indefinitely waiting for a fubar plugin-container
Attachment #746044 - Attachment is obsolete: true
This is only an indeterminate hang with debug builds, due to processLaunchTimeoutSecs being 0 there.

We could possibly try to monitor the startup better though instead of having to time out on a crash on loading the module.
Priority: -- → P3
FWIW, this is hanging with the expected stack for plugin-container launch timeouts:

[...]
mozilla::ipc::GeckoChildProcessHost::SyncLaunch(this=0x000000011fc83f00, aExtraOpts=0x00007fff5fbfc6b8, aTimeoutMs=0, arch=PROCESS_ARCH_X86_64) + 339 at GeckoChildProcessHost.cpp:304
mozilla::plugins::PluginProcessParent::Launch(this=0x000000011fc83f00, timeoutMs=0) + 438 at PluginProcessParent.cpp:78
mozilla::plugins::PluginModuleParent::LoadModule(aFilePath=0x00000001127f15c8) + 163 at PluginModuleParent.cpp:90
GetNewPluginLibrary(aPluginTag=0x0000000112493e80) + 87 at nsNPAPIPlugin.cpp:395
nsNPAPIPlugin::CreatePlugin(aPluginTag=0x0000000112493e80, aResult=0x00007fff5fbfc9f0) + 152 at nsNPAPIPlugin.cpp:416
[...]
I did a quick check - we only create the crash reporter for the process after the child launched and loaded the plugin module:
http://hg.mozilla.org/mozilla-central/annotate/b980d32c366f/dom/plugins/ipc/PluginModuleParent.cpp#l85
... so we can't catch crashes during plugin module loading and instead time out on them.

I wonder if we could just switch to starting the process, creating the crash reporter, then loading the plugin module.
At least the Flash process injector shouldn't be affected by such a change.
Whether or not we have the crashreporter hooked up, SyncLaunch should immediately return if the launched process exits prematurely or crashes. I do support moving the DLL load after the synclaunch, but I don't think it's necessary nor the best way to fix this bug.

From the sample, it appears that GeckoChildProcessHost::PerformAsyncLaunchInternal is not on the stack. This means that somehow it completed without notifying mMonitor at http://hg.mozilla.org/mozilla-central/annotate/e19d0885977c/ipc/glue/GeckoChildProcessHost.cpp#l758. And indeed, there are a bunch of early-returns starting at http://hg.mozilla.org/mozilla-central/annotate/e19d0885977c/ipc/glue/GeckoChildProcessHost.cpp#l666 which skip the monitor *and* don't set mProcessState = PROCESS_ERROR.
Component: Plug-ins → IPC
Keywords: helpwanted
Whiteboard: [mentor=bsmedberg][lang=c++][good next bug]
Hi I am interested in working on this bug,but it's my first time to work on with debug,can anybody guide me on how to get started with it?Thanks a lot.
Do you have a working debug build already? 
There are build instructions here:
https://developer.mozilla.org/en-US/docs/Developer_Guide/Build_Instructions

You could also get help with initial problems on the #introduction or #developer IRC channels:
https://wiki.mozilla.org/IRC
I have a working debug build and would like to work on this bug. This is my first time, however. Could you please help me?
Great, have you been able to reproduce the hang with the test plugin from comment 8?
Yes, it hangs. I get this in my Terminal window:
[...]
++DOMWINDOW == 16 (0x10cff0cb0) [serial = 20] [outer = 0x111fea8b0]
WARNING: NS_ENSURE_TRUE(aURI) failed: file /users/andrey/FF/mozilla-central/caps/src/nsScriptSecurityManager.cpp, line 1913
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file /users/andrey/FF/mozilla-central/extensions/cookie/nsPermissionManager.cpp, line 979
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file /users/andrey/FF/mozilla-central/extensions/permissions/nsContentBlocker.cpp, line 250
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file /users/andrey/FF/mozilla-central/extensions/permissions/nsContentBlocker.cpp, line 212
WARNING: NS_ENSURE_TRUE(currentURI) failed: file /users/andrey/FF/mozilla-central/content/base/src/ThirdPartyUtil.cpp, line 96
For application/x-test found plugin Test.plugin
WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /users/andrey/FF/mozilla-central/xpcom/base/nsTraceRefcntImpl.cpp, line 141
WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /users/andrey/FF/mozilla-central/xpcom/base/nsTraceRefcntImpl.cpp, line 141
dyld: lazy symbol binding failed: Symbol not found: ___asan_init
  Referenced from: /Library/Internet Plug-Ins/Test.plugin/Contents/MacOS/libnptest.dylib
  Expected in: flat namespace

dyld: Symbol not found: ___asan_init
  Referenced from: /Library/Internet Plug-Ins/Test.plugin/Contents/MacOS/libnptest.dylib
  Expected in: flat namespace
[...]
Next you could investigate what is happening here, e.g. by setting up debugging [1], breaking in GeckoChildProcessHost::PerformAsyncLaunchInternal() and see if it fails in there and exits that function before setting an error [2].

Also, feel free to ping us on IRC.

[1] https://developer.mozilla.org/en-US/docs/Debugging_on_Mac_OS_X
[2] http://hg.mozilla.org/mozilla-central/annotate/e19d0885977c/ipc/glue/GeckoChildProcessHost.cpp#l758
Thank you! Is it ok that I'm writing it here, or should I switch to IRC?
I've done that. The function returns false value here [1], before the line 758. It waits for the message for 10 seconds, does not get it, and returns false.

[1] http://hg.mozilla.org/mozilla-central/annotate/e19d0885977c/ipc/glue/GeckoChildProcessHost.cpp#l669
No need to switch to IRC, just another option :)

So - per comment 12, those places (i.e. the early return you hit and the ones after it) also need the error handling that we have in [1]. If you add that, the permanent hang should go away.

I'd personally prefer to move the part where this happens into a separate function so that the error handling doesn't need to be written 5 times.


http://hg.mozilla.org/mozilla-central/annotate/e19d0885977c/ipc/glue/GeckoChildProcessHost.cpp#l757
Attached patch A patch (obsolete) — Splinter Review
Here is a patch. Could you please review it?
Attachment #759395 - Flags: review?(georg.fritzsche)
Comment on attachment 759395 [details] [diff] [review]
A patch

Review of attachment 759395 [details] [diff] [review]:
-----------------------------------------------------------------

Maybe it would be better to move the code in the |#ifdef MOZ_WIDGET_COCOA| into a separate function, something like MachSetupPorts()?
Then we could just handle it returning false in one place.

bsmedberg will most likely review this, but i'm happy to give feedback :)

::: ipc/glue/GeckoChildProcessHost.cpp
@@ -770,5 @@
>    OpenPrivilegedHandle(base::GetProcId(process));
>    {
> -    MonitorAutoLock lock(mMonitor);
> -    mProcessState = PROCESS_CREATED;
> -    lock.Notify();

Note that this is using PROCESS_CREATED, not PROCESS_ERROR like the rest.
Attachment #759395 - Flags: review?(georg.fritzsche)
Attached patch A new patch (obsolete) — Splinter Review
Thank you for your feedback! I have fixed my patch.
Attachment #759395 - Attachment is obsolete: true
Attachment #759467 - Flags: review?
Attachment #759467 - Flags: review? → review?(benjamin)
Assignee: nobody → andzaytsev
I have pulled the latest version of code from the repository, and the patch seems to stop working. Could anybody please confirm this?
Comment on attachment 759467 [details] [diff] [review]
A new patch

Review of attachment 759467 [details] [diff] [review]:
-----------------------------------------------------------------

A bit of quick feedback on this from my end.

::: ipc/glue/GeckoChildProcessHost.cpp
@@ +474,5 @@
> +GeckoChildProcessHost::MachSetupPorts(std::string mach_connection_name)
> +{
> +  // Wait for the child process to send us its 'task_t' data.
> +  const int kTimeoutMs = 10000;
> +  

You have a few whitespaces left in the patch, easily spotted if you follow the "review" link for the patch.

@@ +483,5 @@
> +  if (err != KERN_SUCCESS) {
> +    std::string errString = StringPrintf("0x%x %s", err, mach_error_string(err));
> +    LOG(ERROR) << "parent WaitForMessage() failed: " << errString;
> +    NotifyMonitor();
> +    return false;

Note that the function returns task_t, so you would want to return TASK_NULL or a nullptr here.

@@ +713,5 @@
>  #endif
>                    false, &process, arch);
>  
>  #ifdef MOZ_WIDGET_COCOA
> +  task_t child_task = MachSetupPorts(mach_connection_name);

Sorry, i realize now that my name suggestion wasn't so great - maybe MachGetChildTask() would be better?

@@ +719,1 @@
>      return false;

I think you can just do the monitor notify and setting mProcessState here directly instead of multiple times in MachSetupPorts().
(In reply to Andrey Zaytsev from comment #24)
> I have pulled the latest version of code from the repository, and the patch
> seems to stop working. Could anybody please confirm this?

Does the patch still apply or were any sections rejected? Maybe something changed in the parts your patch addresses?
Comment on attachment 759467 [details] [diff] [review]
A new patch

Review of attachment 759467 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/GeckoChildProcessHost.cpp
@@ +470,5 @@
> +    lock.Notify();
> +}
> +
> +task_t
> +GeckoChildProcessHost::MachSetupPorts(std::string mach_connection_name)

Also, this function should be in a |#ifdef MOZ_WIDGET_COCOA| as well - this will (and should) not compile for e.g. Windows.
Comment on attachment 759467 [details] [diff] [review]
A new patch

I don't think that "NotifyMonitor" is a good method name for this. Because this is specifically about an error, it should be "ErrorAndNotifyMonitor" or something more descriptive.

Also, I wonder if it wouldn't make more sense for the caller of MachSetupPorts to call Notify, instead of doing it several times within that method?
Attachment #759467 - Flags: review?(benjamin) → review-
Notifying monitor before returning false does not fix the hang anymore:

MonitorAutoLock lock(mMonitor);
mProcessState = PROCESS_ERROR;
lock.Notify();

I have cloned the last revision from the repository and compiled it, adding those three lines before return false. Browser still hangs when I open the bug file.
Have you tried to debug it again? Does it still behave the same in PerformAsyncLaunchInternal?
And where does it hang now? And with what callstack (e.g. enter "bt" in the Xcode output window)?
It is very strange: when I debug it under Xcode, it does not hang. It executes those 3 lines of code, exits the function and continue running.
However, when I run it from my terminal it hangs.
Have you tried verifying that it runs through that code by adding some logging (e.g. just some printf()s to see what happens)?
When you attach with Xcode after it hangs and pause it, does it still hang with the same stack as in comment 10?
I've added two printf()s: before [1], and inside 'if', before 'return false'. In Xcode both of them are executed, however when I run the browser with: [./mach run -P dbg], only the first one is executed. It seems to run differently without Xcode. Is it ok?

[1] https://hg.mozilla.org/mozilla-central/annotate/e19d0885977c/ipc/glue/GeckoChildProcessHost.cpp#l663
It's possible that it runs differently without a debugger attached, but it would be interesting to find where it hangs in this case.
Maybe you can find out where it exits that function? And if you attach with Xcode once it hangs, pause it and look at the callstack... does it still look like the hang in comment 10?
Attached patch PatchSplinter Review
The problem was in a different function. There was an infinite loop.

Should I still add NotifyMonitor to those ifs?
Attachment #759467 - Attachment is obsolete: true
Attachment #763855 - Flags: review?(benjamin)
(In reply to Andrey Zaytsev from comment #35)
> The problem was in a different function. There was an infinite loop.

Ah, that is expected because the pref dom.ipc.plugins.processLaunchTimeoutSecs is set to 0 in debug builds.

However, we shouldn't hang indefinitely there in debug builds - at some point mProcessState should be set to PROCESS_ERROR or PROCESS_CONNECTED so that the while condition - (mProcessState < PROCESS_CONNECTED) - should evaluate to false.

Have you found out where it leaves PerformAsyncLaunchInternal() in that scenario? Is mProcessState set properly there and the monitor notified?
Attachment #763855 - Flags: review?(benjamin) → review+
So given a quick reading of this code, I don't understand why this patch is necessary. In the case of an infinite timeout, the monitor should only notify when the process has either been launched successfully or died, in which case mProcessState should be set.

What is actually going on that causes this patch to fix something important?
The loop is infinite when:
timeoutTicks == PR_INTERVAL_NO_TIMEOUT
It leaves PerformAsyncLaunchInternal() on the last line "return true".
So everything works correctly there.
I don't understand why the loop is infinite. Are you saying that mProcessState is still < PROCESS_CONNECTED but the lock isn't being notified? Why isn't the lock being notified?
@bsmedberg could this be a dupe of bug 811636 ?
It looks like everything works fine in PerformAsyncLaunchInternal(). The loop should be fixed anyway, shouldn't it?
And Windows.
So yeah, this turned out to be debug-only bustage on all platforms.
This doesn't seem to be actively worked on.
Assignee: andzaytsev → nobody
Mentor: benjamin
Whiteboard: [mentor=bsmedberg][lang=c++][good next bug] → [lang=c++][good next bug]
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: