Closed Bug 1109768 Opened 10 years ago Closed 10 years ago

Browser becomes unresponsive

Categories

(Core :: General, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 1091705
Tracking Status
e10s ? ---

People

(Reporter: zbraniecki, Unassigned)

Details

Attachments

(7 files)

Attached file Sample of Nightly.txt
Apologies for a vague title, but I'm not sure how to narrow it down yet.

I managed to reproduce one of the two types of hangs that I frequently see while using Nightly+e10s.

The behavior goes like this:
 - I open a page or switch to an already open tab
 - the tab becomes unresponsive
 - all other tabs are unresponsive as well
 - I can't reload the page, open new window etc.

ps -A has:

 9324 ??        36:10.86 /Applications/FirefoxNightly.app/Contents/MacOS/firefox -foreground -foreground
 9503 ??        62:12.32 /Applications/FirefoxNightly.app/Contents/MacOS/plugin-container.app/Contents/MacOS/plugin-container -greomni /Applications/FirefoxNightly.app/Contents/Resources/omni.ja -appomni /Applications/FirefoxNightly.app/Contents/Resources/browser/omni.ja -appdir /Applications/FirefoxNightly.app/Contents/Resources/browser 9324 gecko-crash-server-pipe.9324 org.mozilla.machname.692582760 tab
23819 ??         0:00.20 /Applications/FirefoxNightly.app/Contents/MacOS/plugin-container.app/Contents/MacOS/plugin-container /Library/Internet Plug-Ins/Flash Player.plugin -greomni /Applications/FirefoxNightly.app/Contents/Resources/omni.ja -appomni /Applications/FirefoxNightly.app/Contents/Resources/browser/omni.ja -appdir /Applications/FirefoxNightly.app/Contents/Resources/browser 9324 gecko-crash-server-pipe.9324 org.mozilla.machname.1162858929 plugin
23884 ttys001    0:00.00 grep --color=auto Firefox

I used Activity Monitor to collect a sample from Nightly process. Can't collect a sample from plugin-container because it disappears/reappears every second and Acitivity Monitor doesn't allow me to select it for long enough to collect a sample.
Oh, I have one addon (ADB helper) and Flash 15 plugin.
Hey gandalf - you say that the plugin-container disappears / reappears... can you try ordering the items in Activity Monitor by process name? It's possible the disappearing / reappearing is because of quick re-ordering. If you order by name, that should calm it down.

The plugin-container sample would, indeed, be the most interesting thing we could get here, if you can.
Flags: needinfo?(gandalf)
Flags: needinfo?(gandalf)
So, those two (marked as #1) are sample processes from the current hang. It looked a bit different in activity monitor, because it was using around 65% of CPU, but the chrome was responsive, neither of the process was marked as "unresponsive" which is the characteristic of this bug (compared to the other one where chrome is unresponsive - which is bug 1091705).

Hope that helps!
Steven, any ideas here?
Flags: needinfo?(smichaud)
As best I can tell, the chrome/main process samples don't contain any meaningful information (in particular, they don't show any hangs).  The chrome process sample does show a hang (or something like that), though, in Thread_970027.

The hang appears to be in _sigtramp() -- an undocumented method that's used by the OS to deliver a signal to a process.

For the record, here's a declaration for _sigtramp().  (I figured it out a while ago, when I had to hook it in an interpose library to diagnose a bug.)

extern "C" void _sigtramp(void *handler, void *arg2, int32_t sig,
                          siginfo_t *info, ucontext_t *uap);

It'd be nice to see an all-thread stack trace of the content process while it's "hung", and also of the chrome/main process, in case I missed something in the samples.  (You'd get these using gdb or lldb after attaching to the running process.)
Flags: needinfo?(smichaud)
I'm not familiar with gdb at all. I can reproduce this almost every once in an hour so if you can provide me STRs, I can follow them.

I'm also working from Mozilla's SF office, so if someone with sufficient knowledge who is also working from that office can just take my laptop when it happens and extract the necessary data, that would work for me.
> _sigtramp() -- an undocumented method that's used by the OS to
> deliver a signal to a process.

Actually it's used by the kernel.
> I'm also working from Mozilla's SF office

I'm sure you can find a willing volunteer :-)

I don't know who to suggest, though, since I don't work in the SF office (or any office).
It'd be interesting to know what signal _sigtramp() is being called with.  This interpose library should provide the information.

Here's how to use it:

1) Untar it and run "make".  This will only work if you've installed the XCode commandline tools to your computer.  If you can build Firefox there, you already have them.

2) From a Terminal prompt run the following:

DYLD_INSERT_LIBRARIES=/full/path/to/interpose.dylib path/to/Contents/MacOS/firefox

All calls to _sigtramp() should get logged, together with their stack traces.  You may also see other interesting error messages.
Attached file log1.txt
Sweet! I was able to follow your instructions and build this log.
Video screencast of the behavior: https://www.youtube.com/watch?v=bjjPwLLehe0
Over to smichaud now that we have the log / screencast.
Flags: needinfo?(smichaud)
This is also bug 1059775, I see the 99% cpu rating for plugin-container. I would love to see a stack for that!
Jim: I used to see the scenario from 1059775 - where after I closed Firefox one process remained and kept spinning CPU. I don't remember seeing it anymore for about a month now.
A while ago we had bug 984447, where we spawned zombie plugin-container processes, triggered by broken calls to fork() from the Java plugin.  The calls to fork() crapped out before the child processes (of plugin-container, created by fork()) could get renamed from plugin-container to something else.  The zombie processes ate 100% CPU.
(Following up comment #16)

And no, it wasn't possible to get stack traces of those hungry zombie processes.
Thanks, Gandalf, for your log!

Turns out we're receiving lots of SIGVTALRM (== 26) signals.  I just grepped the tree, and discovered (to my surprise) that we (at least potentially) do use this signal in a couple of different places:

https://hg.mozilla.org/mozilla-central/annotate/249dbdbe09e0/js/src/asmjs/AsmJSSignalHandlers.cpp#l957
https://hg.mozilla.org/mozilla-central/annotate/249dbdbe09e0/nsprpub/pr/src/pthreads/ptthread.c#l1236

So the call to _sigtramp() may be "normal", and have nothing to do with this bug.

Just to be sure, though, I'm going to add code to my interpose library to hook as many calls as possible that might be used to send a SIGVTALRM signal, and possibly also calls to fork() and friends.
Flags: needinfo?(smichaud)
Is anyone able to give reasonably good STR for this bug?  I'd *really* like to be able to reproduce it.
(In reply to comment #0)

> The behavior goes like this:
>  - I open a page or switch to an already open tab
>  - the tab becomes unresponsive
>  - all other tabs are unresponsive as well
>  - I can't reload the page, open new window etc.

I'd guess that, when you see this bug, you always have at least one tab open that contains one or more Flash plugins.  Do you see the bug disproportionately switching too or away from those tabs?

What pages do you normally have open that contain Flash plugins?  Could you give some examples?

Do you still see this bug with the current version of the Flash plugin (16.0.0.235)?  You should upgrade in any case, since older versions contain security bug(s) with exploits in the wild.
Gandalf:  I see logs very much like yours (of calls to _sigtramp() with the SIGVTALRM signal) when I run today's nightly with no plugins at all (with the interpose library that I attached above).  So, though your logs also contain a few things I *don't* see, I'm going to set aside the whole signal business for the time being.

What's most singular in your report is how the Flash plugin-container process (or *some* plugin oriented plugin-container process) keeps getting killed (or stopped) and restarted.  I'm going to write up an interpose library that hooks the sort of calls that might be involved in that -- exit(), abort() and friends.  For good measure I'll also hook fork() in friends.
> ADB helper

What is this?  And where can you get it?  (I can't find it.)
(In reply to Steven Michaud from comment #22)
> > ADB helper
> 
> What is this?  And where can you get it?  (I can't find it.)

https://developer.mozilla.org/en-US/Firefox_OS/Using_the_App_Manager#Adb_Helper_Add-on

Steven:
 - I do have latest flash and I have it disabled by default (ask to activate)
 - I do see it disproportionally often when switching between tabs
 - I suspect that I usually/always have at least one tab with flash when it happens because I usually just have many tabs open and at least one of them is facebook, gmail, engadget etc.
 - I don't have a very reliable STR but I noticed that I can usually reproduce it by going to phoronix.com and opening at least 7-8 articles in new tabs and maybe some comments and switching between the tabs, scrolling down those tabs, switching again. In about a minute I get the hang. Usually. :) Sometimes it takes longer.
Finally, Gandalf, one really stupid question:

Does rebooting your computer make this bug go away?

I ask because this often makes the strangest bugs disappear on the Mac, at least for a while.  If rebooting does make this bug go away, and if it doesn't come back quickly, we should probably just drop this bug -- it won't be worth the trouble to diagnose it.
What version of OS X are you using, Gandalf?  I don't see it listed above.
that may be hard to answer. I never reboot my laptop. I'll do this in an hour or so when I'm done with work and will report back.

One thing I noticed is that sometimes I believe that I putting laptop to sleep/wakeing up (with Firefox open) makes it easier to reproduce.

Mac OS 10.10.1
The other thing I know is that I can equally easily reproduce it on two different macbooks (air and pro), so I doubt it is a very unique problem.
Here's another interpose library for you to try, Gandalf.
Another question for you, Gandalf.

You say you've set Flash "ask to activate".  When I set this and visit phoronix.com (for instance), I don't get any Flash plugin process (unless I explicitly click on a Flash object).  You *do* have a Flash plugin process -- but how do you get it?  Do you click on Flash objects and allow them to run?  Do you choose "Allow now", or is it "Allow and Remember"?
(In reply to Steven Michaud from comment #29)
> Another question for you, Gandalf.
> 
> You say you've set Flash "ask to activate".  When I set this and visit
> phoronix.com (for instance), I don't get any Flash plugin process (unless I
> explicitly click on a Flash object).  You *do* have a Flash plugin process
> -- but how do you get it?  Do you click on Flash objects and allow them to
> run?  Do you choose "Allow now", or is it "Allow and Remember"?

No. I launch a browser with no tabs open (first tab with about:home or about:blank) and that's my process list:

zbraniecki@zbmba:~/Downloads/symbols/interpose$ ps -A|grep Firefox
93460 ??         0:01.44 /Applications/FirefoxNightly.app/Contents/MacOS/firefox -foreground -foreground
93461 ??         0:00.41 /Applications/FirefoxNightly.app/Contents/MacOS/plugin-container.app/Contents/MacOS/plugin-container -greomni /Applications/FirefoxNightly.app/Contents/Resources/omni.ja -appomni /Applications/FirefoxNightly.app/Contents/Resources/browser/omni.ja -appdir /Applications/FirefoxNightly.app/Contents/Resources/browser 93460 gecko-crash-server-pipe.93460 org.mozilla.machname.666582781 tab
93463 ttys004    0:00.00 grep --color=auto Firefox

So I guess it starts on its own. I even removed ADB Helper to make sure that it's not impacting that.
(In reply to Steven Michaud from comment #28)
> Created attachment 8537427 [details]
> Interpose library for logging calls to exit(), fork() and friends
> 
> Here's another interpose library for you to try, Gandalf.

This only produced that: http://pastebin.mozilla.org/8037669

My STR:
 - Open phoronix.com
 - Open many (like 10-15 tabs)
 - Open arstechnica, engadget, slashdot
 - Switch between tabs, open/close/scroll while they're loading
 - In about a minute I get to the hang
(In reply to comment #30)

Actually the only plugin-container process you list is the one for the content process.
> http://pastebin.mozilla.org/8037669

But here you're spawning *two* plugin-container processes -- pid 93504 and pid 93595.

Presumably one of them is for the Flash plugin.
And the second plugin container process is getting spawned about 4 minutes after the first.

Timestamp for posix_spawn() call to launch content process plugin-container: (Tue Dec 16 13:57:38 2014)
Timestamp for posix_spawn() call to launch "second" plugin-container process: (Tue Dec 16 14:01:36 2014)
So, my guess is that the browser hangs exactly when it happens. Yet I did not enable any plugin. I'll disable flash completely and try to reproduce it again.

Maybe there's some weird mix of "ask to activate" and "tabs with flash" and "e10s" that is causing this.
Does the Flash plugin-container process still seem to stop and restart?

If so, do you see more calls to posix_spawn()?
(In reply to Steven Michaud from comment #36)
> Does the Flash plugin-container process still seem to stop and restart?

I don't know how to check that.

> If so, do you see more calls to posix_spawn()?

Umm, where? :)

I did disable flash completely and was still able to reproduce this bug within 2-3 minutes of spamming open tabs from phoronix and closing some, and opening others.
Here are instructions to get an all-thread stack trace from the "ghost" Flash plugin container process after you start seeing the hang:

From the commandline in a Terminal window:

1) lldb
2) attach -p [pid-of-process]
3) bt all
So, it will be 93461 from comment 30, right?
No.  Let's move this to IRC :-)
I don't see any "gandalf" connected to Mozilla's IRC server (/whois gandalf doesn't return anything).

I'm "smichaud" and currently connected, and in the #macdev room.
I'm connected to irc.mozilla.org.
Attached file lldb log
This is an lldb log from when the hang was already in place. I attached to both processes and "bt all" for them.
Attachment #8537471 - Attachment mime type: text/x-log → text/plain
I think the Flash plugin-container process was probably a red herring.  As best I can tell the hangs happen without it.

Bug 1091705 has more information, and among other things a Gecko Profiler profile that Gandalf created under Markus Stange's guidance.  See bug 1091705 comment #19.
(Following up comment #43)

The main process main thread stack trace in this log is atypical.  Usually it contains no interesting information.  The following (from another of Gandalf's logs) is much more typical:

* thread #1: tid = 0x14f943, 0x00007fff8d23b52e libsystem_kernel.dylib`mach_msg_trap + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff8d23b52e libsystem_kernel.dylib`mach_msg_trap + 10
    frame #1: 0x00007fff8d23a69f libsystem_kernel.dylib`mach_msg + 55
    frame #2: 0x00007fff858a9b14 CoreFoundation`__CFRunLoopServiceMachPort + 212
    frame #3: 0x00007fff858a8fdb CoreFoundation`__CFRunLoopRun + 1371
    frame #4: 0x00007fff858a8838 CoreFoundation`CFRunLoopRunSpecific + 296
    frame #5: 0x00007fff8f47f43f HIToolbox`RunCurrentEventLoopInMode + 235
    frame #6: 0x00007fff8f47f1ba HIToolbox`ReceiveNextEventCommon + 431
    frame #7: 0x00007fff8f47effb HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 71
    frame #8: 0x00007fff8e44c6d1 AppKit`_DPSNextEvent + 964
    frame #9: 0x00007fff8e44be80 AppKit`-[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 194
    frame #10: 0x00000001046253b6 XUL`-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 86
    frame #11: 0x00007fff8e43fe23 AppKit`-[NSApplication run] + 594
    frame #12: 0x000000010462658d XUL`nsAppShell::Run() + 253
    frame #13: 0x0000000104e17459 XUL`nsAppStartup::Run() + 41
    frame #14: 0x0000000104e60069 XUL`XREMain::XRE_mainRun() + 3337
    frame #15: 0x0000000104e6031f XUL`XREMain::XRE_main(int, char**, nsXREAppData const*) + 495
    frame #16: 0x0000000104e60684 XUL`XRE_main + 228
    frame #17: 0x0000000100001c7d firefox`main + 1757
    frame #18: 0x0000000100001244 firefox`start + 52
From the evidence here and in the Gecko Profiler profile at bug 1091705 comment #19, it looks like this bug is very similar to bug 1091705, or a duplicate.  It also looks like the problem happens entirely in cross-platform code (and specifically that platform-specific event loop code isn't involved).  So I'm going to stop working on this bug, at least for the time being.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: