Closed Bug 805529 Opened 12 years ago Closed 12 years ago

main thread never exit nsBaseAppShell::OnProcessNextEvent when mayWait param is true

Categories

(Core :: Widget: Gtk, defect)

16 Branch
x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla19

People

(Reporter: olivier.pis.langlois, Assigned: olivier.pis.langlois)

Details

Attachments

(1 file, 2 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0
Build ID: 20121010144125

Steps to reproduce:

1. I loaded a html document from a file which spawned a Thread pool
2. I loaded a flash file issuing a web service call every 2 seconds


Actual results:

After about 60 seconds, the flash animation just froze.

-1230996640[8175698]: nsStandardURL::GetFile [this=85f3af8 spec=file:///opt/400T/ramdisk/media/blank.html resulting_path=/opt/400T/ramdisk/media/blank.html]
-1230996640[8175698]: nsComponentManager: CreateInstanceByContractID(@mozilla.org/network/file-input-stream;1) succeeded
-1230996640[8175698]: nsComponentManager: CreateInstanceByContractID(@mozilla.org/thread-pool;1) succeeded
-1230996640[8175698]: nsObserverService::AddObserver(8391040: xpcom-shutdown-threads)
-1230996640[8175698]: nsComponentManager: CreateInstance({0885d4f8-f7b8-4cda-902e-94ba38bc256e}) succeeded
-1230996640[8175698]: THRD-P(86913f8) dispatch [8691760 0]
-1230996640[8175698]: THRD-P(86913f8) put [0 0 4]
-1230996640[8175698]: EVENTQ(869140c): notify
-1230996640[8175698]: THRD-P(86913f8) put [spawn=1]
-1355306096[868f1f8]: EVENTQ(868d5c8): wait begin
-1230996640[8175698]: EVENTQ(868d5c8): notify
-1355306096[868f1f8]: EVENTQ(868d5c8): wait end
-1355306096[868f1f8]: THRD(868d590) ProcessNextEvent [1 0]
-1355306096[868f1f8]: EVENTQ(868d5c8): wait begin
-1230996640[8175698]: THRD-P(86913f8) put [868d590 kill=0]
-1230996640[8175698]: THRD(868d590) Dispatch [86913fc 0]
-1230996640[8175698]: EVENTQ(868d5c8): notify
-1355306096[868f1f8]: EVENTQ(868d5c8): wait end
-1355306096[868f1f8]: THRD(868d590) running [86913fc]
-1355306096[868f1f8]: THRD-P(86913f8) enter
-1355306096[868f1f8]: THRD-P(86913f8) running [8691760]
-1230996640[8175698]: III AsyncWait [this=86915e4]
-1230996640[8175698]: nsInputStreamPump::Suspend [this=8691360]

-1355306096[868f1f8]: THRD-P(86913f8) shutdown async [868d590]
-1355306096[868f1f8]: THRD(817df40) Dispatch [86b6680 0]
-1355306096[868f1f8]: EVENTQ(817df78): notify
-1355306096[868f1f8]: THRD-P(86913f8) leave
-1355306096[868f1f8]: THRD(868d590) ProcessNextEvent [1 0]
-1355306096[868f1f8]: EVENTQ(868d5c8): wait begin

-1230996640[8175698]: THRD(817df40) ProcessNextEvent [0 0]
-1230996640[8175698]: THRD(817df40) running [86b6680]
-1230996640[8175698]: THRD(868d590) shutdown
-1230996640[8175698]: EVENTQ(868d5c8): notify
-1230996640[8175698]: THRD(817df40) ProcessNextEvent [1 1]

-1355306096[868f1f8]: EVENTQ(868d5c8): wait end
-1355306096[868f1f8]: THRD(868d590) running [83edff0]
-1355306096[868f1f8]: THRD(817df40) Dispatch [83edf08 0]
-1355306096[868f1f8]: EVENTQ(817df78): notify

-1259202544[8052998]: NPError mozilla::plugins::child::_posturlnotify(NPP_t*, const char*, const char*, uint32_t, const char*, NPBool, void*)  [time:1387967764026950][7180][PPluginInstanceChild] Sending Msg_PStreamNotifyConstructor([TODO])
-1267217520[80a7398]: /root/mozilla-release/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:689: sending message @80ce6b8 on channel @8061f98 with type 2293822 (0 in queue)
-1267217520[80a7398]: /root/mozilla-release/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:677: sent message @80ce6b8 on channel @8061f98 with type 2293822
-1269396592[8568f78]: /root/mozilla-release/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:557: received message on channel @83e4238 with type 2293822
-1269396592[8568f78]: THRD(817df40) Dispatch [817bf20 0]
-1269396592[8568f78]: EVENTQ(817df78): notify

nothing else happens from there...

Here is the call stack of the main thread:

#0  0xb7fa8410 in __kernel_vsyscall ()
#1  0xb7218fbc in poll () from /lib/libc.so.6
#2  0xb7406f6d in g_poll () from /usr/lib/libglib-2.0.so.0
#3  0xb7ab0917 in wxapp_poll_func () from /opt/400T/ramdisk/lib/libwx_gtk2_core-2.8.so.0
#4  0xb587cf0a in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011E3F0A (PollWrapper(_GPollFD*, unsigned int, int))
#5  0xb73fdc68 in ?? () from /usr/lib/libglib-2.0.so.0
#6  0xb73fdebe in g_main_context_iteration () from /usr/lib/libglib-2.0.so.0
#7  0xb587cc97 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011E3C97 (nsAppShell::ProcessNextNativeEvent)
#8  0xb5895360 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011FC360 (nsBaseAppShell::DoProcessNextNativeEvent)
#9  0xb589547a in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011FC47A (nsBaseAppShell::OnProcessNextEvent)
#10 0xb5b4bdf0 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x014B2DF0 (nsThread::ProcessNextEvent)
#11 0xb5afdae8 in ?? () from /usr/lib/firefox-16.0.1/libxul.so (NS_ProcessNextEvent_P)
#12 0xb5b4b752 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x014B2752 (nsThread::Shutdown)
#13 0xb5b4d648 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x014B4648 (nsRunnableMethodImpl<unsigned int (nsIThread::*)(), true>::Run)
#14 0xb5b4bf72 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x014B2F72 (nsThread::ProcessNextEvent)
#15 0xb5afdbe0 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x01464BE0 (NS_ProcessPendingEvents_P)
#16 0xb5894c13 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011FBC13 (nsBaseAppShell::NativeEventCallback)
#17 0xb587cfff in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011E3FFF (nsAppShell::EventProcessorCallback)
#18 0xb742ddd4 in ?? () from /usr/lib/libglib-2.0.so.0
#19 0xb73fba29 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#20 0xb73fdcc1 in ?? () from /usr/lib/libglib-2.0.so.0
#21 0xb73fde4d in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#22 0xb6ebb7bd in gtk_main () from /usr/lib/libgtk-x11-2.0.so.0
#23 0xb7acab05 in wxEventLoop::Run () from /opt/400T/ramdisk/lib/libwx_gtk2_core-2.8.so.0
#24 0xb7b4ff6e in wxAppBase::MainLoop () from /opt/400T/ramdisk/lib/libwx_gtk2_core-2.8.so.0
#25 0xb7b4fb41 in wxAppBase::OnRun () from /opt/400T/ramdisk/lib/libwx_gtk2_core-2.8.so.0
#26 0x0808e161 in LayoutEngineClass::OnRun ()
#27 0xb7dc032a in wxEntry () from /opt/400T/ramdisk/lib/libwx_base-2.8.so.0
#28 0x0808e973 in main ()

when the main thread may wait on future events such as when it is notifying another thread to shutdown, the code in nsBaseAppShell::OnProcessNextEvent enters the glib loop assuming that native events will occur frequently allowing it exit eventually when the thread event q receive a new event. In some cases, (mine) no native events ever occur (it could be because no keyboard or mouse is connected).

  while (!NS_HasPendingEvents(thr) && !mProcessedGeckoEvents) {
    // If we have been asked to exit from Run, then we should not wait for
    // events to process.  Note that an inner nested event loop causes
    // 'mayWait' to become false too, through 'mBlockedWait'.
    if (mExiting)
      mayWait = false;

    mLastNativeEventTime = PR_IntervalNow();
    if (!DoProcessNextNativeEvent(mayWait, recursionDepth) || !mayWait)
      break;
  }

An easy way to fix it would be to do that:

DoProcessNextNativeEvent(false, recursionDepth);
if(!mayWait)
  break;

The best way IMO, would be to specify a small to value (ie 20ms) but it doesn't seems to be trivial to do with glib. (you have to add some kind of g_source to the default g_context) and this is beyond my field of expertise....



Expected results:

The main thread should have been responsive.
OS: Windows XP → Linux
CCing some people who might be able to evaluate this.
Maybe not fully complying to your coding standard, Maybe not the "right way"tm to deal with glib

but it is simple and it works and it solve my problem!

feel free to adapt to your tastes
Attachment #675302 - Flags: feedback?(karlt)
Comment on attachment 675302 [details] [diff] [review]
Protect against no events situation

just nitpicking. I would change this:

+  CapPollTo(20);
+
   while (!NS_HasPendingEvents(thr) && !mProcessedGeckoEvents) {
     // If we have been asked to exit from Run, then we should not wait for
     // events to process.  Note that an inner nested event loop causes
@@ -295,10 +297,13 @@
       mayWait = false;
 
     mLastNativeEventTime = PR_IntervalNow();
+
     if (!DoProcessNextNativeEvent(mayWait, recursionDepth) || !mayWait)
       break;
   }
 
+  CapPollTo(-1);

to:

+  CapPollTo(20);
+
   while (!NS_HasPendingEvents(thr) && !mProcessedGeckoEvents) {
     // If we have been asked to exit from Run, then we should not wait for
     // events to process.  Note that an inner nested event loop causes
@@ -295,10 +297,13 @@
       mayWait = false;
 
     mLastNativeEventTime = PR_IntervalNow();

     DoProcessNextNativeEvent(mayWait, recursionDepth)

     if (!mayWait)
       break;
   }
 
+  CapPollTo(-1);

in order to minimize changes in code behavior as now DoProcessNextNativeEvent() may timeout and return false with the poll to capping.
Thanks for the analysis here.

The design is not meant to require that external native events occur because a native event should be generated here, if the poll should be interrupted:
http://hg.mozilla.org/mozilla-central/annotate/3b506ada9ef9/widget/gtk2/nsAppShell.cpp#l122

That should be called from OnDispatchedEvent.
http://hg.mozilla.org/mozilla-central/annotate/3b506ada9ef9/widget/xpwidgets/nsBaseAppShell.cpp#l225
Can you add logging or otherwise determine whether that is happening, and if not why not?
Comment on attachment 675302 [details] [diff] [review]
Protect against no events situation

It should be possible to fix this without waking frequently.
Attachment #675302 - Flags: feedback?(karlt) → feedback-
(In reply to Karl Tomlinson (:karlt) from comment #4)
> Thanks for the analysis here.
> 
> The design is not meant to require that external native events occur because
> a native event should be generated here, if the poll should be interrupted:
> http://hg.mozilla.org/mozilla-central/annotate/3b506ada9ef9/widget/gtk2/
> nsAppShell.cpp#l122
> 
> That should be called from OnDispatchedEvent.
> http://hg.mozilla.org/mozilla-central/annotate/3b506ada9ef9/widget/xpwidgets/
> nsBaseAppShell.cpp#l225
> Can you add logging or otherwise determine whether that is happening, and if
> not why not?

Karl,

Ok, I'll look into that and I'm guessing that OnDispatchedEvent should be called from nsThread::Dispatch() or somewhere close to that, right?

Should be able to come back to you on that by next Monday.

Greetings,
Component: XUL → General
Is this some kind of Wx embedding issue? The stack of the main run loop certainly doesn't look like a normal XUL app runloop.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #7)
> Is this some kind of Wx embedding issue? The stack of the main run loop
> certainly doesn't look like a normal XUL app runloop.

I do not think it is related to Wx. I have run the software on 2 different system.
 
1. The real one which has no mouse, no keyboard. This is where I experience the problem
2. A VM running on my PC that emulates the real thing +- some details (including the presence of mouse/keyboard). On this setup, everything runs smoothly even before the proposed patch.
This is obviously not a normal XUL app, and so it's likely that this is at least partly caused by the embedding layer itself. Is this built with some sort of Wx embedding layer? Do you experience the same problem running Firefox or another XUL application on your mouse/keyboardless system?

OnDispatchedEvent should be waking up the main-thread event loop as called from here: http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/xpcom/threads/nsThread.cpp#l370

In normal eventloop operation that should end up here:
http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/xpwidgets/nsBaseAppShell.cpp#l235

And then here (for GTK2):
http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/gtk2/nsAppShell.cpp#l122

And that pipe write should wake up the GTK main event loop because of the g_io_add_watch_full call here: http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/gtk2/nsAppShell.cpp#l109

You need to figure out in your embedding case where that series of calls/assumptions is being broken.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
> This is obviously not a normal XUL app, and so it's likely that this is at
> least partly caused by the embedding layer itself. Is this built with some
> sort of Wx embedding layer? Do you experience the same problem running
> Firefox or another XUL application on your mouse/keyboardless system?
> 
> OnDispatchedEvent should be waking up the main-thread event loop as called
> from here:
> http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/xpcom/threads/
> nsThread.cpp#l370
> 
> In normal eventloop operation that should end up here:
> http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/xpwidgets/
> nsBaseAppShell.cpp#l235
> 
> And then here (for GTK2):
> http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/gtk2/
> nsAppShell.cpp#l122
> 
> And that pipe write should wake up the GTK main event loop because of the
> g_io_add_watch_full call here:
> http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/gtk2/
> nsAppShell.cpp#l109
> 
> You need to figure out in your embedding case where that series of
> calls/assumptions is being broken.

I totally agree with your analysis. The thread should not block because of the pipe write.

As I said to Karl, based on this new information, I'll add more traces into the lib to understand what happens in my setup.

In meantime, I have observed that the write call return value in nsAppShell::ScheduleNativeEventCallback() isn't validated. In case the write() fails, the main thread will block forever.
Here are new added traces.

Note that if PollWrapper is called recursively inside EventProcessorCallback(), fd 17 (pipe read side fd) is not included in the poll fd set.

g_source_set_can_recurse () must called on the create g_source.

PollWrapper()
poll for fd 3 for events IN|
poll for fd 8 for events IN|
poll for fd 17 for events IN|
ScheduleNativeEventCallback() write res: 1 on fd 18. fd 17 will be read ready on next poll, errno: 0 - Success
EventProcessorCallback(): IN|
PollWrapper()
poll for fd 3 for events IN|
poll for fd 8 for events IN|
PollWrapper()
poll for fd 3 for events IN|
poll for fd 8 for events IN|
Before DoProcessNextNativeEvent() mayWait: 1
PollWrapper()
poll for fd 3 for events IN|
poll for fd 8 for events IN|
ScheduleNativeEventCallback() write res: 1 on fd 18. fd 17 will be read ready on next poll, errno: 0 - Success
posturlnotify: http://10.0.16.198:19675/DictionaryWebService/
expect a new proposed patch soon for this bug.
Status: UNCONFIRMED → NEW
Component: General → Widget: Gtk
Ever confirmed: true
Comment on attachment 676384 [details] [diff] [review]
Make the nsAppShell interrupt pipe io watch source recursive

Thanks very much for diagnosing and reporting this problem!

The licensing of GLib is a little different to Gecko, so I don't think we can just copy and modify a chunk of code from GLib.

Would you be willing to write the necessary code inline in nsAppShell::Init?
This code is used only once and so need not be a separate function and can be simplified a little for the requirements here.  |g_source_set_priority (source, priority)| won't be required.

Gecko is C++ and so style is usually to declare at initialization, despite the existing |ioc| declaration.  Most Gecko files have no space between function name and '('.  i.e.

GSource *source = g_io_create_watch(ioc, G_IO_IN);
sure. I'll rework the proposed patch tomorrow!
Note that it was not possible to initialize local variables at the same time that they are declared because of the gotos.
Attachment #675302 - Attachment is obsolete: true
Attachment #676384 - Attachment is obsolete: true
Comment on attachment 676674 [details] [diff] [review]
Make the nsAppShell interrupt pipe io watch source recursive Gecko style

Thanks very much.

nsnull has been replaced with nullptr, so this will need s/nsnull/nullptr/ before it lands.
Attachment #676674 - Flags: review+
https://tbpl.mozilla.org/?tree=Try&rev=4d8a3f84dd20

I wonder why this bug didn't cause hangs on our test machines, because there are no user input events there.
If I may attempt to partly attempt to answer this question, I must add that when Gecko was in that state, it wasn't frozen or deadlocked. As soon as I was sending new URL to load, the browser was responding. (by sending an event to my Wx window object which was probably waking up the main thread).

The following conditions must be met to notice the problem:

1. You must make the mayWait param set to true for the main thread. My personal experience is that it not something that happens that often.
2. You need a separate thread (only plugin?) that display dynamic content and that will make blocking request to its parent.
https://hg.mozilla.org/mozilla-central/rev/0f45d415a221
Assignee: nobody → olivier.pis.langlois
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: