Last Comment Bug 805529 - main thread never exit nsBaseAppShell::OnProcessNextEvent when mayWait param is true
: main thread never exit nsBaseAppShell::OnProcessNextEvent when mayWait param ...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Widget: Gtk (show other bugs)
: 16 Branch
: x86 Linux
: -- normal (vote)
: mozilla19
Assigned To: Olivier Langlois
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-25 11:14 PDT by Olivier Langlois
Modified: 2012-11-06 18:19 PST (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Protect against no events situation (2.09 KB, patch)
2012-10-25 14:35 PDT, Olivier Langlois
karlt: feedback-
Details | Diff | Splinter Review
Make the nsAppShell interrupt pipe io watch source recursive (1.21 KB, patch)
2012-10-29 16:11 PDT, Olivier Langlois
no flags Details | Diff | Splinter Review
Make the nsAppShell interrupt pipe io watch source recursive Gecko style (1.05 KB, patch)
2012-10-30 10:51 PDT, Olivier Langlois
karlt: review+
Details | Diff | Splinter Review

Description Olivier Langlois 2012-10-25 11:14:08 PDT
User Agent: Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0
Build ID: 20121010144125

Steps to reproduce:

1. I loaded a html document from a file which spawned a Thread pool
2. I loaded a flash file issuing a web service call every 2 seconds


Actual results:

After about 60 seconds, the flash animation just froze.

-1230996640[8175698]: nsStandardURL::GetFile [this=85f3af8 spec=file:///opt/400T/ramdisk/media/blank.html resulting_path=/opt/400T/ramdisk/media/blank.html]
-1230996640[8175698]: nsComponentManager: CreateInstanceByContractID(@mozilla.org/network/file-input-stream;1) succeeded
-1230996640[8175698]: nsComponentManager: CreateInstanceByContractID(@mozilla.org/thread-pool;1) succeeded
-1230996640[8175698]: nsObserverService::AddObserver(8391040: xpcom-shutdown-threads)
-1230996640[8175698]: nsComponentManager: CreateInstance({0885d4f8-f7b8-4cda-902e-94ba38bc256e}) succeeded
-1230996640[8175698]: THRD-P(86913f8) dispatch [8691760 0]
-1230996640[8175698]: THRD-P(86913f8) put [0 0 4]
-1230996640[8175698]: EVENTQ(869140c): notify
-1230996640[8175698]: THRD-P(86913f8) put [spawn=1]
-1355306096[868f1f8]: EVENTQ(868d5c8): wait begin
-1230996640[8175698]: EVENTQ(868d5c8): notify
-1355306096[868f1f8]: EVENTQ(868d5c8): wait end
-1355306096[868f1f8]: THRD(868d590) ProcessNextEvent [1 0]
-1355306096[868f1f8]: EVENTQ(868d5c8): wait begin
-1230996640[8175698]: THRD-P(86913f8) put [868d590 kill=0]
-1230996640[8175698]: THRD(868d590) Dispatch [86913fc 0]
-1230996640[8175698]: EVENTQ(868d5c8): notify
-1355306096[868f1f8]: EVENTQ(868d5c8): wait end
-1355306096[868f1f8]: THRD(868d590) running [86913fc]
-1355306096[868f1f8]: THRD-P(86913f8) enter
-1355306096[868f1f8]: THRD-P(86913f8) running [8691760]
-1230996640[8175698]: III AsyncWait [this=86915e4]
-1230996640[8175698]: nsInputStreamPump::Suspend [this=8691360]

-1355306096[868f1f8]: THRD-P(86913f8) shutdown async [868d590]
-1355306096[868f1f8]: THRD(817df40) Dispatch [86b6680 0]
-1355306096[868f1f8]: EVENTQ(817df78): notify
-1355306096[868f1f8]: THRD-P(86913f8) leave
-1355306096[868f1f8]: THRD(868d590) ProcessNextEvent [1 0]
-1355306096[868f1f8]: EVENTQ(868d5c8): wait begin

-1230996640[8175698]: THRD(817df40) ProcessNextEvent [0 0]
-1230996640[8175698]: THRD(817df40) running [86b6680]
-1230996640[8175698]: THRD(868d590) shutdown
-1230996640[8175698]: EVENTQ(868d5c8): notify
-1230996640[8175698]: THRD(817df40) ProcessNextEvent [1 1]

-1355306096[868f1f8]: EVENTQ(868d5c8): wait end
-1355306096[868f1f8]: THRD(868d590) running [83edff0]
-1355306096[868f1f8]: THRD(817df40) Dispatch [83edf08 0]
-1355306096[868f1f8]: EVENTQ(817df78): notify

-1259202544[8052998]: NPError mozilla::plugins::child::_posturlnotify(NPP_t*, const char*, const char*, uint32_t, const char*, NPBool, void*)  [time:1387967764026950][7180][PPluginInstanceChild] Sending Msg_PStreamNotifyConstructor([TODO])
-1267217520[80a7398]: /root/mozilla-release/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:689: sending message @80ce6b8 on channel @8061f98 with type 2293822 (0 in queue)
-1267217520[80a7398]: /root/mozilla-release/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:677: sent message @80ce6b8 on channel @8061f98 with type 2293822
-1269396592[8568f78]: /root/mozilla-release/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:557: received message on channel @83e4238 with type 2293822
-1269396592[8568f78]: THRD(817df40) Dispatch [817bf20 0]
-1269396592[8568f78]: EVENTQ(817df78): notify

nothing else happens from there...

Here is the call stack of the main thread:

#0  0xb7fa8410 in __kernel_vsyscall ()
#1  0xb7218fbc in poll () from /lib/libc.so.6
#2  0xb7406f6d in g_poll () from /usr/lib/libglib-2.0.so.0
#3  0xb7ab0917 in wxapp_poll_func () from /opt/400T/ramdisk/lib/libwx_gtk2_core-2.8.so.0
#4  0xb587cf0a in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011E3F0A (PollWrapper(_GPollFD*, unsigned int, int))
#5  0xb73fdc68 in ?? () from /usr/lib/libglib-2.0.so.0
#6  0xb73fdebe in g_main_context_iteration () from /usr/lib/libglib-2.0.so.0
#7  0xb587cc97 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011E3C97 (nsAppShell::ProcessNextNativeEvent)
#8  0xb5895360 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011FC360 (nsBaseAppShell::DoProcessNextNativeEvent)
#9  0xb589547a in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011FC47A (nsBaseAppShell::OnProcessNextEvent)
#10 0xb5b4bdf0 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x014B2DF0 (nsThread::ProcessNextEvent)
#11 0xb5afdae8 in ?? () from /usr/lib/firefox-16.0.1/libxul.so (NS_ProcessNextEvent_P)
#12 0xb5b4b752 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x014B2752 (nsThread::Shutdown)
#13 0xb5b4d648 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x014B4648 (nsRunnableMethodImpl<unsigned int (nsIThread::*)(), true>::Run)
#14 0xb5b4bf72 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x014B2F72 (nsThread::ProcessNextEvent)
#15 0xb5afdbe0 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x01464BE0 (NS_ProcessPendingEvents_P)
#16 0xb5894c13 in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011FBC13 (nsBaseAppShell::NativeEventCallback)
#17 0xb587cfff in ?? () from /usr/lib/firefox-16.0.1/libxul.so 0x011E3FFF (nsAppShell::EventProcessorCallback)
#18 0xb742ddd4 in ?? () from /usr/lib/libglib-2.0.so.0
#19 0xb73fba29 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#20 0xb73fdcc1 in ?? () from /usr/lib/libglib-2.0.so.0
#21 0xb73fde4d in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#22 0xb6ebb7bd in gtk_main () from /usr/lib/libgtk-x11-2.0.so.0
#23 0xb7acab05 in wxEventLoop::Run () from /opt/400T/ramdisk/lib/libwx_gtk2_core-2.8.so.0
#24 0xb7b4ff6e in wxAppBase::MainLoop () from /opt/400T/ramdisk/lib/libwx_gtk2_core-2.8.so.0
#25 0xb7b4fb41 in wxAppBase::OnRun () from /opt/400T/ramdisk/lib/libwx_gtk2_core-2.8.so.0
#26 0x0808e161 in LayoutEngineClass::OnRun ()
#27 0xb7dc032a in wxEntry () from /opt/400T/ramdisk/lib/libwx_base-2.8.so.0
#28 0x0808e973 in main ()

when the main thread may wait on future events such as when it is notifying another thread to shutdown, the code in nsBaseAppShell::OnProcessNextEvent enters the glib loop assuming that native events will occur frequently allowing it exit eventually when the thread event q receive a new event. In some cases, (mine) no native events ever occur (it could be because no keyboard or mouse is connected).

  while (!NS_HasPendingEvents(thr) && !mProcessedGeckoEvents) {
    // If we have been asked to exit from Run, then we should not wait for
    // events to process.  Note that an inner nested event loop causes
    // 'mayWait' to become false too, through 'mBlockedWait'.
    if (mExiting)
      mayWait = false;

    mLastNativeEventTime = PR_IntervalNow();
    if (!DoProcessNextNativeEvent(mayWait, recursionDepth) || !mayWait)
      break;
  }

An easy way to fix it would be to do that:

DoProcessNextNativeEvent(false, recursionDepth);
if(!mayWait)
  break;

The best way IMO, would be to specify a small to value (ie 20ms) but it doesn't seems to be trivial to do with glib. (you have to add some kind of g_source to the default g_context) and this is beyond my field of expertise....



Expected results:

The main thread should have been responsive.
Comment 1 Josh Matthews [:jdm] 2012-10-25 11:22:54 PDT
CCing some people who might be able to evaluate this.
Comment 2 Olivier Langlois 2012-10-25 14:35:47 PDT
Created attachment 675302 [details] [diff] [review]
Protect against no events situation

Maybe not fully complying to your coding standard, Maybe not the "right way"tm to deal with glib

but it is simple and it works and it solve my problem!

feel free to adapt to your tastes
Comment 3 Olivier Langlois 2012-10-25 16:43:58 PDT
Comment on attachment 675302 [details] [diff] [review]
Protect against no events situation

just nitpicking. I would change this:

+  CapPollTo(20);
+
   while (!NS_HasPendingEvents(thr) && !mProcessedGeckoEvents) {
     // If we have been asked to exit from Run, then we should not wait for
     // events to process.  Note that an inner nested event loop causes
@@ -295,10 +297,13 @@
       mayWait = false;
 
     mLastNativeEventTime = PR_IntervalNow();
+
     if (!DoProcessNextNativeEvent(mayWait, recursionDepth) || !mayWait)
       break;
   }
 
+  CapPollTo(-1);

to:

+  CapPollTo(20);
+
   while (!NS_HasPendingEvents(thr) && !mProcessedGeckoEvents) {
     // If we have been asked to exit from Run, then we should not wait for
     // events to process.  Note that an inner nested event loop causes
@@ -295,10 +297,13 @@
       mayWait = false;
 
     mLastNativeEventTime = PR_IntervalNow();

     DoProcessNextNativeEvent(mayWait, recursionDepth)

     if (!mayWait)
       break;
   }
 
+  CapPollTo(-1);

in order to minimize changes in code behavior as now DoProcessNextNativeEvent() may timeout and return false with the poll to capping.
Comment 4 Karl Tomlinson (:karlt) 2012-10-25 18:04:50 PDT
Thanks for the analysis here.

The design is not meant to require that external native events occur because a native event should be generated here, if the poll should be interrupted:
http://hg.mozilla.org/mozilla-central/annotate/3b506ada9ef9/widget/gtk2/nsAppShell.cpp#l122

That should be called from OnDispatchedEvent.
http://hg.mozilla.org/mozilla-central/annotate/3b506ada9ef9/widget/xpwidgets/nsBaseAppShell.cpp#l225
Can you add logging or otherwise determine whether that is happening, and if not why not?
Comment 5 Karl Tomlinson (:karlt) 2012-10-25 18:05:51 PDT
Comment on attachment 675302 [details] [diff] [review]
Protect against no events situation

It should be possible to fix this without waking frequently.
Comment 6 Olivier Langlois 2012-10-25 20:37:17 PDT
(In reply to Karl Tomlinson (:karlt) from comment #4)
> Thanks for the analysis here.
> 
> The design is not meant to require that external native events occur because
> a native event should be generated here, if the poll should be interrupted:
> http://hg.mozilla.org/mozilla-central/annotate/3b506ada9ef9/widget/gtk2/
> nsAppShell.cpp#l122
> 
> That should be called from OnDispatchedEvent.
> http://hg.mozilla.org/mozilla-central/annotate/3b506ada9ef9/widget/xpwidgets/
> nsBaseAppShell.cpp#l225
> Can you add logging or otherwise determine whether that is happening, and if
> not why not?

Karl,

Ok, I'll look into that and I'm guessing that OnDispatchedEvent should be called from nsThread::Dispatch() or somewhere close to that, right?

Should be able to come back to you on that by next Monday.

Greetings,
Comment 7 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2012-10-26 12:26:01 PDT
Is this some kind of Wx embedding issue? The stack of the main run loop certainly doesn't look like a normal XUL app runloop.
Comment 8 Olivier Langlois 2012-10-29 08:01:38 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #7)
> Is this some kind of Wx embedding issue? The stack of the main run loop
> certainly doesn't look like a normal XUL app runloop.

I do not think it is related to Wx. I have run the software on 2 different system.
 
1. The real one which has no mouse, no keyboard. This is where I experience the problem
2. A VM running on my PC that emulates the real thing +- some details (including the presence of mouse/keyboard). On this setup, everything runs smoothly even before the proposed patch.
Comment 9 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2012-10-29 08:15:39 PDT
This is obviously not a normal XUL app, and so it's likely that this is at least partly caused by the embedding layer itself. Is this built with some sort of Wx embedding layer? Do you experience the same problem running Firefox or another XUL application on your mouse/keyboardless system?

OnDispatchedEvent should be waking up the main-thread event loop as called from here: http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/xpcom/threads/nsThread.cpp#l370

In normal eventloop operation that should end up here:
http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/xpwidgets/nsBaseAppShell.cpp#l235

And then here (for GTK2):
http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/gtk2/nsAppShell.cpp#l122

And that pipe write should wake up the GTK main event loop because of the g_io_add_watch_full call here: http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/gtk2/nsAppShell.cpp#l109

You need to figure out in your embedding case where that series of calls/assumptions is being broken.
Comment 10 Olivier Langlois 2012-10-29 11:52:52 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
> This is obviously not a normal XUL app, and so it's likely that this is at
> least partly caused by the embedding layer itself. Is this built with some
> sort of Wx embedding layer? Do you experience the same problem running
> Firefox or another XUL application on your mouse/keyboardless system?
> 
> OnDispatchedEvent should be waking up the main-thread event loop as called
> from here:
> http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/xpcom/threads/
> nsThread.cpp#l370
> 
> In normal eventloop operation that should end up here:
> http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/xpwidgets/
> nsBaseAppShell.cpp#l235
> 
> And then here (for GTK2):
> http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/gtk2/
> nsAppShell.cpp#l122
> 
> And that pipe write should wake up the GTK main event loop because of the
> g_io_add_watch_full call here:
> http://hg.mozilla.org/mozilla-central/annotate/3a7226d6429b/widget/gtk2/
> nsAppShell.cpp#l109
> 
> You need to figure out in your embedding case where that series of
> calls/assumptions is being broken.

I totally agree with your analysis. The thread should not block because of the pipe write.

As I said to Karl, based on this new information, I'll add more traces into the lib to understand what happens in my setup.

In meantime, I have observed that the write call return value in nsAppShell::ScheduleNativeEventCallback() isn't validated. In case the write() fails, the main thread will block forever.
Comment 11 Olivier Langlois 2012-10-29 15:44:58 PDT
Here are new added traces.

Note that if PollWrapper is called recursively inside EventProcessorCallback(), fd 17 (pipe read side fd) is not included in the poll fd set.

g_source_set_can_recurse () must called on the create g_source.

PollWrapper()
poll for fd 3 for events IN|
poll for fd 8 for events IN|
poll for fd 17 for events IN|
ScheduleNativeEventCallback() write res: 1 on fd 18. fd 17 will be read ready on next poll, errno: 0 - Success
EventProcessorCallback(): IN|
PollWrapper()
poll for fd 3 for events IN|
poll for fd 8 for events IN|
PollWrapper()
poll for fd 3 for events IN|
poll for fd 8 for events IN|
Before DoProcessNextNativeEvent() mayWait: 1
PollWrapper()
poll for fd 3 for events IN|
poll for fd 8 for events IN|
ScheduleNativeEventCallback() write res: 1 on fd 18. fd 17 will be read ready on next poll, errno: 0 - Success
posturlnotify: http://10.0.16.198:19675/DictionaryWebService/
Comment 12 Olivier Langlois 2012-10-29 15:45:41 PDT
expect a new proposed patch soon for this bug.
Comment 13 Olivier Langlois 2012-10-29 16:11:19 PDT
Created attachment 676384 [details] [diff] [review]
Make the nsAppShell interrupt pipe io watch source recursive
Comment 14 Karl Tomlinson (:karlt) 2012-10-29 16:45:50 PDT
Comment on attachment 676384 [details] [diff] [review]
Make the nsAppShell interrupt pipe io watch source recursive

Thanks very much for diagnosing and reporting this problem!

The licensing of GLib is a little different to Gecko, so I don't think we can just copy and modify a chunk of code from GLib.

Would you be willing to write the necessary code inline in nsAppShell::Init?
This code is used only once and so need not be a separate function and can be simplified a little for the requirements here.  |g_source_set_priority (source, priority)| won't be required.

Gecko is C++ and so style is usually to declare at initialization, despite the existing |ioc| declaration.  Most Gecko files have no space between function name and '('.  i.e.

GSource *source = g_io_create_watch(ioc, G_IO_IN);
Comment 15 Olivier Langlois 2012-10-29 19:14:44 PDT
sure. I'll rework the proposed patch tomorrow!
Comment 16 Olivier Langlois 2012-10-30 10:51:36 PDT
Created attachment 676674 [details] [diff] [review]
Make the nsAppShell interrupt pipe io watch source recursive Gecko style

Note that it was not possible to initialize local variables at the same time that they are declared because of the gotos.
Comment 17 Karl Tomlinson (:karlt) 2012-10-30 15:10:20 PDT
Comment on attachment 676674 [details] [diff] [review]
Make the nsAppShell interrupt pipe io watch source recursive Gecko style

Thanks very much.

nsnull has been replaced with nullptr, so this will need s/nsnull/nullptr/ before it lands.
Comment 18 Karl Tomlinson (:karlt) 2012-10-30 20:46:06 PDT
https://tbpl.mozilla.org/?tree=Try&rev=4d8a3f84dd20

I wonder why this bug didn't cause hangs on our test machines, because there are no user input events there.
Comment 19 Olivier Langlois 2012-10-31 08:51:03 PDT
If I may attempt to partly attempt to answer this question, I must add that when Gecko was in that state, it wasn't frozen or deadlocked. As soon as I was sending new URL to load, the browser was responding. (by sending an event to my Wx window object which was probably waking up the main thread).

The following conditions must be met to notice the problem:

1. You must make the mayWait param set to true for the main thread. My personal experience is that it not something that happens that often.
2. You need a separate thread (only plugin?) that display dynamic content and that will make blocking request to its parent.
Comment 21 Ryan VanderMeulen [:RyanVM] 2012-11-06 18:19:23 PST
https://hg.mozilla.org/mozilla-central/rev/0f45d415a221

Note You need to log in before you can comment on or make changes to this bug.