Closed Bug 970307 Opened 6 years ago Closed 5 years ago

Preload in Nuwa process

Categories

(Firefox OS Graveyard :: General, defect)

x86
macOS
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
2.2 S2 (19dec)

People

(Reporter: kk1fff, Assigned: kk1fff)

References

Details

Attachments

(5 files, 28 obsolete files)

1.39 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
33.06 KB, patch
kk1fff
: review+
Details | Diff | Splinter Review
15.81 KB, patch
kk1fff
: review+
Details | Diff | Splinter Review
9.28 KB, patch
Details | Diff | Splinter Review
14.42 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
Receiving for some IPC message after Nuwa is ready could make Nuwa frozen totally, we will need a mechanism to determine when to start to freeze Nuwa, so we can preload whole preload.js in Nuwa process.
The basic idea of this patch is to find a point that is safe for Nuwa process to freeze. If there are IPC messages that cause Nuwa's main thread try to dispatch event to some frozen thread, it may acquire a lock that will never be released, and then run into a deadlock.

If we can find a point after when there's no more IPC messages delivered to Nuwa, we can make Nuwa freeze after that point. This patch tests all event loops in chrome process, try to find a point when event loops of all threads are empty. This means that all tasks that started by the message that Nuwa was sent previously are done, and since Nuwa is started before we have network access, we can ignore the possibility of receiving message from network in Nuwa.
Depends on: 999323
Attached patch WIP (obsolete) — Splinter Review
Attachment #8403892 - Attachment is obsolete: true
Test with current WIP patch:

Preload in Nuwa process:
                            |     megabytes     |
           NAME   PID  PPID CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER
            b2g 28250     1   25.9    0 52.1 58.6 71.2 146.3       0 root
         (Nuwa) 28302 28250    1.6    0  3.8 10.5 25.4  55.7       0 root
     Homescreen 28444 28302    6.0    1 11.6 16.6 30.2  81.0       2 app_28444
Built-in Keyboa 28514 28302    0.7   18  5.8  9.5 21.8  61.8      10 app_28514
(Preallocated a 28536 28302    0.1   18  3.2  5.7 14.1  58.7      10 root

Preload in preallocated process:
                            |     megabytes     |
           NAME   PID  PPID CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER
            b2g 28994     1   27.0    0 52.3 57.9 70.2 146.7       0 root
         (Nuwa) 29034 28994    1.1    0  3.9  8.9 20.9  52.5       0 root
     Homescreen 29047 29034    4.4    1 12.5 17.0 30.4  82.7       2 app_29047
Built-in Keyboa 29121 29034    1.5   18  7.3 11.1 23.7  70.3      10 app_29121
(Preallocated a 29136 29034    0.9   18  5.5  8.6 19.9  58.5      10 root

It saves about 2MB in preallocated process and 0.9 ~ 1.5MB in homescreen and keyboard.
Awesome Patrick! Let's get that reviewed and landed on m-c as soon as possible to be sure we are not regressing.
(In reply to Patrick Wang (Chih-Kai Wang) (:kk1fff) from comment #3)
> It saves about 2MB in preallocated process and 0.9 ~ 1.5MB in homescreen and
> keyboard.

The savings in the homescreen app seems genuine but the keyboard has a different RSS between the tests so that might account for the difference you're seeing (it's my understanding that this patch should affect the USS values only, not the RSS ones).

In my previous testing I often used ./tools/get_about_memory.py --minimize before looking at USS values to make sure that the RSS were as close as possible between the different runs.
No longer depends on: 941466
The WIP has issues with timer thread, so it will slow down boot process. Still fixing.
(In reply to Gabriele Svelto [:gsvelto] from comment #5)
> In my previous testing I often used ./tools/get_about_memory.py --minimize
> before looking at USS values to make sure that the RSS were as close as
> possible between the different runs.

Good tips, thanks for sharing.
Attached patch WIP (obsolete) — Splinter Review
Check tasks in ThreadPools.
Attachment #8410453 - Attachment is obsolete: true
Test is failure on emulator, still need to deal with worker thread.
Depends on: 1006336
Attached patch WIP (obsolete) — Splinter Review
Handle worker threads.
The test cases start to be able to run, but there are still test failures. I think observing thread task queue with thread observer would be more clear in this case, I am going to change the approach of this patch.
Attachment #8413602 - Attachment is obsolete: true
See Also: → 988689
If permission manager is initialized in Nuwa, mSendPermissionUpdates will be set but the value will not be copied to the new ContentParent when Nuwa forks a new process. Thus the new child will not reveice the permission update from chrome.
Still some issues due to initialize modules before fork need to be addressed case by case.
Depends on: 1032125
Some modules register callbacks to parent when child initialize. If a module's child part is initialized in Nuwa, and it registers callback in parent side, the callbacks won't be clone in parent. The child side need to re-register the callbacks after fork.
nsThreadStatusManager as a singleton object that each thread report its status (idle/busy) to. It notifies listeners when all of the watched threads become idle.
Attachment #8438988 - Attachment is obsolete: true
Attachment #8455158 - Flags: feedback?(bent.mozilla)
Attachment #8438989 - Attachment is obsolete: true
Attachment #8455159 - Flags: feedback?(bent.mozilla)
Attachment #8455158 - Attachment is obsolete: true
Attachment #8455158 - Flags: feedback?(bent.mozilla)
Attachment #8463923 - Flags: feedback?(bent.mozilla)
Attachment #8455159 - Attachment is obsolete: true
Attachment #8455159 - Flags: feedback?(bent.mozilla)
Attachment #8463924 - Flags: feedback?(bent.mozilla)
Attachment #8463923 - Attachment is obsolete: true
Attachment #8463923 - Flags: feedback?(bent.mozilla)
Attachment #8478208 - Flags: review?(bent.mozilla)
Attachment #8463924 - Attachment is obsolete: true
Attachment #8463924 - Flags: feedback?(bent.mozilla)
Attachment #8478211 - Flags: review?(bent.mozilla)
Attachment #8438991 - Attachment is obsolete: true
Attachment #8453668 - Attachment is obsolete: true
Attachment #8478212 - Flags: review?(khuey)
Attachment #8446304 - Attachment is obsolete: true
Attachment #8478213 - Flags: review?(fabrice)
Comment on attachment 8478212 [details] [diff] [review]
Part 3: make nuwa wait for CanFreeze message and preload in nuwa.

stop preloading protocol-proxy-service since dispatches runnables inside nuwa, which can sometimes frozen Nuwa.
Comment on attachment 8478213 [details] [diff] [review]
Part 4: reinitialize AppServiceChild after a new process forked.

Review of attachment 8478213 [details] [diff] [review]:
-----------------------------------------------------------------

Mostly nits, but I'd really like us to use better names. That helps understanding what's happening!

::: dom/ipc/jar.mn
@@ +8,5 @@
>          content/global/BrowserElementChild.js (../browser-element/BrowserElementChild.js)
>          content/global/BrowserElementChildPreload.js (../browser-element/BrowserElementChildPreload.js)
>  *       content/global/BrowserElementPanning.js (../browser-element/BrowserElementPanning.js)
>          content/global/preload.js (preload.js)
> +        content/global/preload2.js (preload2.js)

I'm not a fan on the preload2.js name. Could we use a more explicit one. If we need to rename preload.js too that's fine.

::: dom/ipc/preload2.js
@@ +11,5 @@
> +  "use strict";
> +
> +  let Cu = Components.utils;
> +  let Cc = Components.classes;
> +  let Ci = Components.interfaces;

you don't need Cc and Ci. Actually, you're using Cu only once so you can just do Components.utils.import(...)

@@ +14,5 @@
> +  let Cc = Components.classes;
> +  let Ci = Components.interfaces;
> +
> +  Cu.import("resource://gre/modules/AppsServiceChild.jsm");
> +  dump("TEST: preload2.js");

remove before landing.
Attachment #8478213 - Flags: review?(fabrice)
Duplicate of this bug: 941466
bent, are you going to review these anytime soon?  Can we redirect this to froydnj?
Flags: needinfo?(bent.mozilla)
Comment on attachment 8478212 [details] [diff] [review]
Part 3: make nuwa wait for CanFreeze message and preload in nuwa.

Cervantes, can you take this review?
Attachment #8478212 - Flags: review?(khuey) → review?(cyu)
Comment on attachment 8478208 [details] [diff] [review]
Part 1: Collect thread state in nsThreadStatusManager

Nathan, it was suggested to me that you could perhaps take this review.
Attachment #8478208 - Flags: review?(bent.mozilla) → review?(nfroyd)
Comment on attachment 8478211 [details] [diff] [review]
Part 2: Report threads' status to nsThreadStatusManager

Nathan, it was suggested to me that you could perhaps take this review.
Attachment #8478211 - Flags: review?(bent.mozilla) → review?(nfroyd)
Comment on attachment 8478208 [details] [diff] [review]
Part 1: Collect thread state in nsThreadStatusManager

Review of attachment 8478208 [details] [diff] [review]:
-----------------------------------------------------------------

The locking problems and the separate class for managing thread statuses are the big issues in the comments below.

I'm also not a fan of having to have this functionality on all the time, for all platforms, when we only need it for Nuwa.  We're also going to have this idleness stuff running all the time even after we freeze, which is pure overhead.  Is there any way we can make this monitoring either apply only for Nuwa purposes and/or turn itself off once we freeze?  (I guess the turning off is what NotifyIfAllThreadsIdle is trying to do, but we still have the overhead of looking up threads and setting them idle or not.)

::: xpcom/threads/nsIThreadStatusListener.idl
@@ +2,5 @@
> +
> +[scriptable, uuid(b9d16aee-065e-4d69-8c5b-de5c4e60210d)]
> +interface nsIThreadStatusListener : nsISupports
> +{
> +  void onAllThreadIdle();

This needs some comments.  It also needs a license header, modelines, and a big warning that this notification really only means that all "normal" threads--and it would be great to have a concrete definition for what you want to call "normal"--were idle at some point in the past, not that they are all idle now.

Actually, why does this even need to be scriptable?  And if it doesn't need to be scriptable, why do we even need IDL for it?  Can't it just be an abstract class in a header somewhere?

::: xpcom/threads/nsThreadStatusManager.cpp
@@ +1,1 @@
> +/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 4 -*- */

Please use two-space indentation as per Gecko style.

@@ +18,5 @@
> +{
> +public:
> +    NotifyListenersRunnable() { }
> +
> +    virtual ~NotifyListenersRunnable() { }

Protected, please.

@@ +34,5 @@
> +} // anonymous namespace
> +
> +struct ThreadInfoHolder
> +{
> +    ThreadInfoHolder(nsThread* thread)

Arguments are prefixed with an 'a', so |aThread|.  Please fix this and other arguments throughout the patch.

@@ +84,5 @@
> +        }
> +    }
> +
> +    if (idle) {
> +        NS_DispatchToMainThread(new NotifyListenersRunnable(), NS_DISPATCH_NORMAL);

Please hold an nsRefPtr to the runnable over this call.

@@ +143,5 @@
> +nsThreadStatusManager::AddListener(nsIThreadStatusListener* listener)
> +{
> +    NS_ASSERTION(NS_IsMainThread(), "Need to be run on main thread");
> +    if (!mListeners.Contains(listener)) {
> +        mListeners.AppendElement(listener);

Not locking prior to modification means that you can have multiple threads race to add listeners, but not all of them succeed.  Not good.  Please fix.

@@ +157,5 @@
> +    NS_ASSERTION(NS_IsMainThread(), "Need to be run on main thread");
> +    mListeners.RemoveElement(listener);
> +    if (mListeners.Length() == 0) {
> +        MutexAutoLock lock(mLock);
> +        mWatching = false;

It's not obvious to me what the rules for examining or modifying |mListeners| under the lock here are.  Some places you do, some places you don't.  And then you have this |mWatching| variable which is supposed to represent |mListeners.Length() == 0|.  But duplicating information like this, and examining some parts under a lock and some parts not, is a recipe for races:

1. AddListener on some thread, prepare to AppendElement, context switch to a different thread;
2. RemoveListener checks mListeners.Length() == 0, prepare to lock |mLock|, context switch.
3. AddListener appends, locks |mLock|, sets |mWatching| to true, context switch.
4. Remove Listener locks |mLock|, sets |mWatching| to false.

Now you have a situation where |mWatching| is false|, but you have listeners in |mListeners|.  Not good.

Please eliminate mWatching and clean up the locking.

::: xpcom/threads/nsThreadStatusManager.h
@@ +16,5 @@
> +class nsThread;
> +class nsThreadPool;
> +
> +class ThreadInfoHolder;
> +class nsThreadStatusManager

We should not have another class that does similar things to the existing nsThreadManager.  All the functionality exposed here should be merged into nsThreadManager.

@@ +21,5 @@
> +{
> +public:
> +  ~nsThreadStatusManager();
> +  enum ThreadType {
> +    ThreadTypeNormal,

Some documentation on what exactly these different thread types are and when they're meant to be used would be good.

@@ +22,5 @@
> +public:
> +  ~nsThreadStatusManager();
> +  enum ThreadType {
> +    ThreadTypeNormal,
> +    ThreadTypeThreadpool,

I don't see this used in the places I'd expect it to be used in the second part of this series.  Is this even needed anymore?

@@ +30,5 @@
> +    static nsThreadStatusManager tsm;
> +    return &tsm;
> +  }
> +
> +  void SetThreadType(ThreadType type);

Might be a little nicer on the API client side to say:

SetThreadTypeIgnored() { SetThreadType(ThreadTypeIgnored); }

and so on for each of the types, and then SetThreadType becomes an internal function, and the enum becomes private.  That way, you don't have to qualify the ThreadType enums at the callsites.  (Also prevents clients from accidentally passing bogus data.)

@@ +32,5 @@
> +  }
> +
> +  void SetThreadType(ThreadType type);
> +  void SetThreadIdleStatus(bool isIdle);
> +  void SetThreadIdleStatus(nsIThread *thread, bool isIdle);

This API should not take a boolean argument, it should take an enum instead.  It may also be worthwhile to have it be instead:

SetThreadStatusIdle()
SetThreadStatusWorking()

or similar, so you don't have to qualify the enum name at the callsite.

@@ +39,5 @@
> +  void OnThreadExit(nsThread* thread);
> +
> +  void AddListener(nsIThreadStatusListener* listener);
> +  void RemoveListener(nsIThreadStatusListener* listener);
> +  void GetListener(nsTArray<nsCOMPtr<nsIThreadStatusListener> > *listeners);

Nit: this should be |GetListeners|.

Also nit: We can use >> to close off templates now.  Please do that, here and throughout the patch.

Also, given the recent move constructor support to nsTArray, you could just return an nsTArray here and it will be just as efficient, along with being easier to read.

@@ +44,5 @@
> +
> +private:
> +  nsThreadStatusManager();
> +
> +  void NotifyIfAllThreadIdle();

Nit: this should be |NotifyIfAllThreadsIdle|.
Attachment #8478208 - Flags: review?(nfroyd) → feedback+
Comment on attachment 8478211 [details] [diff] [review]
Part 2: Report threads' status to nsThreadStatusManager

Review of attachment 8478211 [details] [diff] [review]:
-----------------------------------------------------------------

I'd like to see this again once some of the naming changes suggested in part 1 make it in.  I'll have a much higher confidence that all the calls are right then.

::: content/media/MediaStreamGraph.cpp
@@ +1345,5 @@
>      MonitorAutoLock lock(mMonitor);
>      messageQueue.SwapElements(mMessageQueue);
>    }
> +
> +  nsThreadStatusManager::get()->SetThreadType(nsThreadStatusManager::ThreadTypeIgnored);

Nit: indentation is incorrect with respect to the surrounding code.

::: netwerk/base/src/nsSocketTransportService2.cpp
@@ +736,5 @@
>  
>          do {
> +          // Mark current thread empty when there's no pending events
> +          // and we are going to poll.
> +          if (!pendingEvents) {

Nit: indentation of this block is wrong with respect to the code below.

::: widget/gonk/GonkMemoryPressureMonitoring.cpp
@@ +127,5 @@
>        NuwaMarkCurrentThread(nullptr, nullptr);
>      }
>  #endif
>  
> +    nsThreadStatusManager::get()->SetThreadType(nsThreadStatusManager::ThreadTypeIgnored);

Nit: indentation.

::: xpcom/threads/TimerThread.cpp
@@ +187,5 @@
>      NuwaMarkCurrentThread(nullptr, nullptr);
>    }
>  #endif
>  
> +  nsThreadStatusManager::get()->SetThreadType(nsThreadStatusManager::ThreadTypeIgnored);

Nit: indentation.
Attachment #8478211 - Flags: review?(nfroyd)
Comment on attachment 8478212 [details] [diff] [review]
Part 3: make nuwa wait for CanFreeze message and preload in nuwa.

Review of attachment 8478212 [details] [diff] [review]:
-----------------------------------------------------------------

I'd like to see the next revision with the issues addressed.

::: dom/ipc/ContentChild.cpp
@@ +1740,5 @@
>      }
>  
> +    if (IsNuwaProcess()) {
> +        SendNuwaWaiting();
> +    }

This has to be #ifdef'd, or it won't build on non-Nuwa platforms.

::: dom/ipc/ContentParent.cpp
@@ +1321,5 @@
>  
>  NS_IMPL_ISUPPORTS(SystemMessageHandledListener,
>                    nsITimerCallback)
>  
> +class NuwaListener : public nsIThreadStatusListener

Nit: NuwaListener doesn't sound like a good name to me. The listener listen to "Nuwa", what does that mean? The reader has to trace the code to see what it does. I would suggest giving it a more self-descriptive name.

@@ +1324,5 @@
>  
> +class NuwaListener : public nsIThreadStatusListener
> +{
> +public:
> +

Nit: no blank line after the public section, please.

@@ +1332,5 @@
> +        mParent(parent) {
> +    }
> +
> +    virtual ~NuwaListener() {
> +    }

This cannot be public, or it will fail the static assertion MOZ_ASSERT_TYPE_OK_FOR_REFCOUNTING(X).

@@ +1341,5 @@
> +        nsThreadStatusManager::get()->RemoveListener(this);
> +        return NS_OK;
> +    }
> +private:
> +    ContentParent *mParent;

I think this has to be a refptr. We register the listener to the singleton of nsThreadStatusManager, which can outlive the ContentParent instance. If the instance mParent points to should ever be destroyed for any reason, we can use a stale pointer and crash.

::: dom/ipc/PContent.ipdl
@@ +447,5 @@
>       * Notify idle observers in the child
>       */
>      NotifyIdleObserver(uint64_t observerId, nsCString topic, nsString str);
> +
> +    NuwaCanFreeze();

Nit: one blank line before starting a new section, please. Also the same in other parts.

@@ +619,5 @@
>      // Notify the parent that the child has finished handling a system message.
>      async SystemMessageHandled();
>  
>      NuwaReady();
> +    NuwaWaiting();

NuwaWaiting() isn't descriptive enough. Please comment it or use a more descriptive name.

::: dom/ipc/preload.js
@@ +53,5 @@
>    Cc["@mozilla.org/network/effective-tld-service;1"].getService(Ci["nsIEffectiveTLDService"]);
>    Cc["@mozilla.org/network/idn-service;1"].getService(Ci["nsIIDNService"]);
>    Cc["@mozilla.org/network/io-service;1"].getService(Ci["nsIIOService2"]);
>    Cc["@mozilla.org/network/mime-hdrparam;1"].getService(Ci["nsIMIMEHeaderParam"]);
> +  // Cc["@mozilla.org/network/protocol-proxy-service;1"].getService(Ci["nsIProtocolProxyService"]);

Why is nsProtocolProxyService removed from preload? It's not loaded in preload2.js after fork. If you are to comment out preload of nsIProtocolProxyService please leave a comment here why you remove disable the preload of it.

::: xpcom/threads/nsTimerImpl.cpp
@@ +558,5 @@
>  
> +  if (IsNuwaProcess() && IsNuwaReady()) {
> +    return;
> +  }
> +

This also has to be #ifdef'd.
Attachment #8478212 - Flags: review?(cyu) → review-
BTW, Send/RecvNuwaForkDone() seems not necessary to me. Is there any reason we need to run preload2.js after parent has received AddNewProcess() message? If you only want to run preload2.js after fork, you may consider using ContentChild.cpp::InitOnContentProcessCreated().
Flags: needinfo?(bent.mozilla)
Attachment #8478208 - Attachment is obsolete: true
Attachment #8478211 - Attachment is obsolete: true
Attachment #8478213 - Attachment is obsolete: true
Attachment #8513341 - Attachment description: Bug-970307-Part-1-Report-status-of-each-thread-to-fi.patch → Part 1: report status of each thread.
Attachment #8513341 - Flags: review?(nfroyd)
Attachment #8513342 - Flags: review?(cyu)
Attachment #8513345 - Flags: review?(fabrice)
Comment on attachment 8513341 [details] [diff] [review]
Part 1: report status of each thread.

Review of attachment 8513341 [details] [diff] [review]:
-----------------------------------------------------------------

Thank you for adding the NS_* wrappers in this patch.  They make things much more readable.

Only f+'ing because I'm unclear on some of the comments and code below.  I think the whole idleness checking stuff is inherently racy, but I am willing to believe that in the controlled Nuwa environment, especially if we don't have any network connections, that when we send the idle notification, it's unlikely lots of threads are going to become un-idle.

::: xpcom/glue/nsThreadUtils.h
@@ +568,5 @@
> +/**
> + * Helpers for thread to report their status when compiled with Nuwa.
> + */
> +#ifdef MOZ_NUWA_PROCESS
> +#ifdef MOZILLA_INTERNAL_API

I realize that you probably did this for consistency with other #ifdef MOZ_NUWA_PROCESS top-level blocks, but I think we'd be much better off with the conditionals turned inside-out:

#ifdef MOZILLA_INTERNAL_API
#ifdef MOZ_NUWA_PROCESS
// declarations
#else
// inline definitions
#endif // MOZ_NUWA_PROCESS
#endif // MOZILLA_INTERNAL_API

::: xpcom/threads/nsThread.cpp
@@ +827,5 @@
>        if (MAIN_THREAD == mIsMainThread) {
>          HangMonitor::NotifyActivity();
>        }
> +#ifdef MOZ_NUWA_PROCESS
> +      nsThreadManager::get()->SetThreadWorking();

NS_SetCurrentThreadWorking?  Then you don't need the #ifdef.

@@ +841,5 @@
>    --mRunningEvent;
>  
> +#ifdef MOZ_NUWA_PROCESS
> +  if ((!mEvents->GetEvent(false, nullptr)) && (mRunningEvent == 0)) {
> +    nsThreadManager::get()->SetThreadIdle();

Likewise, though I think the #ifdef is still valuable for avoiding work when we're not using Nuwa.

::: xpcom/threads/nsThreadManager.cpp
@@ +48,5 @@
>  
>  typedef nsTArray<nsRefPtr<nsThread>> nsThreadArray;
>  
> +#ifdef MOZ_NUWA_PROCESS
> +class NotifyAllThreadsHasIdled: public nsRunnable

I think |NotifyAllThreadsWereIdle| is a better name for this class.

@@ +444,5 @@
> +  SetThreadStatus(true);
> +}
> +
> +void
> +nsThreadManager::AddAllThreadHadIdledListener(AllThreadHadIdledListener *listener)

Nit: aListener (and similarly below, please).

@@ +446,5 @@
> +
> +void
> +nsThreadManager::AddAllThreadHadIdledListener(AllThreadHadIdledListener *listener)
> +{
> +  mThreadsIdledListener.AppendElement(listener);

Please add a MOZ_ASSERT(GetCurrentThreadStatusInfo()->isWorking); here.

@@ +466,5 @@
> +    // after holding a lock.
> +    // There may be threads that has checked |mThreadsIdledListener| before a
> +    // listener is put into it and then is context switched. In this case, we
> +    // might get the value before it has been updated, and the thread won't check
> +    // again after update to value.

What is "the value" here?  Is it whether mThreadsIdledListener has any elements in it?  Or is it mWorking?

@@ +467,5 @@
> +    // There may be threads that has checked |mThreadsIdledListener| before a
> +    // listener is put into it and then is context switched. In this case, we
> +    // might get the value before it has been updated, and the thread won't check
> +    // again after update to value.
> +    // a. If it goes from idle to working, we'll get incorrect idle status since

What is "it" here (and below, in point b)?  Is it the current thread, or is it some other thread that we might be racing with?

@@ +470,5 @@
> +    // again after update to value.
> +    // a. If it goes from idle to working, we'll get incorrect idle status since
> +    //    there's actually something in its queue. However, having not finished
> +    //    the function means the thread that dispatches this task is still
> +    //    marked as working, and that will prevent us from firing an event.

"the thread that dispatches this task"...isn't that this thread?  Or is it the main thread, since we're going to dispatch the notifier there?  I don't see how either of those choices is still marked as working: the current thread can be marked as non-working, by virtue of this function, and the main thread doesn't necessarily have to be working while we're going through the loop below.

::: xpcom/threads/nsThreadManager.h
@@ +22,5 @@
>  {
>  public:
> +#ifdef MOZ_NUWA_PROCESS
> +  struct ThreadStatusInfo;
> +  class AllThreadHadIdledListener {

I think |AllThreadsWereIdleListener| is a better name for this class.

@@ +25,5 @@
> +  struct ThreadStatusInfo;
> +  class AllThreadHadIdledListener {
> +  public:
> +    NS_INLINE_DECL_REFCOUNTING(AllThreadHadIdledListener);
> +    virtual void OnAllThreadHadIdled() = 0;

Similarly, OnAllThreadsWereIdle.

@@ +76,5 @@
> +  void SetThreadIdle();
> +  void SetThreadWorking();
> +
> +  void AddAllThreadHadIdledListener(AllThreadHadIdledListener *listener);
> +  void RemoveAllThreadHadIdledListener(AllThreadHadIdledListener *listener);

Similarly, {Add,Remove}AllThreadsWereIdleListener

@@ +112,5 @@
> +  ThreadStatusInfo* GetCurrentThreadStatusInfo();
> +  void SetThreadStatus(bool isWorking);
> +
> +  unsigned            mThreadStatusInfoIndex;
> +  nsTArray<nsRefPtr<AllThreadHadIdledListener>> mThreadsIdledListener;

Since this is a collection of listeners, it should be named as such: mThreadsIdleListeners.

@@ +114,5 @@
> +
> +  unsigned            mThreadStatusInfoIndex;
> +  nsTArray<nsRefPtr<AllThreadHadIdledListener>> mThreadsIdledListener;
> +  nsTArray<ThreadStatusInfo*> mThreadStatusInfos;
> +  nsAutoPtr<mozilla::ReentrantMonitor> mMonitor;

mozilla::UniquePtr, please.  (Actually, why is this not a plain member?)

::: xpcom/threads/nsThreadPool.cpp
@@ +207,5 @@
>          } else {
>            PRIntervalTime delta = timeout - (now - idleSince);
>            LOG(("THRD-P(%p) waiting [%d]\n", this, delta));
> +#ifdef MOZ_NUWA_PROCESS
> +          nsThreadManager::get()->SetThreadIdle();

NS_SetCurrentThreadIdle?  Then you don't need the #ifdef.

@@ +219,5 @@
>      }
>      if (event) {
>        LOG(("THRD-P(%p) running [%p]\n", this, event.get()));
> +#ifdef MOZ_NUWA_PROCESS
> +      nsThreadManager::get()->SetThreadWorking();

Likewise for this.

::: xpcom/threads/nsTimerImpl.cpp
@@ +557,5 @@
>    }
>  
> +#ifdef MOZ_NUWA_PROCESS
> +  if (IsNuwaProcess() && IsNuwaReady()) {
> +    return;

This condition looks a little odd, can you elaborate on why this is here?
Attachment #8513341 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #39)
> 
> @@ +466,5 @@
> > +    // after holding a lock.
> > +    // There may be threads that has checked |mThreadsIdledListener| before a
> > +    // listener is put into it and then is context switched. In this case, we
> > +    // might get the value before it has been updated, and the thread won't check
> > +    // again after update to value.
> 
> What is "the value" here?  Is it whether mThreadsIdledListener has any
> elements in it?  Or is it mWorking?

It's |mWorking|.

> 
> @@ +467,5 @@
> > +    // There may be threads that has checked |mThreadsIdledListener| before a
> > +    // listener is put into it and then is context switched. In this case, we
> > +    // might get the value before it has been updated, and the thread won't check
> > +    // again after update to value.
> > +    // a. If it goes from idle to working, we'll get incorrect idle status since
> 
> What is "it" here (and below, in point b)?  Is it the current thread, or is
> it some other thread that we might be racing with?

It's some other thread that we might be racing with.

> 
> @@ +470,5 @@
> > +    // again after update to value.
> > +    // a. If it goes from idle to working, we'll get incorrect idle status since
> > +    //    there's actually something in its queue. However, having not finished
> > +    //    the function means the thread that dispatches this task is still
> > +    //    marked as working, and that will prevent us from firing an event.
> 
> "the thread that dispatches this task"...isn't that this thread?  Or is it
> the main thread, since we're going to dispatch the notifier there?  I don't
> see how either of those choices is still marked as working: the current
> thread can be marked as non-working, by virtue of this function, and the
> main thread doesn't necessarily have to be working while we're going through
> the loop below.

The first case I try to describe in comment is that a if there's a thread A changing its |mWorking| from false to true while we are checking status of all threads, we could fire a bogus event:
(1) Thread B (working) dispatches a task to Thread A (idle)
(2) A entered |SetThreadStatus| and find |mThreadsIdledListener| is empty (so A won't acquire lock)
(3) something is inserted into |mThreadsIdledListener|
(4) Thread C enters |SetThreadStatus|, find there is something in |mThreadsIdledListener|, acquire lock and check status of other threads.
(5) C checks A, finds it's idle
(6) A updates it status to working
(7) B updates it status to idle
(8) C checks B, finds it's idle.

I was trying to explain that the bogus event won't actually be fired because it's not possible that B set it's |mWorking| to false during our loop of checking |mWorking|s, since B will be blocked by the monitor as it entered this function after we have entered the loop. (But when I was writing this bugzilla comment, I realized that current implementation may not work properly, I have to make sure that B won't start to set itself to idle before A had become working. I will address this in next version.)

> @@ +114,5 @@
> > +
> > +  unsigned            mThreadStatusInfoIndex;
> > +  nsTArray<nsRefPtr<AllThreadHadIdledListener>> mThreadsIdledListener;
> > +  nsTArray<ThreadStatusInfo*> mThreadStatusInfos;
> > +  nsAutoPtr<mozilla::ReentrantMonitor> mMonitor;
> 
> mozilla::UniquePtr, please.  (Actually, why is this not a plain member?)

Actually, I copied the style from the Mutex declared above. Another reason is that the header of ReentrantMonitor doesn't need to be included in this header file.

> ::: xpcom/threads/nsTimerImpl.cpp
> @@ +557,5 @@
> >    }
> >  
> > +#ifdef MOZ_NUWA_PROCESS
> > +  if (IsNuwaProcess() && IsNuwaReady()) {
> > +    return;
> 
> This condition looks a little odd, can you elaborate on why this is here?

It should be put in another part actually. It's because when loading preload in Nuwa, a timer event which fires after Nuwa frozen can freeze main thread as well. Thus I cancelled the timer events which fires after Nuwa frozen.
Attachment #8513345 - Flags: review?(fabrice) → review+
Update in this version is that a thread needs to enter a per-thread monitor or mutex of target thread before it can dispatch tasks to the target thread. This would make sure that once a thread has a task in its queue, its status is working. Except thread pool, in which the target thread is not decided when dispatching tasks to the queue. For thread pool, once a thread got a task from thead pool's shared queue, it marks itself working.
Attachment #8513341 - Attachment is obsolete: true
Attachment #8517393 - Flags: review?(nfroyd)
Changed as name change in part 1. Also changed the way nuwa process waits for PBackground form a nested event loop in which Nuwa can be blocked to a continuously dispatching a task to check whether PBackground is ready.
Attachment #8513342 - Attachment is obsolete: true
Attachment #8513342 - Flags: review?(cyu)
Attachment #8517397 - Flags: review?(cyu)
Comment on attachment 8517393 [details] [diff] [review]
Part 1: report status of each thread.

Review of attachment 8517393 [details] [diff] [review]:
-----------------------------------------------------------------

The locking in nsThread is a good idea; it makes a number of things easier to reason about.

I apologize for continued comments on the comment block in nsThreadManager.cpp, but I'm having a hard time convincing myself it's describing the situation well.  And since it's the heart of the patch, well...

::: dom/indexedDB/TransactionThreadPool.cpp
@@ +864,5 @@
>    nsRefPtr<FinishCallback> finishCallback;
>    bool shouldFinish = false;
> +#ifdef MOZ_NUWA_PROCESS
> +  mThread = static_cast<nsThread*>(NS_GetCurrentThread());
> +  // Set ourselve as working thread. We can reset later if we found

Nit: "Set ourself"

::: xpcom/threads/nsThreadManager.cpp
@@ +75,5 @@
> +};
> +
> +struct nsThreadManager::ThreadStatusInfo {
> +  bool mWorking;
> +  bool mWillBeWorking;

I would feel much better if these were Atomic<bool>.  I realize that accesses to Atomic variables are not cheap, but the correctness of the notification mechanism depends on seeing cross-thread updates to these variables, and we're definitely not guaranteed that if these aren't Atomic.

@@ +450,5 @@
> +  aInfo->mWillBeWorking = aIsWorking;
> +  if (mThreadsIdledListeners.Length() > 0) {
> +    // If there's a listener, we update our status and check statuses of threads
> +    // after holding a lock.
> +    // There may be threads that has checked |mThreadsIdledListeners| before a

Thank you for the small changes you made to this, and for the additional example you provided.

I tried reading through this again, but got confused (still!).  Below is my attempt to rewrite the whole comment, with more clarity about what "it" is referring to in many places.  Inline comments between [], also:

-----
There may be a thread X that has checked whether there are listeners prior to a listener being placed into |mThreadsIdledListeners|, but is suspended prior to setting its |mWorking|.  In that case, we might get the value of |mWorking| before it has been updated, and thread X won't check whether there are listeners after updating |mWorking|.

There are two transitions to consider for thread X.

a) If the status of thread X is changed from idle to working while we are walking through |mThreadStatusInfos|, we will not send a bogus event.  The thread that is walking through |mThreadStatusInfos| is obviously working, and the thread that we dispatch...

[Please see review comments elsewhere, I don't think I can adequately rewrite the description of this case to make sense.]

b) If the status of thread X is changed from working to idle, we could miss a chance to notify listeners.  This case is addressed by adding the additional status variable |mWillBeWorking|.  If thread X's |mWillBeWorking| is false while |mWorking| is true, then X is becoming idle, and we can legitimately treat the thread as an idle thread.  There are two  cases to consider here:

  1) The thread X is blocked or about to be blocked by the monitor.  If all the threads are idle (or becoming so), then we will dispatch a task to the main thread.  The main thread will then be marked as working, and when thread X finally enters the monitor, it will see that the main thread is working and not send another event.

[Why does thread X necessarily get scheduled prior to the main thread running its event?  It seems perfectly reasonable that the main thread could completely process its event, go back to idling, and then thread X would be scheduled, only to notice that all the threads are idled again, and fire another event.  Are we simply assuming there won't be any more listeners when that second event gets fired?]

  2) The thread X is not entering the monitor at all, i.e. it's entering the |else| branch of the |if| here.  We don't have to worry about a second event being sent here.
-----

Does all that seem correct?

@@ +456,5 @@
> +    // might get the value before it has been updated, and the thread won't check
> +    // again after update to value.
> +    // a. If status of a thread is changed idle to working during we are
> +    //    checking info array, we won't send bogus event since the thread that
> +    //    is dispatch a task to the idle thread is working. It is not possible

This sentence is unclear: "the thread that is dispatch(ing?) a task to the idle thread is working"...isn't that us, running the checking loop below?  If it is us, how do we know that we are marked as working?  What if we're making the transition to an idle thread?

We only ever dispatch tasks to the main thread, how do we know that it's idle?

If you need to talk about multiple threads, then it would be extremely helpful if you gave them names.  The names given in comment 40 are a good start.

@@ +460,5 @@
> +    //    is dispatch a task to the idle thread is working. It is not possible
> +    //    for the thread that is dispatch a task to change status of itself
> +    //    after it dispatched the task, since it will enter this function after
> +    //    us, so it will find there's something in |mThreadsIdledListeners|,
> +    //    and then try to arquire the monitor.

I tried to rewrite this last sentence, but completely failed, probably because "the thread that is dispatch a task to change..." is unclear: are we talking about the thread that is receiving the task, or the thread that is doing the dispatching (i.e. the one running this loop here)?

::: xpcom/threads/nsThreadManager.h
@@ +22,5 @@
>  {
>  public:
> +#ifdef MOZ_NUWA_PROCESS
> +  struct ThreadStatusInfo;
> +  class AllThreadWereIdleListener {

Nit: AllThreadsWereIdleListener.

@@ +25,5 @@
> +  struct ThreadStatusInfo;
> +  class AllThreadWereIdleListener {
> +  public:
> +    NS_INLINE_DECL_REFCOUNTING(AllThreadWereIdleListener);
> +    virtual void OnAllThreadWereIdle() = 0;

Likewise here.

::: xpcom/threads/nsTimerImpl.cpp
@@ +556,5 @@
>      return;
>    }
>  
> +#ifdef MOZ_NUWA_PROCESS
> +  if (IsNuwaProcess() && IsNuwaReady()) {

Please add a comment describing why we are doing this.
Attachment #8517393 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #43)
> Comment on attachment 8517393 [details] [diff] [review]
> Part 1: report status of each thread.
> 
> Review of attachment 8517393 [details] [diff] [review]:

Thanks for reviewing this patch, Nathan. I will try to describe each step of how the possible racing occurs and the way I proposed to address them.

The race problem occurs since we don't want threads to try to enter the monitor (nsThreadManager::mMonitor) when no one cares about their status. And thus the race can happen when we put the first listener into |mThreadsIdledListeners|:

(1) Thread A wants to dispatch a task to Thread B.
(2) Thread A checks |mThreadsIdledListeners|, and nothing is in the list. So Thread A decides not to
    enter |mMonitor| when updating B's status.
(3) Thread A is suspended just before it changed status of B.
(4) A listener is added to |mThreadsIdledListeners|
(5) Now is Thread C's turn to run. Thread C finds there's something in |mThreadsIdledListeners|, so it
    enters |mMonitor| and check all thread info structs in |mThreadStatusInfos| while A is in the
    middle of changing B's status.

Then C might find Thread B is an idle thread (which is not correct, because A attempted to change B's status prior to C's checking starts), but the fact that thread A is working (thread A hasn't finished dispatching a task to thread B) can prevent thread C from firing a bogus notification.

It's also possible for A to change status of itself to idle after C has checked B's status (and found that B is idle), , since when A's attempt to change status of itself must be started after thread C entered the loop, and then A will need to enter monitor this time.


If the state transition that happens outside the monitor is the other direction, it will be:

(1) Thread D has just finished its jobs and wants to set its status to idle.
(2) Thread D checks |mThreadsIdledListeners|, and nothing is in the list. So Thread D decides not
    to enter |mMonitor|
(3) Thread D is suspend before it update status of itself.
(4) A listener is put into |mThreadsIdledListeners|.
(5) Thread C wants to changes status of itself. It checks |mThreadsIdledListeners| and finds something
    inside the list, thread C then enters |mMonitor|, updates its status and checks thread info in
    |mThreadStatusInfos| while D is changing status of itself out of monitor.

Thread C will find that thread D is working (D is actually idle and just hasn't update its status), then C returns without firing any notification. Finding that thread D is working can make our checking mechinasm miss a chance to fire a notification: because thread D thought there's nothing in |mThreadsIdledListeners| and thus won't check the |mThreadStatusInfos| after changing the status of itself.

|mWillBeWorking| can be used to address this problem. Say we require each thread to put the value that is going to be set to |mWorking| to |mWillBeWorking| before the thread decide whether it should enter |mMonitor| to change status or not. Thus C finds that D is working while D's |mWillBeWorking| is false, and C realizes that D is just updating and can treat D as an idle thread.

It doesn't matter whether D will check thread status after changing its own status or not. If D checks, which means D will enter the monitor before updating status, thus D must be blocked until C has finished dispatching the notification task to main thread, and D will find that main thread is working and will not fire an additional event. On the other hand, if D doesn't check |mThreadStatusInfos|, it's still ok, because C has treated D as an idle thread already.
I rewrote the comments by giving each participating thread. Hoping this makes it a bit more clear.
Attachment #8517393 - Attachment is obsolete: true
Attachment #8519673 - Flags: review?(nfroyd)
Attachment #8517397 - Attachment is obsolete: true
Attachment #8517397 - Flags: review?(cyu)
Attachment #8519674 - Flags: review?(cyu)
Attachment #8519673 - Attachment is obsolete: true
Attachment #8519673 - Flags: review?(nfroyd)
Attachment #8519707 - Flags: review?(nfroyd)
Comment on attachment 8519707 [details] [diff] [review]
Part 1: report status of each thread.

Review of attachment 8519707 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for working through this with me.  The detailed comment is much better!

Have you done any performance testing with these patches applied?

::: xpcom/threads/nsThreadManager.cpp
@@ +477,5 @@
> +    //
> +    // (1) Thread D has just finished its jobs and wants to set its status to idle.
> +    // (2) Thread D checks |mThreadsIdledListeners|, and nothing is in the list.
> +    //     So Thread D decides not to enter |mMonitor|.
> +    // (3) Thread D is suspend before it update status of itself.

Nit: "is suspended before it updates its own status."

@@ +484,5 @@
> +    //     |mThreadsIdledListeners| and finds something inside the list. Thread C
> +    //     then enters |mMonitor|, updates its status and checks thread info in
> +    //     |mThreadStatusInfos| while D is changing status of itself out of monitor.
> +    //
> +    // Thread C will find that thread D is working (D actaully wants to change its

Nit: "actually"

@@ +487,5 @@
> +    //
> +    // Thread C will find that thread D is working (D actaully wants to change its
> +    // status to idle before C starting to check), then C returns without firing
> +    // any notification. Finding that thread D is working can make our checking
> +    // mechinasm miss a chance to fire a notification: because thread D thought

Nit: "mechanism"

@@ +491,5 @@
> +    // mechinasm miss a chance to fire a notification: because thread D thought
> +    // there's nothing in |mThreadsIdledListeners| and thus won't check the
> +    // |mThreadStatusInfos| after changing the status of itself.
> +    //
> +    // |mWillBeWorking| can be used to address this problem. We require each

Now that |mWorking| is Atomic, I wonder if we can just structure things:

  aInfo->mWorking = aIsWorking;
  if (mThreadStatusInfos.Length() > 0) {
    bool hasWorkingThread = false;
    ReentrantMonitorAutoEnter mon(*mMonitor);
    for (size_t i = 0; i < mThreadStatusInfos.Length(); ++i) {
      ...
    }
  }

Which means that we're doing fewer Atomic accesses, which ought to be a good thing.  I *think* the arguments that you used here apply equally for the above algorithm.
Attachment #8519707 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #48)

> Have you done any performance testing with these patches applied?

I've done testing of app startup time. Since the additional synchronization mechanism could affect performance at every where, I think that a trivial testing can reflex the effect. It doesn't seem much difference between startup time of clock with these patches applied and that without patches applied. I've run 5 time for both cases, the results are almost the same.

I've also measured the time spent from b2g startup to the first preallocated being ready on flame. If these patches are not applied, it, from b2g startup, takes about 15.5 secs to Nuwa starting to freeze and takes about 18 secs to the first preallocated process becoming ready. With these patches applied, it takes about 18 secs to Nuwa starting to freeze and takes about 18.5 sec to the first preallocated process becoming ready. With these patches, however, it sometimes costs more time (about 25 sec) from b2g startup to Nuwa starting to freeze. I think this case is due to events from other source are dispatched to chrome process and make threads take longer to consume all of the events.
(In reply to Nathan Froyd (:froydnj) from comment #48)
> Now that |mWorking| is Atomic, I wonder if we can just structure things:
> 
>   aInfo->mWorking = aIsWorking;
>   if (mThreadStatusInfos.Length() > 0) {
>     bool hasWorkingThread = false;
>     ReentrantMonitorAutoEnter mon(*mMonitor);
>     for (size_t i = 0; i < mThreadStatusInfos.Length(); ++i) {
>       ...
>     }
>   }
> 
> Which means that we're doing fewer Atomic accesses, which ought to be a good
> thing.  I *think* the arguments that you used here apply equally for the
> above algorithm.

I don't think |mWorking| can be taken out of the monitor. Since this will make |mWorking|s be able to be changed when a thread is walking through the |mThreadStatusInfos|.
just fixing nits...
Attachment #8519707 - Attachment is obsolete: true
Attachment #8521340 - Flags: review?(nfroyd)
Comment on attachment 8521340 [details] [diff] [review]
Part 1: report status of each thread.

Review of attachment 8521340 [details] [diff] [review]:
-----------------------------------------------------------------

Oh, yes, we still need things under the monitor.  Sigh.

Thanks for working through everything here.
Attachment #8521340 - Flags: review?(nfroyd) → review+
Thank you for the review, Nathan!
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e4b2f842b6d1

Nuwa creation test failure (I guess it's because it takes more time then before, since some of m5 ended successfully), and try finds that there are leakages.
Comment on attachment 8519674 [details] [diff] [review]
Part 2: let nuwa wait for all tasks of preload to be done.

Review of attachment 8519674 [details] [diff] [review]:
-----------------------------------------------------------------

f+ this patch because the patch looks correct to me. My comments are mostly on code clarity and I'd like to see them addressed.

::: dom/ipc/ContentChild.cpp
@@ +2089,5 @@
>          PreloadSlowThings();
>      }
>  
>  #ifdef MOZ_NUWA_PROCESS
>      if (IsNuwaProcess()) {

Please comment the intention for this message sent and what it is waiting for.

::: dom/ipc/ContentChild.h
@@ +452,5 @@
>      static ContentChild* sSingleton;
>  
>      DISALLOW_EVIL_CONSTRUCTORS(ContentChild);
> +#ifdef MOZ_NUWA_PROCESS
> +    bool mDontFlushMemory;

This variable isn't necessary. It is initialized in the double negation form: false to mDontFlushMemory means true to flush memory. It's true in the Nuwa process and the preallocated process.

if (mDontFlushMemory) { ... }
is equivalent to
if (IsNuwaProcess() || ManagedPBrowserChild().Length() == 0) { ... }

I think using this predicate makes it clearer. Don't forget to comment why you don't want to flush memory in the preallocated process.

::: dom/ipc/ContentParent.cpp
@@ +1358,5 @@
>                    nsITimerCallback)
>  
> +#ifdef MOZ_NUWA_PROCESS
> +class NuwaCanFreezeListener : public nsThreadManager::AllThreadsWereIdleListener
> +{

You need to rename this listener if you renamed NuwaCanFreeze() message.

::: dom/ipc/PContent.ipdl
@@ +505,5 @@
>      /**
>       * Notify windows in the child to apply a new app style.
>       */
>      OnAppThemeChanged();
> +    NuwaCanFreeze();

This name confuses me and I don't think it a good name. At first glance I can't see the difference and temporal relationships between NuwaCanFreeze() and NuwaReadyForFreeze(). We need to express the intention of this message more directly. NuwaFreeze() sounds a better choice.

@@ +718,5 @@
>  
>      NuwaReady();
> +    // Sent when nuwa finished its initialization process and is waiting for
> +    // parent's signal to make it freeze.
> +    NuwaReadyForFreeze();

NuwaReadyForFreeze() isn't the correct name to me because it's not really "ready" for freeze. If we ask the Nuwa process to freeze on receiving this message, we could make it deadlock because the preload.js hasn't actually finished. We need a name like NuwaWaitForFreeze() to show that it waits for the async events in preload.js to be process before freeze can actually happen.

::: dom/ipc/tests/test_NuwaProcessCreation.html
@@ +75,5 @@
>    };
>    let timeout = setTimeout(function() {
>      ok(false, "Nuwa process is not launched");
>      testEnd();
> +  }, 120000);

Why do you double the timeout?
Attachment #8519674 - Flags: review?(cyu) → feedback+
(In reply to Cervantes Yu from comment #55)
> ::: dom/ipc/tests/test_NuwaProcessCreation.html
> @@ +75,5 @@
> >    };
> >    let timeout = setTimeout(function() {
> >      ok(false, "Nuwa process is not launched");
> >      testEnd();
> > +  }, 120000);
> 
> Why do you double the timeout?

Waiting for preloading increases the time that Nuwa consumes to become ready. It's not quite obvious at boot time since the task queue contains relatively fewer tasks then, but when we test restarting Nuwa after many event sources are activated, the increase can be obvious. And running in emulator basically amplifies the increase. On try, increasing the timeout can increase the possibility of passing this test. Actually, I am testing with longer timeout to see if this can be passed on try.
With part 1, nsThread's size is increased by 28 byte (a void* + a reentrant monitor), and there have been 2 nsThreads leaked before applying this patch. And there are 3 more reentrant monitors counted by reference count tracer after part 1 applied: 2 are with nsThread, another 1 is with nsThreadManager. Thus I'm increasing the threshold by 128 (28*2+24*3) bytes.
Attachment #8525267 - Flags: review?(nfroyd)
Attachment #8525267 - Flags: review?(nfroyd) → review+
Fix with previous review. I also increased timeout of nuwa's deadlock test case. Timeout of both deadlock test and creation test are 4 minutes, so we still have 1 minute to reset environment if the test is failed.
Attachment #8519674 - Attachment is obsolete: true
Attachment #8525754 - Flags: review?(cyu)
Fixing test failure on try: Moving the routine of adding a |ThreadStatusInfo| into |mThreadStatusInfos| and the routine of removing a |ThreadStatusInfo| from |mThreadStatusInfos| to constructor and destructor of |ThreadStatusInfo|, respectively, to prevent |ThreadStatusInfo| from being left in the array if a thread exits without calling unregister. Carrying r+ as this doesn't change the logic of the patch.
Attachment #8521340 - Attachment is obsolete: true
Attachment #8527420 - Flags: review+
Comment on attachment 8525754 [details] [diff] [review]
Part 2: let nuwa wait for all tasks of preload to be done.

Review of attachment 8525754 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentChild.cpp
@@ +1988,4 @@
>          // Don't flush memory in the nuwa process: the GC thread could be frozen.
> +        // If there's no PBrowser child, don't flush memory, either. GC writes
> +        // to copy-on-write pages and makes preallocated process take more memory
> +        // before it actually become an app.

s/become/becomes

::: dom/ipc/PContent.ipdl
@@ +507,5 @@
>      /**
>       * Notify windows in the child to apply a new app style.
>       */
>      OnAppThemeChanged();
> +    NuwaFreeze();

nit: please add an empty line above and below this NuwaFreeze(). It's not supposed to be clustered with OnAppThemeChanged(), and there is originally an empty line before starting the parent: section.
Attachment #8525754 - Flags: review?(cyu) → review+
Attachment #8527420 - Attachment description: Part 1: report status of each thread. → Part 1: report status of each thread. (r=nfroyd)
updated according to previous review.
Attachment #8525754 - Attachment is obsolete: true
Attachment #8527432 - Flags: review+
Fix failure on try by not having process crash (that is, removing the assertion and making it simply return) when calling post fork preload before calling preload. It happens when Nuwa is created since post fork preload is invoked init() of content child. Carrying r+.
Attachment #8513345 - Attachment is obsolete: true
https://hg.mozilla.org/mozilla-central/rev/32933803bc74
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.2 S1 (5dec)
Reopening as the patches are backed out.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch Part 1b: fix dead lock. (obsolete) — Splinter Review
Nathan, could you review this update patch to part 1? Thanks.

This updated patch fixes the deadlock by preventing a thread from trying to hold locks needed by dispatching task to main thread in the |nsThreadManager.mMonitor| and |mainthread.mThreadStatusMonitor|.


The cause of mochitest 11 failure is that a deadlock occurs when main thread is checking the |mThreadStatusInfos| list in |nsThread.mThreadStatusMonitor| and |nsThreadManager.mMonitor| monitors, and another thread (Thread A) whose status we don't care is trying to dispatch a task to main thread. So A attempts to change main thread's status to working, it holds |mainthread.mLock| and is blocked by |mainthread.mThreadStatusMonitor| which is containing main thread who enters the monitor to set status of itself to idle.

Then main thread finds all threads are idle (actually, A is not idle, but we ignore it), then it tries to dispatch a task to main thread, which is itself, so it try to lock |mainthread.mLock|, and then it will be blocked by A, while A is blocked by the monitor which mainthread is in.
Attachment #8528789 - Flags: review?(nfroyd)
The milestone was set when first landing and I suppose it should be reset when they were backed out.
No longer blocks: CAF-v2.2-metabug
Target Milestone: 2.2 S1 (5dec) → ---
I removed the dependency of 1063044 because I though the dependency was based on the milestone setting which I think was not accurate. But I should have asked Michael whether or not this bug should still be set to be a blocker of 2.2 meta bug.
Flags: needinfo?(mvines)
(np, sounds like this'll save ~1-2 MB in the preallocated process which is something we'd very much like in v2.2)
Flags: needinfo?(mvines)
Comment on attachment 8528789 [details] [diff] [review]
Part 1b: fix dead lock.

Review of attachment 8528789 [details] [diff] [review]:
-----------------------------------------------------------------

I apologize for taking so long to review this.  PTO + Portland did not help.

::: xpcom/threads/nsThread.cpp
@@ +868,5 @@
>      }
>    }
> +  if (notifyAllIdleRunnable) {
> +    // If we are on main thread, we need to dispatch the task after we left
> +    // the monitors.

I think this is worth describing in some detail *why* we need to do that.

@@ +1088,5 @@
>  nsThread::SetWorking()
>  {
>    nsThreadManager::get()->SetThreadIsWorking(
> +    static_cast<nsThreadManager::ThreadStatusInfo*>(mThreadStatusInfo),
> +    true, nullptr);

Why not |nsThreadManager::get()->SetThreadWorking()|?  Is it slightly less efficient, but I think it's worth it.  Then we don't need to store mThreadStatusInfo in nsThread, which reduces coupling between it and nsThreadManager.

@@ +1096,5 @@
>  nsThread::SetIdle()
>  {
>    nsThreadManager::get()->SetThreadIsWorking(
> +    static_cast<nsThreadManager::ThreadStatusInfo*>(mThreadStatusInfo),
> +    false, nullptr);

Same question here, but with |->SetThreadIdle()|?

::: xpcom/threads/nsThreadManager.cpp
@@ +88,5 @@
>      ReentrantMonitorAutoEnter mon(*(mMgr->mMonitor));
>      mMgr->mThreadStatusInfos.AppendElement(this);
> +    if (NS_IsMainThread()) {
> +      mMgr->mMainThreadStatusInfo = this;
> +    }

This code should move to GetCurrentThreadStatusInfo, there's no reason that ThreadStatusInfo should bear this responsibility.

@@ +96,5 @@
>      ReentrantMonitorAutoEnter mon(*(mMgr->mMonitor));
>      mMgr->mThreadStatusInfos.RemoveElement(this);
> +    if (NS_IsMainThread()) {
> +      mMgr->mMainThreadStatusInfo = nullptr;
> +    }

Likewise for this, except that it needs to move to DeleteThreadStatusInfo.  We can get the required handle on nsThreadManager there by calling ::get().

Once those two things are moved, we don't need the |mMgr| member either.

@@ +548,5 @@
> +        // If we want to dispatch a notification task, we need to dispatch
> +        // after left mMonitor. But we should set main thread to working
> +        // before we leave.
> +        mMainThreadStatusInfo->mWorking = true;
> +        mMainThreadStatusInfo->mWillBeWorking = true;

This makes me nervous.  Why can't something like the following happen:

0. Main thread is idle, enters this function.
1. Main thread sets itself as |mWillBeWorking = false|;
2. Context switch someplace else.
3. Thread X enters this loop, discovers all threads idle.
4. Thread X sets these two members to |true|.
5. Event is dispatched to the main thread.
6. Main thread resumes, enters the above loop, and sets itself as |mWorking = false|.

Now these two members have gotten out of sync.

I think we must have locking to protect us from doing this sort of thing, but I'm having trouble seeing it at the moment.

@@ +556,5 @@
> +
> +    if (runnable) {
> +      if (NS_IsMainThread()) {
> +        // If we are on main thread, we can't dispatch to ourself while in
> +        // status change monitor.

This comment doesn't sound right to me--even if we're not on the main thread here, we're not dispatching within the status change monitor, right?  (The monitor was released above when we exited the block.)

::: xpcom/threads/nsThreadManager.h
@@ +76,5 @@
>    void SetThreadIdle();
>    void SetThreadWorking();
> +  void SetThreadIsWorking(ThreadStatusInfo* aInfo,
> +                          bool aIsWorking,
> +                          nsIRunnable **aReturnRunnable);

I think this should be instead:

void SetThreadIdle(nsIRunnable** aReturnRunnable);

It needs a comment describing why we have this method and |SetThreadIdle|.
Attachment #8528789 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #73)
> @@ +1088,5 @@
> >  nsThread::SetWorking()
> >  {
> >    nsThreadManager::get()->SetThreadIsWorking(
> > +    static_cast<nsThreadManager::ThreadStatusInfo*>(mThreadStatusInfo),
> > +    true, nullptr);
> 
> Why not |nsThreadManager::get()->SetThreadWorking()|?  Is it slightly less
> efficient, but I think it's worth it.  Then we don't need to store
> mThreadStatusInfo in nsThread, which reduces coupling between it and
> nsThreadManager.
> 
This function is used to set the thread which is represented by the nsThread. nsThreadManager::SetThreadWorking can only set status of currently running thread since it gets the ThreadStatusInfo from TLS.

> @@ +548,5 @@
> > +        // If we want to dispatch a notification task, we need to dispatch
> > +        // after left mMonitor. But we should set main thread to working
> > +        // before we leave.
> > +        mMainThreadStatusInfo->mWorking = true;
> > +        mMainThreadStatusInfo->mWillBeWorking = true;
> 
> This makes me nervous.  Why can't something like the following happen:
> 
> 0. Main thread is idle, enters this function.
> 1. Main thread sets itself as |mWillBeWorking = false|;
> 2. Context switch someplace else.
> 3. Thread X enters this loop, discovers all threads idle.
> 4. Thread X sets these two members to |true|.
> 5. Event is dispatched to the main thread.
> 6. Main thread resumes, enters the above loop, and sets itself as |mWorking
> = false|.
> 
> Now these two members have gotten out of sync.
> 
Yes, it can happen. Thank you for catching this. 

> I think we must have locking to protect us from doing this sort of thing,
> but I'm having trouble seeing it at the moment.
> 
> @@ +556,5 @@
> > +
> > +    if (runnable) {
> > +      if (NS_IsMainThread()) {
> > +        // If we are on main thread, we can't dispatch to ourself while in
> > +        // status change monitor.
> 
> This comment doesn't sound right to me--even if we're not on the main thread
> here, we're not dispatching within the status change monitor, right?  (The
> monitor was released above when we exited the block.)
> 
It means |mThreadStatusMonitor|, that we have hold before entering this function to make thread status consistent to the actual thread status. I will make this comment more precise here.
Updated patch.

I added a flag |nsThreadManager::mDispatchingToMainThread|, it is set when we found threads are all idle and before the task of notifying listeners has been dispatched. The flag changed and read under |nsThreadManager::mMonitor| is to prevent another notification event from being fired before the task made main thread no longer idle. Thus we don't need to change main thread's status in another thread's turn of checking threads' status, and will prevent status of main thread from being out of sync.

Another change is to turn |DeleteThreadStatusInfo| into a static member of |nsThreadManager|, to access |mThreadStatusInfos| and |mMonitor| of |nsThreadManager| when cleaning up |ThreadStatusInfo|.

And improve comments.
Attachment #8528789 - Attachment is obsolete: true
Attachment #8536302 - Flags: review?(nfroyd)
Comment on attachment 8536302 [details] [diff] [review]
Part 1b: fix dead lock.

Review of attachment 8536302 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/nsThreadManager.cpp
@@ +554,5 @@
> +    if (runnable) {
> +      if (NS_IsMainThread()) {
> +        // We are in main thread's |nsThread::mThreadStatusMonitor|. Dispatching
> +        // a task to ourself put us in dangerous of forming deadlock. Returning
> +        // the task and let our caller to dispatch for us.

"We are holding the main thread's |nsThread::mThreadStatusMonitor|.  If we dispatch a task to ourself, then we are in danger of causing deadlock.  Instead, return the task, and let the caller dispatch it for us."

::: xpcom/threads/nsThreadManager.h
@@ +76,5 @@
> +
> +  // |SetThreadWorking| and |SetThreadIdle| set status of thread that is
> +  // currently running. They get thread status information from TLS and pass
> +  // the information to |SetThreadIsWorking|.
> +  void SetThreadIdle(nsIRunnable **aReturnRunnable);

Nit: |nsIRunnable** aReturnRunnable|.

@@ +87,5 @@
> +  // |ResetIsDispatchingToMainThread| should be invoked after caller on main
> +  // thread dispatched the task to main thread's queue.
> +  void SetThreadIsWorking(ThreadStatusInfo* aInfo,
> +                          bool aIsWorking,
> +                          nsIRunnable **aReturnRunnable);

Nit: |nsIRunnable** aReturnRunnable|
Attachment #8536302 - Flags: review?(nfroyd) → review+
Depends on: 1121364
No longer blocks: CAF-v3.0-FL-metabug
We need to back the patches out on central and find another solution that is safer and not as fragile. But 1151672 contains the discussion for this decision. While this bug is backed out, bug 1138620 also needs to be backed out.
(In reply to Pulsebot from comment #80)
> Backout:
> https://hg.mozilla.org/integration/b2g-inbound/rev/78b19f9f8da8
> https://hg.mozilla.org/integration/b2g-inbound/rev/4d2839eea957

(In reply to Cervantes Yu from comment #79)
> We need to back the patches out on central and find another solution that is
> safer and not as fragile. But 1151672 contains the discussion for this
> decision. While this bug is backed out, bug 1138620 also needs to be backed
> out.

sorry had to back this out since this broke ics debug tests like https://treeherder.mozilla.org/logviewer.html#?job_id=1958360&repo=b2g-inbound
Flags: needinfo?(cyu)
Canceling NI flag. The bustage is fixed and the change is repushed.
Flags: needinfo?(cyu)
You need to log in before you can comment on or make changes to this bug.