Closed Bug 1155547 Opened 9 years ago Closed 9 years ago

Wait Nuwa forking a process if preallocated process isn't present

Categories

(Core :: IPC, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
2.2 S14 (12june)
blocking-b2g 2.5?
Tracking Status
firefox42 --- fixed
b2g-v2.2 --- affected

People

(Reporter: kk1fff, Assigned: cyu)

References

(Blocks 1 open bug)

Details

(Whiteboard: [caf priority: p3][CR 786372])

Attachments

(6 files, 13 obsolete files)

140.36 KB, text/plain
Details
538.71 KB, text/plain
Details
145.56 KB, text/plain
Details
535.12 KB, text/plain
Details
3.64 KB, text/plain
cyu
: review+
Details
50.81 KB, patch
cyu
: review+
Details | Diff | Splinter Review
Attached patch proposed patch (obsolete) — Splinter Review
+++ This bug was initially created as a clone of Bug #1053634 +++

I discussed with Kanru. As the change is really large to make loading an frame asynchronous, and the complexity doesn't seen converge now, the risk of taking patches of 1053634 to 2.2 has increased. It have been already quite hard to just rebase the patches onto 2.2, and it could incur unexpected regression.

I've implemented short version that synchronously waits for nuwa forking a process when a process request comes after nuwa ready. I think we would need more time and complex work to make loading a frame become a real asynchronous task.
Attachment #8593798 - Flags: review?(khuey)
Attached patch Patch (obsolete) — Splinter Review
with 8 lines of context.
Attachment #8593798 - Attachment is obsolete: true
Attachment #8593798 - Flags: review?(khuey)
Attachment #8593800 - Flags: review?(khuey)
Flags: needinfo?(mlee)
Summary: Make all app processes forked from Nuwa → Wait Nuwa forking a process if preallocated process isn't present
According to comment 93 of bug 1053634, this bug is the simpler fix of a 2.2+ bug
blocking-b2g: --- → 2.2+
Attached patch Patch (obsolete) — Splinter Review
Frameloader can try several times before it actually get the process, adding a waiting list to make sure only one process per manifest url.
Attachment #8593800 - Attachment is obsolete: true
Attachment #8593800 - Flags: review?(khuey)
Attachment #8594714 - Flags: review?(khuey)
Attached patch Patch (obsolete) — Splinter Review
Attachment #8594714 - Attachment is obsolete: true
Attachment #8594714 - Flags: review?(khuey)
Attachment #8594717 - Flags: review?(khuey)
Assignee: nobody → kk1fff
Hi Patrick,

I noticed your needinfo request to me but don't see a question indicating what info do you need from me. Please clarify.

Thanks,
Mike
Flags: needinfo?(mlee) → needinfo?(kk1fff)
(In reply to Mike Lee [:mlee] from comment #5)
> I noticed your needinfo request to me but don't see a question indicating
> what info do you need from me. Please clarify.

I just like to know if you have any thought about this as a quick fix of bug 1053634, since we had some email contact about that bug. :)
Flags: needinfo?(kk1fff)
(In reply to Patrick Wang (Chih-Kai Wang) (:kk1fff) from comment #6)
> I just like to know if you have any thought about this as a quick fix of bug
> 1053634, since we had some email contact about that bug. :)
                    ^ I meant.. had some conversations on mail
Low(er) risk fix for v2.2 sounds great to me.  How will this compare to the previous approach in terms of memory and performance?   

Once this lands will be plugin-container process be used for anything in b2g?
(In reply to Michael Vines [:m1] [:evilmachines] from comment #8)
> Low(er) risk fix for v2.2 sounds great to me.  How will this compare to the
> previous approach in terms of memory and performance?   

The patch simply require all process being forked from Nuwa. If this landed, I'd see original one as an architectural refactor for making loading frame asynchronous.

Both original one and this don't wait for Nuwa ready. As bug 1053634 comment 52 states, index db becomes busy when an x-heavy workload pushed, and Nuwa needs several minutes to becoming ready, waiting for Nuwa doesn't seem feasible. Thus, before Nuwa ready, we just fork from B2G. But after Nuwa ready, both approaches make sure that all processes are forked from Nuwa. The difference is that the previous one do it asynchronously while this one just enter nested loop and wait. I think the memory consumption should be close, and the loading time increases only when no preallocated process present, which is within 5 seconds after a process launched.
(In reply to Patrick Wang (Chih-Kai Wang) (:kk1fff) from comment #7)
> (In reply to Patrick Wang (Chih-Kai Wang) (:kk1fff) from comment #6)
> > I just like to know if you have any thought about this as a quick fix of bug
> > 1053634, since we had some email contact about that bug. :)
>                     ^ I meant.. had some conversations on mail

I agree with mvines comment 8 re: this lower risk patch.
Comment on attachment 8594717 [details] [diff] [review]
Patch

Review of attachment 8594717 [details] [diff] [review]:
-----------------------------------------------------------------

I'm afraid of this change, but I'm willing to try it.

::: dom/ipc/PreallocatedProcessManager.h
@@ +85,5 @@
>    static void MaybeForgetSpare(ContentParent* aContent);
>    static bool IsNuwaReady();
>    static void OnNuwaReady();
>    static bool PreallocatedProcessReady();
> +  static already_AddRefed<ContentParent> WaitNewProcess(const nsAString& aManifestURL);

This is nit-picking, but I would prefer to call this BlockForNewProcess.

Regardless, it should definitely have a "For" in there.
Attachment #8594717 - Flags: review?(khuey) → review+
Attached patch Patch (obsolete) — Splinter Review
Thanks, Kyle!

Fixing nit.
Attachment #8594717 - Attachment is obsolete: true
Attachment #8598383 - Flags: review+
Patrick, can you provide a status update here on the outlook for getting this patch re-landed?
I am going to move a patch that disables nuwa on xpcshell to here. B2G on try seems broken yesterday, thus I can't run the test cases. I'll try today.
Flags: needinfo?(kk1fff)
Looks like disabling Nuwa on xpcshell doesn't help. oauth test timed out. I would like to see if Cervantes could help take a look..
Flags: needinfo?(cyu)
Hey Patrick, can you please attach a v2.2 patch for this bug?  We'll pull this in to our build in parallel to the xpchell test debug.
Flags: needinfo?(kk1fff)
Attached patch Patch on 2.2 (obsolete) — Splinter Review
Flags: needinfo?(kk1fff)
I'll continue working on this.
Assignee: kk1fff → cyu
Flags: needinfo?(cyu)
We applied the patch but it is causing major instability.  cafbot will attach minidump and log info in a few minutes.  Please do not merge the change as-is, thanks!
Greg, is there any STR for the above crashes? Thanks.
Flags: needinfo?(ggrisco)
(In reply to Cervantes Yu from comment #26)
> Greg, is there any STR for the above crashes? Thanks.

We found this in stability test, so no clear steps are available, although it is happening quite often.  We've seen it around 40 times in a relatively small number of test hours.
Flags: needinfo?(ggrisco)
I also reproduced the crash. It turns out that TabParent::mFrameElement is destroyed and then is later used in referenced in TabParent::IsVisible() or TabParent::GetFrameLoader(). I guess the nested loop used in this patch reenters some code that is not reentrant. Need to investigate more.
Flags: needinfo?(ggrisco)
Flags: needinfo?(ggrisco)

(In reply to Cervantes Yu from comment #28)
> I also reproduced the crash. It turns out that TabParent::mFrameElement is
> destroyed and then is later used in referenced in TabParent::IsVisible() or
> TabParent::GetFrameLoader(). I guess the nested loop used in this patch
> reenters some code that is not reentrant. Need to investigate more.
Cervantes, What is your status? Are you still investigating or are you working on a fix?
See Also: → 1053634
Flags: needinfo?(cyu)
I am still trying to fix the crash or prove that the approach of using a nested loop is a dead end. I plan to have a conclusion by the end of this week.
Flags: needinfo?(cyu)
It turns out that the assertion placed in nsFrameLoader::TryRemoteBrowser() is violated. 2 TabParents are created for the same nsFrameLoader. So the 2nd TabParent can reference the dangling pointer to the nsGenericHTMLFrameElement if the frame element is destroyed.
Update to the previous patch: make it safe to reenter at places that may enter the nested loop.

A hard lesson for using the nested loop is that we must assume that the method invocation chain can repeat itself with the use of a nested loop. The crash happens because the nested loop is called inside a constructor-like method: ContentParent::CreateBrowserOrApp(). When the constructor-like method repeats itself, a frame loader creates 2 TabParent instances. When the frame element is destroyed, one of the 2 TabParent instances isn't unaware about it and holds a dangling weak reference. Later the TabParent instance crashes when dereferencing it.

The crash is fixed by making the call that might enter a nested loop explicit as ContentParent::BlockForNewProcess(). nsFrameLoader::TryRemoteBrowser() and ContentParent::RecvCreateChildProcess(). The frame loader also checks whether the remote browser is created deep inside the nested loop and skip the creation of another one if the nested loop already created it.
Attachment #8598383 - Attachment is obsolete: true
Attachment #8606234 - Flags: review?(khuey)
Attached file Patch on 2.2 (obsolete) —
Greg, would you initiate a round of stability test with this patch to see if there are still other crashes or quirks? Thanks.
Attachment #8599939 - Attachment is obsolete: true
Attachment #8606253 - Flags: feedback?(ggrisco)
>
> Greg, would you initiate a round of stability test with this patch to see if
> there are still other crashes or quirks? Thanks.

Hi Cervantes, we won't be able to bring this in now since we are trying to stabilize for major milestone.  Once you have stabilized this on master and our milestone has passed, we can bring it in.  Thanks.
Attachment #8606253 - Flags: feedback?(ggrisco)
Whiteboard: [caf priority: p1][CR 786372] → [caf priority: p3][CR 786372]
Kyle, have you had a chance to start looking at this review yet?
Flags: needinfo?(khuey)
Comment on attachment 8606234 [details] [diff] [review]
Use a nested loop to wait until Nuwa completes forking the process.

Review of attachment 8606234 [details] [diff] [review]:
-----------------------------------------------------------------

I was going to hold off on r+ing this until it got some testing, but whatever.  I still don't think we should check this in to branches without extensive testing.
Attachment #8606234 - Flags: review?(khuey) → review+
From the push log, some tests seem to be permfail. I think there are still other issues with the use of nested loop. We need to consider another alternative.
Hi Cervantes,
Do you have any update about the failed test for push?
Thanks!
Flags: needinfo?(cyu)
(In reply to Josh Cheng [:josh] from comment #39)
> Hi Cervantes,
> Do you have any update about the failed test for push?
> Thanks!

I give up the approach in the original patch because the approach is too risky. I am working on another approach to fix this bug.
Flags: needinfo?(cyu)
My plan B is moving the Nuwa operations to the background thread. When the frameloader requests a process but there is no spare one, the request will be processed using the background thread while the frameloader gets blocked (hopefully a short time). This is supposed to be much safer than using a nested loop.

Kyle, do you see any issue with this approach?
Flags: needinfo?(khuey)
Target Milestone: --- → 2.2 S14 (12june)
That sounds like a good approach to try.

This should be dropped from the 2.2 scope though.  There's nowhere near enough time to be confident in the stability of a change like that for 2.2.
Flags: needinfo?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #42)
> That sounds like a good approach to try.
> 
> This should be dropped from the 2.2 scope though.  There's nowhere near
> enough time to be confident in the stability of a change like that for 2.2.

Agree with Kyle. And according to whiteboard, this is a bug with caf priority 3 for partner. It seems not a
very critical issue. Can we discuss with partner to remove it from CAF Blocker list?
Flags: needinfo?(jocheng)
Flags: needinfo?(doliver)
Confirmed that we will move this off the 2.2 list. Bumping to 3.0? to retain visibility.
blocking-b2g: 2.2+ → 3.0?
Flags: needinfo?(jocheng)
Flags: needinfo?(doliver)
Removing dependency on CAF 2.2, per agreement with partner.
No longer blocks: CAF-v2.2-metabug
Attached patch Patch (WIP) (obsolete) — Splinter Review
This is the rollup patch for the current WIP.

TODO:
* All fork requests are sync now. The sync fork request takes 60 to 90 ms on flame. We should only block when we need one ContentParent and there is no prellocated process. The request to fork a preallocated process should run async.
Attachment #8606234 - Attachment is obsolete: true
Attachment #8606253 - Attachment is obsolete: true
No longer blocks: CAF-v2.2-metabug
Attachment #8617276 - Attachment is obsolete: true
Attachment #8617875 - Flags: review?(khuey)
Attachment #8617876 - Flags: review?(khuey)
Changes: fix PreallocatedProcessManagerImpl::GetSpareProcess() and missing #include in NuwaChild.cpp.
Attachment #8617876 - Attachment is obsolete: true
Attachment #8617876 - Flags: review?(khuey)
Attachment #8623560 - Flags: review?(khuey)
Am I expecting a new patch here?  Or is this the latest version?
Flags: needinfo?(cyu)
This is the latest version, but I just found a conflict in updating m-c. I will rebase and fix a build break on Windows.
Flags: needinfo?(cyu)
Fix build breaks on Windows and Linux.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=be14c1ef2408
Attachment #8623560 - Attachment is obsolete: true
Attachment #8623560 - Flags: review?(khuey)
Attachment #8628126 - Flags: review?(khuey)
Comment on attachment 8628126 [details] [diff] [review]
Part 2: PNuwa protocol implementation

Review of attachment 8628126 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, looks good with comments addressed.

::: dom/ipc/ContentChild.cpp
@@ +532,5 @@
>      if (IsNuwaProcess()) {
>          return;
>      }
>  
> +

nit: extra /n

::: dom/ipc/ContentParent.h
@@ +938,5 @@
>  #endif
>  
>      PProcessHangMonitorParent* mHangMonitorActor;
> +
> +    // NuwaParent and ContentParent holds strong references to each other. The

hold

::: dom/ipc/NuwaChild.cpp
@@ +34,5 @@
> +#ifdef MOZ_NUWA_PROCESS
> +
> +namespace {
> +
> +// FIXME indentation: 2

I think you fixed this.

::: dom/ipc/NuwaParent.cpp
@@ +46,5 @@
> +  , mClonedActor(nullptr)
> +  , mWorkerThread(do_GetCurrentThread())
> +  , mNewProcessPid(0)
> +{
> +  MOZ_ASSERT(!NS_IsMainThread());

AssertIsOnBackgroundThread();

@@ +103,5 @@
> +  MOZ_ASSERT(runnable);
> +  MOZ_ALWAYS_TRUE(NS_SUCCEEDED(mWorkerThread->Dispatch(runnable,
> +                                                       NS_DISPATCH_NORMAL)));
> +
> +  lock.Wait();

wait in a loop

@@ +114,5 @@
> +  // We can't call ActorConstructed() right after Alloc() in the above runnable.
> +  // To be safe we dispatch a runnable to the current thread to do it.
> +  runnable = NS_NewRunnableFunction([actor] () -> void
> +  {
> +    // On the main thread.

Add a MOZ_ASSERT(NS_IsMainThread()) in this one.

@@ +117,5 @@
> +  {
> +    // On the main thread.
> +    nsCOMPtr<nsIRunnable> nested = NS_NewRunnableFunction([actor] () -> void
> +    {
> +      // Call NuwaParent::ActorConstructed() on the worker thread.

And an AssertIsOnBackgroundThread() in this one.

@@ +150,5 @@
> +    contentParent->SetNuwaParent(nullptr);
> +    // Need to clear the ref to ContentParent on the main thread.
> +    self->mContentParent = nullptr;
> +
> +  });

nit: extra \n

@@ +166,5 @@
> +  }
> +
> +  nsCOMPtr<nsIRunnable> runnable =
> +    NS_NewNonOwningRunnableMethod(mContentParent.get(),
> +                                  &ContentParent::OnNuwaReady);

Add a comment about how NonOwning is safe because destroying the mContentParent ref would have to go to the main thread anyways.

@@ +236,5 @@
> +    return false;
> +  }
> +
> +  MonitorAutoLock lock(mMonitor);
> +  mBlocked =  true;

nit: only one space here

@@ +237,5 @@
> +  }
> +
> +  MonitorAutoLock lock(mMonitor);
> +  mBlocked =  true;
> +  lock.Wait();

wait on the condvar in a loop

::: dom/ipc/NuwaParent.h
@@ +17,5 @@
> +
> +class ContentParent;
> +
> +class NuwaParent: public mozilla::dom::PNuwaParent
> +{

nit: "NuwaParent :"
Attachment #8628126 - Flags: review?(khuey) → review+
Rebased to latest and added/removed #include's accordingly.
Attachment #8617875 - Attachment is obsolete: true
Attachment #8640980 - Flags: review+
Rebased and addressed review comments.
Attachment #8628126 - Attachment is obsolete: true
Attachment #8640981 - Flags: review+
url:        https://hg.mozilla.org/integration/mozilla-inbound/rev/2bdcd74e503d782614c12a9fb013ad80e1784de2
changeset:  2bdcd74e503d782614c12a9fb013ad80e1784de2
user:       Cervantes Yu <cyu@mozilla.com>
date:       Fri Jul 31 15:25:14 2015 +0800
description:
Bug 1155547, Part 1: Fix unified build breakage in adding new sources under dom/ipc/. r=khuey

url:        https://hg.mozilla.org/integration/mozilla-inbound/rev/89d7e4e93bec246c56efe4225f5e816763e4ba5d
changeset:  89d7e4e93bec246c56efe4225f5e816763e4ba5d
user:       Cervantes Yu <cyu@mozilla.com>
date:       Fri Jul 31 15:25:27 2015 +0800
description:
Bug 1155547, Part 2: Create PNuwa protocol (managed by PBackground) for forking content processes. r=khuey

This allows us to send a sync fork request to the Nuwa process when we need one but there is no
spare process available. After an app is launched, the request to fork a spare process is still
handled asynchronously.
Depends on: 1194180
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: