Closed Bug 1151672 Opened 4 years ago Closed 4 years ago

Preallocated process approach not working on 2.2 and 3.0

Categories

(Firefox OS Graveyard :: Performance, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(firefox41 fixed, b2g-v2.2 unaffected, b2g-master fixed)

RESOLVED FIXED
2.2 S13 (29may)
Tracking Status
firefox41 --- fixed
b2g-v2.2 --- unaffected
b2g-master --- fixed

People

(Reporter: bzumwalt, Assigned: cyu)

References

Details

(Keywords: perf, regression, Whiteboard: [3.0-Daily-Testing][MemShrink:P1])

Attachments

(9 files)

Using a Flame-KK device running 3.0 Master nightly build on 319mb memory, several apps show performance regressions. This issue can be seen in Raptor graphs (http://raptor-ui.divshot.io/#/dashboard/script/mark.js?device=flame-kk&branch=master&memory=319&suite=coldlaunch)

According to Raptor the issue occurred between 9:51 and 1:41 on April 1st

Gecko Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2015-04-01+09%3A51%3A55&enddate=2015-04-01+13%3A41%3A49

Gaia Pushlog:
https://github.com/mozilla-b2g/gaia/compare/ece334a1a93af6c7...9f04770b8a435911
Issue does not occur in Flame 2.2. Attached screenshot of Raptor graphs

Device: Flame 2.2
Build ID: 20150406002503
Gaia: a6351e1197d54f8624523c2db9ba1418f2aa046f
Gecko: c3335a5d3063
Gonk: ebad7da532429a6f5efadc00bf6ad8a41288a429
Version: 37.0 (2.2)
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0
Video comparison of app cold launch times from Flame 2.2 (env vars from comment 1) and Flame 3.0 (env vars below:

http://youtu.be/kkOzIiKwCRg

Device: Flame 3.0
Build ID: 20150331010203
Gaia: a249df8f4c84fe0a139741f05a534d36996ea7b8
Gecko: 8af276ab8636
Gonk: b83fc73de7b64594cd74b33e498bf08332b5d87b
Version: 40.0a1 (3.0)
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:40.0) Gecko/40.0 Firefox/40.0
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Keywords: regression
[Blocking Requested - why for this release]:

I am nominating this 3.0? since performance has regressed significantly from 2.2. See the video in Comment 2.
blocking-b2g: --- → 3.0?
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
I've checked this locally, but the numbers raptor reported did not match to what I saw from logcat.

Metric                            Mean      Median    Min       Max       StdDev  p95     
--------------------------------  --------  --------  --------  --------  ------  --------
coldlaunch.visuallyLoaded         2576.600  2544.000  2467.000  2763.000  88.455  2763.000

I/PerformanceTiming(13776): Performance Entry: communications.gaiamobile.org|mark|visuallyLoaded|1060.493854|0.000000|1428512653006
I/PerformanceTiming(13846): Performance Entry: communications.gaiamobile.org|mark|visuallyLoaded|1084.776822|0.000000|1428512662879
I/PerformanceTiming(13927): Performance Entry: communications.gaiamobile.org|mark|visuallyLoaded|1090.661041|0.000000|1428512672757
I/PerformanceTiming(13991): Performance Entry: communications.gaiamobile.org|mark|visuallyLoaded|1097.663177|0.000000|1428512682752
I/PerformanceTiming(14066): Performance Entry: communications.gaiamobile.org|mark|visuallyLoaded|1061.067083|0.000000|1428512692611
I/PerformanceTiming(14136): Performance Entry: communications.gaiamobile.org|mark|visuallyLoaded|1110.115105|0.000000|1428512702508
I/PerformanceTiming(14207): Performance Entry: communications.gaiamobile.org|mark|visuallyLoaded|1184.060416|0.000000|1428512712615
I/PerformanceTiming(14291): Performance Entry: communications.gaiamobile.org|mark|visuallyLoaded|1101.221198|0.000000|1428512722422
I/PerformanceTiming(14358): Performance Entry: communications.gaiamobile.org|mark|visuallyLoaded|1119.130990|0.000000|1428512732421
I/PerformanceTiming(14435): Performance Entry: communications.gaiamobile.org|mark|visuallyLoaded|1085.446666|0.000000|1428512742417
(In reply to Ting-Yu Chou [:ting] from comment #4)
> I've checked this locally, but the numbers raptor reported did not match to
> what I saw from logcat.

Seems I misunderstood the performance entry, I thought it is the column right after visuallyLoaded (startTime), but Raptor calculates visuallyLoaded by comparing the last column (epoch) with appLaunch.
I noticed the preallocated process does not exist, and seems is the cause of regression. Am bisecting now.
This is the result of bisect, note 46c549 ~ 165cbd all failed building, I can only build 165cbd with fix bfa009 [1] which is bad. The regression could be from one or all of them.

First bad:
  165cbd99c2ce
  2015-02-16 09:48 -0800 Ben Turner - Bug 1131776 - Use WITHOUT ROWID tables for IndexedDB, r=janv.
  b0c5f21c7811
  2015-01-25 22:30 -0800 Ben Turner - Bug 1112702 - Make IndexedDB transactions non-durable by default, r=khuey.
  d0566885a653
  2014-12-02 12:01 -0800 Ben Turner - Bug 1131766 - Use a simpler thread model for IndexedDB transactions, r=khuey.
  46c549bb6343
  2015-01-24 08:16 -0800 Ben Turner - Bug 866846 - Use WAL journal mode for IndexedDB databases, r=janv.
Last good:
  0253a852074f 
  2015-03-31 15:51 -0700 Ben Turner - Revert unintentional change to crash reporting infra in changeset ce2692d64bcf DONTBUILD

[1] bfa0009deb1e
    2015-03-31 16:27 -0700 Ben Turner - Bustage fix for bug 1131776, s/moz_malloc/malloc/ on CLOSED TREE
Flags: needinfo?(bent.mozilla)
Hm, the new code should make all databases have a one-time upgrade from 2.2->3.0 that could take a little while, but thereafter opening a database should be just as fast as 2.2.

Is your test comparing app startup after an upgrade?

I guess we may also be pre-populating fresh 3.0 builds with databases in the old 2.2 format that would cause us to always do an upgrade step on first launch...
Flags: needinfo?(bent.mozilla)
Oh, these changes should have no effect on the preallocated process. If that is broken on 3.0 then obviously we would have perf problems.
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #8)
> Is your test comparing app startup after an upgrade?

No. Check the bug description, the regression was noticed from Raptor.

(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #9)
> Oh, these changes should have no effect on the preallocated process. If that
> is broken on 3.0 then obviously we would have perf problems.

Not sure how, but it is broken because of the changes at comment 7.
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #8)
> Is your test comparing app startup after an upgrade?

Raptor's cold launch test primes applications before gathering performance numbers, so as long as all the upgrade work is completed in the first launch, all subsequent launches should be normal and therefore not exhibit a regression.
(In reply to Ting-Yu Chou [:ting] from comment #10)
> Not sure how, but it is broken because of the changes at comment 7.

Hrm, you mean that if you make a build with just those patches removed you suddenly see the preallocated process working as expected?
(In reply to :Eli Perelman from comment #11)
> Raptor's cold launch test primes applications before gathering performance
> numbers, so as long as all the upgrade work is completed in the first
> launch, all subsequent launches should be normal and therefore not exhibit a
> regression.

Cool, thanks. I've never heard of raptor before so info like this is good to have.
No problem, it was only released officially last week. Here's some more info:

https://developer.mozilla.org/en-US/Firefox_OS/Automated_testing/Raptor
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #12)
> (In reply to Ting-Yu Chou [:ting] from comment #10)
> > Not sure how, but it is broken because of the changes at comment 7.
> 
> Hrm, you mean that if you make a build with just those patches removed you
> suddenly see the preallocated process working as expected?

Yes, bisect shows the last good is:

  0253a852074f 
  2015-03-31 15:51 -0700 Ben Turner - Revert unintentional change to crash reporting infra in changeset ce2692d64bcf DONTBUILD

Preallocated process does show up and no app startup regression.
(In reply to Ting-Yu Chou [:ting] from comment #15)
> Yes, bisect shows the last good is: 0253a852074f 

That's not exactly what I'm asking though... Presumably bisect has removed all patches from 0253a852074f to tip, right? What if you build tip with just those patches listed in comment 7 taken out?
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #16)
> (In reply to Ting-Yu Chou [:ting] from comment #15)
> > Yes, bisect shows the last good is: 0253a852074f 
> 
> That's not exactly what I'm asking though... Presumably bisect has removed
> all patches from 0253a852074f to tip, right? What if you build tip with just
> those patches listed in comment 7 taken out?

I am confused, why would you want to try that? I built 165cbd99c2ce with bfa0009deb1e included and it showed the regression. Between 0253a852074f and 165cbd99c2ce are only the patches in comment 7.
(In reply to Ting-Yu Chou [:ting] from comment #17)

The reason I'm curious is that the code changes in those patches don't go anywhere near the code that affects the preallocated process. It's like saying that what I ate for lunch today changed the weather in Montreal :) The most likely explanation I can think of is that we the timing has changed just enough to expose a bug that already existed.

To be clear, when you run 'b2g-ps' on the phone, is the preallocated process just not there? Or how are you deducing that it is a problem with the preallocated process?
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #18)
> (In reply to Ting-Yu Chou [:ting] from comment #17)
> 
> The reason I'm curious is that the code changes in those patches don't go
> anywhere near the code that affects the preallocated process. It's like
> saying that what I ate for lunch today changed the weather in Montreal :)
> The most likely explanation I can think of is that we the timing has changed
> just enough to expose a bug that already existed.

I can relate. But it is based on the test results, not my guess... I have double confirmed that before leaving comment 7.

> To be clear, when you run 'b2g-ps' on the phone, is the preallocated process
> just not there? Or how are you deducing that it is a problem with the
> preallocated process?

It's just not there.
Summary: Performance regression on Flame 3.0 Master Nightly builds across multiple apps → Preallocated process not running on Flame 3.0 Master Nightly builds
Thinker, I think we're going to need help from someone on your team. Gregor and I sat down with this and found that the preallocated process fails to start reliably on trunk, both with the patches in comment 7 applied and with them backed out. It looks like there's some kind of serious race in the preallocated process machinery with Nuwa.

So far all I have been able to figure out is that we boot and end up with three processes after everything stabilizes: b2g, nuwa, and homescreen. (There is a brief 'smart collections process' but it eventually disappears.) The 'preallocated' process is never shown by 'b2g-ps'. Nothing changes, no matter how long we wait.

Then, if we launch another app (sms, let's say), the process launches and we end up with: b2g, nuwa, homescreen, sms. We should only have to wait about 5 seconds for the 'preallocated' process to launch, but it never does.

We attached gdb during the launch of the sms app and saw that the b2g process is receiving the FirstIdle message here: http://hg.mozilla.org/mozilla-central/file/dd32e3ff3717/dom/ipc/ContentParent.cpp#l2667

We're then calling PreallocatedProcessManager::AllocateAfterDelay, which in MOZ_NUWA_PROCESS builds calls PreallocatedProcessManagerImpl::ScheduleDelayedNuwaFork. We're then creating a task to fork nuwa here: http://hg.mozilla.org/mozilla-central/file/dd32e3ff3717/dom/ipc/PreallocatedProcessManager.cpp#l238

After the timeout PreallocatedProcessManagerImpl::DelayedNuwaFork is called and we end up here: http://hg.mozilla.org/mozilla-central/file/dd32e3ff3717/dom/ipc/PreallocatedProcessManager.cpp#l268

Here's where the problem is, somehow: |mIsNuwaReady| is false but |mPreallocatedAppProcess| is non-null. So we don't do anything, and no additional process is launched.

We can repeat this series of steps multiple times and we never see any 'preallocated' process launch.

Any idea what's going wrong here?
Flags: needinfo?(tlee)
Leave the phone for a while (several minutes), Nuwa gets ready and preallocated process shows up. So this could be some threads do not enter idle state and are not skipped.
Aha! I think I know what's going on.

It looks to me like bug 970307 calculates whether or not a thread is "idle" incorrectly: it requires mRunningEvent to be 0 before it gets marked (http://hg.mozilla.org/mozilla-central/annotate/fec90cbfbaad/xpcom/threads/nsThread.cpp#l880).

Any thread with a nested event loop (like these new IndexedDB threads) will *never* get marked as idle until they have unwound completely (e.g. http://hg.mozilla.org/mozilla-central/file/078128c2600a/dom/indexedDB/ActorsParent.cpp#l10593).

I'll have a patch for this tomorrow.
Blocks: 970307, 1131766
Flags: needinfo?(tlee)
Hrm, except I can't test any patch here because bug 1050122 has Nuwa disabled on DEBUG builds!
The pre-allocated process is broken and all tests are green? The pre-allocated process is disabled on debug builds by default? What can go wrong. Can we fix this right now please?
Flags: needinfo?(tlee)
Flags: needinfo?(kchang)
Attached patch Minimal fixSplinter Review
Patrick, what do you think about this?
Assignee: nobody → bent.mozilla
Status: NEW → ASSIGNED
Attachment #8591106 - Flags: review?(kk1fff)
Comment on attachment 8591106 [details] [diff] [review]
Minimal fix

Review of attachment 8591106 [details] [diff] [review]:
-----------------------------------------------------------------

I think we should make the IndexDB thread report its status in its own way, using |nsThreadManager::get()->SetThreadIdle(nullptr)| and |nsThreadManager::get()->SetThreadWorking()|. This is the case that nsThread doesn't actually know whether a thread is working or not, thus the thread needs to report the status by itself. In a thread that uses event queue normally, having a nested loop should be treated as there is still a task running (the one that runs nested loop), and thus we prevent the "all threads are idle" event from firing.
Attachment #8591106 - Flags: review?(kk1fff)
Ben is working on this bug. Thanks, Ben.
Flags: needinfo?(kchang)
(In reply to Ken Chang[:ken] from comment #27)
> Ben is working on this bug. Thanks, Ben.

Ben fixes this specific issue which should have been caught by basic tests. We are wasting developers time here.
The much bigger problem is that we neither have tests nor debug-enabled gecko support for the pre-allocated process or nuwa. We need someone else to fix this.
Flags: needinfo?(kchang)
(In reply to Gregor Wagner [:gwagner] from comment #28)
> The much bigger problem is that we neither have tests nor debug-enabled
> gecko support for the pre-allocated process or nuwa. We need someone else to
> fix this.

I am working on bug 1050122 which is for enabling Nuwa on debug build, the pathes are there but haven't been landed since some media tests take longer to finish and cause timeout easily.

I will fix it.
I can fix the IndexedDB threads specifically, yes. But eventually someone else will make a new thread that doesn't behave exactly like our Nuwa code currently expects... And then someone else will have to debug and go through this all over again. Doesn't it make more sense to fix Nuwa's idle detection?
Actually there's already a test for checking Nuwa's readiness (dom/ipc/test/test_NuwaProcessCreation.html), but it does not include a timeout.

Cervantes, will there be any concerns if we add a timeout to make sure the messages of nuwa-ready and nuwa-add-new-process arrive "in time"?
Flags: needinfo?(cyu)
(In reply to Ting-Yu Chou [:ting] from comment #31)
> Actually there's already a test for checking Nuwa's readiness
> (dom/ipc/test/test_NuwaProcessCreation.html), but it does not include a
> timeout.
> 
> Cervantes, will there be any concerns if we add a timeout to make sure the
> messages of nuwa-ready and nuwa-add-new-process arrive "in time"?

Adding a timeout manually in the test cases is not a good idea. There is already one imposed by mochitest itself (450 sec for debug build). If the test cases cannot pass, it will eventually fail and the chunk will be orange. Also the tests are run in emulator, which is much slower than in a real device. This makes it difficult to figure out a reasonable timeout for the test cases, and we are more likely to see the chunk getting random orange, which is pretty costly to debug.
Flags: needinfo?(cyu)
(In reply to Ting-Yu Chou [:ting] from comment #31)
> Actually there's already a test for checking Nuwa's readiness
> (dom/ipc/test/test_NuwaProcessCreation.html), but it does not include a
> timeout.

Yeah, that test is trying to do the right thing, but clearly it's not sufficient. Something needs to be changed to make it report failure in our current state.
So after discussing this predicament with gwagner and khuey I think we may need to make some serious changes to our idle detection mechanism. We identified a few problems that I don't see any obvious fix for:

1. We ignore the timer thread entirely (http://hg.mozilla.org/mozilla-central/file/2c9708e6b54d/xpcom/threads/TimerThread.cpp#l208). How is this safe?  It seems to me that if any of the preloaded startup code uses a nsITimer (or, in JS, setTimeout) in the b2g process then we may freeze Nuwa too early. The b2g process is not really idle if there are pending timeouts. I don't see how the Nuwa freeze procedure will ever be 100% reliable if we don't take the timer thread into consideration. Further complicating the matter is the fact that some long-delayed timers (e.g., gecko update timer, etc.) have nothing to do with the Nuwa process and so we shouldn't really care about those. We probably need to just define some threshold timer value where we don't count the timeout for Nuwa freeze purposes (e.g. 3 seconds or so?). Now, I guess it's possible that someone has audited all of the code in 'preload.js' and verified that there are not uses of nsITimer in there? If that is not the case then we need to have that audit done immediately. Otherwise, we need some mechanism in place to prevent future developers from walking into a trap here.

2. "Locks". This is a little more subtle, but there are several kinds of "lock-like" APIs that the child process could use during 'preload.js' execution. Examples include a settings locks, wake locks, cpu locks, maybe others? In these cases the child will send some kind of message to the parent requesting a lock. The parent will grant the lock and respond to the child and then go idle. If we decide that that moment is the right moment to freeze Nuwa then we will never release those locks. Has anyone audited the startup code for this kind of thing? It's hard to know if we would ever see failures here because they could look very different (e.g. cpu lock causes battery to run down, settings lock could cause other apps to behave weirdly, etc.). I don't know how we can fix this with mere idle detection.

3. Open IndexedDB transactions are similar to the "locks" above. In the 2.2 branch each transaction gets its own thread but that thread can be idle multiple times before the transaction completes. This means that if the Nuwa process opens an indexedDB transaction, b2g does some work and then goes idle, we may decide that Nuwa should freeze while this transaction is open. In 3.0 we have the same basic problem but the nested event loop makes Nuwa think that the threads are never idle.

4. Other stuff I don't know about?

I don't know how to fix this really.  Using thread-based idle detection seems to miss lots of edge cases that could cause very unpredictable failures. I think we need a serious audit of the startup code to see exactly what kinds of APIs we're using (a dump of all the IPC messages generated at startup is probably the easiest way to get started). And it's not enough to simply satisfy ourselves that our current code doesn't use any of these dangerous APIs; we must also make sure we have code in place to prevent future changes from hitting these cases.
Hm, it also looks like the idle thread detection does not take chromium threads into account... We have several that might be important: http://mxr.mozilla.org/mozilla-central/search?string=new%20base%3A%3AThread
blocking-b2g: 3.0? → 3.0+
I don't think I'm the right assignee for this bug.
Assignee: bent.mozilla → nobody
Status: ASSIGNED → NEW
Hi Ben,

I think you are right, current idle detection is not perfect.  We will follow up bugs and improve it continuously.  But, for the short.  I would like to have a quick patch for IndexedDB.

You said, "we need some mechanism in place to prevent future developers from walking into a trap here."  For short term, I would purpose a way to explode this type of traps if there any patch hits it.  It could avoid people to spend a lot of time to find out root causes.
Flags: needinfo?(tlee) → needinfo?(bent.mozilla)
Ok, I filed bug 1156611 to mark IndexedDB threads as idle in the short term.

But we really need someone to take a hard look at the code we load at startup to see if anything else could be broken intermittently. Unfortunately I don't really have a good idea for how to make this safer or assert anything because detecting idle threads doesn't seem sufficient.
Flags: needinfo?(bent.mozilla)
(In reply to Thinker Li [:sinker] from comment #37)
> Hi Ben,
> 
> I think you are right, current idle detection is not perfect.  We will
> follow up bugs and improve it continuously.  But, for the short.  I would
> like to have a quick patch for IndexedDB.
> 
> You said, "we need some mechanism in place to prevent future developers from
> walking into a trap here."  For short term, I would purpose a way to explode
> this type of traps if there any patch hits it.  It could avoid people to
> spend a lot of time to find out root causes.

We will fix indexeddb in bug 1156611 but what about all the other issues that are listed in comment 34? Whats the plan for them in 2.2? We know the current implementation is not sound even after fixing the IDB issue. The only alternative I see right now is to back out 970307.
Depends on: 1156611
Priority: -- → P1
Things getting loaded in preload.js definitely needs review. We can take an approach to let preload.js explicitly notify that it has done its work and is not expecting anything else. This may safe us from monitoring thread status as a hint of readiness of freezing. The downside of it is that preload.js itself and the scripts loaded directly or indirectly from it may subject to change and it can be broken if the scripts are changed.

There are still calls that look harmless, but actually work like a cross-process observer. One example is cpmm.addMessageListener() and the message will be sent under some rare condition. We need to have a thorough review for such cases.

Another idea is to make Nuwa work like a normal process that it may freeze and unfreeze at will. But the synchronization with the chrome process is a problem. The cross-process race, such as Nuwa fork()s and chrome process sending a message to it, is hard to tackle. Maybe we should give up such process and then retry the fork() after it is stable. I am not sure if this is going to work though.
This is a 3.0 blocker, but I think we need to do something for 2.2 (like comment 39 says). Maybe backout, maybe just a thorough audit of the preloading code? But something has to happen for 2.2.
blocking-b2g: 3.0+ → 2.2?
blocking-b2g: 2.2? → 2.2+
Summary: Preallocated process not running on Flame 3.0 Master Nightly builds → Preallocated process approach not working on 2.2 and 3.0
Blocks: 1124554
Ken, can we find an assignee here?
ni per comment 42
Flags: needinfo?(tlee)
I'll take this.
Assignee: nobody → cyu
Flags: needinfo?(tlee)
Flags: needinfo?(kchang)
The problem that some thread created in the chrome process keeps the Nuwa process from becoming ready doesn't happen on 2.2. Do we really need to block 2.2 with this bug?

Here are the IPC messages the Nuwa process send/receive when mozilla::dom::TabChild::PreloadSlowThings():

[time:1430301688964854][14458->14445][PContentChild] Sending Msg_GetLookAndFeelCache([TODO])
[time:1430301688965744][14445<-14458][PContentParent] Received Msg_GetLookAndFeelCache([TODO])
[time:1430301688965876][14445->14458][PContentParent] Sending reply Reply_GetLookAndFeelCache([TODO])
[time:1430301688966583][14458<-14445][PContentChild] Received reply Reply_GetLookAndFeelCache([TODO])

[time:1430301689025722][14458->14445][PContentChild] Sending Msg_AsyncMessage([TODO])
[time:1430301689026694][14445<-14458][PContentParent] Received Msg_AsyncMessage([TODO])
Sends "Webapps:RegisterForMessages" (AppsServiceChild.jsm)

[time:1430301689028619][14458->14445][PContentChild] Sending Msg_SyncMessage([TODO])
[time:1430301689029977][14445<-14458][PContentParent] Received Msg_SyncMessage([TODO])
[time:1430301689054319][14445->14458][PContentParent] Sending reply Reply_SyncMessage([TODO])
[time:1430301689061526][14458<-14445][PContentChild] Received reply Reply_SyncMessage([TODO])
Sends "Webapps:GetList" (AppsServiceChild.jsm)

[time:1430301689152136][14458->14445][PContentChild] Sending Msg_AsyncMessage([TODO])
[time:1430301689153078][14445<-14458][PContentParent] Received Msg_AsyncMessage([TODO])
Sends "UserCustomizations:List" (UserCustomization.jsm)

[time:1430301689153457][14445->14458][PContentParent] Sending Msg_AsyncMessage([TODO])
[time:1430301689162250][14458->14445][PNeckoChild] Sending Msg_PCookieServiceConstructor([TODO])
[time:1430301689163116][14445<-14458][PNeckoParent] Received Msg_PCookieServiceConstructor([TODO])

[time:1430301689176952][14458->14445][PContentChild] Sending Msg_ReadPermissions([TODO])
[time:1430301689177573][14445<-14458][PContentParent] Received Msg_ReadPermissions([TODO])
[time:1430301689181982][14445->14458][PContentParent] Sending reply Reply_ReadPermissions([TODO])
[time:1430301689184502][14458<-14445][PContentChild] Received reply Reply_ReadPermissions([TODO])

[time:1430301689311904][14458->14445][PContentChild] Sending Msg_PStorageConstructor([TODO])
[time:1430301689312566][14445<-14458][PContentParent] Received Msg_PStorageConstructor([TODO])
[time:1430301689312768][14445->14458][PStorageParent] Sending Msg_ScopesHavingData([TODO])

[time:1430301689383578][14458->14445][PContentChild] Sending Msg_NuwaWaitForFreeze([TODO])
[time:1430301689384313][14445<-14458][PContentParent] Received Msg_NuwaWaitForFreeze([TODO])
(mozilla::dom::TabChild::PreloadSlowThings() finishes)

[time:1430301689384210][14458<-14445][PContentChild] Received Msg_LoadAndRegisterSheet([TODO])
[time:1430301689389423][14458<-14445][PContentChild] Received Msg_LoadAndRegisterSheet([TODO])
[time:1430301689390444][14458<-14445][PContentChild] Received Msg_LoadAndRegisterSheet([TODO])

[time:1430301689391440][14458<-14445][PContentChild] Received Msg_AsyncMessage([TODO])
Receives "UserCustomizations:Add"

[time:1430301689392857][14458<-14445][PContentChild] Received Msg_AsyncMessage([TODO])
Receives "UserCustomizations:UpdateWindows"

[time:1430301689393580][14458<-14445][PContentChild] Received Msg_AsyncMessage([TODO])
Receives "UserCustomizations:Add"

[time:1430301689394205][14458<-14445][PContentChild] Received Msg_AsyncMessage([TODO])
Receives "UserCustomizations:UpdateWindows"

[time:1430301689394666][14458<-14445][PContentChild] Received Msg_PreferenceUpdate([TODO])
Preference: "datareporting.sessions.current.totalTime"

[time:1430301689394919][14458<-14445][PContentChild] Received Msg_PreferenceUpdate([TODO])
Preference: "ril.debugging.enabled"

[time:1430301689395351][14458<-14445][PContentChild] Received Msg_AsyncMessage([TODO])
Receives "UserCustomizations:Add"

[time:1430301689396047][14458<-14445][PStorageChild] Received Msg_ScopesHavingData([TODO])

It doesn't appear that something dangerous is sent to the Nuwa process. Did I miss anything?

For the issues in comment #34,

1. Time thread is completely ignored. We need to review the timers that might broadcast messages to the child process. Given that it's really CPU intensive during b2g startup and the threads/processes all compete for the CPU time, if there is correctness that depend on a timer, we need to fix that. That sounds like an inherent source of timing problems.

2. It doesn't appear that there is lock-like API invoked in preload.js from the above log.

3. The Nuwa process doesn't create indexedDB actor, or we will fail very loudly by aborting the chrome process (note that the parent actors don't override CloneProtocol()). This won't pass the safeguard of mochitest.

If you still think we need to backout for the potential, yet to happen, breakage of 2.2 because the design is not robust enough w.r.t. further changes elsewhere, I am OK with that. I will search for a less intrusive mechanism for preloading in the Nuwa process.
(In reply to Cervantes Yu from comment #45)
> The problem that some thread created in the chrome process keeps the Nuwa
> process from becoming ready doesn't happen on 2.2. Do we really need to
> block 2.2 with this bug?

Not if we can convince ourselves that it's safe, of course, but we should block 2.2 on making sure we do that.

> Here are the IPC messages the Nuwa process send/receive when
> mozilla::dom::TabChild::PreloadSlowThings():

Thanks for collecting this data! However:

> [time:1430301688964854][14458->14445][PContentChild] Sending
> Msg_GetLookAndFeelCache([TODO])

MXR says that this doesn't exist on 2.2... Are these the messages sent with trunk instead?

And how did you pick the start and end times for recording the log?

> 1. Time thread is completely ignored. We need to review the timers

Definitely.

> 2. It doesn't appear that there is lock-like API invoked in preload.js from
> the above log.

I'm not sure... There are so many APIs that I think it's hard to say without really digging. I have only thought of a few, but there could be others? There could also be simple message manager messages that imply a lock but aren't necessarily easy to spot.

> 3. The Nuwa process doesn't create indexedDB actor, or we will fail very
> loudly by aborting the chrome process (note that the parent actors don't
> override CloneProtocol()). This won't pass the safeguard of mochitest.

This isn't enough to be safe though. Settings, for instance, does all its IndexedDB in the parent process and uses the message manager. So the absence of any cloned IndexedDB actors for the Nuwa process doesn't mean that there are no IndexedDB actors controlled by the Nuwa process.

Also, relying on cloned actors to assert is a racy solution. The Nuwa process could, in theory, create its own IndexedDB actor, use it, and then kill it quickly. If we don't freeze Nuwa at a consistent and deterministic point then we might assert sometimes and not others.

> If you still think we need to backout

I don't want us to back out if we can convince ourselves that what we have currently for 2.2 is safe. Given the GetLookAndFeelCache message here I think we need to see another log. And then we have to audit the timers.

> the design is not robust enough w.r.t. further changes elsewhere

Since we're supposed to be locking down 2.2 I think if we can convince ourselves that what we have is safe that should be enough.

But we can't use this approach for trunk. Anyone that changes something in the startup path could cause dangerous messages to be sent and using thread idle detection isn't sufficient to catch that.

> I will search for a less intrusive mechanism for preloading in the Nuwa process.

For trunk, yes please! We may end up just having to whitelist messages that we know don't do lock-like things, and then we need to have a defined "done" state for each item we preload.
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #46)
> (In reply to Cervantes Yu from comment #45)
> > The problem that some thread created in the chrome process keeps the Nuwa
> > process from becoming ready doesn't happen on 2.2. Do we really need to
> > block 2.2 with this bug?
> 
> Not if we can convince ourselves that it's safe, of course, but we should
> block 2.2 on making sure we do that.
> 
> > Here are the IPC messages the Nuwa process send/receive when
> > mozilla::dom::TabChild::PreloadSlowThings():
> 
> Thanks for collecting this data! However:
> 
> > [time:1430301688964854][14458->14445][PContentChild] Sending
> > Msg_GetLookAndFeelCache([TODO])
> 
> MXR says that this doesn't exist on 2.2... Are these the messages sent with
> trunk instead?
> 
Yes. I checked 2.2 and found that the messages sent/received are fewer than the list here. The actions performed are much the same.

Here are the messages captured on 2.2, starting from preloading:

I/GeckoIPC( 7030): [time:1430386183955691][7030->7015][PContentChild] Sending Msg_AsyncMessage([TODO])
I/GeckoIPC( 7015): [time:1430386183959885][7015<-7030][PContentParent] Received Msg_AsyncMessage([TODO])
Sends "Webapps:RegisterForMessages" (AppsServiceChild.jsm)

I/GeckoIPC( 7030): [time:1430386183967708][7030->7015][PContentChild] Sending Msg_SyncMessage([TODO])
I/GeckoIPC( 7015): [time:1430386183973189][7015<-7030][PContentParent] Received Msg_SyncMessage([TODO])
I/GeckoIPC( 7015): [time:1430386184023369][7015->7030][PContentParent] Sending reply Reply_SyncMessage([TODO])
I/GeckoIPC( 7030): [time:1430386184034874][7030<-7015][PContentChild] Received reply Reply_SyncMessage([TODO])
Sends "Webapps:GetList" (AppsServiceChild.jsm)

I/GeckoIPC( 7030): [time:1430386184119542][7030->7015][PNeckoChild] Sending Msg_PCookieServiceConstructor([TODO])
I/GeckoIPC( 7015): [time:1430386184120167][7015<-7030][PNeckoParent] Received Msg_PCookieServiceConstructor([TODO])
I/GeckoIPC( 7030): [time:1430386184130360][7030->7015][PContentChild] Sending Msg_ReadPermissions([TODO])
I/GeckoIPC( 7015): [time:1430386184131204][7015<-7030][PContentParent] Received Msg_ReadPermissions([TODO])
I/GeckoIPC( 7015): [time:1430386184135786][7015->7030][PContentParent] Sending reply Reply_ReadPermissions([TODO])
I/GeckoIPC( 7030): [time:1430386184138976][7030<-7015][PContentChild] Received reply Reply_ReadPermissions([TODO])
I/Gecko   ( 7030): ###################################### forms.js loaded
I/Gecko   ( 7030): ############################### browserElementPanning.js loaded
I/Gecko   ( 7030): ######################## BrowserElementChildPreload.js loaded
I/GeckoIPC( 7030): [time:1430386184274439][7030->7015][PContentChild] Sending Msg_PStorageConstructor([TODO])
I/GeckoIPC( 7015): [time:1430386184275355][7015<-7030][PContentParent] Received Msg_PStorageConstructor([TODO])
(TabChild::PreloadSlowThings() finishes here)

I/GeckoIPC( 7030): [time:1430386184343009][7030->7015][PContentChild] Sending Msg_NuwaWaitForFreeze([TODO])
I/GeckoIPC( 7015): [time:1430386184343518][7015<-7030][PContentParent] Received Msg_NuwaWaitForFreeze([TODO])
I/GeckoIPC( 7015): [time:1430386184343738][7015->7030][PContentParent] Sending Msg_NuwaFreeze([TODO])
I/GeckoIPC( 7030): [time:1430386184344087][7030<-7015][PContentChild] Received Msg_LoadAndRegisterSheet([TODO])
I/GeckoIPC( 7030): [time:1430386184348687][7030<-7015][PContentChild] Received Msg_LoadAndRegisterSheet([TODO])
I/GeckoIPC( 7030): [time:1430386184349805][7030<-7015][PContentChild] Received Msg_LoadAndRegisterSheet([TODO])
I/GeckoIPC( 7030): [time:1430386184350816][7030<-7015][PContentChild] Received Msg_PreferenceUpdate([TODO])
I/GeckoIPC( 7030): [time:1430386184351156][7030<-7015][PContentChild] Received Msg_PreferenceUpdate([TODO])
The preference update is not necessarily sent.

I/GeckoIPC( 7030): [time:1430386184351521][7030<-7015][PContentChild] Received Msg_NuwaFreeze([TODO])
From here on it's just Nuwa-related messages: NuwaReady, NuwaFork, AddNewProcesses.

> And how did you pick the start and end times for recording the log?
> 
Just pause the Nuwa process in the beginning of TabChild::PreloadSlowThings() with a debugger, start recording the log, and then continue. Then filter and annotate the log manually.
But this is might subject to timing problem and the sample is relatively small. I'll need to dig further to have confidence that it's safe or conclude that we need to back out.
ni myself for getting update.
Flags: needinfo?(kchang)
Blocks: 1161189
Without the pending issues on m-c, 2.2 doesn't look to be unsafe. I don't think we need to block 2.2 with this bug unless proven otherwise.

On m-c, given that it's causing memory regression. We need to consider landing a minimal fix or back out while working on a safer approach.
(In reply to Cervantes Yu from comment #49)
> Without the pending issues on m-c, 2.2 doesn't look to be unsafe. I don't
> think we need to block 2.2 with this bug unless proven otherwise.
> 

That's not how it works with race-conditions. We have to assume its not working until proven otherwise. Can you prove that we have a sound solution for all the issues raised in comment 34? Did anyone do the work and inspect all the preload code manually?

We still don't have Nuwa enabled on debug builds. Bug 1050122 is moving very slowly. Maybe we are seeing real issues?

I don't feel comfortable shipping nuwa in the current form in 2.2 with potential races, limited automated testing and no debug build support.
Depends on: 1161190
Whiteboard: [3.0-Daily-Testing] → [3.0-Daily-Testing][MemShrink]
Gregor and I just tracked down another problem with Nuwa's idle thread detection on master. It looks like there are no thread annotations for the VibratorRunnable, and it can prevent Nuwa from freezing forever: http://mxr.mozilla.org/mozilla-central/source/hal/gonk/GonkHal.cpp#348
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #51)
> on master

I don't think this is a problem for master only though. That code is old and exists on 2.2 also (http://mxr.mozilla.org/mozilla-b2g37_v2_2/source/hal/gonk/GonkHal.cpp#317). Is it just a race that makes this work sometimes? I don't know how Patrick didn't see that one during his testing.
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #52)

Filed bug 1162812.
(In reply to Gregor Wagner [:gwagner] from comment #50)
> (In reply to Cervantes Yu from comment #49)
> > Without the pending issues on m-c, 2.2 doesn't look to be unsafe. I don't
> > think we need to block 2.2 with this bug unless proven otherwise.
> > 
> 
> That's not how it works with race-conditions. We have to assume its not
> working until proven otherwise. Can you prove that we have a sound solution
> for all the issues raised in comment 34? Did anyone do the work and inspect
> all the preload code manually?
> 
No, I can't prove that. On the contrary, I can prove that even the the issues raised in comment #34 are addressed, there still can be unexpected cases to break a mechanism like this. Even we include MessageLoop, TimerThread and other entities that we can dispatch tasks to, the mechanism is still not sound enough. A thread created using pthread_create() can still use its own way to implement the runloop, which is totally unknown to thread status monitoring like this. The same reasoning applies to things like hardware events or other low-level events.

If we want to take a generic approach to preload in the Nuwa process, we will always encounter a problem like this. To have a generic solution that is guaranteed to be correct we will need to have code that understands what other code is doing, which is impossible. The best we can have is only an approximation, and monitoring nsIThread is an approximation we have. From the current open issues this mechanism can be easily broken and I am working on a more robust approach to it.

I did look at preload.js and the js loaded by it, but only walked through it roughly. Comment #45 and $47 contains log of IPC messages sent/received during preload. But even we really take a full, thorough inspection of what is done in preload.js and make sure it is safe, a mechanism depending on it is still not robust. Any change in preload can easily break it by sending an async message back to the chrome process that results in another async response that locks up the Nuwa process. And such change can reside in some corner in the code base (e.g. some native code indirectly called by the XPCOM implementation during its instantiation).

> We still don't have Nuwa enabled on debug builds. Bug 1050122 is moving very
> slowly. Maybe we are seeing real issues?
> 
Ting is having a hard time investigating mochitest-12 timeout.

> I don't feel comfortable shipping nuwa in the current form in 2.2 with
> potential races, limited automated testing and no debug build support.

OK. Let's back out thread monitoring and preload in central and 2.2 given the current issues and concerns. What do you guys think?
These are just my opinions:

On 2.2 I think we should audit the existing code to make sure we don't do anything suspicious with timeouts. Then we should whitelist the messages that are currently being sent during preload somehow (making sure they don't include "lock-like" messages). Then we should crash if we ever see a message during preload that is not in the whitelist. If that doesn't cause frequent stability problems then we can leave the preloading active. Otherwise we should disable it.

On trunk I think we definitely need to back out the thread idle detection and scrap this whole approach because it is very very fragile. We should investigate some other mechanism, maybe making any script loaded in preload.js have to call an explicit "I'm done" function before we can freeze.
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #55)
> These are just my opinions:
> 
> On 2.2 I think we should audit the existing code to make sure we don't do
> anything suspicious with timeouts. Then we should whitelist the messages
> that are currently being sent during preload somehow (making sure they don't
> include "lock-like" messages). Then we should crash if we ever see a message
> during preload that is not in the whitelist. If that doesn't cause frequent
> stability problems then we can leave the preloading active. Otherwise we
> should disable it.
> 
Thinker, Bobby and I had an offline discussion and concluded that backout on 2.2 is the safest and most feasible solution. We already have a white list, but it's not complete:

http://hg.mozilla.org/mozilla-central/file/3716351194b0/ipc/glue/MessageLink.cpp#l176

It's hard to tell how detailed the audit need to be. It's also hard to come up a whitelist that we are sure to be complete so that it doesn't trigger the crash we place.
See Also: → 1162812
Whiteboard: [3.0-Daily-Testing][MemShrink] → [3.0-Daily-Testing][MemShrink:P1]
I'll complete the backout on m-c once the review is done.
Comment on attachment 8606990 [details] [diff] [review]
Bug 1151672: Remove the calls to nsThreadManager::SetThreadWorking() and nsThreadManager::SetThreadIdle() due to backout of bug 970307.

Review of attachment 8606990 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks goodness.

For future reference in case we decide to try something like this again: beyond all the other objections above, it was seriously wrong that we marked a thread as working when it popped a job off the work queue. That means that a race condition existed where we have jobs enqueued, but because no thread had woken up to dequeue one of them yet, the system considered itself to be idle.

Instead, all threads in the thread pool should've been marked as working whenever something was pushed *onto* the work queue, and they should all have been marked as idle when the work queue became empty again.
Attachment #8606990 - Flags: review?(seth) → review+
Blocks: 1166207
sorry had to back this out in https://treeherder.mozilla.org/#/jobs?repo=b2g-inbound&revision=614d52bca5ac since one of this changes broke ics debug tests with failures like :

https://treeherder.mozilla.org/logviewer.html#?job_id=1958358&repo=b2g-inbound
Flags: needinfo?(cyu)
(In reply to Carsten Book [:Tomcat] from comment #67)
> sorry had to back this out in
> https://treeherder.mozilla.org/#/jobs?repo=b2g-inbound&revision=614d52bca5ac
> since one of this changes broke ics debug tests with failures like :
> 
> https://treeherder.mozilla.org/logviewer.html#?job_id=1958358&repo=b2g-
> inbound

Oops, sorry for the bustage. I didn't expect the backout to break things. I'll verify before proceeding next time.
Flags: needinfo?(cyu)
This should fix the mochitest bustage after backing out bug 970307.
Attachment #8608055 - Flags: review?(khuey)
Comment on attachment 8608055 [details] [diff] [review]
Black list PHal::Msg_NotifyNetworkChange__ID after Nuwa is ready

Nit: This is actually "whitelisting" that message I believe?
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #71)
> Comment on attachment 8608055 [details] [diff] [review]
> Black list PHal::Msg_NotifyNetworkChange__ID after Nuwa is ready
> 
> Nit: This is actually "whitelisting" that message I believe?

It's not actually the message resulting from preloading, or we will see the same bustage on 2.2. We need this patch because a recent m-c change that broadcasts a new messages to all processes on startup.
As comment 56 and back-out(s), agreed not to resolve this in v2.2.
blocking-b2g: 2.2+ → ---
We probably need the patch from comment 69 if we want a working phone. Up to cervantes if he wants to file a followup or keep it in this bug and mark it as 2.2+ again.
(In reply to Gregor Wagner [:gwagner] from comment #74)
> We probably need the patch from comment 69 if we want a working phone. Up to
> cervantes if he wants to file a followup or keep it in this bug and mark it
> as 2.2+ again.

m-c needs this patch because Nuwa is enabled on debug build recently (bug 1050122). 2.2 doesn't have this problem unless we want to uplift 1050122 to 2.2. Do we still need to uplift the patch to block PHal::Msg_NotifyNetworkChange__ID?
https://hg.mozilla.org/mozilla-central/rev/2529aa82e200
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.2 S13 (29may)
Flags: needinfo?(kchang)
You need to log in before you can comment on or make changes to this bug.