Closed Bug 950266 Opened 11 years ago Closed 10 years ago

Re-enable the Nuwa process on B2G by default

Categories

(Core :: IPC, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla30
blocking-b2g 1.3+
Tracking Status
firefox27 --- wontfix
firefox28 --- wontfix
firefox29 --- wontfix
firefox30 --- fixed
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed

People

(Reporter: n.nethercote, Assigned: fabrice)

References

Details

(Keywords: perf, Whiteboard: [c=memory p= s= u=1.3+tarako] [MemShrink:P1][CR 599688])

Attachments

(3 files, 4 obsolete files)

Nuwa was enabled in bug 930282, but then disabled in bug 948829 due to problems.  This bug is about re-enabling it.  Please add any blockers I have missed.
Should bug 948021 block this rather than depend on this?
Summary: Re-nable the Nuwa process on B2G by default → Re-enable the Nuwa process on B2G by default
blocking-b2g: --- → 1.3?
Andrew,

Please reassign appropriately.
Flags: needinfo?(overholt)
I'll assign to to Patrick, but it's really just waiting on the blocking bug (which I own).
Assignee: nobody → pwang
Flags: needinfo?(overholt)
After bisecting a datazilla regression, it appears NUWA slows down the cold launch time for some of our apps.  In particular, clock seem effected.  This is all in bug 948977.  It would be nice to understand what is going on there before re-enabling NUWA.
triage: nuwa is intended for 1.3. 1.3+ to benefit tarako
blocking-b2g: 1.3? → 1.3+
hi Patrick, do you mind giving an update on this bug after you come back from the holidays? Thanks
Flags: needinfo?(pwang)
Hi Joe, the regression bugs that are filed after last time we enable Nuwa are almost fixed. I think we can reenable nuwa after bug 948774 has been fixed.
Flags: needinfo?(pwang)
Hi Patrick,

What is the ETA for fixing bug 948774?
Flags: needinfo?(pwang)
I am not pretty sure about the ETA of the bug. But you might ask Kyle, who works on the bug.
Flags: needinfo?(pwang)
See the bug.
Flags: needinfo?(khuey)
Whiteboard: [MemShrink:P1][tarako] → [MemShrink:P1][tarako][CR 599688]
Assignee: pwang → fabrice
arg... Patrik, do you have time to take a look?
Flags: needinfo?(pwang)
The assertion is just bogus.  I'll fix that.
I pushed https://hg.mozilla.org/integration/b2g-inbound/rev/d1d1f50c63f9 to fix the assertion.  Patrick, can you investigate the webapi test failure?
sure.
Flags: needinfo?(pwang)
The error occurs when app_window.js try to use browser API on FTU's iframe. If Nuwa preallocate process is on, loading of remote frame will end at [1], and browser parent script won't be loaded, then calling to browser API will result in a javascript exception.

Since loading a remote frame is now an asynchronous task, browser API won't be available right after attaching an iframe, embedder should test if browser API is usable before using it. Another option is that we provide a callback to let embedder know that we are ready.

[1] http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsFrameLoader.cpp?rev=4f7f249f6d18#463
Gecko change to fire an event when browser element parent is ready.
Attachment #8360552 - Flags: feedback?(fabrice)
Comment on attachment 8360552 [details] [diff] [review]
WIP: Notify embedder when BEP is ready.

Review of attachment 8360552 [details] [diff] [review]:
-----------------------------------------------------------------

Patrick, I had a simpler patch in my queue that just does:

let evt = this._createEvent('ready', null, /* cancelable = */ false);
this._frameElement.dispatchEvent(evt);

at the end of the BrowserElementParent() constructor and it worked just fine with the right gaia patch. Do you feel we really need your pending frame mechanism?

::: dom/browser-element/BrowserElementParent.js
@@ +81,5 @@
>    _observeInProcessBrowserFrameShown: function(frameLoader) {
>      debug("In-process browser frame shown " + frameLoader);
> +    this._createBrowserElementParent(frameLoader,
> +                                     /* hasRemoteFrame = */ false,
> +                                    /* pending frame */ false);

nit: align comments

::: dom/browser-element/BrowserElementParent.jsm
@@ +707,5 @@
>    observe: function(subject, topic, data) {
>      switch(topic) {
> +    case 'remote-browser-frame-shown':
> +      if (!this._initialized && (this._frameLoader == subject)) {
> +        Services.obs.removeObserver(this, 'remote-Browser-frame-shown');

that should be remote-browser-frame-shown
Attachment #8360552 - Flags: feedback?(fabrice) → feedback+
Attached patch mozbrowser-ready.patch (obsolete) — Splinter Review
Gaia patch that waits for the new mozbrowserready event if the paint listener methods are not accessible.
Attachment #8360641 - Flags: review?(alive)
According to comment 18, I wonder this doesn't only affect addNextPaintListener but also all browser API: getScreenshot/goBack/goForward/canGoBack/canGoForward/setVisible/getVisible/purgeHistory. Is that true?
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #22)
> According to comment 18, I wonder this doesn't only affect
> addNextPaintListener but also all browser API:
> getScreenshot/goBack/goForward/canGoBack/canGoForward/setVisible/getVisible/
> purgeHistory. Is that true?

Yes, all browser APIs are not available before the new process launch.
(In reply to Patrick Wang (ChihKai) (:kk1fff) from comment #23)
> (In reply to Alive Kuo [:alive][NEEDINFO!] from comment #22)
> > According to comment 18, I wonder this doesn't only affect
> > addNextPaintListener but also all browser API:
> > getScreenshot/goBack/goForward/canGoBack/canGoForward/setVisible/getVisible/
> > purgeHistory. Is that true?
> 
> Yes, all browser APIs are not available before the new process launch.

How about events? I wonder we may lose mozbrowserloadstart event before mozbrowserready event?
(In reply to Fabrice Desré [:fabrice] from comment #20)
> Comment on attachment 8360552 [details] [diff] [review]
> WIP: Notify embedder when BEP is ready.
> 
> Review of attachment 8360552 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Patrick, I had a simpler patch in my queue that just does:
> 
> let evt = this._createEvent('ready', null, /* cancelable = */ false);
> this._frameElement.dispatchEvent(evt);
> 
> at the end of the BrowserElementParent() constructor and it worked just fine
> with the right gaia patch. Do you feel we really need your pending frame
> mechanism?

I don't think we really need pending, simpler mechanism would be better. :)
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #24)
> (In reply to Patrick Wang (ChihKai) (:kk1fff) from comment #23)
> > Yes, all browser APIs are not available before the new process launch.
> 
> How about events? I wonder we may lose mozbrowserloadstart event before
> mozbrowserready event?

No, there's no event before mozbrowserready, since browser element is not working then.
Comment on attachment 8360641 [details] [diff] [review]
mozbrowser-ready.patch

Review of attachment 8360641 [details] [diff] [review]:
-----------------------------------------------------------------

According to comment 18, comment 23, we need a new mechanism for all browser API but not only addNextPaintListener.
Fabrice, if you don't mind I would take this.
Attachment #8360641 - Flags: review?(alive)
Nuwa Gaia ready adapter
Attachment #8361019 - Flags: review?(timdream)
I kept pending state since this rather understandable to me.

I ran webapi test locally with Alive's patch on Gaia, I still got 3 failures, but it's better than before, when webapi test can't even start.
Attachment #8360552 - Attachment is obsolete: true
Depends on: 960641
Let's move the 'mozbrowserready' patches to bug 960641 and keep this one just to turn on nuwa.
slight update to get things building.
Attachment #8360641 - Attachment is obsolete: true
Attachment #8361019 - Attachment is obsolete: true
Attachment #8361026 - Attachment is obsolete: true
Attachment #8361019 - Flags: review?(timdream)
I am a little bit worried about the regression that could happen if we change the behavior of mozbrowser. Seeking if we can solve this without touching the API..
(In reply to Preeti Raghunath(:Preeti) from comment #33)
> Patrick,
> 
> What is the solution moving forward?

The patch for bug 960641 was landed.
Flags: needinfo?(pwang)
We need to fix bug 961959 first to not break all profiling.
Depends on: 961959
Flags: needinfo?(fabrice)
Fabrice, if that's the case can you add the 1.3 CS blocker status for the dependent bug so it gets tagged please?
Blocks: 961591
Blocks: 942267
Patrick and Cervantes are working through the current tryserver failures.
Flags: needinfo?(fabrice)
Thanks Kyle
Can we really enable Nuwa at this point for the v1.3 release? It seems hugely risky. Nuwa is a new feature and feature freeze was a long time ago. The fact that the feature hasn't been stabilized yet means that it's not ready for v1.3.
We really need this to be enabled for 1.3 as we are targeting devices with 256 MB configurations where the memory savings provided by Nuwa could play a signficant role.
We bisect locally and found that bug 961047 breaks the test case when Nuwa is enabled.
Try result:
https://tbpl.mozilla.org/?tree=Try&rev=4f902b351695
https://tbpl.mozilla.org/?tree=Try&rev=50c61ddd206d

It looks the failure occurs after default enabling APZC. Vivien, Shih-Chiang, would you help take a look?
Status: NEW → ASSIGNED
Keywords: perf
Whiteboard: [MemShrink:P1][tarako][CR 599688] → [c=memory p= s= u=tarako] [MemShrink:P1][tarako][CR 599688]
I could not reproduce the failures on a local emulator build on current b2g-inbound. Try run to confirm:

https://tbpl.mozilla.org/?tree=Try&rev=8919c94aaa4d
(In reply to Fabrice Desré [:fabrice] from comment #46)
> I could not reproduce the failures on a local emulator build on current
> b2g-inbound. Try run to confirm:
> 
> https://tbpl.mozilla.org/?tree=Try&rev=8919c94aaa4d

The failure case on my local build is not the same as on try server. There's probably a timing issue.
Yep, my local (green) build was a debug build - we fail only on opt builds. Sounds like a race :(
We think that turning on Nuwa in the emulator tests might have revealed some potential timing issues with APZ:

- (group 3) ERROR TEST-UNEXPECTED-FAIL | /tests/docshell/test/navigation/test_bug430723.html | Test timed out.

We found that the onscroll callback in the test case can only be fired when ScrollFrameHelper::AsyncScroll::WillRefresh() is called, that is, it is triggered by the refresh driver. If in ScrollFrameHelper::ScrollToWithOrigin(), mAsyncScroll haven't received callback from nsRefreshDriver, then the onscroll callback will never be fired. The test case will never finish until marionette times out.

We have a workaround of this problem that we call ScrollFrameHelper::AsyncScroll::WillRefresh() when it still waits for the refresh driver. Then this failure is gone and we will encounter the unexpected scroll offset like

- ERROR TEST-UNEXPECTED-FAIL | /tests/dom/events/test/test_bug946632.html | <input> horizontal line scroll container.scrollTop - got 20, expected 0

There are other unexpected failures like:

- ERROR TEST-UNEXPECTED-FAIL | /tests/dom/events/test/test_bug946632.html | <input> horizontal line scroll targets <input> - got [object HTMLDivElement], expected [object HTMLInputElement]

Fabrice, we need help from someone familiar with APZ to dive deeper into the failures. Would you help coordinate this?
And if we debug b2g running in the emulator, which makes it even slower, the unexpected scroll offset failures are gone. This indicates that the failure is likely another racing condition.
Myth busted: APZC is not enabled on try runs even with the gaia part in bug 961047!!

We try to log AsyncPanZoomController ctor but didn't see it created. We print the preference "dom.browser_frames.useAsyncPanZoom" and it is false when Nuwa is not enabled and tests passed. I think Nuwa makes the preference in gaia take effect in time and the bugs with APZ are revealed on the try server.
Depends on: 965351
(In reply to Cervantes Yu from comment #52)
> We try to log AsyncPanZoomController ctor but didn't see it created. We
> print the preference "dom.browser_frames.useAsyncPanZoom" and it is false
> when Nuwa is not enabled and tests passed. I think Nuwa makes the preference
> in gaia take effect in time and the bugs with APZ are revealed on the try
> server.

Why would Nuwa change the behaviour of the preference? Note that we explicitly leave dom.browser_frames.useAsyncPanZoom as false in the b2g.js prefs file, and then there is a separate gaia pref that sets it to true when gaia starts up. Since the try server doesn't run gaia, that code doesn't run and dom.browser_frames.useAsyncPanZoom remains false for try testing. From my limited understanding of Nuwa I don't see why this would change if that were enabled.
Priority: -- → P1
Nuwa doesn't change the behavior of the preference, but the timing between the frame loader and turning on the pref to true in b2g/chrome/content/settings.js :

With nuwa enabled:
I/GeckoDump(  289): XXX Settings: dom.browser_frames.useAsyncPanZoom set to true
E/GeckoConsole(  327): Could not read chrome manifest 'file:///system/b2g/chrome.manifest'.
I/Gecko   (  289): XXX frame loader async=true


With nuwa Disabled:
I/Gecko   (  294): XXX frame loader async=false
I/GeckoDump(  294): XXX Settings: dom.browser_frames.useAsyncPanZoom set to true

So with nuwa enabled, we run the tests with apzc enabled, and fail because of the race.
Huh. So you're saying that even without the Nuwa patches, tryserver run is turning on the pref at some point? That was not what I expected based on https://bugzilla.mozilla.org/show_bug.cgi?id=961047#c36 (specifically the middle paragraph that says the test runner doesn't use the Gaia settings).
Yes, but it turns it on after the frame creation, so this has no effect on the tests. That's really bad I think.
If I can successfully repro the test failures locally I can dig into it. I would prefer to fix the test failures properly than back out bug 961047 or land more hacks. Right now I have a b2g-emulator build that appears to work but am running into errors when trying to run the mochitests on it; I'll continue on that tomorrow.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #57)
> If I can successfully repro the test failures locally I can dig into it. I
> would prefer to fix the test failures properly than back out bug 961047 or
> land more hacks. Right now I have a b2g-emulator build that appears to work
> but am running into errors when trying to run the mochitests on it; I'll
> continue on that tomorrow.

FYI, the gecko tree I use with the emulator has a workaround for bug 933336 (attachment 826741 [details] [diff] [review]) and the not-yet-landed patch for bug 936312.
(In reply to Jed Davis [:jld] from comment #58)
 
> FYI, the gecko tree I use with the emulator has a workaround for bug 933336
> (attachment 826741 [details] [diff] [review]) and the not-yet-landed patch
> for bug 936312.

I don't think these bugs/patches are relevant for these test failures.
Are we confident that nuwa actually works and merely our tests fail? In that case disable test, land the patch, and lets keep digging. There is no need to hold up CS as long we think this is actually working.
Hopefully last try build with the 3 tests disabled:
https://tbpl.mozilla.org/?tree=Try&rev=01dc40bbda45
Cervantes had ignored the preference value and always enable APZ with a hard code. The result is really terrible.  The testcases even never start, they are blocked at somewhere before marionette starting the testcases.

What I had been learnt last night is that APZ would not work properly with AsyncScroll in concurrent. The scrollframes post scrolling events only for the changes of the scroll position.  AsyncScroll would change scroll position periodically and post scrolling events.  But, existing AsyncScroll would be stopped by another ScrollBy()/ScrollTo() with INSTANT mode.  APZ would cause a ScrollBy(), and it is related to mDestination; the destination of the AsyncScroll.  So, if a layer transaction for AsyncScroll's scrolling is sent after APZC's RequestContentRepaint() that cause a ScrollBy() at the content process in turn, and ScrollBy() is dispatched at the content process after the transaction, then it could scroll to current position; no position change.  In another word, existing AsyncScroll would be stopped without any scrolling event in advance.

Consider following scenario at a content process,
 1. Current is at A position
 2. call ScrollTo(B position); mDestination == B
 3. post a scrolling event @ C position (between A, B); but AsyncScroll is still running.
 4. call ScrollTo(D position) and expect to receive one or more scrolling events; AsyncScroll is reused and keep running.
 5. ScrollBy(delta, INSTANT) from APZC at the parent process.  C == mDestination + delta.  Current AsyncScroll is destroyed.
 6. The expectation of scrolling is failed now.
Thinker, none of this is nuwa specific, right?
(In reply to Fabrice Desré [:fabrice] from comment #63)
> Thinker, none of this is nuwa specific, right?

If we're talking about our scrolling tests needing to be redone for APZC, then I believe you're right - that was the case before nuwa.  We're tracking similar issue in bug 958036, I guess we need something for these tests as well.
(In reply to Thinker Li [:sinker] from comment #62)
> What I had been learnt last night is that APZ would not work properly with
> AsyncScroll in concurrent. The scrollframes post scrolling events only for
> the changes of the scroll position.  AsyncScroll would change scroll
> position periodically and post scrolling events.  But, existing AsyncScroll
> would be stopped by another ScrollBy()/ScrollTo() with INSTANT mode.  APZ
> would cause a ScrollBy(), and it is related to mDestination; the destination

APZ does a ScrollTo, but otherwise you're exactly right. Also this should have been fixed very recently in bug 961280. Do you know if the problem you describe persists after that fix is applied?
(In reply to Milan Sreckovic [:milan] from comment #64)
> We're tracking
> similar issue in bug 958036, I guess we need something for these tests as
> well.

The "figuring out what's wrong in these tests" is mostly covered by bug 963278.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #65)
> (In reply to Thinker Li [:sinker] from comment #62)
> > What I had been learnt last night is that APZ would not work properly with
> > AsyncScroll in concurrent. The scrollframes post scrolling events only for
> > the changes of the scroll position.  AsyncScroll would change scroll
> > position periodically and post scrolling events.  But, existing AsyncScroll
> > would be stopped by another ScrollBy()/ScrollTo() with INSTANT mode.  APZ
> > would cause a ScrollBy(), and it is related to mDestination; the destination
> 
> APZ does a ScrollTo, but otherwise you're exactly right. Also this should
> have been fixed very recently in bug 961280. Do you know if the problem you
> describe persists after that fix is applied?

bug 961280 was merged to b2g-inbound and didn't fix the test failures for me.
(In reply to Fabrice Desré [:fabrice] from comment #63)
> Thinker, none of this is nuwa specific, right?

Cervantes' test is with Nuwa turned off and APZ turned on. Also we did a test that blocks all RecvUpdateFrame messages in TabChild with Nuwa on, in this case, and we can pass the test case.
(In reply to Fabrice Desré [:fabrice] from comment #63)
> Thinker, none of this is nuwa specific, right?

Right, we don't find any nuwa specific issue for APZ yet.
(In reply to Fabrice Desré [:fabrice] from comment #67)
> bug 961280 was merged to b2g-inbound and didn't fix the test failures for me.

Ok, but can you tell me why it didn't? i.e. why didn't it hit the codepath I added and skip the ScrollTo call? Is my patch wrong, or is it failing for some other reason now?
I'm able to reproduce the failures locally and will debug them. You can follow along in bug 963278 if interested.
Is this landed yet? We are past our CS deadline.
Backed out in https://hg.mozilla.org/integration/b2g-inbound/rev/666164e66db2 for making test_webapps_actor.html permaorange and mochitest-9's log permaoverflowing
(In reply to Wes Kocher (:KWierso) from comment #74)
> Backed out in
> https://hg.mozilla.org/integration/b2g-inbound/rev/666164e66db2 for making
> test_webapps_actor.html permaorange and mochitest-9's log permaoverflowing

Is test_webapps_actor.html right?  If I am right, mm.sendAsyncMessage("install", ...) is before all installing of message listeners, and app installing is executed async in another process.  It don't promise the order of installing apps and registrations of messages.
Hi Fabrice,
I suggest to move mm.sendAsyncMessage("install", ...) to here https://hg.mozilla.org/try/rev/96e3b4c77097#l5.31 since navigator.mozApps.mgmt is created lazily, and it registers for a few messages.
Fabrice, can you please provide v1.3 patch also.
Flags: needinfo?(fabrice)
Inder, I'll land on 1.3 once we're on m-c
Flags: needinfo?(fabrice)
Hey Fabrice, we'd like to enable internally on v1.3 in parallel.  Waiting for it to land on m-c and then for the uplift will take quite a while.
Attached patch nuwa-1_3.patchSplinter Review
Inder, here's a rebased version for 1.3/aurora.
https://hg.mozilla.org/mozilla-central/rev/ca2ab9f4ac9c
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Whiteboard: [c=memory p= s= u=tarako] [MemShrink:P1][tarako][CR 599688] → [c=memory p= s=2014.01.31 u=tarako] [MemShrink:P1][tarako][CR 599688]
Depends on: 966582
Backed out from both branches for frequent B2G mochitest-4 crashes as seen in bug 966582. On Aurora, it was basically perma-crash. On trunk, it wasn't that bad, but still frequent.

https://hg.mozilla.org/releases/mozilla-aurora/rev/3a4d6b21c0ea
https://hg.mozilla.org/mozilla-central/rev/c5b7d3d1d5ee
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla29 → ---
I can reproduce this crash locally, and I got backtrace on content process:

Program received signal SIGSEGV, Segmentation fault.
_pr_poll_with_poll (pds=0x402c6020, npds=1, timeout=<value optimized out>)
    at /home/patrick/w/hgpool/emu1/mcgit/nsprpub/pr/src/pthreads/ptio.c:3814
3814	                    in_flags_read = (pds[index].fd->methods->poll)(
#0  _pr_poll_with_poll (pds=0x402c6020, npds=1, timeout=<value optimized out>)
    at /home/patrick/w/hgpool/emu1/mcgit/nsprpub/pr/src/pthreads/ptio.c:3814
#1  0x41f4d62e in PR_Poll (pds=0x1f, npds=1076706396, timeout=65535)
    at /home/patrick/w/hgpool/emu1/mcgit/nsprpub/pr/src/pthreads/ptio.c:4326
#2  0x405f829e in nsSocketTransportService::Poll (this=<value optimized out>, wait=<value optimized out>, 
    interval=0x402d3cb4) at /home/patrick/w/hgpool/emu1/mcgit/netwerk/base/src/nsSocketTransportService2.cpp:393
#3  0x405f85da in nsSocketTransportService::DoPollIteration (this=0x42de1a00, wait=true)
    at /home/patrick/w/hgpool/emu1/mcgit/netwerk/base/src/nsSocketTransportService2.cpp:820
#4  0x405f87d6 in nsSocketTransportService::Run (this=0x42de1a00)
    at /home/patrick/w/hgpool/emu1/mcgit/netwerk/base/src/nsSocketTransportService2.cpp:684
#5  0x405d34fe in nsThread::ProcessNextEvent (this=0x40202a20, mayWait=<value optimized out>, result=0x402d3d47)
    at /home/patrick/w/hgpool/emu1/mcgit/xpcom/threads/nsThread.cpp:637
#6  0x405a3be8 in NS_ProcessNextEvent (thread=0x40202a20, mayWait=false)
    at /home/patrick/w/hgpool/emu1/mcgit/xpcom/glue/nsThreadUtils.cpp:263
#7  0x40709f84 in mozilla::ipc::MessagePumpForNonMainThreads::Run (this=0x42e20f10, aDelegate=0x402eb1a0)
    at /home/patrick/w/hgpool/emu1/mcgit/ipc/glue/MessagePump.cpp:303
#8  0x406ffd3c in MessageLoop::RunInternal (this=0x1000000)
    at /home/patrick/w/hgpool/emu1/mcgit/ipc/chromium/src/base/message_loop.cc:226
#9  0x406ffdba in MessageLoop::RunHandler (this=0x402eb1a0)
    at /home/patrick/w/hgpool/emu1/mcgit/ipc/chromium/src/base/message_loop.cc:219
#10 MessageLoop::Run (this=0x402eb1a0) at /home/patrick/w/hgpool/emu1/mcgit/ipc/chromium/src/base/message_loop.cc:193
#11 0x405d3a9c in nsThread::ThreadFunc (arg=<value optimized out>)
    at /home/patrick/w/hgpool/emu1/mcgit/xpcom/threads/nsThread.cpp:258
#12 0x41f50348 in _pt_root (arg=<value optimized out>)
    at /home/patrick/w/hgpool/emu1/mcgit/nsprpub/pr/src/pthreads/ptthread.c:205
#13 0x40017358 in _thread_create_startup (arg=0x402abc00)
    at /home/patrick/w/hgpool/emu1/mcgit/mozglue/build/Nuwa.cpp:543
#14 thread_create_startup (arg=0x402abc00) at /home/patrick/w/hgpool/emu1/mcgit/mozglue/build/Nuwa.cpp:574
#15 0x40060e4c in __thread_entry (func=0x402d3f00, arg=0x402abc00, tls=<value optimized out>)
    at bionic/libc/bionic/pthread.c:217
#16 0x4006099c in pthread_create (thread_out=<value optimized out>, attr=0x402abc14, 
    start_routine=0x40017319 <thread_create_startup>, arg=0x402abc00) at bionic/libc/bionic/pthread.c:357
#17 0x00000000 in ?? ()

The pds[index].fd has type (PRFileDesc *) and its value is 0x1f, which is not accessible.
Depends on: 966802
Whiteboard: [c=memory p= s=2014.01.31 u=tarako] [MemShrink:P1][tarako][CR 599688] → [c=memory p= s= u=1.3+tarako] [MemShrink:P1][tarako][CR 599688]
Fabrice,

Is this on your radar?
Flags: needinfo?(fabrice)
(In reply to Preeti Raghunath(:Preeti) from comment #87)
> Fabrice,
> 
> Is this on your radar?

I feel it's my whole life these last couple of weeks. I think I'm gonna make a fridge magnet saying "Re-enable Nuwa!"
Flags: needinfo?(fabrice)
https://hg.mozilla.org/mozilla-central/rev/322238bbb29b

Fabrice and I agree that we should way a day or two before uplifting this so we can better see any issues that pop up from the landing.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Depends on: 968250
I'm adding NO_UPLIFT here to indicate to not uplift this until we figure out what happened in bug 968250. If nuwa caused bug 968250, then we'll need to back out.
Whiteboard: [c=memory p= s= u=1.3+tarako] [MemShrink:P1][tarako][CR 599688] → [c=memory p= s= u=1.3+tarako] [MemShrink:P1][tarako][CR 599688][NO_UPLIFT]
I've just run into another issue that seems to have been caused by this patch and which is a *significant* regression: the process priorities are all over the place because the process priority manager seems to be convinced that there's always a high-priority process running (like a process holding the CPU wakelock).

I have not yet managed to establish what is the root cause but it seems that there's a leftover entry in the |mHighPriorityChildIDs| child array ( http://hg.mozilla.org/mozilla-central/file/579cf46bc21e/dom/ipc/ProcessPriorityManager.cpp#l176 ) that confuses the manager and causes it to miscompute the priorities of the other processes.

Disabling Nuwa by unsetting MOZ_NUWA_PROCESS during the build process makes the issue go away.
Gabriele, file a follow up please.
Removing NO_UPLIFT as the root cause of the problem has been found for why the keon was failing to boot. See bug 967967. We'll want to uplift bug 967967 with this patch if we are to ensure that the keon device remains functional on 1.3.
Whiteboard: [c=memory p= s= u=1.3+tarako] [MemShrink:P1][tarako][CR 599688][NO_UPLIFT] → [c=memory p= s= u=1.3+tarako] [MemShrink:P1][tarako][CR 599688]
We should still sit on this for a couple days.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #94)
> We should still sit on this for a couple days.

Agreed - the bake time is definitely important here to ensure that we're not hitting basic bustage with the pref on.

Can we plan for the uplift on Monday? Or do we want to earlier? I want to make sure that we do get this in our next full functional test run, so that we can find nuwa crashes during the testing.
That sounds reasonable.  I can uplift it Monday morning Taipei time.
We made a commitment to partners that we would uplift this by Friday.  We have to do it before Monday.  Thanks.
(In reply to Faramarz (:faramarz) from comment #97)
> We made a commitment to partners that we would uplift this by Friday.  We
> have to do it before Monday.  Thanks.

Alright, I guess we're doing this Friday then.

Also - there's a smoketest run that has been executed now with nuwa enabled. Only new issue found with nuwa enabled was bug 968501. I'm currently looking someone to test to see if that happens with nuwa disabled.
Depends on: 968501
Yes, 2/7 was the committed date.
The UI automation report came out clean this morning with no regressions caused by nuwa, so were confirmed to be safe here. We're also planning on kicking off a full functional run next Monday.

Fabrice - Can you land this on b2g28?
Flags: needinfo?(fabrice)
Follow-up to disable the flaky WebRTC tests on v1.3 as well since the patch here wasn't ever updated.
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3/rev/c0ea6c9350d6

And re-enable the WebRTC tests on !B2G on trunk because I'm somehow doubting that was the intended goal.
https://hg.mozilla.org/integration/mozilla-inbound/rev/9add9f9ed1fd
Thanks Ryan, we owe you one!
(In reply to Wes Kocher (:KWierso) from comment #104)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #102)
> > And re-enable the WebRTC tests on !B2G on trunk because I'm somehow doubting
> > that was the intended goal.
> > https://hg.mozilla.org/integration/mozilla-inbound/rev/9add9f9ed1fd
> 
> Well, I had to back that out in
> http://hg.mozilla.org/integration/mozilla-inbound/rev/d91a9bc9e84c for
> various test failures like
> https://tbpl.mozilla.org/php/getParsedLog.php?id=34317544&tree=Mozilla-
> Inbound and
> https://tbpl.mozilla.org/php/getParsedLog.php?id=34316799&tree=Mozilla-
> Inbound

ekr - is this the failure you were talking about over email that occurs with the nuwa webrtc crash fixed?
Flags: needinfo?(ekr)
Impressive that we managed to break them that fast. Spun off to bug 969718.
Yes, this looks like a similar problem.
Flags: needinfo?(ekr)
status-b2g-v1.3T: fixed in tarako branch, remove [tarako] in whiteboard
Whiteboard: [c=memory p= s= u=1.3+tarako] [MemShrink:P1][tarako][CR 599688] → [c=memory p= s= u=1.3+tarako] [MemShrink:P1][CR 599688]
Depends on: 965912
Depends on: 974807
re-enabled tests after a green try (and b2g 1.3 has them enabled)
https://hg.mozilla.org/integration/mozilla-inbound/rev/752a28fd6b73
No longer depends on: 974807
Depends on: 1024928
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: