Serious performance regression in FxOS

RESOLVED FIXED in Firefox 44

Status

defect
P1
blocker
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: mlien, Assigned: cyu)

Tracking

({dev-doc-complete, perf, regression})

unspecified
FxOS-S9 (16Oct)
ARM
Gonk (Firefox OS)
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:2.5+, firefox44 fixed)

Details

(Whiteboard: [MemShrink:P1])

Attachments

(2 attachments, 2 obsolete attachments)

Reporter

Description

4 years ago
On Tue Sep. 15 2015, Raptor detected serious regressions in FxOS. Every application's cold launch time is regressed by more than 1.5 seconds and memory usage grows more than 7MB. This could point to an issue in the System or the platform.

---

Device: Aries-kk
Memory: 2048
Branch: master

---

Gaia revision: 02a15ec81e0bece9
Previous Gaia revision: 47f0ac6cf4c4525c

Gecko revision: 13d2e06d95f07770
Previous Gecko revision: 0baf29d2d17a19aa

---

Performance numbers [1][2] on Raptor.

[1] http://raptor.mozilla.org/#/dashboard/script/apps.js?device=aries&branch=master&memory=2048&series=coldlaunch.visuallyLoaded
[2] http://raptor.mozilla.org/#/dashboard/script/apps-memory.js?device=aries&branch=master&memory=2048

Comment 1

4 years ago
This is blocker. 

Ting, could you help to test this on regression bot? Thanks.
blocking-b2g: 2.5? → 2.5+
Flags: needinfo?(janus926)
Flags: needinfo?(anygregor)
Is bisecting now, will update later. Kepp NI.
It's bug 1152046. It's supposedly backed out, but the backout was backed out so the patch is still on central.
Clear NI as Cervantes has spot the culprit.
Flags: needinfo?(janus926)
We need to get bug 1166923 fixed.
Assignee: nobody → cyu
Attachment #8661735 - Flags: review?(khuey)
Assignee

Updated

4 years ago
Depends on: 1152046
Does that also explain the memory usage regression?
would the proposed patch (probably?) make the backout of 1152046 successful on b2g? Any idea why this dependence formed? - 1152046 was only on m-c for a short period.
Flags: needinfo?(cyu)
Note: I backed out bug 1152046 (again) locally, since I was insta-crashing with it in.  That lets me run, but testing https://mozilla.github.com/webrtc-landing/pc_test.html crashes in the same place (jemalloc/internal/bitmap.h) when in e10s - but not in a non-e10s profile.  So it's very touchy and machine/perf/allocation/? dependent.
(In reply to [:fabrice] Fabrice Desré from comment #7)
> Does that also explain the memory usage regression?

Yes, because content processes are forked from b2g.
Flags: needinfo?(cyu)
(In reply to Patrick McManus [:mcmanus] from comment #8)
> would the proposed patch (probably?) make the backout of 1152046 successful
> on b2g? Any idea why this dependence formed? - 1152046 was only on m-c for a
> short period.

This patch doesn't make the backout successful. It's a forward fix for the performance regression of bug 1152046 on b2g. 1152046 is still on m-c because the backout itself is backed out.
Attachment #8661735 - Attachment is patch: true
url:        https://hg.mozilla.org/integration/b2g-inbound/rev/99c11d836978425de6256aa5baf355843f39f400
changeset:  99c11d836978425de6256aa5baf355843f39f400
user:       Cervantes Yu <cyu@mozilla.com>
date:       Thu Sep 17 14:36:06 2015 +0800
description:
Bug 1204837: Make ClosingService thread Nuwa-friedly. r=khuey

Updated

4 years ago
Severity: major → blocker
Priority: -- → P1
When we see mochitest shutdown timeout, the chrome process has:

We have the main thread waiting for the IO thread:
> Thread 1 (Thread 927.927):
> #0  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:183
> #1  0x4008e56c in __pthread_cond_timedwait_relative (cond=0x2f2d4, mutex=0x400c433c <gThreadListLock>, reltime=0x0) at bionic/libc/bionic/pthread.c:1477
> #2  0x4008e620 in __pthread_cond_timedwait (cond=0x2f2d4, mutex=0x400c433c <gThreadListLock>, abstime=0x0, clock=0) at bionic/libc/bionic/pthread.c:1500
> #3  0x4008e748 in pthread_join (thid=193200, ret_val=0x0) at bionic/libc/bionic/pthread.c:654
> #4  0x40f8d6f2 in base::Thread::Stop (this=this@entry=0x40441268) at ipc/chromium/src/base/thread.cc:122
> #5  0x40f9b794 in mozilla::ipc::BrowserProcessSubThread::~BrowserProcessSubThread (this=0x40441268, __in_chrg=<optimized out>) at ipc/glue/BrowserProcessSubThread.cpp:57
> #6  0x40f9b7e0 in mozilla::ipc::BrowserProcessSubThread::~BrowserProcessSubThread (this=0x40441268, __in_chrg=<optimized out>) at ipc/glue/BrowserProcessSubThread.cpp:63
> #7  0x40dca4d8 in mozilla::ShutdownXPCOM (aServMgr=<optimized out>) at xpcom/build/XPCOMInit.cpp:1027
> #8  0x423326fa in ScopedXPCOMStartup::~ScopedXPCOMStartup (this=0x4042f338, __in_chrg=<optimized out>) at toolkit/xre/nsAppRunner.cpp:1499
> #9  0x4233272a in mozilla::DefaultDelete<ScopedXPCOMStartup>::operator() (aPtr=0x4042f338, this=<optimized out>) at ../../dist/include/mozilla/UniquePtr.h:482
> #10 0x423377ca in reset (aPtr=0x0, this=0xbea4e7ec) at ../../dist/include/mozilla/UniquePtr.h:309
> #11 operator= (this=0xbea4e7ec) at ../../dist/include/mozilla/UniquePtr.h:279
> #12 XREMain::XRE_main (this=this@entry=0xbea4e7d8, argc=argc@entry=1, argv=argv@entry=0x4042f288, aAppData=aAppData@entry=0x2da88 <_ZL8sAppData>) at toolkit/xre/nsAppRunner.cpp:4410
> #13 0x42337962 in XRE_main (argc=1, argv=0x4042f288, aAppData=0x2da88 <_ZL8sAppData>, aFlags=<optimized out>) at toolkit/xre/nsAppRunner.cpp:4486
> #14 0x00010e26 in do_main (argc=argc@entry=1, argv=argv@entry=0x4042f288) at b2g/app/nsBrowserApp.cpp:167
> #15 0x00010f98 in b2g_main (argc=1, argv=<optimized out>) at b2g/app/nsBrowserApp.cpp:299
> #16 0x00010ca8 in RunProcesses (aReservedFds=..., argv=0xbea4fac4, argc=1) at b2g/app/B2GLoader.cpp:232
> #17 main (argc=1, argv=0xbea4fac4) at b2g/app/B2GLoader.cpp:297
> (gdb) fr 7
> #7  0x40dca4d8 in mozilla::ShutdownXPCOM (aServMgr=<optimized out>) at xpcom/build/XPCOMInit.cpp:1027
> 1027      delete sIOThread;

Where the IO thread notifies that the thread that ImageBrigeParent runs on has channel error, but the thread was already shut down:
> 
> Thread 4 (Thread 927.942):
> #0  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:183
> #1  0x4008e264 in _normal_lock (mutex=<optimized out>) at bionic/libc/bionic/pthread.c:951
> #2  pthread_mutex_lock (mutex=0x483ffeb8) at bionic/libc/bionic/pthread.c:1041
> #3  0x4002bbd0 in __wrap_pthread_mutex_lock (mtx=0x483ffeb8) at mozglue/build/Nuwa.cpp:1308
> #4  0x40f85ea6 in LockImpl::Lock (this=this@entry=0x483ffeb8) at ipc/chromium/src/base/lock_impl_posix.cc:41
> #5  0x40f8a348 in Acquire (this=0x483ffeb8) at ipc/chromium/src/base/lock.h:16
> #6  AutoLock (lock=..., this=0x44b9cc48) at ipc/chromium/src/base/lock.h:43
> #7  MessageLoop::PostTask_Helper (this=0x483ffdf0, from_here=..., task=<optimized out>, delay_ms=delay_ms@entry=0, nestable=nestable@entry=true) at ipc/chromium/src/base/message_loop.cc:326
> #8  0x40f8a38c in MessageLoop::PostTask (this=<optimized out>, from_here=..., task=<optimized out>) at ipc/chromium/src/base/message_loop.cc:267
> #9  0x40f9beb2 in mozilla::ipc::MessageChannel::PostErrorNotifyTask (this=0x48bf8b38) at ipc/glue/MessageChannel.cpp:1836
> #10 0x40f9eda8 in mozilla::ipc::ProcessLink::OnChannelError (this=0x47aee9e0) at ipc/glue/MessageLink.cpp:429
> #11 0x40f9261e in IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking (this=0x4a1e3800, fd=78) at ipc/chromium/src/chrome/common/ipc_channel_posix.cc:879
> #12 0x40f852ae in base::MessagePumpLibevent::OnLibeventNotification (fd=78, flags=<optimized out>, context=0x4a1e3800) at ipc/chromium/src/base/message_pump_libevent.cc:231
> #13 0x40f831e2 in event_process_active_single_queue (activeq=0x4042f2a8, base=0x40467180) at ipc/chromium/src/third_party/libevent/event.c:1350
> #14 event_process_active (base=<optimized out>) at ipc/chromium/src/third_party/libevent/event.c:1420
> #15 event_base_loop (base=0x40467180, flags=1) at ipc/chromium/src/third_party/libevent/event.c:1621
> #16 0x40f86b26 in base::MessagePumpLibevent::Run (this=0x4040c6e0, delegate=0x44b9cdf0) at ipc/chromium/src/base/message_pump_libevent.cc:357
> #17 0x40f87a78 in MessageLoop::RunInternal (this=this@entry=0x44b9cdf0) at ipc/chromium/src/base/message_loop.cc:234
> #18 0x40f87a92 in RunHandler (this=0x44b9cdf0) at ipc/chromium/src/base/message_loop.cc:227
> #19 MessageLoop::Run (this=this@entry=0x44b9cdf0) at ipc/chromium/src/base/message_loop.cc:201
> #20 0x40f8e2fc in base::Thread::ThreadMain (this=0x40441268) at ipc/chromium/src/base/thread.cc:170
> #21 0x40f8c2a4 in ThreadFunc (closure=<optimized out>) at ipc/chromium/src/base/platform_thread_posix.cc:39
> #22 0x4008ee2c in __thread_entry (func=0x40f8c29d <ThreadFunc(void*)>, arg=0x40441268, tls=<optimized out>) at bionic/libc/bionic/pthread.c:217
> #23 0x4008e97c in pthread_create (thread_out=<optimized out>, attr=0xbea4e640, start_routine=0x40f8c29d <ThreadFunc(void*)>, arg=0x40441268) at bionic/libc/bionic/pthread.c:357
> #24 0x00000000 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 

The channel listener is ImageBridgeParent:
> #9  0x40f9beb2 in mozilla::ipc::MessageChannel::PostErrorNotifyTask (this=0x48bf8b38) at ipc/glue/MessageChannel.cpp:1836
> 1836        mWorkerLoop->PostTask(FROM_HERE, mChannelErrorTask);
> (gdb) p *mListener.mRef.mPtr.mPtr
> $3 = (mozilla::layers::ImageBridgeParent) {
>   <mozilla::layers::PImageBridgeParent> = {
>     <mozilla::ipc::IProtocol> = {
> ... (omitted for brevity)
The ImageBridgeParent instance belongs to the preallocated process. It might be that the preallocated process was started too late so that its shutdown deadlocks the chrome process. I think we need to disable the preallocated process to stabilize the tests.
Assignee

Updated

4 years ago
See Also: → 1168337
Let's see if bug 1168337 fixes this shutdown deadlock.
Status: NEW → ASSIGNED
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #17)
> Let's see if bug 1168337 fixes this shutdown deadlock.

Still having mochitest shutdown hang. There must be another bug that prevents the shutdown process from completion.
Further effect of this bug appears to be that the preallocated process does not exist, which makes profiling app startup difficult.
Whiteboard: [MemShrink]
We need to get this fixed asap. It has been 3 weeks now.
Whats the exact status here? I see backed out patches, dependent bugs and 'see also' bugs.
What needs to be done to fix this?
Flags: needinfo?(anygregor) → needinfo?(cyu)
Whiteboard: [MemShrink] → [MemShrink:P1]
Bug 1152046 cannot be backed out because of other dependencies that lands after it. It has a forward fix to disable the feature on b2g and fix this regression, but landing it reveals another regression that snicked in after the main patch of 1152046 lands and breaks Nuwa (comment #14). I am working on a forward fix of comment #14 so we can land the 2 forward fixes (one for comment #14 of this bug and the other in comment #86 of bug 1152046).
Flags: needinfo?(cyu)
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #23)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d7d4dc6c4d3

I think bug 1127270 caused this regression in comment 14: the cloned ImageBridgeParent and CrossProcessCompositorParent actors doesn't have their OnChannelConnected() called and miss the references to the compositor thread. Call OnChannelConnected() when cloning the actors. This should fix the problem.
Added comments and assertions.
Attachment #8667870 - Attachment is obsolete: true
Attachment #8667870 - Flags: review?(khuey)
Attachment #8668229 - Flags: review?(khuey)
Assignee

Updated

4 years ago
Attachment #8668229 - Flags: review?(nical.bugzilla)
Remove the irrelevant hunk.
Attachment #8668229 - Attachment is obsolete: true
Attachment #8668229 - Flags: review?(nical.bugzilla)
Attachment #8668229 - Flags: review?(khuey)
Assignee

Updated

4 years ago
Attachment #8668233 - Flags: review?(khuey)
Assignee

Updated

4 years ago
Attachment #8668233 - Flags: review?(nical.bugzilla)
Attachment #8668233 - Flags: review?(nical.bugzilla) → review+
https://hg.mozilla.org/integration/b2g-inbound/rev/90c9af6a9708445d2e96746c34ab419376e9bf69
Bug 1204837, Part 1: Make ClosingService thread Nuwa-friedly. r=khuey

https://hg.mozilla.org/integration/b2g-inbound/rev/2d4f2af678ee9149baee5f3542950d8f38f66994
Bug 1204837, Part 2: Cloned ImageBridgeParent and CompositorParent should hold a reference to the compositor thread. r=nical

Updated

4 years ago
See Also: → 1210451
https://hg.mozilla.org/mozilla-central/rev/90c9af6a9708
https://hg.mozilla.org/mozilla-central/rev/2d4f2af678ee
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → FxOS-S9 (16Oct)
Blocks: 1144499
You need to log in before you can comment on or make changes to this bug.