The default bug view has changed. See this FAQ.

[Leo] b2g takes high cpu resource (>90%) and system too slow, due to long CC pauses

RESOLVED FIXED in 1.1 QE6

Status

Firefox OS
General
P1
critical
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: leo.bugzilla.gecko, Unassigned)

Tracking

({meta, perf})

unspecified
1.1 QE6
ARM
Gonk (Firefox OS)
meta, perf
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:leo+, b2g18 fixed)

Details

(Whiteboard: [MemShrink:P1] [c=memory u=1.1])

Attachments

(7 attachments, 1 obsolete attachment)

(Reporter)

Description

4 years ago
STR>
1) run app from first icon of homescreen.
2) load wait 10 seconds.
3) go homescreen by home key
4) do test 1)~3) by end icon of homescreen 
(leo has a 23 app on homescreen including operator apps)
5) go to step 1) 

and never kill app by user. if it has no memory, LMK will kill any apps to run new one.
the purpose of this test is checking stability of target.

after doing test for 3~4 days. target is very slow.
delayed launch application and delayed response as my action.
Created attachment 782463 [details]
after testing for 4 days, take a memory-report / dmd / gc / cc

basically, this version included the patch of Bug 893012.
i am testing to investigation the patch of Bug 893012.
Created attachment 782464 [details]
log of top 10 process

$> adb shell top -m 10 > log_top.txt

periodically, b2g takes high cpu resource almost 92%~95%.
Created attachment 782498 [details]
instead Comment #1, do use this memory-report/dmd/gc/cc log.

instead Attachment 782463 [details], use this.
Created attachment 782499 [details]
b2g-ps.txt

when i was long-pressed home key, there are no running app. 
but, b2g-ps displays 

APPLICATION      USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
Browser          app_8025  8025  140   71960  21572 ffffffff 40091430 S /system/b2g/plugin-container
after do this.
$> kill -9 8025

no chage anything. remains problem.

Comment 6

4 years ago
What is "periodically"? Also, do we have profiler working for b2g?  Would be good to see where
the cpu time is spent.

Comment 7

4 years ago
Oh, this is about https://bugzilla.mozilla.org/show_bug.cgi?id=893012#c65

Comment 8

4 years ago
The root process seems to have nearly 50000 event listeners. Clearly not right.
Something is adding event listener all the time but not remove after use?
Or one should probably add just one event listener, but because of some bug in gaia ends up
adding listener all the time.

Comment 9

4 years ago
Looks like the listeners are added to
0x4c25c1c0 [rc=7] FragmentOrElement (xhtml) iframe class='' app://system.gaiamobile.org/index.html

So, some iframe element in app://system.gaiamobile.org/index.html document.
There is also some JS array with over 20000 items in it, and each item is an Object ...
Looks like property called momentums keeps that array alive.

Updated

4 years ago
Component: General → Gaia
Looks like "momentums" is something from BrowserElementPanning:
  http://mxr.mozilla.org/mozilla-central/ident?i=momentums&filter=
Hmm, the momemtum thing might not be a gaia bug, but seems to come from http://mxr.mozilla.org/mozilla-central/source/dom/browser-element/BrowserElementPanning.js#701
Er, wait, wrong number for event listeners. Not 50000, but "only" about 15000
Comment on attachment 782463 [details]
after testing for 4 days, take a memory-report / dmd / gc / cc

I'm marking this attachment obsolete, per comment 3.
Attachment #782463 - Attachment is obsolete: true
Well, the event listeners aren't a mystery; BrowserElementPanning calls addEventListener but never calls removeEventListener.  :-/
(In reply to leo.bugzilla.gecko from comment #0)
> STR>
> 1) run app from first icon of homescreen.
> 2) load wait 10 seconds.
> 3) go homescreen by home key
> 4) do test 1)~3) by end icon of homescreen 
> (leo has a 23 app on homescreen including operator apps)
> 5) go to step 1) 
> 
> and never kill app by user. if it has no memory, LMK will kill any apps to
> run new one.
> the purpose of this test is checking stability of target.
> 
> after doing test for 3~4 days. target is very slow.
> delayed launch application and delayed response as my action.

Can you tell which Gecko version you are using? Is it mozillab2g-18?

In this case smaug mentionned that it could be useful to port the small patch of bug 899089 to this branch in order to have more useful debug informations.
> Can you tell which Gecko version you are using? Is it mozillab2g-18?

Almost certainly; it's Leo.

> In this case smaug mentionned that it could be useful to port the small patch of bug 
> 899089 to this branch in order to have more useful debug informations.

Keep in mind that it takes them many days to run this test.

> Well, the event listeners aren't a mystery; BrowserElementPanning calls addEventListener 
> but never calls removeEventListener.  :-/

I guess this isn't necessarily the problem, although it's certainly /a/ problem.
Oh, I'm wrong; BrowserElementPanning does (try to) remove its event listener.
over 7000 JS Object (Function - apploaded)
http://mxr.mozilla.org/gaia/source/apps/system/js/window_manager.js#800 ?

And ~2400 JS Object (Function - visibilityChangeHandler)
http://mxr.mozilla.org/mozilla-central/source/dom/browser-element/BrowserElementParent.jsm#186 ?
And about 2400 JS Object (Function - ignoreClearCacheOn) but I don't see that in mxr (gaia or mozilla-central)
I wrote a little script to analyze the GC log for the names of things rooted by things with names starting with "nsXPCWrappedJS[nsIDOMEventListener" and the large ones are these:
 7133 Function apploaded
 2391 Function visibilityChangeHandler
 2391 Function ignoreClearCacheOn
 2391 Function ignoreClearCacheOff
Which is basically what you came up with, with the addition of ignoreClearCacheOff. :)
ignoreClearCacheOn and ignoreClearCacheOff are in the bindingArray for the script at resource://gre/modules/BrowserElementParent.jsm:1, along with visibilityChangeHandler, which seems to mean that they are local variables or arguments for the script. Unfortunately the line number appears bogus, but that does suggest that it is something in BEP related to visibilityChangeHandler, which makes sense given that they appear as event listeners the exact same number of times.  There's also a function BrowserElementParent.prototype._ownerIgnoreClearCacheOff.
Based on cursory examination, I think all of the visibilityChangeHandler and ignoreClear* are being held alive by this nsGlobalWindow #3 from this path:
    [nsXPCWrappedJS (nsIDOMEventListener)] <--[mListeners[i] mListener]-- [nsEventListenerManager] <--[mListenerManager]-- [nsGlobalWindow #3]

and I'm guessing that all of the apploaded event listeners are being kept alive by this path:
  [nsXPCWrappedJS (nsIDOMEventListener)] <--[mListeners[i] mListener]-- [nsEventListenerManager] <--[[via hash] mListenerManager]-- [FragmentOrElement (xhtml) iframe class='' app://system.gaiamobile.org/index.html]

eg via the iframe that Olli mentioned.

But I think that's expected, so that's not very useful.
> There's also a function BrowserElementParent.prototype._ownerIgnoreClearCacheOff.

Do we think this is a vendor customization?
Blocks: 899328
Leo, have you modified BrowserElementParent to add _ownerIgnoreClearCacheOff?  We can't find "_ownerIgnoreClearCacheOff" in any of our code.
Flags: needinfo?(leo.bugzilla.gecko)
blocking-b2g: --- → leo?
Component: Gaia → General
Summary: [Leo] b2g takes high cpu resource (>90%) and system too slow. → [Leo] b2g takes high cpu resource (>90%) and system too slow, due to long CC pauses
Leo, can you also please comment on whether a fix is desired for B2G 1.1?

Moz triage: This bug will cause our battery life to gradually worsen and will cause the phone to gradually become jankier as the user uses the phone for a while (a few weeks?) without rebooting.
Keywords: meta
I'm going to call this [MemShrink] even though the primary effect here is on snappiness and battery life because we believe the root cause is a leak (actually, a few separate leaks).
Whiteboard: [MemShrink]
> And ~2400 JS Object (Function - visibilityChangeHandler)
> http://mxr.mozilla.org/mozilla-central/source/dom/browser-element/BrowserElementParent.jsm#186 ?

If the relevant BEPs have been collected, these should go away when you turn the screen on/off.

I'm not sure how to do better here, but ideas are welcome.
Depends on: 899354
No longer blocks: 899328
Depends on: 899328
Created attachment 782872 [details] [diff] [review]
Diff_BrowserElementParent.jsm.patch

sorry for confusing of ignoreClearCache*
ignoreClearCacheOn/ignoreClearCacheOff is custimization by leo.
it is applied to fix screen is bliking as changing screen on/off.

and also, it was changed following as Bug 894147 which it use weakBEP instead of directly bind() on this.

i am attaching Diff_BrowserElementParent.jsm.patch between LEO and gecko-18
Flags: needinfo?(leo.bugzilla.gecko)
Does Leo have plans to make this entire repository available to Mozilla and the world, as required by MPL?
At the very least, can we please see the changes to BrowserElementChild?
> Leo, can you also please comment on whether a fix is desired for B2G 1.1?
> 
> Moz triage: This bug will cause our battery life to gradually worsen and
> will cause the phone to gradually become jankier as the user uses the phone
> for a while (a few weeks?) without rebooting.

leo will be tested to check stability for 5days from next monday.
although leo has no crash by caused leaked memory. if it has other problems like this, the stability test will be also failed.

IMO, it should be fixed for B2G 1.1
(In reply to Justin Lebar [:jlebar] from comment #32)
> At the very least, can we please see the changes to BrowserElementChild?

there are no diff between BrowserElementChild.js and gecko-18's one
(In reply to Justin Lebar [:jlebar] from comment #26)
> Leo, have you modified BrowserElementParent to add
> _ownerIgnoreClearCacheOff?  We can't find "_ownerIgnoreClearCacheOff" in any
> of our code.

Hi Justin, 
I added the IgnoreClearCacheOn/Off listeners in BEP for bug 871919/871413/862913/825174 at few month ago. 

> And ~2400 JS Object (Function - visibilityChangeHandler)
> And about 2400 JS Object (Function - ignoreClearCacheOn)

And it has exactly same path with visibilityChangeHandler. So, if adding visibilityChangeHandler has the problem (piled up listeners in memory), it also have the same problem obviously..
> if it has other problems like this, the stability test will be also failed.

I think the visibilitychanged handler is only a problem if you launch thousands of apps without ever turning the screen on or off, so I think this issue would likely not be a problem in the field.

The other listeners that Leo added may or may not be safe in the same way; I'd have to look more closely to say for sure (and I'd need the full tree, not just the changes to this one file).  But I'm not clear on Mozilla's policy on reviewing code in repositories like this, so I'm going to hold off until I hear more from management.
leo+ per comment 33.
blocking-b2g: leo? → leo+
Depends on: 899801
Leo should change the patch here to match the idiom introduced in bug 899354.

The fix in bug 899354 won't have a big impact on users -- it fixes a "leak" which only occurs if the user launches hundreds of apps without turning off the screen, which is unlikely to happen in the real world.  But the fix may be more important for the listeners added in the vendor customization here.  I can't tell for sure how important this fix is, because I don't know when these events are fired; I would need to see the complete source code.

Note that matching the idiom from bug 899354 will not fix the fact that the patch here adds a new web-facing API on <iframe mozbrowser>.  I really hope that we will have an opportunity to revisit this before the Leo device ships.

Updated

4 years ago
Keywords: perf
Whiteboard: [MemShrink] → [MemShrink] [c=memory u=1.1]
(In reply to Justin Lebar [:jlebar] from comment #38)
> Leo should change the patch here to match the idiom introduced in bug 899354.

we make a decision to remove leo cusmization on BEP. and it already applied on our version(v08o)
Never mind those thing. 
thanks :jlebar for your concern.
This may have been fixed.  Can you test again?  Thanks.
Flags: needinfo?(jeremy.kim.leo)
Whiteboard: [MemShrink] [c=memory u=1.1] → [MemShrink:P1] [c=memory u=1.1]
Created attachment 792536 [details]
memory-report-898990.tgz

sorry for delay because of my vacation on last week.

i was tested the latest leo with those patch (bug 899328, bug 899354, bug 899801)
(for refrerence, those patches not yet applied on leo's main stream.)

there are some strange problems. leo was freezed after 12 hours. 
the freeze means never response for user input(touch, home-key, power-key)

STR> 
infinitly, run all apps one by one.
because of limitation of memory, if it closes limitation, 
LMK will be activated and some process will be killed to run new app.

when i was checked the information of processes.
$adb shell> ps
app_961   961   141   0      0     c00bc32c 00000000 Z Facebook
app_839   839   141   0      0     c00bc32c 00000000 Z Communications
app_4403  4403  141   0      0     c00bc32c 00000000 Z Wikipedia
root      4465  141   0      0     c00bc32c 00000000 Z Gecko_IOThread
i guess some child process of b2g didn't rightly killed by kernel or somthing.

i can't make sure this problems caused by those patched 
so, i will try one more time to compare those patches 
target 1 - applied patches which is bug 899328, bug 899354, bug 899801                                            
target 2 - not applied anything.
Created attachment 793215 [details]
memory-report-before-898990-15hours.tgz

this is "target 2 - not applied anything." and report after 15 hours test.

it shows on gc log of b2g process.
3838 Function apploaded
971  Function visibilityChangeHandler
Created attachment 793230 [details]
memory-report-after-898990-22hours.tgz

this is "target 1 - applied patches which is bug 899328, bug 899354, bug 899801" after 22 hours test.

it looks very good compare with before(Attachment 793215 [details]).
in result, it was verfied me.
Flags: needinfo?(jeremy.kim.leo)
Let's call this fixed, then!  Thanks, Jeremy.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
status-b2g18: --- → fixed
FYI, here are the top objects in the CC logs, before and after the fix.

== CC log ==

Items marked with * either disappeared after the fix, or otherwise decreased
significantly.

Before:

> $ grep '^[^>]' cc-edges.log  | cut -f '2-10' -d ' ' | sort | uniq -c | sort -rg | head -n20
>    5677 [gc.marked] JS Script
> *  5186 [rc=2] nsXPCWrappedJS (nsIDOMEventListener)
>    5114 [gc.marked] JS Object (Object)
>    5041 [gc.marked] JS Object (Function)
> *  4073 [gc] JS Object (Call)
> *  3836 [gc] JS Object (Function - apploaded)
>    3327 [gc] JS Object (Object)
> *  1282 [rc=1] XPCWrappedNative
>    1241 [gc.marked] JS Object (Block)
> *  1053 [gc] JS Object (XPCWrappedNative_NoHelper)
> *   967 [gc] JS Object (Function - visibilityChangeHandler)
> *   967 [gc] JS Object (Function - get)
>     882 [rc=1] nsGenericDOMDataNode
>     767 [gc.marked] JS Object (Array)
>     617 [gc] JS Object (Array)
>     573 [gc.marked] JS Object (Call)
>     556 [known=1]
>     512 [gc] JS Object (Function)
>     283 [gc] JS Script
>     251 [rc=1] XPCWrappedNative (nsJSIID)

After:

> $ grep '^[^>]' cc-edges.log | cut -f '2-10' -d ' ' | sort | uniq -c | sort -rg | head -n20
>    5490 [gc.marked] JS Script
>    5155 [gc.marked] JS Object (Object)
>    4997 [gc.marked] JS Object (Function)
>    4432 [gc] JS Object (Object)
>    1205 [gc.marked] JS Object (Block)
>     980 [rc=1] nsGenericDOMDataNode
>     798 [gc.marked] JS Object (Array)
>     742 [gc] JS Object (Array)
>     691 [known=1]
>     637 [gc] JS Object (Function)
>     537 [gc.marked] JS Object (Call)
>     487 [gc] JS Script
> *   418 [rc=2] nsXPCWrappedJS (nsIDOMEventListener)
>     345 [known=2]
> *   324 [rc=1] XPCWrappedNative
> *   304 [gc] JS Object (Call)
>     297 [rc=2] nsGenericDOMDataNode
>     291 [gc] JS Object (Proxy)
>     258 [rc=1] XPCWrappedNative (nsJSIID)
>     258 [gc.marked] JS Object (nsJSIID)
Also, anything with [gc.marked] in it shouldn't contribute to CC times.
You need to log in before you can comment on or make changes to this bug.