Closed Bug 1062119 Opened 11 years ago Closed 11 years ago

RestyleManager::RestyleElement() being called continuously in the system app causes CPU usage when idle

Categories

(Firefox OS Graveyard :: Gaia::System, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:2.1+, b2g-v1.4 unaffected, b2g-v2.0 unaffected, b2g-v2.1 verified, b2g-v2.2 verified)

VERIFIED FIXED
2.1 S4 (12sep)
blocking-b2g 2.1+
Tracking Status
b2g-v1.4 --- unaffected
b2g-v2.0 --- unaffected
b2g-v2.1 --- verified
b2g-v2.2 --- verified

People

(Reporter: ting, Assigned: kgrandon)

References

Details

(Keywords: regression, Whiteboard: [systemsfe])

Attachments

(7 files, 2 obsolete files)

Just sync and rebuilt from trunk today, on my Flame I see sluggish scrollbar while scrolling in Homescreen, and Settings. 35.0a1 gecko: 6be703108c55455d320fccab7dc52f0089710907 gaia: f5e84652c3356e61395043b4228ca0dde75aa7f8
Can you check if you're seeing high CPU usage at idle? A lot of people have been reporting that on IRC and I'm bisecting the issue right now.
vmstat when idle: r b free mapped anon slab in cs flt us ni sy id wa ir 0 0 23324 51260 62240 20700 312 675 0 19 0 8 99 0 0 1 0 23288 51288 62228 20700 311 688 0 24 0 10 99 0 0 1 0 23260 51260 62364 20700 345 578 0 24 0 8 99 0 0 0 0 23344 51288 62372 20700 319 816 0 15 0 3 99 0 0 0 0 23408 51260 62316 20700 374 703 3 25 0 12 99 0 0 0 0 24176 50488 62316 20684 377 736 0 26 0 7 99 0 0 0 0 24668 50488 61708 20684 317 670 0 22 0 9 99 0 0
Same problem here, CC'ing some people that have worked on this today. We're trying to pinpoint what introduced this issue.
Changing the title to better describe what we found out until now. So apparently epoll_wait() is returning continuously in the main process' main thread and this is causing ridiculously high CPU usage on single-core devices (with associated sluggishness). The effect is a little less pronounced on dual-core devices (à la Flame) but still very visible. Running 'top' or 'vmstat' can be used to check if you're affected.
Summary: Scrollbar is sluggish while scrolling → Continuous polling causes high CPU usage when the phone is idle
Is this related to (or the same bug as) bug 1050510?
Can we get someone from QA to bisect this immediately?
Flags: needinfo?(jsmith)
'watch adb shell b2g-info' when the phone is idle ... if this issue is happening you'll see substantial CPU use in the b2g process.
FWIW I have this on Gecko ef1c1cf1c9/Gaia 52670853c1 so this goes back a bit.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #6) > Can we get someone from QA to bisect this immediately? Yeah sure.
Flags: needinfo?(jsmith)
Switching to qawanted only first to branch check, then add back the window keyword to start doing the bisection.
If what I'm seeing is correct here /system/b2g/b2g is running really high in Flame 2.2, Flame 2.1 and OpenC 2.2 being the highest of them all when idling. I included the branches running a high of 6-7% b2g process as affected as well because from what I heard idling should be super low 2% or so. (could be wrong though) STR to get these results: 1. Flash device to desired branch. 2. Set Adb and Devtools in settings. 3. Kill all running apps. 4. Run adb shell top -m 10 (adb shell b2g-info was giving me an error) 5. Check the highest cpu % This bug repro's on: Flame 2.2, Flame 2.1, Flame 2.0, Flame 1.4, OpenC 2.1 Actual Results: High CPU usage for /system/b2g/b2g process when device is idle with no running apps. Repro Rate: 5/5 Environmental Variables: (29-35% /system/b2g/b2g) Device: Flame Master BuildID: 20140903062451 Gaia: 52670853c17fc0d3d33065c667c0ce124c93b98f Gecko: 5e9826980be5 Version: 35.0a1 (Master) Firmware Version: v123 ------------------------------------------------ Environmental Variables: (16-22% /system/b2g/b2g) Device: Flame 2.1 BuildID: 20140903085050 Gaia: fbb297c39aab5f17b179533d2a9a6c5166b2c197 Gecko: 31dad821234e Version: 34.0a2 Firmware Version: v123 ------------------------------------------------ Environmental Variables: (6-7% /system/b2g/b2g) Device: Flame 2.0 BuildID: 20140903075252 Gaia: d056733f8a7a1a152f5458b323f092c47dbffa48 Gecko: 742cb642750f Version: 32.0 (2.0) Firmware Version: v123 ------------------------------------------------ Environmental Variables: (6-7% /system/b2g/b2g) Device: Flame 1.4 BuildID: 20140903033752 Gaia: 2ee5b00bfbb8a67a967094804390b4afce8ecf54 Gecko: f5a75c0dd74e Version: 30.0 (1.4) Firmware Version: v123 ------------------------------------------------ Environmental Variables: (37-44% /system/b2g/b2g) Device: Open_C Master BuildID: 20140903133923 Gaia: af04d8bc2111d4ea146239a89ff602206b85eaf5 Gecko: acbdce59da2f Version: 35.0a1 (Master) Firmware Version: P821A10V1.0.0B06_LOG_DL
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(jmitchell)
Keywords: qaurgent, qawanted
QA Contact: croesch
Feel free to flip the tracking flags back if 6-7% is deemed acceptable and you consider those branches not affected.
if 6-7% is deemed a repro then this is not a regression.
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(jmitchell)
QA results all matches our current knowledges. According to contributors who first reported this to me (:mac- and :kamil), it would be at some point around the end of August, 27th or 29th that the issue started. Can we get regression window?
(In reply to Alexandre LISSY :gerard-majax from comment #15) > Can we get regression window? I've started bisecting this yesterday so I'll pick up from where I left, I'm taking this.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Just tried a nightly build from the 25th of August and I can see around ~12% CPU usage when the phone is idle. This is not as bad as recent builds but shouldn't be happening anyway.
The first good version I could find is bbfc46e60d79. The results I get are somewhat mixed as far as CPU usage goes. Sometimes it's ~6-7%, sometimes much more; on good commits however the b2g usage when the phone is sitting idle in the homescreen is practicaly 0 so anything above that is a bug.
OK, this is taking forever because a lot of the intermediate changes in bug 848954 are broken and so I've been 'hg bisect --skip'ing a lot of changesets but I'm still not done :-/
Here's a quick udpate: I could confirm that this is _not_ bug 1061012 but since that also creates a CPU load at idle it makes bisecting even more difficult :-/ If anybody is also investigating this make sure you have the fix for bug 1061012 in gaia.
OK, here's an additional hint. I see activity in the Timer thread too so that's probably where the continuous events that are waking the main thread are coming.
I went as far back as commit 4f7230c80511 on mozilla-central using the current gaia and I can still see the issue happening. I'm starting to think this is a gaia issue or was somehow triggered by gaia.
With the display is switched on, I see a constant stream of the following messages in the logcat. The messages come in groups of 5 or 6, once per second. It seems it's constantly redrawing, although there's no animation on the screen. Don't know if that's related. I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317 I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317 I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317 I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317 I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317 I/Gecko ( 178): [Parent 178] ###!!! ASSERTION: old rule tree still referenced: 'Not Reached', file ../../../../mozilla-central/layout/style/nsStyleSet.cpp, line 1969 I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317 I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317 I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317 I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317 I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317 I/Gecko ( 178): [Parent 178] ###!!! ASSERTION: old rule tree still referenced: 'Not Reached', file ../../../../mozilla-central/layout/style/nsStyleSet.cpp, line 1969 I/Gecko ( 841): [Child 841] WARNING: Transparent content with displayports can be expensive.: file ../../../../mozilla-central/layout/base/nsDisplayList.cpp, line 1317
FWIW, I saw RestyleManager::RestyleElement keeps being called with a div which has id "progress" when idle in homescreen.
(In reply to Mason Chang [:mchang] from comment #26) > Here's a quick profile - > https://people.mozilla.org/~bgirard/cleopatra/ > #report=c0e8602943c6c072e0a37b76fd6e804edbae8551 > > Lots of time spent painting something. OK, I'm utterly confused. That looks a lot different than the profile I had yesterday. The <div> element :ting has mentioned is probably from the gaia_progress shared element. That's used only by the system app ATM. It might be worth trying disabling it and seeing if the problem goes away. Anyway I basically wasted my entire day on this without going anywhere.
Here's another profile - https://people.mozilla.org/~bgirard/cleopatra/#report=9682bdc7b54b11535ae4c0e4d6ee5668c508d417 Painting all the time but no composites.. weird.
BTW from yesterday's build, I'm only getting 6% cpu usage on a flame.
Gecko 203128:5e9826980be5, Gaia dd47e0df260ac41010c623604e0c67921d3a8fb4
Here's a profile with the screen off - https://people.mozilla.org/~bgirard/cleopatra/#report=a15eb56477e3bdcacf95006c872b98665f82b3bc - still painting. Have to figure out why nsRefreshDriver is ticking.
We keep scheduling a paint even though the screen is idle. Updated to m-c- 203527:da04cf3f8a06, gaia dd47e0d and we have 0 % cpu usage while the screen is off now.
Actually the painting while the screen is off is intermittent. Turning the screen on / off a couple of times can get the device in a state where a refresh driver keeps being added / remove and we paint but don't composite. If anyone can take a look at this displaylist and see something, that'd be great.
We're not actually invalidating anything. The only hint I can think of is that the ClientColorLayer has canOptimizeAwayThebes == 1, which consistently schedules a paint - http://dxr.mozilla.org/mozilla-central/source/layout/base/FrameLayerBuilder.cpp#3436. However, I'm not sure what's actually supposed to be happening.
Attachment #8484460 - Attachment description: Display List → Display List plus Invalidation while screen is off.
At least with the display off, disabling this schedule paint seems to have stopped the nsRefreshDriver from ticking all the time - http://dxr.mozilla.org/mozilla-central/source/layout/base/nsCaret.cpp?from=nsCaret.cpp&case=true#411. The CPU usage on a flame for the b2g process goes down to 0% with the screen off. This commit was added here: changeset: 199956:8d8e4df32ffc user: Robert O'Callahan <robert@ocallahan.org> date: Wed Aug 06 17:19:27 2014 +1200 summary: Bug 1048752. Part 18: Add nsCaret::SchedulePaint. r=tn Roc, any idea why this would continuously paint with the screen off? I don't know the code well enough to know which early return we should be taking. Thanks! I still have to see why we're always painting with the screen on now.
Flags: needinfo?(roc)
(In reply to Alexandre LISSY :gerard-majax from comment #15) > QA results all matches our current knowledges. According to contributors who > first reported this to me (:mac- and :kamil), it would be at some point > around the end of August, 27th or 29th that the issue started. > > Can we get regression window? Alexandre - We typically only do regression windows on blocking issues, since windows are expensive to conduct. Do you think this bug should be nominated to block?
Flags: needinfo?(lissyx+mozillians)
(In reply to Mason Chang [:mchang] from comment #36) > At least with the display off, disabling this schedule paint seems to have > stopped the nsRefreshDriver from ticking all the time - > http://dxr.mozilla.org/mozilla-central/source/layout/base/nsCaret. > cpp?from=nsCaret.cpp&case=true#411. > > Roc, any idea why this would continuously paint with the screen off? I don't > know the code well enough to know which early return we should be taking. > Thanks! What's the stack trace for this SchedulePaint? My guess is that it's just the caret blinking. It makes sense that every time the caret blinks on or off we'd call SchedulePaint. It seems to me that when the screen is off, we shouldn't run the refresh driver at all. Imagine the app has some animation running --- we should be doing minimal processing. In that case, SchedulePaint would still be called periodically but the refresh driver would not run while the screen is off. Avoiding those caret timer wakeups would be even better, but that's more work and those wakeups are not a regression. Probably the best way to fix that would be to implement bug 866731 --- off-main-thread caret drawing. With that implemented, caret blinking would be done entirely in the compositor, and if the compositor isn't running, no work is done.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #39) > (In reply to Mason Chang [:mchang] from comment #36) > > At least with the display off, disabling this schedule paint seems to have > > stopped the nsRefreshDriver from ticking all the time - > > http://dxr.mozilla.org/mozilla-central/source/layout/base/nsCaret. > > cpp?from=nsCaret.cpp&case=true#411. > > > > Roc, any idea why this would continuously paint with the screen off? I don't > > know the code well enough to know which early return we should be taking. > > Thanks! > > What's the stack trace for this SchedulePaint? Attached
BTW, what's the gaia progress shared element and how would I disable it?
Flags: needinfo?(gsvelto)
So the sluggish scrollbar of this bug originally seems not relate to the high cpu usage, it occurs between these two pvt builds and I don't see noticeable cpu usage difference when idle in homescreen: OK: Gaia 44bf2e3bc5ddea9db9a8c851bd353cb234aa883c Gecko https://hg.mozilla.org/mozilla-central/rev/c360f3d1c00d BuildID 20140902040205 Version 34.0a1 NG: Gaia 7e469783859785a8bd4bf02a802f32561c84be7b Gecko https://hg.mozilla.org/mozilla-central/rev/372ce1f36116 BuildID 20140902121319 Version 35.0a1 I'll file another bug as this one is now targeting high cpu usage.
(In reply to Ting-Yu Chou [:ting] from comment #43) > I'll file another bug as this one is now targeting high cpu usage. Created bug 1063333.
(In reply to Ting-Yu Chou [:ting] from comment #44) > Created bug 1063333. Sorry for derailing your original report, since verything had become sluggish on my phone I thought this was related. (In reply to Jason Smith [:jsmith] from comment #38) > Alexandre - We typically only do regression windows on blocking issues, > since windows are expensive to conduct. Do you think this bug should be > nominated to block? This is definitely a blocker as it causes the battery to drain in a matter of hours. That being said I'd warn anybody against trying to get a regression window: I tried thrice and failed as there's multiple unrelated issues in the last few days worth of builds so unless we find a more specific way to pinpoint this problem (besides CPU usage) it's a waste of time.
Flags: needinfo?(lissyx+mozillians)
(In reply to Gabriele Svelto [:gsvelto] from comment #45) > Sorry for derailing your original report, since verything had become > sluggish on my phone I thought this was related. Never mind Gabriele, it just wasn't clear what happened :)
(In reply to Mason Chang [:mchang] from comment #42) > BTW, what's the gaia progress shared element and how would I disable it? So Ting-Yu hint in comment 25 was the correct lead. The gaia_progress shared element in the system app (which I assume is the new progress bar) is causing this issue. I think this is related to CSS animation because we had something similar in bug 993259 and I feel silly for not having thought about that before. What made this bug more complicated is that the behavior of the issue changed over time as we had other bugs consuming CPU time at idle. Also it seems that bug 1052503 made the CPU usage drop which is consistent with the change there since it reduced the number of steps in the CSS animation. Kevin, we need some help here: first of all I think this is the same issue of bug 993259, namely the CSS restyling code being invoked over and over for an animated object which is not visible. We haven't yet found a solution for the Gecko problem (bug 962594) causing that behavior so we need a workaround.
Flags: needinfo?(gsvelto) → needinfo?(kgrandon)
Changing the title to reflect the issue better and requesting blocking status [Blocking Requested - why for this release]: This causes continuous CPU usage and can drain a phone's battery in a matter of hours.
Assignee: gsvelto → nobody
blocking-b2g: --- → 2.1?
Summary: Continuous polling causes high CPU usage when the phone is idle → RestyleManager::RestyleElement() being called continuously in the system app causes CPU usage when idle
Gabriele, with your hack I see a huge improvement, but I still have some bad cases: - pulling down the notification trays is very slow - ~10-20% CPU usage while idling on homescreen
Flags: needinfo?(gsvelto)
So this patch works around the issue by null'ing the animation property when the element is hidden. However it's ugly as sin and if there's a better, less intrusive ways of doing this I'd take it in a heartbeat.
Attachment #8484971 - Flags: feedback?(kgrandon)
Flags: needinfo?(gsvelto)
(In reply to Alexandre LISSY :gerard-majax from comment #49) > Gabriele, with your hack I see a huge improvement, but I still have some bad > cases: > - pulling down the notification trays is very slow > - ~10-20% CPU usage while idling on homescreen Sorry, I've accidentally cleared your needinfo. Can you try this patch? It seems to work fine on my Flame.
BTW I've tried using a CSS selector in the system app CSS code to override the animation frame based on the parent visibility (.chrome gaia-progress and .chrome gaia-progress.visible selectors in wrapper.css) but it didn't work. That would have been a lot less clunky.
Oh wow, I didn't about that progress bar work, sorry guys! I suppose we should also add a dependency on bug 962594.
Depends on: 962594
Flags: needinfo?(kgrandon)
Comment on attachment 8484971 [details] [diff] [review] [PATCH] Ugly-looking, foul-smelling, clunky workaround for bug 962594 Review of attachment 8484971 [details] [diff] [review]: ----------------------------------------------------------------- So this doesn't work as-is. I think we should either surface some API for this, or consider using the disabled attribute or similar. I'm looking into this and will follow-up on IRC or steal if appropriate. ::: shared/elements/gaia_progress/style.css @@ +16,5 @@ > } > + > +/* XXX: Workaround for bug 962594, we use this class to explicitly stop the > + * CSS animation when the element is hidden. */ > +#progress.noanim { So I don't think this will work because #progress sits behind the shadow dom. We might want to surface an API for this instead.
Attachment #8484971 - Flags: feedback?(kgrandon) → feedback-
(In reply to Kevin Grandon :kgrandon from comment #53) > Oh wow, I didn't about that progress bar work, sorry guys! I suppose we > should also add a dependency on bug 962594. +1 (In reply to Kevin Grandon :kgrandon from comment #54) > So I don't think this will work because #progress sits behind the shadow > dom. We might want to surface an API for this instead. Are you sure? It's working in my testing since I'm manipulating the relevant element directly.
(In reply to Gabriele Svelto [:gsvelto] from comment #55) > Are you sure? It's working in my testing since I'm manipulating the relevant > element directly. When I loaded the patch on a device I saw that it had the stopped default state, but never animated.
Gabriele - I agree that we should have some workaround until we get a gecko patch, and this is more inline with what I was thinking. Let me know what you think, thanks!
Attachment #8485034 - Flags: review?(gsvelto)
Component: General → Gaia::System
Comment on attachment 8485034 [details] [review] Pull request - do not animate gaia-progress by default This is working very well on my device solving the CPU usage problem but it needs a couple of unit-test related changes. These two tests are failing because the progress' element start() and stop() methods are not stubbed yet: https://github.com/mozilla-b2g/gaia/blob/26b433e4f6e10916b7b974e58218b79678fd887c/apps/system/test/unit/app_chrome_test.js#L87 https://github.com/mozilla-b2g/gaia/blob/26b433e4f6e10916b7b974e58218b79678fd887c/apps/system/test/unit/app_chrome_test.js#L95 You should also add a couple of checks to ensure that start()/stop() calls are properly done in both scenarios.
Attachment #8485034 - Flags: review?(gsvelto) → review+
Attachment #8484818 - Attachment is obsolete: true
Attachment #8484971 - Attachment is obsolete: true
Flags: needinfo?(roc)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment on attachment 8485034 [details] [review] Pull request - do not animate gaia-progress by default [Approval Request Comment] [Bug caused by] (feature/regressing bug #): Rocketbar feature work. [User impact] if declined: We will ship with the worst power performance of any smart phone. [Testing completed]: Manual and unit testing. [Risk to taking this patch] (and alternatives if risky): Low risk as it only impacts the loading indicator. [String changes made]: No.
Attachment #8485034 - Flags: approval-gaia-v2.1?(bbajaj)
Assignee: nobody → kgrandon
Whiteboard: [systemsfe]
Target Milestone: --- → 2.1 S4 (12sep)
blocking-b2g: 2.1? → 2.1+
Attachment #8485034 - Flags: approval-gaia-v2.1?(bbajaj) → approval-gaia-v2.1+
Marking as testsuite+ as we have a unit test which covers this, and this is only a workaround until bug 962594 lands.
Flags: in-testsuite+
This issue has been verified successfully on flame 2.1,2.2 Repro rate:0/5 STR to get these results: 1. Update device to 20150121001510(2.1),20150121002607(2.2). 2. Kill all running apps. 3. Run adb shell top -m 10 4. Check the highest cpu % --After screen time out,the cpu useage keeps in 2%-3%.so I think it's fixed. Flame 2.1 build: Gaia-Rev 77c57eb8a985d5cbd34a597fb1b978ba6e205af6 Gecko-Rev https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/4c28bb3be0c6 Build-ID 20150121001510 Version 34.0 Device-Name flame FW-Release 4.4.2 FW-Incremental eng.cltbld.20150121.034530 FW-Date Wed Jan 21 03:45:41 EST 2015 Bootloader L1TC000118D0 Flame 2.2 Gaia-Rev e4f9b5da3751798f9cc5d95f302c30722cc11fca Gecko-Rev https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/75a462a58d7a Build-ID 20150121002607 Version 37.0a2 Device-Name flame FW-Release 4.4.2 FW-Incremental eng.cltbld.20150121.040751 FW-Date Wed Jan 21 04:08:02 EST 2015 Bootloader L1TC000118D0
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: