Closed Bug 1091240 Opened 5 years ago Closed 5 years ago

network-activity statusbar widget uses 20-33% CPU in WebRTC calls, adds huge networking delay

Categories

(Firefox OS Graveyard :: Gaia::System, defect, P1, critical)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:2.1+)

RESOLVED DUPLICATE of bug 1091775
2.1 S8 (7Nov)
blocking-b2g 2.1+

People

(Reporter: jesup, Assigned: daleharvey)

References

(Depends on 1 open bug)

Details

(Whiteboard: [systemsfe])

Attachments

(5 files)

[Blocking Requested - why for this release]:
Makes webrtc calls almost unusable, especially for more than 10 minutes
We may well want this on 2.0 as well if that's possible.

+++ This bug was initially created as a clone of Bug #1087605 +++

Using a 2.1 build with H.264 enabled:
   user_pref("media.peerconnection.video.h264_enabled", true);

Go to http://mozilla.github.io/werbrtc-landing/pc_test.html
Select "one-way call" and "Require h.264"
Select "Start", select front (or rear) camera

Use "adb shell top -m 10 -d 2 -t" to watch thread CPU use
Watch how b2g/b2g uses 22-33% of the CPU (for what?)

Logs with NSPR_LOG_MODULES=signaling:5,timestamp will show that the rtt will be 300-600ms to start, and goes up from there. (3 seconds + after 20 minutes)

With the attached patch, b2g uses <3% cpu, rtt drops to <20ms typically (often <10), delay doesn't build up.

Previous profiling has shown that huge amounts of CPU are being caused by this, and updateIconVisibility() is causing reflows/relayout N times per second, which is where a lot of the CPU is going (or so it appears).  The patch verifies this is the cause.
blocking-b2g: 2.1? → 2.1+
I thought we fixed this issue already?
Flags: needinfo?(mhenretty)
Flags: needinfo?(bgirard)
(In reply to Gregor Wagner [:gwagner] from comment #1)
> I thought we fixed this issue already?

I hope so (and if so great!) I did see significant CPU use for b2g/b2g in reports from sotaro as well.  I'm updating all my trees and rebuilding (due to some annoying ISP arguments with netflix, I'm limited to a few K/s to certain sites including most of what we use for repo sync).

Sotaro's report from B2G 2.1 in bug 1087605:
https://bug1087605.bugzilla.mozilla.org/attachment.cgi?id=8512231
He shows ~12% CPU in b2g/b2g, which is still way too high, though better than the 25% I was seeing.
Also, the reason I knew it was the network-activity status was that a previous profile from Tef or from QC had pointed at it.
I made a slightly different change on system (just commented out the 
      //      window.addEventListener('moznetworkupload', this);
      //      window.addEventListener('moznetworkdownload', this);
lines in statusbar.js. Uploaded the new system to a couple of phones (Flame and FireE) and then placed a Loop call between them using the Hello app.
.

There's a *huge* improvement both in the observable video quality, and on the initial setup time of the calls (time between the call is 'connected' and the video appears.
I tried Firefox Hello's SimpleWebRTC use case on latest v2.1 flame device, by disabling "network-activity" indicator, total cpu usage was decreased from 98% to 81%.

 - total cpu usage 98% -> 81%
 - b2g main thread 12% -> 4%
 - b2g Compositor thread 12%->6%
(In reply to Sotaro Ikeda [:sotaro] from comment #5)
> I tried Firefox Hello's SimpleWebRTC use case on latest v2.1 flame device,

Firefox Hello app is the following.
 https://github.com/borjasalguero/firefox-loop-reference
Got top log on Firefox Hello's SimpleWebRTC use case with "adb shell top -m 20 -t -d 3" on latest v2.1 flame with Bug 1087605 fix.
Got top log on Firefox Hello's SimpleWebRTC use case by disabling "network activitiy" indicator and with "adb shell top -m 20 -t -d 3" on latest v2.1 flame with Bug 1087605 fix.

attachment 8513827 [details] [diff] [review] is used to disable "network activitiy" indicator.
(In reply to Sotaro Ikeda [:sotaro] from comment #6)
> (In reply to Sotaro Ikeda [:sotaro] from comment #5)
> > I tried Firefox Hello's SimpleWebRTC use case on latest v2.1 flame device,
> 
> Firefox Hello app is the following.
>  https://github.com/borjasalguero/firefox-loop-reference

You might want to try with the Firefox Hello app at https://github.com/mozilla-b2g/firefoxos-loop-client. Thanks!
Thanks for the correction :-)
Attached file Top log with 2.0
This is the top log taken from a FireE running 2.0, and the latest version of the loop app (using a Flame as the other device). There are 4 snapshots:

* One disabling completely the widget (removed the listeners).
* One setting network.activity.blipIntervalMilliseconds" to 2000 ms;
* One setting network.activity.blipIntervalMilliseconds" to 1000 ms;
* One setting network.activity.blipIntervalMilliseconds" to 250 ms (default);

If the widget is disabled, the compositor is at 11%, and b2g on the parent is at only 2%. All the excess CPU is taken by MediaStreamGrph and AudioGUM

If the widget is enabled, at the default rate, the compositor is at 18%, and b2g is at 11%. Changing the time makes a small difference in both b2g and the compositor (goes down to 16%/8%). Probably using even bigger intervals would make b2g go down further, but I don't know about the compositor (it doesn't seem to like rendering the little squares from the widget).
Also: When looking at different solutions for the network activity indicator
enable signaling:5 logs and note the rtt
Keeping the mainthread of the parent busy causes problems for IPC networking traffic

This would be very bad: (before nuking it after 20 min)
I/PRLog   ( 1364): 9381056[b3a03600]: [Socket Thread|WebrtcOMXH264VideoCodec] WebrtcOMXH264VideoCodec.cpp:1042: WebrtcOMXH264VideoEncoder:0xb2b6d5c0 set channel packet loss:0, rtt:3620

This (with the fix) is very good: (after nuking it after 2 hours)
I/PRLog   ( 1408): 2014-10-29 21:04:51.656182 UTC - 32555200[b3391700]: [Socket Thread|WebrtcOMXH264VideoCodec] WebrtcOMXH264VideoCodec.cpp:1042: WebrtcOMXH264VideoEncoder:0xb2bacef0 set channel packet loss:0, rtt:8
Severity: normal → critical
(In reply to Antonio Manuel Amaya Calvo (:amac) from comment #11)

> This is the top log taken from a FireE running 2.0, and the latest version
> of the loop app (using a Flame as the other device). There are 4 snapshots:

Is this with the status visible or invisible?   Previous comments were that the problem was be there even with the statusbar hidden.
Flags: needinfo?(amac.bug)
That was with status bar visible (all of them).
Flags: needinfo?(amac.bug)
We improved the situation in 2.1 by uplifting bug 1074683. But there are more things we can do if network activity icon changes are still causing poor webRTC quality.

2.0, the statusbar code is significantly different. I don't think the network activity icon is causing as many problems there, but can someone confirm?
Flags: needinfo?(mhenretty)
Let's have Alberto investigate some ways for improving webRTC performance with relation to the network activity statusbar icon.
Assignee: anygregor → apastor
(In reply to Michael Henretty [:mhenretty] from comment #16)
> We improved the situation in 2.1 by uplifting bug 1074683. But there are
> more things we can do if network activity icon changes are still causing
> poor webRTC quality.
> 
> 2.0, the statusbar code is significantly different. I don't think the
> network activity icon is causing as many problems there, but can someone
> confirm?

The measurements :amac provided in comment 11 are based on 2.0, so it seems it's also causing a lot of problems in 2.0.
performance of animated PNG seems to have a problem. BenWa and jrmuizel are looking into it. I am looking CairoImage, that is allocating gralloc buffer for every rendering.
I confirmed that during Animated PNG rendering, new CairoImage is allocated, copy image data to it and destroyed for every rendering.
(In reply to Michael Henretty [:mhenretty] from comment #17)
> Let's have Alberto investigate some ways for improving webRTC performance
> with relation to the network activity statusbar icon.

The problem doesn't seem to be  webRTC related. I tried just downloading some data on the background (using here maps). I launched the download and then pressed the home button so the only thing that was being animated on the homescreen was the network indicator. And I got a 12% (varies from 10% to 15%) of CPU for the b2g thread on the main process, and 8-10% on the compositor thread. When there's no network indicator, that goes to around 0% for each of them (I'm guessing since they don't even appear on the top).
Oh, I forgot. As :dcoloma said, all the tests I've done are on 2.0 (some on a FireE, some on a Flame, as indicated).
(In reply to Antonio Manuel Amaya Calvo (:amac) from comment #21)

> The problem doesn't seem to be  webRTC related. I tried just downloading
> some data on the background (using here maps).

This doesn't surprise me at all, and on larger pages, pdf's, etc I can easily believe the CPU use (and possible latency increases) might negatively impact loadtimes/performance and IPC traffic to content.

NI bent for his knowledge after discussing this on IRC.
Flags: needinfo?(bent.mozilla)
Depends on: 1091763
Flags: needinfo?(bgirard)
Depends on: 1091775
Depends on: 1091777
Depends on: 1091782
Whiteboard: [systemsfe]
(In reply to Daniel Coloma:dcoloma from comment #18)
> (In reply to Michael Henretty [:mhenretty] from comment #16)
> > We improved the situation in 2.1 by uplifting bug 1074683. But there are
> > more things we can do if network activity icon changes are still causing
> > poor webRTC quality.
> > 
> > 2.0, the statusbar code is significantly different. I don't think the
> > network activity icon is causing as many problems there, but can someone
> > confirm?
> 
> The measurements :amac provided in comment 11 are based on 2.0, so it seems
> it's also causing a lot of problems in 2.0.

I see. I was responding to Sotaro's findings in comment 5. In 2.0, we weren't doing nearly as much work in JS on network activity updates. Basically we would just display an animated png for half a second [1]. The problems you are seeing in 2.0 is probably related to Sotaro's comment 20, bug 1091777.

1.) https://github.com/mozilla-b2g/gaia/blob/5a3bc7c4d873147ea660c5e986ae7df08bb3521d/apps/system/js/statusbar.js#L729
Not sure what is being requested of me here...
Flags: needinfo?(bent.mozilla)
Comment 11 mentions profiling this on 2.0, are we considering blocking on 2.0 here?
Flags: needinfo?(bbajaj)
Assigning over to :daleharvey to profile with css transitions rather than animated pngs, since Alberto has a couple of blockers already.
Assignee: apastor → dale
Target Milestone: --- → 2.1 S8 (7Nov)
Making the CSS sprite isnt particularly easy, but I should get it done soon, there is an r+'d patch @ https://bugzilla.mozilla.org/show_bug.cgi?id=1091775 which drops the CPU usage by a lot, given the comments I expect that will be good enough (using css wont be free either) but I will see exactly what the numbers are before I close this
So as far I can tell using a css transition actually takes up more cpu than the patched lower framerate apng, the patch is here, https://github.com/daleharvey/gaia/commit/4edc14abfa7ecb2d36b33f7cf5aced89a9935914.patch, will test again in the morning to make sure, but I think this will be closed dupe
The animation you have is running at 60 FPS which would consume more resources.

You need to make the animation 5 FPS just like the gif. You need to force discrete steps that happen every 200ms using keyframes. The FPS counter (from the developer menu) should be displaying 5 FPS if the right affect is achieve.

In this case we should be able to implement the exact same effect as the gif but bypass the main thread thus saving CPU cycles.
Ah, I hoped steps would do that, will try again
Ohh, I've never seen that before. Maybe it does. Check with the FPS counters. If nothing else is changing the counter on the left should be your OMTA target rate where the middle counter should be much lower (about 1 FPS).
ok, confirmed using steps() does run the animation at 5fps (updated the patch with the missing </div> sorry)

CSS transitions - https://gist.github.com/daleharvey/09181f47af368bae05c4
Current master - https://gist.github.com/daleharvey/c9eed5f59c98b0309ee4

The CSS transitions are settling on a lower bound of 6% CPU for the b2g process with current master settling on around 4%, the Compositor stays between 0-1% almost entirely for both but on average lower for master. So closing this as a fixed duplicate.

I did test against master with the previous apng and while its awesome that we got such an improvement, its also hugely worrying that an indicator that is very often visible could be using 25 - 30% of CPU (during page loads which arent cheap themselves) and by the looks of it this bug has been around for a long time without being caught until someone happened to be testing webrtc

While every functional team needs to make the best effort to writing code that efficiently uses resources, it seems like this is the type of issue that is going to slip though and needs a systematic process to ensure it doesnt happen again, I believe there is no longer a "performance team" anymore, how are these problems going to be addressed?
Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(bbajaj) → needinfo?(anygregor)
Resolution: --- → DUPLICATE
Duplicate of bug: 1091775
(In reply to Dale Harvey (:daleharvey) from comment #33)
> While every functional team needs to make the best effort to writing code
> that efficiently uses resources, it seems like this is the type of issue
> that is going to slip though and needs a systematic process to ensure it
> doesnt happen again, I believe there is no longer a "performance team"
> anymore, how are these problems going to be addressed?
> 

Good question but I don't have a good answer here. We usually find the biggest regressions when the phone starts heating up in your pocket or the battery drains faster than you can charge it. But we don't have a good way to test for it now.
I hope with our battery initiative we get this test coverage basically for free.
Flags: needinfo?(anygregor)
You need to log in before you can comment on or make changes to this bug.