Closed Bug 1061969 Opened 10 years ago Closed 8 years ago

[Camera] Pulling down the status bar while in the camera app is very sluggish

Categories

(Firefox OS Graveyard :: Gaia::System, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:-, tracking-b2g:-, b2g-v1.4 unaffected, b2g-v2.0 unaffected, b2g-v2.1 affected, b2g-master unaffected)

RESOLVED WORKSFORME
blocking-b2g -
tracking-b2g -
Tracking Status
b2g-v1.4 --- unaffected
b2g-v2.0 --- unaffected
b2g-v2.1 --- affected
b2g-master --- unaffected

People

(Reporter: KTucker, Assigned: vingtetun)

References

()

Details

(Keywords: perf, regression, Whiteboard: [2.1-exploratory])

Attachments

(9 files, 7 obsolete files)

487.59 KB, text/plain
Details
2.34 KB, patch
vingtetun
: review+
Details | Diff | Splinter Review
14.43 KB, patch
vingtetun
: review+
Details | Diff | Splinter Review
1.82 KB, patch
fabrice
: review+
Details | Diff | Splinter Review
2.38 KB, patch
alive
: review+
Details | Diff | Splinter Review
7.04 KB, patch
gsvelto
: feedback-
Details | Diff | Splinter Review
9.56 KB, patch
etienne
: review+
Details | Diff | Splinter Review
1.63 KB, patch
gsvelto
: review+
Details | Diff | Splinter Review
6.43 KB, patch
gsvelto
: review-
Details | Diff | Splinter Review
Description:
The user will notice that pulling down the status bar while in the camera app is sluggish. Sometimes, the status bar will stop half way down the screen and stay there. 

Repro Steps:
1)  Updated Flame to Build ID: 20140902040205
2)  Open the camera app. 
3)  Pull down the status bar repeatedly.

Actual:
The status bar pulls down sluggishly and sometimes stops half way down the screen.

Expected:
The status bar pulls down smoothly without issue. 

Flame 2.1

Environmental Variables:
Device: Flame Master
BuildID: 20140902040205
Gaia: 44bf2e3bc5ddea9db9a8c851bd353cb234aa883c
Gecko: c360f3d1c00d
Version: 34.0a1 (Master) 
Firmware Version: v123
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0

Notes:
Repro frequency: 100%
See attached: screenshot, logcat
This issue also occurs on Flame 2.1(512mb). 

The status bar pulls down sluggishly while in the camera app. 

Flame 2.1(512mb)

Environmental Variables:
Device: Flame Master
BuildID: 20140902040205
Gaia: 44bf2e3bc5ddea9db9a8c851bd353cb234aa883c
Gecko: c360f3d1c00d
Version: 34.0a1 (Master) 
Firmware Version: v123
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0

---------------------------------------------------------------------------------

This issue does not occur on the Open C 2.1, Flame 2.0(319mb), Open C 2.0, Flame 1.4(319mb) or the Open C 1.4.

The status bar pulls down relatively smoothly while in the camera app. 

Open C 2.1

Environmental Variables:
Device: Open_C 2.1 Master
BuildID: 20140902040205
Gaia: 44bf2e3bc5ddea9db9a8c851bd353cb234aa883c
Gecko: c360f3d1c00d
Version: 34.0a1 (2.1 Master)
Firmware: P821A10v1.0.0B06_LOG_DL
User Agent: Mozilla/5.0 (Mobile; rv:33.0) Gecko/33.0 Firefox/33.0

Flame 2.0(319mb)

Environmental Variables:
Device: Flame 2.0 
BuildID: 20140902000202
Gaia: 449d8db9b3ea1f9262db822c37ef2143477172b7
Gecko: 40d74e0bbcf5
Version: 32.0 (2.0)
Firmware Version: v123
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0

Open C 2.0

Environmental Variables:
Device: Open C 2.0
BuildID: 20140902000202
Gaia: 449d8db9b3ea1f9262db822c37ef2143477172b7
Gecko: 40d74e0bbcf5
Version: 32.0 (2.0)
Firmware Version: P821A10v1.0.0B06_LOG_DL
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0

Flame 1.4(319mb)

Environmental Variables:
Device: Flame 1.4 
BuildID: 20140902000203
Gaia: 2ee5b00bfbb8a67a967094804390b4afce8ecf54
Gecko: 6021b50bbed0
Version: 30.0 (1.4)
Firmware: V123
User Agent: Mozilla/5.0 (Mobile; rv:30.0) Gecko/30.0 Firefox/30.0

Open C 1.4

Environmental Variables:
Device: Open_C 1.4
BuildID: 20140902000203
Gaia: 2ee5b00bfbb8a67a967094804390b4afce8ecf54
Gecko: 6021b50bbed0
Version: 30.0 (1.4)
Firmware: P821A10v1.0.0B06_LOG_DL
User Agent: Mozilla/5.0 (Mobile; rv:30.0) Gecko/30.0 Firefox/30.0


This issue does not occur on the v123 base because the user cannot pull down the status bar in the camera app.

V123 base 

Environmental Variables:
Device: Flame 1.3
Build ID: 20140627162151
Gaia: 5c43d012a9351e8aaeacda0b87b0886b7f67d33d
Gecko: e181a36ebafaa24e5390db9f597313406edfc794
Version: 28.0 (1.3)
Firmware Version: v123
User Agent: Mozilla/5.0 (Mobile; rv:28.0) Gecko/28.0 Firefox/28.0
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(dharris)
Keywords: regression
Please attach a logcat and specify the memory you were using in your original comment
Flags: needinfo?(dharris) → needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage-]
Attached file CameraStatusBar.txt
Flags: needinfo?(ktucker)
This issue was found on Flame 2.1(319mb) and logcat has been attached.
QA Whiteboard: [QAnalyst-Triage-] → [QAnalyst-Triage?]
Flags: needinfo?(dharris)
[Blocking Requested - why for this release]:

The notification page can get stuck on the screen from being sluggish, and make the screen unresponsive until the user moves the page. This is a bad user experience, so I am nominating it to block 2.1? since it is also a regression from 2.0
blocking-b2g: --- → 2.1?
Flags: needinfo?(dharris)
Note: this is not related to bug 1038723; the video shows the bug repro on old style.

Correcting component.
Component: Gaia::System::Window Mgmt → Gaia::System
Derek - Can you add the window wanted keyword and fix the QAnalyst-Triage flag here?
Flags: needinfo?(dharris)
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(dharris)
blocking-b2g: 2.1? → 2.1+
I have a proposal and some local patches. Let's see if they make sense or not.
Attached patch bug1061969.noclip.patch (obsolete) — Splinter Review
Do not use clipping and minimize repaints. Keep the feature of the fixed notifications.
Attachment #8487253 - Flags: review?(chrislord.net)
Attached patch freezer.patch (obsolete) — Splinter Review
Add a freeze/unfreeze method on mozbrowser iframe to prevent the refresh of frame/canvas/video element while used.

Roc, you may dislike it. The main reason why I did that is because the child process is repainting/doing layer transactions/or using directly the compositor extensively while the user try to reveal the statusbar at the top of the phone.

Then the child process updates compete with the animation of the statusbar. It is not an async animation as it follows the finger - (and there is still no apz in the main process just in case you suggest to scroll it :)).

The proposed patch just expose a method to 'pause' the rendering while such an action is performed. Looking at the fps counter is seems to work well for the canvas use case. For the video use case I can still see a small drop of FPS will makes me thing only a subset of the work is skipped here, and maybe there is a better place to do the check ?
Attachment #8487260 - Flags: review?(roc)
Attachment #8487260 - Flags: review?(fabrice)
Use the freeze/unfreeze methods in the system app. This still needs some tests.
Attachment #8487263 - Flags: review?(alive)
Do not forward any events to the app while the utility-tray is opened. This probably need a test.
Attachment #8487264 - Flags: review?(etienne)
For the parts that needs test I'm mostly waiting to see if Roc is OK with the approach or not.
Has a patch, so removing window wanted
Comment on attachment 8487253 [details] [diff] [review]
bug1061969.noclip.patch

Review of attachment 8487253 [details] [diff] [review]:
-----------------------------------------------------------------

This is a novel way of doing it, nice :) I assume that the async animations don't get out of sync when you let go of the gripper (at least, not regularly)?

I also assume that with this method, the entire notifications tray is pre-rendered instead of repainted as it's animating? Of course it must be, or there would be less point to this change, but I have to do due diligence and ask :)

::: apps/system/index.html
@@ +542,4 @@
>        <!-- Utility Tray -->
>        <div id="utility-tray" data-z-index-level="utility-tray">
>          <!-- Placeholder for notifications -->
> +        <div id="notifications-placeholder">

Is there any need for this placeholder anymore? If not, please remove it and the associated comment - otherwise, I think the id and comment need updating to make sense.

::: apps/system/js/utility_tray.js
@@ +299,5 @@
>      dy = Math.min(screenHeight, dy);
>  
>      var style = this.overlay.style;
> +    style.transition = '';
> +    style.transform = 'translateY(' + dy + 'px)';

I don't mind this change being done here, but it would be nice to separate the s/MozTransform/transform/ change into a separate patch to make this easier to read.

@@ +338,4 @@
>      this.validateCachedSizes();
>      var alreadyHidden = !this.shown;
>      var style = this.overlay.style;
> +    style.transition = instant ? '' : 'transform 0.2s linear';

and same here with s/-moz-transform/transform/
Attachment #8487253 - Flags: review?(chrislord.net) → review+
(In reply to Chris Lord [:cwiiis] from comment #16)
> Comment on attachment 8487253 [details] [diff] [review]
> bug1061969.noclip.patch
> 
> Review of attachment 8487253 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is a novel way of doing it, nice :) I assume that the async animations
> don't get out of sync when you let go of the gripper (at least, not
> regularly)?
> 

Correct.

> I also assume that with this method, the entire notifications tray is
> pre-rendered instead of repainted as it's animating? Of course it must be,
> or there would be less point to this change, but I have to do due diligence
> and ask :)
> 

Correct too :)

> ::: apps/system/index.html
> @@ +542,4 @@
> >        <!-- Utility Tray -->
> >        <div id="utility-tray" data-z-index-level="utility-tray">
> >          <!-- Placeholder for notifications -->
> > +        <div id="notifications-placeholder">
> 
> Is there any need for this placeholder anymore? If not, please remove it and
> the associated comment - otherwise, I think the id and comment need updating
> to make sense.
>

Will do. And you're right about the 'Moz'/'-moz' removals. I have been a bit overzealous here.
Assignee: nobody → 21
Comment on attachment 8487260 [details] [diff] [review]
freezer.patch

Review of attachment 8487260 [details] [diff] [review]:
-----------------------------------------------------------------

Instead of doing this, can you temporarily tell the app that it's in the background?

::: content/media/VideoFrameContainer.cpp
@@ +38,5 @@
> +  nsIFrame* frame = mElement->GetPrimaryFrame();
> +  if (frame) {
> +    nsPresContext* presContext = frame->PresContext();
> +    nsIPresShell *presShell = presContext->PresShell();
> +    if (presShell->IsFrozen()) {

This code is running off the main thread in general. You can't touch this main-thread-only stuff here.
Attachment #8487260 - Flags: review?(roc) → review-
Basically I think an API for something like this makes sense, but it should be just like putting the app in the background, except that it's still visible.

However, APZ in the main process is probably the best way to fix this. What's stopping that?
Flags: needinfo?(bugmail.mozilla)
Comment on attachment 8487263 [details] [diff] [review]
bug1061969.windowmanager.freeze.needtests.patch

Review of attachment 8487263 [details] [diff] [review]:
-----------------------------------------------------------------

some nits

::: apps/system/js/app_window.js
@@ +290,5 @@
> +  _frozen: false,
> +  AppWindow.prototype.freeze = function aw__freeze() {
> +    if (this.iframe && this.iframe.freeze && !this._frozen) {
> +      this._frozen = true;
> +      this.iframe.freeze();

this.browser.element.freeze();

@@ +297,5 @@
> +
> +  AppWindow.prototype.unfreeze = function aw__unfreeze() {
> +    if (this.iframe && this.iframe.unfreeze && this._frozen) {
> +      this._frozen = false;
> +      this.iframe.unfreeze();

this.browser.element.unfreeze();

::: apps/system/js/app_window_manager.js
@@ +461,5 @@
>            break;
>  
>          case 'hidewindowforscreenreader':
>            activeApp.setVisibleForScreenReader(false);
> +          activeApp.freeze();

Hmm...can't us do this in setVisibleForScreenReader automatically? like what you do in show/hideFrame

@@ +466,5 @@
>            break;
>  
>          case 'showwindowforscreenreader':
>            activeApp.setVisibleForScreenReader(true);
> +          activeApp.unfreeze();

the same
Attachment #8487263 - Flags: review?(alive) → review+
Comment on attachment 8487263 [details] [diff] [review]
bug1061969.windowmanager.freeze.needtests.patch

Review of attachment 8487263 [details] [diff] [review]:
-----------------------------------------------------------------

::: apps/system/js/app_window.js
@@ +287,5 @@
>      }
>    };
>  
> +  _frozen: false,
> +  AppWindow.prototype.freeze = function aw__freeze() {

BTW I prefer to put this in browser_mixin

@@ +294,5 @@
> +      this.iframe.freeze();
> +    }
> +  };
> +
> +  AppWindow.prototype.unfreeze = function aw__unfreeze() {

as well as this.
Attachment #8487260 - Flags: review?(fabrice)
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #18)
> Comment on attachment 8487260 [details] [diff] [review]
> freezer.patch
> 
> Review of attachment 8487260 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Instead of doing this, can you temporarily tell the app that it's in the
> background?
> 

Sadly no. We still have not resolve bug 1002430 and similar which will keep the layer tree around when the app is in background.

Also, in a more API related level, the current way of telling the app that it is in background is to call iframe.setVisible(false). This has a lot of side effects, like adjusting the process priority - and I don't think we want to do that for this use case.

> ::: content/media/VideoFrameContainer.cpp
> @@ +38,5 @@
> > +  nsIFrame* frame = mElement->GetPrimaryFrame();
> > +  if (frame) {
> > +    nsPresContext* presContext = frame->PresContext();
> > +    nsIPresShell *presShell = presContext->PresShell();
> > +    if (presShell->IsFrozen()) {
> 
> This code is running off the main thread in general. You can't touch this
> main-thread-only stuff here.
Comment on attachment 8487264 [details] [diff] [review]
bug1061969.dont.forward.if.utilitytray.need.tests.patch

Review of attachment 8487264 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with a small unit test added
the plumbing is already there, and we should also make sure that the touch forwarder is never called.

::: apps/system/js/statusbar.js
@@ +537,5 @@
>      if (FtuLauncher.isFtuRunning()) {
>        return;
>      }
>  
> +    // Do not forward events is utility-tray is active

`if the utility-tray is active`
Attachment #8487264 - Flags: review?(etienne) → review+
We have bug 950934 for enabling APZ in the root process. The most immediate thing blocking that is having proper hit testing in the APZ code using event regions. tn is working on that bit.
Flags: needinfo?(bugmail.mozilla)
Attached patch freezer.patch (obsolete) — Splinter Review
Since APZ is out of scope for 2.1, let's try a new version of the new patch.
As soon as APZ will be enabled in the main process you can count on me to try to use it - this is so much better/easier than anything else we can do on the Gaia side :)

I changed a bit the code, so the presShell ask HTMLMediaElement(s) to pause/unpause while it is frozen. For regular <video>s this will pause them.
For the camera, it will just pause the rendering and the camera will continue to record in the background.
Attachment #8487260 - Attachment is obsolete: true
Attachment #8487904 - Flags: review?(roc)
Attachment #8487904 - Flags: review?(fabrice)
Comment on attachment 8487904 [details] [diff] [review]
freezer.patch

Review of attachment 8487904 [details] [diff] [review]:
-----------------------------------------------------------------

The rest of the patch looks OK but it's a bit of a scary hack. Are we going to just take it on the 2.1 branch? I'm not sure if that would make it more scary or less.

There could be interaction between scripted freeze/thaw and native freeze/thaw. That worries me. We could have consecutive Freeze() operations with no Thaw() in between, for example, when script starts a Freeze() and then a navigation causes another Freeze().

A much less risky hack would be to broadcast a specific notification to tell apps to "slow down" and have the camera app respond by pausing its video element. Would that be adequate?

::: content/html/content/src/HTMLMediaElement.cpp
@@ +3500,5 @@
> +    suspendRefresh = frame->PresContext()->PresShell()->IsFrozen();
> +  }
> +
> +  bool pauseElement = suspendEvents || (mMuted & MUTED_BY_AUDIO_CHANNEL) ||
> +                      (IsVideo() || suspendRefresh);

I think you mean && on this line?
Attachment #8487904 - Flags: review?(roc) → review-
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #26)
> Comment on attachment 8487904 [details] [diff] [review]
> freezer.patch
> 
> Review of attachment 8487904 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> The rest of the patch looks OK but it's a bit of a scary hack. Are we going
> to just take it on the 2.1 branch? I'm not sure if that would make it more
> scary or less.
> 
> There could be interaction between scripted freeze/thaw and native
> freeze/thaw. That worries me. We could have consecutive Freeze() operations
> with no Thaw() in between, for example, when script starts a Freeze() and
> then a navigation causes another Freeze().
> 
> A much less risky hack would be to broadcast a specific notification to tell
> apps to "slow down" and have the camera app respond by pausing its video
> element. Would that be adequate?
> 
> ::: content/html/content/src/HTMLMediaElement.cpp
> @@ +3500,5 @@
> > +    suspendRefresh = frame->PresContext()->PresShell()->IsFrozen();
> > +  }
> > +
> > +  bool pauseElement = suspendEvents || (mMuted & MUTED_BY_AUDIO_CHANNEL) ||
> > +                      (IsVideo() || suspendRefresh);
> 
> I think you mean && on this line?

I digged a bit more into the issue and while I can see it for the camera app in recording mode, as well as astroalpaca, I can also see it while loading a big web page like planet.mozilla.org.

I played a bit with priorities and a simple |renice -r 1 b2g_pid| solve the first 2 issues and resolve a big part of the last one. I suspect the last part of the issue for planet.mozilla.org is related to some threads beeing created once a page load and inheriting the main process, main thread priority.

So I would like to explore playing with priorities in order to see if part 2 and 3 are really needed or if we can go with a different approach. Whatever happens part 1 and part 4 are needed so I will land them while exploring the other part of the issue.
Attachment #8487904 - Flags: review?(fabrice)
Attached patch bug1061969.part1.patch (obsolete) — Splinter Review
The patch that changes how we open/close the statusbar with comments addressed. Carrying r+ form Chris.
Attachment #8487253 - Attachment is obsolete: true
Attachment #8487263 - Attachment is obsolete: true
Attachment #8487904 - Attachment is obsolete: true
Attachment #8490706 - Flags: review+
The patch that prevents touch events to be forwarded to the app while the user is interacting with the utility-tray. Version already reviewed by Etienne + tests.
Attachment #8490706 - Attachment is obsolete: true
Attachment #8490708 - Flags: review+
This small gecko changes expose a setActive API on top of mozbrowser iframe. When such a method is called the process priority of the app is adjusted.

I didn't renamed all the internal gecko code that says 'Visible' instead of 'Active' but I can do it if needed. 
There is already a setVisible API for this one is different from the process priority stuff, while it update it as well. Both API don't need to be in sync.
Attachment #8490712 - Flags: review?(fabrice)
This part just call iframe.setActive when the app is declared invisible for the screen reader as well as very early when the app is going to not be visible.
Attachment #8490713 - Flags: review?(alive)
Attached patch bug1061969.part5.patch (obsolete) — Splinter Review
This part use the new setActive API in a few different places, like when the statusbar is going to be shown. It also 'pause' the updates of the statusbar while utilitytray panning or edge gestures is happening.
Attachment #8490715 - Flags: review?(etienne)
This part adjust the threads priority so the main thread of the main process has a better priority than the other threads.

I'm asking feedback before trying to do a cleaner patch just to make sure this is fine.

The rationale behing this value is:
 - For the astroalpaca use case the processor is busy doing compositing for the app. This part may not be needed because the other parts already adjust the priority of the child process when the statusbar is opened. This result into less work on the child side and more time for the main process/main thread to do some work.

 - For the camera use case, the processor is busy doing the same kind of thing than astroalpaca but it there is also some other camera threads running on the system with a nice value of 0. So the main process does not have a lot of time to do something.
 As a result I choose to move the main process priority to -4 in order to give it bigger timeslice. I also moved the compositor thread to -4 because it seems odd to have it running with a worse priority than the main thread.

 - For the browser use case the processor is busy doing network things. While all the network request are started by the child, they end up in the parent process as this is where lives the network stack. As a side effect the DNS Resolver threads, Indexed DB threads, etc... are competing with the main thread and the compositor.
 In order to make the situation a bit better here, the patch ensure that the main thread of the main process and the compositor thread have a better nice than any other threads of the main process (and so any other threads of any processes as well).


The question raised by this patch are:
 - Does moving the main thread of the main process and the compositor thread priority to -4 may affects any sound stuff ? (I tend to think it won't since -4 is relatively far from -16 and Android seems to run its compositor thread a -8, while the main thread is running at -4).

 - Also, the way of making a difference between the main thread and the other threads of the main process, is not very clean. I'm not sure what is the best way to implement it. Help will be appreciated :)
Attachment #8490721 - Flags: feedback?(gsvelto)
(In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) from comment #31)
> Created attachment 8490712 [details] [diff] [review]
> bug1061969.part3.patch
> 
> This small gecko changes expose a setActive API on top of mozbrowser iframe.
> When such a method is called the process priority of the app is adjusted.
> 
> I didn't renamed all the internal gecko code that says 'Visible' instead of
> 'Active' but I can do it if needed. 
> There is already a setVisible API for this one is different from the process
> priority stuff, while it update it as well. Both API don't need to be in
> sync.

I have a couple of questions:
1) Do we care about having several mozbrowser active at the same time?
2) Are we fine with a on/off flag, or should we have something more like an active level?
In other words, should:
 e.setActive(true); e.setActive(true); e.setActive(false); deactivate?
(In reply to Fabrice Desré [:fabrice] from comment #35)
> (In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails,
> needinfo? please) from comment #31)
> > Created attachment 8490712 [details] [diff] [review]
> > bug1061969.part3.patch
> > 
> > This small gecko changes expose a setActive API on top of mozbrowser iframe.
> > When such a method is called the process priority of the app is adjusted.
> > 
> > I didn't renamed all the internal gecko code that says 'Visible' instead of
> > 'Active' but I can do it if needed. 
> > There is already a setVisible API for this one is different from the process
> > priority stuff, while it update it as well. Both API don't need to be in
> > sync.
> 
> I have a couple of questions:
> 1) Do we care about having several mozbrowser active at the same time?

It depends on the meaning of your question.
A basic answer is yes, we care because it is going to change the timeslice allocated on the processors for those apps. The current patch only update the 'active' state of the currently displayed app so it should be safe and does not allowed a background app to be 'active'.

Also the iframe.setVisible call is unchanged, so an app going to the background will automatically be updated as well.

As of today there is sometimes multiple apps beeing 'active' at the same time, for example when you open the utility tray / statusbar, the cost control widget is 'active' while the current displayed app beneath the utility tray is active as well.

> 2) Are we fine with a on/off flag, or should we have something more like an
> active level?
> In other words, should:
>  e.setActive(true); e.setActive(true); e.setActive(false); deactivate?

I kind of like counter flags, but I'm pretty sure the current codebase sometimes does iframe.setVisible(true); iframe.setVisible(true); iframe.setVisible(false); and the app is set to not visible and I will be a bit afraid to change such behavior.

As an example the part5 patch of this bug use a counter flag to pause the update of the statusbar. I have an updated local version of it because yesterday I realized the events sent by the utility tray when it is shown/hidden are not symmetric and so the counter goes into an wrong state if someone:
 - open the statusbar by end
 - release it so it is visible
 - Turn off the phone (this one bypass some of the statusbar code and we end up with the statusbar beeing out of sync).

So for now I would prefer to keep a simple flag and if we need some guards, we can implement them in the system app.
Etienne, here is the updated part5 with symmetric events.
Attachment #8490715 - Attachment is obsolete: true
Attachment #8490715 - Flags: review?(etienne)
Attachment #8491404 - Flags: review?(etienne)
Comment on attachment 8490713 [details] [diff] [review]
bug1061969.part4.patch

Review of attachment 8490713 [details] [diff] [review]:
-----------------------------------------------------------------

I dislike the naming though...
Attachment #8490713 - Flags: review?(alive) → review+
Comment on attachment 8490721 [details] [diff] [review]
adjust-threads-priority.patch

I've thought long and hard about this since you flagged me for review (I basically stopped doing everything else) because this is a tough nut to crack and we've been there already without a fully satisfying solution yet.

In principle I find the idea of giving the main process' main thread (MPMT for now on for brevity) a nice value lower than zero - and lower than other threads in general - is a tempting idea especially since it can become a bottleneck in some scenarios. It's something I'm tempted to plus.

However as you've seen it's extremely icky since every other thread spawned from the main process gets the same nice value and we need to adjust it. But then it's racy so we can't be sure that always works and if it doesn't "It's a bad thing"™. And then there's the problem that this makes the MPMT *very* special and we've already had starvation cases were the main process sucks so much CPU that the phone becomes unresponsive. This could make it worse making such a change regression-prone change.

Then we've got other components in the main process that use the NSPR thread management functions to deliberately run threads at higher priorities than the main thread, sometimes for brief periods of time. As you've noted the NSPR function to adjustments relative to the main thread priority so the change you made will affect all of them. It will basically turn that behavior around ensuring they actually run at a lower priority than the main thread.

Plus modifying the PR_* functions to behave differently in just this specific case is really hacky and I'm not sure if the NSPR maintainers would like it.

All in all I'd say that we might follow this path but only under the following conditions:

- We can prove this gives some very measurable benefits in relevant scenarios while at the same time not regressing others. The results in comment 27 satisfies the first part of this but not the second

- We've really got no other way of solving those scenarios

So my question is the following: when we encounter a scenario were the main thread is appreciably slowed down whose causing it? Is it slowed down by other threads in the main process or by other threads in a child process (or both)?

If it's the former case then we should lower the priority of those threads instead. We already do that for DOM Workers and other CPU-intensive threads so we might as well do for other potential CPU hogs. If it's the latter then I'd rather see us boost the priority of the whole main process WRT child ones (or lower child processes' priorities, it works both ways). Both things can be done in the existing code without risky/hacky changes. I'd opt for a change like the one proposed here only when we've exhausted other less risky options.

BTW, you can count of me for the profiling work to identify the culprits here and shoot down their priorities :)
Attachment #8490721 - Flags: feedback?(gsvelto) → feedback-
Comment on attachment 8491404 [details] [diff] [review]
bug1061969.part5.patch

Review of attachment 8491404 [details] [diff] [review]:
-----------------------------------------------------------------

::: apps/system/js/utility_tray.js
@@ +279,5 @@
> +    if (this.shown) {
> +      window.dispatchEvent(new CustomEvent('utilitytraywillhide'));
> +    } else {
> +      window.dispatchEvent(new CustomEvent('utilitytraywillshow'));
> +    }

let's add tests for the basic event dispatch.

::: apps/system/js/visibility_manager.js
@@ +136,4 @@
>        case 'utility-tray-overlayopened':
>        case 'cardviewshown':
>        case 'system-dialog-show':
>          this.publish('hidewindowforscreenreader');

let's discuss this with Alive but I'm tempted to rename those events and handler now to something like |app-user-interaction-disabled| |app-user-interaction-enabled| and |AppWindow#setUserInteractionEnabled|

::: apps/system/test/unit/statusbar_test.js
@@ +2174,5 @@
> +    });
> +
> +    test('cardviewclosed', function() {
> +      testEventThatResume.bind(this)('cardviewclosed');
> +    });

nice!

Can you add a test covering the counter part, like cardviewshow/utilitytraywillshow/cardviewclosed -> still paused
Attachment #8491404 - Flags: review?(etienne) → review+
(In reply to Etienne Segonzac (:etienne) from comment #40)
> Comment on attachment 8491404 [details] [diff] [review]
> bug1061969.part5.patch
> 
> Review of attachment 8491404 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: apps/system/js/utility_tray.js
> @@ +279,5 @@
> > +    if (this.shown) {
> > +      window.dispatchEvent(new CustomEvent('utilitytraywillhide'));
> > +    } else {
> > +      window.dispatchEvent(new CustomEvent('utilitytraywillshow'));
> > +    }
> 
> let's add tests for the basic event dispatch.
> 
> ::: apps/system/js/visibility_manager.js
> @@ +136,4 @@
> >        case 'utility-tray-overlayopened':
> >        case 'cardviewshown':
> >        case 'system-dialog-show':
> >          this.publish('hidewindowforscreenreader');
> 
> let's discuss this with Alive but I'm tempted to rename those events and
> handler now to something like |app-user-interaction-disabled|
> |app-user-interaction-enabled| and |AppWindow#setUserInteractionEnabled|

I agree this is insufficient but I don't have a good naming.
I will say AppWindow#setAccessibility :)

> 
> ::: apps/system/test/unit/statusbar_test.js
> @@ +2174,5 @@
> > +    });
> > +
> > +    test('cardviewclosed', function() {
> > +      testEventThatResume.bind(this)('cardviewclosed');
> > +    });
> 
> nice!
> 
> Can you add a test covering the counter part, like
> cardviewshow/utilitytraywillshow/cardviewclosed -> still paused
See Also: → 1071156
Adjusting the priorities of threads sounds better than messing with setActive/setVisibility. Mason has already got a patch to boost the compositor thread priority to -4. If we also ensure the MPMT has higher priority than any thread in the content process, do we still need to also make Gecko changes?
Attachment #8490712 - Flags: review?(fabrice) → review+
leave-open until the Gaia bits land.

The Gecko part + the Gaia bits makes the situation a bit better and ensure the statusbar is not stuck. It may still be a bit sluggish without the main thread adjustement and playing more with priorities but since the notifications tray is not stuck anymore I don't think this is a real blocker after that.
Keywords: leave-open
(In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) from comment #43)
> setActive part for Gecko (just adjust the priority):
> https://hg.mozilla.org/integration/mozilla-inbound/rev/49e90971088e

This landed with the wrong bug number, could you backout and reland in one push, with DONTBUILD in the topmost commit message, to correct the bug number for hg annotate? :-)
This is a side patch that comes with the setActive API. I would like this API to ONLY change the process priority but it seems like some code force flush the memory in the ProcessPriorityManager.

I moved the code so it only happens when the app is not visible. This should also helps some cases in the system app where we want to adjust the priority of the app as soon it is launched while not updating the visibility right away (like the homescreen for example).
Attachment #8501013 - Flags: review?(gsvelto)
In order to see what happens and which threads consume CPUs in a simple way, I wrote a small wrapper around ftrace.

To invoke it just plug the device, run |make scheduler_trace| from the Gaia directory, do your action on the device, and wait for the ouput. (it capture during 5 seconds by default).

Gabriele, I wrote that while working on this bug, trying to understand what's going on exactly. I figured out that it may be nice to have it somewhere where others can use it.
Attachment #8501014 - Flags: review?(gsvelto)
For what it worth, the Gaia bits that are independent of any CPU changes are waiting for a green run at https://github.com/mozilla-b2g/gaia/pull/24881.
(In reply to Gabriele Svelto [:gsvelto] from comment #39)
> Comment on attachment 8490721 [details] [diff] [review]
> adjust-threads-priority.patch
> 
> I've thought long and hard about this since you flagged me for review (I
> basically stopped doing everything else) because this is a tough nut to
> crack and we've been there already without a fully satisfying solution yet.
> 
> In principle I find the idea of giving the main process' main thread (MPMT
> for now on for brevity) a nice value lower than zero - and lower than other
> threads in general - is a tempting idea especially since it can become a
> bottleneck in some scenarios. It's something I'm tempted to plus.
> 
> However as you've seen it's extremely icky since every other thread spawned
> from the main process gets the same nice value and we need to adjust it. But
> then it's racy so we can't be sure that always works and if it doesn't "It's
> a bad thing"™. And then there's the problem that this makes the MPMT *very*
> special and we've already had starvation cases were the main process sucks
> so much CPU that the phone becomes unresponsive. This could make it worse
> making such a change regression-prone change.
> 
> Then we've got other components in the main process that use the NSPR thread
> management functions to deliberately run threads at higher priorities than
> the main thread, sometimes for brief periods of time. As you've noted the
> NSPR function to adjustments relative to the main thread priority so the
> change you made will affect all of them. It will basically turn that
> behavior around ensuring they actually run at a lower priority than the main
> thread.

> Plus modifying the PR_* functions to behave differently in just this
> specific case is really hacky and I'm not sure if the NSPR maintainers would
> like it.

I mostly make the patch like this just to get the feedback on the idea :) I convinced myself that fixing all threads creating point in Gecko before asking for a feedback would have been a lot of (possibly) useless work.

> 
> All in all I'd say that we might follow this path but only under the
> following conditions:
> 
> - We can prove this gives some very measurable benefits in relevant
> scenarios while at the same time not regressing others. The results in
> comment 27 satisfies the first part of this but not the second
> 

That's mostly why I wrote the small ftrace tools. Numbers will come.

> - We've really got no other way of solving those scenarios
> 

There are other ways, but it really seems a scheduling issue to me, and since I found 3 places of similar behavior I think this is really how we want to solve the issue.

> So my question is the following: when we encounter a scenario were the main
> thread is appreciably slowed down whose causing it? Is it slowed down by
> other threads in the main process or by other threads in a child process (or
> both)?
> 

By both, but also by threads in the system (like the camera threads which runs with a nice of 0, so the same value than of MPMT :/).

> If it's the former case then we should lower the priority of those threads
> instead. We already do that for DOM Workers and other CPU-intensive threads
> so we might as well do for other potential CPU hogs. If it's the latter then
> I'd rather see us boost the priority of the whole main process WRT child
> ones (or lower child processes' priorities, it works both ways). Both things
> can be done in the existing code without risky/hacky changes. I'd opt for a
> change like the one proposed here only when we've exhausted other less risky
> options.
> 

Sadly just changing the priorities of the childs is not enough (this is the setActive part of this bug, which makes the app priority like a background app).

Making MPMT with a value of -4 is a way of making it more prioritary than the other threads of the MP, while also having a higher priority than a bunch of OS threads.

> BTW, you can count of me for the profiling work to identify the culprits
> here and shoot down their priorities :)

Any help will be appreciated, since event with -4 I'm not 100% happy of the result :)
> 
> Any help will be appreciated, since event with -4 I'm not 100% happy of the
> result :)

-8 (for the compositor thread and the main thread) makes me much more happy for example for the camera use case.
Can we land the Gaia parts of this patch? We need to get these in sooner rather than later, in case there are any regressions - they also solve bug 1074521 properly and should generally provide a better experience, even without the Gecko-side changes.
Flags: needinfo?(21)
(In reply to Chris Lord [:cwiiis] from comment #51)
> Can we land the Gaia parts of this patch? We need to get these in sooner
> rather than later, in case there are any regressions - they also solve bug
> 1074521 properly and should generally provide a better experience, even
> without the Gecko-side changes.

To be more specific, I really just mean part 1.
(In reply to Chris Lord [:cwiiis] from comment #51)
> Can we land the Gaia parts of this patch? We need to get these in sooner
> rather than later, in case there are any regressions - they also solve bug
> 1074521 properly and should generally provide a better experience, even
> without the Gecko-side changes.

I still have a Gaia unit test failures with it that I need to solve. Does not happens locally :(
Flags: needinfo?(21)
(In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) from comment #53)
> (In reply to Chris Lord [:cwiiis] from comment #51)
> > Can we land the Gaia parts of this patch? We need to get these in sooner
> > rather than later, in case there are any regressions - they also solve bug
> > 1074521 properly and should generally provide a better experience, even
> > without the Gecko-side changes.
> 
> I still have a Gaia unit test failures with it that I need to solve. Does
> not happens locally :(

Have you got a link to the failure?
(In reply to Chris Lord [:cwiiis] from comment #54)
> (In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails,
> needinfo? please) from comment #53)
> > (In reply to Chris Lord [:cwiiis] from comment #51)
> > > Can we land the Gaia parts of this patch? We need to get these in sooner
> > > rather than later, in case there are any regressions - they also solve bug
> > > 1074521 properly and should generally provide a better experience, even
> > > without the Gecko-side changes.
> > 
> > I still have a Gaia unit test failures with it that I need to solve. Does
> > not happens locally :(
> 
> Have you got a link to the failure?

https://tbpl.mozilla.org/?rev=11b356eff5d6feb63adee241ea26e4242d90dfac&tree=Gaia-Try
Comment on attachment 8501013 [details] [diff] [review]
bug1061969.part6.patch

Looks good to me and possibly this might also improve the scenario where we turn off the screen w/o the lockscreen enabled. If I understand the rest of the patch-set correctly in that particular scenario we should not send a memory-pressure event anymore which is a nice improvement on a long-standing issue.
Attachment #8501013 - Flags: review?(gsvelto) → review+
Comment on attachment 8501014 [details] [diff] [review]
bug1061969.part7.patch

Review of attachment 8501014 [details] [diff] [review]:
-----------------------------------------------------------------

First of all let me say that while I r-'d this I *love* what you've done here. Amongst the many ways we had to monitor tasks I had completely forgot ftrace while it's definitely the best possible method to figure out who's hogging the CPU given a reproduceale test.

That being said I fear we're not measuring the right stuff here and since I'm not a ftrace expert I might be wrong regarding what we should use instead so I'll needinfo :dhylands who has far more kernel experience than I do and can probably tell us if I'm making any sense here.

::: build/scheduler_trace.js
@@ +75,5 @@
> +      rv.count = 0;
> +    }
> +
> +    if (previousTimestamp && cpu == previousCPU) {
> +      rv.duration += (previousTimestamp - timestamp);

The way you're calculating CPU usage here is incorrect. AFAIK the timestamps in the trace are just that: timestamps of a specific event. To know how much time a task has spent on a CPU you should only look at the sched_stat_runtime events and extract the runtime parameter from them. For example I've taken this from an actual trace:

b2g-205   [001] d.h2 18468.120471: sched_stat_runtime: comm=b2g pid=205 runtime=9960885 [ns] vruntime=162681673790 [ns]

This indicates that the main process' main thread has run for ~10ms (an entire scheduler timeslice) on the second CPU.

::: tools/capture_cpu_scheduling.sh
@@ +6,5 @@
> +
> +# Configure which events that are of interests for us
> +echo 1 >/sys/kernel/debug/tracing/events/sched/sched_switch/enable
> +echo 1 >/sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
> +echo 1 >/sys/kernel/debug/tracing/events/sched/enable

This enables *all* events for the sched subsystem so the two lines before it are redundant. This generates a *lot* of data most of which I think we don't care about. What we should care is sched_stat_runtime events which should tell us for how long a task has run on a CPU. The relevant data for each entry will be called runtime parameter, be in ns and describe for how long a task has run on a specific CPU. So this should be:

echo 1 >/sys/kernel/debug/tracing/events/sched/sched_stat_runtime/enable

@@ +11,5 @@
> +
> +# Turn on tracing
> +echo 1 >/sys/kernel/debug/tracing/tracing_enabled
> +
> +sleep 5

It would be nice if this would be a parameter.

@@ +17,5 @@
> +# Turn off tracing and clean up our events or interest
> +echo 0 >/sys/kernel/debug/tracing/tracing_enabled
> +echo 0 >/sys/kernel/debug/tracing/events/sched/sched_switch/enable
> +echo 0 >/sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
> +echo 0 >/sys/kernel/debug/tracing/events/sched/enable

Similarly as above we should only disable 'sched_stat_runtime' events here, but disabling all events like you're doing in the last line is fine too.
Attachment #8501014 - Flags: review?(gsvelto) → review-
The patch in attachment 8501014 [details] [diff] [review] is trying to extract the amount of time a thread has run on a CPU by using the kernel function tracing capabilities. IIRC the correct way of doing so is to listen for sched_stat_runtime events which report the amount of time in ns a thread has spent on a certain CPU as reported by the scheduler. Dave can you confirm if this is correct? I remember this off the top of my head but couldn't find any kernel documentation stating this explicitly.
Flags: needinfo?(dhylands)
(In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) from comment #49)
> I mostly make the patch like this just to get the feedback on the idea :) I
> convinced myself that fixing all threads creating point in Gecko before
> asking for a feedback would have been a lot of (possibly) useless work.

Yes indeed :) You've got me thinking about how to do this and I've come up with a couple of ideas which might help us achieve that result without having to resort to such drastic (and fragile) measures. More on this later.

> By both, but also by threads in the system (like the camera threads which
> runs with a nice of 0, so the same value than of MPMT :/).

Good point, there's a lot of stuff outside of our control which generally runs at nice == 0; besides even if we address that more stuff (daemons, services) might be added later bringing us back to square one.

> Sadly just changing the priorities of the childs is not enough (this is the
> setActive part of this bug, which makes the app priority like a background
> app).
> 
> Making MPMT with a value of -4 is a way of making it more prioritary than
> the other threads of the MP, while also having a higher priority than a
> bunch of OS threads.
> 
> [...]
> 
> Any help will be appreciated, since event with -4 I'm not 100% happy of the
> result :)

So I've thought of coupe of options we might have. One that addresses the problem of raising the MPMT priority reliably while keeping *all* other threads in check without having to fiddle around with thread-creation points and one that might help us in the long run by making scheduling more predictable but which requires more changes.

- The first idea is to change the way the low-level SetProcessPriority() function works. Currently this function has a number of problems among which it's racy, it has a significant cost due to the high number of system calls and can't react to changes in the threads after it's being called (new threads will get the priority of the main thread as per standard Linux behavior). One possible ways to address this is to make SetProcessPriority() stateful by having a daemon watching the /proc/*/task folder of the specified process and react to new threads being created there by automatically adjusting their priority. I think it's possible to do so on recent kernels using inotify but it's something I have to verify (inotify has certain limitations when dealing with /proc). This would allow us to reliably bump the MPMT priority (even make it realtime if need be!) while ensuring that its child threads and processes are reliably adjusted.

- The second idea is to resuscitate the idea of using control groups (cgroups) instead of using nice values. We've experimented with this in the early days but cgroups support was horribly broken on ICS kernels. We might be luckier with KK and if we are we could use appropriate cgroups to ensure the MPMT always gets a fair share of the CPU and cannot be starved by other processes including the ones we're not creating ourselves. I had an old patch to add cgroups support to the process priority manager and I might refresh it and see if it helps.

I'm going to open a couple of bugs in the HAL component for evaluating both approaches and moving this forward.
FYI there's another couple of events that might be worth checking: sched_wakeup and sched_switch. Measuring the delay between the two tells you the time elapsed between a task being eligible for running and actually being scheduled. In the case of the MPMT this would be useful information to detect cases were it would want to run but was delayed because of CPU contention.
(In reply to Gabriele Svelto [:gsvelto] from comment #60)
> So I've thought of coupe of options we might have. One that addresses the
> problem of raising the MPMT priority reliably while keeping *all* other
> threads in check without having to fiddle around with thread-creation points
> and one that might help us in the long run by making scheduling more
> predictable but which requires more changes.
>

Thanks for those thinking :)
 
> - The first idea is to change the way the low-level SetProcessPriority()
> function works. Currently this function has a number of problems among which
> it's racy, it has a significant cost due to the high number of system calls
> and can't react to changes in the threads after it's being called (new
> threads will get the priority of the main thread as per standard Linux
> behavior). One possible ways to address this is to make SetProcessPriority()
> stateful by having a daemon watching the /proc/*/task folder of the
> specified process and react to new threads being created there by
> automatically adjusting their priority. I think it's possible to do so on
> recent kernels using inotify but it's something I have to verify (inotify
> has certain limitations when dealing with /proc). This would allow us to
> reliably bump the MPMT priority (even make it realtime if need be!) while
> ensuring that its child threads and processes are reliably adjusted.
>

I'm only half-convinced by this on the long run. It may be a good solution for 2.1 though.
Why am I only half convinced is because I'm pretty sure Firefox e10s will need something like this too if we want the UI to be responsive all the time. So adjusting threads priorities directly at the source sounds more correct (but it seems a bit painful to do that for 2.1 while making everybody agrees for all the threads on the all the platforms).

Or did I misunderstood your proposal and it will works on any platform ?

> - The second idea is to resuscitate the idea of using control groups
> (cgroups) instead of using nice values. We've experimented with this in the
> early days but cgroups support was horribly broken on ICS kernels. We might
> be luckier with KK and if we are we could use appropriate cgroups to ensure
> the MPMT always gets a fair share of the CPU and cannot be starved by other
> processes including the ones we're not creating ourselves. I had an old
> patch to add cgroups support to the process priority manager and I might
> refresh it and see if it helps.
>

Whatever we do with nice priorities, will it be enough in the long run with all the possible use case ?
I think it won't and cgroup will offers us much more flexibilities looking at https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt.
I would love to be able to limit the ability of cpu used by network threads or database threads in the main process, and ensuring background processes and their threads can not affect the device too much.

But for 2.1 it seems a little too big to implement, and I would say this is a good solution to complete our nice priorities heuristics and would promote it for 2.2.
What do you think ?
(In reply to Gabriele Svelto [:gsvelto] from comment #58)
> Comment on attachment 8501014 [details] [diff] [review]
> bug1061969.part7.patch
> 
> Review of attachment 8501014 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> That being said I fear we're not measuring the right stuff here and since
> I'm not a ftrace expert I might be wrong regarding what we should use
> instead so I'll needinfo :dhylands who has far more kernel experience than I
> do and can probably tell us if I'm making any sense here.
> 

I'm far from beeing a ftrace expert as well :) I mostly used it a first to see which tasks is consuming the cpus and from there I tried to extract some datas from the big logs I got. I would be happy to fix the patch to get it correct!
(In reply to Gabriele Svelto [:gsvelto] from comment #57)
> Comment on attachment 8501013 [details] [diff] [review]
> bug1061969.part6.patch
> 
> Looks good to me and possibly this might also improve the scenario where we
> turn off the screen w/o the lockscreen enabled. If I understand the rest of
> the patch-set correctly in that particular scenario we should not send a
> memory-pressure event anymore which is a nice improvement on a long-standing
> issue.

From what I remember when the screen is turned off, all the apps have their visibilities set to false, so there will still be the memory-pressure event.

Just to make sure we agreed on the patch I changed the low-memory notification to a heap-minimize (it is not obvious in the patch since I have other local patches in my patch queue).
(In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) from comment #62)
> I'm only half-convinced by this on the long run. It may be a good solution
> for 2.1 though.
> Why am I only half convinced is because I'm pretty sure Firefox e10s will
> need something like this too if we want the UI to be responsive all the
> time. So adjusting threads priorities directly at the source sounds more
> correct (but it seems a bit painful to do that for 2.1 while making
> everybody agrees for all the threads on the all the platforms).

I'll be doing it only for FxOS. On desktop Firefox we don't adjust the thread priorities besides setting them at creation time and even then we can only lower a thread priority, not raise it. You need to be root to put anything to a nice value below 0. Let's say this is the safest solution for solving the problem at hand here and would improve other aspects of priority management too (avoid race conditions, make switching priorities faster).

> Whatever we do with nice priorities, will it be enough in the long run with
> all the possible use case ?
> I think it won't and cgroup will offers us much more flexibilities looking
> at https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt.
> I would love to be able to limit the ability of cpu used by network threads
> or database threads in the main process, and ensuring background processes
> and their threads can not affect the device too much.

Yes, cgroups are a lot more powerful because they provided more direct control (e.g. make sure that a certain task will *always* get 50% of the CPU if it needs it) and they can also be used to manage priorities of other resources such as I/O. Proper use of them might help us avoid nasty scenarios such as an IndexedDB thread hammering the flash so bad that it starves every other thread that needs to do I/O. The reason we evalauted them but didn't use them in 1.0 was that on the ICS kernels we had at the time they were supported but hilariously broken.

> But for 2.1 it seems a little too big to implement, and I would say this is
> a good solution to complete our nice priorities heuristics and would promote
> it for 2.2.
> What do you think ?

Agreed.

(In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) from comment #63)
> I'm far from beeing a ftrace expert as well :) I mostly used it a first to
> see which tasks is consuming the cpus and from there I tried to extract some
> datas from the big logs I got. I would be happy to fix the patch to get it
> correct!

What do you think about moving this part to a separate bug? It would help reduce the scope here since it's not directly related.

(In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) from comment #64)
> From what I remember when the screen is turned off, all the apps have their
> visibilities set to false, so there will still be the memory-pressure event.

Yes, meh :-/ I hoped this would have helped in that scenario.

> Just to make sure we agreed on the patch I changed the low-memory
> notification to a heap-minimize (it is not obvious in the patch since I have
> other local patches in my patch queue).

That's OK. At some stage we changed from heap-minimize to low-memory and that proved to be a bad idea because low-memory drops WebGL contexts while heap-minimize doesn't. While that helped us unearth a few hidden issues where we weren't handling WebGL context rebuild correctly I think nobody will be sad to see that particular behavior go.
(In reply to Gabriele Svelto [:gsvelto] from comment #65)
> (In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails,
> needinfo? please) from comment #62)
> > I'm only half-convinced by this on the long run. It may be a good solution
> > for 2.1 though.
> > Why am I only half convinced is because I'm pretty sure Firefox e10s will
> > need something like this too if we want the UI to be responsive all the
> > time. So adjusting threads priorities directly at the source sounds more
> > correct (but it seems a bit painful to do that for 2.1 while making
> > everybody agrees for all the threads on the all the platforms).
> 
> I'll be doing it only for FxOS. On desktop Firefox we don't adjust the
> thread priorities besides setting them at creation time and even then we can
> only lower a thread priority, not raise it. You need to be root to put
> anything to a nice value below 0. Let's say this is the safest solution for
> solving the problem at hand here and would improve other aspects of priority
> management too (avoid race conditions, make switching priorities faster).

Can I help in any ways ?
 
> (In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails,
> needinfo? please) from comment #63)
> > I'm far from beeing a ftrace expert as well :) I mostly used it a first to
> > see which tasks is consuming the cpus and from there I tried to extract some
> > datas from the big logs I got. I would be happy to fix the patch to get it
> > correct!
> 
> What do you think about moving this part to a separate bug? It would help
> reduce the scope here since it's not directly related.

Sounds good to me. I mostly put it in here to not forgot this part on my hard drive. I will open a followup.
The Gecko part that separate the flush memory command from the setActive call: https://hg.mozilla.org/integration/mozilla-inbound/rev/b4dd10a940db

The gaia bits are waiting on try, they should be green soon.
(In reply to Carsten Book [:Tomcat] from comment #68)
> sorry had to back this out for test failures like
> https://treeherder.mozilla.org/ui/logviewer.
> html#?job_id=2957141&repo=mozilla-inbound

Thanks for the backout I will investigate and push to try before next time. I was pretty confident this was a trivial change. Sounds like I was wrong.
Now that the Gaia bits are in, I can see the async transitions don't stay perfectly in sync when closing the tray, but it's actually quite a nice-looking effect :) Also, performance is superior to using cliprect, unsurprisingly I suppose (until bug 1038785 gets fixed).
This commit causes issues with the software home button. Here is a quick fix:
https://github.com/gmarty/gaia/commit/668b3d30ebd64f755fc1d0726be18f51190daf93

Do you want me to open a bug for that?
Flags: needinfo?(21)
(In reply to Guillaume Marty [:gmarty] from comment #72)
> This commit causes issues with the software home button. Here is a quick fix:
> https://github.com/gmarty/gaia/commit/
> 668b3d30ebd64f755fc1d0726be18f51190daf93
> 
> Do you want me to open a bug for that?

Please.
Flags: needinfo?(21)
Depends on: 1082650
I haven't played with the sched tracing before. But do we really need that level of detail?

What are we getting that we wouldn't get by looking at /proc/PID/stat?

I guess it depends on how "micro" you want your analysis to be, and how many threads you're looking at.

Generating schedule events will cause alot more kernel/user transition, where if you can use the information from /proc/PID/stat then the kernel maintains the stats internally (top uses /proc/PID/stat).
Flags: needinfo?(dhylands)
[Blocking Requested - why for this release]: This is risky and we already need followups. Lets not block on this any more and only uplift if we are really sure its rock solid on master.
blocking-b2g: 2.1+ → -
(In reply to Dave Hylands [:dhylands] from comment #74)
> I haven't played with the sched tracing before. But do we really need that
> level of detail?
> 
> What are we getting that we wouldn't get by looking at /proc/PID/stat?

I guess with /proc/pid/stat you need to know which PID you are looking for, while the sched tracing gives you the list of everything that happens, even if you were not suspecting PID/TID x to be running.

Or did I misunderstood something and there is a magic way to get the list of all runnings pids/tids and know how much time they have consume during a specificied timeslice ?

> 
> I guess it depends on how "micro" you want your analysis to be, and how many
> threads you're looking at.
> 

All possible threads that may runs. It mostly because of the camera use case where some threads are running which are not part of Gecko.

> Generating schedule events will cause alot more kernel/user transition,
> where if you can use the information from /proc/PID/stat then the kernel
> maintains the stats internally (top uses /proc/PID/stat).

I think this is OK for the purpose of what we are looking for as far as I understand user transition. It basically means that we are going to switch between contexts and spend more time switching / doing profiling work. Is that correct ?
Blocks: 1089621
No longer blocks: 1089621
Notification page is good on master/foxfooding. 

---

Build ID               20150707033152
Gaia Revision          e6506d68e01489b57616f8b74e8773f938ce62b3
Gaia Date              2015-07-06 18:14:41
Gecko Revision         https://hg.mozilla.org/mozilla-central/rev/e271ef4c49ae
Gecko Version          42.0a1
Device Name            aries
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.worker.20150619.224059
Firmware Date          Fri Jun 19 22:41:08 UTC 2015
Bootloader             s1
This doesn't seem to be a problem anymore. Vivien you have a large number of r+'d patches still attached here, are they still relevant? If some are then let's rebase them and land them before we close this bug as WFM.
Flags: needinfo?(21)
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(21)
Resolution: --- → WORKSFORME
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.