Closed Bug 860526 Opened 11 years ago Closed 11 years ago

The B2G preallocated process gets niceness 0 (should be 19)

Categories

(Core :: IPC, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 861434
blocking-b2g -

People

(Reporter: justin.lebar+bug, Assigned: gsvelto)

References

Details

Attachments

(1 file, 1 obsolete file)

STR:

  1. Update your tree with repo sync to get the latest gonk-misc.
  2. $ adb reboot
  3. Wait until a few seconds after the phone shows the lock screen.
  4. $ adb shell b2g-procrank --nice

The only line which should have nice 0 is the main "b2g" process.  But on my device, I see

> APPLICATION      NICE   PID      Vss      Rss      Pss      Uss  cmdline
> (Preallocated a  0      452   20388K   20320K   11613K    9060K  /system/b2g/plugin-container

That is, the preallocated process has nice 0, when it should be getting niceness 19.  Indeed, if we look at its oom_adj, that is set properly:

> APPLICATION      NICE  OOM_ADJ  OOM_SCORE  OOM_SCORE_ADJ   PID      Vss      Rss      Pss      Uss  cmdline
> (Preallocated a  0        6        478         400         415   20392K   20324K   11613K    9060K  /system/b2g/plugin-container

The OOM_ADJ is 6, which indicates that the process has BACKGROUND priority.

So why doesn't the niceness match the oom_adj?  That is the question.

The fact that the preallocated process has the wrong niceness is bad, but what's Very Bad is that with my patches for bug 844323, suddenly other processes start getting niceness 0 too.  This completely defeats the purpose of the patches there.

My guess is that whatever is causing the preallocated process to have the wrong niceness may also be causing my troubles in bug 844323.

Gecko (b2g18 branch): https://github.com/mozilla/mozilla-central/commit/685eb1c52ce63a0f4c649b059de4bdb7927d7d8f

Gaia (likely irrelevant): https://github.com/mozilla-b2g/gaia/commit/304104d959f588ef3f732a9bf8271bcce8b74750

Device: Otoro
blocking-b2g: --- → tef?
blocking-b2g: tef? → tef+
Assignee: nobody → justin.lebar+bug
Can we reproduce this on an emulator target?  That may give us more options for finding out what's happening to the niceness.
(In reply to Jed Davis [:jld] from comment #1)
> Can we reproduce this on an emulator target?  That may give us more options
> for finding out what's happening to the niceness.

Is it easier to debug the kernel on the emulator?  I've never messed around with it before.  Ideally we'd print a backtrace when the kernel renices a process; I suspect that would lead very quickly to resolution of this bug.

My current best guess here is that process other than the b2gg master is setting the child's niceness to 0 after the master process renices it.  I think the most likely candidate for this renicer process is the child itself.

My reason for believing this is that under the current trunk, we see only the preloaded process getting nice 0, but with bug 844323, I see other processes with nice 0.  When I stepped through each setpriority() call in the master process with bug 844323 applied, I didn't see anything setting the niceness to 0, but also, after doing this, I didn't see any processes with niceness 0!  What seems to have happened is that pausing execution in gdb changed the ordering of the race, so the child reniced itself to 0 /before/ the parent reniced it.

Additionally, with unpatched b2g18, I observed a few seconds where all browser processes had niceness 0.  This shouldn't happen.  The master process renices the child to non-zero on the main thread immediately after it forks the child, so at most we should have one child process with niceness 0.  With current trunk (but not with bug 844323) the child process eventually renices itself again once it starts up; that may be why these 0 nicenesses don't stick around for anything other than the preallocated process.

This should be simple, right?  Just find the call to setpriority() or nice() that's causing the problem.  It doesn't seem to be in the parent process, so it's probably in the child.  Except I've stepped through in gdb, and I didn't see the call in the child.  I hope I simply missed it...
I think I know what's going on here, I'm fairly sure this is a regression introduced when I landed bug 841651. In that bug I implemented PR_SetThreadPriority() on Android (and thus FxOS) to use per-thread nice values assigned via setpriority(). Now what seems to be happening is the following:

- the parent process spawns the child then tries to call setpriority() on it with nice = 20

- the child process invokes _PR_InitThreads() which ultimately calls PR_SetThreadPriority() on the main thread with nice = 0

- since the main thread TID is the same as the process PID the setpriority() call either re-nices the whole process to 0 or just the main thread; either way this is not what we want

This can obviously race with _PR_InitThreads() arriving first so the final nice value will depend on timing.

I don't have a quick fix for this but it shouldn't be too hard to find one; Justin if you don't mind I can take this bug if you are busy with other stuff.
Depends on: 841651
> Except I've stepped through in gdb, and I didn't see the call in the child.

Just to close the loop here, the reason I didn't see this was likely that I didn't attach the debugger soon enough.  Our "CHILDCHILDCHILD" message prints way after _PR_InitThreads().
This patch is the quick fix, I'll check the implications on scheduling and polish it next.
Assignee: justin.lebar+bug → gsvelto
Status: NEW → ASSIGNED
I did some further testing on this and I've got bad news: apparently calling setpriority() from outside a process seems to affect only the priority of the main thread, not the whole process. I tested this on my machine by pinning a multi-threaded program to one core and then playing with the various threads priorities as well as the process'.

I found this WONTFIX bug in the kernel bug tracker:

https://bugzilla.kernel.org/show_bug.cgi?id=6258

And a discussion on a patch to fix the problem, it ended with the patch being shot down:

https://lkml.org/lkml/2008/9/10/122

I've also checked the relevant code in the kernel and it does seem to change only the nice value of the thread whose TID was given to setpriority() leaving other threads unaffected:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/sys.c?id=refs/tags/v3.9-rc6#n195

Back to this bug, the WIP fix I've attached only fixes the main-thread-got-the-wrong-nice-value issue but it's just a band-aid if all the other threads are unaffected by external calls to setpriority().
I thought we said on IRC that one has to look at how the scheduler uses thread + process priorities in order to understand what's going on.  As far as I can tell, none of the links above suggest that the scheduler does or doesn't compose the thread priority with the process priority.

Looking at the code for setpriority (sys.c), I'd be pretty surprised if there was no composition, because setpriority(PRIO_PGRP) is exactly the same as setpriority(PRIO_PROC) for each of the processes in the PGRP.  So if there is no composition of process and thread priority, then process groups are essentially useless for setting priority (because they would never change the priority existing threads).

But I'm still working through the code.
(In reply to Justin Lebar [:jlebar] from comment #7)
> I thought we said on IRC that one has to look at how the scheduler uses
> thread + process priorities in order to understand what's going on.  As far
> as I can tell, none of the links above suggest that the scheduler does or
> doesn't compose the thread priority with the process priority.

Yes, but before looking at the scheduler code I tried experimenting with this (ugly) small test program:

http://pastebin.com/KsuWmuYF

The program runs two processing by forking. The child process sets the nice value for the main thread to 20 and creates another thread whose priority is set to 10. Both threads then start crunching and periodically print out some stuff. After three seconds the parent process calls setpriority() on the child PID with a nice value of 0. The expected outcome would be that the relative priority of the two threads in the child process should not change, what happens instead is that the main thread now gets scheduled far more often than the other one. You can see this effect by running the compiled program on a single core:

taskset 1 sudo ./a.out; sleep 10; sudo killall a.out

The output looks like this on my Linux box:

bg_worker 1
bg_worker 2
bg_worker 3
bg_worker 4
bg_worker 5
bg_worker 6
bg_worker 7
main 1
bg_worker 8
bg_worker 9
bg_worker 10
bg_worker 11
bg_worker 12
bg_worker 13
bg_worker 14
main 2
bg_worker 15
bg_worker 16
bg_worker 17
bg_worker 18
bg_worker 19
bg_worker 20
bg_worker 21
bg_worker 22
main 3
bg_worker 23
bg_worker 24
bg_worker 25
bg_worker 26
bg_worker 27
bg_worker 28
bg_worker 29
main 4
bg_worker 30
bg_worker 31
bg_worker 32
bg_worker 33
bg_worker 34
bg_worker 35
bg_worker 36
main 5
setpriority called! rv = 0
bg_worker 37
main 6
main 7
main 8
main 9
main 10
main 11
main 12
bg_worker 38
main 13
main 14
main 15
main 16
main 17
main 18
main 19
main 20
main 21
bg_worker 39
...

Also if you look at the proposed fix for this issue in the patch you can see that the author was explicitly iterating over the thread TIDs:

https://lkml.org/lkml/2008/9/1/157

> Looking at the code for setpriority (sys.c), I'd be pretty surprised if
> there was no composition, because setpriority(PRIO_PGRP) is exactly the same
> as setpriority(PRIO_PROC) for each of the processes in the PGRP. So if
> there is no composition of process and thread priority, then process groups
> are essentially useless for setting priority (because they would never
> change the priority existing threads).

I'm not sure what the do_each_pid_thread() iterator returns in that case but it might be iterating over all the threads of the process group (including non-main threads):

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/pid.h#n177

I don't know enough about the pid structure and what's stored in the tasks[] lists to be able to tell.
> but it might be iterating over all the threads of the process group

Ah, indeed it does, if you look at include/linux/pid.h.  But then process groups nuke any custom thread priorities?  That is really bizarre.

Based on your test, niceness looks totally broken for what we want.  Android has an additional scheduling module; perhaps we should look at that.  See bug 808975.
As expected, the patch here fixes this bug.
Attachment #736343 - Flags: feedback+
> Android has an additional scheduling module; perhaps we should look at that.  See bug 808975.

This doesn't look any better than nice; we still have to set the priority for each thread individually.
I guess one good thing is that it appears you can renice other processes' threads if you're root.

The patch here plus my patches for bug 847592 don't fix the problem.  My whole-system profiler shows a lot of time spent in the reniced processes.  But there's a good chance that we're spending time in the threads that we're not renicing.
(In reply to Justin Lebar [:jlebar] from comment #12)
> I guess one good thing is that it appears you can renice other processes'
> threads if you're root.

Yes, we could probably iterate over the dirs in /proc/<pid>/task/* and adjust the threads one by one but this wouldn't fix the potential races caused by the threads adjusting their own priority. Alternatively I could check if setpriority() is a little better behaved in regards to process groups; if it is we might consider adding a process group per each new content process and using that to re-nice the whole thing.

> The patch here plus my patches for bug 847592 don't fix the problem.  My
> whole-system profiler shows a lot of time spent in the reniced processes. 
> But there's a good chance that we're spending time in the threads that we're
> not renicing.

Now that I know more about this issue I think that my patch wouldn't be accepted as-is; I'll have to rethink it a little bit. The issue is that the way I implemented PR_SetThreadPriority() works only as long as nobody changes the priorities from outside of the process. If someone does then the PRThread priority and the kernel thread priority will be different and that's misleading at best.
> Yes, we could probably iterate over the dirs in /proc/<pid>/task/* and adjust the threads one by 
> one but this wouldn't fix the potential races caused by the threads adjusting their own priority. 

I don't think we can get around this without being the kernel, unfortunately.  There are also races around threads starting up after you read the /task directory.  (Maybe reading it twice or something would help.)

> Alternatively I could check if setpriority() is a little better behaved in regards to process 
> groups; 

I thought we looked at process groups in the kernel and established that the kernel is calling setpriority on each of the threads in the process group:

> 		case PRIO_PROCESS:
> 			if (who)
> 				p = find_task_by_vpid(who);
> 			else
> 				p = current;
> 			if (p)
> 				error = set_one_prio(p, niceval, error);
> 			break;
> 		case PRIO_PGRP:
> 			if (who)
> 				pgrp = find_vpid(who);
> 			else
> 				pgrp = task_pgrp(current);
> 			do_each_pid_thread(pgrp, PIDTYPE_PGID, p) {
> 				error = set_one_prio(p, niceval, error);
> 			} while_each_pid_thread(pgrp, PIDTYPE_PGID, p);
> 			break;

> The issue is that the way I implemented PR_SetThreadPriority() works only as long as nobody 
> changes the priorities from outside of the process. If someone does then the PRThread priority and 
> the kernel thread priority will be different and that's misleading at best.

I think it probably makes sense for thread priorities to be relative to the main thread's priority.

So when you set the priority of a thread, we read the thread's priority and then add or subtract to get the thread's new priority.  When you get the priority of a thread, we read both threads' priorities and subtract.

Glandium suggested on my blog that we might want to use cgroups for our priorities.  That might work well for B2G.  I'll need to look into how that interacts with nice and so on.
(In reply to Justin Lebar [:jlebar] from comment #14)
> I don't think we can get around this without being the kernel,
> unfortunately.  There are also races around threads starting up after you
> read the /task directory.  (Maybe reading it twice or something would help.)

I agree, we can probably make the race small enough not to be a problem most of the time but we can't fix it.

> I thought we looked at process groups in the kernel and established that the
> kernel is calling setpriority on each of the threads in the process group:
> [...]

Yes but I wanted to confirm it and I did just that today: calling setpriority() on a process group sets all the threads' nice value to what you provided to the call so this doesn't work either for us.

> I think it probably makes sense for thread priorities to be relative to the
> main thread's priority.
> 
> So when you set the priority of a thread, we read the thread's priority and
> then add or subtract to get the thread's new priority.  When you get the
> priority of a thread, we read both threads' priorities and subtract.

Yes, I thought about that so what you're proposing is to do the following:

- Implement PR_SetThreadPriority() to set nice values for threads as an offset of the main-thread nice value (e.g. if when creating a DOM worker the main thread has nice = 19 PR_SetThreadPriority() will set nice = 20 for the new thread)

- Change SetProcessPriority() to iterate over all the threads of a process and adjust their priority relatively

Obviously this won't prevent races from happening but should make them unlikely enough not to be a problem until we find a better solution.

> Glandium suggested on my blog that we might want to use cgroups for our
> priorities.  That might work well for B2G.  I'll need to look into how that
> interacts with nice and so on.

I haven't experimented with cgroups yet but I can give them a spin.
> - Implement PR_SetThreadPriority() to set nice values for threads as an offset of the main-thread 
> nice value (e.g. if when creating a DOM worker the main thread has nice = 19 PR_SetThreadPriority() 
> will set nice = 20 for the new thread)

Except note that there's no nice 20 on Linux, so it will get truncated to 19.
(In reply to Justin Lebar [:jlebar] from comment #16)
> Except note that there's no nice 20 on Linux, so it will get truncated to 19.

Right, I have used nice values ranging from -1 to 2 to map the PR_* priorities so it would be a matter of having SetProcessPriority() never adjust them by more than 17 so that the lowest-priority threads end up in the nice = 19 value.

I had a quick look at cgroups and they seem to do exactly what we want and on top of that android already offers a cgroup for background processes called bg_non_interactive (see https://github.com/keesj/gomo/wiki/AndroidScheduling). I think it should be fairly easy to use it for our purposes and would be probably far nicer than fiddling with the priorities as I suggested above. I'll try and see if I can make a quick patch for that.
> I'll try and see if I can make a quick patch for that.

I'm not sure exactly what you're going to try, but note that we have two layers of priorities: Process priorities (bg, fg, etc.) and thread priorities.  We may want to do thread priorities using nice and process priorities using cgroups.
(In reply to Justin Lebar [:jlebar] from comment #18)
> I'm not sure exactly what you're going to try, but note that we have two
> layers of priorities: Process priorities (bg, fg, etc.) and thread
> priorities.  We may want to do thread priorities using nice and process
> priorities using cgroups.

Yes, that was my idea, sorry for not elaborating on it.

As a first experiment I first wanted to leave thread scheduling as-is and try to lump background processes in the "bg_non_interactive" cgroup. This should cap all threads from all processes in that group to a maximum of 5% CPU usage when there's a foreground process running at full throttle. This is basically the Android model from what I could gather.

Following this a proper patch would do this:

- Create a cgroup for each value of mozilla::hal::ProcessPriority with appropriate CPU caps

- Reimplement SetProcessPriority() to use those cgroups instead of altering a process' nice value

This should be fairly straightforward except for the values to be assigned to the cgroups parameters. What we can control there is the maximum amount of CPU given to the processes within a cgroup relative to other cgroups:

http://oakbytes.wordpress.com/2012/09/02/cgroup-cpu-allocation-cpu-shares-examples/

Since this behavior is different than using priorities it might be tricky to find out which values would be optimal to map each one into a cgroup.
I've got a WIP patch using the existing cgroups almost ready; I'll attach it as soon as I verified it works.
Comment on attachment 736343 [details] [diff] [review]
[PATCH WIP] Never change the nice value of the main thread on Linux

Actually, this is still wrong in its current form.

When a thread starts up in a setuid(0) process that's nice'd, we shouldn't nice that thread to 0.
Attachment #736343 - Flags: feedback+ → feedback-
This patch re-implements SetProcessPriority by leveraging the default cgroups available on Android instead of setting nice values. The Gonk priority values are mapped like this:

PROCESS_PRIORITY_BACKGROUND             -> bg_non_interactive
PROCESS_PRIORITY_BACKGROUND_HOMESCREEN  -> bg_non_interactive
PROCESS_PRIORITY_BACKGROUND_PERCEIVABLE -> fg_boost
PROCESS_PRIORITY_FOREGROUND             -> fg_boost
PROCESS_PRIORITY_FOREGROUND_HIGH        -> (no cgroup)
PROCESS_PRIORITY_MASTER                 -> (no cgroup)

The behavior for these groups is the following:

- All processes in the |bg_non_interactive| cgroup can take no more than 5% CPU time if there are other processes requiring CPU time; they can take as much CPU time as they need if there's idle time to spare

- All processes in the |fg_boost| cgroup can take no more than 95% CPU time if there are other processes requiring CPU time; they can take as much CPU time as they need if there's idle time to spare

- Processes which don't have a cgroup assigned are uncapped and are free to take up to 100% CPU time irrespective of other processes in the system

The patch applies to mozilla-b2g18 and I tested it lightly making sure the processes ended up in the right groups. I didn't have enough time to test the actual effect on the scheduling of the various processes but I will try it tomorrow.
Attachment #736343 - Attachment is obsolete: true
Attachment #736993 - Flags: feedback?(justin.lebar+bug)
I tested renicing all threads, and I still see incoming calls taking ~4s longer under process-buster load (http://bit.ly/membuster).  So if cgroups can beat that, that's much better.

If we do go with cgroups, though, we're going to have to configure the cgroups on the device just like we configure the low-memory killer.  The cgroups here aren't going to work for us because they give FOREGROUND_HIGH and MASTER only a 5% boost over vanilla FOREGROUND processes.  But a large part of the problem under process-buster stress, and the entirety of the problem in other testcases (e.g. someone panning and zooming a page) is CPU load caused by a foreground app.
It seems to me that certain activities should have permanently high priorities.

This shouldn't be a problem for well behaved apps (like say the call-answerer), which should be idle pretty much all of the time that they're not dealing with say incoming calls.

Maybe we should consider using real-time priorities for certain things? Then the scheduler is based purely on priority, and the lowest realtime priority trumps any of the non-realtime priorities.
> This shouldn't be a problem for well behaved apps (like say the call-answerer), which should be idle 
> pretty much all of the time that they're not dealing with say incoming calls.

The communications app, which is responsible for ringing the phone when a call is incoming, is the same app which is responsible for showing the dialer screen.  There may be other processes involved with answering a call; I don't know.

We have mechanisms in place to boost the priority of this app when appropriate.  I don't think our problem is that we're boosting the priority at the wrong time; it's the mechanism we use for the priority changing that's busted at the moment.  Perhaps realtime priorities are the answer, although it's not clear to me that we would not have the same problem with threads as we do now.  If cgroups can change the priority of a process and not a thread, that seems like a big advantage in terms of avoiding races.
I'm going to forward-dupe this to bug 861434 (back out the NSPR change that caused this issue).  If we decide not to back out the change from NSPR for some reason, we can undupe.

Let's worry about doing the B2G prioritization correctly (by renicing all threads, using cgroups, using realtime priorities, or whatever) in a separate bug blocking bug 847592.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
blocking-b2g: tef+ → -
Attachment #736993 - Flags: feedback?(justin.lebar+bug)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: