Closed Bug 1002917 Opened 10 years ago Closed 10 years ago

b2g process is using RT scheduling and running long time

Categories

(Firefox OS Graveyard :: GonkIntegration, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:1.4+)

RESOLVED INCOMPLETE
blocking-b2g 1.4+

People

(Reporter: tkundu, Assigned: tkundu)

References

Details

(Keywords: crash, regression, Whiteboard: [caf priority: p1][CR 656053][b2g-crash])

Attachments

(1 file)

Attached file logs.zip
STR:

1.Made an MO call.
2.BT ON and paired with other device.
3.Transfered Vedio file via BT
4.Launch Browser and surf for sometime.
5.Downloaded games from MarketPlace.
6.sent multiple MO SMS.
7.MAde an MT call device got crashed.

It is difficult to reproduce as it happens randomly.

It seems to us RT scheduling enabled AudioTrack playback, introduced in https://bugzilla.mozilla.org/show_bug.cgi?id=962932

Can anyone why b2g needs to RT task for a long time ? RT task should run <1sec and yield cpu to other task. Otherwise it will cause CPU hogging and result will be crash.

Please suggest .
blocking-b2g: --- → 1.4?
Component: General → GonkIntegration
> 
> Can anyone why b2g needs to RT task for a long time ? RT task should run
> <1sec and yield cpu to other task. Otherwise it will cause CPU hogging and
> result will be crash.

I think that b2g basically should not use RT. Sorry, I do not know the reason. Some people seems to be tempted to use RT priority just to fix some use case.
(In reply to Tapas Kumar Kundu from comment #0)
> Can anyone why b2g needs to RT task for a long time ? RT task should run
> <1sec and yield cpu to other task. Otherwise it will cause CPU hogging and
> result will be crash.

How is RT scheduling related to the crash here? And also which device did you experience this on? IIRC bug 962932 only affects Jelly Bean and KitKat devices. I'm CC'ing the people who worked on that bug just in case.
If an RT thread were to consume 100% of the CPU, then no other thread would every get scheduled, ever, until the RT thread stops using 100% of the CPU.
We are getting a watchdog bark because a kernel kworker thread can't pet the dog due enough due to the RT scheduling in the b2g process.
RT scheduling should not be used unless we are absolutely sure the process only eats a certain percentage of CPU. Unfortunately it looks like low latency AudioTracks rely on RT scheduling. I think we can disable RT scheduling in gonksched without regressing bug 962932.
(In reply to Michael Wu [:mwu] from comment #5)
> RT scheduling should not be used unless we are absolutely sure the process
> only eats a certain percentage of CPU. Unfortunately it looks like low
> latency AudioTracks rely on RT scheduling. I think we can disable RT
> scheduling in gonksched without regressing bug 962932.

Shouldn't this be ok as this is how android JB works?  I thought enabling the RT scheduling for these low latency AudioTracks was the whole point of your fix for bug 962932.
Bug 962932 was mostly fixing lag due to a timeout. Audio playback was slow because the audio thread gets permission to use low latency, and then waits a very long time for the priority boost which never happens. That wait should go away if we just deny permission.
Do we know why these AudioTracks are running wild or using too much CPU here vs on android?  My understanding was that this is all in the omx/android code, right?  Just turning off the RT priority here seems like papering over the problem.

Michael, your build does not have FAST_TRACKS_AT_NON_NATIVE_SAMPLE_RATE enabled, does it?
Flags: needinfo?(mvines)
(In reply to Ben Kelly [:bkelly] from comment #8)
> Do we know why these AudioTracks are running wild or using too much CPU here
> vs on android?  My understanding was that this is all in the omx/android
> code, right?  Just turning off the RT priority here seems like papering over
> the problem.
> 
> Michael, your build does not have FAST_TRACKS_AT_NON_NATIVE_SAMPLE_RATE
> enabled, does it?

No its not enabled [1] 

[1] https://www.codeaurora.org/cgit/quic/la/platform/frameworks/av/tree/services/audioflinger/Configuration.h?h=b2g_kk_3.5#n36
Flags: needinfo?(mvines)
(In reply to Ben Kelly [:bkelly] from comment #8)
> Do we know why these AudioTracks are running wild or using too much CPU here
> vs on android?  My understanding was that this is all in the omx/android
> code, right?  Just turning off the RT priority here seems like papering over
> the problem.
> 

I don't think it's papering over the problem. We just can't use RT priority unless we can be sure that the audio callbacks finish within a bounded timeframe.

> Michael, your build does not have FAST_TRACKS_AT_NON_NATIVE_SAMPLE_RATE
> enabled, does it?

I don't think that matters. The opensl backend determines what the native sample rate is likely to be and tries to use that.
Can we get a crash stack?
Keywords: crash
Whiteboard: [CR 656053] → [CR 656053][b2g-crash]
Severity: normal → critical
Keywords: regression
(In reply to Michael Wu [:mwu] from comment #10)
> I don't think it's papering over the problem. We just can't use RT priority
> unless we can be sure that the audio callbacks finish within a bounded
> timeframe.

Right, but what is different for us vs android?  These callbacks were designed to use this scheme on android with RT priority.

Also, looking at the steps in comment 0, I don't see a ton of audio being performed that would use AudioTracks.  Don't we only trigger this code for mp3 playback?

Do we have a process listing or something that shows the audio tracks using all the CPU or is this still just a guess?

Does this occur without bluetooth in use?
Is this related to the watchdog or unrelated?

  04-23 14:33:51.033 E/GKI_LINUX(  207): ##### ERROR : timer_thread: tick delayed > 5 slots (-71920729,-50000000) -- cpu overload ? #####

I see lots of those pretty much from the start of the log.

I also see this:

  04-23 14:31:52.873 E/GeckoConsole(  207): [JavaScript Error: "IndexedDB UnknownErr: IDBTransaction.cpp:863"]

The dmesg is empty due to a permissions error, by the way.
(In reply to Ben Kelly [:bkelly] from comment #12)

> Do we have a process listing or something that shows the audio tracks using
> all the CPU or is this still just a guess?
> 
Should we put extra logging in gecko or gonksched to know who is giving RT to a thread for more than 1 second or so ? If you can provide us WIP patch which will log these info from gecko/gonksched then we will try to reproduce this issue with your patch and update new log here. 

>> Does this occur without bluetooth in use?
We didn't check it by disabling Bluetooth so far.
Flags: needinfo?(bkelly)
(In reply to Tapas Kumar Kundu from comment #14)
> Should we put extra logging in gecko or gonksched to know who is giving RT
> to a thread for more than 1 second or so ? If you can provide us WIP patch
> which will log these info from gecko/gonksched then we will try to reproduce
> this issue with your patch and update new log here. 

Do you have something that indicates its a thread belonging to the b2g process?  Or is that an educated guess at this point?  I see b2g is using 30% cpu in top, but that may or may not be relevant.

As far as I know the only thing that can lead to RT priority in b2g user space is the gonksched granting permission to AudioTask.  You could log a message here to see if that is being triggered:

  https://github.com/mozilla-b2g/gonk-misc/blob/master/gonksched.cpp#L88

You can print out the tid and then we can line it up with the threads owned by b2g.  I think you can see the b2g threads by running b2g-info -t or looking in /proc/<b2g pid>/tasks.

We could also modify b2g-info -t to display the cpu time for each thread.  This might help us see which thread under b2g is using that 30% CPU and then correlate it to the audio task.
Flags: needinfo?(bkelly)
(In reply to Ben Kelly [:bkelly] from comment #12)
> (In reply to Michael Wu [:mwu] from comment #10)
> > I don't think it's papering over the problem. We just can't use RT priority
> > unless we can be sure that the audio callbacks finish within a bounded
> > timeframe.
> 
> Right, but what is different for us vs android?  These callbacks were
> designed to use this scheme on android with RT priority.
> 

Yeah, but only for users who opt into it and understand that their code must complete within a bounded amount of time. I'm not sure we can say that for the users of the cubeb backend.

> Also, looking at the steps in comment 0, I don't see a ton of audio being
> performed that would use AudioTracks.  Don't we only trigger this code for
> mp3 playback?
> 

All audio playback involving pcm samples use AudioTracks.
(In reply to Michael Wu [:mwu] from comment #16)
> All audio playback involving pcm samples use AudioTracks.

I guess the other piece I don't understand is the UID check here:

  https://github.com/mozilla-b2g/gonk-misc/blob/master/gonksched.cpp#L79

It was my impression that individual threads could not have a different UID from the owning process.  Do our client processes run with the IAD_MEDIA UID?  Or is this running in the mediaserver process separate from the b2g process.
I believe that the media server is the only process that runs with userid AID_MEDIA

My understanding is that gonksched is a standalone daemon that runs as root.
(In reply to Ben Kelly [:bkelly] from comment #15)
 
> Do you have something that indicates its a thread belonging to the b2g
> process?  Or is that an educated guess at this point?  I see b2g is using
> 30% cpu in top, but that may or may not be relevant.

This is confirmed by kernel debugging. Our observation:
 1-core 0 is only alive core
 2- other cores are in power collapse
 3- on core 0 b2g is running and hogging cpu
 4- b2g is running as rt (real time process) so it should not do cpu intensive work. (concluded by looking at run queue data structure )
 5- while kthread worker which is supposed to pet watchdog is runnable and follows cfs class schduling
 6- since it cannot get cpu unless rt class process leave cpu so results in watchdog bark
(In reply to Tapas Kumar Kundu from comment #19)
> (In reply to Ben Kelly [:bkelly] from comment #15)
>  
> > Do you have something that indicates its a thread belonging to the b2g
> > process?  Or is that an educated guess at this point?  I see b2g is using
> > 30% cpu in top, but that may or may not be relevant.
> 
> This is confirmed by kernel debugging. Our observation:
>  1-core 0 is only alive core
>  2- other cores are in power collapse
>  3- on core 0 b2g is running and hogging cpu
>  4- b2g is running as rt (real time process) so it should not do cpu
> intensive work. (concluded by looking at run queue data structure )
>  5- while kthread worker which is supposed to pet watchdog is runnable and
> follows cfs class schduling
>  6- since it cannot get cpu unless rt class process leave cpu so results in
> watchdog bark

Well, based on this:

(In reply to Dave Hylands [:dhylands] from comment #18)
> I believe that the media server is the only process that runs with userid
> AID_MEDIA

It seems a bit of a mystery how we are getting into RT priority.  That UID check means gonksched will only elevate the priority of the mediaserver process.  I don't know of anywhere else that we set RT priority currently.

Tapas, can you see which thread is running at that priority?  Or just that its some thread within b2g process?
Well, bluedroid sets its timer thread to RT priority 32!

  http://code.metager.de/source/xref/android/4.3/external/bluetooth/bluedroid/gki/ulinux/gki_ulinux.c#742
  http://code.metager.de/source/xref/android/4.3/external/bluetooth/bluedroid/gki/ulinux/gki_int.h#46
  http://code.metager.de/source/xref/android/4.3/external/bluetooth/bluedroid/gki/ulinux/gki_int.h#53

Which combined with these error messages:

    04-23 14:33:51.033 E/GKI_LINUX(  207): ##### ERROR : timer_thread: tick delayed > 5 slots (-71920729,-50000000) -- cpu overload ? #####

From your log process 207 is the same as the parent gecko process:

04-23 14:24:13.523 I/Gonk    (  207): Setting nice for pid 207 to 0
04-23 14:24:13.523 I/Gonk    (  207): Changed nice for pid 207 from 0 to 0.

Might explain some of the problem...
ni'ing in the BT team.
Flags: needinfo?(tzimmermann)
Flags: needinfo?(shuang)
Flags: needinfo?(echou)
(In reply to Ben Kelly [:bkelly] from comment #21)
>     04-23 14:33:51.033 E/GKI_LINUX(  207): ##### ERROR : timer_thread: tick
> delayed > 5 slots (-71920729,-50000000) -- cpu overload ? #####
> 
> From your log process 207 is the same as the parent gecko process:
> 
> 04-23 14:24:13.523 I/Gonk    (  207): Setting nice for pid 207 to 0
> 04-23 14:24:13.523 I/Gonk    (  207): Changed nice for pid 207 from 0 to 0.
> Might explain some of the problem...
Sorry, I opened bug 933589 when I ported Gecko to support bluedroid for JellyBean on Nexus 4, and I never know this is related to RT priority. I also check this error message (ERROR : timer_thread: tick delayed) never shows on AOSP Android devices which use the same library. And bluedroid library uses real time priority for GKI_timer_update() thread. 
http://androidxref.com/4.4.2_r2/xref/external/bluetooth/bluedroid/gki/ulinux/gki_int.h#49
Flags: needinfo?(shuang)
Tapas,

Can we please get a stack and signature for the crash?
blocking-b2g: 1.4? → 1.4+
Flags: needinfo?(tkundu)
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #23)
> (In reply to Ben Kelly [:bkelly] from comment #21)
> >     04-23 14:33:51.033 E/GKI_LINUX(  207): ##### ERROR : timer_thread: tick
> > delayed > 5 slots (-71920729,-50000000) -- cpu overload ? #####
> > 
> > From your log process 207 is the same as the parent gecko process:
> > 
> > 04-23 14:24:13.523 I/Gonk    (  207): Setting nice for pid 207 to 0
> > 04-23 14:24:13.523 I/Gonk    (  207): Changed nice for pid 207 from 0 to 0.
> > Might explain some of the problem...
> Sorry, I opened bug 933589 when I ported Gecko to support bluedroid for
> JellyBean on Nexus 4, and I never know this is related to RT priority. I
> also check this error message (ERROR : timer_thread: tick delayed) never
> shows on AOSP Android devices which use the same library. And bluedroid
> library uses real time priority for GKI_timer_update() thread. 
> http://androidxref.com/4.4.2_r2/xref/external/bluetooth/bluedroid/gki/ulinux/
> gki_int.h#49

Shawn

Are we able to reproduce the crash? Do we need anything from partner here?
(In reply to Preeti Raghunath(:Preeti) from comment #24)
> Tapas,
> 
> Can we please get a stack and signature for the crash?

This is a watchdog bark and not a Gecko crash so there's no stack trace that would really be useful here.

I think next steps here are for us to dig into how the Bluedroid RT task usage in Android differs from us.  The gonksched RT usage looks like a dead end (we did have that code on v1.3 after all and didn't run into this, but BT moved from BlueZ to Bluedroid in in v1.4)
clearing NI as m1 has already answered :)
Flags: needinfo?(tkundu)
Tim/ Eric,

Please review next steps per m1's comments on 26
Flags: needinfo?(timdream)
I can't help Gonk/Gecko crash issue.
Flags: needinfo?(timdream)
Hi Shawn,

What if we patch Bluedroid to set GKI_LINUX_TIMER_POLICY to SCHED_OTHER (i.e. the non-RT scheduler)? Will Bluedroid still work in this case?
Flags: needinfo?(tzimmermann)
It seems to me we need to track down the issue with bug 933589.  If the timer is running out of control, we should fix that instead of lowering its priority.
Depends on: 933589
(In reply to Ben Kelly [:bkelly] from comment #31)
> It seems to me we need to track down the issue with bug 933589.  If the
> timer is running out of control, we should fix that instead of lowering its
> priority.

I just verified that Android Kitkat is getting EPERM error in [1] and Android kitkat bluetooth works fine with that.

[1] https://www.codeaurora.org/cgit/external/gigabyte/platform/external/bluetooth/bluedroid/tree/gki/ulinux/gki_ulinux.c?h=caf/b2g_kk_3.5#n751

Currently we are calling [1] from b2g main process and we are changing priority successfully. 

This means that if we simply disable this portion of code[1] then bluedroid stack should work fine for us (like Android).

I can upload a simple patch for that. 

Do you still want to switch from GKI_LINUX_TIMER_POLICY to SCHED_OTHER ?
Flags: needinfo?(bkelly)
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #30)
> Hi Shawn,
> 
> What if we patch Bluedroid to set GKI_LINUX_TIMER_POLICY to SCHED_OTHER
> (i.e. the non-RT scheduler)? Will Bluedroid still work in this case?
It seems on Android KK, bluedroid library runs on Bluetooth java application, with only bluetooth permission, which causes scheduler based on RT policy fail to acquire. This seems can be fixed from bluedroid based on Comment 32.
Yes, the change in comment 32 can be made outside Gecko for v1.4 but it is scary that we're running Bluedroid at significantly higher privilege than our Android friends.  The widening of the BT attack surface in B2G probably deserves a new bug for v2.0 as well.
Assignee: nobody → tkundu
I opened bug 1005934 to address the issues with running bluedroid from the b2g process.  Sounds like we have a plan for 1.4.
Flags: needinfo?(bkelly)
See Also: → 1005934
Closing this bug out as the patch to avoid the RT scheduler in this case isn't for Gecko.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
Whiteboard: [CR 656053][b2g-crash] → [caf priority: p1][CR 656053][b2g-crash]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: