Closed Bug 847592 Opened 11 years ago Closed 11 years ago

Incoming calls take a long time in loaded scenario, sometimes they are not even displayed and ends up as unanswered call.

Categories

(Firefox OS Graveyard :: General, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:tef+, b2g18 fixed, b2g18-v1.0.1 fixed)

RESOLVED FIXED
1.0.1 Cert2 (21may)
blocking-b2g tef+
Tracking Status
b2g18 --- fixed
b2g18-v1.0.1 --- fixed

People

(Reporter: godavari, Assigned: justin.lebar+bug)

References

Details

(Keywords: perf, Whiteboard: [target: likely done, but waiting on QA results and investigation thereof] [c= p= s=2013.05.17 u=], IOT, Chile, Ikura, khepara_43442 [madrid], ux-tracking)

Attachments

(9 files, 6 obsolete files)

886.88 KB, text/plain
Details
52.62 KB, text/plain
Details
504.97 KB, text/plain
Details
358.91 KB, application/x-xz
Details
206.61 KB, application/x-xz
Details
1.53 MB, application/x-gzip
Details
9.29 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Details
496 bytes, patch
Details | Diff | Splinter Review
946.59 KB, text/plain
Details
Similar to Bug 812059. 

But in a loaded scenario the incoming notification doesn't even showup in some cases. Steps to reproduce the issue. 

1. Make sure a valid SIM is inserted and able to receive calls in the device.
2. Open the browser and visit a page like nytimes.
3. Use another mobile to make initiate a call to the target we want to test. 
4. Mean while pinch and zoom\ pan continuously in the web page we already  opened. 
5. It takes really long time to get the incoming call (more than 2 secs) and sometimes we donot even get the call. 

Please let me know if you need any more details.
Summary: Incoming calls take a long time in loaded scenario, sometime they are not displayed ends up as unanswered. → Incoming calls take a long time in loaded scenario, sometimes they are not even displayed and ends up as unanswered call.
Whiteboard: [FFOS_perf]
Anshul is looking into the later part of the issue where the UI doesn't come up some times.
> 5. It takes really long time to get the incoming call (more than 2 secs) and sometimes we donot even 
> get the call. 

> Anshul is looking into the later part of the issue where the UI doesn't come
> up some times.

I think these are essentially the same issue.  It takes a lot of CPU in order to get the UI to show up, and if by then the call is over, the UI will only show a missed call.
Severity: critical → blocker
Priority: P2 → P1
I don't quite know how QC is using priority flags -- is this a tef+ bug?
The only way I can think of to address this bug is, right when the main process receives the first inkling that there's an incoming call, we immediately drop the CPU priority of all other processes.  This will allow the dialer to launch quickly.

But I need to understand how serious we consider this bug to be before I invest in that.
In Mobile world, not able to receive a call is a Highest priority issue so must be fixed for any commercial product.

(In reply to Justin Lebar [:jlebar] from comment #4)
> The only way I can think of to address this bug is, right when the main
> process receives the first inkling that there's an incoming call, we
> immediately drop the CPU priority of all other processes.  This will allow
> the dialer to launch quickly.
> 
> But I need to understand how serious we consider this bug to be before I
> invest in that.
blocking-b2g: --- → tef?
Assignee: nobody → justin.lebar+bug
Depends on: 844323
I did some more analysis on the issue where incoming call is not getting displayed sometimes when user is busy zooming in/out and panning on a browser page.

The issue is that phone runs out of memory and kernel kills the communications app right after telephony sends the telephony-new-call message and before communications app gets a chance to launch.
> The issue is that phone runs out of memory and kernel kills the communications app right after 
> telephony sends the telephony-new-call message and before communications app gets a chance to launch.

Interesting.  That...shouldn't happen after bug 836199 was fixed.  Maybe we regressed something.

I'll dig into this, but if you're interested in providing more data, the first thing I'm going to try to collect is the result of modifying ProcessPriorityManager.cpp so it prints to logcat (near the top of the file; comment out an "&& 0") and then recording

 $ adb logcat -v time

and, after you've placed the call and it's died, running

 $ adb shell dmesg.
Justin, I was able to reproduce the issue after 7th try. The android logs is from the boot up so it would have all 7 incoming call related logs. Please look at the end of the log for the 7th call for which the communications app didn't show the incoming call UI.
Thanks for the info, Anshul.

This is all good initially.  But then we're releasing the CPU wake-lock, which should not happen until the call ends:

> 03-06 10:50:08.549 I/Gecko:ProcessPriorityManager(  700): Got wake lock changed event. Now mHoldsCPUWakeLock=0, mHoldsHighPriorityWakeLock=0

We then decide that the communications app is in the background (because it is):

> 03-06 10:50:09.579 I/Gecko:ProcessPriorityManager(  700): Reset priority timer callback; about to ResetPriorityNow.
> 03-06 10:50:09.579 I/Gecko:ProcessPriorityManager(  700): Setting priority to BACKGROUND

We die soon thereafter.

We shouldn't be releasing this wake lock.  But also, holding the lock isn't going to guarantee that the process will run quickly enough to receive the call.
Justin the message from ProcessPriorityManager that there is no active cpu lock is after the call was disconnected and so isn't that expected. Also, please look at the last call in the logs, which is one that really has the issue; around the log statement below.

> 03-06 10:50:56.829 I/Gecko   (  127): -*- QCContentHelper_QC_B2G: Notify system message manager of telephony-new-call
Okay...

> the phone runs out of memory and the kernel kills the communications app right after 
> telephony sends the telephony-new-call message and before communications app gets a 
> chance to launch.

Which telephony PID do you think is getting killed here?

I wish we could match up the times in dmesg and logcat!
blocking-b2g: tef? → tef+
Justin, 753 (Communications) app is getting killed.
Whiteboard: [FFOS_perf] → [FFOS_perf], u=user c=dialer s=ux-most-wanted
Do we have any update for this case? Thanks!
Justin, may I know if you're working on this? Or, should we find someone to help this? Just want to know if there has anything I can help here. Thanks!
Flags: needinfo?(justin.lebar+bug)
I'm working on bug 844323, which is a prerequisite to fixing this bug, as I currently understand it.
Flags: needinfo?(justin.lebar+bug)
This appears to be a important bug to fix especially if we're missing calls.  Sounds like jlebar is making good progress on bug 844323 (given the risk), but wanted to reiterate the importance.   Thanks.
As an update on my progress here: I have a working draft of bug 844323, and I've implemented code to immediately renice all processes to 20 when we receive an incoming call.

We renice all the processes within 400ms or so from when the RIL notices that we have a call.  But the phone still doesn't ring more than two or three times when under process stress ("bust processes" at http://bit.ly/membuster).  In other words, this change does not have the expected effect of fixing this bug.

There are a number of possible explanations for this.  Perhaps we're so overloaded that RIL doesn't receive the notification until later than it should.  Perhaps the main process is busy doing work for the reniced subprocesses, and this delays the communication app's loading.  Perhaps despite the fact that they've been reniced, the kernel schedules the processes too aggressively.  Maybe it's some combination of these factors, or maybe it's something else.

More investigation to come as I learn more.
I will try to profile Monday.
Depends on: 860097
I have and am exercising tef+ powers for bugs this depends on.  Just ask.
Depends on: 860273
With my whole-system profiler from bug 860273, I see the child processes taking up a lot of CPU time before they get reniced.

I'm running with my patch which gives nice 19 to all child processes, but I see samples like this

  pid=544 Browser nice=19 utime=0.010 stime=0.000
  pid=567 system-profiler nice=0 utime=0.000 stime=0.020
  pid=582 Browser nice=0 utime=0.070 stime=0.030

This sample reflects 100ms of time, and pid 582 got 100ms of CPU time in this sample (utime + stime = 0.1).

I'll see if I can figure out why these processes aren't getting reniced as quickly as I'd expect.
Depends on: 861441
Whiteboard: [FFOS_perf], u=user c=dialer s=ux-most-wanted → [FFOS_perf], u=user c=dialer s=ux-most-wanted, IOT, Chile, Ikura, khepara_43442
Whiteboard: [FFOS_perf], u=user c=dialer s=ux-most-wanted, IOT, Chile, Ikura, khepara_43442 → [FFOS_perf], u=user c=dialer s=ux-most-wanted, IOT, Chile, Ikura, khepara_43442 [madrid]
Target Milestone: --- → Madrid (19apr)
Whiteboard: [FFOS_perf], u=user c=dialer s=ux-most-wanted, IOT, Chile, Ikura, khepara_43442 [madrid] → [FFOS_perf], u=user c=dialer s=ux-most-wanted, IOT, Chile, Ikura, khepara_43442 [madrid][status: needs more investigation]
I did some quite extensive testing of this scenario today as part of my investigation of bug 861441. From what I could gather there's no easy solution to fix it least landing both my patch in bug 861441 and Justin's patch-set for bug 844323.

When I was working on this however I came up with a possible, quicker (and dirtier) alternative which might allow us to fix this specific problem while dealing with the more general issue later. With my patch applied I tried making the Communications process put himself in the highest priority control group as soon as its created and have it stick there (i.e. never lower its priority to background). Since this completely bypasses the ProcessPriorityManager it ensures that when a call arrives Communications has enough CPU time to respond fast. With this trick I was able to consistently receive calls within a 2s time-frame under process-buster stress.

I know it's a really bad solution and it has its drawbacks (mainly related to OOM behavior since our priorities affect that too) but if we need a quick'n'dirty fix it works.
Whiteboard: [FFOS_perf], u=user c=dialer s=ux-most-wanted, IOT, Chile, Ikura, khepara_43442 [madrid][status: needs more investigation] → [FFOS_perf], u=user c=dialer s=ux-most-wanted, IOT, Chile, Ikura, khepara_43442 [madrid]
What you're talking about here is similar to what I have in bug 847592.  But note that the communications app is also the dialer app, so we can't lock it at the highest priority class.

Anyway I'm done doing quick-and-dirty fixes here.  We have time to do this right.  But your investigation is very helpful because we have a lower bound on what we're trying to achieve here.
(In reply to Justin Lebar [:jlebar] from comment #25)
> What you're talking about here is similar to what I have in bug 847592.  But
> note that the communications app is also the dialer app, so we can't lock it
> at the highest priority class.

Yes, it would be really ugly, especially from a memory POV.

> Anyway I'm done doing quick-and-dirty fixes here.  We have time to do this
> right.  But your investigation is very helpful because we have a lower bound
> on what we're trying to achieve here.

I agree, if we can fix this properly by having a proper method to quickly giving most of the CPU to a single task it will also greatly improve the overall responsiveness of the system.

As an extra data-point I've tried to measure how long it takes for a call to show up under perfect conditions (Communications already open, system essentially idle). With my stopwatch I could measure times a bit below one second; say 0.7-0.8s though the exact measure is not really relevant because it was well within the error margin of my reflexes :) So if we can get it to show up within a 2s timeframe under process-buster stress it would be a very good result IMO.
(In reply to Justin Lebar [:jlebar] from comment #25)
> What you're talking about here is similar to what I have in bug 847592.  But
> note that the communications app is also the dialer app, so we can't lock it
> at the highest priority class.

Does it mean you're proposing another different solution here? Thanks! 

> 
> Anyway I'm done doing quick-and-dirty fixes here.

Thanks, Justin. Because this bug is marked as a certification blocker and it needs to be fixed before April 24th. Do you think it's possible to close this case by that day? 

  We have time to do this
> right.  But your investigation is very helpful because we have a lower bound
> on what we're trying to achieve here.
> Does it mean you're proposing another different solution here? Thanks! 

I'm proposing that we do what I have in bug 847592, instead of something that gsvelto and I think will regress other features of the phone.

> Do you think it's possible to close this case by that day? 

If it needs to be done by April 24, it doesn't really matter whether I think it's possible, right?  I'll do my best.
Reviewed on April 23th: Agreed to have this fixed before the 3rd certification.
I'm not sure we need to take these patches -- I'd love nothing more than to wontfix this bug -- but I want to save what I have in case we do need it.
Sorry, I should move these patches to a separate bug, since this is a metabug...
Target Milestone: 1.0.1 Madrid (19apr) → 1.0.1 Cert2 (28may)
The OEM will need around 5 days to prepare the build and do the verification. It means, we need to fix it by May 21th. Justin, do you think it's possible? Thanks.
At this point we don't know how to solve this bug with the testcase we've been using.  My best guess is that the main process is busy doing work for the low-priority background processes, and that's what delays the communications app from loading.  If that's happening, I doubt we can fix it by May 21.

But the testcase we've been using is highly unrealistic; it launches something like twelve processes all at once.  I think we need to try with a more realistic testcase.
(In reply to Justin Lebar [:jlebar] from comment #35)
> But the testcase we've been using is highly unrealistic; it launches
> something like twelve processes all at once.  I think we need to try with a
> more realistic testcase.

Can we propose a more realistic testcase to partners?
> Can we propose a more realistic testcase to partners?

Absolutely; as I said, the next step is to try a more realistic testcase with the patches we have and see what happens.
This issue reproduces on the Inari device. 

Build ID: 20130426070207
Environmental  Variables:
Kernel Date: Feb 21
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/54285d67454b
Gaia: c47ef39de04e634d847cc86b6acc616fabce69eb

Incoming calls take a long time display, sometimes they do not show up at all.
This issue does not repro on the Ikura device with 4/26 partner build:

Gaia   a3176dbb57dd73fe99397392bf1f8a98a6bf66dd
BuildID 20130426165758
Version 18.0
With all of the patches I've landed in dependent bugs, I haven't been able to cause my Hamachi device running b2g18 (not the 1.0.1 branch) to miss a call.  Calls do take a few extra seconds to get to the phone, but I always get at least two rings (if I'm not panning and zooming at nytimes, I get three rings).

There are three things we need to figure out here.

1. Whether we can make the CPU so busy, we miss a call.
2. If we can't receive a call, exactly what is the delay we can cause in receiving a call?
3. Is that delay acceptable?  If it's not acceptable, we need to decide which releases, if any, the delay blocks.

It would be helpful if QA could test b2g18_v1.0.1 with a rev that includes all the relevant patches to help determine answers to (1) and (2).  I just landed everything on hg: 

https://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/41a0854fe988

This bug blocks the next round of certification, which is imminent.  So please consider this testing request to be a high priority.
Keywords: qawanted
Hey Justin - I pulled the latest 1.0.1 build from the directory today and it shows:

Gecko  http://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/5c1d67e0c242
Gaia   11477c127ae9be5051e4cfbcbf3da1d4150f9967
BuildID 20130502070201
Version 18.0

Does this version have the relevant patches that you note in Comment 40?
The hg debugancestor command is the best way I know of to find out.

If you want to know whether a tip commit T contains a commit Q, then you can run

  $ hg debugancestor T Q   (or Q T; it doesn't matter)

If the result is Q, then the tip contains Q.  If it's anything else, then it doesn't.

> $ hg debugancestor 5c1d67e0c 41a0854fe
> 118820:41a0854fe9884eb6544578b7cdc9442844496611

So, yes.
I performed some testing with the Inari device and the build referenced in Comment 41, which is a Commercial RIL build with AT&T SIM card.

I had the following apps open on the phone while I was testing:

*Music Player - playing one track
*Browser with one tab open, nytimes.com

I basically called the device while I was aggressively pinching and zooming on the nytimes.com site. I got inconsistent results along the way, with it taking sometimes as much as 8 seconds for the call to ring through. Sometimes it took under 2 seconds.

The log I am attaching has the 8 second delay. I also have another log where the dialer screen simply froze.

I was not able to generate a consistent scenario where the calls were actually missed, but the delay still seems to be an issue. 

We are going to test again on the partner build just to see if this happens on that device.
I got the same result as Marcia using the same build in comment 41 on the Inari.
I tested scenarios 1, and 2 a few times.
All of the received calls were within an inconsistent time frame on the Inari device.
Keywords: qawanted
> 2. If we can't receive a call, **exactly what is the delay** we can cause in receiving a call?

> All of the received calls were within an inconsistent time frame on the Inari device.

In the future, please try to be explicit about time frames, if you can.  That will help release drivers make an informed decision here.  Thanks.
Tested on an Inari using partner build with the following build information :

Gaia   a3176dbb57dd73fe99397392bf1f8a98a6bf66dd
BuildID 20130426165758
Version 18.0

STR:

1.Did an incoming call with memory intensive apps running… was using a browser, pinching and zooming the page in the browser.
2.Camera, dialer apps and a game were running in the background.
3.An incoming call to this phone resulted in a blank screen for few seconds (though the call was then displayd on the phone screen)

The time frame was inconsistent as in other comments. I would give the time range as 3 to 5 seconds (the screen being completely blank before the call and caller info could be displayed on the screen)


The logcat shows something in RED like this :

05-02 12:33:40.804: E/GeckoConsole(1994): [JavaScript Warning: "HTTP "Content-Type" of "text/html" is not supported. Load of media resource app://communications.gaiamobile.org/dialer/oncall.html failed." {file: "app://communications.gaiamobile.org/dialer/oncall.html#unlocked" line: 0}]
Depends on: 868521
In my latest tests, I see that merely having the browser open to nytimes.com adds ~1s to the call-received time.  In contrast, having example.com open does not add anything to the call-received time.
Depends on: 868533
Just reviewing UX tags and what UX help, if any, may still be needed here (per the tags). Please let me know.
Recording some of our progress from today:

bz created a whole-system profile which we've been looking at.  (I'll attach the profile in a moment.)  He started profiling when the phone he was calling from started to ring.

What we see in this profile is that there are two long extents of time while the communications app gets under 10% cpu.  During those periods of time, the browser process gets roughly 60% of the CPU, and the B2G main process gets roughly 25% of the CPU.

The browser process spends all of its time reflowing, and the main process spends half of its time compositing and half of its time sending touch events to the browser.

We'd previously thought that what might be happening is that the browser app causes the main process to do a lot of work on its behalf, and indeed the compositing that the main process is doing is a result of the reflows in the browser.  But the compositing happens on its own thread in the main process, so this should not be getting in the way of any events coming from the communications app.  So it seems that if the main process is not responding quickly enough to requests from the child, this is likely a result of touch events.  But since I seem to get calls pretty quickly when I'm touching the phone a lot, I think touch events are unlikely what's causing our problem here.

The outstanding question is: Why is the communications app getting so little CPU?  We want to distinguish between three cases:

1) The communications app is blocked (on I/O, on a sync call to the parent, on page faults, or something else)

2) The communications app is runnable, but is not being scheduled

3) The communications app doesn't /need/ a lot of CPU to start up, but whatever it needs to do to start up isn't happening quickly enough.

Jed is working on getting data out of perf which will take a stack every time a context switch happens.  Hopefully we'll be able to look at when the communications app is de-scheduled and distinguish between (1) and (2).  To understand (3), it would be helpful to have a profile of receiving a call while not under CPU load.
xz -d this file, then open it in http://people.mozilla.com/~bgirard/cleopatra/#.
We've determined that the kernel we're using doesn't give a call stack or a PC for PERF_COUNT_SW_CONTEXT_SWITCHES.  (Note that it does work as expected for PERF_COUNT_SW_CPU_CLOCK.)  A more recent kernel on x86 doesn't have this problem, but we don't yet know if the issue is machine-independent (in which case there is presumably a fix) or machine-dependent (in which case it may still be broken upstream).

Note that PERF_COUNT_SW_PAGE_FAULTS does appear to work correctly, so we could try taking a time profile and a page fault profile and merging them.

Something else we could try: using TableTicker with the patch in bug 810526 (and my modified toolchain).  We should either see a sample every N milliseconds, in which case we'll know what the thread of interest is doing, or we don't, in which case we suspect that it's runnable but the kernel chooses not to run it (which would be supported by also seeing an absence of page faults in that period).
> Something else we could try: using TableTicker with the patch in bug 810526 (and my modified 
> toolchain).  We should either see a sample every N milliseconds, in which case we'll know what the 
> thread of interest is doing, or we don't, in which case we suspect that it's runnable but the kernel 
> chooses not to run it (which would be supported by also seeing an absence of page faults in that 
> period).

This samples by sending signals to each thread, right?  I'd expect us to get samples from both runnable and non-runnable threads, except that the blocked threads would sample at a blocking function, and the unblocked threads wouldn't.  Am I missing something?

Anyway, that would be interesting either way...
> 3) The communications app doesn't /need/ a lot of CPU to start up, but whatever it needs to do to 
> start up isn't happening quickly enough.

I guess another way of putting this is, the process of loading the communications app may not be CPU bound.  If this was the case, I guess I'd expect to see a lot of idle CPU time during the app's startup.  But that would be pretty weird...
From my reading of the kernel sources I don't think we're ever going to get stacks from PERF_COUNT_SW_CONTEXT_SWITCHES (latest version of kernel on ARM still uses empty registers as far as I can tell, http://lxr.free-electrons.com/source/include/linux/perf_event.h#L681 )
If we stop sending and receiving messages to non-critical processes as soon as we get an incoming call, it does seem to work to fix this bug.  It's a crappy user experience, but it's something.  I'll post patches and we can take the discussion into bug 868533.
(In reply to ben turner [:bent] from comment #54)
> From my reading of the kernel sources I don't think we're ever going to get
> stacks from PERF_COUNT_SW_CONTEXT_SWITCHES (latest version of kernel on ARM
> still uses empty registers as far as I can tell,
> http://lxr.free-electrons.com/source/include/linux/perf_event.h#L681 )

I think the problem might just be that ARM doesn't define perf_arch_fetch_caller_regs: http://lxr.free-electrons.com/ident?i=perf_arch_fetch_caller_regs
As before, xz -d and load in http://people.mozilla.com/~bgirard/cleopatra/.

What I find immediately interesting when comparing this profile to the other one is that here, HTML5 parsing takes place in one short burst, whereas in the loaded profile, HTML5 parsing is spread over almost half of the sample!  The communications app doesn't do anything else until it's done parsing HTML.

I don't immediately see why HTML5 parsing should be special.  But perhaps reading the HTML files out of omni.ja is particularly affected by load for some reason.
(In reply to Justin Lebar [:jlebar] from comment #58)
> I don't immediately see why HTML5 parsing should be special.  But perhaps
> reading the HTML files out of omni.ja is particularly affected by load for
> some reason.

It seems to me, from the first (bad) profile, that the HTML5 parser is running script, and that's where we are in at least some of the cases where the Communications app main thread is absent from the CPU for unusually long periods (up to 450ms in one case, unless I'm misreading).  I wonder if there's something the script is doing there that we want to know about.

Also, the first sample after the 450ms gap is constructing an XPCAutoRequest and the last one before it is in nsGlobalWindow::AddEventListener.  I don't know if that's of interest.
I have a STR that never shows the incoming call on b2g18:
Go to marketplace and install the timeout app: TO Barcelona
During the installation call the phone.
I tried it 4 times and it showed the missing call about 5 sec after the network ended the call.
Here's a patch which makes the Mozilla RIL print something when it notices that we receive a call.  I /think/ this is pretty early in the lifecycle.

diff --git a/dom/system/gonk/ril_worker.js b/dom/system/gonk/ril_worker.js
index 0a1d395..b60537c 100644
--- a/dom/system/gonk/ril_worker.js
+++ b/dom/system/gonk/ril_worker.js
@@ -4351,16 +4351,17 @@ let RIL = {
         currentCall.started = new Date().getTime();
       }
       currentCall.state = newCall.state;
       this._handleChangedCallState(currentCall);
     }
 
     // Go through any remaining calls that are new to us.
     for each (let newCall in newCalls) {
+      dump("XXX ril_worker::_process_calls() got new call.\n");
       if (newCall.isVoice) {
         // Format international numbers appropriately.
         if (newCall.number &&
             newCall.toa == TOA_INTERNATIONAL &&
             newCall.number[0] != "+") {
           newCall.number = "+" + newCall.number;
         }
         if (newCall.state == CALL_STATE_INCOMING) {
And here's a page which reflows a lot, in an attempt to automatically replicate the nytimes testcase.  http://bit.ly/jlflow
(In reply to Gregor Wagner [:gwagner] from comment #60)
> I have a STR that never shows the incoming call on b2g18:
> Go to marketplace and install the timeout app: TO Barcelona
> During the installation call the phone.
> I tried it 4 times and it showed the missing call about 5 sec after the
> network ended the call.

FWIW I can't reproduce this locally on my hamachi.  I'll have a look at the profile and see if it's informative.
Gregor, when did you start recording this profile?

One thing I see is us spending a ton of CPU doing some animations in the homescreen process.  Was the homescreen even showing while you recorded this profile?
(In reply to Justin Lebar [:jlebar] from comment #64)
> Gregor, when did you start recording this profile?
> 
> One thing I see is us spending a ton of CPU doing some animations in the
> homescreen process.  Was the homescreen even showing while you recorded this
> profile?

As said on IRC, we animate the TO Barcelona icon during installation.
Comment on attachment 740801 [details] [diff] [review]
Part 1: Add an additional lowCPUPriority argument to hal::SetProcessPriority.

I'm going to move these patches into a separate bug.
Attachment #740801 - Attachment is obsolete: true
Attachment #740802 - Attachment is obsolete: true
Attachment #740803 - Attachment is obsolete: true
Depends on: 870181
Whiteboard: [FFOS_perf], u=user c=dialer s=ux-most-wanted, IOT, Chile, Ikura, khepara_43442 [madrid] → [status: still investigating, needs work and deps to land] [FFOS_perf], u=user c=dialer s=ux-most-wanted, IOT, Chile, Ikura, khepara_43442 [madrid]
This patch also makes it an error to call hal::SetProcessPriority from a child process.
Attachment #747237 - Flags: review?(bent.mozilla)
Comment on attachment 747237 [details] [diff] [review]
Part 1: Add an additional LowCPUPriority argument to hal::SetProcessPriority.

Sorry, wrong bug.
Attachment #747237 - Attachment is obsolete: true
Attachment #747237 - Flags: review?(bent.mozilla)
Attachment #747238 - Attachment is obsolete: true
Attachment #747238 - Flags: review?(bent.mozilla)
Attachment #747239 - Attachment is obsolete: true
Attachment #747239 - Flags: review?(bent.mozilla)
Depends on: 870203
Depends on: 870598
I think we may have fixed this bug with all of the code that we've landed in the past few days in blocking bugs.

I'd love some QA help here verifying this fix, particularly if you can be creative with your testcases.

In particular, today is the tef+ target milestone, so if you guys are able to test /today/, that gives me a much better (that is, nonzero) chance of addressing before the deadline any new issues you discover.

RyanVM is going to do a birch --> b2g18 merge soon, and I'll give you an hg changeset that you can test once he does that.  But you will probably need to spin your own build.
Depends on: 870480
(In reply to Justin Lebar [:jlebar] from comment #71)
> I think we may have fixed this bug with all of the code that we've landed in
> the past few days in blocking bugs.
> 
> I'd love some QA help here verifying this fix, particularly if you can be
> creative with your testcases.
> 
> In particular, today is the tef+ target milestone, so if you guys are able
> to test /today/, 

Tony or Al, is this something you can help? Thanks.

that gives me a much better (that is, nonzero) chance of
> addressing before the deadline any new issues you discover.
> 
> RyanVM is going to do a birch --> b2g18 merge soon, and I'll give you an hg
> changeset that you can test once he does that.  But you will probably need
> to spin your own build.
Flags: needinfo?(tchung)
Flags: needinfo?(atsai)
Let's avoid using needinfo here - use qawanted for QA support.
Flags: needinfo?(tchung)
Flags: needinfo?(atsai)
Keywords: qawanted
Status: UNCONFIRMED → NEW
Ever confirmed: true
Tested 10 incoming calls to Inari while zooming in and out on nytimes.com. 

RSSI bars was only 1.

Average time it took to receive incoming call on Inari from the time the originating phone started dialing out was 4.3 seconds.

Inari device build info:
Commercial RIL
Inari Build ID: 20130513070209
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/6e1dbf036d34
Gaia: 54dfb62735c81152eef74eb1768e2bf621c0a258

Removing QAWanted keyword.
Keywords: qawanted
Thanks for testing and reporting a number, mlevin.  But how long does it take to receive a call when we're not under load?  It's very hard for me to understand what 4.3s means without this information.

We also need a lot more QA than this before we can call this bug fixed.  This bug calls for creative QA, as Marcia started doing in comment 43 (by testing with music playing in the background).

Please come up with creative ways to stress the phone and try to break this.
Keywords: qawanted
> RSSI bars was only 1.

Also, what does this mean?
(In reply to Justin Lebar [:jlebar] from comment #76)
> > RSSI bars was only 1.
> 
> Also, what does this mean?

RSSI = Received Signal Strength Indicator. It is the cellular bars of reception which appear on the status bar of cell phones. It can range from 0 to 5 usually.
Whiteboard: [status: still investigating, needs work and deps to land] [FFOS_perf], u=user c=dialer s=ux-most-wanted, IOT, Chile, Ikura, khepara_43442 [madrid] → [status: still investigating, needs work and deps to land] [FFOS_perf], IOT, Chile, Ikura, khepara_43442 [madrid], ux-tracking
QA has started looking at these builds and should have some data shortly. I filed Bug 872171 as a separate bug, mainly because I can reproduce it consistently. We will make sure to note what apps are running, timing, and logcat information when we report back.
(In reply to Justin Lebar [:jlebar] from comment #75)
> Thanks for testing and reporting a number, mlevin.  But how long does it
> take to receive a call when we're not under load?  It's very hard for me to
> understand what 4.3s means without this information.
> 
> We also need a lot more QA than this before we can call this bug fixed. 
> This bug calls for creative QA, as Marcia started doing in comment 43 (by
> testing with music playing in the background).
> 
> Please come up with creative ways to stress the phone and try to break this.

Re-tested this on Inari with the following software:

Commercial RIL
Inari Build ID: 20130514070202
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/08196b630a4a
Gaia: 57188a05fe5b669fc98bfcc18a49c9b3effdac06

Please see attached Excel document titled: "Received Call Times to Inari.xlsx"
(See comment 79)

Average time to receive incoming call on non-stressed Inari: 8.28 seconds
Average time to receive incoming call on stressed Inari: 12.65 seconds

Please refer to the aforementioned document for the complete test details.
(See commment 79)
Whiteboard: [status: still investigating, needs work and deps to land] [FFOS_perf], IOT, Chile, Ikura, khepara_43442 [madrid], ux-tracking → [target:resolved already?] [FFOS_perf], IOT, Chile, Ikura, khepara_43442 [madrid], ux-tracking
Whiteboard: [target:resolved already?] [FFOS_perf], IOT, Chile, Ikura, khepara_43442 [madrid], ux-tracking → [target: likely done, but waiting on QA results and investigation thereof] [FFOS_perf], IOT, Chile, Ikura, khepara_43442 [madrid], ux-tracking
> Stressed (wifi on, wifi hotspot on, cellular data & roaming on, gps & Bluetooth on, active Internet 
> page zooming on nytimes.com, Gallery, Marketplace, Here Maps, Settings, Calendar, Clock, Music, 
> Video, E-mail, Contacts, SMS apps all running in the background, music file playing).

The Inari has 256mb of RAM.  On my Hamachi (also 256mb), I can have about two apps running in the background, plus the music app playing music.  My phone definitely cannot run all of these apps at the same time, and I strongly suspect yours can't either.

Reproducing this may depend on exactly which apps are still alive at the point when you make the call, so that information would be helpful.  Long-press the home button to see, or run adb shell b2g-ps.
fwiw I'm having trouble testing with audio playing in the bg due to bug 872655.
(In reply to Justin Lebar [:jlebar] from comment #81)
> > Stressed (wifi on, wifi hotspot on, cellular data & roaming on, gps & Bluetooth on, active Internet 
> > page zooming on nytimes.com, Gallery, Marketplace, Here Maps, Settings, Calendar, Clock, Music, 
> > Video, E-mail, Contacts, SMS apps all running in the background, music file playing).
> 
> The Inari has 256mb of RAM.  On my Hamachi (also 256mb), I can have about
> two apps running in the background, plus the music app playing music.  My
> phone definitely cannot run all of these apps at the same time, and I
> strongly suspect yours can't either.
> 
> Reproducing this may depend on exactly which apps are still alive at the
> point when you make the call, so that information would be helpful. 
> Long-press the home button to see, or run adb shell b2g-ps.

Re-tested today on same Inari build as comment 80.
Commercial RIL
Inari Build ID: 20130514070202
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/08196b630a4a
Gaia: 57188a05fe5b669fc98bfcc18a49c9b3effdac06

Unstressed (wifi off, celluar data roaming & connection off, gps & bluetooth off, no music streaming, display screen set to "never").
10 calls.
Average time for call to be received by Inari: 8.5 seconds.

Stressed (wifi on, cellular data & roaming on, gps & Bluetooth on, active Internet page zooming on nytimes.com, background music file playing).
10 Calls.
Average time for call to be received by Inari: 10.9

RSSI Inari: 2 bars
RSSI Unagi: 4 bars
In the excel spreadsheet from comment 79, the first call when we were under load took 18.6s, while the other calls took 10-12s each.

Is the lack of that outlier why the average is lower here?
(In reply to Justin Lebar [:jlebar] from comment #84)
> In the excel spreadsheet from comment 79, the first call when we were under
> load took 18.6s, while the other calls took 10-12s each.
> 
> Is the lack of that outlier why the average is lower here?

It could very well be. 

Here is the data for the stressed calls on the 2nd run (seconds):
10.9
11.2
10.5
10.9
10.7
10.4
11.6
11.0
10.7
11.4
I think we need to be focusing on the 18-second outlier.  If we can reproduce that somewhat consistently, that would be a big deal.  A three-second delay is much harder to worry about.
bz and I spent a few hours trying to reproduce this long delay under various load conditions, and we couldn't get a delay more than two or three seconds longer than what we see when the phone is not loaded.

That's consistent with all but one of the data points in mlevin's testing.  It's possible that the outlier is significant -- perhaps it corresponds to the phone starting up the communications app -- and perhaps it is insignificant -- perhaps it corresponds to a momentary blip in the network.  It's hard to say without more testing.
Sreenidhi and I have been continuing to test this and are tracking our results here: https://docs.google.com/spreadsheet/ccc?key=0Al4xZaJkos_adHdVLXg1bjRMTS01MzZkSzhiM2NVUHc#gid=0. We are using the build from today which seemed to me to bit a better than the one I was testing with yesterday.

During my testing today, the only time I hit a real show stopper is when I was loading nfl.com, aggressively panning and zooming while an inbound call was coming in -  I got a slow script warning which paralyzed the phone, causing an entire manual phone reboot. I also did see at least one case where the call crapped out while I was aggressively panning zooming in the browser.

One thing I consistently noticed today is that while I have the headphones in my ear and the music is playing, there seems to be a bit of delay between the time I stop hearing the audio play in the music player and the time I actually hear the phone start ringing. In other words, it is far from being an immediate transition.
Marcia -- these numbers are meaningless, aren't they?  A lot of the time that you are recording is time spent in the cell network making the actual phone call.  This 'cell connection time' is variable and isn't something we control.

I suppose that you don't have a better way of measuring this time, do you?  Basically what I want to know is how long does it take between the time the modem knows that there is a phone call and the time that it takes for us to show the user the phone UI.

bz, jlebar, do we have that sort of instrumentation?
> Marcia -- these numbers are meaningless, aren't they?

They're not meaningless; we just need a baseline to compare them to.

> bz, jlebar, do we have that sort of instrumentation?

No.  I've asked, and nobody has known how to do this properly.

But what you're asking for also assumes that nothing B2G does affects how long it takes before the modem knows that there's an incoming call.  I don't know if that's true.

Anyway I don't think we can instrument the modem itself.  We can instrument B2G, but certainly what we do in B2G can affect how long it takes before B2G notices that there's an incoming call.
> bz, jlebar, do we have that sort of instrumentation?

This patch prints to the logcat log when the RIL sees an incoming call, or so we think.

In my measurements, this print happens at exactly the same time (within my perception limits) as the first "brrrring" is heard back from the network on the originating phone, so I believe it fairly accurately reflects the "RIL has a call coming in" event...  It's worth double checking whether the two correlate that well for others.

Basically, what I've been doing is running:

  adb logcat | grep 'ril_worker::_process_calls'

then doing whatever on the phone while at the same time dialing it from another phone, and paying attention to three times:

1)  When the phone that originates the call gets a "brrring" from the network.
2)  When the log line process_calls line appears in logcat.
3)  When the target phone rings.

For me, #1 and #2 always happen at the same time, as I said.  I've been focused on finding and debugging situations in which there is a delay between #2 and #3.
> This patch prints to the logcat log when the RIL sees an incoming call, or so we think.

Doesn't this printf happen /after/ the process priority manager sets the communications app to high priority?  I can't actually see what this patch is, because bugzilla is acting up in a strange way.
> /after/ the process priority manager sets the communications app to high priority?

Hmmm.  Yes, I think we had that going on.  Again, the key part for me was the correlation between this printf and the caller getting the "phone is rining" signal.
Boris: When I try to grep for 'ril_worker::_process_calls', I am not getting that generated in the logs on my builds - we are testing Commercial RIL builds from the pvt directory - is there something else I need to do on my end to see this? Or will your patch address that (maybe we can get a tryserver build with that patch?)


(In reply to Boris Zbarsky (:bz) from comment #91)
[snip]
If we're going to do a special build here, we want that patch, the patch in bug 870203, and we also want ENABLE_LOGGING uncommented in ProcessPriorityManager.cpp.

bz's patch won't help with the commercial RIL, although the commercial RIL has its own log messages that give some indication of when it receives a call.  Please remember to use logcat -v time, so we can see when things happen.
If you are using commercial RIL, you can collect radio logs using the following command.

 adb logcat -b radio -v time 

 The radio logs should show the following two messages to specify that there is an incoming call.
  
 UNSOL_CALL_RING <- When network tells the phone to start ringing
 GET_CURRENT_CALLS (with state as INCOMING) <- When network sends an incoming call
These two events usually are within msecs apart.
I'll add another data-point to the tests you've been doing in the hope it sheds some light on the very long outlier you're seeing from time to time. When testing under process buster stress load I used to keep 'top' open just to get the feeling of what was going on. From time to time I would see a much longer delay than usual and for some reason the 'mmcqd' process would show up in top consuming ~20-25% CPU for a few seconds and then quieting down again.

I don't know if this is the cause for the outliers and I found it a bit strange too because AFAIR it's a kernel-thread that deals with the SD card but I remember seeing it more than once in my testing (always under process-buster stress) which got me suspicious about it.
> During my testing today, the only time I hit a real show stopper is when I was loading nfl.com, 
> aggressively panning and zooming while an inbound call was coming in -  I got a slow script warning 
> which paralyzed the phone, causing an entire manual phone reboot.

This seems like the worst outstanding issue at the moment, so I'm going to see if I can reproduce this, unless someone thinks I should be looking into something else.

The fact that we get a slow-script warning doesn't sound like a bug, however; that's exactly what should happen if we need to give CPU to another process.

The fact that the slow-script warning freezes the phone may not be a Gecko bug.  It certainly doesn't sound like a process-priority bug.  Anyway, we'll see if I can see.
I managed to get the main process to segfault just by panning and zooming on nfl.com and then placing a call.  Unfortunately, this is all I got:

05-16 17:38:44.339 F/libc    (  144): Fatal signal 11 (SIGSEGV) at 0x0000003c (code=1)

I'm starting to believe that this bug is fixed as well as we're going to fix it, relative to the overall quality level of our product at the moment.
Depends on: 873284
Hey Anshul, 

I am testing on an Inari, with Commercial RIL (RIL 107) for the bug in the subject. It is using the following build info:

Gecko  http://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/0b6bcb1f4175
Gaia   9648799c2e45917ff150fa9eef8aeac79a9ac008
BuildID 20130515070203
Version 18.0


Comment 96 advises to use adb -b radio -v time to know about the time when the network tells the phone to start ringing and the time when the network sends an incoming call (UNSOL_CALL_RING and GET_CURRENT_CALLS). 

When I looked at the log, I was surprised to find that there were multiple entries for the GET_CURRENT_CALLS message, say : 

15:27:22.421
15:54:35.114
12:29:13.882
12:32:42.206 

and these timestamps are repetitive all over the log in a cycle. The same is the case for UNSOL_CALL_RING. So, to compute the difference between the UNSOL_CALL_RING and GET_CURRENT_CALLS, which timestamp should be considered?
Flags: needinfo?(anshulj)
Attached the corresponding logcat, by using the adb -b radio -v time command
(In reply to Justin Lebar [:jlebar] from comment #99)
> I managed to get the main process to segfault just by panning and zooming on
> nfl.com and then placing a call.  Unfortunately, this is all I got:
> 
> 05-16 17:38:44.339 F/libc    (  144): Fatal signal 11 (SIGSEGV) at
> 0x0000003c (code=1)
> 
> I'm starting to believe that this bug is fixed as well as we're going to fix
> it, relative to the overall quality level of our product at the moment.

We (tef) believe that what you have achieved is enough for 1.0.1. Can we close this one and open a new one for follow-up bugs? Alternatively we could move this to leo+.
Flags: needinfo?(justin.lebar+bug)
Given the testing thus far at https://mozqa.etherpad.mozilla.org/Testing-Bug847592, we haven't run into any critical issues in the given user scenarios. Let's reopen when a new blocking scenario is found (if any).
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Keywords: qawanted
(In reply to Alex Keybl [:akeybl] from comment #103)
> Given the testing thus far at
> https://mozqa.etherpad.mozilla.org/Testing-Bug847592, we haven't run into
> any critical issues in the given user scenarios. Let's reopen when a new
> blocking scenario is found (if any).

Given the large array of comments, I'd instead suggest that if we find a new blocking scenario, then let's file new bugs. That way, we don't turn this bug into an endless bug that never ends.
(In reply to Jason Smith [:jsmith] from comment #104)
> Given the large array of comments, I'd instead suggest that if we find a new
> blocking scenario, then let's file new bugs. That way, we don't turn this
> bug into an endless bug that never ends.

Yes, please.

Thank you so much for your hard work and patience here, everyone.
Flags: needinfo?(justin.lebar+bug)
Flags: needinfo?(anshulj)
Keywords: perf
Whiteboard: [target: likely done, but waiting on QA results and investigation thereof] [FFOS_perf], IOT, Chile, Ikura, khepara_43442 [madrid], ux-tracking → [target: likely done, but waiting on QA results and investigation thereof] [c= p= s=2013.05.17 u=], IOT, Chile, Ikura, khepara_43442 [madrid], ux-tracking
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: