Closed Bug 880390 Opened 11 years ago Closed 11 years ago

Never-ending 'callschanged' events while SIM card not present

Categories

(Core :: DOM: Device Interfaces, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
blocking-b2g leo+
Tracking Status
b2g18 --- fixed

People

(Reporter: bkelly, Assigned: bkelly)

References

Details

(Keywords: perf, regression, Whiteboard: c= p=3 s=2013.06.14 ,)

Attachments

(1 file, 1 obsolete file)

Looking at the following datazilla page, there appears to have been a major load time regression on or around  June 4, 2012:

  http://mzl.la/15OPjCH

All apps except the browser increased from around 1 second up to around 3 or 4 seconds.

We need to investigate to see if this is a real regression or a change in the test, etc.
ouch,
Severity: normal → critical
Keywords: perf
Priority: -- → P1
Whiteboard: c= ,
Initial feedback is that there is an issue with the marionette automation infrastructure.  From what I understand Dave Hunt is actively working it.  Dave, assigning to you if you don't mind.
Assignee: nobody → dave.hunt
Status: NEW → ASSIGNED
Summary: [gaia] general app load time regression → [gaia] marionette automation showing general app load time regression
My guess was http://qa-selenium.mv.mozilla.com:8080/view/B2G%20Perf/job/b2g.unagi.gaia.master.perf.fps/ -- you'll note that we started failing that suite here:

Failed > Console Output  #630 	Jun 4, 2013 3:08:12 AM	 628KB
Success > Console Output  #629 	Jun 4, 2013 12:49:08 AM	 1MB

Of course, I really shouldn't weigh in without Dave; I know that he's working on fixing a regression due to "a) media-file paths (in master/v1-train) and b) databases changing, both of which affected b2gpopulate"
I don't think the issues are related here. The run immediately before this jump is [1] and the first run showing the increase is [2]

[1] http://qa-selenium.mv.mozilla.com:8080/view/B2G%20Perf/job/b2g.unagi.gaia.master.perf/1241/
[2] http://qa-selenium.mv.mozilla.com:8080/view/B2G%20Perf/job/b2g.unagi.gaia.master.perf/1247/

I see 8 gaia changes between those runs, and three gecko changes. I can rerun those locally to see if I can reproduce the spike.

For information, the mentioned infrastructure issues are related to recent changes in the DeviceStorage API and the databases used for reference workloads. These issues have pretty much now been resolved, so if it was affecting the performance I would expect it to have dropped back down.
(In reply to Stephen Donner [:stephend] from comment #3)
> My guess was
> http://qa-selenium.mv.mozilla.com:8080/view/B2G%20Perf/job/b2g.unagi.gaia.
> master.perf.fps/ -- you'll note that we started failing that suite here:

This is the FPS testrun, and not related to the app load time tests.
I'm having issues with my unagi, so I've triggered two builds on the CI. They will take a few hours to run, and won't report to DataZilla but the regression should be obvious in the console output.

Last good: http://qa-selenium.mv.mozilla.com:8080/view/B2G%20Perf/job/b2g.unagi.gaia.master.perf.good/

First bad: http://qa-selenium.mv.mozilla.com:8080/view/B2G%20Perf/job/b2g.unagi.gaia.master.perf.good/
This does not appear to be related to the automation framework. Here are the average load times (in ms) from two tests using the build immediately before the increase and the first build demonstrating the increase. These were on the same device, and using the same versions of b2gpopulate and b2gperf:

Good (gecko:3a432e057b4a0173ad83046cae7668da6c62303b, gaia:4de4354e3a99f151a834743c7b03a041ac8db12f):
Phone: 965
Contacts: 871
Messages: 809
Settings: 1105
Gallery: 662
Video: 707
Music: 759
Camera: 904
Email: 1283
Calendar: 973
Browser: 851
Clock: 1132
FM Radio: 669
Usage: 1277

Bad (gecko:83a3e05401785a7c51b2bae088826a60f01cdbad, gaia:5534304aee934055f08f21ce93261ba2a596516a):
Phone: 3433
Contacts: 3256
Messages: 3317
Settings: 3772
Gallery: 2828
Video: 3009
Music: 3082
Camera: 2919
Email: 3906
Calendar: 3565
Browser: 881
Clock: 3872
FM Radio: 2900
Usage: 5541

Pushlogs: 
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/pushloghtml?fromchange=4785b1353fd7&tochange=997cdbf5d012
Gaia: https://github.com/mozilla-b2g/gaia/compare/4de4354e3a99f151a834743c7b03a041ac8db12f...5534304aee934055f08f21ce93261ba2a596516a

I can try to reduce this regression range, as there were a couple of nightly builds where we had no test results.
This regression is also present in the Gaia performance results. These only use b2gperf to submit results to DataZilla.

http://mzl.la/ZUJi8r
Dave, thank you!  I'm going to start bisecting on my device here.  If you have a convenient way to test those intermediate nightly builds, that would help as well.
The following is an earlier bad build:

Gecko: 80fd2ae052f7a1ff7d92ff17d06c20b19223017b
Gaia: 0a1cf9f90d33092a845c9c791bee4f7455b602b1

https://hg.mozilla.org/releases/mozilla-b2g18/rev/b6f9eab044dd

Revised pushlogs:
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/pushloghtml?fromchange=4785b1353fd7&tochange=b6f9eab044dd
Gaia: https://github.com/mozilla-b2g/gaia/compare/4de4354e3a99f151a834743c7b03a041ac8db12f...0a1cf9f90d33092a845c9c791bee4f7455b602b1

I have one more build to run, and will update when I have the results.
I can reproduce the delay locally on my Buri.  It appears to be related to the gecko revision.

With mozilla-b2g18 at 119502:4785b1353fd7 and HEAD of gaia/v1-train the slow down does not seem to occur.
Note, while datazilla shows the slowdown for "master", Dave indicates that the configuration for that set is really "b2g18+gaia/master".  So that is consistent with the results we are seeing.
Bad:  mozilla-b2g18   119509:efc1e28607d5
jhylands has identified that the slowdown does not occur in v1-train if a sim card is present.  This is regardless of the wifi setting, so does not appear related to having a data connection.

Bisecting suggests the problem is somewhere in this 5 part change:

  http://hg.mozilla.org/releases/mozilla-b2g18/rev/4b236b7d48a7
  http://hg.mozilla.org/releases/mozilla-b2g18/rev/dd63ed7329f6
  http://hg.mozilla.org/releases/mozilla-b2g18/rev/c958c525d075
  http://hg.mozilla.org/releases/mozilla-b2g18/rev/d25bd7f89629
  http://hg.mozilla.org/releases/mozilla-b2g18/rev/efc1e28607d5
Running top -t in the adb shell shows the b2g process is spinning:

  PID   TID PR CPU% S     VSS     RSS PCY UID      Thread          Proc
 1712  1712  0  97% R 171816K  65732K  fg root     b2g             /system/b2g/b2g
Adding some debug in, I see this:

  ### ### Telephony::EnumerateCallStateComplete()

And then it spins on this:

  ### ### CallEvent::Create(null)
  ### ### Telephony::NotifyCallsChanged(null)
  ### ### Telephony::DispatchEvent()
  ### ### Telephony::EnumerationAck::Run()
  ### ### CallEvent::Create(null)
  ### ### Telephony::NotifyCallsChanged(null)
  ### ### Telephony::DispatchEvent()
  ### ### Telephony::EnumerationAck::Run()
Component: Gaia → DOM: Device Interfaces
Product: Boot2Gecko → Core
:htsai, :bent, do you have any ideas on this?  We'll keep investigating this afternoon, but thought I would ask in case it was immediately obvious to you.
Flags: needinfo?(htsai)
Flags: needinfo?(bent.mozilla)
Hm... Multiple enumeration requests maybe? If it's the same request that triggers multiple events then we have a problem...
Flags: needinfo?(bent.mozilla)
Ok, thanks to jhylands for debugging here. This was caused (kinda) by bug 823958.

The problem is that system/js/statusbar.js is adding a 'callschanged' listener inside the 'handleEvent' method when it receives a 'callschanged' event. That used to be a no-op, but after bug 823958 this causes a never-ending loop on b2g18.

This should be fixed on trunk because bug 875545 changed the way all this works.

We can fix the platform on b2g18 so that this works but it's a little bit complicated since bug 875545 can't be backported to b2g18. I'd much prefer to fix statusbar.js instead to avoid this problem and then move on.

Any objections to that plan?
Blocks: 823958
Summary: [gaia] marionette automation showing general app load time regression → Never-ending 'callschanged' events while SIM card not present
Pointer to Github pull-request
Attachment #759938 - Flags: review?(etienne)
Assignee: dave.hunt → bkelly
I went ahead and made the patch recommended in comment 21 since it was only a couple lines.  This does seem to fix the problem on my device.
Thanks for Ben's comment 21. Sounds reasonable to me.
Flags: needinfo?(htsai)
blocking-b2g: --- → leo?
Comment on attachment 759938 [details]
Pointer to Github pull request: https://github.com/mozilla-b2g/gaia/pull/10270

Comment on github.

We probably need to add a test for this, but good thing is most of the boilerplate is already there :)
Attachment #759938 - Flags: review?(etienne)
Sure.  Would you prefer a separate test case checking state or would it be adequate to have the mock throw an exception if a duplicate handler is added?

Also, I have a very busy weekend, so I may not get to this before Monday which I think might be the cutoff for leo.
(In reply to Ben Kelly [:bkelly] from comment #26)
> Sure.  Would you prefer a separate test case checking state or would it be
> adequate to have the mock throw an exception if a duplicate handler is added?

I'd say separate test case.
But the quicker the better :)
Comment on attachment 759938 [details]
Pointer to Github pull request: https://github.com/mozilla-b2g/gaia/pull/10270

This pull request got into a bad state when I rebased my new changes in.  Closed and opened the new PR.
Attachment #759938 - Attachment is obsolete: true
Comment on attachment 760388 [details]
Pointer to Github pull request: https://github.com/mozilla-b2g/gaia/pull/10294

New pull request including unit test.
Attachment #760388 - Flags: review?(etienne)
Attachment #760388 - Flags: feedback?(felash)
Comment on attachment 760388 [details]
Pointer to Github pull request: https://github.com/mozilla-b2g/gaia/pull/10294

Thanks! Ready to go!
Attachment #760388 - Flags: review?(etienne) → review+
Merged to gaia master:  063570622c45d79ba40f8148bb7e89e8c1ff8a46
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
tracking-b2g18: --- → ?
blocking-b2g: leo? → leo+
tracking-b2g18: ? → ---
Keywords: regression
John, can you please assist with the uplift? :)
Flags: needinfo?(jhford)
v1-train: 43e1b73371c7e7e6b9fde90ea2d0d5e7745b9705
Flags: needinfo?(jhford)
Flags: in-moztrap-
Attachment #760388 - Flags: feedback?(felash)
Whiteboard: c= , → c= p=3 s=2013.06.14 ,
Just a late thought: why couldn't we just add the listener at init, and never remove it ?
(In reply to Julien Wajsberg [:julienw] from comment #35)
> Just a late thought: why couldn't we just add the listener at init, and
> never remove it ?

Seems possible, but its unclear to me.  I assumed the code had a good reason to avoid the callschanged events when not in emergency only mode.  Unfortunately I took the shortest path to fix the performance regression given the looming leo deadlines.
You need to log in before you can comment on or make changes to this bug.