Closed Bug 915220 Opened 11 years ago Closed 11 years ago

FxOS: Clock 3s datazilla cold launch regression on v1-train

Categories

(Firefox OS Graveyard :: Gaia::Clock, defect, P1)

ARM
Gonk (Firefox OS)

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mlee, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [c=automation p=5 s=2013.10.04 u=][b2gperf_regression])

Attachments

(3 files)

Check out the cool upward staircase pattern in the replicate times.  Seems like each replicate run is leaking some resource which is slowing down the next run.
Kevin, need you take this one. What's causing this?
Assignee: nobody → kgrandon
Status: NEW → ASSIGNED
Component: General → Gaia::Clock
OS: Mac OS X → Gonk (Firefox OS)
Product: Testing → Boot2Gecko
Hardware: x86 → ARM
Blocks: 915068
Reproduced on my phone.  Bisecting gecko now.
Assignee: kgrandon → bkelly
Does that happen only for the clock app?
I wonder if purging the expired weak references on message dispatch is the problem here.
(In reply to Fabrice Desré [:fabrice] from comment #6)
> Does that happen only for the clock app?

Datazilla suggests that it is only affecting the clock app, but that could be deceptive.  The automation behind datazilla effectively listens for the mozbrowserloadend event and then kills the app.  The conditions triggering the leak are occurring during this time window for only clock app.  Its still possible that other apps hit this condition later in their processing.
You keep referring to a leak ... what are we leaking?
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #9)
> You keep referring to a leak ... what are we leaking?

Good point.  I don't really know its a leak.  I was jumping to that conclusion since the failure mode is that the launch time gets progressively slower on each run.
Can you get this in a profiler?  http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsFrameMessageManager.cpp#764 is the most suspicious bit of that set of patches IMO.  Alternatively you could just remove line 764 and see what happens.
Update:

Removing the line suggest in comment 11 unfortunately did not help.

Also, when I tried to run the profiler I ran into some other problems.

First, it looks like this is another case where the regression only occurs when run from b2gperf.  If I manually launch and kill clock repeatedly I do not see a ramp up in launch times.

While on the face of it this seems similar to bug 892965, it feels a bit different.  There we had a semi-constant time penalty due to conflicting with the launch of the pre-allocated process.  Here we have a pretty clear ramp up in time.  So I think this is probably a separate issue.

The other fun part is that I can't seem to get b2gperf to run with the profiler.  I'm not sure what the issue is here.  I tried pausing b2gperf immediately after it relaunches b2g to enable profiling, but it seems like marionette fails.  Maybe I can hack b2gperf to enable profiling when it does the relaunch.
There is a nice large CPU spike in this profile.  Unfortunately the symbols don't seem to a lot of sense to me.  This is for mozilla-b2g18 at e85db8ff0a7c.
Oh, and the way to profile with b2gperf is to use the --no-restart flag.  So:

  b2gperf --delay=5 --no-restart Clock
blocking-b2g: --- → koi?
In regards to the koi? nom, I was unsure about this myself since so far we only see the problem on b2g18.  It has not manifested on mozilla-central.  Hopefully if we can track down the underlying cause we can determine if we are at risk there as well.
Okay, I'll unnom until we confirm if it's going to reproduce on central as well.
blocking-b2g: koi? → ---
This profile shows a fast Clock launch using rev 119884:e85db8ff0a7c.

From talking with BenWa, it sounds like the large "spike" I referred to previously is not necessarily a problem.  It may be perfectly normal for there to be a deep call stack there.

Comparing this profile to attachment 804638 [details], however, does show that some additional top level work in XPCOMUtils_QueryInterface() is performed in the slow case.

Its unclear to me why this is showing up at the top level of the call stack at this point.  Is it getting run from a finalizer?
Also, in terms of why this might be showing up on b2g18, but not mozilla-central, it appears that more code was landed on m-c.  In particular, bug 889984 attachment 772300 [details] [diff] [review] has some additions to DOMRequestHelper.jsm that are not present in b2g18.

The comments from jlebar suggest that these changes are not needed any more, but perhaps they are masking a problem with the addWeakMessageListener() approach.  I may try backing them out on m-c on the off chance I can reproduce this there.  If I could, then the improved profiler may help track down the issue.
Whiteboard: [c= p= s= u=] → [c=automation p= s= u=]
Nom'ing for leo to determine if there is anything we can do at this point.  Should we continue investigating this?

My opinion is that bug 900221 should be backed out on mozilla-b2g18 if we can.  It clearly is causing a regression here, even if we don't understand what it is exactly.  Also, the vendor reported problems in bug 900221 comment 22 where the device screen sometimes would not turn back on from sleep mode.  I believe I have seen this after the clock slow down as well.

If the mozilla-b2g18 ship has sailed, should this be marked as WONTFIX?

Kyle, what do you think?
blocking-b2g: --- → leo?
Flags: needinfo?(khuey)
Whiteboard: [c=automation p= s= u=] → [c=automation p=5 s= u=]
Unassigning myself until release management can make a call about how to proceed.
Assignee: bkelly → nobody
Status: ASSIGNED → NEW
Preet, if no OEMs are requiring this, and Leo is already shipped, what options do we have here?
Flags: needinfo?(praghunath)
Well, if leo has sailed and this doesn't appear on trunk this then is WONTFIX.

If leo hasn't sailed, then backing out bug 900221 is an option.  It would be unfortunate to do that without understanding the regression, since bug 900221 (and bug 889984) fixed some pretty serious memory leaks, so we're going to be trading one performance problem for another.
Flags: needinfo?(khuey)
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(praghunath)
Resolution: --- → WONTFIX
Whiteboard: [c=automation p=5 s= u=] → [c=automation p=5 s=2013.10.04 u=]
Clearing nom because this has been wont fixed.
blocking-b2g: leo? → ---
Whiteboard: [c=automation p=5 s=2013.10.04 u=] → [c=automation p=5 s=2013.10.04 u=][b2gperf_regression]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: