Closed Bug 924565 Opened 8 years ago Closed 7 years ago

strong observer ref prevents GC of DOMRequestHelper objects

Categories

(Core :: DOM: Core & HTML, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla28
blocking-b2g koi+
Tracking Status
firefox26 --- wontfix
firefox27 --- wontfix
firefox28 --- fixed
b2g-v1.2 --- fixed

People

(Reporter: bkelly, Assigned: bkelly)

References

Details

(Keywords: perf, regression, Whiteboard: [c=progress p=4 s= u=1.2])

Attachments

(8 files, 5 obsolete files)

58.29 KB, image/png
Details
62.38 KB, image/png
Details
10.22 KB, patch
Details | Diff | Splinter Review
46 bytes, text/x-github-pull-request
davehunt
: review+
Details | Review
13.04 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
1.76 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
6.92 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
13.22 KB, patch
Details | Diff | Splinter Review
Recent results on datazilla show a sequentially increasing launch time for the browser app.  See this link and click on one of the datapoints to view per launch times below:

  https://datazilla.mozilla.org/b2g/?branch=master&device=unagi&range=7&test=cold_load_time&app_list=browser&app=browser&gaia_rev=2ced1b05bec1f554&gecko_rev=fb530eed0484e067

I am able to reproduce this locally on my buri and did some preliminary bisecting.  This is not caused by a change in gaia revision, but does seem to change with gecko revision.

To quantify problem I took the results from running:

  b2gperf --delay=10 --no-restart Browser

Put them in a spreadsheet and then used LINEST() to perform a linear regression.  Random results should show around 0 for the line coefficient, while increasing results should show a positive number for the coefficient.  The results I got:

  Gecko hg rev fe7961bba227:  -0.01 line coefficient (GOOD)
  Gecko hg rev 64b497e6f593:   7.68 line coefficient (BAD)

Further bisecting is needed to narrow the regression range.

This is a core problem, but putting in gaia:browser for now until we identify the underlying cause.
Screenshot of increasing launch times in datazilla since we can't link to specific results.
Here is the spreadsheet I used to calculate the results in comment 0 with the full b2gperf data.

https://docs.google.com/spreadsheet/ccc?key=0Ai4XIgrcTvUQdFZ3STlTeEt5czlpMUtKM2o0TjYyMGc&usp=sharing
Whiteboard: [c=progress p= s= u=]
Note, this appears to effect v1.2 as well:

  https://datazilla.mozilla.org/b2g/?branch=v1.2&device=hamachi&range=7&test=cold_load_time&app_list=browser&app=browser&gaia_rev=85c4af3f48a91878&gecko_rev=72011db807e6a5e4

The attachment is a screenshot.

So whatever caused this must have been uplifted from m-c to m-a.
blocking-b2g: --- → koi?
moving to 1.3? per triage with team.
blocking-b2g: koi? → 1.3?
Ben - Is this a regression from 1.1? If it is, feel free to renom.
Jason, I believe it is a regression, but I was part of the triage that koi-'d this.  The impact seems small in these tests and its unclear that it will be visible to the user.  I guess I wasn't sure if it was worth blocking on given the other blockers we have.

What do you think?
Flags: needinfo?(jsmith)
How much slower is the launch time overtime? That can help make an assessment if we need to consider blocking on this or not.
Flags: needinfo?(jsmith)
For browser its about 100ms slower at the end of the test.  The big unknown in my mind, though, is what happens if we kept launching browser.  Would it continue to worsen or stabilize?  I don't know the answer to that at the moment.
100ms is enough to be a bit noticable that it's worthy of finding the regressing patch and potentially backing it out or fixing it. Let me get a second opinion from Dietrich on this, as he's worked a lot on triaging performance regressions when he worked as a developer.

Dietrich - What's your take on this? Should we live with this or push for a back out on the regressing patch?
Flags: needinfo?(dietrich)
Ben, can you do a build and test with a backout of bug Bug 915598 ?
Sure, will do.  Ni? to myself to remember.
Flags: needinfo?(bkelly)
This is definitely something we can't live with. 100ms is huge. Not really a question that a perf regression should be backed out, IMO.
Flags: needinfo?(dietrich)
Putting back to koi? per the discussion above - we need to get the regressing patch backed out that caused this regression.
blocking-b2g: 1.3? → koi?
Initial test with bug 915598 backed out looks good.  I added the data to spreadsheet in comment 2.

Regression coefficients are:

  Gecko hg 558cf02e6b9b:                   4.55
  Gecko hg 558cf02e6b9b + 915598 revert:  -1.35

I'll see if I can do some more runs to make sure the results are stable.  Just takes a long time, unfortunately.
Flags: needinfo?(bkelly)
Yea, I'd say this is caused by bug 915598.  I did two more runs with and without the revert.  So including the run from comment 14 the linear regression coefficients were:

  Gecko hg 558cf02e6b9b:                   4.55, 8.29, 4.85
  Gecko hg 558cf02e6b9b + 915598 revert:  -1.35, 0.35, 0.27

Again, values close to 0 are good and values greater than zero indicate the launch is getting slower with progressive launches.

I got profiles of the b2g process when the problem is occurring.  Profile from early in the run before things get slow:

  http://people.mozilla.org/~bgirard/cleopatra/#report=b84b76f76d3c2c8a348d5221c55e914ea619fba0

Profile from late in the run after things get slow:

  http://people.mozilla.org/~bgirard/cleopatra/#report=8c6d32d41fe6134bd22588075438183e304f81f7

Looks like a lot more cycle collection is going on in the second case.

Unfortunately I didn't have any luck profiling the Browser app itself.  Since b2gperf starts and stops the app so quickly I didn't have time to signal the process and collect the data.  This might be possible to do by changing b2gperf to wait a while before killing the app, but I don't know if that would interfere with the conditions that trigger the problem.

Fabrice, does this help at all?
Blocks: 915598
Flags: needinfo?(fabrice)
Hmm, but if I'm reading the profile right most of that cycle collection is happening after the app is killed.  So that extra work should not be directly contributing to the launch time metric.
Ben, the browser app runs in process for now. So unless you need a tab to load content to expose the issue, profiling the b2g process gives you browser app profiling.

Do we have the same regression in load times with other apps?

I'd like to let us a chance to understand and fix that before backing out bug 915598 though. Fernando, it may be that we don't remove message listeners and then we end up in a slow path in the frame message manager (I try to fix some of these in bug 920804). If you don't have time I'll investigate that tomorrow.
Flags: needinfo?(fabrice) → needinfo?(ferjmoreno)
(In reply to Fabrice Desré [:fabrice] from comment #17)
> Do we have the same regression in load times with other apps?

From looking at datazilla I have suspected that we might, but its not as pronounced as for Browser.  That could just because there is more going on in most apps and its obscured by the noise.

Looking at the recently added Template app, though, I do see an increasing launch time:

  http://mzl.la/17p5WE3

So yes, I do think this probably is affecting all apps.
Target Milestone: --- → 1.2 C3(Oct25)
(In reply to Fabrice Desré [:fabrice] from comment #17) 
> I'd like to let us a chance to understand and fix that before backing out
> bug 915598 though. Fernando, it may be that we don't remove message
> listeners and then we end up in a slow path in the frame message manager (I
> try to fix some of these in bug 920804). If you don't have time I'll
> investigate that tomorrow.

I did some investigation on this and it seems that for the Browser app (in-process) we are not removing the SystemMessageManager listeners added at [1]:

== 1st run ==
I/GeckoMM (24187): RemoveWeakMessageListener = Listeners count 344 
I/GeckoMM (24187): RemoveWeakMessageListener = Listeners count 343
I/GeckoMM (24083): RemoveMessageListener = Listeners count 971 

== 2nd run ==
I/GeckoMM (24187): RemoveWeakMessageListener = Listeners count 344 
I/GeckoMM (24187): RemoveWeakMessageListener = Listeners count 343 
I/GeckoMM (24083): RemoveMessageListener = Listeners count 973

== 3rd run ==
I/GeckoMM (24187): RemoveWeakMessageListener = Listeners count 344 
I/GeckoMM (24187): RemoveWeakMessageListener = Listeners count 343 
I/GeckoMM (24083): RemoveMessageListener = Listeners count 975

== 4th run ==
I/GeckoMM (24187): RemoveWeakMessageListener = Listeners count 344 
I/GeckoMM (24187): RemoveWeakMessageListener = Listeners count 343 
I/GeckoMM (24083): RemoveMessageListener = Listeners count 977

(24083 is the b2g process and 24187 the Homescreen app)

Which means that we will have a bigger message listener list on the parent side to iterate on each run.

For other apps (OOP) we are not explicitly removing six message listeners:

I/GeckoMM ( 2861): RemoveMessageListener = Listeners count 7 
I/GeckoMM ( 2861): Listener -> Activities:RegisterContentTypes
I/GeckoMM ( 2861): Listener -> Activities:UnregisterContentTypes
I/GeckoMM ( 2861): Listener -> Webapps:AddApp
I/GeckoMM ( 2861): Listener -> Webapps:RemoveApp
I/GeckoMM ( 2861): Listener -> Settings:Change:Return:OK
I/GeckoMM ( 2861): RemoveMessageListener Settings:Change:Return:OK 
I/GeckoMM ( 2861): Listener -> SystemMessageManager:GetPendingMessages:Return
I/GeckoMM ( 2861): Listener -> SystemMessageManager:GetPendingMessages:Return

But I guess that's actually not a problem cause the process is gone and the listeners dies with the process.

We definitely need bug 920804 and bug 918479.

[1] https://mxr.mozilla.org/mozilla-central/source/dom/messages/SystemMessageManager.js#251
Flags: needinfo?(ferjmoreno)
We should remove the SystemMessageManager messages though no? Why are we keeping them around?
That's what I am trying to figure out
Fernando,

Assigning this to you since you appear to be working on this. If you're not and can't commit to this for 1.2 please comment. Our option at that point would be to back out bug 915598.

FxOS Perf Triage
Assignee: nobody → ferjmoreno
Status: NEW → ASSIGNED
blocking-b2g: koi? → koi+
Priority: -- → P1
Whiteboard: [c=progress p= s= u=] → [c=progress p= s= u=1.2]
It seems like this code lives in DOM, is that right?
Component: Gaia::Browser → DOM
Product: Boot2Gecko → Core
(In reply to Fabrice Desré [:fabrice] from comment #20)
> We should remove the SystemMessageManager messages though no? Why are we
> keeping them around?

After bug 924971 we remove these messages, but it doesn't help with the perf regression.
Flags: needinfo?(gsvelto)
Fernando, what's your question for Gabriele?

Also, at what point should we back out bug bug 915598? We can't ship 1.2 with this regression, so we should consider what an alternative to a fix looks like.
(In reply to Dietrich Ayala (:dietrich) from comment #25)
> Fernando, what's your question for Gabriele?
> 

I've been talking to him via IRC. He is going to help me analyze the differences between the profiles at comment 15.

> Also, at what point should we back out bug bug 915598? We can't ship 1.2
> with this regression, so we should consider what an alternative to a fix
> looks like.

I would like to have Gabriele feedback before backing out anything, if that is possible.
I did a bunch of tests on an actual device and I've taken profiles of the 1st, 10th, 20th and 30th run:

http://people.mozilla.org/~bgirard/cleopatra/#report=22d99a36addf2064e46688fa12b7d91b604ded2e
http://people.mozilla.org/~bgirard/cleopatra/#report=108d98f9c3ad39c447ea7bcf2de7415372e43d9c
http://people.mozilla.org/~bgirard/cleopatra/#report=185351d0d36c689459ea94c6379ecc0f222776d5
http://people.mozilla.org/~bgirard/cleopatra/#report=fb22773d329aa796896bea6d918257d8eca7eaf5

The 10th run appears longer than the first though the profile is a bit murky but the 20th and 30th run are identical to the first. I've actually tried this a few times and I always end up with the same result so I get the feeling that the slow-down might be caused by the test-harness.
Flags: needinfo?(gsvelto)
Gabriele, were you profiling the browser app itself or the b2g parent process?  I did see some differences in the parent process in the profiles in comment 15.  In particular a lot more cycle collection.

Also, it seems like you're only catching the end of the browser launch and missing the beginning?  I understand this since its hard to catch the beginning with a repeated test like b2gperf.  Just wondering if we are missing something there.
Flags: needinfo?(gsvelto)
Thanks Gabriele!

I added a few logs to get the number of listeners that we add to the nsFrameMessageManager and the time to deliver IPC messages and I found out that the number of listeners keeps growing and growing when running b2gperf.

I/GeckoMM (  108): RemoveMessageListener = Listeners count 2570
I/GeckoMM (  108): RemoveMessageListener = Listeners count 2569
I/GeckoMM (  108): RemoveMessageListener = Listeners count 2571
I/GeckoMM (  108): RemoveMessageListener = Listeners count 2570
I/GeckoMM (  108): RemoveMessageListener = Listeners count 2569
I/GeckoMM (  108): RemoveMessageListener = Listeners count 2842
I/GeckoMM (  108): RemoveMessageListener = Listeners count 2841
I/GeckoMM (  108): RemoveMessageListener = Listeners count 2843
I/GeckoMM (  108): RemoveMessageListener = Listeners count 2842
I/GeckoMM (  108): RemoveMessageListener = Listeners count 2841
I/GeckoMM (  108): RemoveMessageListener = Listeners count 3114
I/GeckoMM (  108): RemoveMessageListener = Listeners count 3113
I/GeckoMM (  108): RemoveMessageListener = Listeners count 3115
I/GeckoMM (  108): RemoveMessageListener = Listeners count 3114
I/GeckoMM (  108): RemoveMessageListener = Listeners count 3113
I/GeckoMM (  108): RemoveMessageListener = Listeners count 3386
I/GeckoMM (  108): RemoveMessageListener = Listeners count 3385

But fortunately it stays the same when running manual tests on the device:

I/GeckoMM (  108): RemoveMessageListener = Listeners count 882 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 881 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 880 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 882 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 881 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 880 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 882 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 881 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 880 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 882 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 881 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 880 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 882 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 881 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 880 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 882 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 881 
I/GeckoMM (  108): RemoveMessageListener = Listeners count 880 


I could also noticed that the Webapps:Launch:Return:OK takes longer with each run via b2gperf:

I/GeckoMM (  108): AsyncMessageToSameProcessParent Webapps:Launch took 98358
I/GeckoMM (  108): AsyncMessageToSameProcessChild Webapps:Launch:Return:OK took 30609
I/GeckoMM (  108): AsyncMessageToSameProcessParent Webapps:Launch took 104278
I/GeckoMM (  108): AsyncMessageToSameProcessChild Webapps:Launch:Return:OK took 32044
I/GeckoMM (  108): AsyncMessageToSameProcessParent Webapps:Launch took 96436
I/GeckoMM (  108): AsyncMessageToSameProcessChild Webapps:Launch:Return:OK took 33509
I/GeckoMM (  108): AsyncMessageToSameProcessParent Webapps:Launch took 97839
I/GeckoMM (  108): AsyncMessageToSameProcessChild Webapps:Launch:Return:OK took 34668
I/GeckoMM (  108): AsyncMessageToSameProcessParent Webapps:Launch took 95764
I/GeckoMM (  108): AsyncMessageToSameProcessChild Webapps:Launch:Return:OK took 35553
I/GeckoMM (  108): AsyncMessageToSameProcessParent Webapps:Launch took 102935
I/GeckoMM (  108): AsyncMessageToSameProcessChild Webapps:Launch:Return:OK took 36895
I/GeckoMM (  108): AsyncMessageToSameProcessParent Webapps:Launch took 95795
I/GeckoMM (  108): AsyncMessageToSameProcessChild Webapps:Launch:Return:OK took 37751
I/GeckoMM (  108): AsyncMessageToSameProcessParent Webapps:Launch took 102784
I/GeckoMM (  108): AsyncMessageToSameProcessChild Webapps:Launch:Return:OK took 39001
I/GeckoMM (  108): AsyncMessageToSameProcessParent Webapps:Launch took 95947
I/GeckoMM (  108): AsyncMessageToSameProcessChild Webapps:Launch:Return:OK took 40008
This is the patch that I used to get the log messages
Attachment #822397 - Attachment description: domrequesthelper → Debug nsFrameMessageManager and DOMRequestHelper
(In reply to Ben Kelly [:bkelly] from comment #28)
> Gabriele, were you profiling the browser app itself or the b2g parent
> process?  I did see some differences in the parent process in the profiles
> in comment 15.  In particular a lot more cycle collection.

I tested only the browser app, looking at what :ferjm found in comment 29 the extra garbage collections in the main process must have been triggered by the leaked listeners.

> Also, it seems like you're only catching the end of the browser launch and
> missing the beginning?

It looks this way because on the devices the browser app does not start from the pre-launch process but is instead started from scratch.
Flags: needinfo?(gsvelto)
Regarding the listeners in comment 29, does that mean this is an instance of bug 915220 which appeared to have been caused/related to bug 900221?
IMHO this should not be a blocker anymore since it only affects the tests.

(In reply to Ben Kelly [:bkelly] (PTO, Back on Tuesday Oct 29) from comment #32)
> Regarding the listeners in comment 29, does that mean this is an instance of
> bug 915220 which appeared to have been caused/related to bug 900221?

Might be. We'll need to check what's going on with b2gperf.
blocking-b2g: koi+ → koi?
Summary: b2g browser launch times worsen over time → b2g browser launch times worsen over time when running b2gperf tests
(In reply to Gabriele Svelto [:gsvelto] from comment #27)
> The 10th run appears longer than the first though the profile is a bit murky
> but the 20th and 30th run are identical to the first. I've actually tried
> this a few times and I always end up with the same result so I get the
> feeling that the slow-down might be caused by the test-harness.

Gabriele, did you maintain the timing during the test or did you pause things to take measurements?

The profiles I did in comment 15 were with b2gperf running at its normal timing.
Flags: needinfo?(gsvelto)
Fernando, do you have a plausible explanation of why b2gperf would run without causing this issue prior to your commit, but start causing the problem after your commit?  I'm trying to understand how we can say its a test harness issue when the test harness didn't change.
Flags: needinfo?(ferjmoreno)
(In reply to Ben Kelly [:bkelly] from comment #34)
> Gabriele, did you maintain the timing during the test or did you pause
> things to take measurements?

I'm not sure I understand your question but I'll describe what I did. For each run I would wait for the phone to quiet down (~10s) then open the browser, wait a few seconds, take the profile, then kill the browser and start again. I did multiple tries of 30 runs each, rebooting the phone in-between. Once I tried 60 runs but the results did not change either.
Flags: needinfo?(gsvelto)
(In reply to Fernando Jiménez Moreno [:ferjm] (FxAccounts Work Week until Friday) from comment #33)
> IMHO this should not be a blocker anymore since it only affects the tests.

Okay.  Feel free to re-nom if you disagree.
blocking-b2g: koi? → ---
Priority: P1 → --
Whiteboard: [c=progress p= s= u=1.2] → [c=progress p= s= u=]
Target Milestone: 1.2 C3(Oct25) → ---
With Fernando's help I think I understand whats happening here now.

It appears that b2gperf calls mozApps.mgmt.getAll() every time it launches an app.  This creates a new WebappsApplication object for each install apps which in turn registers a bunch of listeners as a DOMRequestHelper.

Previously these listeners were getting GC'd as weak references.

As of bug 915598 the listeners stopped getting GC'd because the each listener object was registered as an Observer using a strong reference.

Changing this line to pass true as the third argument fixed the slow down for me:

  http://mxr.mozilla.org/mozilla-central/source/dom/base/DOMRequestHelper.jsm#160
Flags: needinfo?(ferjmoreno)
Attached patch domrequesthelper-lazy-obs.patch (obsolete) — Splinter Review
This patch changes the Observer reference to the DOMRequestHelper to be weak instead of strong.  This seems to avoid the progressive slowdown issue for me.
Attachment #825458 - Flags: review?(fabrice)
This seems like a legitimate platform issue, so re-nom'ing for koi.
blocking-b2g: --- → koi?
Hmm, dom/apps tests failing:

0:06.94 191 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_packaged_app_update.html | Check manifestURL - got "http://test/tests/dom/apps/tests/file_packaged_app.sjs?getManifest=true&appToUpdate", expected "http://test/tests/dom/apps/tests/file_packaged_app.sjs?getManifest=true"
 0:06.94 200 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_packaged_app_update.html | Shouldn't get here!
 0:06.94 217 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_packaged_app_update.html | Download error NO_DOWNLOAD_AVAILABLE
 0:06.94 218 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_packaged_app_update.html | [SimpleTest.finish()] this test already called finish!
 0:06.94 222 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_packaged_app_update.html | XHR error loading http://test/tests/dom/apps/tests/file_packaged_app.sjs?setVersion=3: 0 -
 0:06.94 223 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_packaged_app_update.html | called finish() multiple times
(In reply to Ben Kelly [:bkelly] from comment #42)
> 0:06.94 191 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/apps/tests/test_packaged_app_update.html | Check manifestURL -
> got
> "http://test/tests/dom/apps/tests/file_packaged_app.
> sjs?getManifest=true&appToUpdate", expected
> "http://test/tests/dom/apps/tests/file_packaged_app.sjs?getManifest=true"
>  0:06.94 200 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/apps/tests/test_packaged_app_update.html | Shouldn't get here!
>  0:06.94 217 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/apps/tests/test_packaged_app_update.html | Download error
> NO_DOWNLOAD_AVAILABLE
>  0:06.94 218 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/apps/tests/test_packaged_app_update.html | [SimpleTest.finish()]
> this test already called finish!
>  0:06.94 222 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/apps/tests/test_packaged_app_update.html | XHR error loading
> http://test/tests/dom/apps/tests/file_packaged_app.sjs?setVersion=3: 0 -
>  0:06.94 223 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/apps/tests/test_packaged_app_update.html | called finish()
> multiple times

It seems the mozApps.mgmt.oninstall handler set at the end of test_packaged_app_update.html is running during the next test suite.  Since the next test suite overwrite the oninstall event target, this implies to me we are getting a separate management object.  Does this make sense?
It seems WebappsRegistry is getting GC'd now while the underlying WebappsApplicationMgmt persists due to its strong ref listeners.

I'll keep investigating tomorrow.  Please let me know if I am on the wrong track.  Thanks!
Wow! Thanks Ben! :) I totally missed the Observer thing :(

(In reply to Ben Kelly [:bkelly] from comment #44)
> It seems WebappsRegistry is getting GC'd now while the underlying
> WebappsApplicationMgmt persists due to its strong ref listeners.
> 
> I'll keep investigating tomorrow.  Please let me know if I am on the wrong
> track.  Thanks!

It makes sense to add the WebappsRegistry "Webapps:Install:Return:OK" listener as a strong one. We will be defaulting to strong listeners in Bug 928389 anyway as the WebApps object leaks (bug 889984) that caused all this weak listeners work will likely be solved by Bug 903291.
Taking this while I track down the test failure.
Assignee: ferjmoreno → bkelly
Whiteboard: [c=progress p= s= u=] → [c=progress p=3 s= u=]
So it turns out the test failures were due to a different reason.

When we use a strong observer ref the WebappsApplicationMgmt application gets notified about the inner-window-destroyed event and cleans itself up between the test cases.

With a weak observer ref this notification never occurs.  It turns out this is due to most of the nsDOMRequestHelper child classes overriding QueryInterface, but not specifying nsIObserver.  The nsIObserver QI is necessary or this line will always fail to resolve the weak ref:

  http://mxr.mozilla.org/mozilla-central/source/xpcom/ds/nsObserverList.cpp#72

I have some patches that seem to work.  In try now:

  https://tbpl.mozilla.org/?tree=Try&rev=74726cf35102

I also want to write some test cases around this since it seems easy to regress here.  I probably won't have tests ready until next week.
Part 1: Add nsIObserver to all DOMRequestHelper QI statements
Work in progress patch.

Note, I think we should explicitly QueryInterface() for nsIObserver and nsISupportsWeakReference in initDOMRequestHelper() to force a fast fail if they are missing.  It seems the behavior is subtle and non-obvious if these are missing.
Attachment #825458 - Attachment is obsolete: true
Attachment #825458 - Flags: review?(fabrice)
Attachment #826097 - Attachment description: domrequesthelper-qi-observer.patch → Part 1: Add nsIObserver to all DOMRequestHelper QI statements
Missed a DOMRequestHelper in the first patch.  The fast fail logic helped me catch this when running b2gperf.
Attachment #826097 - Attachment is obsolete: true
Unfortunately it appears that adding nsIObserver to the QI for WebappsApplication re-introduces the progressive slowdown.

It seems the previous code depended on the observer weak refs getting immediately flushed due to the missing QI to avoid the impact from b2gperf creating so many WebappsApplication objects.

This also explains why bug 915220 only appeared on b2g18 and not m-c.  The patch for b2g18 in bug 900221 only added calls to addWeakMessageListener(), but never enabled the weak observer reference.  Therefore b2g18 suffered from the growing WebappsApplication observer problem caused by calling mozApps.mgmt.getAll() repeatedly.  In contrast, the fix in bug 889984, which landed on m-c, enabled the observer weak refs with the mismatched nsIObserver QI.  So m-c avoided the growing observer problem.

What I don't understand is why we only saw the problem for Clock on v1.1 and now mainly for Browser on v1.2/m-c.  Also, why wasn't this an issue prior to bug 900221 and bug 889984?

Anyway, the question now is what to do at this point?  It seems to me:

1) We need to enable the observer weak ref flag to avoid leaking DOMRequestHelper objects.
2) We need to add the nsIObserver QI to all DOMRequestHelper objects so that the weak observer ref works correctly.
3) We need to change b2gperf to not call mozApps.mgmt.getAll() repeatedly or change mozApps.mgmt.getAll() so that it does not create new objects over and over for the same application.
4) We might want to consider investigate optimizing the observer notification path.  It seems to copy the array of observer refs twice for every notification.

My current patches handle (1) and (2).  Should I write a new bug for this since it doesn't actually fix the progressive slowdown?  I can talk to Dave Hunt about (3).  I'm not sure if (4) is worth pursuing.

Fabrice, Fernando, what do you think?  Sorry if this bug has gotten a bit messy.
Flags: needinfo?(ferjmoreno)
Flags: needinfo?(fabrice)
We can probably ignore (4) from comment 52.  Looking back at the profile from comment 15:

  http://people.mozilla.org/~bgirard/cleopatra/#report=8c6d32d41fe6134bd22588075438183e304f81f7&search=Observer

We can see that we are indeed spending a ton of time in the observer notification for the inner-window-destroyed event.  The extra time doesn't appear to be from copying the reference arrays around, though.  Its mostly in the execution of nsXPCWrappedJSClass::CallMethod() calling into the DOMRequestHelper.prototype.observe function().

Interestingly, this profile might also explain why we mainly see the progressive slowdown in Browser, but not other apps.  You can see in the profile that we actually get two window destroyed events for Browser.  One during launch and one later when the app is closed.  The event during launch spends about 200ms and while the app closing event spends about 750ms.  (The later event takes longer because it triggers a GC.  This makes sense given that an app was just closed.)

From previous debugging it seems we only create and destroy this window during launch for Browser.  I assume this is due to that fact that it runs in the parent process instead of OOP.  Note, I also previously commented out the logic that creates a new tab and this window is still created.

Given that b2gperf takes its launch time measurement before closing the app, it seems likely that the window closing during launch and its observer notifications (the 200ms in this profile) are whats slowing down Browser launch.  So:

(5) We might want to open a bug to investigate why an extra window gets created/destroyed when launching Browser.

It seems most of those 200ms are spent in:

  nsXPCWrappedJSClass::CallMethod
  GetContextFromObjectOrDefault
  ExposeGCThingToActiveJS
Debug suggests that the additional 'inner-window-destroyed' event that is fired during launch is due to the 'about:blank' URI getting over-written with the app URI.  I suppose this is expected, but it would be nice if we could avoid the extra work related to about:blank during launch.
(In reply to Ben Kelly [:bkelly] from comment #52)
> Unfortunately it appears that adding nsIObserver to the QI for
> WebappsApplication re-introduces the progressive slowdown.

So I did some additional testing this morning and it appears this statement is only partially true.  While the progressive slowdown occurs again, the effect is much reduced.

Here are the results of running three "b2gperf --delay=10 Browser" runs before and after the attached patches.  The number is the coefficient of a linear regression.  Lower numbers are better.

  Before patch:  5.9, 7.8, 6.4
  After patch:   4.0, 0.9, 2.9

Also, from profiling I don't see any other obvious issues at work here.  I think we should move forward with this fix, change b2gperf to avoid calling mozApps.mgmt.getAll() repeatedly, and mark this bug resolved.
Attachment #826122 - Flags: review?(fabrice)
Comment on attachment 826102 [details] [diff] [review]
Part 2: Enable weak observer ref and enforce expected QI interfaces

Try from comment 51 looks pretty green.  Re-requesting review on these patches.
Attachment #826102 - Flags: review?(fabrice)
Attachment #826122 - Flags: review?(fabrice)
Comment on attachment 826102 [details] [diff] [review]
Part 2: Enable weak observer ref and enforce expected QI interfaces

Dropping review until I can write some tests.
Attachment #826102 - Flags: review?(fabrice)
The perf impact here seems real (if I'm reading the comments correctly :) so koi+ again.
blocking-b2g: koi? → koi+
This patch adds test cases verifying that the DOMRequestHelper uninit() function is called as expected in these cases:

  - no message listeners, no GC
  - weak message listeners, no GC
  - strong message listeners, no GC
  - strong message listeners, forced GC

And is not called in these cases where we expect it to be garbage collected:

  - no message listeners, forced GC
  - weak message listeners, forced GC

Updated try with all three patches:

  https://tbpl.mozilla.org/?tree=Try&rev=bae1c08bdcc4
Attachment #827492 - Flags: review?(fabrice)
Flags: needinfo?(ferjmoreno)
Flags: needinfo?(fabrice)
Comment on attachment 826122 [details] [diff] [review]
Part 1: Declare nsIObserver QI in DOMRequestHelper child classes.

Add back review flag now that we have tests.
Attachment #826122 - Flags: review?(fabrice)
Comment on attachment 826102 [details] [diff] [review]
Part 2: Enable weak observer ref and enforce expected QI interfaces

Add back review flag now that we have unit tests.
Attachment #826102 - Flags: review?(fabrice)
Priority: -- → P1
Whiteboard: [c=progress p=3 s= u=] → [c=progress p=3 s= u=1.2]
This patch to gaiatest avoids repeated calls to mozApps.mgmt.getAll().  This is desirable because getAll() allocates new Application objects on each call.

Many thanks to :ferjm for identifying the problem and essentially writing this patch!

Dave, should I also make this change in gaia/tests/atoms/gaia_test.js?  That file is slightly different.  I don't know how to test it since I'm not sure what uses it.
Attachment #827591 - Flags: review?(dave.hunt)
Whiteboard: [c=progress p=3 s= u=1.2] → [c=progress p=4 s= u=1.2]
Comment on attachment 827591 [details] [review]
Pull request at https://github.com/mozilla-b2g/gaia/pull/13397

(In reply to Ben Kelly [:bkelly] from comment #62)
> Created attachment 827591 [details] [review]
> Pull request at https://github.com/mozilla-b2g/gaia/pull/13397
> 
> This patch to gaiatest avoids repeated calls to mozApps.mgmt.getAll().  This
> is desirable because getAll() allocates new Application objects on each call.
> 
> Many thanks to :ferjm for identifying the problem and essentially writing
> this patch!
> 
> Dave, should I also make this change in gaia/tests/atoms/gaia_test.js?  That
> file is slightly different.  I don't know how to test it since I'm not sure
> what uses it.

That file is left over from the previous JS integration tests. I don't believe it's used any more and will be merged/removed by bug 909839.

Landed as:
https://github.com/mozilla-b2g/gaia/commit/c1204e00cff95820b902c1b41dee229402c6a156
Attachment #827591 - Flags: review?(dave.hunt) → review+
(In reply to Dave Hunt (:davehunt) from comment #63)
> ...
> 
> Landed as:
> https://github.com/mozilla-b2g/gaia/commit/
> c1204e00cff95820b902c1b41dee229402c6a156

Ben should this bug be RESOLVED FIXED?
Flags: needinfo?(bkelly)
(In reply to Mike Lee [:mlee] from comment #64)
> (In reply to Dave Hunt (:davehunt) from comment #63)
> > ...
> > 
> > Landed as:
> > https://github.com/mozilla-b2g/gaia/commit/
> > c1204e00cff95820b902c1b41dee229402c6a156
> 
> Ben should this bug be RESOLVED FIXED?

No, this needs to stay open until the platform changes get review and land.  The b2gperf change was needed, but masks the very real platform problem.  Without the platform fix we cannot free DOMRequestHelper objects under memory pressure.
Flags: needinfo?(bkelly)
Updating the summary to reflect the current issue that needs to be fixed.
Summary: b2g browser launch times worsen over time when running b2gperf tests → strong observer ref prevents GC of DOMRequestHelper objects
Blocks: 915138
It would be interesting btw to know if this could be causing bug 935273 to potentially reproduce.
How does this one relate to Bug 928389?
I don't know about causing bug 935273, but fixing the strong ref here so we can free the DOMRequestHelpers under memory pressure might help that bug.
Attachment #826102 - Flags: feedback?(gene.lian)
Attachment #826102 - Flags: feedback?(ferjmoreno)
Attachment #826122 - Flags: feedback?(gene.lian)
Attachment #826122 - Flags: feedback?(ferjmoreno)
Attachment #827492 - Flags: feedback?(gene.lian)
Attachment #827492 - Flags: feedback?(ferjmoreno)
Attachment #826122 - Flags: feedback?(gene.lian) → feedback+
Comment on attachment 826102 [details] [diff] [review]
Part 2: Enable weak observer ref and enforce expected QI interfaces

Review of attachment 826102 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/DOMRequestHelper.jsm
@@ +141,5 @@
>     *    will be added as a weak referred one by default.
>     */
>    initDOMRequestHelper: function(aWindow, aMessages) {
> +    // Query our required interfaces to force a fast fail if they are not
> +    // provided.  These calls will throw if the interface is not available.

Nice! This will force the callers to use it correctly. :)

@@ +161,5 @@
>      }
>  
>      this._destroyed = false;
>  
> +    Services.obs.addObserver(this, "inner-window-destroyed", true);

So, you are forcing the "inner-window-destroyed" to be added as strong reference, which means the object won't be GC'ed unless the window is closed.

However, in addMessageListeners(), users can choose how to associate the listeners with the window (strongly or weakly). If the listeners attempt to be added weakly, this will become in vain because you always addObserver strongly.

Please correct me if I misunderstood. Thanks!
Attachment #826102 - Flags: feedback?(gene.lian)
Comment on attachment 827492 [details] [diff] [review]
Part 3: Add tests verifying DOMRequestHelper cleanup and weak ref behavior.

Review of attachment 827492 [details] [diff] [review]:
-----------------------------------------------------------------

Temporally clean up my feedback? until my suspicion for the part-2 patch gets clear.
Attachment #827492 - Flags: feedback?(gene.lian)
Comment on attachment 826102 [details] [diff] [review]
Part 2: Enable weak observer ref and enforce expected QI interfaces

Review of attachment 826102 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/DOMRequestHelper.jsm
@@ +161,5 @@
>      }
>  
>      this._destroyed = false;
>  
> +    Services.obs.addObserver(this, "inner-window-destroyed", true);

Sorry I think I misunderstand the use of addObserver(...) the other way around, where "true" means adds the observer as weak reference.

So, addMessageListeners() can still associate the listeners with the window in a way of strong reference, event if the "inner-window-destroyed" is added as weak reference (one strong at least, then it's all strong referenced). It also works for adding listeners as weak reference since all the listeners/observers are added as weak reference.
Attachment #826102 - Flags: feedback+
(In reply to Gene Lian [:gene] (needinfo? encouraged) from comment #72)
> > +    Services.obs.addObserver(this, "inner-window-destroyed", true);
> 
> Sorry I think I misunderstand the use of addObserver(...) the other way
> around, where "true" means adds the observer as weak reference.

Correct.  I think previous incarnations of this code used |/* weak-ref */ true| for that argument.  I should probably add that back since its easy to miss this.
Comment on attachment 827492 [details] [diff] [review]
Part 3: Add tests verifying DOMRequestHelper cleanup and weak ref behavior.

Thanks for the feedback on the first two patches!

Adding back the f? on the tests now that the observer weak ref question was answered.  Just wanted to make sure this didn't fall through the cracks.
Attachment #827492 - Flags: feedback?(gene.lian)
Duplicate of this bug: 940198
Attachment #826122 - Flags: review?(fabrice) → review+
Comment on attachment 826102 [details] [diff] [review]
Part 2: Enable weak observer ref and enforce expected QI interfaces

Review of attachment 826102 [details] [diff] [review]:
-----------------------------------------------------------------

that looks ok to me.

::: dom/base/DOMRequestHelper.jsm
@@ +141,5 @@
>     *    will be added as a weak referred one by default.
>     */
>    initDOMRequestHelper: function(aWindow, aMessages) {
> +    // Query our required interfaces to force a fast fail if they are not
> +    // provided.  These calls will throw if the interface is not available.

nit: I think there's an extra space before 'These'.
Attachment #826102 - Flags: review?(fabrice) → review+
Comment on attachment 827492 [details] [diff] [review]
Part 3: Add tests verifying DOMRequestHelper cleanup and weak ref behavior.

Review of attachment 827492 [details] [diff] [review]:
-----------------------------------------------------------------

looks good to me, thanks!

::: dom/base/test/test_domrequesthelper.xul
@@ +107,5 @@
> +     *                           function(uninitCalled) {
> +     *      // expect uninitCalled === false since GC with only weak refs
> +     *    });
> +     */
> +    const CWD_TOPIC = 'inner-window-destroyed'

nit: missing ';'
Also, use either simple or double quotes consistently in the whole file (it seems like double quotes were used before in this file).

Also, what does this 'CWD' prefix mean?
Attachment #827492 - Flags: review?(fabrice) → review+
Update part 1 to indicate r=fabrice.  Carry r+ forward.
Attachment #826122 - Attachment is obsolete: true
Attachment #826122 - Flags: feedback?(ferjmoreno)
Attachment #8335020 - Flags: review+
Update part 2 with following changes carrying r+ forward:

1) Add comment indicating boolean third arg to addObserver() means to use a weak-ref.
2) Remove space from comment.
3) Update subject to indicate reviewer.
Attachment #8335023 - Flags: review+
Attachment #826102 - Attachment is obsolete: true
Attachment #826102 - Flags: feedback?(ferjmoreno)
Update part 3 with following changes carrying forward r+:

1) Use double quotes instead single quotes.
2) Remove "cwd" prefix on variables except for dummy object.  We need it there to avoid shadowing previous variable.
3) Add comment indicating |cwdDummy| is prefixed to be specific to |checkWindowDestruction()|.
4) Update commit message to include reviewer.
Attachment #827492 - Attachment is obsolete: true
Attachment #827492 - Flags: feedback?(gene.lian)
Attachment #827492 - Flags: feedback?(ferjmoreno)
Attachment #8335026 - Flags: review+
Add checkin-needed for following patches:

  attachment 8335020 [details] [diff] [review]
  attachment 8335023 [details] [diff] [review]
  attachment 8335026 [details] [diff] [review]

Safe to resolve after these are committed.
Keywords: checkin-needed
Comment on attachment 8335023 [details] [diff] [review]
Part 2: Use weak observer reference for DOMRequestHelper.

Review of attachment 8335023 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the amazing work Ben! And sorry for the feedback delay.

Just one quick comment

::: dom/base/DOMRequestHelper.jsm
@@ +142,5 @@
>     */
>    initDOMRequestHelper: function(aWindow, aMessages) {
> +    // Query our required interfaces to force a fast fail if they are not
> +    // provided. These calls will throw if the interface is not available.
> +    this.QueryInterface(Ci.nsISupportsWeakReference);

It's ok to do this check for now, but we'll need to remove it once we default to strong refs, as weak refs will be rarely used then.
(In reply to Dave Hunt (:davehunt) from comment #63)
> Landed as:
> https://github.com/mozilla-b2g/gaia/commit/
> c1204e00cff95820b902c1b41dee229402c6a156

Uplifted to v1.2:
https://github.com/mozilla-b2g/gaia/commit/fe96c5969c9feb0fee9524492a1aad7929aa716e
(In reply to Fernando Jiménez Moreno [:ferjm] (use needinfo instead of CC, please) from comment #83)
> ::: dom/base/DOMRequestHelper.jsm
> @@ +142,5 @@
> >     */
> >    initDOMRequestHelper: function(aWindow, aMessages) {
> > +    // Query our required interfaces to force a fast fail if they are not
> > +    // provided. These calls will throw if the interface is not available.
> > +    this.QueryInterface(Ci.nsISupportsWeakReference);
> 
> It's ok to do this check for now, but we'll need to remove it once we
> default to strong refs, as weak refs will be rarely used then.

Hmm, I didn't realize we had plans to default to strong references in the future.  I guess, though, we could perform the check when the client enabled weak references.  How does that sound?
https://hg.mozilla.org/mozilla-central/rev/05595c39906f
https://hg.mozilla.org/mozilla-central/rev/26ba59b5e2bd
https://hg.mozilla.org/mozilla-central/rev/18de3f6395e5
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
Needs branch-specific patches for b2g26.
Flags: needinfo?(bkelly)
Updated part 1 patch for b2g26_v1_2 branch.  Patches for part 2 and part 3 apply cleanly for me.

Mochitest try build to make sure I didn't break anything.  Since we throw if the interface is missing, this should catch any problems with the fix to part 1:

  https://tbpl.mozilla.org/?tree=Try&rev=a8ecb415ed31

Do I add checkin-needed keyword again if this ends up green?
Flags: needinfo?(bkelly)
Hmm... that try seems to have picked up extra revisions.  And further tries only show the empty try revision and not my patches.  I must be doing something wrong.  Building locally and will test in the morning.
try shows all new commits you push, and no others.  so the first time, you were the first person to push some of those other random commits.  then the second time, you'd already pushed those revisions, so it didn't show them. so, the behavior you saw was perfectly normal, if also confusing.
No need, I've got bug queries for it :)
No longer blocks: 915138
Depends on: 944311
I wanted to follow up on this from a data perspective. I spoke with some of you about the test data from before the fix, which we found showed increasing startup times across test run replicates, along with some other patterns that made the replicates look "not independent". Looking at some data from after the changes landed confirms that the patches seem to have resolved the problem of the increasing startup times - the strong increasing trends are mostly gone.

There are some other issues that have not been affected by the changes: some test runs still show weak increasing trends; in some cases the first few replicates are higher than the others and seem to "burn in" to a constant level; and in some cases there is a spike at the same place (eg. replicate 14) for several test runs. All of these issues are undesirable from the point of view of comparing test outcomes, as we'd want the replicates to be as similar and pattern-less as possible. However, these are not necessarily bugs, and could be expected behaviour, in which case we just need to be aware that they occur, and handle them appropriately. 

I'm linking a couple of plots to illustrate the results. The top row has test runs from before the patches landed (up until late October), and the bottom row has runs from afterwards (beginning of November onward), grouped by app. Each test run across 30 replicates is plotted as a black line, several of which are shown on each plot. A red fitted trend line is superimposed on top. Ideally, the red line would be horizontal, meaning no trend or pattern as the test moves through the replicates. As we can see, the red lines in the bottom row are much closer to horizontal than in the top row. 

https://drive.google.com/a/mozilla.com/file/d/0BzxrdTWOmycCVENfSzhPMmwwalU/edit?usp=sharing
https://drive.google.com/a/mozilla.com/file/d/0BzxrdTWOmycCX3JzaVdoX1V3Xzg/edit?usp=sharing
No longer depends on: 944311
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.