Closed Bug 1245092 Opened 8 years ago Closed 8 years ago

Install reftest and specialpowers extensions at runtime via AddonManager.loadTemporaryAddon()

Categories

(Testing :: Reftest, defect)

defect
Not set
normal

Tracking

(firefox46 fixed, firefox47 fixed, firefox48 fixed)

RESOLVED FIXED
mozilla48
Tracking Status
firefox46 --- fixed
firefox47 --- fixed
firefox48 --- fixed

People

(Reporter: ahal, Assigned: ahal)

References

Details

Attachments

(2 files)

We'll need to install these at runtime as temporary addons so they don't need to get signed. Specialpowers can't be signed, and reftest is modified enough that signing it would be a huge pain.
Depends on: 1247054
Hi Andrew, Is this blocking for Fx46 going live with signing - or can land after?
Flags: needinfo?(ahalberstadt)
It's blocking. Though I'm almost done, I just haven't been keeping the bug up to date. Here's the latest try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=23c5c5d833af
Flags: needinfo?(ahalberstadt)
Depends on: 1249634
This patch also pretty much totally busts b2g reftests, which apparently I don't care about. If no one is going to commit to fixing it, we should turn them off. I'll file a follow-up after this lands.
Good news,  I fixed some test failures that this patch caused.

Bad news, there are some memory leaks that I didn't notice until now (they were masked by the test failures). There is an intermittent leak on linux debug e10s crashtest, and a permanent leak on win 7 e10s crashtest. This has could potentially take me awhile to figure out.
Jgriffin suggested I try pushing only a specific directory of tests to try, to see if the leaks have something to do with a particular test. He may have been on to something, as running only dom crashtests came back green. Here are some more try runs that run about 5 root directories each:
https://treeherder.mozilla.org/#/jobs?repo=try&author=ahalberstadt@mozilla.com&fromchange=dc63b6583aec&tochange=012a5eddeaaa

Unforunately, try is taking a long time to start them.
Here's the latest bisection range:
https://treeherder.mozilla.org/#/jobs?repo=try&author=ahalberstadt@mozilla.com&fromchange=dc63b6583aec&tochange=52ab2496e8e2

It looks like tests in /widget could be causing this, but looking at the first 3 pushes, there are either more problematic directories, or this is a function of the number of tests. My money would be the latter. I'll do some more pushes in the meantime though.
Hey Andrew, I heard you may know a thing or two about memory leaks and e10s. Basically my patch here is hitting some leaks on shutdown, and I'm not really sure why. The important bits are probably in bootstrap.js. See the above try runs.

Do you see anything I'm obviously doing wrong? Or do you have tips on how I can debug this? I'm quite far outside my area of expertise here. Thanks!
Flags: needinfo?(continuation)
Hm, in this try run the windows debug e10s crashtests are actually green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fdcc93b965a9

I wonder if this got fixed by bug 1251692. Pushing a fresh try run.
That new try run is extremely orange.

Generally these sorts of leaks are caused by failing to remove event listeners or observers or some other thing where you register yourself with a central service. It looks like you are leaking whole pages. Another kind of chrome JS leak I've seen before is when a timer fires during shutdown that makes you do something, after the cleanup for that something has already happened, so you end up leaking.

Let me know if the leak persists and you have a test case you can run and I can try to help you figure out what is causing it.
Flags: needinfo?(continuation)
Thanks. You can ignore the orangeness of that try run. It was meant to help us figure out what still needs to be fixed when addon signing is enforced. Though, it seems like it was just luck that nothing leaked judging by this even more recent one:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bdb168296e90

I wonder if this has something to do with the fact that reftest.js is no longer a command line component. Maybe this wasn't introduced by my patch, but latent in reftest.js and not exposed until now.
mccr8, if you have any time to help here this week, it would really be appreciated. If this isn't addressed by then, it may block the addon signing project.
Flags: needinfo?(continuation)
How do I reproduce this issue locally?
I haven't been able to reproduce locally, though I only tried running the full suite a couple times (which is partly why I'm banging my head on the wall). This seems to happen only on linux or windows e10s. Here's the theoretical STR:

1. Apply the patch in this bug
2. Do a debug build
3. ./mach crashtest --e10s

It's even intermittent on try, so it's some kind of timing thing. My machine might be too fast to reproduce it.
For the record, those "command timed out after 1800 second" errors in the previous try run are unrelated to this patch as they time out in mozharness before they even make it to the reftest harness. Though, the intermittent leak does show up, both in linux and windows still.
The approach here, if we could reproduce it locally, would be to get a shutdown CC and GC log, and then I could try to analyze that to figure out what is wrong. That is done by setting MOZ_CC_LOG_SHUTDOWN=1 and other variables described at the start of xpcom/base/nsCycleCollector.cpp. In order to get that to work on try, you have to set it up to upload an artifact containing the log files. I'm not sure how to do that.
Flags: needinfo?(continuation)
I can do that. All we have to do is copy the logs to MOZ_UPLOAD_DIR and mozharness will take care of the rest. Do you know where the log files get saved? I could hack up a patch and push it for you.
Ah, MOZ_CC_LOG_DIRECTORY. I'll do a quick push now.
I *think* this should do the trick:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=646cd901bc5d

Once the job is finished you should be able to click it and see a link to the logs in the "Job Details" pane at the bottom.
https://reviewboard.mozilla.org/r/35635/#review34799

::: layout/tools/reftest/bootstrap.js:14
(Diff revision 1)
> +    win = win.QueryInterface(Ci.nsIInterfaceRequestor).getInterface(Ci.nsIDOMWindow);

Doesn't this create a global var called 'win' who's livetime is bound by the life of the component?  Which means -- we'll leak everything in some cases?
https://reviewboard.mozilla.org/r/35635/#review34799

> Doesn't this create a global var called 'win' who's livetime is bound by the life of the component?  Which means -- we'll leak everything in some cases?

Does it? It's a function argument. At least redeclaring it is an error.
(In reply to Andrew Halberstadt [:ahal] from comment #19)
> I *think* this should do the trick:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=646cd901bc5d

Oops, think I forgot quotes around the "1", here's a new one:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d143f8520145
Hey Andrew, the previous try run has the CC and GC logs attached. Click one of the orange jobs, then click "Job Details" on the bottom pane and you'll see them. They're gzipped, so run them through gunzip if you download them locally.

Is this what you need? Is there any other information that would be helpful?
Flags: needinfo?(continuation)
Thanks, I'll take a look at the logs.
Flags: needinfo?(continuation)
In one log I looked at, we're leaking an inner and outer nsGlobalWindow. That window is being held alive by a nsXPCWrappedJS (nsIBrowserDOMWindow) with a missing reference:

1B275D88 [nsXPCWrappedJS (nsIBrowserDOMWindow)]
    --[mJSObj]--> 17625CA0 [JS Object (Object)]
    --[global]--> 1716A040 [JS Object (Window)]
    --[UnwrapDOMObject(obj)]--> 11F5B800 [nsGlobalWindow #6 inner chrome://browser/content/browser.xul]

nsGlobalWindow has a field with that type, but it is reported to the CC. The only other field I see with that type is TabParent::mBrowserDOMWindow. TabParent isn't cycle collected, and we are leaking a TabParent.

In IRC, Olli said we might be able to clear this field in TabParent::Destroy(), but we're probably leaking a TabParent anyways. I don't know how helpful this is to fixing the leak.

I'll try looking at the patch some more.
I noticed that shutdown() in bootstrap.js isn't doing anything on non-Android platforms any more. Is the listener removal etc. being done somewhere else?
Prior to this patch non-Android didn't use bootstrap.js to begin with. The listener you're talking about is set up behind a similar if (OS == "android") condition. I *think* all the listeners I've added/modified should be cleaned up properly.

I'm wondering if there's something wrong in reftest.js or reftest-content.js that wasn't modified by this patch, but only starts failing because it is now being loaded differently.
Here's the changes to just bootstrap.js with the Android and B2G specific code removed, and the style fixes reverted.
The leaked window is browser.xul if that helps. Probably not.
Actually, it does give me an idea. The original window that's being closed is a browser.xul. And this patch also enables marionette, so maybe marionette is hanging on to a reference of that window.

Here's a try run that simply leaves that initial window open for the duration of the test run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b60245728f82
The networking leak might be different. That one at least makes a little more sense, as the IO service involves networking. I'm doing a try push to see if we are maybe poking at it later in shutdown.
That last patch seems to have fixed the linux debug leaks at least, I guess we'll wait for windows. This change means there will be an extra window hanging around for the duration of the test run, but I guess as long as it doesn't impact the test results, then it isn't really a big deal.
That last try run seems to have fixed the leak. And while it isn't perfect (as there's now a random window being left open during the test run), it seems to work and doesn't seem to impact the tests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4903fb40eb21

So I'll get what I have reviewed and landed ASAP, and file a follow-up bug to look into why there are memory leaks when we close that window and whether or not it has something to do with marionette.

Thanks for helping dig into this Andrew.
So after the memory leak, I was about to upload the patch and realized my patch was disabling the non-local network requirement (something I had done a long time ago in order to make some quick progress and then forgot about). Of course re-enabling that turned all the jobs red :).

But I figured out what prefs needed to be set and we seem to be good to go once more:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=48e61b59a9ad

Patch shortly.
Comment on attachment 8721320 [details]
MozReview Request: Bug 1245092 - Install reftest and specialpowers extensions at runtime via AddonManager.installTemporaryAddon, r?jgriffin

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/35635/diff/1-2/
Attachment #8721320 - Attachment description: MozReview Request: Bug 1245092 - Install reftest and specialpowers extensions at runtime via AddonManager.installTemporaryAddon → MozReview Request: Bug 1245092 - Install reftest and specialpowers extensions at runtime via AddonManager.installTemporaryAddon, r?jgriffin
Attachment #8721320 - Flags: review?(jgriffin)
Blocks: 1255565
Comment on attachment 8721320 [details]
MozReview Request: Bug 1245092 - Install reftest and specialpowers extensions at runtime via AddonManager.installTemporaryAddon, r?jgriffin

https://reviewboard.mozilla.org/r/35635/#review36029

::: testing/marionette/driver/marionette_driver/marionette.py:1150
(Diff revision 1)
> +        self.wait_for_port(timeout=timeout)

I'm not sure what the implicaitons are of removing this wait_for_port; are we sure it's harmless?
Attachment #8721320 - Flags: review?(jgriffin) → review+
https://reviewboard.mozilla.org/r/35635/#review36029

> I'm not sure what the implicaitons are of removing this wait_for_port; are we sure it's harmless?

I think you're looking at the interdiff rather than the full diff. This change is actually already in-tree, it landed as part of the mochitest patch. Also it was moving it up a line rather than removing it.
(In reply to Andrew Halberstadt [:ahal] from comment #37)
> https://reviewboard.mozilla.org/r/35635/#review36029
> 
> > I'm not sure what the implicaitons are of removing this wait_for_port; are we sure it's harmless?
> 
> I think you're looking at the interdiff rather than the full diff. This
> change is actually already in-tree, it landed as part of the mochitest
> patch. Also it was moving it up a line rather than removing it.

Yes, you're right, I was; thanks for the clarification.
https://hg.mozilla.org/mozilla-central/rev/06baa0fa499b
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Depends on: 1034290
There was some non-local network bustage on the beta patch that didn't happen on my beta-based try push. I believe this is the same problem I hit with the mochitest patch, where there was a pref that was only being configured on beta/release, but nowhere else. Here is my attempt to fix it:
https://hg.mozilla.org/releases/mozilla-beta/rev/f902121e4d1b

Note: I'm not sure how test this seeing as I can't reproduce locally or on try, so this is a bit of a shot in the dark.
Depends on: 1256650
Depends on: 1258931
I forgot to uplift that non-local network fix to aurora/central. Here is the aurora change:
https://hg.mozilla.org/releases/mozilla-aurora/rev/faa65506b112a63784802378f8fa5610605c96b0
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Thanks, the Try simulations are much happier now :)
https://hg.mozilla.org/mozilla-central/rev/a57dc2f4a6f6
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
No longer blocks: 1255565
Depends on: 1255565
Depends on: 1267836
No longer depends on: 1267836
Depends on: 1265124
Depends on: 1274557
You need to log in before you can comment on or make changes to this bug.