Intermittent docshell/test/browser/browser_bug670318.js | Test timed out -

RESOLVED FIXED in Firefox 50

Status

()

defect
P3
normal
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: intermittent-bug-filer, Unassigned)

Tracking

({intermittent-failure})

unspecified
mozilla52
Points:
---

Firefox Tracking Flags

(firefox49 unaffected, firefox50 fixed, firefox51 fixed, firefox52 fixed)

Details

Attachments

(1 attachment)

Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Mike, this looks like it might be related to bug 1261842, which landed in about the timeframe where this started, and this intermittent is pretty frequent. Any chance you can take a look?
Flags: needinfo?(mconley)
Looking. Got a try push with some logging cooking here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=df99e2d16f64
Huh.

12:43:16     INFO -  48 INFO TEST-START | docshell/test/browser/browser_bug670318.js
12:44:46     INFO -  TEST-INFO | started process screentopng
12:44:48     INFO -  TEST-INFO | screentopng: exit 0
12:44:48     INFO -  49 INFO checking window state
12:44:48     INFO -  50 INFO Entering test bound test
12:44:48     INFO -  51 INFO Calling ContentTask.spawn...
12:44:48     INFO -  52 INFO Inside ContentTask.spawn
12:44:48     INFO -  53 INFO Adding SHistory listener
12:44:48     INFO -  54 INFO Setting location to http://mochi.test:8888/browser/docshell/test/browser/file_bug670318.html
12:44:48     INFO -  55 INFO Saw HistoryReplaceEntry - count is now: 1
12:44:48     INFO -  56 INFO Saw HistoryNewEntry to spec: http://mochi.test:8888/browser/docshell/test/browser/file_bug670318.html - count was: 1
12:44:48     INFO -  57 INFO Saw HistoryNewEntry to spec: http://mochi.test:8888/browser/docshell/test/browser/file_bug670318.html - count was: 2
12:44:48     INFO -  58 INFO Console message: [JavaScript Error: "1473191031632	Toolkit.GMP	ERROR	GMPInstallManager.simpleCheckAndInstall Could not check for addons: Error: got node name: html, expected: updates (resource://gre/modules/addons/ProductAddonChecker.jsm:153:11) JS Stack trace: parseXML@ProductAddonChecker.jsm:153:11 < Async*ProductAddonChecker.getProductAddonList@ProductAddonChecker.jsm:320:12 < GMPInstallManager.prototype.checkForAddons@GMPInstallManager.jsm:107:5 < GMPInstallManager.prototype.simpleCheckAndInstall<@GMPInstallManager.jsm:204:29 < gBrowserInit._delayedStartup/<@browser.js:1315:7 < setTimeout handler*gBrowserInit._delayedStartup@browser.js:1311:5 < EventListener.handleEvent*gBrowserInit.onLoad@browser.js:1001:5 < onload@browser.xul:1:1" {file: "resource://gre/modules/Log.jsm" line: 753}]
12:44:48     INFO -  App_append@resource://gre/modules/Log.jsm:753:9
12:44:48     INFO -  Logger.prototype.log@resource://gre/modules/Log.jsm:389:7
12:44:48     INFO -  LoggerRepository.prototype.getLoggerWithMessagePrefix/proxy.log@resource://gre/modules/Log.jsm:504:44
12:44:48     INFO -  Logger.prototype.error@resource://gre/modules/Log.jsm:397:5
12:44:48     INFO -  GMPInstallManager.prototype.simpleCheckAndInstall<@resource://gre/modules/GMPInstallManager.jsm:273:7
12:44:48     INFO -  TaskImpl_run@resource://gre/modules/Task.jsm:320:40
12:44:48     INFO -  Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:940:21
12:44:48     INFO -  this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:816:7
12:44:48     INFO -  Promise*this.PromiseWalker.scheduleWalkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:747:11
12:44:48     INFO -  this.PromiseWalker.schedulePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:779:7
12:44:48     INFO -  this.PromiseWalker.completePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:714:7
12:44:48     INFO -  GMPInstallManager.prototype.checkForAddons/<@resource://gre/modules/GMPInstallManager.jsm:116:7
12:44:48     INFO -  Async*GMPInstallManager.prototype.checkForAddons@resource://gre/modules/GMPInstallManager.jsm:107:5
12:44:48     INFO -  GMPInstallManager.prototype.simpleCheckAndInstall<@resource://gre/modules/GMPInstallManager.jsm:204:29
12:44:48     INFO -  TaskImpl_run@resource://gre/modules/Task.jsm:319:40
12:44:48     INFO -  TaskImpl@resource://gre/modules/Task.jsm:280:3
12:44:48     INFO -  createAsyncFunction/asyncFunction@resource://gre/modules/Task.jsm:254:14
12:44:48     INFO -  gBrowserInit._delayedStartup/<@chrome://browser/content/browser.js:1315:7
12:44:48     INFO -  setTimeout handler*gBrowserInit._delayedStartup@chrome://browser/content/browser.js:1311:5
12:44:48     INFO -  EventListener.handleEvent*gBrowserInit.onLoad@chrome://browser/content/browser.js:1001:5
12:44:48     INFO -  onload@chrome://browser/content/browser.xul:1:1
12:44:48     INFO -  59 INFO TEST-UNEXPECTED-FAIL | docshell/test/browser/browser_bug670318.js | Test timed out -
12:44:48     INFO -  MEMORY STAT | vsize 1169MB | residentFast 241MB | heapAllocated 96MB
12:44:48     INFO -  60 INFO TEST-OK | docshell/test/browser/browser_bug670318.js | took 90148ms
12:44:48     INFO -  Not taking screenshot here: see the one that was previously logged
12:44:48     INFO -  61 INFO TEST-UNEXPECTED-FAIL | docshell/test/browser/browser_bug670318.js | Found a tab after previous test timed out: http://mochi.test:8888/browser/docshell/test/browser/file_bug670318.html -

Looks like we're not entering the OnHistoryNewEntry callback enough times, so we never enter this block: http://searchfox.org/mozilla-central/rev/95b0ecf4b59a01e0524ca02f6c96ecaabe38f4d5/docshell/test/browser/browser_bug670318.js#26

On reflection, I can't see how the "make initial browser be remote" stuff might have introduced this. The first recorded instance of this intermittent was on July 21st, 2016. My patches in bug 1261842 landed and stuck on July 25th.

I'm going to plead innocent on this one.
Flags: needinfo?(mconley)
So I tried to bisect this a little... I *think* it started with this merge:

https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=6b180266ac16e3226be33319ff710ddfa85f5836

given:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=6b180266ac16e3226be33319ff710ddfa85f5836&filter-searchStr=opt%20osx%20mochitest-browser-chrome-3&selectedJob=4887722

and 

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=e28e856b987380f55d699092f11f6997378f79a6&filter-searchStr=osx%20opt%20mochitest-browser-chrome-4&group_state=expanded&selectedJob=4887761

Note that the former has a lot of perma-orange from an unrelated bug so it's a bit hard to read - I starred all the unrelated permaorange and so the unstarred items are the ones that are relevant to this bug.

The problem is, I don't see anything in that merge that looks related. :-\
(In reply to :Gijs Kruitbosch from comment #31)
> So I tried to bisect this a little... I *think* it started with this merge:
> 
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?changeset=6b180266ac16e3226be33319ff710ddfa85f5836
> 
> given:
> 
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> central&revision=6b180266ac16e3226be33319ff710ddfa85f5836&filter-
> searchStr=opt%20osx%20mochitest-browser-chrome-3&selectedJob=4887722
> 
> and 
> 
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> central&revision=e28e856b987380f55d699092f11f6997378f79a6&filter-
> searchStr=osx%20opt%20mochitest-browser-chrome-
> 4&group_state=expanded&selectedJob=4887761
> 
> Note that the former has a lot of perma-orange from an unrelated bug so it's
> a bit hard to read - I starred all the unrelated permaorange and so the
> unstarred items are the ones that are relevant to this bug.
> 
> The problem is, I don't see anything in that merge that looks related. :-\

Wild shot in the dark: bug 911216. Till, any chance the difference in promise implementation could explain this + comment #28 ? Here's the test: http://searchfox.org/mozilla-central/source/docshell/test/browser/browser_bug670318.js

Alternatively, Mike, do you see anything else in that merge that looks even vaguely related? Considering how high up this is in OF it feels like we should be working this out... :-(
Flags: needinfo?(till)
Flags: needinfo?(mconley)
This does point a finger at bug 911216, yes :(

We had other cases where the new Promise implementation introduced subtle timing changes which made race conditions in other code more likely, so that's my best guess at what's going on here. See bug 1261784 for an example.

Unfortunately after staring at this for quite some time I keep being at a loss regarding how to usefully debug this. I guess an rr recording with chaos mode would have the best chance at making this reproducible.
Flags: needinfo?(till)
I was able to get an rr recording of this failure. The recording was kinda hard to navigate because of how many times the test ran, but I've added some instrumentation to find the failing run, and I'm currently waiting for a reproduction.

Leaving the needinfo for now.
So here's what I've been able to piece together from my rr reproduction.

The test is waiting for a OnHistoryNewEntry on a nsISHistoryListener for the test tab to be hit 5 times before it continues the test.

That OnHistoryNewEntry is fired here: http://searchfox.org/mozilla-central/rev/515bca4fca5e2e8233ed30e6b6d48fcab8497dbf/docshell/shistory/nsSHistory.cpp#405

And we eventually get to AddEntry by way of this call:

http://searchfox.org/mozilla-central/rev/515bca4fca5e2e8233ed30e6b6d48fcab8497dbf/docshell/base/nsDocShell.cpp#12312

But we're never entering that conditional in the bad case, which is why we never call AddEntry, so the OnHistoryNewEntry never gets fired, so the test never finishes and times out.

So now the question is - why aren't we entering that block? Well, according to rr, it's because we're failing this conditional:

!LOAD_TYPE_HAS_FLAGS(mLoadType, LOAD_FLAGS_REPLACE_HISTORY)

(both in the good case and the bad case, mOSHE exists, but in the good case, mLoadType is 1. In the bad case, mLoadType is 8388609).

So working backwards, the reason why mLoadType is 8388609 is because of this:

http://searchfox.org/mozilla-central/rev/515bca4fca5e2e8233ed30e6b6d48fcab8497dbf/docshell/base/nsDocShell.cpp#1413

We enter this condition because it looks like both the parent and subframe are still "busy". We don't do that in the good case.

It looks like the busy flags are set in nsDocShell::OnStateChange, so if I had to make a hypothesis about what's going on, it's that we're causing the src change in file_bug670318.html when the document is still busy (despite having fired the load event).

So the question is:

1) Is it expected that the nsDocShell will still be "busy" when the load event had been fired?
2) If so, is it sufficient to wait for the readyState to be complete or something instead?

Tentatively that first question is pointed at bz - but if he's too busy, I can try to ask other folk.
Flags: needinfo?(mconley) → needinfo?(bzbarsky)
This test is racy!

When the load() function runs it does two things:

1)  It creates an iframe whose load listener is "setTimeout(next)" and adds it to the DOM.
2)  It does setTimeout(next).

Now appending the frame to the DOM will start an about:blank load in that subframe.  Now there are two possibilities for what can happen:

i) The setTimeout(next) fires before the about:blank load completes, the first src set will be a replace load (because the about:blank) is still loading, and then we will keep doing next() off the iframe's load handler.
ii) If the about:blank load completes first, it will run the load handler and now we have _two_ timers waiting to call next().  It's quite possible for one of them to run, start a load, then the other run, start another load, which will then confuse how many history entries we can end up with, right?  I expect this is the failure case.

So here's a question.  Does just taking out that direct (not from the iframe's load listener) setTimeout(next, 0) call make the test consistently consistently fail?  If not, then it should make it consistently pass.  If it makes it consistently fail, we probably want to start off the iframe with one of the test srcs _before_ we ever append it to the document, so we're never dealing with the about:blank load at all.
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] (busy, pun intended) from comment #42)
> So here's a question.  Does just taking out that direct (not from the
> iframe's load listener) setTimeout(next, 0) call make the test consistently
> consistently fail?  If not, then it should make it consistently pass.  If it
> makes it consistently fail, we probably want to start off the iframe with
> one of the test srcs _before_ we ever append it to the document, so we're
> never dealing with the about:blank load at all.

Thanks bz. At least locally, removing that initial setTimeout(next, 0); doesn't cause the test to fail. Let's see if we can get it to fail on automation.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bb157438a4ae
Flags: needinfo?(mconley)
Unfortunately, even with that change, the orange appears. :/
Hrm.  That's really odd.

Given that we know why it was failing before (in terms of the docshell code and busy states), could we add logging to see whether that's why it still fails even with that change?  (Or catch the modified thing in rr, I guess, but that might be harder than logging plus try push.)
Okay, I _think_ I've figured this one out.

I added some logging to the nsISHistoryListener, as well as the nsSHistory::AddEntry method that calls the OnHistoryNewEntry callback that the test depends on.

Here's what it looks like towards the end of a run where the test completes properly (I removed the onStateChange and onSecurityChange logging I added to the listener, and am only showing from the start of loading iframe 3):

file_bug670318.html loads iframe 3...

[task 2016-10-09T04:54:22.710835Z] 04:54:22     INFO -  mconley: Now loading data:text/html;charset=utf-8,iframe 3 in iframe
[task 2016-10-09T04:54:22.793370Z] 04:54:22     INFO -  mconley: OnHistoryNewEntry called with newURI: http://mochi.test:8888/browser/docshell/test/browser/file_bug670318.html
[task 2016-10-09T04:54:22.852367Z] 04:54:22     INFO -  mconley: Dumping LocationChange
[task 2016-10-09T04:54:22.852661Z] 04:54:22     INFO -    mconley: Dumping Web Progress:
[task 2016-10-09T04:54:22.853807Z] 04:54:22     INFO -      DOMWindowID: 217, isTopLevel: false, isLoadingDocument: true, loadType: 1
[task 2016-10-09T04:54:22.854064Z] 04:54:22     INFO -    mconley: Done Dumping Web Progress
[task 2016-10-09T04:54:22.854238Z] 04:54:22     INFO -    mconley: Dumping Request:
[task 2016-10-09T04:54:22.855597Z] 04:54:22     INFO -      name: data:text/html;charset=utf-8,iframe%203, isPending(): true, status: 0, loadFlags: 7405568
[task 2016-10-09T04:54:22.863060Z] 04:54:22     INFO -    mconley: Done Dumping Request:
[task 2016-10-09T04:54:22.863334Z] 04:54:22     INFO -    location.spec: data:text/html;charset=utf-8,iframe%203, flags: 0
[task 2016-10-09T04:54:22.863532Z] 04:54:22     INFO -  mconley: Done Dumping LocationChange

file_bug670318.html loads iframe 4...

[task 2016-10-09T04:54:23.060309Z] 04:54:23     INFO -  mconley: Now loading data:text/html;charset=utf-8,iframe 4 in iframe
[task 2016-10-09T04:54:23.099660Z] 04:54:23     INFO -  mconley: OnHistoryNewEntry called with newURI: http://mochi.test:8888/browser/docshell/test/browser/file_bug670318.html
[task 2016-10-09T04:54:23.100984Z] 04:54:23     INFO -  mconley: Saw the 5 count at http://mochi.test:8888/browser/docshell/test/browser/file_bug670318.html
[task 2016-10-09T04:54:24.453505Z] 04:54:24     INFO -  MEMORY STAT | vsize 1128MB | residentFast 284MB | heapAllocated 124MB
[task 2016-10-09T04:54:24.461515Z] 04:54:24     INFO -  49 INFO TEST-OK | docshell/test/browser/browser_bug670318.js | took 3389ms


and here's what it looks like during the orange:

file_bug670318.html loads iframe 3...

[task 2016-10-09T04:54:16.910966Z] 04:54:16     INFO -  mconley: Now loading data:text/html;charset=utf-8,iframe 3 in iframe
[task 2016-10-09T04:54:16.986001Z] 04:54:16     INFO -  mconley: OnHistoryNewEntry called with newURI: http://mochi.test:8888/browser/docshell/test/browser/file_bug670318.html
[task 2016-10-09T04:54:17.036242Z] 04:54:17     INFO -  mconley: Dumping LocationChange
[task 2016-10-09T04:54:17.036441Z] 04:54:17     INFO -    mconley: Dumping Web Progress:
[task 2016-10-09T04:54:17.036712Z] 04:54:17     INFO -      DOMWindowID: 217, isTopLevel: false, isLoadingDocument: true, loadType: 1
[task 2016-10-09T04:54:17.036904Z] 04:54:17     INFO -    mconley: Done Dumping Web Progress
[task 2016-10-09T04:54:17.037077Z] 04:54:17     INFO -    mconley: Dumping Request:
[task 2016-10-09T04:54:17.038310Z] 04:54:17     INFO -      name: data:text/html;charset=utf-8,iframe%203, isPending(): true, status: 0, loadFlags: 7405568
[task 2016-10-09T04:54:17.038514Z] 04:54:17     INFO -    mconley: Done Dumping Request:
[task 2016-10-09T04:54:17.038743Z] 04:54:17     INFO -    location.spec: data:text/html;charset=utf-8,iframe%203, flags: 0
[task 2016-10-09T04:54:17.038927Z] 04:54:17     INFO -  mconley: Done Dumping LocationChange

file_bug670318.html loads iframe 4...

[task 2016-10-09T04:54:17.475891Z] 04:54:17     INFO -  mconley: Now loading data:text/html;charset=utf-8,iframe 4 in iframe
[task 2016-10-09T04:54:17.593155Z] 04:54:17     INFO -  mconley: iframe saw load event. Queuing next.

file_bug670318.html loads iframe 5 (which it shouldn't be able to get to)...

[task 2016-10-09T04:54:17.596261Z] 04:54:17     INFO -  mconley: Now loading data:text/html;charset=utf-8,iframe 5 in iframe
[task 2016-10-09T04:54:17.628921Z] 04:54:17     INFO -  mconley: iframe saw load event. Queuing next.
[task 2016-10-09T04:55:00.245226Z] 04:55:00     INFO -  TEST-INFO | started process screentopng
[task 2016-10-09T04:55:01.543767Z] 04:55:01     INFO -  TEST-INFO | screentopng: exit 0
[task 2016-10-09T04:55:01.546739Z] 04:55:01     INFO -  49 INFO checking window state
[task 2016-10-09T04:55:01.548767Z] 04:55:01     INFO -  50 INFO Entering test bound test
[task 2016-10-09T04:55:01.551233Z] 04:55:01     INFO -  51 INFO TEST-UNEXPECTED-FAIL | docshell/test/browser/browser_bug670318.js | Test timed out -

So it looks like all of a sudden, my logging in the nsISHistoryListener disappears. When I looked at the debug log, I noticed a bunch of nsDocShell destruction occurred just before the logging stopped, which suggested to me that a GC had occurred.

So what I think is happening is that the nsISHistoryListener, which implements nsISupportsWeakReference, isn't being properly kept alive. The nsDocShell is only holding on to it weakly, and the ContentTask scope that it runs within doesn't hold a reference to it (the only thing that refers to the listener after construction is itself), so it gets cleared during the next GC. At least, I'm pretty certain that's what's happening.

As a solution, I've set the listener as a property on the content window to keep it alive during the duration of the test. Here's a try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca071c7956efed1b7bd3e67862b493699a7ff2a0

Seems to be agreeable.
Flags: needinfo?(mconley)
Attachment #8799757 - Flags: review?(gijskruitbosch+bugs) → review?(jaws)
Comment on attachment 8799757 [details]
Bug 1288606 - Fix intermittent orange by making sure a weakly held nsISHistoryListener doesn't get GC'd before the browser_bug670318.js completes.

https://reviewboard.mozilla.org/r/84880/#review83442
Attachment #8799757 - Flags: review?(jaws) → review+
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c008511ec36e
Fix intermittent orange by making sure a weakly held nsISHistoryListener doesn't get GC'd before the browser_bug670318.js completes. r=jaws
https://hg.mozilla.org/mozilla-central/rev/c008511ec36e
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Whiteboard: [checkin-needed-aurora][checkin-needed-beta]
https://hg.mozilla.org/releases/mozilla-aurora/rev/606b0224fb75
Whiteboard: [checkin-needed-aurora][checkin-needed-beta] → [checkin-needed-beta]
You need to log in before you can comment on or make changes to this bug.