Open Bug 1252376 Opened 4 years ago Updated 2 years ago

Intermittent browser_privatetabs.js | Test timed out

Categories

(Firefox :: Session Restore, defect, P3)

defect

Tracking

()

REOPENED
Firefox 53
Tracking Status
firefox48 --- wontfix
firefox49 --- wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox53 --- affected

People

(Reporter: philor, Assigned: mikedeboer)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
This frequent intermittent occurs almost always on Debug and e10s; usually on Linux, sometimes on Linux64.

The timeout occurs after 45 seconds. When the test succeeds on Linux Debug, it usually runs in 20 to 30 seconds (increasing the timeout may not help...test may be hanging).

:mikedeboer -- Can you find someone to have a look at this?
Flags: needinfo?(mdeboer)
Assignee: nobody → mdeboer
Status: NEW → ASSIGNED
Flags: needinfo?(mdeboer)
Comment on attachment 8809796 [details]
Bug 1252376 - refactor browser_privatetabs.js to use BrowserTestUtils instead of using less deterministic methods and make sure to garbage collect, forcibly,  after each sessionstore mochitest ended.

https://reviewboard.mozilla.org/r/92314/#review92340

The logs I'm looking at show something like this:


https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=6419986#L19392

16:43:36 INFO - TEST-START | browser/components/sessionstore/test/browser_privatetabs.js
 
...

16:44:24     INFO - Buffered messages logged at 16:43:53
16:44:24     INFO - TEST-PASS | browser/components/sessionstore/test/browser_privatetabs.js | tab considered private - 
16:44:24     INFO - Buffered messages logged at 16:43:55
16:44:24 INFO - TEST-PASS | browser/components/sessionstore/test/browser_privatetabs.js | no windows to restore - 
16:44:24     INFO - Leaving test bound 
16:44:24     INFO - Entering test bound 
16:44:24 INFO - Buffered messages finished

and then the test times out at 16:44:24

This means that after this bit of the test was entered:

https://dxr.mozilla.org/mozilla-central/rev/d38d06f85ef59c5dbb5d4a1a8d895957a78714de/browser/components/sessionstore/test/browser_privatetabs.js#104

we do nothing for about 30 seconds.

That doesn't strike me as any issues with the timeout being too slow. I think there's an issue with the test, where we somehow never reach   ok(state.isPrivate, "tab considered private");

the screenshot shows a private window, so clearly we managed to load the private window and about:privatebrowsing, and it also shows a background tab with about:mozilla - but perhaps there's a race that causes us not to proceed past either the load of that browser or the tab state flush? Adding more logging to find out which it is seems like it would narrow down what's going wrong more clearly, and would also prove more conclusively if we're just a few seconds too slow or not (which right now seems unlikely, considering the two previous bound tests complete in a measly 15-20 seconds).
Attachment #8809796 - Flags: review?(gijskruitbosch+bugs) → review-
The fact that https://treeherder.mozilla.org/logviewer.html#?job_id=31117989&repo=try#L19416 doesn't show any test-related output until almost 60s after the test started (36:59 until 37:53) is highly suspicious. It may be getting trapped in garbage collection, given the huge number of ---DOMWINDOW and ---DOCSHELL lines preceding it after the start of the test.
(In reply to Josh Matthews [:jdm] from comment #47)
> The fact that
> https://treeherder.mozilla.org/logviewer.
> html#?job_id=31117989&repo=try#L19416 doesn't show any test-related output
> until almost 60s after the test started (36:59 until 37:53) is highly
> suspicious. It may be getting trapped in garbage collection, given the huge
> number of ---DOMWINDOW and ---DOCSHELL lines preceding it after the start of
> the test.

The log timestamps are a lie, because the logs are normally omitted (unless you request SimpleTest.requestCompleteLog) and therefore batched and therefore the leading log timestamps are wrong.

The actual timestamps are the ones in lines like:

 04:37:53     INFO - Buffered messages logged at 04:36:59
 04:37:53     INFO - Entering test bound cleanup
 04:37:53     INFO - Forgetting closed tabs
 04:37:53     INFO - Leaving test bound cleanup
 04:37:53     INFO - Entering test bound 
 04:37:53     INFO - Setting up public tab
 04:37:53 INFO - Buffered messages logged at 04:37:01
etc.

which show that it has definitely been doing stuff in the meantime.
I can confirm with another sighting in https://treeherder.mozilla.org/#/jobs?repo=try&revision=59fefe7b32de&selectedJob=31118005

12:39:36     INFO - TEST-START | browser/components/sessionstore/test/browser_privatetabs.js
... one minute until the very first task called 'cleanup' is run ...
12:40:32     INFO - Entering test bound cleanup

During this minute there's a lot of message, just like Josh describes.

Is there some way to let time spent GC'ing not be included in the time spent by the test? Basically also to have the GC behavior logged separately so that it can analyzed. I'm n-i'ing you, Josh, but please feel free to defer...
(In reply to :Gijs Kruitbosch from comment #48)
> The log timestamps are a lie, because the logs are normally omitted (unless
> you request SimpleTest.requestCompleteLog) and therefore batched and
> therefore the leading log timestamps are wrong.

But why is

12:40:30     INFO - TEST-INFO | started process screentopng

also only started after 1 minute? It is apparently not part of the buffered log messages...
(In reply to Mike de Boer [:mikedeboer] from comment #50)
> (In reply to :Gijs Kruitbosch from comment #48)
> > The log timestamps are a lie, because the logs are normally omitted (unless
> > you request SimpleTest.requestCompleteLog) and therefore batched and
> > therefore the leading log timestamps are wrong.
> 
> But why is
> 
> 12:40:30     INFO - TEST-INFO | started process screentopng
> 
> also only started after 1 minute? It is apparently not part of the buffered
> log messages...

Because that's not the test, that's the framework going "huh, your test died, lemme make a screenshot for you". That only happens when the test is dead, ie after it's run. The fact that it shows up *before* it regurgitates the buffered test messages is dumb, but just an artifact of the test runner, AIUI.
Regardless, after I added more logging, I found this to be more interesting:

[task 2016-11-15T12:40:32.879612Z] 12:40:32     INFO - Buffered messages logged at 12:39:48
[task 2016-11-15T12:40:32.879920Z] 12:40:32     INFO - TEST-PASS | browser/components/sessionstore/test/browser_privatetabs.js | no windows to restore - 
[task 2016-11-15T12:40:32.880076Z] 12:40:32     INFO - Leaving test bound 
[task 2016-11-15T12:40:32.880226Z] 12:40:32     INFO - Entering test bound 
[task 2016-11-15T12:40:32.880395Z] 12:40:32     INFO - forgetting closed windows...
[task 2016-11-15T12:40:32.880555Z] 12:40:32     INFO - opening new window...
[task 2016-11-15T12:40:32.880736Z] 12:40:32     INFO - Buffered messages logged at 12:39:57
[task 2016-11-15T12:40:32.880885Z] 12:40:32     INFO - adding new tab...
[task 2016-11-15T12:40:32.881064Z] 12:40:32     INFO - Buffered messages logged at 12:39:58
[task 2016-11-15T12:40:32.881230Z] 12:40:32     INFO - waiting for browser load...

and then it times out.

Which tries to tell me it took 9 seconds to open a new private window. Adding a tab is the next item on the agenda and is part of the next batch of messages.
The last info line is 'waiting for browser load...' which is what it's stuck on for 32 seconds, apparently.
The times in square brackets come from run-task. Those times are obtained whenever the process that is running emits a line of output. run-task uses unbuffered I/O.

The timestamps that follow are from mozharness. I eventually want to remove those because they are redundant with run-task.

When mozharness logs process output, the time is obtained when it sees a new line. Unfortunately, there are still a number of places in mozharness where buffered I/O is being used on child processes. So timestamps may not be accurate. I suspect that's what's happening here. Python will spawn child processes with a buffered pipe unless you tell it not to. On top of that, Python processes will apply output buffering when writing to stdout unless you tell it otherwise. The bad timings are almost certainly related to output buffering between mozharness and whatever it is running.
I should mention fully unbuffered I/O is typically bad because it incurs too much overhead. We prefer line buffering for the pipe on the parent process and unbuffered stdout/stderr on the child process.
(In reply to Mike de Boer [:mikedeboer] from comment #52)
> Regardless, after I added more logging, I found this to be more interesting:
> 
> The last info line is 'waiting for browser load...' which is what it's stuck
> on for 32 seconds, apparently.

It sounds like either the browser doesn't load, it loads content in the wrong browser, or the browser is loaded by the time you invoke browserLoaded() and so no load event fires and we wait forever despite the content being already loaded.
(In reply to :Gijs Kruitbosch from comment #56)
> It sounds like either the browser doesn't load, it loads content in the
> wrong browser, or the browser is loaded by the time you invoke
> browserLoaded() and so no load event fires and we wait forever despite the
> content being already loaded.

Yeah, that's why I ported the test over to BrowserTestUtils.jsm wholesale to see if the mechanisms used in there are more stable.
The funny thing is, all the screenshots in that trypush show different things going on, so it's really not clear what is happening. It's possible there's some odd combination of it hitting the timeouts and something else failing (but I've also not checked all the logs in that try push to see if the logs indicate different states have been reached).
I was thinking about why on earth this is so frequent on e10s and not non-e10s, which resulted in this mind-fart:
I'm assuming that the mochitest timeout timer is running in the chrome/ parent process using the built-in JS timer implementation. Now the content process is blocked whilst doing IO-heavy stuff/ garbage collection (still a suspect in this case) and doesn't affect the timer in the parent process, so it reports a timeout. Except, when the test would've been running in the same process as the timer is, it would've impacted its _real_ timeout duration and finish within the set max duration.
Wouldn't this be a good reason why we're seeing more and more timeouts occurring for tests that open & close many windows/ docShells?
Flags: needinfo?(josh)
Flags: needinfo?(jmaher)
(In reply to Mike de Boer [:mikedeboer] from comment #60)
> Wouldn't this be a good reason why we're seeing more and more timeouts
> occurring for tests that open & close many windows/ docShells?

Or better: batches of tests/ suites, because I'm also seeing that other various tests run in succession have lingering effects on each other in the same browser instance.
on linux64 debug [1] I see this test passing in ~20 seconds (some cases are 17/18 seconds, but overall we finish in <22 seconds after looking at a few logs):
[task 2016-11-16T11:46:21.094510Z] 11:46:21     INFO - TEST-OK | browser/components/sessionstore/test/browser_privatetabs.js | took 20083ms

We are timing out at 45 seconds, to me that seems like something outside of being off by a few seconds due to timer in chrome vs content- I suspect we are depending on something in the browser or in a bad state from a previous test.

here is where we timeout in browser-chrome tests:
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/browser-test.js#3

and here is the settimeout that forces us to timeout after 45 seconds:
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/browser-test.js#787

I am not sure if browser-test.js is chrome or content, I suspect it runs in the chrome process as it is defined in the .xul overlay:
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/browser-test-overlay.xul#11


looking at the tests in sessionstore, I see these tests have trouble running solo and multiple times in the same browser session (a hint that they are leaking stuff or poorly written):
browser/components/sessionstore/test/browser_crashedTabs.js (run prior to browser_privatetabs.js)
browser/components/sessionstore/test/browser_sessionStoreContainer.js
browser/components/sessionstore/test/browser_662812.js (run prior to browser_privatetabs.js)

there are a lot of tests in the browser.ini file:
https://dxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/test/browser.ini#81

specifically tests that are e10s only, one option to consider is these e10s only tests are causing problems with the browser vs the lack of failures on non-e10s.


[1] https://public-artifacts.taskcluster.net/ZeNhy56hQt66wox2kQZLew/0/public/logs/live_backing.log
Flags: needinfo?(jmaher)
All that may very well be true, but it strikes me as odd that extending the timeout makes the intermittent go away (https://treeherder.mozilla.org/#/jobs?repo=try&revision=cb67c950aace&selectedJob=31009938 -- just retriggered several more times to be absolutely sure that statement is correct).
I'll look at the tests you mentioned to see if anything stands out to me.
ok, maybe I overlooked something in the original statement, maybe the ~20 seconds for the runtime is calculated in the content process vs the chrome process- and maybe it really takes more time to run in the normal case- I could see how we could have a runtime of ~30 seconds normally- I still don't think the timer would be that far off between content/chrome process.

I see a few other changes in that patch outside of extending the timeout, and looking at previous try pushes you did, they didn't help.  I also see a few instances on the retriggers you did of browser_privatetabs.js failing- although I will admit they seem to be less frequent than without the extended timeout.

I looked at many of the logs, I see some of the tests taking 27-30 seconds to complete- I cannot find a pattern here :(
https://public-artifacts.taskcluster.net/cza14ay1Siu9b68UJpAykA/0/public/logs/live_backing.log (from try push in comment 63) has a run that passes after >45 seconds:

TEST-OK | browser/components/sessionstore/test/browser_privatetabs.js | took 53589ms

Very strange...but it looks like increasing the timeout might be effective.
/me puzzled just as much.

I was thinking if it'd be a god idea to add a `window.QueryInterface(Ci.nsIInterfaceRequestor).getInterface(Ci.nsIDOMWindowUtils).garbageCollect();` at the end of a docShell-heavy test for good measure?
Nothing to add here.
Flags: needinfo?(josh)
any further ideas to explore here?  This is still an issue (currently the #2 intermittent orange if you ignore fixed issues).

is it possible that we are having issues with previous tests running in the session?  Maybe if we run this test first in the directory it might yield different results?
The test run pointed to in comment #65 has:

[task 2016-11-14T12:00:08.836465Z] 12:00:08     INFO - [Parent 1247] WARNING: No inner window available!: file /home/worker/workspace/build/src/dom/base/nsGlobalWindow.cpp, line 9889

a gazillion times (for about 10 seconds straight). I suspect comment #66 has a point and there's a GC pause or whatever. It looks like this spiked upwards 2 weeks ago (previously ~45-55 failures a week, now 140 a week). It would be useful to understand if a specific cset triggered that spike. Joel, is there any chance you can bisect that spike?

(In reply to Mike de Boer [:mikedeboer] from comment #63)
> All that may very well be true, but it strikes me as odd that extending the
> timeout makes the intermittent go away
> (https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=cb67c950aace&selectedJob=31009938 -- just retriggered
> several more times to be absolutely sure that statement is correct).
> I'll look at the tests you mentioned to see if anything stands out to me.

That is indeed odd. If this is the #2 intermittent then I guess we should go ahead and extend the timeout, uncomfortable as I am with doing that (so yeah, r=me on the patch after all, I guess).

(In reply to Joel Maher ( :jmaher) from comment #64)
> I also see a
> few instances on the retriggers you did of browser_privatetabs.js failing-
> although I will admit they seem to be less frequent than without the
> extended timeout.

I don't see these. Can you point out what you mean?
Flags: needinfo?(jmaher)
Comment on attachment 8809796 [details]
Bug 1252376 - refactor browser_privatetabs.js to use BrowserTestUtils instead of using less deterministic methods and make sure to garbage collect, forcibly,  after each sessionstore mochitest ended.

https://reviewboard.mozilla.org/r/92314/#review94694

r+ per bug discussion
Attachment #8809796 - Flags: review- → review+
My comment in 64 was in relation to the current state of a try push.
Flags: needinfo?(jmaher)
I'm repeating my comment from bug 1285475 here:

I'm a bit wary about this... I can up the timeout on each and every sessionstore test, but I'd also like to point out that this something that started occurring more frequently not too long ago.
Sessionstore mochitests are inherently heavy traffic on the platform side of things, by which I mean DOM/ XPCOM: a very large number of windows, tabs, browser elements instantiating multiple docShells and hard disk I/O when the sessionstore files are flushed to disk. Josh mentioned seeing a large number of ++/-- GC messages when studying the logs of intermittent bug 1252376: Did things change recently on the platform side of things that may affect the timing of the GC cycles? Is there a project running at the moment that intends to optimize memory usage, tweaked the GC behavior somewhat (or papercut by papercut) and might result in an increase of intermittents here?
Who would know?
I did try to narrow this down by retriggers and was not able to find a cause, I looked from Nov 4-9th with no luck, retriggers kept showing a high failure rate.
See Also: → 1253156
Comment on attachment 8809796 [details]
Bug 1252376 - refactor browser_privatetabs.js to use BrowserTestUtils instead of using less deterministic methods and make sure to garbage collect, forcibly,  after each sessionstore mochitest ended.

See comment 79 for the Try link.

So I pushed this patch to try and re-triggered it 20 times to have it fail. It didn't, so I'm quite confident that adding the `garbageCollect()` call is a winner. Of course, I'll retrigger all the things another gazillion times, but it looks good so far.

Now, I also added a fix for a JS error I was seeing in the logs quite frequently, coming from PrivateBrowsingUtils and rewrote the browser_privatetabs.js test to use BTU instead of the homegrown stuff. My longer term goal is to more all sessionstore tests to use BTU and strip head.js of its duplicate mechanisms.
Attachment #8809796 - Flags: review+ → review?(gijskruitbosch+bugs)
Comment on attachment 8809796 [details]
Bug 1252376 - refactor browser_privatetabs.js to use BrowserTestUtils instead of using less deterministic methods and make sure to garbage collect, forcibly,  after each sessionstore mochitest ended.

https://reviewboard.mozilla.org/r/92314/#review95914

r=me for this version as well. Note two points below.

::: browser/components/sessionstore/test/browser_privatetabs.js:107
(Diff revision 3)
>  
>    // Create a new window to attach our frame script to.
> -  let win = yield promiseNewWindowLoaded({private: true});
> +  let win = yield BrowserTestUtils.openNewBrowserWindow({private: true});
>  
>    // Create a new tab in the new window that will load the frame script.
> -  let tab = win.gBrowser.addTab("about:mozilla");
> +  let tab = yield BrowserTestUtils.openNewForegroundTab(win.gBrowser, "about:mozilla");

This makes the tab selected whereas the previous code didn't. Not sure if it matters, but something to keep in mind? Similar things happen in the previous replacements in this file (I reviewed from the bottom to the top in this file).

::: toolkit/modules/PrivateBrowsingUtils.jsm:96
(Diff revision 3)
>      let win = Services.appShell.hiddenPrivateDOMWindow;
>      function isNotLoaded() {
>        return ["complete", "interactive"].indexOf(win.document.readyState) == -1;
>      }
>      if (isNotLoaded()) {
>        setTimeout(function poll() {

Ew. File a followup to swap this to using readystatechange events?
Attachment #8809796 - Flags: review?(gijskruitbosch+bugs) → review+
Blocks: 1320663
(In reply to :Gijs Kruitbosch from comment #83)
> This makes the tab selected whereas the previous code didn't. Not sure if it
> matters, but something to keep in mind? Similar things happen in the
> previous replacements in this file (I reviewed from the bottom to the top in
> this file).

That doesn't seem to matter in this case, indeed. I'll keep it in mind for future cases.
Pushed by mdeboer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ba2692e1cbcf
refactor browser_privatetabs.js to use BrowserTestUtils instead of using less deterministic methods and make sure to garbage collect, forcibly,  after each sessionstore mochitest ended. r=Gijs
https://hg.mozilla.org/mozilla-central/rev/ba2692e1cbcf
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 53
Hmm, so this was merged to mozilla-central this morning, and something that was merged to mozilla-central this morning is causing frequent intermittent leaks of about:privatebrowsing (bug 1321016, bug 1321017, bug 1321055, bug 1321057, bug 1321058, bug 1321188), and now aurora is intermittently leaking it too after this was merged there, and beta leaked it like crazy until this was backed out?

Quite a coincidence!
Mike, should we back this out considering the leaks? :-(
Flags: needinfo?(mdeboer)
Looking at the patch, I have no idea how this would trigger leaks. :-\
What I can think of is that
1. Having JS errors in PrivateBrowsingUtils.jsm is expected behavior?
2. Doing a GC after each test is leaky? I know it _sounds_ crazy, but...

But yeah, it's better to back this out and figure out what's leaking for sure.
Flags: needinfo?(mdeboer)
Blocks: 1321016
Mass wontfix for bugs affecting firefox 52.
You need to log in before you can comment on or make changes to this bug.