<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Updated

•

9 years ago

Assignee: nobody → mdeboer

Status: NEW → ASSIGNED

Flags: needinfo?(mdeboer)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=cb67c950aace

Comment 39

•

9 years ago

Comment 40

•

9 years ago

mozreview-review

Comment on attachment 8809796 [details] Bug 1252376 - refactor browser_privatetabs.js to use BrowserTestUtils instead of using less deterministic methods and make sure to garbage collect, forcibly, after each sessionstore mochitest ended. https://reviewboard.mozilla.org/r/92314/#review92340 The logs I'm looking at show something like this: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=6419986#L19392 16:43:36 INFO - TEST-START | browser/components/sessionstore/test/browser_privatetabs.js ... 16:44:24 INFO - Buffered messages logged at 16:43:53 16:44:24 INFO - TEST-PASS | browser/components/sessionstore/test/browser_privatetabs.js | tab considered private - 16:44:24 INFO - Buffered messages logged at 16:43:55 16:44:24 INFO - TEST-PASS | browser/components/sessionstore/test/browser_privatetabs.js | no windows to restore - 16:44:24 INFO - Leaving test bound 16:44:24 INFO - Entering test bound 16:44:24 INFO - Buffered messages finished and then the test times out at 16:44:24 This means that after this bit of the test was entered: https://dxr.mozilla.org/mozilla-central/rev/d38d06f85ef59c5dbb5d4a1a8d895957a78714de/browser/components/sessionstore/test/browser_privatetabs.js#104 we do nothing for about 30 seconds. That doesn't strike me as any issues with the timeout being too slow. I think there's an issue with the test, where we somehow never reach ok(state.isPrivate, "tab considered private"); the screenshot shows a private window, so clearly we managed to load the private window and about:privatebrowsing, and it also shows a background tab with about:mozilla - but perhaps there's a race that causes us not to proceed past either the load of that browser or the tab state flush? Adding more logging to find out which it is seems like it would narrow down what's going wrong more clearly, and would also prove more conclusively if we're just a few seconds too slow or not (which right now seems unlikely, considering the two previous bound tests complete in a measly 15-20 seconds).

Attachment #8809796 - Flags: review?(gijskruitbosch+bugs) → review-

Comment hidden (Intermittent Failures Robot)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5409e191264

Comment 43

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4b1b770d8c4

Comment 44

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=59fefe7b32de

Comment 45

•

9 years ago

Comment hidden (Intermittent Failures Robot)

Josh Matthews [:jdm]

Comment 47

•

9 years ago

The fact that https://treeherder.mozilla.org/logviewer.html#?job_id=31117989&repo=try#L19416 doesn't show any test-related output until almost 60s after the test started (36:59 until 37:53) is highly suspicious. It may be getting trapped in garbage collection, given the huge number of ---DOMWINDOW and ---DOCSHELL lines preceding it after the start of the test.

Comment 48

•

9 years ago

(In reply to Josh Matthews [:jdm] from comment #47) > The fact that > https://treeherder.mozilla.org/logviewer. > html#?job_id=31117989&repo=try#L19416 doesn't show any test-related output > until almost 60s after the test started (36:59 until 37:53) is highly > suspicious. It may be getting trapped in garbage collection, given the huge > number of ---DOMWINDOW and ---DOCSHELL lines preceding it after the start of > the test. The log timestamps are a lie, because the logs are normally omitted (unless you request SimpleTest.requestCompleteLog) and therefore batched and therefore the leading log timestamps are wrong. The actual timestamps are the ones in lines like: 04:37:53 INFO - Buffered messages logged at 04:36:59 04:37:53 INFO - Entering test bound cleanup 04:37:53 INFO - Forgetting closed tabs 04:37:53 INFO - Leaving test bound cleanup 04:37:53 INFO - Entering test bound 04:37:53 INFO - Setting up public tab 04:37:53 INFO - Buffered messages logged at 04:37:01 etc. which show that it has definitely been doing stuff in the meantime.

Comment 49

•

9 years ago

I can confirm with another sighting in https://treeherder.mozilla.org/#/jobs?repo=try&revision=59fefe7b32de&selectedJob=31118005 12:39:36 INFO - TEST-START | browser/components/sessionstore/test/browser_privatetabs.js ... one minute until the very first task called 'cleanup' is run ... 12:40:32 INFO - Entering test bound cleanup During this minute there's a lot of message, just like Josh describes. Is there some way to let time spent GC'ing not be included in the time spent by the test? Basically also to have the GC behavior logged separately so that it can analyzed. I'm n-i'ing you, Josh, but please feel free to defer...

Comment 50

•

9 years ago

(In reply to :Gijs Kruitbosch from comment #48) > The log timestamps are a lie, because the logs are normally omitted (unless > you request SimpleTest.requestCompleteLog) and therefore batched and > therefore the leading log timestamps are wrong. But why is 12:40:30 INFO - TEST-INFO | started process screentopng also only started after 1 minute? It is apparently not part of the buffered log messages...

Comment 51

•

9 years ago

(In reply to Mike de Boer [:mikedeboer] from comment #50) > (In reply to :Gijs Kruitbosch from comment #48) > > The log timestamps are a lie, because the logs are normally omitted (unless > > you request SimpleTest.requestCompleteLog) and therefore batched and > > therefore the leading log timestamps are wrong. > > But why is > > 12:40:30 INFO - TEST-INFO | started process screentopng > > also only started after 1 minute? It is apparently not part of the buffered > log messages... Because that's not the test, that's the framework going "huh, your test died, lemme make a screenshot for you". That only happens when the test is dead, ie after it's run. The fact that it shows up *before* it regurgitates the buffered test messages is dumb, but just an artifact of the test runner, AIUI.

Comment 52

•

9 years ago

Regardless, after I added more logging, I found this to be more interesting: [task 2016-11-15T12:40:32.879612Z] 12:40:32 INFO - Buffered messages logged at 12:39:48 [task 2016-11-15T12:40:32.879920Z] 12:40:32 INFO - TEST-PASS | browser/components/sessionstore/test/browser_privatetabs.js | no windows to restore - [task 2016-11-15T12:40:32.880076Z] 12:40:32 INFO - Leaving test bound [task 2016-11-15T12:40:32.880226Z] 12:40:32 INFO - Entering test bound [task 2016-11-15T12:40:32.880395Z] 12:40:32 INFO - forgetting closed windows... [task 2016-11-15T12:40:32.880555Z] 12:40:32 INFO - opening new window... [task 2016-11-15T12:40:32.880736Z] 12:40:32 INFO - Buffered messages logged at 12:39:57 [task 2016-11-15T12:40:32.880885Z] 12:40:32 INFO - adding new tab... [task 2016-11-15T12:40:32.881064Z] 12:40:32 INFO - Buffered messages logged at 12:39:58 [task 2016-11-15T12:40:32.881230Z] 12:40:32 INFO - waiting for browser load... and then it times out. Which tries to tell me it took 9 seconds to open a new private window. Adding a tab is the next item on the agenda and is part of the next batch of messages. The last info line is 'waiting for browser load...' which is what it's stuck on for 32 seconds, apparently.

Gregory Szorc [:gps]

Comment 53

•

9 years ago

The times in square brackets come from run-task. Those times are obtained whenever the process that is running emits a line of output. run-task uses unbuffered I/O. The timestamps that follow are from mozharness. I eventually want to remove those because they are redundant with run-task. When mozharness logs process output, the time is obtained when it sees a new line. Unfortunately, there are still a number of places in mozharness where buffered I/O is being used on child processes. So timestamps may not be accurate. I suspect that's what's happening here. Python will spawn child processes with a buffered pipe unless you tell it not to. On top of that, Python processes will apply output buffering when writing to stdout unless you tell it otherwise. The bad timings are almost certainly related to output buffering between mozharness and whatever it is running.

Gregory Szorc [:gps]

Comment 54

•

9 years ago

I should mention fully unbuffered I/O is typically bad because it incurs too much overhead. We prefer line buffering for the pipe on the parent process and unbuffered stdout/stderr on the child process.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8d54a812a98f

Comment 55

•

9 years ago

Comment 56

•

9 years ago

(In reply to Mike de Boer [:mikedeboer] from comment #52) > Regardless, after I added more logging, I found this to be more interesting: > > The last info line is 'waiting for browser load...' which is what it's stuck > on for 32 seconds, apparently. It sounds like either the browser doesn't load, it loads content in the wrong browser, or the browser is loaded by the time you invoke browserLoaded() and so no load event fires and we wait forever despite the content being already loaded.

Comment 57

•

9 years ago

(In reply to :Gijs Kruitbosch from comment #56) > It sounds like either the browser doesn't load, it loads content in the > wrong browser, or the browser is loaded by the time you invoke > browserLoaded() and so no load event fires and we wait forever despite the > content being already loaded. Yeah, that's why I ported the test over to BrowserTestUtils.jsm wholesale to see if the mechanisms used in there are more stable.

Comment 58

•

9 years ago

The funny thing is, all the screenshots in that trypush show different things going on, so it's really not clear what is happening. It's possible there's some odd combination of it hitting the timeouts and something else failing (but I've also not checked all the logs in that try push to see if the logs indicate different states have been reached).

Comment hidden (Intermittent Failures Robot)

Comment 60

•

9 years ago

I was thinking about why on earth this is so frequent on e10s and not non-e10s, which resulted in this mind-fart: I'm assuming that the mochitest timeout timer is running in the chrome/ parent process using the built-in JS timer implementation. Now the content process is blocked whilst doing IO-heavy stuff/ garbage collection (still a suspect in this case) and doesn't affect the timer in the parent process, so it reports a timeout. Except, when the test would've been running in the same process as the timer is, it would've impacted its _real_ timeout duration and finish within the set max duration. Wouldn't this be a good reason why we're seeing more and more timeouts occurring for tests that open & close many windows/ docShells?

Flags: needinfo?(josh)

Flags: needinfo?(jmaher)

Comment 61

•

9 years ago

(In reply to Mike de Boer [:mikedeboer] from comment #60) > Wouldn't this be a good reason why we're seeing more and more timeouts > occurring for tests that open & close many windows/ docShells? Or better: batches of tests/ suites, because I'm also seeing that other various tests run in succession have lingering effects on each other in the same browser instance.

Comment 62

•

9 years ago

on linux64 debug [1] I see this test passing in ~20 seconds (some cases are 17/18 seconds, but overall we finish in <22 seconds after looking at a few logs): [task 2016-11-16T11:46:21.094510Z] 11:46:21 INFO - TEST-OK | browser/components/sessionstore/test/browser_privatetabs.js | took 20083ms We are timing out at 45 seconds, to me that seems like something outside of being off by a few seconds due to timer in chrome vs content- I suspect we are depending on something in the browser or in a bad state from a previous test. here is where we timeout in browser-chrome tests: https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/browser-test.js#3 and here is the settimeout that forces us to timeout after 45 seconds: https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/browser-test.js#787 I am not sure if browser-test.js is chrome or content, I suspect it runs in the chrome process as it is defined in the .xul overlay: https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/browser-test-overlay.xul#11 looking at the tests in sessionstore, I see these tests have trouble running solo and multiple times in the same browser session (a hint that they are leaking stuff or poorly written): browser/components/sessionstore/test/browser_crashedTabs.js (run prior to browser_privatetabs.js) browser/components/sessionstore/test/browser_sessionStoreContainer.js browser/components/sessionstore/test/browser_662812.js (run prior to browser_privatetabs.js) there are a lot of tests in the browser.ini file: https://dxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/test/browser.ini#81 specifically tests that are e10s only, one option to consider is these e10s only tests are causing problems with the browser vs the lack of failures on non-e10s. [1] https://public-artifacts.taskcluster.net/ZeNhy56hQt66wox2kQZLew/0/public/logs/live_backing.log

Flags: needinfo?(jmaher)

Comment 63

•

9 years ago

All that may very well be true, but it strikes me as odd that extending the timeout makes the intermittent go away (https://treeherder.mozilla.org/#/jobs?repo=try&revision=cb67c950aace&selectedJob=31009938 -- just retriggered several more times to be absolutely sure that statement is correct). I'll look at the tests you mentioned to see if anything stands out to me.

Comment 64

•

9 years ago

ok, maybe I overlooked something in the original statement, maybe the ~20 seconds for the runtime is calculated in the content process vs the chrome process- and maybe it really takes more time to run in the normal case- I could see how we could have a runtime of ~30 seconds normally- I still don't think the timer would be that far off between content/chrome process. I see a few other changes in that patch outside of extending the timeout, and looking at previous try pushes you did, they didn't help. I also see a few instances on the retriggers you did of browser_privatetabs.js failing- although I will admit they seem to be less frequent than without the extended timeout. I looked at many of the logs, I see some of the tests taking 27-30 seconds to complete- I cannot find a pattern here :(

Geoff Brown [:gbrown]

Comment 65

•

9 years ago

https://public-artifacts.taskcluster.net/cza14ay1Siu9b68UJpAykA/0/public/logs/live_backing.log (from try push in comment 63) has a run that passes after >45 seconds: TEST-OK | browser/components/sessionstore/test/browser_privatetabs.js | took 53589ms Very strange...but it looks like increasing the timeout might be effective.

Comment 66

•

9 years ago

/me puzzled just as much. I was thinking if it'd be a god idea to add a `window.QueryInterface(Ci.nsIInterfaceRequestor).getInterface(Ci.nsIDOMWindowUtils).garbageCollect();` at the end of a docShell-heavy test for good measure?

Comment hidden (Intermittent Failures Robot)

Josh Matthews [:jdm]

Comment 69

•

9 years ago

Nothing to add here.

Flags: needinfo?(josh)

Comment hidden (Intermittent Failures Robot)

Comment 71

•

9 years ago

any further ideas to explore here? This is still an issue (currently the #2 intermittent orange if you ignore fixed issues). is it possible that we are having issues with previous tests running in the session? Maybe if we run this test first in the directory it might yield different results?

Comment 72

•

9 years ago

The test run pointed to in comment #65 has: [task 2016-11-14T12:00:08.836465Z] 12:00:08 INFO - [Parent 1247] WARNING: No inner window available!: file /home/worker/workspace/build/src/dom/base/nsGlobalWindow.cpp, line 9889 a gazillion times (for about 10 seconds straight). I suspect comment #66 has a point and there's a GC pause or whatever. It looks like this spiked upwards 2 weeks ago (previously ~45-55 failures a week, now 140 a week). It would be useful to understand if a specific cset triggered that spike. Joel, is there any chance you can bisect that spike? (In reply to Mike de Boer [:mikedeboer] from comment #63) > All that may very well be true, but it strikes me as odd that extending the > timeout makes the intermittent go away > (https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=cb67c950aace&selectedJob=31009938 -- just retriggered > several more times to be absolutely sure that statement is correct). > I'll look at the tests you mentioned to see if anything stands out to me. That is indeed odd. If this is the #2 intermittent then I guess we should go ahead and extend the timeout, uncomfortable as I am with doing that (so yeah, r=me on the patch after all, I guess). (In reply to Joel Maher ( :jmaher) from comment #64) > I also see a > few instances on the retriggers you did of browser_privatetabs.js failing- > although I will admit they seem to be less frequent than without the > extended timeout. I don't see these. Can you point out what you mean?

Flags: needinfo?(jmaher)

Comment 73

•

9 years ago

mozreview-review

Comment on attachment 8809796 [details] Bug 1252376 - refactor browser_privatetabs.js to use BrowserTestUtils instead of using less deterministic methods and make sure to garbage collect, forcibly, after each sessionstore mochitest ended. https://reviewboard.mozilla.org/r/92314/#review94694 r+ per bug discussion

Attachment #8809796 - Flags: review- → review+

Comment 74

•

9 years ago

My comment in 64 was in relation to the current state of a try push.

Flags: needinfo?(jmaher)

Comment hidden (Intermittent Failures Robot)

Comment 76

•

9 years ago

I'm repeating my comment from bug 1285475 here: I'm a bit wary about this... I can up the timeout on each and every sessionstore test, but I'd also like to point out that this something that started occurring more frequently not too long ago. Sessionstore mochitests are inherently heavy traffic on the platform side of things, by which I mean DOM/ XPCOM: a very large number of windows, tabs, browser elements instantiating multiple docShells and hard disk I/O when the sessionstore files are flushed to disk. Josh mentioned seeing a large number of ++/-- GC messages when studying the logs of intermittent bug 1252376: Did things change recently on the platform side of things that may affect the timing of the GC cycles? Is there a project running at the moment that intends to optimize memory usage, tweaked the GC behavior somewhat (or papercut by papercut) and might result in an increase of intermittents here? Who would know?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ebf832a5088b

Comment 77

•

9 years ago

I did try to narrow this down by retriggers and was not able to find a cause, I looked from Nov 4-9th with no luck, retriggers kept showing a high failure rate.

Geoff Brown [:gbrown]

Updated

•

9 years ago

Comment 79

•

9 years ago

Comment hidden (mozreview-request)

Comment 81

•

9 years ago

Comment on attachment 8809796 [details] Bug 1252376 - refactor browser_privatetabs.js to use BrowserTestUtils instead of using less deterministic methods and make sure to garbage collect, forcibly, after each sessionstore mochitest ended. See comment 79 for the Try link. So I pushed this patch to try and re-triggered it 20 times to have it fail. It didn't, so I'm quite confident that adding the `garbageCollect()` call is a winner. Of course, I'll retrigger all the things another gazillion times, but it looks good so far. Now, I also added a fix for a JS error I was seeing in the logs quite frequently, coming from PrivateBrowsingUtils and rewrote the browser_privatetabs.js test to use BTU instead of the homegrown stuff. My longer term goal is to more all sessionstore tests to use BTU and strip head.js of its duplicate mechanisms.

Attachment #8809796 - Flags: review+ → review?(gijskruitbosch+bugs)

Comment hidden (Intermittent Failures Robot)

Comment 83

•

9 years ago

mozreview-review

Comment on attachment 8809796 [details] Bug 1252376 - refactor browser_privatetabs.js to use BrowserTestUtils instead of using less deterministic methods and make sure to garbage collect, forcibly, after each sessionstore mochitest ended. https://reviewboard.mozilla.org/r/92314/#review95914 r=me for this version as well. Note two points below. ::: browser/components/sessionstore/test/browser_privatetabs.js:107 (Diff revision 3) > > // Create a new window to attach our frame script to. > - let win = yield promiseNewWindowLoaded({private: true}); > + let win = yield BrowserTestUtils.openNewBrowserWindow({private: true}); > > // Create a new tab in the new window that will load the frame script. > - let tab = win.gBrowser.addTab("about:mozilla"); > + let tab = yield BrowserTestUtils.openNewForegroundTab(win.gBrowser, "about:mozilla"); This makes the tab selected whereas the previous code didn't. Not sure if it matters, but something to keep in mind? Similar things happen in the previous replacements in this file (I reviewed from the bottom to the top in this file). ::: toolkit/modules/PrivateBrowsingUtils.jsm:96 (Diff revision 3) > let win = Services.appShell.hiddenPrivateDOMWindow; > function isNotLoaded() { > return ["complete", "interactive"].indexOf(win.document.readyState) == -1; > } > if (isNotLoaded()) { > setTimeout(function poll() { Ew. File a followup to swap this to using readystatechange events?

Attachment #8809796 - Flags: review?(gijskruitbosch+bugs) → review+

Updated

•

9 years ago

Blocks: 1320663

https://hg.mozilla.org/mozilla-central/rev/ba2692e1cbcf

Comment 84

•

9 years ago

(In reply to :Gijs Kruitbosch from comment #83) > This makes the tab selected whereas the previous code didn't. Not sure if it > matters, but something to keep in mind? Similar things happen in the > previous replacements in this file (I reviewed from the bottom to the top in > this file). That doesn't seem to matter in this case, indeed. I'll keep it in mind for future cases.

Pulsebot

Comment 85

•

9 years ago

Pushed by mdeboer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ba2692e1cbcf refactor browser_privatetabs.js to use BrowserTestUtils instead of using less deterministic methods and make sure to garbage collect, forcibly, after each sessionstore mochitest ended. r=Gijs

Carsten Book [:Tomcat]

Comment 86

•

9 years ago

bugherder

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox53: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 53

Updated

•

9 years ago

status-firefox49: affected → wontfix

status-firefox50: affected → wontfix

status-firefox51: --- → affected

status-firefox52: --- → affected

https://hg.mozilla.org/releases/mozilla-aurora/rev/55ba79bb7d99

Comment 87

•

9 years ago

bugherder uplift

status-firefox52: affected → fixed

Flags: in-testsuite+

https://hg.mozilla.org/releases/mozilla-beta/rev/7b055f68cae5

Comment 88

•

9 years ago

bugherder uplift

status-firefox51: affected → fixed

Comment 89

•

9 years ago

Had leaks on Beta, probably not worth caring about. Backed out. https://hg.mozilla.org/releases/mozilla-beta/rev/2f4a933871ec https://treeherder.mozilla.org/logviewer.html#?job_id=2022871&repo=mozilla-beta

status-firefox51: fixed → wontfix

Phil Ringnalda (:philor)

Reporter

Comment 90

•

9 years ago

Hmm, so this was merged to mozilla-central this morning, and something that was merged to mozilla-central this morning is causing frequent intermittent leaks of about:privatebrowsing (bug 1321016, bug 1321017, bug 1321055, bug 1321057, bug 1321058, bug 1321188), and now aurora is intermittently leaking it too after this was merged there, and beta leaked it like crazy until this was backed out? Quite a coincidence!

Comment 91

•

9 years ago

Mike, should we back this out considering the leaks? :-(

Flags: needinfo?(mdeboer)

Comment 92

•

9 years ago

Looking at the patch, I have no idea how this would trigger leaks. :-\

Comment 93

•

9 years ago

What I can think of is that 1. Having JS errors in PrivateBrowsingUtils.jsm is expected behavior? 2. Doing a GC after each test is leaky? I know it _sounds_ crazy, but... But yeah, it's better to back this out and figure out what's leaking for sure.

Flags: needinfo?(mdeboer)