59 bytes, text/x-review-board-request
Filed by: wkocher [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=115426926&repo=mozilla-central https://queue.taskcluster.net/v1/task/FIoHcixNT-yQhI6EGrnP6g/runs/0/artifacts/public/logs/live_backing.log This started up at some point and is happening pretty frequently now. Backfilling in progress to see where it started: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&fromchange=efc0b1525edbd357818dc7195537364e76f709e7&noautoclassify&filter-searchStr=a57eabd38acb3e07f628ebdcedcd59e913363281&group_state=expanded&selectedJob=115277350 https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=878fe0c8aa814859489c89b284e4f13c64801831&noautoclassify&selectedJob=115287957&filter-searchStr=c43d92eb742998d0e72c85aa629ea07307380d90&group_state=expanded https://treeherder.mozilla.org/#/jobs?repo=autoland&fromchange=800cbfc0af587ea8a05c86205435ff6c7926952a&noautoclassify&filter-searchStr=2e94b91381024cba5f30c3dc19dfb9f75c1aef2a&group_state=expanded&selectedJob=115331531
I believe this leak shifted when bug 1381929 tried to land. LOG ERROR | TEST-UNEXPECTED-FAIL | dom/xhr/tests/browser_blobFromFile.js | leaked 2 window(s) until shutdown [url = about:newtab] https://treeherder.mozilla.org/logviewer.html#?job_id=115524847&repo=autoland Backing out about:newtab in child process that was previously attempted in bug 1365643 appears to avoid the leak. https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa911848b26d1d5fe0d05797d889328eb1616c4f&group_state=expanded&filter-searchStr=mochitest%20e10s%20browser
In order to back out bug 1365643, a related bug 1373271 needs to be reverted as well.
Comment on attachment 8887826 [details] Bug 1382079 part 2 - Backed out changeset c17e14e79d1a (bug 1365643) to fix intermittent leaking about:newtab and bug 1381804 awsy regression. These reviews are to backout changes to move about:newtab back to main process in order to avoid intermittent leaks of about:newtab and the awsy regressions. If good, these can be autolanded. And k88hudson, bug 1381929 can be pushed to autoland after these.
Sorry for the multi-review requests, but whoever can get to it first, please. Also sorry for the last minute change in plans, but the sooner we fix/avoid this family of about:newtab leaks means less worrying about activity-stream updates like bug 1381929 of being backed out. Additionally as erahm pointed out in bug 1376895 comment 1, the large awsy regression tracked in bug 1381804 should be fixed sooner to avoid missing out on other awsy regressing changes, but it seems unlikely that bug 1376894 will be implemented to avoid creating additional content processes sooner than just taking this pair of backing outs.
Comment on attachment 8887825 [details] Bug 1382079 - Backed out changeset 5fc778386eb1 (Bug 1379762 part 3). https://reviewboard.mozilla.org/r/158734/#review164364 Clearing until we decide on how we want to approach preloading in the content process.
Comment on attachment 8887826 [details] Bug 1382079 part 2 - Backed out changeset c17e14e79d1a (bug 1365643) to fix intermittent leaking about:newtab and bug 1381804 awsy regression. https://reviewboard.mozilla.org/r/158736/#review164366
Any suggestions? I've spent the last couple days trying to track this down as it's quite odd in that it only appeared after inbound merged to central/autoland. I've been unable to reproduce this in my linux64 vm. The leak isn't specifically triggered by activity stream, as if I remove all the logic for the add-on / bootstrap.js, the leak still happens. Tried cleaning up various window references from the onboarding add-on in case those are creating cycles. Tried turning off preloading as well as aggressively removing preloaded browser. Tried clearing the console in case there were object references there. Now I'm in the process of try-bisecting of inbound. ursula and others have also been checking out cycle collector and other memory stuff but haven't found much useful yet.
bz/Fischer: Any help in figuring out this high failure test leak? So going through inbound, it looks like the bug that triggers the leak is bug 1379762. https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=56c6c81993e6b0f0e094e0c88b42ce3ec566ead9 (in reverse commit order) > eef6321e071e Gabriel Luong — Bug 1381692 - Reorder and rename the Grid display settings checkboxes. r=micah fail: https://treeherder.mozilla.org/#/jobs?repo=try&revision=da8c5b7068cdd18c73ce377c1c10a1ea6efad8a9 > 5fc778386eb1 Boris Zbarsky — Bug 1379762 part 3. Don't mess about with load blockers if our document is already in the COMPLETE readyState. r=smaug fail: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ce52c6f5d0722e34d24a310dd1e269ed628f3ef > aa65fd52fbcc Boris Zbarsky — Bug 1379762 part 2. Use a more reliable test to figure out when we can skip firing onload in nsDocumentViewer::LoadComplete. r=smaug good: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ce987845eadf42a3f12e2348d466a316d547240 > 68dccd8c1468 Boris Zbarsky — Bug 1379762 part 1. Don't call MediaFeaturesChanged if our override device pixel ratio is set to its current value. r=dbaron good: https://treeherder.mozilla.org/#/jobs?repo=try&revision=116481f1afa2042ddc3c87b2f64dbe7eea35d45f That particular commit https://hg.mozilla.org/mozilla-central/rev/5fc778386eb1 involves "load" which maybe suspiciously points to onboarding's "load" event listener: https://searchfox.org/mozilla-central/source/browser/extensions/onboarding/content/onboarding.js#778 Also potentially suspicious: `new Onboarding(window)` where window [url = about:newtab] could be the source of leak? https://searchfox.org/mozilla-central/source/browser/extensions/onboarding/content/onboarding.js#788 (For my own notes: > git sh aa083a9c0311 | git apply -C 0 && git cia -C head && git ci -m 'try: -b d -p linux64 -u mochitest-browser-chrome-e10s-1 --try-test-paths browser-chrome:browser/extensions/onboarding --artifact --rebuild 5' --allow-empty && git push try && git rh head~3
Sorry, some more context. Onboarding test is leaking about:newtab. Bug 1381569 landed on autoland turning on activity-stream for about:newtab where activity-stream runs in the content process (vs before tiles about:newtab ran in main). Bug 1379762 landed on inbound changing onload behavior. Both bugs independently did not cause the leak, but when both autoland and inbound merged to central, this leak appeared.
(In reply to Ed Lee :Mardak from comment #13) > Sorry, some more context. Onboarding test is leaking about:newtab. Bug > 1381569 landed on autoland turning on activity-stream for about:newtab where > activity-stream runs in the content process (vs before tiles about:newtab > ran in main). Bug 1379762 landed on inbound changing onload behavior. Both > bugs independently did not cause the leak, but when both autoland and > inbound merged to central, this leak appeared. Hi Mardak, Thanks for the background, looking into this.
I obsoleted attachment #8887826 [details] (https://reviewboard.mozilla.org/r/158736/diff/1#index_header), which would have put activity-stream about:newtab back in main process and would have made the leak go away. I.e., this leak happens specifically because of some content process interaction. After discussions with mconley and others, it is not desired to put activity-stream about:newtab back in main process as there are user experience and performance benefits of it being in content process.
I'm not sure this leaked window issue is caused by Bug 1380923 or not? If so, maybe we could try to close the Firefox window after all scripts compiled to avoid this issue.
(In reply to Evan Tseng [:evanxd] from comment #16) > I'm not sure this leaked window issue is caused by Bug 1380923 or not? If > so, maybe we could try to close the Firefox window after all scripts > compiled to avoid this issue. Thanks Evan. Trying different approaches (including your suggestion) on TRY now. Will update once getting the results.
Fischer's changeset on try seems pretty promising: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5228e382d3cdf5c7fb28838ccd3b264241afb3de https://hg.mozilla.org/try/rev/5228e382d3cdf5c7fb28838ccd3b264241afb3de I don't quite understand how it fixes the leak though…
Another of Fisher's change that seems to fix the leak is just waiting some time before closing the tabs: https://hg.mozilla.org/try/rev/a1fc13b39a6d851b39c5a36674a201f4f2f4197d So the changes in comment 18 to await on a ContentTask might just happen to take long enough for something to clean up, and doing nothing but waiting in the meantime is just as good.
(In reply to Ed Lee :Mardak from comment #18) > Fischer's changeset on try seems pretty promising: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=5228e382d3cdf5c7fb28838ccd3b264241afb3de > https://hg.mozilla.org/try/rev/5228e382d3cdf5c7fb28838ccd3b264241afb3de > > I don't quite understand how it fixes the leak though… Here are my try results (please ignore the lint complained about a quick patch) https://treeherder.mozilla.org/#/jobs?repo=try&revision=2e743989d1b8d6983c8b7dd72d9290fbfea04f71 https://treeherder.mozilla.org/#/jobs?repo=try&revision=322b55455067ee4d97dc4aa5fb17190689394467 https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f6eedeedf8a8020d69ed6f7543782ff049d5a31 I am going to do some local tests and come up with a patch.
Hmm. The patches in bug 1379762 should have had no observable behavior difference outside of history navigations. If they do, that's a problem. I can do some try pushes to see which parts of bug 1379762 matter here (the three changesets in there are independent of each other). But it's not clear to me what I should be using as a base for my pushes, given the various backouts and whatnot listed above. What's a revision I should be using for that?
Flags: needinfo?(bzbarsky) → needinfo?(edilee)
bz, the only change you need is setting pref("browser.newtabpage.activity-stream.enabled", true); (replacing the false in firefox.js) But I'm pretty sure it's already narrowed down to "Bug 1379762 part 3. Don't mess about with load blockers if our document is already in the COMPLETE readyState. r=smaug" https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ce52c6f5d0722e34d24a310dd1e269ed628f3ef That push has 3 relevant changesets: 4ce52c6f5d07 try: running onboarding tests 1e7100fcecde is actually turning on activity-stream.enabled pref (but copies commit message of its parent) 5fc778386eb1 is your Bug 1379762 part 3. that checks GetReadyStateEnum() != READYSTATE_COMPLETE Just to confirm, part 3's "parent," i.e., part 2 with the same push pattern, does not leak: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ce987845eadf42a3f12e2348d466a316d547240 See comment 13 for the additional context around this leak.
OK. It sounds like we should back out rev 5fc778386eb1: it should have absolutely no observable behavior impact and is just supposed to be an optimization. If it's causing observable behavior differences, there is clearly something wrong with it that I should investigate.
Comment on attachment 8887825 [details] Bug 1382079 - Backed out changeset 5fc778386eb1 (Bug 1379762 part 3). https://reviewboard.mozilla.org/r/158734/#review165310
Attachment #8887825 - Flags: review?(bzbarsky) → review+
Fischer, not sure if these are helpful, but if onboarding_tourset only has its first test task run, it fails 0% of runs. If it runs the first two tests, it fails about 33% of the runs. And when all 3 tests run, it fails 80% of the runs, which matches with comment 10's push-failure rate. As a sanity check that it wasn't some --artifact --no-artifact difference, I repushed to try part 3 70% failure: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8cb1e8c19ea8a2f25f751cef03a72b2ffbf06d48 part 2 0% failure: https://treeherder.mozilla.org/#/jobs?repo=try&revision=981a6a3b0e808d90a002e2514f086d08c9a2e19d Looks pretty conclusive that it is indeed part 3 5fc778386eb1 that resulted in triggering these leaks.
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/b19a95c5c5b7 Backed out changeset 5fc778386eb1 (Bug 1379762 part 3). r=bz
I noticed some oddness on autoland and inbound in that this leak seemed to go away before our backout here. I've tracked it down to bug 1159003 where instead of leaking onboarding_tourset, some onboarding_* test is frequently timing out (tracked in bug 1383070). Here's the child changeset that seems to have made the leak go away: Bug 1159003 - Remove Performance::GetAsISupports(), r=bz https://hg.mozilla.org/mozilla-central/rev/f4378a5e6c7f https://treeherder.mozilla.org/#/jobs?repo=try&revision=3cb3b32208e0b501c94c0a7cf4f42b36e18c4eba Here's the parent changeset that still has the leak: Bug 1382339 - Improve SpiderMonkey hashing functions by using MFBT's HashGeneric more; r=jandem https://hg.mozilla.org/mozilla-central/rev/a7a9d51edeeb https://treeherder.mozilla.org/#/jobs?repo=try&revision=4f2674b55fd668d6fbcb1a4d2bd4f7cb074273e5 I just pushed a new try build that hopefully allows the test to run without timing out by increases the number of condition retries to confirm that the leak is gone: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe2247835db989fda11fc9383e504bc377949584
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe2247835db989fda11fc9383e504bc377949584 So there seems to be enough Linux64 runs that are not timing out (after 100 retries 100ms each) to be clear that those successful runs are not leaking. Although from Fischer's previous try attempts to avoid the leak, making the test take longer seems to also reduce the likelihood of a leak. So bug 1159003's change might not have directly fixed the leak but just happened to cause onboarding tests to take *much* longer to run avoiding a ?race? that leads to the leak.
I put some timing measurements of how long onboarding_tourset took to wait for the loading of the onboarding overlay and the opening of the onboarding overlay to try to see if there's some correlation to leaking or not. (I also allowed longer timeouts to avoid bug 1383070.) These are linux64 numbers across 10 runs. With bug 1159003's change it mostly didn't leak : load waited: 2861ms median (1002ms min, 7825ms max) open waited: 440ms median ( 105ms min, 9990ms max) Whereas with that change's parent mostly leaked : load waited: 296ms median ( 136ms min, 1701ms max) open waited: 1893ms median ( 806ms min, 5055ms max) And after applying the backout of Bug 1379762 part 3 to the parent resulted in no leaks : load waited: 281ms median ( 106ms min, 2573ms max) open waited: 2032ms median ( 718ms min, 4059ms max) Not really sure what can be said other than with bug 1159003, the load takes almost 10x longer leading to the timeouts of bug 1383070. If anyone wants to look at the raw timings for patterns, I put them in a spreadsheet: https://docs.google.com/spreadsheets/d/1LxMduc9I1-gmTxcsEw3a6ItkIyrJW_Y0MhyjqUZUreE  https://treeherder.mozilla.org/#/jobs?repo=try&revision=852945be69dc4ef6bb7a1d81d988a5ab22755a13&group_state=expanded&filter-searchStr=linux64%20bc  https://treeherder.mozilla.org/#/jobs?repo=try&revision=763f31e0f88e40e3fa569a59128f9a848f98919d&group_state=expanded&filter-searchStr=linux64%20bc  https://treeherder.mozilla.org/#/jobs?repo=try&revision=2295952ae6d56c370b50545fbad47140929c40b0&group_state=expanded&filter-searchStr=linux64%20bc
https://hg.mozilla.org/mozilla-central/rev/b19a95c5c5b7 Tracking relanding the back out in bug 1383301. Tracking onboarding test time out in bug 1383070.
Whiteboard: [stockwell needswork] → [stockwell fixed:backout]
You need to log in before you can comment on or make changes to this bug.