Closed Bug 1420594 Opened 7 years ago Closed 7 years ago

Regression: Latest Nightly needs a longer time to Quit (Spinning Beach Ball appears)

Categories

(Core :: DOM: Core & HTML, defect)

59 Branch
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox-esr52 --- unaffected
firefox57 --- unaffected
firefox58 --- unaffected
firefox59 --- fixed

People

(Reporter: mehmet.sahin, Assigned: bkelly)

References

Details

(Keywords: nightly-community, perf, regression)

Attachments

(6 files, 8 obsolete files)

550.43 KB, video/quicktime
Details
1.07 MB, video/quicktime
Details
7.68 KB, image/png
Details
6.78 KB, patch
baku
: review+
Details | Diff | Splinter Review
8.12 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
928 bytes, patch
bkelly
: review+
Details | Diff | Splinter Review
macOS 10.12.6
FF Nightly 59.0a1 (2017-11-25) (64-Bit)

1.) Start Nightly
2.) Quit Nightly via CMD-Q

Actual: Nightly needs a longer time to Quit. The Spinning Beach Ball appears.

Expected: Nightly should quit quickly.

This is a regression.
Summary: Regression: Lastet Nightly needs a longer time to Quit (Spinning Beach Ball) → Regression: Latest Nightly needs a longer time to Quit (Spinning Beach Ball appears)
I did a bisect, and it looks like I found the regression range:

Good: 2017-11-22-10-31-38

 Bad: 2017-11-22-22-00-56

May be you can help to find the culprit in this range that could have cause this regression.

Thanks.
Can you please use https://mozilla.github.io/mozregression/ to determine a specific bug introduced the regression?
Mozregression seems to be a Windows GUI. Since this is a macOS bug, is it possible to use it on macOS too? If yes, do you have a documentation? I found the regression range above via downloading them manually from https://ftp.mozilla.org/pub/firefox/nightly/2017/.

Thank you.
Attached video Expected.mov
Here is a expected Screencast. Nightly quits in 1 sec.
Attached video actual.mov
And here is a actual Screencast. Nightly quits in 7 sec. :-(


If you need more information please let me know.
I can reproduce this on Windows 10, Ubuntu 17.04 and macOS 10.13.1.

Regression range:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=e6ace8b5531cd2f50b6685ef8fdb180afcd9344e&tochange=dbbca05aa64a14580d3ab23950e148019a097218
Blocks: 1419536
Has Regression Range: --- → yes
OS: Mac OS X → All
Hardware: Unspecified → All
(In reply to magicp from comment #6)
> I can reproduce this on Windows 10, Ubuntu 17.04 and macOS 10.13.1.
> 
> Regression range:
> https://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=e6ace8b5531cd2f50b6685ef8fdb180afcd9344e&tochange=dbbc
> a05aa64a14580d3ab23950e148019a097218

Thanks for the regression range.

bkelly@, Could you please take a look at this regression?

Thank you in advance!
Flags: needinfo?(bkelly)
I'll investigate first thing tomorrow morning.
(In reply to Ben Kelly [:bkelly] from comment #8)
> I'll investigate first thing tomorrow morning.

Okay, thank you very much.
My theory is my changes introduced too much IPC traffic late in shutdown.  My plan is to try eagerly shutting down ClientManagerService and all its actors at the first signs of shutdown.

A back up plan is to filter out window/worker clients we don't technically need yet.  I want to avoid this since it limits the usefulness of this code going forward, but it might be a reasonable back up plan for now.

Of course, it could also be something completely different that I broke.  I'll see what works tomorrow.
(In reply to magicp from comment #6)
> I can reproduce this on Windows 10, Ubuntu 17.04 and macOS 10.13.1.

What were you looking for on windows?  I can't reproduce on that platform.  Watching task manager it seems to go away immediately.
Flags: needinfo?(magicp.jp)
Hmm, I think I might be able to reproduce something on linux.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Flags: needinfo?(magicp.jp)
Flags: needinfo?(bkelly)
Ben: If you need some information from the Mac side (Repro steps etc.) I can provide. Thanks.
And I can confirm at rev e6ace8b5531c the delay I see is gone.  I'll investigate.

(In reply to Mehmet from comment #13)
> Ben: If you need some information from the Mac side (Repro steps etc.) I can
> provide. Thanks.

Thanks, but I think I have enough information to proceed now.
Attached image close-firefox.png
(In reply to Ben Kelly [:bkelly] from comment #11)
> What were you looking for on windows?  I can't reproduce on that platform. 
> Watching task manager it seems to go away immediately.

It is easy steps, quit and restart Firefox. Then a message will be shown.
"Firefox is already running, but is not responding. The old Firefox process must be closed to open a new window [Close Firefox][Cancel]"
This fixes the problem for me locally.  Lets see if it blows anything up on try:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=62269b6c4492abff89d8c1778474e83433dcbff6
Does this test build fix the issue for you?

https://queue.taskcluster.net/v1/task/I9BZpgUkR9KMa8liN1aAjg/runs/0/artifacts/public/build/target.dmg

Note, you have to temporarily disable application code signing to run it on mac.  For example:

https://apple.stackexchange.com/a/294016
Flags: needinfo?(mehmet.sahin)
Hmm, lots of orange.  It seems before-profile-change does not always fire.  I'll try switching to NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID tomorrow.
Nevermind.  This binary crashes immediately on my mac.
Flags: needinfo?(mehmet.sahin)
Hello Ben,

I was able to install the target.dmg on my MacBook Air 11 (Mid 2012) without a crasher. Quitting the app worked also fine and quick again as expected.

Okay, since you noticed the crasher in that build I will wait then for a new build from you to test again.

Thanks.
Running into problems because of bug 1421712.  I'll have to refactor this a bit.
Switch to using nsIAsyncShutdown service since it tells me if the observer topic has already fired.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc4b733cdc4add341d19d92dd51d057a1f34194a
Comment on attachment 8932617 [details] [diff] [review]
P1 Make ClientManagerService track active ClientManagerParent actors. r=baku

Andrea, this patch makes us track the top level PClientManagerParent actors in a list on the ClientManagerService.  This list will be used in the next patch to kill these actors at the start of the shutdown process.

I also added an Init() method so I can kill the actor immediately if shutdown has already begun.
Attachment #8932617 - Flags: review?(amarchesini)
Comment on attachment 8933037 [details] [diff] [review]
P2 Eagerly shutdown ClientManagerService. r=baku

This patch adds an nsIAsyncShutdownBlocker that resolves a MozPromise when xpcom-shutdown is reached.  This then causes the ClientManagerService on the background thread to start deleting the PClientManagerParent actors which in turn stops all Client IPC actors.

We don't actually block shutdown, though.  We don't currently leak without listening to shutdown since the service is gracefully ref-counted away as windows are destroyed.  We only really care about shutdown as a way to jump start the teardown to avoid IPC traffic late in the process.

Note, this does hold the ClientManagerService alive until shutdown, but we effectively do that anyway since every window object in the system holds it alive.
Attachment #8933037 - Flags: review?(amarchesini)
Please correct the component if this does not fit. Thanks.
Component: Untriaged → General
Product: Firefox → Core
Component: General → DOM
Attachment #8932617 - Flags: review?(amarchesini) → review+
Attachment #8933037 - Flags: review?(amarchesini) → review+
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ff3ee0cc91ab
P1 Make ClientManagerService track active ClientManagerParent actors. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/007fde92382e
P2 Eagerly shutdown ClientManagerService. r=baku
Attachment #8933360 - Attachment is obsolete: true
Attachment #8933411 - Flags: review+
These leaks are so weird.  Its like something landed that depends on clients API delaying shutdown and when I fix that its uncovering that new problem.
Let's see if just disabling ClientSource creation added in bug 1419536 also triggers these leaks.
I am having a similar problem but I don't have any visual indication that Fx hasn't quit.

Please look at the bug report I opened regarding my problem: https://bugzilla.mozilla.org/show_bug.cgi?id=1420995
(In reply to Gary [:streetwolf] from comment #39)
> I am having a similar problem but I don't have any visual indication that Fx
> hasn't quit.
> 
> Please look at the bug report I opened regarding my problem:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1420995

I'm going to dupe it to here.  Thanks for pointing it out.
Jan-Ivar, can you help me figure out why content/test/webrtc tests are leaking in my try builds?

Some context:

1. Last week I landed bug 1419536
2. This bug is because that landing introduced a delay in shutdown.  The browser now takes many extra seconds to stop.
3. When I fix the shutdown delay I get leaks in webrtc tests.
4. When I disable the code from bug 1419536 I get leaks in webrtc tests.

Did something land in webrtc in the last week?  Is it possible that its relying on the slow shutdown to cleanup properly?  Right now I don't even know where to look.
Flags: needinfo?(jib)
I see HTMLMediaElement::ShutdownObserver and MediaStreamGraph in comment 44 and on Windows as well in https://treeherder.mozilla.org/logviewer.html#?job_id=148926257&repo=try&lineNumber=10253

My initial suspicion given the 5 day time window was the CamerasParent reorg in bug 1399413 and bug 1388667, which backends getUserMedia calls, which are being exercised in the problematic tests in browser/base/content/test/webrtc/ (UI permission prompt tests that call getUserMedia).

But with these leaks happening on Windows, where real camera and mics are not involved in these tests, it's unlikely.

A more likely suspect might be bug 1412394 which landed 10 days ago, as that made changes to MSG and shutdown. I'm ni?ing people involved with that bug.
Flags: needinfo?(jib) → needinfo?(apehrson)
Try with bug backed 1412394 out:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3763225c80c7ff94bfc94c15cd0e58381e639a06

The time frames still don't seem to line up with any of these bugs landing, though.
Ben.  Do you want me to run Process Manager and give you the data during the 6 seconds that it takes FX to startup?  My bug report mentions that I saw under the Network data that for these 6 seconds it was executing some profiling task.  This I thought might be Fx sending data to Mozilla.
(In reply to Gary [:streetwolf] from comment #48)
> Ben.  Do you want me to run Process Manager and give you the data during the
> 6 seconds that it takes FX to startup?  My bug report mentions that I saw
> under the Network data that for these 6 seconds it was executing some
> profiling task.  This I thought might be Fx sending data to Mozilla.

Thanks but the problems I'm running into are not really related to the actual delay your experienced.  Don't worry, we'll get it sorted out.
(In reply to Ben Kelly [:bkelly] from comment #46)
> Try with bug backed 1412394 out:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=3763225c80c7ff94bfc94c15cd0e58381e639a06

Backing out bug 1412394 did not help.
Results are still coming in, but so far its looking like:

last good: 3f6b9aaed8cd
first bad: 7de1ab07869e
Flags: needinfo?(apehrson)
last good: b0a56e8287da
first bad: 583311c68376

Only about 10 changesets in here.  This one looks like a probably candidate:

changeset:   394231:bbb02c1fccd6
user:        Andrea Marchesini <amarchesini@mozilla.com>
date:        Wed Nov 29 09:40:16 2017 +0100
summary:     Bug 1420419 - Postpone the removing of BlobURL for 5 seconds in order to allow the loading of them in a remote process, r=smaug
I'm going to do a trigger for each of these revisions on try.

Andrea, in the meantime, do you think bug 1420419 could be causing this leak?

Again, the problem is that there is a bug in the tree which makes us delay shutdown for 5 to 10 seconds.  When I fix this shutdown delay I get a perma-leak in some tests.

This change in the bisect range to hold stuff alive for 5 seconds seems suspicious.  Does it listen for shutdown at all?
Flags: needinfo?(amarchesini)
Depends on: 1422314
Yea, my last set of try pushes are pointing at bug 1420419 as the conclusive cause of the leak.
My local mac build consistently takes >5s to shutdown. I got a shutdown profile (https://perfht.ml/2j2lqwM) that shows we are blocked for 5s at https://searchfox.org/mozilla-central/rev/7f45cb7cc0229398fc99849bdc150cb6462d6966/dom/workers/RuntimeService.cpp#2249

Not sure if this helps, only confirms what you already knew, or is an entirely different bug.
Please make sure to push this after bug 1422314.
Flags: needinfo?(amarchesini)
Keywords: checkin-needed
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7908b821ad3f
P1 Make ClientManagerService track active ClientManagerParent actors. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/f5a3054a4c38
P2 Eagerly shutdown ClientManagerService. r=baku a=aryx on CLOSED TREE
Keywords: checkin-needed
I'm running on the latest inbound which includes this patch as well as 1422314.  Happy to say that the delay is gone.  Kudos.
https://hg.mozilla.org/mozilla-central/rev/7908b821ad3f
https://hg.mozilla.org/mozilla-central/rev/f5a3054a4c38
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Thanks so much! Looks fixed for me in FF Nightly 59.0a1 (2017-12-03) (64-Bit) on macOS 10.12.6
I'm kind of upset this was backed out for a single failing test when the backout creates the risk of more leaks being added.
I'm relanding with the test disabled on android for now.
Blocks: 1422823
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7ba9a363b3d2
P1 Make ClientManagerService track active ClientManagerParent actors. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/a3e3a096629c
P2 Eagerly shutdown ClientManagerService. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/b0f2a1cfbd45
P3 Disable test_ext_contentScripts_register.js on android since it fails without the shutdown delay bug. r=baku
Please talk with me before backing out this bug.  Its very critical that it stick since backing it out will create the opportunity for more erroneous code to be added.  In the short time frame we had the extra shutdown delay we already had a late shutdown leak added and a problematic test added.  We need to try to fix any new problems detected instead of backing this out.

Thank you.
https://hg.mozilla.org/mozilla-central/rev/7ba9a363b3d2
https://hg.mozilla.org/mozilla-central/rev/a3e3a096629c
https://hg.mozilla.org/mozilla-central/rev/b0f2a1cfbd45
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Depends on: 1423554
See Also: → 1419605
See Also: → 1477688
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: