Closed Bug 1542884 Opened 5 years ago Closed 4 years ago

After using Google Maps, parent process shutdown delayed an extra 10 seconds until PBackground gives up waiting for PBackground actors to shutdown.

Categories

(Core :: IPC, defect, P2)

70 Branch
x86_64
Windows 10
defect

Tracking

()

VERIFIED FIXED
mozilla75
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- verified
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 --- wontfix
firefox71 - wontfix
firefox72 --- wontfix
firefox73 --- wontfix
firefox74 --- wontfix
firefox75 --- verified

People

(Reporter: streetwolf52, Assigned: janv)

References

(Regression, )

Details

(Keywords: regression, reproducible, Whiteboard: [sci-exclude])

Attachments

(7 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0

Steps to reproduce:

Go to https://www.google.com/maps.

Exit Fx.

Immediately start Fx again.

Actual results:

Fx doesn't start and a message appears that Fx is still running and must be closed first. Task Manager shows one Nightly process still running. It takes ~7 seconds for the process to end at which point Fx can be started without any delay or message.

After getting into Google Maps you can go to other sites and exit Fx from any site and you will still encounter the problem.

Expected results:

Fx should immediately start up after using Google Maps.

OS: Unspecified → Windows 10
Hardware: Unspecified → x86_64
Component: Untriaged → Startup and Profile System
Product: Firefox → Toolkit
Attached image Message

This is the message I get after using Google Maps, exiting Fx, then starting it right away. When I press close it still takes 5-10 secs for Fx to display.

Anyone looking into this?

Attached image Fx Process.jpg

I attached a snippet from Process Explorer that shows the main Fx Process with 4 processes running under it. When I get into Google Maps then exit Fx the process that remains running is PID 3676 which is the main process. It is always the main PID that continues to run. Again this continues to run for up to 10 seconds. Google Maps is the only site I've encountered so far that does this. I tried Google Maps in MS Edge and did not have a problem getting right back into Edge after exiting it.

By setting 'browser.tabs.remote.autostart = false' Google Maps does not keep a firefox process running after I exit Fx. Fx therefore starts up immediately. Hope this helps nail this bug.

Component: Startup and Profile System → Graphics: WebRender
Product: Toolkit → Core
Summary: Google Maps leaves a Nightly process running after exiting Fx. → [Webrender] Google Maps leaves a Nightly process running after exiting Fx.
Component: Graphics: WebRender → General
Summary: [Webrender] Google Maps leaves a Nightly process running after exiting Fx. → [Multi-Process] Google Maps leaves a Nightly process running after exiting Fx.
Summary: [Multi-Process] Google Maps leaves a Nightly process running after exiting Fx. → [E10s] Google Maps leaves a Nightly process running after exiting Fx.

Changing the pref to false also fixes another problem. Some of my add-on icons display an empty menu when the pref is true. With the pref set to false they display properly.

Depends on: e10s

I can reproduce this issue.
Launch with NSPR logging indicates significant delay when shutdown.

set NSPR_LOG_MODULES=sync,timestamp,all:5
set NSPR_LOG_FILE=debug.log
firefox.exe

2019-04-11 02:50:35.979000 UTC - 2996[2519970f400]: Unloaded library xul.dll
2019-04-11 02:50:36.014000 UTC - 4360[2519df67c00]: thread exiting
2019-04-11 02:50:36.014000 UTC - 7204[25199157c00]: thread exiting
2019-04-11 02:50:36.014000 UTC - 2780[2519bd93400]: thread exiting
2019-04-11 02:50:36.014000 UTC - 2996[2519970f400]: thread exiting
2019-04-11 02:50:36.014000 UTC - 2916[25199250000]: thread exiting
2019-04-11 02:50:36.014000 UTC - 6936[25195fa0400]: thread exiting
2019-04-11 02:50:46.019000 UTC - 5720[251969ae800]: thread exiting
2019-04-11 02:50:46.019000 UTC - 7616[251969ad800]: thread exiting
2019-04-11 02:50:46.019000 UTC - 3876[2519f3b8000]: thread exiting
2019-04-11 02:50:46.019000 UTC - 3316[251969ae000]: thread exiting
2019-04-11 02:50:46.019000 UTC - 3064[25195fa6000]: thread exiting
2019-04-11 02:50:46.019000 UTC - 8296[2518b87d000]: thread exiting
2019-04-11 02:50:46.019000 UTC - 2908[25195f06800]: thread exiting
2019-04-11 02:50:46.019000 UTC - 9120[2518b87c400]: thread exiting
2019-04-11 02:50:46.019000 UTC - 9080[2518b848c00]: thread exiting

Significant delay here.

2019-04-11 02:50:36.014000 UTC - 6936[25195fa0400]: thread exiting
2019-04-11 02:50:46.019000 UTC - 5720[251969ae800]: thread exiting

No such delay in normal case.

mozregression:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=71af1a8b7eadef8c377be4d8ae5b4402ea1013fc&tochange=5173c4b6f97c04522159e58f46b9359975c8b9d9

So this must be a regression by bug 1471189.

Component: General → DOM: Workers

(In reply to Gary [:streetwolf] from comment #5)

Changing the pref to false also fixes another problem. Some of my add-on
icons display an empty menu when the pref is true. With the pref set to
false they display properly.

Do you think this issue is related to the original problem reported? Should I open up another bug report? Perhaps it's a separate problem related to Webrender since turning off e10s also disables WR.

(In reply to Gary [:streetwolf] from comment #7)

(In reply to Gary [:streetwolf] from comment #5)

Changing the pref to false also fixes another problem. Some of my add-on
icons display an empty menu when the pref is true. With the pref set to
false they display properly.

Do you think this issue is related to the original problem reported? Should
I open up another bug report? Perhaps it's a separate problem related to
Webrender since turning off e10s also disables WR.

I turned off WR while keeping e10s enabled and my menus still displayed empty so I guess it's the same issue as this original bug.

Status: UNCONFIRMED → NEW
Has Regression Range: --- → yes
Has STR: --- → yes
Ever confirmed: true
Regressed by: 1471189
Version: 68 Branch → 63 Branch
Severity: normal → major
Priority: -- → P2

Opened up Bug 1544206 for the empty menu issue.

It might be interesting to use the profiler in shutdown mode as documented at https://profiler.firefox.com/docs/#/./guide-startup-shutdown (noting that I think the comment about symbolication is moot, as the referenced issue is fixed). Note that you might want to configure the profiler so that is profiles "Worker" threads too. (This will capture both the "DOM Worker" threads that are actual workers, as well as the "Worker Launcher" which is the RemoteWorkerService that supports SharedWorker instances.)

I don't want to seem like a troll but is there any movement on this issue?

Bodgan, can someone from QA try to reproduce this and try using the profiler to capture Worker threads on shutdown? Thanks!

Severity: major → normal
Flags: needinfo?(bogdan.maris)

We managed to reproduce the issue easily.
Webrenderer did not have any effect over it, the workstation did not have it enabled/available and the issue still reproduced. (Win10 1803- Firefox 67.0b11)

One thing worth mentioning is that if Firefox is open via the profile launcher; the close window is no longer displayed but it does take the extra time for Firefox to start up and the process is still hanging in the Task Manager list.

Here's a profile for a run, with the Worker threads captured as well:
https://perfht.ml/2Pg2ppA

However, since I was unable to set up the profiler to capture the shutdown part will have to take it with Andrew* (or someone else who can help) and get some extra details on the setup; if still needed.

Flags: needinfo?(bogdan.maris)

:streetwolf, is this happening with a current nightly? I ask because it seems likely that bug 1540080 seems like it should have fixed problems similar to this one.

I attempted to reproduce this locally on Linux nightly but was unable to. However, this could be due to having a fast machine and a fast network. I did notice, however, that there were multiple issues with the Worker blocking due to use of XHR that entails a syncloop to the main thread. I believe the XHR uses are async... but it appears that the syncloops triggered by touching XMLHttpRequest.responseType and calling XMLHttpRequest.open resulted in very meaningful jank on the worker thread. Jank would be proportional to main thread contention. Note that bug 1330826 (or its successor) cover eliminating this.

Flags: needinfo?(garyshap)

I'm on the latest Nightly and it is still happening. I also have a fast machine i9-9900K and a 1Gb connection. Also, did you see my other report Bug 1544206? That issue started when this issue began.

Flags: needinfo?(garyshap)

Thanks for the clarification. That helps rule out a class of possible problems.

:streetwolf, are you able to try capturing a profile as documented at https://profiler.firefox.com/docs/#/./guide-startup-shutdown? Note that it's not just sufficient to set the environment variable, the profiler also has to already be running when shutdown occurs (and have captured data!). Once you have the profile file, you'll want to load it into the https://profiler.firefox.com/ site and then upload it from there, as the profiler still needs to resolve the memory addresses to symbols, and I believe that can only be done locally on your machine. Note that URLs may be included with your performance profile from active sites (which could also include sites that don't have open tabs, due to Service Workers and push notifications), so if you're using a Firefox profile that would have personally identifying information, you definitely want to check the "Privacy" box option in the profiler add-on's panel OR better yet, be using a fresh Firefox profile via the "-P" option. (The add-on UI can sorta be seen at https://profiler.firefox.com/docs/#/./guide-profiler-fundamentals, although the settings UI has changed since that screenshot was taken. The "privacy" option is available by scrolling the panel way down.)

Flags: needinfo?(garyshap)

If I didn't do things correctly, as it's the first time I've done profiling, let me know.

https://perfht.ml/2Gl9Jwc

Flags: needinfo?(garyshap)
Attached image profiler-settings.png

Thank you very much for undertaking grabbing a profile! Unfortunately, there wasn't much in that profile. Here's an example of a shutdown profile I grabbed where I didn't reproduce the shutdown hang but I could see the XHR syncloop jank: https://perfht.ml/2GtLZan

In particular, the area of interest are the DOM Worker threads in the content process, which is where I'd suspect things are going wrong.

I've attached a screenshot of the settings I currently have for my profiler. IN particular, you may need to increase your buffer size, and you'll want to make sure you've checked the "DOM Worker" box or otherwise made the string "Worker" appear in the "custom threads by name" section. (And of course, in the screenshot, I still need to click the "start" button to activate the profiler!)

Uh, and also note that the network timeline seems to capture URLs even when I told it to sanitize resource URLs on upload, so I'd suggest indeed using a different Firefox disk "profile" (in the other sense, the "-P" when launching Firefox sense) for any captured performance profile.

I did use another profile with a shortcut using the -P param. I must not be doing something right. How about a 'Profiler for Dummies' type of instructions :-)

Hm, so I think you may have done everything right but the default "Publish..." settings did the wrong thing. When you bring up the profile, the profiler seems to choose the set of threads to display based on some preconfigured list. This does not include the DOM Workers. You can make the DOM Workers visible by clicking on the "blue "M/N tracks visible" drop-down thing and selecting them. Alternately, when you go to "Publish...", you want to expand "Include Additional Data" and check the "Hidden threads" box.

I think what happened is that since you didn't manually make the threads visible and you didn't check the box, it filtered them out. (Also, I think I got confused and thought that the checkboxes under "Include Additional Data" where actually filtering things OUT because right above the accordion toggle is the checkbox for "Filter out potentially identifying information"... so when I checked the box "Resource URLs" I was actually explicitly including the URLs! (Whoops! Thankfully there's already an issue on this at https://github.com/firefox-devtools/profiler/issues/1957 and I'm chiming in there.)

The steps I followed (on linux, so the paths and command-line args vary a little:

  1. Run Firefox with shutdown profiler saving enabled: MOZ_PROFILER_SHUTDOWN=/tmp/maps-profile ./firefox-nightly/firefox -P nightly -no-remote
  2. When Firefox starts, I bring up the profiler by clicking on its icon in the toolbar. I make sure the settings are right then and click the start button. If I make any changes after starting the profiler, I need to click the "Apply" button at the bottom of the pane. In particular, the buffer size is important, because the profiler uses a circular buffer and if it overflows, details will be lost.
  3. I bring up Google Maps and drag around to cause it to do something.
  4. I quit Firefox via the menu.
  5. In another Firefox window, I go to https://profiler.firefox.com/ and click on "Load a profile from file"
  6. I browse to the file /tmp/maps-profile where I told it to dump the profile.
  7. I wait a little bit and the profile comes up.
  8. I click "Publish...". I think the right thing to do here if checking "Filter out potentially identifying information" is to check "Hidden threads" (but not check "resource URLs").

I'm having problems. I decided to add an env. variable called MOZ_PROFILER_SHUTDOWN pointing to C:\profiler. I start the profiler but I never see any file in this folder. Same thing when I add it to the Fx shortcut. btw.. The way you show MOZ_PROFILER_SHUTDOWN it's treated as a URL. I added a '-' in front of it but still no profile file. I also can't get the screen where I can publish it to see if this works. I am getting confused.

I found the publish things again. I turned on the DOM workers. How's this... https://perfht.ml/2GywY67

MOZ_PROFILER_SHUTDOWN is the name of the file to (over)write to. So, if C:\profiler is a directory, C:\profiler\shutdown-profile would be the right thing to pass to create a "shutdown-profile" file in the directory. Indeed on Windows you want to use backslashes.

You need a profile file before https://profiler.firefox.com/ will load it and show you the UI which has the "Publish..." button in the upper right of the profile UI.

(In reply to Gary [:streetwolf] from comment #24)

I found the publish things again. I turned on the DOM workers. How's this... https://perfht.ml/2GywY67

The profile only includes 1.6 seconds of data which makes me think your buffer is way too small. You'll want to drag the "buffer size" slider to the right to make it much larger. (Or this profile is from clicking to start the profiler and then almost immediately afterwards triggering a "capture profile". You don't want to click "capture profile" for what we're doing here. We only want to load the profile from what gets written to disk at shutdown.)

I can only get a profile using publish. How's this.. https://perfht.ml/2UHOMWj

Managed to upload the profiler's result based on the additional info.
Here's a link for it, hope it helps: https://perfht.ml/2PlK76u

Thanks, Cristi! The profiler indicates that the PBackground "IPDL Background" thread is blocking shutdown. Is it possible to gather another profile but with "Background" added to the list of threads to profile? So the thread list in the profiler setting would look like: "GeckoMain,Compositor,DOM Worker,Background".

The profile also indicates that all workers were shutdown by the time PBackground shutdown started. Given that QuotaManager shutdown already completed and completed quickly, unless this is WebRTC, I'm not sure what would be blocking PBackground shutdown.

Flags: needinfo?(cristina.madaras)
Flags: needinfo?(cristina.madaras) → needinfo?(cristian.fogel)

The "Background" part is still a bit unclear as for the options.
Read grabbed with 68.0a1 (2019-04-21) - win10

Profile grabbed with "GeckoMain,Compositor,DOM Worker":
https://perfht.ml/2PnuZ8H
Profile grabbed with "GeckoMain,Compositor":
https://perfht.ml/2PlW1gx

Flags: needinfo?(cristian.fogel)

Thank you for the profiler traces with "Background" in comment 30, atlanto! They show that:

  • There is no activity on the IPDL Background thread.
  • ParentImpl::ShutdownBackgroundThread[1] is waiting for its 10 second timer to elapse and call ParentImpl;::ShutdownTimerCallback[2] that force-closes the actors.
  • Bug 1471189 which you also very helpfully bisected to in comment 6 (thank you for that too!) could have broken the teardown of the PBackground actor somehow.

In terms of how PBackground shutdown happens on workers:

  • WorkerThreadPrimaryRunnable::Run calls BackgroundChild::CloseForCurrentThread()[3] after control flow returns from DoRunLoop.
  • WorkerPrivate::ClearMainEventQueue gets invoked which drains the event loop[4] after that. Which I mention because it's always worrying if there's an opportunity for a sub-system to re-initialize immediately after it was supposed to shutdown.

I'd suspect the same-process DOM Workers because I would expect any content-process workers to get torn down when the content processes terminate. In the parent process we appear to have the following workers:

  • osfile_async_worker.js
  • SessionWorker.js
  • what presumably is PageThumbsWorker.js. We only ever seem to have 3 workers in the content process, and although there were no JS stacks for the 3rd worker, that's what shows up in other profiles from this bug.

:baku / :jld, does anything jump to mind about how we could be effectively leaking a PBackground actor?

1: https://searchfox.org/mozilla-central/rev/d80f0a570736dce76a2eb184fb65517462089e8a/ipc/glue/BackgroundImpl.cpp#968
2: https://searchfox.org/mozilla-central/rev/d80f0a570736dce76a2eb184fb65517462089e8a/ipc/glue/BackgroundImpl.cpp#1013
3: https://searchfox.org/mozilla-central/rev/d80f0a570736dce76a2eb184fb65517462089e8a/dom/workers/RuntimeService.cpp#2311
4: https://searchfox.org/mozilla-central/rev/d80f0a570736dce76a2eb184fb65517462089e8a/dom/workers/WorkerPrivate.cpp#3307

Flags: needinfo?(jld)
Flags: needinfo?(amarchesini)

I suspect that the shutdown of ChromeWorkers could be one of the culpits here. ChromeWorkers do not have a parent window and their global is a BackstagePass object. Because of this, they go from 'running' state to 'killing' directly when firefox is shutting down. If they are using any PBackground actor, they are not correctly terminated.

Reading the patch quickly, it seems that this setup was the same even before bug 1471189, but maybe I'm wrong.

Would be great if we can have a correct shutdown procedure for ChromeWorkers. I take this bug.

Assignee: nobody → amarchesini
Flags: needinfo?(amarchesini)
Flags: needinfo?(jld)
Depends on: 1546576

Bug 1546576 has landed so when can we expect a fix for this issue?

If the problem was related to ChromeWorker shutdown, yes. Can you please test this issue again using nightly? Thank you.

Flags: needinfo?(garyshap)

The issue still remains on the latest Nightly. It does seem that the wait is a tiny bit shorter but nevertheless there shouldn't be any wait.

Flags: needinfo?(garyshap)

Any progress on this issue?

Just realized that the wait time is the same as before. I thought it was a little shorter but that is not the case.

Severity: normal → major

:baku, are there next steps you want to take? Possible next steps (don't need to be done by you, feel free to unassign) might include creating a try build that explicitly enumerates the outstanding PBackground actors and ideally the threads they belong to.

Flags: needinfo?(amarchesini)
Summary: [E10s] Google Maps leaves a Nightly process running after exiting Fx. → After using Google Maps, parent process shutdown delayed an extra 10 seconds until PBackground gives up waiting for PBackground actors to shutdown.

Would be great is somebody can help with the testing here. What you are suggesting is probably the right way to go (if we cannot record the bug with rr), but I don't have time to do it. Andrew, can you take this bug?

Assignee: amarchesini → nobody
Flags: needinfo?(amarchesini)
Assignee: nobody → bugmail
Status: NEW → ASSIGNED

Is my issue on hold until the soft-freeze period is over?

No, it's not. Andrew just took the bug. In a few days, we will have a fix, or new comments to know how to continue the debugging.

Status on this issue is appreciated.

Please don't take this as spam but I use Google maps a lot and very often exit Fx right after doing so the extra delay impacts me.

Please what is going on with this bug?

Flags: needinfo?(bugmail)

On Visual Studio 2019, if I set breakpoint in ForceCloseBackgroundActorsRunnable (https://searchfox.org/mozilla-central/source/ipc/glue/BackgroundImpl.cpp#1185)
I could get only this. (I'm not familiar with VS2019)

-		actorsToClose	{...}	nsTArray<(anonymous namespace)::ParentImpl *>
-		nsTArray_Impl<(anonymous namespace)::ParentImpl *,nsTArrayInfallibleAllocator>	{...}	nsTArray_Impl<(anonymous namespace)::ParentImpl *,nsTArrayInfallibleAllocator>
-		nsTArray_base<nsTArrayInfallibleAllocator,nsTArray_CopyWithMemutils>	{mHdr=0x0000027985ebbaf0 {mLength=0x00000001 mCapacity=0x00000001 mIsAutoArray=0x00000000 } }	nsTArray_base<nsTArrayInfallibleAllocator,nsTArray_CopyWithMemutils>
-		mHdr	0x0000027985ebbaf0 {mLength=0x00000001 mCapacity=0x00000001 mIsAutoArray=0x00000000 }	nsTArrayHeader *
		mLength	0x00000001	unsigned int
		mCapacity	0x00000001	unsigned int
		mIsAutoArray	0x00000000	unsigned int
-		nsTArray_TypedBase<(anonymous namespace)::ParentImpl *,nsTArray_Impl<(anonymous namespace)::ParentImpl *,nsTArrayInfallibleAllocator> >	{...}	nsTArray_TypedBase<(anonymous namespace)::ParentImpl *,nsTArray_Impl<(anonymous namespace)::ParentImpl *,nsTArrayInfallibleAllocator> >
		nsTArray_SafeElementAtHelper<(anonymous namespace)::ParentImpl *,nsTArray_Impl<(anonymous namespace)::ParentImpl *,nsTArrayInfallibleAllocator> >	{...}	nsTArray_SafeElementAtHelper<(anonymous namespace)::ParentImpl *,nsTArray_Impl<(anonymous namespace)::ParentImpl *,nsTArrayInfallibleAllocator> >

How can I provide useful data?
(cannot inspect '{...}', cannot step in actorsToClose[index]->Close(), ...)

Thanks for your willingness to help debug! Things are somewhat busy and your assistance is greatly appreciated.

Inside actorsToClose there should be a single ParentImpl*. This ParentImpl should have a member mChannel. The channel should have a mWorkerLoop and mWorkerThread that identify the thread in question.

Since VS2019 doesn't seem to understand our nsTArray container type based on it not letting you inspect, I think the best course of action might be:

  • At this point (prior to close being invoked), activate a breakpoint on MessageChannel::Close[1]. You might need to move your existing breakpoint up a little bit if you're finding that the call has actually already happened.
  • Resume execution.
  • Once the breakpoint triggers, you should be able to inspect mWorkerLoop and mWorkerThread.
  • Ideally mWorkerThread[2] should be identified as a PRThread* and then you can look at its "name" field[3].

1: https://searchfox.org/mozilla-central/rev/0da35261b6789eec65476dbdd4913df6e235af6d/ipc/glue/MessageChannel.cpp#2662
2: https://searchfox.org/mozilla-central/rev/0da35261b6789eec65476dbdd4913df6e235af6d/ipc/glue/MessageChannel.h#623
3: https://searchfox.org/mozilla-central/rev/0da35261b6789eec65476dbdd4913df6e235af6d/nsprpub/pr/include/private/primpl.h#1579

Flags: needinfo?(bugmail)
Attached image (screenshot of Watch)

The result at the moment... snapshot when stopped at MessageChannel::Close.
I couldn't inspect mWorkerThread.

Further investigating now...
(I'm taking 'attach process' method with public released firefox.exe for now, btw)

I would also be willing to test but unfortunately my debugging skills are way above my pay grade so to speak. It's good that others with way more knowledge are now testing.

Got "IPDL Background" by using PR_GetThreadName(mWorkerThread) in MessageChannel::Close.

Fantastic detective work, :atlanto, thank you!

This suggests some API is opening a PBackground connection from the PBackground thread. Which... actually makes a lot of sense... but I don't know that it was something that was originally foreseen for PBackground shutdown purposes...

My naive thought of what to do here would be that if we initialize a PBackground-PBackground actor, we should save it off into a specific variable so that in ParentImpl::ShutdownBackgroundThread we can trigger its close immediately before starting the event loop spin.

:nika, in bug 1487249 you added the intra-thread actor support... is this something PBackground should really use for its PBackground-PBackground connection, or would my naive hand-waving above be sufficient?

:atlanto, if you have additional cycles, it would be interesting to know what the stack is of the caller creating that connection from the PBackground thread. Setting a breakpoint at https://searchfox.org/mozilla-central/rev/0da35261b6789eec65476dbdd4913df6e235af6d/ipc/glue/BackgroundImpl.cpp#1456 should fire once per thread. I think Visual Studio should already tell you what the thread name is when it hits the breakpoint (which should be "IPDL Background" in the case we care about). If it doesn't tell you, PR_GetThreadName(PR_GetCurrentThread()) should work.

Flags: needinfo?(nika)

Alternately, setting a breakpoint on ParentImpl::CreateActorForSameProcess works. The point would be that you only want to break on actor creation, not on getting the actor (which will be called a ton).

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #51)

:nika, in bug 1487249 you added the intra-thread actor support... is this something PBackground should really use for its PBackground-PBackground connection, or would my naive hand-waving above be sufficient?

Sounds scary. PBackground supports sync IPC, which is intentionally not supported by intra-thread actors. Supporting it would either require spinning a nested event loop or would cause a permanent deadlock. Right now we crash if you try to use one because there's no good way to prevent it statically.

Because of that I'm not sure I'd be comfortable with setting up an intra-thread version of an existing actor. At one point I was considering requiring the actor to be declared as async to have OpenOnSameThread called on it, but I never got around to setting it up.

Flags: needinfo?(nika)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #51)

:atlanto, if you have additional cycles, it would be interesting to know what the stack is of the caller creating that connection from the PBackground thread. Setting a breakpoint at https://searchfox.org/mozilla-central/rev/0da35261b6789eec65476dbdd4913df6e235af6d/ipc/glue/BackgroundImpl.cpp#1456 should fire once per thread. I think Visual Studio should already tell you what the thread name is when it hits the breakpoint (which should be "IPDL Background" in the case we care about). If it doesn't tell you, PR_GetThreadName(PR_GetCurrentThread()) should work.

Unfortunately, it was unable to set breakpoint at BackgroundImpl.cpp#1456, so I inserted printf(PR_GetThreadName(PR_GetCurrentThread()) at there. It indicates

operation: Launch Firefox
  "DOM Worker"
  "(null)"
  "Worker Launcher"
  "DOM Worker"
operation: Visit google map
  "DOM File"
operation: Close tab (exit Firefox)
  "DOM Worker"

(Maybe I can collect more data with NS_DebugBreak but I don't know which data is helpful)

Is this issue a hard one to solve? It's not a criticism but I'm curious why it seems like it is a hard one to solve.

Please do not let this fall into a black hole.

Hey Andrew, checking on this one (came up in regression triage). Any updates?

(I'm guessing this issue should be assigned to someone who is able to reproduce the problem and investigate it.
I'm also still wondering why this occurs only on Windows, why Google Map can cause this, why this occurs even after leaving the site or closing the tab or such. If something zombie can survive, why it could not be a virus.)

Lacking even basic Firefox debugging skills, if someone gives me very detailed instructions on what is needed to get the information to fix this problem I'll be happy to do help. I can reproduce this problem 100% of the time. The key is DETAILED instructions. Assume I am a complete idiot :-)

I'd even be willing to let a debugging Guru access my system remotely to do the debugging. My program of choice for this is Teamviewer

Please don't let this bug hang around for years. There must be someone who has this problem on your end and can get the info needed to fix it.

For what it's worth, I just reproduced the problem with Nightly 70.0a1 (2019-07-24) (64-bit) build-id 20190724095428 using a brand new profile under Windows 10 Home (64-bit) , version 1903 (OS Build 18362.239).

While my skills are limited in this area and I am just a Firefox user, not a developer, is there anything I can provide that may help track down this bug?

Please, please fix this issue. On a few occasions I encountered the same problem even though I did not go to Google Maps. It's possible that a web site that uses GM caused this, IDK.

BTW, the console(firefox.exe -console) shows these messages when visited www.google.com/maps. (they appear on build 20180713100116 but not on build 20180712220034)

[Parent 2400, Gecko_IOThread] WARNING: file z:/task_1563383129/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 307
[Parent 2400, Gecko_IOThread] WARNING: file z:/task_1563383129/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 307

https://searchfox.org/mozilla-release/source/ipc/chromium/src/chrome/common/ipc_channel_win.cc#307
The error code(err) is 232(0xE8) - ERROR_NO_DATA.

No offense but if the assignee isn't up to fixing this issue then perhaps someone else with more knowledge is called for.

I found another site that exhibits the same issue per this report.

https://www.amazon.com/Sterling-Silver-Created-Sapphire-White/dp/B004BR3LSG/ref=pd_srecs_sabr_st_76?_encoding=UTF8&pd_rd_i=B004BR3LSG&pd_rd_r=21eb5377-b52e-46cd-a01e-220a390fc59f&pd_rd_w=tbKRD&pd_rd_wg=4IKjF&pf_rd_p=a25d43de-f6f4-41bb-aace-e6f7c0927a67&pf_rd_r=KSCMP0FPRKAAC4DGTKS0&psc=1&refRID=KSCMP0FPRKAAC4DGTKS0#technicalSpecifications_feature_div

Like Google Maps when I exit Fx and restart it there is the same delay. For the most part I don't have a problem with amazon.com. Hope this can help fix the problem.

I am getting into more situations that certain sites, that so far I haven't identified, result in the same 10 second delay. In all probability this is the same thing that is going on with Google Maps.

Severity: major → critical
Whiteboard: [sci-exclude]
Version: 63 Branch → 70 Branch

Why isn't there any progress on this issue? It appears that because users were asked to provide debugging information the assignee doesn't even have Windows 10 machine to test on or can't repro the problem. IMO it's probably the former. If another assignee has a machine with Windows 10 perhaps he/she should be assigned the problem. As mentioned I've found other sites that exhibit the same issue. BTW... This in no way is meant to denigrate the current assignee.

Hi Gary! I expect to get to this bug this upcoming week or hand it off with a direct action plan. We've been somewhat engineer-limited as of late, but we are now able to delve into this issue in more depth. Responding to comment 55, an over-arching problem with workers is that we've had weak and conflicting invariants around their life-cycles. We've been improving this in the interim on other bugs (you may have seen some go by involving WorkerRefs), and generally improving our understanding of these invariants. Accordingly, we've been somewhat reluctant to alter things without first understanding what's changing, especially when altering the status quo might create security bugs. Thanks to :atlanto's help in initial diagnosis, it appeared that this would be a sufficiently large investigation and fix to potentially require 1+ week of engineering time to investigate and fix correctly, which made it harder to slot into the schedule.

In regards to reproduction on Windows, I should be clear that I and others do have access to Windows machines. It's just frequently preferable for us to try and reproduce on linux because we have significantly better tooling available to diagnose and analyze failures on linux through "rr" (https://rr-project.org/) and "pernosco" (a related but currently private tool) so it's frequently easier to investigate and address problems, especially with the multi-threaded, multi-process nature of the issue of a self-PBackground client loop. (There is an experimental windows debugging time travel debugger and I've reached out to those who've tried using it to see if that's an option. My impression previously was that it was not, but hopefully that impression is wrong!)

Flags: needinfo?(bugmail)

Thanks for the clarification Andrew. Considering that something like 90% of all computers use Windows I would think having the proper tools for testing, debugging, whatever, would be a high priority for Windows. Perhaps The Mozilla engineers need to develop these tools in house. I'm looking forward to a resolution of this issue and thanking you in advance for your work on solving it.

Is there a patch forthcoming? I'm finding more and more pages that cause this issue.

Too late for a fix in 70, but sounds like the team has this on its radar.

(In reply to Liz Henry (:lizzard) from comment #72)

Too late for a fix in 70, but sounds like the team has this on its radar.

Perhaps your radar stopped working ;-). This bug becomes a real PITA when I visit a site that causes the delay and I have to do a restart of Fx to update to a new Nightly version for example. True, waiting 10 seconds or more for Fx to come back to life is not the end of the world but it really should be fixed.

Andrew, could you give us an update on this bug? Thanks

Flags: needinfo?(bugmail)

Status please? Seems like a P2 severity critical bug report isn't getting the attention it deserves. Is everyone working on Fission :-) ?

With all due respect and perhaps there are more pressing issues that need to be fixed. When can we expect a resolution of this problem? As previously mentioned it is not limited to Google Maps. Other sites like amazon.com also cause a delay on an Fx restart.

We have shipped our last beta for 71 and this is an unfixed P2, so this is fix-optional for 71 in case we have a safe patch we could uplift in a potential dot-release for 71.

Please help me understand what's going on here. The only action that appears to be happening is that when a new release comes out you simply mark the previous one as WONTFIX. At what point will this problem be fixed?

There's other work that's P1 and from reading comment 69 it sounds like the right team knows about this bug.
We run through the backlog of regressions every week or two, and for issues that aren't moving, try to find owners.

And, yes, marking them wontfix to keep people informed and to clear our queue of outstanding issues for the upcoming release -- it helps to make it clear what the remaining release blocking issues for 72 are. I realize it's frustrating to see something like this unfixed for months but your earlier guess about Fission is probably correct - maybe there is underlying work being done that will make this easier to deal with eventually.

Perry, do you have cycles to pick this up maybe?

Flags: needinfo?(perry)
Assignee: bugmail → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(bugmail)

Perhaps I am interpreting things incorrectly. It appears that after I opened this report 9 months ago it is now in a status as if I just opened it today, ie., NEW with no assignee. If I am wrong please enlighten me, if not what progress has been made during these 9 months?

(I've replied to a variant of comment 81 via email.)

I've placed this bug on our team agenda for Tuesday to find someone with availability to dig into this, which probably won't be Perry because of sw-e10s priorities, so clearing his needinfo.

Flags: needinfo?(perry)
Flags: needinfo?(jvarga)

Broken record here but I've never seen a bug report handled the way this one is. Yes I've seen bug reports get resolved after more than a decade after being reported but those were so called inconsequential ones (although not to those who are adversely affected). This one, might I say, is a little more substantial in it's affect on the user. Seems like nobody wants to touch it. Is it a lack of qualified people or just a general lack of personnel? Is it deemed too inconsequential to be worked on expeditiously? Surely someone in the Mozilla orbit can get this resolved with a little effort and perseverance. If I had the expertise I would offer myself but alas I have no knowledge to do so.

All this aside I've always appreciated the work all the developers do whether Mozilla employees or the myriad of users who contribute their expertise to fix issues.

Sorry for the delay. I spent several hours today analyzing this and I'm finally able to simulate a condition when shutdown is delayed by 10 seconds.
I'm still investigating what would be a proper fix for this.

Flags: needinfo?(jvarga)

Great news. Just want to remind you that going on amazon.com also produces the same delay. Presumably both sites have the same issue.

Assignee: nobody → jvarga
Status: NEW → ASSIGNED
Component: DOM: Workers → IPC

Gary, can you test a build (with a possible fix for this bug) ?

Not sure if you ever downloaded a build from the try server, but here's a link:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c4a01805f50062819212ed35edd733903e216307

I guess you want the Windows 64 optimized build, so look for "Windows 2012 x64 shippable opt". Just click on "Bs" and then download target.zip or setup.exe.

Just installed your try version. Fx starts up just fine after going to Google Maps and amazon.com. I'll run with this to see if I encounter any other problems.

Can we get this fix over to Central or at least autoland? Seems to be no problems I found with it.

Indeed, I was waiting for your feedback.
Anyway, I have to get a review for the patch first.

So, here's a bit more information about the problem...

First of all, I wouldn't be able to resolve this so quickly without the investigation done by others. Thank you.

I can reproduce the problem by creating a worker and immediately terminating it (but only on windows since it's windows specific bug). The problem is actually the fact that there's a zombie PBackground parent actor which needs to be explicitly closed during shutdown. We do that only if actors don't close automatically in 10 seconds.

Here's what happens at low level:

  • we create a pipe on the child side, duplicate it and send it to the parent
  • the parent needs to asynchronously connect to it to make things work
  • however, the child can die very quickly and disconnect from the pipe before parent connects to it
  • we get ERR_NO_DATA on the parent side which is not handled currently
Pushed by jvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d7e7f63dc1bf
Close message channel on failure to connect to the pipe; r=jld
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla75

Gary, can you confirm that this is working better for you now?

Flags: needinfo?(garyshap)

Everything works as it should after this fix. Thanks to all those involved in fixing it.

Flags: needinfo?(garyshap)
Status: RESOLVED → VERIFIED
See Also: → 1620806

Is this something we should consider uplifting to ESR?

Flags: needinfo?(jvarga)

Well, the patch is quite risky I would say, we may consider uplifting it after FF 75 is released.

Flags: needinfo?(jvarga)
Attached image mar3.PNG

Just thought I'd highlight one impact of this. This is a graph of the 75th percentile for shutdown durations, where we see something like a 33% win as soon as this landed, which is fantastic.

Comment on attachment 9128520 [details]
Bug 1542884 - Close message channel on failure to connect to the pipe; r=jld

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: 33% shutdown duration reduction, see https://bugzilla.mozilla.org/show_bug.cgi?id=1542884#c101
  • User impact if declined: Users would still experience 10s shutdown delay in some cases.
  • Fix Landed on Version: 75
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): We waited roughly 2 months before requesting an uplift to ESR (just in case there are any regressions given the patch modifies low level IPC code). There were no regressions AFAIK.
  • String or UUID changes made by this patch: None
Attachment #9128520 - Flags: approval-mozilla-esr68?

Comment on attachment 9128520 [details]
Bug 1542884 - Close message channel on failure to connect to the pipe; r=jld

Small, well-baked patch which has been demonstrated to drastically reduce shutdown hangs in some situations. Approved for 68.8esr.

Attachment #9128520 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
Flags: qe-verify+
QA Whiteboard: [qa-triaged]

Reproduced the initial issue using Firefox 68.0. Verified that using Firefox 68.8.0esr on Windows 10 64bit build I can no longer reproduce this issue, the message that Firefox is still running and must be closed first is not displayed after exiting and starting Fx.

Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: