Closed Bug 872033 Opened 12 years ago Closed 4 years ago

Nightly builds will not fully shutdown, zombie process left behind

Categories

(Core :: XPCOM, defect)

x86
Windows XP
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: Noah, Unassigned)

Details

(Keywords: hang)

Attachments

(2 files)

Using Windows XP, nightlies from the range 3/8/13 - 5/14/13 do not fully exit. They leave behind a zombie firefox.exe process of random memory usage size in Task Manager. It does not close itself after any period of time. I happened to notice one floating in Task Manager and ProcMon said it was open since 5/12/13 (so about 2 days). Only way to close it was to force kill it. The craziest part about this bug, is that when I start a new instance of nightly, I assume I will be warned w/ the error "Firefox is already in use..." error message but I am not. Instead Nightly starts up as if the other zombie firefox.exe process did not exist. Even though it was still present in the background! Double checked w/ ProcMon and the zombie firefox.exe was pointing to the same profile path as the new nightly I opened. Range: 3/7/13 - works 3/8/13 - busted I'm using the -no-remote command. That may play a role here, not sure though. I thought this might've been Telemetry's or FHR's fault since they were enabled by default. Disabled them both and repeated restarts still have the issue. New profiles were used everytime. So no addons to blame here.
Another interesting side effect is that after the 1st nightly zombie process is left behind, future openings of nightly open a separate instance of nightly alongside the zombie process but when that nightly instance is closed, the latest firefox.exe process is fully shutdown. Meanwhile the original instance (which is now the zombie process) continues to float in the background. So the good news is, no more than 1 zombie instance of nightly will be left behind. I suspected bug 848036 & bug 847468 as the causes since they seemed related to shutdown. But there's also bug 846471 & bug 848020. Changelog: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ee4879719f78&tochange=cb432984d5ce
Component: General → XPCOM
Noah, would you be willing to attach a debugger to the process and get a stack of it when it's stuck? Since our crash reporter cannot be used for this, you'd need to use either windbg or Visual Studio debugger (both free) to do this. Some instructions are here: https://developer.mozilla.org/en-US/docs/How_to_get_a_stacktrace_with_WinDbg
Alright finally ran Windbg and got a stacktrace after having to deal with a very annoying, mysterious problem where neither Firefox's or Windows' symbols would download. After 4 days of relentlessly troubleshooting this, turned out the source of the problem was that IE was set to Work Offline mode! Unchecking that in its File menu allowed symbols to d/l instantly! Noting this for my future self in case I hit this problem again. ;) Stacktrace: 0:002> !uniqstack Processing 3 threads, please wait . 0 Id: 1a38.14b4 Suspend: 1 Teb: 7ffdf000 Unfrozen Start: firefox!wmainCRTStartup (004023b6) Priority: 0 Priority class: 32 Affinity: 1 ChildEBP RetAddr 0012d5f4 7c90df5a ntdll!KiFastSystemCallRet 0012d5f8 7c8025cb ntdll!NtWaitForSingleObject+0xc 0012d65c 7c802532 kernel32!WaitForSingleObjectEx+0xa8 0012d670 01d3ebf7 kernel32!WaitForSingleObject+0x12 0012d690 01d8ed10 xul!EnsureImageHlpInitialized+0x62 [e:\builds\moz2_slave\m-cen-w32-ntly-000000000000000\build\xpcom\base\nsstackwalk.cpp @ 281] 0012f6f0 021bfa0e xul!NS_StackWalk+0x26 [e:\builds\moz2_slave\m-cen-w32-ntly-000000000000000\build\xpcom\base\nsstackwalk.cpp @ 474] 0012f8ac 7c810e06 xul!mozilla::ValidWriteAssert+0x83 [e:\builds\moz2_slave\m-cen-w32-ntly-000000000000000\build\xpcom\build\mozpoisonwritebase.cpp @ 189] 0012f90c 00fd2ad7 kernel32!WriteFile+0xf7 WARNING: Stack unwind information not available. Following frames may be wrong. 0012f930 00fd2a80 SBHook!UnInstallHook+0x118 0012fd40 00fd2951 SBHook!UnInstallHook+0xc1 0012fd74 7c90118a SBHook+0x2951 0012fd94 7c9235f7 ntdll!LdrpCallInitRoutine+0x14 0012fe18 7c81cd86 ntdll!LdrShutdownProcess+0x14f 0012ff0c 7c81cdfe kernel32!_ExitProcess+0x42 0012ff20 78ac7ec3 kernel32!ExitProcess+0x14 0012ff2c 78ac7fdf MSVCR100!__crtExitProcess+0x17 [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0dat.c @ 708] 0012ff70 78ac8051 MSVCR100!doexit+0xfb [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0dat.c @ 621] 0012ff84 004022ab MSVCR100!exit+0x11 [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0dat.c @ 393] 0012ffc0 7c816fe7 firefox!__tmainCRTStartup+0x138 [f:\dd\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 566] 0012fff0 00000000 kernel32!BaseProcessStart+0x23 . 1 Id: 1a38.1894 Suspend: 1 Teb: 7ffde000 Unfrozen Start: MSVCR100!_threadstartex (78afc724) Priority: 0 Priority class: 32 Affinity: 1 ChildEBP RetAddr 02abfc10 7c90df5a ntdll!KiFastSystemCallRet 02abfc14 7c918b23 ntdll!NtWaitForSingleObject+0xc 02abfc9c 7c901046 ntdll!RtlpWaitForCriticalSection+0x132 02abfca4 7c926aeb ntdll!RtlEnterCriticalSection+0x46 02abfd1c 7c90e457 ntdll!_LdrpInitialize+0xf0 00000000 00000000 ntdll!KiUserApcDispatcher+0x7 . 2 Id: 1a38.1b2c Suspend: 1 Teb: 7ffdc000 Unfrozen Priority: 0 Priority class: 32 Affinity: 1 ChildEBP RetAddr 02acffc8 7c951e38 ntdll!DbgBreakPoint 02acfff4 00000000 ntdll!DbgUiRemoteBreakin+0x2d Total threads: 3
If you still have you're session open i'd be nice to get a backtrace with the symbols for kernel32 as well to get a proper stack. It is pointing to ValidWriteAssert but it's not clear where this write is coming form.
I'm pretty sure kernel32 symbols are loaded. It's SBHook which doesn't have symbols (I think). The web seems to think that it's from Kaspersky antivirus and probably we're calling DllMain(DLL_PROCESS_DETACH) on it and it's writing to a file from that, which hits our WriteFile poison and then tries to initialize NS_Stackwalk but can't because... well any number of reasons, but probably something related to http://hg.mozilla.org/mozilla-central/annotate/6a739f41d1ba/xpcom/base/nsStackWalk.cpp#l261
I think we need to modify SetShutdownChecks to make sure that imagehlp.dll is loaded if we plan to record the stacks.
Attached file Windbg log 5/26/2013
The SBHook.dll (SmartBridge Hook) is part of the SmartBridge service which is used by my AT&T DSL in some way. Presumably it maintains my internet connection in some way. It's in the directory of C:\Program Files\SBC Self Support Tool\SmartBridge Never had Kaspersky Antivirus installed. Is this file causing any problems or is it a innocent victim caught up in the mix? @BenWa In the log, I see: ModLoad: 7c800000 7c8f5000 C:\WINDOWS\system32\kernel32.dll So kernel32 has symbols. Attaching full Windbg log in case it helps clarify anything else.
Ben, since you know this code pretty well, could you backout the patches I mentioned in comment 1 and create a tryserver build I can test with? Or if all the patches aren't dependent on each other, create maybe 2 tryserver builds each removing 2 certain patches out of the 4 checkins that are related to this bug. That way we can point the blame at a particular checkin in case that helps you guys understand/narrow down the problem better. I suspect bug 848036 or bug 848020. I have no coding experience whatsoever so I have nothing to back my claims up. :P Just a gut feeling since one [bug 848036] mentions shutdown and the other [bug 848020] had doubts & worrisome talk in the bug comments. The other 2 are: Bug 847468 & bug 846471 What do these changes accomplish? I'm not sure what the benefits are of these changes. And what's the exit(0) project? (from Bug 848020 comment #2)
Flags: needinfo?(bgirard)
Noah we talked about this on IRC. Did you find out exactly which addon was loading the library that was causing this?
Flags: needinfo?(bgirard)
My analysis and workaround method: http://pcxfirefox.sourceforge.net/?p=43 http://pcxfirefox.sourceforge.net/zh/?p=20 The method seems to fix the issue.
Attached patch patchSplinter Review
I've decided that at this point this is causing more harm then good in the release channel and no one has time to deal with the reports. We should keep this on in Nightly for now however.
Attachment #788915 - Flags: review?(vdjeric)
(In reply to Benoit Girard (:BenWa) from comment #11) > Created attachment 788915 [details] [diff] [review] > patch > > I've decided that at this point this is causing more harm then good in the > release channel and no one has time to deal with the reports. We should keep > this on in Nightly for now however. In fact, I don't know what is SCM_RECORD, does it belong to telemetry function? But when disable telemetry, SCM_RECORD is also hitted.
(In reply to xunxun from comment #12) > (In reply to Benoit Girard (:BenWa) from comment #11) > > Created attachment 788915 [details] [diff] [review] > > patch > > > > I've decided that at this point this is causing more harm then good in the > > release channel and no one has time to deal with the reports. We should keep > > this on in Nightly for now however. > > In fact, I don't know what is SCM_RECORD, does it belong to telemetry > function? But when disable telemetry, SCM_RECORD is also hitted. Its meant to watch for things writing to the disk late into the firefox shutdown. We plan on shutting down firefox earlier/faster so anything that writes to the disk late will be removed.
(In reply to Benoit Girard (:BenWa) from comment #11) > Created attachment 788915 [details] [diff] [review] > patch > > I've decided that at this point this is causing more harm then good in the > release channel and no one has time to deal with the reports. We should keep > this on in Nightly for now however. And what does "default" mean? Does it contain release, beta and aurora? If yes, I can understand it. If no, why firefox not nightly edition also may have the problem?
Comment on attachment 788915 [details] [diff] [review] patch > #ifdef DEBUG > gShutdownChecks = SCM_CRASH; > #else > const char* releaseChannel = NS_STRINGIFY(MOZ_UPDATE_CHANNEL); >- if (strcmp(releaseChannel, "nightly") == 0 || >- strcmp(releaseChannel, "default") == 0) { >+ if (strcmp(releaseChannel, "nightly") == 0) { > gShutdownChecks = SCM_RECORD; > } else { > gShutdownChecks = SCM_NOTHING; > } > #endif Actually the "default" update channel represents local developer builds, so I don't think this change would help. The update channels are "release", "beta", "aurora", "nightly", "default".
Attachment #788915 - Flags: review?(vdjeric) → review-
I can confirm that problem still persists on Windows 8 x64 - a random process is left behind after I close Nightly. Sometimes it remains with about 30MB of memory used, sometimes it remains with over 200MB used....
(In reply to Frank from comment #16) > I can confirm that problem still persists on Windows 8 x64 - a random > process is left behind after I close Nightly. Sometimes it remains with > about 30MB of memory used, sometimes it remains with over 200MB used.... You can try to add a environment value called MOZ_SHUTDOWN_CHECKS, value is nothing
Does this still occur when using a current version?
Severity: normal → critical
Flags: needinfo?(mye)
Flags: needinfo?(krystaiceman)
Keywords: hang
I can see that happening in; Version 52.0a1 Build ID 20160919065232 User Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 OS Windows_NT 10.0 I'm not exactly sure how to replicate it, so I can't be sure if disabling extensions solve the issue or not. It kinda happens at random. However I do noticed that whenever I try to run Firefox the first time after a boot, it just spawns a single process in task manager which gets stuck at 2.5~ mb ram usage and stays like that forever, until I force close and re-open it. Not sure if it has any relation to above issue.

Erdin, can you still reproduce?

Flags: needinfo?(unique.ek)
Flags: needinfo?(mye)
Flags: needinfo?(krystaiceman)

I have attempted reproducing this issue on Windows 7 and Windows 10 with the versions that were reported to have been reproduced it (Nightly v22.0a1 3/8/13 and v52.0a1 20160919065232). I attempted opening some random pages (Youtube/Wikipedia/Reddit/others), then closed the browser and inspected the Processes tab of Windows's Task Manager. I could not reproduce it and I am pretty sure it may be caused by the use of some addon or some kind of user data in the original user profiles.

@Erdin: if you can still reproduce it, you may be able to help us troubleshoot it.
The contents of the "about:support" page, from the browser that reproduces it might hold some useful information. Can you attach it to this bug?

@Noah: Have you observed this issue again during all this time since you've logged it? Can you still reproduce it now?
The contents of the "about:support" page, from the browser that reproduces it might hold some useful information. Can you attach it to this bug?

If you guys are still willing to help, more troubleshooting questions might be provided by the devs.

Flags: needinfo?(krystaiceman)

Considering how old this bug is and the fact that the people who reproduced it no longer replied, I will close this bug as worksforme.
If you can reproduce it again, please provide the requested information and reopen it. Thank you!

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME

Thanks to all who investigated & also confirmed seeing the bug themselves! I'm not able to reproduce this anymore & I hope the future versions of Windows never trigger this bug again.

If it ever does, I will report it here or in a new bug with a reference to this bug. Sorry for not responding sooner!
Replying to remove the NeedInfo flag. ;)

Flags: needinfo?(krystaiceman)
Flags: needinfo?(unique.ek)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: