Intermittent zombiecheck | child process XXXX still alive after shutdown

RESOLVED FIXED in Firefox 66

Status

()

defect
P3
normal
RESOLVED FIXED
4 years ago
3 months ago

People

(Reporter: KWierso, Assigned: froydnj)

Tracking

(Regressed 1 bug, {intermittent-failure})

unspecified
mozilla67
Points:
---

Firefox Tracking Flags

(firefox46 wontfix, firefox47 wontfix, firefox48 wontfix, firefox49 wontfix, firefox50 wontfix, firefox-esr60 affected, firefox65 wontfix, firefox66 fixed, firefox67 fixed)

Details

(Whiteboard: [stockwell unknown])

Attachments

(2 attachments)

Version: 47 Branch → unspecified
Can we just inject crashes into these zombies to see what they're doing?
Flags: needinfo?(ted)
We try to do that already:
https://dxr.mozilla.org/mozilla-central/rev/6ea654cad929c9bedd8a4161a182b6189fbeae6a/testing/mochitest/runtests.py#1833

I'm not sure why this isn't working, I don't see any interesting errors in the log.
Flags: needinfo?(ted)
The log that allows us to do this check was just fixed in bug 1193861.
Yeah, these are all Windows, mostly in e10s tests.
Component: Plug-ins → General
Unfortunately, bug 1262898 didn't help with this one.
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
 can see that happening in;

Version 	52.0a1
Build ID 	20160919065232
User Agent 	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
OS 	Windows_NT 10.0

I'm not exactly sure how to replicate it, so I can't be sure if disabling extensions solve the issue or not. It kinda happens at random.

However I do noticed that whenever I try to run Firefox the first time after a boot, it just spawns a single process in task manager which gets stuck at 2.5~ mb ram usage and stays like that forever, until I force close and re-open it. Not sure if it has any relation to above issue.

Also trying to force close that process with either process name or pid via taskkill in Windows results in couldn't find process error. 

Not sure which one of those topics are active so also updated here: https://bugzilla.mozilla.org/show_bug.cgi?id=959544
Here are a couple screenshots. The issue just happened again.

Task Manager Process using 2.6mb ram without any progress: 
http://i.imgur.com/tYvP7w8.png
http://i.imgur.com/T2OR9pR.png

Every time i clicked the Firefox icon to open it it spawned another zombie process similar to above:
http://i.imgur.com/iEnT3ba.png

When i tried to kill the process via command line with taskkill
http://i.imgur.com/SWeWTm9.png

ot sure if I should I open a new bug for this or is this related enough for this one. Let me know.
Erdin, this bug is about a specific failure that occurs within our automated test harness. Please file a new bug for the issue you're finding so that it can be properly investigated. Thanks!
(In reply to Ryan VanderMeulen [:RyanVM] from comment #40)
> Erdin, this bug is about a specific failure that occurs within our automated
> test harness. Please file a new bug for the issue you're finding so that it
> can be properly investigated. Thanks!

Sorry about that, submitted here: https://bugzilla.mozilla.org/enter_bug.cgi#h=bugForm%7CFirefox

There are 30 failures associated to this bug in the last 7 days. These are occurring mainly on windows10-64 all builds.

:selena can you take a look at this?

Flags: needinfo?(sdeckelmann)
Whiteboard: [stockwell needswork]

Nathan - can you triage this?

Flags: needinfo?(sdeckelmann) → needinfo?(nfroyd)

OK, we're winding up here:

https://searchfox.org/mozilla-central/rev/5c8ea961d04767db723a0a15e3a8f7fbca154129/testing/mochitest/runtests.py#1967-1982

On asan builds, we don't have the crashreporter, so we just give up and don't even try to get a minidump. That seems...reasonable? I think such problems are separate, so perhaps we'll tackle those in a separate bug.

On builds where we do have the crashreporter, we try to terminate the process and get a minidump. In this case, we get logging like the following:

09:58:28    ERROR - TEST-UNEXPECTED-FAIL | zombiecheck | child process 6340 still alive after shutdown
09:58:28     INFO - Killing process: 6340
09:58:28     INFO - TEST-INFO | started process screenshot
09:58:28     INFO - TEST-INFO | screenshot: exit 0
09:58:28  WARNING - mozcrash kill_pid(): wait failed (-1) terminating pid 6340: error 5
09:58:28     INFO - zombiecheck | Checking for orphan process with PID: 6748
09:58:28     INFO - zombiecheck | Checking for orphan process with PID: 2100
09:58:28     INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/OPr-Y2JsR6Cl6A4b_qT34Q/artifacts/public/build/target.crashreporter-symbols.zip
09:58:31     INFO - mozcrash Copy/paste: Z:\task_1547630818\build\win32-minidump_stackwalk.exe c:\users\task_1547630818\appdata\local\temp\tmp305rzp.mozrunner\minidumps\7cb47b5a-dfbf-48b1-b598-ebaa66adbaa1.dmp c:\users\task_1547630818\appdata\local\temp\tmpfz5aiv
09:58:34     INFO - mozcrash Saved minidump as Z:\task_1547630818\build\blobber_upload_dir\7cb47b5a-dfbf-48b1-b598-ebaa66adbaa1.dmp
09:58:34     INFO - PROCESS-CRASH | Main app process exited normally | application crashed [None]

Notice the mozcrash_kill_pid() warning: we couldn't wait for the process after termination:

https://searchfox.org/mozilla-central/rev/5c8ea961d04767db723a0a15e3a8f7fbca154129/testing/mozbase/mozcrash/mozcrash/mozcrash.py#487-494

The error is ERROR_ACCESS_DENIED. dmajor helped me understand that the WaitForSingleObject call in the above actually needs to request SYNCHRONIZE privileges on the process handle:

https://docs.microsoft.com/en-us/windows/desktop/ProcThread/process-security-and-access-rights

presumably if we did that, at least that error would go away.

The other peculiar thing is that the stacks that we do get are all complete garbage. To wit, one example:

09:58:34     INFO - Crash dump filename: c:\users\task_1547630818\appdata\local\temp\tmp305rzp.mozrunner\minidumps\7cb47b5a-dfbf-48b1-b598-ebaa66adbaa1.dmp
09:58:34     INFO - Operating system: Windows NT
09:58:34     INFO -                   10.0.15063 
09:58:34     INFO - CPU: amd64
09:58:34     INFO -      family 6 model 85 stepping 4
09:58:34     INFO -      8 CPUs
09:58:34     INFO - 
09:58:34     INFO - GPU: UNKNOWN
09:58:34     INFO - 
09:58:34     INFO - No crash
09:58:34     INFO - Process uptime: 64 seconds
09:58:34     INFO - 
09:58:34     INFO - Thread 0
09:58:34     INFO -  0  ntdll.dll!NtWaitForSingleObject + 0x14
09:58:34     INFO -     rax = 0x000001f2a9c68080   rdx = 0x0000000000000060
09:58:34     INFO -     rcx = 0x000001f2a9c68080   rbx = 0x0000000000000000
09:58:34     INFO -     rsi = 0x00000000ffffffff   rdi = 0x000000000000011c
09:58:34     INFO -     rbp = 0x0000008e8b78fa10   rsp = 0x0000008e8b78f868
09:58:34     INFO -      r8 = 0xf1190c019cc37d11    r9 = 0x000000000000002a
09:58:34     INFO -     r10 = 0x000001f2a9c602e0   r11 = 0x0000000000000000
09:58:34     INFO -     r12 = 0x0000000000000000   r13 = 0x0000000000000000
09:58:34     INFO -     r14 = 0x0000000000000000   r15 = 0x00007ff76491bde8
09:58:34     INFO -     rip = 0x00007ff9e4315424
09:58:34     INFO -     Found by: given as instruction pointer in context
09:58:34     INFO -  1  KERNELBASE.dll!RtlpHpLargeAlloc + 0x1bb
09:58:34     INFO -     rbx = 0x0000000000000000   rbp = 0x0000008e8b78fa10
09:58:34     INFO -     rsp = 0x0000008e8b78f870   r12 = 0x0000000000000000
09:58:34     INFO -     r13 = 0x0000000000000000   r14 = 0x0000000000000000
09:58:34     INFO -     r15 = 0x00007ff76491bde8   rip = 0x00007ff9e136988f
09:58:34     INFO -     Found by: call frame info

and the other threads are not much better, being primarily internal ntdll.dll symbols.

Talking with bobowen on #developers suggested one of two possibilities:

  1. We are hitting a race where we really did think we launched a new child process, but we terminated Firefox before the child process is actually anywhere close to being initialized, and so zombiecheck discovers this process in process, as it were.
  2. We really did launch the child processes we thought we did, but they were terminated in some way, and by the end of the test run, Windows managed to re-use the PID for one (or more) of our child processes...and we think the child process is still alive, but it's actually some Windows-internal thing. Ye Olde Stack Overflowe answers suggest that PID recycling is definitely A Thing in Windows (https://stackoverflow.com/a/31257979), so this suggestion is not completely wild.

Number 2 is easier to deal with than number 1, so I have some patches up that attempt to check that child processes we think are alive are actually Firefox processes. We'll do a bunch of retriggers on those and see if these not-super-frequent failures show up. Leaving needinfo for a reminder.

We're at a thousand-ish retriggers across a couple different testsuites on win64 opt, and no incidences of the orange so far. I don't think that's quite to a 95% confidence level, but...let's run with it:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9fd94d578e6242e72e04b0597751a3994c6287a

Patches incoming.

Flags: needinfo?(nfroyd)
Windows documentation indicates that it's invalid to WaitForSingleObject
on a process handle unless you request the SYNCHRONIZE access right.
And indeed, we see errors in the logs like:

09:58:28  WARNING - mozcrash kill_pid(): wait failed (-1) terminating pid 6340: error 5

That "error 5" is an ERROR_ACCESS_DENIED code.  Such errors should go
away with requesting the proper access rights.  Credit to dmajor for
noticing the discrepancy.
Attachment #9039792 - Flags: review?(gbrown)
We're seeing a lot of cases where our "check for zombie child processes"
check is finding live processes, but the minidumps that we get from such
processes are nonsense, and don't even feature Firefox symbols.

The working theory at this point, courtesy of bobowen, is that child
processes that we launch are getting closed during the test runs,
completely normally, and then we are finding other (non-Firefox) live
processes with the PIDs that were used for Firefox child processes at
the end of the test run.  This scenario is plausible due to Windows's
aggressive reuse of PIDs.  We don't see the same behavior on our Unix
test machines because Linux (and OS X, apparently) are not nearly as
aggressive in reusing PIDs.

Since we should be ensuring that any live processes are actually Firefox
processes anyway, let's add the appropriate check.  If the check works
to reduce the incidence of zombiecheck failures, that's great!  If not,
we've at least made our test runner more robust and can investigate
other possibilities for these intermittent failures.
Attachment #9039794 - Flags: review?(gbrown)
Comment on attachment 9039792 [details] [diff] [review]
part 1 - request synchronization privileges in kill_pid

Review of attachment 9039792 [details] [diff] [review]:
-----------------------------------------------------------------

Aha! So glad you/dmajor found this - thanks!
Attachment #9039792 - Flags: review?(gbrown) → review+
Comment on attachment 9039794 [details] [diff] [review]
part 2 - be more stringent in checking for live pids

Review of attachment 9039794 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mochitest/runtests.py
@@ +385,5 @@
> +            if namelen == 0:
> +                # Still an active process, so conservatively assume it's Firefox.
> +                return True
> +
> +            return pName.value.endswith(('firefox.exe', 'plugin-container.exe'))

I'm sure you know more definitively that I do: Are those the only possibilities, for Windows firefox processes?
Attachment #9039794 - Flags: review?(gbrown) → review+

I was going to suggest that you could get the process' parent pid and compare against the Firefox pid but on Windows that's extremely complicated to do so your solution seems fine.

(In reply to Geoff Brown [:gbrown] from comment #114)

::: testing/mochitest/runtests.py
@@ +385,5 @@

  •        if namelen == 0:
    
  •            # Still an active process, so conservatively assume it's Firefox.
    
  •            return True
    
  •        return pName.value.endswith(('firefox.exe', 'plugin-container.exe'))
    

I'm sure you know more definitively that I do: Are those the only
possibilities, for Windows firefox processes?

That's a good question! I don't know definitively, but looking at a Firefox install on my local Windows machine, the only other .exe files that I see are all utility related, so I think we're in the clear here.

Assignee: nobody → nfroyd
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/072ba80a8f3e
part 1 - request synchronization privileges in kill_pid; r=gbrown
https://hg.mozilla.org/integration/mozilla-inbound/rev/9fb802aba6b9
part 2 - be more stringent in checking for live pids; r=gbrown
Status: NEW → RESOLVED
Closed: 6 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67
No longer depends on: 1525152
Regressions: 1525152
You need to log in before you can comment on or make changes to this bug.