Closed Bug 1339594 Opened 7 years ago Closed 7 years ago

Intermittent TEST-UNEXPECTED-ERROR | damp,ts_paint,tps,a11yr,tresize | psutil.NoSuchProcess process no longer exists (pid=3740, name=u'firefox.exe')

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(firefox52 unaffected, firefox53 fixed, firefox54 fixed)

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- unaffected
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: philor, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1236770 +++

Carrying on the intermittents, since bug 1236770 got hijacked by one cause of one flavor in one subtest of one suite.
With mis-stars eliminated, I think this may be Windows-only and e10s-only, and mostly pgo.
This seems typical in recent failures:

23:53:35     INFO -  TEST-UNEXPECTED-ERROR | ts_paint | psutil.NoSuchProcess process no longer exists (pid=3360, name=u'firefox.exe')
23:53:35    ERROR -  Traceback (most recent call last):
23:53:35     INFO -    File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests
23:53:35     INFO -      talos_results.add(mytest.runTest(browser_config, test))
23:53:35     INFO -    File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest
23:53:35     INFO -      return self._runTest(browser_config, test_config, setup)
23:53:35     INFO -    File "C:\slave\test\build\tests\talos\talos\ttest.py", line 174, in _runTest
23:53:35     INFO -      if counter_management else None),
23:53:35     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 140, in run_browser
23:53:35     INFO -      return_code = context.kill_process()
23:53:35     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 40, in kill_process
23:53:35     INFO -      procs = self.process.children()
23:53:35     INFO -    File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
23:53:35     INFO -      raise NoSuchProcess(self.pid, self._name)
23:53:35     INFO -  NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=3360, name=u'firefox.exe')


It looks like, if self.process.terminate() throws NoSuchProcess, there is a good chance that self.process.children() will also throw NoSuchProcess, making the change in bug 1236770 not very effective. Brad, could we just ignore NoSuchProcess in kill_process? Or get a list of children some other way/at some other time??
Flags: needinfo?(blassey.bugs)
This is automation code, I have no idea.
Flags: needinfo?(blassey.bugs)
I don't mind ignoring NoSuchProcess, or when we get it follow another code path.  I am just not sure how to ensure we get the child processes and why we would get NoSuchProcess.  Maybe it terminated before we get to this point?
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=79049477&lineNumber=1947

06:12:52     INFO -  Running cycle 13/20 for ts_paint test...
06:12:52     INFO -  TEST-INFO | started process 1944 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmpkpbugm\profile http://localhost:49677/startup_test/tspaint_test.html)
06:15:22     INFO -  Terminating psutil.Process(pid=1944, name=u'firefox.exe')
06:15:23     INFO -  TEST-UNEXPECTED-ERROR | ts_paint | psutil.NoSuchProcess process no longer exists (pid=1944, name=u'firefox.exe')
06:15:23    ERROR -  Traceback (most recent call last):
06:15:23     INFO -    File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests
06:15:23     INFO -      talos_results.add(mytest.runTest(browser_config, test))
06:15:23     INFO -    File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest
06:15:23     INFO -      return self._runTest(browser_config, test_config, setup)
06:15:23     INFO -    File "C:\slave\test\build\tests\talos\talos\ttest.py", line 174, in _runTest
06:15:23     INFO -      if counter_management else None),
06:15:23     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 140, in run_browser
06:15:23     INFO -      return_code = context.kill_process()
06:15:23     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 40, in kill_process
06:15:23     INFO -      procs = self.process.children()
06:15:23     INFO -    File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
06:15:23     INFO -      raise NoSuchProcess(self.pid, self._name)
06:15:23     INFO -  NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=1944, name=u'firefox.exe')

After "started process", there is a 150 second pause with no output. Perhaps there has been a timeout and TalosError has been raised:

https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/talos_process.py#136

but I think that won't be reported if we raise again in the finally handler, in kill_process.
Assignee: nobody → gbrown
I reproduced the NoSuchProcess on try and determined that the browser was hung and the test timed out at https://dxr.mozilla.org/mozilla-central/rev/a180b976c165b6cd174d24bbb77941919cdc53cb/testing/talos/talos/talos_process.py#124. A minidump was probably produced, but then the TalosError was raised, triggering the finally handler and the attempt to kill the browser...which had already been killed in kill_and_get_minidump().

In this patch, I catch and log exceptions from the possibly-second kill attempt, but then allow processing to continue.

In the case of a timeout, the TalosError will still be raised, which would bypass crash reporting. To handle that, I added another check_for_crashes() call. This part of the patch is not an elegant solution, but gets the job done.

Will try to demonstrate the new handling at https://treeherder.mozilla.org/#/jobs?repo=try&revision=33f2d50b458bae93328e3eea526b85a25ea77b3c; I expect the NoSuchProcess failures to be transformed into PROCESS-CRASH failures, with crash reports.
Attachment #8840540 - Flags: review?(jmaher)
Comment on attachment 8840540 [details] [diff] [review]
handle timeouts better

Review of attachment 8840540 [details] [diff] [review]:
-----------------------------------------------------------------

this looks good.  consider in the future possibly raising something from the finally:try/catch.

::: testing/talos/talos/talos_process.py
@@ +145,5 @@
> +                return_code = proc.wait(1)
> +        except:
> +            # Maybe killed by kill_and_get_minidump(), maybe ended?
> +            LOG.info("Unable to kill process")
> +            LOG.info(traceback.format_exc())

should we raise anything here?
Attachment #8840540 - Flags: review?(jmaher) → review+
(In reply to Joel Maher ( :jmaher) from comment #11)
> ::: testing/talos/talos/talos_process.py
> @@ +145,5 @@
> > +                return_code = proc.wait(1)
> > +        except:
> > +            # Maybe killed by kill_and_get_minidump(), maybe ended?
> > +            LOG.info("Unable to kill process")
> > +            LOG.info(traceback.format_exc())
> 
> should we raise anything here?

Maybe...sometimes? If there has been a timeout, a TalosError has already been raised - no sense in raising another exception. Otherwise, I suppose the test has completed successfully but there was a problem with killing the process; that might be cause for raising an exception. I'm hoping that in that case either 1. everything is actually okay and we complete successfully, or 2. something goes wrong at a later time and someone will notice the NoSuchProcess exception in the log as a possibly related clue.

It might be better to restructure the code somehow to handle timeouts more directly...I'm not sure how exactly. I'll press on with this change for now.
(In reply to Geoff Brown [:gbrown] from comment #10)
> Will try to demonstrate the new handling at
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=33f2d50b458bae93328e3eea526b85a25ea77b3c; I expect
> the NoSuchProcess failures to be transformed into PROCESS-CRASH failures,
> with crash reports.

Here's one:

https://archive.mozilla.org/pub/firefox/try-builds/gbrown@mozilla.com-33f2d50b458bae93328e3eea526b85a25ea77b3c/try-win32/try_win7_ix_test-chromez-e10s-bm112-tests1-windows-build37.txt.gz

16:10:11     INFO -  TEST-INFO | started process 3588 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile http://localhost:49816/startup_test/tresize/addon/content/tresize-test.html)
16:12:41     INFO -  Timeout waiting for test completion; killing browser...
16:12:42     INFO -  Terminating psutil.Process(pid=3588, name=u'firefox.exe')
16:12:42     INFO -  Unable to kill process
16:12:42    ERROR -  Traceback (most recent call last):
16:12:42     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 143, in run_browser
16:12:42     INFO -      return_code = context.kill_process()
16:12:42     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 41, in kill_process
16:12:42     INFO -      procs = self.process.children()
16:12:42     INFO -    File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
16:12:42     INFO -      raise NoSuchProcess(self.pid, self._name)
16:12:42     INFO -  NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=3588, name=u'firefox.exe')
16:12:42     INFO -  mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/B6M4Wx_rQCGEqZ2av4vOHQ/artifacts/public/build/firefox-54.0a1.en-US.win32.crashreporter-symbols.zip
16:12:49     INFO -  mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile\minidumps\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp c:\users\cltbld\appdata\local\temp\tmprp75bd
16:12:56     INFO -  mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp
16:12:56     INFO -  PROCESS-CRASH | tresize | application crashed [unknown top frame]
16:12:56     INFO -  Crash dump filename: c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile\minidumps\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp
16:12:56     INFO -  Operating system: Windows NT
16:12:56     INFO -                    6.1.7601 Service Pack 1
16:12:56     INFO -  CPU: x86
16:12:56     INFO -       GenuineIntel family 6 model 30 stepping 5
16:12:56     INFO -       8 CPUs
16:12:56     INFO -  GPU: UNKNOWN
16:12:56     INFO -  No crash
16:12:56     INFO -  Process uptime: 151 seconds
16:12:56     INFO -  Thread 0
16:12:56     INFO -   0  KERNELBASE.dll!GetLocaleInfoHelper + 0xb1
16:12:56     INFO -      eip = 0x7577092c   esp = 0x002ec9d8   ebp = 0x002ecac0   ebx = 0x000cee7c
16:12:56     INFO -      esi = 0x000b65cc   edi = 0x00000000   eax = 0x00000019   ecx = 0x00000001
16:12:56     INFO -      edx = 0x00000000   efl = 0x00000202
16:12:56     INFO -      Found by: given as instruction pointer in context
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb573b6f4d66
Improve Talos error handling for hung browser; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/eb573b6f4d66
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
I am reopening as this is more debugging code, but please close if we think this fixes the bug or will morph a new bug as more actionable.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
See Also: → 1342685
See Also: → 1342594
See Also: → 1342735
See Also: → 1343242
See Also: → 1310638
Whiteboard: [stockwell needswork]
Blocks: 1343549
(In reply to Joel Maher ( :jmaher) from comment #16)
> I am reopening as this is more debugging code, but please close if we think
> this fixes the bug or will morph a new bug as more actionable.

We haven't had any new reports of NoSuchProcess (just a few mis-stars) and I don't think we will. The change has correctly morphed the NoSuchProcess reports into crashes (see see-also bugs).
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Whiteboard: [stockwell needswork] → [stockwell fixed]
Depends on: 1622257
You need to log in before you can comment on or make changes to this bug.