Closed
Bug 1339594
Opened 8 years ago
Closed 8 years ago
Intermittent TEST-UNEXPECTED-ERROR | damp,ts_paint,tps,a11yr,tresize | psutil.NoSuchProcess process no longer exists (pid=3740, name=u'firefox.exe')
Categories
(Testing :: Talos, defect)
Testing
Talos
Tracking
(firefox52 unaffected, firefox53 fixed, firefox54 fixed)
RESOLVED
FIXED
mozilla54
Tracking | Status | |
---|---|---|
firefox52 | --- | unaffected |
firefox53 | --- | fixed |
firefox54 | --- | fixed |
People
(Reporter: philor, Assigned: gbrown)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])
Attachments
(1 file)
2.56 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #1236770 +++
Carrying on the intermittents, since bug 1236770 got hijacked by one cause of one flavor in one subtest of one suite.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 4•8 years ago
|
||
With mis-stars eliminated, I think this may be Windows-only and e10s-only, and mostly pgo.
Assignee | ||
Comment 5•8 years ago
|
||
This seems typical in recent failures:
23:53:35 INFO - TEST-UNEXPECTED-ERROR | ts_paint | psutil.NoSuchProcess process no longer exists (pid=3360, name=u'firefox.exe')
23:53:35 ERROR - Traceback (most recent call last):
23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests
23:53:35 INFO - talos_results.add(mytest.runTest(browser_config, test))
23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest
23:53:35 INFO - return self._runTest(browser_config, test_config, setup)
23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 174, in _runTest
23:53:35 INFO - if counter_management else None),
23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 140, in run_browser
23:53:35 INFO - return_code = context.kill_process()
23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 40, in kill_process
23:53:35 INFO - procs = self.process.children()
23:53:35 INFO - File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
23:53:35 INFO - raise NoSuchProcess(self.pid, self._name)
23:53:35 INFO - NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=3360, name=u'firefox.exe')
It looks like, if self.process.terminate() throws NoSuchProcess, there is a good chance that self.process.children() will also throw NoSuchProcess, making the change in bug 1236770 not very effective. Brad, could we just ignore NoSuchProcess in kill_process? Or get a list of children some other way/at some other time??
Flags: needinfo?(blassey.bugs)
Comment hidden (Intermittent Failures Robot) |
Comment 8•8 years ago
|
||
I don't mind ignoring NoSuchProcess, or when we get it follow another code path. I am just not sure how to ensure we get the child processes and why we would get NoSuchProcess. Maybe it terminated before we get to this point?
Assignee | ||
Comment 9•8 years ago
|
||
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=79049477&lineNumber=1947
06:12:52 INFO - Running cycle 13/20 for ts_paint test...
06:12:52 INFO - TEST-INFO | started process 1944 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmpkpbugm\profile http://localhost:49677/startup_test/tspaint_test.html)
06:15:22 INFO - Terminating psutil.Process(pid=1944, name=u'firefox.exe')
06:15:23 INFO - TEST-UNEXPECTED-ERROR | ts_paint | psutil.NoSuchProcess process no longer exists (pid=1944, name=u'firefox.exe')
06:15:23 ERROR - Traceback (most recent call last):
06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests
06:15:23 INFO - talos_results.add(mytest.runTest(browser_config, test))
06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest
06:15:23 INFO - return self._runTest(browser_config, test_config, setup)
06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 174, in _runTest
06:15:23 INFO - if counter_management else None),
06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 140, in run_browser
06:15:23 INFO - return_code = context.kill_process()
06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 40, in kill_process
06:15:23 INFO - procs = self.process.children()
06:15:23 INFO - File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
06:15:23 INFO - raise NoSuchProcess(self.pid, self._name)
06:15:23 INFO - NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=1944, name=u'firefox.exe')
After "started process", there is a 150 second pause with no output. Perhaps there has been a timeout and TalosError has been raised:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/talos_process.py#136
but I think that won't be reported if we raise again in the finally handler, in kill_process.
Assignee: nobody → gbrown
Assignee | ||
Comment 10•8 years ago
|
||
I reproduced the NoSuchProcess on try and determined that the browser was hung and the test timed out at https://dxr.mozilla.org/mozilla-central/rev/a180b976c165b6cd174d24bbb77941919cdc53cb/testing/talos/talos/talos_process.py#124. A minidump was probably produced, but then the TalosError was raised, triggering the finally handler and the attempt to kill the browser...which had already been killed in kill_and_get_minidump().
In this patch, I catch and log exceptions from the possibly-second kill attempt, but then allow processing to continue.
In the case of a timeout, the TalosError will still be raised, which would bypass crash reporting. To handle that, I added another check_for_crashes() call. This part of the patch is not an elegant solution, but gets the job done.
Will try to demonstrate the new handling at https://treeherder.mozilla.org/#/jobs?repo=try&revision=33f2d50b458bae93328e3eea526b85a25ea77b3c; I expect the NoSuchProcess failures to be transformed into PROCESS-CRASH failures, with crash reports.
Attachment #8840540 -
Flags: review?(jmaher)
Comment 11•8 years ago
|
||
Comment on attachment 8840540 [details] [diff] [review]
handle timeouts better
Review of attachment 8840540 [details] [diff] [review]:
-----------------------------------------------------------------
this looks good. consider in the future possibly raising something from the finally:try/catch.
::: testing/talos/talos/talos_process.py
@@ +145,5 @@
> + return_code = proc.wait(1)
> + except:
> + # Maybe killed by kill_and_get_minidump(), maybe ended?
> + LOG.info("Unable to kill process")
> + LOG.info(traceback.format_exc())
should we raise anything here?
Attachment #8840540 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #11)
> ::: testing/talos/talos/talos_process.py
> @@ +145,5 @@
> > + return_code = proc.wait(1)
> > + except:
> > + # Maybe killed by kill_and_get_minidump(), maybe ended?
> > + LOG.info("Unable to kill process")
> > + LOG.info(traceback.format_exc())
>
> should we raise anything here?
Maybe...sometimes? If there has been a timeout, a TalosError has already been raised - no sense in raising another exception. Otherwise, I suppose the test has completed successfully but there was a problem with killing the process; that might be cause for raising an exception. I'm hoping that in that case either 1. everything is actually okay and we complete successfully, or 2. something goes wrong at a later time and someone will notice the NoSuchProcess exception in the log as a possibly related clue.
It might be better to restructure the code somehow to handle timeouts more directly...I'm not sure how exactly. I'll press on with this change for now.
Assignee | ||
Comment 13•8 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #10)
> Will try to demonstrate the new handling at
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=33f2d50b458bae93328e3eea526b85a25ea77b3c; I expect
> the NoSuchProcess failures to be transformed into PROCESS-CRASH failures,
> with crash reports.
Here's one:
https://archive.mozilla.org/pub/firefox/try-builds/gbrown@mozilla.com-33f2d50b458bae93328e3eea526b85a25ea77b3c/try-win32/try_win7_ix_test-chromez-e10s-bm112-tests1-windows-build37.txt.gz
16:10:11 INFO - TEST-INFO | started process 3588 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile http://localhost:49816/startup_test/tresize/addon/content/tresize-test.html)
16:12:41 INFO - Timeout waiting for test completion; killing browser...
16:12:42 INFO - Terminating psutil.Process(pid=3588, name=u'firefox.exe')
16:12:42 INFO - Unable to kill process
16:12:42 ERROR - Traceback (most recent call last):
16:12:42 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 143, in run_browser
16:12:42 INFO - return_code = context.kill_process()
16:12:42 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 41, in kill_process
16:12:42 INFO - procs = self.process.children()
16:12:42 INFO - File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
16:12:42 INFO - raise NoSuchProcess(self.pid, self._name)
16:12:42 INFO - NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=3588, name=u'firefox.exe')
16:12:42 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/B6M4Wx_rQCGEqZ2av4vOHQ/artifacts/public/build/firefox-54.0a1.en-US.win32.crashreporter-symbols.zip
16:12:49 INFO - mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile\minidumps\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp c:\users\cltbld\appdata\local\temp\tmprp75bd
16:12:56 INFO - mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp
16:12:56 INFO - PROCESS-CRASH | tresize | application crashed [unknown top frame]
16:12:56 INFO - Crash dump filename: c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile\minidumps\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp
16:12:56 INFO - Operating system: Windows NT
16:12:56 INFO - 6.1.7601 Service Pack 1
16:12:56 INFO - CPU: x86
16:12:56 INFO - GenuineIntel family 6 model 30 stepping 5
16:12:56 INFO - 8 CPUs
16:12:56 INFO - GPU: UNKNOWN
16:12:56 INFO - No crash
16:12:56 INFO - Process uptime: 151 seconds
16:12:56 INFO - Thread 0
16:12:56 INFO - 0 KERNELBASE.dll!GetLocaleInfoHelper + 0xb1
16:12:56 INFO - eip = 0x7577092c esp = 0x002ec9d8 ebp = 0x002ecac0 ebx = 0x000cee7c
16:12:56 INFO - esi = 0x000b65cc edi = 0x00000000 eax = 0x00000019 ecx = 0x00000001
16:12:56 INFO - edx = 0x00000000 efl = 0x00000202
16:12:56 INFO - Found by: given as instruction pointer in context
Comment 14•8 years ago
|
||
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb573b6f4d66
Improve Talos error handling for hung browser; r=jmaher
Comment 15•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox54:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Comment 16•8 years ago
|
||
I am reopening as this is more debugging code, but please close if we think this fixes the bug or will morph a new bug as more actionable.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Whiteboard: [stockwell needswork]
Assignee | ||
Comment 18•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #16)
> I am reopening as this is more debugging code, but please close if we think
> this fixes the bug or will morph a new bug as more actionable.
We haven't had any new reports of NoSuchProcess (just a few mis-stars) and I don't think we will. The change has correctly morphed the NoSuchProcess reports into crashes (see see-also bugs).
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
status-firefox52:
--- → unaffected
status-firefox53:
--- → affected
Comment 19•8 years ago
|
||
bugherder uplift |
Flags: in-testsuite+
Updated•8 years ago
|
Whiteboard: [stockwell needswork] → [stockwell fixed]
You need to log in
before you can comment on or make changes to this bug.
Description
•