Closed
Bug 1339594
Opened 7 years ago
Closed 7 years ago
Intermittent TEST-UNEXPECTED-ERROR | damp,ts_paint,tps,a11yr,tresize | psutil.NoSuchProcess process no longer exists (pid=3740, name=u'firefox.exe')
Categories
(Testing :: Talos, defect)
Testing
Talos
Tracking
(firefox52 unaffected, firefox53 fixed, firefox54 fixed)
RESOLVED
FIXED
mozilla54
Tracking | Status | |
---|---|---|
firefox52 | --- | unaffected |
firefox53 | --- | fixed |
firefox54 | --- | fixed |
People
(Reporter: philor, Assigned: gbrown)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])
Attachments
(1 file)
2.56 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #1236770 +++ Carrying on the intermittents, since bug 1236770 got hijacked by one cause of one flavor in one subtest of one suite.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 4•7 years ago
|
||
With mis-stars eliminated, I think this may be Windows-only and e10s-only, and mostly pgo.
Assignee | ||
Comment 5•7 years ago
|
||
This seems typical in recent failures: 23:53:35 INFO - TEST-UNEXPECTED-ERROR | ts_paint | psutil.NoSuchProcess process no longer exists (pid=3360, name=u'firefox.exe') 23:53:35 ERROR - Traceback (most recent call last): 23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests 23:53:35 INFO - talos_results.add(mytest.runTest(browser_config, test)) 23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest 23:53:35 INFO - return self._runTest(browser_config, test_config, setup) 23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 174, in _runTest 23:53:35 INFO - if counter_management else None), 23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 140, in run_browser 23:53:35 INFO - return_code = context.kill_process() 23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 40, in kill_process 23:53:35 INFO - procs = self.process.children() 23:53:35 INFO - File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper 23:53:35 INFO - raise NoSuchProcess(self.pid, self._name) 23:53:35 INFO - NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=3360, name=u'firefox.exe') It looks like, if self.process.terminate() throws NoSuchProcess, there is a good chance that self.process.children() will also throw NoSuchProcess, making the change in bug 1236770 not very effective. Brad, could we just ignore NoSuchProcess in kill_process? Or get a list of children some other way/at some other time??
Flags: needinfo?(blassey.bugs)
Comment hidden (Intermittent Failures Robot) |
Comment 8•7 years ago
|
||
I don't mind ignoring NoSuchProcess, or when we get it follow another code path. I am just not sure how to ensure we get the child processes and why we would get NoSuchProcess. Maybe it terminated before we get to this point?
Assignee | ||
Comment 9•7 years ago
|
||
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=79049477&lineNumber=1947 06:12:52 INFO - Running cycle 13/20 for ts_paint test... 06:12:52 INFO - TEST-INFO | started process 1944 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmpkpbugm\profile http://localhost:49677/startup_test/tspaint_test.html) 06:15:22 INFO - Terminating psutil.Process(pid=1944, name=u'firefox.exe') 06:15:23 INFO - TEST-UNEXPECTED-ERROR | ts_paint | psutil.NoSuchProcess process no longer exists (pid=1944, name=u'firefox.exe') 06:15:23 ERROR - Traceback (most recent call last): 06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests 06:15:23 INFO - talos_results.add(mytest.runTest(browser_config, test)) 06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest 06:15:23 INFO - return self._runTest(browser_config, test_config, setup) 06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 174, in _runTest 06:15:23 INFO - if counter_management else None), 06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 140, in run_browser 06:15:23 INFO - return_code = context.kill_process() 06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 40, in kill_process 06:15:23 INFO - procs = self.process.children() 06:15:23 INFO - File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper 06:15:23 INFO - raise NoSuchProcess(self.pid, self._name) 06:15:23 INFO - NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=1944, name=u'firefox.exe') After "started process", there is a 150 second pause with no output. Perhaps there has been a timeout and TalosError has been raised: https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/talos_process.py#136 but I think that won't be reported if we raise again in the finally handler, in kill_process.
Assignee: nobody → gbrown
Assignee | ||
Comment 10•7 years ago
|
||
I reproduced the NoSuchProcess on try and determined that the browser was hung and the test timed out at https://dxr.mozilla.org/mozilla-central/rev/a180b976c165b6cd174d24bbb77941919cdc53cb/testing/talos/talos/talos_process.py#124. A minidump was probably produced, but then the TalosError was raised, triggering the finally handler and the attempt to kill the browser...which had already been killed in kill_and_get_minidump(). In this patch, I catch and log exceptions from the possibly-second kill attempt, but then allow processing to continue. In the case of a timeout, the TalosError will still be raised, which would bypass crash reporting. To handle that, I added another check_for_crashes() call. This part of the patch is not an elegant solution, but gets the job done. Will try to demonstrate the new handling at https://treeherder.mozilla.org/#/jobs?repo=try&revision=33f2d50b458bae93328e3eea526b85a25ea77b3c; I expect the NoSuchProcess failures to be transformed into PROCESS-CRASH failures, with crash reports.
Attachment #8840540 -
Flags: review?(jmaher)
Comment 11•7 years ago
|
||
Comment on attachment 8840540 [details] [diff] [review] handle timeouts better Review of attachment 8840540 [details] [diff] [review]: ----------------------------------------------------------------- this looks good. consider in the future possibly raising something from the finally:try/catch. ::: testing/talos/talos/talos_process.py @@ +145,5 @@ > + return_code = proc.wait(1) > + except: > + # Maybe killed by kill_and_get_minidump(), maybe ended? > + LOG.info("Unable to kill process") > + LOG.info(traceback.format_exc()) should we raise anything here?
Attachment #8840540 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 12•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #11) > ::: testing/talos/talos/talos_process.py > @@ +145,5 @@ > > + return_code = proc.wait(1) > > + except: > > + # Maybe killed by kill_and_get_minidump(), maybe ended? > > + LOG.info("Unable to kill process") > > + LOG.info(traceback.format_exc()) > > should we raise anything here? Maybe...sometimes? If there has been a timeout, a TalosError has already been raised - no sense in raising another exception. Otherwise, I suppose the test has completed successfully but there was a problem with killing the process; that might be cause for raising an exception. I'm hoping that in that case either 1. everything is actually okay and we complete successfully, or 2. something goes wrong at a later time and someone will notice the NoSuchProcess exception in the log as a possibly related clue. It might be better to restructure the code somehow to handle timeouts more directly...I'm not sure how exactly. I'll press on with this change for now.
Assignee | ||
Comment 13•7 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #10) > Will try to demonstrate the new handling at > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=33f2d50b458bae93328e3eea526b85a25ea77b3c; I expect > the NoSuchProcess failures to be transformed into PROCESS-CRASH failures, > with crash reports. Here's one: https://archive.mozilla.org/pub/firefox/try-builds/gbrown@mozilla.com-33f2d50b458bae93328e3eea526b85a25ea77b3c/try-win32/try_win7_ix_test-chromez-e10s-bm112-tests1-windows-build37.txt.gz 16:10:11 INFO - TEST-INFO | started process 3588 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile http://localhost:49816/startup_test/tresize/addon/content/tresize-test.html) 16:12:41 INFO - Timeout waiting for test completion; killing browser... 16:12:42 INFO - Terminating psutil.Process(pid=3588, name=u'firefox.exe') 16:12:42 INFO - Unable to kill process 16:12:42 ERROR - Traceback (most recent call last): 16:12:42 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 143, in run_browser 16:12:42 INFO - return_code = context.kill_process() 16:12:42 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 41, in kill_process 16:12:42 INFO - procs = self.process.children() 16:12:42 INFO - File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper 16:12:42 INFO - raise NoSuchProcess(self.pid, self._name) 16:12:42 INFO - NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=3588, name=u'firefox.exe') 16:12:42 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/B6M4Wx_rQCGEqZ2av4vOHQ/artifacts/public/build/firefox-54.0a1.en-US.win32.crashreporter-symbols.zip 16:12:49 INFO - mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile\minidumps\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp c:\users\cltbld\appdata\local\temp\tmprp75bd 16:12:56 INFO - mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp 16:12:56 INFO - PROCESS-CRASH | tresize | application crashed [unknown top frame] 16:12:56 INFO - Crash dump filename: c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile\minidumps\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp 16:12:56 INFO - Operating system: Windows NT 16:12:56 INFO - 6.1.7601 Service Pack 1 16:12:56 INFO - CPU: x86 16:12:56 INFO - GenuineIntel family 6 model 30 stepping 5 16:12:56 INFO - 8 CPUs 16:12:56 INFO - GPU: UNKNOWN 16:12:56 INFO - No crash 16:12:56 INFO - Process uptime: 151 seconds 16:12:56 INFO - Thread 0 16:12:56 INFO - 0 KERNELBASE.dll!GetLocaleInfoHelper + 0xb1 16:12:56 INFO - eip = 0x7577092c esp = 0x002ec9d8 ebp = 0x002ecac0 ebx = 0x000cee7c 16:12:56 INFO - esi = 0x000b65cc edi = 0x00000000 eax = 0x00000019 ecx = 0x00000001 16:12:56 INFO - edx = 0x00000000 efl = 0x00000202 16:12:56 INFO - Found by: given as instruction pointer in context
Comment 14•7 years ago
|
||
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/eb573b6f4d66 Improve Talos error handling for hung browser; r=jmaher
Comment 15•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/eb573b6f4d66
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox54:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Comment 16•7 years ago
|
||
I am reopening as this is more debugging code, but please close if we think this fixes the bug or will morph a new bug as more actionable.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Whiteboard: [stockwell needswork]
Assignee | ||
Comment 18•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #16) > I am reopening as this is more debugging code, but please close if we think > this fixes the bug or will morph a new bug as more actionable. We haven't had any new reports of NoSuchProcess (just a few mis-stars) and I don't think we will. The change has correctly morphed the NoSuchProcess reports into crashes (see see-also bugs).
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
status-firefox52:
--- → unaffected
status-firefox53:
--- → affected
Comment 19•7 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-aurora/rev/e695a8e2017a
Flags: in-testsuite+
Updated•7 years ago
|
Whiteboard: [stockwell needswork] → [stockwell fixed]
You need to log in
before you can comment on or make changes to this bug.
Description
•