Intermittent TEST-UNEXPECTED-ERROR | damp,ts_paint,tps,a11yr,tresize | psutil.NoSuchProcess process no longer exists (pid=3740, name=u'firefox.exe')

RESOLVED FIXED in Firefox 53

Status

RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: philor, Assigned: gbrown)

Tracking

({intermittent-failure})

Trunk
mozilla54
intermittent-failure
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox52 unaffected, firefox53 fixed, firefox54 fixed)

Details

(Whiteboard: [stockwell fixed])

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
+++ This bug was initially created as a clone of Bug #1236770 +++

Carrying on the intermittents, since bug 1236770 got hijacked by one cause of one flavor in one subtest of one suite.
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
(Assignee)

Comment 4

2 years ago
With mis-stars eliminated, I think this may be Windows-only and e10s-only, and mostly pgo.
(Assignee)

Comment 5

2 years ago
This seems typical in recent failures:

23:53:35     INFO -  TEST-UNEXPECTED-ERROR | ts_paint | psutil.NoSuchProcess process no longer exists (pid=3360, name=u'firefox.exe')
23:53:35    ERROR -  Traceback (most recent call last):
23:53:35     INFO -    File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests
23:53:35     INFO -      talos_results.add(mytest.runTest(browser_config, test))
23:53:35     INFO -    File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest
23:53:35     INFO -      return self._runTest(browser_config, test_config, setup)
23:53:35     INFO -    File "C:\slave\test\build\tests\talos\talos\ttest.py", line 174, in _runTest
23:53:35     INFO -      if counter_management else None),
23:53:35     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 140, in run_browser
23:53:35     INFO -      return_code = context.kill_process()
23:53:35     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 40, in kill_process
23:53:35     INFO -      procs = self.process.children()
23:53:35     INFO -    File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
23:53:35     INFO -      raise NoSuchProcess(self.pid, self._name)
23:53:35     INFO -  NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=3360, name=u'firefox.exe')


It looks like, if self.process.terminate() throws NoSuchProcess, there is a good chance that self.process.children() will also throw NoSuchProcess, making the change in bug 1236770 not very effective. Brad, could we just ignore NoSuchProcess in kill_process? Or get a list of children some other way/at some other time??
Flags: needinfo?(blassey.bugs)
This is automation code, I have no idea.
Flags: needinfo?(blassey.bugs)
Comment hidden (Intermittent Failures Robot)
I don't mind ignoring NoSuchProcess, or when we get it follow another code path.  I am just not sure how to ensure we get the child processes and why we would get NoSuchProcess.  Maybe it terminated before we get to this point?
(Assignee)

Comment 9

2 years ago
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=79049477&lineNumber=1947

06:12:52     INFO -  Running cycle 13/20 for ts_paint test...
06:12:52     INFO -  TEST-INFO | started process 1944 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmpkpbugm\profile http://localhost:49677/startup_test/tspaint_test.html)
06:15:22     INFO -  Terminating psutil.Process(pid=1944, name=u'firefox.exe')
06:15:23     INFO -  TEST-UNEXPECTED-ERROR | ts_paint | psutil.NoSuchProcess process no longer exists (pid=1944, name=u'firefox.exe')
06:15:23    ERROR -  Traceback (most recent call last):
06:15:23     INFO -    File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests
06:15:23     INFO -      talos_results.add(mytest.runTest(browser_config, test))
06:15:23     INFO -    File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest
06:15:23     INFO -      return self._runTest(browser_config, test_config, setup)
06:15:23     INFO -    File "C:\slave\test\build\tests\talos\talos\ttest.py", line 174, in _runTest
06:15:23     INFO -      if counter_management else None),
06:15:23     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 140, in run_browser
06:15:23     INFO -      return_code = context.kill_process()
06:15:23     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 40, in kill_process
06:15:23     INFO -      procs = self.process.children()
06:15:23     INFO -    File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
06:15:23     INFO -      raise NoSuchProcess(self.pid, self._name)
06:15:23     INFO -  NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=1944, name=u'firefox.exe')

After "started process", there is a 150 second pause with no output. Perhaps there has been a timeout and TalosError has been raised:

https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/talos_process.py#136

but I think that won't be reported if we raise again in the finally handler, in kill_process.
Assignee: nobody → gbrown
(Assignee)

Comment 10

2 years ago
I reproduced the NoSuchProcess on try and determined that the browser was hung and the test timed out at https://dxr.mozilla.org/mozilla-central/rev/a180b976c165b6cd174d24bbb77941919cdc53cb/testing/talos/talos/talos_process.py#124. A minidump was probably produced, but then the TalosError was raised, triggering the finally handler and the attempt to kill the browser...which had already been killed in kill_and_get_minidump().

In this patch, I catch and log exceptions from the possibly-second kill attempt, but then allow processing to continue.

In the case of a timeout, the TalosError will still be raised, which would bypass crash reporting. To handle that, I added another check_for_crashes() call. This part of the patch is not an elegant solution, but gets the job done.

Will try to demonstrate the new handling at https://treeherder.mozilla.org/#/jobs?repo=try&revision=33f2d50b458bae93328e3eea526b85a25ea77b3c; I expect the NoSuchProcess failures to be transformed into PROCESS-CRASH failures, with crash reports.
Attachment #8840540 - Flags: review?(jmaher)
Comment on attachment 8840540 [details] [diff] [review]
handle timeouts better

Review of attachment 8840540 [details] [diff] [review]:
-----------------------------------------------------------------

this looks good.  consider in the future possibly raising something from the finally:try/catch.

::: testing/talos/talos/talos_process.py
@@ +145,5 @@
> +                return_code = proc.wait(1)
> +        except:
> +            # Maybe killed by kill_and_get_minidump(), maybe ended?
> +            LOG.info("Unable to kill process")
> +            LOG.info(traceback.format_exc())

should we raise anything here?
Attachment #8840540 - Flags: review?(jmaher) → review+
(Assignee)

Comment 12

2 years ago
(In reply to Joel Maher ( :jmaher) from comment #11)
> ::: testing/talos/talos/talos_process.py
> @@ +145,5 @@
> > +                return_code = proc.wait(1)
> > +        except:
> > +            # Maybe killed by kill_and_get_minidump(), maybe ended?
> > +            LOG.info("Unable to kill process")
> > +            LOG.info(traceback.format_exc())
> 
> should we raise anything here?

Maybe...sometimes? If there has been a timeout, a TalosError has already been raised - no sense in raising another exception. Otherwise, I suppose the test has completed successfully but there was a problem with killing the process; that might be cause for raising an exception. I'm hoping that in that case either 1. everything is actually okay and we complete successfully, or 2. something goes wrong at a later time and someone will notice the NoSuchProcess exception in the log as a possibly related clue.

It might be better to restructure the code somehow to handle timeouts more directly...I'm not sure how exactly. I'll press on with this change for now.
(Assignee)

Comment 13

2 years ago
(In reply to Geoff Brown [:gbrown] from comment #10)
> Will try to demonstrate the new handling at
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=33f2d50b458bae93328e3eea526b85a25ea77b3c; I expect
> the NoSuchProcess failures to be transformed into PROCESS-CRASH failures,
> with crash reports.

Here's one:

https://archive.mozilla.org/pub/firefox/try-builds/gbrown@mozilla.com-33f2d50b458bae93328e3eea526b85a25ea77b3c/try-win32/try_win7_ix_test-chromez-e10s-bm112-tests1-windows-build37.txt.gz

16:10:11     INFO -  TEST-INFO | started process 3588 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile http://localhost:49816/startup_test/tresize/addon/content/tresize-test.html)
16:12:41     INFO -  Timeout waiting for test completion; killing browser...
16:12:42     INFO -  Terminating psutil.Process(pid=3588, name=u'firefox.exe')
16:12:42     INFO -  Unable to kill process
16:12:42    ERROR -  Traceback (most recent call last):
16:12:42     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 143, in run_browser
16:12:42     INFO -      return_code = context.kill_process()
16:12:42     INFO -    File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 41, in kill_process
16:12:42     INFO -      procs = self.process.children()
16:12:42     INFO -    File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper
16:12:42     INFO -      raise NoSuchProcess(self.pid, self._name)
16:12:42     INFO -  NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=3588, name=u'firefox.exe')
16:12:42     INFO -  mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/B6M4Wx_rQCGEqZ2av4vOHQ/artifacts/public/build/firefox-54.0a1.en-US.win32.crashreporter-symbols.zip
16:12:49     INFO -  mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile\minidumps\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp c:\users\cltbld\appdata\local\temp\tmprp75bd
16:12:56     INFO -  mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp
16:12:56     INFO -  PROCESS-CRASH | tresize | application crashed [unknown top frame]
16:12:56     INFO -  Crash dump filename: c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile\minidumps\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp
16:12:56     INFO -  Operating system: Windows NT
16:12:56     INFO -                    6.1.7601 Service Pack 1
16:12:56     INFO -  CPU: x86
16:12:56     INFO -       GenuineIntel family 6 model 30 stepping 5
16:12:56     INFO -       8 CPUs
16:12:56     INFO -  GPU: UNKNOWN
16:12:56     INFO -  No crash
16:12:56     INFO -  Process uptime: 151 seconds
16:12:56     INFO -  Thread 0
16:12:56     INFO -   0  KERNELBASE.dll!GetLocaleInfoHelper + 0xb1
16:12:56     INFO -      eip = 0x7577092c   esp = 0x002ec9d8   ebp = 0x002ecac0   ebx = 0x000cee7c
16:12:56     INFO -      esi = 0x000b65cc   edi = 0x00000000   eax = 0x00000019   ecx = 0x00000001
16:12:56     INFO -      edx = 0x00000000   efl = 0x00000202
16:12:56     INFO -      Found by: given as instruction pointer in context

Comment 14

2 years ago
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb573b6f4d66
Improve Talos error handling for hung browser; r=jmaher

Comment 15

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/eb573b6f4d66
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox54: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
I am reopening as this is more debugging code, but please close if we think this fixes the bug or will morph a new bug as more actionable.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Updated

2 years ago
See Also: → bug 1342685
(Assignee)

Updated

2 years ago
See Also: → bug 1342594
Comment hidden (Intermittent Failures Robot)
(Assignee)

Updated

2 years ago
See Also: → bug 1342735
(Assignee)

Updated

2 years ago
See Also: → bug 1343242
(Assignee)

Updated

2 years ago
See Also: → bug 1310638
Whiteboard: [stockwell needswork]
(Assignee)

Updated

2 years ago
Blocks: 1343549
(Assignee)

Comment 18

2 years ago
(In reply to Joel Maher ( :jmaher) from comment #16)
> I am reopening as this is more debugging code, but please close if we think
> this fixes the bug or will morph a new bug as more actionable.

We haven't had any new reports of NoSuchProcess (just a few mis-stars) and I don't think we will. The change has correctly morphed the NoSuchProcess reports into crashes (see see-also bugs).
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago2 years ago
Resolution: --- → FIXED
status-firefox52: --- → unaffected
status-firefox53: --- → affected

Comment 19

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/e695a8e2017a
status-firefox53: affected → fixed
Flags: in-testsuite+
Whiteboard: [stockwell needswork] → [stockwell fixed]
You need to log in before you can comment on or make changes to this bug.