Open Bug 1378002 Opened 2 years ago Updated 16 days ago

Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, ts_paint_heavy, tp6_google_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame]

Categories

(Testing :: Talos, defect, P3)

Version 3
defect

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Attachments

(1 file)

Duplicate of this bug: 1414492
Summary: Intermittent PROCESS-CRASH | tps | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy | application crashed [unknown top frame]
Duplicate of this bug: 1407176
Duplicate of this bug: 1410051
Duplicate of this bug: 1422525
Duplicate of this bug: 1420661
Summary: Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, glvideo | application crashed [unknown top frame]
Duplicate of this bug: 1415519
Summary: Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, glvideo | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_amazon_heavy, glvideo | application crashed [unknown top frame]
Duplicate of this bug: 1423511
Blocks: 1422525
Summary: Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_amazon_heavy, glvideo | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame]
There have been 39 total failures in the last 7 days.
There are failures across 3 platforms and 2 build types:

-Windows 7: 18
-windows10-64: 12
-OS X 10.10: 9

-opt: 29
-pgo: 10

Here is a recent log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=159981381&lineNumber=1974

16:19:54     INFO -  mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/HLdGg4LuRKGRsrIIESRRUA/artifacts/public/build/target.crashreporter-symbols.zip
16:20:03     INFO -  mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld~1.t-w\appdata\local\temp\tmptjw_b3\profile\minidumps\4b13ff50-6d26-4605-adc4-1e1f8300b7fd.dmp c:\users\cltbld~1.t-w\appdata\local\temp\tmpfd76kq
16:20:11     INFO -  mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\4b13ff50-6d26-4605-adc4-1e1f8300b7fd.dmp
16:20:11     INFO -  PROCESS-CRASH | tp5o_scroll | application crashed [unknown top frame]
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
Summary: Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_google_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame]
lots of failures, but we are moving next week to the new hardware for windows- I am fine waiting until then.
This bug has failed 70 times in the last 7 days, mainly on Windows 7 and Windows 10 but there are a few occurrences on OS X 10.10, 
affecting pgo and opt build types.

Failing tests: talos-tp6-stylo-threads-e10, talos-tp6-e10.

Link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=162317413&lineNumber=1540

Part of that log: IndexError: list index out of range
18:15:09     INFO -      During handling of the above exception, another exception occurred:
18:15:09     INFO -      Traceback (most recent call last):
18:15:09     INFO -        File "<string>", line 1, in <module>
18:15:09     INFO -        File "C:\Users\CLTBLD~1.T-W\AppData\Local\Temp\pip-build-a91etw0y\cffi\setup.py", line 113, in <module>
18:15:09     INFO -          if sys.platform == 'win32' and uses_msvc():
18:15:09     INFO -        File "C:\Users\CLTBLD~1.T-W\AppData\Local\Temp\pip-build-a91etw0y\cffi\setup.py", line 91, in uses_msvc
18:15:09     INFO -          return config.try_compile('#ifndef _MSC_VER\n#error "not MSVC"\n#endif')
18:15:09     INFO -        File "C:\slave\test\build\python3.6\lib\distutils\command\config.py", line 227, in try_compile
18:15:09     INFO -          self._compile(body, headers, include_dirs, lang)
18:15:09     INFO -        File "C:\slave\test\build\python3.6\lib\distutils\command\config.py", line 133, in _compile
18:15:09     INFO -          self.compiler.compile([src], include_dirs=include_dirs)
18:15:09     INFO -        File "C:\slave\test\build\python3.6\lib\distutils\_msvccompiler.py", line 304, in compile
18:15:09     INFO -          self.initialize()
18:15:09     INFO -        File "C:\slave\test\build\python3.6\lib\distutils\_msvccompiler.py", line 197, in initialize
18:15:09     INFO -          vc_env = _get_vc_env(plat_spec)
18:15:09     INFO -        File "c:\slave\test\py3venv\lib\site-packages\setuptools\msvc.py", line 172, in msvc14_get_vc_env
18:15:09     INFO -          return EnvironmentInfo(plat_spec, vc_min_ver=14.0).return_env()
18:15:09     INFO -        File "c:\slave\test\py3venv\lib\site-packages\setuptools\msvc.py", line 771, in __init__
18:15:09     INFO -          self.si = SystemInfo(self.ri, vc_ver)
18:15:09     INFO -        File "c:\slave\test\py3venv\lib\site-packages\setuptools\msvc.py", line 480, in __init__
18:15:09     INFO -          raise distutils.errors.DistutilsPlatformError(err)
18:15:09     INFO -      distutils.errors.DistutilsPlatformError: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

:rwood did you have the chance to look at this bug?
Over the last 7 days there are 65 failures on this bug. These happen on Linux x64, macosx64-nightly, OS X 10.10, Windows 7, windows10-64.

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=163916283&lineNumber=2682

Here is a relevant part of that log: 

13:18:45     INFO -      Found by: given as instruction pointer in context
13:18:45     INFO -  TEST-UNEXPECTED-ERROR | damp | Found crashes after test run, terminating test
13:18:45    ERROR -  Traceback (most recent call last):
13:18:45     INFO -    File "C:\Users\task_1519387965\build\tests\talos\talos\run_tests.py", line 296, in run_tests
13:18:45     INFO -      talos_results.add(mytest.runTest(browser_config, test))
13:18:45     INFO -    File "C:\Users\task_1519387965\build\tests\talos\talos\ttest.py", line 62, in runTest
13:18:45     INFO -      return self._runTest(browser_config, test_config, setup)
13:18:45     INFO -    File "C:\Users\task_1519387965\build\tests\talos\talos\ttest.py", line 209, in _runTest
13:18:45     INFO -      test_config['name'])
13:18:45     INFO -    File "C:\Users\task_1519387965\build\tests\talos\talos\ttest.py", line 46, in check_for_crashes
13:18:45     INFO -      raise TalosCrash('Found crashes after test run, terminating test')
13:18:45     INFO -  TalosCrash: Found crashes after test run, terminating test
13:18:45     INFO -  TEST-INFO took 3607888ms
13:18:45     INFO -  SUITE-END | took 3607s
13:18:45    ERROR - Return code: 2
13:18:45  WARNING - setting return code to 2
13:18:45    ERROR - # TBPL FAILURE #
these are primarily windows failures- we just did a major change yesterday to windows hardware- I wonder what this will look like in a week from now.
There have been 85 failures in the last week and 207 failures in the last 21 days.

For the last 7 days I see that the failures occur on: 

Platforms:
- OS X 10.10: 48
- windows10-64: 22
- Windows 7: 10
- macosx64-nightly: 3
- linux64-stylo-disabled: 1
- linux64-nightly: 1

Build type:
- opt: 65
- pgo: 19
- debug: 1

Here is a recent log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=165736923&lineNumber=3687

04:13:02     INFO -  mozcrash Copy/paste: /Users/cltbld/tasks/task_1520074748/build/macosx64-minidump_stackwalk /var/folders/46/s7s5j29s62v4stfn569xx97m00000w/T/tmprtPiG6/profile/minidumps/79C62B57-26C8-4CB1-AD2A-53A839CD82B5.dmp /var/folders/46/s7s5j29s62v4stfn569xx97m00000w/T/tmp_gExnn
04:13:02     INFO -  mozcrash Saved minidump as /Users/cltbld/tasks/task_1520074748/build/blobber_upload_dir/79C62B57-26C8-4CB1-AD2A-53A839CD82B5.dmp
04:13:02     INFO -  PROCESS-CRASH | damp | application crashed [unknown top frame]
04:13:02     INFO -  Crash dump filename: /var/folders/46/s7s5j29s62v4stfn569xx97m00000w/T/tmprtPiG6/profile/minidumps/79C62B57-26C8-4CB1-AD2A-53A839CD82B5.dmp
04:13:02     INFO -  stderr from minidump_stackwalk:
04:13:02     INFO -  2018-03-03 04:13:02: minidump.cc:4359: INFO: Minidump opened minidump /var/folders/46/s7s5j29s62v4stfn569xx97m00000w/T/tmprtPiG6/profile/minidumps/79C62B57-26C8-4CB1-AD2A-53A839CD82B5.dmp
04:13:02     INFO -  2018-03-03 04:13:02: minidump.cc:4808: ERROR: ReadBytes: read 0/32
04:13:02     INFO -  2018-03-03 04:13:02: minidump.cc:4453: ERROR: Minidump cannot read header
04:13:02     INFO -  2018-03-03 04:13:02: stackwalk.cc:133: ERROR: Minidump /var/folders/46/s7s5j29s62v4stfn569xx97m00000w/T/tmprtPiG6/profile/minidumps/79C62B57-26C8-4CB1-AD2A-53A839CD82B5.dmp could not be read

All of the bug dependencies have been solved.

:jmaher do you think we should wait some more on disabling this? About 41% of failures are on OS X 10.10 / opt in the last week.
Flags: needinfo?(jmaher)
I will look into this more, this is the g2 job failing on osx and windows- the profiling ones on nightly are different issues and a random browser-chrome is mixed in there.  We will disable or fix this- thanks for filing!
the main crash here is damp and this is related to the logs we see:
03:33:36     INFO -  PID 965 | Open toolbox on 'complicated.inspector'
03:33:37     INFO -  PID 965 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath
03:33:37     INFO -  PID 965 | 2018-03-03 03:33:37.195 plugin-container[999:66350] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x9057, name = 'com.apple.tsm.portname'
03:33:37     INFO -  PID 965 | See /usr/include/servers/bootstrap_defs.h for the error codes.
03:33:38     INFO -  PID 965 | Garbage collect
03:33:38     INFO -  PID 965 | Reload page on 'complicated.inspector'
03:33:42     INFO -  PID 965 | Close toolbox on 'complicated.inspector'
03:33:42     INFO -  PID 965 | Garbage collect
03:33:44     INFO -  PID 965 | Open toolbox on 'complicated.jsdebugger'
03:33:44     INFO -  PID 965 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath
03:33:44     INFO -  PID 965 | 2018-03-03 03:33:44.814 plugin-container[1000:66676] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x953b, name = 'com.apple.tsm.portname'
03:33:44     INFO -  PID 965 | See /usr/include/servers/bootstrap_defs.h for the error codes.
03:33:45     INFO -  PID 965 | Waiting for state change: count sources
03:33:45     INFO -  PID 965 | Finished waiting for state change: count sources
03:33:45     INFO -  PID 965 | Selecting source: ga.js
03:33:45     INFO -  PID 965 | Waiting for state change: selected source
04:30:46     INFO -  Timeout waiting for test completion; killing browser...
04:30:46     INFO -  Terminating psutil.Process(pid=965, name='firefox', started='03:30:46')
04:30:46     INFO -  PID 965 | [GFX1-]: Receive IPC close with reason=AbnormalShutdown
04:30:46     INFO -  PID 965 | ** Unknown exception behavior: -2147483647
04:30:46     INFO -  PID 965 | ** Unknown exception behavior: -2147483647
04:30:46     INFO -  PID 965 | ** Unknown exception behavior: -2147483647


OSX seems to be the most affected.  :ochameau, are you familiar with any recent changes in damp which would cause the above hang?
Flags: needinfo?(rwood)
Flags: needinfo?(poirot.alex)
Flags: needinfo?(jmaher)
Not that I know of. It seems to have increased significantly around early February.

Yulia, any idea why this method would never resolve:
https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/devtools/addon/content/damp.js#167-188
  selectSource(dbg, url) {
    dump(`Selecting source: ${url}\n`);
    const line = 1;
    const source = this.findSource(dbg, url);
    dbg.actions.selectLocation({ sourceId: source.get("id"), line });
    return this.waitForState(
      dbg,
      state => {
        const source = dbg.selectors.getSelectedSource(state);
        const isLoaded = source && source.get("loadedState") === "loaded";
        if (!isLoaded) {
          return false;
        }

        // wait for symbols -- a flat map of all named variables in a file -- to be calculated.
        // this is a slow process and becomes slower the larger the file is
        return dbg.selectors.hasSymbols(state, source.toJS());
      },
      "selected source"
    );
  }
I'm wondering if hasSymbols can be the issue here? Or is loadedState never ending up with "loaded"?

DAMP times out here intermittently:
  https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/devtools/addon/content/damp.js#821
Flags: needinfo?(poirot.alex) → needinfo?(ystartsev)
I'm taking a look. In late January the sources and symbols were both refactored, and this could be caused by that. For now I will split the test so we have a better idea why its failing
Assignee: nobody → ystartsev
Flags: needinfo?(ystartsev)
I looked into the code and I am having a hard time seeing what might be causing this. I also reran the test on talos 30 times and it didn't fail. For now I have made a change that will add more logging to the test, so that we can narrow down the issue -- if it has to do with the source never loading, or the symbols never loading.

It is not a fix but it might give us more information
Comment on attachment 8959608 [details]
Bug 1378002 - add more information to track intermittent process crash.

https://reviewboard.mozilla.org/r/228424/#review234556

This change adds quite a lot of logs:
02:49:55     INFO -  PID 28724 | Waiting for state change: selected source
02:49:55     INFO -  PID 28724 | Checking if source is loaded
02:49:56     INFO -  PID 28724 | Checking if source is loaded
02:49:56     INFO -  PID 28724 | Waiting for symbols
02:49:56     INFO -  PID 28724 | Checking if source is loaded
02:49:56     INFO -  PID 28724 | Waiting for symbols
02:49:56     INFO -  PID 28724 | Checking if source is loaded
02:49:56     INFO -  PID 28724 | Waiting for symbols
02:49:56     INFO -  PID 28724 | Checking if source is loaded
02:49:56     INFO -  PID 28724 | Waiting for symbols
02:49:56     INFO -  PID 28724 | Checking if source is loaded
02:49:56     INFO -  PID 28724 | Waiting for symbols
02:49:56     INFO -  PID 28724 | Checking if source is loaded
02:49:56     INFO -  PID 28724 | Waiting for symbols
02:49:57     INFO -  PID 28724 | Checking if source is loaded
02:49:57     INFO -  PID 28724 | Waiting for symbols
02:49:57     INFO -  PID 28724 | Finished waiting for state change: selected source

Instead of landing it, you can push it to try like this:
  ./mach try -b o -p macosx64 -t g2-e10s --rebuild-talos 10 --artifact
Unless you think it is valuable to always have these logs?
Or were you unable to reproduce this intermittent on try?
Attachment #8959608 - Flags: review?(poirot.alex)
I wasn't able to reproduce it -- I am rerunning it this time with 20 rebuilds and with the logs, lets see if that gives more information...
Joel, I looked at the 10 last failure for this bug on orangefactor and there are all happenning on windows, and the vast majority of them are related to such failure:
01:57:52     INFO -  PID 3888 | JavaScript note: http://localhost:49424/tests/tp5n/goo.ne.jp/search.goo.ne.jp/cdn/goo/js/v2/sgt.v2.js, line 1: { opened at line 1, column 11981
01:58:07     INFO -  PID 3888 | Cycle 1(4): loaded http://localhost:49424/tests/tp5n/goo.ne.jp/goo.ne.jp/index.html (next: http://localhost:49424/tests/tp5n/guardian.co.uk/www.guardian.co.uk/index.html)
01:58:08     INFO -  PID 3888 | JavaScript note: http://localhost:49424/tests/tp5n/goo.ne.jp/search.goo.ne.jp/cdn/goo/js/v2/sgt.v2.js, line 1: { opened at line 1, column 11981
01:58:09     INFO -  Timeout waiting for test completion; killing browser...

It looks like it is mostly tp5o_scroll timeout.
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=170226617&lineNumber=1818

the timeout related to debugger seems to be gone.
But I was wondering if we could make logviewer/perfherder have better summary.
It isn't handy to see various, very different error happening during g2 runs be aggregated in this bug.
We could tweak DAMP to print particular strings when one test timesout/throws.
Is there particular patterns that are picked up/ignored by perfherder or whatever tool processing test outputs?
Flags: needinfo?(jmaher)
I agree this isn't damp related anymore.  I am open to suggestions of making the error reporting better so we can reduce the time looking at the large list of raw logs to find a real failure pattern.

One problem is when we get a timeout in the browser we have to have a failure message such as "Timeout waiting for test completion" and then close the browser and cleanup.  When we force the browser closed we do so and look for crash dumps.  Unfortunately that is the same pattern in many of these logs.  What is interesting is that we seem to be running for >1 hour to complete the test when it should take ~30 minutes- so in the case of a few logs I looked at we are loading each page very slow (5-15 seconds instead of 1-3 seconds) and we timeout.
Flags: needinfo?(jmaher)
Is there a particular string/pattern that we can use to ensure that's the first line logged in perfherder?

Today, in 99% of DAMP failures, the first error display in Perfherder > Failure summary is:
  PROCESS-CRASH | damp | application crashed [unknown top frame]
Like here:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=34ae98a0d2e19d03c137d8d747c1525ff699719d&selectedJob=168907788

We can tweak damp.js in order to print which test timed out and stop reporting it as a crash, but just a test failure.
Do you know where is the code that implements what logs are kept in the "Failure summary" screen?
If not, any idea who would know?
we are killing this from the python code that launched the browser:
https://searchfox.org/mozilla-central/source/testing/talos/talos/talos_process.py#138

we timeout after 1 hour- so if the process still exists after 1 hour the python code takes over and kills the browser.

we do something different for TART where it detects an internal timeout:
https://searchfox.org/mozilla-central/source/testing/talos/talos/talos_process.py#71

I suppose damp could do the same thing by having a smaller timeout internally using tptimeout.  Here is an example of a 5 second timeout from tp5:
https://searchfox.org/mozilla-central/source/testing/talos/talos/test.py#497

I am not sure what value would make sense for damp or if it needs to do something more like TART, but there seems to be some options to pursue.
Bug 1444826 which is in autoland, should help use better sort Talos crashes related to DAMP,
so it would be great to revisit this bug in one week and see if DAMP still crashes, and now, we should more easily know why.
Summary: Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_google_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, ts_paint_heavy, tp6_google_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame]
in the last 2 weeks we had 3 failures:
2 - win 7 g2 (tps)
1 - win 10 h1 (ts_paint_heavy)

I think this is not a disable recommended anymore, the failure rate is low and I will leave as [stockwell unknown]
Whiteboard: [stockwell disable-recommended] → [stockwell unknown]
Assignee: ystartsev → nobody
You need to log in before you can comment on or make changes to this bug.