Closed Bug 1433270 Opened 7 years ago Closed 7 years ago

Intermittent tp6_google | timeout

Categories

(Testing :: Talos, defect, P5)

Version 3
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

There have been 39 failures in the last 7 days. This was filed on January 24th This fails on Linux x64 / opt & pgo. There are some exceptions for linux64-qr (2) and OS X 10.10. Here is a relevant log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=158483519&lineNumber=1849 09:51:05 INFO - PID 16806 | 17255 1848 09:51:05 INFO - PID 16806 | ExceptionHandler::SendContinueSignalToChild sent continue signal to child 1849 09:51:05 INFO - TEST-UNEXPECTED-ERROR | tp6_google | timeout 1850 09:51:05 ERROR - Traceback (most recent call last): 1851 09:51:05 INFO - File "/builds/slave/test/build/tests/talos/talos/run_tests.py", line 289, in run_tests 1852 09:51:05 INFO - talos_results.add(mytest.runTest(browser_config, test)) 1853 09:51:05 INFO - File "/builds/slave/test/build/tests/talos/talos/ttest.py", line 62, in runTest 1854 09:51:05 INFO - return self._runTest(browser_config, test_config, setup) 1855 09:51:05 INFO - File "/builds/slave/test/build/tests/talos/talos/ttest.py", line 214, in _runTest 1856 09:51:05 INFO - debugger_args=browser_config['debugger_args'] 1857 09:51:05 INFO - File "/builds/slave/test/build/tests/talos/talos/talos_process.py", line 139, in run_browser 1858 09:51:05 INFO - raise TalosError("timeout") 1859 09:51:05 INFO - TalosError: timeout 1860 09:51:05 INFO - TEST-INFO took 3608353ms 1861 09:51:05 INFO - SUITE-END | took 3608s :rwood could you please take a look?
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
:rwood unfortunately the new hardware didn't reduce this failure rate- might be worth investigating in the short term.
I really have *no idea* what is causing this. It looks like tp6_google gets through several tp page cycles successfully but then out of the blue (there's nothing obvious in the logs). I'll see if I can reproduce this locally, and if not I'll get a loaner and try it there.
In the last 7 days we have 60 failures. They occur mostly on windows10-64 (opt, pgo), Windows 7 (opt, pgo), OS X 10.10 (opt), Linux x64 (opt, pgo). Failure log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=161508309&lineNumber=2459 04:23:47 INFO - PROCESS-CRASH | tp6_google | application crashed [unknown top frame] 04:23:47 INFO - Crash dump filename: c:\users\cltbld\appdata\local\temp\tmpfvxdld\profile\minidumps\bcef21a3-6ebd-4ee0-9885-f6858add5c38.dmp 04:23:47 INFO - stderr from minidump_stackwalk: 04:23:47 INFO - 2018-02-10 04:23:47: minidump.cc:4359: INFO: Minidump opened minidump c:\users\cltbld\appdata\local\temp\tmpfvxdld\profile\minidumps\bcef21a3-6ebd-4ee0-9885-f6858add5c38.dmp 04:23:47 INFO - 2018-02-10 04:23:47: minidump.cc:4808: ERROR: ReadBytes: read 0/32 04:23:47 INFO - 2018-02-10 04:23:47: minidump.cc:4453: ERROR: Minidump cannot read header 04:23:47 INFO - 2018-02-10 04:23:47: stackwalk.cc:133: ERROR: Minidump c:\users\cltbld\appdata\local\temp\tmpfvxdld\profile\minidumps\bcef21a3-6ebd-4ee0-9885-f6858add5c38.dmp could not be read 04:23:47 INFO - 2018-02-10 04:23:47: minidump.cc:4331: INFO: Minidump closing minidump 04:23:47 INFO - minidump_stackwalk exited with return code 1 04:23:47 INFO - TEST-UNEXPECTED-ERROR | tp6_google | Found crashes after test run, terminating test
most likely a duplicate of 1378002 - hopefully resolved once we upgrade to new machines and fresh OS.
Over the last 7 days this bug has 30 failures. These happen on Linux x64, linux64-nightly, OS X 10.10, Windows 7 and windows10-64. Here is the most relevant log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=163249283&lineNumber=1937 Here is a relevant part from that log: 10:25:04 INFO - PROCESS-CRASH | tp6_google | application crashed [unknown top frame] 10:25:04 INFO - Crash dump filename: c:\users\cltbld\appdata\local\temp\tmp85ghrn\profile\minidumps\cd2a3f5d-b99c-4e74-9b5f-497955b96277.dmp 10:25:04 INFO - stderr from minidump_stackwalk: 10:25:04 INFO - 2018-02-20 10:25:04: minidump.cc:4359: INFO: Minidump opened minidump c:\users\cltbld\appdata\local\temp\tmp85ghrn\profile\minidumps\cd2a3f5d-b99c-4e74-9b5f-497955b96277.dmp 10:25:04 INFO - 2018-02-20 10:25:04: minidump.cc:4808: ERROR: ReadBytes: read 0/32 10:25:04 INFO - 2018-02-20 10:25:04: minidump.cc:4453: ERROR: Minidump cannot read header 10:25:04 INFO - 2018-02-20 10:25:04: stackwalk.cc:133: ERROR: Minidump c:\users\cltbld\appdata\local\temp\tmp85ghrn\profile\minidumps\cd2a3f5d-b99c-4e74-9b5f-497955b96277.dmp could not be read 10:25:04 INFO - 2018-02-20 10:25:04: minidump.cc:4331: INFO: Minidump closing minidump 10:25:04 INFO - minidump_stackwalk exited with return code 1 10:25:04 INFO - TEST-UNEXPECTED-ERROR | tp6_google | Found crashes after test run, terminating test 10:25:04 ERROR - Traceback (most recent call last): 10:25:04 INFO - File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 299, in run_tests 10:25:04 INFO - talos_results.add(mytest.runTest(browser_config, test)) 10:25:04 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 62, in runTest 10:25:04 INFO - return self._runTest(browser_config, test_config, setup) 10:25:04 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 209, in _runTest 10:25:04 INFO - test_config['name']) 10:25:04 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 46, in check_for_crashes 10:25:04 INFO - raise TalosCrash('Found crashes after test run, terminating test') 10:25:04 INFO - TalosCrash: Found crashes after test run, terminating test 10:25:04 INFO - TEST-INFO took 3633885ms 10:25:04 INFO - SUITE-END | took 3633s WARNING | IO Completion Port failed to signal process shutdown Parent process 5820 exited with children alive: PIDS: 7040, 7148 Attempting to kill them, but no guarantee of success 10:28:09 ERROR - Return code: 2 10:28:09 WARNING - setting return code to 2 10:28:09 ERROR - # TBPL FAILURE #
:rwood- as this is not just windows- we should look into this. Possibly this is google specific and we stop that test, or we determine that it is toolchain related. Maybe some investigation in the short term.
:rwood do you have any updates on this Bug?
I believe alot of these timeouts are the same tp6 issue as in Bug 1439979. I have a couple of things running on try now to get more info / try to fix this. https://bugzilla.mozilla.org/show_bug.cgi?id=1439979#c10 https://bugzilla.mozilla.org/show_bug.cgi?id=1439979#c11
Flags: needinfo?(rwood)
:rwod do you have any updates on this bug?
Flags: needinfo?(rwood)
there are 3 instances in the last 30 days, I think this bug is mostly fixed by the migration to the new hardware and taskcluster worker.
Whiteboard: [stockwell disable-recommended] → [stockwell unknown]
Looks good - there have been zero instances of this since April 10th.
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(rwood)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.