Closed Bug 1439522 Opened 8 years ago Closed 5 years ago

Intermittent ts_paint_webext/ts_paint | application crashed [unknown top frame]

Categories

(Testing :: Talos, defect, P3)

Version 3
defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1630448

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell needswork:owner])

Filed by: apavel [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=163136925&repo=autoland https://queue.taskcluster.net/v1/task/Mp-f0iGqRlqQ-TXW5qspvQ/runs/0/artifacts/public/logs/live_backing.log 23:54:32 INFO - TEST-INFO | started process 1957 (/home/cltbld/workspace/build/application/firefox/firefox -profile /tmp/tmplcr9K_/profile http://localhost:36243/startup_test/tspaint_test.html) 23:54:33 INFO - PID 1957 | 1519113273064 addons.webextension.talos@mozilla.org WARN Please specify whether you want browser_style or not in your page_action options. 23:54:33 INFO - PID 1957 | 1519113273072 addons.webextension.talos@mozilla.org WARN Please specify whether you want browser_style or not in your browser_action options. 23:57:02 INFO - Timeout waiting for test completion; killing browser... 23:57:02 INFO - Terminating psutil.Process(pid=1957, name='firefox', started='23:54:32') 23:57:02 INFO - PID 1957 | ExceptionHandler::GenerateDump cloned child 2049 23:57:02 INFO - PID 1957 | ExceptionHandler::SendContinueSignalToChild sent continue signal to child 23:57:02 INFO - PID 1957 | ExceptionHandler::WaitForContinueSignal waiting for continue signal... 23:57:02 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/WItqOn28SFS7ivXRc0q7FQ/artifacts/public/build/target.crashreporter-symbols.zip 23:57:11 INFO - mozcrash Copy/paste: /home/cltbld/workspace/build/linux64-minidump_stackwalk /tmp/tmplcr9K_/profile/minidumps/438e5a63-ac1b-1ee8-0688-dd550acaf472.dmp /tmp/tmpx5ZCuz 23:57:11 INFO - mozcrash Saved minidump as /home/cltbld/workspace/build/blobber_upload_dir/438e5a63-ac1b-1ee8-0688-dd550acaf472.dmp 23:57:11 INFO - PROCESS-CRASH | ts_paint_webext | application crashed [unknown top frame] 23:57:11 INFO - Crash dump filename: /tmp/tmplcr9K_/profile/minidumps/438e5a63-ac1b-1ee8-0688-dd550acaf472.dmp 23:57:11 INFO - stderr from minidump_stackwalk: 23:57:11 INFO - 2018-02-19 23:57:11: minidump.cc:4359: INFO: Minidump opened minidump /tmp/tmplcr9K_/profile/minidumps/438e5a63-ac1b-1ee8-0688-dd550acaf472.dmp 23:57:11 INFO - 2018-02-19 23:57:11: minidump.cc:4808: ERROR: ReadBytes: read 0/32 23:57:11 INFO - 2018-02-19 23:57:11: minidump.cc:4453: ERROR: Minidump cannot read header 23:57:11 INFO - 2018-02-19 23:57:11: stackwalk.cc:133: ERROR: Minidump /tmp/tmplcr9K_/profile/minidumps/438e5a63-ac1b-1ee8-0688-dd550acaf472.dmp could not be read 23:57:11 INFO - 2018-02-19 23:57:11: minidump.cc:4331: INFO: Minidump closing minidump 23:57:11 INFO - minidump_stackwalk exited with return code 1 23:57:11 INFO - TEST-UNEXPECTED-ERROR | ts_paint_webext | Found crashes after test run, terminating test 23:57:11 ERROR - Traceback (most recent call last): 23:57:11 INFO - File "/home/cltbld/workspace/build/tests/talos/talos/run_tests.py", line 299, in run_tests 23:57:11 INFO - talos_results.add(mytest.runTest(browser_config, test)) 23:57:11 INFO - File "/home/cltbld/workspace/build/tests/talos/talos/ttest.py", line 62, in runTest 23:57:11 INFO - return self._runTest(browser_config, test_config, setup) 23:57:11 INFO - File "/home/cltbld/workspace/build/tests/talos/talos/ttest.py", line 209, in _runTest 23:57:11 INFO - test_config['name']) 23:57:11 INFO - File "/home/cltbld/workspace/build/tests/talos/talos/ttest.py", line 46, in check_for_crashes 23:57:11 INFO - raise TalosCrash('Found crashes after test run, terminating test') 23:57:11 INFO - TalosCrash: Found crashes after test run, terminating test 23:57:11 INFO - TEST-INFO took 167041ms 23:57:11 INFO - SUITE-END | took 167s 23:57:12 ERROR - Return code: 2 23:57:12 WARNING - setting return code to 2 23:57:12 ERROR - # TBPL FAILURE # 23:57:12 INFO - Running post-action listener: _package_coverage_data 23:57:12 INFO - Running post-action listener: _resource_record_post_action 23:57:12 INFO - [mozharness: 2018-02-20 07:57:12.157378Z] Finished run-tests step (success) 23:57:12 INFO - Running post-run listener: _resource_record_post_run 23:57:12 INFO - Total resource usage - Wall time: 178s; CPU: 2.0%; Read bytes: 0; Write bytes: 571793408; Read time: 0; Write time: 41588 23:57:12 INFO - TinderboxPrint: CPU usage<br/>2.3% 23:57:12 INFO - TinderboxPrint: I/O read bytes / time<br/>0 / 0 23:57:12 INFO - TinderboxPrint: I/O write bytes / time<br/>571,793,408 / 41,588 23:57:12 INFO - TinderboxPrint: CPU idle<br/>1,388.2 (97.7%) 23:57:12 INFO - TinderboxPrint: CPU user<br/>27.6 (1.9%) 23:57:12 INFO - TinderboxPrint: Swap in / out<br/>0 / 0 23:57:12 INFO - install - Wall time: 12s; CPU: 13.0%; Read bytes: 0; Write bytes: 1032192; Read time: 0; Write time: 8 23:57:12 INFO - setup-mitmproxy - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 23:57:12 INFO - run-tests - Wall time: 168s; CPU: 2.0%; Read bytes: 0; Write bytes: 568909824; Read time: 0; Write time: 41508 23:57:12 INFO - Running post-run listener: _upload_blobber_files 23:57:12 WARNING - Blob upload gear skipped. Missing cmdline options. 23:57:12 INFO - Running post-run listener: copy_logs_to_upload_dir 23:57:12 INFO - Copying logs to upload dir... 23:57:12 INFO - mkdir: /home/cltbld/workspace/build/upload/logs 23:57:12 INFO - Copying logs to upload dir... 23:57:12 WARNING - returning nonzero exit status 2 cleanup + cleanup + local rv=2 + [[ -s /home/cltbld/.xsession-errors ]] + cp /home/cltbld/.xsession-errors /home/cltbld/artifacts/public/xsession-errors.log + false + exit 2
There have been 46 failures in the last week. Summary: Intermittent ts_paint_webext/ts_paint | application crashed [unknown top frame] Failures per platform: - Windows 7: 25 - windows10-64: 16 - OS X 10.10: 5 Failures per build type: - opt: 40 - pgo: 6 Here is a relevant log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=166795025&lineNumber=2987 10:19:32 INFO - Terminating psutil.Process(pid=1932L, name='firefox.exe', started='10:17:01') 10:19:32 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/XMQ3ttFCSd6dLt-kWqm06w/artifacts/public/build/target.crashreporter-symbols.zip 10:19:40 INFO - mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld~1.002\appdata\local\temp\tmpux9zyq\profile\minidumps\b13dc3d3-d177-47d7-b95a-d3d6c31a8062.dmp c:\users\cltbld~1.002\appdata\local\temp\tmpvlue1t 10:19:50 INFO - mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\b13dc3d3-d177-47d7-b95a-d3d6c31a8062.dmp 10:19:50 INFO - PROCESS-CRASH | ts_paint_webext | application crashed [unknown top frame] 10:19:50 INFO - Crash dump filename: c:\users\cltbld~1.002\appdata\local\temp\tmpux9zyq\profile\minidumps\b13dc3d3-d177-47d7-b95a-d3d6c31a8062.dmp 10:19:50 INFO - Operating system: Windows NT 10:19:50 INFO - 6.1.7601 Service Pack 1 10:19:50 INFO - CPU: x86 10:19:50 INFO - GenuineIntel family 6 model 30 stepping 5 10:19:50 INFO - 8 CPUs 10:19:50 INFO - GPU: UNKNOWN 10:19:50 INFO - No crash 10:19:50 INFO - Process uptime: 151 seconds 10:19:50 INFO - Thread 0
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
Hi Jim, can you please have a look at this ts_paint crash? Here's a link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=166291517&lineNumber=3032 Thanks!
Flags: needinfo?(rwood) → needinfo?(jmathies)
(In reply to Robert Wood [:rwood] (PTO until 9-May) from comment #6) > Hi Jim, can you please have a look at this ts_paint crash? Here's a link to > a recent log: > > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > central&job_id=166291517&lineNumber=3032 > > Thanks! I'm using a search query to find these now. Will be working on them over the next month or so.
Flags: needinfo?(jmathies)
Priority: P5 → P3
(In reply to Jim Mathies [:jimm] from comment #8) > (In reply to Robert Wood [:rwood] (PTO until 9-May) from comment #6) > > Hi Jim, can you please have a look at this ts_paint crash? Here's a link to > > a recent log: > > > > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > > central&job_id=166291517&lineNumber=3032 > > > > Thanks! > > I'm using a search query to find these now. Will be working on them over the > next month or so. Oh, not this one though. It's not in a component I triage. I will make a manual note to take a look.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → INCOMPLETE
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

Gabrielle, is that the same issue again as filed as bug 1614931?

Flags: needinfo?(gsvelto)

No, this is worse. The only minidump I found is completely empty. There is no .extra file either. Somehow we failed in a way that was so catastrophic that we didn't write anything out.

Flags: needinfo?(gsvelto)

Ah, so looks like it depends on bug 981641 then.

Depends on: 981641
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → INCOMPLETE

In the past week there were 27 failures on windows10-64-shippable opt.

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=306259778&repo=mozilla-central&lineNumber=966

[task 2020-06-13T23:34:00.722Z] 23:34:00 INFO - TEST-INFO | started process 708 (C:\Users\task_1592081630\build\application\firefox\firefox -wait-for-browser -no-deelevate -profile c:\users\task_1592081630\appdata\local\temp\tmpei5iw6\profile http://127.0.0.1:49796/startup_test/tspaint_test.html?gecko_profile_entries=10000000&gecko_profile_threads=GeckoMain%2CCompositor&gecko_profile_dir=c%3A%5Cusers%5Ctask_1592081630%5Cappdata%5Clocal%5Ctemp%5Ctmpq3er2o&gecko_profile_interval=1 http://127.0.0.1:49796/startup_test/tspaint_test.html?gecko_profile_entries=10000000&gecko_profile_threads=GeckoMain%2CCompositor&gecko_profile_dir=c%3A%5Cusers%5Ctask_1592081630%5Cappdata%5Clocal%5Ctemp%5Ctmpq3er2o&gecko_profile_interval=1)
[task 2020-06-13T23:34:02.377Z] 23:34:02 INFO - PID 708 | __start_report396__end_report
[task 2020-06-13T23:34:02.377Z] 23:34:02 INFO - PID 708 |
[task 2020-06-13T23:34:02.379Z] 23:34:02 INFO - PID 708 | __startTimestamp1592091242377__endTimestamp
[task 2020-06-13T23:34:10.390Z] 23:34:10 INFO - Browser shutdown timed out after 5 seconds, killing process.
[task 2020-06-13T23:34:10.392Z] 23:34:10 INFO - Launcher process psutil.Process(pid=708L, name='firefox.exe', started='23:34:00') detected. Killing parent process psutil.Process(pid=1732, name='firefox.exe', started='23:34:00') instead.
[task 2020-06-13T23:34:10.392Z] 23:34:10 INFO - Killing psutil.Process(pid=1732, name='firefox.exe', started='23:34:00') and writing a minidump file
[task 2020-06-13T23:34:10.392Z] 23:34:10 INFO - mozcrash Writing a dump to c:\users\task_1592081630\appdata\local\temp\tmpei5iw6\profile\minidumps\421ddf84-4036-48c9-8871-9ba97819b5b8.dmp for [1732]
[task 2020-06-13T23:34:10.470Z] 23:34:10 INFO - PID 708 | Exiting due to channel error.
[task 2020-06-13T23:34:10.515Z] 23:34:10 INFO - mozcrash checking c:\users\task_1592081630\appdata\local\temp\tmpei5iw6\profile\minidumps for minidumps...
[task 2020-06-13T23:34:10.515Z] 23:34:10 INFO - mozcrash Downloading symbols from: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/P_lm41ksRCGdvUqP331FwQ/artifacts/public/build/target.crashreporter-symbols.zip
[task 2020-06-13T23:34:11.963Z] 23:34:11 INFO - mozcrash Copy/paste: C:/Users/task_1592081630/fetches\minidump_stackwalk\minidump_stackwalk.exe c:\users\task_1592081630\appdata\local\temp\tmpei5iw6\profile\minidumps\421ddf84-4036-48c9-8871-9ba97819b5b8.dmp c:\users\task_1592081630\appdata\local\temp\tmp1bhlxm
[task 2020-06-13T23:34:16.838Z] 23:34:16 INFO - mozcrash Saved minidump as C:\Users\task_1592081630\build\blobber_upload_dir\421ddf84-4036-48c9-8871-9ba97819b5b8.dmp
[task 2020-06-13T23:34:16.838Z] 23:34:16 INFO - PROCESS-CRASH | ts_paint_webext | application crashed [unknown top frame]

Flags: needinfo?(gmierz2)
Whiteboard: [stockwell unknown] → [stockwell needswork:owner]

I ni?'ed :gsvelto about this in bug 981641.

Flags: needinfo?(gmierz2)

See :gsvelto's comment here: https://bugzilla.mozilla.org/show_bug.cgi?id=981641#c20

:gbrown would you be able to look into this issue? It sounds like it's coming from mozcrash based on :gsvelto's comment.

Flags: needinfo?(gbrown)

FYI I've pulled down a few minidumps and had a look:

  • All of them were of the parent process
  • All of them had different stacks, so the parent process was killed at a different point every time
  • Looking at the logs it seems that the test never started, have a look at this snippet:
[task 2020-06-11T11:17:02.484Z] 11:17:02     INFO -  TEST-INFO | started process 8188 (C:\Users\task_1591864905\build\application\firefox\firefox -wait-for-browser -no-deelevate -profile c:\users\task_1591864905\appdata\local\temp\tmpbplloy\profile http://127.0.0.1:49799/startup_test/tspaint_test.html?gecko_profile_entries=10000000&gecko_profile_threads=GeckoMain%2CCompositor&gecko_profile_dir=c%3A%5Cusers%5Ctask_1591864905%5Cappdata%5Clocal%5Ctemp%5Ctmpetgzbj&gecko_profile_interval=1 http://127.0.0.1:49799/startup_test/tspaint_test.html?gecko_profile_entries=10000000&gecko_profile_threads=GeckoMain%2CCompositor&gecko_profile_dir=c%3A%5Cusers%5Ctask_1591864905%5Cappdata%5Clocal%5Ctemp%5Ctmpetgzbj&gecko_profile_interval=1)
[task 2020-06-11T11:17:08.023Z] 11:17:08     INFO -  PID 8188 | BROWSER FAILED TO GENERATE MOZAFTERPAINT IN 5 SECONDS
[task 2020-06-11T11:17:08.023Z] 11:17:08     INFO -  PID 8188 | __startTimestamp1591874228020__endTimestamp
[task 2020-06-11T11:17:16.069Z] 11:17:16     INFO -  Browser shutdown timed out after 5 seconds, killing process.

Connecting the dots above it seems that the main process might have been slow to start and the harness killed it after 5 seconds w/o a first paint event. Not knowing how the harness works my guess is that the main process might not be stuck, just very very slow starting up. It might be interesting to bump that timeout and see if the issue goes away. As to why the main process is taking so long to start up I have no idea unfortunately.

No longer depends on: 981641

At a glance, I'm not sure there is anything wrong with mozcrash, or the crash reports produced here. Most of these failures are timeouts and mozcrash forces minidump generation and kills firefox. Of course there is no crashing thread in that scenario, and there may be more threads just running in OS code...maybe that is causing some confusion? Looking through a few crash reports, most seem to have some threads running firefox code, and the stacks are symbolicated and appear consistent...if not very helpful. Unhelpful stacks may also be related to the scenario called out in comment 41: if firefox is still starting up, hits a timeout and is shutting down when the minidump is generated, the crash report may not be very interesting.

To follow up on comment 41, I pushed to try with longer timeouts, concentrating on the Windows 10 T(prof) tests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=334430599e41c198ddad4289f79a89285216b076
Increasing wait_for_quit_timeout from 5 to 30 seconds seems to avoid many of the crash-on-shutdown-timeout occurrences.
Increasing the MozAfterPaint timeout from 5 to 120 seconds does not seem to help at all: there are still frequent timeouts after 120 seconds.
Even with both changes, these tasks remain near perma-fail.

Flags: needinfo?(gbrown)

Maybe push another try with toolkit.asyncshutdown.log enabled. It might give us some ideas what's hanging during startup/shutdown.

As it looks like with the fix on bug 1630448 the number of crashes have been reduced a lot or are even gone. There is only one instance on June 26th. It shows a problem with minidump stackwalk:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=307623880&repo=mozilla-central

[task 2020-06-26T04:57:37.276Z] 04:57:37 INFO - 2020-06-26 04:57:37: minidump_stackwalk.cc:135: ERROR: Minidump c:\users\task_1593134568\appdata\local\temp\tmpm79dej\profile\minidumps\eb95f136-02c9-4301-abda-2fe6ef2a28d8.dmp could not be read
[task 2020-06-26T04:57:37.276Z] 04:57:37 INFO - minidump_stackwalk exited with return code 1

The minidump is only 64 bytes in size. Looks like was a problem in saving it? Gabriele, any idea?

I would say we close this bug.

Depends on: 1630448
Flags: needinfo?(gsvelto)

The stackwalker couldn't analyze the minidump because it's empty. That's interesting because when this happens we often attributed it to an issue in Gecko, but this is a minidump that was generated by mozcrash. mozcrash just calls the appropriate Windows API which means that the failure to write the minidump didn't happen in our code... which is encouraging. Either way yeah, I think we can close this bug.

Flags: needinfo?(gsvelto)
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.