talos should be able to get a stack trace from browser hangs

RESOLVED FIXED

Status

enhancement
P5
normal
RESOLVED FIXED
10 years ago
3 years ago

People

(Reporter: ted, Assigned: parkouss)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [talos])

Attachments

(1 attachment, 1 obsolete attachment)

Currently if the browser hangs, Talos kills it and we don't get a stack (at least on OS X). The output there looks like:
../Minefield.app/Contents/MacOS/run-mozilla.sh: line 399:   991 Terminated              "$prog" ${1+"$@"}
Failed tp: 
		Stopped Wed, 18 Mar 2009 04:35:47
FAIL: Busted: tp
FAIL: browser crash
program finished with exit code 0

This would be way more useful if we could get a stack out of this. I'm not sure if this happens on other platforms, so setting to just OS X for now.

Comment 1

10 years ago
I haven't tested, but I'd expect if you sent a SIGSEGV (instead of SIGKILL) signal to a linux build the crash handler would be invoked. Darwin does things differently, so I'm not what you could do there. Perhaps pretend to be a debugger and inject a crash into the process?

The other option is, if we have enough "real" symbols for it, to generate a profile using dtrace before killing the app.
Yeah, OS X is tricky. I've looked into it a number of times and never figured out how to get it to work. :-/
This sounds like Future...
Component: Release Engineering: Talos → Release Engineering: Future
Just doing the SEGV thing on Linux would be a huge step.  It would have helped me a ton when I landed interruptible reflow (which was precisely hanging on Linux).

Should I just file a separate bug on the Linux issue?
Assignee: nobody → anodelman
Priority: -- → P3
This should fix up one platform.  Still need solutions for mac/win.
Attachment #383597 - Flags: review?(catlee)
Attachment #383597 - Flags: review?(catlee) → review+
Comment on attachment 383597 [details] [diff] [review]
[Checked in]fix for linux (send signal.SIGSEGV before doing signal.SIGKILL)

Checking in ffprocess_linux.py;
/cvsroot/mozilla/testing/performance/talos/ffprocess_linux.py,v  <--  ffprocess_linux.py
new revision: 1.12; previous revision: 1.11
done
Attachment #383597 - Attachment description: fix for linux (send signal.SIGSEGV before doing signal.SIGKILL) → [Checked in]fix for linux (send signal.SIGSEGV before doing signal.SIGKILL)
Attachment #383597 - Flags: checked‑in+ checked‑in+
Could buildbot do the same for test suites (at least)?
(In reply to comment #7)
> Could buildbot do the same for test suites (at least)?

Wouldn't do us any good, because it doesn't have the PID of the actual browser process, just the PID of the test harness. You can file a separate bug on implementing something similar in the test harnesses (if there's not one on file already).
(In reply to comment #8)
> You can file a separate bug on
> implementing something similar in the test harnesses

I filed bug 501034.
Assignee: anodelman → nobody
I've got a patch in bug 501034 that has a "crashinject.exe" utility that can crash a Windows process by PID in a way that invokes Breakpad. If you'd like, I can provide a binary of it for use in Talos.
Mass move of bugs from Release Engineering:Future -> Release Engineering. See
http://coop.deadsquid.com/2010/02/kiss-the-future-goodbye/ for more details.
Component: Release Engineering: Future → Release Engineering
Priority: P3 → P5
Whiteboard: [talos]
Moving this to the Testing:Talos component so Alice can evaluate what's left to do on this bug.
Component: Release Engineering → Talos
Flags: checked-in+
Product: mozilla.org → Testing
QA Contact: release → talos
Version: other → unspecified
ted, iirc this is a completed item, can you explain what is left to do here?
I think this is fixed for Linux only. Our unittest suites handle both Linux and Windows (by way of the crashinject binary).
(Assignee)

Comment 15

4 years ago
What is the status of that bug, now that we use minidump stackwalk from mozharness for that ?
You probably still need to copy what Mochitest does:
https://dxr.mozilla.org/mozilla-central/rev/5fe9ed3edd6811a662d40d05e37b0d66e9520d82/testing/mochitest/runtests.py#1573

...but it's pretty simple these days.
(Assignee)

Comment 17

4 years ago
Well, we already have something similar: https://github.com/mozilla/build-talos/blob/master/talos/ttest.py#L49

Is that enough ?
No, that handles crash dumps that are left behind if the process crashes, but it doesn't do anything about timeouts. What you need is to call mozcrash.kill_and_get_minidump in the timeout handling code:
https://github.com/mozilla/build-talos/blob/master/talos/talos_process.py#L93
(Assignee)

Comment 19

4 years ago
Posted patch 483968.patchSplinter Review
Right, thanks Ted! I think I missed the "hangs" part in the bug title. :(

So, I ask you for review, feel free to redirect to :jmaher if you want.
Assignee: nobody → j.parkouss
Attachment #383597 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8658228 - Flags: review?(ted)
Comment on attachment 8658228 [details] [diff] [review]
483968.patch

That seems pretty reasonable.
Attachment #8658228 - Flags: review?(ted) → review+
(Assignee)

Comment 21

4 years ago
Landed in https://hg.mozilla.org/build/talos/rev/c0de097a7159.
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
(Assignee)

Comment 22

4 years ago
Pushed on try to test it though (broken on windows, but we'll see for linux at least):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=02517d2d24cc
(Assignee)

Comment 23

4 years ago
Ah! It worth the try! So on talos side this *looks* good, ie we call the right mozcrash function. But then we have:

16:11:35    ERROR -  Traceback (most recent call last):
16:11:35     INFO -    File "/builds/slave/test/build/talos_repo/talos/run_tests.py", line 268, in <module>
16:11:35     INFO -      main()
16:11:35     INFO -    File "/builds/slave/test/build/talos_repo/talos/run_tests.py", line 264, in main
16:11:35     INFO -      sys.exit(run_tests(config, browser_config))
16:11:35     INFO -    File "/builds/slave/test/build/talos_repo/talos/run_tests.py", line 216, in run_tests
16:11:35     INFO -      talos_results.add(mytest.runTest(browser_config, test))
16:11:35     INFO -    File "/builds/slave/test/build/talos_repo/talos/ttest.py", line 75, in runTest
16:11:35     INFO -      return self._runTest(browser_config, test_config, setup)
16:11:35     INFO -    File "/builds/slave/test/build/talos_repo/talos/ttest.py", line 185, in _runTest
16:11:35     INFO -      if counter_management else None),
16:11:35     INFO -    File "/builds/slave/test/build/talos_repo/talos/talos_process.py", line 95, in run_browser
16:11:35     INFO -      mozcrash.kill_and_get_minidump(proc.pid)
16:11:35     INFO -    File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/mozcrash/mozcrash.py", line 443, in kill_and_get_minidump
16:11:35     INFO -      os.kill(pid, signal.SIGABRT)
16:11:35    ERROR -  NameError: global name 'signal' is not defined

So, this because we use latest mozcrash release (0.15), but there is a bug (fixed in m-c) as far as I can see. So we should release mozcrash 0.16, put it in pypi/internal packages and update our requirements.txt.
(Assignee)

Updated

4 years ago
Depends on: 1203040
(Assignee)

Comment 24

4 years ago
Ok so I reopen the bug until mozcrash 0.16 is available on both pypi and internal pypi. We should then update the requirements.txt in talos to fix that issue.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Updated

4 years ago
Depends on: 1203654
(Assignee)

Updated

4 years ago
See Also: → 1203892
(Assignee)

Comment 25

4 years ago
Ok, pushed to try with the fix using mozcrash >=0.16:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a99651963f79
(Assignee)

Comment 26

4 years ago
pushed to try and making the browser hangs on linux, see if we really get stack traces now:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=79fd2b13c934
(Assignee)

Comment 27

4 years ago
Hmm, no stack trace, but we see the "TalosError: timeout" exception, so we are executing the code.

I thought that maybe we needed to give some more time for the browser to exit after mozcrash.kill_and_get_minidump call, so another try here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0a6609d55f4b

But same thing, no stack trace. And in those logs you can see that there is no "Terminating psutil.Process" log line before the error, so mozcrash.kill_and_get_minidump really killed the process this time.

:ted, any idea ?
Flags: needinfo?(ted)
(Assignee)

Updated

4 years ago
See Also: → 1211608
(Assignee)

Comment 28

3 years ago
We are getting stack traces for sure, you can see this in bug 1220934 as an example.

I believe my test to generate a stack trace is not working instead. So I'm closing this bug, feel free to reopen if you think it should not.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago3 years ago
Flags: needinfo?(ted)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.