Closed
Bug 483968
Opened 15 years ago
Closed 8 years ago
talos should be able to get a stack trace from browser hangs
Categories
(Testing :: Talos, enhancement, P5)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ted, Assigned: parkouss)
References
Details
(Whiteboard: [talos])
Attachments
(1 file, 1 obsolete file)
1.54 KB,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
Currently if the browser hangs, Talos kills it and we don't get a stack (at least on OS X). The output there looks like: ../Minefield.app/Contents/MacOS/run-mozilla.sh: line 399: 991 Terminated "$prog" ${1+"$@"} Failed tp: Stopped Wed, 18 Mar 2009 04:35:47 FAIL: Busted: tp FAIL: browser crash program finished with exit code 0 This would be way more useful if we could get a stack out of this. I'm not sure if this happens on other platforms, so setting to just OS X for now.
Comment 1•15 years ago
|
||
I haven't tested, but I'd expect if you sent a SIGSEGV (instead of SIGKILL) signal to a linux build the crash handler would be invoked. Darwin does things differently, so I'm not what you could do there. Perhaps pretend to be a debugger and inject a crash into the process? The other option is, if we have enough "real" symbols for it, to generate a profile using dtrace before killing the app.
Reporter | ||
Comment 2•15 years ago
|
||
Yeah, OS X is tricky. I've looked into it a number of times and never figured out how to get it to work. :-/
Comment 3•15 years ago
|
||
This sounds like Future...
Component: Release Engineering: Talos → Release Engineering: Future
![]() |
||
Comment 4•15 years ago
|
||
Just doing the SEGV thing on Linux would be a huge step. It would have helped me a ton when I landed interruptible reflow (which was precisely hanging on Linux). Should I just file a separate bug on the Linux issue?
Updated•15 years ago
|
Assignee: nobody → anodelman
Priority: -- → P3
Comment 5•15 years ago
|
||
This should fix up one platform. Still need solutions for mac/win.
Attachment #383597 -
Flags: review?(catlee)
Updated•15 years ago
|
Attachment #383597 -
Flags: review?(catlee) → review+
Comment 6•14 years ago
|
||
Comment on attachment 383597 [details] [diff] [review] [Checked in]fix for linux (send signal.SIGSEGV before doing signal.SIGKILL) Checking in ffprocess_linux.py; /cvsroot/mozilla/testing/performance/talos/ffprocess_linux.py,v <-- ffprocess_linux.py new revision: 1.12; previous revision: 1.11 done
Attachment #383597 -
Attachment description: fix for linux (send signal.SIGSEGV before doing signal.SIGKILL) → [Checked in]fix for linux (send signal.SIGSEGV before doing signal.SIGKILL)
Attachment #383597 -
Flags: checked‑in+ checked‑in+
Comment 7•14 years ago
|
||
Could buildbot do the same for test suites (at least)?
Reporter | ||
Comment 8•14 years ago
|
||
(In reply to comment #7) > Could buildbot do the same for test suites (at least)? Wouldn't do us any good, because it doesn't have the PID of the actual browser process, just the PID of the test harness. You can file a separate bug on implementing something similar in the test harnesses (if there's not one on file already).
Comment 9•14 years ago
|
||
(In reply to comment #8) > You can file a separate bug on > implementing something similar in the test harnesses I filed bug 501034.
Updated•14 years ago
|
Assignee: anodelman → nobody
Reporter | ||
Comment 10•14 years ago
|
||
I've got a patch in bug 501034 that has a "crashinject.exe" utility that can crash a Windows process by PID in a way that invokes Breakpad. If you'd like, I can provide a binary of it for use in Talos.
Comment 11•14 years ago
|
||
Mass move of bugs from Release Engineering:Future -> Release Engineering. See http://coop.deadsquid.com/2010/02/kiss-the-future-goodbye/ for more details.
Component: Release Engineering: Future → Release Engineering
Priority: P3 → P5
Updated•14 years ago
|
Whiteboard: [talos]
Comment 12•13 years ago
|
||
Moving this to the Testing:Talos component so Alice can evaluate what's left to do on this bug.
Component: Release Engineering → Talos
Flags: checked-in+
Product: mozilla.org → Testing
QA Contact: release → talos
Version: other → unspecified
Comment 13•11 years ago
|
||
ted, iirc this is a completed item, can you explain what is left to do here?
Reporter | ||
Comment 14•11 years ago
|
||
I think this is fixed for Linux only. Our unittest suites handle both Linux and Windows (by way of the crashinject binary).
Assignee | ||
Comment 15•8 years ago
|
||
What is the status of that bug, now that we use minidump stackwalk from mozharness for that ?
Reporter | ||
Comment 16•8 years ago
|
||
You probably still need to copy what Mochitest does: https://dxr.mozilla.org/mozilla-central/rev/5fe9ed3edd6811a662d40d05e37b0d66e9520d82/testing/mochitest/runtests.py#1573 ...but it's pretty simple these days.
Assignee | ||
Comment 17•8 years ago
|
||
Well, we already have something similar: https://github.com/mozilla/build-talos/blob/master/talos/ttest.py#L49 Is that enough ?
Reporter | ||
Comment 18•8 years ago
|
||
No, that handles crash dumps that are left behind if the process crashes, but it doesn't do anything about timeouts. What you need is to call mozcrash.kill_and_get_minidump in the timeout handling code: https://github.com/mozilla/build-talos/blob/master/talos/talos_process.py#L93
Assignee | ||
Comment 19•8 years ago
|
||
Right, thanks Ted! I think I missed the "hangs" part in the bug title. :( So, I ask you for review, feel free to redirect to :jmaher if you want.
Assignee: nobody → j.parkouss
Attachment #383597 -
Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8658228 -
Flags: review?(ted)
Reporter | ||
Comment 20•8 years ago
|
||
Comment on attachment 8658228 [details] [diff] [review] 483968.patch That seems pretty reasonable.
Attachment #8658228 -
Flags: review?(ted) → review+
Assignee | ||
Comment 21•8 years ago
|
||
Landed in https://hg.mozilla.org/build/talos/rev/c0de097a7159.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 22•8 years ago
|
||
Pushed on try to test it though (broken on windows, but we'll see for linux at least): https://treeherder.mozilla.org/#/jobs?repo=try&revision=02517d2d24cc
Assignee | ||
Comment 23•8 years ago
|
||
Ah! It worth the try! So on talos side this *looks* good, ie we call the right mozcrash function. But then we have: 16:11:35 ERROR - Traceback (most recent call last): 16:11:35 INFO - File "/builds/slave/test/build/talos_repo/talos/run_tests.py", line 268, in <module> 16:11:35 INFO - main() 16:11:35 INFO - File "/builds/slave/test/build/talos_repo/talos/run_tests.py", line 264, in main 16:11:35 INFO - sys.exit(run_tests(config, browser_config)) 16:11:35 INFO - File "/builds/slave/test/build/talos_repo/talos/run_tests.py", line 216, in run_tests 16:11:35 INFO - talos_results.add(mytest.runTest(browser_config, test)) 16:11:35 INFO - File "/builds/slave/test/build/talos_repo/talos/ttest.py", line 75, in runTest 16:11:35 INFO - return self._runTest(browser_config, test_config, setup) 16:11:35 INFO - File "/builds/slave/test/build/talos_repo/talos/ttest.py", line 185, in _runTest 16:11:35 INFO - if counter_management else None), 16:11:35 INFO - File "/builds/slave/test/build/talos_repo/talos/talos_process.py", line 95, in run_browser 16:11:35 INFO - mozcrash.kill_and_get_minidump(proc.pid) 16:11:35 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/mozcrash/mozcrash.py", line 443, in kill_and_get_minidump 16:11:35 INFO - os.kill(pid, signal.SIGABRT) 16:11:35 ERROR - NameError: global name 'signal' is not defined So, this because we use latest mozcrash release (0.15), but there is a bug (fixed in m-c) as far as I can see. So we should release mozcrash 0.16, put it in pypi/internal packages and update our requirements.txt.
Assignee | ||
Comment 24•8 years ago
|
||
Ok so I reopen the bug until mozcrash 0.16 is available on both pypi and internal pypi. We should then update the requirements.txt in talos to fix that issue.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 25•8 years ago
|
||
Ok, pushed to try with the fix using mozcrash >=0.16: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a99651963f79
Assignee | ||
Comment 26•8 years ago
|
||
pushed to try and making the browser hangs on linux, see if we really get stack traces now: https://treeherder.mozilla.org/#/jobs?repo=try&revision=79fd2b13c934
Assignee | ||
Comment 27•8 years ago
|
||
Hmm, no stack trace, but we see the "TalosError: timeout" exception, so we are executing the code. I thought that maybe we needed to give some more time for the browser to exit after mozcrash.kill_and_get_minidump call, so another try here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0a6609d55f4b But same thing, no stack trace. And in those logs you can see that there is no "Terminating psutil.Process" log line before the error, so mozcrash.kill_and_get_minidump really killed the process this time. :ted, any idea ?
Flags: needinfo?(ted)
Assignee | ||
Comment 28•8 years ago
|
||
We are getting stack traces for sure, you can see this in bug 1220934 as an example. I believe my test to generate a stack trace is not working instead. So I'm closing this bug, feel free to reopen if you think it should not.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Flags: needinfo?(ted)
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•