Closed Bug 804385 Opened 12 years ago Closed 11 years ago

mozharness talos has issues with minidump_stackwalk

Categories

(Release Engineering :: Applications: MozharnessCore, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: jyeo)

References

Details

(Whiteboard: [mozharness][talos])

Attachments

(6 files)

e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=16357091&tree=Cedar&full=1

15:27:33     INFO -  NOISE: exception getting privileged access, defaulting to XUL_FENNEC
15:27:38     INFO -  NOISE: Found crashdump: /tmp/tmp4OAlyL/profile/minidumps/4dbd15ec-1f53-8926-382c5348-0a6a5952.dmp
15:27:39     INFO -  Failed tp5n:
15:27:39     INFO -  		Stopped Mon, 22 Oct 2012 15:27:39
15:27:39    ERROR -  Traceback (most recent call last):
15:27:39     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/run_tests.py", line 250, in run_tests
15:27:39     INFO -      talos_results.add(mytest.runTest(browser_config, test))
15:27:39     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/ttest.py", line 412, in runTest
15:27:39     INFO -      self.cleanupAndCheckForCrashes(browser_config, profile_dir)
15:27:39     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/ttest.py", line 182, in cleanupAndCheckForCrashes
15:27:39 CRITICAL -      raise talosError("error executing: '%s'" % subprocess.list2cmdline(cmd))
15:27:39 CRITICAL -  talosError: "error executing: '/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/breakpad/linux64/minidump_stackwalk /tmp/tmp4OAlyL/profile/minidumps/4dbd15ec-1f53-8926-382c5348-0a6a5952.dmp /home/cltbld/talos-slave/test/build/symbols'"
15:27:39 CRITICAL -  FAIL: Busted: tp5n
15:27:39     INFO -  FAIL: error executing: '/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/breakpad/linux64/minidump_stackwalk /tmp/tmp4OAlyL/profile/minidumps/4dbd15ec-1f53-8926-382c5348-0a6a5952.dmp /home/cltbld/talos-slave/test/build/symbols'
15:27:39    ERROR -  Traceback (most recent call last):
15:27:39     INFO -    File "/home/cltbld/talos-slave/test/build/venv/bin/talos", line 9, in <module>
15:27:39     INFO -      load_entry_point('talos==0.0', 'console_scripts', 'talos')()
15:27:39     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/run_tests.py", line 295, in main
15:27:39     INFO -      run_tests(parser)
15:27:39     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/run_tests.py", line 259, in run_tests
15:27:39     INFO -      raise e
15:27:39 CRITICAL -  talos.utils.talosError: "error executing: '/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/breakpad/linux64/minidump_stackwalk /tmp/tmp4OAlyL/profile/minidumps/4dbd15ec-1f53-8926-382c5348-0a6a5952.dmp /home/cltbld/talos-slave/test/build/symbols'"
15:27:39    ERROR - Return code: 1
Same for linux32:

15:12:02     INFO -  DEBUG: created profile
15:32:04     INFO -  NOISE: exception getting privileged access, defaulting to XUL_FENNEC
15:32:09     INFO -  NOISE: Found crashdump: /tmp/tmpORRyKf/profile/minidumps/752b6547-0d26-efa5-61862122-6aff5a42.dmp
15:32:09     INFO -  Failed tscrollr:
15:32:09     INFO -  		Stopped Mon, 22 Oct 2012 15:32:09
15:32:09    ERROR -  Traceback (most recent call last):
15:32:09     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/run_tests.py", line 250, in run_tests
15:32:09     INFO -      talos_results.add(mytest.runTest(browser_config, test))
15:32:09     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/ttest.py", line 412, in runTest
15:32:09     INFO -      self.cleanupAndCheckForCrashes(browser_config, profile_dir)
15:32:09     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/ttest.py", line 182, in cleanupAndCheckForCrashes
15:32:09 CRITICAL -      raise talosError("error executing: '%s'" % subprocess.list2cmdline(cmd))
15:32:09 CRITICAL -  talosError: "error executing: '/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/breakpad/linux/minidump_stackwalk /tmp/tmpORRyKf/profile/minidumps/752b6547-0d26-efa5-61862122-6aff5a42.dmp /home/cltbld/talos-slave/test/build/symbols'"
15:32:09 CRITICAL -  FAIL: Busted: tscrollr
15:32:09     INFO -  FAIL: error executing: '/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/breakpad/linux/minidump_stackwalk /tmp/tmpORRyKf/profile/minidumps/752b6547-0d26-efa5-61862122-6aff5a42.dmp /home/cltbld/talos-slave/test/build/symbols'
15:32:09    ERROR -  Traceback (most recent call last):
15:32:09     INFO -    File "/home/cltbld/talos-slave/test/build/venv/bin/talos", line 9, in <module>
15:32:09     INFO -      load_entry_point('talos==0.0', 'console_scripts', 'talos')()
15:32:09     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/run_tests.py", line 295, in main
15:32:09     INFO -      run_tests(parser)
15:32:09     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/run_tests.py", line 259, in run_tests
15:32:09     INFO -      raise e
15:32:09 CRITICAL -  talos.utils.talosError: "error executing: '/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/talos/breakpad/linux/minidump_stackwalk /tmp/tmpORRyKf/profile/minidumps/752b6547-0d26-efa5-61862122-6aff5a42.dmp /home/cltbld/talos-slave/test/build/symbols'"
15:32:09    ERROR - Return code: 1
any idea why this is failing?  do we need a different environment for launching this process?  Are we not running these tests on linux, the minidump_stackwalk is using the linux version.
I don't know why it's failing.
I do know that it's calling the linux minidump_stackwalk on linux, and linux64 on linux64, etc. so that's not it.

I'm thinking about parsing for this in mozharness, and if I detect it, doing some debugging later (verifying the binary and directory are there, permissions, etc)
Assignee: nobody → aki
We should remove this code when we figure out what's going on and fix it.
But this should help us figure out what's going on.
Attachment #676422 - Flags: review?(jhammel)
Comment on attachment 676422 [details] [diff] [review]
try to figure out what's going on here

Review of attachment 676422 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with one nit.

::: mozharness/mozilla/testing/talos.py
@@ +454,5 @@
>          self.return_code = self.run_command(command, cwd=self.workdir,
>                                              output_parser=parser)
> +        if parser.minidump_output:
> +            for item in parser.minidump_output:
> +                self.run_command(["ls", "-l", item])

I would like to qualify this output so we know what it is and why we see it.
Attachment #676422 - Flags: review?(jhammel) → review+
Comment on attachment 676422 [details] [diff] [review]
try to figure out what's going on here

http://hg.mozilla.org/build/mozharness/rev/8de53d8a6437
Attachment #676422 - Flags: checked-in+
See Also: → 745193
It would probably be a nice thing to have Talos be a little more informative about this as well.  I'll work on a patch here, though this will likely be a multi-step process unless I can reproduce (FWIW, I haven't seen this error either with mozharness or otherwise).
I can't say if this is sufficient to really diagnose the failure, but its probably not a bad change to take
Attachment #685312 - Flags: review?(jmaher)
Comment on attachment 685312 [details] [diff] [review]
be a little more verbose about a few things

Review of attachment 685312 [details] [diff] [review]:
-----------------------------------------------------------------

this isn't much more verbose, but it cleans a lot of little things up.
Attachment #685312 - Flags: review?(jmaher) → review+
since this is rare, I am not sure of what testing is needed, maybe a sanity check on windows.
pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=ce2120177cde

while the patch isn't a ton more verbose, it should at least give some sort of clue as to why it fails and it ensures that the minidump_stackwalk and the symbols path given are actually found on disk.
Try run for ce2120177cde is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=ce2120177cde
Results (out of 5 total builds):
    success: 5
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jhammel@mozilla.com-ce2120177cde
pushed: http://hg.mozilla.org/build/talos/rev/e00b0636f9f3 ; hopefully when this hits production/cedar this will at least give up the next clue
We haven't seen a crash recently.

We can try to force a crash (ted's crash page in a page load manifest?) or decide whether this actually blocks rollout.
This probably shouldn't block deployment if it is rare.
My impression is that what's rare are ongoing intermittent talos crashes and hangs, but that this is what happens every time a mozharness talos run crashes or hangs.

As bug 814698 says, talos-only crashes caused by newly landed patches happen around once a month. It would be an interesting gamble to see whether we would tell someone that they just can't land a patch because it crashes, though we cannot tell them where, or whether we would just hide some or all of talos on some or all platforms on some or all branches. I'd bet heavily on the latter for a frequent intermittent crash, especially in just one suite, but for permared? An interesting gamble.
Not actively working on this.
Assignee: aki → nobody
Assignee: nobody → yshun
It seems that this is one of the last things to enable talos mozharness in production.
I was thinking of aiming for switching things on the try branch first, once we verify the numbers on Cedar and fix this bug.
Depends on: 892524
minidump stackwalk output on linux(64)
output for minidump stackwalk on winxp
I have posted 3 logs. I can see the output for minidump_stackwalk when ff creashes on linux and win but not on mac.

I commented out http://mxr.mozilla.org/build/source/talos/talos/ttest.py#209 till 210 so that I wouldn't get the error in bug 892524.
getting closer!
Attachment #774213 - Attachment mime type: text/x-log → text/plain
Attachment #774217 - Attachment mime type: text/x-log → text/plain
Attachment #774206 - Attachment mime type: text/x-log → text/plain
I managed to crash firefox with https://code.google.com/p/crashme/ by executing this hack in the browser console:

    Components.utils.import("resource://crashme/modules/Crasher.jsm");
    Crasher.crash(0);

minidump_stackwalk was able to find the dmp file and printed the output in the log.

I wasn't able to get the output in https://bugzilla.mozilla.org/attachment.cgi?id=774217&action=edit because I was using kill -SEGV to crash firefox. When I try to kill firefox that way, the dmp files are not generated in the minidumps folder.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
We have the minidump output on mac, win and linux. Resolved fixed.
\o/
Product: mozilla.org → Release Engineering
Component: General Automation → Mozharness
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: