Closed Bug 542952 Opened 16 years ago Closed 8 years ago

leak test build reports 'No symbols path given, can't process dump.' when a crash happens

Categories

(Testing :: General, defect, P5)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: sgautherie, Unassigned)

References

Details

(Keywords: regression, Whiteboard: [unittest])

Attachments

(1 file, 2 obsolete files)

Moved here from bug 520707 comment 13: { Serge Gautherie (:sgautherie) 2010-01-21 06:49:12 PST http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1264014698.1264015053.2511.gz Linux comm-central-trunk leak test build on 2010/01/20 11:11:38 No symbols path given, can't process dump. I'm not sure what/where we need to do to get the symbol path. } And I'm not sure how to check this now that bug 520707 is fixed :-/
Flags: in-testsuite-
Ted, do you have an idea what we could be missing here?
This is crashing running leaktest.py? You'll probably have to append --symbols-path=dist/crashreporter-symbols to the commandline.
It's running leaktest.px though our buildbot harness, yes. Is --symbols-path always good to be passed, or is it only needed if the symbols are in a non-default location?
It needs to always be passed, there's no default for it.
Interesting, as that means that Firefox leak test builds ought to show the same problem when they crash, and we probably need a general fix in buildbotcustom for that.
Moving to RelEng as it seems that the issue is that we generally always need to pass --symbols-path to leaktest.py to get useful output in case the build crashes.
Component: Build Config → Release Engineering
Flags: in-testsuite-
Product: SeaMonkey → mozilla.org
QA Contact: build-config → release
Version: Trunk → other
Assignee: nobody → ccooper
Status: NEW → ASSIGNED
Priority: -- → P2
Probably a dupe of bug 519195.
Attached patch Proposed fix (obsolete) — Splinter Review
Ted, Please correct me if I'm wrong. --symbols-path should point to the _directory_ (dist/crashreporter-symbols) where firefox-*-symbols.txt exists, not the _file_. Coop, Seems like the only place where leacktest is being called is AliveTest step. I've added another optional parameter for symbolsPath and changed the factories accordingly. Please review.
Assignee: ccooper → raliiev
Attachment #428875 - Flags: review?(ccooper)
Comment on attachment 428875 [details] [diff] [review] Proposed fix > workdir='build/%s/_leaktest' % self.mozillaObjdir, >+ symbolsPath='build/%s/dist/crashreporter-symbols', You're missing the actual variable to replace % with, and that should be mozillaObjDir like in the line above, so make it that: symbolsPath='build/%s/dist/crashreporter-symbols' % self.mozillaObjdir, Same for the other symbolsPath= lines.
Yes, you want the directory.
(In reply to comment #9) > (From update of attachment 428875 [details] [diff] [review]) > > workdir='build/%s/_leaktest' % self.mozillaObjdir, > >+ symbolsPath='build/%s/dist/crashreporter-symbols', Shouldn't that also be: symbolsPath='../dist/crashreporter-symbols' ...given that the workdir is already under the objdir?
Coop, you're probably right. It looks to me like Rail didn't actually test this patch, it probably should run through staging to make sure it works basically. The real test will probably be when it's in production and hits a crash, which seems to still happen intermittently on SeaMonkey trunk, even if the frequency has been greatly reduced recently.
Comment on attachment 428875 [details] [diff] [review] Proposed fix A few things: * AFAICT --symbols-path is an arg to automation.py rather than leaktest.py, so it needs to go in the extraArgs, i.e. after the '--' * we probably want to set the symbolsPath every time we run leaktest.py, so I would prefer having it default to being set so that we would have to explicitly turn it off when adding steps rather than turning it on every time.
Attachment #428875 - Flags: review?(ccooper) → review-
If you want to test this in staging, you could run a leaktest run on Linux, then kill -SEGV the browser process, which should trigger Breakpad and give you a stack trace.
At least I'm on the right way. :) I had no idea how to test it. Ted, thanks for the tip. I'll test it on staging tomorrow and be back with the results. Thanks for the comments.
Summary: [SeaMonkey] 'Linux comm-central-trunk leak test build' reports 'No symbols path given, can't process dump.' when a crash happens → leak test build reports 'No symbols path given, can't process dump.' when a crash happens
Attached patch Proposed fix (obsolete) — Splinter Review
Coop, could you review the following patch. Still trying to reproduce, but the only message I get after "kill -SEGV pid" is TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
Attachment #428875 - Attachment is obsolete: true
Attachment #429198 - Flags: review?(ccooper)
Comment on attachment 429198 [details] [diff] [review] Proposed fix Hrmmm, sorry. On second inspection, we always want to set --symbols-path even when extraArgs is None. Maybe we should make extraArgs=[] the default? We'll have to check any AliveTest consumers to make sure they conform if we do.
Attachment #429198 - Flags: review?(ccooper) → review-
Attached patch Proposed fixSplinter Review
> Hrmmm, sorry. On second inspection, we always want to set --symbols-path even when extraArgs is None. Argh... Should have slept more. :) > Maybe we should make extraArgs=[] the default? We'll have to check any AliveTest consumers to make sure they conform if we do. I'd prefer to use None, [] may become stateful in some cases. Please take a look at this version.
Attachment #429198 - Attachment is obsolete: true
Attachment #429475 - Flags: review?(ccooper)
Comment on attachment 429475 [details] [diff] [review] Proposed fix Looks good.
Attachment #429475 - Flags: review?(ccooper) → review+
pm* reconfig-ed with this change.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Am I right that the end of http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1267465601.1267466624.27337.gz&fulltext=1 is actually quite the output we actually want to see when it crashes and this bug was fixed successfully?
No, that's just our internal stack walking code from hitting an assertion.
OK, then we'll need to wait on a real crash that isn't from an assertion to be able to confirm this fix.
I think the change here is causing the win32 debug builds to timeout doing the tracemalloc alive test (bug 549422). Rather than trying to guess a new timeout I'm going to back it the change and we can revisit in staging (including figuring out if spending more than an hour on this test is a good use of resources).
Blocks: 549422
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #25) > I think the change here is causing the win32 debug builds to timeout doing the > tracemalloc alive test (bug 549422). Rather than trying to guess a new timeout > I'm going to back it the change and we can revisit in staging (including > figuring out if spending more than an hour on this test is a good use of > resources). Rail: can you try to reproduce this in staging tomorrow, please?
As a note, SeaMonkey doesn't show any timeouts on trunk boxes, not even on Windows.
Another interesting observation. Trying to reproduce this timeout I've got this: MINIDUMP_STACKWALK binary not found: /e/builds/moz2_slave/mozilla-central-win32-debug/tools/breakpad/win32/minidump_stackwalk.exe minidump_stackwalk.exe exists in that directory. Should we use Windows style paths for MINIDUMP_STACKWALK env variable?
Rail: that's an MSYS path, which is ok as long as you execute things via a msys shell. If you run directly from a Windows shell, or execute a native Windows program (python) that executes the command without going through a msys shell, then it won't translate that path for you.
(In reply to comment #27) > Rail: can you try to reproduce this in staging tomorrow, please? Yes, I can reproduce this. Alive test #5 times out. I changed the timeout value to 5400 and the test looks fine now. It takes more than 4000 sec on Windows now. :( Should I just adjust the timeout or should we investigate the root clause?
(In reply to comment #31) > Yes, I can reproduce this. Alive test #5 times out. > > I changed the timeout value to 5400 and the test looks fine now. It takes more > than 4000 sec on Windows now. :( > > Should I just adjust the timeout or should we investigate the root clause? How long does AliveTest #5 take to complete on Linux and Mac? 4000 sec seems long, but if it's in the same ballpark as the others I'm less worried. Also, does AliveTest #5 take 4000 sec regardless of whether the symbols-path is specified, i.e. was there an unrelated code change that has caused AliveTest to start taking longer on Windows?
(In reply to comment #32) > How long does AliveTest #5 take to complete on Linux and Mac? Usually AliveTest #5 takes 1-2 minutes on Linux and Mac. > Also, does AliveTest #5 take 4000 sec regardless of whether the symbols-path is > specified, i.e. was there an unrelated code change that has caused AliveTest to > start taking longer on Windows? Not sure (can test). Additionally looks like the given parameter is not passed to the right place, see bug 549897.
(In reply to comment #32) > Also, does AliveTest #5 take 4000 sec regardless of whether the symbols-path is > specified, i.e. was there an unrelated code change that has caused AliveTest to > start taking longer on Windows? Just tested: elapsedTime=6214.438000. Looks like the long wait time isn't closely connected to the patch.
(In reply to comment #32) > How long does AliveTest #5 take to complete on Linux and Mac? 4000 sec seems > long, but if it's in the same ballpark as the others I'm less worried. I filed bug 549561 because Windows is much much slower than the other platforms. (In reply to comment #34) > Looks like the long wait time isn't closely connected to the patch. Have to disagree here. All branches started timing out, and went green again, in close correlation to landing and backing out attachment 429475 [details] [diff] [review]. That includes relatively inactive branches like 1.9.1 and 1.9.2.
(In reply to comment #35) > Have to disagree here. All branches started timing out, and went green again, > in close correlation to landing and backing out attachment 429475 [details] [diff] [review]. That > includes relatively inactive branches like 1.9.1 and 1.9.2. Will test with 1.9.2 tomorrow.
SeaMonkey trunk needs about 1000s for alive test 5, but we're doing shared builds. Might the fact that Firefox does libxul builds instead make a difference here?
1.9.2 Win result: python leaktest.py -- --trace-malloc malloc.log --shutdown-leaks sdleak.log --symbols-path ../dist/crashreporter-symbols ..... program finished with exit code 0 elapsedTime=1309.438000
1.9.1 Win result: python leaktest.py -- --trace-malloc malloc.log --shutdown-leaks sdleak.log --symbols-path ../dist/crashreporter-symbols ... program finished with exit code 0 elapsedTime=774.328000
Are those results all from VMs or hardware ? ie are we comparing apples with apples
SeaMonkey result is from an ESX VM, but I guess the FF ones should be more interesting.
(In reply to comment #40) > Are those results all from VMs or hardware ? ie are we comparing apples with > apples All of the builds were built on the same salve (win32-slave04).
Here are some timings from staging-master. == Without the patch == mozilla-central --------------- slave: win32-04, 6288 sec, 6303 sec slave: win32-38, 4068 sec, 3976 sec mozilla-1.9.2 ------------- slave: win32-04, 1281 sec slave: win32-38, 907 sec, 898 sec places ------ slave: win32-38, 4009 sec == With the patch == mozilla-central --------------- slave: win32-38, 4151 sec, 4357 sec, 4626 sec mozilla-1.9.2 ------------- slave: win32-04, 1321 sec slave: win32-38, 910 sec places ------ slave: win32-04, 5841 sec Looks like the difference between runs with and without the patch is not notable.
Guess we should try landing this again, and be ready with a timeout bump just in case we need it. Rail says he used 7200 for the runs in comment #43, thinking at it controlled the total step length rather the how long to wait after the last bit of output.
After some investigation playing with timeouts I can get the following situation: http://img697.imageshack.us/img697/5260/screenshotinm.png (browser window with URL of --symbols-path). On my local machine I've tried to run the same steps by hand and found that leaktest.py doesn't pass --symbols-path parameter properly. leaktest.py passes all extra arguments (extraArgs) to Automation.runApp http://hg.mozilla.org/mozilla-central/file/5e9d5bbf7596/build/leaktest.py.in#l83 while Automation.runApp expects symbolsPath as a separate parameter: http://hg.mozilla.org/mozilla-central/file/5e9d5bbf7596/build/automation.py.in#l727
Depends on: 549897
Priority: P2 → P4
Assignee: rail → nobody
Priority: P4 → P5
(In reply to comment #45) > On my local machine I've tried to run the same steps by hand and found that > leaktest.py doesn't pass --symbols-path parameter properly. What's the takeaway here? Should we be reassigning to someone to fix leaktest.py?
Whiteboard: [unittest]
Yeah, it needs to be fixed for this to work.
Component: Release Engineering → General
Product: mozilla.org → Testing
QA Contact: release → general
Version: other → Trunk
Mass closing bugs with no activity in 2+ years. If this bug is important to you, please re-open.
Status: REOPENED → RESOLVED
Closed: 16 years ago8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: