Closed Bug 1251893 Opened 5 years ago Closed 4 years ago

Leak stacks don't work on Taskcluster test runs

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla48

People

(Reporter: RyanVM, Assigned: armenzg)

References

Details

Attachments

(1 file)

https://treeherder.mozilla.org/logviewer.html#?job_id=17319342&repo=try

 20:20:43     INFO -  allocation stack:

 20:20:43     INFO -  #00: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

 20:20:43     INFO -  #01: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

 20:20:43     INFO -  #02: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

 20:20:43     INFO -  #03: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

 20:20:43     INFO -  #04: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

 20:20:43     INFO -  #05: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

 20:20:43     INFO -  #06: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

 20:20:43     INFO -  #07: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

 20:20:43     INFO -  #08: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

 20:20:43     INFO -  #09: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

 20:20:43     INFO -  #10: start_thread (/build/eglibc-rrybNj/eglibc-2.15/nptl/pthread_create.c:308)

 20:20:43     INFO -  #11: clone (/build/eglibc-rrybNj/eglibc-2.15/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:114)

20:20:43 INFO - 14 @0x7f2e0bc67670 (0 references; 0 from COMPtrs)
To be clear, these stacks are typically generated when adding |"--setenv=XPCOM_MEM_LOG_CLASSES=<the leaking class you want a stack for>"| to the appropriate mozharness test config in the tree.

An example of this in action - https://hg.mozilla.org/try/rev/4689074c7b22 - which gives a log like https://treeherder.mozilla.org/logviewer.html#?job_id=17324551&repo=try#L74993 when working correctly.
Blocks: 1243024
No longer blocks: tc-linux-debug-tier1
Where can I learn about how these stacks work? or make them work?
Flags: needinfo?(ted)
I don't think the leak log bit is super relevant (despite the way the bug is filed). I think the bug is that the stack hasn't been symbolized, which is supposed to happen inside the Mochitest harness. It should be using the same code we use to get symbols for assertion stacks:
https://dxr.mozilla.org/mozilla-central/rev/5e0140b6d11821e0c2a2de25bc5431783f03380a/testing/mochitest/runtests.py#2517

The actual implementation of that is in mozrunner:
https://dxr.mozilla.org/mozilla-central/rev/5e0140b6d11821e0c2a2de25bc5431783f03380a/testing/mozbase/mozrunner/mozrunner/utils.py#246
Flags: needinfo?(ted)
This should be simpler to test by making a test cause an assertion. You can do that by calling a method on the nsIDebug interface, like:
https://dxr.mozilla.org/mozilla-central/rev/5e0140b6d11821e0c2a2de25bc5431783f03380a/testing/xpcshell/selftest.py#539
Moving to TaskCluster component...
Component: TaskCluster → Integration
Product: Testing → Taskcluster
Version: Version 3 → unspecified
:armenzg, this one issue seemed to have fallen through the cracks, can you pick this up and help get this solved?
Flags: needinfo?(armenzg)
Sure!
Assignee: nobody → armenzg
Flags: needinfo?(armenzg)
Let me try to rephrase this, in this push [1] we can see logs of jobs that should have a meaningful allocation stacks but don't [2]
In one type of jobs we download the symbols on deman while on the Buildbot ones we don't.
I see that get_stack_fixer() takes a symbolsPath. This could be related [3]

Should we move away from using ondemand for downloading symbols?
stackFixer() is called in the __init__ of the OutputHandler.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=96b972ba996b

[2]
https://public-artifacts.taskcluster.net/HYqHzLTjR3yOkSR4NVWqkA/0/public/logs/live_backing.log
19:20:51     INFO -  allocation stack:
19:20:52     INFO -  #00: ??? (/home/worker/workspace/build/application/firefox/libxul.so)
19:20:52     INFO -  #01: ??? (/home/worker/workspace/build/application/firefox/libxul.so)

[3]
https://dxr.mozilla.org/mozilla-central/rev/5e0140b6d11821e0c2a2de25bc5431783f03380a/testing/mozbase/mozrunner/mozrunner/utils.py#266
I believe that buildbot always downloads symbols for debug tests ahead of time to handle cases like this.
(In reply to Chris AtLee [:catlee] from comment #9)
> I believe that buildbot always downloads symbols for debug tests ahead of
> time to handle cases like this.

Yes, this is correct. For non-debug builds having the symbol path be a URL works great because mozcrash will download the symbols on-demand to produce a stack for a crash. For debug builds, we will almost always hit an assertion which will print a stack like this and needs symbols to make it useful, so we should download them before running tests, unpack them, and set the symbol path to the local directory.
No longer blocks: 1243024
Do we need the change for every debug job? or only for mochitests?
Every debug test job.
Comment on attachment 8741378 [details]
MozReview Request: Bug 1251893 - Change TC Linux64 debug jobs to always download symbols. r=ted

Not ready for review. Sorry.
Attachment #8741378 - Flags: review?(ted)
Comment on attachment 8741378 [details]
MozReview Request: Bug 1251893 - Change TC Linux64 debug jobs to always download symbols. r=ted

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/46439/diff/1-2/
Attachment #8741378 - Flags: review?(ted)
hrmm my next review push raised the flag back.

Anyways, here's the push that shows that nothing breaks [1]
Here's a command showing --download-symbols=true instead of ondemand [2]

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=637e5197199933ea5ad2f931fcf13ba4e54199d3&filter-searchStr=tc&group_state=expanded&selectedJob=19497053
[2] /home/worker/workspace/mozharness/scripts/desktop_unittest.py --config-file mozharness/configs/unittests/linux_unittest.py --config-file mozharness/configs/remove_executables.py --no-read-buildbot-config --installer-url=https://queue.taskcluster.net/v1/task/M_n8wx4wR9aaBC_mEwj-Cw/artifacts/public/build/target.tar.bz2 --test-packages-url=https://queue.taskcluster.net/v1/task/M_n8wx4wR9aaBC_mEwj-Cw/artifacts/public/build/target.test_packages.json --download-symbols=true --gtest-suite=gtest
Comment on attachment 8741378 [details]
MozReview Request: Bug 1251893 - Change TC Linux64 debug jobs to always download symbols. r=ted

https://reviewboard.mozilla.org/r/46439/#review43875

I don't fully understand the current state of the in-tree task definitions, but it looks like these same tasks would be used for opt tests, is that right? If so then changing these to always download symbols isn't what we want. We want opt tests to continue to download ondemand.
At the moment there are no opt tests.
gbrown is likely to tackle this and we will need to split the yml files into opt/debug.
We can have opt jobs at that point use ondemand.

Can we get this landed and ask gbrown to make sure to use ondemand for opt jobs?
wrt to comment 18
Flags: needinfo?(ted)
(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #18)
> At the moment there are no opt tests.
> gbrown is likely to tackle this and we will need to split the yml files into
> opt/debug.
> We can have opt jobs at that point use ondemand.
> 
> Can we get this landed and ask gbrown to make sure to use ondemand for opt
> jobs?

OK. I'm not wild about leaving footguns in the tree for the next person to deal with, but if this is something we've already scoped, maybe you can just put a comment in the task definitions to note that those should be `ondemand` for opt jobs?
Flags: needinfo?(ted)
Comment on attachment 8741378 [details]
MozReview Request: Bug 1251893 - Change TC Linux64 debug jobs to always download symbols. r=ted

https://reviewboard.mozilla.org/r/46439/#review44265

::: testing/taskcluster/tasks/tests/fx_desktop_unittest.yml:12
(Diff revision 2)
>        - bash
>        - /home/worker/bin/test.sh
>        - --no-read-buildbot-config
>        - --installer-url={{build_url}}
>        - --test-packages-url={{test_packages_url}}
> -      - --download-symbols=ondemand
> +      - --download-symbols=true

As I said, let's put a comment here to let gbrown (or whoever winds up doing the work) know that opt tests need to use `ondemand`, so that we don't wind up with the reverse of this problem in the near future.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #21)
> > -      - --download-symbols=ondemand
> > +      - --download-symbols=true
> 
> As I said, let's put a comment here to let gbrown (or whoever winds up doing
> the work) know that opt tests need to use `ondemand`, so that we don't wind
> up with the reverse of this problem in the near future.

This is scoped in bug 1265773 by gbrown.
I will land since it is already going to be worked on.
I will add comments and land it tomorrow.
There are a few debug test jobs that still use ondemand symbols after this change:

  fx_linux64_firefox_ui_functional.yml
  fx_linux64_firefox_ui_functional_e10s.yml

and some that don't specify:

  fx_linux64_luciddream.yml
  fx_linux64_mochitest_vg.yml

Is that intentional?
(In reply to Geoff Brown [:gbrown] from comment #24)
> There are a few debug test jobs that still use ondemand symbols after this
> change:
> 
>   fx_linux64_firefox_ui_functional.yml
>   fx_linux64_firefox_ui_functional_e10s.yml
> 
These are whimboo's jobs and these don't run on developer's pushes but on nightly jobs.
They're not debug jobs.

> and some that don't specify:
> 
>   fx_linux64_luciddream.yml
>   fx_linux64_mochitest_vg.yml
> 
> Is that intentional?

Luciddream (not running) and valgrind are only for opt builds.
https://hg.mozilla.org/integration/mozilla-inbound/rev/cc6534601ad80c04c260493112e9a8d621c304ad
Bug 1251893 - Change TC Linux64 debug jobs to always download symbols. r=ted
https://hg.mozilla.org/mozilla-central/rev/cc6534601ad8
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.