Closed Bug 483111 Opened 12 years ago Closed 11 years ago

Get stacks for try server unittest crashes

Categories

(Release Engineering :: General, defect, P3)

x86
All
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lsblakk, Assigned: nthomas)

References

Details

Attachments

(1 file)

This is a follow-up bug for bug 445611
Priority: -- → P3
Blocks: 427480
Assignee: lsblakk → nobody
Component: Release Engineering → Release Engineering: Future
Looking for someone to take this.
Status: ASSIGNED → NEW
Component: Release Engineering: Future → Release Engineering
I've got other things to work on in try server land so will have a look.
Assignee: nobody → nthomas
Status: NEW → ASSIGNED
Priority: P3 → P2
Nick, do you have any news on this?
Unfortunatly i have a persistent crash on tryservers and i can't debug it :(
We have a few options here. We can go the quick and dirty approach, like this
  http://pastebin.mozilla.org/674526
which is to adapt the exist factories to clone the repo with the stack walkers, build the symbols, and define MINIDUMP_STACKWALK in the environment. That gets us reasonably close to the unit tests on the Firefox tree, but leaves out leak thresholds (and possibly other stuff). There's a factory for each platform to modify, just linux is done in the pastebin.

I also tried to write a TryUnittestBuildFactory based on buildbotcustom.process.factory.UnittestBuildFactory. That wasn't passing the env changes on properly and I didn't get a chance to dig into it more. This approach is nicer because we use the one factory for all three platforms, and get leak thresholds, and generally sync up with the main unit testers. Still needs some work to make it function.

But in the meantime UnittestBuildFactory has been gutted of tests in favour of packaged unit tests. Given bug 493228 and bug 520227 it may better to take the quick and dirty fix now, and jump to
* optimized build also produces symbols and test packages, and maybe 'make check'
* unit test run is triggered by sendchange, using packaged unit test factory
* the time we save by not doing a unit compile could be spent doing a debug build, to get us leak checks on debug tests (overall a load increase)
* talos as usual

catlee and bhearsum, what say you ?
The quick and dirty approach is running on try staging now. Mak, if you push to try again and watch the "try" columns on the MozillaTest tree then you should hopefully get a stack.  Staging gets behind quite quickly because it only has 1 slave per platform, so there may be a delay in producing a build.
(In reply to comment #4)
> But in the meantime UnittestBuildFactory has been gutted of tests in favour of
> packaged unit tests. Given bug 493228 and bug 520227 it may better to take the
> quick and dirty fix now, and jump to
> * optimized build also produces symbols and test packages, and maybe 'make
> check'
> * unit test run is triggered by sendchange, using packaged unit test factory
> * the time we save by not doing a unit compile could be spent doing a debug
> build, to get us leak checks on debug tests (overall a load increase)
> * talos as usual

That sounds great to me.
Attached patch Get stacks, v1Splinter Review
For each platform, this pulls the build/tools repo for the stackwalker, calls 'make buildsymbols' to create objdir/dist/crashreporter-symbols, sets MINIDUMP_STACKWALK in a copy of the unittest environment, and uses that environment for all the test suites. Also fixes up some use of OBJDIR along the way. I've verified that the repo is pulled OK, buildsymbols succeeds, and MINIDUMP_STACKWALK is being set for all unit suite invocations (even though it's probably useless on make check).

Having a problem with resolving the symbols in the stacks though. bz suggested a null-pointer dereference as an easy way to guarantee a crash (by inserting "*(char*)0 = 'x';" at http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsGenericElement.cpp#1782). After pushing that to try staging the resulting xpcshell crashes are correctly resolved (except on mac which misses the topmost call of the crashing thread). But all of ref/crashtest, and the mochi suites return output like this:
TEST-UNEXPECTED-FAIL | automation.py | application crashed (minidump found)
...
Crash reason:  SIGSEGV
Crash address: 0x404eb3f0

Thread 0 (crashed)
 0  libxul.so + 0x4db3f0
    eip = 0x404eb3f0   esp = 0xbf875854   ebp = 0xbf87585c   ebx = 0x40e37d5c
    esi = 0x42848c60   edi = 0x41a7b840   eax = 0x42848c70   ecx = 0x41a7b840
    edx = 0x00200001   efl = 0x00010202
 1  libxul.so + 0x5a2b0d
    eip = 0x405b2b0e   esp = 0xbf875864   ebp = 0xbf87587c
 2  libxul.so + 0x5a2c04
    eip = 0x405b2c05   esp = 0xbf875884   ebp = 0xbf87589c
....

There's no complaint from build/automationutils.py::checkForCrashes about undefined SYMBOLS_PATH. I'm going to try not eating stderr when calling the stack walker to try to get some clues.
There was (is?) a bug in the ref/crashtest harness where the symbols path wasn't being converted to an absolute path to be passed along to the stack walker, which results in failure since the harness also changes its own working directory, invalidating any relative directory references.

Ted, would this explain the behaviour above?  One way to check if you can run the ref/crashtest manually is to pass an absolute path to the symbols directory to the test harness.
There was a bug, but I believe I fixed all instances of it. It's possible there's a related bug, or I missed a spot.
For linux reftest we call 'make reftest'
 in /builds/slave/sendchange-linux-unittest/mozilla/objdir
On crashing, we're looking for symbols at
/builds/slave/sendchange-linux-unittest/dist/crashreporter-symbols/libxul.so/F8CC3D6D8F1AF9564377BDAD9B4B367B0/libxul.so.sym
which is missing a mozilla/objdir before the dist/.
For mochitest-plain it was
/builds/slave/dist/crashreporter-symbols/libxul.so/F8CC3D6D8F1AF9564377BDAD9B4B367B0/libxul.so.sym
so that's missing sendchange-linux-unittest/mozilla/objdir before dist. Same problem on mac and windows.
The default path is preprocessed into automation.py:
http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#98

From here:
http://mxr.mozilla.org/mozilla-central/source/build/automation-build.mk#24

TARGET_DEPTH comes from the makefiles that include automation-build.mk, like here for reftest:
http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/Makefile.in#61

We do try to normalize paths into a full path here (to fix that bug catlee mentioned):
http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/runreftest.py#127

So maybe we're mucking up the path somehow? Could you look in automation.py (it gets preprocessed to a variety of locations, but $objdir/_tests/reftest/ is the reftest copy) and see if that path looks good? (It should be a path relative to the directory it sits in.)
Of course, after typing that out and looking at the code, I realized the problem. When I landed a fix for bug 519194, ( http://hg.mozilla.org/mozilla-central/rev/0f0239688a76 ), I made us resolve the symbols path relative to the cwd. However, the SYMBOLS_PATH that gets preprocessed into automation.py is...a path relative to the *script dir*, so I fixed the packaged case and broke the in-tree case. Nice.

I filed bug 523330 on fixing that, so if your work here gets you stacks (even with bogus symbols), you're probably ok.
Comment on attachment 406985 [details] [diff] [review]
Get stacks, v1

My notes about this patch are in comment #7. Can't wait to rip all this duplication out.
Attachment #406985 - Attachment description: WIP → Get stacks, v1
Attachment #406985 - Flags: review?(bhearsum)
Comment on attachment 406985 [details] [diff] [review]
Get stacks, v1

>+    s(SetProperty, name="get toolsdir",
>+                   command=['bash', '-c', 'pwd'],
>+                   property='toolsdir',
>+                   workdir='tools',
>+    ),

Does this work here? We discovered that we have to use 'pwd -W' for packaged unittests, because we run python directly, which doesn't convert the MSYS style path this provides to a windows one. I _think_ this is ok, just want to be sure.

r=bhearsum if it works, or if 'pwd -W' is needed.
Attachment #406985 - Flags: review?(bhearsum) → review+
That's a good question. I'll do another run with my crash patch + the change from bug 523330.

Would wait to roll out this for 523330.
Depends on: 523330
the pwd is fine as is, I think because we're still calling the make target in this build.
Priority: P2 → P3
Whiteboard: waiting on blocker
prod and staging servers reconfig'd.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: waiting on blocker
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.