Closed Bug 696376 Opened 13 years ago Closed 13 years ago

enable leaktest on macosx64

Categories

(Release Engineering :: General, defect, P2)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: espindola, Assigned: espindola)

References

Details

(Whiteboard: see comment 20 - we have resucitated bug 558496)

Attachments

(1 file, 16 obsolete files)

13.10 KB, patch
dbaron
: review+
cwiiis
: feedback+
Details | Diff | Splinter Review
Attached patch enable trace malloc on os x 64 (obsolete) — Splinter Review
Stack walking on trace malloc is being disable for os x too by bug 696281, so it is now sufficiently fast.
Attachment #568646 - Flags: review?(armenzg)
A discussion on IRC pointed out that the proposed patch would cause us to try to enable leaktest on every branch, but that would only work on branches where 696281 is fixed.

Armen will point me at some example code or documentation to change this patch to only enable the leaktest in some branches.
I have looked at the code and I believe it should be like this:

1) pass in init of MercurialBuildFactory another variable [1]
2) move the whole "if self.platform != 'macosx64':" clause as its own function foo
3) inside of "if self.leakTest"[3]
   - call the new function we created if the variable in 1) is defined
4) allow the factory to receive the new variable [4]
5) we should set the variable to True as most platforms have it on by default [5]

Now, for macosx64 we should set that variable to false so the function from 3  does not get called (status quo).
To do so we should add the new variable to the PLATFORM_VARS to make macosx64 to be disabled by default [6]. Notice that this is the buildbot-configs repo.

Notice that up until now we have not yet added what you are asking for.

We can add the exception for mozilla-central in here [7]:
+ BRANCHES['mozilla-central']['platforms']['macosx64']['your_variable'] = True

I am very sorry that you have to deal with the way that the code should have been written to handle branches properly.

I will be gone Wed-Fri next week and I won't be able to handle any reviews.

[1] http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l602
[2] http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l1166
[3] http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l833
[4] http://mxr.mozilla.org/build/source/buildbotcustom/misc.py#1088
[5] http://mxr.mozilla.org/build/source/buildbotcustom/misc.py#1011
[6] http://hg.mozilla.org/build/buildbot-configs/file/78962f070607/mozilla/config.py#l372
[7] http://hg.mozilla.org/build/buildbot-configs/file/78962f070607/mozilla/config.py#l1167
Comment on attachment 568646 [details] [diff] [review]
enable trace malloc on os x 64

It is the write code changes but we have to handle several branches as mentioned on the previous comment.
Attachment #568646 - Flags: review?(armenzg) → review-
Assignee: nobody → respindola
Attachment #568646 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #568730 - Flags: review?(armenzg)
I tried setting up a bot for testing it, but:

$ grep -r cltbld *  | wc -l 
141

so it looks like i do need real buildbot machine for it to work.

would you mind testing the patch? If not, is it possible to set up a bot one of the machines I have loaned?
Thanks for pushing it this far.
I will be taking care of testing it.
I will have results for you on Monday.
Assignee: respindola → armenzg
cool. Assign back to me once you have the results and I will try to upload a patch with the fix.
Comment on attachment 568730 [details] [diff] [review]
refactor the code before changing it

You are missing the following:
leakEnv = self.env.copy()
leakEnv['MINIDUMP_STACKWALK'] = getPlatformMinidumpPath(self.platform)
leakEnv['MINIDUMP_SAVE_PATH'] = WithProperties('%(basedir:-)s/minidumps')

Besides that this patch is a no-op and good.

Are you still willing to tackle the comment 2? Or want us to finish up your request?
Not sure I follow, what is it that I am missing in

leakEnv = self.env.copy()
leakEnv['MINIDUMP_STACKWALK'] = getPlatformMinidumpPath(self.platform)
leakEnv['MINIDUMP_SAVE_PATH'] = WithProperties('%(basedir:-)s/minidumps')

? The patch creates a addLeakTestStepsCommon that starts with

+        if self.platform == 'macosx64':
+            return

so it should make the change proposed in comment 2 easier to implement (in a followup patch).
In addLeakStepsCommon on the first step this is called:
  env=leakEnv,
Inside of that function leakEnv is not declared.

You can also run ./test-masters.sh and you should see it fail.
I haven't been able to run test-masters.sh since it expects a local buildbot install, but this patch should be a nop and fix the issue you found.
Attachment #568730 - Attachment is obsolete: true
Attachment #568730 - Flags: review?(armenzg)
Attachment #569389 - Flags: review?(armenzg)
Comment on attachment 569389 [details] [diff] [review]
refactor the code before changing it

This is excellent.
I would like to add a comment wrt graphAndUpload

'''
graphAndUpload - This variable is used to prevent the try jobs from
                 doing talos graph posts and uploading logs.
'''

What do you think? If you are OK with it we can land it and be part of the next reconfig happening tonight.
Attachment #569389 - Flags: review?(armenzg) → review+
This uses the existing enable_leaktests variable. This is *not* a nop, but it doesn't look like the headers of both implementations of addLeakTestSteps do much useful testing before calling addLeakTestStepsCommon.
Attachment #569457 - Flags: review?(armenzg)
This patch depends on the change to the config to be pushed live. It removes the special case for macosx form the factory code itself.
Attachment #569460 - Flags: review?(armenzg)
This must only be committed once the previous patches are pushed and 696281 is fixed. I assume mozilla-inbound uses the same setup and mozilla-central, correct?
Attachment #569465 - Flags: review?(armenzg)
Priority: -- → P2
Comment on attachment 569389 [details] [diff] [review]
refactor the code before changing it

This no-op change made it into production with:
http://hg.mozilla.org/build/buildbotcustom/graph/1881 (5 days go)

I am now reviewing the next 3 patches.
All 3 patches have landed:
http://hg.mozilla.org/build/buildbot-configs/rev/0766b4b471c5
http://hg.mozilla.org/build/buildbotcustom/rev/0fc5a343a31d

This will be picked up on Thursday's scheduled reconfig.
Attachment #569457 - Flags: review?(armenzg)
Attachment #569457 - Flags: review+
Attachment #569457 - Flags: checked-in+
Attachment #569460 - Flags: review?(armenzg)
Attachment #569460 - Flags: review+
Attachment #569460 - Flags: checked-in+
Attachment #569465 - Flags: review?(armenzg)
Attachment #569465 - Flags: review+
Attachment #569465 - Flags: checked-in+
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #18)
> All 3 patches have landed:
> http://hg.mozilla.org/build/buildbot-configs/rev/0766b4b471c5
> http://hg.mozilla.org/build/buildbotcustom/rev/0fc5a343a31d
> 
> This will be picked up on Thursday's scheduled reconfig.

This actually went in on Tuesday as a reconfig happeneded soon after:
http://hg.mozilla.org/build/buildbotcustom/rev/4b2ebc7a398c
tl;dr; Enabling leak tests on 10.6 (without stack walking) for m-c happened sometime after 1pm PDT on Nov. 1st and since then we have resurrected an intermittent orange.
We have to figure out this intermittent orange before we can enable this on all branches and we enable 10.5 leak tests on 10.6 machines (as we will hit it).

espindola who do you think could help with this?
1) we can loan a machine
2) who knows bloatcycle

Since we enabled leak tests for 10.6 (without stack walking) for m-c we got one job failing with 404 [1] trying to grab malloc.log.
Then we have had 2 timeouts on alive_tests2 [3][4] since then (2 out of 7 jobs).

To view this you can visit:
https://tbpl.mozilla.org/?jobname=10.6.2%20mozilla-central%20leak

The only thing in common between those 2 oranges are that the two slaves run from the same master bm07-build1 (this has never been a root cause but we never know).
slaves moz2-darwin10-slave16 and moz2-darwin10-slave42.

IIUC we have resuscitated an intermittent orange! (bug 558496)
We don't have the slowness mentioned in bug 573631 because we have removed stack walking in the dependent bug 696281.

The difference in output between a bad run and a good run is this (I obviously cut the output):
  WARNING: NS_ENSURE_TRUE(mTextInputHandler) failed: file /builds/slave/m-cen-osx64-dbg/build/widget/src/cocoa/nsChildView.mm, line 4071
+ localhost - - [03/Nov/2011 06:52:04] "GET /bloatcycle.html HTTP/1.1" 200 -
+ WARNING: NS_ENSURE_TRUE(mMutable) failed: file /builds/slave/m-cen-osx64-dbg/build/netwerk/base/src/nsSimpleURI.cpp, line 292
-
-
- command timed out: 3600 seconds without output, attempting to kill
- process killed by signal 9
- program finished with exit code -1
- elapsedTime=3613.243564

[1] https://tbpl.mozilla.org/php/getParsedLog.php?id=7157618&tree=Firefox&full=1
[2]  http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug//malloc.log
[3] https://tbpl.mozilla.org/php/getParsedLog.php?id=7161540&tree=Firefox&full=1
[4] https://tbpl.mozilla.org/php/getParsedLog.php?id=7184419&tree=Firefox&full=1
Whiteboard: see comment 20
Whiteboard: see comment 20 → see comment 20 - we have resucitated bug 558496
On 32 bit OS X we have to walk the stack to check if we must abort immediately. The 64 bit walker never sets that and we can avoid the walk. It is also possible that the extra walking is responsible for some of the recent timeouts (oranges).

A try (with two build runs) is at.
https://tbpl.mozilla.org/?tree=Try&rev=69714951bd04


An alternative approach that ports the check for pthread_cond_wait$UNIX2003 is at

https://tbpl.mozilla.org/?tree=Try&rev=6679261627e8

but I never found such a function on 10.7. I can give it a try on 10.6 if you like this better.
Assignee: armenzg → respindola
Attachment #571829 - Flags: review?(dbaron)
> but I never found such a function on 10.7

in fact, on 10.6:

$ nm /usr/lib/libc.dylib | grep 'pthread_cond_wait$UNIX2003'
0003010f T _pthread_cond_wait$UNIX2003

on 10.7 the same command returns nothing.
Comment on attachment 571829 [details] [diff] [review]
f stacks_enabled is false, only walk the stack on OS X 32 bits.

r=dbaron
Attachment #571829 - Flags: review?(dbaron) → review+
https://hg.mozilla.org/mozilla-central/rev/5bbc4fb5b277
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
We have had since the landing a couple of orange runs:
https://tbpl.mozilla.org/php/getParsedLog.php?id=7240621&tree=Firefox&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=7247077&tree=Firefox&full=1

JavaScript strict warning: file:///builds/slave/m-cen-osx64-dbg/build/obj-firefox/dist/NightlyDebug.app/Contents/MacOS/components/nsSearchService.js, line 2567: reference to undefined property cache.directories
WARNING: Subdocument container has no frame: file /builds/slave/m-cen-osx64-dbg/build/layout/base/nsDocumentViewer.cpp, line 2403
++DOMWINDOW == 9 (0x129ff61a8) [serial = 9] [outer = 0x123ccf620]
WARNING: Subdocument container has no frame: file /builds/slave/m-cen-osx64-dbg/build/layout/base/nsDocumentViewer.cpp, line 2403
++DOMWINDOW == 10 (0x129ff9178) [serial = 10] [outer = 0x123ce7cb0]
++DOMWINDOW == 11 (0x12b103318) [serial = 11] [outer = 0x129f25ec0]

command timed out: 3600 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=3602.667802
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 700505
I was able to reproduce this in a new 10.6 installation.

The problem is that the dlsym call we use no longer works and I could not find any reasonable candidate to use on 10.6.

My first try was to see if we could get the address of a local symbol. Unfortunately, gdb reads the macho file directly and not an API. We could use the reader in breakpad, but linking that into the main binary is probably not worth it.

I tried looking at the dyld code, but the only thing that looks at local symbols is the dladdr implementation. My current idea is to use a dummy first stack walk with a callback that calls dladdr to capture the local symbol address. I should have that coded tomorrow.
I will write a patch that uses the same trick.
Attached patch find new_sem_from_pool (obsolete) — Splinter Review
This one fixes some of the problems found by the first try push:

https://tbpl.mozilla.org/?tree=Try&rev=47a94e1f6eba
Attachment #572734 - Attachment is obsolete: true
Attachment #572973 - Attachment is obsolete: true
Attachment #572973 - Flags: review?(dbaron)
Attachment #573234 - Flags: review?(dbaron)
Attached patch find new_sem_from_pool (obsolete) — Splinter Review
This version fixes out of process plugins too.
Attachment #573234 - Attachment is obsolete: true
Attachment #573234 - Flags: review?(dbaron)
Attachment #573349 - Flags: review?(dbaron)
I see several pushes to try but I want to remind you that we disabled leaktest on macosx64 everywhere in bug 700505. This is just in case you are trying to reproduce the intermittent orange in the try server.
No, I was able to reproduce the leaktest problems locally, now I am "just" trying to get all the cases working with the patch :-(
Attached patch find new_sem_from_pool (obsolete) — Splinter Review
This one is finally green:
https://tbpl.mozilla.org/?tree=Try&rev=39632808c53d
Attachment #573349 - Attachment is obsolete: true
Attachment #573349 - Flags: review?(dbaron)
Attachment #573774 - Flags: review?(dbaron)
Benjamin suggested using NS_LogInit. A try with that suggestion is at

https://tbpl.mozilla.org/?tree=Try&rev=3164e005b7c2
Attached patch find new_sem_from_pool (obsolete) — Splinter Review
This is the same version as the last try push, just added comments.
Attachment #573774 - Attachment is obsolete: true
Attachment #573774 - Flags: review?(dbaron)
Attachment #573864 - Flags: review?(benjamin)
Attachment #573864 - Flags: review?(benjamin) → review?(dbaron)
espindola, where are we with this? waiting on dbaron's review?
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #47)
> espindola, where are we with this? waiting on dbaron's review?

correct.
Comment on attachment 573864 [details] [diff] [review]
find new_sem_from_pool

>diff --git a/xpcom/base/nsStackWalk.h b/xpcom/base/nsStackWalk.h
>--- a/xpcom/base/nsStackWalk.h
>+++ b/xpcom/base/nsStackWalk.h
>@@ -128,4 +128,7 @@ NS_FormatCodeAddressDetails(void *aPC, c
> 
> PR_END_EXTERN_C
> 
>+void
>+StackWalkInitCriticalRanges(void);
>+
> #endif /* !defined(nsStackWalk_h_) */

If this needs to be a part of the public API, it should be in the extern "C", marked XPCOM_API(), and documented.  If not, it should be in a different header file.
Attachment #576743 - Attachment is obsolete: true
Attachment #576743 - Flags: review?(dbaron)
Attached patch find new_sem_from_pool (obsolete) — Splinter Review
https://tbpl.mozilla.org/?tree=Try&rev=47331881b800
Attachment #569389 - Attachment is obsolete: true
Attachment #569457 - Attachment is obsolete: true
Attachment #569460 - Attachment is obsolete: true
Attachment #569465 - Attachment is obsolete: true
Attachment #571829 - Attachment is obsolete: true
Attachment #576747 - Attachment is obsolete: true
Attachment #576747 - Flags: review?(dbaron)
Attachment #577390 - Flags: review?(dbaron)
Comment on attachment 577390 [details] [diff] [review]
find new_sem_from_pool

(commit message)
> The good news is that dladdr works with any symbol, not just exported
> ones. This patch changes the init to force a call to new_sem_from_pool
> and walks the stack to find it.

This would probably be better explained in a little more detail, e.g., by changing "force a call to ... find it." to "set up a malloc logger, force a call to new_sem_from_pool, and walk the stack from inside the malloc logger to find it." or something like that.


It's possible this would now work without the loop in FindFunctionAddresses if you just stored the pc (rather than the function's base address) in stack_callback and made a range with pc and pc+1.  But it's fine as-is too.

>+void
>+StackWalkInitCriticalRanges()
>+{
>+  if(gCriticalRange.mInit)

Space between if and (.

>+StackWalkInitCriticalRanges() {
>+  gCriticalRange.mInit = true;
>+}

{ on its own line


r=dbaron with that; sorry for the delay
Attachment #577390 - Flags: review?(dbaron) → review+
I really liked the suggestion of keeping only one address, so I pushed it to try in

https://tbpl.mozilla.org/?tree=Try&rev=de8f8a3a7a61

I will also try locally on 10.5 and 10.6. Assuming all that passes I will push to m-i.
https://hg.mozilla.org/mozilla-central/rev/e81b024e6ec9
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
This breaks building on android due to _URC_NORMAL_STOP not being in unwind.h.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #577390 - Attachment is obsolete: true
Attachment #578654 - Flags: review?(dbaron)
Attachment #578654 - Flags: feedback?(chrislord.net)
Comment on attachment 578654 [details] [diff] [review]
like the previous, but s/_URC_NORMAL_STOP/1/

Review of attachment 578654 [details] [diff] [review]:
-----------------------------------------------------------------

I've not tested it, but the only problem seemed to be the undefined enum value, so looks good to me.
Attachment #578654 - Flags: feedback?(chrislord.net) → feedback+
Comment on attachment 578654 [details] [diff] [review]
like the previous, but s/_URC_NORMAL_STOP/1/

Changing your diff settings from patch to patch makes re-review a pain.

>Currently we use dlsym on pthread_cond_wait$UNIX2003 to find a
>function that indicates that new_sem_from_pool is on the stack. This
>works on 10.5, but on 10.6 I could not find a single reliable
>indicator that would work with dlsym.
>
>The good news is that dladdr works with any symbol, not just exported
>ones. To find the address of new_sem_from_pool, we set up a malloc logger
>and force a call to new_sem_from_pool. From the logger callback we walk
>the stack trying dladdr on every address.
>
>To for a call to new_sem_from_pool, the initialization code has to be the

To force?

>first to use semaphores, so it is now run from NS_LogInit.
>
>This works on 10.6 and 10.5 (but we have to look for
>"pthread_cond_wait$UNIX2003"). In 10.7 the call to malloc is gone, so we don't

wrap long line?

>have to worry about critical addresses on it anymore.
>
>+  if(gCriticalAddress.mInit)

space after "if" (per comment 53)


>+    if (IsCriticalAddress(pc)) {
>+        printf("Aborting stack trace, PC is critical\n");
>+        /* We just want to stop the walk, so any error code will do.
>+           Using _URC_NORMAL_STOP would probably be the most accurate,
>+           but it is not defined on Android for ARM. */
>+        return 1;

1 happens to be _URC_FOREIGN_EXCEPTION_CAUGHT, which is defined on Android.  If you're going to use that, at least give the name in a comment.

In any case, the ones I see on Android/ARM are:

      _URC_OK = 0,       /* operation completed successfully */
      _URC_FOREIGN_EXCEPTION_CAUGHT = 1,
      _URC_END_OF_STACK = 5,
      _URC_HANDLER_FOUND = 6,
      _URC_INSTALL_CONTEXT = 7,
      _URC_CONTINUE_UNWIND = 8,
      _URC_FAILURE = 9   /* unspecified failure of some kind */

and the ones I see in unwind.h on Android/x86 and Linux/x86 are:

  _URC_NO_REASON = 0,
  _URC_FOREIGN_EXCEPTION_CAUGHT = 1,
  _URC_FATAL_PHASE2_ERROR = 2,
  _URC_FATAL_PHASE1_ERROR = 3,
  _URC_NORMAL_STOP = 4,
  _URC_END_OF_STACK = 5,
  _URC_HANDLER_FOUND = 6,
  _URC_INSTALL_CONTEXT = 7,
  _URC_CONTINUE_UNWIND = 8

r=dbaron
Attachment #578654 - Flags: review?(dbaron) → review+
https://hg.mozilla.org/mozilla-central/rev/c9a74f4ee1f7
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Depends on: 707648
Depends on: 711658
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: