Last Comment Bug 658840 - Forked glxinfo process leaks strings
: Forked glxinfo process leaks strings
Status: RESOLVED FIXED
: regression, valgrind
Product: Core
Classification: Components
Component: Graphics (show other bugs)
: Trunk
: x86_64 Linux
: -- normal (vote)
: mozilla7
Assigned To: Benoit Jacob [:bjacob] (mostly away)
:
Mentors:
Depends on:
Blocks: 639842
  Show dependency treegraph
 
Reported: 2011-05-21 20:02 PDT by Jesse Ruderman
Modified: 2011-09-19 09:14 PDT (History)
10 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
valgrind output (firefox debug) (6.13 KB, text/plain)
2011-05-21 20:02 PDT, Jesse Ruderman
no flags Details
return from main() in the glxtest process (2.17 KB, patch)
2011-05-26 07:39 PDT, Benoit Jacob [:bjacob] (mostly away)
benjamin: review+
karlt: checkin+
Details | Diff | Splinter Review
use abort() for errors in glxtest (1.08 KB, patch)
2011-09-18 21:26 PDT, Benoit Jacob [:bjacob] (mostly away)
karlt: review+
Details | Diff | Splinter Review

Description Jesse Ruderman 2011-05-21 20:02:41 PDT
Created attachment 534267 [details]
valgrind output (firefox debug)

Workaround #1: pass --child-silent-after-fork=yes to Valgrind, and ignore Valgrind's exit code.

Workaround #2: add these leaks to a Valgrind suppression file. This 

(How are we dealing with this on Tinderbox?)

(How did our other leak detection tools survive the insanity of having every Firefox startup correspond to two shutdowns?)
Comment 1 Jesse Ruderman 2011-05-21 20:03:21 PDT
> Workaround #2: add these leaks to a Valgrind suppression file.

This doesn't work, because two of the leaks have busted stacks (???)
Comment 2 Benoit Jacob [:bjacob] (mostly away) 2011-05-21 20:53:35 PDT
Can you explain me more about this link? I don't understand it.

Is this a leak whereby the forked glxinfo process stays around as a zombie? I'm afraid this problem currently happens, as the zombie only gets killed by GfxInfo::Init() and I'm not sure if that is always getting called (need do check).

Or are you talking about a different kind of leak?
Comment 3 Jesse Ruderman 2011-05-21 20:56:38 PDT
This is a leak where the forked glxinfo process exits, but doesn't free some of its memory before exiting.  Some of this memory is associated with strings.
Comment 4 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2011-05-22 07:04:07 PDT
We fork() but don't exec(), so the forked process doesn't have separate memory mappings.  This could be the tools getting confused.  (Or it could be us doing things we're not supposed to.)
Comment 5 Karl Tomlinson (:karlt) 2011-05-22 13:58:49 PDT
Returning to let main finish (instead of calling exit()) should mean fewer leaks.

(In reply to comment #4)
> We fork() but don't exec(),

AFAIK we still have only one thread at this point, so I think that's OK.
Comment 6 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2011-05-23 06:28:27 PDT
I don't think either of the workarounds are correct, because:

#1 would ignore all children, not just this one we don't care about
#2 would ignore these leaks in the main process (where they are a bug) and not just the glxtest forked process

I think that karlt's solution is probably fine: make fire_glxtest_process return a value instead of exiting, and "return" out of main if we're the forked process.

But I'm also a little worried that we're going to confuse the refcount/tracemalloc logs with the forked process, since they have duplicated file handles and are writing to the same log :-(
Comment 7 Jeff Muizelaar [:jrmuizel] 2011-05-24 13:31:25 PDT
What if we came up with a way of telling valgrind "I'm exiting and I know I'm going to leak, ignore them". We could perhaps do this with a valgrind client request from the child process.
Comment 8 Karl Tomlinson (:karlt) 2011-05-24 17:28:03 PDT
(In reply to comment #7)
We'd also need to tell our other leak tools to ignore leaks (but perhaps we have to do something there anyway to deal with the duplicate file handles issue).
Comment 9 Benoit Jacob [:bjacob] (mostly away) 2011-05-26 07:39:22 PDT
Created attachment 535334 [details] [diff] [review]
return from main() in the glxtest process

Like this?
Comment 10 Benoit Jacob [:bjacob] (mostly away) 2011-05-26 07:58:46 PDT
Note that even with this patch, we still exit() in the error cases, see fatal_error() and x_error_handler(). Should I fix that too?
Comment 11 Benoit Jacob [:bjacob] (mostly away) 2011-06-10 14:00:12 PDT
http://hg.mozilla.org/mozilla-central/rev/84636464bba8

Assuming this is all what we want to fix here, otherwise reopen.
Comment 12 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2011-06-10 14:09:30 PDT
Can't this approach cause crashes if some of the stuff cleaned up on the exit of the child process is still needed by the parent process (since they're still sharing memory space, since we've never called exec())?
Comment 13 Benoit Jacob [:bjacob] (mostly away) 2011-06-10 14:13:49 PDT
(In reply to comment #12)
> (since they're still sharing memory space

Are they? I didn't know that. That's beyond my understanding of things. Other people please advise!
Comment 14 Karl Tomlinson (:karlt) 2011-06-10 15:53:32 PDT
This is a fork (not clone) so in general they don't share memory space.
"Under Linux, fork() is implemented using copy-on-write pages."

The child has copies of file descriptors for the same files.
Comment 15 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2011-06-10 16:02:28 PDT
ah, ok.
Comment 16 Mihaela Velimiroviciu (:mihaelav) 2011-08-23 05:29:40 PDT
Can anyone please provide a test case or some guidelines that I can use to verify this fix? Do I need to create any specific environment to test the fix?

Thanks!
Comment 17 Benoit Jacob [:bjacob] (mostly away) 2011-08-23 08:08:43 PDT
@Mihaela: the bug was detected in valgrind. So the steps-to-reproduce were basically:
1. get a Linux optimized build with debug symbols (might also be possible on Mac, but not on Windows)
2. valgrind ./firefox

Notice that valgrind always reports some leaks that have to be ignored by adding them to a 'valgrind suppressions file'. See:
https://developer.mozilla.org/en/Debugging_Mozilla_with_Valgrind

But frankly, I wouldn't worry too much about this bug. It never was a very big bug and the fix almost certainly works.
Comment 18 Karl Tomlinson (:karlt) 2011-08-23 17:43:26 PDT
Not directly as reported in comment 0, but, in a debug build, glxinfo dumps strings stats on exit (if it doesn't crash).  This happens early while the browser is still running.  Something like:

 => mAllocCount:              5
 => mReallocCount:            3
 => mFreeCount:               5
 => mShareCount:              1
 => mAdoptCount:              0
 => mAdoptFreeCount:          0

mFreeCount should match mAllocCount, as above.
Comment 19 neil@parkwaycc.co.uk 2011-09-18 03:47:22 PDT
(In reply to Benoit Jacob from comment #10)
> Note that even with this patch, we still exit() in the error cases, see
> fatal_error() and x_error_handler(). Should I fix that too?
That would be appreciated, but now at least I know why I seem to get leaks.

(In reply to David Baron from comment #12)
> Can't this approach cause crashes if some of the stuff cleaned up on the
> exit of the child process is still needed by the parent process (since
> they're still sharing memory space, since we've never called exec())?
I thought that applied to vfork() rather than fork()?
Comment 20 Karl Tomlinson (:karlt) 2011-09-18 16:05:33 PDT
(In reply to neil@parkwaycc.co.uk from comment #19)
> (In reply to Benoit Jacob from comment #10)
> > Note that even with this patch, we still exit() in the error cases, see
> > fatal_error() and x_error_handler(). Should I fix that too?
> That would be appreciated, but now at least I know why I seem to get leaks.

Probably even better would be if the leak code didn't report leaks on error exit codes.  on_exit() provides this info, but was dropped between SunOS 4 and 5, so atexit is recommended instead.  I don't know how to get the exit code with atexit().

I guess glxtest could abort() instead of exit().  (ISTR we now make abort perform a SIGSEGV-generating op, though.  That could cause surprises but I guess we already cope with that.)
Comment 21 Benoit Jacob [:bjacob] (mostly away) 2011-09-18 21:26:12 PDT
(In reply to Karl Tomlinson (:karlt) from comment #20)
> (In reply to neil@parkwaycc.co.uk from comment #19)
> > (In reply to Benoit Jacob from comment #10)
> > > Note that even with this patch, we still exit() in the error cases, see
> > > fatal_error() and x_error_handler(). Should I fix that too?
> > That would be appreciated, but now at least I know why I seem to get leaks.
> 
> Probably even better would be if the leak code didn't report leaks on error
> exit codes.  on_exit() provides this info, but was dropped between SunOS 4
> and 5, so atexit is recommended instead.  I don't know how to get the exit
> code with atexit().

I really don't want to use atexit() in library (here, libxul) code. (FWIW, since any code is susceptible of moving to a library in the future, I never want to use atexit).

> 
> I guess glxtest could abort() instead of exit().  (ISTR we now make abort
> perform a SIGSEGV-generating op, though.  That could cause surprises but I
> guess we already cope with that.)

Great idea!
First I reproduced the string leak here on linux, by making the glxtest process trigger an artificial error. Then I replaced exit() by abort() in the fatal_error and X error handler functions, and XPCOM_MEM_LEAK_LOG didn't report any leak anymore. Patch coming.
Comment 22 Benoit Jacob [:bjacob] (mostly away) 2011-09-18 21:26:53 PDT
Created attachment 560845 [details] [diff] [review]
use abort() for errors in glxtest
Comment 23 Karl Tomlinson (:karlt) 2011-09-18 22:46:20 PDT
(In reply to Benoit Jacob [:bjacob] from comment #21)
> (In reply to Karl Tomlinson (:karlt) from comment #20)
> > Probably even better would be if the leak code didn't report leaks on error
> > exit codes.  on_exit() provides this info, but was dropped between SunOS 4
> > and 5, so atexit is recommended instead.  I don't know how to get the exit
> > code with atexit().
> 
> I really don't want to use atexit() in library (here, libxul) code. (FWIW,
> since any code is susceptible of moving to a library in the future, I never
> want to use atexit).

Yes, best avoid atexit wherever possible.
My comment was based on the assumption that the leak reporting code used atexit, but it looks like that is either no longer the case, or never was:
http://hg.mozilla.org/mozilla-central/rev/f4ddad2c0eb7
http://hg.mozilla.org/mozilla-central/annotate/5319b0100025/xpcom/string/src/nsSubstring.cpp#l105
Comment 24 neil@parkwaycc.co.uk 2011-09-19 08:38:27 PDT
Will abort() dump core or trigger crashreporter, or is it too early?
Comment 25 Benoit Jacob [:bjacob] (mostly away) 2011-09-19 09:14:22 PDT
It doesn't, as the glxtest process is forked specifically before the point where we enable the crashreporter. See toolkit/xre/nsAppRunner.cpp.

This was necessary as the whole point of the glxtest process is that it must be able to crash without the user noticing.

Note You need to log in before you can comment on or make changes to this bug.