Closed Bug 473629 Opened 13 years ago Closed 12 years ago

crash on exit in glibc memalign with jemalloc statically linked

Categories

(Toolkit :: Startup and Profile System, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 555894

People

(Reporter: fta+bugzilla, Unassigned)

References

Details

(Keywords: crash)

Attachments

(1 file, 1 obsolete file)

I'm getting a crash on exit using trunk or 1.9.1.
It's both with the main window (with plenty of tabs) and when firefox is called from another app to open a new link (in a tab).

It's 100% reproducible here.
Here is the not very helpful trace:

$ firefox -g http://www.mozilla.org/
/usr/bin/gdb /usr/lib/firefox-3.2a1pre/firefox -x /tmp/mozargs.GkMszR
GNU gdb 6.8-debian
...
This GDB was configured as "i486-linux-gnu"...
(no debugging symbols found)
(gdb) r
Starting program: /usr/lib/firefox-3.2a1pre/firefox http://www.mozilla.org/
(no debugging symbols found)
...
(no debugging symbols found)
(no debugging symbols found)
[New Thread 0xb7ce26d0 (LWP 5863)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb7ce26d0 (LWP 5863)]
0xb7d5d3a1 in ptmalloc_init () from /lib/tls/i686/cmov/libc.so.6
(gdb) bt
#0  0xb7d5d3a1 in ptmalloc_init () from /lib/tls/i686/cmov/libc.so.6
#1  0xb7d60cc3 in memalign_hook_ini () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7d60ae2 in memalign () from /lib/tls/i686/cmov/libc.so.6
#3  0xb8083d82 in ?? () from /lib/ld-linux.so.2
#4  0xb808422b in ___tls_get_addr () from /lib/ld-linux.so.2
#5  0xb5e40962 in ?? () from /lib/libselinux.so.1
#6  0xb5e39b7e in ?? () from /lib/libselinux.so.1
#7  0xb5e321d8 in ?? () from /lib/libselinux.so.1
#8  0xb5e42340 in _fini () from /lib/libselinux.so.1
#9  0xb8081a53 in ?? () from /lib/ld-linux.so.2
#10 0xb7d1bb89 in exit () from /lib/tls/i686/cmov/libc.so.6
#11 0xb7d0377d in __libc_start_main () from /lib/tls/i686/cmov/libc.so.6
#12 0x080496f1 in ?? ()

could it be a jemalloc issue?
Better debugging symbols would be nice, but I'll toss this to jemalloc to see if they have any ideas.
Component: General → jemalloc
Product: Firefox → Core
QA Contact: general → jemalloc
Keywords: crash
I have most possible debugging symbols installed on my Ubuntu/Jaunty box, including libc, firefox but apparently, i still miss some. /me looking.
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb7c906d0 (LWP 1554)]
0xb7d0b3a1 in ptmalloc_init () from /lib/tls/i686/cmov/libc.so.6
(gdb) bt
#0  0xb7d0b3a1 in ptmalloc_init () from /lib/tls/i686/cmov/libc.so.6
#1  0xb7d0ecc3 in memalign_hook_ini () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7d0eae2 in memalign () from /lib/tls/i686/cmov/libc.so.6
#3  0xb8031d82 in tls_get_addr_tail () from /lib/ld-linux.so.2
#4  0xb803222b in ___tls_get_addr_internal () from /lib/ld-linux.so.2
#5  0xb5dee962 in fini_context_translations () at setrans_client.c:211
#6  0xb5de7b7e in fini_lib () at init.c:127
#7  0xb5de01d8 in __do_global_dtors_aux () from /lib/libselinux.so.1
#8  0xb5df0340 in _fini () from /lib/libselinux.so.1
#9  0xb802fa53 in _dl_fini () from /lib/ld-linux.so.2
#10 0xb7cc9b89 in exit () from /lib/tls/i686/cmov/libc.so.6
#11 0xb7cb177d in __libc_start_main () from /lib/tls/i686/cmov/libc.so.6
#12 0x080496f1 in ?? ()
(gdb)
For some reason the TLS code is calling into glibc's malloc implementation (via memalign()).  Since ptmalloc_init() is being called as a result, we can tell that glibc's internal malloc data structures have never been initialized.  I don't know why this is causing a crash, but it isn't even touching jemalloc.  Maybe there's some strange interaction with SELinux...
Summary: crash on exit → crash on exit in glibc memalign with jemalloc statically linked
Attached patch leaking env in nsAppRunner.cpp (obsolete) — Splinter Review
leaking env in nsAppRunner.cpp seems to fix this.
previous wasn't a legal diff
Attachment #383038 - Attachment is obsolete: true
so um, are you saying it isn't safe to call PR_SetEnv w/ a null pointer?
I think it is the strdup call rather than the null pointer
check that fixes the crash.  The string you pass to PR_SetEnv,
which simply passes it to putenv, needs to stay around after
PR_SetEnv returns because the string "shall become part of the
environment":
http://www.opengroup.org/onlinepubs/009695399/functions/putenv.html

The putenv man page also says:
  A potential error is to call putenv() with an automatic
  variable as the argument, then return from the calling
  function while string is still part of the environment.
Comment on attachment 383040 [details] [diff] [review]
 leaking env in nsAppRunner.cpp (non garbage patch)

ok, here's a free r-,
   // We intentionally leak |expr| here since it is required by PR_SetEnv.
the code is already trying to leak, but it's trying to leak an nspr pointer. If you're going to do this, just fix SaveWordToEnv to do:

SaveWordToEnv(const char *name, const nsACString & word)
{
  char *expr = PR_smprintf("%s=%s", name, PromiseFlatCString(word).get());
  if (!expr)
      return;

  /* this copy is intentionally leaked per setenv requirements */
  PR_SetEnv (strdup (env));

  PR_smprintf_free(env);
}
Attachment #383040 - Flags: review-
and note that there shouldn't be a space between foo and () if foo is a function.
(In reply to comment #10)
> (From update of attachment 383040 [details] [diff] [review])
> ok, here's a free r-,
>    // We intentionally leak |expr| here since it is required by PR_SetEnv.
> the code is already trying to leak, but it's trying to leak an nspr pointer. If
> you're going to do this, just fix SaveWordToEnv to do:
> 

Yeah. SaveWordToEnv is already properly leaked. But not all places are using that func to set their env. This patch ensures that also the direct invocations to PR_SetEnv are getting leaked.
well, please find them and identify the right one.
nsAppRunner.cpp is linked in libxul.so so I wonder whether it has been unloaded at this stage so that it's static strings inserted into the environment are no longer valid pointers.

Making sure the static strings are copied to the heap would fix this.
Setting environ to NULL on return might also be a hacky way to fix this.

The stack here highlights another issue too though:

readelf --reloc /lib32/ld-linux.so.2 includes this

Relocation section '.rel.plt' at offset 0x8d4 contains 5 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
0001afd0  00000b07 R_386_JUMP_SLOT   00014b40   __libc_memalign
0001afd4  00000407 R_386_JUMP_SLOT   00014c60   malloc
0001afd8  00001007 R_386_JUMP_SLOT   00014d30   calloc
0001afdc  00000807 R_386_JUMP_SLOT   00014c90   realloc
0001afe0  00000507 R_386_JUMP_SLOT   00014ae0   free

I don't know why only memalign calls use the libc-specific symbol name but dl-tls.c looks like it calls free on pointers from __libc_malloc.
That doesn't seem to explain the crash in ptmalloc_init but is probably something that should be fixed.

Attachment 386469 [details] [diff] would resolve the inconsistency here by making __libc_memalign use jemalloc's memalign.  That would also avoid the call to memalign_hook_ini, avoiding this crash.
Depends on: 493541
Component: jemalloc → Startup and Profile System
Product: Core → Toolkit
QA Contact: jemalloc → startup
(In reply to comment #14)
> nsAppRunner.cpp is linked in libxul.so so I wonder whether it has been unloaded
> at this stage so that it's static strings inserted into the environment are no
> longer valid pointers.
> 

right. This matches my evaluation. hence almost all PR_SetEnv need to be leaked in this file which is why i used this macro. We can move it further down or go through the list and explicitly strcpy all but one PR_Set env things.
(In reply to comment #14)
> 
> Attachment 386469 [details] [diff] would resolve the inconsistency here by making
> __libc_memalign use jemalloc's memalign.  That would also avoid the call to
> memalign_hook_ini, avoiding this crash.

I can check your patch as well. however, we also had other problems because of environ getting trashed after unloading libxul.so (namely, the libatk bridge made things crash when it tried to access environ in exit_hook); so leaking is probably still needed.
(In reply to comment #16)
> however, we also had other problems because of environ getting trashed after
> unloading libxul.so (namely, the libatk bridge made things crash when it
> tried to access environ in exit_hook); so leaking is probably still needed.

Yes, sounds like leaking would still be needed.
Here too:

http://hg.mozilla.org/mozilla-central/file/0471fa063e43/accessible/src/atk/nsAppRootAccessible.cpp#l567

I'm not a toolkit peer, but I think it would be better to change only the
instances that are not already leaked to use a function name that is not
PR_SetEnv.  SaveWordToEnv() is probably fine for this.

Maybe the attempts to clear environment values could check that they are set
before needing to leak:

http://hg.mozilla.org/mozilla-central/file/f46e6aee1335/toolkit/xre/nsAppRunner.cpp#l3303
Just my notes on various environment APIs:

glibc's setenv (which copies) will reuse strings (from previous setenv calls)
if the value matches an old value for a variable.  I don't know why it never
modifies or frees old strings.  Perhaps this is to make the result of getenv
persistant but that is not guaranteed anyway because of putenv.

POSIX Programmer's Manual page for putenv says "The space used by string is no
longer used once a new string which defines name is passed to putenv()."

PR_SetEnv() docs say:
"The caller must ensure that the string passed to PR_SetEnv() is persistent."

nsIEnvironment deletes strings passed to PR_SetEnv once the value has been
replaced with another.
http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsIEnvironment.idl

IEEE Std 1003.1, 2003 Edition POSIX does not have const in the prototype
definition for putenv, but I don't know why this is because there doesn't seem
to be any reason why putenv should modify the string:

int putenv(char *string);

/home/karl/moz/dev/nsprpub/pr/src/misc/prenv.c: In function 'PR_SetEnv':
/home/karl/moz/dev/nsprpub/pr/src/misc/prenv.c:96: warning: passing argument 1 of 'putenv' discards qualifiers from pointer target type
(In reply to comment #17)
> (In reply to comment #16)
> > however, we also had other problems because of environ getting trashed after
> > unloading libxul.so (namely, the libatk bridge made things crash when it
> > tried to access environ in exit_hook); so leaking is probably still needed.
> 
> Yes, sounds like leaking would still be needed.

Actually, on further thought, the atk crash is probably caused by something else.
If dependent libraries were unloaded before atexit hooks run then I expect bigger problems.

It's probably worth checking our theory here that libxul has been unloaded even for the crash in comment 3.  "info shared" at the time of the crash will show the loaded shared libraries.
atk is generally broken, there are bugs, you can find them.

putenv isn't const because on some platforms you can do this:

char x[]="test=hello cruel world";
putenv(x);
char *evil = getenv("test");
evil[11]='d';
evil[13]='o';
evil[14]='l';
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 555894
You need to log in before you can comment on or make changes to this bug.