Bugzilla

Assignee

Updated

•

11 years ago

Blocks: valgrind-tbpl-bugs

Boris Zbarsky [:bzbarsky]

Assignee

Comment 1

•

11 years ago

> Just doing a browser startup and shutdown, I got 382 Allocate calls and only
> 204 Destroy calls.

I botched that -- I instrumented the Trace function instead of the Destroy function.

With that fixed, it's even worse!

- Startup/shutdown: 204 allocs, 3 destroys
- Browse for a couple of minutes: 593 allocs, 195 destroys

Comment 2

•

11 years ago

So I just tried to reproduce this by the simple expedient of adding printfs in AllocateProtoAndIfaceCache and DestroyProtoAndIfaceCache.  I also added a printf in mozilla::dom::CreateGlobal to catch the instances of bug 940071.

I then started the browser with a clean profile and quit.  Note that this was a debug build, not an opt one, so we actually do cleanup at shutdown.

I see 182 calls to AllocateProtoAndIfaceCache and 180 calls to DestroyProtoAndIfaceCache.  I also see 2 calls to CreateGlobal, so the numbers all add up...

So I'm not sure what's going on in comment 0, unless that was in an opt build.  :(

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 3

•

11 years ago

> So I'm not sure what's going on in comment 0, unless that was in an opt
> build.  :(

Yes, this was in a Linux 64-bit opt build.

ref https://tbpl.mozilla.org/?jobname=valgrind&showall=1&rev=beddd6d4bcdf

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 4

•

11 years ago

And my local run was an opt build.  Why does that matter?

Comment 5

•

11 years ago

Because in opt builds we don't run the GC/CC when we shut down ...

Andrew McCreight [:mccr8]

Assignee

Comment 6

•

11 years ago

> Because in opt builds we don't run the GC/CC when we shut down ...

But the leaks reported by Valgrind are ones that are definitely lost, i.e. the leaked blocks don't have any pointers to them anywhere in the (valid portion of the) address space.  Should these blocks have pointers to them?

Comment 7

•

11 years ago

You can run with XPCOM_CC_RUN_DURING_SHUTDOWN=1 if you want to get the shutdown GC/CCs in a non-debug build.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 8

•

11 years ago

Well it gets stored in a reserved slot of a JSObject, so I'm not sure that there should be an (unmunged) pointer to it.

Assignee

Comment 9

•

11 years ago

bz and I worked this out on IRC.  The cache pointers are stored as jsvals, which means they are munged.  Valgrind doesn't know about the munging, and so thinks that there are no pointers to these blocks.  The only way around this false positive is to add a suppression.

Nathan Froyd [:froydnj]

Comment 10

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #9)
> bz and I worked this out on IRC.  The cache pointers are stored as jsvals,
> which means they are munged.  Valgrind doesn't know about the munging, and
> so thinks that there are no pointers to these blocks.  The only way around
> this false positive is to add a suppression.

Just because Valgrind is confused about the pointers doesn't mean that these blocks are actually being freed, right?  Bug 940071 certainly makes it sound like we are leaking some of them.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Assignee

Comment 11

•

11 years ago

> Just because Valgrind is confused about the pointers doesn't mean that these
> blocks are actually being freed, right?  Bug 940071 certainly makes it sound
> like we are leaking some of them.

Correct.  And writing a suppression that doesn't suppress genuine leaks would be hard.  So mccr8's suggestion about XPCOM_CC_RUN_DURING_SHUTDOWN=1 is a good idea, and it works when I test it locally.  However, http://hg.mozilla.org/build/tools/file/default/scripts/valgrind/valgrind.sh#l82 indicates that we already use that env var in the tbpl builds!  Maybe it's not being exported correctly?

Comment 12

•

11 years ago

> So mccr8's suggestion about XPCOM_CC_RUN_DURING_SHUTDOWN=1
> is a good idea, and it works when I test it locally.  However,
> http://hg.mozilla.org/build/tools/file/default/scripts/valgrind/valgrind.
> sh#l82 indicates that we already use that env var in the tbpl builds!  Maybe
> it's not being exported correctly?

Perhaps. I originally added it back in bug 822029 (Dec 2012) and I think it at least worked for awhile back then.

Assignee

Comment 13

•

11 years ago

I ran the exact commands locally (|make pgo-profile-run|) and XPCOM_CC_RUN_DURING_SHUTDOWN is seen correctly.  Hmm.

Assignee

Comment 14

•

11 years ago

So either XPCOM_CC_RUN_DURING_SHUTDOWN isn't being set/seen (though it was for me locally), or CC is running at shutdown but there are still some ProtoAndIfaceCaches lingering.

How effective is the CC at shutdown?  Is it certain to collect all dead objects?

Assignee

Updated

•

11 years ago

Summary: The ProtoAndIfaceCache is leaking badly → The ProtoAndIfaceCache is causing false leak positives with Valgrind

Assignee

Comment 15

•

11 years ago

I looked at the other leaks on TBPL.  Most of them go through these functions

  initFunctionAndObjectClasses
  CompileScript
  XDRScript
  CompileFunctionBody

All of these allocate a ScriptSource object and then stash it in a slot, munging the pointer the same way as the ProtoAndIfaceCache.

Just like the ProtoAndIfaceCache, there are *many* leaks involving ScriptSource objects.  So if we can work out and fix the problem -- is CC not running at shutdown, or is it running but not collecting everything? -- then almost all of the TBPL leaks will go away.

Assignee

Comment 16

•

11 years ago

Is there some way to check if that shutdown CC is occurring?  If I could run Valgrind on tryserver (bug 631838) I'd just add some debugging printfs and be on my way.  But I can't...

Assignee

Comment 17

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #16)
> Is there some way to check if that shutdown CC is occurring?  If I could run
> Valgrind on tryserver (bug 631838) I'd just add some debugging printfs and
> be on my way.  But I can't...

So I just pushed a MOZ_VALGRIND-only debugging printf patch:
https://hg.mozilla.org/mozilla-central/rev/ba9ecdea3a90

I'll back it out again tomorrow, after the overnight Valgrind run has occurred.

Whiteboard: [MemShrink] → [MemShrink][leave open]

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Assignee

Comment 18

•

11 years ago

> So I just pushed a MOZ_VALGRIND-only debugging printf patch:
> https://hg.mozilla.org/mozilla-central/rev/ba9ecdea3a90
> 
> I'll back it out again tomorrow, after the overnight Valgrind run has
> occurred.

The run is here:
https://tbpl.mozilla.org/php/getParsedLog.php?id=30758937&tree=Mozilla-Central

The results were odd.  We pass the point where the env var is set:

+ export XPCOM_CC_RUN_DURING_SHUTDOWN=1
+ XPCOM_CC_RUN_DURING_SHUTDOWN=1

and then the debug printf is never hit.  Or, it possibly is hit, but it's within the truncated part of the output.

It's weird.  The debug printf is certainly working -- prior to setting the env var, Firefox is run a couple of times (I'm not sure why) and we get three lines like this:

  XPCOM_CC_RUN_DURING_SHUTDOWN = (null)

which makes sense, because the env var hasn't been set yet.

During the main run (after setting the env var) there are 16 "LEAK SUMMARY" lines in the output, each one corresponding to a different process, like this:

  ==13901== LEAK SUMMARY:
  ==13926== LEAK SUMMARY:
  ==13927== LEAK SUMMARY:
  ==13929== LEAK SUMMARY:
  ==13930== LEAK SUMMARY:
  ==13931== LEAK SUMMARY:
  ==13936== LEAK SUMMARY:
  ==13937== LEAK SUMMARY:
  ==13938== LEAK SUMMARY:
  ==13942== LEAK SUMMARY:
  ==13943== LEAK SUMMARY:
  ==13944== LEAK SUMMARY:
  ==13945== LEAK SUMMARY:
  ==13946== LEAK SUMMARY:
  ==13947== LEAK SUMMARY:
  ==13949== LEAK SUMMARY:

But there is only one "Command:" line:

  ==13898== Command: /builds/slave/m-cen-l64-valgrind-00000000000/objdir/dist/fire
fox/firefox http://localhost:8888/index.html -profile /tmp/tmpE0zvLS

which is what you get when you explicit start a process under Valgrind (as opposed to a new process being forked).  So it looks like Firefox forks itself at least 16 times.  Maybe that's expected for a PGO profile run?  Weird.

Anyway, *none* of those forked processes print out the env var value, which seemingly means that none of them call nsCycleCollector::Shutdown().  Maybe that's expected, who knows.

But process 13898, which started the whole thing, doesn't have a LEAK SUMMARY in the output so that must be due to truncation.  And so if it did run nsCycleCollector::Shutdown(), we missed the output.  Sigh.

My best idea for proceeding is to modify valgrind.sh to dial down --num-callers from 50 to 10 or 15.  That will drastically reduce the amount of output because (a) stacks are shorter, and (b) a lot more of them will be merged.  That will hopefully be enough to get us back below the 50MB output truncation threshold, and then the debug printf will hopefully give useful information tomorrow night.

gkw, what do you think?  If I land a change to valgrind.sh in the |tools| repo will the TBPL run pick it up automatically, or do I have to do some kind of manual update in m-c to pick up changes?

Flags: needinfo?(gary)

Comment 19

•

11 years ago

> My best idea for proceeding is to modify valgrind.sh to dial down
> --num-callers from 50 to 10 or 15.

15 sounds good - 50 was kind of arbitrary anyway.

> will the TBPL run pick it up automatically

Yes, the TBPL will likely pick it up automatically.

Flags: needinfo?(gary)

Assignee

Comment 20

•

11 years ago

Thanks, gkw.  BTW, I will probably increase --num-callers again at some point... 15 is a little low, but 50 is pretty high.  I use 24 in DMD, so something like that is probably good.

Assignee

Comment 21

•

11 years ago

Attached patch Reduce Valgrind's --num-callers value to 15, to reduce the voluminous output. — Details — Splinter Review

Attachment #8334783 - Flags: review?(gary)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Assignee

Updated

•

11 years ago

Assignee: nobody → n.nethercote

Status: NEW → ASSIGNED

Comment 22

•

11 years ago

Comment on attachment 8334783 [details] [diff] [review]
Reduce Valgrind's --num-callers value to 15, to reduce the voluminous output.

Review of attachment 8334783 [details] [diff] [review]:
-----------------------------------------------------------------

r=me. Bonus points for filing a follow-up bug to bump that back up to ~25, when the dust has settled.

Attachment #8334783 - Flags: review?(gary) → review+

Assignee

Comment 23

•

11 years ago

I landed the --num-callers change:
https://hg.mozilla.org/build/tools/rev/8dd52029bd85

Assignee

Comment 24

•

11 years ago

coop triggered an off-schedule run for me.  It's at https://tbpl.mozilla.org/php/getParsedLog.php?id=30794315&tree=Mozilla-Central.  The --num-callers change worked, so the output is no longer truncated.

The starting command invoked under Valgrind runs as pid 11207.  The command is this:

 /builds/slave/m-cen-l64-valgrind-00000000000/objdir/dist/firefox/firefox http://localhost:8888/index.html -profile /tmp/tmpne5dAl

AFAICT, this process then forks 21 times.  Here is the "definitely lost" line for every process in the log:

 ==11210==    definitely lost: 0 bytes in 0 blocks
 ==11251==    definitely lost: 967,075 bytes in 1,362 blocks
 ==11256==    definitely lost: 976,179 bytes in 1,372 blocks
 ==11258==    definitely lost: 976,179 bytes in 1,372 blocks
 ==11260==    definitely lost: 976,179 bytes in 1,372 blocks
 ==11261==    definitely lost: 976,179 bytes in 1,372 blocks
 ==11266==    definitely lost: 1,219,603 bytes in 1,473 blocks
 ==11267==    definitely lost: 1,219,603 bytes in 1,473 blocks
 ==11268==    definitely lost: 1,219,603 bytes in 1,473 blocks
 ==11272==    definitely lost: 1,308,371 bytes in 1,539 blocks
 ==11273==    definitely lost: 1,308,371 bytes in 1,539 blocks
 ==11274==    definitely lost: 1,308,371 bytes in 1,539 blocks
 ==11275==    definitely lost: 1,308,499 bytes in 1,541 blocks
 ==11276==    definitely lost: 1,308,499 bytes in 1,541 blocks
 ==11277==    definitely lost: 1,308,499 bytes in 1,541 blocks
 ==11279==    definitely lost: 1,335,683 bytes in 1,569 blocks
 ==11280==    definitely lost: 1,335,683 bytes in 1,569 blocks
 ==11281==    definitely lost: 1,335,683 bytes in 1,569 blocks
 ==11283==    definitely lost: 1,423,523 bytes in 1,619 blocks
 ==11284==    definitely lost: 1,423,523 bytes in 1,619 blocks
 ==11285==    definitely lost: 1,423,523 bytes in 1,619 blocks
 ==11207==    definitely lost: 291 bytes in 5 blocks

Process 11207 executes nsCycleCollector::Shutdown() and correctly observes the XPCOM_CC_RUN_DURING_SHUTDOWN env var:

 XPCOM_CC_RUN_DURING_SHUTDOWN = 1
 XPCOM_CC_RUN_DURING_SHUTDOWN = 1
 XPCOM_CC_RUN_DURING_SHUTDOWN = 1
 XPCOM_CC_RUN_DURING_SHUTDOWN = 1

I think there are four because there is the main JSRuntime and three worker JSRuntimes.

But process 11207 is the only process that calls nsCycleCollector::Shutdown().  And it only leaks 291 bytes, and none of those leaks are the false positives mentioned above.  So the CC on shutdown is successful at preventing the false positives.

But none of the 21 forked processes (11210..11285) that leak between 967 KB and 1,423 KB do the CC at shutdown, which explains why they have all the false positives.  (I guess process 11210 finishes early enough that we haven't done any of these allocations that Valgrind misdiagnoses as leaks.)

The increasing number of leaks makes sense -- presumably the forks are happening regularly during the workload, and in the later ones Firefox has allocated more memory.

So now I just need to understand why on earth all this forking occurs, and if it can be avoided.  It actually seems really silly for Valgrind purposes, because we end up doing much of the same leak checking (and other checking) over and over!

(not currently active) Ted Mielczarek

Assignee

Comment 25

•

11 years ago

Ted, does it sound right that |make pgo-profile-run| forks Firefox 21 times?

Flags: needinfo?(ted)

Comment 26

•

11 years ago

No, that doesn't make any sense to me. pgo-profile-run simply starts the browser pointing at the test index page:
http://mxr.mozilla.org/mozilla-central/source/build/pgo/index.html?force=1

and then quits when it's done. There's no magic here. It's possible mozrunner or mozprocess is doing something weird here that confuses valgrind.

Flags: needinfo?(ted)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 27

•

11 years ago

I agree with Ted. I don't believe Firefox was forked 21 times some time ago, the last I looked at it.

Assignee

Comment 28

•

11 years ago

I just tried a local |make pgo-profile-run| very similar to the test machines one, and I got a single forked process, right at start-up, which quickly went away.  WTF, man.

Assignee

Comment 29

•

11 years ago

Last night's run is here:
https://tbpl.mozilla.org/php/getParsedLog.php?id=30994695&tree=Mozilla-Central

It apparently forked 115 times!  And then the initial process aborted like this:

==6032== Unsupported clone() flags: 0x800600
==6032==
==6032== The only supported clone() uses are:
==6032==  - via a threads library (LinuxThreads or NPTL)
==6032==  - via the implementation of fork or vfork
==6032==
==6032== Valgrind detected that your program requires
==6032== the following unimplemented functionality:
==6032==    Valgrind does not support general clone().
==6032== This may be because the functionality is hard to implement,
==6032== or because no reasonable program would behave this way,
==6032== or because nobody has yet needed it.  In any case, let us know at
==6032== www.valgrind.org and/or try to work around the problem, if you can.
==6032==
==6032== Valgrind has to exit now.  Sorry.  Bye!

I've seen this before locally, though it didn't happen the last time I tried to reproduce it.

Assuming the flags on the test machine match those on my machine (which may be an invalid assumption), the flags are
CLONE_UNTRACED|CLONE_FS|CLONE_FILES.  That looks a lot like we're at http://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.cc#464, which is in breakpad, which means we've already crashed and are just trying to report the crash in some way.  This follows an invalid read in the JS jit, which I suspect is the real cause of the crash.  Hmm.

Summary: The ProtoAndIfaceCache is causing false leak positives with Valgrind → Weird forking(?) happening with Valgrind-on-TBPL runs

Assignee

Comment 30

•

11 years ago

> So I just pushed a MOZ_VALGRIND-only debugging printf patch:
> https://hg.mozilla.org/mozilla-central/rev/ba9ecdea3a90
> 
> I'll back it out again tomorrow, after the overnight Valgrind run has
> occurred.

Oh, I backed this out a few days ago but forgot to link it here:
https://hg.mozilla.org/mozilla-central/rev/2850ab9720c9

Assignee

Comment 31

•

11 years ago

I landed a change to valgrind.sh where I've switched to Nulgrind (the Valgrind tool that does nothing) and has --trace-syscalls=yes:
https://hg.mozilla.org/build/tools/rev/859904b596df

This will hopefully give me some new insight into the forking situation.

(not currently active) Ted Mielczarek

Assignee

Comment 32

•

11 years ago

tl;dr: Firefox *is* indeed forking repeatedly;  each time it forks it then
immediately tries to exec dbus-launch, fails, and then exits without doing any
clean-up, causing Valgrind to issue false positive leak reports.  Simply
installing dbus-launch on the test machines might be enought to rectify this
problem.

ted, does this make more sense now?  Do you know what dbus-launch is used for?


FULL DETAILS

So, the --trace-syscalls run completed.  Here are the syscalls for an example
forking.  Process 6099 (the main process) calls clone(), creating process 6193:

> SYSCALL[6099,1]( 56) sys_clone ( 1200011, 0x0, 0x0, 0x402e3b0, 0x17d3 )   clone(fork): process 6099 created child 6193
>  --> [pre-success] Success(0x0:0x1831)
> SYSCALL[6099,1](  3) sys_close ( 48 )[sync] --> Success(0x0:0x0)
> SYSCALL[6099,1](  3) sys_close ( 50 )[sync] --> Success(0x0:0x0)
> SYSCALL[6099,1](  0) sys_read ( 47, 0x37f42f00, 1024 ) --> [async] ...
>  --> [pre-success] Success(0x0:0x0)

The first arg of sys_clone() is the |flags|, which is 0x1200011, which is
CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID.  Those |flags| cause some ctid
fiddling I don't understand, but the thing to note is that this is a fork-like
call to clone, and not a thread-creation clone, because |flags| doesn't include
CLONE_THREAD).  Valgrind even describes it as "clone(fork)".

(Nb: We see lots of thread-creation clone() calls elsewhere in the log, which
have flags set to 3d0f00, which does include CLONE_THREAD.)

Process 6193 then does some fd fiddling:

> SYSCALL[6193,1](  2) sys_open ( 0xbd005ff(/dev/null), 2 ) --> [async] ...
> SYSCALL[6193,1](  2) ... [async] --> Success(0x0:0x33)
> SYSCALL[6193,1](  3) sys_close ( 47 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1](  3) sys_close ( 49 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1](  3) sys_close ( 0 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1](  3) sys_close ( 1 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1](  3) sys_close ( 2 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1]( 33) sys_dup2 ( 51, 0 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1]( 33) sys_dup2 ( 48, 1 )[sync] --> Success(0x0:0x1)
> SYSCALL[6193,1]( 33) sys_dup2 ( 50, 2 )[sync] --> Success(0x0:0x2)
> SYSCALL[6193,1]( 97) sys_getrlimit ( 7, 0x7feffd3d0 )[sync] --> Success(0x0:0x0)

And then it tries to close every fd from 3 to 4085:

> SYSCALL[6193,1](  3) sys_close ( 3 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1](  3) sys_close ( 4 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1](  3) sys_close ( 5 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1](  3) sys_close ( 6 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1](  3) sys_close ( 7 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1](  3) sys_close ( 8 )[sync] --> Success(0x0:0x0)
> SYSCALL[6193,1](  3) sys_close ( 9 )[sync] --> Success(0x0:0x0)
> [... omit many entries here...]
> SYSCALL[6193,1](  3) sys_close ( 4081 )[sync] --> Failure(0x9)
> SYSCALL[6193,1](  3) sys_close ( 4082 )[sync] --> Failure(0x9)
> SYSCALL[6193,1](  3) sys_close ( 4083 )[sync] --> Failure(0x9)
> SYSCALL[6193,1](  3) sys_close ( 4084 )[sync] --> Failure(0x9)
> SYSCALL[6193,1](  3) sys_close ( 4085 )[sync] --> Failure(0x9)

Then it tries to exec dbus-launch in a bunch of locations, failing on every
one:

> SYSCALL[6193,1]( 14) sys_rt_sigprocmask ( 2, 0x7feffd450, 0x0, 8 ) --> [pre-success] Success(0x0:0x0)
> SYSCALL[6193,1]( 59) sys_execve ( 0xbd005ee(/bin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd391(/tools/buildbot/bin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd396(/usr/local/bin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd393(/usr/lib64/ccache/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd3a0(/bin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd39c(/usr/bin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd395(/usr/local/sbin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd39b(/usr/sbin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd39f(/sbin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd396(/tools/git/bin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd391(/tools/python27/bin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd387(/tools/python27-mercurial/bin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)
> SYSCALL[6193,1]( 59) sys_execve ( 0x7feffd394(/home/cltbld/bin/dbus-launch), 0xbf0c8a0, 0x13e614b0 ) --> [pre-fail] Failure(0x2)

Then it exits without doing any standard Firefox shutdown stuff:

> SYSCALL[6193,1](231) exit_group( 1 ) --> [pre-success] Success(0x0:0x0)

which causes Valgrind to complain about leaks.

If the exec of dbus-launch succeeded, then the dbus-launch process would
overwrite the forked Valgrind process, and we wouldn't have any problems.  So
the question(s) is now:  what is dbus-launch, why are we trying to invoke it,
and why is it failing?

From the man page:

  dbus-launch - Utility to start a message bus from a shell script

  The dbus-launch command is used to start a session bus instance of
  dbus-daemon from a shell script.

So it's some kind of messaging thing, which seems kind of reasonable, though I
don't know anything beyond that.

Perhaps we could simply install dbus-launch on the test machines?

Flags: needinfo?(ted)

Comment 33

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #32)
> ted, does this make more sense now?  Do you know what dbus-launch is used
> for?

No, this is still a mystery to me. I don't see any reference to dbus-launch in the tree, so I assume this is something that's being called via some library function (GTK?). Is it possible to get a stack into the Gecko code where this is happening?

CCing some Linux expertise, perhaps this may ring a bell to someone.

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 34

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #32)
> If the exec of dbus-launch succeeded, then the dbus-launch process would
> overwrite the forked Valgrind process, and we wouldn't have any problems.  So
> the question(s) is now:  what is dbus-launch, why are we trying to invoke it,
> and why is it failing?

> So it's some kind of messaging thing

Yes, dbus-launch sets up a per-user-session message bus.

It is invoked when starting the X11 session (See 934843 comment 95) and sets the
DBUS_SESSION_BUS_ADDRESS environment variable for all processes in that session.

Tests run from within a process in this session that listens to messages to start new tests.  In this way the tests have the necessary environment variables set.

Why is the process trying to invoke dbus-launch?

A process using a dbus library, perhaps even GIO, will start its own message bus if DBUS_SESSION_BUS_ADDRESS is not set.

> Perhaps we could simply install dbus-launch on the test machines?

dbus-launch is on the test machines, unless the set-up is different for valgrind tests.

I don't know why the dbus library is not finding dbus-launch.
Could the PATH have been corrupted perhaps?

But the first question to answer is why is the process not finding the existing dbus session?  Is DBUS_SESSION_BUS_ADDRESS set appropriately?
Perhaps the dbus-daemon has crashed?

Comment 35

•

11 years ago

So, one thing to note here is that this whole run is taking place on a *build* machine, not a test machine. That being said, build machines should be capable of running Firefox, since we do PGO builds on them which require starting the browser (with the same invocation that Valgrind builds are using, even). It's possible there are simply some environment variables that are not being set properly for Valgrind builds.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 36

•

11 years ago

> It's possible there are simply some environment variables that
> are not being set properly for Valgrind builds.

And in this case, it's probably DBUS_SESSION_BUS_ADDRESS?

Comment 37

•

11 years ago

Yes, DBUS_SESSION_BUS_ADDRESS is the relevant variable here.

Having seen bug 797242, I suspect build machines don't run jobs from within a full session, but may set variables such as DISPLAY to talk to an Xvfb display.  I don't know whether there is a dbus-daemon maintaining a session bus for that display.

Assignee

Updated

•

11 years ago

Whiteboard: [MemShrink][leave open] → [leave open]

Assignee

Comment 38

•

11 years ago

Thanks for the info, karlt!

> I don't know why the dbus library is not finding dbus-launch.
> Could the PATH have been corrupted perhaps?

If I'm reading the Valgrind output correctly, all of the execve() calls fail with errno=2, which is ENOENT, which represents "No such file or directory".

I don't know if PATH has been corrupted, but the invoked locations include /bin, /usr/bin, /usr/local/bin, which are pretty reasonable places to look.  (On my Ubuntu box dbus-launch is in /usr/bin.)

> But the first question to answer is why is the process not finding the
> existing dbus session?  Is DBUS_SESSION_BUS_ADDRESS set appropriately?
> Perhaps the dbus-daemon has crashed?

I guess I can modify the valgrind.sh script to print the value of DBUS_SESSION_BUS_ADDRESS.  (Actually, I'll just run |printenv| to get all the env vars).  I might invoke |pstree| and |ps aux| as well.  Anything else I should do at the same time?

Flags: needinfo?(karlt)

Assignee

Comment 39

•

11 years ago

> I guess I can modify the valgrind.sh script to print the value of
> DBUS_SESSION_BUS_ADDRESS.  (Actually, I'll just run |printenv| to get all
> the env vars).  I might invoke |pstree| and |ps aux| as well.

I did this here:
https://hg.mozilla.org/build/tools/rev/d94e5b0e9177

Flags: needinfo?(karlt)

Assignee

Comment 40

•

11 years ago

Here's the debugging output.  |printenv| indicates that DBUS_SESSION_BUS_ADDRESS is not set.  |pstree| indicates that dbus-daemon is running.

karlt, any suggestions on how to proceed now?

Flags: needinfo?(karlt)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Assignee

Updated

•

11 years ago

Summary: Weird forking(?) happening with Valgrind-on-TBPL runs → Valgrind-on-TBPL: failure to exec dbus-launch in forked processes causes many false positive leak reports

Updated

•

11 years ago

Attachment #8338986 - Attachment mime type: text/x-vhdl → text/plain

Comment 41

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #38)
> If I'm reading the Valgrind output correctly, all of the execve() calls fail
> with errno=2, which is ENOENT, which represents "No such file or directory".
> 
> I don't know if PATH has been corrupted, but the invoked locations include
> /bin, /usr/bin, /usr/local/bin, which are pretty reasonable places to look. 
> (On my Ubuntu box dbus-launch is in /usr/bin.)

Yes,
http://vault.centos.org/6.0/os/x86_64/Packages/dbus-x11-1.2.24-3.el6.x86_64.rpm
has dbus-launch in /usr/bin/

So it seems it is not installed.

Could dbus-libs be installed without dbus-x11?

Comment 42

•

11 years ago

init-+-abrt-dump-oops
     |-abrtd
     |-collectdmon---collectd---10*[{collectd}]
     |-crond
     |-dbus-daemon
     |-dhclient
     |-hald---hald-runner-+-hald-addon-acpi
     |                    `-hald-addon-inpu
     |-irqbalance
     |-master-+-pickup
     |        `-qmgr
     |-6*[mingetty]
     |-nrpe
     |-ntpd
     |-portreserve
     |-rpcbind
     |-rsyslogd---3*[{rsyslogd}]
     |-sshd
     |-supervisord---Xvfb
     |-twistd-+-mock_mozilla---valgrind.sh---pstree
     |        `-{twistd}
     `-udevd---2*[udevd]

dbus      1307  0.0  0.0  21436   588 ?        Ss   19:27   0:00 dbus-daemon --system

This dbus-daemon is the "system" bus, not the session bus.

It looks like there is no full X session, but just Xvfb to provide enough for basic browser start-up.

Running on a test machine instead of a build machine may be preferable, but installing dbus-x11 will probably be enough to avoid this issue.

Flags: needinfo?(karlt)

Assignee

Updated

•

11 years ago

Depends on: 943713

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Assignee

Comment 43

•

11 years ago

> installing dbus-x11 will probably be enough to avoid this issue.

I filed bug 943713 for this.

Julian Seward [:jseward]

Comment 44

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #9)
> bz and I worked this out on IRC.  The cache pointers are stored as jsvals,
> which means they are munged.  Valgrind doesn't know about the munging, and
> so thinks that there are no pointers to these blocks.  The only way around
> this false positive is to add a suppression.

I see the bug as a whole is well on the way to being fixed.  One
comment re the above .. the leak checker in 3.9.0 acquired some
heuristics that allow it to accept that certain pointers really
do hold a block live even though they don't point exactly at its
first byte:

    --leak-check-heuristics=heur1,heur2,... which heuristics to use for
        improving leak search false positive [none]
        where heur is one of stdstring newarray multipleinheritance all none

and so we could easily add another one, that de-munges pointers hidden
in JSVals.  Would that help?  From talking to njn on Vidyo I got the
impression that this would be trivial -- merely need to shift them
left one bit, or something like that.

Comment 45

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #43)
> > installing dbus-x11 will probably be enough to avoid this issue.
> 
> I filed bug 943713 for this.

And in the latest logs (after installing dbus-x11):

https://tbpl.mozilla.org/php/getParsedLog.php?id=31154515&tree=Mozilla-Central&full=1

There's the following messages:

Xlib:  extension "RANDR" missing on display ":2.0".
JavaScript error: , line 0: uncaught exception: 2147942487

Comment 46

•

11 years ago

To be clear, the RANDR message is not new.
I don't know about the other.

Assignee

Comment 47

•

11 years ago

> And in the latest logs (after installing dbus-x11):
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=31154515&tree=Mozilla-
> Central&full=1

dbus-x11 was not installed for that run.  It started at 3am on Nov 27, but dbus-x11 was installed around 9am on Nov 27, according to https://bugzilla.mozilla.org/show_bug.cgi?id=943713#c4.

However, dbus-x11 was installed for this run:  https://tbpl.mozilla.org/php/getParsedLog.php?id=31181413&tree=Mozilla-Central.  And it worked!  The DBUS_SESSION_BUS_ADDRESS env var isn't set, but that doesn't seem to matter.  Here's the full Valgrind output from that run -- only two errors! :)


==4553== Memcheck, a memory error detector
==4553== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==4553== Using Valgrind-3.8.2.SVN.MOZ-13022-2546 and LibVEX; rerun with -h for copyright info
==4553== Command: /builds/slave/m-cen-l64-valgrind-00000000000/objdir/dist/firefox/firefox http://localhost:8888/index.html -profile /tmp/tmpFvW9zE
==4553==
==4556==
==4556== HEAP SUMMARY:
==4556==     in use at exit: 216,510 bytes in 634 blocks
==4556==   total heap usage: 791 allocs, 157 frees, 237,352 bytes allocated
==4556==
==4556== LEAK SUMMARY:
==4556==    definitely lost: 0 bytes in 0 blocks
==4556==    indirectly lost: 0 bytes in 0 blocks
==4556==      possibly lost: 1,117 bytes in 32 blocks
==4556==    still reachable: 215,393 bytes in 602 blocks
==4556==         suppressed: 0 bytes in 0 blocks
==4556== Reachable blocks (those to which a pointer was found) are not shown.
==4556== To see them, rerun with: --leak-check=full --show-reachable=yes
==4556==
==4556== For counts of detected and suppressed errors, rerun with: -v
==4556== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)
Xlib:  extension "RANDR" missing on display ":2.0".
JavaScript error: , line 0: uncaught exception: 2147942487
==4553==
==4553== HEAP SUMMARY:
==4553==     in use at exit: 844,921 bytes in 9,126 blocks
==4553==   total heap usage: 434,720 allocs, 425,594 frees, 364,148,177 bytes allocated
==4553==
==4553== 33 bytes in 1 blocks are definitely lost in loss record 2,461 of 5,156
==4553==    at 0x4C28A49: malloc (vg_replace_malloc.c:270)
==4553==    by 0x7614457: moz_xmalloc (mozalloc.cpp:52)
==4553==    by 0x813BC80: NS_Alloc (nsMemoryImpl.cpp:197)
==4553==    by 0x812C405: ToNewCString(nsACString_internal const&) (nsMemory.h:36)
==4553==    by 0x96B120C: CrashReporter::SetRestartArgs(int, char**) (nsExceptionHandler.cpp:1652)
==4553==    by 0x969F269: XREMain::XRE_mainInit(bool*) (nsAppRunner.cpp:3025)
==4553==    by 0x96A3522: XREMain::XRE_main(int, char**, nsXREAppData const*) (nsAppRunner.cpp:4022)
==4553==    by 0x96A383E: XRE_main (nsAppRunner.cpp:4244)
==4553==    by 0x404933: do_main(int, char**, nsIFile*) (nsBrowserApp.cpp:280)
==4553==    by 0x404A5A: main (nsBrowserApp.cpp:647)
==4553==
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   fun:malloc
   fun:moz_xmalloc
   fun:NS_Alloc
   fun:_Z12ToNewCStringRK19nsACString_internal
   fun:_ZN13CrashReporter14SetRestartArgsEiPPc
   fun:_ZN7XREMain12XRE_mainInitEPb
   fun:_ZN7XREMain8XRE_mainEiPPcPK12nsXREAppData
   fun:XRE_main
   fun:_ZL7do_mainiPPcP7nsIFile
   fun:main
}
==4553== 258 bytes in 4 blocks are definitely lost in loss record 4,891 of 5,156
==4553==    at 0x4C28A49: malloc (vg_replace_malloc.c:270)
==4553==    by 0x7614457: moz_xmalloc (mozalloc.cpp:52)
==4553==    by 0x813BC80: NS_Alloc (nsMemoryImpl.cpp:197)
==4553==    by 0x812C405: ToNewCString(nsACString_internal const&) (nsMemory.h:36)
==4553==    by 0x96B1193: CrashReporter::SetRestartArgs(int, char**) (nsExceptionHandler.cpp:1638)
==4553==    by 0x969F269: XREMain::XRE_mainInit(bool*) (nsAppRunner.cpp:3025)
==4553==    by 0x96A3522: XREMain::XRE_main(int, char**, nsXREAppData const*) (nsAppRunner.cpp:4022)
==4553==    by 0x96A383E: XRE_main (nsAppRunner.cpp:4244)
==4553==    by 0x404933: do_main(int, char**, nsIFile*) (nsBrowserApp.cpp:280)
==4553==    by 0x404A5A: main (nsBrowserApp.cpp:647)
==4553==
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   fun:malloc
   fun:moz_xmalloc
   fun:NS_Alloc
   fun:_Z12ToNewCStringRK19nsACString_internal
   fun:_ZN13CrashReporter14SetRestartArgsEiPPc
   fun:_ZN7XREMain12XRE_mainInitEPb
   fun:_ZN7XREMain8XRE_mainEiPPcPK12nsXREAppData
   fun:XRE_main
   fun:_ZL7do_mainiPPcP7nsIFile
   fun:main
}
==4553== LEAK SUMMARY:
==4553==    definitely lost: 291 bytes in 5 blocks
==4553==    indirectly lost: 5,432 bytes in 169 blocks
==4553==      possibly lost: 2,129 bytes in 59 blocks
==4553==    still reachable: 307,176 bytes in 2,145 blocks
==4553==         suppressed: 529,893 bytes in 6,748 blocks
==4553== Reachable blocks (those to which a pointer was found) are not shown.
==4553== To see them, rerun with: --leak-check=full --show-reachable=yes
==4553==
==4553== For counts of detected and suppressed errors, rerun with: -v
==4553== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 63 from 57)


Hallelujah.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Assignee

Comment 48

•

11 years ago

You can only imagine how hard I'll click the "Save Changes" button after marking this RESOLVED FIXED.

Many thanks for you help, Karl!

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Comment 49

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #47)
> > And in the latest logs (after installing dbus-x11):
> > 
> > https://tbpl.mozilla.org/php/getParsedLog.php?id=31154515&tree=Mozilla-
> > Central&full=1
> 
> dbus-x11 was not installed for that run.  It started at 3am on Nov 27, but
> dbus-x11 was installed around 9am on Nov 27, according to
> https://bugzilla.mozilla.org/show_bug.cgi?id=943713#c4.
> 
> However, dbus-x11 was installed for this run: 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=31181413&tree=Mozilla-
> Central.  And it worked!  The DBUS_SESSION_BUS_ADDRESS env var isn't set,
> but that doesn't seem to matter.  Here's the full Valgrind output from that
> run -- only two errors! :)

Oh, you're right, I had both tabs open, just copied the wrong URL. Oops.

Ryan VanderMeulen [:RyanVM]

Assignee

Comment 50

•

11 years ago

 the leak checker in 3.9.0 acquired some
> heuristics that allow it to accept that certain pointers really
> do hold a block live... and so we could easily add another one, that
> de-munges pointers hidden in JSVals.  Would that help?

Thanks for the offer.  At this point it's not necessary, and I'm ready to move on!

Updated

•

11 years ago

Whiteboard: [leave open]

Target Milestone: --- → mozilla28

Assignee

Comment 51

•

11 years ago

I removed the |printenv|/|pstree|/|ps aux| from valgrind.sh:
https://hg.mozilla.org/build/tools/rev/1919970de608

Assignee

Updated

•

11 years ago

Blocks: valgrind-on-tbpl
No longer blocks: valgrind-tbpl-bugs

Updated

•

11 years ago

status-firefox28: --- → fixed

Whiteboard: [qa-]

Assignee

Comment 52

•

11 years ago

This bug didn't touch Firefox's code; it just involved configuration of build machines, and the build/tools/ repo. So setting the status-firefox28 flag doesn't seem right.

Comment 53

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #52)
> This bug didn't touch Firefox's code; it just involved configuration of
> build machines, and the build/tools/ repo. So setting the status-firefox28
> flag doesn't seem right.

What B2G version is aligned to Firefox 28?

Assignee

Comment 54

•

11 years ago

> What B2G version is aligned to Firefox 28?

I don't know, and this bug has nothing to do with B2G. It's about the Valgrind test job on TBPL, which is desktop-only.

Comment 55

•

11 years ago

I don't know where this belongs then. Is the product/component correct for this bug?

Assignee

Comment 56

•

11 years ago

> Is the product/component correct for this bug?

No. I changed it to Testing/General, because that's what I use for Valgrind-on-TBPL in general.

Component: DOM → General

Product: Core → Testing

Target Milestone: mozilla28 → ---