Closed Bug 946037 Opened 6 years ago Closed 6 years ago

Valgrind-on-TBPL: new leaks involving MessageLoop()

Categories

(Core :: General, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: njn, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

This Valgrind-on-TBPL run was successful:
Linux x86-64 mozilla-central valgrind on 2013-12-02 03:02:10 PST for push 84a5a5800bd3

This one has some new leaks:
Linux x86-64 mozilla-central valgrind on 2013-12-03 03:02:16 PST for push 8648aa476eef

The regression range has *lots* of changes:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=84a5a5800bd3&tochange=8648aa476eef

I skimmed through the regression range and didn't see anything that caught my eye.

----

The leaks involving the following functions:

- js::frontend::CompileScript
- js::GlobalObject::initFunctionAndObjectClasses
- xpc::CreateGlobalObject

are all familiar from bug 940069 -- they're cases where a pointer is munged when stored as a JS::Value, and so Valgrind doesn't know about them.  But when we had those previously we had bazillions of them, whereas in this run we have just a handful, which makes me wonder if they're real.

The remaining three are almost identical...

> 568 (64 direct, 504 indirect) bytes in 1 blocks are definitely lost in loss record 2,164 of 2,455
>    at 0x4C28A49: malloc (vg_replace_malloc.c:270)
>    by 0x7614457: moz_xmalloc (mozalloc.cpp:52)
>    by 0x83712ED: std::_Deque_base<MessageLoop::PendingTask, std::allocator<MessageLoop::PendingTask> >::_M_initialize_map(unsigned long) (mozalloc.h:201)
>    by 0x8372451: MessageLoop::MessageLoop(MessageLoop::Type) (stl_deque.h:454)
>    by 0x8375598: base::Thread::ThreadMain() (thread.cc:144)
>    by 0x8362E49: ThreadFunc(void*) (platform_thread_posix.cc:39)
>    by 0x4E377F0: start_thread (in /lib64/libpthread-2.12.so)
>    by 0x5CDD92C: clone (in /lib64/libc-2.12.so)

This is a weird mix of system stack frames and our stack frames, but it involves the creation of a new thread.  The closest type we have to that Deque one is this:

  typedef std::queue<PendingTask> TaskQueue

in ipc/chromium/src/base/message_loop.h, and:

  typedef std::priority_queue<base::PendingTask> DelayedTaskQueue;

in security/sandbox/base/pending_task.h.

But nothing changed in those directories recently.
There used to be a suppression in bug 794350 about MessageLoop::MessageLoop. I wonder if it might be related..
From the successful run:

==2200== HEAP SUMMARY:
==2200==     in use at exit: 845,096 bytes in 9,128 blocks
==2200==   total heap usage: 438,156 allocs, 429,028 frees, 365,134,617 bytes allocated
==2200==
==2200== LEAK SUMMARY:
==2200==    definitely lost: 0 bytes in 0 blocks
==2200==    indirectly lost: 5,432 bytes in 169 blocks
==2200==      possibly lost: 2,129 bytes in 59 blocks
==2200==    still reachable: 307,351 bytes in 2,147 blocks
==2200==         suppressed: 530,184 bytes in 6,753 blocks

These results are almost deterministic -- the prior run had *very* similar numbers.

From the bad run:

==1814== HEAP SUMMARY:
==1814==     in use at exit: 922,656 bytes in 11,052 blocks
==1814==   total heap usage: 440,562 allocs, 429,510 frees, 369,142,954 bytes allocated
==1814==
==1814== LEAK SUMMARY:
==1814==    definitely lost: 0 bytes in 0 blocks
==1814==    indirectly lost: 5,432 bytes in 169 blocks
==1814==      possibly lost: 62,750 bytes in 1,926 blocks
==1814==    still reachable: 324,226 bytes in 2,202 blocks
==1814==         suppressed: 530,248 bytes in 6,755 blocks

"in use" is up, as is "still reachable", and "possibly lost" is way up.
> There used to be a suppression in bug 794350 about MessageLoop::MessageLoop.
> I wonder if it might be related..

Hmm, the stack in that bug is very similar to this one.  I wonder if this is non-deterministic.
I can't reproduce locally :(
The successful run on Dec 2 featured only two processes: 2419 and 2422.  I.e. 2419 forked once.

The failing run on Dec 3 featured three processes: 1814, 1817, 1826.  I.e. 1814 forked twice.  All the new leaks are in the 1826 process.  The familiar ones (CompileScript et al) will be due to the familiar reason:  process 1826 isn't doing a full shutdown, so some blocks aren't freed, and the munged pointers to those blocks aren't recognized by Valgrind.  I don't know how the std::_Deque_base ones fit it with that story, though.

I'll do another --trace-syscalls=yes run (as in https://hg.mozilla.org/build/tools/rev/859904b596df) to see why this additional fork is occurring.

Another data point:  the failing run has this output which the successful run does not:

> Home directory /builds not ours.
> ==1826== Warning: invalid file descriptor 4086 in syscall close()
Attachment #8342072 - Attachment description: Valgrind output from the run → Valgrind output from Dec 3 (failing)
Attachment #8344230 - Attachment description: Valgrind output from the 2nd (successful) → Valgrind output from Dec 2 (successful)
I pushed a change to use --trace-syscalls=yes:

https://hg.mozilla.org/build/tools/rev/41508ca8b2a6

And then backed it out after a couple of runs:

https://hg.mozilla.org/build/tools/rev/d0539ca7b0cf

(The runs took *ages* -- I cancelled a couple after ~18 hours of running!  Which
is weird because when I did this in bug 940069 that didn't happen.  No matter,
I was able to get useful sufficient info to understand what's happening.)

Firefox is doing a fork-and-exec of /usr/bin/pulseaudio, but that's failing,
and the forked process then exits immediately.  This is familiar from bug
940069, and explains the familiar (false positive) leaks.  Hopefully the
MessageLoop() one has the same cause.

The simple fix probably just to install the pulseaudio executable on the build
machines.  (This is the second time we've had to install something on the build
machines to satisfy Valgrind... if it happens a third time I'll investigate a
deeper fix, either by making Valgrind not complain, or moving valgrind-on-tbpl
runs to test machines :)

An interesting question for kinetik and padenot:  what changed on Dec 2nd to
cause /usr/bin/pulseaudio to be fork-and-exec'd?  None of the patches in the
regression range (see comment 0) had "pulse" in their titles, but the following
ones had "audio":

	0575a10042e2	Jan Gerber — Bug 938686 - Support Opus in WebM. r=kinetik Support the Opus audio codec in the WebM (Matroska) container. This is part of the "WebM 2" proposed spec, which also includes the new VP9 video codec. Alas we weren't able to get concensus to change the doctype of filename extension to mark the revision allowing the new codecs.

	66be4716e86e	Paul Adenot — Bug 944707 - Stop locking when getting the preferred samplerate from the AudioStream. r=kinetik

	72f2f506fe01	Shelly Lin — Bug 945135 - Part 1: Refactor of TrackEncoder and AudioTrackEncoder. r=roc

	3117fa37236a	Shawn Ku — Bug 931722 - Part 2: Extra mute request to RIL at AudioManager. r=mchen

	c7994af691f5	Robert O'Callahan — Bug 938022. Part 1: Update mAudioEndTime from SendStreamData. r=cpearce

	81c66b34eb05	Paul Adenot — Bug 944132 - Make sure we don't loose precision when computing the audio clock when using opensl. r=sotaro
Depends on: 947752
> The simple fix probably just to install the pulseaudio executable on the
> build machines.

I filed bug 947752 for this.
It'll be bug 944707.  That added a call to InitPreferredSampleRate() to AudioStream::InitLibrary(), which is called from nsLayoutStatics::Initialize().  InitPreferredSampleRate() initializes a libcubeb context if one is not already initialized (which is true at startup).

That's unfortunate, because it means we spin up whatever resources are necessary to initialize a cubeb context (on Linux, that means a connection to PulseAudio, including starting PA if it's not running) on every startup.

We should find a way to call InitPreferredSampleRate() lazily the first time the result is needed.
Blocks: 944707
No longer depends on: 947752
Depends on: 947752
I backed out the offending patch, we'll find a different way to fix it that doesn't cause this problem.
(In reply to Nicholas Nethercote [:njn] from comment #7)
Random observation:
> (The runs took *ages* -- I cancelled a couple after ~18 hours of
> running! 

You might like to use --fair-sched=yes (only works on Linux).  This
enforces round-robin scheduling of threads and in some cases avoids
very large delays caused by unfair scheduling when using the original
pipe-based sequentialisation scheme (which is still the default).

Generally I've found --fair-sched=yes improves interactive behaviour
of the browser when running it on Memcheck.  I always use it now.
kinetik's backout hopefully will have fixed this, but I can't tell because every Valgrind run on m-c since the backout has either crashed  (bug 947671), or mysteriously timed out (bug 948145).  Argh.
The hangs and crashes are resolved.  The back-out clearly fixed this.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.