946037 - Valgrind-on-TBPL: new leaks involving MessageLoop()

Reporter

Description

•

12 years ago

This Valgrind-on-TBPL run was successful: Linux x86-64 mozilla-central valgrind on 2013-12-02 03:02:10 PST for push 84a5a5800bd3 This one has some new leaks: Linux x86-64 mozilla-central valgrind on 2013-12-03 03:02:16 PST for push 8648aa476eef The regression range has *lots* of changes: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=84a5a5800bd3&tochange=8648aa476eef I skimmed through the regression range and didn't see anything that caught my eye. ---- The leaks involving the following functions: - js::frontend::CompileScript - js::GlobalObject::initFunctionAndObjectClasses - xpc::CreateGlobalObject are all familiar from bug 940069 -- they're cases where a pointer is munged when stored as a JS::Value, and so Valgrind doesn't know about them. But when we had those previously we had bazillions of them, whereas in this run we have just a handful, which makes me wonder if they're real. The remaining three are almost identical... > 568 (64 direct, 504 indirect) bytes in 1 blocks are definitely lost in loss record 2,164 of 2,455 > at 0x4C28A49: malloc (vg_replace_malloc.c:270) > by 0x7614457: moz_xmalloc (mozalloc.cpp:52) > by 0x83712ED: std::_Deque_base<MessageLoop::PendingTask, std::allocator<MessageLoop::PendingTask> >::_M_initialize_map(unsigned long) (mozalloc.h:201) > by 0x8372451: MessageLoop::MessageLoop(MessageLoop::Type) (stl_deque.h:454) > by 0x8375598: base::Thread::ThreadMain() (thread.cc:144) > by 0x8362E49: ThreadFunc(void*) (platform_thread_posix.cc:39) > by 0x4E377F0: start_thread (in /lib64/libpthread-2.12.so) > by 0x5CDD92C: clone (in /lib64/libc-2.12.so) This is a weird mix of system stack frames and our stack frames, but it involves the creation of a new thread. The closest type we have to that Deque one is this: typedef std::queue<PendingTask> TaskQueue in ipc/chromium/src/base/message_loop.h, and: typedef std::priority_queue<base::PendingTask> DelayedTaskQueue; in security/sandbox/base/pending_task.h. But nothing changed in those directories recently.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 1

•

12 years ago

There used to be a suppression in bug 794350 about MessageLoop::MessageLoop. I wonder if it might be related..

Nicholas Nethercote [inactive]

Reporter

Comment 2

•

12 years ago

From the successful run: ==2200== HEAP SUMMARY: ==2200== in use at exit: 845,096 bytes in 9,128 blocks ==2200== total heap usage: 438,156 allocs, 429,028 frees, 365,134,617 bytes allocated ==2200== ==2200== LEAK SUMMARY: ==2200== definitely lost: 0 bytes in 0 blocks ==2200== indirectly lost: 5,432 bytes in 169 blocks ==2200== possibly lost: 2,129 bytes in 59 blocks ==2200== still reachable: 307,351 bytes in 2,147 blocks ==2200== suppressed: 530,184 bytes in 6,753 blocks These results are almost deterministic -- the prior run had *very* similar numbers. From the bad run: ==1814== HEAP SUMMARY: ==1814== in use at exit: 922,656 bytes in 11,052 blocks ==1814== total heap usage: 440,562 allocs, 429,510 frees, 369,142,954 bytes allocated ==1814== ==1814== LEAK SUMMARY: ==1814== definitely lost: 0 bytes in 0 blocks ==1814== indirectly lost: 5,432 bytes in 169 blocks ==1814== possibly lost: 62,750 bytes in 1,926 blocks ==1814== still reachable: 324,226 bytes in 2,202 blocks ==1814== suppressed: 530,248 bytes in 6,755 blocks "in use" is up, as is "still reachable", and "possibly lost" is way up.

Nicholas Nethercote [inactive]

Reporter

Comment 3

•

12 years ago

Attached file Valgrind output from Dec 3 (failing) — Details

Nicholas Nethercote [inactive]

Reporter

Comment 4

•

12 years ago

> There used to be a suppression in bug 794350 about MessageLoop::MessageLoop. > I wonder if it might be related.. Hmm, the stack in that bug is very similar to this one. I wonder if this is non-deterministic.

Nicholas Nethercote [inactive]

Reporter

Comment 5

•

12 years ago

I can't reproduce locally :(

Nicholas Nethercote [inactive]

Reporter

Comment 6

•

12 years ago

Attached file Valgrind output from Dec 2 (successful) — Details

The successful run on Dec 2 featured only two processes: 2419 and 2422. I.e. 2419 forked once. The failing run on Dec 3 featured three processes: 1814, 1817, 1826. I.e. 1814 forked twice. All the new leaks are in the 1826 process. The familiar ones (CompileScript et al) will be due to the familiar reason: process 1826 isn't doing a full shutdown, so some blocks aren't freed, and the munged pointers to those blocks aren't recognized by Valgrind. I don't know how the std::_Deque_base ones fit it with that story, though. I'll do another --trace-syscalls=yes run (as in https://hg.mozilla.org/build/tools/rev/859904b596df) to see why this additional fork is occurring. Another data point: the failing run has this output which the successful run does not: > Home directory /builds not ours. > ==1826== Warning: invalid file descriptor 4086 in syscall close()

Nicholas Nethercote [inactive]

Reporter

Updated

•

12 years ago

Attachment #8342072 - Attachment description: Valgrind output from the run → Valgrind output from Dec 3 (failing)

Nicholas Nethercote [inactive]

Reporter

Updated

•

12 years ago

Attachment #8344230 - Attachment description: Valgrind output from the 2nd (successful) → Valgrind output from Dec 2 (successful)

Nicholas Nethercote [inactive]

Reporter

Comment 7

•

12 years ago

I pushed a change to use --trace-syscalls=yes: https://hg.mozilla.org/build/tools/rev/41508ca8b2a6 And then backed it out after a couple of runs: https://hg.mozilla.org/build/tools/rev/d0539ca7b0cf (The runs took *ages* -- I cancelled a couple after ~18 hours of running! Which is weird because when I did this in bug 940069 that didn't happen. No matter, I was able to get useful sufficient info to understand what's happening.) Firefox is doing a fork-and-exec of /usr/bin/pulseaudio, but that's failing, and the forked process then exits immediately. This is familiar from bug 940069, and explains the familiar (false positive) leaks. Hopefully the MessageLoop() one has the same cause. The simple fix probably just to install the pulseaudio executable on the build machines. (This is the second time we've had to install something on the build machines to satisfy Valgrind... if it happens a third time I'll investigate a deeper fix, either by making Valgrind not complain, or moving valgrind-on-tbpl runs to test machines :) An interesting question for kinetik and padenot: what changed on Dec 2nd to cause /usr/bin/pulseaudio to be fork-and-exec'd? None of the patches in the regression range (see comment 0) had "pulse" in their titles, but the following ones had "audio": 0575a10042e2 Jan Gerber — Bug 938686 - Support Opus in WebM. r=kinetik Support the Opus audio codec in the WebM (Matroska) container. This is part of the "WebM 2" proposed spec, which also includes the new VP9 video codec. Alas we weren't able to get concensus to change the doctype of filename extension to mark the revision allowing the new codecs. 66be4716e86e Paul Adenot — Bug 944707 - Stop locking when getting the preferred samplerate from the AudioStream. r=kinetik 72f2f506fe01 Shelly Lin — Bug 945135 - Part 1: Refactor of TrackEncoder and AudioTrackEncoder. r=roc 3117fa37236a Shawn Ku — Bug 931722 - Part 2: Extra mute request to RIL at AudioManager. r=mchen c7994af691f5 Robert O'Callahan — Bug 938022. Part 1: Update mAudioEndTime from SendStreamData. r=cpearce 81c66b34eb05 Paul Adenot — Bug 944132 - Make sure we don't loose precision when computing the audio clock when using opensl. r=sotaro

Nicholas Nethercote [inactive]

Reporter

Updated

•

12 years ago

Depends on: 947752

Nicholas Nethercote [inactive]

Reporter

Comment 8

•

12 years ago

> The simple fix probably just to install the pulseaudio executable on the > build machines. I filed bug 947752 for this.

Matthew Gregan [:kinetik]

Comment 9

•

12 years ago

It'll be bug 944707. That added a call to InitPreferredSampleRate() to AudioStream::InitLibrary(), which is called from nsLayoutStatics::Initialize(). InitPreferredSampleRate() initializes a libcubeb context if one is not already initialized (which is true at startup). That's unfortunate, because it means we spin up whatever resources are necessary to initialize a cubeb context (on Linux, that means a connection to PulseAudio, including starting PA if it's not running) on every startup. We should find a way to call InitPreferredSampleRate() lazily the first time the result is needed.

Blocks: 944707

No longer depends on: 947752

Matthew Gregan [:kinetik]

Updated

•

12 years ago

Depends on: 947752

Matthew Gregan [:kinetik]

Comment 10

•

12 years ago

I backed out the offending patch, we'll find a different way to fix it that doesn't cause this problem.

Julian Seward [:jseward]

Comment 11

•

12 years ago

(In reply to Nicholas Nethercote [:njn] from comment #7) Random observation: > (The runs took *ages* -- I cancelled a couple after ~18 hours of > running! You might like to use --fair-sched=yes (only works on Linux). This enforces round-robin scheduling of threads and in some cases avoids very large delays caused by unfair scheduling when using the original pipe-based sequentialisation scheme (which is still the default). Generally I've found --fair-sched=yes improves interactive behaviour of the browser when running it on Memcheck. I always use it now.

Nicholas Nethercote [inactive]

Reporter

Comment 12

•

12 years ago

kinetik's backout hopefully will have fixed this, but I can't tell because every Valgrind run on m-c since the backout has either crashed (bug 947671), or mysteriously timed out (bug 948145). Argh.

Nicholas Nethercote [inactive]

Reporter

Comment 13

•

12 years ago

The hangs and crashes are resolved. The back-out clearly fixed this.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WORKSFORME

Bugzilla

Valgrind-on-TBPL: new leaks involving MessageLoop()

Categories

(Core :: General, defect)

Tracking

()

People

(Reporter: n.nethercote, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Updated

Comment 7

Updated

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Comment 13

Attachment

General

Description

File Name

Content Type

Valgrind output from Dec 3 (failing) 12 years ago Nicholas Nethercote [inactive] 17.32 KB, text/plain		Details
Valgrind output from Dec 2 (successful) 12 years ago Nicholas Nethercote [inactive] 2.13 KB, text/plain		Details