1069695 - Make the Valgrind test job work with e10s

Assignee

Description

•

10 years ago

By default, Valgrind doesn't trace into child processes. We'll need to change this for e10s. And there'll be some other stuff to deal with.

Nicholas Nethercote [inactive]

Assignee

Comment 1

•

10 years ago

Attached file TBPL run with e10s and --trace-children=yes enabled (obsolete) — Details

I did a try run with e10s enabled and Valgrind's --trace-children option on:
https://tbpl.mozilla.org/?tree=Try&rev=023c92badd58. The Valgrind output is
attached. There are three things that need dealing with.

First, there are lots of variants on this:

> ==19426== Mismatched free() / delete / delete []
> ==19426==    at 0x4C2806A: free (vg_replace_malloc.c:446)
> ==19426==    by 0x504E3C9: moz_free (mozalloc.cpp:46)
> ==19426==    by 0x504E346: operator delete(void*) (mozalloc.h:225)
> ==19426==    by 0x63D7566: CommandLine::IsSwitch(std::string const&, std::string*, std::string*) (basic_string.h:246)
> ...
> ==19426==  Address 0x12006a30 is 0 bytes inside a block of size 27 alloc'd
> ==19426==    at 0x4C29B0B: operator new(unsigned long) (vg_replace_malloc.c:298)
> ==19426==    by 0xA7623C8: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.13)
> ==19426==    by 0xA762DE4: ??? (in /usr/lib64/libstdc++.so.6.0.13)
> ==19426==    by 0xA762F32: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.13)
> ==19426==    by 0x63D7380: CommandLine::IsSwitch(std::string const&, std::string*, std::string*) (command_line.cc:142)
> ...

Something's confusing Valgrind here. The code is correctly using |operator new|
together with |operator delete|. For the latter, we're somehow seeing the
underlying |moz_free| and |free| calls, but we don't see that for the former.

All the errors occur with std::string. I guess we only see these when we enable
e10s because they all occur in the Chromium IPC code -- normal Mozilla code
avoids std::string.

Second, there are lots of variants on this:

> ==19385== Syscall param sendmsg(msg.msg_iov[0]) points to uninitialised byte(s)
> ==19385==    at 0x4E3EADD: ??? (in /lib64/libpthread-2.12.so)
> ==19385==    by 0x856B8ED: IPC::Channel::ChannelImpl::ProcessOutgoingMessages() (ipc_channel_posix.cc:719)
> ==19385==    by 0x856C9BA: IPC::Channel::ChannelImpl::Send(IPC::Message*) (ipc_channel_posix.cc:795)
> ==19385==    by 0x856C9CA: IPC::Channel::Send(IPC::Message*) (ipc_channel_posix.cc:994)

This happens when structs with uninitialized gaps are copied between processes
via IPC syscalls. This behaviour is legitimate but violates Valgrind's
assumption that all syscall inputs should be defined. Most of the cases seems
to involve nsHttpChannel, which is a large class with lots of members, many via
inheritance.

I see two options:
- Find a way to zero these gaps. This might be tricky.
- Suppress it. Not as nice, but might be necessary.

Finally, there are lots of reported leaks in the child process. I suspect this
is lots of stuff we deliberately don't free, and we need to set
MOZ_CC_RUN_DURING_SHUTDOWN for the child process to force freeing, or something
like that.

Andrew McCreight [:mccr8]

Comment 2

•

10 years ago

Child processes in opt builds call exit(0) fairly early in the shutdown process
  http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#1562
So yeah there's going to be a lot of leaks.  I'm not sure there's any easy fix. :(  Maybe you can add some kind of environment variable that skips that QuickExit.  Of course, we crash later right now in debug builds, too, so you'll still get leaks...

Nicholas Nethercote [inactive]

Assignee

Comment 3

•

10 years ago

Just to complicate things: doing a local run, I don't see the alloc/free mismatch errors, and I see fewer variants of each of the sendmsg and leak errors. Hmph.

Nicholas Nethercote [inactive]

Assignee

Comment 4

•

10 years ago

I tried zeroing the realloc'd memory in Pickle::Resize() but it doesn't help. Which means that undefined values must be written over those zeroes. The undefined values could be padding, or could be genuine undefined values. I'll have to try some client requests to get a better idea of what's happening.

Brad Lassey [:blassey] (use needinfo?)

Updated

•

10 years ago

tracking-e10s: ? → +

Julian Seward [:jseward]

Comment 5

•

10 years ago

(In reply to Nicholas Nethercote [:njn] from comment #1)
> First, there are lots of variants on this:
> [...]
> Something's confusing Valgrind here. The code is correctly using |operator
> new|
> together with |operator delete|. For the latter, we're somehow seeing the
> underlying |moz_free| and |free| calls, but we don't see that for
> the former.

I think this might be the result of differential inlining, where we
have

  operator new (...) { ... moz_alloc(...) ... }
  operator delete (...) { ... moz_free(...) ... }

and one but not both (or neither) of these gets inlined.  Hence V
winds up seeing calls to new matched by calls to moz_free, or vice
versa.

There's no sane way to suppress these, and I got fed up with looking
at them, so I added --show-mismatched-frees=no|yes [yes] for V 3.10.
I use it (set to =no) for V/Mochi runs now.  It's a bit suboptimal
though, because I think it might sometimes hide real bugs.  I remember
JimB mentioning that such mismatches cause crashing on Win8 and that V
managed to find one that was causing him problems.  Or something like
that.

Julian Seward [:jseward]

Comment 6

•

10 years ago

(In reply to Nicholas Nethercote [:njn] from comment #4)
> I tried zeroing the realloc'd memory in Pickle::Resize() but it doesn't
> help. Which means that undefined values must be written over those
> zeroes.

I tried that too, and came to the same conclusion.  I think we
discussed this a bit via Vidyo recently, btw.  Perhaps a reasonable
solution is to find the origins of the structs sent over IPC and zero
them out at allocation time.

I have been doing non-e10s Mochi runs on V for the past couple of
months.  I can change to doing e10s runs and would be happy to try to
fix this up.

Nicholas Nethercote [inactive]

Assignee

Comment 7

•

10 years ago

Attached patch (DON'T LAND) Add some code to help understand Valgrind's complaints (obsolete) — Details — Splinter Review

Here's a patch with which I've been doing some digging. It's hard to understand
what's going on because so much of this code is generated from the .ipdl files.
Nonetheless, in this generated code it looks like all the various structs get
written into the to-be-sent buffers on a field by field basis. So zeroing the
buffer initially *should* suffice.

Furthermore, the complaints I was getting with this patch applied suggested
that there are actually one or two uninitialized fields being copied into the
buffer, so these errors might not be false positives after all.

Nicholas Nethercote [inactive]

Assignee

Comment 8

•

10 years ago

Attached patch Force-enable e10s — Details — Splinter Review

This patch force-enables e10s, which is useful.

Nicholas Nethercote [inactive]

Assignee

Comment 9

•

10 years ago

Attached patch Use --trace-children=yes in |mach valgrind-test| — Details — Splinter Review

Nicholas Nethercote [inactive]

Assignee

Comment 10

•

10 years ago

Attached file njn's local suppressions file — Details

I run the valgrind test with |mach valgrind-test --suppressions valgrind-test.supp|. I also need the patch in bug 1069034 applied when using 3.10.

Nicholas Nethercote [inactive]

Assignee

Comment 11

•

10 years ago

Julian: using |mach valgrind-test| is a lot easier than doing mochitest runs, once you have it set up. I've attached all the patches I've been using with it. I'm planning to look again at this and grovel through the generated code more carefully once I've finished my current batch of DMD work, but if you want to look at it in the meantime that's fine by me :)

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Depends on: 1098185

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Depends on: 1098703

Nicholas Nethercote [inactive]

Assignee

Comment 12

•

10 years ago

Comment on attachment 8496330 [details] [diff] [review]
Use --trace-children=yes in |mach valgrind-test|

--trace-children doesn't cause problems with e10s off, and we'll need it eventually, so I want to land it now just to get it done.

Attachment #8496330 - Flags: review?(jseward)

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Depends on: 1100205

Julian Seward [:jseward]

Updated

•

10 years ago

Attachment #8496330 - Flags: review?(jseward) → review+

Nicholas Nethercote [inactive]

Assignee

Comment 13

•

10 years ago

This is the addition of --trace-children=yes to the Valgrind test job invocation:
https://hg.mozilla.org/integration/mozilla-inbound/rev/328195791752

Keywords: leave-open

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Attachment #8496327 - Attachment is obsolete: true

Nicholas Nethercote [inactive]

Assignee

Comment 14

•

10 years ago

Attached patch Turn on e10s for the Valgrind test job — Details — Splinter Review

This was just a guess, but it seems to work.

I don't know what is the difference between browser.tabs.remote.autostart and
browser.tabs.remote.autostart.1.

Attachment #8525063 - Flags: review?(mh+mozilla)

Mike Hommey [:glandium]

Comment 15

•

10 years ago

Comment on attachment 8525063 [details] [diff] [review]
Turn on e10s for the Valgrind test job

Review of attachment 8525063 [details] [diff] [review]:
-----------------------------------------------------------------

::: build/valgrind/mach_commands.py
@@ +69,5 @@
>              prefs.update(Preferences.read_prefs(prefpath))
>              interpolation = { 'server': '%s:%d' % httpd.httpd.server_address,
>                                'OOP': 'false'}
>              prefs = json.loads(json.dumps(prefs) % interpolation)
> +            prefs['browser.tabs.remote.autostart'] = True

I can only say this looks like a good place to add it, not that it's the right pref to turn on (although I think it is).

(BTW, that OOP above is not used)

Attachment #8525063 - Flags: review?(wmccloskey)

Attachment #8525063 - Flags: review?(mh+mozilla)

Attachment #8525063 - Flags: review+

Carsten Book [:Tomcat]

Comment 16

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/328195791752

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 17

•

10 years ago

Comment on attachment 8525063 [details] [diff] [review]
Turn on e10s for the Valgrind test job

Review of attachment 8525063 [details] [diff] [review]:
-----------------------------------------------------------------

Yes, this is the right pref.

autostart.1 is only for turning e10s on for nightly users. The idea is that, once we've fixed a bunch of bugs, we'll turn on autostart.2 by default. Then all the people who disabled e10s when we originally enabled it on nightly (via autostart.1) will have it enabled again. We'll keep doing this until enough bugs are fixed that people stop turning it off.

Attachment #8525063 - Flags: review?(wmccloskey) → review+

Nicholas Nethercote [inactive]

Assignee

Comment 18

•

10 years ago

Attached patch (DRAFT) Don't exit(0) in the content process (obsolete) — Details — Splinter Review

Here's an update.

The undefined value errors are now all fixed. Hooray.
  
The mismatched alloc/free errors have disappeared. I don't know why. Hooray.
  
That leaves just the leaks from the content process. They vary a lot; for
example, see the multiple runs at
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e5d11a7749ed

If you look at the first run there, you'll see info from three processes: the
parent (pid 24937), a short-lived process at start-up that does something to
detect the graphics driver (pid 24940), and the content process (pid 24987). 
The first two have a "HEAP SUMMARY", "LEAK SUMMARY" and "ERROR SUMMARY"; the
content process has just a "HEAP SUMMARY", but there are leaks reported for it 
nonetheless. I don't know what to make of the missing summaries. (Locally, I
get all three summaries for the content process, like I'd expect.)

Furthermore, if I apply the attached patch that neuters two of the exit(0)
calls done by the child process, then the test passes! E.g. see the multiple
runs at
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=176f385ca02a

But if you look at the logs, the content process has *none* of the three 
summaries. (And I get the same result locally.) I don't know what this means,
but I'm suspicious that something is going wrong in the child process and it's
not actually checking like it should.

Julian, any idea what might be happening?

Attachment #8526533 - Flags: feedback?(jseward)

Nicholas Nethercote [inactive]

Assignee

Comment 19

•

10 years ago

I added this code to ActorDestroy():

 void
 ContentChild::ActorDestroy(ActorDestroyReason why)
 {
+    // njn: this should give an undefined value error, and a leak(?)
+    int* x = (int*) malloc(100000 * sizeof(int));
+    fprintf(stderr, "ACTOR DESTROY: %d\n", x[0]);

It triggered in the content process and I got an undefined value error from Valgrind about it. But I didn't get a leak error, and again the three SUMMARY lines didn't show up.

Nicholas Nethercote [inactive]

Assignee

Comment 20

•

10 years ago

> It triggered in the content process and I got an undefined value error from
> Valgrind about it. But I didn't get a leak error, and again the three
> SUMMARY lines didn't show up.

VG_(terminate_NORETURN)() isn't being hit in the content process.

Julian Seward [:jseward]

Comment 21

•

10 years ago

(In reply to Nicholas Nethercote [:njn] from comment #18)
> Julian, any idea what might be happening?

No, but the following does occur to me.  Presumably there is some
logging mechanism that collects the output of the run, so as to create
the the log file you show.  Presumably also, that needs to wait for
both the parent and content processes to finish before stopping
"listening".  Is it possible that this logger could have stopped
before the content process finishes?

For this to work correctly, either:

* the logger must wait for parent and all content processes to finish, or

* the parent process must wait for all content processes to finish,
  and the logger must wait for the parent to finish.

w/ apologies if this is a complete red herring.

Nicholas Nethercote [inactive]

Assignee

Comment 22

•

10 years ago

> Is it possible that this logger could have stopped before the content process finishes?

I doubt it. I'm pretty sure it just hoovers up everything that goes to stdout and stderr. And the parent process's SUMMARY stuff comes after where the child process's SUMMARY stuff is supposed to occur, so the logger is clearly still working.

I'll add some debugging printfs to Valgrind in order to work out just where/how the child process is terminating.

Nicholas Nethercote [inactive]

Assignee

Comment 23

•

10 years ago

I've confirmed it's not related to the logger -- I see similar behaviour when running things locally and directly, i.e. not using |mach valgrind-test|.

Doing it that way, sometimes I get no SUMMARY for the content process. Doing that, the final thing printed by the content process with -d was:

--16637:1:syswrap- thread_wrapper(tid=17): exit, schedreturncode VgSrc_ExitThread
--16637:1:syswrap- run_a_thread_NORETURN(tid=17): post-thread_wrapper
--16637:1:syswrap- run_a_thread_NORETURN(tid=17): not last one standing

And sometimes I get just the HEAP SUMMARY. In one such case, the final thing printed by the content process with -d was:

==6145==  
==6145== 24 bytes in 1 blocks are definitely lost in loss record 3,012 of 10,308
==6145==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:299)
==6145==    by 0x81469BE: js_malloc (Utility.h:99)
==6145==    by 0x81469BE: js_pod_malloc<unsigned char> (Utility.h:560)

That's a partial leak report, which is really weird. It feels like the content process is being killed in such a way that cuts Valgrind off right in the middle of things.

Nicholas Nethercote [inactive]

Assignee

Comment 24

•

10 years ago

Here's debug output from another example where no SUMMARY was produced. The
first three lines are debugging printfs I put in XRE_ShutdownChildProcess().
After that you can see that threads start getting killed, but there are still
21 threads present when the output ends.

> njn: XRE_ShutdownChildProcess
> njn: XRE_ShutdownChildProcess: pre-Quit
> njn: XRE_ShutdownChildProcess: post-Quit
> --24179:1:syswrap- thread_wrapper(tid=29): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=29): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=29): not last one standing (c = 29)
> --24179:1:syswrap- run_a_thread_NORETURN(tid=29): pre-thread_wrapper
> --24179:1:syswrap- thread_wrapper(tid=29): entry
> --24179:1:syswrap- run_a_thread_NORETURN(tid=30): pre-thread_wrapper
> --24179:1:syswrap- thread_wrapper(tid=30): entry
> --24179:1:syswrap- thread_wrapper(tid=25): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=25): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=25): not last one standing (c = 30)
> --24179:1:syswrap- run_a_thread_NORETURN(tid=25): pre-thread_wrapper
> --24179:1:syswrap- thread_wrapper(tid=25): entry
> --24179:1:aspacem  allocated thread stack at 0x82cfb5000 size 1064960
> --24179:1:syswrap- run_a_thread_NORETURN(tid=31): pre-thread_wrapper
> --24179:1:syswrap- thread_wrapper(tid=31): entry
> --24179:1:syswrap- thread_wrapper(tid=29): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=29): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=29): not last one standing (c = 31)
> --24179:1:syswrap- thread_wrapper(tid=30): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=30): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=30): not last one standing (c = 30)
> --24179:1:syswrap- thread_wrapper(tid=25): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=25): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=25): not last one standing (c = 29)
> --24179:1:syswrap- thread_wrapper(tid=31): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=31): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=31): not last one standing (c = 28)
> --24179:1:syswrap- thread_wrapper(tid=18): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=18): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=18): not last one standing (c = 27)
> --24179:1:syswrap- thread_wrapper(tid=26): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=26): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=26): not last one standing (c = 26)
> --24179:1:syswrap- thread_wrapper(tid=27): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=27): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=27): not last one standing (c = 25)
> --24179:1:syswrap- thread_wrapper(tid=28): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=28): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=28): not last one standing (c = 24)
> --24179:1:syswrap- thread_wrapper(tid=20): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=20): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=20): not last one standing (c = 23)
> --24179:1:syswrap- thread_wrapper(tid=23): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=23): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=23): not last one standing (c = 22)
> --24179:1:syswrap- thread_wrapper(tid=24): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=24): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=24): not last one standing (c = 21)
> --24179:1:mallocfr newSuperblock at 0x82898F000 (pszB   65504)  owner VALGRIND/ttaux
> --24179:1:mallocfr newSuperblock at 0x82D0B9000 (pszB  659424) unsplittable owner VALGRIND/ttaux
> --24179:1:mallocfr reclaimSuperblock at 0x82CF4A000 (pszB  438240) unsplittable owner VALGRIND/ttaux

Nicholas Nethercote [inactive]

Assignee

Comment 25

•

10 years ago

Aha! Julian suggested I try --trace-signals and it quickly showed that the parent process is sending |kill -9| to the content process. Turns out there's a class called ChildGrimReaper which does this, and it gives the content process two seconds (kMaxWaitMs) to finish up before killing it. With Valgrind running things are much slower, plus the content process does leak checking, so two seconds isn't nearly enough. If I change it to 20 seconds I get all three SUMMARY lines for the child, as expected. Yay.

But the bad news is that the content process has bazillions of leaks, even when it doesn't do the exit(0). Boo.

(not currently active) Ted Mielczarek

Comment 26

•

10 years ago

I know mccr8 was looking to fix the leak logs in child processes thing, but maybe we could special-case Valgrind jobs with an environment variable that disables the ChildGrimReaper?

Andrew McCreight [:mccr8]

Comment 27

•

10 years ago

(In reply to Nicholas Nethercote [:njn] from comment #25)
> But the bad news is that the content process has bazillions of leaks, even
> when it doesn't do the exit(0). Boo.

Could you attach a log here? I'd be curious to see what it looks like.

Nicholas Nethercote [inactive]

Assignee

Comment 28

•

10 years ago

Attached file log: non-e10s, without MOZ_CC_RUN_DURING_SHUTDOWN — Details

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Attachment #8528690 - Attachment description: log: non-e10s → log: non-e10s, without MOZ_CC_RUN_DURING_SHUTDOWN

Nicholas Nethercote [inactive]

Assignee

Comment 29

•

10 years ago

Attached file log: e10s — Details

Nicholas Nethercote [inactive]

Assignee

Comment 30

•

10 years ago

Attached file log: e10s, without QuickExit, with slower GrimReaper — Details

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Attachment #8526533 - Attachment is obsolete: true

Attachment #8526533 - Flags: feedback?(jseward)

Nicholas Nethercote [inactive]

Assignee

Updated

•

10 years ago

Attachment #8491949 - Attachment is obsolete: true

Nicholas Nethercote [inactive]

Assignee

Comment 31

•

10 years ago

I just attached three log files.

The first log file is from a non-10s build without MOZ_CC_RUN_DURING_SHUTDOWN.
This shows the kind of leaks Valgrind finds in that situation. Here are the top
ten leak sites, with frequencies:

> 365 counts:
> (  1) 39 (10.7%, 10.7%): TEST-UNEXPECTED-FAIL | valgrind-test | 110 (96 direct, 14 indirect) bytes in 1 blocks are definitely lost at malloc / CreateFunctionPrototype / js::GlobalObject::resolveConstructor / js::GlobalObject::ensureConstructor
> (  2) 36 ( 9.9%, 20.5%): TEST-UNEXPECTED-FAIL | valgrind-test | 108 (96 direct, 12 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::CloneScript / JSRuntime::cloneSelfHostedFunctionScript
> (  3) 15 ( 4.1%, 24.7%): TEST-UNEXPECTED-FAIL | valgrind-test | 720 (16 direct, 704 indirect) bytes in 1 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  4) 11 ( 3.0%, 27.7%): TEST-UNEXPECTED-FAIL | valgrind-test | 220 (192 direct, 28 indirect) bytes in 2 blocks are definitely lost at malloc / CreateFunctionPrototype / js::GlobalObject::resolveConstructor / js::GlobalObject::ensureConstructor
> (  5)  5 ( 1.4%, 29.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 136 (96 direct, 40 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::frontend::CompileScript / JS::Compile
> (  6)  4 ( 1.1%, 30.1%): TEST-UNEXPECTED-FAIL | valgrind-test | 1,440 (32 direct, 1,408 indirect) bytes in 2 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  7)  4 ( 1.1%, 31.2%): TEST-UNEXPECTED-FAIL | valgrind-test | 848 (16 direct, 832 indirect) bytes in 1 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  8)  3 ( 0.8%, 32.1%): TEST-UNEXPECTED-FAIL | valgrind-test | 1,104 (16 direct, 1,088 indirect) bytes in 1 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  9)  3 ( 0.8%, 32.9%): TEST-UNEXPECTED-FAIL | valgrind-test | 166 (96 direct, 70 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::frontend::CompileScript / JS::Compile
> ( 10)  3 ( 0.8%, 33.7%): TEST-UNEXPECTED-FAIL | valgrind-test | 131 (96 direct, 35 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::frontend::CompileScript / JS::Compile

Lots of JS stuff, relating to global objects, scripts, functions, etc.

The second log file is from an e10s build which does exit(0). More leak site
frequencies:

> 73 counts:
> (  1) 14 (19.2%, 19.2%): TEST-UNEXPECTED-FAIL | valgrind-test | 110 (96 direct, 14 indirect) bytes in 1 blocks are definitely lost at malloc / CreateFunctionPrototype / js::GlobalObject::resolveConstructor / js::GlobalObject::ensureConstructor
> (  2)  5 ( 6.8%, 26.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 108 (96 direct, 12 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::CloneScript / JSRuntime::cloneSelfHostedFunctionScript
> (  3)  4 ( 5.5%, 31.5%): TEST-UNEXPECTED-FAIL | valgrind-test | 720 (16 direct, 704 indirect) bytes in 1 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  4)  4 ( 5.5%, 37.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 40 bytes in 1 blocks are definitely lost at malloc / g_malloc / g_slice_alloc / gdk_region_new
> (  5)  3 ( 4.1%, 41.1%): TEST-UNEXPECTED-FAIL | valgrind-test | 220 (192 direct, 28 indirect) bytes in 2 blocks are definitely lost at malloc / CreateFunctionPrototype / js::GlobalObject::resolveConstructor / js::GlobalObject::ensureConstructor
> (  6)  2 ( 2.7%, 43.8%): TEST-UNEXPECTED-FAIL | valgrind-test | 131 (96 direct, 35 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::frontend::CompileScript / JS::Compile
> (  7)  2 ( 2.7%, 46.6%): TEST-UNEXPECTED-FAIL | valgrind-test | 168 (96 direct, 72 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::frontend::CompileScript / JS::Compile
> (  8)  2 ( 2.7%, 49.3%): TEST-UNEXPECTED-FAIL | valgrind-test | 1,440 (32 direct, 1,408 indirect) bytes in 2 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  9)  2 ( 2.7%, 52.1%): TEST-UNEXPECTED-FAIL | valgrind-test | 770 (672 direct, 98 indirect) bytes in 7 blocks are definitely lost at malloc / CreateFunctionPrototype / js::GlobalObject::resolveConstructor / js::GlobalObject::ensureConstructor
> ( 10)  2 ( 2.7%, 54.8%): TEST-UNEXPECTED-FAIL | valgrind-test | 976 (16 direct, 960 indirect) bytes in 1 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal

Not as many, but the leak sites are very similar to the first log file.

The third log file is from an e10s build, which doesn't do exit(0), and has had
the GrimReaper timeout increased from 2s to 60s so that Valgrind has time to do
its leak checking. The leak site frequencies:

> 1360 counts:
> (  1) 122 ( 9.0%,  9.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 64 bytes in 1 blocks are definitely lost at malloc / js::Nursery::allocateSlots / js::NativeObject::growSlots / js::NativeObject::setLastProperty
> (  2)  61 ( 4.5%, 13.5%): TEST-UNEXPECTED-FAIL | valgrind-test | 2,112 bytes in 33 blocks are definitely lost at malloc / js::Nursery::allocateSlots / js::NativeObject::growSlots / js::NativeObject::setLastProperty
> (  3)  60 ( 4.4%, 17.9%): TEST-UNEXPECTED-FAIL | valgrind-test | 24 bytes in 1 blocks are definitely lost at malloc / js::ProxyObject::New / js::NewProxyObject / js::Wrapper::New
> (  4)  49 ( 3.6%, 21.5%): TEST-UNEXPECTED-FAIL | valgrind-test | 280 (24 direct, 256 indirect) bytes in 1 blocks are definitely lost at malloc / js::Shape::hashify / js::NativeObject::lookup / js::baseops::GetProperty
> (  5)  35 ( 2.6%, 24.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 48 bytes in 1 blocks are definitely lost at calloc / JSScript::makeTypes / js::jit::BaselineCompiler::compile / js::jit::BaselineCompile
> (  6)  31 ( 2.3%, 26.3%): TEST-UNEXPECTED-FAIL | valgrind-test | 64 bytes in 1 blocks are definitely lost at malloc / js::Nursery::allocateSlots / js::NativeObject::growSlots / js::NativeObject::setSlotSpan
> (  7)  30 ( 2.2%, 28.5%): TEST-UNEXPECTED-FAIL | valgrind-test | 56 (24 direct, 32 indirect) bytes in 1 blocks are definitely lost at malloc / js::Shape::hashify / js::NativeObject::toDictionaryMode / js_fun_bind
> (  8)  20 ( 1.5%, 30.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 8 bytes in 1 blocks are definitely lost at calloc / JSScript::partiallyInit / JSScript::fullyInitFromEmitter / js::frontend::EmitFunctionScript
> (  9)  18 ( 1.3%, 31.3%): TEST-UNEXPECTED-FAIL | valgrind-test | 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely lost at malloc / js::PropertyTree::insertChild / js::PropertyTree::getChild / js::NativeObject::getChildProperty
> ( 10)  17 ( 1.2%, 32.6%): TEST-UNEXPECTED-FAIL | valgrind-test | 128 bytes in 2 blocks are definitely lost at malloc / js::Nursery::allocateSlots / js::NativeObject::growSlots / js::NativeObject::setLastProperty

More JS stuff, but the leak sites are entirely different. And the number of
leaks is *much* higher than with exit(0) -- 1360 vs 73!

Andrew McCreight [:mccr8]

Comment 32

•

10 years ago

That's really weird.  I'd guess we're still not running enough shutdown CC+GCs somehow, but maybe the JS engine just shuts down in a different way that leaks slots in content processes?

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Priority: -- → P3

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Blocks: e10s-tests

BugBot [:suhaib / :marco/ :calixte]

Comment 33

•

6 years ago

The leave-open keyword is there and there is no activity for 6 months.
:njn, maybe it's time to close this bug?

Flags: needinfo?(n.nethercote)

Nicholas Nethercote [inactive]

Assignee

Comment 34

•

6 years ago

I suspect the test job is running e10s naturally now, given that pref-flipping is no longer required to enable e10s. I looked at logs of a recent run of this job and I saw Valgrind output for two processes. I think it's good enough.

Status: NEW → RESOLVED

Closed: 6 years ago

Flags: needinfo?(n.nethercote)

Resolution: --- → WORKSFORME

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

Keywords: leave-open

TBPL run with e10s and --trace-children=yes enabled 10 years ago Nicholas Nethercote [inactive] 709.54 KB, text/plain		Details
(DON'T LAND) Add some code to help understand Valgrind's complaints 10 years ago Nicholas Nethercote [inactive] 2.35 KB, patch		Details \| Diff \| Splinter Review
Force-enable e10s 10 years ago Nicholas Nethercote [inactive] 1.17 KB, patch		Details \| Diff \| Splinter Review
Use --trace-children=yes in \|mach valgrind-test\| 10 years ago Nicholas Nethercote [inactive] 1.26 KB, patch	jseward : review+	Details \| Diff \| Splinter Review
njn's local suppressions file 10 years ago Nicholas Nethercote [inactive] 3.58 KB, text/plain		Details
Turn on e10s for the Valgrind test job 10 years ago Nicholas Nethercote [inactive] 1.33 KB, patch	billm : review+ glandium : review+	Details \| Diff \| Splinter Review
(DRAFT) Don't exit(0) in the content process 10 years ago Nicholas Nethercote [inactive] 1.44 KB, patch		Details \| Diff \| Splinter Review
log: non-e10s, without MOZ_CC_RUN_DURING_SHUTDOWN 10 years ago Nicholas Nethercote [inactive] 2.52 MB, text/plain		Details
log: e10s 10 years ago Nicholas Nethercote [inactive] 491.93 KB, text/plain		Details
log: e10s, without QuickExit, with slower GrimReaper 10 years ago Nicholas Nethercote [inactive] 581.27 KB, text/plain		Details