Closed Bug 1069695 Opened 10 years ago Closed 5 years ago

Make the Valgrind test job work with e10s

Categories

(Testing :: General, defect, P3)

x86_64
Linux
defect

Tracking

(e10s+)

RESOLVED WORKSFORME
Tracking Status
e10s + ---

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

(Blocks 1 open bug)

Details

Attachments

(7 files, 3 obsolete files)

By default, Valgrind doesn't trace into child processes. We'll need to change this for e10s. And there'll be some other stuff to deal with.
I did a try run with e10s enabled and Valgrind's --trace-children option on:
https://tbpl.mozilla.org/?tree=Try&rev=023c92badd58. The Valgrind output is
attached. There are three things that need dealing with.

First, there are lots of variants on this:

> ==19426== Mismatched free() / delete / delete []
> ==19426==    at 0x4C2806A: free (vg_replace_malloc.c:446)
> ==19426==    by 0x504E3C9: moz_free (mozalloc.cpp:46)
> ==19426==    by 0x504E346: operator delete(void*) (mozalloc.h:225)
> ==19426==    by 0x63D7566: CommandLine::IsSwitch(std::string const&, std::string*, std::string*) (basic_string.h:246)
> ...
> ==19426==  Address 0x12006a30 is 0 bytes inside a block of size 27 alloc'd
> ==19426==    at 0x4C29B0B: operator new(unsigned long) (vg_replace_malloc.c:298)
> ==19426==    by 0xA7623C8: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.13)
> ==19426==    by 0xA762DE4: ??? (in /usr/lib64/libstdc++.so.6.0.13)
> ==19426==    by 0xA762F32: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.13)
> ==19426==    by 0x63D7380: CommandLine::IsSwitch(std::string const&, std::string*, std::string*) (command_line.cc:142)
> ...

Something's confusing Valgrind here. The code is correctly using |operator new|
together with |operator delete|. For the latter, we're somehow seeing the
underlying |moz_free| and |free| calls, but we don't see that for the former.

All the errors occur with std::string. I guess we only see these when we enable
e10s because they all occur in the Chromium IPC code -- normal Mozilla code
avoids std::string.

Second, there are lots of variants on this:

> ==19385== Syscall param sendmsg(msg.msg_iov[0]) points to uninitialised byte(s)
> ==19385==    at 0x4E3EADD: ??? (in /lib64/libpthread-2.12.so)
> ==19385==    by 0x856B8ED: IPC::Channel::ChannelImpl::ProcessOutgoingMessages() (ipc_channel_posix.cc:719)
> ==19385==    by 0x856C9BA: IPC::Channel::ChannelImpl::Send(IPC::Message*) (ipc_channel_posix.cc:795)
> ==19385==    by 0x856C9CA: IPC::Channel::Send(IPC::Message*) (ipc_channel_posix.cc:994)

This happens when structs with uninitialized gaps are copied between processes
via IPC syscalls. This behaviour is legitimate but violates Valgrind's
assumption that all syscall inputs should be defined. Most of the cases seems
to involve nsHttpChannel, which is a large class with lots of members, many via
inheritance.

I see two options:
- Find a way to zero these gaps. This might be tricky.
- Suppress it. Not as nice, but might be necessary.

Finally, there are lots of reported leaks in the child process. I suspect this
is lots of stuff we deliberately don't free, and we need to set
MOZ_CC_RUN_DURING_SHUTDOWN for the child process to force freeing, or something
like that.
Child processes in opt builds call exit(0) fairly early in the shutdown process
  http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#1562
So yeah there's going to be a lot of leaks.  I'm not sure there's any easy fix. :(  Maybe you can add some kind of environment variable that skips that QuickExit.  Of course, we crash later right now in debug builds, too, so you'll still get leaks...
Just to complicate things: doing a local run, I don't see the alloc/free mismatch errors, and I see fewer variants of each of the sendmsg and leak errors. Hmph.
I tried zeroing the realloc'd memory in Pickle::Resize() but it doesn't help. Which means that undefined values must be written over those zeroes. The undefined values could be padding, or could be genuine undefined values. I'll have to try some client requests to get a better idea of what's happening.
(In reply to Nicholas Nethercote [:njn] from comment #1)
> First, there are lots of variants on this:
> [...]
> Something's confusing Valgrind here. The code is correctly using |operator
> new|
> together with |operator delete|. For the latter, we're somehow seeing the
> underlying |moz_free| and |free| calls, but we don't see that for
> the former.

I think this might be the result of differential inlining, where we
have

  operator new (...) { ... moz_alloc(...) ... }
  operator delete (...) { ... moz_free(...) ... }

and one but not both (or neither) of these gets inlined.  Hence V
winds up seeing calls to new matched by calls to moz_free, or vice
versa.

There's no sane way to suppress these, and I got fed up with looking
at them, so I added --show-mismatched-frees=no|yes [yes] for V 3.10.
I use it (set to =no) for V/Mochi runs now.  It's a bit suboptimal
though, because I think it might sometimes hide real bugs.  I remember
JimB mentioning that such mismatches cause crashing on Win8 and that V
managed to find one that was causing him problems.  Or something like
that.
(In reply to Nicholas Nethercote [:njn] from comment #4)
> I tried zeroing the realloc'd memory in Pickle::Resize() but it doesn't
> help. Which means that undefined values must be written over those
> zeroes.

I tried that too, and came to the same conclusion.  I think we
discussed this a bit via Vidyo recently, btw.  Perhaps a reasonable
solution is to find the origins of the structs sent over IPC and zero
them out at allocation time.

I have been doing non-e10s Mochi runs on V for the past couple of
months.  I can change to doing e10s runs and would be happy to try to
fix this up.
Here's a patch with which I've been doing some digging. It's hard to understand
what's going on because so much of this code is generated from the .ipdl files.
Nonetheless, in this generated code it looks like all the various structs get
written into the to-be-sent buffers on a field by field basis. So zeroing the
buffer initially *should* suffice.

Furthermore, the complaints I was getting with this patch applied suggested
that there are actually one or two uninitialized fields being copied into the
buffer, so these errors might not be false positives after all.
This patch force-enables e10s, which is useful.
I run the valgrind test with |mach valgrind-test --suppressions valgrind-test.supp|. I also need the patch in bug 1069034 applied when using 3.10.
Julian: using |mach valgrind-test| is a lot easier than doing mochitest runs, once you have it set up. I've attached all the patches I've been using with it. I'm planning to look again at this and grovel through the generated code more carefully once I've finished my current batch of DMD work, but if you want to look at it in the meantime that's fine by me :)
Depends on: 1098185
Depends on: 1098703
Comment on attachment 8496330 [details] [diff] [review]
Use --trace-children=yes in |mach valgrind-test|

--trace-children doesn't cause problems with e10s off, and we'll need it eventually, so I want to land it now just to get it done.
Attachment #8496330 - Flags: review?(jseward)
Depends on: 1100205
Attachment #8496330 - Flags: review?(jseward) → review+
This is the addition of --trace-children=yes to the Valgrind test job invocation:
https://hg.mozilla.org/integration/mozilla-inbound/rev/328195791752
Keywords: leave-open
Attachment #8496327 - Attachment is obsolete: true
This was just a guess, but it seems to work.

I don't know what is the difference between browser.tabs.remote.autostart and
browser.tabs.remote.autostart.1.
Attachment #8525063 - Flags: review?(mh+mozilla)
Comment on attachment 8525063 [details] [diff] [review]
Turn on e10s for the Valgrind test job

Review of attachment 8525063 [details] [diff] [review]:
-----------------------------------------------------------------

::: build/valgrind/mach_commands.py
@@ +69,5 @@
>              prefs.update(Preferences.read_prefs(prefpath))
>              interpolation = { 'server': '%s:%d' % httpd.httpd.server_address,
>                                'OOP': 'false'}
>              prefs = json.loads(json.dumps(prefs) % interpolation)
> +            prefs['browser.tabs.remote.autostart'] = True

I can only say this looks like a good place to add it, not that it's the right pref to turn on (although I think it is).

(BTW, that OOP above is not used)
Attachment #8525063 - Flags: review?(wmccloskey)
Attachment #8525063 - Flags: review?(mh+mozilla)
Attachment #8525063 - Flags: review+
Comment on attachment 8525063 [details] [diff] [review]
Turn on e10s for the Valgrind test job

Review of attachment 8525063 [details] [diff] [review]:
-----------------------------------------------------------------

Yes, this is the right pref.

autostart.1 is only for turning e10s on for nightly users. The idea is that, once we've fixed a bunch of bugs, we'll turn on autostart.2 by default. Then all the people who disabled e10s when we originally enabled it on nightly (via autostart.1) will have it enabled again. We'll keep doing this until enough bugs are fixed that people stop turning it off.
Attachment #8525063 - Flags: review?(wmccloskey) → review+
Here's an update.

The undefined value errors are now all fixed. Hooray.
  
The mismatched alloc/free errors have disappeared. I don't know why. Hooray.
  
That leaves just the leaks from the content process. They vary a lot; for
example, see the multiple runs at
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e5d11a7749ed

If you look at the first run there, you'll see info from three processes: the
parent (pid 24937), a short-lived process at start-up that does something to
detect the graphics driver (pid 24940), and the content process (pid 24987). 
The first two have a "HEAP SUMMARY", "LEAK SUMMARY" and "ERROR SUMMARY"; the
content process has just a "HEAP SUMMARY", but there are leaks reported for it 
nonetheless. I don't know what to make of the missing summaries. (Locally, I
get all three summaries for the content process, like I'd expect.)

Furthermore, if I apply the attached patch that neuters two of the exit(0)
calls done by the child process, then the test passes! E.g. see the multiple
runs at
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=176f385ca02a

But if you look at the logs, the content process has *none* of the three 
summaries. (And I get the same result locally.) I don't know what this means,
but I'm suspicious that something is going wrong in the child process and it's
not actually checking like it should.

Julian, any idea what might be happening?
Attachment #8526533 - Flags: feedback?(jseward)
I added this code to ActorDestroy():

 void
 ContentChild::ActorDestroy(ActorDestroyReason why)
 {
+    // njn: this should give an undefined value error, and a leak(?)
+    int* x = (int*) malloc(100000 * sizeof(int));
+    fprintf(stderr, "ACTOR DESTROY: %d\n", x[0]);

It triggered in the content process and I got an undefined value error from Valgrind about it. But I didn't get a leak error, and again the three SUMMARY lines didn't show up.
> It triggered in the content process and I got an undefined value error from
> Valgrind about it. But I didn't get a leak error, and again the three
> SUMMARY lines didn't show up.

VG_(terminate_NORETURN)() isn't being hit in the content process.
(In reply to Nicholas Nethercote [:njn] from comment #18)
> Julian, any idea what might be happening?

No, but the following does occur to me.  Presumably there is some
logging mechanism that collects the output of the run, so as to create
the the log file you show.  Presumably also, that needs to wait for
both the parent and content processes to finish before stopping
"listening".  Is it possible that this logger could have stopped
before the content process finishes?

For this to work correctly, either:

* the logger must wait for parent and all content processes to finish, or

* the parent process must wait for all content processes to finish,
  and the logger must wait for the parent to finish.

w/ apologies if this is a complete red herring.
> Is it possible that this logger could have stopped before the content process finishes?

I doubt it. I'm pretty sure it just hoovers up everything that goes to stdout and stderr. And the parent process's SUMMARY stuff comes after where the child process's SUMMARY stuff is supposed to occur, so the logger is clearly still working.

I'll add some debugging printfs to Valgrind in order to work out just where/how the child process is terminating.
I've confirmed it's not related to the logger -- I see similar behaviour when running things locally and directly, i.e. not using |mach valgrind-test|.

Doing it that way, sometimes I get no SUMMARY for the content process. Doing that, the final thing printed by the content process with -d was:

--16637:1:syswrap- thread_wrapper(tid=17): exit, schedreturncode VgSrc_ExitThread
--16637:1:syswrap- run_a_thread_NORETURN(tid=17): post-thread_wrapper
--16637:1:syswrap- run_a_thread_NORETURN(tid=17): not last one standing

And sometimes I get just the HEAP SUMMARY. In one such case, the final thing printed by the content process with -d was:

==6145==  
==6145== 24 bytes in 1 blocks are definitely lost in loss record 3,012 of 10,308
==6145==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:299)
==6145==    by 0x81469BE: js_malloc (Utility.h:99)
==6145==    by 0x81469BE: js_pod_malloc<unsigned char> (Utility.h:560)

That's a partial leak report, which is really weird. It feels like the content process is being killed in such a way that cuts Valgrind off right in the middle of things.
Here's debug output from another example where no SUMMARY was produced. The
first three lines are debugging printfs I put in XRE_ShutdownChildProcess().
After that you can see that threads start getting killed, but there are still
21 threads present when the output ends.

> njn: XRE_ShutdownChildProcess
> njn: XRE_ShutdownChildProcess: pre-Quit
> njn: XRE_ShutdownChildProcess: post-Quit
> --24179:1:syswrap- thread_wrapper(tid=29): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=29): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=29): not last one standing (c = 29)
> --24179:1:syswrap- run_a_thread_NORETURN(tid=29): pre-thread_wrapper
> --24179:1:syswrap- thread_wrapper(tid=29): entry
> --24179:1:syswrap- run_a_thread_NORETURN(tid=30): pre-thread_wrapper
> --24179:1:syswrap- thread_wrapper(tid=30): entry
> --24179:1:syswrap- thread_wrapper(tid=25): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=25): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=25): not last one standing (c = 30)
> --24179:1:syswrap- run_a_thread_NORETURN(tid=25): pre-thread_wrapper
> --24179:1:syswrap- thread_wrapper(tid=25): entry
> --24179:1:aspacem  allocated thread stack at 0x82cfb5000 size 1064960
> --24179:1:syswrap- run_a_thread_NORETURN(tid=31): pre-thread_wrapper
> --24179:1:syswrap- thread_wrapper(tid=31): entry
> --24179:1:syswrap- thread_wrapper(tid=29): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=29): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=29): not last one standing (c = 31)
> --24179:1:syswrap- thread_wrapper(tid=30): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=30): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=30): not last one standing (c = 30)
> --24179:1:syswrap- thread_wrapper(tid=25): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=25): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=25): not last one standing (c = 29)
> --24179:1:syswrap- thread_wrapper(tid=31): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=31): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=31): not last one standing (c = 28)
> --24179:1:syswrap- thread_wrapper(tid=18): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=18): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=18): not last one standing (c = 27)
> --24179:1:syswrap- thread_wrapper(tid=26): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=26): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=26): not last one standing (c = 26)
> --24179:1:syswrap- thread_wrapper(tid=27): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=27): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=27): not last one standing (c = 25)
> --24179:1:syswrap- thread_wrapper(tid=28): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=28): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=28): not last one standing (c = 24)
> --24179:1:syswrap- thread_wrapper(tid=20): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=20): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=20): not last one standing (c = 23)
> --24179:1:syswrap- thread_wrapper(tid=23): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=23): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=23): not last one standing (c = 22)
> --24179:1:syswrap- thread_wrapper(tid=24): exit, schedreturncode VgSrc_ExitThread
> --24179:1:syswrap- run_a_thread_NORETURN(tid=24): post-thread_wrapper
> --24179:1:syswrap- run_a_thread_NORETURN(tid=24): not last one standing (c = 21)
> --24179:1:mallocfr newSuperblock at 0x82898F000 (pszB   65504)  owner VALGRIND/ttaux
> --24179:1:mallocfr newSuperblock at 0x82D0B9000 (pszB  659424) unsplittable owner VALGRIND/ttaux
> --24179:1:mallocfr reclaimSuperblock at 0x82CF4A000 (pszB  438240) unsplittable owner VALGRIND/ttaux
Aha! Julian suggested I try --trace-signals and it quickly showed that the parent process is sending |kill -9| to the content process. Turns out there's a class called ChildGrimReaper which does this, and it gives the content process two seconds (kMaxWaitMs) to finish up before killing it. With Valgrind running things are much slower, plus the content process does leak checking, so two seconds isn't nearly enough. If I change it to 20 seconds I get all three SUMMARY lines for the child, as expected. Yay.

But the bad news is that the content process has bazillions of leaks, even when it doesn't do the exit(0). Boo.
I know mccr8 was looking to fix the leak logs in child processes thing, but maybe we could special-case Valgrind jobs with an environment variable that disables the ChildGrimReaper?
(In reply to Nicholas Nethercote [:njn] from comment #25)
> But the bad news is that the content process has bazillions of leaks, even
> when it doesn't do the exit(0). Boo.

Could you attach a log here? I'd be curious to see what it looks like.
Attachment #8528690 - Attachment description: log: non-e10s → log: non-e10s, without MOZ_CC_RUN_DURING_SHUTDOWN
Attached file log: e10s
Attachment #8526533 - Attachment is obsolete: true
Attachment #8526533 - Flags: feedback?(jseward)
Attachment #8491949 - Attachment is obsolete: true
I just attached three log files.

The first log file is from a non-10s build without MOZ_CC_RUN_DURING_SHUTDOWN.
This shows the kind of leaks Valgrind finds in that situation. Here are the top
ten leak sites, with frequencies:

> 365 counts:
> (  1) 39 (10.7%, 10.7%): TEST-UNEXPECTED-FAIL | valgrind-test | 110 (96 direct, 14 indirect) bytes in 1 blocks are definitely lost at malloc / CreateFunctionPrototype / js::GlobalObject::resolveConstructor / js::GlobalObject::ensureConstructor
> (  2) 36 ( 9.9%, 20.5%): TEST-UNEXPECTED-FAIL | valgrind-test | 108 (96 direct, 12 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::CloneScript / JSRuntime::cloneSelfHostedFunctionScript
> (  3) 15 ( 4.1%, 24.7%): TEST-UNEXPECTED-FAIL | valgrind-test | 720 (16 direct, 704 indirect) bytes in 1 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  4) 11 ( 3.0%, 27.7%): TEST-UNEXPECTED-FAIL | valgrind-test | 220 (192 direct, 28 indirect) bytes in 2 blocks are definitely lost at malloc / CreateFunctionPrototype / js::GlobalObject::resolveConstructor / js::GlobalObject::ensureConstructor
> (  5)  5 ( 1.4%, 29.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 136 (96 direct, 40 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::frontend::CompileScript / JS::Compile
> (  6)  4 ( 1.1%, 30.1%): TEST-UNEXPECTED-FAIL | valgrind-test | 1,440 (32 direct, 1,408 indirect) bytes in 2 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  7)  4 ( 1.1%, 31.2%): TEST-UNEXPECTED-FAIL | valgrind-test | 848 (16 direct, 832 indirect) bytes in 1 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  8)  3 ( 0.8%, 32.1%): TEST-UNEXPECTED-FAIL | valgrind-test | 1,104 (16 direct, 1,088 indirect) bytes in 1 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  9)  3 ( 0.8%, 32.9%): TEST-UNEXPECTED-FAIL | valgrind-test | 166 (96 direct, 70 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::frontend::CompileScript / JS::Compile
> ( 10)  3 ( 0.8%, 33.7%): TEST-UNEXPECTED-FAIL | valgrind-test | 131 (96 direct, 35 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::frontend::CompileScript / JS::Compile

Lots of JS stuff, relating to global objects, scripts, functions, etc.

The second log file is from an e10s build which does exit(0). More leak site
frequencies:

> 73 counts:
> (  1) 14 (19.2%, 19.2%): TEST-UNEXPECTED-FAIL | valgrind-test | 110 (96 direct, 14 indirect) bytes in 1 blocks are definitely lost at malloc / CreateFunctionPrototype / js::GlobalObject::resolveConstructor / js::GlobalObject::ensureConstructor
> (  2)  5 ( 6.8%, 26.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 108 (96 direct, 12 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::CloneScript / JSRuntime::cloneSelfHostedFunctionScript
> (  3)  4 ( 5.5%, 31.5%): TEST-UNEXPECTED-FAIL | valgrind-test | 720 (16 direct, 704 indirect) bytes in 1 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  4)  4 ( 5.5%, 37.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 40 bytes in 1 blocks are definitely lost at malloc / g_malloc / g_slice_alloc / gdk_region_new
> (  5)  3 ( 4.1%, 41.1%): TEST-UNEXPECTED-FAIL | valgrind-test | 220 (192 direct, 28 indirect) bytes in 2 blocks are definitely lost at malloc / CreateFunctionPrototype / js::GlobalObject::resolveConstructor / js::GlobalObject::ensureConstructor
> (  6)  2 ( 2.7%, 43.8%): TEST-UNEXPECTED-FAIL | valgrind-test | 131 (96 direct, 35 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::frontend::CompileScript / JS::Compile
> (  7)  2 ( 2.7%, 46.6%): TEST-UNEXPECTED-FAIL | valgrind-test | 168 (96 direct, 72 indirect) bytes in 1 blocks are definitely lost at malloc / js::frontend::CreateScriptSourceObject / js::frontend::CompileScript / JS::Compile
> (  8)  2 ( 2.7%, 49.3%): TEST-UNEXPECTED-FAIL | valgrind-test | 1,440 (32 direct, 1,408 indirect) bytes in 2 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal
> (  9)  2 ( 2.7%, 52.1%): TEST-UNEXPECTED-FAIL | valgrind-test | 770 (672 direct, 98 indirect) bytes in 7 blocks are definitely lost at malloc / CreateFunctionPrototype / js::GlobalObject::resolveConstructor / js::GlobalObject::ensureConstructor
> ( 10)  2 ( 2.7%, 54.8%): TEST-UNEXPECTED-FAIL | valgrind-test | 976 (16 direct, 960 indirect) bytes in 1 blocks are definitely lost at malloc / moz_xmalloc / xpc::CreateGlobalObject / XPCWrappedNative::WrapNewGlobal

Not as many, but the leak sites are very similar to the first log file.

The third log file is from an e10s build, which doesn't do exit(0), and has had
the GrimReaper timeout increased from 2s to 60s so that Valgrind has time to do
its leak checking. The leak site frequencies:

> 1360 counts:
> (  1) 122 ( 9.0%,  9.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 64 bytes in 1 blocks are definitely lost at malloc / js::Nursery::allocateSlots / js::NativeObject::growSlots / js::NativeObject::setLastProperty
> (  2)  61 ( 4.5%, 13.5%): TEST-UNEXPECTED-FAIL | valgrind-test | 2,112 bytes in 33 blocks are definitely lost at malloc / js::Nursery::allocateSlots / js::NativeObject::growSlots / js::NativeObject::setLastProperty
> (  3)  60 ( 4.4%, 17.9%): TEST-UNEXPECTED-FAIL | valgrind-test | 24 bytes in 1 blocks are definitely lost at malloc / js::ProxyObject::New / js::NewProxyObject / js::Wrapper::New
> (  4)  49 ( 3.6%, 21.5%): TEST-UNEXPECTED-FAIL | valgrind-test | 280 (24 direct, 256 indirect) bytes in 1 blocks are definitely lost at malloc / js::Shape::hashify / js::NativeObject::lookup / js::baseops::GetProperty
> (  5)  35 ( 2.6%, 24.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 48 bytes in 1 blocks are definitely lost at calloc / JSScript::makeTypes / js::jit::BaselineCompiler::compile / js::jit::BaselineCompile
> (  6)  31 ( 2.3%, 26.3%): TEST-UNEXPECTED-FAIL | valgrind-test | 64 bytes in 1 blocks are definitely lost at malloc / js::Nursery::allocateSlots / js::NativeObject::growSlots / js::NativeObject::setSlotSpan
> (  7)  30 ( 2.2%, 28.5%): TEST-UNEXPECTED-FAIL | valgrind-test | 56 (24 direct, 32 indirect) bytes in 1 blocks are definitely lost at malloc / js::Shape::hashify / js::NativeObject::toDictionaryMode / js_fun_bind
> (  8)  20 ( 1.5%, 30.0%): TEST-UNEXPECTED-FAIL | valgrind-test | 8 bytes in 1 blocks are definitely lost at calloc / JSScript::partiallyInit / JSScript::fullyInitFromEmitter / js::frontend::EmitFunctionScript
> (  9)  18 ( 1.3%, 31.3%): TEST-UNEXPECTED-FAIL | valgrind-test | 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely lost at malloc / js::PropertyTree::insertChild / js::PropertyTree::getChild / js::NativeObject::getChildProperty
> ( 10)  17 ( 1.2%, 32.6%): TEST-UNEXPECTED-FAIL | valgrind-test | 128 bytes in 2 blocks are definitely lost at malloc / js::Nursery::allocateSlots / js::NativeObject::growSlots / js::NativeObject::setLastProperty

More JS stuff, but the leak sites are entirely different. And the number of
leaks is *much* higher than with exit(0) -- 1360 vs 73!
That's really weird.  I'd guess we're still not running enough shutdown CC+GCs somehow, but maybe the JS engine just shuts down in a different way that leaks slots in content processes?
Priority: -- → P3
The leave-open keyword is there and there is no activity for 6 months.
:njn, maybe it's time to close this bug?
Flags: needinfo?(n.nethercote)
I suspect the test job is running e10s naturally now, given that pref-flipping is no longer required to enable e10s. I looked at logs of a recent run of this job and I saw Valgrind output for two processes. I think it's good enough.
Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(n.nethercote)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.