1354215 - stylo: Sort out malloc overhead issues

Reporter

Description

•

8 years ago

This has been discussed in a few places, and I wanted to centralize discussion on a bug. There are basically two things at play here: (1) On OSX (but not linux), we have lots of single-thread malloc overhead. See bug 1331843 comment 28 and onward. (2) On Linux (but not necessarily elsewhere), heycam saw (last summer) a huge amount of locking contention in malloc, which caused us to unundef MOZ_MEMORY_NARENAS_DEFAULT_ONE in mozjemalloc for stylo builds [1]. Assuming this is still an issue, we need to determine whether this is acceptable to ship. [1] http://searchfox.org/mozilla-central/rev/b8cce081200129a1307e2a682f121135b5ca1d19/memory/mozjemalloc/jemalloc.c#151

Bobby Holley (:bholley)

Reporter

Comment 1

•

8 years ago

I'm going to copy some recent posts out of a thread to add continuity to this discussion.

Bobby Holley (:bholley)

Reporter

Comment 2

•

8 years ago

Julian writes: > I modified main() in nsBrowserApp.cpp to run a malloc speed test, with > selectable numbers of threads and iterations, and then quit. Patch is attached. > I hijacked our own main() rather than write a standalone program so as to > ensure that the test uses whatever jemalloc setup is baked into the build. > > I'm now confused. For various tests up to 100 threads each iterating one > million times, I don't see signs of massive contention or slowdown on Mac > as compared to Linux. The Mac (a 2.3 GHz Ivy Bridge) performs on an > absolute basis about 1/3 slower than the Linux box (2.4 GHz Skylake Xeon), > but that ratio seems more or less consistent across the spectrum of threads > and iterations that I tested. > > Also .. if malloc is really much slower on MacOS, wouldn't we have seen that > already in the existing C++-only code base? > > So .. I think I missed something. But what?

Bobby Holley (:bholley)

Reporter

Comment 3

•

8 years ago

Nathan writes: > I am skeptical that mutex overhead is the main blocker here. > mozjemalloc uses spin locks on OS X: > > http://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#1658 > > The copy of jemalloc (jemalloc4?) we have in tree also uses spin locks: > > http://dxr.mozilla.org/mozilla-central/source/memory/jemalloc/src/include/jemalloc/internal/mutex.h#41 > > though it looks like 10.12 has a separate unfair lock type (?). > > I mean, we could see about substituting unfair pthread_mutex_t locks > for those, but I'd expect spin locks to be about right for jemalloc. > On other platforms (Windows, Linux), jemalloc chooses to spin before > blocking in the kernel; unless we think that blocking in the kernel is > faster (less CPU intensive, maybe) than continuing to spin, I don't > know if we can do much better there. > > The WTF::Lock work showed a win over (naive) pthread_mutex_t on OS X; > it didn't say anything about spin locks. So we'd have to remeasure > there to assure ourselves that we would actually win with that.

Bobby Holley (:bholley)

Reporter

Comment 4

•

8 years ago

So I haven't actually verified that the issue is indeed the locks - it was just a guess. That said, it's not clear to me what else would cause 200ms of time in mozjemalloc of osx and 6ms for linux on the same (main-thread-only, no-contention) testcase.

Bobby Holley (:bholley)

Reporter

Comment 5

•

8 years ago

From bug 1331843: Simon's linux profile is https://perfht.ml/2oPZTrR My osx profile: https://perfht.ml/2n7lHBk

Julian Seward [:jseward]

Assignee

Comment 6

•

8 years ago

Bobby, could you please give some details about how to reproduce the numbers you saw? * build config (+ mozconfig) * workload * what profiler you used, what steps inside that profiler?

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Reporter

Comment 7

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #6) > Bobby, could you please give some details about how to reproduce the > numbers you saw? > > * build config (+ mozconfig) OSX, brand new MBP. export LDFLAGS=-Wl,-no_compact_unwind mk_add_options AUTOCLOBBER=1 ac_add_options --enable-stylo ac_add_options --enable-warnings-as-errors ac_add_options --disable-debug ac_add_options --enable-optimize ac_add_options --enable-profiling > > * workload Download this testcase: https://www.dropbox.com/s/h51fspacftf1lcq/myspace.tar.bz2?dl=0 ./mach run Make sure the gecko profiler is installed ( https://perf-html.io/ ) queue up file:///path/to/testcases/myspace.com/www.myspace.com/albumart.html Start the profiler (Ctrl-shift-1) Load the page Capture the profile (Ctrl-shift-2) Filter for Servo_StyleSheet_FromUT8Bytes, and select that sub-region of the timeline (this is all main-thread-only) > > * what profiler you used, what steps inside that profiler? See above. Note that Simon has been working on turning this workflow into a gtest, which you may find useful. I think he just got it working.

Flags: needinfo?(bobbyholley)

Nathan Froyd [:froydnj]

Comment 8

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #5) > From bug 1331843: > > Simon's linux profile is https://perfht.ml/2oPZTrR This says that we have: - 443 ms: style::properties::declaration_block::parse_property_declaration_list (34 ms self-time) - 206 ms: style::properties::ParsedDeclaration::parse (34 ms self-time) - 73 ms: style::properties::ParsedDeclaration::expand_into (24 ms self-time) - 55 ms: __memmove_avx_unaligned_erms (55 ms self-time) - 43 ms: cssparser::parser::Parser::next_including_whitespace_and_comments (15 ms self-time) - 25 ms: style::properties::PropertyId::parse (17 ms self-time) > My osx profile: https://perfht.ml/2n7lHBk This says that we have: - 515 ms: style::properties::declaration_block::parse_property_declaration_list::he587fadafe443966 (35 ms self-time) - 214 ms: style::properties::ParsedDeclaration::parse::h2bded4b3d7f7f982 (24 ms self-time) - 148 ms: style::properties::ParsedDeclaration::expand::h16add01a38f7c1b7 (32 ms self-time) - 48 ms: _platform_memmove$VARIANT$Haswell (48 ms self-time) - 31 ms: cssparser::parser::Parser::next_including_whitespace_and_comments::h26d9183299f59cad (14 ms self-time) - 26 ms: style::properties::PropertyId::parse::h73968137e91960cf (17 ms self-time) Those look pretty similar, actually, certainly nothing that would be explained by malloc overhead. Though the 75ms difference between ParsedDeclaration::expand_into on Linux and ParsedDeclaration::expand on Mac looks like the entirety of the difference between Linux's 443ms and Mac's 515ms. Didn't Simon recently rewrite ParsedDeclaration::expand to only be expand_into? Or is the compiler making different inlining decisions here?

Nathan Froyd [:froydnj]

Comment 9

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #8) > Though the 75ms difference between ParsedDeclaration::expand_into on Linux > and ParsedDeclaration::expand on Mac looks like the entirety of the > difference between Linux's 443ms and Mac's 515ms. Didn't Simon recently > rewrite ParsedDeclaration::expand to only be expand_into? Or is the > compiler making different inlining decisions here? To be clear: are these profiles actually testing the same code? I don't see ParsedDeclaration::expand anymore, only ParsedDeclaration::expand_into.

Bobby Holley (:bholley)

Reporter

Comment 10

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #9) > (In reply to Nathan Froyd [:froydnj] from comment #8) > > Though the 75ms difference between ParsedDeclaration::expand_into on Linux > > and ParsedDeclaration::expand on Mac looks like the entirety of the > > difference between Linux's 443ms and Mac's 515ms. Didn't Simon recently > > rewrite ParsedDeclaration::expand to only be expand_into? Or is the > > compiler making different inlining decisions here? > > To be clear: are these profiles actually testing the same code? I don't see > ParsedDeclaration::expand anymore, only ParsedDeclaration::expand_into. The stylo profile I linked to was captured a few weeks ago. I get the same results on an up-to-date one (git rev fd7a240d3085fe35b1d67f4216459f28f1e53944): https://perfht.ml/2oFdtSG

Bobby Holley (:bholley)

Reporter

Comment 11

•

8 years ago

We discovered that the linux profile does have a roughly-equivalent amount of time spent in malloc, and the samples are just garbage and categorized under OnInputStreamReady. So that's one mystery solved. Remaining questions are: (1) Why is stylo css parsing so much faster on linux, and (2) Do we still need multiple mozjemalloc arenas, and what is the impact of enabling them?

Nathan Froyd [:froydnj]

Comment 12

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11) > We discovered that the linux profile does have a roughly-equivalent amount > of time spent in malloc, and the samples are just garbage and categorized > under OnInputStreamReady. Julian volunteered to look into the unwinding issues, FWIW.

Bobby Holley (:bholley)

Reporter

Comment 13

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11) > We discovered that the linux profile does have a roughly-equivalent amount > of time spent in malloc, and the samples are just garbage and categorized > under OnInputStreamReady. > > So that's one mystery solved. Remaining questions are: > (1) Why is stylo css parsing so much faster on linux, and Simon pointed out that this is obviously because the malloc overhead isn't getting billed to Servo_StyleSheet_FromUTF8Bytes due to the garbage samples. So there's no discrepancy, stylo parsing is objectively slower, and Simon can hopefully fix that in bug 1331843. So we just need to investigate the multiple arenas issue here.

Eric Rahm [:erahm]

Comment 14

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11) > (2) Do we still need multiple mozjemalloc arenas, and what is the impact of > enabling them? Do we have details for heycam's original test?

Bobby Holley (:bholley)

Reporter

Comment 15

•

8 years ago

(In reply to Eric Rahm [:erahm] from comment #14) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11) > > (2) Do we still need multiple mozjemalloc arenas, and what is the impact of > > enabling them? > > Do we have details for heycam's original test? It was https://en.wikipedia.org/wiki/Barack_Obama , and I believe it was on a Linux Xeon box. Worth retesting. Note that the overhead in this case will be under Servo_TraverseSubtree. You'll want to set STYLO_THREADS=N to try different numbers of worker threads.

Bobby Holley (:bholley)

Reporter

Comment 16

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #13) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11) > > We discovered that the linux profile does have a roughly-equivalent amount > > of time spent in malloc, and the samples are just garbage and categorized > > under OnInputStreamReady. > > > > So that's one mystery solved. Remaining questions are: > > (1) Why is stylo css parsing so much faster on linux, and > > Simon pointed out that this is obviously because the malloc overhead isn't > getting billed to Servo_StyleSheet_FromUTF8Bytes due to the garbage samples. > So there's no discrepancy, stylo parsing is objectively slower, and Simon > can hopefully fix that in bug 1331843. _except_ that gecko _also_ has garbage malloc samples, it turns out. So Simon's profiles do show that linux is still faster, which doesn't match what we see on osx.

Julian Seward [:jseward]

Assignee

Comment 17

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #12) > Julian volunteered to look into the unwinding issues, FWIW. I put a patch containing a fix for at least some of the unwind breakage at https://bugzilla.mozilla.org/show_bug.cgi?id=1354546#c1.

Eric Rahm [:erahm]

Comment 18

•

8 years ago

So just to be clear we think the OSX profile is legit? Also I assume the OSX profile was *with* multiple arenas right? It looks like you're doing a ton of memmoving: 82 ms - _platform_memmove$VARIANT$Haswell That's mostly coming from: - style::properties::declaration_block::parse_property_declaration_list::he587fadafe443966 - realloc There's also: 13 ms - _platform_bzero$VARIANT$Haswell This is odd, that should really only show up with we're using calloc, which makes me question our stacks. And another odd one: 16 ms - _je_malloc_usable_size This should be super fast, so maybe a degradation from enabling multiple arenas?

Bobby Holley (:bholley)

Reporter

Comment 19

•

8 years ago

(In reply to Eric Rahm [:erahm] from comment #18) > So just to be clear we think the OSX profile is legit? Also I assume the OSX > profile was *with* multiple arenas right? Yes. But keep in mind that the stuff you're looking at is CSS parsing, which is main thread only, and therefore unlikely to be improved by multiple arenas (though maybe regressed, as you note). The multithreaded part is under Servo_TraverseSubtree, where you should just see a condvar wait, and you won't see the interesting work on the worker threads because we don't have profiler support for them yet. > > It looks like you're doing a ton of memmoving: > 82 ms - _platform_memmove$VARIANT$Haswell > > That's mostly coming from: > - > style::properties::declaration_block::parse_property_declaration_list:: > he587fadafe443966 > - realloc Yeah, in general Rust is pretty memmov-heavy. There are two reasons: (1) The move semantics of Rust mean that we end up moving more, in general. (2) Reliance on Vecs causes overhead when doubling. see bug 1331843 comment 26 for my proposal on using a stack-allocated vec when parsing property declarations to eliminate the realloc bits (at the expense of slightly more memmoving in general). > > > There's also: > 13 ms - _platform_bzero$VARIANT$Haswell > > This is odd, that should really only show up with we're using calloc, which > makes me question our stacks. I'm not sure - would be interested to know what's going on there. > > And another odd one: > 16 ms - _je_malloc_usable_size > > This should be super fast, so maybe a degradation from enabling multiple > arenas? I'd be interested in this too.

Manish Goregaokar [:manishearth]

Updated

•

8 years ago

Blocks: stylo

Eric Rahm [:erahm]

Comment 20

•

8 years ago

I attempted to start looking into this, but am blocked by bug 1355253. Once that's fixed and stylo runs on osx without crashing I can look into things further.

Depends on: 1355253

Eric Rahm [:erahm]

Updated

•

8 years ago

Depends on: 1354772
No longer depends on: 1355253

Bobby Holley (:bholley)

Reporter

Comment 21

•

8 years ago

The dependent bug is now fixed. NI erahm to follow up here.

Flags: needinfo?(erahm)

Eric Rahm [:erahm]

Comment 22

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21) > The dependent bug is now fixed. NI erahm to follow up here. There's been a lot of churn in this bug, I'm not sure exactly what we want to look into at this point. It might make sense to make this a meta bug and split out blockers. Below are the issues that I think have been raised and are still valid with some specific questions, but first a general question about how you are profiling. I know we're using the profiler add-on (https://perf-html.io/). Is the flow just: - Launch browser w/ e10s disabled - Navigate to page and quickly hit ctrl+shift+1 - Wait for something to be displayed and hit ctrl-shift-2 Or is there cleaner method more reproducible method? Now to the issues that have been raised: #1 - There was talk about a lot of lock contention in mozjemalloc (maybe just on Linux) being a driving factor for enabling multiple arenas. Follow up questions: - What specific test case are we looking at? I assume it needs to exercise the parallel parts of stylo. - What configuration do I need? - What function(s) should I be focused on - Is this just in stylo builds, or in general? #2 - We saw a big discrepancy b/w linux and mac. But then we didn't. But then we did. Follow up questions: - What specific test case are we looking at? - What configuration do I need? - What function(s) should I be focused on - Is this just in stylo builds, or in general? #3 - There's a fair amount of overhead on mac in je_malloc_usable_size I'll split out a bug for this, I'd guess this is stylo agnostic. #4 - There's a fair amount of overhead (on mac?) in _platform_bzero$VARIANT$Haswell This might just be from bad stacks, cursory testing profiling with Instruments didn't show much overhead. I can split this out as well.

Flags: needinfo?(erahm) → needinfo?(bobbyholley)

Bobby Holley (:bholley)

Reporter

Comment 23

•

8 years ago

Responding in a hurry here because we're about to head out to dinner. (In reply to Eric Rahm [:erahm] from comment #22) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21) > > The dependent bug is now fixed. NI erahm to follow up here. > > There's been a lot of churn in this bug, I'm not sure exactly what we want > to look into at this point. It might make sense to make this a meta bug and > split out blockers. Yep. Item 2 (the question of what to do about the multiple jemalloc arenas) is the most urgent. > > Below are the issues that I think have been raised and are still valid with > some specific questions, but first a general question about how you are > profiling. > > I know we're using the profiler add-on (https://perf-html.io/). Is the flow > just: > - Launch browser w/ e10s disabled e10s doesn't need to be disabled AFAIK (is that still in the etherpad? If so we should remove it). We should focus perf efforts on e10s, since that's what'll ship in 57. I don't think the issues here differ with/without. > - Navigate to page and quickly hit ctrl+shift+1 > - Wait for something to be displayed and hit ctrl-shift-2 > > Or is there cleaner method more reproducible method? That is the flow if you want profiles. If you just want timing, you can set DUMP_STYLE_STATISTICS=1. It will be less spammy after https://github.com/servo/servo/pull/16379 gets merged to m-c (currently on autoland). > > Now to the issues that have been raised: > > #1 - There was talk about a lot of lock contention in mozjemalloc (maybe > just on Linux) being a driving factor for enabling multiple arenas. Correct. Cameron says that parallel performance was terrible without it. I just confirmed with him that he noticed that it was slow on linux, but did the actual profiling on a mac pro. So that suggests it may be cross-platform. If you can't reproduce, let me know and I'll see if he can. > > Follow up questions: > - What specific test case are we looking at? I assume it needs to exercise > the parallel parts of stylo. A good starting point is obama's wikipedia page, saved locally as webpage (complete). That's where cameron observed the locking contention (I think he also observed it on the html spec). > - What configuration do I need? A regular opt build with --enable-stylo. > - What function(s) should I be focused on Servo_TraverseSubtree. While that happens, the main thread should be waiting on a condvar, and the worker thread should have rayon on the stack (did Nathan land support for profiling the worker threads yet?). The interesting bits are: * selector matching (get_all_matching_rules) * style struct computation (cascade) The latter does a lot of allocating, so I'm guessing that's where the contention was. But there is some allocation during selector matching as well (when inserting the results into the rule tree, which is a data structure that didn't exist when cameron did the profiling). All needs to be remeasured. > - Is this just in stylo builds, or in general? Just stylo builds. Cameron noticed that the parallel style traversal was very slow without. > > #2 - We saw a big discrepancy b/w linux and mac. But then we didn't. But > then we did. > > Follow up questions: > - What specific test case are we looking at? The albumart.html myspace testcase with the 100x-duplicated stylesheet. This is the non-"orig" testcase in https://www.dropbox.com/s/h51fspacftf1lcq/myspace.tar.bz2?dl=0 > - What configuration do I need? Mac profiles were taken on an MBP. Linux profiles were taken on an i7-7700K, on Ubuntu 16.10 64-bit. > - What function(s) should I be focused on > - Is this just in stylo builds, or in general? This was on parsing performance. Look at Servo_StyleSheet_FromUT8Bytes. Make sure you account for the garbage samples, unless Julian has fixed that. > > #3 - There's a fair amount of overhead on mac in je_malloc_usable_size > > I'll split out a bug for this, I'd guess this is stylo agnostic. Sounds good. > > #4 - There's a fair amount of overhead (on mac?) in > _platform_bzero$VARIANT$Haswell > > This might just be from bad stacks, cursory testing profiling with > Instruments didn't show much overhead. I can split this out as well. I would be interested to know what's up with this.

Flags: needinfo?(bobbyholley)

Julian Seward [:jseward]

Assignee

Comment 24

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #23) > Mac profiles were taken on an MBP. Linux profiles were taken on an i7-7700K, > on Ubuntu 16.10 64-bit. > > - What function(s) should I be focused on > > - Is this just in stylo builds, or in general? > > This was on parsing performance. Look at Servo_StyleSheet_FromUT8Bytes. Make > sure you account for the garbage samples, unless Julian has fixed that. This is improving, but there are multiple problems with unwinding on Linux. I am working on it. Try with https://bugzilla.mozilla.org/show_bug.cgi?id=1354546 which is currently on m-i but not m-c yet, and https://bugzilla.mozilla.org/show_bug.cgi?id=1355853 which is waiting to land. These should reduce the bogus samples significantly. There are other fixes in the pipeline.

Bobby Holley (:bholley)

Reporter

Comment 25

•

8 years ago

Also note that, to reproduce the malloc overhead in #2, you'll of course need to remove the ifndef here: http://searchfox.org/mozilla-central/rev/944f87c575e8a0bcefc1ed8efff10b34cf7a5169/memory/mozjemalloc/jemalloc.c#151

Bobby Holley (:bholley)

Reporter

Comment 26

•

8 years ago

Probably best to continue the discussion of #2 in bug 1291355.

Eric Rahm [:erahm]

Updated

•

8 years ago

Depends on: 1356701

Chris Peterson [:cpeterson]

Comment 27

•

8 years ago

Possibly related? Bug 1329888 is about the Gecko profiler using the wrong symbols on Mac and getting confused about jemalloc and _platform_bzero$VARIANT$Haswell.

Updated

•

8 years ago

Blocks: stylo-jemalloc

Bobby Holley (:bholley)

Reporter

Comment 28

•

8 years ago

(In reply to Eric Rahm [:erahm] from comment #22) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21) > > The dependent bug is now fixed. NI erahm to follow up here. > > There's been a lot of churn in this bug, I'm not sure exactly what we want > to look into at this point. It might make sense to make this a meta bug and > split out blockers. > > Below are the issues that I think have been raised and are still valid with > some specific questions, but first a general question about how you are > profiling. > > I know we're using the profiler add-on (https://perf-html.io/). Is the flow > just: > - Launch browser w/ e10s disabled > - Navigate to page and quickly hit ctrl+shift+1 > - Wait for something to be displayed and hit ctrl-shift-2 > > Or is there cleaner method more reproducible method? > > Now to the issues that have been raised: > > #1 - There was talk about a lot of lock contention in mozjemalloc (maybe > just on Linux) being a driving factor for enabling multiple arenas. This was fixed for stylo in bug 1361258. > > Follow up questions: > - What specific test case are we looking at? I assume it needs to exercise > the parallel parts of stylo. > - What configuration do I need? > - What function(s) should I be focused on > - Is this just in stylo builds, or in general? > > #2 - We saw a big discrepancy b/w linux and mac. But then we didn't. But > then we did. This was the garbage sample issue. > > Follow up questions: > - What specific test case are we looking at? > - What configuration do I need? > - What function(s) should I be focused on > - Is this just in stylo builds, or in general? > > #3 - There's a fair amount of overhead on mac in je_malloc_usable_size > > I'll split out a bug for this, I'd guess this is stylo agnostic. This is bug 1356701. > > #4 - There's a fair amount of overhead (on mac?) in > _platform_bzero$VARIANT$Haswell This is bug 1360772. > > This might just be from bad stacks, cursory testing profiling with > Instruments didn't show much overhead. I can split this out as well. So I think we can close this bug as filed.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME