stylo: Sort out malloc overhead issues

RESOLVED WORKSFORME

Status

()

Core
CSS Parsing and Computation
P1
normal
RESOLVED WORKSFORME
4 months ago
2 months ago

People

(Reporter: bholley, Assigned: jseward)

Tracking

(Blocks: 2 bugs)

Firefox Tracking Flags

(Not tracked)

Details

This has been discussed in a few places, and I wanted to centralize discussion on a bug.

There are basically two things at play here:
(1) On OSX (but not linux), we have lots of single-thread malloc overhead. See bug 1331843 comment 28 and onward.
(2) On Linux (but not necessarily elsewhere), heycam saw (last summer) a huge amount of locking contention in malloc, which caused us to unundef MOZ_MEMORY_NARENAS_DEFAULT_ONE in mozjemalloc for stylo builds [1]. Assuming this is still an issue, we need to determine whether this is acceptable to ship.

[1] http://searchfox.org/mozilla-central/rev/b8cce081200129a1307e2a682f121135b5ca1d19/memory/mozjemalloc/jemalloc.c#151
I'm going to copy some recent posts out of a thread to add continuity to this discussion.
Julian writes:

> I modified main() in nsBrowserApp.cpp to run a malloc speed test, with
> selectable numbers of threads and iterations, and then quit.  Patch is attached.
> I hijacked our own main() rather than write a standalone program so as to
> ensure that the test uses whatever jemalloc setup is baked into the build.
> 
> I'm now confused.  For various tests up to 100 threads each iterating one
> million times, I don't see signs of massive contention or slowdown on Mac
> as compared to Linux.  The Mac (a 2.3 GHz Ivy Bridge) performs on an
> absolute basis about 1/3 slower than the Linux box (2.4 GHz Skylake Xeon),
> but that ratio seems more or less consistent across the spectrum of threads
> and iterations that I tested.
> 
> Also .. if malloc is really much slower on MacOS, wouldn't we have seen that
> already in the existing C++-only code base?
>
> So .. I think I missed something. But what?
Nathan writes:

> I am skeptical that mutex overhead is the main blocker here.
> mozjemalloc uses spin locks on OS X:
>
> http://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#1658
>
> The copy of jemalloc (jemalloc4?) we have in tree also uses spin locks:
>
> http://dxr.mozilla.org/mozilla-central/source/memory/jemalloc/src/include/jemalloc/internal/mutex.h#41
>
> though it looks like 10.12 has a separate unfair lock type (?).
>
> I mean, we could see about substituting unfair pthread_mutex_t locks
> for those, but I'd expect spin locks to be about right for jemalloc.
> On other platforms (Windows, Linux), jemalloc chooses to spin before
> blocking in the kernel; unless we think that blocking in the kernel is
> faster (less CPU intensive, maybe) than continuing to spin, I don't
> know if we can do much better there.
> 
> The WTF::Lock work showed a win over (naive) pthread_mutex_t on OS X;
> it didn't say anything about spin locks.  So we'd have to remeasure
> there to assure ourselves that we would actually win with that.
So I haven't actually verified that the issue is indeed the locks - it was just a guess. That said, it's not clear to me what else would cause 200ms of time in mozjemalloc of osx and 6ms for linux on the same (main-thread-only, no-contention) testcase.
From bug 1331843:

Simon's linux profile is https://perfht.ml/2oPZTrR
My osx profile: https://perfht.ml/2n7lHBk
(Assignee)

Comment 6

4 months ago
Bobby, could you please give some details about how to reproduce the
numbers you saw?

* build config (+ mozconfig)

* workload

* what profiler you used, what steps inside that profiler?
Flags: needinfo?(bobbyholley)
(In reply to Julian Seward [:jseward] from comment #6)
> Bobby, could you please give some details about how to reproduce the
> numbers you saw?
> 
> * build config (+ mozconfig)

OSX, brand new MBP.

export LDFLAGS=-Wl,-no_compact_unwind
mk_add_options AUTOCLOBBER=1
ac_add_options --enable-stylo
ac_add_options --enable-warnings-as-errors
ac_add_options --disable-debug
ac_add_options --enable-optimize
ac_add_options --enable-profiling


> 
> * workload

Download this testcase: https://www.dropbox.com/s/h51fspacftf1lcq/myspace.tar.bz2?dl=0

./mach run
Make sure the gecko profiler is installed ( https://perf-html.io/ )
queue up file:///path/to/testcases/myspace.com/www.myspace.com/albumart.html
Start the profiler (Ctrl-shift-1)
Load the page
Capture the profile (Ctrl-shift-2)
Filter for Servo_StyleSheet_FromUT8Bytes, and select that sub-region of the timeline (this is all main-thread-only)


> 
> * what profiler you used, what steps inside that profiler?

See above. Note that Simon has been working on turning this workflow into a gtest, which you may find useful. I think he just got it working.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #5)
> From bug 1331843:
> 
> Simon's linux profile is https://perfht.ml/2oPZTrR

This says that we have:

- 443 ms: style::properties::declaration_block::parse_property_declaration_list (34 ms self-time)
  - 206 ms: style::properties::ParsedDeclaration::parse (34 ms self-time)
  - 73 ms: style::properties::ParsedDeclaration::expand_into (24 ms self-time)
  - 55 ms: __memmove_avx_unaligned_erms (55 ms self-time)
  - 43 ms: cssparser::parser::Parser::next_including_whitespace_and_comments (15 ms self-time)
  - 25 ms: style::properties::PropertyId::parse (17 ms self-time)

> My osx profile: https://perfht.ml/2n7lHBk

This says that we have:

- 515 ms: style::properties::declaration_block::parse_property_declaration_list::he587fadafe443966 (35 ms self-time)
  - 214 ms: style::properties::ParsedDeclaration::parse::h2bded4b3d7f7f982 (24 ms self-time)
  - 148 ms: style::properties::ParsedDeclaration::expand::h16add01a38f7c1b7 (32 ms self-time)
  - 48 ms: _platform_memmove$VARIANT$Haswell (48 ms self-time)
  - 31 ms: cssparser::parser::Parser::next_including_whitespace_and_comments::h26d9183299f59cad (14 ms self-time)
  - 26 ms: style::properties::PropertyId::parse::h73968137e91960cf (17 ms self-time)

Those look pretty similar, actually, certainly nothing that would be explained by malloc overhead.

Though the 75ms difference between ParsedDeclaration::expand_into on Linux and ParsedDeclaration::expand on Mac looks like the entirety of the difference between Linux's 443ms and Mac's 515ms.  Didn't Simon recently rewrite ParsedDeclaration::expand to only be expand_into?  Or is the compiler making different inlining decisions here?
(In reply to Nathan Froyd [:froydnj] from comment #8)
> Though the 75ms difference between ParsedDeclaration::expand_into on Linux
> and ParsedDeclaration::expand on Mac looks like the entirety of the
> difference between Linux's 443ms and Mac's 515ms.  Didn't Simon recently
> rewrite ParsedDeclaration::expand to only be expand_into?  Or is the
> compiler making different inlining decisions here?

To be clear: are these profiles actually testing the same code?  I don't see ParsedDeclaration::expand anymore, only ParsedDeclaration::expand_into.
(In reply to Nathan Froyd [:froydnj] from comment #9)
> (In reply to Nathan Froyd [:froydnj] from comment #8)
> > Though the 75ms difference between ParsedDeclaration::expand_into on Linux
> > and ParsedDeclaration::expand on Mac looks like the entirety of the
> > difference between Linux's 443ms and Mac's 515ms.  Didn't Simon recently
> > rewrite ParsedDeclaration::expand to only be expand_into?  Or is the
> > compiler making different inlining decisions here?
> 
> To be clear: are these profiles actually testing the same code?  I don't see
> ParsedDeclaration::expand anymore, only ParsedDeclaration::expand_into.

The stylo profile I linked to was captured a few weeks ago. I get the same results on an up-to-date one (git rev fd7a240d3085fe35b1d67f4216459f28f1e53944): https://perfht.ml/2oFdtSG
We discovered that the linux profile does have a roughly-equivalent amount of time spent in malloc, and the samples are just garbage and categorized under OnInputStreamReady.

So that's one mystery solved. Remaining questions are:
(1) Why is stylo css parsing so much faster on linux, and
(2) Do we still need multiple mozjemalloc arenas, and what is the impact of enabling them?
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11)
> We discovered that the linux profile does have a roughly-equivalent amount
> of time spent in malloc, and the samples are just garbage and categorized
> under OnInputStreamReady.

Julian volunteered to look into the unwinding issues, FWIW.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11)
> We discovered that the linux profile does have a roughly-equivalent amount
> of time spent in malloc, and the samples are just garbage and categorized
> under OnInputStreamReady.
> 
> So that's one mystery solved. Remaining questions are:
> (1) Why is stylo css parsing so much faster on linux, and

Simon pointed out that this is obviously because the malloc overhead isn't getting billed to Servo_StyleSheet_FromUTF8Bytes due to the garbage samples. So there's no discrepancy, stylo parsing is objectively slower, and Simon can hopefully fix that in bug 1331843.

So we just need to investigate the multiple arenas issue here.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11)
> (2) Do we still need multiple mozjemalloc arenas, and what is the impact of
> enabling them?

Do we have details for heycam's original test?
(In reply to Eric Rahm [:erahm] from comment #14)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11)
> > (2) Do we still need multiple mozjemalloc arenas, and what is the impact of
> > enabling them?
> 
> Do we have details for heycam's original test?

It was https://en.wikipedia.org/wiki/Barack_Obama , and I believe it was on a Linux Xeon box. Worth retesting.

Note that the overhead in this case will be under Servo_TraverseSubtree. You'll want to set STYLO_THREADS=N to try different numbers of worker threads.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #13)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11)
> > We discovered that the linux profile does have a roughly-equivalent amount
> > of time spent in malloc, and the samples are just garbage and categorized
> > under OnInputStreamReady.
> > 
> > So that's one mystery solved. Remaining questions are:
> > (1) Why is stylo css parsing so much faster on linux, and
> 
> Simon pointed out that this is obviously because the malloc overhead isn't
> getting billed to Servo_StyleSheet_FromUTF8Bytes due to the garbage samples.
> So there's no discrepancy, stylo parsing is objectively slower, and Simon
> can hopefully fix that in bug 1331843.

_except_ that gecko _also_ has garbage malloc samples, it turns out. So Simon's profiles do show that linux is still faster, which doesn't match what we see on osx.
(Assignee)

Comment 17

4 months ago
(In reply to Nathan Froyd [:froydnj] from comment #12)
> Julian volunteered to look into the unwinding issues, FWIW.

I put a patch containing a fix for at least some of the unwind breakage at
https://bugzilla.mozilla.org/show_bug.cgi?id=1354546#c1.
So just to be clear we think the OSX profile is legit? Also I assume the OSX profile was *with* multiple arenas right?

It looks like you're doing a ton of memmoving:
82 ms - _platform_memmove$VARIANT$Haswell

That's mostly coming from:
- style::properties::declaration_block::parse_property_declaration_list::he587fadafe443966
- realloc


There's also:
13 ms - _platform_bzero$VARIANT$Haswell

This is odd, that should really only show up with we're using calloc, which makes me question our stacks.

And another odd one:
16 ms - _je_malloc_usable_size

This should be super fast, so maybe a degradation from enabling multiple arenas?
(In reply to Eric Rahm [:erahm] from comment #18)
> So just to be clear we think the OSX profile is legit? Also I assume the OSX
> profile was *with* multiple arenas right?

Yes. But keep in mind that the stuff you're looking at is CSS parsing, which is main thread only, and therefore unlikely to be improved by multiple arenas (though maybe regressed, as you note). The multithreaded part is under Servo_TraverseSubtree, where you should just see a condvar wait, and you won't see the interesting work on the worker threads because we don't have profiler support for them yet.

> 
> It looks like you're doing a ton of memmoving:
> 82 ms - _platform_memmove$VARIANT$Haswell
> 
> That's mostly coming from:
> -
> style::properties::declaration_block::parse_property_declaration_list::
> he587fadafe443966
> - realloc

Yeah, in general Rust is pretty memmov-heavy. There are two reasons:
(1) The move semantics of Rust mean that we end up moving more, in general.
(2) Reliance on Vecs causes overhead when doubling.

see bug 1331843 comment 26 for my proposal on using a stack-allocated vec when parsing property declarations to eliminate the realloc bits (at the expense of slightly more memmoving in general).

> 
> 
> There's also:
> 13 ms - _platform_bzero$VARIANT$Haswell
> 
> This is odd, that should really only show up with we're using calloc, which
> makes me question our stacks.

I'm not sure - would be interested to know what's going on there.

> 
> And another odd one:
> 16 ms - _je_malloc_usable_size
> 
> This should be super fast, so maybe a degradation from enabling multiple
> arenas?

I'd be interested in this too.
Blocks: 1243581
I attempted to start looking into this, but am blocked by bug 1355253. Once that's fixed and stylo runs on osx without crashing I can look into things further.
Depends on: 1355253
Depends on: 1354772
No longer depends on: 1355253
The dependent bug is now fixed. NI erahm to follow up here.
Flags: needinfo?(erahm)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21)
> The dependent bug is now fixed. NI erahm to follow up here.

There's been a lot of churn in this bug, I'm not sure exactly what we want to look into at this point. It might make sense to make this a meta bug and split out blockers.

Below are the issues that I think have been raised and are still valid with some specific questions, but first a general question about how you are profiling.

I know we're using the profiler add-on (https://perf-html.io/). Is the flow just:
  - Launch browser w/ e10s disabled
  - Navigate to page and quickly hit ctrl+shift+1
  - Wait for something to be displayed and hit ctrl-shift-2

Or is there cleaner method more reproducible method?

Now to the issues that have been raised:

#1 - There was talk about a lot of lock contention in mozjemalloc (maybe just on Linux) being a driving factor for enabling multiple arenas.

Follow up questions:
  - What specific test case are we looking at? I assume it needs to exercise the parallel parts of stylo.
  - What configuration do I need?
  - What function(s) should I be focused on
  - Is this just in stylo builds, or in general?

#2 - We saw a big discrepancy b/w linux and mac. But then we didn't. But then we did.

Follow up questions:
  - What specific test case are we looking at?
  - What configuration do I need?
  - What function(s) should I be focused on
  - Is this just in stylo builds, or in general?

#3 - There's a fair amount of overhead on mac in je_malloc_usable_size

I'll split out a bug for this, I'd guess this is stylo agnostic.

#4 - There's a fair amount of overhead (on mac?) in _platform_bzero$VARIANT$Haswell

This might just be from bad stacks, cursory testing profiling with Instruments didn't show much overhead. I can split this out as well.
Flags: needinfo?(erahm) → needinfo?(bobbyholley)
Responding in a hurry here because we're about to head out to dinner.

(In reply to Eric Rahm [:erahm] from comment #22)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21)
> > The dependent bug is now fixed. NI erahm to follow up here.
> 
> There's been a lot of churn in this bug, I'm not sure exactly what we want
> to look into at this point. It might make sense to make this a meta bug and
> split out blockers.

Yep. Item 2 (the question of what to do about the multiple jemalloc arenas) is the most urgent.

> 
> Below are the issues that I think have been raised and are still valid with
> some specific questions, but first a general question about how you are
> profiling.
> 
> I know we're using the profiler add-on (https://perf-html.io/). Is the flow
> just:
>   - Launch browser w/ e10s disabled

e10s doesn't need to be disabled AFAIK (is that still in the etherpad? If so we should remove it). We should focus perf efforts on e10s, since that's what'll ship in 57. I don't think the issues here differ with/without.

>   - Navigate to page and quickly hit ctrl+shift+1
>   - Wait for something to be displayed and hit ctrl-shift-2
> 
> Or is there cleaner method more reproducible method?

That is the flow if you want profiles. If you just want timing, you can set DUMP_STYLE_STATISTICS=1. It will be less spammy after https://github.com/servo/servo/pull/16379 gets merged to m-c (currently on autoland).

> 
> Now to the issues that have been raised:
> 
> #1 - There was talk about a lot of lock contention in mozjemalloc (maybe
> just on Linux) being a driving factor for enabling multiple arenas.

Correct. Cameron says that parallel performance was terrible without it. I just confirmed with him that he noticed that it was slow on linux, but did the actual profiling on a mac pro. So that suggests it may be cross-platform. If you can't reproduce, let me know and I'll see if he can.

> 
> Follow up questions:
>   - What specific test case are we looking at? I assume it needs to exercise
> the parallel parts of stylo.

A good starting point is obama's wikipedia page, saved locally as webpage (complete). That's where cameron observed the locking contention (I think he also observed it on the html spec).

>   - What configuration do I need?

A regular opt build with --enable-stylo.

>   - What function(s) should I be focused on

Servo_TraverseSubtree. While that happens, the main thread should be waiting on a condvar, and the worker thread should have rayon on the stack (did Nathan land support for profiling the worker threads yet?). The interesting bits are:
* selector matching (get_all_matching_rules)
* style struct computation (cascade)

The latter does a lot of allocating, so I'm guessing that's where the contention was. But there is some allocation during selector matching as well (when inserting the results into the rule tree, which is a data structure that didn't exist when cameron did the profiling). All needs to be remeasured.

>   - Is this just in stylo builds, or in general?

Just stylo builds. Cameron noticed that the parallel style traversal was very slow without.

> 
> #2 - We saw a big discrepancy b/w linux and mac. But then we didn't. But
> then we did.
> 
> Follow up questions:
>   - What specific test case are we looking at?

The albumart.html myspace testcase with the 100x-duplicated stylesheet. This is the non-"orig" testcase in https://www.dropbox.com/s/h51fspacftf1lcq/myspace.tar.bz2?dl=0

>   - What configuration do I need?

Mac profiles were taken on an MBP. Linux profiles were taken on an i7-7700K, on Ubuntu 16.10 64-bit.
>   - What function(s) should I be focused on
>   - Is this just in stylo builds, or in general?

This was on parsing performance. Look at Servo_StyleSheet_FromUT8Bytes. Make sure you account for the garbage samples, unless Julian has fixed that.


> 
> #3 - There's a fair amount of overhead on mac in je_malloc_usable_size
> 
> I'll split out a bug for this, I'd guess this is stylo agnostic.

Sounds good.

> 
> #4 - There's a fair amount of overhead (on mac?) in
> _platform_bzero$VARIANT$Haswell
> 
> This might just be from bad stacks, cursory testing profiling with
> Instruments didn't show much overhead. I can split this out as well.

I would be interested to know what's up with this.
Flags: needinfo?(bobbyholley)
(Assignee)

Comment 24

3 months ago
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #23)
> Mac profiles were taken on an MBP. Linux profiles were taken on an i7-7700K,
> on Ubuntu 16.10 64-bit.
> >   - What function(s) should I be focused on
> >   - Is this just in stylo builds, or in general?
> 
> This was on parsing performance. Look at Servo_StyleSheet_FromUT8Bytes. Make
> sure you account for the garbage samples, unless Julian has fixed that.

This is improving, but there are multiple problems with unwinding on Linux.
I am working on it.  Try with

  https://bugzilla.mozilla.org/show_bug.cgi?id=1354546
  which is currently on m-i but not m-c yet, and

  https://bugzilla.mozilla.org/show_bug.cgi?id=1355853
  which is waiting to land.

These should reduce the bogus samples significantly.  There are other fixes
in the pipeline.
Also note that, to reproduce the malloc overhead in #2, you'll of course need to remove the ifndef here:

http://searchfox.org/mozilla-central/rev/944f87c575e8a0bcefc1ed8efff10b34cf7a5169/memory/mozjemalloc/jemalloc.c#151
Probably best to continue the discussion of #2 in bug 1291355.
Depends on: 1356701
Possibly related? Bug 1329888 is about the Gecko profiler using the wrong symbols on Mac and getting confused about jemalloc and _platform_bzero$VARIANT$Haswell.
See Also: → bug 1329888
Blocks: 1291355
(In reply to Eric Rahm [:erahm] from comment #22)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21)
> > The dependent bug is now fixed. NI erahm to follow up here.
> 
> There's been a lot of churn in this bug, I'm not sure exactly what we want
> to look into at this point. It might make sense to make this a meta bug and
> split out blockers.
> 
> Below are the issues that I think have been raised and are still valid with
> some specific questions, but first a general question about how you are
> profiling.
> 
> I know we're using the profiler add-on (https://perf-html.io/). Is the flow
> just:
>   - Launch browser w/ e10s disabled
>   - Navigate to page and quickly hit ctrl+shift+1
>   - Wait for something to be displayed and hit ctrl-shift-2
> 
> Or is there cleaner method more reproducible method?
> 
> Now to the issues that have been raised:
> 
> #1 - There was talk about a lot of lock contention in mozjemalloc (maybe
> just on Linux) being a driving factor for enabling multiple arenas.

This was fixed for stylo in bug 1361258.

> 
> Follow up questions:
>   - What specific test case are we looking at? I assume it needs to exercise
> the parallel parts of stylo.
>   - What configuration do I need?
>   - What function(s) should I be focused on
>   - Is this just in stylo builds, or in general?
> 
> #2 - We saw a big discrepancy b/w linux and mac. But then we didn't. But
> then we did.

This was the garbage sample issue.

> 
> Follow up questions:
>   - What specific test case are we looking at?
>   - What configuration do I need?
>   - What function(s) should I be focused on
>   - Is this just in stylo builds, or in general?
> 
> #3 - There's a fair amount of overhead on mac in je_malloc_usable_size
> 
> I'll split out a bug for this, I'd guess this is stylo agnostic.


This is bug 1356701.

> 
> #4 - There's a fair amount of overhead (on mac?) in
> _platform_bzero$VARIANT$Haswell

This is bug 1360772.

> 
> This might just be from bad stacks, cursory testing profiling with
> Instruments didn't show much overhead. I can split this out as well.

So I think we can close this bug as filed.
Status: NEW → RESOLVED
Last Resolved: 2 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.