Closed Bug 1361258 Opened 3 years ago Closed 3 years ago

stylo: Use Thread-local jemalloc arenas

Categories

(Core :: CSS Parsing and Computation, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: bholley, Assigned: glandium)

References

(Blocks 1 open bug)

Details

Attachments

(6 files, 6 obsolete files)

11.28 KB, patch
glandium
: review+
glandium
: checkin+
Details | Diff | Splinter Review
4.09 KB, patch
erahm
: review+
bholley
: feedback+
glandium
: checkin+
Details | Diff | Splinter Review
1.43 KB, patch
glandium
: review+
Details | Diff | Splinter Review
1.34 KB, patch
sfink
: review+
glandium
: checkin+
Details | Diff | Splinter Review
7.10 KB, patch
erahm
: review+
glandium
: checkin+
Details | Diff | Splinter Review
2.16 KB, patch
bholley
: review+
Details | Diff | Splinter Review
See the discussion surrounding bug 1291355 comment 35.
Flags: needinfo?(bobbyholley)
Attached patch bug1361258-1.cset (obsolete) — Splinter Review
Patch for contemplation.  The switching is done in traverse_subtree
in servo/ports/geckolib/glue.rs.
Nice! bug 1347525 is the benchmark to measure, according to smaug.
Assignee: bobbyholley → jseward
Flags: needinfo?(bobbyholley)
Here are "[PERF],traversal_time_ms" numbers for the Obama testcase,
taking the best of 4 runs:

one arena, obama
  1  97 97 99 101 -> 97
  2  70 63 63 64  -> 63
  4  50 52 47 46  -> 46
  8  71 61 61 62  -> 61

multi arena, obama
  1  103 101 102 101 -> 101
  2  56 57 57 68     ->  56
  4  36 35 35 35     ->  35
  8  25 25 24 24     ->  24

adaptive, obama
  1  98 109 99 98 -> 98
  2  55 58 56 56  -> 55
  4  35 38 39 36  -> 35
  8  25 24 25 31  -> 24

Single-arena scales poorly but is marginally faster in the 1-thread case.
Multi arena scales much better.  Adaptive (meaning, with this patch) does
as well as Multi for 8 threads and as well as a single arena for the one
thread case.

<marginally OT>
Ignoring the one-arena case, we get a more or less 3 x speedup with 4 
threads but only a 4 x speedup with 8 threads.  It would be interesting
to know why that is:

* lack of available parallelism in the problem?

* low level resource contention of some kind (RFO misses)?

* the effects of hyperthreading? -- a core shared by two threads
  is going to be slower for each thread than if only one thread
  runs on it

* something else?
</marginally OT>
(In reply to Julian Seward [:jseward] from comment #1)
> Patch for contemplation.  The switching is done in traverse_subtree
> in servo/ports/geckolib/glue.rs.

One thing I don't like about this is the fact that we have to "manually"
change the allocation mode back to single-arena after the parallel
traversal.  It would be all too easy to forget to do that some time
in the future.  Also, what happens if the parallel traversal takes an
exception of some kind?

I would prefer to wrap this up in a C++ style RAII construct.  Is that
possible in Rust?  And would that help in the exception case?

For reference, the relevant code fragment (post patch) is:

if traversal_driver.is_parallel() {
    unsafe { bindings::Gecko_SetMultipleJemallocArenasEnabled(true); }
    parallel::traverse_dom(&traversal, element, token,
                           global_style_data.style_thread_pool.as_ref().unwrap());
    unsafe { bindings::Gecko_SetMultipleJemallocArenasEnabled(false); }
} else {
    sequential::traverse_dom(&traversal, element, token);
}
Flags: needinfo?(bobbyholley)
Here are some numbers for smaug's microbenchmark as discussed in bug 1347525,
specifically using the test case at bug 1347525 comment 13.

This is on Linux x86_64, with some effort made to run on quiet machine.
The following numbers are time in ms:

STYLO_THREADS=1    single arena     78
STYLO_THREADS=8    single arena     78

STYLO_THREADS=1    multi arena      75
STYLO_THREADS=8    multi arena      76

STYLO_THREADS=1    adaptive         76
STYLO_THREADS=8    adaptive         77

m-c, -Og, non stylo, ~ a week old   64

So there's no evidence here that multi arenas slows down this test, nor that
this patch is helping.  I'll try on OSX, since one thing that might be different
there is the TLS costs.
Stylo overhead in general is horrible there. That must get fixed.

Comparing opt m-i (no stylo), 
MOZ_MEMORY_NARENAS_DEFAULT_ONE enabled (the default setup) I get best value 47
MOZ_MEMORY_NARENAS_DEFAULT_ONE disabled (MOZ_STYLO case), I get best value 49.
So the difference is smaller than it used to be, which makes sense since malloc usage has been reduced in HTML parser, and elsewhere in DOM.
(In reply to Julian Seward [:jseward] from comment #4)
> One thing I don't like about this is the fact that we have to "manually"
> change the allocation mode back to single-arena after the parallel
> traversal.  It would be all too easy to forget to do that some time
> in the future.  Also, what happens if the parallel traversal takes an
> exception of some kind?
> 
> I would prefer to wrap this up in a C++ style RAII construct.  Is that
> possible in Rust?

It would look something like:

  pub struct AutoMultipleJemallocArenas;

  impl AutoMultipleJemallocArenas {
      pub fn new() -> AutoMultipleJemallocArenas {
          unsafe {
              bindings::Gecko_SetMultipleJemallocArenasEnabled(true);
          }
      }
  }

  impl Drop for AutoMultipleJemallocArenas {
      fn drop(&mut self) {
          unsafe {
              bindings::Gecko_SetMultipleJemallocArenasEnabled(false);
          }
      }
  }

And then usage would look something like:

  let mut maybe_arenas = None;
  if traversal_driver.is_parallel() {
      maybe_arenas = Some(AutoMultipleJemallocArenas::new());
  }

  // do traversal...
      

> if traversal_driver.is_parallel() {
>     unsafe { bindings::Gecko_SetMultipleJemallocArenasEnabled(true); }
>     parallel::traverse_dom(&traversal, element, token,
>                           
> global_style_data.style_thread_pool.as_ref().unwrap());
>     unsafe { bindings::Gecko_SetMultipleJemallocArenasEnabled(false); }
> } else {
>     sequential::traverse_dom(&traversal, element, token);
> }
(In reply to Olli Pettay [:smaug] from comment #6)
> Stylo overhead in general is horrible there. That must get fixed.

On this testcase specifically, with stylo enabled? Can you file a bug with a profile?

> 
> Comparing opt m-i (no stylo), 
> MOZ_MEMORY_NARENAS_DEFAULT_ONE enabled (the default setup) I get best value
> 47
> MOZ_MEMORY_NARENAS_DEFAULT_ONE disabled (MOZ_STYLO case), I get best value
> 49.
> So the difference is smaller than it used to be, which makes sense since
> malloc usage has been reduced in HTML parser, and elsewhere in DOM.

Ok, I think the next things to do are:
(1) Remeasure comment 5 with a checkout of m-c from about 21 days ago, which is when the discussion around bug 1291355 comment 6 happened. We should be able to reproduce the 10% overhead he talked about, and then see if adaptive arenas fix it.
(2) Push both multiple arenas and adapative arenas to try with -t all to get talos results.

Comment 7 is spot-on for doing RAII in Rust. I also think we should do it in parallel.rs rather than in the caller in case we call traverse_dom from multiple places.
Flags: needinfo?(bobbyholley)
(We can make it compile-time conditional by doing if cfg!(feature = "gecko") { }).
FWIW, talos won't give too useful numbers.

Also, if we focus on the innerHTML case, bug 1355441 will reduce malloc usage there some more.


(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #8)
> (In reply to Olli Pettay [:smaug] from comment #6)
> > Stylo overhead in general is horrible there. That must get fixed.
> 
> On this testcase specifically, with stylo enabled? Can you file a bug with a
> profile?
Julian, could you file that? I don't have stylo builds around to get Gecko profiler profile.
Flags: needinfo?(jseward)
(In reply to Olli Pettay [:smaug] from comment #10)
> FWIW, talos won't give too useful numbers.
> 
> Also, if we focus on the innerHTML case, bug 1355441 will reduce malloc
> usage there some more.

Yes, but the point is to figure out the impact of adaptive arenas on malloc-bound workloads. I want to make sure it fixes the problem you measured before so that we can be sure there's no overhead in other areas when we turn it on.
Sure, we need to find some new microbenchmarks.
Perhaps some tescase which is creating lots of new elements.
Bug 1232023 might be such?
(In reply to Olli Pettay [:smaug] from comment #12)
> Sure, we need to find some new microbenchmarks.

In comment 8, I proposed just checking out a copy of m-c from last month when you measured it and using that. Is there any reason that wouldn't work?
(In reply to Olli Pettay [:smaug] from comment #10)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #8)
> > (In reply to Olli Pettay [:smaug] from comment #6)
> > > Stylo overhead in general is horrible there. That must get fixed.
> > 
> > On this testcase specifically, with stylo enabled? Can you file a bug with a
> > profile?
> Julian, could you file that? I don't have stylo builds around to get Gecko
> profiler profile.

Can do, but I need to know: What sampling rate?  Which threads?
Flags: needinfo?(jseward) → needinfo?(bugs)
Whatever is the default rate in the profiler. And main thread should be enough, since that is where the slow down is. I assume the issue is that stylo does something slow whenever an element is added to DOM, 
or the issue is that stylo increases the sizeof Element by 8 bytes, and that causes elements to no fit so well into cpu caches or something.
Flags: needinfo?(bugs)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #8)
> (1) Remeasure comment 5 with a checkout of m-c from about 21 days ago, which
> is when the discussion around bug 1291355 comment 6 happened. We should be
> able to reproduce the 10% overhead he talked about, and then see if adaptive
> arenas fix it.

Here's what I've got.  The tl;dr is that I can't repro it.

I compared m-c, optimised, non-debug, non-stylo, from a few hours ago
("current") and for 21 days ago ("previous"), using the following changeset
chosen more or less arbitrarily:

  changeset:   353289:02bf71459f40
  user:        Florian Queze <florian@queze.net>
  date:        Fri Apr 14 21:51:40 2017 +0200
  summary:     Bug 1356569 - fix test_SyncedTabsDeckComponent.js, r=jaws.

With results (milliseconds, smaller is better)

  current, single arena   69
  current, multi arena    69

  previous, single arena  66
  previous, multi arena   67

So there's not much to see here.  I *think* I did this right.

I also tried current-vs-previous with optimised, non-debug, stylo enabled:

  current: see comment 5, but copying in for convenience:

    STYLO_THREADS=1    single arena     78
    STYLO_THREADS=8    single arena     78

    STYLO_THREADS=1    multi arena      75
    STYLO_THREADS=8    multi arena      76

    STYLO_THREADS=1    adaptive         76
    STYLO_THREADS=8    adaptive         77

  previous:

    STYLO_THREADS=1    single arena     82
    STYLO_THREADS=8    single arena     81

    STYLO_THREADS=1    multi arena      79
    STYLO_THREADS=8    multi arena      81

    STYLO_THREADS=1    adaptive         82
    STYLO_THREADS=8    adaptive         81

This is on Linux x86_64, Fedora 25, Intel(R) Xeon(R) CPU E3-1505M v5, 2.6 GHz.
(4 cores, 8 threads, 8 MB L3).
ok, thanks. Perhaps I should have then re-run the test whenever I did it.
(I did re-run jemalloc4 test more often)

Might be worth to try also https://bugzilla.mozilla.org/show_bug.cgi?id=1232023

Stylo regression is anyhow very bad.
(in bug 1232023, malloc seems to take 25% of cloneNode)
(In reply to Julian Seward [:jseward] from comment #16)

Ah, wait.  I made a stupid mistake with the following measurements
and they should be ignored:

> With results (milliseconds, smaller is better)
> 
>   current, single arena   69
>   current, multi arena    69
> 
>   previous, single arena  66
>   previous, multi arena   67

I built with --disable-jemalloc.  So these numbers are irrelevant.

The values for the optimised, non-debug, stylo enabled build are
AFAIK still valid.
(In reply to Julian Seward [:jseward] from comment #19)
> (In reply to Julian Seward [:jseward] from comment #16)
> Ah, wait.  I made a stupid mistake with the following measurements
> and they should be ignored:

Sorry for the noise.  Here are new measurements.  They use the mozconfig
shown below, and the following configurations for single vs multi arena:

single arena:
 memory/mozjemalloc/jemalloc.c:151:
  #ifndef MOZ_STYLO
  #define MOZ_MEMORY_NARENAS_DEFAULT_ONE
  #endif

multi arena:
  #ifndef MOZ_STYLO
  //#define MOZ_MEMORY_NARENAS_DEFAULT_ONE
  #endif

Results are: (without stylo, m-c curr)

  single arena  53 54
  multi arena   53 54

(14 Apr) 353289:02bf71459f40

  single arena  59 57 58
  multi arena   57 60 56

If we take the minimum time as most accurately representing the actual cost,
then I still can't demonstrate any slowdown for the multi-arena case.

smaug, can you sanity check this config?  Am I doing something different or
wrong here, relative to what you measured?

--- mozconfig -----------------------------------------------------

. $topsrcdir/browser/config/mozconfig
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/gcc-O2-nondebug-plain
ac_add_options --enable-tests

ac_add_options --enable-optimize="-g -O2"
ac_add_options --enable-debug-symbols

ac_add_options --enable-profiling

ac_add_options --disable-crashreporter

mk_add_options MOZ_MAKE_FLAGS="-j8"
mk_add_options AUTOCLOBBER=1
Flags: needinfo?(bugs)
Well, that sounds like there isn't regression, as I hinted in comment 17.
But I'll try again.
Flags: needinfo?(bugs)
I don't know why I got the slower performance a month ago.

Anyhow, what is still clear that there is some major stylo specific regression, based on comment 16.
Has the bug filed for that?
(In reply to Olli Pettay [:smaug] from comment #22)
> Anyhow, what is still clear that there is some major stylo specific
> regression, based on comment 16.  Has the bug filed for that?

Yes, bug 1362982.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #8)
> (2) Push both multiple arenas and adapative arenas to try with -t all to get
> talos results.

Here are results for single and multiple arenas (not adaptive).  They are
generated with
./mach try -b do -p all -u all -t all --rebuild-talos 5

Single   is changeset 5a190b..
Multiple is changeset dcf903..

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=5a190bf0eeb8d55404cd84062c573cbed092a64e&newProject=try&newRevision=dcf903ecb43d590a983a467c339dd588ee6fe3a3&framework=1&showOnlyImportant=0

They show no (important) perf changes, but there are RSS regressions for
windows7-32, up to 15.80%.  No other platforms seem to space-regress, though.
There are also very large changes to "tp5o XRes opt" and "tp5o XRes opt e10s".
What are those?
(In reply to Julian Seward [:jseward] from comment #24)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #8)
> > (2) Push both multiple arenas and adapative arenas to try with -t all to get
> > talos results.
> 
> Here are results for single and multiple arenas (not adaptive).  They are
> generated with
> ./mach try -b do -p all -u all -t all --rebuild-talos 5
> 
> Single   is changeset 5a190b..
> Multiple is changeset dcf903..
> 
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=5a190bf0eeb8d55404cd84062c573cbe
> d092a64e&newProject=try&newRevision=dcf903ecb43d590a983a467c339dd588ee6fe3a3&
> framework=1&showOnlyImportant=0
> 
> They show no (important) perf changes, but there are RSS regressions for
> windows7-32, up to 15.80%.  No other platforms seem to space-regress, though.

Oh hm, that's worth looking into. How does it look with adaptive? I'm curious as to whether it's a sunk cost or depends on the amount of multiple arena usage.

> There are also very large changes to "tp5o XRes opt" and "tp5o XRes opt
> e10s".
> What are those?

https://wiki.mozilla.org/Buildbot/Talos/Tests#Xres_.28X_Resource_Monitoring.29
(In reply to Julian Seward [:jseward] from comment #24)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #8)
> > (2) Push both multiple arenas and adapative arenas to try with -t all to get
> > talos results.
> 
> Here are results for single and multiple arenas (not adaptive).  They are
> generated with
> ./mach try -b do -p all -u all -t all --rebuild-talos 5
> 
> Single   is changeset 5a190b..
> Multiple is changeset dcf903..
> 
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=5a190bf0eeb8d55404cd84062c573cbe
> d092a64e&newProject=try&newRevision=dcf903ecb43d590a983a467c339dd588ee6fe3a3&
> framework=1&showOnlyImportant=0
> 
> They show no (important) perf changes, but there are RSS regressions for
> windows7-32, up to 15.80%.  No other platforms seem to space-regress, though.

Really? If I uncheck 'show only important changes', I see a 12% linux regression too.

These regressions are on the order of 25MB, which is very large, much higher than it should be, at least for adaptive. Can you test adaptive? I would expect a maximum overhead of chunksize * num_extra_arenas, which IIUC should be something on the order of 4MB.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #26)
> Really? If I uncheck 'show only important changes', I see a 12% linux
> regression too.

Ah, right, I didn't see that at first.  It's marked as low confidence,
but I'm not sure what that means.

> Can you test adaptive?

Sure.  I also re-tested multi, so as to test all 3 variants from exactly
the same sources.

Single vs Multi:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=22028266be5e4485a959d44b1619c7e3d3f80dfa&newProject=try&newRevision=3dbb7af4687d3ee209016ce9ce01c23d1a1eebd2&framework=1

Single vs Adaptive:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=22028266be5e4485a959d44b1619c7e3d3f80dfa&newProject=try&newRevision=64247742801704909670ff22d798413f871e09ee&framework=1

Single vs Adaptive doesn't seem to show much by way of high-confidence
differences apart from a 6% improvement on linux64-stylo.  I assume that's
because most tests are done on non-Stylo-enabled builds, so they behave
the same as single-arena.

Is that what you wanted me to measure?  I'm somewhat confused.
Flags: needinfo?(bobbyholley)
(In reply to Julian Seward [:jseward] from comment #27)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #26)
> > Really? If I uncheck 'show only important changes', I see a 12% linux
> > regression too.
> 
> Ah, right, I didn't see that at first.  It's marked as low confidence,
> but I'm not sure what that means.
> 
> > Can you test adaptive?
> 
> Sure.  I also re-tested multi, so as to test all 3 variants from exactly
> the same sources.
> 
> Single vs Multi:
> 
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=22028266be5e4485a959d44b1619c7e3
> d3f80dfa&newProject=try&newRevision=3dbb7af4687d3ee209016ce9ce01c23d1a1eebd2&
> framework=1
> 
> Single vs Adaptive:
> 
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=22028266be5e4485a959d44b1619c7e3
> d3f80dfa&newProject=try&newRevision=64247742801704909670ff22d798413f871e09ee&
> framework=1
> 
> Single vs Adaptive doesn't seem to show much by way of high-confidence
> differences apart from a 6% improvement on linux64-stylo.  I assume that's
> because most tests are done on non-Stylo-enabled builds, so they behave
> the same as single-arena.

The linux64-stylo talos tests run everything with stylo, so I don't think that's accurate. The 6% improvement on linux64-stylo is presumably because stylo builds currently force multiple arenas, and so enabling adapative wins a lot of that back.

From what I see in your try pushes, linux64-stylo still uses about 15MB more RSS than vanilla linux64. We should investigate how much of that is due to multiple arenas and how much of it is due to other memory overhead in stylo. If the overhead of multiple arenas is more than a few MB, we should investigate why.



> 
> Is that what you wanted me to measure?  I'm somewhat confused.

Yes. The key point I wanted to establish is that adaptive doesn't regress other random stuff when stylo isn't turned on. The near-term goal is to get stylo built by default, which means getting rid of the special compile-time stylo behavior in mozjemalloc.c.

So in summary, we should do two things:
(1) Get adaptive arenas reviewed and landed. Ideally this would be on m-c before the end of the week, because that's our target date for building stylo by default. Do you think this is realistic?
(2) File a separate bug to measure the RSS hit of adaptive arenas on stylo, as described above. Once (1) is landed, this can by done by simply pushing a linux64-stylo try push with adaptive arenas disabled, and comparing that to the baseline results.

Let me know if that makes sense and what you think timelines for the above items are. This stuff is important. :-)
Flags: needinfo?(bobbyholley) → needinfo?(jseward)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #28)
> Let me know if that makes sense and what you think timelines for the above
> items are. This stuff is important. :-)

Yes that makes sense.  The patch needs tidying up before review -- tidying
at the jemalloc.c end and RAIIification at the Rust end.  I can prepare a
version for review by end of Weds.  I'll also file the followup, RSS
investigation, bug.
Flags: needinfo?(jseward)
Julian is it possible to get a summary of where we're at with this? I think there was a lot of talk about single-vs-multi and adaptive-vs-multi but not single-vs-adaptive and really that's all we care about for this bug. There was talk of regressions on microbenchmarks but it's not clear whether that was a non-issue or still persists (again we only care about single-vs-adaptive for this bug).

I'd also like to see some numbers comparing a non-stylo base build w/o the adaptive patches with a --enable-stylo build w/ the adaptive patches, this will make it easier to compare in perfherder. For this we want multiple talos runs and awsy runs as well.
Flags: needinfo?(jseward)
RE: not getting high confidence results, apparently the try flag is '--rebuild 5' not '--rebuild-talos 5'.
(In reply to Eric Rahm [:erahm] from comment #31)
> RE: not getting high confidence results, apparently the try flag is
> '--rebuild 5' not '--rebuild-talos 5'.

It might be more nuanced than that, it seems like taskcluster might not handle either rebuild flag (hence having good results on win/mac but not linux).
(In reply to Eric Rahm [:erahm] from comment #30)
> Julian is it possible to get a summary of where we're at with this?

comment 28 summarizes my understanding of the current state of things.

> I think
> there was a lot of talk about single-vs-multi and adaptive-vs-multi but not
> single-vs-adaptive and really that's all we care about for this bug.

This is covered in comment 27, no?

> There
> was talk of regressions on microbenchmarks but it's not clear whether that
> was a non-issue or still persists (again we only care about
> single-vs-adaptive for this bug).

My impression is that there is no regression from enabling adaptie.

> 
> I'd also like to see some numbers comparing a non-stylo base build w/o the
> adaptive patches with a --enable-stylo build w/ the adaptive patches, this
> will make it easier to compare in perfherder.

I think that would conflate two variables. We know that adaptive + nostylo has no RSS regression. The next interesting thing to compare is stylo w/adaptive vs stylo w/o adaptive to measure the overhead of the arena choice (vs the general overhead of other stylo data structures).

> For this we want multiple
> talos runs and awsy runs as well.

FWIW, you don't need --rebuild to trigger multiple talos runs. Just click the rebuild button for the job of interest on treeherder.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #33)

I agree with Bobby's summary in comment 33.  From my point of view:

(1) Stylo (parallel) scales poorly with single arenas, for reasons we can guess
    at but don't yet know.

(2) With multiple arenas it scales much better.  There had been concern that
    multiple arenas would cause perf loss for malloc-intensive serial code, but
    that hasn't been repeatably observed.

(3) Multiple arenas do however cause some significant RSS regressions on Talos,
    for reasons to be investigated in a followup bug.

(4) Bobby suggested an adaptive approach, where we enable multiple arenas
    exactly during the parallel styling pass(es) but nowhere else.  The original
    motivation was to avoid the problems expected in (2), but that turned out to
    be a moot point.  Instead, we discovered we need the adaptive approach to
    avoid the problems observed in (3).

(5) For a non-stylo-enabled run, the adaptive approach has the same perf
    characteristics as single arena, because multiple arenas are never enabled.

(6) For a stylo-enabled run, the adaptive approach gets us the parallel
    scalability of (2) with less of the space penalty of (3).

(7) For the moment this seems the least-worst approach, so we're going with it.
Flags: needinfo?(jseward)
Attached patch Mike's proposal, part 1 of 2 (obsolete) — Splinter Review
Attachment #8864146 - Attachment is obsolete: true
Attached patch Mike's proposal, part 2 of 2 (obsolete) — Splinter Review
These two patches (from Mike) implement the scheme discussed at
bug 1291355 comment 69 through bug 1291355 comment 74.
Some numbers for the patches in comment 35 and comment 36:

Obama parallel traversal times, milliseconds

single

  STYLO_THREADS=1 84 83 83  (best 83)  1.00 x
  STYLO_THREADS=2 58 55 55  (best 55)  1.51 x
  STYLO_THREADS=3 43 42 43  (best 42)  1.97 x
  STYLO_THREADS=4 43 43 50  (best 43)  1.93 x
  STYLO_THREADS=6 45 47 60  (best 45)  1.84 x
  STYLO_THREADS=8 56 59 59  (best 56)  1.48 x

adaptive-mh

  STYLO_THREADS=1 86 85 83  (best 83)  1.00 x
  STYLO_THREADS=2 49 48 49  (best 48)  1.73 x
  STYLO_THREADS=3 34 40 34  (best 34)  2.44 x
  STYLO_THREADS=4 32 31 30  (best 30)  2.77 x
  STYLO_THREADS=6 26 24 24  (best 24)  3.45 x
  STYLO_THREADS=8 22 23 21  (best 21)  3.95 x


"innerHTML" test case from
https://bugzilla.mozilla.org/show_bug.cgi?id=1347525#c13, milliseconds

single

  73 72 75  (best 72)

adaptive-mh

  75 74 72  (best 72)


So -- it looks good from both the parallel and single-thread aspects.
Talos measurements to come.


--- command for Obama testcase -------------------------------

STYLO_THREADS=8 DISPLAY=:2.0 DUMP_STYLE_STATISTICS=1 ./mach run --disable-e10s file:///home/sewardj/MOZ/STYLO/Barack_Obama_Wikipedia.html

--- command for innerHTML testcase ---------------------------

STYLO_THREADS=1 DISPLAY=:2.0 QUMP_STYLE_STATISTICS=1 ./gcc-Og-nondebug-stylo/dist/bin/firefox-bin -P dev -no-remote file:///home/sewardj/MOZ/STYLO/bug1347525testcase_looping2.html

--- config for adaptive-mh -----------------------------------

is with the patches in comment 35 and comment 36 applied

--- config for single ----------------------------------------

is with the patches in comment 35 and comment 36 applied, plus
the following

diff --git a/memory/mozjemalloc/jemalloc.c b/memory/mozjemalloc/jemalloc.c
--- a/memory/mozjemalloc/jemalloc.c
+++ b/memory/mozjemalloc/jemalloc.c
@@ -3048,17 +3048,17 @@ chunk_dealloc(void *chunk, size_t size)
  * Begin arena.
  */
 
 MOZ_JEMALLOC_API void
 jemalloc_set_multiple_arenas_enabled_impl(bool multi_enabled)
 {
        arena_t *arena;
 
-       if (multi_enabled) {
+       if (0 && multi_enabled) {
                arena = choose_arena_hard();
        } else {
                arena = arenas[0];
        }
 #ifdef MOZ_MEMORY_WINDOWS
        TlsSetValue(tlsIndex, arena);
 #else
        arenas_map = arena;

--- mozconfig ------------------------------------------------

sewardj@nienna[C39]:/space/MOZ/MC-STYLO$ cat $MOZCONFIG
. $topsrcdir/browser/config/mozconfig
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/gcc-Og-nondebug-stylo
ac_add_options --enable-tests

ac_add_options --enable-optimize="-g -Og"
ac_add_options --enable-debug-symbols

ac_add_options --enable-stylo

ac_add_options --enable-valgrind

ac_add_options --enable-profiling
ac_add_options --enable-elf-hack

ac_add_options --disable-crashreporter

mk_add_options MOZ_MAKE_FLAGS="-j8"
mk_add_options AUTOCLOBBER=1
Thanks for the thorough breakdown Julian. One question: if we're comparing perf should we be doing more aggressively optimized builds, ie: -O3 vs -Og?
(In reply to Julian Seward [:jseward] from comment #36)
> Created attachment 8866354 [details] [diff] [review]
> Mike's proposal, part 2 of 2
> 
> These two patches (from Mike) implement the scheme discussed at
> bug 1291355 comment 69 through bug 1291355 comment 74.

Credit goes where credit due: the first part comes entirely from Julian's previous patch.

(In reply to Eric Rahm [:erahm] from comment #38)
> Thanks for the thorough breakdown Julian. One question: if we're comparing
> perf should we be doing more aggressively optimized builds, ie: -O3 vs -Og?

Note that --enable-optimize flags are not used for rust code.
This is extracted from Julian's original patch, with a change to the API name.
Assignee: jseward → mh+mozilla
Attachment #8866351 - Attachment is obsolete: true
Attachment #8866606 - Flags: review+
See message in the patch for the tradeoffs made in order to make this landable quickly.

Bobby, please check that you're okay with the second item for the short term (I intend to fix it next week along with the last item, but I didn't want to block enabling stylo on this)
Attachment #8866354 - Attachment is obsolete: true
Attachment #8866607 - Flags: review?(erahm)
Attachment #8866607 - Flags: feedback?(bobbyholley)
Attachment #8866608 - Flags: review?(bobbyholley)
(In reply to Eric Rahm [:erahm] from comment #38)
-O2 would probably be wise in future.  -Og is somewhat less aggressive
than -O2.  The main thing though is that opt levels between the two 
compared versions are the same, and that is of course true.
(In reply to Julian Seward [:jseward] from comment #37)
> So -- it looks good from both the parallel and single-thread aspects.
> Talos measurements to come.

Talos results are at [1].  MacOS failed to build, so the results will
presumably lack numbers for MacOS.

[1]
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=70a83a6491aa7e1d5293df57763f4815027c8773&newProject=try&newRevision=4a8ef276b769fcbdf600513b9113bc1e692dc242&framework=1&showOnlyImportant=0
Attachment #8866645 - Flags: review?(erahm)
Comment on attachment 8866645 [details] [diff] [review]
Use Thread Local Storage in mozjemalloc on mac

Review of attachment 8866645 [details] [diff] [review]:
-----------------------------------------------------------------

Somehow this is busted.
Attachment #8866645 - Flags: review?(erahm)
Attachment #8866645 - Attachment is obsolete: true
Attachment #8866664 - Flags: review?(erahm)
Comment on attachment 8866664 [details] [diff] [review]
Use Thread Local Storage in mozjemalloc on mac

This is busted as well.
Attachment #8866664 - Flags: review?(erahm)
Comment on attachment 8866607 [details] [diff] [review]
Initial implementation for jemalloc_thread_local_arena

Review of attachment 8866607 [details] [diff] [review]:
-----------------------------------------------------------------

Under-reporting memory usage for a week sounds like a fine price to pay for moving ahead on this. Thanks for working on this!
Attachment #8866607 - Flags: feedback?(bobbyholley) → feedback+
Comment on attachment 8866608 [details] [diff] [review]
Opt-in to thread-local jemalloc arenas for stylo rayon threads

Review of attachment 8866608 [details] [diff] [review]:
-----------------------------------------------------------------

This is much nicer than twiddling a bit whenever we use the thread pool - thank you!

Note that this will need to be rebased on top of https://github.com/servo/servo/pull/16786
Attachment #8866608 - Flags: review?(bobbyholley) → review+
Talos agrees that this essentially has no additional overhead (comparing with and without this patch, but with all the others).
Attachment #8866664 - Attachment is obsolete: true
Attachment #8866721 - Flags: review?(erahm)
This compares unpatched vs. patched (without the TLS change on mac). Note that without the TLS change on mac, this patch queue is a noop, and the differences that appear on perfherder for glterrain and bloom are just due to those tests being noisy or bimodal:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8e8da236d6c0&newProject=try&newRevision=f0338f956b40&framework=1&showOnlyImportant=0

(note this means that for linux-stylo, this is multi-arena vs. thread-local arena, and there's a nice drop in memory usage)

This compares unpacked vs. fully patched on mac (although, without the initialization ordering change, which doesn't have an impact on performance):

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8e8da236d6c0&newProject=try&newRevision=1868d1ad74ea&framework=1&showOnlyImportant=0

And I just pushed mozilla-central with forced single-arena for stylo, to compare against thread-local arena (mostly for memory usage):

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=703266e25919&newProject=try&newRevision=f0338f956b40&framework=1&showOnlyImportant=0
Rebased ; does this need to go through github?
Attachment #8866608 - Attachment is obsolete: true
Attachment #8866960 - Flags: review?(bobbyholley)
Summary: stylo: Investigate enabling multiple jemalloc arenas only during the parallel traversal → stylo: Use Thread-local jemalloc arenas
Attachment #8866721 - Flags: review?(erahm) → review+
Comment on attachment 8866607 [details] [diff] [review]
Initial implementation for jemalloc_thread_local_arena

Review of attachment 8866607 [details] [diff] [review]:
-----------------------------------------------------------------

So just to be clear: every time the stylo thread pool gets created and kicks in we leak an arena right? Do you know if we reuse the same pool each time (so we only leak thread-pool-size arenas total) or if it gets recreated for each styling (so we leak number_of_styloings * thread-pool-size). I guess as long as you promise to fix it and stylo is disabled by default I don't care that much.

::: memory/mozjemalloc/jemalloc.c
@@ +3063,5 @@
> +		arena = arenas_extend(NO_INDEX);
> +	} else {
> +		arena = arenas[0];
> +	}
> +#ifdef MOZ_MEMORY_WINDOWS

I guess 'thread_local' (available in msvc) is C++11 only, whereas '__thread' is available for C (in gcc/clang but not msvc)? I wonder if it would make sense to just compile this code as C++ instead...regardless that's all follow up worthey, the status quo is fine here.
Attachment #8866607 - Flags: review?(erahm) → review+
(In reply to Eric Rahm [:erahm] from comment #56)
> Comment on attachment 8866607 [details] [diff] [review]
> Initial implementation for jemalloc_thread_local_arena
> 
> Review of attachment 8866607 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> So just to be clear: every time the stylo thread pool gets created and kicks
> in we leak an arena right? Do you know if we reuse the same pool each time
> (so we only leak thread-pool-size arenas total) or if it gets recreated for
> each styling (so we leak number_of_styloings * thread-pool-size). I guess as
> long as you promise to fix it and stylo is disabled by default I don't care
> that much.

Maybe the commit message wasn't clear enough, but the thread pool for stylo only starts threads once, and they live until the end of the process (and in fact, the threads didn't even call the exit handler when I tried ; they're probably just killed).

> ::: memory/mozjemalloc/jemalloc.c
> @@ +3063,5 @@
> > +		arena = arenas_extend(NO_INDEX);
> > +	} else {
> > +		arena = arenas[0];
> > +	}
> > +#ifdef MOZ_MEMORY_WINDOWS
> 
> I guess 'thread_local' (available in msvc) is C++11 only, whereas '__thread'
> is available for C (in gcc/clang but not msvc)? I wonder if it would make
> sense to just compile this code as C++ instead...regardless that's all
> follow up worthey, the status quo is fine here.

I'm actually considering it, because there's a lot of boilerplate in this code that could use RAII. And some awfully large macros that could use templates. That wouldn't necessarily allow to use thread_local, though, because of e.g. what's in the commit message for the mac TLS thing. But yeah, followup.
Blocks: 1364358
Blocks: 1364359
Comment on attachment 8866960 [details] [diff] [review]
Opt-in to thread-local jemalloc arenas for stylo rayon threads

Review of attachment 8866960 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. This should be PR-ed against github.com/servo/servo when the relevant APIs have hit autoland. Ping somebody in #servo to trigger the landing.

When you write that patch, you'll also need to copy over the generated bindings.rs file from your objdir into components/style/gecko/generated/bindings.rs . We check in a copy of the bindings so that the servo CI can at least compile libgeckoservo independently. The bindings don't need to be perfectly up to date, so long as things compile.
Attachment #8866960 - Flags: review?(bobbyholley) → review+
Ok, so since mozreview won't allow to autoland this because of the mixed authorship and some of the patches being r=glandium, I'll go ahead and land on inbound now, ahead of sfink's review of the trivial one-liner exception for hazard builds. Eventually, this will make it to autoland through merges, at which point I'll be able to get servo landed. And then I'll disable multiple-arenas.
Keywords: leave-open
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/mozilla-inbound/rev/182823a741b4
Add a jemalloc_thread_local_arena API with a binding for rust. r=glandium
https://hg.mozilla.org/integration/mozilla-inbound/rev/1290ff350272
Add a hazard exception for Gecko_SetJemallocThreadLocalArena. r=me
https://hg.mozilla.org/integration/mozilla-inbound/rev/751e131baa5d
Initial implementation for jemalloc_thread_local_arena. r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/bacbf98fc818
Use Thread Local Storage in mozjemalloc on mac. r=erahm
Attachment #8866606 - Flags: checkin+
Attachment #8866607 - Flags: checkin+
Comment on attachment 8866640 [details] [diff] [review]
Add a hazard exception for Gecko_SetJemallocThreadLocalArena

Still interested in the review, after the fact.
Attachment #8866640 - Flags: checkin+
Attachment #8866721 - Flags: checkin+
Comment on attachment 8866640 [details] [diff] [review]
Add a hazard exception for Gecko_SetJemallocThreadLocalArena

Review of attachment 8866640 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/devtools/rootAnalysis/analyzeHeapWrites.js
@@ +384,5 @@
>          "Gecko_EnsureMozBorderColors",
>          "Gecko_ClearMozBorderColors",
>          "Gecko_AppendMozBorderColors",
>          "Gecko_CopyMozBorderColors",
> +        "Gecko_SetJemallocThreadLocalArena",

At some point, I'd like to go through all these and try to remove as many as possible, but this one seems more reasonable to have there than most.
Attachment #8866640 - Flags: review?(sphink) → review+
https://github.com/servo/servo/pull/16847 is landing the servo part in the servo repo. After that only the backout of 2214b3c57c9c is left, and that will need to wait for the autolanded servo commit to reach inbound.
(In reply to Mike Hommey [:glandium] from comment #64)
> https://github.com/servo/servo/pull/16847 is landing the servo part in the
> servo repo. After that only the backout of 2214b3c57c9c is left, and that
> will need to wait for the autolanded servo commit to reach inbound.

I can just push this directly to autoland to avoid round-tripping through central, which will be especially slow on a weekend.
Keywords: leave-open
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/503e83d0d7ab
Backout 2214b3c57c9c (bug 1291356), disabling multiple jemalloc arenas. r=glandium
Blocks: 1364838
Everything landed on central.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Blocks: 1403766
You need to log in before you can comment on or make changes to this bug.