Stylo thread-specific jemalloc arenas are effectively disabled

RESOLVED FIXED in Firefox 59

Status

()

defect
P2
normal
RESOLVED FIXED
Last year
Last year

People

(Reporter: bholley, Assigned: bholley)

Tracking

unspecified
mozilla60
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox58 wontfix, firefox59 fixed, firefox60 fixed)

Details

Attachments

(1 attachment, 1 obsolete attachment)

Assignee

Description

Last year
So this is fun. Found while investigating bug 1431285.

In bug 1397101, when we were optimizing memory right before stylo went to beta, we introduced a bug that permanently disabled thread-specific jemalloc arenas for a style thread as soon as it makes an allocation larger than 512 bytes (which happens very early on). 

In other words, the stylo we shipped in Quantum had enormous amounts of locking and cache contention during heap allocation, and was thus way slower than it could have been. It's kind of amazing that nobody noticed this.
Assignee

Comment 1

Last year
MozReview-Commit-ID: 9i5B76vkNfr
Attachment #8949180 - Flags: review?(mh+mozilla)
Assignee

Comment 2

Last year
On the testcase in bug 1431285:

Before this fix: https://perfht.ml/2FByRN1
After this fix: https://perfht.ml/2GWd5Fm

Note that the time spent in malloc for each of the worker threads goes from ~125 to ~20. And that doesn't even account for the cache contention that we measured when we implemented the multiple arenas.

One interesting question is whether some of our memory wins from bug 1397101 will evaporate with this fix.
Assignee

Updated

Last year
Blocks: 1431285
Comment on attachment 8949180 [details] [diff] [review]
Don't clobber the thread-local arenas when we happen to hit a large allocation. v1

Review of attachment 8949180 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/build/mozjemalloc.cpp
@@ +2178,5 @@
>      ret = thread_arena.get();
>    }
>  
>    if (!ret) {
> +    ret = gArenas.GetDefault();

Nice find. I wonder how this hasn't caused a talos warning when this landed.

With that being said, it would be preferable to keep defaulting to thread_local_arena(false), so that subsequent calls for smaller sizes can actually get the arena from the thread local variable.

So it would be better to feed ret with the default arena in the else case of the kMaxQuantumClass test above.
Attachment #8949180 - Flags: review?(mh+mozilla)
Assignee

Comment 4

Last year
MozReview-Commit-ID: 9i5B76vkNfr
Attachment #8949211 - Flags: review?(mh+mozilla)
Assignee

Updated

Last year
Attachment #8949180 - Attachment is obsolete: true
Comment on attachment 8949211 [details] [diff] [review]
Don't clobber the thread-local arenas when we happen to hit a large allocation. v2

Review of attachment 8949211 [details] [diff] [review]:
-----------------------------------------------------------------

Extra points for making the code more legible at the same time.
Attachment #8949211 - Flags: review?(mh+mozilla) → review+
Assignee

Comment 6

Last year
So the obvious question here is why we didn't detect this regression when it landed. I believe this is because, as we saw when landing stylo, the vast majority of our talos tests don't really capture large-scale content styling. Tp6 amazon captures it to some extent, however it only measures time to first non-blank paint, stylo 57 already had the style overhead in that interval down to ~15ms.

In general, I think we did a lot of work to reduce allocation (by sharing) after the point where we concluded that pinned arenas was a must-have, so it's definitely not as critical as it once was.

That said, I can still measure some big wins with this (on OSX):
* MotionMark, i.e. the testcase in bug 1431285
* the HTML spec (Servo_TraverseSubtree time drops from low 700s to low 600s, malloc time drops from ~45ms to ~15ms per worker)
* StyleBench (time spent styling drops from ~520 to ~475, malloc time goes from ~30ms per worker to zero).
* Large github diff [1] (time spent styling drops from 350-500ms to ~120ms, malloc time drops from ~25ms to ~2ms per worker).

The github one is particularly interesting because the styling time is highly variable, and totally dwarfs the difference in malloc time. I interpret that as an indication of the previously-measured cache contention, which can vary significantly run-to-run.

The only potential downside of landing this is that it might cause AWSY regressions, depending on how much of glandium's intended optimization continues to hold, and how much its measured effect was actually because of the bug. I did a try push at [2], which looks fine so far. I'm going to wait on the remaining retriggers and then push.

[1] https://github.com/bholley/gecko/commit/96ecf505adde95240799d285de90a007310f7bda
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=6a9a3a9d901409a0db64dc8b9cfc080763f1285c

Comment 7

Last year
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1934fc70f158
Don't clobber the thread-local arenas when we happen to hit a large allocation. r=glandium
Bobby, is this fix worth uplifting to Beta 59? We still have five weeks of bake time remaining in this Beta cycle.
Flags: needinfo?(bobbyholley)
Priority: -- → P2
I'm not Bobby, but totally.
Flags: needinfo?(bobbyholley)
https://hg.mozilla.org/mozilla-central/rev/1934fc70f158
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Assignee

Comment 11

Last year
Comment on attachment 8949211 [details] [diff] [review]
Don't clobber the thread-local arenas when we happen to hit a large allocation. v2

Approval Request Comment
[Feature/Bug causing the regression]: This bug was introduced in bug 1397101, which landed right before the initial landing of Stylo went to Beta. So this isn't a regression in terms of anything we've shipped on our release channel, but it is very much a malfunction in terms of how Stylo is supposed to operate and perform. We built a mechanism to give Stylo threads separate memory pools to improve performance, but this bug causes those memory pools to be ignored.
[User impact if declined]: Slower Stylo performance.
[Is this code covered by automated tests?]: Our performance tests don't seem to be very sensitive to the slowdowns here, but we have tons of tests that would tell us if there were any correctness issues here.
[Has the fix been verified in Nightly?]: I have verified that the allocation overhead goes away on MotionMark with the latest nightly.
[Needs manual test from QE? If yes, steps to reproduce]: Probably not.
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Moderate-low.
[Why is the change risky/not risky?]: It's a change in the memory allocator, which very central code that affects everything. It is conceivable that even a benign change to that code could trigger other bugs or have unforeseen consequences. That said, the change is very small and simply makes the code operate as intended. 
[String changes made/needed]: None
Attachment #8949211 - Flags: approval-mozilla-beta?
> [Why is the change risky/not risky?]: It's a change in the memory allocator, which very central code that affects everything. It is conceivable that even a benign change to that code could trigger other bugs or have unforeseen consequences. That said, the change is very small and simply makes the code operate as intended. 

Also, it makes the code operate the same as before bug 1397101, so it's not like we haven't already shipped nightlies with this "new" configuration.
This improved performance, according to stylobench:

== Change summary for alert #11443 (as of Thu, 08 Feb 2018 00:18:06 GMT) ==

Improvements:

  4%  stylebench linux64 opt e10s     61.72 -> 63.98
  3%  stylebench linux64 pgo e10s     68.40 -> 70.28

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=11443
Comment on attachment 8949211 [details] [diff] [review]
Don't clobber the thread-local arenas when we happen to hit a large allocation. v2

Wow! Perf improvement for Stylo. Let's uplift this for beta 10.

We still have 2 weeks until building the release candidate, so that should give us some time to detect possible regressions.
Attachment #8949211 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #13)
> This improved performance, according to stylobench:
> 
> == Change summary for alert #11443 (as of Thu, 08 Feb 2018 00:18:06 GMT) ==
> 
> Improvements:
> 
>   4%  stylebench linux64 opt e10s     61.72 -> 63.98
>   3%  stylebench linux64 pgo e10s     68.40 -> 70.28
> 
> For up to date results, see:
> https://treeherder.mozilla.org/perf.html#/alerts?id=11443

Yay for stylebench doing its thing :)
You need to log in before you can comment on or make changes to this bug.