Closed Bug 1436541 Opened 6 years ago Closed 6 years ago

Stylo thread-specific jemalloc arenas are effectively disabled

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla60

Tracking Flags:

Tracking

Status

firefox-esr52

---

unaffected

firefox58

---

wontfix

firefox59

---

fixed

firefox60

---

fixed

People

(Reporter: bholley, Assigned: bholley)

References

Details

Attachments

(1 file, 1 obsolete file)

Don't clobber the thread-local arenas when we happen to hit a large allocation. v1 6 years ago Bobby Holley (:bholley) 834 bytes, patch		Details \| Diff \| Splinter Review
Don't clobber the thread-local arenas when we happen to hit a large allocation. v2 6 years ago Bobby Holley (:bholley) 1.31 KB, patch	glandium : review+ lizzard : approval-mozilla-beta+	Details \| Diff \| Splinter Review

Bobby Holley (:bholley)

Assignee

Description

•

6 years ago

So this is fun. Found while investigating bug 1431285.

In bug 1397101, when we were optimizing memory right before stylo went to beta, we introduced a bug that permanently disabled thread-specific jemalloc arenas for a style thread as soon as it makes an allocation larger than 512 bytes (which happens very early on). 

In other words, the stylo we shipped in Quantum had enormous amounts of locking and cache contention during heap allocation, and was thus way slower than it could have been. It's kind of amazing that nobody noticed this.

Bobby Holley (:bholley)

Assignee

Comment 1

•

6 years ago

Attached patch Don't clobber the thread-local arenas when we happen to hit a large allocation. v1 (obsolete) — Details — Splinter Review

MozReview-Commit-ID: 9i5B76vkNfr

Attachment #8949180 - Flags: review?(mh+mozilla)

Bobby Holley (:bholley)

Assignee

Comment 2

•

6 years ago

On the testcase in bug 1431285:

Before this fix: https://perfht.ml/2FByRN1
After this fix: https://perfht.ml/2GWd5Fm

Note that the time spent in malloc for each of the worker threads goes from ~125 to ~20. And that doesn't even account for the cache contention that we measured when we implemented the multiple arenas.

One interesting question is whether some of our memory wins from bug 1397101 will evaporate with this fix.

Bobby Holley (:bholley)

Assignee

Updated

•

6 years ago

Blocks: 1431285

Mike Hommey [:glandium]

Comment 3

•

6 years ago

Comment on attachment 8949180 [details] [diff] [review]
Don't clobber the thread-local arenas when we happen to hit a large allocation. v1

Review of attachment 8949180 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/build/mozjemalloc.cpp
@@ +2178,5 @@
>      ret = thread_arena.get();
>    }
>  
>    if (!ret) {
> +    ret = gArenas.GetDefault();

Nice find. I wonder how this hasn't caused a talos warning when this landed.

With that being said, it would be preferable to keep defaulting to thread_local_arena(false), so that subsequent calls for smaller sizes can actually get the arena from the thread local variable.

So it would be better to feed ret with the default arena in the else case of the kMaxQuantumClass test above.

Attachment #8949180 - Flags: review?(mh+mozilla)

Bobby Holley (:bholley)

Assignee

Comment 4

•

6 years ago

Attached patch Don't clobber the thread-local arenas when we happen to hit a large allocation. v2 — Details — Splinter Review

MozReview-Commit-ID: 9i5B76vkNfr

Attachment #8949211 - Flags: review?(mh+mozilla)

Bobby Holley (:bholley)

Assignee

Updated

•

6 years ago

Attachment #8949180 - Attachment is obsolete: true

Mike Hommey [:glandium]

Comment 5

•

6 years ago

Comment on attachment 8949211 [details] [diff] [review]
Don't clobber the thread-local arenas when we happen to hit a large allocation. v2

Review of attachment 8949211 [details] [diff] [review]:
-----------------------------------------------------------------

Extra points for making the code more legible at the same time.

Attachment #8949211 - Flags: review?(mh+mozilla) → review+

Bobby Holley (:bholley)

Assignee

Comment 6

•

6 years ago

So the obvious question here is why we didn't detect this regression when it landed. I believe this is because, as we saw when landing stylo, the vast majority of our talos tests don't really capture large-scale content styling. Tp6 amazon captures it to some extent, however it only measures time to first non-blank paint, stylo 57 already had the style overhead in that interval down to ~15ms.

In general, I think we did a lot of work to reduce allocation (by sharing) after the point where we concluded that pinned arenas was a must-have, so it's definitely not as critical as it once was.

That said, I can still measure some big wins with this (on OSX):
* MotionMark, i.e. the testcase in bug 1431285
* the HTML spec (Servo_TraverseSubtree time drops from low 700s to low 600s, malloc time drops from ~45ms to ~15ms per worker)
* StyleBench (time spent styling drops from ~520 to ~475, malloc time goes from ~30ms per worker to zero).
* Large github diff [1] (time spent styling drops from 350-500ms to ~120ms, malloc time drops from ~25ms to ~2ms per worker).

The github one is particularly interesting because the styling time is highly variable, and totally dwarfs the difference in malloc time. I interpret that as an indication of the previously-measured cache contention, which can vary significantly run-to-run.

The only potential downside of landing this is that it might cause AWSY regressions, depending on how much of glandium's intended optimization continues to hold, and how much its measured effect was actually because of the bug. I did a try push at [2], which looks fine so far. I'm going to wait on the remaining retriggers and then push.

[1] https://github.com/bholley/gecko/commit/96ecf505adde95240799d285de90a007310f7bda
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=6a9a3a9d901409a0db64dc8b9cfc080763f1285c

Pulsebot

Comment 7

•

6 years ago

Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1934fc70f158
Don't clobber the thread-local arenas when we happen to hit a large allocation. r=glandium

Chris Peterson [:cpeterson]

Comment 8

•

6 years ago

Bobby, is this fix worth uplifting to Beta 59? We still have five weeks of bake time remaining in this Beta cycle.

status-firefox58: --- → wontfix

status-firefox59: --- → affected

status-firefox60: --- → affected

status-firefox-esr52: --- → unaffected

Flags: needinfo?(bobbyholley)

Priority: -- → P2

Mike Hommey [:glandium]

Comment 9

•

6 years ago

I'm not Bobby, but totally.

Flags: needinfo?(bobbyholley)

Cristian Brindusan [:cbrindusan]

Comment 10

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1934fc70f158

Status: NEW → RESOLVED

Closed: 6 years ago

status-firefox60: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla60

Bobby Holley (:bholley)

Assignee

Comment 11

•

6 years ago

Comment on attachment 8949211 [details] [diff] [review]
Don't clobber the thread-local arenas when we happen to hit a large allocation. v2

Approval Request Comment
[Feature/Bug causing the regression]: This bug was introduced in bug 1397101, which landed right before the initial landing of Stylo went to Beta. So this isn't a regression in terms of anything we've shipped on our release channel, but it is very much a malfunction in terms of how Stylo is supposed to operate and perform. We built a mechanism to give Stylo threads separate memory pools to improve performance, but this bug causes those memory pools to be ignored.
[User impact if declined]: Slower Stylo performance.
[Is this code covered by automated tests?]: Our performance tests don't seem to be very sensitive to the slowdowns here, but we have tons of tests that would tell us if there were any correctness issues here.
[Has the fix been verified in Nightly?]: I have verified that the allocation overhead goes away on MotionMark with the latest nightly.
[Needs manual test from QE? If yes, steps to reproduce]: Probably not.
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Moderate-low.
[Why is the change risky/not risky?]: It's a change in the memory allocator, which very central code that affects everything. It is conceivable that even a benign change to that code could trigger other bugs or have unforeseen consequences. That said, the change is very small and simply makes the code operate as intended. 
[String changes made/needed]: None

Attachment #8949211 - Flags: approval-mozilla-beta?

Mike Hommey [:glandium]

Comment 12

•

6 years ago

> [Why is the change risky/not risky?]: It's a change in the memory allocator, which very central code that affects everything. It is conceivable that even a benign change to that code could trigger other bugs or have unforeseen consequences. That said, the change is very small and simply makes the code operate as intended. 

Also, it makes the code operate the same as before bug 1397101, so it's not like we haven't already shipped nightlies with this "new" configuration.

Ionuț Goldan [:igoldan]

Comment 13

•

6 years ago

This improved performance, according to stylobench:

== Change summary for alert #11443 (as of Thu, 08 Feb 2018 00:18:06 GMT) ==

Improvements:

  4%  stylebench linux64 opt e10s     61.72 -> 63.98
  3%  stylebench linux64 pgo e10s     68.40 -> 70.28

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=11443

Liz Henry (:lizzard) (relman/hg->git project)

Comment 14

•

6 years ago

Comment on attachment 8949211 [details] [diff] [review]
Don't clobber the thread-local arenas when we happen to hit a large allocation. v2

Wow! Perf improvement for Stylo. Let's uplift this for beta 10.

We still have 2 weeks until building the release candidate, so that should give us some time to detect possible regressions.

Attachment #8949211 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Noemi Erli[:noemi_erli]

Comment 15

•

6 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/22be9d71d70b

status-firefox59: affected → fixed

Emilio Cobos Álvarez (:emilio)

Comment 16

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #13)
> This improved performance, according to stylobench:
> 
> == Change summary for alert #11443 (as of Thu, 08 Feb 2018 00:18:06 GMT) ==
> 
> Improvements:
> 
>   4%  stylebench linux64 opt e10s     61.72 -> 63.98
>   3%  stylebench linux64 pgo e10s     68.40 -> 70.28
> 
> For up to date results, see:
> https://treeherder.mozilla.org/perf.html#/alerts?id=11443

Yay for stylebench doing its thing :)

You need to log in before you can comment on or make changes to this bug.