Try out an arena based allocator for DOM nodes again

NEW
Assigned to

Status

()

Core
DOM
P2
normal
10 months ago
7 months ago

People

(Reporter: Ehsan, Assigned: Ehsan)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(4 attachments)

(Assignee)

Description

10 months ago
While profiling some slow DOM code that comes up in benchmarks like Speedometer, we end up bottlenecking on pure pointer chasing for DOM traversal because of bad memory locality of DOM nodes.  I think we can do better, by attempting arena allocation of DOM nodes again.

I have some ideas on how to do this again without the downsides of the previous approach (see bug 403830).  Firstly we can try to share code with nsPresArena.  That in itself will give us the advantage of freelists that aren't fixed-size, contrary to the approach above.  But I'd also like to try next power of two rouding bucket allocation like jemalloc does in order to avoid fragmentation when reusing memory of freed nodes to allocate new nodes, so nodes will be sorted into separate arenas based on their bucket size.  Furthermore I would like to put attributes and content nodes in separate arenas since they are typically accessed separately during traversal, so leaving them separate improves the locality of traversal algorithms going over each one.

One complication is how to handle node adoption.  Olli's suggestion was to allocate them using the default allocator in the adoption case and store the allocator in a property on the node.  This way we don't have to worry about finding the right arena when the node is about to be destroyed.  We of course need a node flag to know whether to check the property and those are scarce these days.

This setup will obviously use more memory than we currently use, the question is how much, and I'd like to find that out.  We're hoping bug 1377993 improves DOM memory usage to some extent in order to buy us some breathing room here.  I'd also like to get some performance measurements once the changes are implemented in order to know what the exact memory-speed trade-off is.

After we have this data we can have a conversation about whether we should land the patches, tune them or give up!  CCing erahm to keep him in the loop from the beginning but no reason to be alarmed just yet.  :-)
Priority: -- → P1
(Assignee)

Comment 1

10 months ago
So I did a study of the current sizes of our DOM nodes on 32-bit and 64-bit.  Here is the raw data: https://docs.google.com/a/mozilla.com/spreadsheets/d/1i6yWXsDz_LFvvG6uvy57HupkHwSfdEjlIZq25Jsh7os/edit?usp=sharing

Here are the take-aways that are going to shape the design here:

  * HTMLAudioElemennt and HTMLVideoElement are huge in size and total outliers.  Due to the next power of two bucketing we can't unfortunately include them in this optimization.  I filed bug 1378506 about this as a follow-up to reduce the size of HTMLMediaElement.
  * On 32-bit, the only node that would fit in a 64-bit bucket is a Comment node.  On 64-bit, this node's size is 120 bytes, and there are a lot of other nodes that are 128 bytes, so we need to have a bucket size of 128 bytes size there.  So we could arena allocate Comment nodes on 64-bit platforms and leave it heap allocated on 32-bit platforms, but for now for simplicity I'd rather leave it alone on all platforms.
  * All other nodes on both platforms fit nicely in one of these three buckets: 128 bytes, 256 bytes, or 512 bytes.
(Assignee)

Comment 2

10 months ago
I forgot to include Text and CDATASection, added those to the spreadsheet now.  Their presence together with comment might make it worthwhile to have a 64 byte bucket for 32-bit platforms since Text especially is very common.
(Assignee)

Updated

10 months ago
Depends on: 1378983
(Assignee)

Comment 3

10 months ago
Eric, what types of tests should I run to get a sense of whether the memory usage of the patches here is acceptable or not?
Flags: needinfo?(erahm)

Comment 4

10 months ago
> Firstly we can try to share code with nsPresArena.

I'd argue against using nsPresArena for DOM nodes.
nsPresArena is intentionally slower and use more memory than needed
in order to support poisoning and type confusion mitigation measures.
Listing a few things off the top of my head:
1. writing the poison value costs CPU cycles and possibly cache misses
2. memory for one type of instance can only be re-used for that same
   type, which leads to a slight over-allocation / cache misses
3. the "free lists" are not lists, but nsTArrays, which means malloc /
   free calls just to manage these pointers.  (without poisoning you
   could instead use the free instances themselves for a linked list,
   which would be faster and use less memory)
4. memory is never free()'d; we always spend the high-water mark for
   each type separately

I don't think these measures are necessary for ref-counted DOM nodes.
I think it would be better to write a new non-poisoning arena class
that can be optimized for performance.
(Assignee)

Comment 5

10 months ago
(In reply to Mats Palmgren (:mats) from comment #4)
> > Firstly we can try to share code with nsPresArena.
> 
> I'd argue against using nsPresArena for DOM nodes.
> nsPresArena is intentionally slower and use more memory than needed
> in order to support poisoning and type confusion mitigation measures.
> Listing a few things off the top of my head:
> 1. writing the poison value costs CPU cycles and possibly cache misses

I have disabled that part for DOM arenas.

> 2. memory for one type of instance can only be re-used for that same
>    type, which leads to a slight over-allocation / cache misses

For DOM arenas, our bucketing strategy will be very different.  I have a set of WIP patches which I will post later today which should demonstrate where I'm moving towards.  But for now, we use 4 arenas for DOM node allocations in 64-bit and 5 in 32-bit platforms, based on the description in comment 0 depending on the size of the nodes not their type.

> 3. the "free lists" are not lists, but nsTArrays, which means malloc /
>    free calls just to manage these pointers.  (without poisoning you
>    could instead use the free instances themselves for a linked list,
>    which would be faster and use less memory)

Yeah this is one aspect which isn't ideal.  Right now the mozilla::dom::Arena class that I have is fairly simple.  I may end up switching its backend implementation to not share anything with nsPresArena for that reason if profiling shows the cost of maintaining the "free lists" is important, but right now there are other more important performance issues with the patches which I'll write about in a separate comment...  (For now I have decided to reuse the nsPresArena code since the arena manipulation code is kind of the least interesting aspect of this experiment.  None of the decisions here are final yet.)

> 4. memory is never free()'d; we always spend the high-water mark for
>    each type separately

Not sure if I follow this part, but the memory management of DOM nodes will not change, they will stay refcounted and will be freed.  We of course eventually need to make sure if an entire arena gets freed up we reclaim the memory of the arena back to the OS which nsPresArena currently doesn't know how to do.

> I don't think these measures are necessary for ref-counted DOM nodes.
> I think it would be better to write a new non-poisoning arena class
> that can be optimized for performance.

I expect to end up doing that in the end, but I'd still like to try to share code if we can.  So far we aren't sharing all that much code at the "nsPresArena" level at any rate, as most of the sharing happens at the ArenaAllocator level...
(Assignee)

Comment 6

10 months ago
So here is a summary of where I am right now with this.  I have the basic patches that implement the core of the idea here.  So far I have spent most of the time on the correctness side of things, and the patches are good enough for local browsing, and they manage to successfully run Speedometer but the try server was fairly unhappy with me last night when I pushed what I had to try.  I haven't looked into the details of failures too closely yet.

I have been looking at the perf side of things, and things have been looking not all that great!  Initially when I ran Speedometer on a build with my patches, it was around 10 points *slower* than a build without these patches!  As far as I can tell, none of the code that I have added really shows up in the profile all that much, here is some details on the profiles I have looked at:

  * nsContentList::PopulateSelf() seems to show up in profiles after the patches at around half the time it used to take before the patches.  The pointer chasing loops there were one of the motivating factors behind the work here.
  * The cost of allocating new HTML elements seems to have gone down in the profiles after the patches compared to profiles before.  Again, this is what I would expect.
  * The cost of CC graph building (CCGraphBuilder::BuildGraph and stuff under it) has skyrocketted in the profiles after the patches!!!  In a profile before the patches of about 100 of Speedometer subtest runs, this function took 0.45% of the total time.  After the patches, it takes 26.74% of the total time!
  ** As far as I can tell, at least a huge part of this cost is due to a massive increase in the cost of hashtable lookups under this code!  PLDHashTable::Add() has gone up from 0.38% of total time before to 9.32% of total time after!!!  The profile shows that after the patches, when we are trying to add new entries to the CC hashtable, we're getting collisions all the time, and we're getting destroyed by the hashtable performance. :-(  I think this could be due to the fact that the pointers stored into the hashtable now follow a pattern that hashes quite awfully for some reason, but I have yet to investigate this very deeply.

It's still unclear if there are other things at play here, but the hashtable collisions issue here makes any kind of perf comparison pointless for now...

Comment 7

10 months ago
> Not sure if I follow this part, but the memory management of DOM nodes
> will not change, they will stay refcounted and will be freed.

You mean "freed" in the sense "given back to the arena", but that only
means it'll end up on a free list.  What I mean is, even if you "free"
all your DOM nodes, so that the arena chunks could be *really* freed
in the jemalloc free() sense, it won't.  Memory that's been malloc'ed
for the chunks are only free()'d when the arena is destroyed.

Worse, nsPresArena only re-use memory for the same type as it was
originally allocated for, so for C++ types A and B that are the same
size you can have thousands of free A's in the arena, but if there
are no free B's, a B allocation will still claim new arena space
rather than re-using a free A.

You can workaround this latter feature though, by using "synthetic"
type IDs so that multiple C++ classes to use the same ID, at the cost
of over-allocating for the smaller types.
(I'm guessing this is what you're effectively doing for the buckets
you're talking about)

Comment 8

10 months ago
> PLDHashTable::Add() has gone up from 0.38% of total time before to 9.32% of total time after

You're using RegisterArenaRefPtr/DeregisterArenaRefPtr?
Those are really expensive and should be avoided if possible -
it keeps a hashtable of all objects.
From reading the past few comments, I take it these arenas share no code with the GC's arenas? GC arenas just hold a single type of object (really all they care about is the size), using bump allocation where they can and spans to deal with fragmentation.

It shouldn't be hard to remove the GC-specific bits and make something a little more generic, but I suppose that's another rabbit hole (especially since GC arenas come from GC chunks, and I don't know as much about how the GC chunk pool works).

Comment 10

10 months ago
Right, nsPresArena is mostly used for layout/style system (non-refcounted) objects.
(Assignee)

Comment 11

10 months ago
(In reply to Mats Palmgren (:mats) from comment #8)
> > PLDHashTable::Add() has gone up from 0.38% of total time before to 9.32% of total time after
> 
> You're using RegisterArenaRefPtr/DeregisterArenaRefPtr?

No.  This is CCGraph::mPtrNodeToMap <https://searchfox.org/mozilla-central/rev/f1472e5b57fea8d4a7bbdd81c44e46958fe3c1ce/xpcom/base/nsCycleCollector.cpp#842>, nothing that I added.

The massive increase in hashtable collisions is of course quite obvious.  It is because of this extremely crappy hash function that is used for this hash table: <https://searchfox.org/mozilla-central/rev/f1472e5b57fea8d4a7bbdd81c44e46958fe3c1ce/xpcom/ds/PLDHashTable.cpp#73>.  Sigh.  :-(  What my patch changes is it makes a lot of nodes that get added to this table as keys have addresses that are very close in numerical values, and this hash function just shifts them down and otherwise leaves them unchanged, so the entries don't end up distributed in the hashtable *at all*, leading to the really horrible performance I was observing.  I'll file a separate bug to fix that.
(Assignee)

Comment 12

10 months ago
(In reply to Mats Palmgren (:mats) from comment #7)
> > Not sure if I follow this part, but the memory management of DOM nodes
> > will not change, they will stay refcounted and will be freed.
> 
> You mean "freed" in the sense "given back to the arena", but that only
> means it'll end up on a free list.

Yes.

> What I mean is, even if you "free"
> all your DOM nodes, so that the arena chunks could be *really* freed
> in the jemalloc free() sense, it won't.  Memory that's been malloc'ed
> for the chunks are only free()'d when the arena is destroyed.

Yes, I understand.  That is what I was talking about in the second part of the paragraph you quoted.  :-)  We should fix that for DOM nodes.  (I'm not really sure why the current situation is OK for frames really either, my idea was to fix it centrally for both if possible, but I care about the DOM arena side here.)

> Worse, nsPresArena only re-use memory for the same type as it was
> originally allocated for, so for C++ types A and B that are the same
> size you can have thousands of free A's in the arena, but if there
> are no free B's, a B allocation will still claim new arena space
> rather than re-using a free A.
> 
> You can workaround this latter feature though, by using "synthetic"
> type IDs so that multiple C++ classes to use the same ID, at the cost
> of over-allocating for the smaller types.
> (I'm guessing this is what you're effectively doing for the buckets
> you're talking about)

Yes that is exactly what I'm doing.  :-)

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #9)
> From reading the past few comments, I take it these arenas share no code
> with the GC's arenas?

Correct.

> GC arenas just hold a single type of object (really
> all they care about is the size), using bump allocation where they can and
> spans to deal with fragmentation.
> 
> It shouldn't be hard to remove the GC-specific bits and make something a
> little more generic, but I suppose that's another rabbit hole (especially
> since GC arenas come from GC chunks, and I don't know as much about how the
> GC chunk pool works).

That sounds like overkill to me to be honest.  :-)
(Assignee)

Comment 13

10 months ago
With the path in bug 1379282, the cost of hashtable lookups drop back down to the pre-patch levels!
Depends on: 1379282

Comment 14

10 months ago
> I'm not really sure why the current situation is OK for frames really either

It's both a mitigation against type confusion - it guarantees a memory
address is never re-used for a different type; and a mitigation against
UAF - it's always filled with poison while not used (both for the lifetime
of the shell).  It's a price we're willing to pay, since it's mitigated
hundreds of otherwise exploitable crashes.
(Assignee)

Comment 15

10 months ago
(In reply to Mats Palmgren (:mats) from comment #14)
> > I'm not really sure why the current situation is OK for frames really either
> 
> It's both a mitigation against type confusion - it guarantees a memory
> address is never re-used for a different type; and a mitigation against
> UAF - it's always filled with poison while not used (both for the lifetime
> of the shell).  It's a price we're willing to pay, since it's mitigated
> hundreds of otherwise exploitable crashes.

OK, makes sense.  I think your comments have made me want to not really reuse much code from nsPresArena any more after all, since I think if I end up special casing the arena page deallocation issue only for DOM arenas then there is not much sharing in practice anyway.  I'll post my WIP patches now before I go on vacation but the first 2 parts will probably just completely change, and the third part will get rewritten on top of it when I get back...
(Assignee)

Comment 16

10 months ago
Created attachment 8884429 [details] [diff] [review]
Part 1: Refactor nsPresArena into a reusable base class so that it can be shared between DOM and layout
(Assignee)

Comment 17

10 months ago
Created attachment 8884430 [details] [diff] [review]
Part 2: Disambiguate ArenaObjectID in nsPresArena.cpp
(Assignee)

Comment 18

10 months ago
Created attachment 8884431 [details] [diff] [review]
Part 3: Add support for allocating DOM nodes into arenas
(Assignee)

Comment 19

10 months ago
Created attachment 8884432 [details] [diff] [review]
Part 4: Port the most common nodes appearing in HTML documents to be allocated in arenas

Comment 20

10 months ago
Just to remind you when you come back: it's suboptimal to use nsTArray
for the free lists in a non-poisoning arena.  Using a single-link list
on the objects themselves is much faster.  Maybe the FreeList type
could be a template param?  Or maybe it could have FreeListArray and
FreeListLinkedList and then select which one to use based on
EnablePoisoning through some template magic?
yeah, nsTArray is bad for performance. linked list or SegmentedVector should work better.
(In reply to (Away 7/10-7/17) from comment #3)
> Eric, what types of tests should I run to get a sense of whether the memory
> usage of the patches here is acceptable or not?

The "tabs open" measurements from awsy should be a good indicator. Do a try run without your patches, do a bunch of retriggers, then repeat with the patches. You can then diff them in perfherder. I think the following should work:

> ./mach try -b o -p linux64 -u awsy-e10s -t none --rebuild 5
Flags: needinfo?(erahm)
I sort of wonder whether it's worth doing some sort of moving allocator, with handles, etc, for nodes.  It's a big project, but it would allow us to do arenas without weird high-water-mark or fragmentation behavior: when an arena chunk gets too sparse, just move things out of it and kill it off.
P2 for now, but we definitely want to have a good feel for the memory usage changes before landing this.
Whiteboard: [MemShrink] → [MemShrink:P2]
(Assignee)

Comment 25

8 months ago
I have decided to punt on this for 57, I think...  :/  I have kept the patches in my queue but people keep bitrotting them, and it's no longer worth my time trying to keep them applied.  So far what I had causes slowdowns as opposed to make anything faster and I haven't had the time to fix the freelist issues yet but those issues don't really show up in profiles anyhow, so I doubt that would be where to look for the sources of the slowdown.  Perhaps we can look at this again when there is not so much time pressure...
(Assignee)

Comment 26

8 months ago
Note to self: current non-building tree here: https://github.com/ehsan/gecko-dev/tree/arena_allocated_dom_nodes

I think the most recent bitrot was due to bug 1387956, possibly among others...
AFAIK these aren't destined to be fixed in 56 or 57 so I'm setting the priority to P2.
Priority: P1 → P2
You need to log in before you can comment on or make changes to this bug.