Try out an arena-based allocator for DOM nodes again
Categories
(Core :: DOM: Core & HTML, enhancement, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox76 | --- | fixed |
People
(Reporter: ehsan.akhgari, Assigned: sefeng)
References
(Depends on 1 open bug, Blocks 2 open bugs)
Details
(Keywords: parity-safari, Whiteboard: [MemShrink:P2])
Attachments
(23 files, 5 obsolete files)
13.86 KB,
patch
|
Details | Diff | Splinter Review | |
1.86 KB,
patch
|
Details | Diff | Splinter Review | |
21.30 KB,
patch
|
Details | Diff | Splinter Review | |
187.95 KB,
patch
|
Details | Diff | Splinter Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review |
While profiling some slow DOM code that comes up in benchmarks like Speedometer, we end up bottlenecking on pure pointer chasing for DOM traversal because of bad memory locality of DOM nodes. I think we can do better, by attempting arena allocation of DOM nodes again. I have some ideas on how to do this again without the downsides of the previous approach (see bug 403830). Firstly we can try to share code with nsPresArena. That in itself will give us the advantage of freelists that aren't fixed-size, contrary to the approach above. But I'd also like to try next power of two rouding bucket allocation like jemalloc does in order to avoid fragmentation when reusing memory of freed nodes to allocate new nodes, so nodes will be sorted into separate arenas based on their bucket size. Furthermore I would like to put attributes and content nodes in separate arenas since they are typically accessed separately during traversal, so leaving them separate improves the locality of traversal algorithms going over each one. One complication is how to handle node adoption. Olli's suggestion was to allocate them using the default allocator in the adoption case and store the allocator in a property on the node. This way we don't have to worry about finding the right arena when the node is about to be destroyed. We of course need a node flag to know whether to check the property and those are scarce these days. This setup will obviously use more memory than we currently use, the question is how much, and I'd like to find that out. We're hoping bug 1377993 improves DOM memory usage to some extent in order to buy us some breathing room here. I'd also like to get some performance measurements once the changes are implemented in order to know what the exact memory-speed trade-off is. After we have this data we can have a conversation about whether we should land the patches, tune them or give up! CCing erahm to keep him in the loop from the beginning but no reason to be alarmed just yet. :-)
Updated•7 years ago
|
Reporter | ||
Comment 1•7 years ago
|
||
So I did a study of the current sizes of our DOM nodes on 32-bit and 64-bit. Here is the raw data: https://docs.google.com/a/mozilla.com/spreadsheets/d/1i6yWXsDz_LFvvG6uvy57HupkHwSfdEjlIZq25Jsh7os/edit?usp=sharing Here are the take-aways that are going to shape the design here: * HTMLAudioElemennt and HTMLVideoElement are huge in size and total outliers. Due to the next power of two bucketing we can't unfortunately include them in this optimization. I filed bug 1378506 about this as a follow-up to reduce the size of HTMLMediaElement. * On 32-bit, the only node that would fit in a 64-bit bucket is a Comment node. On 64-bit, this node's size is 120 bytes, and there are a lot of other nodes that are 128 bytes, so we need to have a bucket size of 128 bytes size there. So we could arena allocate Comment nodes on 64-bit platforms and leave it heap allocated on 32-bit platforms, but for now for simplicity I'd rather leave it alone on all platforms. * All other nodes on both platforms fit nicely in one of these three buckets: 128 bytes, 256 bytes, or 512 bytes.
Reporter | ||
Comment 2•7 years ago
|
||
I forgot to include Text and CDATASection, added those to the spreadsheet now. Their presence together with comment might make it worthwhile to have a 64 byte bucket for 32-bit platforms since Text especially is very common.
Reporter | ||
Comment 3•7 years ago
|
||
Eric, what types of tests should I run to get a sense of whether the memory usage of the patches here is acceptable or not?
Comment 4•7 years ago
|
||
> Firstly we can try to share code with nsPresArena.
I'd argue against using nsPresArena for DOM nodes.
nsPresArena is intentionally slower and use more memory than needed
in order to support poisoning and type confusion mitigation measures.
Listing a few things off the top of my head:
1. writing the poison value costs CPU cycles and possibly cache misses
2. memory for one type of instance can only be re-used for that same
type, which leads to a slight over-allocation / cache misses
3. the "free lists" are not lists, but nsTArrays, which means malloc /
free calls just to manage these pointers. (without poisoning you
could instead use the free instances themselves for a linked list,
which would be faster and use less memory)
4. memory is never free()'d; we always spend the high-water mark for
each type separately
I don't think these measures are necessary for ref-counted DOM nodes.
I think it would be better to write a new non-poisoning arena class
that can be optimized for performance.
Reporter | ||
Comment 5•7 years ago
|
||
(In reply to Mats Palmgren (:mats) from comment #4) > > Firstly we can try to share code with nsPresArena. > > I'd argue against using nsPresArena for DOM nodes. > nsPresArena is intentionally slower and use more memory than needed > in order to support poisoning and type confusion mitigation measures. > Listing a few things off the top of my head: > 1. writing the poison value costs CPU cycles and possibly cache misses I have disabled that part for DOM arenas. > 2. memory for one type of instance can only be re-used for that same > type, which leads to a slight over-allocation / cache misses For DOM arenas, our bucketing strategy will be very different. I have a set of WIP patches which I will post later today which should demonstrate where I'm moving towards. But for now, we use 4 arenas for DOM node allocations in 64-bit and 5 in 32-bit platforms, based on the description in comment 0 depending on the size of the nodes not their type. > 3. the "free lists" are not lists, but nsTArrays, which means malloc / > free calls just to manage these pointers. (without poisoning you > could instead use the free instances themselves for a linked list, > which would be faster and use less memory) Yeah this is one aspect which isn't ideal. Right now the mozilla::dom::Arena class that I have is fairly simple. I may end up switching its backend implementation to not share anything with nsPresArena for that reason if profiling shows the cost of maintaining the "free lists" is important, but right now there are other more important performance issues with the patches which I'll write about in a separate comment... (For now I have decided to reuse the nsPresArena code since the arena manipulation code is kind of the least interesting aspect of this experiment. None of the decisions here are final yet.) > 4. memory is never free()'d; we always spend the high-water mark for > each type separately Not sure if I follow this part, but the memory management of DOM nodes will not change, they will stay refcounted and will be freed. We of course eventually need to make sure if an entire arena gets freed up we reclaim the memory of the arena back to the OS which nsPresArena currently doesn't know how to do. > I don't think these measures are necessary for ref-counted DOM nodes. > I think it would be better to write a new non-poisoning arena class > that can be optimized for performance. I expect to end up doing that in the end, but I'd still like to try to share code if we can. So far we aren't sharing all that much code at the "nsPresArena" level at any rate, as most of the sharing happens at the ArenaAllocator level...
Reporter | ||
Comment 6•7 years ago
|
||
So here is a summary of where I am right now with this. I have the basic patches that implement the core of the idea here. So far I have spent most of the time on the correctness side of things, and the patches are good enough for local browsing, and they manage to successfully run Speedometer but the try server was fairly unhappy with me last night when I pushed what I had to try. I haven't looked into the details of failures too closely yet. I have been looking at the perf side of things, and things have been looking not all that great! Initially when I ran Speedometer on a build with my patches, it was around 10 points *slower* than a build without these patches! As far as I can tell, none of the code that I have added really shows up in the profile all that much, here is some details on the profiles I have looked at: * nsContentList::PopulateSelf() seems to show up in profiles after the patches at around half the time it used to take before the patches. The pointer chasing loops there were one of the motivating factors behind the work here. * The cost of allocating new HTML elements seems to have gone down in the profiles after the patches compared to profiles before. Again, this is what I would expect. * The cost of CC graph building (CCGraphBuilder::BuildGraph and stuff under it) has skyrocketted in the profiles after the patches!!! In a profile before the patches of about 100 of Speedometer subtest runs, this function took 0.45% of the total time. After the patches, it takes 26.74% of the total time! ** As far as I can tell, at least a huge part of this cost is due to a massive increase in the cost of hashtable lookups under this code! PLDHashTable::Add() has gone up from 0.38% of total time before to 9.32% of total time after!!! The profile shows that after the patches, when we are trying to add new entries to the CC hashtable, we're getting collisions all the time, and we're getting destroyed by the hashtable performance. :-( I think this could be due to the fact that the pointers stored into the hashtable now follow a pattern that hashes quite awfully for some reason, but I have yet to investigate this very deeply. It's still unclear if there are other things at play here, but the hashtable collisions issue here makes any kind of perf comparison pointless for now...
Comment 7•7 years ago
|
||
> Not sure if I follow this part, but the memory management of DOM nodes
> will not change, they will stay refcounted and will be freed.
You mean "freed" in the sense "given back to the arena", but that only
means it'll end up on a free list. What I mean is, even if you "free"
all your DOM nodes, so that the arena chunks could be *really* freed
in the jemalloc free() sense, it won't. Memory that's been malloc'ed
for the chunks are only free()'d when the arena is destroyed.
Worse, nsPresArena only re-use memory for the same type as it was
originally allocated for, so for C++ types A and B that are the same
size you can have thousands of free A's in the arena, but if there
are no free B's, a B allocation will still claim new arena space
rather than re-using a free A.
You can workaround this latter feature though, by using "synthetic"
type IDs so that multiple C++ classes to use the same ID, at the cost
of over-allocating for the smaller types.
(I'm guessing this is what you're effectively doing for the buckets
you're talking about)
Comment 8•7 years ago
|
||
> PLDHashTable::Add() has gone up from 0.38% of total time before to 9.32% of total time after
You're using RegisterArenaRefPtr/DeregisterArenaRefPtr?
Those are really expensive and should be avoided if possible -
it keeps a hashtable of all objects.
Comment 9•7 years ago
|
||
From reading the past few comments, I take it these arenas share no code with the GC's arenas? GC arenas just hold a single type of object (really all they care about is the size), using bump allocation where they can and spans to deal with fragmentation. It shouldn't be hard to remove the GC-specific bits and make something a little more generic, but I suppose that's another rabbit hole (especially since GC arenas come from GC chunks, and I don't know as much about how the GC chunk pool works).
Comment 10•7 years ago
|
||
Right, nsPresArena is mostly used for layout/style system (non-refcounted) objects.
Reporter | ||
Comment 11•7 years ago
|
||
(In reply to Mats Palmgren (:mats) from comment #8) > > PLDHashTable::Add() has gone up from 0.38% of total time before to 9.32% of total time after > > You're using RegisterArenaRefPtr/DeregisterArenaRefPtr? No. This is CCGraph::mPtrNodeToMap <https://searchfox.org/mozilla-central/rev/f1472e5b57fea8d4a7bbdd81c44e46958fe3c1ce/xpcom/base/nsCycleCollector.cpp#842>, nothing that I added. The massive increase in hashtable collisions is of course quite obvious. It is because of this extremely crappy hash function that is used for this hash table: <https://searchfox.org/mozilla-central/rev/f1472e5b57fea8d4a7bbdd81c44e46958fe3c1ce/xpcom/ds/PLDHashTable.cpp#73>. Sigh. :-( What my patch changes is it makes a lot of nodes that get added to this table as keys have addresses that are very close in numerical values, and this hash function just shifts them down and otherwise leaves them unchanged, so the entries don't end up distributed in the hashtable *at all*, leading to the really horrible performance I was observing. I'll file a separate bug to fix that.
Reporter | ||
Comment 12•7 years ago
|
||
(In reply to Mats Palmgren (:mats) from comment #7) > > Not sure if I follow this part, but the memory management of DOM nodes > > will not change, they will stay refcounted and will be freed. > > You mean "freed" in the sense "given back to the arena", but that only > means it'll end up on a free list. Yes. > What I mean is, even if you "free" > all your DOM nodes, so that the arena chunks could be *really* freed > in the jemalloc free() sense, it won't. Memory that's been malloc'ed > for the chunks are only free()'d when the arena is destroyed. Yes, I understand. That is what I was talking about in the second part of the paragraph you quoted. :-) We should fix that for DOM nodes. (I'm not really sure why the current situation is OK for frames really either, my idea was to fix it centrally for both if possible, but I care about the DOM arena side here.) > Worse, nsPresArena only re-use memory for the same type as it was > originally allocated for, so for C++ types A and B that are the same > size you can have thousands of free A's in the arena, but if there > are no free B's, a B allocation will still claim new arena space > rather than re-using a free A. > > You can workaround this latter feature though, by using "synthetic" > type IDs so that multiple C++ classes to use the same ID, at the cost > of over-allocating for the smaller types. > (I'm guessing this is what you're effectively doing for the buckets > you're talking about) Yes that is exactly what I'm doing. :-) (In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #9) > From reading the past few comments, I take it these arenas share no code > with the GC's arenas? Correct. > GC arenas just hold a single type of object (really > all they care about is the size), using bump allocation where they can and > spans to deal with fragmentation. > > It shouldn't be hard to remove the GC-specific bits and make something a > little more generic, but I suppose that's another rabbit hole (especially > since GC arenas come from GC chunks, and I don't know as much about how the > GC chunk pool works). That sounds like overkill to me to be honest. :-)
Reporter | ||
Comment 13•7 years ago
|
||
With the path in bug 1379282, the cost of hashtable lookups drop back down to the pre-patch levels!
Comment 14•7 years ago
|
||
> I'm not really sure why the current situation is OK for frames really either
It's both a mitigation against type confusion - it guarantees a memory
address is never re-used for a different type; and a mitigation against
UAF - it's always filled with poison while not used (both for the lifetime
of the shell). It's a price we're willing to pay, since it's mitigated
hundreds of otherwise exploitable crashes.
Reporter | ||
Comment 15•7 years ago
|
||
(In reply to Mats Palmgren (:mats) from comment #14) > > I'm not really sure why the current situation is OK for frames really either > > It's both a mitigation against type confusion - it guarantees a memory > address is never re-used for a different type; and a mitigation against > UAF - it's always filled with poison while not used (both for the lifetime > of the shell). It's a price we're willing to pay, since it's mitigated > hundreds of otherwise exploitable crashes. OK, makes sense. I think your comments have made me want to not really reuse much code from nsPresArena any more after all, since I think if I end up special casing the arena page deallocation issue only for DOM arenas then there is not much sharing in practice anyway. I'll post my WIP patches now before I go on vacation but the first 2 parts will probably just completely change, and the third part will get rewritten on top of it when I get back...
Reporter | ||
Comment 16•7 years ago
|
||
Reporter | ||
Comment 17•7 years ago
|
||
Reporter | ||
Comment 18•7 years ago
|
||
Reporter | ||
Comment 19•7 years ago
|
||
Comment 20•7 years ago
|
||
Just to remind you when you come back: it's suboptimal to use nsTArray for the free lists in a non-poisoning arena. Using a single-link list on the objects themselves is much faster. Maybe the FreeList type could be a template param? Or maybe it could have FreeListArray and FreeListLinkedList and then select which one to use based on EnablePoisoning through some template magic?
Comment 21•7 years ago
|
||
yeah, nsTArray is bad for performance. linked list or SegmentedVector should work better.
Comment 22•7 years ago
|
||
(In reply to (Away 7/10-7/17) from comment #3) > Eric, what types of tests should I run to get a sense of whether the memory > usage of the patches here is acceptable or not? The "tabs open" measurements from awsy should be a good indicator. Do a try run without your patches, do a bunch of retriggers, then repeat with the patches. You can then diff them in perfherder. I think the following should work: > ./mach try -b o -p linux64 -u awsy-e10s -t none --rebuild 5
Comment 23•7 years ago
|
||
I sort of wonder whether it's worth doing some sort of moving allocator, with handles, etc, for nodes. It's a big project, but it would allow us to do arenas without weird high-water-mark or fragmentation behavior: when an arena chunk gets too sparse, just move things out of it and kill it off.
Comment 24•7 years ago
|
||
P2 for now, but we definitely want to have a good feel for the memory usage changes before landing this.
Reporter | ||
Comment 25•7 years ago
|
||
I have decided to punt on this for 57, I think... :/ I have kept the patches in my queue but people keep bitrotting them, and it's no longer worth my time trying to keep them applied. So far what I had causes slowdowns as opposed to make anything faster and I haven't had the time to fix the freelist issues yet but those issues don't really show up in profiles anyhow, so I doubt that would be where to look for the sources of the slowdown. Perhaps we can look at this again when there is not so much time pressure...
Reporter | ||
Comment 26•7 years ago
|
||
Note to self: current non-building tree here: https://github.com/ehsan/gecko-dev/tree/arena_allocated_dom_nodes I think the most recent bitrot was due to bug 1387956, possibly among others...
Comment 27•7 years ago
|
||
AFAIK these aren't destined to be fixed in 56 or 57 so I'm setting the priority to P2.
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Comment 28•6 years ago
|
||
Just to provide an update about preventing cross docGroup adoption (The context here is we want to see if we can prevent cross docGroup adoption because it could make the implementation of this bug easier and it could also help Fission for isolating docGroups) I created a new patch (https://phabricator.services.mozilla.com/D15150) with minimum changes to prevent cross docGroup adoption, pushed to try server and spent some time on test failures to see why the tests failed and what they were trying to test (Try build of the patch https://treeherder.mozilla.org/#/jobs?repo=try&revision=d8be95a7639bcd98fe406a3c1d49ef4c963565a6). I think the Try build looks good enough to do a short summary for what we should expect if we prevent cross docGroup adoption (No related test failures for Linux, although there are a few for Windows) I think the test failures fall into these categories: 1) Adoption happens around custom XULElements (different security principals); There are only a few use cases in our code 2) Tests related to testing cross compartment features (creates node adoption between Chrome code and content document) 3) Feature tests which "accidentally" trigger this (For example, https://searchfox.org/mozilla-central/rev/64ef7bc9fb478ba7c292f9fa57813d3fe864201e/accessible/tests/mochitest/treeupdate/test_ariaowns.html#601) I think it is fair to say we can prevent cross docGroup adoption, this is doable and shouldn't cause much trouble for us.
Comment 29•6 years ago
|
||
That looks green enough to do some performance testing. You could also push to try with all the talos tests, with and without your patch so that we could see how it behaves with existing talos tests.
Updated•6 years ago
|
Assignee | ||
Comment 30•5 years ago
|
||
I am really sorry for not updating the patch for a while and I'll make sure I update the patch in future!
I am going to give a summary of what I have tried so far and what my plan is, please feel free to provide any suggestions!
The idea is each docGroup will have its own arena, and all DOM objects belong to this docGroup will be allocated into this arena. If the dom objects is going to be allocated before the docGroup is created, it will use the normal malloc function.
The first memory allocator approach I did (Thanks Ehsan for the based patch) used the common memory allocator approach. Like having dedicated memory chunks for different sizes, and place the dom objects accordingly into these memory chunks. I reused the xpcom/ds/ArenaAllocator.h to achieve it.
-
4 sizes (128, 256, 512, Attributes): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8ad795d272596f8c87a3f7e2ad95d3a5976fbe33&newProject=try&newRevision=3858ed4436941cffc88a359d40454a80df47932c&framework=1
-
Same as 1) but also increased the chunk size from 8KB to 1MB: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8ad795d272596f8c87a3f7e2ad95d3a5976fbe33&newProject=try&newRevision=baeafc14b7b4d97d82ff28801402272a323e226e&framework=1
-
No more sizes differentiation, place everything together based on their allocation order, 10MB chunk size : https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8ad795d272596f8c87a3f7e2ad95d3a5976fbe33&newProject=try&newRevision=9380f80e0c586996ba30e80563a1c2ee54123ec5&framework=1
-
No more sizes differentiation, each type of node has its own chunk (chunk per node): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8ad795d272596f8c87a3f7e2ad95d3a5976fbe33&newProject=try&newRevision=24bf0ecc08c1b423ac54e90e8fcdc58a3dc07732&framework=1
-
Chunk per node and also 1MB Chunk size: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8ad795d272596f8c87a3f7e2ad95d3a5976fbe33&newProject=try&newRevision=986bc580f281cec14b445b6185f48103f356e400&framework=1
-
6 sizes (128, 144, 176, 256, 512, Attribute): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8ad795d272596f8c87a3f7e2ad95d3a5976fbe33&newProject=try&newRevision=a8676a1647ff7e77f3b511a39e4425de378f38c5&framework=1
None of them really improve anything except 3), so I started to focus to make 3) better.
If we want to be able to place dom objects that have different sizes next to each other, A proper implementation should be able to merge adjacent free objects with different sizes. So I started with a bitmap implementation. However this implementation doesn't work well because the larger the memory chunk, the longer time to flip the bits, which increased the time to do memory allocation. And it also didn't work very well https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=d19fa4ef993b0330b23abcc26370672426453ac8&newProject=try&newRevision=a50004bf9769d6b0b4605c952237ed43ba413d17.
These are the 2 approaches that I am trying to make it better.
https://phabricator.services.mozilla.com/D36114, back to 3), and accept the fact that we can't merge adjacent memory blocks`. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=d19fa4ef993b0330b23abcc26370672426453ac8&newProject=try&newRevision=ded51b9113b2cab20c4ec9b81a4f15e914bbda55
https://phabricator.services.mozilla.com/D36105, my own implementation to handle free blocks by using a lot of extra data structures https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=d19fa4ef993b0330b23abcc26370672426453ac8&newProject=try&newRevision=9071ea8e420b608e71787d3632af155138e55e91.
Again I haven't found a good implementation yet and I also don't have a lot of confidences for my patches because they are a lot of stuff in there and I am worrying I did something totally unnecessary. Any help would be much appreciated!
Randell pointed out jemalloc arena might help, so I will give it a try.
Comment 31•5 years ago
|
||
It sounds a lot like you're effectively re-implementing mozjemalloc
(with a pre-allocated backing buffer) through trial and error, it might be worth familiarizing yourself with that logic to get an idea of how we balance various requirements (fragmentation, free list overhead, cache line sharing, perf, etc).
A chunk size of 10MB is going to be a hard sell from a memshrink perspective; I'm also a bit concerned that any chance of wins from cache locality is going to be diminished at that size.
As jesup has suggested it might make sense to just dedicate a mozjemalloc arena to dom allocations (particularly with fission enabled where fewer pages will be sharing a dom allocator)
Assignee | ||
Comment 32•5 years ago
•
|
||
Yeah, 10MB was just a test to see if it makes sense to place them based on their allocation order, despite their sizes.
I just got the test results for jemalloc arena, the result looks good.
Talso: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=621f3eafaf3a8c5b179071440532ddc0d11984f7&newProject=try&newRevision=3f4e0d61a36fe2732d809c73106ce73feb301e8d, quite a few high confidence improvements, also some med improvements, and no high confidence regressions. However, some subtests regressed, for example https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=3f4e0d61a36fe2732d809c73106ce73feb301e8d&originalSignature=1657197&newSignature=1657197&framework=1&originalRevision=621f3eafaf3a8c5b179071440532ddc0d11984f7
raptor: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=621f3eafaf3a8c5b179071440532ddc0d11984f7&newProject=try&newRevision=3f4e0d61a36fe2732d809c73106ce73feb301e8d&framework=10, no high confidence changes, a few med confidences improvements/regressions.
Assignee | ||
Comment 33•5 years ago
|
||
seems fine for most of them, Resident Memory opt stylo regressed
also some improvements, for example https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=bd41da1b6e4a9c7136b8be0ab3f08364834a3c33&originalSignature=1887533&newSignature=1887533&framework=4&originalRevision=b00ef1a5f7d39403275e8d8ba3fa9d1c6f9b9bc9
Assignee | ||
Comment 34•5 years ago
|
||
Assignee | ||
Comment 35•5 years ago
|
||
Depends on D41054
Reporter | ||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 37•5 years ago
|
||
When adding this new arena, can we please ensure that it has memory checking support (MOZ_MAKE_MEM_NOACCESS and friends)? Otherwise, use-after-free inside the new arena will be much harder to detect.
Comment 38•5 years ago
|
||
(In reply to Christian Holler (:decoder) from comment #37)
When adding this new arena, can we please ensure that it has memory checking support (MOZ_MAKE_MEM_NOACCESS and friends)? Otherwise, use-after-free inside the new arena will be much harder to detect.
Nevermind, just found it in the patch now :)
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 39•5 years ago
|
||
Assignee | ||
Comment 40•5 years ago
|
||
Depends on D43135
Updated•5 years ago
|
Assignee | ||
Comment 41•5 years ago
|
||
Some updates to the latest perf numbers.
Tests I did was all talos tests, all raptor tp6 tests, all AWSY test and speedometer tests.
With Pref disabled (uses the current allocation methodology): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=cdccfb33532721b5e2ef69b157b4bad74db2eb1a&newProject=try&newRevision=ac8cad1eab833dfe2bcc11ffb6aa5bcf47eb546c&framework=1
WIth Pref enabled (uses arena allocation): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=cdccfb33532721b5e2ef69b157b4bad74db2eb1a&newProject=try&newRevision=2e884e10d09503a1036e99a23da22b2f2958f4e5&framework=1
There's only one regression which exists in both comparisons, which is time_to_session_store_window_restored_ms for windows 7. I am not too worrying about it as it exists in both comparisons so it shouldn't be related to the arena allocation itself, but I am looking into it.
Comment 42•5 years ago
|
||
Comment on attachment 9087538 [details]
Bug 1377999 - Unsupport cross docGroup node adoption
Revision D43135 was moved to bug 1467970. Setting attachment 9087538 [details] to obsolete.
Assignee | ||
Comment 43•5 years ago
|
||
Just an update to this bug.
At this point, it is unlikely that we will land the patch, because we couldn't find a good reason to land it. (Event if the DOM arena is disabled by default).
We tried to find a real use case where it improved the DOM object iteration performance. We pick a large document from html standard, then used NodeIterator
to iterator all DOM nodes to eliminate the overheads of javascript, and we expected using DOM Arena should improve the performance of NodeIterator
(because it should.....), however, it wasn't the case, and we found a 2% regression. Since we are using jemalloc arena, there aren't too much things we can tune.
I could do more testing to find a special case where the performance is improved, maybe a large document with only Div elements, but again, we would need a real use case improvement for us to land it.
Comment 44•5 years ago
|
||
I could do more testing to find a special case where the performance is improved, maybe a large document with only Div elements, but again, we would need a real use case improvement for us to land it.
There are security benefits to isolating DOM Nodes. Most of the security benefit exists if each type (really: size!) of DOM Node would have a separate arena, but there might be some benefit by having them all somewhere else compared to, say, the media codec allocations. So we might not need a performance win to take this. On the other hand, the patch looks huge, so I could understand there are complexity concerns.
Do I understand your comments correctly that you had a version earlier that was able to group DOM Nodes per object size?
Assignee | ||
Comment 45•5 years ago
|
||
The current patch which gives us a seems-okayish performance groups DOM Nodes per DocGroup, so it isn't really per object size, as objects with different sizes get grouped into to the same arena.
I have never tried to do it per object size. In the very beginning, I tried to group DOM objects per node type, but it didn't work well, lots of regressions.
Comment 46•5 years ago
|
||
If you're using moz_arena_malloc, you do end up with nodes of the same size being grouped together.
Comment 47•5 years ago
|
||
If you're using moz_arena_malloc, you do end up with nodes of the same size being grouped together.
In this case this patch could be a huge security gain wrt. robustness against UAF exploits. A pointer to an incorrectly freed DOM node can't be manipulated (by heap feng shui) to point to somewhere else (like the vtable of another class). If it is performance neutral, we should very seriously consider still taking it.
Sean, does the above change your assessment on whether it is worthwhile to mold this into land-able state?
Comment 48•5 years ago
|
||
How is using this mozjemalloc arena so different to use normal mozjemalloc from security point of view.
mozjemalloc has bucket sizes, so nodes within certain size limits end up using same bucket.
Comment 49•5 years ago
|
||
How is using this mozjemalloc arena so different to use normal mozjemalloc from security point of view.
If the DOM nodes are mixed with regular allocations, there's more opportunity to get a different object that has a size that fits in the same bucket, into the memory area previously occupied by an object that has a dangling pointer (i.e. UAF).
Comment 50•5 years ago
•
|
||
Say I manage to convince Firefox to free a DOM Node X that I have partial control over (by changing the attributes). I can't do anything interesting just changing the attributes of this DOM node. It is freed (but still used, the UAF) by the allocator, which now allocates a More Interesting Internal Data Structure on the same spot. Now, the memory spot that I control and that was a harmless attribute actually contains a vtable ptr = profit.
This kind of attack is harder if the arena only allows other Less Interesting Objects to be allocated into the freed up memory. It's even harder if it's similarly sized objects of a similar type (i.e. other DOM Nodes), since there's less chance the vtable is in a useful position.
It is even harder if every DOM Node has its own arena. WebKit has this, but it's quite a step further from this still.
Comment 51•5 years ago
|
||
It is even harder if every DOM Node has its own arena.
Fwiw, this is basically what we have for frame classes in the PresShell arena. An arena slot allocated to say a nsBlockFrame is only ever reallocated to a new nsBlockFrame, thus eliminating type confusion.
Assignee | ||
Comment 52•5 years ago
|
||
gcp: Thanks for all of these info. Yeah, I mean that makes sense to me.
I think for us to make a decision, we need to understand how much security gain does this patch provide, and also whether the security gain worth the performance regressions.
To help us understand more about the performance regressions, based on Ehsan's suggestion, I am going to run the nodeIterator benchmark again on some phones, and we also hope it may shine on some low-end phones because low-end phones usually have a very small amount of cache.
I also feel this thing might need a broader discussion.
Comment 53•5 years ago
•
|
||
(In reply to Mats Palmgren (:mats) from comment #51)
It is even harder if every DOM Node has its own arena.
Fwiw, this is basically what we have for frame classes in the PresShell arena. An arena slot allocated to say a nsBlockFrame is only ever reallocated to a new nsBlockFrame, thus eliminating type confusion.
Yeah, that I know and understand the benefits there. But I'm really wondering if the benefits of using a mozjemalloc arena for all the DOM nodes (well, more precisely, all HTML elements and Text nodes) are worth the complexity and possible performance and memory usage regressions.
(I was expecting performance benefits from the patch, but that doesn't seem to be the case.)
(nsIFrames had the disadvantage that they weren't refcounted, so ensuring sane lifetime management was harder than with refcounted and CCed DOM nodes. And WeakFrames were added because of somewhat unsafe crashes, and that was even before the current arena.)
Reporter | ||
Comment 54•5 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #50)
Say I manage to convince Firefox to free a DOM Node X that I have partial control over (by changing the attributes). I can't do anything interesting just changing the attributes of this DOM node. It is freed (but still used, the UAF) by the allocator, which now allocates a More Interesting Internal Data Structure on the same spot. Now, the memory spot that I control and that was a harmless attribute actually contains a vtable ptr = profit.
I don't actually think that we can achieve this benefit, unless if we have per-node-type arenas. Consider the following two nodes, A and B, A with one base class and B with two, so B would have two vtable pointers. So in the scenario above if the UAF'ed ptr is an A pointer and you manipulate its first word-size data member, and then get Gecko to allocate a B node in its place, you're controlling a vtable pointer again.
For frames, we actually use a much different allocation strategy, we allocate giving the type of the frame object, which is used to reuse frame allocation spots of the same type. This guarantees that the scenario that I described above can never happen, but it is not something that this patch currently implements, and the complexity of implementing such an approach is quite high...
Assignee | ||
Comment 55•5 years ago
|
||
While I was doing the testing on low-end phones, we found a bug of NodeIterator (bug 1596238), and once we fixed it, things started to changed.
I did more testing 2 low-end phones (Moto G5, Galaxy S4) and on my high-end XPS 15. I used the same testing technique which was using NodeIterator API to iterate all DOM elements, and as the result, DOM Arena performed very well, it was better than the baseline(which was a commit where my changes based of).
Full results: https://docs.google.com/spreadsheets/d/1AlsPqeEiaed83Nm8OnZNbFxrnAuYdX-ima6djvfR18c/edit?usp=sharing
So yeah, we are reconsidering our decision!
Assignee | ||
Comment 56•5 years ago
|
||
Assignee | ||
Comment 57•5 years ago
|
||
Depends on D57698
Assignee | ||
Comment 58•5 years ago
|
||
Depends on D57699
Assignee | ||
Comment 59•5 years ago
|
||
Depends on D57700
Assignee | ||
Comment 60•5 years ago
|
||
Depends on D57701
Assignee | ||
Comment 61•5 years ago
|
||
Depends on D57702
Assignee | ||
Comment 62•5 years ago
|
||
Depends on D57703
Assignee | ||
Comment 63•5 years ago
|
||
Depends on D57704
Assignee | ||
Comment 64•5 years ago
|
||
Depends on D57705
Assignee | ||
Comment 65•5 years ago
|
||
Depends on D57706
Updated•5 years ago
|
Assignee | ||
Comment 66•5 years ago
|
||
We customized the new operator to allow Dom Arena to be hooked
in.
Depends on D57698
Assignee | ||
Comment 67•5 years ago
|
||
Document is a special case. Since Documents could own NodeInfoManager,
use the global new operator for Documents
Depends on D62350
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 68•5 years ago
|
||
Depends on D57707
Assignee | ||
Comment 69•5 years ago
|
||
Depends on D62352
Assignee | ||
Comment 70•5 years ago
|
||
Depends on D62353
Assignee | ||
Comment 71•5 years ago
|
||
Depends on D62354
Assignee | ||
Comment 72•5 years ago
|
||
Depends on D62355
Updated•5 years ago
|
Assignee | ||
Comment 73•5 years ago
|
||
There's no existing macro to allow use declare a final
DeleteCycleCollectable, this patch adds one
Depends on D62351
Assignee | ||
Comment 74•5 years ago
|
||
Depends on D63262
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 75•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 76•5 years ago
|
||
Pushed by sefeng@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4152a66c9cab Add the capability to do DOM node Arena allocation r=smaug https://hg.mozilla.org/integration/autoland/rev/9b10d4a18d8f Handle cross docGroup node adoption by keeping arena alive r=smaug https://hg.mozilla.org/integration/autoland/rev/479b7648e14f Set nsINode to use the customized new operator r=smaug https://hg.mozilla.org/integration/autoland/rev/be4727306b56 Enable default new operator back for Document node r=smaug https://hg.mozilla.org/integration/autoland/rev/cd8a79470a89 Refactor some macros to support a final DeleteCycleCollectable r=smaug https://hg.mozilla.org/integration/autoland/rev/c8f21f2b4452 Make nsIContent to declare a final DeleteCycleCollectable r=smaug https://hg.mozilla.org/integration/autoland/rev/52013e3b7c7e Make HTML Element to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/ad20ed22a644 Make Attribute to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/fc5b51e31c10 Make SVG nodes to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/d377d84d8500 Make MathML nodes to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/83e21f5ac307 Make nsXMLElement to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/1d82b510aae0 Make nsXULElement to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/538c6cf14b44 Make GeneratedImageContent to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/e9f0c79a628a Make DocumentFragment to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/ac3e0036d412 Make Comment to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/cdc0080abe33 Make TextNode to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/6d39a375d5e2 Make CDATASection to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/5ed5a5b322db Make DocumentType to adapt the DOMArena changes r=smaug https://hg.mozilla.org/integration/autoland/rev/32010372f437 Make ProcessingInstruction to adapt the DOMArena changes r=smaug
Comment 77•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/4152a66c9cab
https://hg.mozilla.org/mozilla-central/rev/9b10d4a18d8f
https://hg.mozilla.org/mozilla-central/rev/479b7648e14f
https://hg.mozilla.org/mozilla-central/rev/be4727306b56
https://hg.mozilla.org/mozilla-central/rev/cd8a79470a89
https://hg.mozilla.org/mozilla-central/rev/c8f21f2b4452
https://hg.mozilla.org/mozilla-central/rev/52013e3b7c7e
https://hg.mozilla.org/mozilla-central/rev/ad20ed22a644
https://hg.mozilla.org/mozilla-central/rev/fc5b51e31c10
https://hg.mozilla.org/mozilla-central/rev/d377d84d8500
https://hg.mozilla.org/mozilla-central/rev/83e21f5ac307
https://hg.mozilla.org/mozilla-central/rev/1d82b510aae0
https://hg.mozilla.org/mozilla-central/rev/538c6cf14b44
https://hg.mozilla.org/mozilla-central/rev/e9f0c79a628a
https://hg.mozilla.org/mozilla-central/rev/ac3e0036d412
https://hg.mozilla.org/mozilla-central/rev/cdc0080abe33
https://hg.mozilla.org/mozilla-central/rev/6d39a375d5e2
https://hg.mozilla.org/mozilla-central/rev/5ed5a5b322db
https://hg.mozilla.org/mozilla-central/rev/32010372f437
Description
•