Keep arena list sorted by free size to reduce fragmentation

RESOLVED FIXED in mozilla33

Status

()

defect
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: terrence, Assigned: ehoogeveen)

Tracking

unspecified
mozilla33
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(4 attachments, 22 obsolete attachments)

9.72 KB, image/png
Details
9.29 KB, image/png
Details
30.20 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
6.20 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
(Reporter)

Description

5 years ago
Right now this list is unsorted, so we end up with a bunch of half-full arenas. Even without compacting GC, we should be able to reduce heap fragmentation by being smarter about how we choose our next arena.
I guess you'd prefer to allocate in almost-full arenas? That way the almost-empty arenas would have a good chance of having their few elements die soon?
(Assignee)

Comment 2

5 years ago
So 1) do a binary search on the sorted list to find the most-filled arena with enough space left to hold the new element, 2) allocate it there and 3) do a binary search on the more-filled arenas and move it to its new position?

Could I work on this? It sounds relatively straightforward (and a good follow-up to bug 1005849 in a way).
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #2)
> So 1) do a binary search on the sorted list to find the most-filled arena
> with enough space left to hold the new element, 2) allocate it there and 3)
> do a binary search on the more-filled arenas and move it to its new position?

No binary search should be needed. There are 20 or so different AllocKinds, and a separate arena list for each one. So all GC things allocated in a particular arena list have the same size, which keeps things simple. And there's a "cursor" in each list which indicates where the first non-full arena is, so if the list after the cursor is sorted, you'll always be able to just allocate from the first arena after the cursor.

The tricky part is sorting the list. You'll have to do it during finalization. See FinalizeTypedArenas() in js/src/jsgc.cpp, where we currently re-insert non-empty arenas in the order they're processed.

You'll need to look at the FreeSpan, FreeList, ArenaHeader, ArenaList and ArenaLists classes to understand all this.

> Could I work on this?

Sure. One of the harder aspects of this will be getting measurements to show that it actually improves things, and that the cost of sorting isn't too high.
Whiteboard: [MemShrink]
You should also try the "visualize arenas" patch in bug 989379. Running it helped me greatly to understand how the arena lists work.
(Assignee)

Comment 5

5 years ago
(In reply to Nicholas Nethercote [:njn] from comment #3)
> (In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #2)
> > So 1) do a binary search on the sorted list to find the most-filled arena
> > with enough space left to hold the new element, 2) allocate it there and 3)
> > do a binary search on the more-filled arenas and move it to its new position?
> 
> No binary search should be needed. There are 20 or so different AllocKinds,
> and a separate arena list for each one. So all GC things allocated in a
> particular arena list have the same size, which keeps things simple. And
> there's a "cursor" in each list which indicates where the first non-full
> arena is, so if the list after the cursor is sorted, you'll always be able
> to just allocate from the first arena after the cursor.

Ah, so each non-full arena will be able to hold at least one more element. That also means the order of the list won't change after allocation (though the cursor might move down). So this part should be very fast, but the actual sorting will happen all at once.

I'll have a look at the classes involved, thanks for the pointers!
> Ah, so each non-full arena will be able to hold at least one more element.
> That also means the order of the list won't change after allocation (though
> the cursor might move down). So this part should be very fast, but the
> actual sorting will happen all at once.

Yes.

Another performance effect: the common allocation case is that the arena isn't full, in which case the allocation is fast. The uncommon case is that the arena has filled up and we need to move to the next non-full one (or create one if there aren't any non-full ones). That uncommon case will become a little more common with this change. It might not be enough to matter, though.
(Assignee)

Comment 7

5 years ago
Here's a first cut at this. Sorting linked lists doesn't appear to be a very common operation, so I had to roll my own. I implemented the merge sort described at http://www.chiark.greenend.org.uk/~sgtatham/algorithms/listsort.html (using the description but not the example code) - which should be about the best you can do for linked lists. An alternative would be to copy the list (or pointers to its elements) over into a regular array, and sort that. That might be faster, but it would also use more memory, and we'd want a way to query how many arenas there are.

In any case, now I need to 1) properly test my sort implementation and 2) figure out what effect this has on performance and fragmentation. I expect the latter will take more time than writing this patch did, so I'm holding off putting this up for review until I have some numbers.
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
(Assignee)

Comment 8

5 years ago
And by 'decreasing' I actually mean 'increasing' order, heh. I'll fix the descriptions when I've done some testing.
Hmm... I was thinking that you could do this more easily by just changing this:

> dest.insertAtCursor(aheader);

with something like this:

> dest.insertInSortedPosition(aheader);

and implementing that function. So it'll be an insertion sort, though you can probably optimize it by using the cursor to partition the insertion of full arenas vs. non-full arenas.

Sorry I didn't make this clear :(
(Assignee)

Comment 10

5 years ago
Hmm, I guess that would work too. O(n^2) behavior, technically, but easier to spread out across incremental slices too. To be honest I went into this not realizing that ArenaHeaders worked like a linked list, and then after I learned more I didn't re-evaluate ;)

I don't think the code I did implement is too bad, so maybe we can compare the two approaches. I didn't have much time to work on it today.
It is certainly worth comparing :) 

I suspect the ability to partition via the cursor will be a sizeable win. Inserting full arenas -- which is very common -- will be an O(1) operation.
(Assignee)

Comment 12

5 years ago
Technically, my current patch does that too - it only sorts the non-full arenas, by starting at the cursor. It just does it all in one go, instead of as they come in.
Oh, I didn't see that. Great!
(Assignee)

Comment 14

5 years ago
I will be extremely busy this coming week, so I probably won't be able to work on this until next Saturday.
Gecko 32 will ship with FxOS 2.0, but Nightly 32 will be merged to Aurora next week. If this patch is simple and reduces memory usage, we should consider uplifting it to Aurora 32. Our next opportunity to ship GC improvements with FxOS (2.1) will be Gecko 34.
(Assignee)

Comment 16

5 years ago
I haven't had the time to do any testing on this, and njn's idea from comment #9 would be a competing approach (simpler implementation). I can whip up a patch with that approach very easily, but I won't have time to do any testing until Friday at the earliest. If this is something we want *now*, someone else should take over. Otherwise I can work on it when I have time, and if it makes a big difference we can consider it for uplift.
Emanuel: sorry, I didn't mean to pressure you! <:) I know you are busy. We still have six weeks of Aurora if we'd like to uplift. I was just noting that this is an interesting bug we might want to track for Aurora 32.
(Assignee)

Comment 18

5 years ago
Updated to fix comments and description, still untested.

(In reply to Chris Peterson (:cpeterson) from comment #17)
> Emanuel: sorry, I didn't mean to pressure you! <:) I know you are busy. We
> still have six weeks of Aurora if we'd like to uplift. I was just noting
> that this is an interesting bug we might want to track for Aurora 32.

No worries, just making sure we're on the same page :)
Attachment #8430908 - Attachment is obsolete: true
(Assignee)

Comment 19

5 years ago
This implements njn's suggestion from comment #9 (also untested).

When I have more time, I'll probably write a standalone test for this (using random numbers as a proxy) to test these for correctness and get some numbers on relative performance.
(Assignee)

Comment 20

5 years ago
The latter doesn't compile because there are some other existing uses of insertAtCursor. I'll fix these after testing locally; both patches probably need to adjust those sites.
(Assignee)

Comment 21

5 years ago
I wrote a test to check the correctness of both algorithms and compare their performance. They appear to work correctly; my computer is pretty busy with another calculation at the moment, but the one shot merge sort approach appears to overtake the amortized insertion sort approach around n = 150. How big are these lists, generally? I'll try adding some logging code myself tomorrow.
My experience is that they're mostly less than 150, but you should do some real measurements to find out the proportions.
(Assignee)

Comment 23

5 years ago
I put a check in ArenaList's deepcheck to see how many elements there are put the max count of Arenas per ArenaList seen so far in a static member variable. Running Octane I see this number spiking to ~25k - so I think we should take the merge sort approach. I'll do some more performance testing first though; I want to see how destroying cache locality will affect the relative performance of the algorithms.
(Assignee)

Comment 24

5 years ago
Some concrete performance results for 25k objects (included the insert results for fun):

cache locality    algorithm    time (ms)
           bad        merge          8.6
          good        merge          1.7
           bad       insert       1841.4
          good       insert        435.7

Here 'good' cache locality means all the (dummy) arenas were initially adjacent, and 'bad' cache locality means they were chosen at random from a 4GiB list. As you can see, the merge sort is 5x as fast with good cache locality, and the insertion sort is similarly affected. I would expect actual real life results to be somewhere in between these extremes.

This also means that for the 25k list I saw during a run of Octane, the merge sort would probably take around 5ms, which I believe is also the length of an incremental slice. It may be possible to optimize the sort somewhat (at the expense of complexity and memory overhead) by searching for already-sorted spans first. Then we could even conceivably split the sort up into chunks by sorting once at the *start* of each slice, then sorting what's left at the end. I don't know if this is worth the added complexity, however (and I haven't tried implementing it).

Terrence, what do you think of these numbers? I think Octane serves as a pretty good proxy for a worst case situation since it's so heavy on the GCing, but I don't really want to make GC less incremental if I can help it.
Flags: needinfo?(terrence)
(Assignee)

Comment 25

5 years ago
Oh, one thing I could implement with too much trouble is to interrupt the sort between merge passes - then let you resume it later. This would work as long as the ArenaList isn't used or modified after all the Arenas are inserted and before the GC is complete (is that something we can rely on, or can new items be added to the src list of FinalizeTypedArenas between slices?).
(Assignee)

Comment 26

5 years ago
Alright, this uses the merge sort approach during finalization and adds the ability to interrupt it between passes for incremental GC. It also uses insertInSortedPosition from the other patch for an edge case related to parallel execution. There's a few things I'm uncertain about:

1) At http://dxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#1583 we take the first non-empty arena and move the cursor past it, and set it as fully used. How do we know the arena will be fully used after this allocation?

2) At http://dxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#1635 we insert the newly allocated, empty arena at the start of the ArenaList. This doesn't change the cursor, so it can't affect the sort. I assume the idea is that this will get moved to its correct position during the next GC?

3) At http://dxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#1774 we append arenas to a list where the cursor is assumed to be at the end. How do we know the arenas allocated outside of background finalization are full? Should we be doing another sort here?
Attachment #8433681 - Attachment is obsolete: true
Attachment #8433682 - Attachment is obsolete: true
Attachment #8436542 - Flags: review?(terrence)
You have measured the time impact of this change, which is good. But you haven't measured the space impact, which was the original motivation :)
(Assignee)

Comment 28

5 years ago
Indeed! I'll get to that tomorrow ;) I figured I might as well put the patch up and ask my questions in case there's something I overlooked in my implementation.
(Assignee)

Comment 29

5 years ago
Running with the "Visualize arenas" patch Nicholas mentioned showed that my size calculation was too naive - and how to fix it, too. This patch fixes that, and I can see it working as expected. I haven't got any results on potential memory savings yet, though.

I've been trying to think what a good way to test this would be. Octane is a pretty good stress test for the garbage collector, but once it's done I wouldn't expect us to hang on to any of the memory it used, so that should free all the arenas that weren't already in use to begin with. Where this *should* make a difference is in loading websites that use a lot of JS. Their arena usage will peak while the pages are loading, but they should hang on to some results, leaving behind partially filled arenas. I'll probably try membench tomorrow and see if the patch makes a clear difference - unfortunately, that loads so many pages that I expect the result will be quite noisy.
Attachment #8436542 - Attachment is obsolete: true
Attachment #8436542 - Flags: review?(terrence)
Attachment #8437417 - Flags: review?(terrence)
techcrunch.com is my preferred test case when I need a large page with lots of JS. Make sure you roll over the social media button strip at the bottom of each story so that all the buttons get loaded.
FWIW jemalloc uses an rb_tree to hold it's set of fixed-sized-things-that-aren't-full (runs in their parlance, arenas here I suppose). If I recall correctly, they just grab the 'arena' with the lowest address in memory (it could be highest, I don't think it really matters, you just have to do the same thing each time).

I believe this avoids the need to re-sort in most cases (until the arena becomes full). I haven't looked at your patch, it's quite possible this is what you're achieving anyhow, but thought it might be helpful.
(Assignee)

Comment 32

5 years ago
Stuff in GC arenas is collected all at once during GC, so there's no need to sort until after that (we can always choose the first arena after the cursor). We could sort them based on the address, or use it as a tiebreaker when two arenas have an equal amount of free space left. We could presumably even get the heap growth direction from the GC's chunk allocator, though that would make comparisons a little more involved. None of that really changes the needed infrastructure though, so I'd like to see how much of a difference the current patch makes first.
(Reporter)

Comment 33

5 years ago
You should use Octane to test for performance changes and membench to test effectiveness. Even though the absolute size will be a bit noisy, what you're really looking for is fragmentation magnitude. So ideally about:memory would show a lower 'unused-gc-things' and potentially a higher 'gc-heap/decommitted-arenas'.

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #26)
>
> 1) At http://dxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#1583 we
> take the first non-empty arena and move the cursor past it, and set it as
> fully used. How do we know the arena will be fully used after this
> allocation?

The first-free-span is packed when stored in the arena header. We unpack it into a FreeList and allocate out of the unpacked version stored there instead of unpacking and re-packing at every allocation. In the meantime, we set the packed one stored directly in the ArenaHeader to "fully used" in order to prevent anyone accidentally using it to traverse things in the arena, since it is not up-to-date. See AutoCopyFreeListToArenas and uses.
 
> 2) At http://dxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#1635 we
> insert the newly allocated, empty arena at the start of the ArenaList. This
> doesn't change the cursor, so it can't affect the sort. I assume the idea is
> that this will get moved to its correct position during the next GC?

Yeah, I've always been a bit confused by this. The comment mentions cache locality. It's probably nonsense. I'd do what makes sense and profile to make sure it's not killing performance.

> 3) At http://dxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#1774 we
> append arenas to a list where the cursor is assumed to be at the end. How do
> we know the arenas allocated outside of background finalization are full?
> Should we be doing another sort here?

We don't and yes. We're probably going to be freeing more things from older arenas, so this list is likely to be in exactly the opposite order of what we want.
Flags: needinfo?(terrence)
(In reply to Terrence Cole [:terrence] from comment #33)
> We don't and yes. We're probably going to be freeing more things from older
> arenas, so this list is likely to be in exactly the opposite order of what
> we want.

If cases like this and the already-sorted spans from comment 24 exist, Timsort might be a good algorithm to use. It's a merge sort that specifically optimizes for these cases, and is a lot faster than a generic merge sort for already-sorted or exactly-inverse lists.
(Reporter)

Comment 35

5 years ago
Comment on attachment 8437417 [details] [diff] [review]
v2 - Sort arenas in order of increasing free space.

Review of attachment 8437417 [details] [diff] [review]:
-----------------------------------------------------------------

This is very nice and amazingly simple. Great work! I'm clearing review because I want a second pass with the nits fixed, once you've collected numbers.

::: js/src/jsgc.cpp
@@ +536,5 @@
> +            l = r;
> +            // Find the start of the second chunk.
> +            for (int i = 0; r && i < sortChunk; ++i)
> +                r = r->next;
> +            // Cache the size calculation

Period at end.

@@ +537,5 @@
> +            // Find the start of the second chunk.
> +            for (int i = 0; r && i < sortChunk; ++i)
> +                r = r->next;
> +            // Cache the size calculation
> +            size_t lSize = (l ? l->freeSpaceLeft(ts) : 0);

No need for the parens around the ternarys here, no nesting so no real possibility for confusion.

@@ +572,5 @@
> +            // Append the remaining items in the first chunk.
> +            while (l && lUnused--) {
> +                s->next = l;
> +                l = l->next;
> +                s = s->next;

This is equivalent to:

s = l;
l = l->next;

@@ +578,5 @@
> +            // Append the remaining items in the second chunk.
> +            while (r && rUnused--) {
> +                s->next = r;
> +                r = r->next;
> +                s = s->next;

Same here:

s = r;
r = r->next;

@@ +593,5 @@
> +            return false;
> +        }
> +    } while (true);
> +    *cursorp_ = h; // Adjust the cursor to point to the new head.
> +    sortChunk = 1; // Reset chunk size

I know that merge sort works on "chunks", but the fact that the arenas we are sorting already live in a "Chunk" with the same name makes this a bit confusing. Could we perhaps change the name of |sortChunk| to something like |currentSortRunLength|, or something both more descriptive and less already-used.

::: js/src/jsgc.h
@@ +544,5 @@
> +        while (c->next && spaceLeft > c->next->freeSpaceLeft(ts))
> +            c = c->next;
> +        a->next = c->next;
> +        c->next = a;
> +        check();

|cursorp_| is indirect specifically because it allows us to do this type of linked-list traversal without special casing the first element. We can use the same trick here:

// Find the first arena with more space than |a|.
ArenaHeader **p = cursorp_;
while (*p && spaceLeft > (*p)->next->freeSpaceLeft(ts))
    p = &(*p)->next;

// |p| points to the prior entry's |next|, so we can update it directly.
// Note that this works even if the "prior list next" is still cursorp_.
a->next = *p;
*p = a;
check();

@@ +575,5 @@
>                  cursorp_ = other.cursorp_;
>          }
>          deepCheck();
>      }
> +    

Whitespace.
Attachment #8437417 - Flags: review?(terrence)
Whiteboard: [MemShrink] → [MemShrink:P2]
(In reply to Terrence Cole [:terrence] from comment #35)
> Comment on attachment 8437417 [details] [diff] [review]
> @@ +572,5 @@
> > +            // Append the remaining items in the first chunk.
> > +            while (l && lUnused--) {
> > +                s->next = l;
> > +                l = l->next;
> > +                s = s->next;
> 
> This is equivalent to:
> 
> s = l;
> l = l->next;

Maybe it's the review interface, but that is confusing. Those 2 lines are equivalent to the 2nd two of the original 3 lines, not the whole set. So really it's

  s->next = l;
  s = l;
  l = l->next;

though I wonder, is

  s = s->next = l;
  l = l->next;

more or less clear?
(Assignee)

Comment 37

5 years ago
Thanks for the review! I'll comment on everything later.

(In reply to Steve Fink [:sfink] from comment #36)
> though I wonder, is
> 
>   s = s->next = l;
>   l = l->next;
> 
> more or less clear?

I think that makes the relationship clearer, thanks. It's hard enough to reason about this without going cross-eyed trying to figure out how the exact ordering affects things :P
(Reporter)

Comment 38

5 years ago
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #37)
> Thanks for the review! I'll comment on everything later.
> 
> (In reply to Steve Fink [:sfink] from comment #36)
> > though I wonder, is
> > 
> >   s = s->next = l;
> >   l = l->next;
> > 
> > more or less clear?
> 
> I think that makes the relationship clearer, thanks. It's hard enough to
> reason about this without going cross-eyed trying to figure out how the
> exact ordering affects things :P

I'm pretty sure I was pretty sure that s->next is always clobbered and that the s = s->next above it in the follow-on loop is uneffectful. Now that I'm re-reading, I'm thinking I may have also gone cross-eyed too: until we have test coverage statistics on tbpl, let's keep the assignment to s->next.

I forgot to include it in my review, but please also whip up a test-case for the merge algorithm itself. The uses of it will be super-hard to test other than running membench and octane, but the massive, complex algorithm itself should be easy to mock up a harness for. Please throw some hand-crafted arenas at it to give us some extra confidence in the large new block of highly complex code.
(Assignee)

Comment 39

5 years ago
I ran Membench 5 times with and without the (updated) patch, and got these numbers for unused-gc-things:

before: 180.15, 178.24, 180.34, 182.42, 175.64
after:  167.59, 173.85, 185.72, 170.58, 164.81 (~172.5 on average or )

So ~180 on average before, and 172.5 on average after (or 170 ignoring the outlier). Everything else was too noisy to draw any conclusions. So that's promising, but I really wish I had a better test for this :) I did see some intermittent crashes with the updated patch, so I'll have to look into that first (these are probably related to the changes I made to appendToListWithCursorAtEnd(), which I haven't run through my local tests).

Ideally I'd also like to compare different ways of sorting:

(1) free(a) < free(b) || (free(a) == free(b) && a < b)
    This is what the current patch uses (my local updated version, that is).
    The idea is that we try the oldest, most full arenas first to give newer
    arenas a better chance of being freed entirely.

(2) free(a) < free(b) || (free(a) == free(b) && a > b)
    The idea here is that we still try the fullest arenas first, but we start
    with the most recently allocated ones because these allocations are more
    likely to be related (and thus get freed all at once).

(3) a < b
(4) a > b
    Here we drop ordering by size entirely in favor of ordering by address,
    with the same reasoning as for (1) and (2).

Unfortunately, without a better test for this, comparing these options isn't going to be easy (at the very least, it'll be time consuming).
(Assignee)

Comment 40

5 years ago
Those numbers are in MiB, by the way. Total memory usage at the end of the test was around 2450 MiB, though still spiking randomly by up to 200 MiB.
(Reporter)

Comment 41

5 years ago
10MiB on membench is a great win! I'd say it's worth taking with order (1), sans any other changes, at least once the bugs are worked out. We can always investigate further wins later.

How is the octane performance?
(Assignee)

Comment 42

5 years ago
Seemed to be about equal, hovering around 26k either way (tested in the browser, on Windows). I'll get a larger sample once I fix the crashes (or sufficiently convince myself they're unrelated - I also saw a crash testing an older base revision, so membench may simply be close to OOMing on 32-bit Windows).
(Assignee)

Comment 43

5 years ago
A quick update: I was wrong about Octane being unaffected; the overall score doesn't change that much, but Splay Latency is heavily impacted. I've been working to optimize every part of the patch - it's adding a few extra parts, but oddly the individual parts themselves keep getting simpler every time I refactor them, so I don't think it'll hurt the clarity that much.
(Assignee)

Comment 44

5 years ago
I think this is probably about the best I can do for now. Splay Latency still takes a hit, unfortunately: before this patch I saw it alternating between ~7500 and ~10500, giving 10.5k more often than not - after the patch I see it giving ~6500, ~8500 or ~9500, with ~8500 probably being the most likely. The overall score is not affected very much, 100-200 at most on a total of ~26600. Still, it's clear that this isn't free. I haven't tested PGO builds or the shell.

The one run of membench I did with this patch showed 157.30 MiB of unused-gc-things, which would be a pretty substantial decrease - but that may have been a lucky run.

(In reply to Terrence Cole [:terrence] from comment #33)
> > insert the newly allocated, empty arena at the start of the ArenaList. This
> > doesn't change the cursor, so it can't affect the sort. I assume the idea is
> > that this will get moved to its correct position during the next GC?
> 
> Yeah, I've always been a bit confused by this. The comment mentions cache
> locality. It's probably nonsense. I'd do what makes sense and profile to
> make sure it's not killing performance.

I changed this to insertAtCursor as part of my performance enhancements - this keeps it naturally sorted (since outside finalization, the arenas can only fill up) and lets me merge these arenas with the finalized ones in a single pass later on.

(In reply to Terrence Cole [:terrence] from comment #35)
> This is very nice and amazingly simple.

I hope you still think so when you see this patch ;)

> > +            size_t lSize = (l ? l->freeSpaceLeft(ts) : 0);
> 
> No need for the parens around the ternarys here, no nesting so no real
> possibility for confusion.

Okay, got rid of these.

> I know that merge sort works on "chunks", but the fact that the arenas we
> are sorting already live in a "Chunk" with the same name makes this a bit
> confusing. Could we perhaps change the name of |sortChunk| to something like
> |currentSortRunLength|, or something both more descriptive and less
> already-used.

Thanks, I went with |currentSortRunLength|. Naming things is hard.

> |cursorp_| is indirect specifically because it allows us to do this type of
> linked-list traversal without special casing the first element. We can use
> the same trick here:

Thanks for pointing this out, I hadn't fully grasped how this works when I wrote that version of the patch. I'm using this sort of logic all over the place now - it's hard to get right, but it really simplified a lot of code.

I haven't gotten started on writing any in-tree tests for this yet; I don't know much about actually *using* the jsapi, so it might take me some time. I have tested every stage locally using large amounts of random numbers, however, so I'm fairly sure it's correct now. Those crashes I was seeing earlier *were* due to a bug, and I had to wade through a lot more when I started optimizing the logic, but I'm reasonably confident that everything is correct now. Now to wait for Try to prove me wrong.. ;) (though I think I'll have a break first, wew)
Attachment #8437417 - Attachment is obsolete: true
Attachment #8439975 - Flags: review?(terrence)
(Reporter)

Comment 45

5 years ago
Comment on attachment 8439975 [details] [diff] [review]
v3 - Sort arenas in order of increasing free space.

Review of attachment 8439975 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #44)
> Created attachment 8439975 [details] [diff] [review]
> v3 - Sort arenas in order of increasing free space.
> 
> I think this is probably about the best I can do for now. Splay Latency
> still takes a hit, unfortunately: before this patch I saw it alternating
> between ~7500 and ~10500, giving 10.5k more often than not - after the patch
> I see it giving ~6500, ~8500 or ~9500, with ~8500 probably being the most
> likely. The overall score is not affected very much, 100-200 at most on a
> total of ~26600. Still, it's clear that this isn't free. I haven't tested
> PGO builds or the shell.

You should be able to fix this quite easily. Splay latency doesn't return to the event loop so we only run non-incremental GCs there. Even aside from benchmarks, if we're GCing and it's not an incremental GC then we should be focused on getting out of GC (i.e. interactive again) instead of minimizing space usage. I think if you just check budget.isUnlimited() before sorting, it should solve the perf issue.

This would be r+ other than the perf issue. We have a no regression policy for arewefastyet though, so we have to get that sorted out first.

::: js/src/jsgc.cpp
@@ +1980,5 @@
> +    // Merging will change |finalized|, so determine this now.
> +    bool arenaListNotEmpty = !finalized.isEmpty();
> +    // Note that both lists should be sorted at this point: |al| by filling up
> +    // naturally and |finalized| by being sorted during finalization.
> +    al->mergeWithSortedList(finalized);

Add a whitespace before the second comment, after the declaration of arenaListNotEmpty.
Attachment #8439975 - Flags: review?(terrence)
Comment on attachment 8439975 [details] [diff] [review]
v3 - Sort arenas in order of increasing free space.

Review of attachment 8439975 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgc.h
@@ +440,5 @@
>      // |cursorp_| is never null.
>      //
>      ArenaHeader     *head_;
>      ArenaHeader     **cursorp_;
> +    size_t          currentSortRunLength;

I haven't read the patch closely, but does this need to be a class member? My gut feeling is that it could be a local variable in sort() and passed into sortInner(), or something like that...
(Reporter)

Comment 47

5 years ago
(In reply to Nicholas Nethercote [:njn] from comment #46)
> Comment on attachment 8439975 [details] [diff] [review]
> 
> I haven't read the patch closely, but does this need to be a class member?
> My gut feeling is that it could be a local variable in sort() and passed
> into sortInner(), or something like that...

The algorithm is incrementalized so that it doesn't dominate the last slice. The incremental state has to go on the runtime somewhere and this seems like a reasonable spot.

Now that I think about it, this should probably have a section in the stats buffer as well, since it takes so long.
(Assignee)

Comment 48

5 years ago
(In reply to Terrence Cole [:terrence] from comment #45)
> You should be able to fix this quite easily. Splay latency doesn't return to
> the event loop so we only run non-incremental GCs there. Even aside from
> benchmarks, if we're GCing and it's not an incremental GC then we should be
> focused on getting out of GC (i.e. interactive again) instead of minimizing
> space usage. I think if you just check budget.isUnlimited() before sorting,
> it should solve the perf issue.

Good idea, this patch does that. Unfortunately this completely disables sorting for background finalization, since that isn't done incrementally. I tried leaving it enabled if onBackgroundThread == true (by giving it a near unlimited budget), but I had to take it out completely to fix the performance regression. I don't know how much this hurts the memory fragmentation improvements.

With this patch, I get number similar to before the patch, perhaps even slightly higher (I saw up to 10800 for Splay Latency, and 26850 in total).

> > +    bool arenaListNotEmpty = !finalized.isEmpty();
> > +    // Note that both lists should be sorted at this point: |al| by filling up
> > +    // naturally and |finalized| by being sorted during finalization.
> > +    al->mergeWithSortedList(finalized);
> 
> Add a whitespace before the second comment, after the declaration of
> arenaListNotEmpty.

Done.
Attachment #8439975 - Attachment is obsolete: true
Attachment #8440216 - Flags: review?(terrence)
(Assignee)

Comment 49

5 years ago
(In reply to Terrence Cole [:terrence] from comment #47)
> Now that I think about it, this should probably have a section in the stats
> buffer as well, since it takes so long.

Could you describe what needs to be done there? I assume it involves js::gcstats::Statistics or the RAII classes surrounding it somehow.
(Assignee)

Comment 50

5 years ago
I don't really have a good handle on how much we're missing by not sorting background finalized objects, but it seems very unfortunate (especially given that being on a background thread is presumably supposed to make it a bit less performance sensitive). Is there any way we could somehow incrementalize, or lower priority on background finalization to allow us to sort these? Doesn't have to be in this bug, just wondering if it's feasible.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #50)
I'm interested by the fact that this affects splay latency so much - are we really doing that much more work with sorting that it makes a difference?  I guess we must be.

What do you mean by incrementalizing background finalization, and how would that help here?  I filed bug 989390 but I'm not sure whether that's the same thing you're thinking of.
(Assignee)

Comment 52

5 years ago
(In reply to Jon Coppeard (:jonco) from comment #51)
> I'm interested by the fact that this affects splay latency so much - are we
> really doing that much more work with sorting that it makes a difference?  I
> guess we must be.

It might be possible to optimize the sort more.

For one thing, I could special case the currentSortRunLength == 1 case to not use the full merge mechanics.

The fact that arenas are stored in a linked list makes a lot of other sorting techniques (which rely on having access to indices) awkward. If we kept track of how many elements we have, at least, we could copy the linked list over to an array and go from there, but that would require a lot more auxiliary storage than we use now.

One of the things that makes the sort potentially slow (in a way I haven't measured locally) is that each size calculation has to iterate over the FreeSpans of each arena. If we could cache this data somehow, that might speed things up considerably. IIRC arenas are 4kiB, so we would need 12 bits for this in an ArenaHeader. We'd still have to iterate over all elements to either initialize the cache or clear it, unless we kept track of the size at all times (linking up changes in the FreeLists to the ArenaHeaders).

If we know we're just inserting a small number of elements into a larger, sorted list, we could avoid querying the size so much by doing something like a binary search on the larger list. You'd have to traverse the list more (since we can't use indices), but perhaps that's not the slow part.

> What do you mean by incrementalizing background finalization, and how would
> that help here?  I filed bug 989390 but I'm not sure whether that's the same
> thing you're thinking of.

Currently background finalization always uses an unlimited incremental budget, and in v4 I only sort when the budget is limited. This is a heuristic, since we know that Splay Latency causes full GCs, and we should strive to make full GCs as short as possible anyway since they cause more jank. But this means that, by extension, background finalized objects will never be sorted.

I guess I'm not so much looking for *incremental* background finalization, but more a heuristic along the lines of "Only sort if GC pressure is not too high". As for why Splay Latency is affected so much by activity on a background thread, I don't really know ... perhaps taking longer on the background thread causes us to allocate more temporary arenas, hurting memory locality and causing us to GC more, with whatever administrative overhead that involves. Bug 989390 sounds like it could help with that, but I'll admit that I'm kind of stabbing in the dark here.
(Assignee)

Comment 53

5 years ago
I've done a lot of work on this over the past few days. In a fruitful IRC conversation, billm suggested dividing the arenas up into a few size bins instead of doing a full sort, thereby achieving an O(n) algorithm that still gives most of the benefits. Following that conversation, I realized that:

1) Since the size of an Arena is 4kiB and the minimum size of an object is 16 bytes (technically there could be 8-byte objects, but there aren't any of those right now), and 0 < sizeof(ArenaHeader) <= 16 (or <= 32 on 64-bit), there can be at most 255 objects in an arena.

2) We only need 12 bits to store offsets within an Arena, so we only need 24 bits to store the firstOffset_ and lastOffset_ members of a CompactFreeSpan - leaving extra bits in a 32-bit bitfield.

As a result:

a) We can cache the number of free things, which in most cases we get for free during marking (and if we don't have it, we only have to iterate over the FreeSpans once).

b) If we use 256 size bins, we can completely sort any object type in O(n) time (though we can't sort based on address as well).

I reimplemented the sorting algorithm, taking these things into account, and that made it a lot faster. I don't have any in-browser measurements for v4 of the patch, but for v5 I found that sorts of 11.5k objects during Octane took about 0.9ms (and that was with a bug that made it iterate over *all* the arenas, including the full ones). The only disadvantage is that the sort now uses two 256-element arrays, so it uses 2kiB of stack on 32-bit and 4kiB of stack on 64-bit. I don't know if that's considered 'a lot', though.

I then spent about two days tracking down an annoying assertion failure in debug builds of the browser. It turns out that while in ArenaLists::adoptArenas and the various functions calling FinalizeArenas, we copy the current FreeList into its ArenaHeader's CompactFreeList, and this was rarely causing the post-sort cursor to point at the Arena in active use.

This wasn't causing opt crashes since the Arena is marked as fully used again after these functions are done, but it would leave the cursor pointing at a full Arena, forcing ArenaLists::allocateFromArenaInline to allocate a new Arena and presumably wasting memory until the next GC. This will have affected all the patches in this bug so far - I just hadn't run a debug build yet.

I fixed this by temporarily clearing the reference to the FreeList before sorting - a simple fix, in the end. v5 triggers no assertions that I know of, and I added quite a few checks while debugging (deepCheck now ensures the ArenaList is fully sorted, for instance).

Performance is still an issue, but I think I've figured out the cause. If I sort the Arenas from most-filled to least-filled, Splay and Splay Latency both take about a 1000 point hit (the whole benchmark takes about a 300 point hit). If I sort the Arenas from least-filled to most-filled, performance on the whole benchmark is even worse (another 300-500 points lower).

But if I sort the Arenas from *half* filled to empty or full (taking the distance from half full), performance is restored to the pre-patch numbers, perhaps even higher (I seem to get 10.7k on Splay Latency fairly consistently). I don't know if that's necessarily a satisfying solution - it might undo the memory savings, for one thing. I didn't include this change in v5.

Perhaps we could sort each *bin* by address to improve memory locality, though that would make the sorts slower again (and somewhat difficult to incrementalize). I'm also getting kind of tired of doing runs of super noisy Octane, but c'est la vie I guess. Anyway, uploading the patch since it's otherwise complete, and in case anyone has further thoughs.
Attachment #8440216 - Attachment is obsolete: true
Attachment #8440216 - Flags: review?(terrence)
(Assignee)

Comment 54

5 years ago
Welp, now I can't reproduce a performance regression at all - I got a highest score of 10.8k on Splay Latency with the patch, which is pretty much as high as I've seen, and pre-patch numbers look consistently worse. I also tried to instrument the number of calls to refillFreeList and any difference in the number of calls was well within the noise for Octane. Maybe we can just take this in its current form (barring review comments, obviously) and see how it does on AWFY. v5.1 just fixes a trailing blank and adds a few small comments.
Attachment #8443856 - Attachment is obsolete: true
Attachment #8443967 - Flags: review?(terrence)
(Assignee)

Comment 55

5 years ago
By the way, with the extensive checking that deepCheck() does now, does this still need additional tests? All the bugs that I found while working on this showed up consistently just starting up a debug browser with a (more-or-less) clean profile.
(Assignee)

Comment 56

5 years ago
Oops, somewhere along the way I made a new patch and forgot to fill in the description. Sorry about the churn.
Attachment #8443967 - Attachment is obsolete: true
Attachment #8443967 - Flags: review?(terrence)
Attachment #8443969 - Flags: review?(terrence)
Comment on attachment 8443969 [details] [diff] [review]
v5.2 - Sort arenas in order of increasing free space.

Thanks for the patch. I'd like to look at this too. Here are some high-level comments:

- I would strongly prefer that we not use any extra bits to store the number of free things in the ArenaHeader. I was hoping that, as we finalize each arena, we could insert it into the appropriate array entry based on how many free things it has. Then, at the end of finalization, we could just concatenate the array entries together. This is pretty much what you're doing in sort(), except that you do it as a post-pass and I'd like to do it simultaneously with finalizing each arena.

I realize that that causes a few problems for background finalization and adopting. In the background finalization case, the existing arenas (the ones that didn't just get finalized) should all have a single free span, which makes it very easy to know how much free space they have. I think adoption should be similar--all the arenas being adopted will have a single free span.

- Please try to use more descriptive variable names. Names like |r| and |c| for arenas are too short.
Attachment #8443969 - Flags: review?(wmccloskey)
(Assignee)

Comment 58

5 years ago
Thanks for the comments.

(In reply to Bill McCloskey (:billm) from comment #57)
> Comment on attachment 8443969 [details] [diff] [review]
> - I would strongly prefer that we not use any extra bits to store the number
> of free things in the ArenaHeader.

Technically we don't use any *extra* bits - we go from two 16-bit members to a single 32-bit bitfield (perhaps I should have used uint32_t explicitly). Of course, this does add some complexity and the compiler will have to do a bit more work under the hood, but CompactFreeSpan::compact and decompact shouldn't be too hot, comparatively speaking.

> I was hoping that, as we finalize each
> arena, we could insert it into the appropriate array entry based on how many
> free things it has. Then, at the end of finalization, we could just
> concatenate the array entries together. This is pretty much what you're
> doing in sort(), except that you do it as a post-pass and I'd like to do it
> simultaneously with finalizing each arena.

I did this initially, but I wasn't sure about the implications of keeping such a GCRuntime-wide array around between incremental GC slices, and I wanted to see if centralizing it in a single sort() function would be cleaner. Then I got stuck debugging for several days. It shouldn't be too hard to do this agin, though I'm not sure how it would interact with the copyFreeListToArena/clearFreeListInArena thing I found.

> I realize that that causes a few problems for background finalization and
> adopting. In the background finalization case, the existing arenas (the ones
> that didn't just get finalized) should all have a single free span, which
> makes it very easy to know how much free space they have. I think adoption
> should be similar--all the arenas being adopted will have a single free span.

Yeah, I would actually expect all the arenas allocated during background finalization to be either full or in use (and thus full for the purposes of sorting). I'm not sure about the to-be-adopted arenas - v5.2 already asserts that they're sorted, but I don't know if the same thing applies as for background finalization.

But even assuming that we do need to check their size and insert them in the appropriate positions, we should only have to do it once per insertion, so it probably wouldn't be the end of the world. The main source of overhead was checking the size of each finalized arena, which we now get for free.

> - Please try to use more descriptive variable names. Names like |r| and |c|
> for arenas are too short.

Heh, will do. I sometimes feel that short names keep things more readable at a glance, but I'm also just terrible at naming things.
(Assignee)

Comment 59

5 years ago
Oh, and one reason I made the sort function is that I wanted to keep track of the tail pointers in the same pass so we wouldn't have to iterate over all of them again at the end - but there's no fundamental reason we couldn't do this when inserting them one at a time (and in fact, no reason we couldn't still put the logic in a function).
Removing tracking-firefox32 because it's too late in the release cycle to uplift this feature to Aurora 32.
(Assignee)

Comment 61

5 years ago
Bill, is this closer to what you had in mind? Unfortunately I'm getting random crashes with this patch - I think we're resuming incremental GC in the wrong spot and linking up the wrong arenas, though I haven't directly confirmed this. Do you think this could be the case? Could we keep a bit more state between slices to prevent it?
Attachment #8443969 - Attachment is obsolete: true
Attachment #8443969 - Flags: review?(wmccloskey)
Attachment #8443969 - Flags: review?(terrence)
Attachment #8444625 - Flags: feedback?(wmccloskey)
Comment on attachment 8444625 [details] [diff] [review]
v6 - Sort arenas in order of increasing free space.

Review of attachment 8444625 [details] [diff] [review]:
-----------------------------------------------------------------

This looks pretty good. I think the main thing I'd suggest is to create a class that holds numFreeHeads and numFreeCursors, and only manipulate those using methods of the class. I think it would make some of this code a lot easier to understand.

I'm afraid I don't see any reason why you would get crashes. We do sometimes "reset" incremental GCs by calling resetIncrementalGC [1]. During the mark phase, that just stops the GC. But during sweeping, we're pretty careful to finish sweeping up the current zone group. So I'd be surprised if we started or stopped any sweeping in the middle.

Let me know if you need any help debugging. Are you getting crashes in shell tests or while running a browser?

[1] http://mxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#4534

::: js/src/jsgc.cpp
@@ +1943,5 @@
> +
> +#ifdef DEBUG
> +    // Clear our reference to the current free list (if any) so that
> +    // the debug checks won't conclude that the list is not sorted.
> +    clearFreeListInArena(thingKind);

I'm having trouble figuring this part out. Can you explain more?

@@ +5743,4 @@
>          toList->deepCheck();
> +        tempList.deepCheck();
> +        // Merge the remaining arenas from fromList into toList.
> +        toList->mergeWithOtherList(tempList);

My hope was that tempList would contain only completely full arenas, except for the last one. Is that not the case? I'd really like to be able to avoid this complicated merge algorithm, if possible.
Attachment #8444625 - Flags: feedback?(wmccloskey) → feedback+
(Assignee)

Comment 63

5 years ago
(In reply to Bill McCloskey (:billm) from comment #62)
> This looks pretty good. I think the main thing I'd suggest is to create a
> class that holds numFreeHeads and numFreeCursors, and only manipulate those
> using methods of the class. I think it would make some of this code a lot
> easier to understand.

Okay, can do.

> I'm afraid I don't see any reason why you would get crashes. We do sometimes
> "reset" incremental GCs by calling resetIncrementalGC [1]. During the mark
> phase, that just stops the GC. But during sweeping, we're pretty careful to
> finish sweeping up the current zone group. So I'd be surprised if we started
> or stopped any sweeping in the middle.
> 
> Let me know if you need any help debugging. Are you getting crashes in shell
> tests or while running a browser?

Most of the crashes seem to happen during marking and in jitcode related checks, if that tells you anything. I usually get one running Octane in the browser, though I could look into reproducing them in the shell (would make testing changes quicker, for sure).

> > +    clearFreeListInArena(thingKind);
> 
> I'm having trouble figuring this part out. Can you explain more?

When we start allocating from an Arena, we initially set it to full (its CompactFreeSpan is initialized with zero offsets). To be able to mark the in-use arena, we copy the FreeList of each thingkind to its associated ArenaHeader (synchronizing the ArenaHeader's CompactFreeList with the FreeList).

But this means that when we sort the arenas, the in-use arena won't be seen as full. Regardless of exactly where it ends up in the sorted list, the arena will continue to fill up and leave a 'hole' in that position - if you ask the ArenaList for a new arena and its cursor happens to be pointing to this (now full) arena, it will return null and force us to allocate a new arena from the chunk (until GC rebuilds the list).

Similarly, when doing the debug-mode checks to see if an ArenaList is sorted, the in-use arena has a good chance of showing up as a non-sorted item unless we clear out the information first.

Now, what I don't know is if the in-use arena will be put in the arenaListsToSweep. If not, then it probably can't show up during finalization. I *do* know the problem happens in adoptArenas - that's where I initially saw it happen. If the in-use arena *can* show up during finalization, it occurs to me that patch v6 might have reintroduced this problem. I alluded to this in comment #58, then forgot. I should see if this has anything to do with the crashes I'm seeing.

> > +        toList->mergeWithOtherList(tempList);
> 
> My hope was that tempList would contain only completely full arenas, except
> for the last one. Is that not the case? I'd really like to be able to avoid
> this complicated merge algorithm, if possible.

That's right, only the last one has to be inserted. I didn't change the function name, but it now just inserts the single partial arena at the end of tempList into the partial arenas in toList. The complexity in insertInSortedPosition (which mergeWithOtherList calls) comes from the fact that we really don't want to compute the free space for thousands of partially full arenas, so it does something like a binary search for singly linked lists. insertInSortedPosition is already significantly less complicated than InsertInto from v5.2 was though. Of course, we could just call it Good Enough and just insert it at the start of the partially filled arenas in toList ;)
(Assignee)

Comment 64

5 years ago
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #63)
> the
> arena will continue to fill up and leave a 'hole' in that position - if you
> ask the ArenaList for a new arena and its cursor happens to be pointing to
> this (now full) arena, it will return null and force us to allocate a new
> arena from the chunk (until GC rebuilds the list).

Er, that's not actually true, I'm not sure how I managed to convince myself of that. It will happily return the full arena, and I suppose allocation will then fail (in a position where allocation is not expected to fail). I'm not sure why I didn't see crashes from that.
Just wanted to make sure you were aware of this code:
http://mxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#4050
Before we begin sweeping a zone group, we clear its free lists.

I'm still thinking about what we need to do for adopt, though.
(Assignee)

Comment 66

5 years ago
Thanks. I probably would have found that.. eventually :)

I just got a debug crash that actually tripped one of my assertions, and it showed two things:
1) Two different allocKinds in the same list (specifically, one ArenaHeader had allocKind 20 and the rest had allocKind 19).
2) The resulting ArenaList wasn't sorted. I couldn't look into how many items were in disarray, since I removed the size cache.

So something's definitely going badly wrong here, though I don't know where yet.
(Assignee)

Comment 67

5 years ago
Aha! I think I figured it out (from that same crash). If we go over budget, we return false in FinalizeTypedArenas() even if there are no more things left to finalize. Then the next time we enter foregroundFinalize(), we return early because there's nothing left to sweep - skipping the sorting step! I think we can just take this early return out altogether. Compiling now to see if this fixes the crashes.
That would explain it :). Nice work!
(Assignee)

Comment 69

5 years ago
That fixed the problem I was seeing, but it turns out it wasn't the only one:

1) In foregroundFinalize(), arenaLists[thingKind] isn't necessarily empty going in, so constructFromBins() was stomping all over its previous contents. arenaLists[thingKind] may also contain the currently active FreeList, so we do indeed need to clear it before constructFromBins() so that the debug mode assertions don't trip (of course, my patch added those assertions in the first place, but I think they're a nice addition).

2) With that fixed, this assertion seems to fail consistently during Octane Crypto:

http://hg.mozilla.org/mozilla-central/annotate/e86b84998b18/js/src/vm/Shape.cpp#l1830

This has the following stack:

JSCompartment::sweepInitialShapeTable() Line 1830
JSCompartment::sweep() Line 599
GCRuntime::beginSweepingZoneGroup() Line 4156
GCRuntime::beginSweepPhase() Line 4277
GCRuntime::incrementalCollectSlice() Line 4850
GCRuntime::gcCycle() Line 4996
GCRuntime::collect() Line 5149
GC() Line 5180
ShrinkingGC() Line 195
WorkerPrivate::GarbageCollectInternal() Line 5614
GarbageCollectRunnable::WorkerRun() Line 1660
WorkerRunnable::Run() Line 312
WorkerPrivate::ProcessAllControlRunnablesLocked() Line 4501
WorkerPrivate::DoRunLoop() Line 3997
WorkerThreadPrimaryRunnable::Run() Line 2591

I don't know what to make of this one.. I'm not sure what the failure indicates. Bill, any idea?
Flags: needinfo?(wmccloskey)
(Assignee)

Comment 70

5 years ago
Talked to Jonco on IRC - it looks like I was just linking up the lists wrong in my fix for problem (1) above. I still don't really understand *what* was wrong, but after writing the code in a more obviously correct way the crashes seem to be gone!
Flags: needinfo?(wmccloskey)
(Assignee)

Comment 71

5 years ago
Okay, I think this is ready for review. Aside from fixing the crashes and applying various bits of polish, I made the following changes:

(1) The size bin stuff is now encapsulated into a new class, ArenaFreeSpaceHistogram. I'm not entirely sure about the name, but it's the best I could come up with. I also wasn't really sure where to put it, so I just stuck it above the definition of ArenaList.

(2) I merged wrapInOtherList() and mergeWithOtherList() into one function, joinIntoThisList(), which takes a 'simple' and a 'complex' ArenaList, merges them and sets |this| to the result. This still uses insertInSortedPosition() like mergeWithOtherList() did, but is actually simpler than either of the original functions.

Finally, while debugging this version I realized one non-obvious way in which this patch changes the behavior. Prior to this patch, incremental foreground finalization would immediately append freshly swept arenas to the ArenaLists, thus allowing these arenas to be selected for use between slices. After this patch, we sort everything into bins first, and only merge them together and into the ArenaLists at the end of finalization. So for a GC that takes a lot of slices, this patch might cause us to allocate more new arenas. I doubt it'll make a big difference in practice, but there it is.
Attachment #8444625 - Attachment is obsolete: true
Attachment #8445640 - Flags: review?(wmccloskey)
Attachment #8445640 - Flags: review?(terrence)
(Assignee)

Comment 72

5 years ago
I sent this off to try (debug) and it came back pretty orange, going to have to see what's up - so feel free to consider those review requests as feedback for now (though I don't expect the patch to change a lot).
(Assignee)

Comment 73

5 years ago
As a quick update: I fixed three problems related to the new assertions which were otherwise harmless, and that eliminated most of the assertions for me locally. TBPL is still orange however - it seems to have found a bug in one of the new functions that happens very rarely. So far I've managed to reproduce it once after running a jit-test that trips it ~18k times; hopefully the fix will be obvious.
(Assignee)

Comment 74

5 years ago
Well, that was fun. So it turns out that when we're doing background finalization using the GC helper threads, this may race with the free lists being copied into the arenas allocated during background finalization. That means we can't reliably clear them and copy them back in to avoid tripping the debug assertions, because this may race with the main thread, tripping more assertions.

So to fix that, I've added a new field to ArenaHeaders, using the final free bit of the bitfield, to indicate when free lists have been copied in (and make hasFreeThings() and numFreeThings() return 0 in that case). Purging clears this bit, and I had to add a hasFreeThingsRaw() for the functions that are specifically testing to see if the free lists have been copied in. With that, I think this is ready for another try run.

Note that all of this is debug only - we don't have to check the bit at all in opt builds since they don't use any of the assertions. It's still a bit of extra complexity, but on the flip side it's self-contained, and gets rid of the clearFreeListsInArenas()/copyFreeListsToArenas() calls that I was sprinkling about.
(Assignee)

Comment 75

5 years ago
So in the end, the debug checks may not be worth the added complexity. Figuring out that all the problems are in fact harmless probably counts for something, and showing that the lists are *always* sorted in debug mode with this patch is pretty cool, but it might not be that important to keep checking this for every build. Either way, here's the fixed patch. To get deepCheck passing I had to:

(1) Use a bit in ArenaHeader to track whether a reference to its free list has been copied into it.

(2) Add a function, correctAfterPurge(), to correct the position of the cursor and the purged arena.
Attachment #8445640 - Attachment is obsolete: true
Attachment #8445640 - Flags: review?(wmccloskey)
Attachment #8445640 - Flags: review?(terrence)
Attachment #8447539 - Flags: review?(wmccloskey)
Attachment #8447539 - Flags: review?(terrence)
(Assignee)

Comment 76

5 years ago
Looking pretty much green on try now: https://tbpl.mozilla.org/?tree=Try&rev=0001efe4e3a6

The one worry I have about doing these checks is that they will slow down debug builds, increasing the risk of timeouts. The two oranges in that try run support that worry, though there are already bugs on file for both of them - but it would be bad if this increased the frequency of orange.
(Assignee)

Comment 77

5 years ago
Oh, and I'm not sure what's up with the reftest crash - it doesn't *look* related, but..
Comment on attachment 8447539 [details] [diff] [review]
v8 - Sort arenas in order of increasing free space during finalization.

Review of attachment 8447539 [details] [diff] [review]:
-----------------------------------------------------------------

This is looking really good. Thanks for all the work. I'd still like to see one more version after you make these changes. The main changes I'm asking for are:
- Introduce an ArenaListSegment class.
- Get rid of deepCheck and let the list become unsorted after adoptArenas. Hopefully that will allow us to remove the extra bit in ArenaHeader?

::: js/src/gc/GCRuntime.h
@@ +642,5 @@
>      mozilla::DebugOnly<PRThread *>   lockOwner;
>  
>      GCHelperState helperState;
>  
> +    ArenaFreeSpaceHistogram arenaFreeSpaceHistogram;

Let's call this "incrementalSweepList". And please put a comment above it:
"During incremental sweeping, this field tracks the arenas being swept in sorted order."

::: js/src/jsgc.cpp
@@ +538,5 @@
> +    return nmarked;
> +}
> +
> +inline void
> +ArenaFreeSpaceHistogram::appendToArenaList(ArenaList &list, size_t thingsPerArena)

I think it would help to put some "paragraph breaks" (i.e., blank lines) in this function to separate the different pieces.

@@ +545,5 @@
> +                 thingsPerArena + 1 == numSizeBinsToUse_);
> +    // Check that the list we're adding on to is sorted.
> +    list.deepCheck();
> +    // Make sure there the list has no partially full arenas.
> +    JS_ASSERT(!*list.cursorp_);

You should be able to use isCursorAtEnd here.

@@ +547,5 @@
> +    list.deepCheck();
> +    // Make sure there the list has no partially full arenas.
> +    JS_ASSERT(!*list.cursorp_);
> +    // Append any full arenas |this| might have.
> +    if (list.head_) {

And !list.isEmpty() here.

@@ +552,5 @@
> +        // If !heads_[0], this will set heads_[0] to the list head.
> +        *tails_[0] = list.head_;
> +        tails_[0] = list.cursorp_;
> +    }
> +    // Link the tails up to each list head.

(Later on in the review, I suggest that we create an ArenaListSegment class that has |head| and |tail| fields.) We could have a local ArenaListSegment variable here. The ArenaListSegment class could have an append method that would append another ArenaListSegment. Then this function could just iterate over each bin and append its list to the local ArenaListSegment. And then at the end we'd add the local ArenaListSegment to |list|. I think that would be a lot easier to follow.

@@ +595,5 @@
>  
>      while (ArenaHeader *aheader = *src) {
>          *src = aheader->next;
> +        size_t nMarked = aheader->getArena()->finalize<T>(fop, thingKind, thingSize);
> +        size_t nFree = thingsPerArena - nMarked;

I'd prefer these be "nmarked" and "nfree" to match other names. Also, please put an empty line after this line.

@@ +1336,5 @@
>          return false;
>  
> +    // Initialize free things histogram with its maximum capacity,
> +    // so that it can be used by all possible AllocKinds.
> +    arenaFreeSpaceHistogram.init(MaxThingsPerArena);

Why can't you just pass MaxThingsPerArena in to the constructor?

@@ +4359,5 @@
> +    // Build a sorted list of objects from the size bins.
> +    dest.appendToArenaList(arenaLists[thingKind], thingsPerArena);
> +
> +    // Reinitialize the GCRuntime-wide histogram for further use.
> +    dest.init(MaxThingsPerArena);

Can you move this line to the caller? I think it makes more sense there.

::: js/src/jsgc.h
@@ +417,5 @@
> +// array points to the first arena in each size bin, and the tails array tracks
> +// the address of the |next| field of the last arena in each size bin, allowing
> +// us to preserve the ordering of arenas as much as possible in constant time
> +// and quickly link up subsequent bins into a contiguous, sorted list.
> +class ArenaFreeSpaceHistogram

Let's call this SortedArenaList.

@@ +421,5 @@
> +class ArenaFreeSpaceHistogram
> +{
> +    size_t numSizeBinsToUse_;
> +    ArenaHeader *heads_[MaxThingsPerArena + 1];
> +    ArenaHeader **tails_[MaxThingsPerArena + 1];

Instead of these two arrays, how about we have a class called ArenaListSegment? It would have |head| and |tail| fields. Then this class would have an array of ArenaListSegments.

@@ +431,5 @@
> +        else
> +            numSizeBinsToUse_ = 0;
> +    }
> +
> +    void init(size_t thingsPerArena) {

Let's call this "reset" instead of init. It's similar to some other reset methods we have. It could just call a reset method on each ArenaListSegment.

@@ +442,5 @@
> +            tails_[i] = heads_ + i;
> +        }
> +    }
> +
> +    void placeInBin(ArenaHeader *aHeader, size_t bin) {

Let's call this "insert". It would call an insert method on the correct ArenaListSegment.

@@ +518,5 @@
>  #endif
>      }
>  
>      // This does checking involving all the arenas in the list.
>      void deepCheck() const {

Let's get rid of deepCheck and just use check.

@@ +632,5 @@
> +    // stored in |this|. Both input lists should be sorted. |simple| may have
> +    // any number of full arenas, but must have at most one non-full arena.
> +    // |complex| on the other hand may have any number of non-full arenas.
> +    // Either |simple| or |complex| may be the same as |this|.
> +    void joinIntoThisList(ArenaList &simple, ArenaList &complex) {

As long as we're not doing deepCheck, it doesn't really matter what we do with that one non-full Arena from adoptArenas. And aside from that, all the Arenas in simple will be full. To make this code simpler, can we just put all the arenas from |simple| right before the cursor in |complex|?
Attachment #8447539 - Flags: review?(wmccloskey) → feedback+
Would it be worth having telemetry for counts for the arena size bins? It might help with detecting regressions in the future. Maybe in a follow-up bug?
It'd be good to have new measurements once all the fix-ups have been made, too.
(Assignee)

Comment 81

5 years ago
Thanks for the comments! Altogether they should make the patch a fair bit simpler.

(In reply to Hugh Nougher [:Hughman] from comment #79)
> Would it be worth having telemetry for counts for the arena size bins? It
> might help with detecting regressions in the future. Maybe in a follow-up
> bug?

Counts in what sense?

(1) The number of bins is determined by the number of things an arena can contain, which can be at most 255 right now (the patch adds a bunch of static asserts to ensure that this upper limit is correct).
(2) The number of things in a single bin can be anything up to the number of arenas in the list that we're sorting - but that's no different from right now; the amount of memory used by the bins themselves is constant.
(3) With this patch, we have one ArenaFreeSpaceHistogram / SortedArenaList on the heap per GCRuntime. The others are allocated on the stack, they're short-lived, and the peak amount should be limited by the number of helper threads.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #81)
> Counts in what sense?

I suggest three telemetry probes consisting of mean, lower percentile and higher percentile. Maybe the total number of arenas would also be good. I think these would be enough to see strange changes to the distribution of fullness of arenas. Do you agree?

For the note, this is what I thought before reading what could be recorded with telemetry:
> After sending the counts (eg: 30 with 0 free, 25 with 1 free, ..., 4 with
> 254 free) would be a graph or histogram of the data averaged over runs
> and/or time could be used to detect things like:
> - Why is the count of 254 free suddenly risen? (could this indicate a leak?)
> - Why is the free space in arenas rising? (eg: a graph may show it usually
> averaging 10 free but changed to 15 over a week)
(Assignee)

Comment 83

5 years ago
I think it would be hard to get meaningful numbers out of that. Every zone has its own allocator, and each allocator has an ArenaList for each AllocKind, which will get sorted separately. We'd have to store statistics for each sort, turn them into percentages (since different AllocKinds have a different amount of things per arena), and sum over them - and some of those lists may only have one or two arenas in them.

In addition, different workloads give *vastly* different results. For instance, using Nicholas' "Visualize Arenas" patch, you can see that one test of Octane generates a huge number of full arenas, whereas another generates a huge number of non-full ones. Of course, telemetry probes happen on idle, not during benchmarks, but I'm skeptical that the numbers would be meaningful.

Perhaps once we have a compacting GC, we could use the number of partially full arenas to see if we're defragmenting the arenas effectively. But without that, I'm afraid the numbers will vary wildly depending on what webpages the user happens to have loaded.
Ah! My apologies. I thought arenas were more global or being dealt with globally. I agree with you that its a little pointless given your arguments.
(Assignee)

Comment 85

5 years ago
Thanks for all the feedback! This should address all the comments, and does a bit more refactoring besides. In particular, I added a copy constructor and assignment operator overload to ArenaList, and an ArenaList cast operator overload to SortedArenaList. I know operator overloads can easily start to obfuscate what's going on, but in this case I think they make everything a fair bit cleaner without sacrificing clarity. Let me know what you think.

I haven't retested MemBench yet with this patch, but I did run (a slightly older version of) it through try.

Debug: https://tbpl.mozilla.org/?tree=Try&rev=f22065c0814e
Opt:   https://tbpl.mozilla.org/?tree=Try&rev=21dd20c11dbe

The small changes I made after that should be safe, but just in case here's a single platform push: https://tbpl.mozilla.org/?tree=Try&rev=ddb98870f597

(In reply to Bill McCloskey (:billm) from comment #78)
> - Get rid of deepCheck and let the list become unsorted after adoptArenas.
> Hopefully that will allow us to remove the extra bit in ArenaHeader?

Done. And yes, check() doesn't need the bit. Since deepCheck was already commented out originally, and would never be able to consistently succeed with or without this patch, I just took it out altogether.

> > +    ArenaFreeSpaceHistogram arenaFreeSpaceHistogram;
> 
> Let's call this "incrementalSweepList". And please put a comment above it:
> "During incremental sweeping, this field tracks the arenas being swept in
> sorted order."

Done. I went with a slightly longer comment since it would have spanned multiple lines anyway.

> > +ArenaFreeSpaceHistogram::appendToArenaList(ArenaList &list, size_t thingsPerArena)
> 
> I think it would help to put some "paragraph breaks" (i.e., blank lines) in
> this function to separate the different pieces.

I didn't end up doing this, because after all the refactoring it was so simple that it seemed unnecessary.

> > +    JS_ASSERT(!*list.cursorp_);
> 
> You should be able to use isCursorAtEnd here.

Done, and elsewhere. I also added an isCursorAtHead to check the opposite condition.

> > +    if (list.head_) {
> 
> And !list.isEmpty() here.

Done.

> Let's call this SortedArenaList.

> Instead of these two arrays, how about we have a class called
> ArenaListSegment? It would have |head| and |tail| fields. Then this class
> would have an array of ArenaListSegments.

> We could have a local ArenaListSegment
> variable here. The ArenaListSegment class could have an append method that
> would append another ArenaListSegment. Then this function could just iterate
> over each bin and append its list to the local ArenaListSegment. And then at
> the end we'd add the local ArenaListSegment to |list|. I think that would be
> a lot easier to follow.

Done. I went with SortedArenaListSegment since it isn't really meant for general use, and it fit well with SortedArenaList. And you were right - this cleaned things up nicely :)

> > +        size_t nMarked = aheader->getArena()->finalize<T>(fop, thingKind, thingSize);
> > +        size_t nFree = thingsPerArena - nMarked;
> 
> I'd prefer these be "nmarked" and "nfree" to match other names. Also, please
> put an empty line after this line.

Done.

> > +    arenaFreeSpaceHistogram.init(MaxThingsPerArena);
> 
> Why can't you just pass MaxThingsPerArena in to the constructor?

No reason, just a thinko. I've made the constructor default to MaxThingsPerArena now as the safe default, so no explicit initialization is needed.

> > +    // Reinitialize the GCRuntime-wide histogram for further use.
> > +    dest.init(MaxThingsPerArena);
> 
> Can you move this line to the caller? I think it makes more sense there.

Done. I also made this only clear the fields that were actually used during finalization for that particular AllocKind, since the others should still be in their vanilla state (and we assert if we try to use an index > thingsPerArena_).

> > +    void init(size_t thingsPerArena) {
> 
> Let's call this "reset" instead of init. It's similar to some other reset
> methods we have. It could just call a reset method on each ArenaListSegment.

Done. I went with "clear" inside SortedArenaListSegment for parity with ArenaList.

> > +    void placeInBin(ArenaHeader *aHeader, size_t bin) {
> 
> Let's call this "insert". It would call an insert method on the correct
> ArenaListSegment.

I went with "insertAt" for SortedArenaList, and "insert" in SortedArenaListSegment.

> As long as we're not doing deepCheck, it doesn't really matter what we do
> with that one non-full Arena from adoptArenas. And aside from that, all the
> Arenas in simple will be full. To make this code simpler, can we just put
> all the arenas from |simple| right before the cursor in |complex|?

In this patch I prepend the non-full arena using insertAtCursor, but I still append the full arenas. The reasoning is that in each case, the arenas in the simple list should on average be allocated later - so to promote freeing them as soon as possible, we want them as close to the end of the list as the sorting order allows.

Though, regardless of the reasoning, I think the function is simple enough now that there would be little to no gain from switching things around.
Attachment #8447539 - Attachment is obsolete: true
Attachment #8447539 - Flags: review?(terrence)
Attachment #8448675 - Flags: review?(wmccloskey)
Attachment #8448675 - Flags: review?(terrence)
(Assignee)

Comment 86

5 years ago
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #85)
> The small changes I made after that should be safe, but just in case here's
> a single platform push: https://tbpl.mozilla.org/?tree=Try&rev=ddb98870f597

Welp, so much for that. I'll look into this in a bit, should be an easy fix.
(Reporter)

Comment 87

5 years ago
Comment on attachment 8448675 [details] [diff] [review]
v9 - Sort arenas in order of increasing free space during finalization.

Bill's review is more than adequate.
Attachment #8448675 - Flags: review?(terrence)
Comment on attachment 8448675 [details] [diff] [review]
v9 - Sort arenas in order of increasing free space during finalization.

Review of attachment 8448675 [details] [diff] [review]:
-----------------------------------------------------------------

This looks really nice. I think it's actually easier to read than what we have now. Thanks so much Emanuel!

::: js/src/gc/Heap.h
@@ +508,5 @@
>      bool hasFreeThings() const {
>          return !firstFreeSpan.isEmpty();
>      }
>  
> +    size_t numFreeThings(const size_t ThingSize) const {

I don't think we need this anymore.

::: js/src/jsgc.cpp
@@ +260,5 @@
>  
> +// The minimum size of a GC thing, in bytes.
> +static const size_t MinThingSize = 16;
> +
> +static_assert(MinThingSize == SortedArenaList::MinThingSize,

Why do we need two MinThingSize variables?

@@ +540,5 @@
> +}
> +
> +SortedArenaList::operator ArenaList()
> +{
> +    // Link the non-empty bin tails up to the non-empty bin heads.

I was hoping we could have a local SortedArenaListSegment variable that we'd move non-full Arenas into. And we could have an append method on SortedArenaListSegment that could take another SortedArenaListSegment. That way we could just do something like this:

SortedArenaListSegment nonFullSegment;
for (size_t i = 1; i <= thingsPerArena_; i++) {
    nonFullSegment.append(bins[i]);
    bins[i].clear();
}

And maybe we could add a constructor to ArenaList that would take a full segment and a non-full segment. Then we could just say |return ArenaList(bins[0], nonFullSegment)|.

The constructor would work like this I guess:

ArenaList::ArenaList(const SortedArenaListSegment &full, const SortedArenaListSegment &nonFull)
{
    if (full.isEmpty()) {
        head_ = nonFull.head;
        cursorp_ = &head_;
    } else {
        head_ = full.head;
        *full.tail = nonFull.head;
        cursorp_ = nonFull.isEmpty() ? full.tail : nonFull.tail;
    }
}

@@ +551,5 @@
> +    }
> +    *tailAt(tailIndex) = nullptr;
> +    // Initialize an ArenaList to represent a flattened version of |this|.
> +    ArenaList result;
> +    result.head_ = headAt(0);

This seems wrong. What if the first bin was empty?

::: js/src/jsgc.h
@@ +585,3 @@
>      }
> +
> +    bool isTailAtHead() const {

Any reason not to call this isEmpty()?

@@ +599,5 @@
> +static_assert(ArenaSize <= 4096, "When increasing the Arena size, please consider "\
> +                                 "how this will affect the size of a SortedArenaList.");
> +
> +// A class that holds arenas in sorted order by appending arenas to specific
> +// size bins. SortedArenaLists can be cast to an ArenaList when sorting is done.

Please say "converted" or "coerced" instead of "cast". "Casts" sort of suggests that they have the same representation.

@@ +642,5 @@
> +        JS_ASSERT(nfree <= thingsPerArena_);
> +        bin[nfree].append(aheader);
> +    }
> +
> +    operator ArenaList();

This is nice, but I think I'd prefer it to be an actual function call. That way it's clear in the caller that we're doing something interesting and a little bit expensive.
Attachment #8448675 - Flags: review?(wmccloskey) → review+
(Assignee)

Comment 89

5 years ago
I finally figured out what was causing those problems with the final try run in comment #85. It turns out that the fromList in adoptArenas *can* legitimately have more than one non-full arena in it.

The source of these lists is parallel JS [1], where we scan through the arenas one at a time, purging the freeLists on each loop to avoid marking and allocating into an arena at the same time. This will purge the freeList arena at the end of the list, forcing us to allocate out of a new arena, which will then be purged in adoptArenas, potentially leaving two (or more? I'm not sure how much we can allocate during marking) non-full arenas at the end of the list. I think this only shows up during nursery evacuating GCs, since otherwise we just move the marked memory over to the other nursery directly.

For this bug, I think I will just iterate over the non-full arenas in ArenaList::join and prepend the whole range. There shouldn't be more than 2 or 3 non-full arenas in there.

[1] http://hg.mozilla.org/mozilla-central/annotate/ac6960197eb6/js/src/gc/ForkJoinNursery.cpp#l419
(Assignee)

Comment 90

5 years ago
Thanks for the reviews! Applied the above and fixed the review comments, carrying forward r=billm.

A final try push: https://tbpl.mozilla.org/?tree=Try&rev=c2f3ac621022

I've checked with the Visualize Arenas patch that this still works, going to do some tests with Membench and Octane to see how performance and memory usage look.

> > +    size_t numFreeThings(const size_t ThingSize) const {
> 
> I don't think we need this anymore.

Ah, you're right - removed.

> > +static_assert(MinThingSize == SortedArenaList::MinThingSize,
> 
> Why do we need two MinThingSize variables?

Leftover from an earlier scheme. I moved this static assert into jsgc.h and made the ones below use SortedArenaList::MinThingSize instead.

> I was hoping we could have a local SortedArenaListSegment variable that we'd
> move non-full Arenas into. And we could have an append method on
> SortedArenaListSegment that could take another SortedArenaListSegment. That
> way we could just do something like this:

Great suggestion. It actually turned out even cleaner than this: I gave ArenaList a constructor that takes the first SortedArenaListSegment, which contains all the information it needs (the segment's tail is the ArenaList's cursor). I gave SortedArenaListSegment a "linkTo" method which just does |*tailp_ = aheader;| - it's the same length, but probably more obvious. I also added some comments describing each function.

> > +    result.head_ = headAt(0);
> 
> This seems wrong. What if the first bin was empty?

If the first bin is empty, the segment tail points at the segment head. Therefore when we link up the segment tails to subsequent segment heads, the head of the first segment will end up pointing to the same arena as the head of the first non-empty segment. So the head and tail of the first segment are the head and cursor of the ArenaList we want to return (unless the list is empty, in which case segment[0].isEmpty() and we point the ArenaList's cursor at *its* head).

> > +    bool isTailAtHead() const {
> 
> Any reason not to call this isEmpty()?

Fixed. I called it this for parity with ArenaList, but in context isEmpty() makes a lot more sense.

> This is nice, but I think I'd prefer it to be an actual function call. That
> way it's clear in the caller that we're doing something interesting and a
> little bit expensive.

Changed to SortedArenaList::toArenaList() - it's indeed clearer that way.
Attachment #8448675 - Attachment is obsolete: true
Attachment #8450592 - Flags: review+
(Assignee)

Comment 91

5 years ago
With v10 I still see a regression of ~1000 points on Splay Latency when I run Octane in the browser. Running Splay by itself in the shell the regression is smaller (~400 points), but it's definitely there. In comment #53 I wondered if the regression results from performing more calls to refillFreeList, but I saw no difference in the total number of calls.

Even if the total number of calls is the same, however, the distribution might not be. So I instrumented the number of calls per millisecond, running 1400 iterations of Splay in the shell in Octane's deterministic mode. I've attached graphs showing the distribution before and after the patch, across three runs for each.

As you can see, we consistently have a lot more spikes in the number of calls after sorting, with the maximum number of calls per millisecond going from less than 700 to almost 1200. The total number of calls is virtually unchanged, coming in around ~440k. The increased variance in the number of VM calls per second might well explain the lower Splay Latency score (though there might just be a correlation).

As a result, I'd like to suggest making a fastpath in the JIT for getting the next free list, which would fall through if we need to allocate a new arena. We could also fastpath allocating a new arena from an existing chunk, but the more we fastpath the more work we do if the fast path fails (unless we refactor things to take into account that the fast paths have already happened).

Unfortunately, I don't really know anything about how to write this fastpath. Getting the next available arena should simply be a case of calling arenaAfterCursor() and moveCursorPast() on the current ArenaList, but we also do some checks for GC in refillFreeList and allocateFromArenaInline that we'd probably have to keep (if only to bail if they fail). Jan, does that sound reasonable?
Flags: needinfo?(jdemooij)
(Assignee)

Comment 93

5 years ago
Oh, and I'm happy to do the work, but it might take me some time to get started with the JITs :)
(Assignee)

Comment 94

5 years ago
By the way, reading scrollback on IRC from yesterday between billm and shu:

Purging does just clear the free list - the problem is that when we get a new free list, we move the ArenaList cursor past the arena we're allocating from. Since ArenaHeaders are only singly linked, that means we can't make the purged arena available for allocation again except by stepping through the entire list - which may be long.

If I'm reading the code right, though, PJS probably doesn't have to purge on every iteration. In a non-evacuating GC, we'll just copy the marked objects into the ForkJoinNursery::newspace, which is infallible [1]. For evacuating GCs, we can probably just add a small function to ArenaLists to check if a particular ArenaHeader matches the one for the current free list, and break out of the loop before we try to apply ArenaCellIter to it. Then we could just move the purge() call outside the loop, I think.

[1] http://hg.mozilla.org/mozilla-central/annotate/fdf4539783a3/js/src/gc/ForkJoinNursery.cpp#l764
(Assignee)

Comment 95

5 years ago
Like so. Only lightly tested, seems to pass the PJS jit-tests just fine. Terrence, can you review this?
Attachment #8450795 - Flags: review?(terrence)
(Assignee)

Comment 96

5 years ago
Comment on attachment 8450795 [details] [diff] [review]
Avoid purging the free lists every iteration in PJS by checking if the current arena is in use.

Moved to bug 1034225.
Attachment #8450795 - Attachment is obsolete: true
Attachment #8450795 - Flags: review?(terrence)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #91)
> As a result, I'd like to suggest making a fastpath in the JIT for getting
> the next free list, which would fall through if we need to allocate a new
> arena. We could also fastpath allocating a new arena from an existing chunk,
> but the more we fastpath the more work we do if the fast path fails (unless
> we refactor things to take into account that the fast paths have already
> happened).
> 
> Unfortunately, I don't really know anything about how to write this
> fastpath. Getting the next available arena should simply be a case of
> calling arenaAfterCursor() and moveCursorPast() on the current ArenaList,
> but we also do some checks for GC in refillFreeList and
> allocateFromArenaInline that we'd probably have to keep (if only to bail if
> they fail). Jan, does that sound reasonable?

Sounds interesting, it might be good to prototype this and see how much it helps. Also, are we sure the fast path failure is causing the performance regression? I saw you were asking about macro-assembler stuff, let me know if you need any help with that.
Flags: needinfo?(jdemooij)
(Assignee)

Comment 98

5 years ago
I've been plugging away at this over the weekend. It doesn't work, and I'm starting to feel a bit stumped, so maybe this is a good time to ask for feedback. The actual JIT changes are about 1/3 of the way down the patch.

This crashes trying to get the clasp_ of a JSObject that appears to have a null type_->value during typed arena finalization on a GC helper thread. I can think of about 4 reasons for this:

(1) The macro assembler code I wrote to get the next arena and update the free list is just wrong. The generated code looks alright to me, but my knowledge of assembly is shallow at best. For the record, this is the C++ equivalent of what I'm trying to do (ignoring the 'addressOf' stuff and C++ constraints):

// Given lists == zone()->allocator.arenas
ArenaList &arenaList = lists.arenaLists[allocKind];
FreeList &freeList = lists.freeLists[allocKind];
ArenaHeader *arenaToUse = *arenaList.cursorp_;
if (!arenaToUse)
    bail;
arenaList.cursorp_ = &arenaToUse->next;
freeList.head.first = arenaToUse->arenaAddress() +
                      arenaToUse->firstFreeSpan.firstOffset_;
arenaToUse->firstFreeSpan.firstOffset_ = 0;
freeList.head.last = arenaToUse->arenaAddress() +
                     arenaToUse->firstFreeSpan.lastOffset_;
arenaToUse->firstFreeSpan.lastOffset_ = 0;

(2) A GC is starting while this code runs, so that zone->wasGCStarted() == true by the end and I'm not doing the GCRuntime::arenaAllocatedDuringGC() dance. I wouldn't expect this to happen so consistently, but maybe it does. I haven't looked into this yet because it's a bit of a pain to deal with, mostly because of the bitfield in ArenaHeader.

(3) We're not checking the BackgroundFinalizeState in a thread-safe way. I don't think this matters on a strongly ordered platform like x86 (although we'd have to deal with it somehow on ARM) because the release/acquire barriers we insert should only affect compiler optimizations, not insert actual instructions - but maybe this is tripping things up somehow.

(4) Something else to do with race conditions that I haven't thought of.

(In reply to Jan de Mooij [:jandem] from comment #97)
> Also, are we sure the fast path failure is causing the performance
> regression?

Not 100% sure, no - but see the graphs above. I had hoped that this would be relatively quick and easy to confirm, but it's proving to be pretty challenging for me :(
Attachment #8452569 - Flags: feedback?(jdemooij)
(Assignee)

Comment 99

5 years ago
Oh, maybe I was looking too hard for a way to disable the other threads and should just try a non-threadsafe shell. That might confirm if the code is just wrong ((1) above).
(Assignee)

Comment 100

5 years ago
Okay, so on a non-threadsafe build with Clang on Linux, I get "Assertion failure: first <= last, at js/src/gc/Heap.h:251", so it looks like my code is doing something wrong in setting the FreeSpan members, or before that. I'm glad there's something to fix there, but I'll need help figuring out what it is :)
(Assignee)

Comment 101

5 years ago
Ah, interesting: building a non-threadsafe shell on Windows, it fails trying to dereference the null type_->value again, except without the GC helper thread in the call stack. So different compilers fail in different places, but regardless it doesn't seem like a threading issue.
Comment on attachment 8452569 [details] [diff] [review]
WIP - Inline refillFreeList for the common case

Review of attachment 8452569 [details] [diff] [review]:
-----------------------------------------------------------------

Have you tried this patch without the JIT changes, to make sure it's really the JIT code that's misbehaving?

Below are some issues that could be responsible for the crashes. If it still doesn't work I can take a closer look.

::: js/src/jit/IonMacroAssembler.cpp
@@ +517,5 @@
>      // no room remaining in the span, we bail to finish the allocation. The
>      // interpreter will call |refillFreeLists|, setting up a new FreeSpan so
>      // that we can continue allocating in the jit.
>      loadPtr(AbsoluteAddress(zone->addressOfFreeListFirst(allocKind)), result);
> +    branchPtr(Assembler::GreaterThan, AbsoluteAddress(zone->addressOfFreeListLast(allocKind)), result, &success);

This should be Assembler::Above (unsigned comparison) instead of Assembler::GreaterThan (signed comparison).

@@ +542,5 @@
> +    branch32(Assembler::NotEqual, result, Imm32(0), fail);
> +#endif
> +    // Bail if GCRuntime::heapState != js::Idle
> +    load32(AbsoluteAddress(runtime->addressOfHeapState()), result);
> +    branch32(Assembler::NotEqual, result, Imm32(js::Idle), fail);

HeapState is an enum so it's not guaranteed to be 32 bits. You could use the MOZ_ENUM_TYPE macro in mfbt/TypedEnum.h, but unfortunately that doesn't do anything with older compilers. You could also change the heapState field in GCRuntime to uint32_t...

@@ +545,5 @@
> +    load32(AbsoluteAddress(runtime->addressOfHeapState()), result);
> +    branch32(Assembler::NotEqual, result, Imm32(js::Idle), fail);
> +    // Bail if Zone::gcState_ != NoGC
> +    load32(AbsoluteAddress(zone->addressOfGCState()), result);
> +    branch32(Assembler::NotEqual, result, Imm32(JS::Zone::NoGC), fail);

Same here.

@@ +548,5 @@
> +    load32(AbsoluteAddress(zone->addressOfGCState()), result);
> +    branch32(Assembler::NotEqual, result, Imm32(JS::Zone::NoGC), fail);
> +    // Bail if ArenaLists::backgroundFinalizeState[allocKind] != BFS_DONE
> +    load32(AbsoluteAddress(zone->addressOfBackgroundFinalizeState(allocKind)), result);
> +    branch32(Assembler::NotEqual, result, Imm32(js::gc::ArenaLists::BFS_DONE), fail);

And here.
Attachment #8452569 - Flags: feedback?(jdemooij)
(Assignee)

Comment 103

5 years ago
Thanks for the advice! I actually looked at the generated code in a debugger earlier and the problem turned out to be something much more obvious, but the things you mentioned will probably be important later.

So, it turns out I'm just missing a step. MacroAssembler::freeSpanAllocate() currently just deals with the absolute simplest case: bump allocation. But the FreeList may be fragmented, and we may simply need to move from one FreeSpan to the next. FreeList::allocate() deals with that possibility, but the code I added does not, which is probably causing all sorts of weirdness.
(Assignee)

Comment 104

5 years ago
Finally got this running - turns out it avoids about 25% of the calls to refillFreeList on Splay. Unfortunately, it doesn't change the performance at all :(

I also redid an earlier experiment where I sorted things from 50% full to 0%/100% full, and couldn't reproduce the good numbers I saw earlier. So maybe this *is* just making GC slightly slower, and I've been chasing a red herring.

I can try to confirm this, but there's not much I can do about it with the current approach. Alternatively I could try to shave off some time elsewhere, perhaps just by tuning some GC settings - but I'm not sure where to start.
We could try landing this and see if the benchmark numbers are actually affected.

If that doesn't work, we can turn off sorting during benchmarks. We have a "high-frequency" GC mode that should be active during splay and almost never during browsing, so it's pretty easy to tell when to turn off sorting. Having code around for sorting and not sorting would be unfortunate, but maybe we'll just have to do that.
(Assignee)

Comment 106

5 years ago
Yeah, the numbers in the shell seem very close, and the browser is pretty noisy. I'd still like to pin down exactly where the slowdown is coming from - I think I'll do some measurements to see if average GC time does get longer, and by how much (but I'll save that for tomorrow).
(Assignee)

Comment 107

5 years ago
Seems I missed a spot. When we get to forwardFromTenured(), we might not be allocating from the fromSpace anymore, so we might not need to purge. This adds a small function, ForkJoinNursery::purgeFreeListInsideFromSpace(), which purges the current free list *only* if said free list is located in fromSpace. With this, I no longer see any cases where the fromList has more than 1 partially full arena, allowing me to simplify ArenaList::join().

This passes jit-tests locally (including the slow ones). Here's a try run including part 1: https://tbpl.mozilla.org/?tree=Try&rev=4d4aae21b974
Attachment #8454211 - Flags: review?(lhansen)
(Assignee)

Comment 108

5 years ago
Carrying forward r=billm once more. Compared to v10, this just simplifies ArenaList::join() again (and adds back the assertion ensuring that the 'simple' list has no more than one non-full arena).
Attachment #8450592 - Attachment is obsolete: true
Attachment #8452569 - Attachment is obsolete: true
Attachment #8454212 - Flags: review+
(Assignee)

Updated

5 years ago
Attachment #8454212 - Attachment description: Part 1 v1: Sort arenas in order of increasing free space during finalization. → Part 1 v11: Sort arenas in order of increasing free space during finalization.
(Assignee)

Comment 109

5 years ago
Since we don't actually insert the partially filled arena in its sorted position anymore in adoptArenas, there's no reason to use a temporary list instead of inserting them directly. As that's also the only place where we had to join two non-full lists (and where the complex list was the target), I turned ArenaList::join into ArenaList::insertListWithCursorAtEnd. Complexity be gone!
Attachment #8454212 - Attachment is obsolete: true
Attachment #8454527 - Flags: review+
(Assignee)

Comment 110

5 years ago
Comment on attachment 8454211 [details] [diff] [review]
Part 0: Remove the final source of unnecessary purging in PJS.

Part 1 v12 no longer depends on this, and I'd like to change the stopping condition of the inner loop to do a fromSpace check instead (which seems less ad hoc to me), so let's move this to its own bug (to be filed).
Attachment #8454211 - Attachment is obsolete: true
Attachment #8454211 - Flags: review?(lhansen)
(Assignee)

Updated

5 years ago
Attachment #8454527 - Attachment description: Part 1 v12: Sort arenas in order of increasing free space during finalization. → v12: Sort arenas in order of increasing free space during finalization.
(Assignee)

Comment 111

5 years ago
I'd like to try landing this and see how it does on AWFY. As billm said, if it causes a meaningful regression we can consider our options, but the patch itself should be fine.
Keywords: checkin-needed
Emanuel, did you get any numbers on memory usage reduction in the end? That was the original motivation :)
(Assignee)

Comment 114

5 years ago
Er, yes! I got a bit side-tracked by the performance investigations. I'll see if I can reproduce the same improvements as earlier.
(Assignee)

Comment 115

5 years ago
Alright, I ran Membench 5 times without and 5 times with the patch again, here's the results:

Before:
  (174.30 + 185.35 + 180.22 + 178.14 + 181.56) / 5 = 179.914

After:
  (171.08 + 177.68 + 168.55 + 171.06 + 172.55) / 5 = 172.184

So applying the patch seems to decrease the total unused-gc-things by anywhere up to 14 MiB, with a mean improvement of around 7 MiB (though the data is still very noisy).

Also, early numbers on AWFY look pretty much unaffected!
Blocks: 1037772
> Alright, I ran Membench 5 times without and 5 times with the patch again,
> here's the results:
> 
> Before:
>   (174.30 + 185.35 + 180.22 + 178.14 + 181.56) / 5 = 179.914
> 
> After:
>   (171.08 + 177.68 + 168.55 + 171.06 + 172.55) / 5 = 172.184

Note to self: as per comment 39, this is the "unused-gc-things" value measured at the end of Membench execution.

Thanks for the new measurements.
https://hg.mozilla.org/mozilla-central/rev/0d237b824f96
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Backed out for causing bug 1037750 and bug 1037772 per IRC discussion with ehoogeveen.
https://hg.mozilla.org/mozilla-central/rev/bf09d63db425
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla33 → ---
Depends on: 1037750
(Assignee)

Updated

5 years ago
Attachment #8454527 - Attachment description: v12: Sort arenas in order of increasing free space during finalization. → Part 1 v12: Sort arenas in order of increasing free space during finalization.
Depends on: 1038396
(Assignee)

Comment 120

5 years ago
Argh, I can't believe how long it took me to track this down. All my theories from bug 1037772 turned out to be wrong: the logic in the previous version of this patch was just broken! Somehow the problem only showed up incredibly rarely; thankfully gkw's unreduced testcase from the sec-sensitive bug allowed me to debug this locally, but it still took me all night to figure out where things were going wrong.

I'm pretty sure this is correct now, but here's a final try run on both parts: https://tbpl.mozilla.org/?tree=Try&rev=f0a4d0879011
Attachment #8455593 - Attachment is obsolete: true
Attachment #8455593 - Flags: review?(wmccloskey)
Attachment #8455941 - Flags: review?(wmccloskey)
Comment on attachment 8455941 [details] [diff] [review]
Part 2 v2: Let ArenaLists track swept arenas between slices and extend ArenaIter to walk incrementally swept arenas as well.

Review of attachment 8455941 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgc.cpp
@@ +4367,5 @@
>                                 SortedArenaList &sweepList)
>  {
> +    if (!FinalizeArenas(fop, &arenaListsToSweep[thingKind], sweepList, thingKind, sliceBudget)) {
> +        incrementalSweptArenaKind = thingKind;
> +        incrementalSweptArenas = sweepList.toArenaList();

Is this right? toArenaList will modify the arena list and put everything in bin 0 while also leaving the heads of the other bins as they are. I would think that sweepList would be unusable after calling toArenaList().

Would you mind trying out the implementation for toArenaList that I suggested in comment 88? It has the advantage of not modifying the SortedArenaList, which I think is what we need here.
Attachment #8455941 - Flags: review?(wmccloskey)
(Assignee)

Comment 122

5 years ago
(In reply to Bill McCloskey (:billm) from comment #121)
> Is this right? toArenaList will modify the arena list and put everything in
> bin 0 while also leaving the heads of the other bins as they are. I would
> think that sweepList would be unusable after calling toArenaList().

Each tail pointer holds the address of the |next| field of the last arena in each bin (or the address of the head of the bin if the bin is empty). toArenaList doesn't change the tail pointers: it changes what the |next| field of the arena at the end of each bin *points* to.

We can change this as many times as we like without making the SortedArenaList invalid, because we use the tails *themselves* when appending a new arena to a bin, which will always be valid. This is similar to how the cursor of ArenaList works, except that we don't care what arena might follow a tail pointer.

Does that clear things up? I found it pretty hard to wrap my head around this double pointer arithmetic (see e.g. the penultimate chunk of Terrence's comment #35), but it really does save a lot of bookkeeping.
Flags: needinfo?(wmccloskey)
(Assignee)

Comment 123

5 years ago
Perhaps I should make this additional clarification (still figuring out how to explain all this): toArenaList essentially turns every tail pointer of the SortedArenaList into a cursor pointer - however, SortedArenaList doesn't care whether its tail pointers are *also* cursor pointers, as long as they still point to the same thing.

The ArenaList returned by toArenaList simply uses the *first* cursor pointer in SortedArenaList, since that is the cursor sitting between the full and non-full arenas.
(Assignee)

Comment 124

5 years ago
That Valgrind red on try is part 2 by the way. It didn't like that I failed to initialize incrementalSweptArenaKind. With that fixed: https://tbpl.mozilla.org/?tree=Try&rev=0a74889077c8
Comment on attachment 8455941 [details] [diff] [review]
Part 2 v2: Let ArenaLists track swept arenas between slices and extend ArenaIter to walk incrementally swept arenas as well.

Review of attachment 8455941 [details] [diff] [review]:
-----------------------------------------------------------------

OK, I guess you're right.
Attachment #8455941 - Flags: review+
Flags: needinfo?(wmccloskey)
(Assignee)

Comment 126

5 years ago
This version moves SortedArenaList::toArenaList back into the class body and moves SortedArenaList below ArenaList, so everything is in one place and so its description follows the big comment above ArenaList. In addition I beefed up some of the comments surrounding toArenaList to hopefully make part 2 more obviously correct.
Attachment #8454527 - Attachment is obsolete: true
Attachment #8457682 - Flags: review+
(Assignee)

Comment 127

5 years ago
Carrying forward r=billm. Valgrind on try rightly pointed out that incrementalSweptArenaKind could be used uninitialized, so this version initializes it to FINALIZE_LIMIT (and changes it to an unsigned int, since FINALIZE_LIMIT isn't part of the AllocKind enum).
Attachment #8455941 - Attachment is obsolete: true
Attachment #8457687 - Flags: review+
(Assignee)

Comment 128

5 years ago
Builds are green on try, let's try landing this again: https://tbpl.mozilla.org/?tree=Try&rev=61a875305ed4
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/bbfa802abc93
https://hg.mozilla.org/mozilla-central/rev/42d193544465
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Blocks: 1041307
You need to log in before you can comment on or make changes to this bug.