Closed Bug 445780 Opened 16 years ago Closed 14 years ago

Page map needs to be sparse

Categories

(Tamarin Graveyard :: Garbage Collection (mmGC), defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Q3 11 - Serrano

People

(Reporter: treilly, Assigned: pnkfelix)

References

Details

(Whiteboard: 64-bit, Tracker)

Attachments

(2 obsolete files)

SPAM puts the region's in a sorted list and does a binary search, then a map per region can be mantained. It'd be easy for MMgc to do this as well (in fact I think I'll make a bug). With a default region size of 16mb searching should be quick.
Why not make a 2-level or 3-level tree? This would replace the binary search with an array lookup.
Sounds interesting, care to expound?
Er, sorry if that was kind of abstruse. What I meant by "2-level or 3-level tree" was, instead of using a dense, sorted array of pointers-to-pagemap-regions, use a sparse array of those pointers. Instead of doing a binary search to find the desired region, you can just index into the array. For example, on a 32-bit platform, the GetPageMapValue can treat its argument like this: 32 0 rrrrrrrr wwwwwwww bbbbaaaa aaaaaaaa = 32-bit address ^^^^^^^^ r = which region to look in ^^^^^^^^ w = what word within the region ^^^^ b = which bits within the word ^^^^^^^^^^^^^ ignored (points to something within the page, plus tag bits) The pagemap, then, is an array of 256 pointers. /* initially null; each pointer can point to a 1KB pagemap-region */ uint32 *regions[256]; The lookup code: 1 pRegion = regions[r]; // r is addr>>24 2 if (pRegion == NULL) 3 return 0; // definitely not mapped yet 4 word = pRegion[w]; // w is ((addr>>16) & 0xff) 5 return GET_BITS_FROM_WORD(word, b); // b is ((addr>>12) & 0x0f) Of course you can carve up the bits various different ways. (You can make "regions" an array of bits instead of an array of pointers if the OS supports setting aside 128KB of contiguous virtual memory for pagemap regions, then committing it a page at a time, as needed. I don't know if it would actually make things faster, though.) I almost implemented this, not to save memory or make things faster, but because it's easier to make this thread-safe. But on 64-bit platforms, pointers are 48 bits. You would need an additional level in the hierarchy, so it would become a tree. It might look like this: 48 31 32 0 rrrrrrrr rrrsssss ssssssww wwwwwwww bbbbaaaa aaaaaaaa But, I dunno, when relatively few regions and subregions are actually populated, this might not be any better than a binary search.
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Flags: flashplayer-qrb? → flashplayer-qrb+
Target Milestone: --- → Future
I've used this in a couple of places - the ZCT uses it, and an experimental card-marking write barrier uses it. The nice thing is that you can map the entire memory at very low cost, so there's no need to subtract a base pointer to get the right offset for the start of the table. A shared pagemap for all the GCs might also be good - though it would mean inserting GC identifiers at some level of the tree, adds complexity + storage, may or may not be worth it (ie, we need to calculate and we need some data to calculate with).
Assignee: treilly → nobody
The linear region map bites us on 64-bit Linux. Quoting Tommy: "This is causing problems in 64 bit linux b/c VMPI_allocateAlignedMemory is giving us memory from low 32 bit addresses and mmap is giving us addresses from 0x7fxxxxxxxxxxxxxx so we need a 4 GB page map." and also: "Actually it could have been much worse, looks like valloc uses address from 0x00007f0000000000, so the memory range was like 2^48 which requires 2^33 page map". This was discovered in connection with bug #554191 and a workaround was proposed there, namely avoiding valloc or malloc for the region table and relying entirely on reserve/commit to obtain memory (to narrow the address range). But this is just a bug waiting to happen.
Priority: -- → P2
Summary: Use a per-region page map instead of one big one → Page map needs to be sparse
Whiteboard: 64-bit
Target Milestone: Future → flash10.1.1
Targeting for 10.2; if the fix is easy and credible we can backport to 10.1.1 but the default assumption is that any fix will be somewhat pervasive.
Target Milestone: flash10.1.1 → flash10.2
Blocks: 564119
Assignee: nobody → fklockii
Status: NEW → ASSIGNED
So there are at least three work items here, we may want to split them out and create a tracker etc: - the GCHeap page map must be sparsely allocated or we'll be bitten on 64-bit - each GC has a linear page table (two bits per page) that has the same problem, even if it is smaller - in FixedMalloc we're using an expensive search in debug builds to query ownership of pages, this is not working out well in practice. The reasonable thing to do would be to ask GCHeap. Thus IMO the design we may be looking for here is a generic design in GCHeap that sparsely stores information about every page allocated by GCHeap, and which can be queried about any pointer in the address space. If the page is assigned to a GC it needs to be able to extract a pointer to the GC that owns it; if it is assigned to FixedMalloc it should be possible to figure out whether it's a large-block or small-block page, etc. Right now we're storing per-block information in the block itself but with such a global data structure we could reconsider that, esp if it is fast enough and if the storage can be more efficient (eg if multiple pages with the same attributes can share one information blob). Information sharing in this table is in fact something that is appealing, the current structure in GCHeap appears to me to be a little profligate in its use of memory. Performance is important for some applications (querying the data structure on the hot path in the GC) and sometimes not (many GCHeap operations, probably).
Another interesting point about storing meta-information outside the block is that it becomes possible to allocate blocks in groups for the per-size allocators; that would allow larger objects to be accommodated by the allocators, or more large objects would fit into a superblock allocated from GCHeap. The latter may be important for Flex since there's some evidence from Alex Harui that many Flex objects are quite large, we may be fragmenting a bit because very few large objects fit into a single page. We would reduce that fragmentation overhead if we could let an object cross a block boundary. And that's only possible if the meta-information is stored elsewhere.
This is adaption of change Tommy did on bug 554191 to expose large address ranges from within GCHeap. (Its really a revert to some experimental code he wrote and then backed out upon encountering the large address space issues.) This is not intended ever to be pushed to repos; its just a way to recreate one context where we encountered the problem.
This is more play code than a proper test. It has been useful for me to narrow down the scope of the problem for this bug. For example, from inspecting the code and my experiments thus far, GCHeap already is storing its meta-data sparsely (though not necessarily in the most efficient fashion). However, the GC is not. Here's the sample output from some runs (note that all of these runs are on a build with the prior attached patch that switches GCHeap to use valloc via VMPI_allocAlignedMemory): % /usr/bin/time -f "maxmem:%Mk elapsed:%E" ./shell/avmshell -Dselftest=mmgc,*,alloc_loop_viaheap_far && echo && /usr/bin/time -f "maxmem:%Mk elapsed:%E" ./shell/avmshell -Dselftest=mmgc,*,alloc_loop_viaheap_near && echo && /usr/bin/time -f "maxmem:%Mk elapsed:%E" ./shell/avmshell -Dselftest=mmgc,*,alloc_loop_viagc_far ['start', 'mmgc', 'bugzilla_445780'] ['test', 'mmgc', 'bugzilla_445780', 'alloc_loop_viaheap_far'] min: 0x f97000 max: 0x 7f493114e000 delta:139952316444672 ['pass', 'mmgc', 'bugzilla_445780', 'alloc_loop_viaheap_far'] ['end', 'mmgc', 'bugzilla_445780'] maxmem:50912k elapsed:0:00.16 ['start', 'mmgc', 'bugzilla_445780'] ['test', 'mmgc', 'bugzilla_445780', 'alloc_loop_viaheap_near'] min: 0x 7fcfd5de0000 max: 0x 7fcfd6616000 delta:8609792 ['pass', 'mmgc', 'bugzilla_445780', 'alloc_loop_viaheap_near'] ['end', 'mmgc', 'bugzilla_445780'] maxmem:54048k elapsed:0:00.21 ['start', 'mmgc', 'bugzilla_445780'] ['test', 'mmgc', 'bugzilla_445780', 'alloc_loop_viagc_far'] numPagesNeeded:2092063 error: out of memory OUT OF MEMORY Command exited with non-zero status 128 maxmem:28432k elapsed:0:00.06 The first run above is direct GCHeap allocations with a very large address space (0xfxxxxx -- 0x7fxxxxxxxxxx). The second is GCHeap allocations with a more reasonable address space. The max mem for both is comparable. The third run is GC allocations with a telescoping memory. Its dying because of the large address range that emerges. (That "numPagesNeeded" annotation towards the end is residue from instrumentation I added to my local build; that is printed right before we OOM boom.)
Attachment #456274 - Attachment mime type: application/octet-stream → text/plain
spent today puzzling over my inabliity to replicate valloc behavior documented in the above trace. I eventually found ways to get similarly large address spaces, but now I realized why I wasn't able to replicate it: The valloc behavior that is being tested above is on Linux 64 (as Tommy said originally). In particular, OS X behaves in a different manner.
Depends on: 580603
Depends on: 581070
Target Milestone: flash10.x - Serrano → flash10.2.x-Spicy
(In reply to comment #7) > So there are at least three work items here, we may want to split them out and > create a tracker etc: > > - the GCHeap page map must be sparsely allocated or we'll be bitten on 64-bit Every time I read the code I end up toggling my notion of whether the GCHeap page map is sparse or dense. I managed to convince myself in July that it was sparse, and my experiments with Bug 580603 led me to think that it "must" be sparse, but now I'm re-reading the GCHeap source and I do not see how it could the HeapBlocks could be anything but dense, based on how they are accessed (pointer arithmetic using the sizes of the blocks as offsets). I'm already going through the GCHeap code (to try to close out Bug 563889). I'll resolve this mystery as part of that. > - each GC has a linear page table (two bits per page) that has the same > problem, even if it is smaller This was done in Bug 581070. > - in FixedMalloc we're using an expensive search in debug builds to query > ownership of pages, this is not working out well in practice. The reasonable > thing to do would be to ask GCHeap. Haven't done anything with this yet. > Thus IMO the design we may be looking for here is a generic design in GCHeap > that sparsely stores information about every page allocated by GCHeap, and > which can be queried about any pointer in the address space. [...] This still sounds like a reasonable long-term goal (see comment #7 for more details). Maybe I should open up a separate bug dedicated to that and close this one out to clean things up. (As Lars says at the very top above, a tracker may have been appropriate here.)
Please annotate this bug with remaining information, open new bugs for remaining issues if appropriate, and close this bug if we've done the important work.
1. Main remaining short-term work item: Bug 580603 (which was transitively linked from Bug 445780), which adds a new (debug-only) mode and associated self test. I'll get on it. Other Items (open as of comment 12) with associated information. 2. > > - the GCHeap page map must be sparsely allocated or we'll be bitten on 64-bit The GCHeap representation is sparse. It uses O(M) internal state to represent M units of committed memory; unreserved memory and memory between regions isn't represented. This needs to be documented better in comments in the code, which is part of the remaining work for Bug 563889. 3. > > - in FixedMalloc we're using an expensive search in debug builds to query > > ownership of pages, this is not working out well in practice. The reasonable > > thing to do would be to ask GCHeap. Still haven't done anything with this. Not going to happen for Serrano. Probably goes with the item 4 below. 4. > > Thus IMO the design we may be looking for here is a generic design in GCHeap > > that sparsely stores information about every page allocated by GCHeap, and > > which can be queried about any pointer in the address space. [...] Opened Bug 604339 for this.
No longer blocks: 564119
Target Milestone: flash10.2.x-Spicy → flash10.x - Serrano
Comment on attachment 456272 [details] [diff] [review] Use valloc in gcheap to expose large address range exploratory code; not relevant anymore.
Attachment #456272 - Attachment is obsolete: true
Comment on attachment 456274 [details] exploratory code exposing issues with massive address range exploratory code; not relevant anymore.
Attachment #456274 - Attachment is obsolete: true
Whiteboard: 64-bit → 64-bit, Tracker
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Opened Bug 610982 to cover the last item (consolidating the various page map data structures into gcheap), which is the last relevant item that was discussed here.
Flags: flashplayer-bug+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: