Closed
Bug 601724
Opened 14 years ago
Closed 14 years ago
nanojit: Reduce indirect call overhead caused by CodeAlloc.alloc
Categories
(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect)
Tamarin Graveyard
Baseline JIT (CodegenLIR)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rreitmai, Assigned: rreitmai)
References
Details
(Whiteboard: fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey)
Attachments
(1 file, 2 obsolete files)
CodeAlloc.alloc calls CodeAlloc.allocCodeChunks while on Tamarin results in VMPI_allocateCodeMemory; which normally results in an OS page request. On some non-intel platforms (ARM) the request is for 1 page, which can be quite expensive if repeated frequently. One option to reduce this overhead (i.e. reduce the # of calls to allocCodeChunk) is to ask for a larger size and then partition the piece of memory into multiple blocks which get added to the available list.
Updated•14 years ago
|
Flags: flashplayer-qrb+
Updated•14 years ago
|
Assignee: nobody → rreitmai
Assignee | ||
Comment 1•14 years ago
|
||
Support the partitioning of the mem region allocated via codeChunkAlloc into a number of pieces. This allows us to request more than a single page at a time on ARM platforms, which should significantly reduce overhead. I have not yet run performance numbers to validate, but will shortly. On non-ARM platforms this patch does not provide any change to policy. Although it could be used to solve bug 601165. Not to mention it provides more flexibility when we start integrating the background compiler.
Attachment #481124 -
Flags: review?(nnethercote)
Attachment #481124 -
Flags: feedback?(siwilkin)
Assignee | ||
Updated•14 years ago
|
Attachment #481124 -
Flags: review?(edwsmith)
Comment 2•14 years ago
|
||
Comment on attachment 481124 [details] [diff] [review] v1 Assumption: the problem you're trying to solve is too many mmap/mprotect calls, on ARM, which has a small chunk size. Is this right? The VM<-->CodeAlloc interface deals in chunks, which want to be large and coarse, to reduce mmap/mprotect operations. The CodeAlloc<-->Assembler interface deals in blocks, to support small pieces of code, and to support generating code in 2 streams at once (mainline and exit paths). Adding the idea of a partition that is smaller than a chunk, for the purposes of modifying permissions (but not for allocation, nor for restricting the max block size, overcomplicates the whole class by introducing a new intermediate "granularity" for managing memory. Currently there is a tension between having a low max block size (for ARM) and a high chunk size (for allocation efficiency). I would prefer it if we can set this up so page permissions are still set on whole chunks. Could we add a max_bytes parameter to alloc(), for ARM? If a free block is larger than that, then CodeAlloc could split it immediately, leaving leftover bytes on the free list. Then pagesPerAlloc could be increased for ARM, giving less mmap calls *and* less mprotect calls, compared to this patch which (if i understand the code) only reduces the # of mprotect calls.
Attachment #481124 -
Flags: review?(edwsmith) → review-
Comment 3•14 years ago
|
||
Also, igoring comment #2, we don't need the new concept of a "partition". it's really a MMU page; its size is owned by the OS (getpagesize()), and we should just call it that, if necessary. The ARM limitation of 4K is for the purpose of placing inline constants within 4K of the code that references them; there is no requirement for page alignment, and the 4K value is derived from the instruction set encoding, not the OS page size. In hindsight, its a mistake in the existing code to conflate page size with this limit; the patch propagates that mistake by introducing "partition" which is a unit of permission-setting, and also an upper limit of the block size returned by CodeAlloc::alloc().
Comment 4•14 years ago
|
||
(In reply to comment #3) > Also, igoring comment #2, we don't need the new concept of a "partition". it's > really a MMU page; its size is owned by the OS (getpagesize()), and we should > just call it that, if necessary. VMPI_getVMPageSize().
Assignee | ||
Updated•14 years ago
|
Attachment #481124 -
Flags: review?(nnethercote)
Assignee | ||
Comment 5•14 years ago
|
||
(In reply to comment #2) > Comment on attachment 481124 [details] [diff] [review] > v1 > > Could we add a max_bytes parameter to alloc(), for ARM? If a free block is > larger than that, then CodeAlloc could split it immediately, leaving leftover > bytes on the free list. Then pagesPerAlloc could be increased for ARM, giving > less mmap calls *and* less mprotect calls, compared to this patch which (if i > understand the code) only reduces the # of mprotect calls. Interesting idea and a much more elegant solution and as pointed out in comment 3 it also avoids baking in any additional page and alignment criteria.
Comment 6•14 years ago
|
||
Edwin's comment 3 is right on the money. On ARM QNX platforms we want to take advantage of the large page support offered by the system, and this means allocating 64K or 1M blocks to avoid the high cost of TLB refills (measured on our system to be as high as 15% of CPU cycles). On ARM, due to the indirect tie of the system page size to the maximum size of the instruction set encoding, we can't easily return large blocks out of VMPI_allocateCodeMemory() without creating a sub-allocator inside of our own implementation ... which just seems wrong given all of the other allocator gear that is in place.
Assignee | ||
Comment 7•14 years ago
|
||
(In reply to comment #6) > Edwin's comment 3 is right on the money. On ARM QNX platforms we want to take > advantage of the large page support offered by the system, and this means > allocating 64K or 1M blocks to avoid the high cost of TLB refills (measured on > our system to be as high as 15% of CPU cycles). > Just to reinforce one point; the new design, means page permission bits will still be toggled at the chunk level. Meaning if 64K or 1M chunks are used, all bits for the entire chunk will be flipped each time we compile a new method anywhere in that chunk. If this is not desirable, we'll have to come up with a separate fix for that.
Assignee | ||
Comment 8•14 years ago
|
||
The patch consolidates all allocation logic into addMem(). The cost is a couple of extra updates to the next pointer in the case of an new allocation, but besides cleaning up the code this change paves the way for the next patch, which adds logic for splitting a free list block.
Attachment #481124 -
Attachment is obsolete: true
Attachment #482409 -
Flags: review?(edwsmith)
Attachment #481124 -
Flags: feedback?(siwilkin)
Assignee | ||
Updated•14 years ago
|
Attachment #482409 -
Flags: review?(nnethercote)
Comment 9•14 years ago
|
||
Comment on attachment 482409 [details] [diff] [review] v1 BTW, when you post patches that are done as HG changesets, the file names don't show up in Bugzilla's "diff" view, which makes it hard to see what's going on. >@@ -365,7 +362,8 @@ > // add terminator to heapblocks list so we can track whole blocks > terminator->next = heapblocks; > heapblocks = terminator; >- return b; >+ >+ addBlock(availblocks, b); // add to free list > } > > CodeList* CodeAlloc::getBlock(NIns* start, NIns* end) { Is your tree up-to-date? This looks like it's based on old code -- the current tip already has an addBlock() call in there. Patch seems fine -- it doesn't change functionality, AFAICT -- is that right?
Attachment #482409 -
Flags: review?(nnethercote) → review+
Comment 10•14 years ago
|
||
(In reply to comment #9) > BTW, when you post patches that are done as HG changesets, the file names don't > show up in Bugzilla's "diff" view, which makes it hard to see what's going on. Turning on git-style diff'ing in one's hgrc (see below) fixes this problem. [diff] # Note: makes diff's incompatible with UNIX patch; if that matters, # then either use patch -p1 or get rid of this. git = True
Assignee | ||
Comment 11•14 years ago
|
||
Comment on attachment 482409 [details] [diff] [review] v1 reassign review to Bill.
Attachment #482409 -
Flags: review?(edwsmith) → review?(wmaddox)
Updated•14 years ago
|
Attachment #482409 -
Flags: review?(wmaddox) → review+
Assignee | ||
Comment 12•14 years ago
|
||
rreitmai http://hg.mozilla.org/projects/nanojit-central/rev/480ca3961ba1
Assignee | ||
Comment 13•14 years ago
|
||
Currently request to alloc() will return blocks equal in size to that requested from the OS. This patch adds support for limiting the size of the block returned from CodeAlloc.alloc()
Attachment #482409 -
Attachment is obsolete: true
Attachment #488398 -
Flags: review?(nnethercote)
Assignee | ||
Updated•14 years ago
|
Attachment #488398 -
Flags: review?(Jacob.Bramley)
Assignee | ||
Updated•14 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 14•14 years ago
|
||
rreitmai http://hg.mozilla.org/tamarin-redux/rev/7951f96fc6ad
Whiteboard: fixed-in-tamarin
Updated•14 years ago
|
Attachment #488398 -
Flags: feedback?(rreitmai)
Updated•14 years ago
|
Attachment #488398 -
Flags: review?(Jacob.Bramley) → review+
Assignee | ||
Comment 15•14 years ago
|
||
All tests I've run have showed little difference when the patch is applied; that aside comment 6, makes it apparent that some platforms will benefit from this change. Running valgrind on an ARM Cortex-A8 I'm seeing the following cachegrind behaviour on misc/boidshack.abc (see below). These results appear to hint at a slightly reduced miss rate for D1 and LLi. *** baseline *** ==22495== I refs: 2,260,103,455 ==22495== I1 misses: 2,007,496 ==22495== LLi misses: 223,331 ==22495== I1 miss rate: 0.08% ==22495== LLi miss rate: 0.00% ==22495== ==22495== D refs: 1,265,908,141 (798,512,227 rd + 467,395,914 wr) ==22495== D1 misses: 23,327,074 ( 22,610,448 rd + 716,626 wr) ==22495== LLd misses: 1,982,144 ( 1,488,085 rd + 494,059 wr) ==22495== D1 miss rate: 1.8% ( 2.8% + 0.1% ) ==22495== LLd miss rate: 0.1% ( 0.1% + 0.1% ) ==22495== ==22495== LL refs: 25,334,570 ( 24,617,944 rd + 716,626 wr) ==22495== LL misses: 2,205,475 ( 1,711,416 rd + 494,059 wr) ==22495== LL miss rate: 0.0% ( 0.0% + 0.1% ) *** with attachment 488398 [details] [diff] [review] *** ==22408== I refs: 2,252,810,927 ==22408== I1 misses: 1,940,698 ==22408== LLi misses: 164,551 ==22408== I1 miss rate: 0.08% ==22408== LLi miss rate: 0.00% ==22408== ==22408== D refs: 1,261,926,399 (795,785,350 rd + 466,141,049 wr) ==22408== D1 misses: 21,784,320 ( 21,066,715 rd + 717,605 wr) ==22408== LLd misses: 1,891,321 ( 1,404,122 rd + 487,199 wr) ==22408== D1 miss rate: 1.7% ( 2.6% + 0.1% ) ==22408== LLd miss rate: 0.1% ( 0.1% + 0.1% ) ==22408== ==22408== LL refs: 23,725,018 ( 23,007,413 rd + 717,605 wr) ==22408== LL misses: 2,055,872 ( 1,568,673 rd + 487,199 wr) ==22408== LL miss rate: 0.0% ( 0.0% + 0.1% ) The run below is from the same machine looking at misc/ , where avm2 is a version with the attachment. The results are fairly typical of what is seen on the broader set of tests: iterations: 3 avm avm2 test best avg best avg %dBst %dAvg Metric: time Dir: misc/ boids 33119 33160.7 34277 34311.7 -3.5 -3.5 - boidshack 6908 6994 6598 6644 4.5 5.0 + gameoflife 44480 44491.7 44271 44490 0.5 0.0 primes 105578 105972 105599 105621 -0.0 0.3
Comment 16•14 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/c94d644bd756
Whiteboard: fixed-in-tamarin → fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey
Assignee | ||
Updated•14 years ago
|
Attachment #488398 -
Flags: feedback?(rreitmai) → feedback?(edwsmith)
Comment 17•14 years ago
|
||
Comment on attachment 488398 [details] [diff] [review] Allow alloc() requests to limit size of returned block (nit) could you fix caps and punctuation in CodeAlloc.h since you edited the comment anyway? And, the new comments in CodeAlloc.cpp. > /** allocate some memory (up to... In CodeAlloc::alloc(), I'm having trouble following the new code. I think the problem scenario is: 1. if new block is too big 2. and leftover part is too small 3. then you cut desired in half. Right? could you clarify the comment? questions: - can the resulting halves be too small? if so, there should be an assert to guard against it. For example if byteLimit = sizeofMinBlock + minAllocSize then desired = 2*sizeofMinBlock + minAllocSize And desired/2 = sizeofMinBlock + minAllocSize/2 so all this only works if sizeofMinBlock >= minAllocSize/2. is it? I haven't looked. (and, check my math :-) otherwise desired could end up too small. Given the complexity, one or two more asserts that check the exit invariants (no block got too small or too big) would be good. F+, but I highly recommend some clarify comments and asserts.
Attachment #488398 -
Flags: feedback?(edwsmith) → feedback+
Updated•14 years ago
|
Attachment #488398 -
Flags: review?(nnethercote) → review+
Assignee | ||
Comment 18•14 years ago
|
||
(In reply to comment #17) > Comment on attachment 488398 [details] [diff] [review] > questions: > - can the resulting halves be too small? I think its more than an assert ... looking at addRemainder(), minHole is set to max(minAllocSize, 2*sizeofMinBlock) which is wrong. Looking at the last 'else' case of addRemainder(), where we introduce a new free block b2 (which needs >= sizeofMinBlock+minAllocSize) and a used block b3 (which needs to be >=sizeofMinBlock). So minHole should be 2*sizeofMinBlock + minAllocSize Assuming my math is correct (please check) then we'd got an existing bug, that aside we'd need to bump up that value to ensure the above patch work; 4*sizeofMinBlock + 3*minAllocSize The additional 2*(sizeofMinBlock+minAllocSize) ensures that we never have a free block on the list that cannot be split.
Comment 19•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/c94d644bd756
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•