Closed Bug 601724 Opened 9 years ago Closed 9 years ago

nanojit: Reduce indirect call overhead caused by CodeAlloc.alloc

Categories

(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rreitmai, Assigned: rreitmai)

References

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey)

Attachments

(1 file, 2 obsolete files)

CodeAlloc.alloc calls CodeAlloc.allocCodeChunks while on Tamarin results in VMPI_allocateCodeMemory; which normally results in an OS page request.

On some non-intel platforms (ARM) the request is for 1 page, which can be quite expensive if repeated frequently.  

One option to reduce this overhead (i.e. reduce the # of calls to allocCodeChunk) is to ask for a larger size and then partition the piece of memory into multiple blocks which get added to the available list.
Flags: flashplayer-qrb+
Assignee: nobody → rreitmai
Attached patch v1 (obsolete) — Splinter Review
Support the partitioning of the mem region allocated via codeChunkAlloc into a number of pieces.

This allows us to request more than a single page at a time on ARM platforms, which should significantly reduce overhead.   I have not yet run performance numbers to validate, but will shortly.

On non-ARM platforms this patch does not provide any change to policy.  Although it could be used to solve bug 601165.  Not to mention it provides more flexibility when we start integrating the background compiler.
Attachment #481124 - Flags: review?(nnethercote)
Attachment #481124 - Flags: feedback?(siwilkin)
Depends on: 587727
Attachment #481124 - Flags: review?(edwsmith)
Comment on attachment 481124 [details] [diff] [review]
v1

Assumption: the problem you're trying to solve is too many mmap/mprotect calls, on ARM, which has a small chunk size.  Is this right?  

The VM<-->CodeAlloc interface deals in chunks, which want to be large and coarse, to reduce mmap/mprotect operations.  The CodeAlloc<-->Assembler interface deals in blocks, to support small pieces of code, and to support generating code in 2 streams at once (mainline and exit paths).  Adding the idea of a partition that is smaller than a chunk, for the purposes of modifying permissions (but not for allocation, nor for restricting the max block size, overcomplicates the whole class by introducing a new intermediate "granularity" for managing memory.

Currently there is a tension between having a low max block size (for ARM) and a high chunk size (for allocation efficiency).  I would prefer it if we can set this up so page permissions are still set on whole chunks.

Could we add a max_bytes parameter to alloc(), for ARM?  If a free block is larger than that, then CodeAlloc could split it immediately, leaving leftover bytes on the free list.  Then pagesPerAlloc could be increased for ARM, giving less mmap calls *and* less mprotect calls, compared to this patch which (if i understand the code) only reduces the # of mprotect calls.
Attachment #481124 - Flags: review?(edwsmith) → review-
Also, igoring comment #2, we don't need the new concept of a "partition".  it's really a MMU page; its size is owned by the OS (getpagesize()), and we should just call it that, if necessary.

The ARM limitation of 4K is for the purpose of placing inline constants within 4K of the code that references them; there is no requirement for page alignment, and the 4K value is derived from the instruction set encoding, not the OS page size.  In hindsight, its a mistake in the existing code to conflate page size with this limit; the patch propagates that mistake by introducing "partition" which is a unit of permission-setting, and also an upper limit of the block size returned by CodeAlloc::alloc().
(In reply to comment #3)
> Also, igoring comment #2, we don't need the new concept of a "partition".  it's
> really a MMU page; its size is owned by the OS (getpagesize()), and we should
> just call it that, if necessary.

VMPI_getVMPageSize().
Attachment #481124 - Flags: review?(nnethercote)
(In reply to comment #2)
> Comment on attachment 481124 [details] [diff] [review]
> v1
>
> Could we add a max_bytes parameter to alloc(), for ARM?  If a free block is
> larger than that, then CodeAlloc could split it immediately, leaving leftover
> bytes on the free list.  Then pagesPerAlloc could be increased for ARM, giving
> less mmap calls *and* less mprotect calls, compared to this patch which (if i
> understand the code) only reduces the # of mprotect calls.

Interesting idea and a much more elegant solution and as pointed out in comment 3 it also avoids baking in any additional page and alignment criteria.
Edwin's comment 3 is right on the money.  On ARM QNX platforms we want to take advantage of the large page support offered by the system, and this means allocating 64K or 1M blocks to avoid the high cost of TLB refills (measured on our system to be as high as 15% of CPU cycles).

On ARM, due to the indirect tie of the system page size to the maximum size of the instruction set encoding, we can't easily return large blocks out of VMPI_allocateCodeMemory() without creating a sub-allocator inside of our own implementation ... which just seems wrong given all of the other allocator gear that is in place.
(In reply to comment #6)
> Edwin's comment 3 is right on the money.  On ARM QNX platforms we want to take
> advantage of the large page support offered by the system, and this means
> allocating 64K or 1M blocks to avoid the high cost of TLB refills (measured on
> our system to be as high as 15% of CPU cycles).
> 

Just to reinforce one point; the new design, means page permission bits will still be toggled at the chunk level.  Meaning if 64K or 1M chunks are used, all bits for the entire chunk will be flipped each time we compile a new method anywhere in that chunk.

If this is not desirable, we'll have to come up with a separate fix for that.
Attached patch v1 (obsolete) — Splinter Review
The patch consolidates all allocation logic into addMem().  The cost is a couple of extra updates to the next pointer in the case of an new allocation, but besides cleaning up the code this change paves the way for the next patch, which adds logic for splitting a free list block.
Attachment #481124 - Attachment is obsolete: true
Attachment #482409 - Flags: review?(edwsmith)
Attachment #481124 - Flags: feedback?(siwilkin)
Attachment #482409 - Flags: review?(nnethercote)
Comment on attachment 482409 [details] [diff] [review]
v1

BTW, when you post patches that are done as HG changesets, the file names don't show up in Bugzilla's "diff" view, which makes it hard to see what's going on.


>@@ -365,7 +362,8 @@
>         // add terminator to heapblocks list so we can track whole blocks
>         terminator->next = heapblocks;
>         heapblocks = terminator;
>-        return b;
>+
>+        addBlock(availblocks, b); // add to free list
>     }
> 
>     CodeList* CodeAlloc::getBlock(NIns* start, NIns* end) {

Is your tree up-to-date?  This looks like it's based on old code -- the
current tip already has an addBlock() call in there.

Patch seems fine -- it doesn't change functionality, AFAICT -- is that
right?
Attachment #482409 - Flags: review?(nnethercote) → review+
(In reply to comment #9)
> BTW, when you post patches that are done as HG changesets, the file names don't
> show up in Bugzilla's "diff" view, which makes it hard to see what's going on.

Turning on git-style diff'ing in one's hgrc (see below) fixes this problem.

  [diff]
  # Note: makes diff's incompatible with UNIX patch; if that matters,
  # then either use patch -p1 or get rid of this.
  git = True
Comment on attachment 482409 [details] [diff] [review]
v1

reassign review to Bill.
Attachment #482409 - Flags: review?(edwsmith) → review?(wmaddox)
Attachment #482409 - Flags: review?(wmaddox) → review+
Currently request to alloc() will return blocks equal in size to that requested from the OS.  

This patch adds support for limiting the size of the block returned from CodeAlloc.alloc()
Attachment #482409 - Attachment is obsolete: true
Attachment #488398 - Flags: review?(nnethercote)
Attachment #488398 - Flags: review?(Jacob.Bramley)
Status: NEW → ASSIGNED
rreitmai http://hg.mozilla.org/tamarin-redux/rev/7951f96fc6ad
Whiteboard: fixed-in-tamarin
Attachment #488398 - Flags: feedback?(rreitmai)
Attachment #488398 - Flags: review?(Jacob.Bramley) → review+
All tests I've run have showed little difference when the patch is applied; that aside comment 6, makes it apparent that some platforms will benefit from this change.

Running valgrind on an ARM Cortex-A8 I'm seeing the following cachegrind behaviour on misc/boidshack.abc (see below).   These results appear to hint at a slightly reduced miss rate for D1 and LLi.

*** baseline ***

==22495== I   refs:      2,260,103,455
==22495== I1  misses:        2,007,496
==22495== LLi misses:          223,331
==22495== I1  miss rate:          0.08%
==22495== LLi miss rate:          0.00%
==22495==
==22495== D   refs:      1,265,908,141  (798,512,227 rd   + 467,395,914 wr)
==22495== D1  misses:       23,327,074  ( 22,610,448 rd   +     716,626 wr)
==22495== LLd misses:        1,982,144  (  1,488,085 rd   +     494,059 wr)
==22495== D1  miss rate:           1.8% (        2.8%     +         0.1%  )
==22495== LLd miss rate:           0.1% (        0.1%     +         0.1%  )
==22495==
==22495== LL refs:          25,334,570  ( 24,617,944 rd   +     716,626 wr)
==22495== LL misses:         2,205,475  (  1,711,416 rd   +     494,059 wr)
==22495== LL miss rate:            0.0% (        0.0%     +         0.1%  )

*** with attachment 488398 [details] [diff] [review] ***

==22408== I   refs:      2,252,810,927
==22408== I1  misses:        1,940,698
==22408== LLi misses:          164,551
==22408== I1  miss rate:          0.08%
==22408== LLi miss rate:          0.00%
==22408==
==22408== D   refs:      1,261,926,399  (795,785,350 rd   + 466,141,049 wr)
==22408== D1  misses:       21,784,320  ( 21,066,715 rd   +     717,605 wr)
==22408== LLd misses:        1,891,321  (  1,404,122 rd   +     487,199 wr)
==22408== D1  miss rate:           1.7% (        2.6%     +         0.1%  )
==22408== LLd miss rate:           0.1% (        0.1%     +         0.1%  )
==22408==
==22408== LL refs:          23,725,018  ( 23,007,413 rd   +     717,605 wr)
==22408== LL misses:         2,055,872  (  1,568,673 rd   +     487,199 wr)
==22408== LL miss rate:            0.0% (        0.0%     +         0.1%  )

The run below is from the same machine looking at misc/ , where avm2 is a version with the attachment.  The results are fairly typical of what is seen on the broader set of tests:

iterations: 3
                                      avm          avm2
test                          best    avg   best    avg  %dBst  %dAvg
Metric: time 
Dir: misc/
  boids                      33119 33160.7  34277 34311.7   -3.5   -3.5 - 
  boidshack                   6908   6994   6598   6644    4.5    5.0 + 
  gameoflife                 44480 44491.7  44271  44490    0.5    0.0   
  primes                    105578 105972 105599 105621   -0.0    0.3
http://hg.mozilla.org/tracemonkey/rev/c94d644bd756
Whiteboard: fixed-in-tamarin → fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey
Attachment #488398 - Flags: feedback?(rreitmai) → feedback?(edwsmith)
Comment on attachment 488398 [details] [diff] [review]
Allow alloc() requests to limit size of returned block

(nit) could you fix caps and punctuation in CodeAlloc.h since you edited the comment anyway?  And, the new comments in CodeAlloc.cpp.

>  /** allocate some memory (up to...

In CodeAlloc::alloc(), I'm having trouble following the new code.  I think the problem scenario is:

1. if new block is too big
2. and leftover part is too small
3. then you cut desired in half.

Right?  could you clarify the comment?

questions:
 - can the resulting halves be too small?  if so, there should be an assert
   to guard against it.

For example if

     byteLimit = sizeofMinBlock + minAllocSize

then

     desired = 2*sizeofMinBlock + minAllocSize

And desired/2 = sizeofMinBlock + minAllocSize/2

so all this only works if sizeofMinBlock >= minAllocSize/2.  is it? I haven't looked.  (and, check my math :-) otherwise desired could end up too small.  

Given the complexity, one or two more asserts that check the exit invariants (no block got too small or too big) would be good.

F+, but I highly recommend some clarify comments and asserts.
Attachment #488398 - Flags: feedback?(edwsmith) → feedback+
Attachment #488398 - Flags: review?(nnethercote) → review+
(In reply to comment #17)
> Comment on attachment 488398 [details] [diff] [review]
> questions:
>  - can the resulting halves be too small?  

I think its more than an assert ... looking at addRemainder(), minHole is set to max(minAllocSize, 2*sizeofMinBlock) which is wrong.

Looking at the last 'else' case of addRemainder(), where we introduce a new free block b2 (which needs >= sizeofMinBlock+minAllocSize) and a used block b3 (which needs to be >=sizeofMinBlock).

So minHole should be 2*sizeofMinBlock + minAllocSize

Assuming my math is correct (please check) then we'd got an existing bug, that aside we'd need to bump up that value to ensure the above patch work;

  4*sizeofMinBlock + 3*minAllocSize

The additional 2*(sizeofMinBlock+minAllocSize) ensures that we never have a free block on the list that cannot be split.
http://hg.mozilla.org/mozilla-central/rev/c94d644bd756
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 614126
You need to log in before you can comment on or make changes to this bug.