nanojit: Reduce indirect call overhead caused by CodeAlloc.alloc

RESOLVED FIXED

Status

Tamarin
Baseline JIT (CodegenLIR)
RESOLVED FIXED
8 years ago
8 years ago

People

(Reporter: Rick Reitmaier, Assigned: Rick Reitmaier)

Tracking

unspecified
Dependency tree / graph
Bug Flags:
flashplayer-qrb +

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey)

Attachments

(1 attachment, 2 obsolete attachments)

(Assignee)

Description

8 years ago
CodeAlloc.alloc calls CodeAlloc.allocCodeChunks while on Tamarin results in VMPI_allocateCodeMemory; which normally results in an OS page request.

On some non-intel platforms (ARM) the request is for 1 page, which can be quite expensive if repeated frequently.  

One option to reduce this overhead (i.e. reduce the # of calls to allocCodeChunk) is to ask for a larger size and then partition the piece of memory into multiple blocks which get added to the available list.

Updated

8 years ago
Flags: flashplayer-qrb+

Updated

8 years ago
Assignee: nobody → rreitmai
(Assignee)

Comment 1

8 years ago
Created attachment 481124 [details] [diff] [review]
v1

Support the partitioning of the mem region allocated via codeChunkAlloc into a number of pieces.

This allows us to request more than a single page at a time on ARM platforms, which should significantly reduce overhead.   I have not yet run performance numbers to validate, but will shortly.

On non-ARM platforms this patch does not provide any change to policy.  Although it could be used to solve bug 601165.  Not to mention it provides more flexibility when we start integrating the background compiler.
Attachment #481124 - Flags: review?(nnethercote)
Attachment #481124 - Flags: feedback?(siwilkin)
(Assignee)

Updated

8 years ago
Depends on: 587727
(Assignee)

Updated

8 years ago
Attachment #481124 - Flags: review?(edwsmith)

Comment 2

8 years ago
Comment on attachment 481124 [details] [diff] [review]
v1

Assumption: the problem you're trying to solve is too many mmap/mprotect calls, on ARM, which has a small chunk size.  Is this right?  

The VM<-->CodeAlloc interface deals in chunks, which want to be large and coarse, to reduce mmap/mprotect operations.  The CodeAlloc<-->Assembler interface deals in blocks, to support small pieces of code, and to support generating code in 2 streams at once (mainline and exit paths).  Adding the idea of a partition that is smaller than a chunk, for the purposes of modifying permissions (but not for allocation, nor for restricting the max block size, overcomplicates the whole class by introducing a new intermediate "granularity" for managing memory.

Currently there is a tension between having a low max block size (for ARM) and a high chunk size (for allocation efficiency).  I would prefer it if we can set this up so page permissions are still set on whole chunks.

Could we add a max_bytes parameter to alloc(), for ARM?  If a free block is larger than that, then CodeAlloc could split it immediately, leaving leftover bytes on the free list.  Then pagesPerAlloc could be increased for ARM, giving less mmap calls *and* less mprotect calls, compared to this patch which (if i understand the code) only reduces the # of mprotect calls.
Attachment #481124 - Flags: review?(edwsmith) → review-

Comment 3

8 years ago
Also, igoring comment #2, we don't need the new concept of a "partition".  it's really a MMU page; its size is owned by the OS (getpagesize()), and we should just call it that, if necessary.

The ARM limitation of 4K is for the purpose of placing inline constants within 4K of the code that references them; there is no requirement for page alignment, and the 4K value is derived from the instruction set encoding, not the OS page size.  In hindsight, its a mistake in the existing code to conflate page size with this limit; the patch propagates that mistake by introducing "partition" which is a unit of permission-setting, and also an upper limit of the block size returned by CodeAlloc::alloc().

Comment 4

8 years ago
(In reply to comment #3)
> Also, igoring comment #2, we don't need the new concept of a "partition".  it's
> really a MMU page; its size is owned by the OS (getpagesize()), and we should
> just call it that, if necessary.

VMPI_getVMPageSize().
(Assignee)

Updated

8 years ago
Attachment #481124 - Flags: review?(nnethercote)
(Assignee)

Comment 5

8 years ago
(In reply to comment #2)
> Comment on attachment 481124 [details] [diff] [review]
> v1
>
> Could we add a max_bytes parameter to alloc(), for ARM?  If a free block is
> larger than that, then CodeAlloc could split it immediately, leaving leftover
> bytes on the free list.  Then pagesPerAlloc could be increased for ARM, giving
> less mmap calls *and* less mprotect calls, compared to this patch which (if i
> understand the code) only reduces the # of mprotect calls.

Interesting idea and a much more elegant solution and as pointed out in comment 3 it also avoids baking in any additional page and alignment criteria.

Comment 6

8 years ago
Edwin's comment 3 is right on the money.  On ARM QNX platforms we want to take advantage of the large page support offered by the system, and this means allocating 64K or 1M blocks to avoid the high cost of TLB refills (measured on our system to be as high as 15% of CPU cycles).

On ARM, due to the indirect tie of the system page size to the maximum size of the instruction set encoding, we can't easily return large blocks out of VMPI_allocateCodeMemory() without creating a sub-allocator inside of our own implementation ... which just seems wrong given all of the other allocator gear that is in place.
(Assignee)

Comment 7

8 years ago
(In reply to comment #6)
> Edwin's comment 3 is right on the money.  On ARM QNX platforms we want to take
> advantage of the large page support offered by the system, and this means
> allocating 64K or 1M blocks to avoid the high cost of TLB refills (measured on
> our system to be as high as 15% of CPU cycles).
> 

Just to reinforce one point; the new design, means page permission bits will still be toggled at the chunk level.  Meaning if 64K or 1M chunks are used, all bits for the entire chunk will be flipped each time we compile a new method anywhere in that chunk.

If this is not desirable, we'll have to come up with a separate fix for that.
(Assignee)

Comment 8

8 years ago
Created attachment 482409 [details] [diff] [review]
v1

The patch consolidates all allocation logic into addMem().  The cost is a couple of extra updates to the next pointer in the case of an new allocation, but besides cleaning up the code this change paves the way for the next patch, which adds logic for splitting a free list block.
Attachment #481124 - Attachment is obsolete: true
Attachment #482409 - Flags: review?(edwsmith)
Attachment #481124 - Flags: feedback?(siwilkin)
(Assignee)

Updated

8 years ago
Attachment #482409 - Flags: review?(nnethercote)
Comment on attachment 482409 [details] [diff] [review]
v1

BTW, when you post patches that are done as HG changesets, the file names don't show up in Bugzilla's "diff" view, which makes it hard to see what's going on.


>@@ -365,7 +362,8 @@
>         // add terminator to heapblocks list so we can track whole blocks
>         terminator->next = heapblocks;
>         heapblocks = terminator;
>-        return b;
>+
>+        addBlock(availblocks, b); // add to free list
>     }
> 
>     CodeList* CodeAlloc::getBlock(NIns* start, NIns* end) {

Is your tree up-to-date?  This looks like it's based on old code -- the
current tip already has an addBlock() call in there.

Patch seems fine -- it doesn't change functionality, AFAICT -- is that
right?
Attachment #482409 - Flags: review?(nnethercote) → review+
(In reply to comment #9)
> BTW, when you post patches that are done as HG changesets, the file names don't
> show up in Bugzilla's "diff" view, which makes it hard to see what's going on.

Turning on git-style diff'ing in one's hgrc (see below) fixes this problem.

  [diff]
  # Note: makes diff's incompatible with UNIX patch; if that matters,
  # then either use patch -p1 or get rid of this.
  git = True
(Assignee)

Comment 11

8 years ago
Comment on attachment 482409 [details] [diff] [review]
v1

reassign review to Bill.
Attachment #482409 - Flags: review?(edwsmith) → review?(wmaddox)

Updated

8 years ago
Attachment #482409 - Flags: review?(wmaddox) → review+
(Assignee)

Comment 13

8 years ago
Created attachment 488398 [details] [diff] [review]
Allow alloc() requests to limit size of returned block

Currently request to alloc() will return blocks equal in size to that requested from the OS.  

This patch adds support for limiting the size of the block returned from CodeAlloc.alloc()
Attachment #482409 - Attachment is obsolete: true
Attachment #488398 - Flags: review?(nnethercote)
(Assignee)

Updated

8 years ago
Attachment #488398 - Flags: review?(Jacob.Bramley)
(Assignee)

Updated

8 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 14

8 years ago
rreitmai http://hg.mozilla.org/tamarin-redux/rev/7951f96fc6ad
Whiteboard: fixed-in-tamarin

Updated

8 years ago
Attachment #488398 - Flags: feedback?(rreitmai)
Attachment #488398 - Flags: review?(Jacob.Bramley) → review+
(Assignee)

Comment 15

8 years ago
All tests I've run have showed little difference when the patch is applied; that aside comment 6, makes it apparent that some platforms will benefit from this change.

Running valgrind on an ARM Cortex-A8 I'm seeing the following cachegrind behaviour on misc/boidshack.abc (see below).   These results appear to hint at a slightly reduced miss rate for D1 and LLi.

*** baseline ***

==22495== I   refs:      2,260,103,455
==22495== I1  misses:        2,007,496
==22495== LLi misses:          223,331
==22495== I1  miss rate:          0.08%
==22495== LLi miss rate:          0.00%
==22495==
==22495== D   refs:      1,265,908,141  (798,512,227 rd   + 467,395,914 wr)
==22495== D1  misses:       23,327,074  ( 22,610,448 rd   +     716,626 wr)
==22495== LLd misses:        1,982,144  (  1,488,085 rd   +     494,059 wr)
==22495== D1  miss rate:           1.8% (        2.8%     +         0.1%  )
==22495== LLd miss rate:           0.1% (        0.1%     +         0.1%  )
==22495==
==22495== LL refs:          25,334,570  ( 24,617,944 rd   +     716,626 wr)
==22495== LL misses:         2,205,475  (  1,711,416 rd   +     494,059 wr)
==22495== LL miss rate:            0.0% (        0.0%     +         0.1%  )

*** with attachment 488398 [details] [diff] [review] ***

==22408== I   refs:      2,252,810,927
==22408== I1  misses:        1,940,698
==22408== LLi misses:          164,551
==22408== I1  miss rate:          0.08%
==22408== LLi miss rate:          0.00%
==22408==
==22408== D   refs:      1,261,926,399  (795,785,350 rd   + 466,141,049 wr)
==22408== D1  misses:       21,784,320  ( 21,066,715 rd   +     717,605 wr)
==22408== LLd misses:        1,891,321  (  1,404,122 rd   +     487,199 wr)
==22408== D1  miss rate:           1.7% (        2.6%     +         0.1%  )
==22408== LLd miss rate:           0.1% (        0.1%     +         0.1%  )
==22408==
==22408== LL refs:          23,725,018  ( 23,007,413 rd   +     717,605 wr)
==22408== LL misses:         2,055,872  (  1,568,673 rd   +     487,199 wr)
==22408== LL miss rate:            0.0% (        0.0%     +         0.1%  )

The run below is from the same machine looking at misc/ , where avm2 is a version with the attachment.  The results are fairly typical of what is seen on the broader set of tests:

iterations: 3
                                      avm          avm2
test                          best    avg   best    avg  %dBst  %dAvg
Metric: time 
Dir: misc/
  boids                      33119 33160.7  34277 34311.7   -3.5   -3.5 - 
  boidshack                   6908   6994   6598   6644    4.5    5.0 + 
  gameoflife                 44480 44491.7  44271  44490    0.5    0.0   
  primes                    105578 105972 105599 105621   -0.0    0.3
http://hg.mozilla.org/tracemonkey/rev/c94d644bd756
Whiteboard: fixed-in-tamarin → fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey
(Assignee)

Updated

8 years ago
Attachment #488398 - Flags: feedback?(rreitmai) → feedback?(edwsmith)

Comment 17

8 years ago
Comment on attachment 488398 [details] [diff] [review]
Allow alloc() requests to limit size of returned block

(nit) could you fix caps and punctuation in CodeAlloc.h since you edited the comment anyway?  And, the new comments in CodeAlloc.cpp.

>  /** allocate some memory (up to...

In CodeAlloc::alloc(), I'm having trouble following the new code.  I think the problem scenario is:

1. if new block is too big
2. and leftover part is too small
3. then you cut desired in half.

Right?  could you clarify the comment?

questions:
 - can the resulting halves be too small?  if so, there should be an assert
   to guard against it.

For example if

     byteLimit = sizeofMinBlock + minAllocSize

then

     desired = 2*sizeofMinBlock + minAllocSize

And desired/2 = sizeofMinBlock + minAllocSize/2

so all this only works if sizeofMinBlock >= minAllocSize/2.  is it? I haven't looked.  (and, check my math :-) otherwise desired could end up too small.  

Given the complexity, one or two more asserts that check the exit invariants (no block got too small or too big) would be good.

F+, but I highly recommend some clarify comments and asserts.
Attachment #488398 - Flags: feedback?(edwsmith) → feedback+
Attachment #488398 - Flags: review?(nnethercote) → review+
(Assignee)

Comment 18

8 years ago
(In reply to comment #17)
> Comment on attachment 488398 [details] [diff] [review]
> questions:
>  - can the resulting halves be too small?  

I think its more than an assert ... looking at addRemainder(), minHole is set to max(minAllocSize, 2*sizeofMinBlock) which is wrong.

Looking at the last 'else' case of addRemainder(), where we introduce a new free block b2 (which needs >= sizeofMinBlock+minAllocSize) and a used block b3 (which needs to be >=sizeofMinBlock).

So minHole should be 2*sizeofMinBlock + minAllocSize

Assuming my math is correct (please check) then we'd got an existing bug, that aside we'd need to bump up that value to ensure the above patch work;

  4*sizeofMinBlock + 3*minAllocSize

The additional 2*(sizeofMinBlock+minAllocSize) ensures that we never have a free block on the list that cannot be split.

Comment 19

8 years ago
http://hg.mozilla.org/mozilla-central/rev/c94d644bd756
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
(Assignee)

Updated

8 years ago
Blocks: 614126
You need to log in before you can comment on or make changes to this bug.