Open Bug 676181 Opened 13 years ago Updated 2 years ago

Load then use is inefficient on arm

Tracking

()

Status:

NEW

People

(Reporter: mjrosenb, Unassigned)

Details

Attachments

(2 files)

does some of the described stuff 13 years ago Marty Rosenberg [:mjrosenb] 7.68 KB, patch		Details \| Diff \| Splinter Review
faster_getInt.diff 13 years ago Marty Rosenberg [:mjrosenb] 6.72 KB, patch		Details \| Diff \| Splinter Review

Marty Rosenberg [:mjrosenb]

Reporter

Description

•

13 years ago

Attached patch does some of the described stuff — Details — Splinter Review

While looking at code generated by our YARR, I noticed an odd sequence:

[jaeger] Insns mov r3, #0x200000 @ (2097152)
[jaeger] Insns orr r3, r3, #0x20 @ (32)
[jaeger] Insns orrs r5, r5, r3

instead, what we should produce is

orr r5, #0x200020
orrs r5, #0x20

It looks like there are many other places where we generate a move into a temp register, followed by a use of the temp register and the target register, when doing two successive operations on the dest register will suffice.

Another less obvious case that I found in the same test was:
[jaeger] Insns mov r3, #0x610000 @ (6356992)
[jaeger] Insns orr r3, r3, #0x61 @ (97)
[jaeger] Insns cmp r5, r3
[jaeger] Insns ldrne pc, =0xffffffff @ (-1)

In this case, it only works because the cmp can in fact be a teq without changing the result of the comparison. In this case, we can emit:
eor r3, r5, #0x610000
teq r3, 0x61
ldrne pc, =0xffffffff

the eor/teq sequence will not work if we were testing for negative/overflow rather than equality/zeroness.
It may be tempting to try
sub r3, r5, #0x610000
cmp r3, 0x61
but that is not guaranteed to set *all* of the bits correctly. Any bits that are dependent on the actual computation rather than just the final value written into the register have the potential to be wrong. It may be possible, by doing something like this:

subs r3, r5, 0x....
cmpvc r3, 0x...
ldrvs pc, ....
the idea being that if we are testing for overflow, we do part of the subtraction, test for overflow, if there was overflow, then that was going to happen if we did part of the subtraction or all of it, so we don't bother doing the second half of the subtraction, and branch accordingly. However, I have not thought that through thoroughly, and am not convinced it is correct.

The attached patch compiles, but is other than that *completely* *untested*. I should be testing it soon, and hopefully getting some numbers.

Marty Rosenberg [:mjrosenb]

Reporter

Comment 1

•

13 years ago

Initial results are disheartening.  
=============================================================================

** TOTAL **:           ??                3625.4ms +/- 2.0%   3644.5ms +/- 1.7%     not conclusive: might be *1.005x as slow*

=============================================================================

  3d:                  *1.022x as slow*   677.3ms +/- 1.0%    692.3ms +/- 0.2%     significant
    cube:              -                  265.7ms +/- 1.4%    264.5ms +/- 0.3% 
    morph:             *1.138x as slow*   118.7ms +/- 1.1%    135.1ms +/- 0.3%     significant
    raytrace:          -                  292.9ms +/- 1.5%    292.7ms +/- 0.2% 

  bitops:              ??                 165.8ms +/- 0.6%    167.2ms +/- 1.9%     not conclusive: might be *1.008x as slow*
    3bit-bits-in-byte: ??                  12.9ms +/- 0.7%     13.9ms +/- 14.6%     not conclusive: might be *1.073x as slow*
    nsieve-bits:       -                   52.3ms +/- 1.5%     51.9ms +/- 0.8% 
    bitwise-and:       ??                  23.9ms +/- 2.0%     24.6ms +/- 9.7%     not conclusive: might be *1.029x as slow*
    bits-in-byte:      ??                  76.8ms +/- 0.3%     76.9ms +/- 0.3%     not conclusive: might be *1.002x as slow*

  math:                -                  404.6ms +/- 3.5%    398.7ms +/- 0.2% 
    cordic:            -                  114.6ms +/- 4.0%    112.7ms +/- 0.5% 
    partial-sums:      -                  180.1ms +/- 4.1%    177.2ms +/- 0.3% 
    spectral-norm:     -                  109.9ms +/- 2.1%    108.8ms +/- 0.3% 

  regexp:              -                  161.6ms +/- 7.1%    156.0ms +/- 0.3% 
    dna:               -                  161.6ms +/- 7.1%    156.0ms +/- 0.3% 

  string:              ??                 852.4ms +/- 1.1%    856.8ms +/- 0.1%     not conclusive: might be *1.005x as slow*
    tagcloud:          ??                 267.8ms +/- 0.6%    269.4ms +/- 0.3%     not conclusive: might be *1.006x as slow*
    unpack-code:       *1.027x as slow*   279.1ms +/- 0.3%    286.7ms +/- 0.2%     significant
    validate-input:    -                  103.9ms +/- 0.3%    103.8ms +/- 0.5% 
    base64:            -                   54.8ms +/- 0.8%     54.5ms +/- 0.5% 
    fasta:             -                  146.8ms +/- 5.2%    142.4ms +/- 0.4% 

  access:              ??                 528.7ms +/- 0.5%    536.0ms +/- 3.7%     not conclusive: might be *1.014x as slow*
    nbody:             ??                 120.8ms +/- 0.4%    122.8ms +/- 3.9%     not conclusive: might be *1.017x as slow*
    binary-trees:      -                   73.5ms +/- 1.5%     73.0ms +/- 0.4% 
    fannkuch:          ??                 210.3ms +/- 0.8%    214.4ms +/- 5.4%     not conclusive: might be *1.019x as slow*
    nsieve:            ??                 124.1ms +/- 0.6%    125.7ms +/- 2.7%     not conclusive: might be *1.013x as slow*

  date:                ??                 472.8ms +/- 4.9%    474.3ms +/- 3.7%     not conclusive: might be *1.003x as slow*
    format-tofte:      ??                 244.3ms +/- 4.2%    246.7ms +/- 7.2%     not conclusive: might be *1.010x as slow*
    format-xparb:      -                  228.5ms +/- 5.7%    227.6ms +/- 0.3% 

  controlflow:         -                   33.7ms +/- 1.5%     33.6ms +/- 1.4% 
    recursive:         -                   33.7ms +/- 1.5%     33.6ms +/- 1.4% 

  crypto:              ??                 328.6ms +/- 5.2%    329.7ms +/- 6.5%     not conclusive: might be *1.004x as slow*
    md5:               -                   93.4ms +/- 7.3%     91.6ms +/- 2.6% 
    aes:               -                  176.3ms +/- 5.8%    176.0ms +/- 5.9% 
    sha1:              ??                  58.8ms +/- 1.1%     62.1ms +/- 14.0%     not conclusive: might be *1.055x as slow*

=============================================================================

** TOTAL **:      *1.014x as slow*  16729.8ms +/- 0.5%   16962.2ms +/- 0.4%     significant

=============================================================================

  v8:             *1.014x as slow*  16729.8ms +/- 0.5%   16962.2ms +/- 0.4%     significant
    crypto:       -                  2019.5ms +/- 0.6%    2016.6ms +/- 0.4% 
    raytrace:     *1.050x as slow*   2626.3ms +/- 2.0%    2757.1ms +/- 1.1%     significant
    richards:     ??                 1842.8ms +/- 0.2%    1858.2ms +/- 1.2%     not conclusive: might be *1.008x as slow*
    splay:        ??                 1351.0ms +/- 0.2%    1353.6ms +/- 0.3%     not conclusive: might be *1.002x as slow*
    deltablue:    ??                 3799.3ms +/- 1.2%    3848.4ms +/- 1.3%     not conclusive: might be *1.013x as slow*
    earley-boyer: *1.015x as slow*   2705.7ms +/- 0.6%    2745.7ms +/- 0.4%     significant
    regexp:       -                  2385.2ms +/- 0.2%    2382.4ms +/- 0.4% 

I don't really believe that the jitted code is actually slower.  I can only assume that the extra calls to disInt (and getOp2) are negating any benefits of the better code.  I am now trying to write a better variant of those functions using gcc's builtin clz.

Marty Rosenberg [:mjrosenb]

Reporter

Comment 2

•

13 years ago

Attached patch faster_getInt.diff — Details — Splinter Review

So I've implemented a fster getInt. It is about 30% longer than I thought it was going to be, but trying it on every 32 bit integer, it is about 3 times as fast as the pre-existing code.  It helps a bit with benchmarks, but unfortunately, not as much as I had expected.

=============================================================================

** TOTAL **:           -                 3631.6ms +/- 0.6%   3609.7ms +/- 0.1% 

=============================================================================

  3d:                  -                  695.5ms +/- 0.3%    693.7ms +/- 0.2% 
    cube:              -                  268.2ms +/- 0.7%    266.0ms +/- 0.5% 
    morph:             -                  135.0ms +/- 0.2%    135.0ms +/- 0.2% 
    raytrace:          ??                 292.2ms +/- 0.1%    292.7ms +/- 0.3%     not conclusive: might be *1.002x as slow*

  bitops:              -                  166.8ms +/- 2.1%    165.7ms +/- 0.7% 
    3bit-bits-in-byte: ??                  13.0ms +/- 0.5%     13.8ms +/- 7.5%     not conclusive: might be *1.066x as slow*
    nsieve-bits:       -                   51.7ms +/- 0.4%     51.6ms +/- 0.3% 
    bitwise-and:       -                   24.3ms +/- 5.6%     23.5ms +/- 0.7% 
    bits-in-byte:      -                   77.8ms +/- 2.7%     76.8ms +/- 0.1% 

  math:                -                  398.0ms +/- 0.2%    397.8ms +/- 0.1% 
    cordic:            1.001x as fast     112.3ms +/- 0.1%    112.2ms +/- 0.1%     significant
    partial-sums:      ??                 176.8ms +/- 0.1%    177.1ms +/- 0.2%     not conclusive: might be *1.002x as slow*
    spectral-norm:     -                  108.9ms +/- 0.7%    108.5ms +/- 0.1% 

  regexp:              -                  155.5ms +/- 0.1%    155.6ms +/- 0.1% 
    dna:               -                  155.5ms +/- 0.1%    155.6ms +/- 0.1% 

  string:              1.006x as fast     856.5ms +/- 0.1%    851.2ms +/- 0.1%     significant
    tagcloud:          1.007x as fast     269.6ms +/- 0.1%    267.7ms +/- 0.2%     significant
    unpack-code:       1.013x as fast     285.5ms +/- 0.1%    281.9ms +/- 0.2%     significant
    validate-input:    1.004x as fast     103.9ms +/- 0.1%    103.4ms +/- 0.1%     significant
    base64:            -                   54.6ms +/- 0.3%     54.5ms +/- 0.2% 
    fasta:             ??                 143.0ms +/- 0.6%    143.7ms +/- 0.6%     not conclusive: might be *1.005x as slow*

  access:              -                  532.4ms +/- 1.6%    527.8ms +/- 0.1% 
    nbody:             -                  121.8ms +/- 1.2%    120.8ms +/- 0.1% 
    binary-trees:      -                   73.2ms +/- 0.7%     73.0ms +/- 0.2% 
    fannkuch:          -                  212.2ms +/- 2.8%    209.1ms +/- 0.1% 
    nsieve:            -                  125.3ms +/- 0.7%    124.9ms +/- 0.2% 

  date:                -                  471.6ms +/- 1.6%    464.3ms +/- 0.2% 
    format-tofte:      -                  243.1ms +/- 2.6%    239.5ms +/- 0.3% 
    format-xparb:      1.016x as fast     228.5ms +/- 0.7%    224.8ms +/- 0.4%     significant

  controlflow:         -                   33.4ms +/- 0.2%     33.4ms +/- 0.2% 
    recursive:         -                   33.4ms +/- 0.2%     33.4ms +/- 0.2% 

  crypto:              -                  321.8ms +/- 1.1%    320.0ms +/- 0.1% 
    md5:               -                   90.8ms +/- 0.8%     90.3ms +/- 0.1% 
    aes:               -                  172.4ms +/- 1.6%    171.3ms +/- 0.1% 
    sha1:              1.003x as fast      58.6ms +/- 0.3%     58.4ms +/- 0.2%     significant

=============================================================================

** TOTAL **:      *1.020x as slow*  16729.8ms +/- 0.5%   17070.0ms +/- 0.5%     significant

=============================================================================

  v8:             *1.020x as slow*  16729.8ms +/- 0.5%   17070.0ms +/- 0.5%     significant
    crypto:       ??                 2019.5ms +/- 0.6%    2029.0ms +/- 1.1%     not conclusive: might be *1.005x as slow*
    raytrace:     *1.070x as slow*   2626.3ms +/- 2.0%    2811.3ms +/- 1.5%     significant
    richards:     ??                 1842.8ms +/- 0.2%    1864.6ms +/- 1.7%     not conclusive: might be *1.012x as slow*
    splay:        *1.011x as slow*   1351.0ms +/- 0.2%    1366.3ms +/- 0.3%     significant
    deltablue:    *1.022x as slow*   3799.3ms +/- 1.2%    3882.8ms +/- 1.8%     significant
    earley-boyer: *1.015x as slow*   2705.7ms +/- 0.6%    2745.5ms +/- 0.3%     significant
    regexp:       1.006x as fast     2385.2ms +/- 0.2%    2370.5ms +/- 0.3%     significant

Jacob Bramley [:jbramley]

Updated

•

13 years ago

OS: Linux → All

Hardware: x86_64 → ARM

BugBot [:suhaib / :marco/ :calixte]

Comment 3

•

2 years ago

The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.

Assignee: marty.rosenberg → nobody

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Load then use is inefficient on arm

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: mjrosenb, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Updated

Comment 3

Updated

Attachment

General

Description

File Name

Content Type