Open Bug 1915793 Opened 3 months ago Updated 19 days ago

) is 2.1x slower in Nightly (and spends lots of time around GC)

Tracking

()

Status:

NEW

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug,
URL
)

Details

Attachments

(1 file)

Replace all the code in the demo with this code 3 months ago Mayank Bansal 2.39 KB, text/plain		Details

Mayank Bansal

Reporter

Description

•

3 months ago

•

Edited

Go to http://paperjs.org/examples/spiral-raster/
On the top-right, click on "source'
EITHER replace all the code in the demo with the attached code OR
search for: "length: count / 100" and replace by : "length: count / 1000"
Click on "Run" in the top-right.

Nightly: https://share.firefox.dev/4dF7xhp (13s)
Chrome: https://share.firefox.dev/4g3buhs (6s)

Mayank Bansal

Reporter

Comment 1

•

3 months ago

Attached file Replace all the code in the demo with this code — Details

Mayank Bansal

Reporter

Comment 2

•

3 months ago

STR2: Replace "angle: count * 5" -- > "angle: count * 50"

Nightly: https://share.firefox.dev/3XnchCq

Iain Ireland [:iain]

Comment 3

•

3 months ago

If I run this through samply with a local browser build (making both of the changes listed above), I see ~15% in minor GC, but by far the hot spot is this code in _add:

for (var i = index + amount, l = segments.length; i < l; i++)
  segments[i]._index = i;

Running with IONPERF=ir, the hot part is simply storing to a dynamic slot:

InterruptCheck	   
    86    mov r11, 0x79dc99735aac
    70    cmp dword [r11], 0x0
    116   jnz 0x21a504
CompareAndBranch    
    60    cmp eax, ebx
    7     jge 0x21a103
LoadElementAndUnbox    
    110   mov r11, -0x2000000000000
    234   xor r11, qword [rdx + rax * 8]
    184   mov rcx, r11
    134   shr r11, 0x2f
    158   jnz 0x21a532
GuardShape    
    147   mov r11, 0x25dc7b9f67a0
    180   cmp qword [rcx], r11
    439   jnz 0x21a53c
Slots    
    6628  mov rcx, qword [rcx + 0x8]
StoreDynamicSlotT
    1305  mov r11, 0x79dc99749a10
    145   test dword [r11], 0x1
    208   jz 0x21a0f2
    1     mov r11, qword [rcx]
    304   shr r11, 0x2f
    4     cmp r11d, 0x1fff6
          jb 0x21a0f2
          push rdx
          lea rdx, qword [rcx]
          call 0xffffffffffec2d24
          pop rdx
    175   mov dword [rcx], eax
    1353  mov dword [rcx + 0x4], -0x78000
AddI    
    100   add eax, 0x1
Goto    
    82    jmp 0x21a06f

The single hottest instruction is loading the slots array. The next hottest code is the pre-barrier check and the store itself.

This code all looks pretty good.

Will Medina [:willyelm]

Updated

•

3 months ago

Severity: -- → N/A

Priority: -- → P3

Matthew Gaudet (he/him) [:mgaudet]

Comment 4

•

20 days ago

Something something figuring out if we could allocate a larger object to avoid this load ending up in the dynamic slots.

I believe the optimization in bug 1823811 doesn't apply here because the index property is set outside of a constructor function.

Matthew Gaudet (he/him) [:mgaudet]

Updated

•

20 days ago

Comment 5

•

20 days ago

It's not really the point here, but staring at that code makes me think that the prebarrier check could be hoisted out of the loop. It only works because the loop contains no calls or nested loops, but here the only time the prebarrier check results will change is at the loop head via the InterruptCheck.

(Then again, I don't understand why so many samples hit in the prebarrier test in the first place. The load should be from L1 and the branch should be 100% predictable and not taken. Hardware is hard.)

Iain Ireland [:iain]

Comment 6

•

19 days ago

(In reply to Steve Fink [:sfink] [:s:] from comment #5)

It's not really the point here, but staring at that code makes me think that the prebarrier check could be hoisted out of the loop. It only works because the loop contains no calls or nested loops, but here the only time the prebarrier check results will change is at the loop head via the InterruptCheck.

(Then again, I don't understand why so many samples hit in the prebarrier test in the first place. The load should be from L1 and the branch should be 100% predictable and not taken. Hardware is hard.)

Interesting idea. It's a bit awkward, because in general prebarrier checks are implicit, rather than being exposed as MIR nodes, so it's not hoistable unless we split it into separate nodes. I can imagine an optimization that looks for loops without calls and hoists the prebarrier check outside the loop, but then what do you do when the prebarrier is actually needed? I think you might have to generate two separate copies of the loop and switch between them, which is maybe a bit much.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Slightly modified paper.js demo (http://paperjs.org/examples/spiral-raster/) is 2.1x slower in Nightly (and spends lots of time around GC)

Categories

(Core :: JavaScript Engine, task, P3)

Tracking

()

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug,
URL
)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Updated

Comment 5

Comment 6

Attachment

General

Description

File Name

Content Type