Closed Bug 1499507 Opened Last year Closed Last year

Remove some code size overhead from WebIDL profiling stack frames

Categories

(Core :: Gecko Profiler, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox64 --- wontfix
firefox65 --- fixed

People

(Reporter: mstange, Assigned: mstange)

References

Details

Attachments

(9 files)

46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
Bug 785440 added profiling stack frames to all WebIDL APIs and caused some code size increases, see bug 1437772.

In this bug I want to land some improvements to this situation, but it probably won't fix the whole regression.
Assignee: nobody → mstange
Status: NEW → ASSIGNED
Here are the results from the patches I'm about to attach:

                      XUL section sizes opt:

                      Linux x64    Mac x64     Win x64     Android 32

                      93300567     116789591   92883950    58319924
fold checks            -158576      -239520     -189936     +124
                      93141991     116550071   92694014    58320048
infallible capacity    -94672       -88624      -98744      -128
                      93047319     116461447   92595270    58319920
no line numbers        -101120      -107968                 -98112
                      92946199     116353479   n/a         58221808
remove strings         -160         -201744                 -208
                      92946039     116151735   n/a         58221600
stackPointer load      -159568      -54320                  -129088
                      92786471     116097415   n/a         58092512

I introduced some build failures in the Windows part of the "no line numbers" patch, which is why I don't have numbers for the Windows builds yet, but I'll get them by tomorrow. I'm a bit surprised that only Mac is showing a reduction from the "remove strings" patch, but my guess is that that's because the "XUL section sizes opt" number doesn't measure all sections of the binary. But I haven't verified that assumption.
This eliminates a few instructions from every profiler label and saves code size.
We have around 9000 WebIDL constructors + methods + getters + setters which all
have an inlined instance of this code.

This change reduces the binary size on Linux x64 by around 160KB.

Here's a diff of the impact on the code generated for Attr_Binding::get_specified
in the Mac build:

    movq       %rsp, %rbp
    pushq      %r15
    pushq      %r14
    pushq      %r12
    pushq      %rbx
    subq       $0x10, %rsp
    movq       %rcx, %r14
    movq       %rdx, %r15
-   movq       __ZN7mozilla8profiler6detail12RacyFeatures18sActiveAndFeaturesE@GOT, %rax ; __ZN7mozilla8profiler6detail12RacyFeatures18sActiveAndFeaturesE@GOT
-   movl       (%rax), %eax
-   testl      %eax, %eax
-   js         loc_xxxxx
-
-   movq       $0x0, -40(%rbp)
-   jmp        loc_xxxxx
-
-   movq       0x78(%rdi), %rbx
+   movq       0x80(%rdi), %rbx
    movq       %rbx, -40(%rbp)
    testq      %rbx, %rbx
    je         loc_xxxxx

    movl       0x10(%rbx), %r12d
    cmpl       %r12d, (%rbx)
    jbe        loc_xxxxx
This eliminates a few instructions from each inlined instance of
AutoProfilerLabel because we no longer need to handle allocation failure in the
inlined code.
I think this allocation should be fine to make infallible: The allocation size
is limited by the thread's stack depth, and we only hit this code path when the
stack is the deepest it's ever been during the thread's life time.

This change reduces the binary size on Linux x64 by around 100KB.

Here's a diff of the impact on the code generated for Attr_Binding::get_specified
in the Mac build, it really just eliminates one test and one jump at the very end
of the method:

@@ -9,30 +9,29 @@
    movq       %rcx, %r14
    movq       %rdx, %r15
    movq       0x80(%rdi), %rbx
    movq       %rbx, -40(%rbp)
    testq      %rbx, %rbx
    je         loc_xxxxx

    movl       0x10(%rbx), %r12d
-   cmpl       %r12d, (%rbx)
-   jbe        loc_xxxxx
+   cmpl       (%rbx), %r12d
+   jae        loc_xxxxx

    movq       0x8(%rbx), %rax
    movq       %r12, %rcx
    shlq       $0x5, %rcx
    leaq       aGetAttrspecifi, %rdx                       ; "get Attr.specified"
    movq       %rdx, (%rax,%rcx)
    movq       $0x0, 0x8(%rax,%rcx)
    leaq       -40(%rbp), %rdx
    movq       %rdx, 0x10(%rax,%rcx)
    movl       $0x106, 0x18(%rax,%rcx)
    movl       $0x1c, 0x1c(%rax,%rcx)
-
    leal       0x1(%r12), %eax
    movl       %eax, 0x10(%rbx)

    movq       %r15, %rdi
    call       __ZNK7mozilla3dom4Attr9SpecifiedEv          ; mozilla::dom::Attr::Specified() const
    movzxl     %al, %eax
    movabsq    $0xfff9000000000000, %rcx
    orq        %rax, %rcx
@@ -50,12 +49,9 @@
    popq       %r14
    popq       %r15
    popq       %rbp
    ret
                         ; endp

    movq       %rbx, %rdi
    call       __ZN14ProfilingStack18ensureCapacitySlowEv  ; ProfilingStack::ensureCapacitySlow()
-   testb      %al, %al
-   jne        loc_xxxxx
-
    jmp        loc_xxxxx

Depends on D9192
They were not displayed in the UI, and the instructions to initialize the line
field of a stack frame increased code size unnecessarily.

This change reduces the binary size on Linux x64 by around 100KB.

Here's a diff of the impact on the code generated for Attr_Binding::get_specified
in the Mac build:

@@ -20,17 +20,16 @@
    movq       0x8(%rbx), %rax
    movq       %r12, %rcx
    shlq       $0x5, %rcx
    leaq       aGetAttrspecifi, %rdx                       ; "get Attr.specified"
    movq       %rdx, (%rax,%rcx)
    movq       $0x0, 0x8(%rax,%rcx)
    leaq       -40(%rbp), %rdx
    movq       %rdx, 0x10(%rax,%rcx)
-   movl       $0x106, 0x18(%rax,%rcx)
    movl       $0x1c, 0x1c(%rax,%rcx)
    leal       0x1(%r12), %eax
    movl       %eax, 0x10(%rbx)

    movq       %r15, %rdi
    call       __ZNK7mozilla3dom4Attr9SpecifiedEv          ; mozilla::dom::Attr::Specified() const
    movzxl     %al, %eax
    movabsq    $0xfff9000000000000, %rcx

Depends on D9193
This code is run during JSON serialization so performance is not a big concern.

Depends on D9195
This makes it easier to add more flags.

Depends on D9197
This means that our binary does not need to include concatenated strings such
as "set CanvasRenderingContext2D.fillStyle". It only needs to contain the
individual strings "CanvasRenderingContext2D" and "fillStyle" which are most
likely already present in the binary.

This change reduces the binary size on macOS x64 by around 200KB.

Here's a diff of the impact on the code generated for Attr_Binding::get_specified
in the Mac build. This change makes us generate slightly more code, which is
very much offset by the reduction in the amount of strings we ship.

@@ -15,22 +15,23 @@

    movl       0x10(%rbx), %r12d
    cmpl       (%rbx), %r12d
    jae        loc_xxxxx

    movq       0x8(%rbx), %rax
    movq       %r12, %rcx
    shlq       $0x5, %rcx
-   leaq       aGetAttrspecifi, %rdx                       ; "get Attr.specified"
+   leaq       aAttr, %rdx                                 ; "Attr"
    movq       %rdx, (%rax,%rcx)
-   movq       $0x0, 0x8(%rax,%rcx)
+   leaq       aSpecified, %rdx                            ; "specified"
+   movq       %rdx, 0x8(%rax,%rcx)
    leaq       -40(%rbp), %rdx
    movq       %rdx, 0x10(%rax,%rcx)
-   movl       $0x71, 0x1c(%rax,%rcx)
+   movl       $0x3a1, 0x1c(%rax,%rcx)
    leal       0x1(%r12), %eax
    movl       %eax, 0x10(%rbx)

    movq       %r15, %rdi
    call       __ZNK7mozilla3dom4Attr9SpecifiedEv          ; mozilla::dom::Attr::Specified() const
    movzxl     %al, %eax
    movabsq    $0xfff9000000000000, %rcx
    orq        %rax, %rcx

Depends on D9204
This change reduces the binary size on macOS x64 by around 50KB.

Here's a diff of the impact on the code generated for Attr_Binding::get_specified
in the Mac build. It's a bit hard to read because %r12 and %rbx swap their
function, but what happens in this method is that "movq       %r12, %rcx" goes
away, and the two instructions "leal       0x1(%r12) %eax" and
"movl       %eax, 0x10(%rbx)" turn into an "incl       0x10(%r12)".
So the old code was preserving the original value of profilingStack->stackPointer
in a register, and then using it later to compute the incremented stackPointer.
The new code uses an "incl" instruction for the stackPointer increment and
doesn't worry that the stackPointer value might have changed since the stack
size check at the start of the function. (It can't have changed.)

before: %rbx has the ProfilingStack*, %r12 has profilingStack->stackPointer
after: %r12 has the ProfilingStack*, %rbx has profilingStack->stackPointer

@@ -3,37 +3,35 @@
    movq       %rsp, %rbp
    pushq      %r15
    pushq      %r14
    pushq      %r12
    pushq      %rbx
    subq       $0x10, %rsp
    movq       %rcx, %r14
    movq       %rdx, %r15
-   movq       0x80(%rdi), %rbx
-   movq       %rbx, -40(%rbp)
-   testq      %rbx, %rbx
+   movq       0x80(%rdi), %r12
+   movq       %r12, -40(%rbp)
+   testq      %r12, %r12
    je         loc_xxxxx

-   movl       0x10(%rbx), %r12d
-   cmpl       (%rbx), %r12d
+   movl       0x10(%r12), %ebx
+   cmpl       (%r12), %ebx
    jae        loc_xxxxx

-   movq       0x8(%rbx), %rax
-   movq       %r12, %rcx
-   shlq       $0x5, %rcx
-   leaq       aAttr, %rdx                                 ; "Attr"
-   movq       %rdx, (%rax,%rcx)
-   leaq       aSpecified, %rdx                            ; "specified"
-   movq       %rdx, 0x8(%rax,%rcx)
-   leaq       -40(%rbp), %rdx
-   movq       %rdx, 0x10(%rax,%rcx)
-   movl       $0x3a1, 0x1c(%rax,%rcx)
-   leal       0x1(%r12), %eax
-   movl       %eax, 0x10(%rbx)
+   movq       0x8(%r12), %rax
+   shlq       $0x5, %rbx
+   leaq       aAttr, %rcx                                 ; "Attr"
+   movq       %rcx, (%rax,%rbx)
+   leaq       aSpecified, %rcx                            ; "specified"
+   movq       %rcx, 0x8(%rax,%rbx)
+   leaq       -40(%rbp), %rcx
+   movq       %rcx, 0x10(%rax,%rbx)
+   movl       $0x3a1, 0x1c(%rax,%rbx)
+   incl       0x10(%r12)

    movq       %r15, %rdi
    call       __ZNK7mozilla3dom4Attr9SpecifiedEv          ; mozilla::dom::Attr::Specified() const
    movzxl     %al, %eax
    movabsq    $0xfff9000000000000, %rcx
    orq        %rax, %rcx
    movq       %rcx, (%r14)
    movq       -40(%rbp), %rax
@@ -47,11 +45,11 @@
    popq       %rbx
    popq       %r12
    popq       %r14
    popq       %r15
    popq       %rbp
    ret
                         ; endp

-   movq       %rbx, %rdi
+   movq       %r12, %rdi
    call       __ZN14ProfilingStack18ensureCapacitySlowEv  ; ProfilingStack::ensureCapacitySlow()
    jmp        loc_xxxxx

Depends on D9205
Updated numbers from successful try pushes:

                      XUL section sizes opt:

                      Linux x64    Mac x64     Win x64     Android 32

                      93300567     116789591   92883950    58319924
fold checks            -158576      -239520     -189936     +124
                      93141991     116550071   92694014    58320048
infallible capacity    -94672       -88624      -98744      -128
                      93047319     116461447   92595270    58319920
no line numbers        -101120      -107968     -129552     -98112
                      92946199     116353479   92465718    58221808
remove strings         -160         -201744     -164800     -208
                      92946039     116151735   92300918    58221600
stackPointer load      -159568      -54320      +6608       -129088
                      92786471     116097415   92307526    58092512


Try pushes:

> baseline:                    https://treeherder.mozilla.org/#/jobs?repo=try&revision=721d857b06913ecaacc0d65e2d760cb5ef56613d
> after "fold checks":         https://treeherder.mozilla.org/#/jobs?repo=try&revision=f9d367e5467c11ebf05d4193f9039bde67ff47b7
> after "infallible capacity": https://treeherder.mozilla.org/#/jobs?repo=try&revision=17d6eae0985be3108f31ff4f4499eb967e84e6fc
> after "no line numbers":     https://treeherder.mozilla.org/#/jobs?repo=try&revision=b5be11594dc0a50a8785ad233d9b596b9bd981e9
> after "remove strings":      https://treeherder.mozilla.org/#/jobs?repo=try&revision=870a7409e58c039a48c5cd327243c30dbabc241d
> after "stackPointer load":   https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fe115c646ebec9f964f68b0966d05124bd7d7e1
Blocks: 1500467
Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/4a9c9a91182c
Fold the 'profiler is active' check into the 'JSContext has a non-null PseudoStack' check. r=sfink
https://hg.mozilla.org/integration/autoland/rev/bc134fe1722a
Make ensureCapacitySlow infallible. r=emilio
https://hg.mozilla.org/integration/autoland/rev/16d6c90333de
Don't collect line numbers for profiling stack frames. r=njn
https://hg.mozilla.org/integration/autoland/rev/ca23a517da63
Use AppendPrintf to concatenate the label with the dynamic string. r=njn
https://hg.mozilla.org/integration/autoland/rev/9225e9aea377
Convert the ProfilingStackFrame kind into a set of flags. r=njn
https://hg.mozilla.org/integration/autoland/rev/58dc19fb2b76
Add ProfilingStackFrame flags for to choose the string template that is used to combine the label with the dynamic string. r=njn
https://hg.mozilla.org/integration/autoland/rev/f427afc392b0
Add AUTO_PROFILER_LABEL_DYNAMIC_FAST which allows specifying flags. r=njn
https://hg.mozilla.org/integration/autoland/rev/8a3f4acbad3b
Use AUTO_PROFILER_DYNAMIC_FAST for WebIDL APIs. r=bzbarsky
https://hg.mozilla.org/integration/autoland/rev/541186291b88
Allow the compiler to generate a non-atomic increment instruction for the stack pointer increment. r=njn
Attachment #9018501 - Attachment description: Bug 1499507 - Use AUTO_PROFILER_DYNAMIC_FAST for WebIDL APIs. r?bzbarsky → Bug 1499507 - Use AUTO_PROFILER_LABEL_DYNAMIC_FAST for WebIDL APIs. r?bzbarsky
Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/24a9494155fe
Fold the 'profiler is active' check into the 'JSContext has a non-null PseudoStack' check. r=sfink
https://hg.mozilla.org/integration/autoland/rev/30476b824eb4
Make ensureCapacitySlow infallible. r=emilio
https://hg.mozilla.org/integration/autoland/rev/41cede6bc7d2
Don't collect line numbers for profiling stack frames. r=njn
https://hg.mozilla.org/integration/autoland/rev/6256446f16c4
Use AppendPrintf to concatenate the label with the dynamic string. r=njn
https://hg.mozilla.org/integration/autoland/rev/76dd85b9aaf7
Convert the ProfilingStackFrame kind into a set of flags. r=njn
https://hg.mozilla.org/integration/autoland/rev/cf40b044af3f
Add ProfilingStackFrame flags for to choose the string template that is used to combine the label with the dynamic string. r=njn
https://hg.mozilla.org/integration/autoland/rev/9254e6c721b3
Add AUTO_PROFILER_LABEL_DYNAMIC_FAST which allows specifying flags. r=njn
https://hg.mozilla.org/integration/autoland/rev/35d05a53e0a9
Use AUTO_PROFILER_LABEL_DYNAMIC_FAST for WebIDL APIs. r=bzbarsky
https://hg.mozilla.org/integration/autoland/rev/4c79a5557d7b
Allow the compiler to generate a non-atomic increment instruction for the stack pointer increment. r=njn
Uh oh :( I thought I could do without a debug build try push...
Flags: needinfo?(mstange)
Fixed the Android build failure, it compiles locally. Going to try again.
Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/18a8aa4a02c3
Fold the 'profiler is active' check into the 'JSContext has a non-null PseudoStack' check. r=sfink
https://hg.mozilla.org/integration/autoland/rev/1f19af3b4374
Make ensureCapacitySlow infallible. r=emilio
https://hg.mozilla.org/integration/autoland/rev/cd066bdf78f5
Don't collect line numbers for profiling stack frames. r=njn
https://hg.mozilla.org/integration/autoland/rev/76602ccd5ca3
Use AppendPrintf to concatenate the label with the dynamic string. r=njn
https://hg.mozilla.org/integration/autoland/rev/be8a9a7f29bd
Convert the ProfilingStackFrame kind into a set of flags. r=njn
https://hg.mozilla.org/integration/autoland/rev/2bc3310f933c
Add ProfilingStackFrame flags for to choose the string template that is used to combine the label with the dynamic string. r=njn
https://hg.mozilla.org/integration/autoland/rev/408a37a9e42b
Add AUTO_PROFILER_LABEL_DYNAMIC_FAST which allows specifying flags. r=njn
https://hg.mozilla.org/integration/autoland/rev/5ec157cbbdd1
Use AUTO_PROFILER_LABEL_DYNAMIC_FAST for WebIDL APIs. r=bzbarsky
https://hg.mozilla.org/integration/autoland/rev/50f1edbfb52c
Allow the compiler to generate a non-atomic increment instruction for the stack pointer increment. r=njn
You need to log in before you can comment on or make changes to this bug.