Open Bug 1376323 Opened 7 years ago Updated 2 years ago

Speed up StripCRLF()

Categories

(Core :: XPCOM, enhancement, P3)

enhancement

Tracking

()

REOPENED

People

(Reporter: ehsan.akhgari, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Here is a perf profile of nsACString::StripTaggedASCII().  This function shows up under StripCRLF() calls in HTMLInputElement::SanitizeValue().

       │    void                                                                  ▒
       │    nsTSubstring_CharT::StripTaggedASCII(const ASCIIMaskArray& aToStrip)  ▒
       │    {                                                                     ▒
  6.06 │      push   %rbp                                                         ▒
  3.03 │      mov    %rsp,%rbp                                                    ▒
       │      push   %r15                                                         ▒
  1.52 │      push   %r14                                                         ◆
       │      push   %rbx                                                         ▒
       │      push   %rax                                                         ▒
       │      mov    %rsi,%rbx                                                    ▒
  1.52 │      mov    %rdi,%r14                                                    ▒
       │      if (mLength == 0) {                                                 ▒
       │      mov    0x8(%r14),%r15d                                              ▒
       │      test   %r15d,%r15d                                                  ▒
       │    ↓ je     6c                                                           ▒
       │    _ZN9nsAString13EnsureMutableEj():                                     ▒
       │        if (mFlags & (F_FIXED | F_OWNED)) {                               ▒
       │      mov    0xc(%r14),%eax                                               ▒
       │      test   $0x18,%al                                                    ▒
       │    ↓ je     77                                                           ▒
       │    _ZN9nsAString16StripTaggedASCIIERKSt5arrayIbLm128EE():                ▒
       │                                                                          ▒
       │      if (!EnsureMutable()) {                                             ▒
       │        AllocFailed(mLength);                                             ▒
       │      }                                                                   ▒
       │                                                                          ▒
       │      char_type* to   = mData;                                            ▒
       │21:   mov    (%r14),%rsi                                                  ▒
       │      char_type* from = mData;                                            ▒
       │      char_type* end  = mData + mLength;                                  ▒
       │      mov    %r15d,%eax                                                   ▒
       │      lea    (%rsi,%rax,2),%rcx                                           ▒
       │      mov    %rsi,%rax                                                    ▒
       │    ↓ jmp    40                                                           ▒
       │        uint32_t theChar = (uint32_t)*from++;                             ▒
       │        // Replacing this with a call to ASCIIMask::IsMasked              ▒
       │        // regresses performance somewhat, so leaving it inlined.         ▒
       │        if (!mozilla::ASCIIMask::IsMasked(aToStrip, theChar)) {           ▒
       │          // Not stripped, copy this char.                                ▒
       │          *to++ = (char_type)theChar;                                     ▒
  3.03 │30:   mov    %di,(%rax)                                                   ▒
  7.58 │      add    $0x2,%rax                                                    ▒
       │        }                                                                 ▒
  1.52 │      add    $0x2,%rdx                                                    ▒
  3.03 │      mov    %rdx,%rsi                                                    ▒
  1.52 │      xchg   %ax,%ax                                                      ▒
  4.55 │40:   mov    %rsi,%rdx                                                    ▒
       │      while (from < end) {                                                ▒
  1.52 │      cmp    %rcx,%rdx                                                    ▒
       │    ↓ jae    5d                                                           ▒
       │        uint32_t theChar = (uint32_t)*from++;                             ▒
  7.58 │      movzwl (%rdx),%edi                                                  ▒
  4.55 │      cmp    $0x7f,%rdi                                                   ▒
       │    ↑ ja     30                                                           ▒
  9.09 │      lea    0x2(%rdx),%rsi                                               ▒
       │        if (!mozilla::ASCIIMask::IsMasked(aToStrip, theChar)) {           ▒
  6.06 │      cmpb   $0x0,(%rbx,%rdi,1)                                           ▒
 28.79 │    ↑ jne    40                                                           ▒
  4.55 │    ↑ jmp    30                                                           ▒
       │      }                                                                   ▒
       │      *to = char_type(0); // add the null                                 ▒
  1.52 │5d:   movw   $0x0,(%rax)                                                  ▒
       │      mLength = to - mData;                                               ▒
  1.52 │      sub    (%r14),%rax                                                  ▒
       │      shr    %rax                                                         ▒
       │      mov    %eax,0x8(%r14)                                               ▒
       │    }                                                                     ▒
  1.52 │6c:   add    $0x8,%rsp                                                    ▒
       │      pop    %rbx                                                         ▒
       │      pop    %r14                                                         ▒
       │      pop    %r15                                                         ▒
       │      pop    %rbp                                                         ▒
       │    ← retq                                                                ▒


See how we are stalling all of our time on two dependent loads in the body of our hot loop where the CPU is sitting around waiting for the memory fetches to arrive.  With two levels of loop unrolling this should get to optimimum performance.

I wrote a patch to unroll this loop, and here is the impact when running the following tests.

export MOZ_XRE_DIR=/path/to/objdir/dist/bin
export MOZ_RUN_GTEST=1
export GTEST_FILTER=Strings.PerfStripWhitespace:Strings.PerfStripCRLF # These two tests examine StripTaggedASCII()

Before:
ehsan@ehsan-precision:~/moz/src.1347035$ ./mach run
 0:00.09 /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/dist/bin/firefox -no-remote -profile /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/tmp/scratch_user
Running GTest tests...
Note: Google Test filter = Strings.PerfStripWhitespace:Strings.PerfStripCRLF
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Strings
[ RUN      ] Strings.PerfStripWhitespace
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripWhitespace", "value": 105074, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripWhitespace (105 ms)
[ RUN      ] Strings.PerfStripCRLF
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripCRLF", "value": 119844, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripCRLF (119 ms)
[----------] 2 tests from Strings (225 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (225 ms total)
[  PASSED  ] 2 tests.
Finished running GTest tests.
ehsan@ehsan-precision:~/moz/src.1347035$ ./mach run
 0:00.09 /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/dist/bin/firefox -no-remote -profile /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/tmp/scratch_user
Running GTest tests...
Note: Google Test filter = Strings.PerfStripWhitespace:Strings.PerfStripCRLF
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Strings
[ RUN      ] Strings.PerfStripWhitespace
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripWhitespace", "value": 104696, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripWhitespace (104 ms)
[ RUN      ] Strings.PerfStripCRLF
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripCRLF", "value": 124698, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripCRLF (124 ms)
[----------] 2 tests from Strings (229 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (229 ms total)
[  PASSED  ] 2 tests.
Finished running GTest tests.
ehsan@ehsan-precision:~/moz/src.1347035$ ./mach run
 0:00.08 /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/dist/bin/firefox -no-remote -profile /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/tmp/scratch_user
Running GTest tests...
Note: Google Test filter = Strings.PerfStripWhitespace:Strings.PerfStripCRLF
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Strings
[ RUN      ] Strings.PerfStripWhitespace
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripWhitespace", "value": 100907, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripWhitespace (101 ms)
[ RUN      ] Strings.PerfStripCRLF
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripCRLF", "value": 116099, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripCRLF (116 ms)
[----------] 2 tests from Strings (217 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (217 ms total)
[  PASSED  ] 2 tests.
Finished running GTest tests.


After:
ehsan@ehsan-precision:~/moz/src.1347035$ ./mach run
 0:00.09 /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/dist/bin/firefox -no-remote -profile /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/tmp/scratch_user
Running GTest tests...
Note: Google Test filter = Strings.PerfStripWhitespace:Strings.PerfStripCRLF
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Strings
[ RUN      ] Strings.PerfStripWhitespace
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripWhitespace", "value": 64004, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripWhitespace (64 ms)
[ RUN      ] Strings.PerfStripCRLF
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripCRLF", "value": 76056, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripCRLF (76 ms)
[----------] 2 tests from Strings (140 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (140 ms total)
[  PASSED  ] 2 tests.
Finished running GTest tests.
ehsan@ehsan-precision:~/moz/src.1347035$ ./mach run
 0:00.09 /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/dist/bin/firefox -no-remote -profile /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/tmp/scratch_user
Running GTest tests...
Note: Google Test filter = Strings.PerfStripWhitespace:Strings.PerfStripCRLF
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Strings
[ RUN      ] Strings.PerfStripWhitespace
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripWhitespace", "value": 65946, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripWhitespace (66 ms)
[ RUN      ] Strings.PerfStripCRLF
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripCRLF", "value": 76946, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripCRLF (77 ms)
[----------] 2 tests from Strings (143 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (143 ms total)
[  PASSED  ] 2 tests.
Finished running GTest tests.
ehsan@ehsan-precision:~/moz/src.1347035$ ./mach run
 0:00.09 /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/dist/bin/firefox -no-remote -profile /home/ehsan/moz/src.1347035/obj-ff-opt.noindex/tmp/scratch_user
Running GTest tests...
Note: Google Test filter = Strings.PerfStripWhitespace:Strings.PerfStripCRLF
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Strings
[ RUN      ] Strings.PerfStripWhitespace
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripWhitespace", "value": 67360, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripWhitespace (67 ms)
[ RUN      ] Strings.PerfStripCRLF
PERFHERDER_DATA: {"framework": {"name": "platform_microbench"}, "suites": [{"name": "Strings", "subtests": [{"name": "PerfStripCRLF", "value": 78263, "lowerIsBetter": true}]}]}
[       OK ] Strings.PerfStripCRLF (78 ms)
[----------] 2 tests from Strings (145 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (146 ms total)
[  PASSED  ] 2 tests.
Finished running GTest tests.
Comment on attachment 8881307 [details] [diff] [review]
Unroll the loop in nsTSubstring_CharT::StripTaggedASCII() in order to speed up StripCRLF()

Review of attachment 8881307 [details] [diff] [review]:
-----------------------------------------------------------------

That's quite a speedup!  To be clear, the claim is that we're stalling on 1) the load from the original string and 2) the load from the mask array?
Attachment #8881307 - Flags: review?(nfroyd) → review+
That's correct.
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a5c0a74876ae
Unroll the loop in nsTSubstring_CharT::StripTaggedASCII() in order to speed up StripCRLF(); r=froydnj
https://hg.mozilla.org/mozilla-central/rev/a5c0a74876ae
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Depends on: 1379578
backed out for causing bug 1379578
Status: RESOLVED → REOPENED
Flags: needinfo?(ehsan)
Resolution: FIXED → ---
Backout by cbook@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/715eae7152cc
Backed out changeset a5c0a74876ae for causing bug 1379578
Target Milestone: mozilla56 → ---
Nathan, sorry I dropped the ball on this...

For a summary of what changed on Talos as a result of this landing, please see https://treeherder.mozilla.org/perf.html#/alerts?id=7555

When I came back from vacation I did a few Talos pushes based on the idea of disabling this optimization on 32-bit architectures by setting the unrolling factor to 1.  The rationale for that is that we don't care about 32-bit anywhere besides Win32, and we don't care about it all that much even on Windows for our users on Win64 who would use 64-bit builds.

Are you OK with that general approach?  Should I invest more time in this?  The alternative would be to give up on this and move on, since I have no idea how to have our cake and eat it on both 32 and 64-bit architectures.
Flags: needinfo?(ehsan) → needinfo?(nfroyd)
I think giving up on unrolling on win32 (and linux32, I guess?) is fine.  We may want to keep the optimization on ARM--we have more registers there and can probably live with an unrolling factor of 2.  But if trying to tweak things for ARM is too much, I think having an unroll factor of 2 on 64-bit and not unrolling on 32-bit would be fine.
Flags: needinfo?(nfroyd)
Do we run gtests on Android?  It seems that we don't...  I'm now thinking how to assess what to do on ARM...
Priority: -- → P3
Just a note if we get back to this, the tests we have don't really do what they intend to do [1]. `StripCRLF` is an inline operation that mutates the string, so after the first call to strip there will be no characters to strip for the remaining 19,999 iterations. We're effectively testing an edge case of no CRLF in the string (which is an okay test, but not what we intend to do currently).

We should really fix the tests and possibly add a few more stress tests before we start relying on them for perf comparisons.

[1] http://searchfox.org/mozilla-central/rev/1c4da216e00ac95b38a3f236e010b31cdfaae03b/xpcom/tests/gtest/TestStrings.cpp#1312-1342
Assignee: ehsan → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: