Closed Bug 1754932 Opened 2 years ago Closed 2 years ago

LUL initialisation: replace `CallFrameInfo::Rule` and 7 children with a fixed-sized structure

Categories

(Core :: Gecko Profiler, enhancement, P3)

All
Linux
enhancement

Tracking

()

RESOLVED FIXED
104 Branch
Tracking Status
firefox104 --- fixed

People

(Reporter: jseward, Assigned: jseward)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 1 obsolete file)

At profiler startup, LUL reads and preprocesses Dwarf unwind info for all
shared objects mapped into the process. That includes libxul.so and around 90
other shared objects on an x86_64-linux build. This takes too long.

Profiling just this reading phase shows a huge number of mallocs and frees
resulting from LulDwarf.cpp. The largest cause of these is the implementation
of a sum-of-products type with 7 variants (CallFrameInfo::*Rule) using C++
inheritance. The variants have different sizes and so have to be boxed
(stored on the heap and referred to by pointers). They are very short lived
and this causes a lot of heap turnover.

Attached patch speedup-lul-tagged-union.diff-4 (obsolete) — Splinter Review

Here's a WIP patch. It doesn't work right, so is unusable and absolutely
un-landable. It fails its self tests, in particular
./mach gtest "LulDwarfCFIInsn.DW_CFA_expression" and more generally
./mach gtest "Lul*". But it's close enough that it seemed worth profiling
to establish whether this is a path worth persuing.

Results are, for reading from all .so's up to and including libxul.so, then
_Exit(0)-ing, on an Intel Core i5-1135G7, around 4 GHz-ish, are:

Before:
  0.50user 0.02system 0:00.53elapsed 100%CPU
  heap usage: 8,087,404 allocs, 397,177,435 bytes allocated
  I   refs:      6,848,730,307

After
  0.42user 0.02system 0:00.45elapsed 100%CPU
  heap usage: 3,176,909 allocs, 351,186,650 bytes allocated
  I   refs:      5,702,345,799

More details in the attachment.

This is clearly a disappointing result. But once it's made to work correctly,
there may be other optimisations to do. Of the remaining 3.18 million
mallocs, about 1.97 million of them come from building the
CallFrameInfo::QRuleMap::registers_ mapping, so that's also a route for
improvement.

Profiling also shows the majority of the instructions (about 73%) go into
parsing the Dwarf .debug_frame/.eh_frame section and initial
summarisation, and the remaining 27% goes into sorting the final summaries by
address. However, if we measure Level-1 data cache (D1) misses, the picture
is the other way around: sorting causes 75% of D1 misses. So it may be worth
looking into whether the cache friendlyness of sorting could be improved.

Also, a sort implementation that can inline the comparison function is needed.
We don't currently have that and consequently the above workload contains 89
million calls to lul::CmpExtentsByOffsetLE, each costing just 7
instructions, which is silly.

Next step is to fully debug the patch.

Also the replacement structure for Rule (in the patch, struct QRule)
could be implemented with just 3 machine words, instead of circa 5
as it currently is.

Severity: -- → N/A
Priority: -- → P3

First working version. Even less speedup than described in comment 1.

Attachment #9263436 - Attachment is obsolete: true

At profiler startup, LUL reads and preprocesses Dwarf unwind info for all
shared objects mapped into the process. That includes libxul.so and around 90
other shared objects on an x86_64-linux build. This takes too long.

Profiling just this reading phase shows a huge number of mallocs and frees
resulting from LulDwarf.cpp. The largest cause of these is the implementation
of a sum-of-products type with 7 variants (CallFrameInfo::*Rule) using C++
inheritance. The variants have different sizes and so have to be boxed (stored
on the heap and referred to by pointers). They are very short lived and this
causes a lot of heap turnover.

For an m-c build on x86_64-linux on Fedora 35 using clang-13.0.0 -g -O2, when
measuring a MOZ_PROFILER_STARTUP=1 run that is hacked so as to exit
immediately after the unwind info for libxul.so has been read, this patch has
the following perf effects:

  • run time reduced from 0.50user to 0.41user (Core i5 1135G7, circa 4 GHz)

  • heap allocation reduced from 8,087,404 allocs/397,177,435 bytes to
    3,176,936 allocs, 328,204,042 bytes.

  • instruction count reduced from 6,848,730,307 to 5,572,028,393.

Main changes are:

  • class CallFrameInfo::Rule has been completely rewritten, so as to merge
    its 7 children into it, and those children have been removed. The new
    representation uses just 3 machine words to represent all 7 alternatives.
    The various virtual methods of class CallFrameInfo::Rule (Handle,
    operator==, etc) have been rewritten to use a simple switch-based scheme.

  • The code that uses class CallFrameInfo::Rule has been changed (but not
    majorly rewritten) so as to pass around instances of it by value, rather
    than pointers to it. This removes the need to allocate them on the heap.
    To simulate the previous use case where a NULL value had a meaning, the
    revised class in fact has an 8th alternative, INVALID, and a routine
    isVALID. INVALID rules are now used in place of NULL pointers to rules as
    previously.

  • Accessors and constructors for the revised class CallFrameInfo::Rule hide
    the underlying 3-word representation, ensure that the correct
    representational invariants are maintained, and make it impossible to access
    fields that are meaningless for the given variant. So it's not any less
    safe than the original.

  • Additionally, Dwarf expression strings are no longer represented using
    std::string. Instead a new two-word type ImageSlice has been provided.
    This avoids all unnecessary (and unknown) overheads with std::string and
    provides a significant part of the measured speedup.

Blocks: 1777540
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 104 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: