LUL initialisation: replace `CallFrameInfo::Rule` and 7 children with a fixed-sized structure
Categories
(Core :: Gecko Profiler, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox104 | --- | fixed |
People
(Reporter: jseward, Assigned: jseward)
References
(Blocks 1 open bug)
Details
Attachments
(3 files, 1 obsolete file)
At profiler startup, LUL reads and preprocesses Dwarf unwind info for all
shared objects mapped into the process. That includes libxul.so and around 90
other shared objects on an x86_64-linux build. This takes too long.
Profiling just this reading phase shows a huge number of mallocs and frees
resulting from LulDwarf.cpp. The largest cause of these is the implementation
of a sum-of-products type with 7 variants (CallFrameInfo::*Rule
) using C++
inheritance. The variants have different sizes and so have to be boxed
(stored on the heap and referred to by pointers). They are very short lived
and this causes a lot of heap turnover.
Assignee | ||
Comment 1•3 years ago
|
||
Here's a WIP patch. It doesn't work right, so is unusable and absolutely
un-landable. It fails its self tests, in particular
./mach gtest "LulDwarfCFIInsn.DW_CFA_expression"
and more generally
./mach gtest "Lul*"
. But it's close enough that it seemed worth profiling
to establish whether this is a path worth persuing.
Results are, for reading from all .so's up to and including libxul.so
, then
_Exit(0)
-ing, on an Intel Core i5-1135G7, around 4 GHz-ish, are:
Before:
0.50user 0.02system 0:00.53elapsed 100%CPU
heap usage: 8,087,404 allocs, 397,177,435 bytes allocated
I refs: 6,848,730,307
After
0.42user 0.02system 0:00.45elapsed 100%CPU
heap usage: 3,176,909 allocs, 351,186,650 bytes allocated
I refs: 5,702,345,799
More details in the attachment.
This is clearly a disappointing result. But once it's made to work correctly,
there may be other optimisations to do. Of the remaining 3.18 million
mallocs, about 1.97 million of them come from building the
CallFrameInfo::QRuleMap::registers_
mapping, so that's also a route for
improvement.
Profiling also shows the majority of the instructions (about 73%) go into
parsing the Dwarf .debug_frame
/.eh_frame
section and initial
summarisation, and the remaining 27% goes into sorting the final summaries by
address. However, if we measure Level-1 data cache (D1) misses, the picture
is the other way around: sorting causes 75% of D1 misses. So it may be worth
looking into whether the cache friendlyness of sorting could be improved.
Also, a sort implementation that can inline the comparison function is needed.
We don't currently have that and consequently the above workload contains 89
million calls to lul::CmpExtentsByOffsetLE
, each costing just 7
instructions, which is silly.
Next step is to fully debug the patch.
Assignee | ||
Comment 2•3 years ago
|
||
Assignee | ||
Comment 3•3 years ago
|
||
Also the replacement structure for Rule
(in the patch, struct QRule
)
could be implemented with just 3 machine words, instead of circa 5
as it currently is.
Assignee | ||
Comment 4•3 years ago
|
||
First working version. Even less speedup than described in comment 1.
Assignee | ||
Comment 5•3 years ago
|
||
At profiler startup, LUL reads and preprocesses Dwarf unwind info for all
shared objects mapped into the process. That includes libxul.so and around 90
other shared objects on an x86_64-linux build. This takes too long.
Profiling just this reading phase shows a huge number of mallocs and frees
resulting from LulDwarf.cpp. The largest cause of these is the implementation
of a sum-of-products type with 7 variants (CallFrameInfo::*Rule) using C++
inheritance. The variants have different sizes and so have to be boxed (stored
on the heap and referred to by pointers). They are very short lived and this
causes a lot of heap turnover.
For an m-c build on x86_64-linux on Fedora 35 using clang-13.0.0 -g -O2, when
measuring a MOZ_PROFILER_STARTUP=1 run that is hacked so as to exit
immediately after the unwind info for libxul.so has been read, this patch has
the following perf effects:
-
run time reduced from 0.50user to 0.41user (Core i5 1135G7, circa 4 GHz)
-
heap allocation reduced from 8,087,404 allocs/397,177,435 bytes to
3,176,936 allocs, 328,204,042 bytes. -
instruction count reduced from 6,848,730,307 to 5,572,028,393.
Main changes are:
-
class CallFrameInfo::Rule
has been completely rewritten, so as to merge
its 7 children into it, and those children have been removed. The new
representation uses just 3 machine words to represent all 7 alternatives.
The various virtual methods ofclass CallFrameInfo::Rule
(Handle
,
operator==
, etc) have been rewritten to use a simple switch-based scheme. -
The code that uses
class CallFrameInfo::Rule
has been changed (but not
majorly rewritten) so as to pass around instances of it by value, rather
than pointers to it. This removes the need to allocate them on the heap.
To simulate the previous use case where a NULL value had a meaning, the
revised class in fact has an 8th alternative,INVALID
, and a routine
isVALID
. INVALID rules are now used in place of NULL pointers to rules as
previously. -
Accessors and constructors for the revised
class CallFrameInfo::Rule
hide
the underlying 3-word representation, ensure that the correct
representational invariants are maintained, and make it impossible to access
fields that are meaningless for the given variant. So it's not any less
safe than the original. -
Additionally, Dwarf expression strings are no longer represented using
std::string
. Instead a new two-word typeImageSlice
has been provided.
This avoids all unnecessary (and unknown) overheads withstd::string
and
provides a significant part of the measured speedup.
Comment 6•2 years ago
|
||
bugherder |
Description
•