Closed Bug 1754932 Opened 3 years ago Closed 2 years ago

LUL initialisation: replace `CallFrameInfo::Rule` and 7 children with a fixed-sized structure

Tracking

()

Status:

RESOLVED FIXED

Milestone:

104 Branch

Tracking Flags:

Tracking

Status

firefox104

---

fixed

People

(Reporter: jseward, Assigned: jseward)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 1 obsolete file)

speedup-lul-tagged-union.diff-4 3 years ago Julian Seward [:jseward] 86.21 KB, patch		Details \| Diff \| Splinter Review
Perf results for the patch in comment 1 3 years ago Julian Seward [:jseward] 2.19 KB, text/plain		Details
speedup-lul-tagged-union.diff-6 3 years ago Julian Seward [:jseward] 35.80 KB, patch		Details \| Diff \| Splinter Review
Bug 1754932 - LUL initialisation: replace `CallFrameInfo::Rule` and 7 children with a fixed-sized structure. r=mstange. 3 years ago Julian Seward [:jseward] 48 bytes, text/x-phabricator-request		Details \| Review

Julian Seward [:jseward]

Assignee

Description

•

3 years ago

At profiler startup, LUL reads and preprocesses Dwarf unwind info for all
shared objects mapped into the process. That includes libxul.so and around 90
other shared objects on an x86_64-linux build. This takes too long.

Profiling just this reading phase shows a huge number of mallocs and frees
resulting from LulDwarf.cpp. The largest cause of these is the implementation
of a sum-of-products type with 7 variants (CallFrameInfo::*Rule) using C++
inheritance. The variants have different sizes and so have to be boxed
(stored on the heap and referred to by pointers). They are very short lived
and this causes a lot of heap turnover.

Julian Seward [:jseward]

Assignee

Comment 1

•

3 years ago

Attached patch speedup-lul-tagged-union.diff-4 (obsolete) — Details — Splinter Review

Here's a WIP patch. It doesn't work right, so is unusable and absolutely
un-landable. It fails its self tests, in particular
./mach gtest "LulDwarfCFIInsn.DW_CFA_expression" and more generally
./mach gtest "Lul*". But it's close enough that it seemed worth profiling
to establish whether this is a path worth persuing.

Results are, for reading from all .so's up to and including libxul.so, then
_Exit(0)-ing, on an Intel Core i5-1135G7, around 4 GHz-ish, are:

Before:
  0.50user 0.02system 0:00.53elapsed 100%CPU
  heap usage: 8,087,404 allocs, 397,177,435 bytes allocated
  I   refs:      6,848,730,307

After
  0.42user 0.02system 0:00.45elapsed 100%CPU
  heap usage: 3,176,909 allocs, 351,186,650 bytes allocated
  I   refs:      5,702,345,799

More details in the attachment.

This is clearly a disappointing result. But once it's made to work correctly,
there may be other optimisations to do. Of the remaining 3.18 million
mallocs, about 1.97 million of them come from building the
CallFrameInfo::QRuleMap::registers_ mapping, so that's also a route for
improvement.

Profiling also shows the majority of the instructions (about 73%) go into
parsing the Dwarf .debug_frame/.eh_frame section and initial
summarisation, and the remaining 27% goes into sorting the final summaries by
address. However, if we measure Level-1 data cache (D1) misses, the picture
is the other way around: sorting causes 75% of D1 misses. So it may be worth
looking into whether the cache friendlyness of sorting could be improved.

Also, a sort implementation that can inline the comparison function is needed.
We don't currently have that and consequently the above workload contains 89
million calls to lul::CmpExtentsByOffsetLE, each costing just 7
instructions, which is silly.

Next step is to fully debug the patch.

Julian Seward [:jseward]

Assignee

Comment 2

•

3 years ago

Attached file Perf results for the patch in comment 1 — Details

Julian Seward [:jseward]

Assignee

Comment 3

•

3 years ago

Also the replacement structure for Rule (in the patch, struct QRule)
could be implemented with just 3 machine words, instead of circa 5
as it currently is.

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Updated

•

3 years ago

Severity: -- → N/A

Priority: -- → P3

Julian Seward [:jseward]

Assignee

Comment 4

•

3 years ago

Attached patch speedup-lul-tagged-union.diff-6 — Details — Splinter Review

First working version. Even less speedup than described in comment 1.

Attachment #9263436 - Attachment is obsolete: true

Julian Seward [:jseward]

Assignee

Comment 5

•

3 years ago

Attached file Bug 1754932 - LUL initialisation: replace `CallFrameInfo::Rule` and 7 children with a fixed-sized structure. r=mstange. — Details

Profiling just this reading phase shows a huge number of mallocs and frees
resulting from LulDwarf.cpp. The largest cause of these is the implementation
of a sum-of-products type with 7 variants (CallFrameInfo::*Rule) using C++
inheritance. The variants have different sizes and so have to be boxed (stored
on the heap and referred to by pointers). They are very short lived and this
causes a lot of heap turnover.

For an m-c build on x86_64-linux on Fedora 35 using clang-13.0.0 -g -O2, when
measuring a MOZ_PROFILER_STARTUP=1 run that is hacked so as to exit
immediately after the unwind info for libxul.so has been read, this patch has
the following perf effects:

run time reduced from 0.50user to 0.41user (Core i5 1135G7, circa 4 GHz)
heap allocation reduced from 8,087,404 allocs/397,177,435 bytes to
3,176,936 allocs, 328,204,042 bytes.
instruction count reduced from 6,848,730,307 to 5,572,028,393.

Main changes are:

class CallFrameInfo::Rule has been completely rewritten, so as to merge
its 7 children into it, and those children have been removed. The new
representation uses just 3 machine words to represent all 7 alternatives.
The various virtual methods of class CallFrameInfo::Rule (Handle,
operator==, etc) have been rewritten to use a simple switch-based scheme.
The code that uses class CallFrameInfo::Rule has been changed (but not
majorly rewritten) so as to pass around instances of it by value, rather
than pointers to it. This removes the need to allocate them on the heap.
To simulate the previous use case where a NULL value had a meaning, the
revised class in fact has an 8th alternative, INVALID, and a routine
isVALID. INVALID rules are now used in place of NULL pointers to rules as
previously.
Accessors and constructors for the revised class CallFrameInfo::Rule hide
the underlying 3-word representation, ensure that the correct
representational invariants are maintained, and make it impossible to access
fields that are meaningless for the given variant. So it's not any less
safe than the original.
Additionally, Dwarf expression strings are no longer represented using
std::string. Instead a new two-word type ImageSlice has been provided.
This avoids all unnecessary (and unknown) overheads with std::string and
provides a significant part of the measured speedup.

Julian Seward [:jseward]

Assignee

Updated

•

2 years ago

Blocks: 1777540

Cristian Tuns

Comment 6

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/27a3c69f1ef6

Status: NEW → RESOLVED

Closed: 2 years ago

status-firefox104: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 104 Branch

You need to log in before you can comment on or make changes to this bug.

Bugzilla

LUL initialisation: replace `CallFrameInfo::Rule` and 7 children with a fixed-sized structure

Categories

(Core :: Gecko Profiler, enhancement, P3)

Tracking

()

People

(Reporter: jseward, Assigned: jseward)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files, 1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Updated

Comment 6

Attachment

General

Description

File Name

Content Type