Closed Bug 1581049 Opened 5 years ago Closed 5 years ago

Optimize stack-sample handling in profiler

Categories

(Core :: Gecko Profiler, task, P2)

task

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox71 --- fixed

People

(Reporter: mozbugz, Assigned: mozbugz)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

Following the previous patches (especially bug 1576551 and bug 1576555), there is a significant drop in profiler performance when storing the JSON profile, and during some other operations handling the stack samples.
This make capturing less-than-trivial profiles much slower, so slow that it sometimes fails to make it to the front-end. This needs fixing, before we can land all these patches.

Profiling the profiler showed a massive amount of time spent just skipping entries!

This got worse with the new ProfileBuffer storage, because:

  • As detailed in bug 1577658, creating the JSON output involves reading the full buffer many times, at least twice per thread, and a few more times for the other types of data. So most of the each reading pass ignores most of the buffer!
  • Entries now have variable sizes. So instead of "just" reading the ProfileBufferEntry::Kind byte and then always skipping 9 bytes, we now need to read the entry size (usually 1 byte) and (also usually) the full legacy entry, before skipping a variable number of bytes.
  • Stack samples may contain hundreds to thousands of entries, which need to be skipped one-by-one in most cases.

Bug 1577658 should help with the above, but it is too much work to implement right now, and it is yet uncertain whether we will actually gain speed in the end, because we would be shifting the burden towards writing many parts of the JSON output at the same time.

So as a first step, a smaller optimization that should make a big difference, would be to group all data entries from a sample into one new type of entry. This means that when we're skipping samples, we will only skip one entry instead of hundreds/thousands.

And I will explore other simple optimizations based on further profiling-the-profiler investigation...

Some objects are copied byte-by-byte to/from ModuloBuffers.
E.g., serialized BlocksRingBuffers, or duplicate stacks. (And more to come.)

Iterator::ReadInto(Iterator&) optimizes these copies by using the minimum
number of memcpys possible.

Depends on D44435

Instead of copying BlocksRingBuffer data byte-by-byte (using iterators byte
dereferencers), we can now use ModuloBuffer::Iterator::ReadInto(Iterator&) to
copy them using a small number of memcpys.

Depends on D45838

Stack samples may contain up to hundreds or thousands of entries, which need to
be copied (during sampling and duplicating), and also mostly skipped when
creating the JSON output for other threads or other types of profile data.

This patch gathers all sample legacy entries (apart from the thread id and the
timestamp) into a single non-legacy entry, which is much faster to copy or
skip. The preceding timestamp now has a distinct Kind (TimeBeforeCompactStack)
to know whether to handle legacy entries of the new CompactStack entry kind.

Depends on D45839

Profiling the profiler showed a lot of time spent allocating memory for the
frame-address strings, almost half the time of StreamSamplesToJSON!
Using an "Auto" string will prevent these allocations in most cases.

Also removed the printf("0x%llx"), instead just appending "0x" and an hex-
formatted number with AppendInt() (which handles 32- and 64-bit numbers
separately, so there is no more needs to do a double-cast to avoid sign
extensions -- There is still a double cast, but a no-op one to fix the type to
either uint32_t or uint64_t).

Using an Auto string for frame labels as well.

Depends on D45840

Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b30942f9db54
ModuloBuffer::Iterator::ReadInto(ModuloBuffer::Iterator&) - r=gregtatum
https://hg.mozilla.org/integration/autoland/rev/b4b6bbb18cc1
BlocksRingBuffer (de)serializer uses optimized inter-ModuloBuffer copier - r=gregtatum
https://hg.mozilla.org/integration/autoland/rev/be1ec4577d37
Gather stack sample data into fewer entries - r=gregtatum
https://hg.mozilla.org/integration/autoland/rev/27afea20c396
Use AutoCStrings in StreamSamplesToJSON to avoid mallocs - r=gregtatum
Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5829fae53029
ModuloBuffer::Iterator::ReadInto(ModuloBuffer::Iterator&) - r=gregtatum
https://hg.mozilla.org/integration/autoland/rev/257f8a17217f
BlocksRingBuffer (de)serializer uses optimized inter-ModuloBuffer copier - r=gregtatum
https://hg.mozilla.org/integration/autoland/rev/1ceccc3c86a4
Gather stack sample data into fewer entries - r=gregtatum
https://hg.mozilla.org/integration/autoland/rev/4001594f15dd
Use AutoCStrings in StreamSamplesToJSON to avoid mallocs - r=gregtatum
Regressions: 1582357
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: