Use small same-as-previous-sample entries instead of DuplicateLastSample for AUTO_PROFILER_THREAD_SLEEP samples
Categories
(Core :: Gecko Profiler, task, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox93 | --- | fixed |
People
(Reporter: mstange, Assigned: mozbugz)
References
(Blocks 2 open bugs)
Details
Attachments
(4 files)
At the moment, we try to reduce sampling overhead for threads that aren't running by using AUTO_PROFILER_THREAD_SLEEP: We get one stack the first time we observe a thread in the sleeping state, and then, as long as the AUTO_PROFILER_THREAD_SLEEP scope isn't exited, we just keep duplicating that sample on each sample tick. This is possible because threads don't change their stacks while they're sleeping.
However, this approach has two problems:
- Sleeping threads still occupy a lot of buffer space.
- Duplicating samples is per-sample CPU work that could be avoided.
The first idea would be to move the sample duplication to profile serialization time. At sampling time, we could just insert the sampling timestamp into the buffer and nothing else, and then the samples could be generated at serialization time. In fact, we wouldn't even need to duplicate any stacks at that point, because during serialization, stacks are already in a de-duplicated tree structure, so we could just repeat the stack ID of the previous sample.
But then the next idea is the following: We don't really care about the timestamps of the samples while the thread was sleeping. Sample timestamps are useful when the stack is changing. But when it's unchanging, and we know when it started and when it stopped to be in that unchanging state, then the timestamps in the middle aren't very interesting. In fact, we'd be fine having just one biiig sample for the sleeping state. And we could make such a big sample by giving our samples a duration field, and then putting a long duration on the sample from the sleeping state.
| Assignee | ||
Comment 1•5 years ago
|
||
Great idea, thank you.
Some thoughts:
About avoiding duplicating stack: If we only store the sampled stack once in the profiler buffer, there's a risk that it could "fall off" the buffer if the thread is asleep for a long time. We would need some way to either re-sample when needed, or keep the sample outside of the buffer (e.g., in RegisteredThread) to avoid destruction.
About the "biig" sample, as you noted currently the Gecko Profiler only keeps the timestamp of each sample, the duration is later inferred during processing. So we'd need to add an extra field (duration or last timestamp), and also to make sure that the frontend correctly handles these long samples in the different stack views.
Comment 2•5 years ago
|
||
I think the UI is already ready to handle a duration property in the sample table. We use it for the "comparison" view. Also it's currently handled so that we don't need to have it on every sample, just on the ones who are different than the interval, so that would work for what you want to do here.
For now this is implemented in the processed format only, not the gecko format. So we would still need to implement this in the gecko format and implement the code to process the gecko format in our UI code.
| Assignee | ||
Comment 3•4 years ago
|
||
Florian reminded me about this bug, and I've just thought of a possible simple-enough solution to my worry about losing the original sample when its buffer chunk is destroyed:
So we'd keep the basic idea of recording a small entry with the current timestamp and the index of the original sample (instead of a full copy).
The difference will come when we're at the beginning of a new chunk: If we know that the index of the original sample is in a previous chunk, then we'd do a full copy into the new chunk, and this copy would become the new original. This way, even as we destroy old chunks, the oldest remaining chunk will always contain full usable stacks.
And there will be the corner case where the index is in an already-destroyed chunk, in which case we'll need to redo a full sampling -- Very unlikely, but still needs to be handled just in case.
It's not the best solution (which would be to keep the repeated stack outside of the buffer, so it's can't be lost in old chunks), but I believe it's a much simpler solution, and should deliver 99% of the size&speed benefits.
(Not talking about the "biig" sample trick here, we'll need to coordinate with the front-end for that; I may split this bug when I get to it...)
| Assignee | ||
Comment 4•4 years ago
|
||
Starting work on this bug.
I'll go with the plan I stated in comment 3.
On top of that:
- I think it's actually useful to record each sampling time. Sometimes we notice gaps between samplings (which may be due to external influences), and this interesting clue would be lost if we switched to single samples with long durations instead.
- Each sample also contains CPU measurements, which may sometimes be non-zero during
AUTO_PROFILER_THREAD_SLEEP(most probably due to OS activity). This was not a feature yet when this bug was first filed.
I'm changing the bug title to reflect what I'm going to do.
I don't think we need a bug to implement sample durations covering multiple samplings (mainly for reason #1 above), but please open one if you think of good reasons.
| Assignee | ||
Comment 5•4 years ago
|
||
First prototype: code in Try.
Quick test, sampling all tracks after Firefox startup, with a limited buffer, and doing nothing for a while:
Before: https://share.firefox.dev/3lSPa0o
After: https://share.firefox.dev/3lVGPZH
Samples captured in the parent main thread: 1563 -> 2388, that's 53% better. 🎉
Running time captured: 3.1s -> 4.8s, 55% better.
So we can capture more data with the same buffer size.
Time spent handling identical stacks (during ~30s): 2.1s -> 0.3s, that's 7 times better! 🎉🎉🎉
Before, we copied full stacks 762,750 times, after we only copied 29,305, and used the new index-to-previous-stack's 828,880 times! (More references, but they take much less time and memory.)
So the profiling overhead of sleeping threads is much lower, this should help a lot when capturing lots of threads, assuming most would be sleeping.
I think this is a success for a fairly small code change! I will finalize the patches soon...
Now, I was hoping for a much better space improvement; I may have overestimated stack sizes relative to the remainder of sample entries (tag, timestamp, CPU measurements). I will try to investigate further, in case there are other easy winnings there.
| Assignee | ||
Comment 6•4 years ago
|
||
Some measurements:
On the parent main thread, a typical duplicated stack entry takes 314 bytes.
With the prototype patch above, the new no-copy entry was 38 bytes, around 8 times smaller. I think the reason we only saw a 1.5x improvement in effective data amount, is because there is so much more other data, in particular screenshots and markers.
While investigating entry sizes, I found two size optimizations that brought the no-copy entry size down to 25 bytes.
Another before/after test, no screenshots, looking at the Socket process main thread: 3332 samples 6.66s / 7863 samples 15.7s, 2.4x better 😁
| Assignee | ||
Comment 7•4 years ago
|
||
PROFILER_RUNTIME_STATS was broken recently. It's only used during development (mostly to benchmark new code), so it's not critical and no tests are needed.
| Assignee | ||
Comment 8•4 years ago
|
||
Depends on D122677
| Assignee | ||
Comment 9•4 years ago
|
||
Instead of copying the full stack from the previous sample when identical, the new ProfileBufferEntryKind::TimeBeforeSameSample + SameSample entry pair indicates that this is an identical sample. Later when producing the final JSON profile, we can just re-use the same sample identifier as before.
This effectively lowers the size of this kind of entry from hundreds of bytes, down to 20-30 bytes, which should help with capturing more samples in the same buffer size. And it also uses less CPU resources, since we don't need to find the previous stack and copy it.
We still need to perform a full copy at the start of a buffer chunk, to make sure there is always a full stack available in case older previous chunks have been destroyed.
Depends on D122678
| Assignee | ||
Comment 10•4 years ago
|
||
The majority of RunningTimes objects would contain zero or near-zero values (no/low activity), so it makes sense to serialize them using ULEB128, which only takes 1 bytes for values under 128, instead of the full 8 bytes.
And high values in the millions would still serialize in 4 bytes or less, half of the 8-byte type size.
This patch is particularly useful now, because SampleSample entries are much smaller than CompactStack entries, so the size taken by RunningTimes is much more significant now, and saving up to 7 bytes means we could record up to 24% more of these entries in the same buffer space.
Depends on D122679
Comment 11•4 years ago
|
||
Comment 12•4 years ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/7fff5ffa9b08
https://hg.mozilla.org/mozilla-central/rev/6c1fbffae7f6
https://hg.mozilla.org/mozilla-central/rev/dce613844060
https://hg.mozilla.org/mozilla-central/rev/e197b02749a2
Description
•