Closed
Bug 1344258
Opened 8 years ago
Closed 8 years ago
Remove integer mod from loop in ProfileBuffer::FindLastSampleOfThread
Categories
(Core :: Gecko Profiler, enhancement)
Core
Gecko Profiler
Tracking
()
RESOLVED
FIXED
mozilla55
Tracking | Status | |
---|---|---|
firefox55 | --- | fixed |
People
(Reporter: jseward, Assigned: jseward)
Details
Attachments
(1 file)
1.57 KB,
patch
|
n.nethercote
:
review+
|
Details | Diff | Splinter Review |
This follows up on the performance fix in bug 1344118. Even with
that fix in place, by far the most expensive part of the profiler
core is ProfileBuffer::FindLastSampleOfThread. At least, on Linux
x86_64, with LUL unwinding, 9MB buffer, 1 KHz sampling, and profiling
all threads.
ProfileBuffer::FindLastSampleOfThread requires 2 32-bit integer %
instructions per loop. If we assume a CPU core with a single divider
unit that can produce 2 result bits per cycle, that's a 32 cycle
cost per iteration, which is high considering that the rest of the
loop is a load, two or three conditional branches and some integer
decrementing. The mods are not even necessary since we can just
do an underflow and wraparound check in the obvious way, and since
that will be almost perfectly predicted, it will be very cheap.
Assignee | ||
Comment 1•8 years ago
|
||
To test a fix for this, I made a new implementation of FindLastSampleOfThread
and ran it together with the old implementation, asserting that the same
values are produced. I also tested with the fix for bug 1344118 disabled, so
as to check the "not found, wraparound" case.
This gives the opportunity to accurately compare old- vs new- with
Linux's "perf" profiler. Typical results are
73.41% SamplerThread [.] ProfileBuffer::FindLastSampleOfThread_OLD
11.04% SamplerThread [.] ProfileBuffer::FindLastSampleOfThread_NEW
that is, the new version is 6.5 times cheaper than the old.
In terms of overall impact, with the browser idling and profiler running
in the configuration specified in comment 0, I get, from "top", this
(roughly -- there's a lot of noise):
old min cpu use 42%
new min cpu use 22%
Another way to characterise it is to compare the relative costs of
ProfileBuffer::FindLastSampleOfThread with the next hottest function,
lul::SecMap::FindRuleSet:
Old:
82.21% SamplerThread ProfileBuffer::FindLastSampleOfThread_OLD
2.93% SamplerThread lul::SecMap::FindRuleSet
New:
30.90% SamplerThread ProfileBuffer::FindLastSampleOfThread_NEW
5.71% SamplerThread lul::SecMap::FindRuleSet
With the old implementation, FindLastSampleOfThread is 28.05 times more
expensive than FindRuleSet. With the new implementation, it is 5.4 times
more expensive. That implies a cost reduction for FindLastSampleOfThread
of about 5.2, which is more or less consistent with the direct measurements
at the start of this comment.
Comment 2•8 years ago
|
||
Nice!
It sounds like you improved the search loop. You also talked about recording the last sample position directly and skipping the search loop entirely. Have you tried that?
Assignee | ||
Comment 3•8 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #2)
That could be a follow up to this, but it's not entirely straightforward:
* We'd need to handle the case where there is a last-position recorded for a
thread, but since then we've overwritten the entire buffer, so that the
recorded position is invalid. Tedious, but probably not a big deal.
* More seriously, the obvious implementation is to have a mozilla::Vector that
maps thread IDs to their last sample position. The tricky bit comes when we
have to extend the vector to accomodate a new thread. Then we have to be
sure that we're not in a critical section, else we risk deadlock. Not sure
what to do about that.
Assignee | ||
Comment 4•8 years ago
|
||
This changes the number of integer mod uses from 2 per loop iteration
(there are typically around 500 iterations per call) to 1 per call.
Attachment #8843579 -
Flags: review?(n.nethercote)
Comment 5•8 years ago
|
||
Comment on attachment 8843579 [details] [diff] [review]
Remove integer mod from loop in ProfileBuffer::FindLastSampleOfThread. r=n.nethercote.
Review of attachment 8843579 [details] [diff] [review]:
-----------------------------------------------------------------
You could avoid the condition within the loop by first checking if mReadPos < mWritePos. If so, use a loop that doesn't check for wraparound. Otherwise, use two loops, one that counts down to zero and then one that counts down from the buffer's top. But it probably isn't worth the extra code, so this will suffice :)
::: tools/profiler/core/ProfileBufferEntry.cpp
@@ +745,5 @@
> int ProfileBuffer::FindLastSampleOfThread(int aThreadId)
> {
> // We search backwards from mWritePos-1 to mReadPos.
> // Adding mEntrySize makes the result of the modulus positive.
> + int readPos = (mWritePos + mEntrySize - 1) % mEntrySize;
Nit: I suspect currPos would be a better name than readPos.
@@ +756,5 @@
> + readPos--;
> + if (readPos < 0) {
> + // This almost never happens, so is perfectly predicted, and hence
> + // almost free.
> + readPos = mEntrySize-1;
Nit: spaces around '-'.
Attachment #8843579 -
Flags: review?(n.nethercote) → review+
Comment 6•8 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #3)
> (In reply to Nicholas Nethercote [:njn] from comment #2)
>
> That could be a follow up to this, but it's not entirely straightforward:
>
> * We'd need to handle the case where there is a last-position recorded for a
> thread, but since then we've overwritten the entire buffer, so that the
> recorded position is invalid. Tedious, but probably not a big deal.
AFAICT the two places that would need to handle it are ProfileBuffer::reset() and the (mWritePos == mReadPos) case in ProfileBuffer::addTag().
> * More seriously, the obvious implementation is to have a mozilla::Vector
> that
> maps thread IDs to their last sample position. The tricky bit comes when
> we
> have to extend the vector to accomodate a new thread. Then we have to be
> sure that we're not in a critical section, else we risk deadlock. Not sure
> what to do about that.
We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently doesn't deal with TheadInfos, so it would increase module entanglement a bit.
You should definitely land this patch, because comment 1 shows such a big improvement. But that comment does also show that additional improvement might be worth the effort.
Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6f118a165c53
Remove integer mod from loop in ProfileBuffer::FindLastSampleOfThread. r=n.nethercote.
Comment 8•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox55:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in
before you can comment on or make changes to this bug.
Description
•