Closed Bug 1344258 Opened 8 years ago Closed 8 years ago

Remove integer mod from loop in ProfileBuffer::FindLastSampleOfThread

Categories

(Core :: Gecko Profiler, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: jseward, Assigned: jseward)

Details

Attachments

(1 file)

This follows up on the performance fix in bug 1344118. Even with that fix in place, by far the most expensive part of the profiler core is ProfileBuffer::FindLastSampleOfThread. At least, on Linux x86_64, with LUL unwinding, 9MB buffer, 1 KHz sampling, and profiling all threads. ProfileBuffer::FindLastSampleOfThread requires 2 32-bit integer % instructions per loop. If we assume a CPU core with a single divider unit that can produce 2 result bits per cycle, that's a 32 cycle cost per iteration, which is high considering that the rest of the loop is a load, two or three conditional branches and some integer decrementing. The mods are not even necessary since we can just do an underflow and wraparound check in the obvious way, and since that will be almost perfectly predicted, it will be very cheap.
To test a fix for this, I made a new implementation of FindLastSampleOfThread and ran it together with the old implementation, asserting that the same values are produced. I also tested with the fix for bug 1344118 disabled, so as to check the "not found, wraparound" case. This gives the opportunity to accurately compare old- vs new- with Linux's "perf" profiler. Typical results are 73.41% SamplerThread [.] ProfileBuffer::FindLastSampleOfThread_OLD 11.04% SamplerThread [.] ProfileBuffer::FindLastSampleOfThread_NEW that is, the new version is 6.5 times cheaper than the old. In terms of overall impact, with the browser idling and profiler running in the configuration specified in comment 0, I get, from "top", this (roughly -- there's a lot of noise): old min cpu use 42% new min cpu use 22% Another way to characterise it is to compare the relative costs of ProfileBuffer::FindLastSampleOfThread with the next hottest function, lul::SecMap::FindRuleSet: Old: 82.21% SamplerThread ProfileBuffer::FindLastSampleOfThread_OLD 2.93% SamplerThread lul::SecMap::FindRuleSet New: 30.90% SamplerThread ProfileBuffer::FindLastSampleOfThread_NEW 5.71% SamplerThread lul::SecMap::FindRuleSet With the old implementation, FindLastSampleOfThread is 28.05 times more expensive than FindRuleSet. With the new implementation, it is 5.4 times more expensive. That implies a cost reduction for FindLastSampleOfThread of about 5.2, which is more or less consistent with the direct measurements at the start of this comment.
Nice! It sounds like you improved the search loop. You also talked about recording the last sample position directly and skipping the search loop entirely. Have you tried that?
(In reply to Nicholas Nethercote [:njn] from comment #2) That could be a follow up to this, but it's not entirely straightforward: * We'd need to handle the case where there is a last-position recorded for a thread, but since then we've overwritten the entire buffer, so that the recorded position is invalid. Tedious, but probably not a big deal. * More seriously, the obvious implementation is to have a mozilla::Vector that maps thread IDs to their last sample position. The tricky bit comes when we have to extend the vector to accomodate a new thread. Then we have to be sure that we're not in a critical section, else we risk deadlock. Not sure what to do about that.
This changes the number of integer mod uses from 2 per loop iteration (there are typically around 500 iterations per call) to 1 per call.
Attachment #8843579 - Flags: review?(n.nethercote)
Comment on attachment 8843579 [details] [diff] [review] Remove integer mod from loop in ProfileBuffer::FindLastSampleOfThread. r=n.nethercote. Review of attachment 8843579 [details] [diff] [review]: ----------------------------------------------------------------- You could avoid the condition within the loop by first checking if mReadPos < mWritePos. If so, use a loop that doesn't check for wraparound. Otherwise, use two loops, one that counts down to zero and then one that counts down from the buffer's top. But it probably isn't worth the extra code, so this will suffice :) ::: tools/profiler/core/ProfileBufferEntry.cpp @@ +745,5 @@ > int ProfileBuffer::FindLastSampleOfThread(int aThreadId) > { > // We search backwards from mWritePos-1 to mReadPos. > // Adding mEntrySize makes the result of the modulus positive. > + int readPos = (mWritePos + mEntrySize - 1) % mEntrySize; Nit: I suspect currPos would be a better name than readPos. @@ +756,5 @@ > + readPos--; > + if (readPos < 0) { > + // This almost never happens, so is perfectly predicted, and hence > + // almost free. > + readPos = mEntrySize-1; Nit: spaces around '-'.
Attachment #8843579 - Flags: review?(n.nethercote) → review+
(In reply to Julian Seward [:jseward] from comment #3) > (In reply to Nicholas Nethercote [:njn] from comment #2) > > That could be a follow up to this, but it's not entirely straightforward: > > * We'd need to handle the case where there is a last-position recorded for a > thread, but since then we've overwritten the entire buffer, so that the > recorded position is invalid. Tedious, but probably not a big deal. AFAICT the two places that would need to handle it are ProfileBuffer::reset() and the (mWritePos == mReadPos) case in ProfileBuffer::addTag(). > * More seriously, the obvious implementation is to have a mozilla::Vector > that > maps thread IDs to their last sample position. The tricky bit comes when > we > have to extend the vector to accomodate a new thread. Then we have to be > sure that we're not in a critical section, else we risk deadlock. Not sure > what to do about that. We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently doesn't deal with TheadInfos, so it would increase module entanglement a bit. You should definitely land this patch, because comment 1 shows such a big improvement. But that comment does also show that additional improvement might be worth the effort.
Pushed by jseward@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/6f118a165c53 Remove integer mod from loop in ProfileBuffer::FindLastSampleOfThread. r=n.nethercote.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: