Closed
Bug 1345032
Opened 7 years ago
Closed 7 years ago
Further cost reductions for ProfileBuffer::FindLastSampleOfThread
Categories
(Core :: Gecko Profiler, enhancement)
Core
Gecko Profiler
Tracking
()
RESOLVED
FIXED
mozilla55
Tracking | Status | |
---|---|---|
firefox55 | --- | fixed |
People
(Reporter: jseward, Assigned: jseward)
Details
Attachments
(2 files, 1 obsolete file)
14.24 KB,
patch
|
n.nethercote
:
review-
|
Details | Diff | Splinter Review |
14.99 KB,
patch
|
n.nethercote
:
review+
|
Details | Diff | Splinter Review |
FindLastSampleOfThread currently involves a linear search backwards through the sample buffer. Profiling showed that to be the largest profiler cost by far. Bugs 1344118 and 1344258 significantly improve the situation, collectively reducing the cost by a factor of at least 5 and often much more. But the linear search is still present and still dominant. The worst of it is that it's unnecessary: we could achieve the same by using a table, one entry per thread, that records the start of the most recent sample for each thread.
![]() |
||
Comment 1•7 years ago
|
||
From bug 1344258 comment 6, the places that need to be handled: > AFAICT the two places that would need to handle it are ProfileBuffer::reset() > and the (mWritePos == mReadPos) case in ProfileBuffer::addTag(). and as for storing the data: > We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently > doesn't deal with TheadInfos, so it would increase module entanglement a bit.
Assignee | ||
Comment 2•7 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #1) > and as for storing the data: > > > We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently > > doesn't deal with TheadInfos, so it would increase module entanglement a bit. There's one ThreadInfo for each thread, whereas the ProfileBuffer is shared between all threads. So we'd have to go fish around in gPS->mThreads to find the relevant ThreadInfo. At least so it seems from looking at the code. Does that sync with your understanding, or am I missing something? An alternative that I've been mulling over to change ThreadId entries to contain a pair of ints, one of which is the thread's ID as at present and the other one is the index of the previous ThreadId entry. That way the ThreadId entries are chained together, and we can search back without having to examine all intervening entries. It has the advantage of no further module entanglement and no extra dynamic memory allocation. It's still a linear search backwards but with a cost of perhaps 1/100th what we have at the moment. ProfileBufferEntries currently may contain doubles (eg ResidentMemory) so having a pair of 32-bit ints wouldn't make them any larger.
Flags: needinfo?(n.nethercote)
Assignee | ||
Comment 3•7 years ago
|
||
> An alternative that I've been mulling over to change ThreadId entries to
> contain a pair of ints, one of which is the thread's ID as at present [..]
This works, but is rather inelegant and only gives a cost reduction of about a
factor of 4 for FindLastSampleOfThread. It might be that hopping backwards
along a linked list is cache-unfriendly.
A more elegant solution would be to return to the original plan, but put the
table that maps thread-ids to last-sample-indices in ProfileBuffer. That
seems a fitting place for it and doesn't involve any further cross module
entanglement. When the ProfileBuffer is created we know how many threads
will be associated with it and so the table can be dynamically allocated at
that point.
![]() |
||
Comment 4•7 years ago
|
||
> > > We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently > > > doesn't deal with TheadInfos, so it would increase module entanglement a bit. > > There's one ThreadInfo for each thread, whereas the ProfileBuffer is shared > between all threads. So we'd have to go fish around in gPS->mThreads to find > the relevant ThreadInfo. At least so it seems from looking at the code. > Does > that sync with your understanding, or am I missing something? Which operation are you talking about? When we are sampling we have the relevant ThreadInfo and ProfileBuffer at hand. Are you talking about when ProfileBuffer is reset? At that point we might have to go through gPS->mThreads to clear all the cached "last sample" entries. Or we could avoid that by using the existing generation counter in ProfileBuffer to indicate staleness. > A more elegant solution would be to return to the original plan, but put the > table that maps thread-ids to last-sample-indices in ProfileBuffer. That > seems a fitting place for it and doesn't involve any further cross module > entanglement. When the ProfileBuffer is created we know how many threads > will be associated with it and so the table can be dynamically allocated at > that point. Can't more threads be added later? In which case you can't allocate a fixed-size table. Variable-sized tables aren't that hard, though.
Flags: needinfo?(n.nethercote)
Assignee | ||
Comment 5•7 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #4) > > entanglement. When the ProfileBuffer is created we know how many threads > > will be associated with it and so the table can be dynamically allocated at > > that point. > > Can't more threads be added later? Yes, I discovered that later (locked_register_thread). Yes, resizing the table accordingly is easy.
Assignee | ||
Comment 6•7 years ago
|
||
ProfileBuffer::FindLastSampleOfThread currently involves a linear search backwards through the sample buffer. Profiling showed that to be the largest profiler cost by far, at least on Linux. Bugs 1344118 and 1344258 significantly improve the situation, collectively reducing the cost by a factor of at least 5 and often much more. But the linear search is still present and still dominant. The worst of it is that it's unnecessary: we could achieve the same by using a table, one entry per thread, that records the start of the most recent sample for each thread. This patch adds such a table to ProfileBuffer. The table is updated in ProfileBuffer::addTag when a sample-start entry is observed, and is queried in ProfileBuffer::FindLastSampleOfThread. This is straightforward apart from the minor twist that we can't resize the table in the profiler critical section without risking deadlock. The table is an array of pairs, with lookup being a simple linear traversal. Because the array is typically small -- less than 100 entries -- the cost is tolerable. A followup to change it to binary search might be advisable. This patch reduces the cost of FindLastSampleOfThread by about a factor of 5, putting it on a par with the overall unwinding costs on Linux.
Attachment #8848930 -
Flags: review?(n.nethercote)
![]() |
||
Comment 7•7 years ago
|
||
Comment on attachment 8848930 [details] [diff] [review] Further cost reductions for ProfileBuffer::FindLastSampleOfThread Review of attachment 8848930 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/core/ProfileBuffer.cpp @@ +12,5 @@ > , mEntrySize(aEntrySize) > , mGeneration(0) > + , mLSMap(static_cast<LSMEntry*>(malloc(nThreads * sizeof(LSMEntry)))) > + , mLSMapSize(nThreads) > + , mLSMapUsed(0) This is very old-school C style code! A more C++/Mozilla-y way to do it would be with mozilla::Vector, which keeps track of length, capacity, allocation, resizing, and deallocation for you. That would simplify the patch considerably. You also wouldn't have to pass in nThreads here. (That's what I meant by "Variable-sized tables aren't that hard" in comment 5.) But I think the extra table can be avoided entirely. Here's my suggestion: - ThreadInfo gets two extra fields: a pointer (or index) to the last sample, and the ProfileBuffer's generation when that sample was taken. No need for a separate table. - DuplicateLastSample() gets two extra arguments: the above two fields. (The relevant ThreadInfo is available at the callsite.) It only does the duplication if the pointer is valid, which involves checking that the generation matches the ProfileBuffer's current generation. By passing in these arguments, we can avoid ProfileBuffer having a dependence on ThreadInfo. The lookup is also O(1). - FindLastSampleOfThread() can be deleted. Thoughts? ::: tools/profiler/core/ProfileBufferEntry.cpp @@ +759,5 @@ > + MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize); > + ProfileBufferEntry entry = mEntries[ix]; > + if (entry.isThreadId() && entry.mTagInt == aThreadId) { > + // It's still valid. > + return ix; Could we get unlucky here? Could the sample have been overwritten by the same thread? Not sure, though it's irrelevant if we take my suggested approach. ::: tools/profiler/core/platform.cpp @@ +1742,5 @@ > + // its last-sample-map to accommodate this new thread. > + ProfileBuffer* buffer = gPS->Buffer(aLock); > + if (buffer) { > + buffer->BumpLSMSize(); > + } This is a bit gross. Using mozilla::Vector would avoid that, as would avoiding the table altogether.
Attachment #8848930 -
Flags: review?(n.nethercote) → review-
![]() |
||
Comment 8•7 years ago
|
||
Oh, I forgot about ProfileBuffer::addTag(), which needs to record the last sample position and generation. It looks like you can pass in pointers to the two fields and it can fill them in. Even better, if you wrap up the two fields into a LastSample struct that will make the passing neater.
Assignee | ||
Comment 9•7 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #7) > But I think the extra table can be avoided entirely. Here's my suggestion: [..] It's a nice proposal. Patch to follow. Resulting performance is the best yet. > Even better, if you wrap up the two fields into a LastSample struct that will > make the passing neater. I did that, and also added a third read-only field, which is the thread ID. This simplifies various bits because it means we don't have to pass around the thread ID to which a LastSample pertains, and we can't pass the wrong thread ID. An alternative would have been to make the third field be a ThreadInfo*, but that increases module entanglement and pointer chasing. > ::: tools/profiler/core/ProfileBufferEntry.cpp > @@ +759,5 @@ > > + MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize); > > + ProfileBufferEntry entry = mEntries[ix]; > > + if (entry.isThreadId() && entry.mTagInt == aThreadId) { > > + // It's still valid. > > + return ix; > > Could we get unlucky here? Could the sample have been overwritten by the > same thread? Not sure, though it's irrelevant if we take my suggested > approach. That check is still required with the new approach, because the alternative is to have addTag check each sample to see if we're overwriting a sample start marker. That means we'd have to somehow get hold of associated LastSample and invalidate it. Also, it means that addTag drags all old samples up through the memory hierarchy merely to overwrite them. From Linux "perf" numbers, I am beginning to suspect that the profiler's performance is already largely memory limited. I tested the logic by running the old and new FindLastSample implementations in parallel for several minutes (several tens of millions of entries) and asserting that they produce identical results, with the buffer size set to minimum (90k) so as to maximise the number of wraparounds. Testing like that has found bugs in the various predecessor patches. If we adopted the indexing scheme proposed in bug 1348959, I think this validity testing would become simpler.
Assignee | ||
Comment 10•7 years ago
|
||
ProfileBuffer::FindLastSampleOfThread currently involves a linear search backwards through the sample buffer. Profiling showed that to be the largest profiler cost by far, at least on Linux. Bugs 1344118 and 1344258 significantly improve the situation, collectively reducing the cost by a factor of at least 5 and often much more. But the linear search is still present and still dominant. The worst of it is that it's unnecessary: we could achieve the same by recording the start point of the most recent sample for each thread in that thread's ThreadInfo record. This patch does exactly that, adding the type ProfileBuffer::LastSample to store the start points. LastSample also includes the ID of the thread it pertains to as a read-only field, as that is needed in various places. addTag doesn't check whether we're overwriting buffer entries containing start points. Instead, FindLastSample checks whether the entry pointed to the LastSample it is given still contains a marker.
Attachment #8849322 -
Flags: review?(n.nethercote)
![]() |
||
Comment 11•7 years ago
|
||
Comment on attachment 8849322 [details] [diff] [review] Further cost reductions for ProfileBuffer::FindLastSampleOfThread Review of attachment 8849322 [details] [diff] [review]: ----------------------------------------------------------------- Looking good. There are enough nits that I'd like to see it again before landing. Thanks. ::: tools/profiler/core/ProfileBuffer.cpp @@ +38,5 @@ > mReadPos = (mReadPos + 1) % mEntrySize; > } > } > > +void ProfileBuffer::addTagThreadId(LastSample& ls) Param should be |aLS| or |aLastSample|. ::: tools/profiler/core/ProfileBuffer.h @@ +26,5 @@ > + : mThreadId(aThreadId) > + , mGeneration(0) > + , mPos(-1) > + {} > + // The thread to which this LastSample pertains. Nit: blank line after the constructor. @@ +30,5 @@ > + // The thread to which this LastSample pertains. > + const int mThreadId; > + // The profiler-buffer generation number at which the sample was created. > + uint32_t mGeneration; > + // And its position in the buffer. Mention in the comment that -1 means "invalid". @@ +39,5 @@ > void addTag(const ProfileBufferEntry& aTag); > + > + // Add to the buffer, a sample start (ThreadId) entry, for the thread that > + // |ls| belongs to, and record the resulting generation and index in |ls|. > + void addTagThreadId(LastSample& ls); |aLS| or |aLastSample|. @@ +51,5 @@ > + > + // Find the most recent sample for the thread denoted by |ls| and clone it, > + // patching in the |aStartTime| as appropriate. > + bool DuplicateLastSample(const mozilla::TimeStamp& aStartTime, > + LastSample& ls); |aLS| or |aLastSample|. What line length do you have your text editor set to? https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style says line lengths should be 80 chars or fewer. This declaration is 79 chars, so can easily fit on a single line. I think a number of the comments in this patch could be reflowed to take up more of each line, too. @@ +63,5 @@ > size_t SizeOfIncludingThis(mozilla::MallocSizeOf aMallocSizeOf) const; > > protected: > char* processDynamicTag(int readPos, int* tagsConsumed, char* tagBuff); > + int FindLastSampleOfThread(const LastSample& ls); |aLS| or |aLastSample|. ::: tools/profiler/core/ProfileBufferEntry.cpp @@ +741,5 @@ > readPos = (readPos + 1) % mEntrySize; > } > } > > +int ProfileBuffer::FindLastSampleOfThread(const LastSample& ls) |aLS| or |aLastSample|. @@ +748,5 @@ > + // generation, or is one behind the buffer's generation, since the buffer's > + // generation is incremented on wraparound. There's no ambiguity relative > + // to ProfileBuffer::reset, since that increments mGeneration by two. > + if (ls.mGeneration == mGeneration || > + (mGeneration > 0 && ls.mGeneration == mGeneration-1)) { Nit: space around '-'. @@ +752,5 @@ > + (mGeneration > 0 && ls.mGeneration == mGeneration-1)) { > + int ix = ls.mPos; > + // It might be that the sample has since been overwritten, so check > + // that it is still valid. > + MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize); Nit: I write conditions like these as |0 <= ix && ix < mEntrySize|... Also, isn't -1 a possible value for mPos at this point? @@ +753,5 @@ > + int ix = ls.mPos; > + // It might be that the sample has since been overwritten, so check > + // that it is still valid. > + MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize); > + ProfileBufferEntry entry = mEntries[ix]; Make this a reference (|ProfileBufferEntry&|) so we don't copy it unnecessarily. @@ +757,5 @@ > + ProfileBufferEntry entry = mEntries[ix]; > + if (entry.isThreadId() && entry.mTagInt == ls.mThreadId) { > + // It's still valid. > + return ix; > + } else { Moz style says no else after return... in this case it might look awkward. I wonder if using ?: might help. @@ +761,5 @@ > + } else { > + // The entry has been overwritten. Given that |ix| is the index of the > + // thread denoted by |ls|, we can be sure that there are no other older > + // samples for it in the buffer either, since they would have been > + // overwritten before this most recent one. Is the absence of older samples relevant? I don't think so, and mentioning them just confused me. @@ +767,3 @@ > } > + } else { > + MOZ_ASSERT(ls.mGeneration <= mGeneration-2); Nit: space around '-'. @@ +772,4 @@ > } > > bool > +ProfileBuffer::DuplicateLastSample(const TimeStamp& aStartTime, LastSample& ls) |aLS| or |aLastSample|. Also, can it be a |const| reference?
Attachment #8849322 -
Flags: review?(n.nethercote) → feedback+
Assignee | ||
Comment 12•7 years ago
|
||
ProfileBuffer::FindLastSampleOfThread currently involves a linear search backwards through the sample buffer. Profiling showed that to be the largest profiler cost by far, at least on Linux. Bugs 1344118 and 1344258 significantly improve the situation, collectively reducing the cost by a factor of at least 5 and often much more. But the linear search is still present and still dominant. The worst of it is that it's unnecessary: we could achieve the same by recording the start point of the most recent sample for each thread in that thread's ThreadInfo record. This patch does exactly that, adding the type ProfileBuffer::LastSample to store the start points. LastSample also includes the ID of the thread it pertains to as a read-only field, as that is needed in various places. addTag doesn't check whether we're overwriting buffer entries containing start points. Instead, FindLastSample checks whether the entry pointed to the LastSample it is given still contains a marker.
Attachment #8849472 -
Flags: review?(n.nethercote)
Assignee | ||
Updated•7 years ago
|
Attachment #8849322 -
Attachment is obsolete: true
Assignee | ||
Comment 13•7 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #11) > Comment on attachment 8849322 [details] [diff] [review] > Further cost reductions for ProfileBuffer::FindLastSampleOfThread > > Review of attachment 8849322 [details] [diff] [review]: [..] All nits fixed. > > + bool DuplicateLastSample(const mozilla::TimeStamp& aStartTime, > > + LastSample& ls); > > |aLS| or |aLastSample|. > > What line length do you have your text editor set to? [..] emacs thought the one-line version was already 80, and with ls->aLS it's now 81. So it's still wrapped. > > + // It might be that the sample has since been overwritten, so check > > + // that it is still valid. > > + MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize); > > Also, isn't -1 a possible value for mPos at this point? Hmm, that's a good point. That assertion never failed in testing, and I suspect that's because of the logic associated with CanDuplicateLastSampleDueToSleep, but I didn't follow it entirely. In any case I added a test here to be on the safe side. > > bool > > +ProfileBuffer::DuplicateLastSample(const TimeStamp& aStartTime, LastSample& ls) > > Also, can it be a |const| reference? No, because it gets modified by the call to addTagThreadId.
![]() |
||
Comment 14•7 years ago
|
||
Comment on attachment 8849472 [details] [diff] [review] Further cost reductions for ProfileBuffer::FindLastSampleOfThread Review of attachment 8849472 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/core/ProfileBuffer.cpp @@ +38,5 @@ > mReadPos = (mReadPos + 1) % mEntrySize; > } > } > > +void ProfileBuffer::addTagThreadId(LastSample& aLS) I realize it's late to be saying this, but "Last" isn't a very good word to put in a name, because it's not clear if it means "previous" or "final". Would PrevSample or LatestSample or MostRecentSample be better? ::: tools/profiler/core/ProfileBufferEntry.cpp @@ +741,5 @@ > readPos = (readPos + 1) % mEntrySize; > } > } > > +int ProfileBuffer::FindLastSampleOfThread(const LastSample& aLS) Nit: return type on its own line.
Attachment #8849472 -
Flags: review?(n.nethercote) → review+
Assignee | ||
Comment 15•7 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #14) > > +void ProfileBuffer::addTagThreadId(LastSample& aLS) > > I realize it's late to be saying this, but "Last" isn't a very good word to > put in a name, because it's not clear if it means "previous" or "final". > Would PrevSample or LatestSample or MostRecentSample be better? MostRecentSample is a bit wordy. How about LatestSample? So .. shall I rename just the type, or also DuplicateLastSample() too? It looks a bit strange to have DuplicateLastSample(..., LatestSample& aLS).
Flags: needinfo?(n.nethercote)
Assignee | ||
Comment 16•7 years ago
|
||
To do this consistently would require renaming CanDuplicateLastSampleDueToSleep FindLastSampleOfThread at their various call sites, which makes the patch much bigger. LastSample isn't quite as descriptive as one might hope, but at least it's consistent with existing usage. How do you want to proceed?
![]() |
||
Comment 17•7 years ago
|
||
I am happy to let you decide, hence the r+. Sorry I didn't make it clear. Sticking with LastSample is good enough.
Flags: needinfo?(n.nethercote)
Comment 18•7 years ago
|
||
Pushed by jseward@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/6235799ad4fd Further cost reductions for ProfileBuffer::FindLastSampleOfThread. r=n.nethercote.
Comment 19•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/6235799ad4fd
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox55:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in
before you can comment on or make changes to this bug.
Description
•