1345032 - Further cost reductions for ProfileBuffer::FindLastSampleOfThread

Assignee

Description

•

9 years ago

FindLastSampleOfThread currently involves a linear search backwards through the sample buffer. Profiling showed that to be the largest profiler cost by far. Bugs 1344118 and 1344258 significantly improve the situation, collectively reducing the cost by a factor of at least 5 and often much more. But the linear search is still present and still dominant. The worst of it is that it's unnecessary: we could achieve the same by using a table, one entry per thread, that records the start of the most recent sample for each thread.

Nicholas Nethercote [inactive]

Comment 1

•

9 years ago

From bug 1344258 comment 6, the places that need to be handled: > AFAICT the two places that would need to handle it are ProfileBuffer::reset() > and the (mWritePos == mReadPos) case in ProfileBuffer::addTag(). and as for storing the data: > We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently > doesn't deal with TheadInfos, so it would increase module entanglement a bit.

Julian Seward [:jseward]

Assignee

Comment 2

•

9 years ago

(In reply to Nicholas Nethercote [:njn] from comment #1) > and as for storing the data: > > > We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently > > doesn't deal with TheadInfos, so it would increase module entanglement a bit. There's one ThreadInfo for each thread, whereas the ProfileBuffer is shared between all threads. So we'd have to go fish around in gPS->mThreads to find the relevant ThreadInfo. At least so it seems from looking at the code. Does that sync with your understanding, or am I missing something? An alternative that I've been mulling over to change ThreadId entries to contain a pair of ints, one of which is the thread's ID as at present and the other one is the index of the previous ThreadId entry. That way the ThreadId entries are chained together, and we can search back without having to examine all intervening entries. It has the advantage of no further module entanglement and no extra dynamic memory allocation. It's still a linear search backwards but with a cost of perhaps 1/100th what we have at the moment. ProfileBufferEntries currently may contain doubles (eg ResidentMemory) so having a pair of 32-bit ints wouldn't make them any larger.

Flags: needinfo?(n.nethercote)

Julian Seward [:jseward]

Assignee

Comment 3

•

9 years ago

> An alternative that I've been mulling over to change ThreadId entries to > contain a pair of ints, one of which is the thread's ID as at present [..] This works, but is rather inelegant and only gives a cost reduction of about a factor of 4 for FindLastSampleOfThread. It might be that hopping backwards along a linked list is cache-unfriendly. A more elegant solution would be to return to the original plan, but put the table that maps thread-ids to last-sample-indices in ProfileBuffer. That seems a fitting place for it and doesn't involve any further cross module entanglement. When the ProfileBuffer is created we know how many threads will be associated with it and so the table can be dynamically allocated at that point.

Nicholas Nethercote [inactive]

Comment 4

•

9 years ago

> > > We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently > > > doesn't deal with TheadInfos, so it would increase module entanglement a bit. > > There's one ThreadInfo for each thread, whereas the ProfileBuffer is shared > between all threads. So we'd have to go fish around in gPS->mThreads to find > the relevant ThreadInfo. At least so it seems from looking at the code. > Does > that sync with your understanding, or am I missing something? Which operation are you talking about? When we are sampling we have the relevant ThreadInfo and ProfileBuffer at hand. Are you talking about when ProfileBuffer is reset? At that point we might have to go through gPS->mThreads to clear all the cached "last sample" entries. Or we could avoid that by using the existing generation counter in ProfileBuffer to indicate staleness. > A more elegant solution would be to return to the original plan, but put the > table that maps thread-ids to last-sample-indices in ProfileBuffer. That > seems a fitting place for it and doesn't involve any further cross module > entanglement. When the ProfileBuffer is created we know how many threads > will be associated with it and so the table can be dynamically allocated at > that point. Can't more threads be added later? In which case you can't allocate a fixed-size table. Variable-sized tables aren't that hard, though.

Flags: needinfo?(n.nethercote)

Julian Seward [:jseward]

Assignee

Comment 5

•

9 years ago

(In reply to Nicholas Nethercote [:njn] from comment #4) > > entanglement. When the ProfileBuffer is created we know how many threads > > will be associated with it and so the table can be dynamically allocated at > > that point. > > Can't more threads be added later? Yes, I discovered that later (locked_register_thread). Yes, resizing the table accordingly is easy.

Julian Seward [:jseward]

Assignee

Comment 6

•

9 years ago

Attached patch Further cost reductions for ProfileBuffer::FindLastSampleOfThread — Details — Splinter Review

ProfileBuffer::FindLastSampleOfThread currently involves a linear search backwards through the sample buffer. Profiling showed that to be the largest profiler cost by far, at least on Linux. Bugs 1344118 and 1344258 significantly improve the situation, collectively reducing the cost by a factor of at least 5 and often much more. But the linear search is still present and still dominant. The worst of it is that it's unnecessary: we could achieve the same by using a table, one entry per thread, that records the start of the most recent sample for each thread. This patch adds such a table to ProfileBuffer. The table is updated in ProfileBuffer::addTag when a sample-start entry is observed, and is queried in ProfileBuffer::FindLastSampleOfThread. This is straightforward apart from the minor twist that we can't resize the table in the profiler critical section without risking deadlock. The table is an array of pairs, with lookup being a simple linear traversal. Because the array is typically small -- less than 100 entries -- the cost is tolerable. A followup to change it to binary search might be advisable. This patch reduces the cost of FindLastSampleOfThread by about a factor of 5, putting it on a par with the overall unwinding costs on Linux.

Attachment #8848930 - Flags: review?(n.nethercote)

Nicholas Nethercote [inactive]

Comment 7

•

9 years ago

Comment on attachment 8848930 [details] [diff] [review] Further cost reductions for ProfileBuffer::FindLastSampleOfThread Review of attachment 8848930 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/core/ProfileBuffer.cpp @@ +12,5 @@ > , mEntrySize(aEntrySize) > , mGeneration(0) > + , mLSMap(static_cast<LSMEntry*>(malloc(nThreads * sizeof(LSMEntry)))) > + , mLSMapSize(nThreads) > + , mLSMapUsed(0) This is very old-school C style code! A more C++/Mozilla-y way to do it would be with mozilla::Vector, which keeps track of length, capacity, allocation, resizing, and deallocation for you. That would simplify the patch considerably. You also wouldn't have to pass in nThreads here. (That's what I meant by "Variable-sized tables aren't that hard" in comment 5.) But I think the extra table can be avoided entirely. Here's my suggestion: - ThreadInfo gets two extra fields: a pointer (or index) to the last sample, and the ProfileBuffer's generation when that sample was taken. No need for a separate table. - DuplicateLastSample() gets two extra arguments: the above two fields. (The relevant ThreadInfo is available at the callsite.) It only does the duplication if the pointer is valid, which involves checking that the generation matches the ProfileBuffer's current generation. By passing in these arguments, we can avoid ProfileBuffer having a dependence on ThreadInfo. The lookup is also O(1). - FindLastSampleOfThread() can be deleted. Thoughts? ::: tools/profiler/core/ProfileBufferEntry.cpp @@ +759,5 @@ > + MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize); > + ProfileBufferEntry entry = mEntries[ix]; > + if (entry.isThreadId() && entry.mTagInt == aThreadId) { > + // It's still valid. > + return ix; Could we get unlucky here? Could the sample have been overwritten by the same thread? Not sure, though it's irrelevant if we take my suggested approach. ::: tools/profiler/core/platform.cpp @@ +1742,5 @@ > + // its last-sample-map to accommodate this new thread. > + ProfileBuffer* buffer = gPS->Buffer(aLock); > + if (buffer) { > + buffer->BumpLSMSize(); > + } This is a bit gross. Using mozilla::Vector would avoid that, as would avoiding the table altogether.

Attachment #8848930 - Flags: review?(n.nethercote) → review-

Nicholas Nethercote [inactive]

Comment 8

•

9 years ago

Oh, I forgot about ProfileBuffer::addTag(), which needs to record the last sample position and generation. It looks like you can pass in pointers to the two fields and it can fill them in. Even better, if you wrap up the two fields into a LastSample struct that will make the passing neater.

Julian Seward [:jseward]

Assignee

Comment 9

•

9 years ago

(In reply to Nicholas Nethercote [:njn] from comment #7) > But I think the extra table can be avoided entirely. Here's my suggestion: [..] It's a nice proposal. Patch to follow. Resulting performance is the best yet. > Even better, if you wrap up the two fields into a LastSample struct that will > make the passing neater. I did that, and also added a third read-only field, which is the thread ID. This simplifies various bits because it means we don't have to pass around the thread ID to which a LastSample pertains, and we can't pass the wrong thread ID. An alternative would have been to make the third field be a ThreadInfo*, but that increases module entanglement and pointer chasing. > ::: tools/profiler/core/ProfileBufferEntry.cpp > @@ +759,5 @@ > > + MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize); > > + ProfileBufferEntry entry = mEntries[ix]; > > + if (entry.isThreadId() && entry.mTagInt == aThreadId) { > > + // It's still valid. > > + return ix; > > Could we get unlucky here? Could the sample have been overwritten by the > same thread? Not sure, though it's irrelevant if we take my suggested > approach. That check is still required with the new approach, because the alternative is to have addTag check each sample to see if we're overwriting a sample start marker. That means we'd have to somehow get hold of associated LastSample and invalidate it. Also, it means that addTag drags all old samples up through the memory hierarchy merely to overwrite them. From Linux "perf" numbers, I am beginning to suspect that the profiler's performance is already largely memory limited. I tested the logic by running the old and new FindLastSample implementations in parallel for several minutes (several tens of millions of entries) and asserting that they produce identical results, with the buffer size set to minimum (90k) so as to maximise the number of wraparounds. Testing like that has found bugs in the various predecessor patches. If we adopted the indexing scheme proposed in bug 1348959, I think this validity testing would become simpler.

Julian Seward [:jseward]

Assignee

Comment 10

•

9 years ago

Attached patch Further cost reductions for ProfileBuffer::FindLastSampleOfThread (obsolete) — Details — Splinter Review

ProfileBuffer::FindLastSampleOfThread currently involves a linear search backwards through the sample buffer. Profiling showed that to be the largest profiler cost by far, at least on Linux. Bugs 1344118 and 1344258 significantly improve the situation, collectively reducing the cost by a factor of at least 5 and often much more. But the linear search is still present and still dominant. The worst of it is that it's unnecessary: we could achieve the same by recording the start point of the most recent sample for each thread in that thread's ThreadInfo record. This patch does exactly that, adding the type ProfileBuffer::LastSample to store the start points. LastSample also includes the ID of the thread it pertains to as a read-only field, as that is needed in various places. addTag doesn't check whether we're overwriting buffer entries containing start points. Instead, FindLastSample checks whether the entry pointed to the LastSample it is given still contains a marker.

Attachment #8849322 - Flags: review?(n.nethercote)

Nicholas Nethercote [inactive]

Comment 11

•

9 years ago

Comment on attachment 8849322 [details] [diff] [review] Further cost reductions for ProfileBuffer::FindLastSampleOfThread Review of attachment 8849322 [details] [diff] [review]: ----------------------------------------------------------------- Looking good. There are enough nits that I'd like to see it again before landing. Thanks. ::: tools/profiler/core/ProfileBuffer.cpp @@ +38,5 @@ > mReadPos = (mReadPos + 1) % mEntrySize; > } > } > > +void ProfileBuffer::addTagThreadId(LastSample& ls) Param should be |aLS| or |aLastSample|. ::: tools/profiler/core/ProfileBuffer.h @@ +26,5 @@ > + : mThreadId(aThreadId) > + , mGeneration(0) > + , mPos(-1) > + {} > + // The thread to which this LastSample pertains. Nit: blank line after the constructor. @@ +30,5 @@ > + // The thread to which this LastSample pertains. > + const int mThreadId; > + // The profiler-buffer generation number at which the sample was created. > + uint32_t mGeneration; > + // And its position in the buffer. Mention in the comment that -1 means "invalid". @@ +39,5 @@ > void addTag(const ProfileBufferEntry& aTag); > + > + // Add to the buffer, a sample start (ThreadId) entry, for the thread that > + // |ls| belongs to, and record the resulting generation and index in |ls|. > + void addTagThreadId(LastSample& ls); |aLS| or |aLastSample|. @@ +51,5 @@ > + > + // Find the most recent sample for the thread denoted by |ls| and clone it, > + // patching in the |aStartTime| as appropriate. > + bool DuplicateLastSample(const mozilla::TimeStamp& aStartTime, > + LastSample& ls); |aLS| or |aLastSample|. What line length do you have your text editor set to? https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style says line lengths should be 80 chars or fewer. This declaration is 79 chars, so can easily fit on a single line. I think a number of the comments in this patch could be reflowed to take up more of each line, too. @@ +63,5 @@ > size_t SizeOfIncludingThis(mozilla::MallocSizeOf aMallocSizeOf) const; > > protected: > char* processDynamicTag(int readPos, int* tagsConsumed, char* tagBuff); > + int FindLastSampleOfThread(const LastSample& ls); |aLS| or |aLastSample|. ::: tools/profiler/core/ProfileBufferEntry.cpp @@ +741,5 @@ > readPos = (readPos + 1) % mEntrySize; > } > } > > +int ProfileBuffer::FindLastSampleOfThread(const LastSample& ls) |aLS| or |aLastSample|. @@ +748,5 @@ > + // generation, or is one behind the buffer's generation, since the buffer's > + // generation is incremented on wraparound. There's no ambiguity relative > + // to ProfileBuffer::reset, since that increments mGeneration by two. > + if (ls.mGeneration == mGeneration || > + (mGeneration > 0 && ls.mGeneration == mGeneration-1)) { Nit: space around '-'. @@ +752,5 @@ > + (mGeneration > 0 && ls.mGeneration == mGeneration-1)) { > + int ix = ls.mPos; > + // It might be that the sample has since been overwritten, so check > + // that it is still valid. > + MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize); Nit: I write conditions like these as |0 <= ix && ix < mEntrySize|... Also, isn't -1 a possible value for mPos at this point? @@ +753,5 @@ > + int ix = ls.mPos; > + // It might be that the sample has since been overwritten, so check > + // that it is still valid. > + MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize); > + ProfileBufferEntry entry = mEntries[ix]; Make this a reference (|ProfileBufferEntry&|) so we don't copy it unnecessarily. @@ +757,5 @@ > + ProfileBufferEntry entry = mEntries[ix]; > + if (entry.isThreadId() && entry.mTagInt == ls.mThreadId) { > + // It's still valid. > + return ix; > + } else { Moz style says no else after return... in this case it might look awkward. I wonder if using ?: might help. @@ +761,5 @@ > + } else { > + // The entry has been overwritten. Given that |ix| is the index of the > + // thread denoted by |ls|, we can be sure that there are no other older > + // samples for it in the buffer either, since they would have been > + // overwritten before this most recent one. Is the absence of older samples relevant? I don't think so, and mentioning them just confused me. @@ +767,3 @@ > } > + } else { > + MOZ_ASSERT(ls.mGeneration <= mGeneration-2); Nit: space around '-'. @@ +772,4 @@ > } > > bool > +ProfileBuffer::DuplicateLastSample(const TimeStamp& aStartTime, LastSample& ls) |aLS| or |aLastSample|. Also, can it be a |const| reference?

Attachment #8849322 - Flags: review?(n.nethercote) → feedback+

Julian Seward [:jseward]

Assignee

Comment 12

•

9 years ago

Attached patch Further cost reductions for ProfileBuffer::FindLastSampleOfThread — Details — Splinter Review

ProfileBuffer::FindLastSampleOfThread currently involves a linear search backwards through the sample buffer. Profiling showed that to be the largest profiler cost by far, at least on Linux. Bugs 1344118 and 1344258 significantly improve the situation, collectively reducing the cost by a factor of at least 5 and often much more. But the linear search is still present and still dominant. The worst of it is that it's unnecessary: we could achieve the same by recording the start point of the most recent sample for each thread in that thread's ThreadInfo record. This patch does exactly that, adding the type ProfileBuffer::LastSample to store the start points. LastSample also includes the ID of the thread it pertains to as a read-only field, as that is needed in various places. addTag doesn't check whether we're overwriting buffer entries containing start points. Instead, FindLastSample checks whether the entry pointed to the LastSample it is given still contains a marker.

Attachment #8849472 - Flags: review?(n.nethercote)

Julian Seward [:jseward]

Assignee

Updated

•

9 years ago

Attachment #8849322 - Attachment is obsolete: true

Julian Seward [:jseward]

Assignee

Comment 13

•

9 years ago

(In reply to Nicholas Nethercote [:njn] from comment #11) > Comment on attachment 8849322 [details] [diff] [review] > Further cost reductions for ProfileBuffer::FindLastSampleOfThread > > Review of attachment 8849322 [details] [diff] [review]: [..] All nits fixed. > > + bool DuplicateLastSample(const mozilla::TimeStamp& aStartTime, > > + LastSample& ls); > > |aLS| or |aLastSample|. > > What line length do you have your text editor set to? [..] emacs thought the one-line version was already 80, and with ls->aLS it's now 81. So it's still wrapped. > > + // It might be that the sample has since been overwritten, so check > > + // that it is still valid. > > + MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize); > > Also, isn't -1 a possible value for mPos at this point? Hmm, that's a good point. That assertion never failed in testing, and I suspect that's because of the logic associated with CanDuplicateLastSampleDueToSleep, but I didn't follow it entirely. In any case I added a test here to be on the safe side. > > bool > > +ProfileBuffer::DuplicateLastSample(const TimeStamp& aStartTime, LastSample& ls) > > Also, can it be a |const| reference? No, because it gets modified by the call to addTagThreadId.

Nicholas Nethercote [inactive]

Comment 14

•

9 years ago

Comment on attachment 8849472 [details] [diff] [review] Further cost reductions for ProfileBuffer::FindLastSampleOfThread Review of attachment 8849472 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/core/ProfileBuffer.cpp @@ +38,5 @@ > mReadPos = (mReadPos + 1) % mEntrySize; > } > } > > +void ProfileBuffer::addTagThreadId(LastSample& aLS) I realize it's late to be saying this, but "Last" isn't a very good word to put in a name, because it's not clear if it means "previous" or "final". Would PrevSample or LatestSample or MostRecentSample be better? ::: tools/profiler/core/ProfileBufferEntry.cpp @@ +741,5 @@ > readPos = (readPos + 1) % mEntrySize; > } > } > > +int ProfileBuffer::FindLastSampleOfThread(const LastSample& aLS) Nit: return type on its own line.

Attachment #8849472 - Flags: review?(n.nethercote) → review+

Julian Seward [:jseward]

Assignee

Comment 15

•

9 years ago

(In reply to Nicholas Nethercote [:njn] from comment #14) > > +void ProfileBuffer::addTagThreadId(LastSample& aLS) > > I realize it's late to be saying this, but "Last" isn't a very good word to > put in a name, because it's not clear if it means "previous" or "final". > Would PrevSample or LatestSample or MostRecentSample be better? MostRecentSample is a bit wordy. How about LatestSample? So .. shall I rename just the type, or also DuplicateLastSample() too? It looks a bit strange to have DuplicateLastSample(..., LatestSample& aLS).

Flags: needinfo?(n.nethercote)

Julian Seward [:jseward]

Assignee

Comment 16

•

9 years ago

To do this consistently would require renaming CanDuplicateLastSampleDueToSleep FindLastSampleOfThread at their various call sites, which makes the patch much bigger. LastSample isn't quite as descriptive as one might hope, but at least it's consistent with existing usage. How do you want to proceed?

Nicholas Nethercote [inactive]

Comment 17

•

9 years ago

I am happy to let you decide, hence the r+. Sorry I didn't make it clear. Sticking with LastSample is good enough.

Flags: needinfo?(n.nethercote)

Pulsebot

Comment 18

•

9 years ago

Pushed by jseward@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/6235799ad4fd Further cost reductions for ProfileBuffer::FindLastSampleOfThread. r=n.nethercote.

Carsten Book [:Tomcat]

Comment 19

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/6235799ad4fd

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox55: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Further cost reductions for ProfileBuffer::FindLastSampleOfThread 9 years ago Julian Seward [:jseward] 14.24 KB, patch	n.nethercote : review-	Details \| Diff \| Splinter Review
Further cost reductions for ProfileBuffer::FindLastSampleOfThread 9 years ago Julian Seward [:jseward] 14.98 KB, patch	n.nethercote : feedback+	Details \| Diff \| Splinter Review
Further cost reductions for ProfileBuffer::FindLastSampleOfThread 9 years ago Julian Seward [:jseward] 14.99 KB, patch	n.nethercote : review+	Details \| Diff \| Splinter Review