Further cost reductions for ProfileBuffer::FindLastSampleOfThread

RESOLVED FIXED in Firefox 55

Status

()

Core
Gecko Profiler
RESOLVED FIXED
5 months ago
4 months ago

People

(Reporter: jseward, Assigned: jseward)

Tracking

unspecified
mozilla55
Points:
---

Firefox Tracking Flags

(firefox55 fixed)

Details

Attachments

(2 attachments, 1 obsolete attachment)

(Assignee)

Description

5 months ago
FindLastSampleOfThread currently involves a linear search backwards through
the sample buffer.  Profiling showed that to be the largest profiler cost by
far.  Bugs 1344118 and 1344258 significantly improve the situation,
collectively reducing the cost by a factor of at least 5 and often much more.
But the linear search is still present and still dominant.  The worst of it is
that it's unnecessary: we could achieve the same by using a table, one entry
per thread, that records the start of the most recent sample for each thread.
From bug 1344258 comment 6, the places that need to be handled:

> AFAICT the two places that would need to handle it are ProfileBuffer::reset()
> and the (mWritePos == mReadPos) case in ProfileBuffer::addTag().

and as for storing the data:

> We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently
> doesn't deal with TheadInfos, so it would increase module entanglement a bit.
(Assignee)

Comment 2

4 months ago
(In reply to Nicholas Nethercote [:njn] from comment #1)

> and as for storing the data:
> 
> > We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently
> > doesn't deal with TheadInfos, so it would increase module entanglement a bit.

There's one ThreadInfo for each thread, whereas the ProfileBuffer is shared
between all threads.  So we'd have to go fish around in gPS->mThreads to find
the relevant ThreadInfo.  At least so it seems from looking at the code.  Does
that sync with your understanding, or am I missing something?

An alternative that I've been mulling over to change ThreadId entries to
contain a pair of ints, one of which is the thread's ID as at present and the
other one is the index of the previous ThreadId entry.  That way the ThreadId
entries are chained together, and we can search back without having to examine
all intervening entries.  It has the advantage of no further module
entanglement and no extra dynamic memory allocation.  It's still a linear
search backwards but with a cost of perhaps 1/100th what we have at the
moment.  ProfileBufferEntries currently may contain doubles (eg
ResidentMemory) so having a pair of 32-bit ints wouldn't make them any larger.
Flags: needinfo?(n.nethercote)
(Assignee)

Comment 3

4 months ago
> An alternative that I've been mulling over to change ThreadId entries to
> contain a pair of ints, one of which is the thread's ID as at present [..]

This works, but is rather inelegant and only gives a cost reduction of about a
factor of 4 for FindLastSampleOfThread.  It might be that hopping backwards
along a linked list is cache-unfriendly.

A more elegant solution would be to return to the original plan, but put the
table that maps thread-ids to last-sample-indices in ProfileBuffer.  That
seems a fitting place for it and doesn't involve any further cross module
entanglement.  When the ProfileBuffer is created we know how many threads
will be associated with it and so the table can be dynamically allocated at
that point.
> > > We could add another field to ThreadInfo instead. ProfileBuffer.cpp currently
> > > doesn't deal with TheadInfos, so it would increase module entanglement a bit.
> 
> There's one ThreadInfo for each thread, whereas the ProfileBuffer is shared
> between all threads.  So we'd have to go fish around in gPS->mThreads to find
> the relevant ThreadInfo.  At least so it seems from looking at the code. 
> Does
> that sync with your understanding, or am I missing something?

Which operation are you talking about? When we are sampling we have the relevant ThreadInfo and ProfileBuffer at hand. Are you talking about when ProfileBuffer is reset? At that point we might have to go through gPS->mThreads to clear all the cached "last sample" entries. Or we could avoid that by using the existing generation counter in ProfileBuffer to indicate staleness.

> A more elegant solution would be to return to the original plan, but put the
> table that maps thread-ids to last-sample-indices in ProfileBuffer.  That
> seems a fitting place for it and doesn't involve any further cross module
> entanglement.  When the ProfileBuffer is created we know how many threads
> will be associated with it and so the table can be dynamically allocated at
> that point.

Can't more threads be added later? In which case you can't allocate a fixed-size table. Variable-sized tables aren't that hard, though.
Flags: needinfo?(n.nethercote)
(Assignee)

Comment 5

4 months ago
(In reply to Nicholas Nethercote [:njn] from comment #4)
> > entanglement.  When the ProfileBuffer is created we know how many threads
> > will be associated with it and so the table can be dynamically allocated at
> > that point.
> 
> Can't more threads be added later?

Yes, I discovered that later (locked_register_thread).  Yes, resizing
the table accordingly is easy.
(Assignee)

Comment 6

4 months ago
Created attachment 8848930 [details] [diff] [review]
Further cost reductions for ProfileBuffer::FindLastSampleOfThread

ProfileBuffer::FindLastSampleOfThread currently involves a linear search
backwards through the sample buffer.  Profiling showed that to be the largest
profiler cost by far, at least on Linux.  Bugs 1344118 and 1344258
significantly improve the situation, collectively reducing the cost by a
factor of at least 5 and often much more.  But the linear search is still
present and still dominant.  The worst of it is that it's unnecessary: we
could achieve the same by using a table, one entry per thread, that records
the start of the most recent sample for each thread.

This patch adds such a table to ProfileBuffer.  The table is updated in
ProfileBuffer::addTag when a sample-start entry is observed, and is queried in
ProfileBuffer::FindLastSampleOfThread.  This is straightforward apart from the
minor twist that we can't resize the table in the profiler critical section
without risking deadlock.

The table is an array of pairs, with lookup being a simple linear traversal.
Because the array is typically small -- less than 100 entries -- the cost is
tolerable.  A followup to change it to binary search might be advisable.

This patch reduces the cost of FindLastSampleOfThread by about a factor of 5,
putting it on a par with the overall unwinding costs on Linux.
Attachment #8848930 - Flags: review?(n.nethercote)
Comment on attachment 8848930 [details] [diff] [review]
Further cost reductions for ProfileBuffer::FindLastSampleOfThread

Review of attachment 8848930 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/core/ProfileBuffer.cpp
@@ +12,5 @@
>    , mEntrySize(aEntrySize)
>    , mGeneration(0)
> +  , mLSMap(static_cast<LSMEntry*>(malloc(nThreads * sizeof(LSMEntry))))
> +  , mLSMapSize(nThreads)
> +  , mLSMapUsed(0)

This is very old-school C style code!

A more C++/Mozilla-y way to do it would be with mozilla::Vector, which keeps track of length, capacity, allocation, resizing, and deallocation for you. That would simplify the patch considerably. You also wouldn't have to pass in nThreads here. (That's what I meant by "Variable-sized tables aren't that hard" in comment 5.)

But I think the extra table can be avoided entirely. Here's my suggestion:

- ThreadInfo gets two extra fields: a pointer (or index) to the last sample, and the ProfileBuffer's generation when that sample was taken. No need for a separate table.

- DuplicateLastSample() gets two extra arguments: the above two fields. (The relevant ThreadInfo is available at the callsite.) It only does the duplication if the pointer is valid, which involves checking that the generation matches the ProfileBuffer's current generation. By passing in these arguments, we can avoid ProfileBuffer having a dependence on ThreadInfo. The lookup is also O(1).

- FindLastSampleOfThread() can be deleted.

Thoughts?

::: tools/profiler/core/ProfileBufferEntry.cpp
@@ +759,5 @@
> +  MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize);
> +  ProfileBufferEntry entry = mEntries[ix];
> +  if (entry.isThreadId() && entry.mTagInt == aThreadId) {
> +    // It's still valid.
> +    return ix;

Could we get unlucky here? Could the sample have been overwritten by the same thread? Not sure, though it's irrelevant if we take my suggested approach.

::: tools/profiler/core/platform.cpp
@@ +1742,5 @@
> +  // its last-sample-map to accommodate this new thread.
> +  ProfileBuffer* buffer = gPS->Buffer(aLock);
> +  if (buffer) {
> +    buffer->BumpLSMSize();
> +  }

This is a bit gross. Using mozilla::Vector would avoid that, as would avoiding the table altogether.
Attachment #8848930 - Flags: review?(n.nethercote) → review-
Oh, I forgot about ProfileBuffer::addTag(), which needs to record the last sample position and generation. It looks like you can pass in pointers to the two fields and it can fill them in. Even better, if you wrap up the two fields into a LastSample struct that will make the passing neater.
(Assignee)

Comment 9

4 months ago
(In reply to Nicholas Nethercote [:njn] from comment #7)
> But I think the extra table can be avoided entirely. Here's my suggestion: [..]

It's a nice proposal.  Patch to follow.  Resulting performance is the best
yet.

> Even better, if you wrap up the two fields into a LastSample struct that will
> make the passing neater.

I did that, and also added a third read-only field, which is the thread ID.
This simplifies various bits because it means we don't have to pass around the
thread ID to which a LastSample pertains, and we can't pass the wrong thread
ID.  An alternative would have been to make the third field be a ThreadInfo*,
but that increases module entanglement and pointer chasing.

> ::: tools/profiler/core/ProfileBufferEntry.cpp
> @@ +759,5 @@
> > +  MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize);
> > +  ProfileBufferEntry entry = mEntries[ix];
> > +  if (entry.isThreadId() && entry.mTagInt == aThreadId) {
> > +    // It's still valid.
> > +    return ix;
> 
> Could we get unlucky here? Could the sample have been overwritten by the
> same thread? Not sure, though it's irrelevant if we take my suggested
> approach.

That check is still required with the new approach, because the alternative is
to have addTag check each sample to see if we're overwriting a sample start
marker.  That means we'd have to somehow get hold of associated LastSample and
invalidate it.  Also, it means that addTag drags all old samples up through
the memory hierarchy merely to overwrite them.  From Linux "perf" numbers, I
am beginning to suspect that the profiler's performance is already largely
memory limited.

I tested the logic by running the old and new FindLastSample implementations
in parallel for several minutes (several tens of millions of entries) and
asserting that they produce identical results, with the buffer size set to
minimum (90k) so as to maximise the number of wraparounds.  Testing like that
has found bugs in the various predecessor patches.

If we adopted the indexing scheme proposed in bug 1348959, I think this
validity testing would become simpler.
(Assignee)

Comment 10

4 months ago
Created attachment 8849322 [details] [diff] [review]
Further cost reductions for ProfileBuffer::FindLastSampleOfThread

ProfileBuffer::FindLastSampleOfThread currently involves a linear search
backwards through the sample buffer.  Profiling showed that to be the largest
profiler cost by far, at least on Linux.  Bugs 1344118 and 1344258
significantly improve the situation, collectively reducing the cost by a
factor of at least 5 and often much more.  But the linear search is still
present and still dominant.  The worst of it is that it's unnecessary: we
could achieve the same by recording the start point of the most recent sample
for each thread in that thread's ThreadInfo record.

This patch does exactly that, adding the type ProfileBuffer::LastSample to
store the start points.  LastSample also includes the ID of the thread it
pertains to as a read-only field, as that is needed in various places.

addTag doesn't check whether we're overwriting buffer entries containing start
points.  Instead, FindLastSample checks whether the entry pointed to the
LastSample it is given still contains a marker.
Attachment #8849322 - Flags: review?(n.nethercote)
Comment on attachment 8849322 [details] [diff] [review]
Further cost reductions for ProfileBuffer::FindLastSampleOfThread

Review of attachment 8849322 [details] [diff] [review]:
-----------------------------------------------------------------

Looking good. There are enough nits that I'd like to see it again before landing. Thanks.

::: tools/profiler/core/ProfileBuffer.cpp
@@ +38,5 @@
>      mReadPos = (mReadPos + 1) % mEntrySize;
>    }
>  }
>  
> +void ProfileBuffer::addTagThreadId(LastSample& ls)

Param should be |aLS| or |aLastSample|.

::: tools/profiler/core/ProfileBuffer.h
@@ +26,5 @@
> +      : mThreadId(aThreadId)
> +      , mGeneration(0)
> +      , mPos(-1)
> +    {}
> +    // The thread to which this LastSample pertains.

Nit: blank line after the constructor.

@@ +30,5 @@
> +    // The thread to which this LastSample pertains.
> +    const int mThreadId;
> +    // The profiler-buffer generation number at which the sample was created.
> +    uint32_t mGeneration;
> +    // And its position in the buffer.

Mention in the comment that -1 means "invalid".

@@ +39,5 @@
>    void addTag(const ProfileBufferEntry& aTag);
> +
> +  // Add to the buffer, a sample start (ThreadId) entry, for the thread that
> +  // |ls| belongs to, and record the resulting generation and index in |ls|.
> +  void addTagThreadId(LastSample& ls);

|aLS| or |aLastSample|.

@@ +51,5 @@
> +
> +  // Find the most recent sample for the thread denoted by |ls| and clone it,
> +  // patching in the |aStartTime| as appropriate.
> +  bool DuplicateLastSample(const mozilla::TimeStamp& aStartTime,
> +                           LastSample& ls);

|aLS| or |aLastSample|.

What line length do you have your text editor set to? https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style says line lengths should be 80 chars or fewer. This declaration is 79 chars, so can easily fit on a single line. I think a number of the comments in this patch could be reflowed to take up more of each line, too.

@@ +63,5 @@
>    size_t SizeOfIncludingThis(mozilla::MallocSizeOf aMallocSizeOf) const;
>  
>  protected:
>    char* processDynamicTag(int readPos, int* tagsConsumed, char* tagBuff);
> +  int FindLastSampleOfThread(const LastSample& ls);

|aLS| or |aLastSample|.

::: tools/profiler/core/ProfileBufferEntry.cpp
@@ +741,5 @@
>      readPos = (readPos + 1) % mEntrySize;
>    }
>  }
>  
> +int ProfileBuffer::FindLastSampleOfThread(const LastSample& ls)

|aLS| or |aLastSample|.

@@ +748,5 @@
> +  // generation, or is one behind the buffer's generation, since the buffer's
> +  // generation is incremented on wraparound.  There's no ambiguity relative
> +  // to ProfileBuffer::reset, since that increments mGeneration by two.
> +  if (ls.mGeneration == mGeneration ||
> +      (mGeneration > 0 && ls.mGeneration == mGeneration-1)) {

Nit: space around '-'.

@@ +752,5 @@
> +      (mGeneration > 0 && ls.mGeneration == mGeneration-1)) {
> +    int ix = ls.mPos;
> +    // It might be that the sample has since been overwritten, so check
> +    // that it is still valid.
> +    MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize);

Nit: I write conditions like these as |0 <= ix && ix < mEntrySize|...

Also, isn't -1 a possible value for mPos at this point?

@@ +753,5 @@
> +    int ix = ls.mPos;
> +    // It might be that the sample has since been overwritten, so check
> +    // that it is still valid.
> +    MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize);
> +    ProfileBufferEntry entry = mEntries[ix];

Make this a reference (|ProfileBufferEntry&|) so we don't copy it unnecessarily.

@@ +757,5 @@
> +    ProfileBufferEntry entry = mEntries[ix];
> +    if (entry.isThreadId() && entry.mTagInt == ls.mThreadId) {
> +      // It's still valid.
> +      return ix;
> +    } else {

Moz style says no else after return... in this case it might look awkward. I wonder if using ?: might help.

@@ +761,5 @@
> +    } else {
> +      // The entry has been overwritten.  Given that |ix| is the index of the
> +      // thread denoted by |ls|, we can be sure that there are no other older
> +      // samples for it in the buffer either, since they would have been
> +      // overwritten before this most recent one.

Is the absence of older samples relevant? I don't think so, and mentioning them just confused me.

@@ +767,3 @@
>      }
> +  } else {
> +    MOZ_ASSERT(ls.mGeneration <= mGeneration-2);

Nit: space around '-'.

@@ +772,4 @@
>  }
>  
>  bool
> +ProfileBuffer::DuplicateLastSample(const TimeStamp& aStartTime, LastSample& ls)

|aLS| or |aLastSample|.

Also, can it be a |const| reference?
Attachment #8849322 - Flags: review?(n.nethercote) → feedback+
(Assignee)

Comment 12

4 months ago
Created attachment 8849472 [details] [diff] [review]
Further cost reductions for ProfileBuffer::FindLastSampleOfThread

ProfileBuffer::FindLastSampleOfThread currently involves a linear search
backwards through the sample buffer.  Profiling showed that to be the largest
profiler cost by far, at least on Linux.  Bugs 1344118 and 1344258
significantly improve the situation, collectively reducing the cost by a
factor of at least 5 and often much more.  But the linear search is still
present and still dominant.  The worst of it is that it's unnecessary: we
could achieve the same by recording the start point of the most recent sample
for each thread in that thread's ThreadInfo record.

This patch does exactly that, adding the type ProfileBuffer::LastSample to
store the start points.  LastSample also includes the ID of the thread it
pertains to as a read-only field, as that is needed in various places.

addTag doesn't check whether we're overwriting buffer entries containing start
points.  Instead, FindLastSample checks whether the entry pointed to the
LastSample it is given still contains a marker.
Attachment #8849472 - Flags: review?(n.nethercote)
(Assignee)

Updated

4 months ago
Attachment #8849322 - Attachment is obsolete: true
(Assignee)

Comment 13

4 months ago
(In reply to Nicholas Nethercote [:njn] from comment #11)
> Comment on attachment 8849322 [details] [diff] [review]
> Further cost reductions for ProfileBuffer::FindLastSampleOfThread
> 
> Review of attachment 8849322 [details] [diff] [review]: [..]

All nits fixed.

> > +  bool DuplicateLastSample(const mozilla::TimeStamp& aStartTime,
> > +                           LastSample& ls);
> 
> |aLS| or |aLastSample|.
> 
> What line length do you have your text editor set to? [..]

emacs thought the one-line version was already 80, and with ls->aLS it's now 81.
So it's still wrapped.

> > +    // It might be that the sample has since been overwritten, so check
> > +    // that it is still valid.
> > +    MOZ_RELEASE_ASSERT(ix >= 0 && ix < mEntrySize);
>
> Also, isn't -1 a possible value for mPos at this point?

Hmm, that's a good point.  That assertion never failed in testing, and I suspect
that's because of the logic associated with CanDuplicateLastSampleDueToSleep,
but I didn't follow it entirely.  In any case I added a test here to be on the
safe side.

> >  bool
> > +ProfileBuffer::DuplicateLastSample(const TimeStamp& aStartTime, LastSample& ls)
> 
> Also, can it be a |const| reference?

No, because it gets modified by the call to addTagThreadId.
Comment on attachment 8849472 [details] [diff] [review]
Further cost reductions for ProfileBuffer::FindLastSampleOfThread

Review of attachment 8849472 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/core/ProfileBuffer.cpp
@@ +38,5 @@
>      mReadPos = (mReadPos + 1) % mEntrySize;
>    }
>  }
>  
> +void ProfileBuffer::addTagThreadId(LastSample& aLS)

I realize it's late to be saying this, but "Last" isn't a very good word to put in a name, because it's not clear if it means "previous" or "final". Would PrevSample or LatestSample or MostRecentSample be better?

::: tools/profiler/core/ProfileBufferEntry.cpp
@@ +741,5 @@
>      readPos = (readPos + 1) % mEntrySize;
>    }
>  }
>  
> +int ProfileBuffer::FindLastSampleOfThread(const LastSample& aLS)

Nit: return type on its own line.
Attachment #8849472 - Flags: review?(n.nethercote) → review+
(Assignee)

Comment 15

4 months ago
(In reply to Nicholas Nethercote [:njn] from comment #14)
> > +void ProfileBuffer::addTagThreadId(LastSample& aLS)
> 
> I realize it's late to be saying this, but "Last" isn't a very good word to
> put in a name, because it's not clear if it means "previous" or "final".
> Would PrevSample or LatestSample or MostRecentSample be better?

MostRecentSample is a bit wordy.  How about LatestSample?

So .. shall I rename just the type, or also DuplicateLastSample() too?
It looks a bit strange to have DuplicateLastSample(..., LatestSample& aLS).
Flags: needinfo?(n.nethercote)
(Assignee)

Comment 16

4 months ago
To do this consistently would require renaming 

  CanDuplicateLastSampleDueToSleep
  FindLastSampleOfThread

at their various call sites, which makes the patch much bigger.  LastSample
isn't quite as descriptive as one might hope, but at least it's consistent with
existing usage.  How do you want to proceed?
I am happy to let you decide, hence the r+. Sorry I didn't make it clear.

Sticking with LastSample is good enough.
Flags: needinfo?(n.nethercote)

Comment 18

4 months ago
Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6235799ad4fd
Further cost reductions for ProfileBuffer::FindLastSampleOfThread.  r=n.nethercote.

Comment 19

4 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/6235799ad4fd
Status: NEW → RESOLVED
Last Resolved: 4 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.