Last Comment Bug 683229 - Add user space profiling using Simple Profiler System (SPS)
: Add user space profiling using Simple Profiler System (SPS)
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Gecko Profiler (show other bugs)
: Trunk
: x86 Mac OS X
: -- normal with 1 vote (vote)
: mozilla10
Assigned To: Benoit Girard (:BenWa)
:
Mentors:
Depends on: 698526 699918 700754 710296 730302
Blocks: 698002 698005 703444 705856 707185 713227 744026
  Show dependency treegraph
 
Reported: 2011-08-30 11:22 PDT by Benoit Girard (:BenWa)
Modified: 2016-02-05 03:00 PST (History)
18 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
SPS Profiler (47.83 KB, patch)
2011-08-30 11:22 PDT, Benoit Girard (:BenWa)
jmuizelaar: feedback+
Details | Diff | Splinter Review
Sample (Throw away log parsing UI) (53.67 KB, image/png)
2011-08-30 11:33 PDT, Benoit Girard (:BenWa)
no flags Details
Sample log with symbolicated leaf data (205.70 KB, text/plain)
2011-08-30 15:24 PDT, Benoit Girard (:BenWa)
no flags Details
Sample log with symbolicated leaf data (227.37 KB, text/plain)
2011-08-30 21:07 PDT, Benoit Girard (:BenWa)
no flags Details
SPS Profiler (52.06 KB, patch)
2011-09-01 16:41 PDT, Benoit Girard (:BenWa)
cjones.bugs: review-
Details | Diff | Splinter Review
review diff (wip) (28.59 KB, patch)
2011-09-06 07:34 PDT, Benoit Girard (:BenWa)
no flags Details | Diff | Splinter Review
Review diff (34.48 KB, patch)
2011-09-07 12:42 PDT, Benoit Girard (:BenWa)
no flags Details | Diff | Splinter Review
Review diff (34.47 KB, patch)
2011-09-07 12:51 PDT, Benoit Girard (:BenWa)
no flags Details | Diff | Splinter Review
review diff (34.48 KB, patch)
2011-10-19 12:23 PDT, Benoit Girard (:BenWa)
no flags Details | Diff | Splinter Review
Review diff (35.32 KB, patch)
2011-10-19 12:25 PDT, Benoit Girard (:BenWa)
no flags Details | Diff | Splinter Review
Review diff (+build fix) (36.00 KB, patch)
2011-10-19 13:22 PDT, Benoit Girard (:BenWa)
no flags Details | Diff | Splinter Review
review diff (36.30 KB, patch)
2011-10-20 11:22 PDT, Benoit Girard (:BenWa)
cjones.bugs: review+
Details | Diff | Splinter Review
Use 'MOZ_PROFILER_SPS' env variable to turn profiler on (1.34 KB, patch)
2011-10-27 13:25 PDT, Benoit Girard (:BenWa)
jmuizelaar: review+
Details | Diff | Splinter Review

Description Benoit Girard (:BenWa) 2011-08-30 11:22:17 PDT
Created attachment 556929 [details] [diff] [review]
SPS Profiler

Simple Profiler System is a user space profiling system. It instruments the code base with 'Checkpoints' used to build partial 'stack' made up of the active checkpoints. It periodically samples the pc and the list of active checkpoint. Code markers can also be added to filter samples of interest. A goal of this profiler is to provide an universal cross platform, and light overhead, profiling tool.

This profiler lets us gather profiling data for platforms where there is no kernel support for profiling and can complement current tools in platform where we already have profiling data.

There's a few ideas that have been discussed on additional uses for this tool, however the immediate use for this profiler is to profile on Android Fennec.

Open issues with this patch:
- We will want to change the build integration to enable this profiler with a pref for now.
- Needs to be ported for non android platform, this will be addressed on follow up bugs.
- The save logic will need to change.
- Add prefs to control options such as save location.

This tool could also be used to solve bug 653703.
Comment 1 Benoit Girard (:BenWa) 2011-08-30 11:33:47 PDT
Created attachment 556934 [details]
Sample (Throw away log parsing UI)

I developed a sample parsing program to see what kind of data we can extract from the log as a proof of concept. We are working on a better HTML/JS permanent replacement UI that should hopefully be done around the time this land.

As we add more checkpoints in the source code the information should become more meaningful.

https://github.com/ehsan/cleopatra
Comment 2 Benoit Girard (:BenWa) 2011-08-30 15:24:27 PDT
Created attachment 557014 [details]
Sample log with symbolicated leaf data
Comment 3 Benoit Girard (:BenWa) 2011-08-30 16:11:37 PDT
We can get libc function names (but not source line) using arm-linux-androideabi-addr2line so we can improve symbolication.
Comment 4 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-30 18:41:27 PDT
Can I suggest the name "Simps" as something more pronounceable?  (Comes from /Brave New World/, IIRC.  You probably don't want to know from what specifically, but it fits with the motivation here ;)

A description of overall methodology would be useful.

What I heard from BenWa sounds basically sound: you're (mostly) uniformly sampling real time, but only some samples have useful info.  The rest are "dark matter".  As long as the dark-matter samples are recorded with the detailed ones, then this just results in a pre-filter on the data presented to the user.

You should be able to uniformly sample "unhalted" time with "halted" time for a given thread, e.g. while it's waiting for the GPU.  As long as the Gecko main thread is mostly sleeping in an interruptible state, then we'll be able to charge real time accurately there.  That'd be pretty cool.

You'd likely get more precise samples by using a high-resolution timer (POSIX timer or itimer on linux) programmed for the thread you want to sample instead of delivering signals from a separate thread.  That would preclude switching on sampling for the samplee while it's in the middle of, say, a too-long IO operation, though.

I think this scheme should work with multiple processes, and since samples require grabbing timestamps anyway, you'll be able to kinda sorta match up data across processes.

I definitely see a useful niche for this project as something that can stay on in opt builds (but unarmed by default).  As long as the probes stay sparse and out of hot loops, this would be a great spot check for slow pages.  Could have triagers grab this data to see what component to file a bug in, e.g.

It would be cool if this were compatible with oprofile (either the sampler output data in oprofile format, or the visualizer understood oprofile format), only because I haven't found a good oprofile visualizer yet.  It's always a pain extracting useful data from oprofile samples.  oprofile's data format looks pretty whack though.
Comment 5 Benoit Girard (:BenWa) 2011-08-30 21:07:39 PDT
Created attachment 557085 [details]
Sample log with symbolicated leaf data

Here's a better log with symbols for system libraries and name demanding.
Comment 6 Jeff Muizelaar [:jrmuizel] 2011-08-31 09:59:18 PDT
(In reply to Chris Jones [:cjones] [:warhammer] from comment #4)
> Can I suggest the name "Simps" as something more pronounceable?  (Comes from
> /Brave New World/, IIRC.  You probably don't want to know from what
> specifically, but it fits with the motivation here ;)
> 
> A description of overall methodology would be useful.

Here's a basic description:
We periodically interrupt a thread using a platform specific mechanism. We can then sample it using a variety of methods (stack walking, keeping our own pseudo-stack, etc.) The current implementation annotates particular functions so that they push a string onto a stack on entry and pop it on exit. This gives a reliable stack that we can sample from. We can add annotations in areas as needed.

> You'd likely get more precise samples by using a high-resolution timer
> (POSIX timer or itimer on linux) programmed for the thread you want to
> sample instead of delivering signals from a separate thread.  That would
> preclude switching on sampling for the samplee while it's in the middle of,
> say, a too-long IO operation, though.

The v8 code that this comes from used to use itimer. They switched for higher resolution:

http://codereview.chromium.org/4000007

> I think this scheme should work with multiple processes, and since samples
> require grabbing timestamps anyway, you'll be able to kinda sorta match up
> data across processes.

Samples don't require grabbing timestamps. You could also use a shared atomic counter for matching up data across processes.
Comment 7 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-31 11:58:31 PDT
(In reply to Jeff Muizelaar [:jrmuizel] from comment #6)
> (In reply to Chris Jones [:cjones] [:warhammer] from comment #4)
> > You'd likely get more precise samples by using a high-resolution timer
> > (POSIX timer or itimer on linux) programmed for the thread you want to
> > sample instead of delivering signals from a separate thread.  That would
> > preclude switching on sampling for the samplee while it's in the middle of,
> > say, a too-long IO operation, though.
> 
> The v8 code that this comes from used to use itimer. They switched for
> higher resolution:
> 
> http://codereview.chromium.org/4000007
> 

I don't see a rationale for that change there, or latency measurements.  Delivering a signal from another thread should have higher latency because IIRC posted signals are only delivered at next context switch, they don't interrupt the signalee.  If that's right, then in CPU-intensive code, the average latency is going to be timeslice/2.  The profilee thread setting up an hrtimer to interrupt itself means that the kernel can directly program a timer interrupt for that core, which should have lowest latency in theory.

We can directly measure this easily enough.

> > I think this scheme should work with multiple processes, and since samples
> > require grabbing timestamps anyway, you'll be able to kinda sorta match up
> > data across processes.
> 
> Samples don't require grabbing timestamps. You could also use a shared
> atomic counter for matching up data across processes.

How are you going to accurately charge real time to backtraces without having a delta between samples?  Relying on exactly precise sampling interrupts sounds like a recipe for accreting error.  We really need to measure interrupt precision ...
Comment 8 Benoit Girard (:BenWa) 2011-08-31 12:34:58 PDT
(In reply to Chris Jones [:cjones] [:warhammer] from comment #7)
> 
> How are you going to accurately charge real time to backtraces without
> having a delta between samples?  Relying on exactly precise sampling
> interrupts sounds like a recipe for accreting error.  We really need to
> measure interrupt precision ...

I don't think we need to charge real time for this tool to be useful. Even if the samples are not evenly distributed we are stopping the program several hundred time randomly each second. If certain functions show up appropriately more frequently in the samples then we can be reasonably certain that there is a hot spot. The current patch only tries to find hotspot, not measure the duration. I tried to leave this tool as open as possible to be extended in the future.

Matching up samples recorded across threads and process would in fact be very useful for electrolysis, especially to see what is taking a long time to complete in sync ipc calls. Jeff tells me that using an atomic counter in shared memory could be cheaper then using a high resolution timer.
Comment 9 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-31 12:42:28 PDT
Yes, nonuniform sampling could still be useful, it just needs to be something that users of the tool are aware of.

A counter will be cheaper, but be less accurate (than a high-res clock).  A counter would be somewhat harder to implement.  It's just a tradeoff: if correlation inaccuracies on the order of interrupt interval are OK, then a counter will be just fine.
Comment 10 Benoit Girard (:BenWa) 2011-08-31 13:27:42 PDT
For now I'd rather land it as is and get a feel of how useful this data is. In the future we should experiment with some of the suggestions above.

We should probably flesh out the tool better, and experiment with these suggestion, on android/linux before we start porting it to other platform instead of having to back port too many patches.
Comment 11 Benoit Girard (:BenWa) 2011-09-01 16:41:08 PDT
Created attachment 557694 [details] [diff] [review]
SPS Profiler

I need to verify that the omitting the build pref correctly disable the profiler before checking in.

Usage:
1) Set the mozconfig pref
2) Run fennec, the profiler will keep a rotating buffer of X samples with no noticeable overhead.
3) Dump the profile to /sdcard/profile_0.txt : |adb shell kill -42 <PID>|
4) (optional) post process + symbolicate the data using addr2line.
5) Paste the log in http://ehsan.github.com/cleopatra/ (still being improved)

I haven't done the rename yet. I like SPS because the name is more descriptive to new users but I'm willing to rename it to "Simps".

I also plan on submitting this to Talos so that we can investigate turning this on by default.

For number 4 I'm still experimenting with the script that does post-processing with addr2line/objdump, I don't have anything to publish yet but if you ping me I'll send you what I have.
Comment 12 Benoit Girard (:BenWa) 2011-09-01 16:42:53 PDT
Also for a quick demo copy paste the latest sample attached log into http://ehsan.github.com/cleopatra/.
Comment 13 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-09-01 16:43:11 PDT
(In reply to Benoit Girard (:BenWa) from comment #11)
>
> 3) Dump the profile to /sdcard/profile_0.txt : |adb shell kill -42 <PID>|

Quick nit: can this be /sdcard/profile_${PID}.txt?
Comment 14 Benoit Girard (:BenWa) 2011-09-01 16:47:09 PDT
Quick note about sharing this code:
- We need a bit of code to integrate with the application (mozilla). All that code lives in TableTicker.cpp. Integrating with an application will require a bit of work so replacing the moz code in TabltTicker.cpp (TableTicker::HandleSaveRequest(), TableTicker::TableTicker()) wont add a very much work.
- I noticed I used mozilla_ in some names, I can rename those once we pick a name.
Comment 15 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-09-02 23:22:47 PDT
Comment on attachment 557694 [details] [diff] [review]
SPS Profiler

I should say right off that I really like where this patch is going
:).  I left a bunch of comments, but I don't think it should be too
much more work to get this landable.

Overall, I think this patch is trying to be a bit too general with the
profiler>SPS sorting.  I don't really expect we'll have multiple,
entirely separate low-overhead always-enabled profiling backends
united under the same API.  Do you?  If not, let's focus on SPS, and
add provisions for its impl on multiple platforms, and do away with
the abstract profiler notion.

>diff --git a/config/autoconf.mk.in b/config/autoconf.mk.in
>--- a/config/autoconf.mk.in
>+++ b/config/autoconf.mk.in
>diff --git a/configure.in b/configure.in
>--- a/configure.in
>+++ b/configure.in

Have you run this patch through try with the profiler enabled?  It'd
be great to land it initially without all this configure crap, if it
doesn't affect talos of course.

>diff --git a/content/base/src/nsDocument.cpp b/content/base/src/nsDocument.cpp

(I didn't review usage very closely.  It would have been better to
split that into a separate patch.)

>diff --git a/tools/profiler/Makefile.in b/tools/profiler/Makefile.in
>new file mode 100644
>--- /dev/null
>+++ b/tools/profiler/Makefile.in

>+# The Initial Developer of the Original Code is
>+# Kipp E.B. Hickman.

I don't think that's true ;).

>+DEPTH		= ../..
>+topsrcdir	= @top_srcdir@
>+srcdir		= @srcdir@
>+VPATH       = \

Nit: this should line up with "="s above or not attempt to line up.

>+  $(srcdir) \
>+  $(srcdir)/sps \
>+  $(NULL)
>+
>+
>+include $(DEPTH)/config/autoconf.mk
>+
>+EXPORTS = \
>+        sampler.h \
>+        sps_sampler.h \
>+        $(NULL)
>+

Nit: should match the VPATH style above.

>+ifeq ($(OS_TARGET),Android)

Won't this work on desktop linux too?  It would be easier to build and
test there.

>+CPPSRCS += \
>+        platform.cc \
>+        TableTicker.cpp \
>+        $(NULL)

Nit: same here.

>diff --git a/tools/profiler/sampler.h b/tools/profiler/sampler.h

The mix of naming style (sampler.h vs. TableTicker.cpp
vs. platform.cc) is a bit off-putting.  We decided to just fork the v8
stuff, right?  (There wasn't enough of it to be worth the effort of
sharing?)  If so, let's import it with moz naming style, LikeThis.h.

>new file mode 100644
>--- /dev/null
>+++ b/tools/profiler/sampler.h
>+// General include for the standard profiler interface,
>+// each profiler implementation should overwrite the macros
>+// as desired or define them as no-ops.
>+
>+#ifndef SAMPLER_H
>+#define SAMPLER_H
>+
>+#if defined(_MSC_VER)
>+#define FULLFUNCTION __FUNCSIG__
>+#elif (__GNUC__ >= 4)
>+#define FULLFUNCTION __PRETTY_FUNCTION__
>+#else
>+#define FULLFUNCTION __FUNCTION__
>+#endif
>+

Mmm we should have a general helper for this.  This is the second or
third place I've seen it copied to.  Nothing to worry about for this
patch, though.

>+#define SAMPLE_CHECKPOINT(name_space, info) // NOP
>+#define SAMPLE_MARKER(info) // NOP
>+

Nit: I would use "SAMPLER" here to avoid ambiguity with the verb "to
sample".

These macros need very good comments.

I don't want to bikeshed this too much, but if we want folks to use
this well, I'd like to pick some good names.  SAMPLER_MARKER() is OK;
the mental model is that there's a timeline, and you're marking that
timeline with some specific event?  I wonder if
SAMPLER_TIMELINE_EVENT() might be clearer, although it's longer.

SAMPLER_CHECKPOINT() is confusing, I think.  The mental model is that
the profiler keeps an abstraction of the actual machine stack, and
this helper pushes a new frame onto that stack.  Can we find a more
descriptive name?  How about SAMPLER_STACK_FRAME()?

This file would be a good place for a long-ish comment giving an
overview of the system.

>diff --git a/tools/profiler/sps/Makefile.in b/tools/profiler/sps/Makefile.in

Per general note above, I think you should consider getting rid of the
sps/ directory here.  But if you don't, and just for future reference,
Ted will get mad at you for adding an extraneous Makefile :).  You can
just VPATH the sources here into the tools/profiler/Makefile.in and
nuke this one.  That'll result in faster builds.  See
gfx/layers/Makefile.in for current best practices.

>diff --git a/tools/profiler/sps/TableTicker.cpp b/tools/profiler/sps/TableTicker.cpp
>new file mode 100644
>--- /dev/null
>+++ b/tools/profiler/sps/TableTicker.cpp
>@@ -0,0 +1,392 @@
>+static void reverse(char s[])
>+static void kr_itoa(int n, char s[])

These functions are fun to write, but snprintf() is the right tool for
this job ;).  kr_itoa() is used before a series of system calls that
write to a file, it's not going to be hot code.  In general, this kind
of stuff is a good way to introduce security bugs, so prefer to use
stdlib/STL stuff unless absolutely necessary not to.

>+class StackEntry

This doesn't seem to be used.

>+
>+class ProfileEntry
>+{
>+public:
>+  ProfileEntry() {}
>+

Make this initialize all the members to 0/null.

>+  // aTagData must not need release (i.e. be a string from the text segment)
>+  ProfileEntry(char aTagName, const char *aTagData)
>+    : mTagName(aTagName)
>+    , mTagData(aTagData)
>+    , mLeafAddress(0)
>+    , mTagDataNeedRelease(false)
>+    , mUsed(true)
>+  { }
>+
>+  ProfileEntry(char aTagName, const char *aTagData, Address aLeafAddress)
>+    : mTagName(aTagName)
>+    , mTagData(aTagData)
>+    , mLeafAddress(aLeafAddress)
>+    , mTagDataNeedRelease(false)
>+    , mUsed(true)
>+  { }
>+
>+  void write(Profile *profile, int fileno);
>+
>+  void free()

Naming methods after libc functions is usually not a great idea
because sometimes they're macros and weird things happen during
builds.  Would recommend choosing something else.

You don't need this free() method.

>+private:
>+  char mTagName;
>+  const char* mTagData;
>+  Address mLeafAddress;

Put mTagName after mLeafAddress for better natural alignment.

>+  bool mTagDataNeedRelease;
>+  bool mUsed;

You don't need these members.

>+#define PROFILE_MAX_ENTRY 100000

sizeof(ProfileEntry) will be 12 bytes on 32-bit machines, 24 on
64-bit.  So that's a 1.2/2.4 MB buffer for samples.  I can live with
that, but unless we get lucky with demand-paging, this is going to
regress one of the talos memory benchmarks, I bet.  It's probably
going to be easier to dynamically allocate the sample buffer when the
profiler is turned on.  That'll also let us do stuff like set an env
variable or preference to override the number of entries.

>+class Profile
>+  void addTag(ProfileEntry aTag)
>+  {
>+    // Called from signal, call only Reentrant functions
>+    // May require memory release
>+    entries[writePos].free();

You can nuke this line.

>+    entries[writePos] = aTag;
>+    writePos = (writePos + 1) % PROFILE_MAX_ENTRY;
>+    if (writePos == readPos) {
>+      // Keep on sloth open

"one slot"

>+      entries[readPos].free();

entries[readPos] = ProfileEntry();

>+  void setType(int aType)
>+  {
>+    mType = aType;
>+  }

This doesn't seem to be used.

>+private:
>+  // Circular buffer 'Keep One Slot Open' implementation
>+  // for simplicity
>+  ProfileEntry entries[PROFILE_MAX_ENTRY];
>+  int writePos; // points to the next entry we will write to
>+  int readPos;  // points to the next entry we will read to

I would prefer a separate CircularBuffer datatype, so that it's easier
to check its correctness in isolation, but I wouldn't r- this patch
for that.

>+class TableTicker: public Sampler {
>+
>+  virtual void Tick(TickSample* sample);
>+
>+  virtual void RequestSave()
>+  {
>+    mSaveRequested = true;
>+  }

Please very clearly note that the above two functions run in a
signal-handler context, so can only use async-signal-safe functions.
You also need to do this for the functions that Tick() calls.

>+/**
>+ * This is an event used to unref a GLContext on the main thread and
>+ * optionally delete a texture associated with that context.
>+ */

No it's not :P.

>+class SaveProfileTask : public nsRunnable {
>+
>+  NS_IMETHOD Run() {
>+
>+    sprintf(buff, FOLDER "profile_%i.txt", XRE_GetProcessType());
>+

|snprintf|.  Need to have the pid in the filename too.

>+  nsCOMPtr<nsIRunnable> runnable = new SaveProfileTask();
>+  NS_DispatchToMainThread(runnable);

Mmmmkay this is a big ironical bowl of suck, but it would be better to
use the ipc/chromium Tasks here than XPCOM runnables :/.  Subprocesses
aren't guaranteed to have XPCOM.  That can go into a followup though.

>+
>+void TableTicker::Tick(TickSample* sample)
>+{
>+  // Marker(s) come before the sample

You need to document the file format somewhere visible.  Sampler.h
would be great, near the overview comment.

>+  int i = 0;
>+  while(true) {
>+    const char *marker = mStack.getMarker(i++);

Write this as a for loop, plz.

>+    if(!marker)
>+      break;
>+    mProfile.addTag(ProfileEntry('m', marker));
>+  }
>+  mStack.queueClearMarker = true;
>+
>+  // Sample
>+  // 's' tag denotes the start of a sample block
>+  // followed by 0 or more 'c' tags.
>+  for (int i=0; i<mStack.sp; i++) {

"i = 0" etc.

>+    if (i == 0) {
>+      Address pc = 0;
>+      if (sample) {
>+        pc = sample->pc;
>+      }
>+      mProfile.addTag(ProfileEntry('s', mStack._stack[i], pc));

Do this before the for loop.

>+void ProfileEntry::write(Profile *profile, int fileno)
>+{
>+    for (int i=0; i<maps.count; i++) {

Whitespace here plz, |i = 0| etc.

>+      map_entry *e = &maps.entries[i];
>+      if (pc > e->start && pc < e->end) {

You can sort the map table (since the ranges are disjoint) then use
binary search to find the right entry.  We can have thousands of
memory mappings, so that might be worth doing if this function gets
hot.  If we were using STL for this, it would be just a few lines of
code and cleaner in addition to faster ;).

>+        if (e->name) {
>+          char offsetStr[33];
>+          found = true;
>+          kr_itoa(pc - e->start, offsetStr);
>+          ::write(fileno, "l-", 2);
>+          ::write(fileno, e->name, strlen(e->name));

You'd save a lot of bytes by writing a "map legend" and then using a
small identifier here instead of the full library name.  But that's an
optimization that can happen in a followup.

>+      ::write(fileno, "l-???@", 6);
>+      ::write(fileno, offsetStr, strlen(offsetStr));
>+    }
>+    ::write(fileno, "\n", strlen("\n"));
>+  }
>+#endif
>+}

It might not matter because I don't think this function is going to be
terribly perf-critical, but you might be better off perf-wise with
userspace-buffered writes followed by an fflush() than going in and
out of the kernel for this many small writes.  Being able to use
fprintf() would save a lot of LoC too.

>+
>+void mozilla_sampler_init()
>+{
>+  // profiler uses getspecific because TLS is not supported on android.
>+  // getspecific was picked over nspr because it had less overhead required
>+  // to make the checkpoint function fast.

TLS variables are faster than pthread_getspecific() and more
convenient, so we should use them when available.  If the crack-baby
android linker doesn't have __thread linkage, then tough luck for it,
but that shouldn't slow down other platforms.  I don't think the
ifdef-ery would be too bad.

I've heard horror stories about MSVC's __tls though, like it only
having 256 slots.  For MSVC, we'll probably want to use NSPR's TLS
thing.  I think bsmedberg knows about MSVC TLS in some detail.  That
can of course go in a followup.

>+  int key_create_err = 0;
>+  key_create_err |= pthread_key_create(&pkey_stack, NULL);
>+  key_create_err |= pthread_key_create(&pkey_ticker, NULL);
>+  if (key_create_err) {
>+    LOG("Failed to init.");
>+    return;
>+  }

Since you're not logging the error code, 

  if (pthread_key_create() || pthread_key_create()) {

would be better.

>+
>+  TableTicker *t = new TableTicker(10);
>+  __android_log_print(ANDROID_LOG_ERROR, "profiler", "Set stack %p", t->GetStack());
>+  pthread_setspecific(pkey_ticker, t);
>+  pthread_setspecific(pkey_stack, t->GetStack());
>+
>+  t->Start();

Starting the sampler by default is definitely going to regress talos.
Probably don't want to do that.

>+}
>+
>diff --git a/tools/profiler/sps/platform.cc b/tools/profiler/sps/platform.cc
>new file mode 100644
>--- /dev/null
>+++ b/tools/profiler/sps/platform.cc

This should be Platform_linux.cc.

>+#ifdef ENABLE_SPS_LEAF_DATA
>+/* a crapy version of getline */
>+static ssize_t getline(char **lineptr, size_t *n, FILE *stream)

Why can't we use libc getline()?

>+struct map_info get_maps(pid_t pid)

This name isn't moz style.

This function is kinda sorta wrong because it'll miss
dynamically-loaded code, and also JIT'd code.  Maybe not all that big
of a deal.  It does mean though that for the time being we have to be
very careful to ensure that INIT() is called after glandium's lib
loading stuff at startup.

>+   char name[4096] = "";

PATH_MAX

>+   ret = sscanf(line, "%lx-%lx %*s %lx %*s %*x %s\n",  &entry.start, &entry.end, &entry.offset, name);
>+   entry.name = strdup(name);
>+   if (ret != 4 && ret != 3) {
>+     LOG("Get maps failed");
>+   }

Can cut down on map table size by filtering out ones that don't have
the 'x' bit in the map string.

>+static Sampler* active_sampler_ = NULL;
>+

This isn't moz style.

>+
>+#if !defined(__GLIBC__) && (defined(__arm__) || defined(__thumb__))
>+// Android runs a fairly new Linux kernel, so signal info is there,
>+// but the C library doesn't have the structs defined.
>+

I hate bionic.

>+//class Sampler::PlatformData : public Malloced {
>+class Sampler::PlatformData {
>+ public:
>+  explicit PlatformData(Sampler* sampler)
>+      : sampler_(sampler),
>+        signal_handler_installed_(false),
>+        vm_tgid_(getpid()),
>+        // Glibc doesn't provide a wrapper for gettid(2).
>+        vm_tid_(syscall(SYS_gettid)),

Let's make a sys_gettid() wrapper that does this so that the
parameters are documented.

>+        signal_sender_launched_(false) {
>+  }
>+
>+  void SignalSender() {
>+    while (sampler_->IsActive()) {
>+      sampler_->HandleSaveRequest();
>+
>+      // Glibc doesn't provide a wrapper for tgkill(2).
>+      syscall(SYS_tgkill, vm_tgid_, vm_tid_, SIGPROF);

Same here, sys_tgkill().

>+      // Convert ms to us and subtract 100 us to compensate delays
>+      // occuring during signal delivery.
>+
>+      const useconds_t interval = sampler_->interval_ * 1000 - 100;

This is pretty ghetto ;).  I won't complain any more until we measure,
though.

>+void Sampler::Start() {
>+  // There can only be one active sampler at the time on POSIX
>+  // platforms.

There can be more than one of these ... the reason there's only one
right now is that active_sampler_ is a static var.  There's no
fundamental limitation.

>+  // Request start/stop profiling signals
>+  LOG("Request signal");
>+  struct sigaction sa2;
>+  sa2.sa_sigaction = ProfilerSaveSignalHandler;
>+  sigemptyset(&sa2.sa_mask);
>+  sa2.sa_flags = SA_RESTART | SA_SIGINFO;
>+  if (sigaction(42, &sa2, &data_->old_signal_handler_) != 0) {

Let's make 42 a symbolic constant somewhere visible, maybe in
Sampler.h.  We need to assert that SIGRTMIN <= 42 <= SIGRTMAX.

>+void Sampler::Stop() {
>+  active_ = false;
>+
>+  // Wait for signal sender termination (it will exit after setting
>+  // active_ to false).
>+  if (data_->signal_sender_launched_) {
>+    pthread_join(data_->signal_sender_thread_, NULL);
>+    data_->signal_sender_launched_ = false;
>+  }
>+
>+  // Restore old signal handler
>+  if (data_->signal_handler_installed_) {
>+    sigaction(SIGPROF, &data_->old_signal_handler_, 0);
>+    data_->signal_handler_installed_ = false;
>+  }

Need to restore the "signal 42" handler too, or we'll have a nice
security bug.

>+
>+  // This sampler is no longer the active sampler.
>+  active_sampler_ = NULL;

Please note here that we're not deleting active_sampler_ because this
is only run from ~Sampler atm.

Which reminds me: who's going to delete our Sampler?  I guess that can
be a followup unless it turns tinderbox orange.

>diff --git a/tools/profiler/sps/platform.h b/tools/profiler/sps/platform.h
>new file mode 100644
>--- /dev/null
>+++ b/tools/profiler/sps/platform.h
>@@ -0,0 +1,116 @@
>+#define ASSERT(a)

why u no assert?  MOZ_ASSERT(a) would work well here.

>+#include <stdint.h>
>+typedef uint8_t byte;
>+typedef byte* Address;

This isn't going to compile on windoze.  Probably need to pull in
mozilla/Util.h and use |uint8| :/.

>+
>+// The USE(x) template is used to silence C++ compiler warnings
>+// issued for (yet) unused variables (typically parameters).
>+template <typename T>
>+static inline void USE(T) { }

You can nuke this, we have our own helper: |unused << x|;

>+struct map_entry {

Let's C++-ify this.

>+ unsigned long start;
>+ unsigned long end;
>+ unsigned long offset;
>+ char *name;
>+};
>+
>+struct map_info {
>+ struct map_entry *entries;
>+ int count;
>+};

This really wants to be a std::vector<MapEntry>.  We don't need to
mutate the map info in a signal-safe way, so vector is OK.

>diff --git a/tools/profiler/sps/sps_sampler.h b/tools/profiler/sps/sps_sampler.h

I think we can go ahead and start thinking about how to grow this code
to other platforms without creating too much work for this patch.  I
think everything in this file is generic (or rather, should be; see
notes below about inlining).  We should be able to move all the
interfaces here to Sampler.h.  We can make Sampler.h the public
interface, and keep platform-specific code out of it unless absolutely
necessary.  Then the rest should be covered by Platform.h and
Platform_*.cc implementations, which are private to the profiler.
Sound OK?

>+#include <pthread.h>
>+
>+extern pthread_key_t pkey_stack;
>+extern pthread_key_t pkey_ticker;
>+

These are here to support the inlined functions, I assume.  See below.

>+#define SAMPLER_INIT() mozilla_sampler_init();

This should have been part of the Sampler.h interface, but if this
becomes the public interface, nothing to fix.

Probably want a SAMPLER_DEINIT() to delete our Sampler.

>+#define SAMPLE_CHECKPOINT(name_space, info) SamplingRAII only_one_sampleraii_per_func(FULLFUNCTION, name_space "::" info);

(Technically it's one per C++ scope.)

>+//#define NS_SAMPLE_START() mozilla_sampler_init()
>+//#define NS_SAMPLE_STOP()

What's this for?

>+// Returns a handdle to pass on exit. This can check that we are popping the
>+// correct callstack.
>+inline void* mozilla_sampler_call_enter(const char *aInfo);
>+inline void  mozilla_sampler_call_exit(void* handle);
>+inline void  mozilla_sampler_add_marker(const char *aInfo);
>+

OK, a few things here ---

I don't think we want external code to call these functions directly,
do we?  If not, we should say so, and leave a note that the macros
should be used instead.

Do you expect to call these from C code?  I'm guessing not because
they're not using C linkage, and you're using C++ features to
implement them :) ("inline" keyword).  If not, let's use normal style
and put these in a namespace and use our regular naming conventions
"addMarker()" etc.  (That's js/src/ C++ style, and we use that in
mfbt/ too, so I don't see a problem with using it here.)  If so, then
you have some bustage to fix :).

Whether or not these remain C, you should use |SamplerStack*| instead
of |void*| for the arguments to these.  |void*| is forcing lots of
unnecessary casts in this code for no gain, AFAICT.

Last and probably most importantly, why are they inlined?  Do you have
data showing that the function call is a measurable overhead?  Did you
check that the compiler is actually inlining the function body?  It's
free to ignore the |inline| keyword, and the rumor is that both gcc
and MSVC don't even look at it.  This uglifies quite a bit of the
impl, so I'd it to be properly motivated.  If the compiler isn't
choosing to inline these, then we're getting the worst of all worlds.

>+void mozilla_sampler_init();
>+
>+#ifdef __cplusplus
>+class SamplingRAII {

More descriptive name plz; "SamplerStackFrame"?  Also, this should be
annotated NS_STACK_CLASS and be in the mozilla:: namespace.

>+
>+// WARNING: Signal can (and do) come when modifying this structure
>+// so beware of the order in which you modify things, the structure
>+// must remain consistant at EVERY point. i.e. the order of your increments
>+// do matter.

I would just say, the SamplerStack members are read by signal
handlers, so the mutation of them needs to be signal-safe.

>+struct Stack

This isn't ifdef __cplusplus, but it's full of C++-isms.

I would call this "SamplerStack" for clarity.

If we de-inline the interface here, this struct can be moved into
Sampler.cpp.

>+{
>+  // Keep unused parts of this stack cleared
>+  const char *_stack[1024];
>+  const char *_markers[1024];

SM style for members is "likeThis", no underscores anywhere.  These
need better comments.

>+  int sp;
>+  int mp;

These need to be sig_atomic_t.

Write out the full names, please.  They need to be discussed in the
comments above.

Should note here that we're using a bare-bones impl for efficiency
partly, but also because the impl needs to be signal safe.  Using
std::stack or something like that wouldn't be signal safe.

I'm not so sure I like the approach of silently dropping frames and
markers when the buffers fill.  For one, you've got a nasty bug here
with push()/pop(): if the push() fails because the buffer is full,
then pop() will still happily pop what was last pushed.  This will
cause a buffer underflow.

As things stand now, markers will accumulate while the profiler is
turned off.  That means the first marker dump will be garbage.  I
think we have two options here: (i) not record markers when the
sampler is turned off, which is just a well-predicted branch in
addMarker(), or (ii) use a ring buffer for the markers.  I think I
would prefer the extra branch.

Since the sampler can be dynamically enabled/disabled, we always have
to record stack frames or we could get garbage samples for a while
after it's first enabled.  So no other options here, have to record
them always.

>+  bool queueClearMarker;

This also needs to be sig_atomic_t.

>+  Stack()
>+  {
>+    sp = 0;
>+    mp = 0;
>+    queueClearMarker = false;
>+  }

This is a C++ ctor right now, so use C++ initializers for the members
plz :).

>+  void addMarker(const char *marker)
>+  {
>+    if (queueClearMarker) {
>+      clearMarker();
>+    }
>+    if (!marker) {
>+      return; //discard
>+    }
>+    if (mp == 1024) {
>+      return; //array full, silently drop

If we go with the test-if-active instead of ring buffer approach here
(or don't change anything), I think we should warn when the marker
buffer overflows.

>+    }
>+    _markers[mp] = marker;
>+    asm("":::"memory");

Split this out into a CompilerBarrier() helper for clarity, plz.

>+  void clearMarker()

"clearMarkers()".

>+  void push(const char *name)
>+  {
>+    if (sp>=1024)

"sp >= 1024"

>+      return;
>+

(Probably not worth warning here since this is likely to happen with
nested event loops etc.)

>+    _stack[sp] = name;
>+    // Prevent the optimizer from re-ordering these instructions
>+    asm("":::"memory");

CompilerBarrier(), and then can drop the comment.

>+    sp++;
>+  }
>+  void pop()
>+  {
>+    sp--;

Since callExit() is a public symbol atm, callers can manually cause
buffer underflow by calling exit() repeatedly, another underflow in
addition to failed push()es as noted above.

It's somewhat subtle, and I need to think about it in a bit more to be
100% sure, but I don't think you need to use atomic
increment/decrement here for stackPointer/markerPointer.  I have a
hand-wavy argument in mind, but I need to formalize it.  If you guys
already worked this out, please add a comment describing your
argument.

>+
>+inline void* mozilla_sampler_call_enter(const char *aInfo)
>+{
>+  Stack *stack = (Stack*)pthread_getspecific(pkey_stack);

pthread_getspecific()/setspecific() should be the only places we cast
from void*<->SamplerStack*.

>+  if (!stack) {
>+    return stack;
>+  }
>+  stack->push(aInfo);
>+
>+  // The handle is meant to support future changes
>+  // but for now it is simply use to save a call to
>+  // pthread_getspecific on exit.

I don't understand the first part of this comment, but the second part
is a valid optimization that's likely to be the case even for TLS
variables.

There's also a subtlety here that you should document: in theory,
pthread_getspecific could return a null SamplerStack to callEnter(),
but return a nonnull one to callExit().  If that happened, and you
*didn't* save the stack handle, then callExit() would pop() something
it didn't push, and that would be another buffer underflow bug.

So this is necessary for correctness, in addition to being a nice
optimization (probably).  In the current impl, the stack will always
be either always null or always nonnull, so if you want to enforce
that as a strong invariant and avoid the subtlety above, that would be
fine too.  Document it and assert it :).

r- for security bugs.
Comment 16 Benoit Girard (:BenWa) 2011-09-03 14:16:46 PDT
(In reply to Chris Jones [:cjones] [:warhammer] from comment #15)

Thanks for the very throughout review, I really appreciate! Lots of good feedback. I didn't write a comment for rhetorical questions or suggestions I agreed with. A lot of styles nits are inconsistencies between v8/jeff/me that I will fix up. I started a new patch in my queue based on this one that will have the review changes to make a nice 'interdiff' for the follow up review but I will qfold them into one patch on check-in.

> Overall, I think this patch is trying to be a bit too general with the
> profiler>SPS sorting.  I don't really expect we'll have multiple,
> entirely separate low-overhead always-enabled profiling backends
> united under the same API.  Do you?  If not, let's focus on SPS, and
> add provisions for its impl on multiple platforms, and do away with
> the abstract profiler notion.

My idea was to make it easy for people to hack on an alternate implementation(s) that perhaps might be higher overhead platform specific alternative for example. I guess if I don't know what it might be then it's perhaps not worth adding support for it.

> Have you run this patch through try with the profiler enabled?  It'd
> be great to land it initially without all this configure crap, if it
> doesn't affect talos of course.

I only had the pref to make it less controversial to land it at first. In the end me and Jeff really want this by default (as long as there overhead is negligible). If it allows us to fix performance issues faster then it will pay for itself in performance costs.

> Won't this work on desktop linux too?  It would be easier to build and
> test there.

I haven't tried this on the linux yet. I had a port for mac by replacing tgkill with pthread_kill but jeff had a different idea for porting to mac and asked me to remove it. I'm thinking of enabling this one platform at a time to make the transition less complex.

> You'd save a lot of bytes by writing a "map legend" and then using a
> small identifier here instead of the full library name.  But that's an
> optimization that can happen in a followup.

Optimizing profile saving speed/size is something we've discussed a bit. We plan to address this in a follow bug since saving only occurs on demand and logs have a reasonable size (smaller then shark's).

> It might not matter because I don't think this function is going to be
> terribly perf-critical, but you might be better off perf-wise with
> userspace-buffered writes followed by an fflush() than going in and
> out of the kernel for this many small writes.  Being able to use
> fprintf() would save a lot of LoC too.

I'll rewrite with fprintf. We avoided them because saving used to happen in the signal handler but we changed that for obvious reasons.

> TLS variables are faster than pthread_getspecific() and more
> convenient, so we should use them when available.  If the crack-baby
> android linker doesn't have __thread linkage, then tough luck for it,
> but that shouldn't slow down other platforms.  I don't think the
> ifdef-ery would be too bad.

I agree. Me and Jeff confirmed that they are not available on android. I want to do Linux port separately even if it is trivial because this patch is already complicated.

> >+}
> >+
> >diff --git a/tools/profiler/sps/platform.cc b/tools/profiler/sps/platform.cc
> >diff --git a/tools/profiler/sps/platform.h b/tools/profiler/sps/platform.h

These we're 95% imported from v8 but you suggest valid improvement and since we don't plan on merging (since we have different goal then the original code) we should do it.

> >diff --git a/tools/profiler/sps/sps_sampler.h b/tools/profiler/sps/sps_sampler.h

> >+// Returns a handdle to pass on exit. This can check that we are popping the
> >+// correct callstack.
> >+inline void* mozilla_sampler_call_enter(const char *aInfo);
> >+inline void  mozilla_sampler_call_exit(void* handle);
> >+inline void  mozilla_sampler_add_marker(const char *aInfo);
> >+
> 
> OK, a few things here ---
> 
> I don't think we want external code to call these functions directly,
> do we?  If not, we should say so, and leave a note that the macros
> should be used instead.

I wrote these so that c code could still add pseudo stack frame where the RAII helpers would not be unavailable (ex: gfx/qcms). Perhaps we don't care about this use case? Or we could just extend it then.

> 
> Do you expect to call these from C code?  I'm guessing not because
> they're not using C linkage, and you're using C++ features to
> implement them :) ("inline" keyword).  If not, let's use normal style
> and put these in a namespace and use our regular naming conventions
> "addMarker()" etc.  (That's js/src/ C++ style, and we use that in
> mfbt/ too, so I don't see a problem with using it here.)  If so, then
> you have some bustage to fix :).
> 
> Whether or not these remain C, you should use |SamplerStack*| instead
> of |void*| for the arguments to these.  |void*| is forcing lots of
> unnecessary casts in this code for no gain, AFAICT.
> 
> Last and probably most importantly, why are they inlined?  Do you have
> data showing that the function call is a measurable overhead?  Did you
> check that the compiler is actually inlining the function body?  It's
> free to ignore the |inline| keyword, and the rumor is that both gcc
> and MSVC don't even look at it.  This uglifies quite a bit of the
> impl, so I'd it to be properly motivated.  If the compiler isn't
> choosing to inline these, then we're getting the worst of all worlds.

We've been taking a lot of care to avoid adding branching and getting these calls to inline to reduce the overhead. If we don't support c code then we can remove these.

> These need to be sig_atomic_t.

I didn't know these existed. Searching them has pointed me to a few docs about being signal safe I will take a look at. Apparently 'int' should be sig atomic everywhere but I like more explicit types.

> I'm not so sure I like the approach of silently dropping frames and
> markers when the buffers fill.  For one, you've got a nasty bug here
> with push()/pop(): if the push() fails because the buffer is full,
> then pop() will still happily pop what was last pushed.  This will
> cause a buffer underflow.

We're trying to avoid branching to reduce overhead since this is in the 'hot path'. We've discussed using protected pages to prevent overflows but I'm not convinced of that either. The stack should be large enough but that's not a great argument either. I will add branching until we can find a solution that doesn't need it (if its even possible). 
> 
> As things stand now, markers will accumulate while the profiler is
> turned off.  That means the first marker dump will be garbage.  I
> think we have two options here: (i) not record markers when the
> sampler is turned off, which is just a well-predicted branch in
> addMarker(), or (ii) use a ring buffer for the markers.  I think I
> would prefer the extra branch.

So far we've been using this as always-on and we don't have any markers on start-up. But your suggestion does make this more flexible.

> I don't understand the first part of this comment, but the second part
> is a valid optimization that's likely to be the case even for TLS
> variables.

This is just to make these more extensible in the future. We return an unknown handle that gives us data needed to remove the checkpoint/speudo stack frame, but if we stick to only having the sps profiler implementation (and remove support for other profilers) then we can remove it.
Comment 17 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-09-03 14:53:52 PDT
(In reply to Benoit Girard (:BenWa) from comment #16)
> (In reply to Chris Jones [:cjones] [:warhammer] from comment #15)
> > Overall, I think this patch is trying to be a bit too general with the
> > profiler>SPS sorting.  I don't really expect we'll have multiple,
> > entirely separate low-overhead always-enabled profiling backends
> > united under the same API.  Do you?  If not, let's focus on SPS, and
> > add provisions for its impl on multiple platforms, and do away with
> > the abstract profiler notion.
> 
> My idea was to make it easy for people to hack on an alternate
> implementation(s) that perhaps might be higher overhead platform specific
> alternative for example. I guess if I don't know what it might be then it's
> perhaps not worth adding support for it.
> 

If another use case arises in the future, then it would probably be easier to design the abstract interface with two concrete backends in hand.  Until then, I would prefer to keep things as simple as possible.

> > Have you run this patch through try with the profiler enabled?  It'd
> > be great to land it initially without all this configure crap, if it
> > doesn't affect talos of course.
> 
> I only had the pref to make it less controversial to land it at first. In
> the end me and Jeff really want this by default (as long as there overhead
> is negligible). If it allows us to fix performance issues faster then it
> will pay for itself in performance costs.
> 

Sure.  My recommendation was just to send a job off to try without the configure stuff (or with the profiler enabled), just to see if it actually moves talos.  If not, then yay!, and we don't need to land the configure goop, which we'd just want to remove in the future anyway.

> > >diff --git a/tools/profiler/sps/sps_sampler.h b/tools/profiler/sps/sps_sampler.h
> 
> > >+// Returns a handdle to pass on exit. This can check that we are popping the
> > >+// correct callstack.
> > >+inline void* mozilla_sampler_call_enter(const char *aInfo);
> > >+inline void  mozilla_sampler_call_exit(void* handle);
> > >+inline void  mozilla_sampler_add_marker(const char *aInfo);
> > >+
> > 
> > OK, a few things here ---
> > 
> > I don't think we want external code to call these functions directly,
> > do we?  If not, we should say so, and leave a note that the macros
> > should be used instead.
> 
> I wrote these so that c code could still add pseudo stack frame where the
> RAII helpers would not be unavailable (ex: gfx/qcms). Perhaps we don't care
> about this use case? Or we could just extend it then.
> 

Are qcms and cairo "allowed" to use moz stuff?  I thought they weren't supposed to.  In general though, I would hold off on C support until there's C code we want to call into here from.  Calling this stuff from C code makes me nervous because it's going to be easier to misuse the interface and cause security bugs there.

> > I'm not so sure I like the approach of silently dropping frames and
> > markers when the buffers fill.  For one, you've got a nasty bug here
> > with push()/pop(): if the push() fails because the buffer is full,
> > then pop() will still happily pop what was last pushed.  This will
> > cause a buffer underflow.
> 
> We're trying to avoid branching to reduce overhead since this is in the 'hot
> path'. We've discussed using protected pages to prevent overflows but I'm
> not convinced of that either. The stack should be large enough but that's
> not a great argument either. I will add branching until we can find a
> solution that doesn't need it (if its even possible). 
> > 

We gotta fix the security bug if this is going to be on by default ;).
Comment 18 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-09-03 15:19:50 PDT
(In reply to Chris Jones [:cjones] [:warhammer] from comment #15)
> Comment on attachment 557694 [details] [diff] [review]
> SPS Profiler
> 
> It's somewhat subtle, and I need to think about it in a bit more to be
> 100% sure, but I don't think you need to use atomic
> increment/decrement here for stackPointer/markerPointer.  I have a
> hand-wavy argument in mind, but I need to formalize it.  If you guys
> already worked this out, please add a comment describing your
> argument.
> 

Here's the argument: at all times, the stack contains valid pointers to strings up to but not including index K.  The invariant we have to maintain is that 0 <= stackPointer <= K; IOW, all the pointers in the stack that are visible to other threads are valid pointers.  That can be argued rigorously by induction, but I'll skip to the meat

push(): assuming 0 <= stackPointer <= K on entry, push does the following
  (1) Overwrite the pointer at stackPointer.  This either increases K by 1 (adds another valid pointer), or doesn't change it, if the stack had been larger previously (so stack[stackPointer] contained a valid pointer that was overwritten with another valid pointer).  Either way, we still have stackPointer < K' after this operation completes.
  (2) Increment stackPointer.  Since stackPointer < K', then we have stackPointer' <= K'.
Now, this argument relies on (1) and (2) being strongly ordered.  The compiler barrier isn't enough per se; the actual memory stores to the stack and stackPointer *also* need to be ordered.  The stores won't be reordered on x86[1].  I'm not 100% sure about ARM.  We should find out, because if stores can be reordered, we need to use a memory fence to prevent that.

pop(): assuming 0 <= stackPointer <= K on entry, then as long as stackPointer > 0, it holds on exit because stackPointer' < K, K is unaffected (i.e. the stack isn't touched), and it's the only thread writing stackPointer.  It's crucial to this argument that the stack not be modified.

We REALLY need to assert(stackPointer > 0) in pop() :).

We should also add comments about the subtleties needed to maintain correctness, so other folks don't come along later and accidentally break the invariants and cause security bugs.  (I was this happen in some moz multi-threaded code that was thankfully only used in DEBUG builds.)

[1] http://www.cl.cam.ac.uk/~pes20/weakmemory/index3.html#x86-TSO
Comment 19 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-09-03 15:24:01 PDT
Oh, I forgot to mention above wrt atomic increment/decrement: it doesn't matter if the the increment in push() is observed in progress, because both stackPointer/stackPointer+1 are valid.  And similarly in pop(), both stackPointer/stackPointer-1 are valid during the decrement, because we don't mutate stack[stackPointer].
Comment 20 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-09-03 16:46:09 PDT
(In reply to Chris Jones [:cjones] [:warhammer] from comment #18)
> Now, this argument relies on (1) and (2) being strongly ordered.  The
> compiler barrier isn't enough per se; the actual memory stores to the stack
> and stackPointer *also* need to be ordered.  The stores won't be reordered
> on x86[1].  I'm not 100% sure about ARM.  We should find out, because if
> stores can be reordered, we need to use a memory fence to prevent that.
> 

From [1]

  The ARM and Power architectures allow processors
to reorder their individual memory accesses rather freely.
Test ppc1 below shows that a pair of stores, or a pair of
loads, to non-overlapping different memory locations, can
be reordered.

Bummer.  Gotta use a fence or come up with another approach that's not sensitive to store order.

[1] http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf
Comment 21 Benoit Girard (:BenWa) 2011-09-04 10:02:44 PDT
> This function is kinda sorta wrong because it'll miss
> dynamically-loaded code, and also JIT'd code.  Maybe not all that big
> of a deal.  It does mean though that for the time being we have to be
> very careful to ensure that INIT() is called after glandium's lib
> loading stuff at startup.

I do this on profile save to try to minimize the problem. I think we have to be careful about code being unmapped and even more frightening is that new code could be mapped with the old address. At this point I'm hoping this doesn't become a problem otherwise we should wait until it happens to deal with it (unless we can think of an easy solution without increasing sampling overhead). We shouldn't have to worry about this happening before lib loading stuff.
Comment 22 :Ehsan Akhgari 2011-09-05 14:55:56 PDT
The parts of the code borrowed from V8 should be mentioned in about:license, I think.
Comment 23 Benoit Girard (:BenWa) 2011-09-06 07:34:29 PDT
Created attachment 558471 [details] [diff] [review]
review diff (wip)
Comment 24 Benoit Girard (:BenWa) 2011-09-07 10:53:21 PDT
I was under the impression that |asm volatile("" ::: "memory");| was sufficient but it doesn't appear to be the case. Perhaps we want to use |__sync_synchronize()|? There might be something cheaper so I'll leave a TODO in the code.

http://en.wikipedia.org/wiki/Memory_ordering
Comment 25 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-09-07 11:51:46 PDT
|asm volatile("" ::: "memory");| prevents the compiler from reordering instructions across the |asm| syntax.  That means that the store instructions on either side of the |asm| will enter the CPU pipeline in the order they're in the program.

However, that's *not* the same as the memory effects of the stores being made visible to all cores in the order you want.  x86 will order things like we want, but ARM doesn't guarantee it.  So we need something else for ARM.

__sync_synchronize() will do what you want, but it can be expensive in hot code.  Another approach might be possible.
Comment 26 Benoit Girard (:BenWa) 2011-09-07 12:08:46 PDT
I'm going to measure it with and without so that we can know exactly what it is costing us.
Comment 27 Benoit Girard (:BenWa) 2011-09-07 12:42:32 PDT
Created attachment 558914 [details] [diff] [review]
Review diff
Comment 28 Benoit Girard (:BenWa) 2011-09-07 12:51:55 PDT
Created attachment 558917 [details] [diff] [review]
Review diff

This patch includes the review change from the previous patch. I can attach a folded patch if that's easier to review.

The code moves are still pending.
Comment 29 Nathan Froyd [:froydnj] 2011-09-13 14:55:37 PDT
I was talking to Luke Wagner this morning and he mentioned something sort of similar that he had done:

http://hg.mozilla.org/users/lwagner_mozilla.com/measure/

The sampling is not there, but the idea of RAII calipers is at least there.  He mentioned that the overhead was small, but non-negligible, so his effort was not-landable as-is.

Jeff suggested that we should have a BoF for everybody to share their private profiler they had created. :)  I think there's clearly some sort of need for this stuff; layout has something kind of like this, there's jprof, there's telemetry probes, there's some Moz-specific DTrace effort going on...we should get everybody talking together, at least.  It'd be good to have One Profiler To Rule Them All.
Comment 30 Steve Fink [:sfink] [:s:] 2011-09-13 15:42:30 PDT
Yep, and Dave Mandelin wrote a sampling profiler which partly used RAII to tag things too, in bug 642054, which I've taken over.
Comment 31 Luke Wagner [:luke] 2011-09-15 12:07:02 PDT
Ideally this would be always-on for release Firefox (e.g., several of us have been talking about an about:cpu).  However, even with fast TLS, I think there is overhead that would prohibit certain high-frequency types of measurement (e.g., measuring the overhead of quickstubs), so DTrace-style dynamic-nop-replacement would be ideal.  Are there any plans to take this profiler in that direction?
Comment 32 :Ehsan Akhgari 2011-09-15 12:22:51 PDT
(In reply to Nathan Froyd (:froydnj) from comment #29)
> Jeff suggested that we should have a BoF for everybody to share their
> private profiler they had created. :)  I think there's clearly some sort of
> need for this stuff; layout has something kind of like this, there's jprof,
> there's telemetry probes, there's some Moz-specific DTrace effort going
> on...we should get everybody talking together, at least.  It'd be good to
> have One Profiler To Rule Them All.

Can we make sure that these sessions are either recorded for others to watch as well, or at least ask every person who present things to blog about them, post something to dev.platform, etc?  Not everybody who's interested in this stuff is at the all-hands.  :/
Comment 33 Benoit Girard (:BenWa) 2011-09-15 13:20:50 PDT
(In reply to Luke Wagner [:luke] from comment #31)

That's an interesting idea. Are you suggesting self modifying code? We should sync up and discuss this I'm definitely interested in anything that could help us ship this without regressing.
Comment 34 Steve Fink [:sfink] [:s:] 2011-09-15 14:51:51 PDT
(In reply to Benoit Girard (:BenWa) from comment #33)
> (In reply to Luke Wagner [:luke] from comment #31)
> 
> That's an interesting idea. Are you suggesting self modifying code? We
> should sync up and discuss this I'm definitely interested in anything that
> could help us ship this without regressing.

If you want to pursue this, talk to Kevin Gadd. He has implemented (some of?) the core functionality for Windows.
Comment 35 Nathan Froyd [:froydnj] 2011-09-15 21:12:32 PDT
(In reply to Ehsan Akhgari [:ehsan] from comment #32)
> (In reply to Nathan Froyd (:froydnj) from comment #29)
> > Jeff suggested that we should have a BoF for everybody to share their
> > private profiler they had created.
> 
> Can we make sure that these sessions are either recorded for others to watch
> as well, or at least ask every person who present things to blog about them,
> post something to dev.platform, etc?  Not everybody who's interested in this
> stuff is at the all-hands.  :/

Yeah, sorry about that, I should have suggested something different.  I will see if there's room in the schedule tomorrow; if we wind up doing something, I will blog about it and/or post to dev.platform.
Comment 36 Benoit Girard (:BenWa) 2011-09-15 22:05:50 PDT
I'd love to discuss our options. If we can't get a room we could just meet up in a hacking room. Should we post something on dev.platform to reach out to everyone that is working on something and/or curious?
Comment 37 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-09-19 18:47:53 PDT
Sorry for the review latency, but is this patch still trying to be landed as-is?  Happy to review asap if so.
Comment 38 Benoit Girard (:BenWa) 2011-09-19 20:22:04 PDT
No problem. We noticed that the saving code is incorrect, during the all hands profiling BoF, since you can take a sample and modify the profile data while a saving. I'll get that fix.

Nathan Froyd gave a good summary of our discussion:
http://blog.mozilla.com/nfroyd/2011/09/17/notes-from-the-all-hands-profiling-bof/
Comment 39 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-09-19 21:49:30 PDT
(In reply to Benoit Girard (:BenWa) from comment #38)
> No problem. We noticed that the saving code is incorrect, during the all
> hands profiling BoF, since you can take a sample and modify the profile data
> while a saving. I'll get that fix.
> 

That's perfectly ok per my argument above.  You just get a slightly different stack.  The $ip captured on sample would be slightly misleading.  But if you have a good and performant fix, go for it ;).
Comment 40 Ted Mielczarek [:ted.mielczarek] 2011-10-04 11:48:18 PDT
(In reply to Benoit Girard (:BenWa) from comment #11)
> 4) (optional) post process + symbolicate the data using addr2line.

> For number 4 I'm still experimenting with the script that does
> post-processing with addr2line/objdump, I don't have anything to publish yet
> but if you ping me I'll send you what I have.

Using platform tools for this is a pain in the butt. If you'd like to piggyback on the work that we've done in Breakpad, you can use Breakpad format symbols. You just have to "make buildsymbols" in your objdir, or grab the symbols.zip file next to a build from FTP. Jesse wrote some Python code that parses the Breakpad symbol files and uses them to fix up assertion stack traces, it's in the tree here:
http://mxr.mozilla.org/mozilla-central/source/tools/rb/fix_stack_using_bpsyms.py
Comment 41 Benoit Girard (:BenWa) 2011-10-19 12:23:41 PDT
Created attachment 568150 [details] [diff] [review]
review diff
Comment 42 Benoit Girard (:BenWa) 2011-10-19 12:25:59 PDT
Created attachment 568152 [details] [diff] [review]
Review diff

Forgot to qref.

These changes include a crash fix for Tablets.

Submitting a try job.
Comment 43 Benoit Girard (:BenWa) 2011-10-19 13:22:10 PDT
Created attachment 568165 [details] [diff] [review]
Review diff (+build fix)

This version doesn't break non supported platforms (non-android for now). You'll be able to include sampler.h but the macros will be defined to no-op and you wont link against libprofiler.a.
Comment 44 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-10-20 04:49:19 PDT
Comment on attachment 568165 [details] [diff] [review]
Review diff (+build fix)

>+  // Called within a signal. This function must be Reentrant

Nit: "reentrant", no title case.

>diff --git a/tools/profiler/sps/platform.h b/tools/profiler/sps/platform.h

>+class MapEntry {

>+  MapEntry(const MapEntry& aEntry)

>+  unsigned long GetStart()
>+  {
>+    return mStart;
>+  }

Do you really need getters for all these?

>diff --git a/tools/profiler/sps/sps_sampler.h b/tools/profiler/sps/sps_sampler.h

> #define SAMPLER_INIT() mozilla_sampler_init();
>-#define SAMPLE_CHECKPOINT(name_space, info) SamplingRAII only_one_sampleraii_per_func(FULLFUNCTION, name_space "::" info);
>+#define SAMPLER_DEINIT() mozilla_sampler_deinit();
>+#define SAMPLE_CHECKPOINT(name_space, info) mozilla::SamplerStackFrameRAII only_one_sampleraii_per_scope(FULLFUNCTION, name_space "::" info);
> #define SAMPLE_MARKER(info) mozilla_sampler_add_marker(info);
>-//#define NS_SAMPLE_START() mozilla_sampler_init()
>-//#define NS_SAMPLE_STOP()
>+
>+#ifdef ANDROID

You need this if __arm__.

>+// TODO Is there something cheaper that will prevent
>+//      memory stores from being reordered
>+#define MEMORY_CLOBBER() __sync_synchronize()

This is a memory fence, not a clobber (not sure what that would be).
Note that some places on teh interwebs suggest that
__sync_synchronize() isn't implemented by gcc for ARM.  Instead of
going down the rathole of getting this right, it's better for us just
to use atomicops.h from ipc/chromium.

>+#else
>+#define MEMORY_CLOBBER() asm volatile("" ::: "memory");

This should be if x86/*.  Need a comment explaining why a real memory
fence isn't needed.

>+#endif
> 

If the arch is unknown, you should #error.  Otherwise there could be
security bugs like what could happen on ARM.

I'd like to see one more version with the above fixed.
Comment 45 Benoit Girard (:BenWa) 2011-10-20 11:22:47 PDT
Created attachment 568448 [details] [diff] [review]
review diff

Applied the review comments. I could only remove on getter in MapEntry, I made the others single line.

All other suggestions are done.

I started a project to help log retrieval and resolving symbols using Breakpad and addr2line, more work needed on that:
https://github.com/bgirard/SPS_Helper
Comment 46 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-10-25 18:51:10 PDT
Comment on attachment 568448 [details] [diff] [review]
review diff

>+#ifdef __arm__

See below; can use |ARCH_CPU_ARM_FAMILY| here.

>+#include "base/atomicops.h"
>+// TODO Is there something cheaper that will prevent
>+//      memory stores from being reordered
>+// Uses: pLinuxKernelMemoryBarrier
>+#define MEMORY_CLOBBER() base::subtle::MemoryBarrier();

As I mentioned in the previous review, |MEMORY_CLOBBER()| isn't a
descriptive name.  STORE_SEQUENCER() would be better; this isn't a
full memory fence everywhere.  This macro needs a good comment.

Nit: |# define| to pseudo-indent for easier scanning.

>+//Investigate when porting
>+//#else __i386__
>+//#define MEMORY_CLOBBER() asm volatile("" ::: "memory");

Not sure why this is commented out.

If you pull in "base/atomicops.h" unconditionally, then you can use
chromium's nice |ARCH_CPU_X86_FAMILY| define here.

r=me with above fixed, but please fix it.
Comment 47 Benoit Girard (:BenWa) 2011-10-27 13:25:25 PDT
Created attachment 570066 [details] [diff] [review]
Use 'MOZ_PROFILER_SPS' env variable to turn profiler on

I have this configuration on the try server. I plan on landing this without any checkpoints if the try run looks good. It is very easy to use with SPS_Helper on android.
Comment 48 Ed Morley [:emorley] 2011-10-28 08:11:19 PDT
The changeset that has just landed on inbound added MOZ_ENABLE_PROFILER_SPS to configure, but doesn't actually use it anywhere. ie: this is always built on android. Was this intentional? If so, surely the congfigure part can come out...?
Comment 49 Benoit Girard (:BenWa) 2011-10-28 08:13:09 PDT
(In reply to Ed Morley [:edmorley] from comment #48)
> The changeset that has just landed on inbound added MOZ_ENABLE_PROFILER_SPS
> to configure, but doesn't actually use it anywhere. ie: this is always built
> on android. Was this intentional? If so, surely the congfigure part can come
> out...?

Good catch, I'll submit a patch to remove this.
Comment 50 Ed Morley [:emorley] 2011-10-28 09:18:23 PDT
(In reply to Benoit Girard (:BenWa) from comment #49)
> Good catch, I'll submit a patch to remove this.

Thanks :-)

(Filed as a separate to ease merging, bug 698005)
Comment 51 Marco Bonardo [::mak] 2011-10-29 01:45:49 PDT
https://hg.mozilla.org/mozilla-central/rev/78e97de08278
Comment 52 Benoit Girard (:BenWa) 2011-12-23 07:54:19 PST
Created a proper tracking bug 713227.

Note You need to log in before you can comment on or make changes to this bug.