Closed Bug 1148204 Opened 9 years ago Closed 6 years ago

Implement and integrate Backtrack

Categories

(Core :: Gecko Profiler, defect)

defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: mayhemer, Assigned: mayhemer)

References

(Depends on 2 open bugs, Blocks 1 open bug, )

Details

Attachments

(2 files, 11 obsolete files)

202.18 KB, patch
Details | Diff | Splinter Review
342.18 KB, patch
Details | Diff | Splinter Review
This is more an umbrella and introduction bug.

The analyzes is about finding a shortest path of operations (deliberately not using word "tasks") that need to happen through several chained objects and on multiple threads (and processes) to achieve a complex result - e.g. reaching from a user clicking a link to the first-paint event during render of the resulting web page.

To achieve that we need to collect a Gantt Chart-like [1] data from as much as possible running objects (not just runnables) and interconnections (dependencies) among them.  

Each thread represents a "person" able to serially process (sub)operations.  Each operation dependency chain is a "project" or a "sub-project" that needs to be done and is jumping among "people" - threads.

This generates a graph of dependencies of upper-level (dependent) objects requesting something be done from lower-level (working) objects.  (e.g. nsHttpChannel creates and waits for nsHttpTransaction to load a content from an HTTP server.)

With information on how long runnables are waiting in queues - which Task Tracer already collects - we can get a perfect picture of what is blocking them and how are we actually doing inside.

When we define two points in this graph, we can find the shortest path between them and by analyzes of all the rest of the graph we know what all operations unnecessary to achieve the target block this shortest (critical) path.  Based on that data we can decide how to optimize - like move execution to different threads/add new threads/postpone some of the (sub)operations etc while this analyzes mechanism gives also immediate feedback and exact measurement of the changes.



The intermediate output shall then be the following:
- the critical path: shortest backward walk from the target to the source
- the all-related tree: all operations needed to achieve the target from the source
- the unrelated trees: all operations that intervene in the time span between source and target

The final output shall be:
- the exact listing/visualization of unrelated operations that somehow block the tasks/operations on the critical path and/or the all-related tree

Nice to have:
- a UI that (with some intelligent filtering involved) can show the graphs of operations and events for each object with easy track of dependencies ; this can be used as in-time-logging with high visual value ; current about:timeline sucks at this
- the UI should be switchable between per-object and per-thread view




To achieve it we need:
- define relations between objects and also runnables these objects dispatch (a complete walkable chain)
- define async operations paths: 
  - object A calls a method on object B and expects a result via a callback
  - object B calls back with a result on object A (A was till this moment blocked on B)
- track also operation result dependency on multiple objects results *) to obtain the all-related tree
- track sync I/O blocking
- track single thread CPU loads (complex cycles, CSS reflow, JS execution..)
- track time we wait for a lock
- track network response dependency
- track any code specific queuing (DNS request serialization, http transaction queues etc..)
- annotate objects by resource name (URL, host name, etc..) they work with



Shelly, I may need some guidance/consultation how to proceed here with you.  Task Tracer is similar to Visual Event Tracer.


[1] http://en.wikipedia.org/wiki/Gantt_chart


*) E.g. an HTLM page cannot start rendering before two CSS files it refers are downloaded.  There are then two channels loading each CSS.  The one finished as the last is on the critical path, both are on the all-related tree.
Based on m-c@c20f8549d631

This is what I started to write a very long time ago.  Rewrite of VET to become Backtrack.  Difference is that now I collect data per object instance, not primarily per thread.  There are links and calls between objects tracked so it gives enough information to analyze object relation trees.  Also it measures time of CPU, I/O, lock wait etc, queuing, dispatch delays etc.

This is only for reference and backup, I'm submitting this at a time it has not been decided how this should be implemented in production quality in relation to existing Task Tracer code.

Currently I'm writing another patch on top of this one to start paths annotations (instrumenting begin and end of a path) and do the actual critical/related analyzes.

Next step - prototype a UI to show the paths/trees visually for human analyzes.
Summary: Enhance Task Tracer for critical paths analyzes → Enhance Task Tracer/Visual Event Tracer for critical paths analyzes: Backtrack
Summary: Enhance Task Tracer/Visual Event Tracer for critical paths analyzes: Backtrack → Enhance GeckoProfile and TaskTracer to capture full object interactions, do a critical path analyzes on that data
Summary: Enhance GeckoProfile and TaskTracer to capture full object interactions, do a critical path analyzes on that data → Enhance GeckoProfiler and TaskTracer to capture full object interactions, do a critical path analyzes on that data
- adds CallerChain class (under tools/profiler) collecting new data
- STS and TT code lightly enhanced to capture the new data automatically
  - monitors object lifetime (classes need to be instrumented - very simple!)
  - collects method call hierarchy with knowledge of objects calling them
  - collects runnable calls
- some instrumentation added to the platform code as a usage example
- from these data we can recapitulate all per-object/per-thread code paths (not part of this patch yet)


The patch won't compile.  Actually, there is now not much to run.  The fun will start when I have the analyzes part that processes the logs and creates actual object inter-call trees and maps.


How the data collection works:
- each thread has it's own active "marker batch" that can take linearly up to some 96 "markers"
- each "marker" represents some event like entering a method, leaving a method, starting an object lifetime, etc and has a globally unique (atomically) increasing sequence number 
- the batches are added to a global linked list that can be later examined
Attachment #8586768 - Attachment is obsolete: true
Attachment #8594049 - Flags: feedback?(bgirard)
(In reply to Honza Bambas (:mayhemer) from comment #2)
> How the data collection works:

A better description could be:

- each event creates a marker
- this marker is stored (with a unique sequence number) to a "marker batch"
- "batch" is thread private
- each batch is an array of markers with static and limited capacity (96)
- when its capacity is filled, a new batch is allocated and is put as a new head to the global linked list of batches
Depends on: 1167634
Comment on attachment 8594049 [details] [diff] [review]
[reference only] preview of SPS/TT enhancements

Review of attachment 8594049 [details] [diff] [review]:
-----------------------------------------------------------------

So far the patch looks interesting. I like to review the API surface changes first and I find them unclear so I'd like to first iterrate on a clean and simple API to expose this feature then we can look at the implementation details once we're happy with the API.

Sorry about the delay in reviewing this, I should of found an alternate reviewer instead of having you wait :(.

::: tools/profiler/CallerChain.cpp
@@ +223,5 @@
> +
> +CallerChain::MarkerBatch::MarkerBatch(MarkerBatch * aPrev)
> +  : mCount(0)
> +  , mPrev(aPrev)
> +  , mThreadName(DupCurrentThreadName())

From what I understand so far this is one of the things that sticks out the most. Ideally I try to keep overhead as low as possible which means doing memory allocations only when required. I'd rather avoid this. If you insert a profiler marker it will already for which thread is belong but I still need to understand this patch better before I can say if you should be using that.

::: tools/profiler/CallerChain.h
@@ +19,5 @@
> +
> +extern bool mozilla_caller_chain_init();
> +extern mozilla::CallerChain * mozilla_caller_chain();
> +
> +extern uint32_t mozilla_caller_chain_method_enter(const void * aThis, const char * aInfo);

I'm having a lot of trouble understanding what these methods do and it's the core of this patch. These could really use documentation. This isn't quite how I envisionned this working but I'm curious to understand your approach.
Depends on: 1170534
Definitely an early stage (needs some tweaks to land) but I want review since feedback didn't catch much attention.

- 'preview' and 'v1' difference:
  - renamed from callerchain to backtrack
  - this is a full patch that builds and runs

- see comment 2 and comment 3 for description what the patch does
- most interesting changes are in GeckoProfilerImpl.h and BackTrack(Impl).h
- BacktrackAnalyzer.* is the core of the data processing, it 'linearize' the per-thread captured batches and links objects, runnables and stacks together, then allows backtrack of a selected path ; result is a JSON that can be processed by any UI front end (a preview will be published later)
- it doesn't connect with data captured by SPS

- to build (verified currently only on windows):
  - apply patch from bug 1167634
  - apply patch from bug 1170534
  - ac_add_options --enable-tasktracer
  - ac_add_options --enable-profiling
  - ac_add_options --enable-backtrack

- to run: automatically starts the capturing from the very start, there is no stop button
- to view: see www.janbambas.cz for instructions (coming soon)
Attachment #8594049 - Attachment is obsolete: true
Attachment #8594049 - Flags: feedback?(bgirard)
Attachment #8614066 - Flags: review?(bgirard)
I can't find you on IRC, a 1:1 convo would save a lot of back and forth.

I understand that you want a full review but I'd like to understand and agree with the API levels changes that you're making before I look at the low level. This patch is going to bring on a lot of complexity so it's important to look at the high level design first.

Here what I'm trying to understand:
1) Why do you need the marker batch? Why not just emit SPS marker directly? This is what they were designed for. you can add the payload that you need.
2) You're exposing a -massive- API surface with over 20 new functions and several new objects. I'd like to see proper documentation to a) Help with the review, b) Document the code once it goes to land.
3) Where is the interaction with the profiler? You're just re-using the profiler code base hooks but you're reinventing markers.

It looks like you're just tacking on something in the profiler module that is entirely self-contained and independent, which actually duplicates a bunch of the functionality (thread local markers). Unless I'm misunderstanding what should happen here is you should emit proper SPS marker which is very well supported and works well. Otherwise this should be its own independent module.
(In reply to Benoit Girard (:BenWa) from comment #6)
> I can't find you on IRC, a 1:1 convo would save a lot of back and forth.
> 
> I understand that you want a full review but I'd like to understand and
> agree with the API levels changes that you're making before I look at the
> low level. This patch is going to bring on a lot of complexity so it's
> important to look at the high level design first.

That is why I was asking (a long time ago!) for a feedback and asking directly how to proceed here.

> 
> Here what I'm trying to understand:
> 1) Why do you need the marker batch? Why not just emit SPS marker directly?
> This is what they were designed for. you can add the payload that you need.

I didn't know how even though I though I explained well what I wanted to do and ask around how to proceed.

> 2) You're exposing a -massive- API surface with over 20 new functions and
> several new objects. I'd like to see proper documentation to a) Help with
> the review, b) Document the code once it goes to land.

Definitely, this is one think that needs to improve.  Probably a rush on my side.  Will document and resubmit.

> 3) Where is the interaction with the profiler? You're just re-using the
> profiler code base hooks but you're reinventing markers.

Exactly.  I didn't know how to hook inside, so I am just reusing the existing markers and put a code inside.

I already had some of my batch/marker collection code already written, so I to prove the idea I went that way.

I will be happy to get rid of it and inject into the stack somehow.  But I first need to understand it.  The SPS data structures appear to me as something that is JS-specific and not much mutable for my purpose.  But I'll be happy to be wrong.

> 
> It looks like you're just tacking on something in the profiler module that
> is entirely self-contained and independent, which actually duplicates a
> bunch of the functionality (thread local markers). 

Definitely.  But depends if we need new or not, we'll see.

> Unless I'm
> misunderstanding what should happen here is you should emit proper SPS
> marker which is very well supported and works well. Otherwise this should be
> its own independent module.

OK, can you please give me some more detailed direction how to proceed?  I will gladly do it (I won't try to protect my code I've written - I'll gladly throw it away..)

Thanks for looking into this.
Flags: needinfo?(bgirard)
(In reply to Honza Bambas (:mayhemer) from comment #7)
> That is why I was asking (a long time ago!) for a feedback and asking
> directly how to proceed here.
> 

Yes sorry, I was away for 3 months, my handle had a notice and I didn't keep enough of an eye on my bugmail to forward the review. My bad.

> I will be happy to get rid of it and inject into the stack somehow.  But I
> first need to understand it.  The SPS data structures appear to me as
> something that is JS-specific and not much mutable for my purpose.  But I'll
> be happy to be wrong.
>

There's a misunderstanding here. Let's ignore the profiler stack and focus on the markers. The profiler markers is separate from the stack. The profiler has an extensible marker API. See PROFILER_MARKER_PAYLOAD and ProfilerMarkerPayload. It looks like one of your use case, or one similar, to track when sometimes starts and finishes is already covered by ProfilerMarkerTracing. This is what powers the devtools timeline and the cleopatra frames timeline. See the first two sections of this https://benoitgirard.wordpress.com/2014/05/05/profiler-frame-performance-and-debugging/.

http://mxr.mozilla.org/mozilla-central/ident?i=profiler_tracing

> 
> OK, can you please give me some more detailed direction how to proceed?  I
> will gladly do it (I won't try to protect my code I've written - I'll gladly
> throw it away..)
> 
> Thanks for looking into this.

Now of course you require more complex relationship tracking than ProfilerMarkerTracing. You can create your owns class which derives ProfilerMarkerPayload and adds the field that you need to track. You probably want to look at ProfilerMarkerTracing for an example. Facility to capture back-traces is already implemented for markers so you can get that for free.

We do the 'linearization' of the data in the UI ATM but if you really wanted that could be done in Gecko during the profile save/export phase.

Now AFAIK this should cover your use case for Gantt charts.

As a benefit of doing it like this you can easily retrieve CPU sampling data for any period during your Gantt chart.

> (I won't try to protect my code I've written - I'll gladly
> throw it away..)

If you change your mind and want to land your code then I thing that it should be landed as a separate module since it really has no purpose depending on the profiler.
Flags: needinfo?(bgirard) → needinfo?(honzab.moz)
Thanks, I'll answer tomorrow!
If you have a chance, please check on the life preview too: http://www.janbambas.cz/moz/backtrack/?first-paint-janbambas.cz.txt
Looks very interesting. It does help to understand what you're trying to achieve seeing the ends results. I don't fully understand it since I don't know the network code that well but I get the idea.

If we had profiling samples along side, by making the changes I suggest, then it would be easy to ask what the samples are during important events. We just need the use the timestamp range plus tid. For instance a lot of your events have a long dispatch period (red+pink). If I understand correctly that's when the event is in the queue but it's waiting on previous events to complete. If you store your markers in the profile then it becomes trivial to request what the sampling data is during both the dispatch phase (what unrelated execution is on our critical path by blocking our event) and what we're doing once the event is running the black section.

All you'd need to do for your UI/viewer is make sure that it plays well in an iframe and we can throw it into a tab in cleopatra, or throw cleopatra in a tab in your UI. Then they can pass data across like 'I'm looking at this time range'.
(In reply to Benoit Girard (:BenWa) from comment #8)

Thanks for this feedback!

> There's a misunderstanding here. Let's ignore the profiler stack and focus
> on the markers. The profiler markers is separate from the stack. The
> profiler has an extensible marker API. See PROFILER_MARKER_PAYLOAD and
> ProfilerMarkerPayload. It looks like one of your use case, or one similar,
> to track when sometimes starts and finishes is already covered by
> ProfilerMarkerTracing. This is what powers the devtools timeline and the
> cleopatra frames timeline. 


As I understand, I can add whatever markers.  The Cleopatra UI then has to be enhanced to understand and interpret my newly added types of markers, right?

I assume a marker cannot 'start' on one thread and 'end' on a different one, right?  Assuming so, a single marker with start and stop timestamps won't work for e.g. object lifetime, and probably queue/dequeue, network req/resp.  Separate markers will unfortunately (by means of wasting memory) be needed for most of my cases.


> Now of course you require more complex relationship tracking than
> ProfilerMarkerTracing. You can create your owns class which derives
> ProfilerMarkerPayload and adds the field that you need to track. You
> probably want to look at ProfilerMarkerTracing for an example. Facility to
> capture back-traces is already implemented for markers so you can get that
> for free.


I'll make a detailed plan per each of the info I want to capture and post it here to confirm.


One major unknown is how to track 'this' as part of the stack frame.  My first attempts ended up in the JS representation of a stack frame.  There were no place to add any 'this' info, obviously.  Two possibilities:

1. 'profiler_tracing' (or via my own class) the 'this' value and linearize the info later, in the UI probably.  It's just that I probably don't need mStartTime and mEndTime from ProfilerMarkerPayload.  

2. Could 'this' be tracked as part of SamplerStackFrameRAII?  It passes itself to js::ProfileEntry via setCppFrame - further I'm lost.  But to make this somehow work would be pretty nice!


Other major unknown is how you track threads.  The 'mozilla_sampler_register_thread' and the code around is a mystery for me.  The registration were missing almost all the threads I was trying to capture on.  How does this work?  It's a big show stopper for me unless I understand how it works.


> 
> We do the 'linearization' of the data in the UI ATM but if you really wanted
> that could be done in Gecko during the profile save/export phase.


Where to 'linearize' depends on the complexity.  My BacktrackAnalyzer class does a 2 phase walk-through on the data: 1. preprocess, which creates linear list of objects, thread events, connects dispatches, network responses etc. - may take few milliseconds to complete, and 2. 'analyzes', which is actually just a walk from a selected 'way-point' back to its start.  It's simple jumping to a runnable dispatch source, to a network request send, to a queuing point etc.  Each marker hit is then annotated as 'critical'.  You may see BacktrackAnalyzer::Preprocess() and Analyze() in my patch.


> 
> As a benefit of doing it like this you can easily retrieve CPU sampling data
> for any period during your Gantt chart.

That is what I want.  Blank spaces filled with PSP data.

> If you change your mind and want to land your code then I thing that it
> should be landed as a separate module since it really has no purpose
> depending on the profiler.

I want to make this properly and there are plenty of benefits to do it as you suggest.



(In reply to Benoit Girard (:BenWa) from comment #11)
> Looks very interesting. It does help to understand what you're trying to
> achieve seeing the ends results. I don't fully understand it since I don't
> know the network code that well but I get the idea.
> 
> If we had profiling samples along side, by making the changes I suggest,
> then it would be easy to ask what the samples are during important events.

If we also add information about objects then yes.

> We just need the use the timestamp range plus tid. For instance a lot of
> your events have a long dispatch period (red+pink). If I understand
> correctly that's when the event is in the queue but it's waiting on previous
> events to complete. 

Exactly!  Usually the biggest source of bottlenecks.  But there could be more like mutex wait, I/O wait and various combinations we had e.g. in the old version of HTTP cache ;)

> If you store your markers in the profile then it becomes
> trivial to request what the sampling data is during both the dispatch phase
> (what unrelated execution is on our critical path by blocking our event) and
> what we're doing once the event is running the black section.

Again, only if we also have the 'object' data as well.

> 
> All you'd need to do for your UI/viewer is make sure that it plays well in
> an iframe and we can throw it into a tab in cleopatra, or throw cleopatra in
> a tab in your UI. Then they can pass data across like 'I'm looking at this
> time range'.

Forget the viewer on my site, it's just and only a preview - a prototype.  It was never intended to be a production quality thing.  It's just a way to show the idea in real.

OTOH, does Cleopatra show multiple threads like my preview?
Flags: needinfo?(honzab.moz) → needinfo?(bgirard)
And few more questions:
- is the profiler working inter-process?
- do you have any idea how to directly use/integrate the data captured by TaskTracer?  I'm afraid I'll have to add markings to the TT code (as I do for backtrack in the v1 patch).  But I'd like to reuse the IPC crossing that TT is already implementing.  I didn't get much info from :shelly.
Attached image Cleopatra/Backtrack integration mockup (obsolete) —
(In reply to Honza Bambas (:mayhemer) from comment #12)
> 
> As I understand, I can add whatever markers.  The Cleopatra UI then has to
> be enhanced to understand and interpret my newly added types of markers,
> right?

You already have your custom UI that works well. I say we just add a new tab to Cleopatra that has your UI in it. See the attachment. Then we add a few messages to 'sync' the view and the selection. When you select something in backtrack we jump to the sample. If we have a way to go from sample to Backtrack (harder problem) then we can do that as well. We just send cross document messages.

Cleopatra will just ignore markers that it doesn't know how to present.

If you'd rather extend Cleopatra directly feel free to, but I imagine it's easier to just embed your current UI.

> 
> I assume a marker cannot 'start' on one thread and 'end' on a different one,
> right?  Assuming so, a single marker with start and stop timestamps won't
> work for e.g. object lifetime, and probably queue/dequeue, network req/resp.
> Separate markers will unfortunately (by means of wasting memory) be needed
> for most of my cases.
> 

Yes, but we're taking about wasting 100's of bytes per marker. The real cost is going to be in the size of the backtraces that we capture so we don't need to micro-optimize a non bottleneck.

> 
> > Now of course you require more complex relationship tracking than
> > ProfilerMarkerTracing. You can create your owns class which derives
> > ProfilerMarkerPayload and adds the field that you need to track. You
> > probably want to look at ProfilerMarkerTracing for an example. Facility to
> > capture back-traces is already implemented for markers so you can get that
> > for free.
> 
> 
> I'll make a detailed plan per each of the info I want to capture and post it
> here to confirm.
> 
> 
> One major unknown is how to track 'this' as part of the stack frame.  My
> first attempts ended up in the JS representation of a stack frame.  There
> were no place to add any 'this' info, obviously.  Two possibilities:
> 
> 1. 'profiler_tracing' (or via my own class) the 'this' value and linearize
> the info later, in the UI probably.  It's just that I probably don't need
> mStartTime and mEndTime from ProfilerMarkerPayload.  
> 
> 2. Could 'this' be tracked as part of SamplerStackFrameRAII?  It passes
> itself to js::ProfileEntry via setCppFrame - further I'm lost.  But to make
> this somehow work would be pretty nice!

Yes if we did this we could say map object markers to samples. It would be possible.

> 
> 
> Other major unknown is how you track threads.  The
> 'mozilla_sampler_register_thread' and the code around is a mystery for me. 
> The registration were missing almost all the threads I was trying to capture
> on.  How does this work?  It's a big show stopper for me unless I understand
> how it works.
> 

I've only registered threads that I've wanted to profile. There's no cross platform API to get a list of the threads, for instance there's no pthread_get_thread_list() like APIs. Therefore we decided to just force the registration of threads we want to watch. We use this hook to do a platform specific equivalent of pthread_self().

If it's missing and you need it, just add it!

> 
> (In reply to Benoit Girard (:BenWa) from comment #11)ore your markers in the profile then it becomes
> > trivial to request what the sampling data is during both the dispatch phase
> > (what unrelated execution is on our critical path by blocking our event) and
> > what we're doing once the event is running the black section.
> 
> Again, only if we also have the 'object' data as well.

I'm not sure I 100% understand what you're asking for. I'm starting to think that what you're asking for is already covered with PROFILER_LABEL_PRINTF. Example:
http://mxr.mozilla.org/mozilla-central/source/layout/base/nsPresShell.cpp#9194

This means that every sample that is taken while this RAII is active is tagged with the URL of the document were reflowing. Your patch was also trying to tag identifier and URLs. If the marker wants this information it has to capture a backtrace. I linked a video below, you'll see that I know which document with is being reflowed because of this annotation. I imagine you want to know which names/url are given to a network channel when it's blocked. You should be able to use this for that.

> 
> > 
> > All you'd need to do for your UI/viewer is make sure that it plays well in
> > an iframe and we can throw it into a tab in cleopatra, or throw cleopatra in
> > a tab in your UI. Then they can pass data across like 'I'm looking at this
> > time range'.
> 
> Forget the viewer on my site, it's just and only a preview - a prototype. 
> It was never intended to be a production quality thing.  It's just a way to
> show the idea in real.
> 
> OTOH, does Cleopatra show multiple threads like my preview?

Yes, here's a walk-through of a complex profile recorded for B2G. It shows multiple threads on multiple processes: http://people.mozilla.org/~bgirard/keyboard_timeline.mp4
Flags: needinfo?(bgirard)
(In reply to Honza Bambas (:mayhemer) from comment #13)
> And few more questions:
> - is the profiler working inter-process?
Yes, see above. It works with e10s (some bugs need to be fixed however) and b2g. See my example profile which shows the system process and the Messaging app talking to each other.
> - do you have any idea how to directly use/integrate the data captured by
> TaskTracer?  I'm afraid I'll have to add markings to the TT code (as I do
> for backtrack in the v1 patch).  But I'd like to reuse the IPC crossing that
> TT is already implementing.  I didn't get much info from :shelly.

We discussed integrating TaskTracer better but that works hasn't happened yet.
(In reply to Benoit Girard (:BenWa) from comment #14)
> You already have your custom UI that works well. 

I yet have to decide which way to go.  Cleopatra already has a nice layout, zoom, resize, refresh capabilities.  My prototype doesn't + few other flaws and only works with my processed data structures.

> I say we just add a new tab
> to Cleopatra that has your UI in it. See the attachment. Then we add a few
> messages to 'sync' the view and the selection. When you select something in
> backtrack we jump to the sample. If we have a way to go from sample to
> Backtrack (harder problem) then we can do that as well. We just send cross
> document messages.

Good to know!

> The real cost
> is going to be in the size of the backtraces that we capture so we don't
> need to micro-optimize a non bottleneck.

So, each PROFILER_LABEL_* does a full stack capture from the very bottom?

> > 2. Could 'this' be tracked as part of SamplerStackFrameRAII?  It passes
> > itself to js::ProfileEntry via setCppFrame - further I'm lost.  But to make
> > this somehow work would be pretty nice!
> 
> Yes if we did this we could say map object markers to samples. It would be
> possible.

I think I'll try the PRINTF way (see bellow).

> I've only registered threads that I've wanted to profile. There's no cross
> platform API to get a list of the threads, for instance there's no
> pthread_get_thread_list() like APIs. Therefore we decided to just force the
> registration of threads we want to watch. We use this hook to do a platform
> specific equivalent of pthread_self().
> 
> If it's missing and you need it, just add it!

so profiler_register_thread(name, local) is the magic?  How could I miss that!

> > Again, only if we also have the 'object' data as well.
> 
> I'm not sure I 100% understand what you're asking for. I'm starting to think
> that what you're asking for is already covered with PROFILER_LABEL_PRINTF.
> Example:
> http://mxr.mozilla.org/mozilla-central/source/layout/base/nsPresShell.
> cpp#9194

This is interesting.  So instead of using (and hacking on) PROFILER_LABEL_FUNC, which is handed with 'this' in my v1 patch, I could simply use PRINTF version instead and pass it a list of formatted arguments that can later be read or just displayed directly.  Cool!

With that I probably don't need to track object lifetime (and stick resource identification to 'this' pointer value).  Maybe we can start like that - no lifetime tracking.  What do you think?

> Yes, here's a walk-through of a complex profile recorded for B2G. It shows
> multiple threads on multiple processes:
> http://people.mozilla.org/~bgirard/keyboard_timeline.mp4

Hmm... Cleopatra has evolved from the last time I've been using it, nice!

My original idea was to reuse your timelines (at the top) for showing my captured data (since it has all the zooming, resize refreshing and nice layout ;)).  

Backtrack's critical path flow can be put to a single "tape" since its a sequence of events coming one from other forming a linear chain (regardless crossing threads/processes).  

There is the "Frames" specific timeline.  We could add a new specific timeline - "Backtrack" - showing the critical path.  That timeline would (automatically) uniformly zoom and subrange with others.  It would be seeded by letting the user manually select a "way point" (from a drop down list, with filtering probably) to start track back from (the way-point concept is something I find important since picking the start point to look at used to be hard for me e.g. in Visual Event Tracer or any kind of such a tool).  

Then, selecting an 'event block' on the Backtrack tape would select the proper range in the proper thread (and vice versa, to some extent).  Then I have to think how to show the "unrelated" concurrent flows.  Probably as another "Backtrack" timeline under the selected thread?  Not sure.

Other thing is that you show the stack samples coalesced by method (label) name and sorted by overall time spent inside.  It's of course purpose of a profiler!  But for backtrack (i.e. when something on the backtrack timeline is selected) it would be better to switch to a view with natural ordering and no coalescing as I do in my UI.  I'd also like to add info about I/O and lock wait blockades into this call sample listing.

Another feature should be selecting (clicking) any frame from the stack listing that would start a "sub-backtrack", that could either cross (actually connect to) the current critical path or reveal that this new sub-path is completely unrelated and where from it comes.

Does this all sound reasonable?  If so, I'll start looking at the Cleopatra code how complicated would be to introduce the Backtrack timeline and concept the markers I need to fill it.

Thanks so far for the great feedback!
Flags: needinfo?(bgirard)
(In reply to Honza Bambas (:mayhemer) from comment #16)
> (In reply to Benoit Girard (:BenWa) from comment #14)
> > The real cost
> > is going to be in the size of the backtraces that we capture so we don't
> > need to micro-optimize a non bottleneck.
> 
> So, each PROFILER_LABEL_* does a full stack capture from the very bottom?

No that's it. Sorry this concept is not very intuitive. PROFILER_LABEL_* will add extra information to a pseudo execution stack. This is pushed during execution and popped after using RAII. When the profiler takes a sample on its 1 ms clock, or we insert a marker which requests a backtrace (collection is optimal since its a bit slow and consume a lot of memory), or a piece of code wants a backtrace explicitly we will capture the backtrace and include any active PROFILER_LABEL_*.

The fancy trick we do is that we in-order merge the C++ stack with this side pseudo stack. So the PROFILER_LABEL_ fall next to their C++ frame counterparts.

> > > Again, only if we also have the 'object' data as well.
> > 
> > I'm not sure I 100% understand what you're asking for. I'm starting to think
> > that what you're asking for is already covered with PROFILER_LABEL_PRINTF.
> > Example:
> > http://mxr.mozilla.org/mozilla-central/source/layout/base/nsPresShell.
> > cpp#9194
> 
> This is interesting.  So instead of using (and hacking on)
> PROFILER_LABEL_FUNC, which is handed with 'this' in my v1 patch, I could
> simply use PRINTF version instead and pass it a list of formatted arguments
> that can later be read or just displayed directly.  Cool!
> 

Exactly.

> With that I probably don't need to track object lifetime (and stick resource
> identification to 'this' pointer value).  Maybe we can start like that - no
> lifetime tracking.  What do you think?
> 

It would certain make things easier if you don't need object lifetimes. Otherwise keep your ProfilerTracked object and set a marker when you construct/destroy. If you don't track construction/destruction then you don't know if the 'this' pointer is the same object or if you had a delete/new which reused teh same address.

> > Yes, here's a walk-through of a complex profile recorded for B2G. It shows
> > multiple threads on multiple processes:
> > http://people.mozilla.org/~bgirard/keyboard_timeline.mp4
> 
> Hmm... Cleopatra has evolved from the last time I've been using it, nice!
> 
> My original idea was to reuse your timelines (at the top) for showing my
> captured data (since it has all the zooming, resize refreshing and nice
> layout ;)).  
> 
> Backtrack's critical path flow can be put to a single "tape" since its a
> sequence of events coming one from other forming a linear chain (regardless
> crossing threads/processes).  
> 
> There is the "Frames" specific timeline.  We could add a new specific
> timeline - "Backtrack" - showing the critical path.  That timeline would
> (automatically) uniformly zoom and subrange with others.  It would be seeded
> by letting the user manually select a "way point" (from a drop down list,
> with filtering probably) to start track back from (the way-point concept is
> something I find important since picking the start point to look at used to
> be hard for me e.g. in Visual Event Tracer or any kind of such a tool).  
> 
> Then, selecting an 'event block' on the Backtrack tape would select the
> proper range in the proper thread (and vice versa, to some extent).  Then I
> have to think how to show the "unrelated" concurrent flows.  Probably as
> another "Backtrack" timeline under the selected thread?  Not sure.
> 
> Other thing is that you show the stack samples coalesced by method (label)
> name and sorted by overall time spent inside.  It's of course purpose of a
> profiler!  But for backtrack (i.e. when something on the backtrack timeline
> is selected) it would be better to switch to a view with natural ordering
> and no coalescing as I do in my UI.  I'd also like to add info about I/O and
> lock wait blockades into this call sample listing.
> 
> Another feature should be selecting (clicking) any frame from the stack
> listing that would start a "sub-backtrack", that could either cross
> (actually connect to) the current critical path or reveal that this new
> sub-path is completely unrelated and where from it comes.
> 
> Does this all sound reasonable?  If so, I'll start looking at the Cleopatra
> code how complicated would be to introduce the Backtrack timeline and
> concept the markers I need to fill it.
> 
> Thanks so far for the great feedback!

Yes this sounds like a great plan. I'd suggest starting with the minimum viable changes to Cleopatra just for the shake of simplicity. See 'js/waterfall.js' for an example how to provide a new UI widget.

Also you might of noticed that the profiler has a lot of optional feature. These are the checkboxes in the profiler add-on. You could integrate backtrack as an optional feature. This let's the user chose between running more features and having more data to diagnose bottleneck and increasing the data collected and therefore the overhead. When you want the most accurate CPU samples you really want to minimize what data you collect.

If you want to do that then you add a new feature here to indicate the current build support a profiling feature:
http://mxr.mozilla.org/mozilla-central/source/tools/profiler/platform.cpp#603

and then when the profiler is started you can run your init if the backtrack feature was requested:
http://mxr.mozilla.org/mozilla-central/source/tools/profiler/TableTicker.h#108
Flags: needinfo?(bgirard) → needinfo?(honzab.moz)
Comment on attachment 8614066 [details] [diff] [review]
[prototype, poc, collecting data itself] v1

Review of attachment 8614066 [details] [diff] [review]:
-----------------------------------------------------------------

Given the current discussion I don't think I need to review this. Let me know if I'm misunderstanding.
Attachment #8614066 - Flags: review?(bgirard)
Sorry for late answer, I was sick for two weeks.

(In reply to Benoit Girard (:BenWa) from comment #17)
> (In reply to Honza Bambas (:mayhemer) from comment #16)
> > (In reply to Benoit Girard (:BenWa) from comment #14)
> > > The real cost
> > > is going to be in the size of the backtraces that we capture so we don't
> > > need to micro-optimize a non bottleneck.
> > 
> > So, each PROFILER_LABEL_* does a full stack capture from the very bottom?
> 
> No that's it. Sorry this concept is not very intuitive. PROFILER_LABEL_*
> will add extra information to a pseudo execution stack. This is pushed
> during execution and popped after using RAII. When the profiler takes a
> sample on its 1 ms clock, or we insert a marker which requests a backtrace
> (collection is optimal since its a bit slow and consume a lot of memory), or
> a piece of code wants a backtrace explicitly we will capture the backtrace
> and include any active PROFILER_LABEL_*.
> 
> The fancy trick we do is that we in-order merge the C++ stack with this side
> pseudo stack. So the PROFILER_LABEL_ fall next to their C++ frame
> counterparts.


Hmm.. this a bit sounds like a show stopper for me.  I need all info I capture.  Only this way I can chain all the stuff that happens such as dispatch/run, network request/response, queue/dequeue, lock wait/ack/release.  

It seems like until backtrace is manually requested or that 1ms interval is hit any info I capture with PROFILER_LABEL_* (mainly PRINTF) will be just lost.  Is that so?



Thanks for all the leads to start hacking!
Flags: needinfo?(honzab.moz) → needinfo?(bgirard)
I think what you want to do is mix labels and markers. If you have time to meet this week perhaps we can take 15 minutes to go over a design in person.
Flags: needinfo?(bgirard)
(In reply to Benoit Girard (:BenWa) from comment #20)
> I think what you want to do is mix labels and markers. If you have time to
> meet this week perhaps we can take 15 minutes to go over a design in person.

I'm not in Whistler.  I'll try something and if stuck ask more questions.  Maybe next week we can do a Vidyo chat if need, will let you know.  

Thanks so far!
Alias: backtrack
Status: NEW → ASSIGNED
Summary: Enhance GeckoProfiler and TaskTracer to capture full object interactions, do a critical path analyzes on that data → Implement and integrate Backtrack
No longer depends on: 1167634
Attachment #8614066 - Flags: review-
Hi Benoit, it's been a while...

I started to work on this feature again.  I abandoned (probably just for now) reuse of the tasktracer code at all.

Now I'm trying to add some of my own data structures, actually pretty similar to what tasktracer adds at GeckoSampler.cpp!StreamTaskTracer.  I've duplicated I believe all places where tasktracer tries to carry its own data by doing the same for backtrack.  However, parserWorker doesn't see that data in the raw profile (parseRawProfile function).  I also made tasktracer compile and run on windows.  And there is no "tasktracer" data in the rawprofile either.

According github history [1] [2] you've added these bits.  Was it ever working?  Is there some pending work to be done around?  Or is the code wrong from its basis?

Thanks.

[1] https://github.com/mayhemer/cleopatra/commit/8bea215b290f2777e245ecd3baf2bf6a926d2863
[2] https://github.com/mayhemer/cleopatra/commit/b8d99a1105b9fc59fe0dc1476b3c097a32482428
Flags: needinfo?(bgirard)
Attached patch wip v2 (obsolete) — Splinter Review
[for reference only]
(Continuation for comment 23)

So, I partially figured this out.  parseRawProfile reads tasktracer data from a wrong property.  [1] is wrong.  It's a property of rawProfile.profileJSON.  So I'm moving it to parseProfileJSON() nested function.


OK, one problem solved, more arise.  Tasktracer (and me) see all the threads running, with info taken at [2].  But markers in the profile (rawProfile.profileJSON(parsed).threads) are only 3 selected threads of the interest.  But I need all of them, I need markers, callstacks (definitely call graphs) captured on all threads.  I've seen some filtering happen, which is there from some reason.  I presume not to pollute the UI and the profiler.  But I need the data internally... what to do?



And yet another problem is that when in the cleopatra UI I click on a stored local profile, the worker's parseRawProfile gets something completely different to parse.  An apparently already filtered profile with different properties, no tasktracer nor backtrack data present.  I'm pretty much confused and have spent a lot of time already figuring this all out.  So, an advice would save a lot of that time for me.


[1] https://github.com/mayhemer/cleopatra/blob/b8d99a1105b9fc59fe0dc1476b3c097a32482428/js/parserWorker.js#L333
[2] http://hg.mozilla.org/mozilla-central/annotate/33d36bf6ca0c/tools/profiler/core/GeckoSampler.cpp#l321
A note that when I set "profiler.threadfilter" pref to an empty string, I'm getting the following Shift-Ctrl-2:

[JavaScript Error: "TypeError: profile.threads[i] is undefined" {file: "resource://jid0-edalmuivkozlouyij0lpdx548bc-at-jetpack/data/SymbolicateWorker.js" line: 218}]
(In reply to Honza Bambas (:mayhemer) from comment #23)
> According github history [1] [2] you've added these bits.  Was it ever
> working?  Is there some pending work to be done around?  Or is the code
> wrong from its basis?

No, it wasn't. TaskTracer needs to have it's data transformed by a B2G script between the point where it's dumped and when the TaskTracer UI sees it. That part was never plugged in to work for outside the B2G profiling script.

> And yet another problem is that when in the cleopatra UI I click on a stored
> local profile, the worker's parseRawProfile gets something completely
> different to parse.

Yes, this is expected. The problem is we've drastically changed the Cleopatra format a lot over the years. What happens is Cleopatra will read the format that it outputted from Gecko (and also we're still supporting the old format which is why the code is messy) and will normalize it into the format that it uses internally. When you save the profile again (upload, save to file, indexDB) it will be saved using the Cleopatra internal format.

The state of all the formats is a mess now. :mstange is playing with a rewrite of Cleopatra on his free time so in the long term we might have something cleaner to work with but in the short term we shouldn't block on that work.

Let me know if that leaves anything unanswered.
Flags: needinfo?(bgirard)
Thanks, my biggest concern now is comment 25, specifically this part:

"
Tasktracer (and me) see all the threads running, with info taken at [2].  But markers in the profile (rawProfile.profileJSON(parsed).threads) are only 3 selected threads of the interest.  But I need all of them, I need markers, callstacks (definitely call graphs) captured on all threads.  I've seen some filtering happen, which is there from some reason.  I presume not to pollute the UI and the profiler.  But I need the data internally... what to do?
"


With profiler.threadfilter = "" set I did few small fixes like skipping when arrays are null or empty to go over these exceptions but finally I'm getting just empty UI (no samples) and no errors/exception shown anywhere (console, cleopatra UI).


I'd like to give it a try to join the data (call graphs mainly) captured by the sampler with the ranges captured with backtrack.  If it turns out to be too big task for the profiler, I may at least go with a subset and any important data capture myself, using my own internal marker engine.


And more... I tried to list more threads in profiler.threadfilter pref (like "GeckoMain,Compositor,Cache2 I/O,Socket Thread") and to my surprise... nothing has changed.  Still getting just the 3 threads.
Flags: needinfo?(bgirard)
That's because we don't call profiler_(un)register_thread on those threads you added to the filter:
http://mxr.mozilla.org/mozilla-central/search?string=register_thread&find=&findi=&filter=%5E%5B%5E%5C0%5D*%24&hitlimit=&tree=mozilla-central

We don't have a good cross-platform API for getting a list of thread and thread context for all threads so we have to manually register all the threads that we care about.

We have to be very strict about which threads we profile because of cost of profiling is linear in the amount of threads. The more threads you sample, the worse the sample you get and the higher the profiling overhead is. This is why by default we profile threads relevant to painting and nothing else.

I'd suggest implementing your tool to work well if we have data recorded from threads where we don't have any sampling data from.
Flags: needinfo?(bgirard)
We call register_thread on many others tho.  I've seen some 13 to 20 threads being in the registered profile global.

Anyway, good point, I'll register them and see how bad it is.  Problem is that I need the stack samples.  That was the whole point of integrating it to gecko profiler.  So that I can get callgraphs for the pieces where events are waiting.

I'll anyway rather go with my own marker system (as in v1) and make stack sampling optional.  I anyway don't need stack sampling to select a backtrack objective pont to find it's origin.

Thanks so far for all the help!
(In reply to Honza Bambas (:mayhemer) from comment #30)
> We call register_thread on many others tho.  I've seen some 13 to 20 threads
> being in the registered profile global.

I meant sRegisteredThreads global
One more question: how can I correctly find the right thread to get the samples when I have tid (gettid()) in hands?  I can see tids are not stored along with the sample arrays.
Flags: needinfo?(bgirard)
It's outputted by Gecko. It's possible Cleopatra discards it when converting to it's internal format so it might need to be kept. If that's the case then it should just be as simple as assigning it.

http://mxr.mozilla.org/mozilla-central/source/tools/profiler/core/ThreadProfile.cpp#126
Flags: needinfo?(bgirard)
Attachment #8614066 - Attachment description: v1 → [prototype, poc, collecting data itself] v1
Hi again,

so, I'm now collecting my markers with my own engine.  I'm able to stream it in GeckoSampler::StreamJSON.  I've checked this code is called on both the parent process and the child process.  But the data from the child process are not visible in cleopatra gProfiles[0].backtrack.

I assume I need to add my own code for my specific data merge.  How exactly are the child parts of the profile JSON merged to the parent (overall) profile?  Or simply said, what/where do I need to do to merge properly my data?

Thanks.
Flags: needinfo?(bgirard)
Attached patch wip v3 (obsolete) — Splinter Review
- my own markers engine
- very little instrumentation
Attachment #8614852 - Attachment is obsolete: true
Attachment #8727878 - Attachment is obsolete: true
They should be under the gProfiles 'threads' property somewhere. If you dump & beautify the JSON it should be easy to find. Subprocesses embed a sub profile.
Flags: needinfo?(bgirard)
Depends on: 1261038
Attached patch wip v4 (obsolete) — Splinter Review
- growing and growing...
- js interpreter instrumentation (still not fully established)
- full inter process time sync
- ipc bridging
- many small details
Attachment #8733982 - Attachment is obsolete: true
Attached patch wip v5 (obsolete) — Splinter Review
Attachment #8747255 - Attachment is obsolete: true
Depends on: 1276221
Attached patch wip v6 (obsolete) — Splinter Review
Attachment #8614066 - Attachment is obsolete: true
Attachment #8755133 - Attachment is obsolete: true
Depends on: 1280077
Attachment #8760213 - Attachment is obsolete: true
schien, mayhemer, is either of you planning to actively work on this? I'm very interested in helping you finish this.

mayhemer, can you briefly describe how, for each type of marker, its cause can be traced? I see that DEQUEUE, DONE, SUB_COMPLETED and SUBS_RESOLVED have an origin field, but it's not obvious to me how it works for QUEUE and OBJECTIVE. Though I suppose for OBJECTIVE you can just check what task is currently running (i.e. is between DEQUEUE and DONE).
Flags: needinfo?(honzab.moz)
(In reply to Markus Stange [:mstange] from comment #42)
> schien, mayhemer, is either of you planning to actively work on this? I'm
> very interested in helping you finish this.

Hi Markus!

Yes, we want!  sc helped me recently to move a bit forward by making this build on platforms other than windows (my main dev plat).  he wants to use it to confirm some of his performance patches.

my current plan is to land this patch (since maintaining it is hard) mostly as is, disabled at compile time by default and finish other tasks later.  it means to get this code reviewed and building on all t1 platforms, at least when disabled (should be easy).

the very next step is to allow taking backtrace (callstack) samples manually when an event/task is being QUEUED.  I want to use STS sampler because it's super fast.  symbolication should be done in the web ui when the profile is processed in cleopatra (same as we do for stacks now) 

I got some instructions from :benwa yet in orlando on how to do it.  but honestly, those were a bit wage (more like "inspect the code yourself").  I didn't move a bit since then.

so, if you know the stack sampler or know someone who knows, and would be willing to help with this, that would be great!

> 
> mayhemer, can you briefly describe how, for each type of marker, its cause
> can be traced? I see that DEQUEUE, DONE, SUB_COMPLETED and SUBS_RESOLVED
> have an origin field, but it's not obvious to me how it works for QUEUE and
> OBJECTIVE. 

so, even though you apparently understand how the origin field works, I'd rather repeat to make the picture complete:

each thread simply has an array of captured markers, indexed from 0 to N (this index is only unique per-thread, not inter-thread).  each marker can then be identified by {tid, id}.  origin on a marker then simply is a {tid,id} pair pointing at a marker that is originating this one.  DEQUEUE's origin is a QUEUE marker (e.g. Run() points to it's Dispatch() point).  DONE's origin is DEQUEUE (spans the Run() execution time).

QUEUE, INPUT and OBJECTIVE don't have origins.  how to handle them explains the following para:

during backtrack you start from a selected OBJECTIVE marker.  you are and stay on a certain thread.  you walk markers *on this thread* back (simply id--) until you hit a marker with the type DEQUEUE.  it then has the origin field, that points to a thread (maybe the same or different than the one you are currently walking) and per-thread marker id you have to jump to.  (this shows in cleopatra as a pink-ish line in the graph.)  and then you simply again walk that thread until you hit DEQUEUE again.

backtrack stops when you hit an INPUT marker.

it may also happen you hit DONE on the thread.  it means there is either a nested queue loop or a loop dequeuing tasks from some implementation specific queue (e.g. js parsing, repaint, network socket handling, from what I quickly recall).  in that case (on DONE marker) you jump over to it's origin.  the marker to find is expected to be a DEQUEUE marker on the same thread.  you skip that DEQUEUE (ignore it) and then simply continue walking still the same thread as before.  (in cleopatra this is shown as gray rectangle in the graph.)

SUB_COMPLETED marks a sub-task ending (like load of one of css files a web page is referring has been just finished).

SUBS_RESOLVED marks the point when all sub-tasks have been finished (like all css files that a web page is referring in its HEAD have been loaded and some other operations like first pain etc have just been unblocked.)

All of them (IIRC) links back to each other.  when you hit SUBS_RESOLVED during backtracking, its origin points to the last SUB_COMPLETED marker taken.  origin of that SUB_COMPLETED marker then points to a previous SUB_COMPLETED marker and so on until the first one has origin = {0,0}.  there the chain ends.

SUBs are not something that creates the graph in cleoopatra.  these two are tho very important for finding out what is related to reaching the objective.  w/o these you can't easily say which event chains can potentially be rescheduled to happen later (lower their priority).  capturing all of SUBs is going to be relatively complicated.  the instrumentation is pretty sensitive overall.. one disadvantage of backtrack.

I can send a more detailed email about all of this.  I know I lack comments in the patch... and there is no doc for this, except in my head.


some source of information (last resort :))) can be https://github.com/mayhemer/cleopatra/blob/backtrack/js/backtrack.js.  it does the actual backtrack (back from OBJECTIVE to INPUT).  there are two parts of the backtrack loop: scan and main.  the first one collects info about subtasks, the other draws on the screen (adds html elements to the dedicated backtrack div) + puts some info for UI to the elements.

and sorry, I'm no JS virtuous :)

> Though I suppose for OBJECTIVE you can just check what task is
> currently running (i.e. is between DEQUEUE and DONE).

yes!
Flags: needinfo?(honzab.moz)
Thanks! This is already very helpful.

On the weekend I tried out both backtrack and TaskTracer. Both needed some work to actually compile (bug 1287392 tracks the TaskTracer build failure), but once I got them working, I was able to get data out of both.

(In reply to Honza Bambas (:mayhemer) from comment #43)
> Yes, we want!  sc helped me recently to move a bit forward by making this
> build on platforms other than windows (my main dev plat).  he wants to use
> it to confirm some of his performance patches.

That's great!

> my current plan is to land this patch (since maintaining it is hard) mostly
> as is, disabled at compile time by default and finish other tasks later.  it
> means to get this code reviewed and building on all t1 platforms, at least
> when disabled (should be easy).

Hmm... can we try to reduce its surface area first? I understand that the goal is to get a picture of what's going on that's as complete as possible, but the patch really adds a ton of instrumentation. The more instrumentation you have, the more likely it is that some of it breaks, especially if it's not built by default. The same happened to TaskTracer and TaskTracer has a lot less instrumentation.

We should also try to get each piece of instrumentation reviewed by a person who knows the code that it's instrumenting.

> the very next step is to allow taking backtrace (callstack) samples manually
> when an event/task is being QUEUED.  I want to use STS sampler because it's
> super fast.  symbolication should be done in the web ui when the profile is
> processed in cleopatra (same as we do for stacks now) 
> 
> I got some instructions from :benwa yet in orlando on how to do it.  but
> honestly, those were a bit wage (more like "inspect the code yourself").  I
> didn't move a bit since then.

All you need to do is call profiler_get_backtrace(). See for example the two places where the refresh driver does it.

> each thread simply has an array of captured markers, indexed from 0 to N
> (this index is only unique per-thread, not inter-thread).  each marker can
> then be identified by {tid, id}.

As an aside, I think TaskTracer's solution to this is a little nicer: A taskId is a uint64_t, which includes both the thread ID and the task index in that thread:
uint64_t taskid = ((uint64_t)tid << 32) | ++info->mLastUniqueTaskId;

> DEQUEUE's origin is a QUEUE marker (e.g. Run() points to it's Dispatch() point). 

How does this work for IPC messages? (Does it work at all?)
TaskTracer had to add a "parent_task_id" field to the IPC Message type so that it could track origins across processes.

For QUEUE markers, or in TaskTracer terms, DISPATCH markers, TaskTracer automatically links up the current task on the stack as the dispatched task's parent.

Here are a few more questions I had yesterday:

Is every DEQUEUE marker expected to be followed up with a DONE marker? That seems to be quite broken in the current patch, especially around the SourceCompressionTask and GlobalHelperThreadState business; there are lots of 'd' markers without corresponding 'e' markers.

In one of the earlier emails about backtrack that I've found, you said that it's valuable to be able to trace calls *by object*, e.g. all stages that an individual nsHttpChannel object goes through. Does the current patch try to make this possible? If so, how?

What is the purpose of the instrumentation of the HelperThread? What problems would one run into if those pieces of instrumentation were not present?

Are you planning to change the coding style to match Gecko style? E.g. naming fields mField instead of field or field_.
Sorry for late answer again.. I'm so busy these days :/

(In reply to Markus Stange [:mstange] from comment #44)
> Thanks! This is already very helpful.
> 
> On the weekend I tried out both backtrack and TaskTracer. Both needed some
> work to actually compile (bug 1287392 tracks the TaskTracer build failure),
> but once I got them working, I was able to get data out of both.
> 
> (In reply to Honza Bambas (:mayhemer) from comment #43)
> > Yes, we want!  sc helped me recently to move a bit forward by making this
> > build on platforms other than windows (my main dev plat).  he wants to use
> > it to confirm some of his performance patches.
> 
> That's great!
> 
> > my current plan is to land this patch (since maintaining it is hard) mostly
> > as is, disabled at compile time by default and finish other tasks later.  it
> > means to get this code reviewed and building on all t1 platforms, at least
> > when disabled (should be easy).
> 
> Hmm... can we try to reduce its surface area first? I understand that the
> goal is to get a picture of what's going on that's as complete as possible,
> but the patch really adds a ton of instrumentation. The more instrumentation
> you have, the more likely it is that some of it breaks, especially if it's
> not built by default. The same happened to TaskTracer and TaskTracer has a
> lot less instrumentation.

I don't know what you exactly mean.  All I've added is needed or is at least an outline of places where we should think of something more automated.  Wrapping nsIRunnables when added to nsEventQueue is one such example of an automated tracking collection - similar to what TT does.  But not all code can just that migrate to use nsEventQueue or something common.  Hence there is instrumentation and class members (usually origin tracking) added on various places.

btw, in chromium code I've see they have some more generic support for tracking tasks...

I can see the patch be split to core + each module instrumentation separately.  Get review for the core first and land it.  Then for each modules.  It can also be the step to get feedback if it's actually correct :)

> 
> We should also try to get each piece of instrumentation reviewed by a person
> who knows the code that it's instrumenting.

Yep, as above.

> 
> > the very next step is to allow taking backtrace (callstack) samples manually
> > when an event/task is being QUEUED.  I want to use STS sampler because it's
> > super fast.  symbolication should be done in the web ui when the profile is
> > processed in cleopatra (same as we do for stacks now) 
> > 
> > I got some instructions from :benwa yet in orlando on how to do it.  but
> > honestly, those were a bit wage (more like "inspect the code yourself").  I
> > didn't move a bit since then.
> 
> All you need to do is call profiler_get_backtrace(). See for example the two
> places where the refresh driver does it.

I think I've already tried that... anyway, I will check on it, thanks for the tip!

Question: does this function return something that is symbolificated later (as the sampler does) or does it something more complete?  I need the first case, since speed is crucial here.

> 
> > each thread simply has an array of captured markers, indexed from 0 to N
> > (this index is only unique per-thread, not inter-thread).  each marker can
> > then be identified by {tid, id}.
> 
> As an aside, I think TaskTracer's solution to this is a little nicer: A
> taskId is a uint64_t, which includes both the thread ID and the task index
> in that thread:
> uint64_t taskid = ((uint64_t)tid << 32) | ++info->mLastUniqueTaskId;

It's the same with just more work.  If you want just the TID you must mask and shift.  Is there some reason why this should be better than having a structure as I do?

> 
> > DEQUEUE's origin is a QUEUE marker (e.g. Run() points to it's Dispatch() point). 
> 
> How does this work for IPC messages? (Does it work at all?)

Yep, I mostly copied what they did for task tracer, with few fixes.

> TaskTracer had to add a "parent_task_id" field to the IPC Message type so
> that it could track origins across processes.

I do the same + few more added instrumentation to track the message across all actual threads it walks through (more precise/deep than TT).

> 
> For QUEUE markers, or in TaskTracer terms, DISPATCH markers, TaskTracer
> automatically links up the current task on the stack as the dispatched
> task's parent.

backtrack doesn't need to do it.  you are walking the thread manually later.  backtrack is more detailed than tt, since it also has to include networking, nested loops etc.  it's more a performance tool that a tracking tool.

> 
> Here are a few more questions I had yesterday:
> 
> Is every DEQUEUE marker expected to be followed up with a DONE marker? 

not always, but probably just to save some memory and profile size.  I would have to take a look where exactly it's so.  but we could enforce always having DONE marker as a general rule if you have some plans here.

> That
> seems to be quite broken in the current patch, especially around the
> SourceCompressionTask and GlobalHelperThreadState business; there are lots
> of 'd' markers without corresponding 'e' markers.

Hmm.. the more recently added instrumentation is trying to use the RAII class that surrounds execution blocks with both DEQUEUE and DONE.  maybe it's easy to migrate DEQUEUE only markers to do that everywhere, not sure at the moment, thought, have to take a look.

> 
> In one of the earlier emails about backtrack that I've found, you said that
> it's valuable to be able to trace calls *by object*, e.g. all stages that an
> individual nsHttpChannel object goes through. Does the current patch try to
> make this possible? If so, how?

No, I don't find it important for the goal of this work.  I was trying to do that several times and it was always too complicated.  My immediate plan is to utilize nspr logging (with maybe some changes if needed) for this.  I want to create an analyzer that is given an nspr log and let you see object chains of your liking there.  (no bug filed)

> 
> What is the purpose of the instrumentation of the HelperThread? What
> problems would one run into if those pieces of instrumentation were not
> present?

hmm.. I don't recall if I found it missing when inspecting profiles or I added that just because I found it being something that queues/dequeues tasks.  but I believe those are needed.. why do you ask?

> 
> Are you planning to change the coding style to match Gecko style? E.g.
> naming fields mField instead of field or field_.

I hate the gecko style :)  if someone forces me to change it e.g as review requirement, yes, but not otherwise.  the code looks and read so much clearer with the more modern style!  sps is somewhat using it to.
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #44)
> All you need to do is call profiler_get_backtrace(). See for example the two
> places where the refresh driver does it.

Seems like this works!  I only did a rough test by means of calling the function, checking that it actually grabs a stack (invoke of a tick) and that it doesn't cause a major slowdown.  It only adds a kind of light "tremor" to the UI responsiveness.  I also haven't disabled the 1ms sampler, so we can lower the perf impact even more by disabling it.

Now I need to make sure it can be symbolized in the UI.  Something I don't know if works or not yet.

Thanks for this advice, Markus!
Blocks: 1329137
I'm going to discuss the latest thoughts in an external document.
Flags: needinfo?(mstange)
Attachment #8798118 - Attachment description: wip v7 (not super-deeply tested) → 2016 patch (wip, reference for instrumentation)
Attached patch 2018.patch (basic queue) (obsolete) — Splinter Review
- this is an outline of a simple template class to queue and dequeue strongly referred tasks
- includes test for serial q/deq and leak checking
- no instrumentation and no marker capturing code
Depends on: 1028264
Depends on: 1500118
The work has moved to a new bug 1507282.
Alias: backtrack
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: