Open Bug 1399136 Opened 7 years ago Updated 2 years ago

The profiler has blanks where nothing seem to be proccessing

Categories

(Core :: Gecko Profiler, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: ochameau, Unassigned)

Details

Attachments

(1 file)

Here is my STR, but it doesn't always reproduces:
  * open clean firefox (somewhat clean profile, no addons except perf-html)
  * start recording at 0.1ms 270MB
  * quickly toggle devtools on the inspector CmdOrCtrl+Shift+C
  * once the toolbox is loaded, open the profile

I often get that:
  https://perfht.ml/2w4Byq4
Where there is a clear blank between 1.0 and 1.6s, which starts being a lot given the recent optimization I made to devtools load time.

Nothing is recording in the child process, really nothing!
And in the parent process, it seems to be only polling.
No markers.

Two things happens just before this blank:
* we load toolbox.xul document (I think we see some DOM markers related to this, like DOMContentLoaded/load, but it ends early before 1.0s)
* We load this frame script: resource://devtools/server/child.js
  (which we still miss in the call tree per bug 1397718)

I see two possibilities here:
* profilers miss something
* platform code freeze for some reason
Markus, any idea what is wrong here?
Flags: needinfo?(mstange)
Oh, I see a third option. The profiler allocates its recording buffer during the beginning of the profile and doesn't record itself. So it freeze firefox without any record. Is that it?
I'm pretty sure that this pause is created by the profiler itself. On Linux, when the profiler is started for the first time in a given process, it has to initialize LUL, which it needs for stackwalking. This involves reading files and processing stackwalking tables in them.

It's worth investigating whether we could at least capture pseudostacks during that time. I think we don't collect any samples because the profiler lock is locked while LUL is getting initialized. Maybe we can move part of LUL initialization outside of the lock.
Oh, yeah, the allocation of the recording buffer can also take a short amount of time, but the vast majority of the initialization time here is probably lul.
Flags: needinfo?(mstange)
Could we display LUL processing as a marker? like event processing markers. It would really help confirm if that's really that.

It sounds similar to a GC, that's the typical things that happens in background and help understanding why nothing happens or slowing execute!
Julian, do you have bandwidth to pick this bug up?
Flags: needinfo?(jseward)
Priority: -- → P3
I looked at this today.  To put a bit of context on it, here are some
numbers for LUL startup on x86_64-linux, using a local build done with
"-Og -g" (moderately optimised).

LUL reads unwind info from around 120 shared objects (.so files).  Most of
those are pretty small, except for libxul.so.  See attachment for details.

In those files, it is the .eh_frame sections that hold unwind info.  Their
total size is about 16MB, of which libxul.so accounts for 12MB.

Reading and processing the unwind info for libxul.so alone requires almost 5
billion instructions.  Scaling that up for all libraries gives a total cost
of 6.6 billion insns for LUL startup.

For a high-end Intel, reading the .eh_frame info out of buffer cache, I
would expect this to cost between 1 and 1.5 seconds.  I am somewhat
surprised that Alexandre sees only a 0.6 second stall.

LUL's unwind-info parser isn't very efficient.  It is derived from Breakpad
code, and appears to favour robustness and simplicity over performance -- a
fair enough tradeoff, no criticism there.  With some considerable
hoop-jumping I think it would be possible to roughly halve the instruction
count.  However, the disk I/O limitations would remain.  When reading from
buffer cache or from an SSD, I can see that might not be a problem, but from
a magnetic disk I suspect it would be a bottleneck.

A different reason to invest effort in making the reader faster is that we
could spend the gains on reducing LUL's memory usage.  The current in-memory
representation of unwind info is space-inefficient, using on the order of
50MB of memory.  As described in [1], we could do a lot better, but that
will take some extra computation to de-duplicate the information.

[1] https://blog.mozilla.org/jseward/2013/09/03/how-compactly-can-cfiexidx-stack-unwinding-info-be-represented/
Flags: needinfo?(jseward)
I guess the first thing to do here is see if we can get the LUL loading
start/end points displayed as a marker somehow.  Markus, does that sound
simple to do?
Flags: needinfo?(mstange)
All my linux profiles are done within VMWare on a Windows 10 host, my CPU is a i7-7500U.

I would suggest to display markers for when the profiler is doing anything.
The most important is to know that it is doing things that may interfere with profiler results.
Then I can safely understand such pauses and redo one or just interpret it with this special case in mind.

Also, personally, I'm totally fine having to wait for a couple of seconds before starting up my profile.
(it may be less true for developers looking at startup perfs!)

You may even tweak the addon UI, like changing the color of the recording icon, or anything, to tell us when we can start profiling/do the user action without noise.
We'll probably want to insert a tracing marker around the MakeUnique<lul::LUL> call inside the SamplerThread constructor.

I think the only problem with inserting a marker inside the SamplerThread constructor is that we're holding the profiler state lock at that point, and we don't want to create a deadlock. AutoProfilerTracing currently doesn't grab the lock (since bug 1347274), so maybe we can just put an AutoProfilerTracing object in there and be done with it, but it seems a bit brittle.

Nick, do you have any suggestions?
Flags: needinfo?(mstange) → needinfo?(n.nethercote)
My increasingly drastic ideas...

- Insert the marker and just live with the brittleness.

- Move LUL initialization outside of the locked code, and pass a "LUL has been initialized" token to locked_profiler_start(). It'll be a bit ugly because locked_profiler_start() is called from multiple places.

- We have recursive mutexes now! Use them and all these deadlock problems will go away. A few other places in the existing code will become simpler too.

- Stop using LUL:
  - Mac uses FramePointerStackWalk(), which requires MOZ_PROFILING to be set.
  - Win32 uses FramePointerStackWalk(), which works because frame pointers are always enabled on Win64.
  - Win64 uses MozStackWalk(), because Win64 never has frame pointers.
  - Android uses EHABIStackWalk().

  We could make Linux be like Mac or Win64. I very briefly tried this a while back and it seemed to basically work. The stack traces weren't identical to the ones we got with LUL, but I didn't spend enough time on it to find out why.
Flags: needinfo?(n.nethercote)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: