Closed Bug 1422061 Opened 2 years ago Closed 2 years ago

DAMP/Talos profiles don't include content process data

Categories

(Core :: Gecko Profiler, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox59 --- fixed

People

(Reporter: ochameau, Assigned: ochameau)

References

Details

Attachments

(1 file)

Talos relies on MOZ_PROFILER_STARTUP env variable to start the profiler:
https://searchfox.org/mozilla-central/source/testing/talos/talos/gecko_profile.py#82-84

Which is great as it allows to see Firefox startup.
But it looks like it misses the content processes.
I can see two "content" processes, but they seem to be privileged...

See this profile:
  https://perfht.ml/2isltFd

Is there any way to also record all the content processes?
Same profile but without search and "JS only" filters: https://perfht.ml/2Arx4Lz

The profiler knows nothing about "privileged" or "unprivileged" processes. If content processes are missing, there must be a different reason.

However, I'm not sure that there are any content processes missing from your profile. It seems to be a recording of the Talos test DAMP, and the page chrome://damp/content/damp.html is loaded in the parent process. So it's possible that for the time during which that profile has no content processes, there simply were no content processes.

I was able to see that the URL chrome://damp/content/damp.html was loaded in the parent process by looking at the C++ stack of a sample inside painting, around here: https://perfht.ml/2AsTr3c
You can see a pseudo stack frame called "PresShell::Paint chrome://damp/content/damp.html#auto" on the stack there.
Let's repurpose this bug to make profiles generated while running DAMP to include info about the content process.
Blocks: 1329132
Summary: MOZ_PROFILER_STARTUP records only parent/privileged processes? → DAMP/Talos profiles don't include content process data
Comment on attachment 8933597 [details]
Bug 1422061 - Prevent performance actor from stopping profiler started by Talos.

I didn't really meant to ask for review at this stage...

A profile with this patch against DAMP
  https://perfht.ml/2Auewdz
  Done via:
  ./mach talos-test --activeTests damp --geckoProfile --cycles 1 --tppagecycles 1 --subtest custom.inspector --geckoProfileEntries 30000000
  unzip /mnt/desktop/gecko/testing/mozharness/build/blobber_upload_dir/profile_damp.zip
  Open profile_damp/page_0_pagecycle_1/cycle_0.profile from perf-html.io

This test opens a tab on this URL: http://localhost:55591/tests/devtools/addon/content/pages/custom/inspector.html
We should see a content process full of devtools code.

But even with this patch, we still don't.

FYI, it looks like following file is unused:
https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/devtools/addon/content/Profiler.js
The profiler is controled by TalosPowersService.js.
Attachment #8933597 - Flags: review?(mstange)
I verified how talos started the profiler, it calls:
  nsIProfiler.StartProfiler(30000000, 10, ["js", "leaf", "stackwalk", "threads"], 4, [GeckoMain, Compositor], 2)
From here:
  https://searchfox.org/mozilla-central/source/testing/talos/talos/talos-powers/components/TalosPowersService.js#95-98
This function is called multiple times, I ensured calling StartProfiler only once, but it doesn't change anything.

Just to be sure, I tried disabling the symbolicator on the python side of Talos, by commenting these lines:
  https://searchfox.org/mozilla-central/source/testing/talos/talos/gecko_profile.py#92-95
Nothing changes.

Also, I was wondering if the content process for the tabs we open is destroyed before the profiler dumps,
so I kept the tabs open. Same result.

Finally, I commented the MOZ_PROFILER_STARTUP env variable, set by Talos:
  https://searchfox.org/mozilla-central/source/testing/talos/talos/gecko_profile.py#82
But I still get the same result.

I'm clueless now.
Flags: needinfo?(mstange)
Still happening, here is a profile of DAMP with current tip:
  https://perfht.ml/2mcZzU3
(In case it help having latest profile version, with stripes.
It looks like the content process is not present because reloading the inspector triggers a call to StopProfiler() inside the content process. This discards the content process profile and means that, once the parent process requests a profile from the relevant child process, that child process does not have a profile it can hand back to the parent process.

Here's how I found this out:
I ran the command that you gave above, but with profiler logging enabled:

MOZ_LOG="prof:3" ./mach talos-test --activeTests damp --geckoProfile --cycles 1 --tppagecycles 1 --subtest custom.inspector --geckoProfileEntries 30000000

Here's the relevant excerpt from the log:

> [...]
> 12:14:53     INFO -  PID 79690 | [Main Thread]: I/prof [79690] profiler_start
> 12:14:53     INFO -  PID 79690 | [Main Thread]: I/prof [79690] locked_profiler_stop
> 12:14:53     INFO -  PID 79690 | [Main Thread]: I/prof [79690] locked_profiler_start
> [...]
> 12:14:53     INFO -  PID 79690 | [ProfilerChild]: I/prof [79692] locked_profiler_stop
> 12:14:53     INFO -  PID 79690 | [Main Thread]: I/prof [79690] profiler_pause
> 12:14:53     INFO -  PID 79690 | [ProfilerChild]: I/prof [79692] profiler_start
> [...]
> 12:14:53     INFO -  PID 79690 | [Main Thread]: I/prof [79690] profiler_resume
> 12:14:53     INFO -  PID 79690 | Garbage collect
> 12:14:53     INFO -  PID 79690 | [ProfilerChild]: I/prof [79692] profiler_pause
> 12:14:53     INFO -  PID 79690 | [ProfilerChild]: I/prof [79692] profiler_resume
> 12:14:54     INFO -  PID 79690 | Open toolbox on 'custom.inspector'
> 12:14:54     INFO -  PID 79690 | [Main Thread]: I/prof [79693] profiler_init
> [...]
> 12:14:54     INFO -  PID 79690 | [Main Thread]: I/prof [79693] locked_profiler_start
> [...]
> 12:14:54     INFO -  PID 79690 | Unable to read VR Path Registry from /Users/mstange/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath
> 12:14:55     INFO -  PID 79690 | [ProfilerChild]: I/prof [79693] profiler_ensure_started
> 12:14:55     INFO -  PID 79690 | 2018-01-12 12:14:55.022 plugin-container[79693:3722432] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x9937, name = 'com.apple.tsm.portname'
> 12:14:55     INFO -  PID 79690 | See /usr/include/servers/bootstrap_defs.h for the error codes.
> 12:14:56     INFO -  PID 79690 | Garbage collect
> 12:14:56     INFO -  PID 79690 | Reload page on 'custom.inspector'
> 12:14:57     INFO -  PID 79690 | Close toolbox on 'custom.inspector'
> 12:14:58     INFO -  PID 79690 | [Main Thread]: I/prof [79692] profiler_stop
> 12:14:58     INFO -  PID 79690 | [Main Thread]: I/prof [79692] locked_profiler_stop
> 12:14:58     INFO -  PID 79690 | Garbage collect

I have three processes. The parent process (79690) and two child processes (79692 and 79693). 79693 is present in the profile but idle, so 79692 must be the one we're missing.

And this line tells us why it's missing:

> 12:14:58     INFO -  PID 79690 | [Main Thread]: I/prof [79692] profiler_stop

This profiler_stop invocation runs on the main thread, which means that it must have been triggered through the nsIProfiler interface from within the content process itself.
Because if the profiler was stopped in the parent process, then the forwarded "stop profiler" command would have arrived in the child process on that process's "ProfilerChild" thread and not on its Main Thread.


This current design is obviously really suboptimal. We have two controllers of the profiler fighting with each other: the Talos harness and the devtools performance panel. I'm not sure what to do about this. Greg's rewrite of the performance panel will only control the profiler from the parent process and not within the child process, so it will not have this problem.
Maybe for the purposes of getting proper profiles from Talos, you could just comment out all .StopProfiler() calls within devtools?
https://searchfox.org/mozilla-central/search?q=.stopprofiler%28%29&path=devtools
Flags: needinfo?(mstange)
Here's a profile I got after I commented out all those StopProfiler calls: https://perfht.ml/2rdEpLA
"Content (2 of 2)" is the thread we want.
Thanks a ton for figuring this out. Yes, the profile looks correct!

If that's because of StopProfiler calls from Talos here:
  https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/devtools/addon/content/Profiler.js
It means that it affects not only DAMP but all talos tests involving this copy of Profiler.js:
  https://searchfox.org/mozilla-central/search?q=.stopprofiler()&case=false&regexp=false&path=Profiler.js

Otherwise, if that relates to some calls from devtools, it most likely comes from performance panel test, where we open it at some point during DAMP.

In any case, I can easily fix that from here. I will try to confirm what is the precise StopProfiler call that breaks the profile.
Thanks again!
So it is related to this precise call site:
  https://searchfox.org/mozilla-central/source/devtools/server/performance/profiler.js#143

If I fix this and run all DAMP tests:
  ./mach talos-test --activeTests damp --geckoProfile --cycles 1 --tppagecycles 1 --geckoProfileEntries 100000000

I get such profile:
  https://perfht.ml/2Dj6NRf
It is interesting to see that there is 6 content processes.
These new content processes are clearly related to DAMP, but at first sight, we are still missing activity in content processes.
Could it be that 100MB profile entries is still too short? My issue is that tab process crash at creation if I increase the buffer even more.
(In reply to Alexandre Poirot [:ochameau] from comment #10)
> These new content processes are clearly related to DAMP, but at first sight,
> we are still missing activity in content processes.
> Could it be that 100MB profile entries is still too short?

I think so, yes. For example you can see a "BHR-detected hang" marker in content process 6 which extends into the "out of buffer" range on the left.

> My issue is that
> tab process crash at creation if I increase the buffer even more.

Can you try increasing the interval instead? Maybe to 20ms or 50ms, just to check whether that helps.
Comment on attachment 8933597 [details]
Bug 1422061 - Prevent performance actor from stopping profiler started by Talos.

https://reviewboard.mozilla.org/r/204548/#review218982

Looks good to me! Thanks for the great comments.
Attachment #8933597 - Flags: review?(gtatum) → review+
Using 50ms doesn't seem to change anything:
 ./mach talos-test --activeTests damp --geckoProfile --cycles 1 --tppagecycles 1 --geckoProfileEntries 100000000 --geckoProfileInterval 50

https://perfht.ml/2mEeRBt
It is interesting to see it is intermittent, in this profile, for the same command, exposes 7 content processes (instead of 6)
https://perfht.ml/2DnoxdW
(In reply to Markus Stange [:mstange] from comment #15)
> Oh, I think you're running into this limit:
> https://searchfox.org/mozilla-central/rev/
> 41925c0b6c6d58578690121de439d2a8d3d690f3/tools/profiler/gecko/nsProfiler.
> cpp#537
> 
> Try setting it to something really large.

You are right, setting it to 50 allows me to get a full profile of everything \o/
  https://perfht.ml/2reZPYq

Would it be fine to bump this threshold to 50 (25 would work too I think) or would it require a more thoughtful fix?
Depends on: 1431179
I've attached a patch to bug 1431179 that removes the limit. I think we can land the patch in this bug already and close this bug out. What do you think?

The dumpProfileToFileAsync patch might not be necessary, after all, it would just be nice cleanup, if I understand correctly.
(In reply to Markus Stange [:mstange] from comment #17)
> I've attached a patch to bug 1431179 that removes the limit. I think we can
> land the patch in this bug already and close this bug out. What do you think?

Yes!
 
> The dumpProfileToFileAsync patch might not be necessary, after all, it would
> just be nice cleanup, if I understand correctly.

If I remember correctly my investigation, the code that saves the profile is this one:
https://searchfox.org/mozilla-central/source/testing/talos/talos/talos-powers/components/TalosPowersService.js#116-132
Do you think there is value in refactoring this to use dumpProfileToFileAsync?

Otherwise, Talos would benefit from some cleanup to remove unused redundant files like this one:
https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/devtools/addon/content/Profiler.js
Pushed by apoirot@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cccd776a80d9
Prevent performance actor from stopping profiler started by Talos. r=gregtatum
https://hg.mozilla.org/mozilla-central/rev/cccd776a80d9
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Assignee: nobody → poirot.alex
You need to log in before you can comment on or make changes to this bug.