Closed Bug 1557447 Opened 5 months ago Closed 25 days ago

Profiler support for IPC information (Gecko side)

Categories

(Core :: Gecko Profiler, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox69 --- wontfix
firefox70 --- wontfix
firefox71 --- fixed

People

(Reporter: jesup, Assigned: squib)

References

Details

(Whiteboard: [qf:p2:pageload])

Attachments

(3 files, 2 obsolete files)

See also bug 1557443

We need a way to log and visualize IPC events in a profile; they add considerable understanding to what code is doing, and why something runs when it does. We'd want to see what was sent and to where; have id's to match up send and receive (jump to right place?), record or calculate send/receive/event-runs latencies, etc.

See a hack here: https://perfht.ml/2QSInCn that uses TextMarkers; these are too expensive; require allocations, waste space, etc - it's more of a proof-of-concept. Patch attached.

Comment from ehsan: "I think having a timeline view in the profiler UI with horizontal bands for threads and profile markers as vertical lines on the thread timeline may be a nice way to visualize the concurrency patterns which could be helpful in identifying lags or where scheduling could have happened better (e.g. by having IPCed some data earlier, etc.)"

For perf, we'll want to minimize overhead when not recording this info; probably have a Feature for deciding to record it; put it in the main Profiler buffer (or alternatively a separate circular buffer; probably not needed though); in the UI try to link up an request in one direction with a response, or a set of IPCs for (example) an HttpChannel, etc. I'm sure there are more ideas....

Type: defect → enhancement
Priority: -- → P3
Whiteboard: [qf] → [qf:p2:pageload]

I took a peek at the usage of LogMessageForProtocol and they all seem to be auto-generated by IPDL. In this case, instead of allocating strings for text markers, we could add markers that contain:

  • The actor (this can be derived from the IPC message type)
  • IPC message type (which we can map back to strings when we write out the profile);
  • whether it is a send or recv; and
  • the PID of the other process.

This should only be a few bytes per call.

:jld does IPC have per-message IDs that we could use to correlate the send/recvs between processes?

:jesup, how does all of the above sound as an alternative to text markers

Flags: needinfo?(rjesup)
Flags: needinfo?(jld)

They are auto-generated.

Per discussion with nika, sync IPCs have ids, but async leave it 0 or -1 or some such - but we could put ids in there, so we can correlate after-the-fact in the profiler (or in logs).

Markers do allocate memory, and processing a lot of markers might generate overhead (they have to be culled-by-time to avoid stretching longer than the buffer has samples for)

Best solution would be to put these in ProfilerBuffer Buffer entries (like we did for counters), or a separate circular buffer. But we could put them into Markers for now and see what the impact is - I'm just concerned there may be a LOT of IPCs, and when they flood I don't want it to stall on memory locks, for example.

Interesting data to collect or post-process would be time-to-reception and time-to-run-on-target-thread. Maybe time-to-sent (but probably not). If we record on the receiver side, and put the timestamp of sending in the message, we can just have one block per IPC saved

Flags: needinfo?(rjesup)

Looks like Nika answered the ID question (and I'd basically just be looking up IPC::Message::Header and searching for what its fields do, because I don't know a huge amount about this corner of IPC off the top of my head).

Flags: needinfo?(jld)

Taking this...

Good info above! Comment 2 is similar to what I was thinking. I'll have to look more into ProfilerBuffers, since I'm not familiar with them...

I'll also start looking into how to show all this stuff in perf-html. Hopefully I'll have a part with at least a new marker payload on the Gecko side up tomorrow.

Assignee: nobody → jporter+bmo
Status: NEW → ASSIGNED

This is a proof of concept expanding on :jesup's patch. I added a new marker type with all the relevant fields to profile the IPC calls. I also changed how we set the seqno for messages so that we have a real seqno for each IPC message; this is a bit messy and should probably be implemented differently. Still, it works.

This gives us the sender pid, recipient pid, and message seqno, which should be enough to correlate both sides of an IPC message in the profiler. I'll also post a patch for the profiler itself to show this info in a very simplistic form...

... and here's the profiler patch: https://github.com/jimporter/profiler/tree/ipc-markers

That's fine... but for IPC info I think we really need to avoid the memory allocations that markers introduce, since they can block. I.e. it needs to be in ProfilerBuffer entries in the ring buffer. (For that matter, we should strongly consider modifying the ring buffer to include markers instead of the current complex method that involves allocations; the downside is that the buffer would become variable-length-entries instead of fixed-length (I presume). Alternatively we could create a second ring buffer for IPC (or for Markers), but then you have the issues of differing wrap times depending on buffer traffic. That might be ok. (NI to gerald for comment/awareness)

Flags: needinfo?(gsquelart)

All that said - this is quite useful for iterating on the UI choices and design, and for just seeing what IPCs are in use and non-critical timing info it would work fine (much like my hacky patch did, but better).

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #8)

... For that matter, we should strongly consider modifying the ring buffer to include markers instead of the current complex method that involves allocations; the downside is that the buffer would become variable-length-entries instead of fixed-length (I presume).

I'm working on that in bug 1562604. 😉
Got an initial prototype that replaces the ProfileBuffer bump buffer with the new variable-length-entries buffer, so far so good -- even saving us ~%20 of space!
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e129d36bc91461c5f37066693c30714ff2112e77
Next step is to save marker data in there, which may be tricky, we'll see...

Alternatively we could create a second ring buffer for IPC (or for Markers), but then you have the issues of differing wrap times depending on buffer traffic. That might be ok.

Anything's possible!
You'll have to balance the amount of work needed to implement this separate IPC buffer, against waiting for bug 1562604, which you should reasonably expect in weeks from now. (Hoping for ~2w, but don't trust my estimates! Assume more like 4-8w.)

Flags: needinfo?(gsquelart)

I'll definitely keep an eye on bug 1562604, thanks! I can focus on the UX side of this in the meantime, and if I finish that up before you've got complete-ish patch up, I can always work on an interim solution for buffering the IPC markers.

Here's the (still WIP) patch for the profiler frontend: https://github.com/firefox-devtools/profiler/pull/2172

One thing I've noticed during testing is that occasionally we end up with duplicate seqno's for IPC messages, which is obviously a problem. We also naturally can't correlate both sides of an IPC message if we're not actually profiling both the src and dest threads. Any thoughts about how we should surface this in the UI?

Currently, I'm just using the normal track for markers in the UI, but its vertically offset by a bit now. It might help to have a separate IPC track, but I'll discuss that more with the profiler UI folks when they're back. :)

Flags: needinfo?(rjesup)

:gerald/:jesup, I'm not sure how close to done bug 1562604 is, but if there's still a moderate amount left to be done on it (say, a couple weeks), would it make sense to clean up this patch and land it with a disabled-by-default build flag? That would make it slightly easier to deploy an initial version of the front-end side for people to provide feedback on other useful visualizations.

I'm open to other options here, but I thought it would be valuable to get some more early feedback on the UX, and I wanted to make it easier for people who are interested in this data to start trying things out.

Flags: needinfo?(gsquelart)

(In reply to Jim Porter (:squib) from comment #13)

:gerald/:jesup, I'm not sure how close to done bug 1562604 is, but if there's still a moderate amount left to be done on it (say, a couple weeks), would it make sense to clean up this patch and land it with a disabled-by-default build flag? That would make it slightly easier to deploy an initial version of the front-end side for people to provide feedback on other useful visualizations. ...

Realisticaly, I see bug 1562604 landing in 1-2 months if there is no major hump (/me touches wood).
I'd be fine with landing this behind a build flag or even a pref, if that could help with the frontend, and help some people start to track IPC perf issues.

(In reply to Jim Porter (:squib) from comment #12)

... if we're not actually profiling both the src and dest threads.

Maybe it would be worth considering enabling some markers over all threads? Especially after bug 1562604 lands (/me crosses fingers), we won't have the cost of sampling everything, so it should be simpler to allow markers from all threads in this case. If you (Jim, Randell, anyone) think it would be worth it, feel free to open a follow-up bug. 😉

Flags: needinfo?(gsquelart)

Just as an FYI, we as the security team under Paul Theriault is super interested in using this for security testing and dynamic analysis of IPC calls.

Attached patch Mostly-complete patch (obsolete) — Splinter Review

This isn't quite done, but I think it's mostly there. IPC message profiling is hidden behind a profiler feature, so it won't have a performance cost by default, and I should also be able to remove all the strings in the marker payload to further reduce the number of memory allocations. Currently, I'm including the actor (the name of the class that sent the message, e.g. "PContentChild"), but I think this is mostly-redundant with the message type (which gets sent as a string of the form "PContent::Msg_foobar"). We might be able to remove this entirely, replace it with a boolean parent/child flag, or use an nsAutoCStringN to avoid the memory allocation for the string. We could also try sharing the strings in some global table if we need.

Randell, do you have any major concerns with the patch as it is now? If not, I'll finish cleaning things up and put it up for review in the next couple days.

Attachment #9073301 - Attachment is obsolete: true
Flags: needinfo?(rjesup)
Attachment #9085630 - Flags: feedback?(rjesup)

This isn't quite done, but I think it's mostly there. IPC message profiling is hidden behind a profiler feature, so it won't have a performance cost by default, and I should also be able to remove all the strings in the marker payload to further reduce the number of memory allocations. Currently, I'm including the actor (the name of the class that sent the message, e.g. "PContentChild"), but I think this is mostly-redundant with the message type (which gets sent as a string of the form "PContent::Msg_foobar"). We might be able to remove this entirely, replace it with a boolean parent/child flag, or use an nsAutoCStringN to avoid the memory allocation for the string. We could also try sharing the strings in some global table if we need.

Is there ever a possibility that the message prefix and the actor name could ever be unequal? If not, a boolean to indicate direction is certainly enough.

(In reply to Frederik Braun [:freddyb] from comment #17)

Is there ever a possibility that the message prefix and the actor name could ever be unequal? If not, a boolean to indicate direction is certainly enough.

After reading through ipc/ipdl/ipdl/lower.py and parsing the resulting profile data, I think I've convinced myself that the message prefix and actor name are always the same (aside from "Parent"/"Child", of course), so I've converted this to be a bool. That means our IPC marker is now just a handful of integral values! \o/

This adds the ability to add profile markers for both the sender and recipient
sides of IPC messages. These can then be correlated with one another in the
profile visualization. For the UI component of this patch, see
https://github.com/firefox-devtools/profiler/pull/2172.

Comment on attachment 9085630 [details] [diff] [review]
Mostly-complete patch

Moving to Phabricator now that this should be pretty much review-ready...
Attachment #9085630 - Attachment is obsolete: true
Attachment #9085630 - Flags: feedback?(rjesup)

:jld + :gregtatum, this patch should be ready for review. If either of you are too backlogged on other reviews, feel free to forward the r? to someone else. Thanks!

Flags: needinfo?(jld)
Flags: needinfo?(gtatum)

I can review the profiler related changes tomorrow.

Flags: needinfo?(gtatum)

(Answered out-of-band.)

Flags: needinfo?(jld)
Pushed by jporter@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cd4dc0ae3364
Profiler support for IPC information; r=nika
Pushed by jporter@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/debb6f82c687
part 2 - Remove unnecessary declarations from bug 1575448; r=gerald
Pushed by jporter@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2923afce519a
Profiler support for IPC information; r=nika

Whoops, I forgot to reference mSync in one place in ProfilerMarkerPayload when I added it! I've re-landed with the fix after confirming locally that it failed without the fix and passed with.

Flags: needinfo?(jporter+bmo)
Status: ASSIGNED → RESOLVED
Closed: 25 days ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

Backed out changeset 2923afce519a (Bug 1557447) for browser chrome failure at tools/profiler/tests/browser/browser_test_feature_preferencereads.js.

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=271817005&searchStr=linux%2Cx64%2Cdebug%2Cmochitests%2Ctest-linux64%2Fdebug-mochitest-browser-chrome-e10s-3%2Cm%28bc3%29&revision=2923afce519ae4339e8fb7d29dfba065717eff83

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=271817005&repo=autoland&lineNumber=16543

Backout link: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=271817005&searchStr=linux%2Cx64%2Cdebug%2Cmochitests%2Ctest-linux64%2Fdebug-mochitest-browser-chrome-e10s-3%2Cm%28bc3%29&revision=70426e17c6445e928a200123d6a65859870859ff

[task 2019-10-18T04:06:29.125Z] 04:06:29     INFO - TEST-UNEXPECTED-FAIL | tools/profiler/tests/browser/browser_test_feature_preferencereads.js | PreferenceRead profile markers for layout.css.dpi were recorded when the PreferenceRead feature was turned on. - 0 > 0 - JS frame :: chrome://mochitests/content/browser/tools/profiler/tests/browser/browser_test_feature_preferencereads.js :: test_profile_feature_preferencereads/< :: line 65
[task 2019-10-18T04:06:29.126Z] 04:06:29     INFO - Stack trace:
[task 2019-10-18T04:06:29.126Z] 04:06:29     INFO - chrome://mochitests/content/browser/tools/profiler/tests/browser/browser_test_feature_preferencereads.js:test_profile_feature_preferencereads/<:65
[task 2019-10-18T04:06:29.127Z] 04:06:29     INFO - resource://testing-common/BrowserTestUtils.jsm:withNewTab:135
[task 2019-10-18T04:06:29.127Z] 04:06:29     INFO - chrome://mochitests/content/browser/tools/profiler/tests/browser/browser_test_feature_preferencereads.js:test_profile_feature_preferencereads:49
[task 2019-10-18T04:06:29.128Z] 04:06:29     INFO - chrome://mochikit/content/browser-test.js:Tester_execTest/<:1067
[task 2019-10-18T04:06:29.128Z] 04:06:29     INFO - chrome://mochikit/content/browser-test.js:Tester_execTest:1102
[task 2019-10-18T04:06:29.129Z] 04:06:29     INFO - chrome://mochikit/content/browser-test.js:nextTest/<:930
[task 2019-10-18T04:06:29.129Z] 04:06:29     INFO - chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:SimpleTest.waitForFocus/waitForFocusInner/focusedOrLoaded/<:805
[task 2019-10-18T04:06:29.130Z] 04:06:29     INFO - GECKO(6333) | ++DOMWINDOW == 3 (0x7fdf263ccc00) [pid = 6652] [serial = 4] [outer = 0x7fdf25902f20]
[task 2019-10-18T04:06:29.428Z] 04:06:29     INFO - Console message: [JavaScript Error: "getScreenshot(http://example.com/browser/tools/profiler/tests/browser/fixed_height.html) failed: TypeError: NetworkError when attempting to fetch resource." {file: "resource://activity-stream/lib/Screenshots.jsm" line: 59}]
[task 2019-10-18T04:06:29.429Z] 04:06:29     INFO - getScreenshotForURL@resource://activity-stream/lib/Screenshots.jsm:59:10
[task 2019-10-18T04:06:29.429Z] 04:06:29     INFO - async*maybeCacheScreenshot@resource://activity-stream/lib/Screenshots.jsm:112:37
[task 2019-10-18T04:06:29.430Z] 04:06:29     INFO - _fetchScreenshot@resource://activity-stream/lib/TopSitesFeed.jsm:527:23
[task 2019-10-18T04:06:29.431Z] 04:06:29     INFO - _fetchIcon@resource://activity-stream/lib/TopSitesFeed.jsm:515:16
[task 2019-10-18T04:06:29.431Z] 04:06:29     INFO - getLinksWithDefaults@resource://activity-stream/lib/TopSitesFeed.jsm:408:16
[task 2019-10-18T04:06:29.432Z] 04:06:29     INFO - async*refresh@resource://activity-stream/lib/TopSitesFeed.jsm:431:30
[task 2019-10-18T04:06:29.432Z] 04:06:29     INFO - onAction@resource://activity-stream/lib/TopSitesFeed.jsm:771:14
[task 2019-10-18T04:06:29.433Z] 04:06:29     INFO - _middleware/</<@resource://activity-stream/lib/Store.jsm:63:17
[task 2019-10-18T04:06:29.433Z] 04:06:29     INFO - Store/this[method]@resource://activity-stream/lib/Store.jsm:39:54
[task 2019-10-18T04:06:29.433Z] 04:06:29     INFO - init/this.intervalId<@resource://activity-stream/lib/SystemTickFeed.jsm:27:24
[task 2019-10-18T04:06:29.434Z] 04:06:29     INFO - notify@resource://gre/modules/Timer.jsm:62:17
Flags: needinfo?(jporter+bmo)

Hmm, interesting. I'm not able to reproduce that failure locally. I'll dig into it and see if I can figure out what's going on...

Flags: needinfo?(jporter+bmo)

Jim, do you have a link to a tryserver run with the test failure and link to run without failure (with probably fixed test)?

Flags: needinfo?(jporter+bmo)

smaug, here are the assorted try runs. See also comment 33 which has the autoland run that failed and caused the backout. The only real difference is that I think my mozilla-central for the try runs is a few days older, but that didn't seem to have any effect on the try results.

Failing try run (just Linux64 debug mochitests): https://treeherder.mozilla.org/#/jobs?repo=try&revision=411824631724634ede43aaa08b8e9e4cd1439f5f&selectedJob=271832887
Passing try run (all the tests): https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f652d8a5bac3b43d71b7995ccaefe2ddc8746e2

Flags: needinfo?(jporter+bmo) → needinfo?(bugs)
Pushed by jporter@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0fb346412075
Profiler support for IPC information; r=nika,smaug
Flags: needinfo?(bugs)
See Also: → 853864
You need to log in before you can comment on or make changes to this bug.