Closed Bug 1337063 Opened 7 years ago Closed 7 years ago

PContent::Msg_GetGraphicsDeviceInitData can take a long time

Categories

(Core :: Graphics, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla56
Performance Impact high
Tracking Status
firefox56 --- fixed

People

(Reporter: ehsan.akhgari, Assigned: mozbugz)

References

(Blocks 1 open bug)

Details

(Whiteboard: [gfx-noted])

Attachments

(1 file, 1 obsolete file)

Priority: -- → P3
See Also: → 1337062
Whiteboard: gfx-noted
This is the third worst sync IPC that we have after bug 1337062 and bug 1194751 with 95 percentile of 677.
I have the same question here as in bug 1337062 comment 13.
Flags: needinfo?(milan)
Whiteboard: gfx-noted → [qf:p1][gfx-noted]
Assignee: nobody → gsquelart
Flags: needinfo?(milan)
Doing bug 1337062 first.  Close to landing.
That's right.

I've actually started working on it (~half done now), using the same awkward command-line strategy for now, so we can get its benefits quickly before we implement a more elegant solution...
I don't understand why we can't do a more elegant solution now? Why can't we put this data into the initial IPC pipe?
David, could you please open a separate bug for that?

I'd like to focus here on PContent::Msg_GetGraphicsDeviceInitData, so we don't block processes because of it anymore, regardless of the method of transfer for the initial data.

I know almost nothing about how the IPC works, so I think someone who is better versed into it should tackle that (I'd be happy to learn & look into it, but it would take much more time); and then we can move all the existing work that uses the command line, to use the pipe instead (or whatever other solution is chosen*).

* My own uneducated thought: It would be great if we could somehow reuse the IPC calls, so that the code that de/serializes data is automatically generated from ipdls, instead of having to write it (again) by hand and having to maintain it when structures change.
(In reply to David Anderson [:dvander] from comment #5)
> Why can't we put this data into the initial IPC pipe?

And just about that question, in case you missed it, it was answered there:
> Bill McCloskey (:billm) wrote in bug 1337062 comment #8:
> (In reply to David Anderson [:dvander] from bug 1337062 comment #7)
> > This is needed during startup, but the data is pretty simple. Out of
> > curiosity, we launch the subprocess with an initial pipe right? Is there any
> > reason we can't shove the data in there instead of the command line, which
> > is super gross? Or can we not write to a pipe until both ends are connected?
> There was some work in bug 1303096 to do that using PeekMessages, but Brad
> hit some test failures that he couldn't figure out. If you want to give it a
> go, that would be great. The command line thing is gross.

So I think we all want a better solution, but it's not that trivial! Help welcome. ;-)
My objection to hacking a solution via argv is that, as soon as we need to modify anything involving this message, someone is going to have to touch that code, and it seems like exactly the sort of thing that is hard to maintain. And unless we're planning to immediately uplift this to 53, it doesn't make sense that we need to rush a hack. If we file a follow-up bug it'll probably never get fixed, until the maintenance burden becomes too high for someone else.

I'll take a look at the failures in bug 1303096 to see what went wrong.
(In reply to David Anderson [:dvander] from comment #8)
> My objection to hacking a solution via argv is that, as soon as we need to
> modify anything involving this message, someone is going to have to touch
> that code, and it seems like exactly the sort of thing that is hard to
> maintain. And unless we're planning to immediately uplift this to 53, it
> doesn't make sense that we need to rush a hack. If we file a follow-up bug
> it'll probably never get fixed, until the maintenance burden becomes too
> high for someone else.

Fixing the issue is a very high priority and it is a good candidate for an uplift. The sooner the code lands the better position we will be in to make that determination. I see no reason hijack this bug with repaying technical debt from the other similar bugs.

I'm not concerned about incurring some technical debt here and neither should you. My team have a really good track record of repaying technical debt so any followup will get prioritised and fixed if it is important enough.

Can you please create a follow up bug with a clear statement of exactly what you think is wrong with the current implementation, separate from any proposed solutions? You'll need to needinfo? me so I don't miss it.
I objected in those bugs too - we have a system in place already for serializing structures to other processes, and adding a new one based on string parsing doesn't seem good. Every time we change these messages we have to consider the string parsing code.
Yuck. The IPC bug Brad encountered is proving to be very difficult to track down. Luckily, *this* bug shouldn't need either approach.

The content process does not request graphics data until it initializes XPCOM, which happens from an async message from the parent process [1]. We can send the ContentDeviceData as part of the SetXPCOMProcessAttributes message. gfxVars could have gone there as well.

Note we can't eliminate this sync message in general but we can on startup - a "don't fetch the first time" flag in gfxPlatform would be okay. If I can help please let me know.

[1] http://searchfox.org/mozilla-central/rev/c4fdb67bca7889e59af9dd99c604651a49c4aafa/dom/ipc/PContent.ipdl#455
[2] http://searchfox.org/mozilla-central/rev/c4fdb67bca7889e59af9dd99c604651a49c4aafa/dom/ipc/ContentChild.cpp#523
(In reply to David Anderson [:dvander] from comment #11)
> The content process does not request graphics data until it initializes
> XPCOM, which happens from an async message from the parent process [1]. We
> can send the ContentDeviceData as part of the SetXPCOMProcessAttributes
> message. gfxVars could have gone there as well.

Thank you for the tip, this seems to work, and does look like the ideal place to do this.

Tries in progress:
- ContentDeviceData:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9dbd8a23e4b79157679da985051a9809e5b7e917
- And gfxVarUpdates:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=18826ac92a411dcadd4b7f9efa95156fb66afa48

Could you please have a quick look at the patches there? Feedback welcome before I push for official review.
(I've put MOZ_ASSERTs to catch unexpected races; but I'm keeping the old sync requests as a fallback otherwise, until we're confident it's all good.)
Flags: needinfo?(dvander)
These look great, thanks!
Flags: needinfo?(dvander)
(In reply to Gerald Squelart [:gerald] from comment #12)
> ...
> (I've put MOZ_ASSERTs to catch unexpected races; but I'm keeping the old
> sync requests as a fallback otherwise, until we're confident it's all good.)

MOZ_CRASH or gfxDevCrash() instead of MOZ_ASSERT may be a better choice if we really want to find out quickly if this is a problem or not.  A lot more people run optimized build :)
I didn't want to be too annoying to our users, in case our assumptions are wrong; and especially this close to a new release.

However, if you think it would be worth the risk, I'd be happy to switch to MOZ_CRASH.
Let's do a MOZ_CRASH (or, gfxDevCrash which is MOZ_CRASH in Nightly, and MOZ_ASSERT in Beta/Release.)  I'm assuming it would be unexpected if we hit it, and tell us we have a bug to fix - may as well know about that sooner.
Thanks for doing all this!
Comment on attachment 8856817 [details]
Bug 1337063 - Transfer child's initial ContentDeviceData with SendSetXPCOMProcessAttributes -

https://reviewboard.mozilla.org/r/128738/#review131576

Thanks!
Attachment #8856817 - Flags: review?(dvander) → review+
Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a6300a521f9c
Transfer child's initial ContentDeviceData with SendSetXPCOMProcessAttributes - r=dvander
https://hg.mozilla.org/mozilla-central/rev/a6300a521f9c
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
This was backed out: https://hg.mozilla.org/mozilla-central/rev/c73219703d72
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I'll try to come back to this bug, but for now, if someone else with a better clue wants to give it a try, please go ahead!
Assignee: gsquelart → nobody
(In reply to Milan Sreckovic [:milan] from comment #24)
> Created attachment 8872691 [details]
> Bug 1337063 - Transfer child's initial ContentDeviceData with
> SendSetXPCOMProcessAttributes -
> 
> Review commit: https://reviewboard.mozilla.org/r/144224/diff/#index_header
> See other reviews: https://reviewboard.mozilla.org/r/144224/

Rebased the patch, and a fresh try run  https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d942abf880c2dcc95cb571f0ab1ebcf455ac8af when combined with the patch from bug 1337062 (which depends on this one.)
Having trouble making the patch come from :gerald with the rebase :)
Comment on attachment 8872691 [details]
Bug 1337063: Transfer child's initial ContentDeviceData with SendSetXPCOMProcessAttributes -

https://reviewboard.mozilla.org/r/144224/#review148056
Attachment #8872691 - Flags: review?(dvander) → review+
Right - this was just a rebase, but at least we know it wasn't a fluke.
Will dig through the talos runs, see what comes up.
Annoyingly, sessionrestore doesn't regress when I run it locally.  Sigh.
Assignee: nobody → milan
Fun times.  As a reminder, we're looking at the sessionrestore talos results.

For reference, the original implementation of gfxPlatform::FetchAndImportContentDeviceData (whitespace and asserts modified):

void
gfxPlatform::FetchAndImportContentDeviceData()
{
 mozilla::dom::ContentChild* cc = mozilla::dom::ContentChild::GetSingleton();
 mozilla::gfx::ContentDeviceData data;
 cc->SendGetGraphicsDeviceInitData(&data);
 ImportContentDeviceData(data);
}

Attached patch (https://treeherder.mozilla.org/#/jobs?repo=try&revision=e44fc2fbf6e4cf681362f05d0d42a4605007380f) with the ~17% slowdown (https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=e44fc2fbf6e4cf681362f05d0d42a4605007380f&framework=1&showOnlyImportant=0&selectedTimeRange=172800) with gfxPlatform::FetchAndImportContentDeviceData method modified to  use the data we collected earlier if it's available.  Thus avoiding the sync IPC call to parent process.

void
gfxPlatform::FetchAndImportContentDeviceData()
{
  if (gContentDeviceInitData) {
    ImportContentDeviceData(*gContentDeviceInitData);
    return;
  }
  mozilla::dom::ContentChild* cc = mozilla::dom::ContentChild::GetSingleton();
  mozilla::gfx::ContentDeviceData data;
  cc->SendGetGraphicsDeviceInitData(&data); 
  ImportContentDeviceData(data);
}

Slightly modified version of the above (https://treeherder.mozilla.org/#/jobs?repo=try&revision=ddec719a2efd5b974759734011bbf39a007e65ba) with the performance results (https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=ddec719a2efd5b974759734011bbf39a007e65ba&framework=1&showOnlyImportant=0&selectedTimeRange=172800).  This modification to the attached patch version of gfxPlatform::FetchAndImportContentDeviceData is otherwise the only difference from the above:

void
gfxPlatform::FetchAndImportContentDeviceData()
{
  mozilla::dom::ContentChild* cc = mozilla::dom::ContentChild::GetSingleton();
  mozilla::gfx::ContentDeviceData data;
  cc->SendGetGraphicsDeviceInitData(&data);
  if (gContentDeviceInitData) {
    ImportContentDeviceData(*gContentDeviceInitData);
    return;
  }
  ImportContentDeviceData(data);
}

Note the difference - we still use the previously obtained data, but we do it after the sync IPC call.  It is a waste of a sync IPC call, as we don't end up using the result.

Except that we get the talos performance numbers back.  The regression from the attached patch goes away.
Classic talos
Any sync IPC call will do above, for the talos result to be good (https://treeherder.mozilla.org/#/jobs?repo=try&revision=f8a72d65376e797169927d76e53d4bbfe385deee).

This is probably real, somehow the sync call at that point leaves us at a better state to deal with, perhaps others upcoming sync calls later on (but still part of the test.)  Fun times indeed.
Sigh :)

Sounds like a repeat of the other issue Gerald had, where dropping an early sync message changed startup timing elsewhere, exposing bugs. Probably no bug is being exposed here but maybe a subsequent sync message is taking longer or occurring later than it did before.
I had to read this several times to convince myself I'm not reading anything wrong! I can't say I understand what's going on here.  :-)
https://hg.mozilla.org/try/rev/3adece4f3b4735ff544b2282b410f26e2d5793d8 - making a sync IPC call to parent at the end if gfxWindowsPlatform::InitAcceleration (in addition to the performance regression patch) gets us back the performance.  I'll chase a bit more - one of the possibilities is that the content process, without this sync IPC call, gets too far ahead, and creates more work for the parent.
This run (https://treeherder.mozilla.org/#/jobs?repo=try&revision=ff82dade2c07f3708ca1cea871f9ed53285978be) gives us a definitive answer that the issue is either in the timing, or not a problem.  We take the existing code, *remove* the sync IPC call in question, and we get 17-18% regression.  Removing it means content process can get further ahead than with the call, so either it asks the parent process for more work, or for some work earlier, so that it gets measured within the talos test, where otherwise, with the content process effectively doing a sleep, that work would have shown up after the talos test stopped the timing.

Randomly, one of the tests that regresses (sessionrestore_no_auto_restore opt e10s) improves (before the attached patches) ~4-5% when we switch to a single content process (https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=9e0374a7a0c5d74172f5a79f8cdebb5cd6db7c48&framework=1&showOnlyImportant=0&selectedTimeRange=172800)
Comment 39 (together with all the previous runs in the earlier comments) has a pretty much definitive "proof" that the regression we're seeing in talos is not.  It shows that removing a sync point between content and parent (content waiting on parent), with the otherwise no changes to the behavior gives us the same measured slowdown.

So, this is not really accepting the regression in talos, just explaining that there is no regression, just that the numbers got worse.
Assignee: milan → gsquelart
Pushed by msreckovic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/644e5b5678b9
Transfer child's initial ContentDeviceData with SendSetXPCOMProcessAttributes - r=dvander
https://hg.mozilla.org/mozilla-central/rev/644e5b5678b9
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: mozilla55 → mozilla56
Performance Impact: --- → P1
Whiteboard: [qf:p1][gfx-noted] → [gfx-noted]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: