Closed Bug 1337062 Opened 7 years ago Closed 7 years ago

PContent::Msg_GetGfxVars can take a long time

Categories

(Core :: Graphics, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56
Performance Impact high
Tracking Status
firefox56 --- fixed

People

(Reporter: ehsan.akhgari, Assigned: milan)

References

(Blocks 1 open bug)

Details

(Whiteboard: [gfx-noted])

Attachments

(2 files, 1 obsolete file)

<https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-02-05&keys=PJavaScript%253A%253AMsg_DOMInstanceOf!PContent%253A%253AMsg_RpcMessage!PContent%253A%253AMsg_GetGfxVars!PAPZCTreeManager%253A%253AMsg_ReceiveMouseInputEvent&max_channel_version=nightly%252F54&measure=IPC_SYNC_LATENCY_MS&min_channel_version=null&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-02-02&table=0&trim=1&use_submission_date=0>

Start	End	IPC_SYNC_LATENCY_MS Count
0	1	0 (0%)
1	3	48.71k (20.67%)
3	8	44.31k (18.8%)
8	20	38.55k (16.36%)
20	50	37.82k (16.04%)
50	126	28.53k (12.1%)
126	317	20.1k (8.53%)
317	796	11.16k (4.73%)
796	2k	4.65k (1.97%)
2k	Infinity	1.87k (0.79%)

We should switch to sending this info either asynchronously to the child process, or in case we need it during startup, send it on the command line.
It seems like it would be valuable to have some infrastructure where the parent process sends a big bag of ipdl data to the child immediately up on startup and the child waits for this before continuing. That would make porting these things much easier than having to use async or the command line.
Bug 1303096 is converting things like setting an initial set of prefs to use the command line.  CCing Bill and Brad in case they have thought of doing something more generic.  But the whole point here is to not make the child process to wait on anything.  If the data is indeed needed during startup, it should be sent on the command line so that it can be parsed and consumed immediately without requiring any IPC.
See Also: → 1303096
(In reply to :Ehsan Akhgari from comment #2)
> Bug 1303096 is converting things like setting an initial set of prefs to use
> the command line.

Is that going to run into issues on Windows with its short command-line limits and potentially very large preference files?  Or are we only sending a short whitelisted number of prefs which we know can't be large (i.e no unbounded string values)?
It's a not-so-short list of whitelisted preferences. I checked the command line limit on Windows and it seems like we have enough space. Almost all the prefs we're sending are booleans.
(In reply to :Ehsan Akhgari from comment #2)
> Bug 1303096 is converting things like setting an initial set of prefs to use
> the command line.  CCing Bill and Brad in case they have thought of doing
> something more generic.  But the whole point here is to not make the child
> process to wait on anything.

If the parent process sends the data immediately I can't imagine the child waiting for more than one or two OS ticks in the worst case. Further, given that child is largely dependent on the parent for IO it isn't going to be able to make much progress without some kind of waiting for the parent.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #5)
> (In reply to :Ehsan Akhgari from comment #2)
> > Bug 1303096 is converting things like setting an initial set of prefs to use
> > the command line.  CCing Bill and Brad in case they have thought of doing
> > something more generic.  But the whole point here is to not make the child
> > process to wait on anything.
> 
> If the parent process sends the data immediately I can't imagine the child
> waiting for more than one or two OS ticks in the worst case.

Usually when you're starting a new process, the parent is busy doing stuff (layout out a new tab in the tab strip, for example).

> Further, given
> that child is largely dependent on the parent for IO it isn't going to be
> able to make much progress without some kind of waiting for the parent.

Just starting a child process takes some time (starting services, parsing a bunch of JS, etc.). If we could do that work in parallel with the layout work in the parent, we would save time.
Priority: -- → P3
Whiteboard: gfx-noted
See Also: → 1337063
This is needed during startup, but the data is pretty simple. Out of curiosity, we launch the subprocess with an initial pipe right? Is there any reason we can't shove the data in there instead of the command line, which is super gross? Or can we not write to a pipe until both ends are connected?
(In reply to David Anderson [:dvander] from comment #7)
> This is needed during startup, but the data is pretty simple. Out of
> curiosity, we launch the subprocess with an initial pipe right? Is there any
> reason we can't shove the data in there instead of the command line, which
> is super gross? Or can we not write to a pipe until both ends are connected?

There was some work in bug 1303096 to do that using PeekMessages, but Brad hit some test failures that he couldn't figure out. If you want to give it a go, that would be great. The command line thing is gross.
Now we have a sense of how bad this sync IPC is.  See this spreadsheet: <https://docs.google.com/spreadsheets/d/1x_BWVlnQPg0DHbsrvPFX7g89lnFGa3lAIHWD_pLa_dE/edit#gid=738048355>

After loading Flash(!!!) and with bug 1303096, this is *the* worst sync IPC message that we have.  The 95 percentile is 857ms.
Longer term we probably want to use MozPromise to get prefs (see bug 1313200) and hang all the graphics initialisation off those. That would be much easier if we MozPromise the graphics stack first. That might take a while.

Right now we need to discuss a short term fix. We either need to push the prefs into the child process or make it so we can service PContent::Msg_GetGfxVars without touching the main thread on the parent process.
I think we should just do what we are doing for the rest of these early initialization messages (which is, either send it on the command line or through some async message from parent to child after it creates a new child.)  To be honest I don't think the gross-ness of the command line arguments needs to be fixed before we can resolve this bug.
I think it's worth having a maintainable solution. Another thing we can do, if we don't want to investigate using the initial pipe, is send the sync message to the background thread instead. gfxVars almost never changes after startup and if it does we can just throw a mutex around it.
(In reply to David Anderson [:dvander] from comment #12)
> Another thing we can do,
> if we don't want to investigate using the initial pipe, is send the sync
> message to the background thread instead. gfxVars almost never changes after
> startup and if it does we can just throw a mutex around it.

Can we do this? :)
Flags: needinfo?(dvander)
Whiteboard: gfx-noted → [qf:p1][gfx-noted]
Assignee: nobody → gsquelart
Flags: needinfo?(dvander)
Comment on attachment 8851881 [details]
Bug 1337062 - Transfer initial gfxVars over command line -

https://reviewboard.mozilla.org/r/124108/#review126834

::: dom/ipc/ContentParent.cpp:2036
(Diff revision 1)
> +    if (gfxVars.IsEmpty()) {
> +      // Cannot pass an empty string.
> +      extraArgs.push_back("0");
> +    } else {
> +      extraArgs.push_back(gfxVars.get());
> +    }

Let's check the combined length of all of the extraArgs against ARG_MAX (or 8k on Windows) and only append -gfxVars if we're within that limit.
A note - we're already way past 1024; we currently send value of media.wmf.disable-d3d11-for-dlls and media.wmf.disable-d3d9-for-dlls preferences across, and those two alone are more than 2k.

Separately from that topic; is there an approach where we can avoid sprintf/parse steps?  It would probably mean not using nsCString though...
Comment on attachment 8851881 [details]
Bug 1337062 - Transfer initial gfxVars over command line -

https://reviewboard.mozilla.org/r/124108/#review126834

> Let's check the combined length of all of the extraArgs against ARG_MAX (or 8k on Windows) and only append -gfxVars if we're within that limit.

I assumed that we only had four up-to-1KB strings and a few other short strings, so we should always be under 8K.
Could ARG_MAX we lower than 4KB?

But it would be easy for me to add this check anyway, so I'll get it ready...
(In reply to Milan Sreckovic [:milan] from comment #16)
> A note - we're already way past 1024; we currently send value of
> media.wmf.disable-d3d11-for-dlls and media.wmf.disable-d3d9-for-dlls
> preferences across, and those two alone are more than 2k.

We only send these if they are different from their default values, so in most cases we're only sending a dozen characters.

As per the commit description (see hidden comment 14), I believe that the small number of people who would want to modify it would most likely just clear it or overwrite it with a small string.


> Separately from that topic; is there an approach where we can avoid
> sprintf/parse steps?  It would probably mean not using nsCString though...

I was following the existing style, it was the easiest&quickest way for me to get this work done & landed. -- I'm new in this area.

But anything's possible!
I think it would be great if we could just use the same serialization code as used in IPCs.
Alternatively, we could send blobs of binary data, to at least avoid printing&parsing numbers.
If we're worried about blowing the command line limit budget, can we stick the initial preference values in environment variables instead of putting them in the command line?
Thanks for the reminder Chris, you and I did discuss this option with dmajor recently.

Unfortunately, a bit of research led me to [1], showing that Windows has a limit of 32KB for *all* env-vars, so there is still a limit, and in the worst case, setting our own big env-var could block other important ones!
Linux seems looser, but still has limits [2].

So we may be able to buy ourselves some extra space, but it comes at some risk, and would not solve the problem for all possible use cases anyway.
I think a better solution would require some IPC work -- but that was not successful so far, see comment 8.

At the moment I would echo Ehsan's comment 11, and just go with the command line option, which should work fine for the vast majority of users. And later on we can switch to the Next Gen solution when it's available...


[1] https://blogs.msdn.microsoft.com/oldnewthing/20100203-00/?p=15083
[2] http://man7.org/linux/man-pages/man2/execve.2.html
I really would rather not tie ourselves to a solution that requires a fixed size. gfxVars is seeing increasing use and I don't want to put the burden of coming up with another solution on the person who needs to add something and happens to go over the limit.

I find it hard to believe that we'll be able to start up the content process without doing a single sync IPC call. Surely it's better to just group all the needed sync IPC into a single call.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #22)
> I really would rather not tie ourselves to a solution that requires a fixed
> size. gfxVars is seeing increasing use and I don't want to put the burden of
> coming up with another solution on the person who needs to add something and
> happens to go over the limit.

My understanding is that the failure mode if we go over the limit is that we end up with a sync message, so we'd be no worse off than we are now.

> I find it hard to believe that we'll be able to start up the content process
> without doing a single sync IPC call. Surely it's better to just group all
> the needed sync IPC into a single call.

My simplistic model for how this works is that the content process does a bunch of startup work and then can start loading. Meanwhile, the parent process is busy animating the tab strip and then becomes relatively free once that's done. So it's not actually that relevant *how many* sync messages we send. It's more about how early or late we send the first sync message. The longer we can delay it, the more work can happen in parallel. So coalescing messages doesn't really help much if we're not actually moving anything later.

I looked through the remaining sync messages at startup. I'm copying from https://bugzilla.mozilla.org/show_bug.cgi?id=1301392#c24 and excluding the messages that have already been eliminated (as well as any non-PContent messages, under the assumption that the compositor is less likely to be janked).

GetGfxVars (158ms into startup)
GetGraphicsDeviceInitData (174ms into startup)
SyncMessage (447ms into startup)
ClipboardHasType (491ms)
NotifyIMEFocus (508ms)

So the two graphics messages seem to be the most problematic. I'm strongly in favor of fixing this via command line arguments. If the size issue becomes a problem, we know how to fix it later.
(In reply to Bill McCloskey (:billm) from comment #23)
> 
> So the two graphics messages seem to be the most problematic. I'm strongly
> in favor of fixing this via command line arguments. If the size issue
> becomes a problem, we know how to fix it later.

I guess it's better than nothing, but I suspect we won't be able to put off fixing this properly forever.
Another option is to send the information in an Async message and if that doesn't come before we first need these values to block at that point in the child process start up. But, I agree that the commandline is the best of the options I know of though.
Comment on attachment 8851881 [details]
Bug 1337062 - Transfer initial gfxVars over command line -

https://reviewboard.mozilla.org/r/124108/#review127362
Attachment #8851881 - Flags: review?(blassey.bugs) → review-
Brad, you seem fine with the command line solution (as least for now), and I addressed your comment 15.
But you've just given me r- with no explanation.
Please let me know if that was just a mistake, otherwise I'd like to know what's expected, thanks.
Flags: needinfo?(blassey.bugs)
I'm expecting another patch that uses ARG_MAX
Flags: needinfo?(blassey.bugs)
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #28)
> I'm expecting another patch that uses ARG_MAX

Brad, I've had trouble trying to use ARG_MAX, e.g.:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=01a58a1236bbe8cf8e657b93600e59ffb9023726

Anyway, when I asked for help from Mike (:glandium), he suggested using `sysconf(_SC_ARG_MAX)` instead, which is what I actually did in the patch you rejected.
Did you miss it, or are there some reasons you do prefer `ARG_MAX`?


Also, Mike thinks this is "essentially useless" because ARG_MAX is also a limit for the environment, not just the command line.
However, I think it's still a little bit useful, to catch obvious abuses in case some gfxVars are *really* too big.
Flags: needinfo?(blassey.bugs)
(In reply to Gerald Squelart [:gerald] from comment #29)
> (In reply to Brad Lassey [:blassey] (use needinfo?) from comment #28)
> > I'm expecting another patch that uses ARG_MAX
> 
> Brad, I've had trouble trying to use ARG_MAX, e.g.:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=01a58a1236bbe8cf8e657b93600e59ffb9023726
> 
> Anyway, when I asked for help from Mike (:glandium), he suggested using
> `sysconf(_SC_ARG_MAX)` instead, which is what I actually did in the patch
> you rejected.
> Did you miss it, or are there some reasons you do prefer `ARG_MAX`?
Yes, I didn't see the new patch at all, I'm still not used to mozreviw
Flags: needinfo?(blassey.bugs)
Comment on attachment 8851881 [details]
Bug 1337062 - Transfer initial gfxVars over command line -

https://reviewboard.mozilla.org/r/124108/#review129444

::: dom/ipc/ContentParent.cpp:2029
(Diff revision 2)
>    extraArgs.push_back("-stringPrefs");
>    extraArgs.push_back(stringPrefs.get());
>  
> +  char gfxBuf[1024];
> +  nsFixedCString gfxVarsUpdates(gfxBuf, 1024, 0);
> +  if (gfxVars::SerializeNonDefaultVarUpdates(gfxVarsUpdates, 1023)) {

My original thought was using ARG_MAX (or SG_ARG_MAX here) would allow for us to safely pass a larger gfx arg string since Milan said that non default args would frequently be above the 1024 limit. Did glandium suggest that would not be safe?
Attachment #8851881 - Flags: review- → review?(blassey.bugs)
Comment on attachment 8851881 [details]
Bug 1337062 - Transfer initial gfxVars over command line -

https://reviewboard.mozilla.org/r/124108/#review129444

> My original thought was using ARG_MAX (or SG_ARG_MAX here) would allow for us to safely pass a larger gfx arg string since Milan said that non default args would frequently be above the 1024 limit. Did glandium suggest that would not be safe?

glandium pointed at the fact that ARG_MAX/_SC_ARG_MAX is the total environment limit, including command line and environment variables.
So if we used it in this test (to pass gfxVarsUpdates or not), we may still have problems in the actual subprocess launch, because other environment variables could be set as well (see GeckoChildProcessHost::PerformAsyncLaunchInternal [1]), and all of them together could go past the limit.
This is why I kept the initial 1KB limit for gfxVarsUpdates; but I'm still making a rough ARG_MAX check to hopefully catch a fraction of potential issues -- But with no absolute guarantee.

I think removing the 1KB check and relying only on ARG_MAX would indeed be unwise, unless we can know for sure how much space the environment occupies (apart from the command line). I've punted at 1KB in the test at line 2041.

Regarding Milan's note, I responded in comment 18, saying that a vast majority of users would never modify the D3D block-lists, and therefore these massive strings will not be passed through the command line. The remainder of gfxVarsUpdates are numbers or bools, so in my tests, I've only seen a handful of characters in the command line serialization.

For the handful of users who may modify these D3D block-lists, they would either overwrite it completely and therefore it should be small enough; Or they may modify it, in which case the 1KB-limit will probably kick in, and these users will just go back to the old sync PContent::Msg_GetGfxVars, i.e., only a bit worse than before because of the extra work done in this patch here.

In the future, if/when we implement a better way to pass initial data, most of the code in this patch will still be useful, we'll just have to remove the arbitrary 1024-character string limits and the ARG_MAX check.

So I personally think it's worth pushing it, as it's an improvement for almost everybody, and should be reusable when we change tactics later on.

[1] https://dxr.mozilla.org/mozilla-central/rev/b043233ec04f06768d59dcdfb9e928142280f3cc/ipc/glue/GeckoChildProcessHost.cpp#711
(In reply to Gerald Squelart [:gerald] from comment #32)
> ...
> 
> So I personally think it's worth pushing it, as it's an improvement for
> almost everybody, and should be reusable when we change tactics later on.
> 

Sounds reasonable to me.  It may be useful, probably in a separate bug, to put in telemetry for "the length was too much" vs. "the length was fine", or even better "length was <1k", "length was <2k", length was <5k", "length was long" type of a histogram for it...
Comment on attachment 8851881 [details]
Bug 1337062 - Transfer initial gfxVars over command line -

https://reviewboard.mozilla.org/r/124108/#review129968

::: dom/ipc/ContentParent.cpp:2029
(Diff revision 2)
>    extraArgs.push_back("-stringPrefs");
>    extraArgs.push_back(stringPrefs.get());
>  
> +  char gfxBuf[1024];
> +  nsFixedCString gfxVarsUpdates(gfxBuf, 1024, 0);
> +  if (gfxVars::SerializeNonDefaultVarUpdates(gfxVarsUpdates, 1023)) {

It looks like we pass an empty environment to the child process [http://searchfox.org/mozilla-central/source/ipc/chromium/src/base/process_util_linux.cc#143], though it would be worth confirming that if we're going to rely on it.

We do add some other command line args though in PerfromAsyncLaunchInternal(), though we can probably set an upper limit for that.

It's good that we won't normally need more than 1024 for the gfx args, but I'm interested in putting in a check that would be at least somewhat effective at keeping us from failing to launch subprocesses.
Attachment #8851881 - Flags: review?(blassey.bugs) → review+
Comment on attachment 8851881 [details]
Bug 1337062 - Transfer initial gfxVars over command line -

https://reviewboard.mozilla.org/r/124108/#review130016
Comment on attachment 8851881 [details]
Bug 1337062 - Transfer initial gfxVars over command line -

https://reviewboard.mozilla.org/r/124108/#review129968

> It looks like we pass an empty environment to the child process [http://searchfox.org/mozilla-central/source/ipc/chromium/src/base/process_util_linux.cc#143], though it would be worth confirming that if we're going to rely on it.
> 
> We do add some other command line args though in PerfromAsyncLaunchInternal(), though we can probably set an upper limit for that.
> 
> It's good that we won't normally need more than 1024 for the gfx args, but I'm interested in putting in a check that would be at least somewhat effective at keeping us from failing to launch subprocesses.

Thank you Brad.
I think the main way we're preventing launch failure is by forcefully limiting what we send. Currently that's: Few options, and at most 4*1KB big strings. And now, just in case, we have an extra check that makes sure adding the gfxVarsUpdates won't leave less than 1KB to the environment.
Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4f7947062b4f
Transfer initial gfxVars over command line - r=blassey
Of course, after all this, Sotaro just landed bug 1323316, which broke my hand-crafted de/serialization! :-(

Back to the drawing board.
(I'm thinking: Either a quick-fix static_assert to catch these discrepancies early, or we do need to work on the better IPC-friendly solution now.)

I think I'll work on bug 1337063 first, as that one can be more easily dealt with, without hand-crafted de/serialization code.
Two options that may be a bit more future proof. One is to send an Async message in InitInternal like we do with SendSetXPCOMProcessAttributes() and and block on the child side if the message doesn't arrive before its first needed.

The second option is to continue using the command line flags, but use the IPCMessage serialization and deserialization to get the argument.
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #40)
> Two options that may be a bit more future proof. One is to send an Async
> message in InitInternal like we do with SendSetXPCOMProcessAttributes() and
> and block on the child side if the message doesn't arrive before its first
> needed.

I did a try run for bug 1337063 [1], with a MOZ_CRASH if content tries to use gfxPlatform before it receives this message. It didn't crash. I would bet we can stick gfxVars in this message as well.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=6fc760f22732c26f2941be4398c560ca2f2d71ca
Thank you David.
Now trying to pass gfxVarUpdates through SendSetXPCOMProcessAttributes(), so far so good (details in bug 1337063 comment 12).
Comment on attachment 8856818 [details]
Bug 1337062 - Transfer initial gfxVars with SendSetXPCOMProcessAttributes -

Retracting review, as I forgot to register to get later gfxVars updates. Will work on it tomorrow...
Attachment #8856818 - Flags: review?(dvander)
Comment on attachment 8856818 [details]
Bug 1337062 - Transfer initial gfxVars with SendSetXPCOMProcessAttributes -

https://reviewboard.mozilla.org/r/128740/#review131702

::: dom/ipc/ContentParent.cpp:2228
(Diff revision 2)
>    }
>  
> +  nsTArray<GfxVarUpdate> varUpdates = gfxVars::FetchNonDefaultVars();
> +  for (auto& update : varUpdates) {
> +    xpcomInit.gfxNonDefaultVarUpdates().AppendElement(Move(update));
> +  }

Does it work to xpcomInit.gfxNonDefaultVarUpdates() = Move(varUpdates) ?
Attachment #8856818 - Flags: review?(dvander) → review+
Comment on attachment 8856818 [details]
Bug 1337062 - Transfer initial gfxVars with SendSetXPCOMProcessAttributes -

https://reviewboard.mozilla.org/r/128740/#review131702

> Does it work to xpcomInit.gfxNonDefaultVarUpdates() = Move(varUpdates) ?

Good idea, thanks!
I can even do `xpcomInit.gfxNonDefaultVarUpdates() = gfxVars::FetchNonDefaultVars();` and let C++ choose the best way to move things. ;-)
Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e8efc3c5d61f
Transfer initial gfxVars with SendSetXPCOMProcessAttributes - r=dvander
gfxVars::Initialize not called in time?
Flags: needinfo?(gsquelart)
My gut feel is a rogue gfxVarUpdate getting through before the big xpcom-init.
But I've only just restarted the old Windows machine, I'll try and debug that soon...
Priority: P3 → --
After debugging, I think I was correct: Right after the gfxVarUpdate registration, BuildContentDeviceData was modifying a gfxVar, causing an gfxVarUdpate message to be sent before the XPCOMInit one! So on the content side, ApplyUpdate was called before gfxVars were initialized.
Swapping the two does the trick.

Waiting for try results to be sure, before pushing the update:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8cdfd1fa30691cb7a4470b92b478293266c2f243
Flags: needinfo?(gsquelart)
Depends on: 1356138
Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2d7ef036a01f
Transfer initial gfxVars with SendSetXPCOMProcessAttributes - r=dvander
https://hg.mozilla.org/mozilla-central/rev/2d7ef036a01f
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Backed out in https://hg.mozilla.org/mozilla-central/rev/f77f3057b8ee

Not only did it make https://treeherder.mozilla.org/logviewer.html#?job_id=91516724&repo=autoland nearly permaorange, but that shutdown crash is also showing up in lots of other otherwise-normal-seeming failures, like https://treeherder.mozilla.org/logviewer.html#?job_id=91511655&repo=mozilla-central
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla55 → ---
ipc_channel_error=PNecko::Msg_NotifyCurrentTopLevelOuterContentWindowId Route error: message sent to unknown actor ID

I have no idea how this patch could have caused that...
After some testing, I've found that moving gfxVars::Initialize *after* ContentChild::InitXPCOM seems to make things work fine again:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b7a8d1e9e64cc3ee9d81accb35bad67581a62b21

I initially thought gfxVars::Initialize would just set some variables, so it made sense to me to do it first thing.

So either initializing gfxVars first disturbs InitXPCOM, or InitXPCOM somehow modifies gfxVars?
Both seem unlikely, but obviously something does go wrong, and I don't know enough about XPCOM and I haven't dived too deep...
David, what do you think?

In the meantime I'll launch a more comprehensive try with this.
Flags: needinfo?(dvander)
(In reply to Gerald Squelart [:gerald] from comment #59)
> After some testing, I've found that moving gfxVars::Initialize *after*
> ContentChild::InitXPCOM seems to make things work fine again:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=b7a8d1e9e64cc3ee9d81accb35bad67581a62b21
> 
> I initially thought gfxVars::Initialize would just set some variables, so it
> made sense to me to do it first thing.
> 
> So either initializing gfxVars first disturbs InitXPCOM, or InitXPCOM
> somehow modifies gfxVars?
> Both seem unlikely, but obviously something does go wrong, and I don't know
> enough about XPCOM and I haven't dived too deep...
> David, what do you think?
> 
> In the meantime I'll launch a more comprehensive try with this.

InitXPCOM will trigger gfxPlatform/layout stuff, which will read from gfxVars. So it's probably best to do it first, unless we can defer it to first use or something. (Or assert that it is not read during InitXPCOM.) The backout reason looks strikingly similar to the problems blassey saw when doing messages through the initial pipe.

Have you tried to reproduce the backout problem locally with any success? I will try to do so myself... something seems really odd here.
Flags: needinfo?(dvander)
I think I managed to reproduce the crash locally, but it took like 90 minutes of mochitest running in debug mode. I'm trying again with the crashreporter off since mach seems to delete all the crash reports.
I lied, apparently, I'm having trouble reproducing this locally. Tomorrow I'll do some try runs with more logging.

Gerald, one thing you could try is moving the initialization a little later and using the same trick you did for DeviceInitData: store the initial vars somewhere global, then read them on Initialize() and unset the global. Then just let the Initialize call here [1] happen, and don't worry about initializing in ContentChild.

[1] http://searchfox.org/mozilla-central/source/gfx/thebes/gfxPlatform.cpp#622
Comment on attachment 8856818 [details]
Bug 1337062 - Transfer initial gfxVars with SendSetXPCOMProcessAttributes -

I haven't been able to reproduce it locally.

David, I've updated the patch as per your comment 63 (store the initial updates, and use them lazily when gfxVars are first initialized).
Try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=05e25fb6e5464b435cc626e9d3984ca70473d6d7
(Looking fine to me, compared to others where 'bc' suites had obvious crashes, e.g.: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b7a8d1e9e64cc3ee9d81accb35bad67581a62b21 )

Would you mind giving it a quick last review, before I push?
Attachment #8856818 - Flags: review+ → review?(dvander)
Comment on attachment 8856818 [details]
Bug 1337062 - Transfer initial gfxVars with SendSetXPCOMProcessAttributes -

https://reviewboard.mozilla.org/r/128740/#review137066

Looks good, crossing my fingers that this sticks...
Attachment #8856818 - Flags: review?(dvander) → review+
Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1ad211b75475
Transfer initial gfxVars with SendSetXPCOMProcessAttributes - r=dvander
https://hg.mozilla.org/mozilla-central/rev/1ad211b75475
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
That was considerably better, plenty good enough to get by professional sheriffs, though not amateur sheriffs.

Backed out in https://hg.mozilla.org/mozilla-central/rev/413f9d65fe4b for still hitting the timeout-and-then-shutdown-crash thing, enough that I now know where it's frequent enough you can hit it on try. https://wiki.mozilla.org/ReleaseEngineering/TryChooser#What_if_I_want_PGO_for_my_build to get PGO and "try: -b o -p win64 -u mochitest-bc[Windows 8],mochitest-dt[Windows 8] -t none --rebuild 10" should give you two or three failures like https://treeherder.mozilla.org/logviewer.html#?job_id=94717791&repo=autoland
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla55 → ---
Oops, I always do that in my try pushes, too. Make that "try: -b o -p win64 -u mochitest-e10s-bc[Windows 8],mochitest-e10s-devtools-chrome[Windows 8] -t none --rebuild 10" since it's e10s tests you want.
Sweet, this is not our fault! bug 1360549 comment #7 for details. When that's fixed this should be good to reland, aside from talos investigation I guess.
Thanks David.

Bug 1360549 could use our patch from this bug here, and Phil's Try suggestion from comment 69, to verify that they have really fixed the issue; and at the same time it would prove (or at least give more confidence) that our patch is good to land.

Though we should probably try to investigate the Talos issue, if still present, before we actually land something?
Another interesting data point:
On a hunch, I first backed-out bug 1337063 (sibling bug that sends DeviceContentData during the content process launch, and which seems to be causing Talos regressions) and then rebased this bug's patch (for gfxVars), and the Try is as green as can be!
https://treeherder.mozilla.org/#/jobs?repo=try&revision=05a96d78299c1ff9c554d13f844d126f00e1b351

So it looks more like the interaction between these two is causing trouble here.
David, any guesses/insight?

(Note that this Try ran before bug 1356468 got fixed. I shall try all combinations of bug 1337062 and bug 1337063 again...)
Flags: needinfo?(dvander)
(In reply to Gerald Squelart [:gerald] from comment #75)
> Another interesting data point:
> On a hunch, I first backed-out bug 1337063 (sibling bug that sends
> DeviceContentData during the content process launch, and which seems to be
> causing Talos regressions) and then rebased this bug's patch (for gfxVars),
> and the Try is as green as can be!
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=05a96d78299c1ff9c554d13f844d126f00e1b351
> 
> So it looks more like the interaction between these two is causing trouble
> here.
> David, any guesses/insight?
> 
> (Note that this Try ran before bug 1356468 got fixed. I shall try all
> combinations of bug 1337062 and bug 1337063 again...)

Because by re-introducing the sync message in bug 1337063, you effectively mask the timing issue in bug 1360549. The problem is that you resolved all the sync issues that were delaying the broken high-priority message, such that it arrived at the right time.
Flags: needinfo?(dvander)
Err, "right time" as in "right time to cause a crash".
Thanks David.
And I meant "this Try ran while bug 1360549 was not fixed" in my comment 75.

So my Try helps to confirms your theory that bug 1360549 relies on at least one of the sync messages in this bug 1337062 or bug 1337063. :-)

I guess now we'll just have to wait for bug 1360549 to be fixed before relanding this one.
But in the meantime I'll keep investigating the Talos issues...
I'll try to come back to this bug, but for now, if someone else with a better clue wants to give it a try, please go ahead!
Assignee: gsquelart → nobody
Comment on attachment 8872693 [details]
Bug 1337062: Transfer initial gfxVars with SendXPCOMProcessAttributes. Deal with potential (future) race condition.

https://reviewboard.mozilla.org/r/144228/#review148058
Attachment #8872693 - Flags: review?(dvander) → review+
Thank you for looking into it, Milan. Good luck!
Assignee: nobody → milan
Attachment #8872693 - Attachment is obsolete: true
Attachment #8856818 - Attachment is obsolete: false
Comment on attachment 8872693 [details]
Bug 1337062: Transfer initial gfxVars with SendXPCOMProcessAttributes. Deal with potential (future) race condition.

Gerald, would you mind taking a look as I made some changes from your original patch.  In particular, I removed the "things must arrive in the correct order" condition.
Attachment #8872693 - Flags: review+ → review?(gsquelart)
Comment on attachment 8872693 [details]
Bug 1337062: Transfer initial gfxVars with SendXPCOMProcessAttributes. Deal with potential (future) race condition.

https://reviewboard.mozilla.org/r/144228/#review157510

Thank you for taking this over the line.

::: gfx/config/gfxVars.cpp:32
(Diff revision 2)
> +  if (sInstance) {
> +    // Apply the updates, the object has been created already
> +    for (const auto& varUpdate : aInitUpdates) {
> +      ApplyUpdate(varUpdate);
> +    }
> +     

Remove empty line.
Attachment #8872693 - Flags: review?(gsquelart) → review+
Pushed by msreckovic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7b6b179caf46
Transfer initial gfxVars with SendXPCOMProcessAttributes. Deal with potential (future) race condition. r=dvander,gerald
https://hg.mozilla.org/mozilla-central/rev/7b6b179caf46
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Performance Impact: --- → P1
Whiteboard: [qf:p1][gfx-noted] → [gfx-noted]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: