all talos on e10s fails to run since Feb 10th.

RESOLVED FIXED

Status

RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: jmaher, Assigned: jmaher)

Tracking

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s+)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Assignee)

Description

4 years ago
with this changeset we fail all linux/windows talos on e10s:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=3d3f1b07ef0f

here is the treeherder view:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=3d3f1b07ef0f (note, the jobs are hidden by default)

I assume we can backout a bunch of these patches and figure out the root cause of the failure.
tracking-e10s: --- → ?
(Assignee)

Comment 1

4 years ago
odd, I ran locally on a build from inbound (build today) and it works as expected.  This could be pgo related or something on linux64 which works
tracking-e10s: ? → +

Updated

4 years ago
Flags: needinfo?(jmathies)
Looking into getting a more precise regression window here.
(Assignee)

Comment 3

4 years ago
in trying to debug this, I put some dump statements in the pageloader code and on a build from feb 9th, I can run e10s mode and see my dump statements from inside the contentLoadHandler function:
http://hg.mozilla.org/build/talos/file/tip/talos/pageloader/chrome/pageloader.js#l188

using a recent build, I don't see my dump statement printed out, although I do see a dump statement printed out indicating I added an eventListener for 'load'.  So the dump statements work, but the load event is failing to either be sent or call the function.
Using mozregression I bisected builds until I got this pushlog:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a793dd926e90&tochange=ed3e996b3805

To reproduce, run talos with e10s enabled against your firefox executable:

talos -a tsvgx --e10s --develop -e /path/to/firefox

If it only loads one page from that suite and then dies, you've got a build with the bug. From the pushlog, it appears that it was changes introduced in bug 1047603 that caused this problem.
Depends on: 1047603

Updated

4 years ago
Flags: needinfo?(jmathies)
So my theory here is that the pageloader extension that talos relies on needs to be updated to deal with an assumption that bug 1047603 changed.

Bug 1047603 changes the following assumption:

For newly opened e10s windows, the initial browser is remote.

Bug 1047603 makes this no longer true. The initial browser is only ever made remote once it starts to load a page that it deems should become remote.

Marionette had to deal with this problem as well, and there is some platform code that does require us to make that initial browser remote ASAP to work properly, so there's a function "forceInitialBrowserRemote" that is exposed that perhaps pageloader could use.

I'm not 100% familiar with how pageloader starts up, but I assume it opens a browser window, and starts interacting with it. What I suggest is that the first thing that pageloader does as soon as the browser window opens, is to force the initial tab to become remote by calling that function.

For example, here's what we did for marionette: https://reviewboard.mozilla.org/r/3459/diff/#0

jmaher - do you think this solution could be adapted for pageloader?
Flags: needinfo?(jmaher)
(Assignee)

Comment 6

4 years ago
Created attachment 8566702 [details] [diff] [review]
force new window to be remote (1.0)
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Flags: needinfo?(jmaher)
Attachment #8566702 - Flags: review?(mconley)
Comment on attachment 8566702 [details] [diff] [review]
force new window to be remote (1.0)

Review of attachment 8566702 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/pageloader/chrome/pageloader.js
@@ +175,5 @@
>        window.resizeTo(10,10);
>  
>        var browserLoadFunc = function (ev) {
>          browserWindow.removeEventListener('load', browserLoadFunc, true);
> +        browserWindow.XULBrowserWindow.forceInitialBrowserRemote();

This should only ever occur if gMultiProcessBrowser is true, so we should wrap that in a conditional. Otherwise, I suspect this will break the non-e10s case.

We should also add some documentation to explain what the hell is going on. Something like:

"For e10s windows, the initial browser is not remote until it attempts to browse to some URI that should be remote. We bypass this restriction by forcing the initial browser to be remote before it attempts to go anywhere."
Attachment #8566702 - Flags: review?(mconley) → review-
(Assignee)

Comment 8

4 years ago
Created attachment 8566713 [details] [diff] [review]
force new window to be remote (1.1)

updated to check if we are in a remote setting and added a comment.
Attachment #8566702 - Attachment is obsolete: true
(Assignee)

Updated

4 years ago
Blocks: 1134824
(Assignee)

Updated

4 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
So I've been speaking with avih and jmaher in #perf.

I suspect that this patch broke TART for e10s. TART has a chrome:// page for managing the tests, and I believe it is into the content of this page that the add-on injects a function for dumping out the test results. When run with bug 1134238, it sounds like TART is never calling that function.

I believe this is because Talos is forcing the initial browser to be remote, then TART injects the function into the content of the initial browser, and then the chrome URI loads. The chrome URI is not one that we would normally load in the content process, so the browser is updated to be non-remote, which invalidates anything set on its "content". I suspect that this is why the function is not being called. (Note that this is all speculation on my part, and I haven't actually tested this to see if this is the case).

Reverting the patch seems to allow TART to run and complete, though I do not recommend backing this patch out, as this would break Talos entirely. I think it's prudent to find a  workaround instead for the TART test.

I suggest waiting until the chrome:// page loads before injecting any functions into the content.

Updated

4 years ago
Depends on: 1138749
You need to log in before you can comment on or make changes to this bug.