Make talos work with e10s (2014)

RESOLVED FIXED

Status

Testing
Talos
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: avih, Unassigned)

Tracking

(Blocks: 1 bug)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s+)

Details

Attachments

(5 attachments, 7 obsolete attachments)

15.80 KB, text/plain
Details
9.32 KB, patch
jmaher
: review+
Details | Diff | Splinter Review
1.25 KB, patch
jmaher
: review+
Details | Diff | Splinter Review
9.53 KB, patch
Details | Diff | Splinter Review
1.14 KB, patch
armenzg
: review+
Details | Diff | Splinter Review
(Reporter)

Description

3 years ago
A lot of work was done at bug 578343 (2010) to make talos run with e10s, and then some more work at bug 689518 (Nov 2011). Bug 941459 (special powers) was apparently done for Android but Joel says talos doesn't use special powers.

Joel thinks that the last time talos worked with e10s was 2011.

Yesterday Joel pushed a e10s talos run and it seems none of the talos run groups complete (all appear as red - some crashes, some maybe no results to parse):  https://tbpl.mozilla.org/?tree=Try&rev=33497d62b45d

We need to analyze the run logs to determined what/when failed, then coem up with an approach to fix them.
Blocks: 984139
tracking-e10s: --- → +
OS: Windows 8.1 → All
Hardware: x86_64 → All
I had run this on try server and everything failed, running it locally (https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code), I edited the PerfConfigurator to add a global preference:
http://hg.mozilla.org/build/talos/file/dea55d0aff4d/talos/PerfConfigurator.py#l321

upon running it fails on its warmup run, here we call:
http://hg.mozilla.org/build/talos/file/dea55d0aff4d/talos/getInfo.html

this requires priviledged access, mostly for quitting:
http://hg.mozilla.org/build/talos/file/dea55d0aff4d/talos/page_load_test/quit.js
oh, my bad here- I had an old build I was testing against.

I got the talos warmup to load, but running -a tsvgx, I get stuck in the pageloader:
http://hg.mozilla.org/build/talos/file/tip/talos/pageloader/chrome/pageloader.js#l226
(Reporter)

Comment 3

3 years ago
I ran it locally after setting the pref on PerfConfigurator.py (windows 7, latest nightly 2014-08-21).

For me it crashes _after_ it runs through getinfo.html (i.e. i think once it opens the browser instance which should actually run the test).

However, I added some debug prints at getinfo.html, and found out that it can't get the browser ID properly. On my system it goes through this line ( http://hg.mozilla.org/build/talos/file/dea55d0aff4d/talos/getInfo.html#l28 ):

> dumpLog('exception getting privileged access, defaulting to XUL_FENNEC');

Which means it probably throws an exception on one of these two lines:

> netscape.security.PrivilegeManager.enablePrivilege('UniversalPreferencesRead UniversalPreferencesWrite UniversalXPConnect');
> var appInfo = Components.classes['@mozilla.org/xre/app-info;1']
>                         .getService(Components.interfaces.nsIXULAppInfo);


So if it can't get the proper privileges on getinfo, I'm guessing the same issue could also affect the test instance.

Bill, could you please try to run it locally, reproduce the issue, and try to fix it? AFAIK, it _used_ to work, but apparently not anymore.
Flags: needinfo?(wmccloskey)
Created attachment 8477113 [details] [diff] [review]
talos-patch

I guess I made a little progress here. appInfo.ID is expected to return an error in content processes. We could fix this, but it's probably not a big deal. Rather than mess around with it, I just hard coded Firefox.

The real problem seems to be that we're crashing. I tracked this down to bug 946407. If someone notifies the child-memory-reporter-request observer and passes a null data value, we crash. I'm needinfo? jld to see what we can do about that. Jed, here is the code where Talos is notifying:
http://hg.mozilla.org/build/talos/file/dea55d0aff4d/talos/pageloader/chrome/memory.js#l42
To work around it, I just made up a fake string to pass in.

At that point, I was able to start running dromaeo. However, it seems to hang after the first test. I don't really know why. Avi, could you maybe look at that? If you point me at the place where it's going wrong, I can try to help. I'm attaching the patch that gets us this far.
Flags: needinfo?(wmccloskey)
Flags: needinfo?(jld)
Flags: needinfo?(avihpit)
Before I changed it, the message value was an integer cast to char16_t*, so probably null mapped to 0 and  "generation=0 minimize=0 DMDident=" would be the updated value.  But I don't understand why it's using the observer service directly instead of the memory reporter manager.
Flags: needinfo?(jld)
thanks for this patch, I got ts paint to run locally, but tsvgx didn't run, it got stuck here:
http://hg.mozilla.org/build/talos/file/tip/talos/pageloader/chrome/pageloader.js#l226

I assume we are getting much closer.  Bill, when you say the first test, did a page load and you see it do stuff?  Talos launches a getinfo.html page which gets some information about the build/system and then closes down.  This is the 'warmup run' which initializes the profile as well.  Is that what passed for you when you say the first test passed?
As an additional data point, I reproduced the same behaviour as :jmaher on linux64. Here's the error that I'm seeing on the console if I try to run tsvg (talos --develop -e ~/tmp/firefox/firefox -a tsvg --datazilla-url output.json --cycles 3): 

System JS : ERROR chrome://pageloader/content/pageloader.js:226 - Error: unable to modify interposed property content

Running the same test with "ts" works fine, but then "ts" doesn't use the pageloader.
OK, I see what the problem is. It's a bug in our addon shims (the pageloader is running as an addon). You should be able to fix it by adding the multiprocessCompatible flag to pageloader's install.rdf file:

https://developer.mozilla.org/en-US/Add-ons/Install_Manifests#multiprocessCompatible
(In reply to Bill McCloskey (:billm) from comment #8)
> OK, I see what the problem is. It's a bug in our addon shims (the pageloader
> is running as an addon). You should be able to fix it by adding the
> multiprocessCompatible flag to pageloader's install.rdf file:
> 
> https://developer.mozilla.org/en-US/Add-ons/
> Install_Manifests#multiprocessCompatible

If I do this, the tsvg page seems to load (woohoo!), but then it hangs again.
(In reply to William Lachance (:wlach) from comment #9)
> (In reply to Bill McCloskey (:billm) from comment #8)
> > OK, I see what the problem is. It's a bug in our addon shims (the pageloader
> > is running as an addon). You should be able to fix it by adding the
> > multiprocessCompatible flag to pageloader's install.rdf file:
> > 
> > https://developer.mozilla.org/en-US/Add-ons/
> > Install_Manifests#multiprocessCompatible
> 
> If I do this, the tsvg page seems to load (woohoo!), but then it hangs again.

No apparent useful information, here are the last few messages on my console:

IPDL protocol error: [PContentChild] Received an invalid file descriptor!
RSS: Main: 134713344
(Reporter)

Comment 11

3 years ago
Nice progress. Looks like billm and wlach can fix this in no time ;)
Flags: needinfo?(avihpit)
No longer blocks: 984139
Blocks: 1058354
(In reply to William Lachance (:wlach) from comment #10)
> (In reply to William Lachance (:wlach) from comment #9)
> > (In reply to Bill McCloskey (:billm) from comment #8)
> > > OK, I see what the problem is. It's a bug in our addon shims (the pageloader
> > > is running as an addon). You should be able to fix it by adding the
> > > multiprocessCompatible flag to pageloader's install.rdf file:
> > > 
> > > https://developer.mozilla.org/en-US/Add-ons/
> > > Install_Manifests#multiprocessCompatible
> > 
> > If I do this, the tsvg page seems to load (woohoo!), but then it hangs again.
> 
> No apparent useful information, here are the last few messages on my console:
> 
> IPDL protocol error: [PContentChild] Received an invalid file descriptor!
> RSS: Main: 134713344

Decided to spend a bit more time narrowing this down. It looks like the old e10s pageloader code relied on a property called "remote" being set to activate various bits of e10s code (notably, use of message manager as opposed to setting events on the window). Example:

http://hg.mozilla.org/build/talos/file/49b74c08dad4/talos/pageloader/chrome/pageloader.js#l229

I guess this is probably obsolete?

As I am pretty sure non-e10s firefox supports messagemanager just fine, it might be a worthy experiment just to try unconditionally enabling that code on a non-e10s branch and see what happens, and then fix any remaining issues with e10s firefox. I'll give this a try later today or tomorrow if someone else doesn't beat me to it.
That's odd. The remote attribute on the browser element is not obsolete. It should be set to true in e10s. It definitely shouldn't be blank. Can you confirm that dumping |content| produces an nsXULElement rather than a window?
(In reply to Bill McCloskey (:billm) from comment #13)
> That's odd. The remote attribute on the browser element is not obsolete. It
> should be set to true in e10s. It definitely shouldn't be blank. Can you
> confirm that dumping |content| produces an nsXULElement rather than a window?

Yes, putting some debug logging in pageloader.js, around line 230 (when creating the browserLoadFunc object):

  dumpLine("content: " + content);
  dumpLine("content has remote: " + content.hasAttribute('remote'));

Output:

content: [object XULElement]
content has remote: false
OK, thanks. I'll have to take a look at this. There's some funky stuff we do when we initialize browser elements. The pageloader code might be running before the initialization, which could cause this sort of problem.
Hey Bill, any progress on this? I'm happy to help out with any Talos issues after we've figured out what's going on here.
Flags: needinfo?(wmccloskey)
Created attachment 8489676 [details] [diff] [review]
talos2

Here's a new patch to go on top of the first one. The problem is that "content" actually pointed to a <tabbrowser> element, not a <browser>. You can get to the <browser> by using .selectedBrowser.

Now I'm able to run through some SVG tests. However, at the end I get several Python errors. Could you look into it Will?
Flags: needinfo?(wmccloskey) → needinfo?(wlachance)
Looking into it!
Flags: needinfo?(wlachance)
Created attachment 8490373 [details]
talos output log

So I had a look at this today. The python error seems to be because we have duplicate information in the browser log. And it seems like *that's* happening because we're still sending "page loaded" messages after the browser should have quit (we restart the browser between pageloader cycles). This is very mysterious to me, as there is code in pageloader.js which should be removing these listeners, and it *is* being activated:

http://hg.mozilla.org/build/talos/file/5e18c3e56875/talos/pageloader/chrome/pageloader.js#l720

Any idea what could be happening? It's also quite strange that we're getting these events just as the browser is shutting down. I don't really see why that should be.
Flags: needinfo?(wmccloskey)
Created attachment 8490500 [details] [diff] [review]
talos3

Another patch. This seems to make it run considerably longer. Is it actually working now?
Flags: needinfo?(wmccloskey) → needinfo?(wlachance)
Yup that seems to work! At least for tsvg and tart, dromaeo_dom is hanging for some reason. Nonetheless, I think a consolidated patch would be worth landing as-is and maybe we can start turning on some tests while working on dromaeo as a followup.
Flags: needinfo?(wlachance)
(Reporter)

Comment 22

3 years ago
Just make sure the patch doesn't break non-e10s windows ;)
Created attachment 8490906 [details] [diff] [review]
Consolidated patch

This is a consolidated version of billm's patches and suggestions, with the following differences:

1. A comment on why we're hardcoding the id in getInfo.html
2. The pref enabling e10s is not set (if you want to enable it, use --extra-prefs)

It works with tsvg and tart. dromaeo_dom does not work for some reason.

I'll give this to jmaher to review, billm for feedback on anything we should bear in mind before landing.
Assignee: nobody → wlachance
Attachment #8477113 - Attachment is obsolete: true
Attachment #8489676 - Attachment is obsolete: true
Attachment #8490500 - Attachment is obsolete: true
Attachment #8490906 - Flags: review?(jmaher)
Attachment #8490906 - Flags: feedback?(wmccloskey)
(In reply to Avi Halachmi (:avih) from comment #22)
> Just make sure the patch doesn't break non-e10s windows ;)

dromaeo_dom seems to run ok if e10s is not enabled. We would of course want to stage a try run before landing/using this. :)
We also need to set the multiprocessCompatible flag to true in the pageloader install.rdf.
(In reply to Bill McCloskey (:billm) from comment #25)
> We also need to set the multiprocessCompatible flag to true in the
> pageloader install.rdf.

This is already added, forget to mention it.
Comment on attachment 8490906 [details] [diff] [review]
Consolidated patch

Review of attachment 8490906 [details] [diff] [review]:
-----------------------------------------------------------------

we don't use the specialpowers bits, so that won't get tested.  I am a bit concerned about a few things as outlined in the review.

::: talos/getInfo.html
@@ +23,4 @@
>          version = appInfo.version;
>          buildid = appInfo.appBuildID;
> +        // Hardcoding id to Firefox for e10s: https://bugzilla.mozilla.org/show_bug.cgi?id=1050706#c4
> +        id = '{99bceaaa-e3c6-48c1-b981-ef9b46b67d60}'; //appInfo.ID

I don't like this, how will we differentiate Fennec?

::: talos/pageloader/chrome/memory.js
@@ +39,4 @@
>      var memTimer = Components.classes["@mozilla.org/timer;1"].createInstance(Components.interfaces.nsITimer);
>      memTimer.initWithCallback(event, 10000, Components.interfaces.nsITimer.TYPE_ONE_SHOT);
>  
> +    os.notifyObservers(null, "child-memory-reporter-request", "generation=1 anonymize=0 minimize=0 DMDident=0");

how will this work with e10s turned off?

::: talos/pageloader/install.rdf
@@ +16,4 @@
>      <em:name>PageLoader extension</em:name>
>      <em:description>Cycles through pages and measures load times</em:description>
>      <em:creator>Vladimir Vukicevic</em:creator>
> +    <em:multiprocessCompatible>true</em:multiprocessCompatible>

will this have side effects when not run in e10s mode?
Attachment #8490906 - Flags: review?(jmaher) → review-
(In reply to Joel Maher (:jmaher) from comment #27)
> Comment on attachment 8490906 [details] [diff] [review]
> Consolidated patch
> 
> Review of attachment 8490906 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> we don't use the specialpowers bits, so that won't get tested.  I am a bit
> concerned about a few things as outlined in the review.
> 
> ::: talos/getInfo.html
> @@ +23,4 @@
> >          version = appInfo.version;
> >          buildid = appInfo.appBuildID;
> > +        // Hardcoding id to Firefox for e10s: https://bugzilla.mozilla.org/show_bug.cgi?id=1050706#c4
> > +        id = '{99bceaaa-e3c6-48c1-b981-ef9b46b67d60}'; //appInfo.ID
> 
> I don't like this, how will we differentiate Fennec?

I agree, we shouldn't do it this way. I filed bug 1068989 to use mozversion, which should allow us to delete this code altogether.

> ::: talos/pageloader/install.rdf
> @@ +16,4 @@
> >      <em:name>PageLoader extension</em:name>
> >      <em:description>Cycles through pages and measures load times</em:description>
> >      <em:creator>Vladimir Vukicevic</em:creator>
> > +    <em:multiprocessCompatible>true</em:multiprocessCompatible>
> 
> will this have side effects when not run in e10s mode?

No, it just says that the addon is multiprocess compatible. When I run talos in non-e10s mode it works fine.
Depends on: 1068989
Hey Bill, could you answer jmaher's question here?

(In reply to Joel Maher (:jmaher) from comment #27)

> ::: talos/pageloader/chrome/memory.js
> @@ +39,4 @@
> >      var memTimer = Components.classes["@mozilla.org/timer;1"].createInstance(Components.interfaces.nsITimer);
> >      memTimer.initWithCallback(event, 10000, Components.interfaces.nsITimer.TYPE_ONE_SHOT);
> >  
> > +    os.notifyObservers(null, "child-memory-reporter-request", "generation=1 anonymize=0 minimize=0 DMDident=0");
> 
> how will this work with e10s turned off?
Flags: needinfo?(wmccloskey)
> > +    os.notifyObservers(null, "child-memory-reporter-request", "generation=1 anonymize=0 minimize=0 DMDident=0");
> 
> how will this work with e10s turned off?

This code is specific to child processes. It doesn't do anything unless e10s is enabled (before or after the change).
Flags: needinfo?(wmccloskey)
Also, we won't need to worry about the app ID stuff once bug 1067229. Then we can revert the change in this patch.
Comment on attachment 8490906 [details] [diff] [review]
Consolidated patch

This looks fine aside from the ID issue.
Attachment #8490906 - Flags: feedback?(wmccloskey) → feedback+
(In reply to Bill McCloskey (:billm) from comment #31)
> Also, we won't need to worry about the app ID stuff once bug 1067229. Then
> we can revert the change in this patch.

Ok, for now we'll block this on bug 1068989 (make talos use mozversion) for now; patches in flight, but there's a bunch of moving parts involved. If bug 1067229 lands first, we can push a modified version of this patch without waiting for that.

Here's a try run against a version of talos with the e10s changes included, hopefully it will give us a broader idea of what needs more work while we figure that out:

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=9b2b5fc81164
Results of the try run: 

* tsvg, tp works on Linux, Win7, Win8
* everything else fails
* nothing works on mac or winxp

Digging into why mac fails right now.
(In reply to William Lachance (:wlach) from comment #34)
> Results of the try run: 
> 
> * tsvg, tp works on Linux, Win7, Win8
> * everything else fails
> * nothing works on mac or winxp
> 
> Digging into why mac fails right now.

According to :jimm, apparently xp failures are expected due to bug 1037445.
Did some investigation on what was going on in MacOS X. It looks like we're not even getting as far as plInit() in the page loader extension (the first thing it does unload, basically). Any idea what could be going on?
Flags: needinfo?(wmccloskey)
Spent some more time debugging. It looks like quitting the application is not working, so we're not getting past the stage where we gather browser information. I checked and ipcMode is true, so it looks like we're firing the appropriate event, it's just not being processed in such a way as to actually quit the browser.

http://hg.mozilla.org/build/talos/file/bc955b57ddd1/talos/getInfo.html#l68
http://hg.mozilla.org/build/talos/file/bc955b57ddd1/talos/page_load_test/quit.js#l74
Sorry, I don't have much insight into the Mac issue. I don't have a Mac that I can easily test on right now. Let's focus on Linux and Windows for now.

What tests are failing (the everything else ones)?
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #38)
> Sorry, I don't have much insight into the Mac issue. I don't have a Mac that
> I can easily test on right now. Let's focus on Linux and Windows for now.
> 
> What tests are failing (the everything else ones)?

See the try run in comment 33: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=9b2b5fc81164
Can you help me a bit more? I've never understood the different letters in Talos try runs. I ran |talos --print-tests|, and here are the ones that look useful to me for e10s:

cart, tart - these would be somewhat interesting
tdhtml, tp5* - very useful.
tp5o_tscroll, tscroll*, tresize - also very useful
tpaint, ts_paint, ts - maybe somewhat useful
JS and CSS tests - we can ignore these for now

Which suites are these tests part of?
(In reply to Bill McCloskey (:billm) from comment #40)
> Can you help me a bit more? I've never understood the different letters in
> Talos try runs. I ran |talos --print-tests|, and here are the ones that look
> useful to me for e10s:
> 
> cart, tart - these would be somewhat interesting
> tdhtml, tp5* - very useful.
> tp5o_tscroll, tscroll*, tresize - also very useful
> tpaint, ts_paint, ts - maybe somewhat useful
> JS and CSS tests - we can ignore these for now
> 
> Which suites are these tests part of?

The suites are defined here: http://hg.mozilla.org/mozilla-central/file/tip/testing/talos/talos.json

I believe this roughly maps as follows:

c == chrome ["tresize", "tcanvasmark"]
d == dromaeojs ["dromaeo_css", "dromaeo_dom", "kraken", "v8_7"]
g1 == ["tp5o_scroll", "glterrain"]
o == other ["a11yr", "ts_paint", "tpaint", "sessionrestore", "sessionrestore_no_auto_restore"]
s == svg ["tsvgx", "tsvgr_opacity", "tart", "tscrollx", "cart"]
tp = tp5o ["tp5o"]
OK, thanks. As a next step, let's try to get "c" and "g1" working.
Depends on: 1071310
(Reporter)

Comment 43

3 years ago
Just to align the terminology, we call the letters "run groups". The framework splits tests into jobs (job is invoking all the tests of a run group) such it can run the jobs in parallel.

Initially the run group letters were names which tried to reflect the tests at this group (like 't' for tp5), but over the years tests got assigned to groups regardless of letters, where the goal is that all groups take similar time to complete (which is about 20 mins these days) such that the parallelism can be effective.

Once we realized the group letter got arbitrary, we switched to plain enumeration, where g1 is the first using this naming scheme (and the latest group to be added as of now). The next one will be g2, etc.

So, g1 (or any of the other letters) can be described as a 'run group' or as a 'job'.
Now that content processes are reporting the right app ids (bug 1067229), we don't need to gate this on making talos use mozversion (bug 1068989)
No longer depends on: 1068989
Created attachment 8495508 [details] [diff] [review]
Updated e10s patch

This is the same as the previous patch except:

1. Reverted changes to getInfo.html (no longer necessary now that bug 1067229)
2. Added an --e10s option which will set the right preferences (currently just browser.tabs.remote.autostart)

Testing in a try run here: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=46d3a320e62b
Attachment #8490906 - Attachment is obsolete: true
Attachment #8495508 - Flags: review?(jmaher)
Comment on attachment 8495508 [details] [diff] [review]
Updated e10s patch

Review of attachment 8495508 [details] [diff] [review]:
-----------------------------------------------------------------

great!
Attachment #8495508 - Flags: review?(jmaher) → review+
Pushed talos change, try run looked good (of course it was running in non-e10s mode, but at least we know this didn't break anything): https://hg.mozilla.org/build/talos/rev/53dab58cff84
Tests wh(In reply to Bill McCloskey (:billm) from comment #42)
> OK, thanks. As a next step, let's try to get "c" and "g1" working.

I believe the problem here is mostly/entirely that tests that do their own timing are broken. Trying to unwind how to fix this...
Created attachment 8498414 [details] [diff] [review]
Fixes to pageloader to make tests that do their own timing work

There were some problems in how we were handling tests that do their own timing for the e10s case. This fixes that and tries to consolidate more code between e10s/non-e10s, as well as doing various other cleanups to try to make the logic more transparent (more work left to do). I have two try runs going for this. One with e10s turned off:

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=72655de7fc93

And one with e10s turned on (via a hacked version of talos):

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=68f3c137ce9c
Attachment #8498414 - Flags: review?(jmaher)
Note that the above patch doesn't fix all the problems with the various tests. e.g. tp5o_scroll, tresize, tcanvasmark are all still broken I think because they still have various non-e10s-compatible assumptions. I'll look into fixing that next, however this was a necessary prerequisite to getting them working.
Comment on attachment 8498414 [details] [diff] [review]
Fixes to pageloader to make tests that do their own timing work

Review of attachment 8498414 [details] [diff] [review]:
-----------------------------------------------------------------

we should chat more, a bug is a good way for others to easily see the conversation.  I don't understand why we are removing mozafterpaint?  If there is a good reason, I can see this patch as being beneficial.

::: talos/pageloader/chrome/pageloader.js
@@ +240,3 @@
>                            "    }, true); " + 
>                            "  } else { " +
> +                          "    sendAsyncMessage('PageLoader:Load', {}); " +

I don't understand why we ignore mozafterpaint and just wait for the load message?

@@ +320,3 @@
>      // XXX we use a capturing event here too -- load events don't bubble up
>      // to the <browser> element.  See bug 390263.
> +    loadHandler = plLoadHandler;

why do we set loadHandler, I don't see it used

@@ +328,5 @@
> +    if (useMozAfterPaint) {
> +      content.removeEventListener("MozAfterPaint", plPaintedCapturing, true);
> +      gPaintListener = false;
> +    }
> +  };

I assume here we want loadHandler, but then we use plPaintedCapturing, is plPainted used anywhere else?
Attachment #8498414 - Flags: review?(jmaher) → review-
(In reply to Joel Maher (:jmaher) from comment #51)
> Comment on attachment 8498414 [details] [diff] [review]
> Fixes to pageloader to make tests that do their own timing work
> 
> Review of attachment 8498414 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> we should chat more, a bug is a good way for others to easily see the
> conversation.  I don't understand why we are removing mozafterpaint?  If
> there is a good reason, I can see this patch as being beneficial.
> 
> ::: talos/pageloader/chrome/pageloader.js
> @@ +240,3 @@
> >                            "    }, true); " + 
> >                            "  } else { " +
> > +                          "    sendAsyncMessage('PageLoader:Load', {}); " +
> 
> I don't understand why we ignore mozafterpaint and just wait for the load
> message?

We don't ignore mozafterpaint, we just overload the meaning of the pageload message depending on if mozafterpaint is enabled. This is kind of confusing, but not really any different than what we were doing before, which was trigger the exact same callback function inside the paint message handler.

Maybe we should rename 'PageLoader:Load' to 'PageLoader:LoadEvent' to reduce confusion?

> @@ +320,3 @@
> >      // XXX we use a capturing event here too -- load events don't bubble up
> >      // to the <browser> element.  See bug 390263.
> > +    loadHandler = plLoadHandler;
> 
> why do we set loadHandler, I don't see it used

Yeah this was a mistake, I'll upload a new patch with this backed out.

> @@ +328,5 @@
> > +    if (useMozAfterPaint) {
> > +      content.removeEventListener("MozAfterPaint", plPaintedCapturing, true);
> > +      gPaintListener = false;
> > +    }
> > +  };
> 
> I assume here we want loadHandler, but then we use plPaintedCapturing, is
> plPainted used anywhere else?

Please ignore this, as above.
Created attachment 8498484 [details] [diff] [review]
Fix tests that do their own timing take 2

Hopefully addresses review comments
Attachment #8498414 - Attachment is obsolete: true
Attachment #8498484 - Flags: review?(jmaher)
New try runs.

Without e10s:

https://tbpl.mozilla.org/?tree=Try&rev=33a9f6f7353f

With e10s:

https://tbpl.mozilla.org/?tree=Try&rev=498bc27d68cc
Comment on attachment 8498484 [details] [diff] [review]
Fix tests that do their own timing take 2

Review of attachment 8498484 [details] [diff] [review]:
-----------------------------------------------------------------

did you trim all the code out of here that you can?  I suspect now we might have an extra function or two we are not using :)
Attachment #8498484 - Flags: review?(jmaher) → review+
Created attachment 8499140 [details] [diff] [review]
Fix Android

The initial patch broke the pageloader on android. This should fix it, works locally for me anyway.
Attachment #8499140 - Flags: review?(jmaher)
Comment on attachment 8499140 [details] [diff] [review]
Fix Android

Review of attachment 8499140 [details] [diff] [review]:
-----------------------------------------------------------------

how does this work for android?  are we not doing e10s on android?  is it done differently?  

r=me as this allows us to use the same revision of talos on android + desktop for now.
Attachment #8499140 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #57)
> Comment on attachment 8499140 [details] [diff] [review]

> how does this work for android?  are we not doing e10s on android?  is it
> done differently?  

No, Android has not used e10s since the 2012 native rewrite.

Will
(In reply to William Lachance (:wlach) from comment #49)
> Created attachment 8498414 [details] [diff] [review]
> Fixes to pageloader to make tests that do their own timing work
> 
> There were some problems in how we were handling tests that do their own
> timing for the e10s case. This fixes that and tries to consolidate more code
> between e10s/non-e10s, as well as doing various other cleanups to try to
> make the logic more transparent (more work left to do). I have two try runs
> going for this. One with e10s turned off:
> 
> https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=72655de7fc93
> 
> And one with e10s turned on (via a hacked version of talos):
> 
> https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=68f3c137ce9c

The hacked-e10s run didn't actually enable e10s, so we'll have to run that one again.

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=5000230d7986
Landed the Android patch: https://hg.mozilla.org/build/talos/rev/147630cfb53f
Depends on: 1077632
Depends on: 1077640
(In reply to William Lachance (:wlach) from comment #59)
> (In reply to William Lachance (:wlach) from comment #49)
> The hacked-e10s run didn't actually enable e10s, so we'll have to run that
> one again.
> 
> https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=5000230d7986

win7 dromaeo seems to be failing in an unusual way:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.com-5000230d7986/try-win32/try_win7-ix_test-dromaeojs-bm109-tests1-windows-build91.txt.gz

We should be seeing results like this:

12:01:03     INFO -  |i|pagename|runs|
12:01:03     INFO -  |0;Richards;165.18004625041294
12:01:03     INFO -  |1;DeltaBlue;272.5538293813028
12:01:03     INFO -  |2;Encrypt;336.02150537634407
12:01:03     INFO -  |3;Decrypt;6223.60248447205

But instead we're seeing:

11:10:03 INFO - |i|pagename|runs|
11:10:03 INFO - |0;http://localhost/page_load_test/v8_7/run.html;170.1837985023826;291.97080291970804;343.64261168384877;6348.101265822785;2014.0845070422536;690.1311249137336;12268.292682926829;35120.68965517241;834.0283569641368;6666.666666666667

I know dromaeo isn't a huge priority but we probably shouldn't land the patch until we figure out what's going on.
(Reporter)

Comment 62

3 years ago
Will, what's the current status on which tests work and which don't (yet)?
(In reply to Avi Halachmi (:avih) from comment #62)
> Will, what's the current status on which tests work and which don't (yet)?

Pretty much only tp and x are working ok at this point. Dromaeo should work once we figure out the issue mentioned in comment 61. I hope to have a patch for this soon, but I'm also likely to be pretty busy next week. I'll see what I can do.
(Reporter)

Updated

3 years ago
Depends on: 1078250
(Reporter)

Comment 64

3 years ago
(In reply to William Lachance (:wlach) from comment #41)
>
> I believe this roughly maps as follows:
> 
> c == chrome ["tresize", "tcanvasmark"]
> d == dromaeojs ["dromaeo_css", "dromaeo_dom", "kraken", "v8_7"]
> g1 == ["tp5o_scroll", "glterrain"]
> o == other ["a11yr", "ts_paint", "tpaint", "sessionrestore",
> "sessionrestore_no_auto_restore"]
> s == svg ["tsvgx", "tsvgr_opacity", "tart", "tscrollx", "cart"]
> tp = tp5o ["tp5o"]


(In reply to William Lachance (:wlach) from comment #63)
> (In reply to Avi Halachmi (:avih) from comment #62)
> > Will, what's the current status on which tests work and which don't (yet)?
> 
> Pretty much only tp and x are working ok at this point...

'x' being?
x = xperf (win 7 only)
(Reporter)

Comment 66

3 years ago
Do any of the sub-tests within any of the run-groups work in e10s?

Do tests which have OWN_TIMING work at the framework level? (e.g. do the pages have access to a working tpRecordTime function?) If not, then this should first be fixed IMO, which might make tscrollx, tsvgx, glterrain and possibly more work without further changes.
good point Avi.  I suspect the insertion of 'tpRecordTime' is broken.  Maybe we insert it properly, but we don't get the data back to the pageloader addon due to e10s.

I know :wlach is at a work week this week, so he might be distracted.  Once I confirm he has all his code landed in Talos, I could take up part of the work.
(In reply to William Lachance (:wlach) from comment #61)
> (In reply to William Lachance (:wlach) from comment #59)
> > (In reply to William Lachance (:wlach) from comment #49)
> > The hacked-e10s run didn't actually enable e10s, so we'll have to run that
> > one again.
> > 
> > https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=5000230d7986
> 
> win7 dromaeo seems to be failing in an unusual way:
> 
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.
> com-5000230d7986/try-win32/try_win7-ix_test-dromaeojs-bm109-tests1-windows-
> build91.txt.gz
> 
> We should be seeing results like this:
> 
> 12:01:03     INFO -  |i|pagename|runs|
> 12:01:03     INFO -  |0;Richards;165.18004625041294
> 12:01:03     INFO -  |1;DeltaBlue;272.5538293813028
> 12:01:03     INFO -  |2;Encrypt;336.02150537634407
> 12:01:03     INFO -  |3;Decrypt;6223.60248447205
> 
> But instead we're seeing:
> 
> 11:10:03 INFO - |i|pagename|runs|
> 11:10:03 INFO -
> |0;http://localhost/page_load_test/v8_7/run.html;170.1837985023826;291.
> 97080291970804;343.64261168384877;6348.101265822785;2014.0845070422536;690.
> 1311249137336;12268.292682926829;35120.68965517241;834.0283569641368;6666.
> 666666666667
> 
> I know dromaeo isn't a huge priority but we probably shouldn't land the
> patch until we figure out what's going on.

So I just tried to reproduce this locally on my Win7 VM and could not do so. :( The test still ran fine, even with e10s enabled. Jmaher volunteered to look into this. To do so, he'll need to apply the "Fix tests which do own timing take 2" patch.
Flags: needinfo?(jmaher)
Depends on: 1078391
(Reporter)

Comment 69

3 years ago
Tested locally on windows 8.1 with latest (unpatched) talos rev 799 with this invocation:

> talos -e path/to/firefox.exe --develop --tppagecycles=1 --tpcycles=1 --cycles=1 -a <test-name> --e10s

Before "declaring" the result with --e10s I made sure that each test runs and completes without e10s.

> c == chrome ["tresize", "tcanvasmark"]

tresize - throws an exception and doesn't resize the window (and the test doesn't progress):

[Exception... "Method not implemented"  nsresult: "0x80004001 (NS_ERROR_NOT_IMPLEMENTED)"  location: "JS frame :: http://localhost:15707/startup_test/tresize-test.html :: <TOP_LEVEL> :: line 87"  data: no]

Lines 86-87-88 are:
> netscape.security.PrivilegeManager.enablePrivilege('UniversalXPConnect');
> window.moveTo(10,10);
> window.resizeTo(dims,dims);


tcanvasmark - runs and completes, but then talos apparently doesn't see the result and the window doesn't close and the test doesn't finish.


> d == dromaeojs ["dromaeo_css", "dromaeo_dom", "kraken", "v8_7"]

dormaeojs - didn't test yet.

> g1 == ["tp5o_scroll", "glterrain"]

tp5o_Scroll - loads the first page to scroll but then doesn't scroll and doesn't progress to the next page. The test doesn't finish.

glterrain - the page loads, the test runs and completes (slower than non e10s but that's a known issue with WebGL right now), but then talos can't see the results and doesn't exit.

> o == other ["a11yr", "ts_paint", "tpaint", "sessionrestore",
> "sessionrestore_no_auto_restore"]

Didn't test 'o' yet.


> s == svg ["tsvgx", "tsvgr_opacity", "tart", "tscrollx", "cart"]

tsvgx - appears to be working.
tsvgr_opacity - appears to be working.
tscrollx - appears to be working.

tart/cart - Doesn't start the test with current talos repo. I have a patch for bug 1078250, and with it the test starts and completes, but then talos can't see the result and doesn't exit.


> tp = tp5o ["tp5o"]

tp5o - appears to be working.

------------

The fact that tsvgx and tscrollx seems working means that HAS_OWN_TIMING works and tpRecordTime is injected to the content correctly. That's good.

I _think_ that what's common to the tests which complete but talos then can't read the results (tcanvasmark, glterrain, [patched] tart/cart) is that they report a group of results in a single tpRecordTime call (I know that's true for all but tcanvasmark - which I didn't examine yet). Maybe somehow reporting a single result works in e10s but reporting multiple doesn't.

Looks like tresize either needs a framescipt, or maybe that's a bug with e10s that window.moveTo is not implemented.

tp5o_scroll needs to be looked at, as well as the other tests which I didn't try.
glterrain + canvasmark could be figured out.  If we get to the end, but don't have results, some other corner case is broken.  I suspect it is the tpRecordTime where we pass in an array instead of a single value. TART also does this.

here is an example of multiple results coming back:
tpRecordTime([3,1,4,5,9], 0, ['pi1', 'pi2', 'pi3', 'pi4', 'pi5'])

in e10s (current pageloader w/no patches), we define tpRecordTime here (http://hg.mozilla.org/build/talos/file/tip/talos/pageloader/chrome/pageloader.js#l228):
                         "    content.wrappedJSObject.tpRecordTime = function(t, s) { sendAsyncMessage('PageLoader:RecordTime', { time: t, startTime: s }); }; ";


but outside of e10s, we have a different function definition (http://hg.mozilla.org/build/talos/file/tip/talos/pageloader/chrome/pageloader.js#l476):
  content.contentWindow.wrappedJSObject.tpRecordTime = function (time, startTime, testName) {
    gTime = time;
    gStartTime = startTime;
    recordedName = testName;
    setTimeout(plWaitForPaintingCapturing, 0);
  }


we have adjusted the function for tpRecordTime since the original e10s implementation and that would explain the problem.

Wlach has a patch (https://bug1050706.bugzilla.mozilla.org/attachment.cgi?id=8498484) that gets us much further towards e10s, but it doesn't modify the function signature by adding the names parameter.  That should be easy to do and test.
Flags: needinfo?(jmaher)
Created attachment 8502737 [details] [diff] [review]
fix tests that do their own timing (0.8)

this works locally on linux64, I need to get this tested on a variety of platforms, ideally on try or something like that.
Attachment #8498484 - Attachment is obsolete: true
locally (linux 64), I verified:
* canvasmark
* tart/cart (w patch from bug 1078250)

I suspect glterrain works.  this leaves scroll tests and tresize to sort out.
(Reporter)

Comment 73

3 years ago
(In reply to Avi Halachmi (:avih) from comment #69)
> ...
> Looks like tresize either needs a framescipt, or maybe that's a bug with
> e10s that window.moveTo is not implemented.

Bill, do you know anything about this ^ ?
Flags: needinfo?(wmccloskey)
Depends on: 1082127
It looks like window.moveTo is not implemented in e10s. I filed bug 1082127.
I believe that is the last of the tests to get going (assuming we get our current patches reviewed, landed, deployed.

It will be interesting to see how this actually looks when we get it on treeherder.
Flags: needinfo?(wmccloskey)
(Reporter)

Comment 76

3 years ago
(In reply to Joel Maher (:jmaher) from comment #75)
> I believe that is the last of the tests to get going (assuming we get our
> current patches reviewed, landed, deployed.

There's still also tp5o_Scroll.
Testing jmaher's update to my patch. 

Non-e10s:

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=9c2f7063eca3

e10s (only testing linux + win32):

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=b07031d1b63e
Created attachment 8505019 [details] [diff] [review]
fix tests that do their own timing (0.9)
Attachment #8502737 - Attachment is obsolete: true
(In reply to Joel Maher (:jmaher) from comment #78)
> Created attachment 8505019 [details] [diff] [review]
> fix tests that do their own timing (0.9)

New try run (e10s only, where things were broken and fixed): https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=daa79eb098e1
with that latest push, just 'c' and 'g1' are failing.  Unfortunately they fail in the beginning so we don't have proof that the issues are just tresize and tp5o_scroll.  We could edit the talos.json file to remove those tests for now!  Or we could hack talos to ignore those tests while setting up.
Comment on attachment 8505019 [details] [diff] [review]
fix tests that do their own timing (0.9)

Review of attachment 8505019 [details] [diff] [review]:
-----------------------------------------------------------------

after looking at this patch often enough, I don't have any issues with it.
(Reporter)

Comment 82

3 years ago
(In reply to Joel Maher (:jmaher) from comment #80)
> with that latest push, just 'c' and 'g1' are failing.  Unfortunately they
> fail in the beginning so we don't have proof that the issues are just
> tresize and tp5o_scroll.  We could edit the talos.json file to remove those
> tests for now!  Or we could hack talos to ignore those tests while setting
> up.

You verified that tcanvasmark works in comment 72, and I verified that glterrain completes and has results, but then fails to report them due to the issue with toRecordTime with multiple values.

I think it's fairly safe to assume both are working. And you could always run then locally with -a tcanvasmark and -a glterrain ;)
running locally vs running on tbpl and on all platforms is a much different thing; I usually say if it runs locally it is ready to run on try.
(Reporter)

Comment 84

3 years ago
I don't disagree, just saying that our observations so far, including those before and after the tpRecordTime patch, make it quite likely that these tests should be working with e10s.
https://hg.mozilla.org/build/talos/rev/6c837558d722
Created attachment 8506357 [details] [diff] [review]
turn on all jobs for e10s (1.0)
Attachment #8506357 - Flags: review?(armenzg)

Updated

3 years ago
Attachment #8506357 - Flags: review?(armenzg) → review+
https://hg.mozilla.org/build/buildbot-configs/rev/89e0cda2c158
In production: https://hg.mozilla.org/build/buildbot-configs/rev/89e0cda2c158
Depends on: 1090421
Depends on: 1090434
Depends on: 1092194
Depends on: 1092322
Depends on: 1092428
Depends on: 1094355
Depends on: 1094757
Depends on: 1098357
Depends on: 1098386
(Reporter)

Comment 89

3 years ago
WRT tresize, after some discussion today we arrived at the conclusion that implementing window.moveTo/resizeTo for the content is not the correct approach.

While such fix will make the test work without further changes, it would also induce a penalty from content-parent processes communication for each resize, and since all this test does is resize as fast as possible, this penalty will not be negligible.

Also such solution where the content process resizes the window is similar to the user resizing the window in a non-e10s case, but is different in e10s.

For these reasons, we should rewrite the test to resize the window from the parent process.
No longer depends on: 1082127
Depends on: 1102479
(Reporter)

Updated

3 years ago
Depends on: 1124728
I'm not actively working on this, which actually seems to be more of a metabug at this point.
Assignee: wlachance → nobody
can we close this bug and track all work in bug 1144120?
Now treeherder report talos e10s, and trychooser also allow to trigger e10s talos jobs, thanks to :jmaher.

Re: Should we close the bug ?
great idea- any remaining issues can be resolved in bug 1144120.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.