Closed Bug 1261152 Opened 8 years ago Closed 8 years ago

Ensure tps is treating e10s fairly

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(e10s+, firefox48 fixed)

RESOLVED FIXED
mozilla48
Tracking Status
e10s + ---
firefox48 --- fixed

People

(Reporter: mconley, Assigned: mconley)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

The tps test is supposed to measure how long it takes to switch tabs in the browser, and present the content of those tabs to the user. It loads up a bunch of sites from the tp5 set, waits for them all to finish loading, and then switches to each of those tabs, one by one.

Here's a profile I recently gathered from an e10s run on Windows:

http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/c44bd51496971b55c413c5c923bf8b6c2c8f2fdcd915a20851d21e9fb819560a995dfc2e9a0b0cbc6f608655a62dfbd4a06199666683507c7974d2d0678c0c42#report=29db309b072a460b01461c67a659fad5cee2d0fd&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A80290,%22end%22%3A80807%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A80513,%22end%22%3A80570%7D%5D&selection=0,4700,6,7,4701,4702,4,5,6,7,4701,8,9,10,11,12,13,14,15,16,17,4861,4862,4863,4864,4865,287,288,289,290,291,292,293,294,295,296


One thing that strikes me is that it looks like content was done painting at Rasterize #45, and that the next composite is Composite #516 - and that's when the web content should have been displayed to the user.

Instead, we wait a while - we wait for another vsync before we mark the test as done.

I think this test is slightly flawed here - we shouldn't be waiting for that extra vsync.
Assignee: nobody → mconley
Blocks: e10s-tests
tracking-e10s: --- → +
I have more details. Here's a snippet of a tps run on a Linux 64 machine:

http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/e4e9f40b66605cb4cff3af7bddb8f8574e5b9215357a696a69c6ac1f4c6c0179b57be480fd395e1588ad577bcdfb7fd609a927342bc76cdc6fabe5958a9b3454#report=72c5d0ad920d1f09457659de33518be2f2d8fd4d&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A93860,%22end%22%3A102630%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A94760,%22end%22%3A96513%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A94799,%22end%22%3A94945%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A94828,%22end%22%3A94901%7D%5D

At 94830, we start the tab switch, and the paint is kicked off in the content process.

That paint completes around 94858. A Vsync tick causes a second, shorter paint to occur (I'm not 100% sure this paint is necessary, but there it is) - that paint goes from 94859 to 94862.

The parent, meanwhile, has heard that the layers are ready from (presumably) that first paint that occurred.

At around 94862, the parent hears both a MozAfterPaint in the parent (since a composite occurred not long ago), as well as the message that the layers are ready.

At that point, the async tab switcher flips the tabpanel deck to reveal the newly painted browser content.

We then wait 12ms for the next composite, which _should_, I believe, show the web content.

The composite occurs, and at 94876, the TabChild dispatches its own MozAfterPaint since the tab has finally been presented to the user.

Here's the kicker - then we wait a whole 16ms before stopping the tps stopwatch.

I suspect tps needs to be modified to pay attention to the MozAfterPaint events that are being dispatched by the TabChild, instead of the requestAnimationFrame stuff it's currently doing.
Hey Joel - I think this is going to address our tps "regression". It turns out this isn't really a regression - we're just measuring wrong.

e10s actually appears to perform better than non-e10s in this test with the first patch:

https://treeherder.mozilla.org/perf.html#/e10s?repo=try&revision=fba2a2487c10

The second patch is hopefully going to address the OS X regression here, since it keeps the content process alive between cycles of the test. I think the reason OS X is regressing is because of the asian character sets that are being cached in the single-process case, but in the multi-process case, because the content process shuts down between cycles, no cache exists.

Try build with both patches is here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=09910a98365f9982fc9331b4b44fb72eb716fddd
Depends on: 1264409
Attachment #8741869 - Flags: review?(jmaher)
Comment on attachment 8741869 [details]
MozReview Request: Bug 1261152 - Keep the content process alive between tps runs. r?jmaher

https://reviewboard.mozilla.org/r/46805/#review43401

one issue, It appears we still look for data.times... in the code that receives the data.

::: testing/talos/talos/tests/tabswitch/bootstrap.js:437
(Diff revision 1)
>      if (pref)
>        Services.prefs.setBoolPref("browser.tabs.warnOnCloseOtherTabs", pref);
> -    Services.obs.notifyObservers(win, 'tabswitch-test-results', JSON.stringify({'times': times, 'urls': testURLs}));
> +
> +    remotePage.sendAsyncMessage("tabswitch-test-results", {
> +      times,
> +      urls: testURLs,

previously you had a structure of:
\{'times': value,
 'urls': value\}
 
 now you have a similar structure, just there is nothing labeled 'times'
Comment on attachment 8741868 [details]
MozReview Request: Bug 1261152 - Make TPS more accurate using MozAfterPaint with layer transaction ids. r?jmaher

https://reviewboard.mozilla.org/r/46803/#review43405

::: testing/talos/talos/tests/tabswitch/bootstrap.js:407
(Diff revision 1)
> +    let times = [];
> +
> +    for (let tab of tabs) {
> +      yield forceGC(win, tab.linkedBrowser);
> +      let time = yield switchToTab(tab);
> +      dump(`${tab.linkedBrowser.currentURI.spec}: ${time}ms\n`);

is this dump seen in automation?  I would think it is not visible.
Attachment #8741868 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #8)
> Comment on attachment 8741869 [details]
> MozReview Request: Bug 1261152 - Keep the content process alive between tps
> runs. r?jmaher
> 
> https://reviewboard.mozilla.org/r/46805/#review43401
> 
> one issue, It appears we still look for data.times... in the code that
> receives the data.
> 
> ::: testing/talos/talos/tests/tabswitch/bootstrap.js:437
> (Diff revision 1)
> >      if (pref)
> >        Services.prefs.setBoolPref("browser.tabs.warnOnCloseOtherTabs", pref);
> > -    Services.obs.notifyObservers(win, 'tabswitch-test-results', JSON.stringify({'times': times, 'urls': testURLs}));
> > +
> > +    remotePage.sendAsyncMessage("tabswitch-test-results", {
> > +      times,
> > +      urls: testURLs,
> 
> previously you had a structure of:
> \{'times': value,
>  'urls': value\}
>  
>  now you have a similar structure, just there is nothing labeled 'times'

Ah, this some ES6 goodness.

Specifically, this is a shorthand property name. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Object_initializer#New_notations_in_ECMAScript_2015

Essentially, given some local variable named times and testURLs:

{ times: times, urls: testURLs }

is equivalent to

{ times, urls: testURLs }
nice, then ship it!
(In reply to Joel Maher (:jmaher) from comment #9)
> Comment on attachment 8741868 [details]
> MozReview Request: Bug 1261152 - Make TPS more accurate using MozAfterPaint
> with layer transaction ids. r?jmaher
> 
> https://reviewboard.mozilla.org/r/46803/#review43405
> 
> ::: testing/talos/talos/tests/tabswitch/bootstrap.js:407
> (Diff revision 1)
> > +    let times = [];
> > +
> > +    for (let tab of tabs) {
> > +      yield forceGC(win, tab.linkedBrowser);
> > +      let time = yield switchToTab(tab);
> > +      dump(`${tab.linkedBrowser.currentURI.spec}: ${time}ms\n`);
> 
> is this dump seen in automation?  I would think it is not visible.

It is, yes - you can see these in the test log.
Comment on attachment 8741869 [details]
MozReview Request: Bug 1261152 - Keep the content process alive between tps runs. r?jmaher

https://reviewboard.mozilla.org/r/46805/#review43409
So this seems to have put a pretty dramatic bullet through the tps regression.

non-e10s has gone up. e10s has gone down.

In case people are worried that I've somehow unfairly shifted the test in e10s's favour, here are some profiles I took from my local machine from a single run:

non-e10s: https://cleopatra.io/#report=c77d23d2314b897b2eed8c85b1337d47103711b8&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A48038,%22end%22%3A75017%7D%5D

e10s: https://cleopatra.io/#report=c835c4853b5832ce6162415ff5e95ba1166ec479&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A46761,%22end%22%3A90595%7D%5D

Here's a little snippet of a single tab switch to look at, which I'll discuss next:

non-e10s: https://cleopatra.io/#report=c77d23d2314b897b2eed8c85b1337d47103711b8&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A48038,%22end%22%3A75017%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A48038,%22end%22%3A48734%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A48110,%22end%22%3A48241%7D%5D

e10s: https://cleopatra.io/#report=c835c4853b5832ce6162415ff5e95ba1166ec479&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A46761,%22end%22%3A90595%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A46863,%22end%22%3A47980%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A47014,%22end%22%3A47101%7D%5D

Notice in the non-e10s snippet, the big paint goes from time 48128 all the way until around 48190, taking approximately ~62ms. The next composite finishes at 48213, and the MozAfterPaint event for it is serviced at 48218, which ends the test with a time of about ~106ms.

In the e10s snippet, the paint gets to start a _lot_ sooner, at 47019 in the content process. That paint lasts until around 47075, lasting about ~55ms. This is likely faster than non-e10s because we don't have to worry about painting any of the browser UI. I believe the composite with that web content finishes around 47078, but it's 15ms later that the MozAfterPaint event finally has a chance to be serviced, and we can stop the stopwatch for a total time of ~77ms.

There's room for improvement in terms of the accuracy of the measurement in both cases. Specifically, if we were to fix bug 1264798, we wouldn't need to care about when the MozAfterPaint event was serviced - we'd be able to get the time that the pixels were presented to the user with a very high-degree of precision.
In my netbook, when I open heavy javascript page on a tab and before this tab finished loading / rendering / completing javascript execution, switch to another tabs make spinner on middle page appear for quite a while. This make switching tab experience worse than non e10s. Is this test include heavy javascript page? Or Is this test account for switching tabs before another tab completely loaded or rendered? Or Is it tested on another talos test? Thank you.
(In reply to Katsuki from comment #17)
> In my netbook, when I open heavy javascript page on a tab and before this
> tab finished loading / rendering / completing javascript execution, switch
> to another tabs make spinner on middle page appear for quite a while. This
> make switching tab experience worse than non e10s. Is this test include
> heavy javascript page?

This test opens a series of pages from the tp5 set, some with reasonably complex JS running on them, yes.

> Or Is this test account for switching tabs before
> another tab completely loaded or rendered?

This is different. We wait for all of the pages to be loaded before we start switching through them.

Do you have an example site that I could test?
Flags: needinfo?(Shiratori1991)
If I open bugzilla and google on different tab, switching tab between these two tabs seem smooth, google and bugzilla page displayed properly. (I try to quickly switch tab on about 500ms interval).

Then I open this page : https://reviewable.io/reviews/servo/servo/10438  on another tab. Before this page completely rendered or before completely loaded or before review being loaded, I try to switch tab between google and bugzilla on about 500 ms interval, but what I see is just the spinner on middle page. Even switching tab on 2 seconds interval just display the spinner.
Flags: needinfo?(Shiratori1991)
(In reply to Katsuki from comment #19)
> If I open bugzilla and google on different tab, switching tab between these
> two tabs seem smooth, google and bugzilla page displayed properly. (I try to
> quickly switch tab on about 500ms interval).
> 
> Then I open this page : https://reviewable.io/reviews/servo/servo/10438  on
> another tab. Before this page completely rendered or before completely
> loaded or before review being loaded, I try to switch tab between google and
> bugzilla on about 500 ms interval, but what I see is just the spinner on
> middle page. Even switching tab on 2 seconds interval just display the
> spinner.

Interesting. I am able to recreate the spinner that you're seeing, but when I use the same steps in single-process mode, the entire browser UI locks up instead. Are you not seeing this?
Flags: needinfo?(Shiratori1991)
Yes I am seeing it on single process, firefox hangs for couple of seconds. Seems like other bug? Is it about jank or responsiveness bug?
Flags: needinfo?(Shiratori1991)
(In reply to Katsuki from comment #21)
> Yes I am seeing it on single process, firefox hangs for couple of seconds.
> Seems like other bug? Is it about jank or responsiveness bug?

It's essentially the same bug, just manifesting differently. In the non-e10s case, the main process is being bogged down by some heavy JS that reviewable.io is doing. In the e10s case, the content process is getting bogged down by that JS. The fact that the e10s-case allows the browser to still be somewhat responsive is a bonus, though it's not amazing that you can't see any tab content until the content process frees up.

The transition to multiple content processes will ease this up, most likely. You might want to follow the e10s-multi metabug at bug 1207306.

Otherwise, the issue you've reported is unrelated to the tps test.
OK, thank you.
Depends on: 1283812
You need to log in before you can comment on or make changes to this bug.