1250620 - Review all talos tests validity in e10s and non-e10s

Assignee

Description

•

8 years ago

As part of the e10s effort (but useful regardless), we're trying to establish which of the talos tests are relevant for comparing against non-e10s, which starts by validating the tests first.

For each test, we'd like to answer the following questions:
1. Is the test meaningful/relevant in non-e10s?
2. Is the test meaningful/relevant in e10s?
3. Is the comparison of the results between the above valid?
4. Caveats/Notes/platform-notes?

Below is a list of all current talos tests and their owners as far as we know (or someone I think could help). For more detailed description of each test see the wiki at https://wiki.mozilla.org/Buildbot/Talos/Tests .

Each listed dev, please state your replies for the above 4 questions.

If you feel you're not the right person, pointing at someone who can help would be appreciated.

If the reply is not plain, or if it requires further discussion or investigation, please say so and we'll file a new bug for it.


Validations needed WRT the questions above:
-------------------------------------------

mstange (+mchang): ASAP as a tool to measure throughput.

jimm: tp5-responsiveness, ts_paint, tpaint, tresize

mconley: tp5, TART (ASAP), CART (ASAP)

kats: tscrollx, tp5o_scroll (both ASAP)

jgilbert: glterrain (ASAP)

seth: tsvgx (ASAP)

mattwoodrow: tsvg_opacity

snorp: CanvasMark

erahm: tp5 counters (memory etc)

bz: Dromaeo dom/css

davidb: a11y

aklotz: xperf

h4writer: kraken

bgrins: DAMP

Avi Halachmi (:avih)

Assignee

Updated

•

8 years ago

Flags: needinfo?(snorp)

Flags: needinfo?(seth)

Flags: needinfo?(mstange)

Flags: needinfo?(mconley)

Flags: needinfo?(matt.woodrow)

Flags: needinfo?(jmathies)

Flags: needinfo?(jgilbert)

Flags: needinfo?(hv1989)

Flags: needinfo?(erahm)

Flags: needinfo?(dbolter)

Flags: needinfo?(bzbarsky)

Flags: needinfo?(bugmail.mozilla)

Flags: needinfo?(bgrinstead)

Flags: needinfo?(aklotz)

Boris Zbarsky [:bzbarsky]

Comment 1

•

8 years ago

> bz: Dromaeo dom/css

1. Yes.
2. Yes.
3. Yes.
4. I would expect that scores are identical on these tests for e10s and non-e10s.  If they're not,
   that's worth investigating, of course.

Flags: needinfo?(bzbarsky)

(No longer employed by Mozilla) Aaron Klotz

Comment 2

•

8 years ago

(In reply to Avi Halachmi (:avih) from comment #0)
> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?

Definitely. Very useful for catching startup main-thread I/O.

> 2. Is the test meaningful/relevant in e10s?

Yes. We still want to check main-thread I/O in the chrome process under e10s.

> 3. Is the comparison of the results between the above valid?

Yes (but see Caveats)

> 4. Caveats/Notes/platform-notes?

On the one hand, we should expect startup main-thread I/O in the chrome process to be better in the e10s case. OTOH, if chrome is blocked on content for some reason (perhaps during content process startup and initialization), and content is doing I/O, then there is an effective dependency between the chrome process and the content process which won't show up in the xperf data.

(No longer employed by Mozilla) Aaron Klotz

Updated

•

8 years ago

Flags: needinfo?(aklotz)

Brian Grinstead [:bgrins]

Comment 3

•

8 years ago

(In reply to Avi Halachmi (:avih) from comment #0)
> bgrins: DAMP

> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?

Yes

> 2. Is the test meaningful/relevant in e10s?

Yes

> 3. Is the comparison of the results between the above valid?

Yes

> 4. Caveats/Notes/platform-notes?

Definitely care about perf differences between e10s and non-e10s for devtools, especially since we have a lot of users on the Dev Edition channel

Flags: needinfo?(bgrinstead)

Avi Halachmi (:avih)

Assignee

Comment 4

•

8 years ago

Just for reference, these are non-e10s (left) vs e10s (right) comparison on windows 7 32 as of yesterday for all tests. Other platforms might show a different picture, and wlach is working on a live comparison dashboard, and I'll post a link once it's up.

Results are sorted from biggest e10s regressions (compared to non-e10s) at the top, to biggest e10s improvements at the bottom.

The middle value is the diff in percentage.

Keep in mind though that this bug is about the tests/comparison validity, and not about actual regressions/improvements.

Win7 (worst to best e10s):
-- tp5o_scroll            2.9 +/-  0%  [ +70.2%]       5.0 +/-  0%
-- tps                   46.0 +/-  0%  [ +63.8%]      75.3 +/-  0%
-- tpaint               185.1 +/-  2%  [ +28.6%]     238.0 +/-  5%
-- tart                   3.8 +/-  0%  [ +11.8%]       4.2 +/-  0%

-- tresize                9.6 +/-  0%  [  +7.3%]      10.3 +/-  0%
-- tscrollx               2.8 +/-  0%  [  +6.7%]       3.0 +/-  0%
-- cart                  23.4 +/-  0%  [  +6.6%]      24.9 +/-  0%
   a11yr                407.3 +/-  6%  [  +1.5%]     413.5 +/-  8%
   kraken              1603.3 +/- 10%  [  +0.8%]    1616.4 +/- 12%
   damp                 198.2 +/-  1%  [  +0.2%]     198.6 +/-  1%
   sessionrestore      1544.0 +/- 14%  [  -0.1%]    1541.7 +/- 18%
   dromaeo_css         6643.1 +/- 50%  [  +0.1%]    6650.2 +/- 99%
++ tp5o                 217.6 +/-  1%  [  -1.1%]     215.2 +/-  1%
++ tcanvasmark         7779.8 +/- 36%  [  +1.5%]    7893.0 +/- 36%
++ glterrain              5.2 +/-  0%  [  -2.0%]       5.1 +/-  0%
++ sessionrestore_NAR   631.0 +/-  6%  [  -2.6%]     614.5 +/-  6%

++ tsvgx                388.3 +/-  1%  [ -50.4%]     192.4 +/-  1%
++ ts_paint             870.3 +/-  4%  [ -55.8%]     385.1 +/-  3%
++ tp5o_responsiveness   39.7 +/-  1%  [ -84.8%]       6.0 +/-  1%
++ tsvgr_opacity        505.7 +/-  3%  [ -90.2%]      49.5 +/-  1%

Eric Rahm [:erahm]

Comment 5

•

8 years ago

> erahm: tp5 counters (memory etc)

I can really only comment on the memory component.

> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?

Yes.

> 2. Is the test meaningful/relevant in e10s?

Yes.

> 3. Is the comparison of the results between the above valid?

No. See caveat.

> 4. Caveats/Notes/platform-notes?

e10s is going to use more memory so a 1:1 comparison isn't useful. A percentage threshold might be useful, ie we use no more than X% memory of non-e10s with e10s enabled.

Flags: needinfo?(erahm)

Hannes Verschore [:h4writer]

Comment 6

•

8 years ago

(In reply to Boris Zbarsky [:bz] from comment #1)
> > bz: Dromaeo dom/css
> 
> 1. Yes.
> 2. Yes.
> 3. Yes.
> 4. I would expect that scores are identical on these tests for e10s and
> non-e10s.  If they're not,
>    that's worth investigating, of course.

Same for Kraken

Flags: needinfo?(hv1989)

Avi Halachmi (:avih)

Assignee

Comment 7

•

8 years ago

(In reply to Eric Rahm [:erahm] from comment #5)
> > erahm: tp5 counters (memory etc)
> 
> I can really only comment on the memory component.
> 
> > For each test, we'd like to answer the following questions:
> > 1. Is the test meaningful/relevant in non-e10s?
> 
> Yes.
> 
> > 2. Is the test meaningful/relevant in e10s?
> 
> Yes.
> 
> > 3. Is the comparison of the results between the above valid?
> 
> No. See caveat.
> 
> > 4. Caveats/Notes/platform-notes?
> 
> e10s is going to use more memory so a 1:1 comparison isn't useful. A
> percentage threshold might be useful, ie we use no more than X% memory of
> non-e10s with e10s enabled.

After clarifying on IRC, erham meant that e10s is expected to be unfavorable in such comparison.

Regardless, currently this comparison is also invalid since on e10s the test doesn't measure all processes. This is handled in bug 1250169, and once this bug lands, the comparison should be valid (and expectedly unfavorable for e10s).

Avi Halachmi (:avih)

Assignee

Comment 8

•

8 years ago

Forgot to add at comment 0:

yoric: Session restore

blassey: tps

Flags: needinfo?(dteller)

Flags: needinfo?(blassey.bugs)

David Teller [:Yoric] - still alive but not very active

Comment 9

•

8 years ago

1. Is the test meaningful/relevant in non-e10s?
Yes.

2. Is the test meaningful/relevant in e10s?
Yes.

3. Is the comparison of the results between the above valid?
Yes.

4. Caveats/Notes/platform-notes?
As mentioned in bug 1245891, we may not be measuring what we want. We'll need to fix this soonish.

Flags: needinfo?(dteller)

Matt Woodrow (:mattwoodrow)

Comment 10

•

8 years ago

(In reply to Avi Halachmi (:avih) from comment #0)
> mattwoodrow: tsvg_opacity


> 1. Is the test meaningful/relevant in non-e10s?

Yes

> 2. Is the test meaningful/relevant in e10s?

It should be, but given the massive improvement reported in comment 4, I suspect no.

> 3. Is the comparison of the results between the above valid?

I don't believe so, I can't see why e10s would improve the results of this test.

> 4. Caveats/Notes/platform-notes?

Jim Mathies [:jimm]

Comment 11

•

8 years ago

tp5-responsiveness - this measures browser process, main thread jank while loading a set of tpo5 pages.
1. Is the test meaningful/relevant in non-e10s?
Yes.
2. Is the test meaningful/relevant in e10s?
The test measures browser side jank only. There is a general rule that the browser process should never sync send to the content process and as such never block on it for data. If we regress this under e10s, something has landed that needs to be backed out. Also, since content has moved to a content process, the browser main thread should have much less work. Hence browser side values should be much lower for e10s vs. non-e10s.
3. Is the comparison of the results between the above valid?
Yes.
4. Caveats/Notes/platform-notes?
It would be great if this test could be modified at some point to provide a combined jank value for both browser and content processes. This might come in handy when we
move to multiple content processes as well.

ts_paint - startup paint
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/startup_test/tspaint_test.html
1. Is the test meaningful/relevant in non-e10s?
Yes.
2. Is the test meaningful/relevant in e10s?
Yes.
3. Is the comparison of the results between the above valid?
Yes, although there are questions about whether this test is really useful in e10s land. (See bug 1174767)

tpaint - window paint
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/startup_test/tpaint.html
1. Is the test meaningful/relevant in non-e10s?
Yes.
2. Is the test meaningful/relevant in e10s?
Yes.
3. Is the comparison of the results between the above valid?
Yes, we want to minimize regressions here. (See bug 1174770)

tresize - resize chrome window performance, which includes resizing remote layers
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/startup_test/tresize/addon/content/tresize-test.html
1. Is the test meaningful/relevant in non-e10s?
Yes.
2. Is the test meaningful/relevant in e10s?
Yes.
3. Is the comparison of the results between the above valid?
Yes, we want to minimize regressions here. (See bug 1179732). Future work involves resizing tp5o pages to get a sense of user experience. This would also flex ipc / content process more heavily.

Flags: needinfo?(jmathies)

Matt Woodrow (:mattwoodrow)

Comment 12

•

8 years ago

I filed bug 1250717 for the main reason for the difference. Looks like it's still a bit faster with e10s, though that is somewhat believable since we don't have to deal with the throbber.

Matt Woodrow (:mattwoodrow)

Updated

•

8 years ago

Flags: needinfo?(matt.woodrow)

Brad Lassey [:blassey] (use needinfo?)

Comment 13

•

8 years ago

mconley has taken ownership of tps

Flags: needinfo?(blassey.bugs)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 14

•

8 years ago

> kats: tscrollx, tp5o_scroll (both ASAP)

> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?

Yes

> 2. Is the test meaningful/relevant in e10s?

Yes

> 3. Is the comparison of the results between the above valid?

Yes

> 4. Caveats/Notes/platform-notes?

Regressions in this test may get masked for APZ-driven scrolling methods in real-world usage. So for example a regression in this test may not result in user-visible jank when wheel-scrolling, because that is handled by APZ. But there are definitely cases where a regression in these tests can result in user-visible issues.

Flags: needinfo?(bugmail.mozilla)

David Bolter [:davidb] (NeedInfo me for attention)

Comment 15

•

8 years ago

(In reply to Avi Halachmi (:avih) from comment #0)
> davidb: a11y
> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?
> 2. Is the test meaningful/relevant in e10s?
> 3. Is the comparison of the results between the above valid?
> 4. Caveats/Notes/platform-notes?

Yes, yes, not sure, we'll probably want to add/change some talos tests when we are farther along with a11y support for e10s.

Flags: needinfo?(dbolter)

Mike Conley (:mconley) (:⚙️)

Comment 16

•

8 years ago

tps:

1. Is the test meaningful/relevant in non-e10s?

Yes 

2. Is the test meaningful/relevant in e10s?

Yes

3. Is the comparison of the results between the above valid?

Yes - BenWa did a bunch of work to ensure this.

4. Caveats/Notes/platform-notes?

N/A




TART (ASAP):

1. Is the test meaningful/relevant in non-e10s?

Yes 

2. Is the test meaningful/relevant in e10s?

Yes

3. Is the comparison of the results between the above valid?

Yes.

4. Caveats/Notes/platform-notes?

This test seems to be particularly fragile with e10s for some reason, and will timeout. This might have to do with the fast creation / destruction of content processes, but I haven't dug into it too much.




CART (ASAP):

1. Is the test meaningful/relevant in non-e10s?

Yes 

2. Is the test meaningful/relevant in e10s?

Yes

3. Is the comparison of the results between the above valid?

Yes.

4. Caveats/Notes/platform-notes?

This test will open about:customizing, which will initially be opened in a remote browser, but once we find out that about:customizing is "blacklisted" (ie, cannot be opened remotely), I believe the remoteness will flip, which might result in slightly more overhead and might explain the regression. That's just guesswork though.



tp5:

1. Is the test meaningful/relevant in non-e10s?

Yes 

2. Is the test meaningful/relevant in e10s?

Yes, I think so.

3. Is the comparison of the results between the above valid?

Yes, I think so.

4. Caveats/Notes/platform-notes?

N/A.

Flags: needinfo?(mconley)

Avi Halachmi (:avih)

Assignee

Comment 17

•

8 years ago

(In reply to Mike Conley (:mconley) - Needinfo me! from comment #16)
> TART (ASAP):
> ...
> 4. Caveats/Notes/platform-notes?
> 
> This test seems to be particularly fragile with e10s for some reason, and
> will timeout. This might have to do with the fast creation / destruction of
> content processes, but I haven't dug into it too much.

This is apparently because changing the newtab url now being async (it was sync in the past), which sometimes ended with the test hanging.

However, changing the newtab URL is not being measured in TART (it's just a preparation between measurements), so if it doesn't hang - then it does provide meaningful numbers.

This is handled at bug 1246695, and currently uplifted (needed slightly different patch for Aurora IIRC).

Markus Stange [:mstange]

Comment 18

•

8 years ago

(In reply to Avi Halachmi (:avih) from comment #0)
> mstange (+mchang): ASAP as a tool to measure throughput.

It is a valid tool. Measuring throughput with requestAnimationFrame in asap mode measures the whole pipeline, including painting (one guaranteed paint per animation frame (if something needs painting)) and composition.
I can't think of a case where ASAP mode would cause us to miss real regressions. On the other hand, it can help us catch regressions during animation frames that are faster than the vsync frame budget - regressions that would be real regressions on slower machines. It's very unlikely that asap mode will cause us to catch regressions where there is *no* real regression, not even on slower machines. So in conclusion, it's valid.


> seth: tsvgx (ASAP)

I'll answer this one.

> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?

Yes.

> 2. Is the test meaningful/relevant in e10s?

Yes.

> 3. Is the comparison of the results between the above valid?

Yes.

> 4. Caveats/Notes/platform-notes?

It looks like this test is 50% in e10s on Windows 7. This should be investigated.
I'm curious about comparisons on other platforms, and especially in whether GPU content acceleration could be responsible for the difference.

Flags: needinfo?(seth)

Flags: needinfo?(mstange)

Kelsey Gilbert [:jgilbert]

Comment 19

•

8 years ago

> jgilbert: glterrain (ASAP)
> 
> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?
Yes.

> 2. Is the test meaningful/relevant in e10s?
Yes.

> 3. Is the comparison of the results between the above valid?
Yes.

> 4. Caveats/Notes/platform-notes?
Perf with e10s should be greater-than-or-equal-to non-e10s perf.
It should be largely the same.

Flags: needinfo?(jgilbert)

Kelsey Gilbert [:jgilbert]

Comment 20

•

8 years ago

The 60% improvement on WinXP is likely because we aren't doing readback+upload+composite synchronously on the main thread now.

Jim Mathies [:jimm]

Comment 22

•

8 years ago

From bug 1198890, how about Draomeo CSS?

https://wiki.mozilla.org/Buildbot/Talos/Tests#Dromaeo_Tests

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 23

•

8 years ago

See comment 1?  I'm not sure what the question is....

Flags: needinfo?(bzbarsky)

Jim Mathies [:jimm]

Comment 24

•

8 years ago

My bad, I didn't see Dromaeo had already been queried. Thanks!

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Blocks: e10s-tests, 1251377, 1251388

tracking-e10s: --- → ?

Chris Peterson [:cpeterson]

Comment 25

•

8 years ago

snorp: are you the right owner for the tcanvasmark test?

tracking-e10s: ? → m9+

Chris Peterson [:cpeterson]

Comment 26

•

8 years ago

Assigning this bug to Avi because he is driving the investigation of these talos tests validity. We want all blocking bugs to have an owner.

Assignee: nobody → avihpit

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Depends on: 1252768

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 27

•

8 years ago

I won't claim to be the owner of canvasmark, but it should be fine to run in e10s and non-e10s.

Flags: needinfo?(snorp)

Avi Halachmi (:avih)

Assignee

Comment 28

•

8 years ago

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #27)
> I won't claim to be the owner of canvasmark, but it should be fine to run in
> e10s and non-e10s.

Won't claim, but nonetheless snorp accepted this on IRC.

Joel Maher ( :jmaher ) (UTC -8)

Comment 29

•

8 years ago

is there more work to do on this bug?

Flags: needinfo?(avihpit)

Avi Halachmi (:avih)

Assignee

Updated

•

8 years ago

Depends on: 1254121

Avi Halachmi (:avih)

Assignee

Comment 30

•

8 years ago

I suggest to keep this open till we've accepted the status of all the tests.

Open issues as far as I can tell:
1. We're still not conclusive on tsvgx_opacity (bug 1250717).
2. IMO tsvgx is still not good (bug 1254121).
3. tp5o %CPU still doesn't have an owner (see comment 5).

Joel, can you help with 3?

Flags: needinfo?(avihpit) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 31

•

8 years ago

regarding the %cpu, I don't even know if that is the right measurement to take- it makes sense logically to measure the cpu- as to what makes sense with counters, this stuff has been around forever.  I really don't want to own this piece.  As for CPU, we would need to measure main and content process, maybe report them separately?

Flags: needinfo?(jmaher)

Avi Halachmi (:avih)

Assignee

Comment 32

•

8 years ago

If we don't know who added/owns/familiar-with the tp5o %cpu tests and as a result have zero info on their relevance, then I suggest to drop them altogether.

Joel Maher ( :jmaher ) (UTC -8)

Comment 33

•

8 years ago

I am all for dropping tp5o %cpu, we have:
* win7- % processor time
* linux64 - XRes

should I drop both?

Avi Halachmi (:avih)

Assignee

Comment 34

•

8 years ago

IMO wherever this applies:

> we don't know who added/owns/familiar-with the tp5o %cpu tests and as a result
> have zero info on their relevance

I'm completely unfamiliar with the tp5o counters tests. If it applies to win7 %cpu and Xres, then IMO yes for both.

Avi Halachmi (:avih)

Assignee

Updated

•

8 years ago

Depends on: 1254628

Avi Halachmi (:avih)

Assignee

Comment 35

•

8 years ago

We have some tests which use MozAfterPaint (a11y, ts_paint, tpaint, tresize, tp5o, tsvg_opacity), and tests have been using it for many years, but I don't fully trust its usage (long legacy).

Recent unexpected (IMO) results at bug 1254121 (fixing tsvgx_opacity for e10s) and bug 1254121 (fixing tsvgx for e10s) made me think we should re-assess how well we're using MozAfterPaint in talos.

I know it's a bit late to start doubting a big chunk of tests, but hopefully we can clear at least some relatively quickly.

Handling this at bug 1254121.

Avi Halachmi (:avih)

Assignee

Comment 36

•

8 years ago

ts_paint, tpaint, tresize are good (thanks Jim!). Three to go.

Avi Halachmi (:avih)

Assignee

Comment 37

•

8 years ago

(In reply to Avi Halachmi (:avih) from comment #35)
> ... results at bug 1254121 .. and bug 1254121... Handling this at bug 1254121.

Oops, seems I need to get back to basics with copy-paste (thanks kats! :) ).

tsvg_opacity e10s fix: bug 1250717
tsvgx e10s fix:        bug 1254121
Assess MozAfterPaint:  bug 1254628

Depends on: 1250717

Brad Lassey [:blassey] (use needinfo?)

Comment 38

•

8 years ago

not an m9

tracking-e10s: m9+ → +

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Priority: -- → P3

Joel Maher ( :jmaher ) (UTC -8)

Comment 39

•

7 years ago

I believe there is nothing actionable to do here.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX