Closed Bug 1250620 Opened 8 years ago Closed 7 years ago

Review all talos tests validity in e10s and non-e10s

Categories

(Testing :: Talos, defect, P3)

defect

Tracking

(e10s+)

RESOLVED WONTFIX
Tracking Status
e10s + ---

People

(Reporter: avih, Assigned: avih)

References

(Blocks 1 open bug)

Details

As part of the e10s effort (but useful regardless), we're trying to establish which of the talos tests are relevant for comparing against non-e10s, which starts by validating the tests first.

For each test, we'd like to answer the following questions:
1. Is the test meaningful/relevant in non-e10s?
2. Is the test meaningful/relevant in e10s?
3. Is the comparison of the results between the above valid?
4. Caveats/Notes/platform-notes?

Below is a list of all current talos tests and their owners as far as we know (or someone I think could help). For more detailed description of each test see the wiki at https://wiki.mozilla.org/Buildbot/Talos/Tests .

Each listed dev, please state your replies for the above 4 questions.

If you feel you're not the right person, pointing at someone who can help would be appreciated.

If the reply is not plain, or if it requires further discussion or investigation, please say so and we'll file a new bug for it.


Validations needed WRT the questions above:
-------------------------------------------

mstange (+mchang): ASAP as a tool to measure throughput.

jimm: tp5-responsiveness, ts_paint, tpaint, tresize

mconley: tp5, TART (ASAP), CART (ASAP)

kats: tscrollx, tp5o_scroll (both ASAP)

jgilbert: glterrain (ASAP)

seth: tsvgx (ASAP)

mattwoodrow: tsvg_opacity

snorp: CanvasMark

erahm: tp5 counters (memory etc)

bz: Dromaeo dom/css

davidb: a11y

aklotz: xperf

h4writer: kraken

bgrins: DAMP
Flags: needinfo?(snorp)
Flags: needinfo?(seth)
Flags: needinfo?(mstange)
Flags: needinfo?(mconley)
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(jmathies)
Flags: needinfo?(jgilbert)
Flags: needinfo?(hv1989)
Flags: needinfo?(erahm)
Flags: needinfo?(dbolter)
Flags: needinfo?(bzbarsky)
Flags: needinfo?(bugmail.mozilla)
Flags: needinfo?(bgrinstead)
Flags: needinfo?(aklotz)
> bz: Dromaeo dom/css

1. Yes.
2. Yes.
3. Yes.
4. I would expect that scores are identical on these tests for e10s and non-e10s.  If they're not,
   that's worth investigating, of course.
Flags: needinfo?(bzbarsky)
(In reply to Avi Halachmi (:avih) from comment #0)
> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?

Definitely. Very useful for catching startup main-thread I/O.

> 2. Is the test meaningful/relevant in e10s?

Yes. We still want to check main-thread I/O in the chrome process under e10s.

> 3. Is the comparison of the results between the above valid?

Yes (but see Caveats)

> 4. Caveats/Notes/platform-notes?

On the one hand, we should expect startup main-thread I/O in the chrome process to be better in the e10s case. OTOH, if chrome is blocked on content for some reason (perhaps during content process startup and initialization), and content is doing I/O, then there is an effective dependency between the chrome process and the content process which won't show up in the xperf data.
Flags: needinfo?(aklotz)
(In reply to Avi Halachmi (:avih) from comment #0)
> bgrins: DAMP

> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?

Yes

> 2. Is the test meaningful/relevant in e10s?

Yes

> 3. Is the comparison of the results between the above valid?

Yes

> 4. Caveats/Notes/platform-notes?

Definitely care about perf differences between e10s and non-e10s for devtools, especially since we have a lot of users on the Dev Edition channel
Flags: needinfo?(bgrinstead)
Just for reference, these are non-e10s (left) vs e10s (right) comparison on windows 7 32 as of yesterday for all tests. Other platforms might show a different picture, and wlach is working on a live comparison dashboard, and I'll post a link once it's up.

Results are sorted from biggest e10s regressions (compared to non-e10s) at the top, to biggest e10s improvements at the bottom.

The middle value is the diff in percentage.

Keep in mind though that this bug is about the tests/comparison validity, and not about actual regressions/improvements.

Win7 (worst to best e10s):
-- tp5o_scroll            2.9 +/-  0%  [ +70.2%]       5.0 +/-  0%
-- tps                   46.0 +/-  0%  [ +63.8%]      75.3 +/-  0%
-- tpaint               185.1 +/-  2%  [ +28.6%]     238.0 +/-  5%
-- tart                   3.8 +/-  0%  [ +11.8%]       4.2 +/-  0%

-- tresize                9.6 +/-  0%  [  +7.3%]      10.3 +/-  0%
-- tscrollx               2.8 +/-  0%  [  +6.7%]       3.0 +/-  0%
-- cart                  23.4 +/-  0%  [  +6.6%]      24.9 +/-  0%
   a11yr                407.3 +/-  6%  [  +1.5%]     413.5 +/-  8%
   kraken              1603.3 +/- 10%  [  +0.8%]    1616.4 +/- 12%
   damp                 198.2 +/-  1%  [  +0.2%]     198.6 +/-  1%
   sessionrestore      1544.0 +/- 14%  [  -0.1%]    1541.7 +/- 18%
   dromaeo_css         6643.1 +/- 50%  [  +0.1%]    6650.2 +/- 99%
++ tp5o                 217.6 +/-  1%  [  -1.1%]     215.2 +/-  1%
++ tcanvasmark         7779.8 +/- 36%  [  +1.5%]    7893.0 +/- 36%
++ glterrain              5.2 +/-  0%  [  -2.0%]       5.1 +/-  0%
++ sessionrestore_NAR   631.0 +/-  6%  [  -2.6%]     614.5 +/-  6%

++ tsvgx                388.3 +/-  1%  [ -50.4%]     192.4 +/-  1%
++ ts_paint             870.3 +/-  4%  [ -55.8%]     385.1 +/-  3%
++ tp5o_responsiveness   39.7 +/-  1%  [ -84.8%]       6.0 +/-  1%
++ tsvgr_opacity        505.7 +/-  3%  [ -90.2%]      49.5 +/-  1%
> erahm: tp5 counters (memory etc)

I can really only comment on the memory component.

> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?

Yes.

> 2. Is the test meaningful/relevant in e10s?

Yes.

> 3. Is the comparison of the results between the above valid?

No. See caveat.

> 4. Caveats/Notes/platform-notes?

e10s is going to use more memory so a 1:1 comparison isn't useful. A percentage threshold might be useful, ie we use no more than X% memory of non-e10s with e10s enabled.
Flags: needinfo?(erahm)
(In reply to Boris Zbarsky [:bz] from comment #1)
> > bz: Dromaeo dom/css
> 
> 1. Yes.
> 2. Yes.
> 3. Yes.
> 4. I would expect that scores are identical on these tests for e10s and
> non-e10s.  If they're not,
>    that's worth investigating, of course.

Same for Kraken
Flags: needinfo?(hv1989)
(In reply to Eric Rahm [:erahm] from comment #5)
> > erahm: tp5 counters (memory etc)
> 
> I can really only comment on the memory component.
> 
> > For each test, we'd like to answer the following questions:
> > 1. Is the test meaningful/relevant in non-e10s?
> 
> Yes.
> 
> > 2. Is the test meaningful/relevant in e10s?
> 
> Yes.
> 
> > 3. Is the comparison of the results between the above valid?
> 
> No. See caveat.
> 
> > 4. Caveats/Notes/platform-notes?
> 
> e10s is going to use more memory so a 1:1 comparison isn't useful. A
> percentage threshold might be useful, ie we use no more than X% memory of
> non-e10s with e10s enabled.

After clarifying on IRC, erham meant that e10s is expected to be unfavorable in such comparison.

Regardless, currently this comparison is also invalid since on e10s the test doesn't measure all processes. This is handled in bug 1250169, and once this bug lands, the comparison should be valid (and expectedly unfavorable for e10s).
Forgot to add at comment 0:

yoric: Session restore

blassey: tps
Flags: needinfo?(dteller)
Flags: needinfo?(blassey.bugs)
1. Is the test meaningful/relevant in non-e10s?
Yes.

2. Is the test meaningful/relevant in e10s?
Yes.

3. Is the comparison of the results between the above valid?
Yes.

4. Caveats/Notes/platform-notes?
As mentioned in bug 1245891, we may not be measuring what we want. We'll need to fix this soonish.
Flags: needinfo?(dteller)
(In reply to Avi Halachmi (:avih) from comment #0)
> mattwoodrow: tsvg_opacity


> 1. Is the test meaningful/relevant in non-e10s?

Yes

> 2. Is the test meaningful/relevant in e10s?

It should be, but given the massive improvement reported in comment 4, I suspect no.

> 3. Is the comparison of the results between the above valid?

I don't believe so, I can't see why e10s would improve the results of this test.

> 4. Caveats/Notes/platform-notes?
tp5-responsiveness - this measures browser process, main thread jank while loading a set of tpo5 pages.
1. Is the test meaningful/relevant in non-e10s?
  Yes.
2. Is the test meaningful/relevant in e10s?
  The test measures browser side jank only. There is a general rule that the browser process should never sync send to the content process and as such never block on it for data. If we regress this under e10s, something has landed that needs to be backed out. Also, since content has moved to a content process, the browser main thread should have much less work. Hence browser side values should be much lower for e10s vs. non-e10s.
3. Is the comparison of the results between the above valid?
  Yes.
4. Caveats/Notes/platform-notes?
  It would be great if this test could be modified at some point to provide a combined jank value for both browser and content processes. This might come in handy when we
move to multiple content processes as well.

ts_paint - startup paint
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/startup_test/tspaint_test.html
1. Is the test meaningful/relevant in non-e10s?
  Yes.
2. Is the test meaningful/relevant in e10s?
  Yes.
3. Is the comparison of the results between the above valid?
  Yes, although there are questions about whether this test is really useful in e10s land. (See bug 1174767)

tpaint - window paint
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/startup_test/tpaint.html
1. Is the test meaningful/relevant in non-e10s?
  Yes.
2. Is the test meaningful/relevant in e10s?
  Yes.
3. Is the comparison of the results between the above valid?
  Yes, we want to minimize regressions here. (See bug 1174770)

tresize - resize chrome window performance, which includes resizing remote layers
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/startup_test/tresize/addon/content/tresize-test.html
1. Is the test meaningful/relevant in non-e10s?
  Yes.
2. Is the test meaningful/relevant in e10s?
  Yes.
3. Is the comparison of the results between the above valid?
  Yes, we want to minimize regressions here. (See bug 1179732). Future work involves resizing tp5o pages to get a sense of user experience. This would also flex ipc / content process more heavily.
Flags: needinfo?(jmathies)
I filed bug 1250717 for the main reason for the difference. Looks like it's still a bit faster with e10s, though that is somewhat believable since we don't have to deal with the throbber.
Flags: needinfo?(matt.woodrow)
mconley has taken ownership of tps
Flags: needinfo?(blassey.bugs)
> kats: tscrollx, tp5o_scroll (both ASAP)

> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?

Yes

> 2. Is the test meaningful/relevant in e10s?

Yes

> 3. Is the comparison of the results between the above valid?

Yes

> 4. Caveats/Notes/platform-notes?

Regressions in this test may get masked for APZ-driven scrolling methods in real-world usage. So for example a regression in this test may not result in user-visible jank when wheel-scrolling, because that is handled by APZ. But there are definitely cases where a regression in these tests can result in user-visible issues.
Flags: needinfo?(bugmail.mozilla)
(In reply to Avi Halachmi (:avih) from comment #0)
> davidb: a11y
> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?
> 2. Is the test meaningful/relevant in e10s?
> 3. Is the comparison of the results between the above valid?
> 4. Caveats/Notes/platform-notes?

Yes, yes, not sure, we'll probably want to add/change some talos tests when we are farther along with a11y support for e10s.
Flags: needinfo?(dbolter)
tps:

1. Is the test meaningful/relevant in non-e10s?

Yes 

2. Is the test meaningful/relevant in e10s?

Yes

3. Is the comparison of the results between the above valid?

Yes - BenWa did a bunch of work to ensure this.

4. Caveats/Notes/platform-notes?

N/A




TART (ASAP):

1. Is the test meaningful/relevant in non-e10s?

Yes 

2. Is the test meaningful/relevant in e10s?

Yes

3. Is the comparison of the results between the above valid?

Yes.

4. Caveats/Notes/platform-notes?

This test seems to be particularly fragile with e10s for some reason, and will timeout. This might have to do with the fast creation / destruction of content processes, but I haven't dug into it too much.




CART (ASAP):

1. Is the test meaningful/relevant in non-e10s?

Yes 

2. Is the test meaningful/relevant in e10s?

Yes

3. Is the comparison of the results between the above valid?

Yes.

4. Caveats/Notes/platform-notes?

This test will open about:customizing, which will initially be opened in a remote browser, but once we find out that about:customizing is "blacklisted" (ie, cannot be opened remotely), I believe the remoteness will flip, which might result in slightly more overhead and might explain the regression. That's just guesswork though.



tp5:

1. Is the test meaningful/relevant in non-e10s?

Yes 

2. Is the test meaningful/relevant in e10s?

Yes, I think so.

3. Is the comparison of the results between the above valid?

Yes, I think so.

4. Caveats/Notes/platform-notes?

N/A.
Flags: needinfo?(mconley)
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #16)
> TART (ASAP):
> ...
> 4. Caveats/Notes/platform-notes?
> 
> This test seems to be particularly fragile with e10s for some reason, and
> will timeout. This might have to do with the fast creation / destruction of
> content processes, but I haven't dug into it too much.

This is apparently because changing the newtab url now being async (it was sync in the past), which sometimes ended with the test hanging.

However, changing the newtab URL is not being measured in TART (it's just a preparation between measurements), so if it doesn't hang - then it does provide meaningful numbers.

This is handled at bug 1246695, and currently uplifted (needed slightly different patch for Aurora IIRC).
(In reply to Avi Halachmi (:avih) from comment #0)
> mstange (+mchang): ASAP as a tool to measure throughput.

It is a valid tool. Measuring throughput with requestAnimationFrame in asap mode measures the whole pipeline, including painting (one guaranteed paint per animation frame (if something needs painting)) and composition.
I can't think of a case where ASAP mode would cause us to miss real regressions. On the other hand, it can help us catch regressions during animation frames that are faster than the vsync frame budget - regressions that would be real regressions on slower machines. It's very unlikely that asap mode will cause us to catch regressions where there is *no* real regression, not even on slower machines. So in conclusion, it's valid.


> seth: tsvgx (ASAP)

I'll answer this one.

> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?

Yes.

> 2. Is the test meaningful/relevant in e10s?

Yes.

> 3. Is the comparison of the results between the above valid?

Yes.

> 4. Caveats/Notes/platform-notes?

It looks like this test is 50% in e10s on Windows 7. This should be investigated.
I'm curious about comparisons on other platforms, and especially in whether GPU content acceleration could be responsible for the difference.
Flags: needinfo?(seth)
Flags: needinfo?(mstange)
> jgilbert: glterrain (ASAP)
> 
> For each test, we'd like to answer the following questions:
> 1. Is the test meaningful/relevant in non-e10s?
Yes.

> 2. Is the test meaningful/relevant in e10s?
Yes.

> 3. Is the comparison of the results between the above valid?
Yes.

> 4. Caveats/Notes/platform-notes?
Perf with e10s should be greater-than-or-equal-to non-e10s perf.
It should be largely the same.
Flags: needinfo?(jgilbert)
The 60% improvement on WinXP is likely because we aren't doing readback+upload+composite synchronously on the main thread now.
From bug 1198890, how about Draomeo CSS?

https://wiki.mozilla.org/Buildbot/Talos/Tests#Dromaeo_Tests
Flags: needinfo?(bzbarsky)
See comment 1?  I'm not sure what the question is....
Flags: needinfo?(bzbarsky)
My bad, I didn't see Dromaeo had already been queried. Thanks!
snorp: are you the right owner for the tcanvasmark test?
Assigning this bug to Avi because he is driving the investigation of these talos tests validity. We want all blocking bugs to have an owner.
Assignee: nobody → avihpit
Depends on: 1252768
I won't claim to be the owner of canvasmark, but it should be fine to run in e10s and non-e10s.
Flags: needinfo?(snorp)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #27)
> I won't claim to be the owner of canvasmark, but it should be fine to run in
> e10s and non-e10s.

Won't claim, but nonetheless snorp accepted this on IRC.
is there more work to do on this bug?
Flags: needinfo?(avihpit)
Depends on: 1254121
I suggest to keep this open till we've accepted the status of all the tests.

Open issues as far as I can tell:
1. We're still not conclusive on tsvgx_opacity (bug 1250717).
2. IMO tsvgx is still not good (bug 1254121).
3. tp5o %CPU still doesn't have an owner (see comment 5).

Joel, can you help with 3?
Flags: needinfo?(avihpit) → needinfo?(jmaher)
regarding the %cpu, I don't even know if that is the right measurement to take- it makes sense logically to measure the cpu- as to what makes sense with counters, this stuff has been around forever.  I really don't want to own this piece.  As for CPU, we would need to measure main and content process, maybe report them separately?
Flags: needinfo?(jmaher)
If we don't know who added/owns/familiar-with the tp5o %cpu tests and as a result have zero info on their relevance, then I suggest to drop them altogether.
I am all for dropping tp5o %cpu, we have:
* win7- % processor time
* linux64 - XRes

should I drop both?
IMO wherever this applies:

> we don't know who added/owns/familiar-with the tp5o %cpu tests and as a result
> have zero info on their relevance

I'm completely unfamiliar with the tp5o counters tests. If it applies to win7 %cpu and Xres, then IMO yes for both.
Depends on: 1254628
We have some tests which use MozAfterPaint (a11y, ts_paint, tpaint, tresize, tp5o, tsvg_opacity), and tests have been using it for many years, but I don't fully trust its usage (long legacy).

Recent unexpected (IMO) results at bug 1254121 (fixing tsvgx_opacity for e10s) and bug 1254121 (fixing tsvgx for e10s) made me think we should re-assess how well we're using MozAfterPaint in talos.

I know it's a bit late to start doubting a big chunk of tests, but hopefully we can clear at least some relatively quickly.

Handling this at bug 1254121.
ts_paint, tpaint, tresize are good (thanks Jim!). Three to go.
(In reply to Avi Halachmi (:avih) from comment #35)
> ... results at bug 1254121 .. and bug 1254121... Handling this at bug 1254121.

Oops, seems I need to get back to basics with copy-paste (thanks kats! :) ).

tsvg_opacity e10s fix: bug 1250717
tsvgx e10s fix:        bug 1254121
Assess MozAfterPaint:  bug 1254628
Depends on: 1250717
Priority: -- → P3
I believe there is nothing actionable to do here.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.