Closed Bug 1178353 Opened 9 years ago Closed 9 years ago

windows xp svgr talos e10s only regression on m-c

Categories

(Core :: Graphics: ImageLib, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
e10s m8+ ---
firefox41 + disabled
firefox42 + disabled
firefox43 + disabled
firefox44 - fixed
firefox45 --- ?

People

(Reporter: jmaher, Assigned: seth)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

we have 2 e10s only regressions on winxp talos svgx.

You can see from the graph we have 2 bumps (http://graphs.mozilla.org/graph.html#tests=[[281,1,45],[281,1,37]]&sel=1432965814114,1435557814114&displayrange=30&datatype=geo):
0) ~265.00
1) ~285.00, ~7% regression, June 19th: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2694ff2ace6a&tochange=c319f262ce3e
2) ~305.00, ~7% regression, June 22nd: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=cddf3b36b5e2&tochange=be81b8d6fae9

in bisecting the second regression, I have a series of (non pgo) try pushes:
hg update 026e77985e59: https://treeherder.mozilla.org/#/jobs?repo=try&revision=011553447470
hg update 2505945b9d43: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9e249ca2733e
hg update 953f8ac1e47e: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a41f2e99e7d
hg update 02f640a72dcd: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1c17a0df1e32
hg update 69a9775dc5c9: https://treeherder.mozilla.org/#/jobs?repo=try&revision=44c859f0e373
hg update 513d62fe75c9: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9ba309fffb34

Ideally we will see which of these helps narrow down the regression so we can run another round and find the offending push.

There is still work to do for the first regression.
Keywords: perf, regression
Whiteboard: [talos_regression]
for the first regression, the scope is narrowed to just a single m-c push:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b6fe3099f8f6&tochange=c319f262ce3e
the range is now between:
hg update 69a9775dc5c9: https://treeherder.mozilla.org/#/jobs?repo=try&revision=44c859f0e373
hg update 513d62fe75c9: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9ba309fffb34

which translates to:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=c319f262ce3e&tochange=69a9775dc5c9&filter-searchStr=talos%20svg%20win%20xp

and the pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c319f262ce3e&tochange=69a9775dc5c9

-----------

hg update 2afd5728f13a: https://treeherder.mozilla.org/#/jobs?repo=try&revision=25a17a9ad784
hg update a79cb8e688cd: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b96a3fedcab0
hg update 51168e14388c: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3676e4ea5862

these will take a few hours to update, good times!  One interesting fact is that we might have 1 regression that happened to show up oddly between two pushes, we will see if that is the case.  Folks feel free to hop in here and add to it.
and the winner is bug 1174923:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a79cb8e688cd

Seth, as the patch author of this, can you comment on what options we have to fix this?
Component: Talos → ImageLib
Flags: needinfo?(seth)
Product: Testing → Core
Blocks: 1174923
Blocks: 1144120
(In reply to Joel Maher (:jmaher) from comment #3)
> and the winner is bug 1174923:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/a79cb8e688cd
> 
> Seth, as the patch author of this, can you comment on what options we have
> to fix this?

I have no idea why firing the document load event *earlier* should cause a regression in svgr performance. Then again, I don't know what the test actually does.

Here's my guess as to what's happening: this test is waiting for the load event, and then measuring how long it takes to do some things that require images to be decoded. (For example, perhaps it draws images to a canvas, or uses them as masks.) Because previously we waited for the images to be decoded before firing the load event, we used to not count the time it takes to decode those images into the measurement. Now that we are firing the load event earlier, we are probably starting to count that image decoding time as part of the measurement, causing an apparent regression.

If my guess is correct, then the best bet for fixing this problem is to fix the test to ensure that images are decoded in some other way before the clock starts. One option is to draw them into a canvas, which will force them to be decoded.
Flags: needinfo?(seth)
as a note there were two issues mentioned in the first comment, this bug is now dealing with the second issue and bug 1178643 is dealing with the first issue.
Hmmm. I'm still not sure how to explain this - I can't think of any explanation for why composite-scale-rotate-opacity.svg should take *five times* as long with this patch applied. Especially since these are page load tests, if the directory they're stored is to be believed - that suggests that firing page load earlier should be helpful rather than causing a regression.

However, with such a strong signal, it should be possible to reproduce this locally. I'll take a look tomorrow and see if I can learn anything.
By the way Joel, thanks for the additional details. That will make investigation much easier.
maybe mconley can help as well.  I am on pto today/tomorrow- so any response will be spotty.  I am happy to help investigate where possible- the trick is running on try is we need to have this in e10s mode which isn't easily available on try.
ok, I see this talos test regress when this landed the first time (as narrowed down in bug 1178643).  What is odd is that we don't correct on the backout.  We have enough retriggers to show that the test regresses each time on backout.

This is confusing nonetheless, but it does keep pointing at the code in question here.
[Tracking Requested - why for this release]: regression in 41
Assignee: nobody → seth
tracking-e10s: --- → m8+
Joel, do I understand you correctly that there's no way of testing this on try?

Can we fix that?
right now the best way to test on e10s is to edit talos.json in your tree:
diff --git a/testing/talos/talos.json b/testing/talos/talos.json
--- a/testing/talos/talos.json
+++ b/testing/talos/talos.json
@@ -1,16 +1,16 @@
 {
     "talos.zip": {
         "url": "http://talos-bundles.pvt.build.mozilla.org/zips/talos.a6052c33d420.zip",
         "path": ""
     },
     "global": {
-        "talos_repo": "https://hg.mozilla.org/build/talos",
-        "talos_revision": "4a8d22dd38c4"
+        "talos_repo": "http://hg.mozilla.org/users/mconley_mozilla.com/e10s-talos",
+        "talos_revision": "default"
     },
     "extra_options": {
         "android": [ "--apkPath=%(apk_path)s" ]
     },
     "suites": {
         "chromez": {
             "tests": ["tresize", "tcanvasmark"]
         },



there are some quirks to sort out, but we need to get this as an option on talos try chooser.
Tracking in 41 because regression
Just posting to note that some other issues have kept me from focusing on this, but it's still a high priority for me and I plan to come back to it soon.
Don't see this getting uplifted to 41.
We disabled e10s 42 when it moved to beta.
Still enabled in 43. Tracking request to bring that to Liz radar.
Ok, tracking for 43. Is this also affecting 44?
Seth, is this still a high priority?
Flags: needinfo?(seth)
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #21)
> Seth, is this still a high priority?

Yes, but not compared to other (1) regressions we've had recently that more directly impact users and (2) B2G 2.5 needs. Hence I haven't had a chance to address it yet.

That said, there are unrelated reasons to stop sync decoding images in nsSVGImageFrame, which should fix this regression, so I expect it to get fixed soon.
Flags: needinfo?(seth)
I assume this is still an issue for Firefox 44/45?
we only run the tests on trunk although the code is still a problem on all branches.
Blocks: e10s-perf
Adding a speculative dependency on bug 1217571, which I cannot *believe* managed to exist for so long with no one noticing. Given that this is e10s specific and involves painting the same image many, many times, it's quite possible that bug 1217571 may fix this.
Depends on: 1217571
Joel, let's watch what happens once bug 1217571 gets merged. Needinfo'ing you to make you aware of the situation.
Flags: needinfo?(jmaher)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
No need to track this for FF44 as bug 1217571 is already tracked and marking this as fixed on 44 based on last comment.
Marking disabled for 43 since we are deferring the e10s experiment to 44.
You need to log in before you can comment on or make changes to this bug.