Closed
Bug 1088898
Opened 10 years ago
Closed 10 years ago
8% Win7|8 Paint regression on b2g-inbound (v.36) Oct 21 from push c4b63beb3d76
Categories
(Core Graveyard :: Widget: Gonk, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla36
People
(Reporter: jmaher, Assigned: mchang)
References
Details
(Keywords: perf, regression, Whiteboard: [talos_regression])
Attachments
(4 files)
I saw this from an alert on mozilla-central after b2g-inbound merged in, here is a view on the graph:
graphs.mozilla.org/graph.html#tests=[[82,203,31],[82,203,25]]&sel=1413577757567,1414182557567&displayrange=7&datatype=runningl
I did some retriggers on tbpl:
https://tbpl.mozilla.org/?startdate=2014-10-22&enddate=2014-10-23&tree=B2g-Inbound&jobname=Windows%207%2032-bit%20b2g-inbound%20talos%20other_nol64
the paint values jump from ~190 -> > 200 as a result of this patch:
https://hg.mozilla.org/integration/b2g-inbound/rev/c4b63beb3d76
Here is more information about tpaint:
https://wiki.mozilla.org/Buildbot/Talos/Tests#tpaint
Reporter | ||
Comment 1•10 years ago
|
||
here is a better link that shows the affected change:
https://tbpl.mozilla.org/?startdate=2014-10-22&enddate=2014-10-23&tree=B2g-Inbound&jobname=Windows%207%2032-bit%20b2g-inbound%20talos%20other_nol64&fromchange=15099ba111e8&tochange=badc56be9f64
Assignee | ||
Comment 2•10 years ago
|
||
Trying two things to confirm the regression:
1) Caching all the reads to gfxPrefs::VsyncAlignedCompositor - https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=a2178b1bf386
2) Backout changes to the compositor made in bug 1048667 - https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=edf82a59dfbc
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → mchang
Status: NEW → ASSIGNED
Assignee | ||
Comment 3•10 years ago
|
||
Hi Joel,
I think I got talos running locally, but I'm confused about the output, attached. I see some different data in the talos output than compared to what I see in the talos result file.
For example, in the console output I see this:
First
13:41:28 INFO - avgOpenTime:136.82
13:41:28 INFO - minOpenTime:117.90
13:41:28 INFO - maxOpenTime:199.15
13:41:28 INFO - medOpenTime:126.7352086536921
But in the talos log I see this:
START
VALUES
latte,tpaint,,eb2ce4d501c1,20141025112119,1414269651
0,126.64,NULL
END
Any idea why there is a difference in what's being reported?
Also, do you know what mozconfig is being used on try servers for Windows builds? I tried reading the MDN Talos pages and saw only VMs, not specific .mozconfigs for reference machines. Thanks!
Flags: needinfo?(jmaher)
Reporter | ||
Comment 4•10 years ago
|
||
for tpaint (http://hg.mozilla.org/build/talos/file/9dfad0333c76/talos/test.py#l145), we drop the first 5 data points and take a median of the remaining 20 data points. The summary printed out from talos doesn't do any ignoring or alternative calculations.
Looking at the two try runs, the "Caching all the reads to gfxPrefs::VsyncAlignedCompositor" didn't seem to affect the numbers (or if it did, it was very minor). But the second push with a backout did revert the numbers.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 5•10 years ago
|
||
Do you know what the .mozconfig is for this test on try servers?
Flags: needinfo?(jmaher)
Assignee | ||
Comment 6•10 years ago
|
||
I ran a few more tests. All tests were done on Windows 7 with SP1. Patched to the latest available Windows updates along with Visual Studio Express 2013. Built with the latest mozilla-build directory as well. I also changed the talos test to run 500 runs instead of 20 runs per run.
I see no discernible difference between master and the backout patch. Also, the compositor code runs on both Linux and OS X which showed no regression. Is there any reason we use the median rather than the average? Can we get the average data from talos?
Reporter | ||
Comment 7•10 years ago
|
||
for talos we don't have a lot of options to get raw data with our current system. If datazilla were robust for the volume of talos data we could get it there. There is work to be done on treeherder which is collecting this data as well.
there could be a difference in the builds being generated, running via buildbot+mozharness, or on the machines. These are the machines we run on:
https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation
I wonder if you downloaded the builds before/after the patch if it would reproduce locally. Assuming we cannot see the fix locally, this does reduce the severity of this.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 8•10 years ago
|
||
Here is some timing data of 10 runs from the prebuilt binaries from comment 2. Master is the gfx caching one, which on try still showed the regression. Backout is the patch that did show regression fixed. All tests were run 10 times with a reboot before the 10 runs. The first run after a reboot were discarded. I used this talos command:
talos -n -d --develop --executablePath=~/Desktop/master/ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mchang@mozilla.com-a2178b1bf386/try-win32/firefox-36.0a1.en-US.win32/firefox/firefox.exe --activeTests tpaint
I don't see a very large or sustained regression in the median or the mean. The only big difference is that there are some lower outliers with the backout version. Interestingly, the std deviation is also lower without the backout patch.
However, I did try after running the 10 tests for the backout patch, running master again. This is the Backout/Master avg in the timing data. I see another outlier for master that's as fast as the backout patch, so I'm not convinced this is a real regression yet.
Reporter | ||
Comment 9•10 years ago
|
||
if you look at the trend here with a lot of data points:
http://graphs.mozilla.org/graph.html#tests=%5B%5B82,131,25%5D%5D&sel=none&displayrange=7&datatype=running
there is overlap, but the range of values has gone from 185-200 then when this patch landed it looks like 195-220. You can see the range of values in a lot larger after this patch and the graph clearly shows that.
:avih, do you have any additional thoughts here? Mason has done a lot of legwork with no specific pinpoint for this. While it is a problem in our automation we don't see it locally and it is hard to reproduce/see a fix unless we do a backout (on try).
Flags: needinfo?(avihpit)
Comment 10•10 years ago
|
||
http://graphs.mozilla.org/graph.html#tests=[[82,131,25]]&sel=none&displayrange=30&datatype=running
Graph server does show a clear regression IMO with the average going from 190 to 206, and becoming more noisy, though I didn't try to identify the offending changeset.
However, I'm not familiar with the Paint test, so I don't really know on what the result depend and how good paint regressions could be reproduced locally.
Maybe Vladan knows more?
Flags: needinfo?(avihpit) → needinfo?(vdjeric)
Comment 11•10 years ago
|
||
Why not get a Talos loaner from Joel and analyze it on there?
Flags: needinfo?(vdjeric)
Assignee | ||
Comment 12•10 years ago
|
||
I've been trying to bisect this through try servers. Instead of deleteing code, I started with backing out the compositor changes and slowly adding parts back in.
What I've tried so far:
Just adding the compositor vsync observer as an empty class
https://tbpl.mozilla.org/?tree=Try&rev=3ef3c1276e9d
Looks ok.
Change the CompositorParent::CompositeCallback signature to accept a mozilla:TimeStamp as a parameter
http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#780
https://tbpl.mozilla.org/?tree=Try&rev=de39a6d6700d
Looks ok
Create the compositor vsync observer on compositorParent constructor
https://tbpl.mozilla.org/?tree=Try&rev=ad0d9f2e9671
Re-add this if statement - http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?from=CompositorParent.cpp&case=true#370
Looks ok
Change mLastcompose to the schedule time
https://tbpl.mozilla.org/?tree=Try&rev=f8813a553d06
http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?from=CompositorParent.cpp&case=true#788
Looks bad
Trying a few more tries:
Just the backout compositor and moving the mLastCompose time to CompositeCallback
https://tbpl.mozilla.org/?tree=Try&rev=93ae924d082b
And adding the compositor vsync observer, create the compositor vsync observer in the CompositorParent constructor, and moving the mLastCompose time to CompositeCallback
https://tbpl.mozilla.org/?tree=Try&rev=62a57eef7cf3
Hopefully one of the last two pushes will reproduce the regression.
Assignee | ||
Comment 13•10 years ago
|
||
This one seemed to reproduce the regression:
https://tbpl.mozilla.org/?tree=Try&rev=93ae924d082b
I started with backing out the compositor changes and just moving mLastCompose time to the CompositeCallback - https://hg.mozilla.org/try/rev/2dadd9634e5c. This exposed a logic bug where we didn't always set the mLastCompose in all cases, but mLastCompose is only used for OMTA sampling and #COMPOSITOR_PERFORMANCE_WARNINGS.
Here is another try where we correctly set mLastCompose in all places, but I still don't understand why this would cause the regression. I want to see if this last try fixes the issue then we can decide what to do next. https://tbpl.mozilla.org/?tree=Try&rev=694f0a3091e3
Assignee | ||
Comment 14•10 years ago
|
||
Previously, we were setting mLastCompose during CompositorParent::CompositeToTarget. Now we set it during the CompositeCallback but forgot to set it during forced flushes. This seemed to have caused the regression. Correctly setting mLastcompose before a call to CompositeToTarget seems to have fixed the regression as shown here - https://tbpl.mozilla.org/?tree=Try&rev=6720e1c23d64
I'm not sure why this actually caused a regression. MozAfterPaint seems to be triggered on the main thread during painting, not compositing from reading the code. The only other explanation I can think of is here - http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?from=CompositorParent.cpp&case=true#754 - That we forget to set mLastCompose, so we schedule the next composite ASAP. But I didn't see this code path being hit while manually creating a new window on OSX or Windows nor did I see ForceComposites. I'm testing to see if I can reproduce this locally with talos, but I don't want to hold up the fix.
Attachment #8513683 -
Flags: review?(bgirard)
Updated•10 years ago
|
Attachment #8513683 -
Flags: review?(bgirard) → review+
Assignee | ||
Comment 15•10 years ago
|
||
Comment 16•10 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Reporter | ||
Comment 17•10 years ago
|
||
I have confirmed this test has been fixed on graph server:
http://graphs.mattn.ca/graph.html#tests=%5B%5B82,131,31%5D,%5B82,131,25%5D%5D&sel=1414475569364.633,1414654453871.6753&displayrange=7&datatype=running
thanks for taking the extra time to fix this
Updated•6 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•