Closed Bug 1088898 Opened 10 years ago Closed 10 years ago

8% Win7|8 Paint regression on b2g-inbound (v.36) Oct 21 from push c4b63beb3d76

Categories

(Core Graveyard :: Widget: Gonk, defect)

All
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla36

People

(Reporter: jmaher, Assigned: mchang)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(4 files)

I saw this from an alert on mozilla-central after b2g-inbound merged in, here is a view on the graph: graphs.mozilla.org/graph.html#tests=[[82,203,31],[82,203,25]]&sel=1413577757567,1414182557567&displayrange=7&datatype=runningl I did some retriggers on tbpl: https://tbpl.mozilla.org/?startdate=2014-10-22&enddate=2014-10-23&tree=B2g-Inbound&jobname=Windows%207%2032-bit%20b2g-inbound%20talos%20other_nol64 the paint values jump from ~190 -> > 200 as a result of this patch: https://hg.mozilla.org/integration/b2g-inbound/rev/c4b63beb3d76 Here is more information about tpaint: https://wiki.mozilla.org/Buildbot/Talos/Tests#tpaint
See Also: → 1048667
Trying two things to confirm the regression: 1) Caching all the reads to gfxPrefs::VsyncAlignedCompositor - https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=a2178b1bf386 2) Backout changes to the compositor made in bug 1048667 - https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=edf82a59dfbc
Assignee: nobody → mchang
Status: NEW → ASSIGNED
Attached file Talos Output
Hi Joel, I think I got talos running locally, but I'm confused about the output, attached. I see some different data in the talos output than compared to what I see in the talos result file. For example, in the console output I see this: First 13:41:28 INFO - avgOpenTime:136.82 13:41:28 INFO - minOpenTime:117.90 13:41:28 INFO - maxOpenTime:199.15 13:41:28 INFO - medOpenTime:126.7352086536921 But in the talos log I see this: START VALUES latte,tpaint,,eb2ce4d501c1,20141025112119,1414269651 0,126.64,NULL END Any idea why there is a difference in what's being reported? Also, do you know what mozconfig is being used on try servers for Windows builds? I tried reading the MDN Talos pages and saw only VMs, not specific .mozconfigs for reference machines. Thanks!
Flags: needinfo?(jmaher)
for tpaint (http://hg.mozilla.org/build/talos/file/9dfad0333c76/talos/test.py#l145), we drop the first 5 data points and take a median of the remaining 20 data points. The summary printed out from talos doesn't do any ignoring or alternative calculations. Looking at the two try runs, the "Caching all the reads to gfxPrefs::VsyncAlignedCompositor" didn't seem to affect the numbers (or if it did, it was very minor). But the second push with a backout did revert the numbers.
Flags: needinfo?(jmaher)
Do you know what the .mozconfig is for this test on try servers?
Flags: needinfo?(jmaher)
Attached file Timing Data
I ran a few more tests. All tests were done on Windows 7 with SP1. Patched to the latest available Windows updates along with Visual Studio Express 2013. Built with the latest mozilla-build directory as well. I also changed the talos test to run 500 runs instead of 20 runs per run. I see no discernible difference between master and the backout patch. Also, the compositor code runs on both Linux and OS X which showed no regression. Is there any reason we use the median rather than the average? Can we get the average data from talos?
for talos we don't have a lot of options to get raw data with our current system. If datazilla were robust for the volume of talos data we could get it there. There is work to be done on treeherder which is collecting this data as well. there could be a difference in the builds being generated, running via buildbot+mozharness, or on the machines. These are the machines we run on: https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation I wonder if you downloaded the builds before/after the patch if it would reproduce locally. Assuming we cannot see the fix locally, this does reduce the severity of this.
Flags: needinfo?(jmaher)
Here is some timing data of 10 runs from the prebuilt binaries from comment 2. Master is the gfx caching one, which on try still showed the regression. Backout is the patch that did show regression fixed. All tests were run 10 times with a reboot before the 10 runs. The first run after a reboot were discarded. I used this talos command: talos -n -d --develop --executablePath=~/Desktop/master/ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mchang@mozilla.com-a2178b1bf386/try-win32/firefox-36.0a1.en-US.win32/firefox/firefox.exe --activeTests tpaint I don't see a very large or sustained regression in the median or the mean. The only big difference is that there are some lower outliers with the backout version. Interestingly, the std deviation is also lower without the backout patch. However, I did try after running the 10 tests for the backout patch, running master again. This is the Backout/Master avg in the timing data. I see another outlier for master that's as fast as the backout patch, so I'm not convinced this is a real regression yet.
if you look at the trend here with a lot of data points: http://graphs.mozilla.org/graph.html#tests=%5B%5B82,131,25%5D%5D&sel=none&displayrange=7&datatype=running there is overlap, but the range of values has gone from 185-200 then when this patch landed it looks like 195-220. You can see the range of values in a lot larger after this patch and the graph clearly shows that. :avih, do you have any additional thoughts here? Mason has done a lot of legwork with no specific pinpoint for this. While it is a problem in our automation we don't see it locally and it is hard to reproduce/see a fix unless we do a backout (on try).
Flags: needinfo?(avihpit)
http://graphs.mozilla.org/graph.html#tests=[[82,131,25]]&sel=none&displayrange=30&datatype=running Graph server does show a clear regression IMO with the average going from 190 to 206, and becoming more noisy, though I didn't try to identify the offending changeset. However, I'm not familiar with the Paint test, so I don't really know on what the result depend and how good paint regressions could be reproduced locally. Maybe Vladan knows more?
Flags: needinfo?(avihpit) → needinfo?(vdjeric)
Why not get a Talos loaner from Joel and analyze it on there?
Flags: needinfo?(vdjeric)
I've been trying to bisect this through try servers. Instead of deleteing code, I started with backing out the compositor changes and slowly adding parts back in. What I've tried so far: Just adding the compositor vsync observer as an empty class https://tbpl.mozilla.org/?tree=Try&rev=3ef3c1276e9d Looks ok. Change the CompositorParent::CompositeCallback signature to accept a mozilla:TimeStamp as a parameter http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#780 https://tbpl.mozilla.org/?tree=Try&rev=de39a6d6700d Looks ok Create the compositor vsync observer on compositorParent constructor https://tbpl.mozilla.org/?tree=Try&rev=ad0d9f2e9671 Re-add this if statement - http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?from=CompositorParent.cpp&case=true#370 Looks ok Change mLastcompose to the schedule time https://tbpl.mozilla.org/?tree=Try&rev=f8813a553d06 http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?from=CompositorParent.cpp&case=true#788 Looks bad Trying a few more tries: Just the backout compositor and moving the mLastCompose time to CompositeCallback https://tbpl.mozilla.org/?tree=Try&rev=93ae924d082b And adding the compositor vsync observer, create the compositor vsync observer in the CompositorParent constructor, and moving the mLastCompose time to CompositeCallback https://tbpl.mozilla.org/?tree=Try&rev=62a57eef7cf3 Hopefully one of the last two pushes will reproduce the regression.
This one seemed to reproduce the regression: https://tbpl.mozilla.org/?tree=Try&rev=93ae924d082b I started with backing out the compositor changes and just moving mLastCompose time to the CompositeCallback - https://hg.mozilla.org/try/rev/2dadd9634e5c. This exposed a logic bug where we didn't always set the mLastCompose in all cases, but mLastCompose is only used for OMTA sampling and #COMPOSITOR_PERFORMANCE_WARNINGS. Here is another try where we correctly set mLastCompose in all places, but I still don't understand why this would cause the regression. I want to see if this last try fixes the issue then we can decide what to do next. https://tbpl.mozilla.org/?tree=Try&rev=694f0a3091e3
Previously, we were setting mLastCompose during CompositorParent::CompositeToTarget. Now we set it during the CompositeCallback but forgot to set it during forced flushes. This seemed to have caused the regression. Correctly setting mLastcompose before a call to CompositeToTarget seems to have fixed the regression as shown here - https://tbpl.mozilla.org/?tree=Try&rev=6720e1c23d64 I'm not sure why this actually caused a regression. MozAfterPaint seems to be triggered on the main thread during painting, not compositing from reading the code. The only other explanation I can think of is here - http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?from=CompositorParent.cpp&case=true#754 - That we forget to set mLastCompose, so we schedule the next composite ASAP. But I didn't see this code path being hit while manually creating a new window on OSX or Windows nor did I see ForceComposites. I'm testing to see if I can reproduce this locally with talos, but I don't want to hold up the fix.
Attachment #8513683 - Flags: review?(bgirard)
Attachment #8513683 - Flags: review?(bgirard) → review+
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
I have confirmed this test has been fixed on graph server: http://graphs.mattn.ca/graph.html#tests=%5B%5B82,131,31%5D,%5B82,131,25%5D%5D&sel=1414475569364.633,1414654453871.6753&displayrange=7&datatype=running thanks for taking the extra time to fix this
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: