Closed
Bug 1220804
Opened 10 years ago
Closed 8 years ago
remote-tsvgx on autophone is bimodal for the composite tests, resulting in overall noise
Categories
(Testing :: Talos, defect)
Testing
Talos
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: jmaher, Unassigned)
References
Details
there are 4 tests (composite*.svg):
https://git.mozilla.org/?p=automation/ep1.git;a=tree;f=talos/tsvg;h=40f83cb02172a3d5c1716a331287369e53470b61;hb=579fa0b401f717f6449cb3ce656b3ed2746d1629
these are bi-modal on autophone:
https://treeherder.mozilla.org/perf.html#/graphs?series=[mozilla-inbound,9ba98452b2b91a601962e875e73d8cff588a6957,1]&series=[mozilla-inbound,e4037106cbd2f46d3c803ced3e77d2214df41af8,1]&series=[mozilla-inbound,3e6c927bdeb7946e4388129f60629a93db186472,1]&series=[mozilla-inbound,871bd63c399fee019f4ed291b0861bb18610e226,1]
this results in the overall score of svgx being much noisier than that of the panda version- we should investigate and fix this.
Reporter | ||
Comment 1•10 years ago
|
||
some data from log files:
1;composite-scale.svg;451;373;136;117;132;132
1;composite-scale.svg;223;145;365;131;114;356
1;composite-scale.svg;427;129;136;105;124;113
1;composite-scale.svg;231;119;118;113;363;163
we actually drop the first value, and take the geometric_mean of the remaining values. If we took a median of the remaining values it would be more consistent. Maybe we should consider why we have a few 300+ numbers and a mostly <150 numbers.
Reporter | ||
Comment 2•10 years ago
|
||
for both the pandas and autophone we are loading the pages from a remote server. I wonder if the autophone server has issues with file serving when other stuff is going on over the network? since tp4m is so reliably, I would think not.
Reporter | ||
Comment 3•10 years ago
|
||
I see similar noise on the nexus-s for the summary, it is just a noisy set of composite tests instead of truly bi-modal. This indicates it is device specific
Comment 4•10 years ago
|
||
A couple of those files are pretty big. I did notice some GC activity in the logcat but there didn't appear to be markers in logcat to tell what was going on.
522K gearflowers.svg
389K hixie-007.xml
Reporter | ||
Comment 5•10 years ago
|
||
I looked around for signs in the logcat, sadly the signs I see are the same ones that I see on the panda boards. One thought I had was the large files (gearflowers.svg, hixie-001.xml, and hixie-007.xml) might be having side effects on the other load times.
Right now we have a 250ms delay between each cycle (currently we do 6 cycles), so maybe we could put the larger pages at the end of the cycle, and increase the delay to 500ms?
Likewise we could try removing the really large pages and seeing what that does. I am open to trying a few things. I just don't know why this is so unique on the nexus-7. it makes me wonder if we would switch devices in the future if we would play this same game.
Comment 6•10 years ago
|
||
The tests don't take an inordinate amount of time. Perhaps we could put them on some of the other devices which don't have too much load so we can get a better picture. What is the story with reporting with different devices/android versions? How would PerfHerder handle Nexus S, Nexus 4, Nexus 5 in addition to Nexus 7?
Reporter | ||
Comment 7•10 years ago
|
||
Right now it will be keyed off of platform which is currently "android-4-3-armv7-api11". There is a bug on file to make treeherder use more than the raw platform name in the backend, sadly I cannot find it after 20 minutes inside of bugzilla, so I filed a new one, bug 1224571.
If we have a device like the Nexus 4 which has a different version of android, then it will probably be a simple fix to run remote-tsvgx on there and compare. Honestly that seems like the right approach here to determine if it is device specific, then figure out how we want to handle the pattern of data on that device.
:bc:, thoughts?
Comment 8•10 years ago
|
||
Our breakdown is currently:
nexus s Android 2.3 API9
nexus 4 Android 4.2 API11+
nexus 7 Android 4.3 API11+
nexus 5 Android 4.4 API11+
that should be sufficient to distinguish them for now?
Reporter | ||
Comment 9•10 years ago
|
||
yes, I say lets go for it and get a weeks worth of data!
Comment 10•10 years ago
|
||
Yes, perfherder specifically looks for machine platform to distinguish platforms. So autophone can set that to whatever it likes if you want data to be organized differently. But as Joel said, the fact that the devices are running different versions of Android should be enough to distinguish them for now.
I'd prefer to see what the outcome of your experiments are, as well as determining the requirements of Android perf testing in general before committing time to modifying Perfherder itself. I have a lot of other things to do right now...
Comment 11•8 years ago
|
||
:bc, is this bug still a valid concern that should be kept open? Thanks :)
Flags: needinfo?(bob)
Comment 12•8 years ago
|
||
While investigating the current status of our tests, I ran a couple of try runs on the production servers and autophone-4:
production:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b42f3165db903d7bb137f6611e7399a82794fee9&group_state=expanded
autophone-4:
https://treeherder.allizom.org/#/jobs?repo=try&revision=b42f3165db903d7bb137f6611e7399a82794fee9&group_state=expanded
production tp4m:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Btry,02c8bd1f934b0336c3a45f1c863313ef6aae77b6,1,3%5D&selected=%5Btry,02c8bd1f934b0336c3a45f1c863313ef6aae77b6,260638,134678930%5D
staging tp4m:
https://treeherder.allizom.org/perf.html#/graphs?series=%5Btry,02c8bd1f934b0336c3a45f1c863313ef6aae77b6,1,3%5D&selected=%5Btry,02c8bd1f934b0336c3a45f1c863313ef6aae77b6,388161,127835726%5D
production tsvg:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Btry,b651e33205845624deea16fa9a2b9cfe9bcb9e0d,1,3%5D&selected=%5Btry,b651e33205845624deea16fa9a2b9cfe9bcb9e0d,260638,134678931%5D
staging tsvg:
https://treeherder.allizom.org/perf.html#/graphs?series=%5Btry,b651e33205845624deea16fa9a2b9cfe9bcb9e0d,1,3%5D&selected=%5Btry,b651e33205845624deea16fa9a2b9cfe9bcb9e0d,388161,127835725%5D
One thing to remember is these results are from pairs of devices which can contribute to a bimodal result. But overall, these don't seem too bad and certainly not as bad as jmaher originally detected.
I say:
This isn't that much of a concern today but we can probably improve the noise by only using one device instead of two for the talos tests. jmaher?
Flags: needinfo?(bob) → needinfo?(jmaher)
Reporter | ||
Comment 13•8 years ago
|
||
I don't see this as a concern, it is noisy, but not bi-modal- I think a single device would help reduce noise- should we make that change?
Flags: needinfo?(jmaher)
Comment 14•8 years ago
|
||
Filed Bug 1405707 to reduce the number of devices to one.
I'll go ahead and resolve this as wfm.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•