Closed Bug 1220804 Opened 10 years ago Closed 8 years ago

remote-tsvgx on autophone is bimodal for the composite tests, resulting in overall noise

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: jmaher, Unassigned)

References

Details

some data from log files: 1;composite-scale.svg;451;373;136;117;132;132 1;composite-scale.svg;223;145;365;131;114;356 1;composite-scale.svg;427;129;136;105;124;113 1;composite-scale.svg;231;119;118;113;363;163 we actually drop the first value, and take the geometric_mean of the remaining values. If we took a median of the remaining values it would be more consistent. Maybe we should consider why we have a few 300+ numbers and a mostly <150 numbers.
for both the pandas and autophone we are loading the pages from a remote server. I wonder if the autophone server has issues with file serving when other stuff is going on over the network? since tp4m is so reliably, I would think not.
I see similar noise on the nexus-s for the summary, it is just a noisy set of composite tests instead of truly bi-modal. This indicates it is device specific
A couple of those files are pretty big. I did notice some GC activity in the logcat but there didn't appear to be markers in logcat to tell what was going on. 522K gearflowers.svg 389K hixie-007.xml
I looked around for signs in the logcat, sadly the signs I see are the same ones that I see on the panda boards. One thought I had was the large files (gearflowers.svg, hixie-001.xml, and hixie-007.xml) might be having side effects on the other load times. Right now we have a 250ms delay between each cycle (currently we do 6 cycles), so maybe we could put the larger pages at the end of the cycle, and increase the delay to 500ms? Likewise we could try removing the really large pages and seeing what that does. I am open to trying a few things. I just don't know why this is so unique on the nexus-7. it makes me wonder if we would switch devices in the future if we would play this same game.
The tests don't take an inordinate amount of time. Perhaps we could put them on some of the other devices which don't have too much load so we can get a better picture. What is the story with reporting with different devices/android versions? How would PerfHerder handle Nexus S, Nexus 4, Nexus 5 in addition to Nexus 7?
Right now it will be keyed off of platform which is currently "android-4-3-armv7-api11". There is a bug on file to make treeherder use more than the raw platform name in the backend, sadly I cannot find it after 20 minutes inside of bugzilla, so I filed a new one, bug 1224571. If we have a device like the Nexus 4 which has a different version of android, then it will probably be a simple fix to run remote-tsvgx on there and compare. Honestly that seems like the right approach here to determine if it is device specific, then figure out how we want to handle the pattern of data on that device. :bc:, thoughts?
Our breakdown is currently: nexus s Android 2.3 API9 nexus 4 Android 4.2 API11+ nexus 7 Android 4.3 API11+ nexus 5 Android 4.4 API11+ that should be sufficient to distinguish them for now?
yes, I say lets go for it and get a weeks worth of data!
Yes, perfherder specifically looks for machine platform to distinguish platforms. So autophone can set that to whatever it likes if you want data to be organized differently. But as Joel said, the fact that the devices are running different versions of Android should be enough to distinguish them for now. I'd prefer to see what the outcome of your experiments are, as well as determining the requirements of Android perf testing in general before committing time to modifying Perfherder itself. I have a lot of other things to do right now...
:bc, is this bug still a valid concern that should be kept open? Thanks :)
Flags: needinfo?(bob)
While investigating the current status of our tests, I ran a couple of try runs on the production servers and autophone-4: production: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b42f3165db903d7bb137f6611e7399a82794fee9&group_state=expanded autophone-4: https://treeherder.allizom.org/#/jobs?repo=try&revision=b42f3165db903d7bb137f6611e7399a82794fee9&group_state=expanded production tp4m: https://treeherder.mozilla.org/perf.html#/graphs?series=%5Btry,02c8bd1f934b0336c3a45f1c863313ef6aae77b6,1,3%5D&selected=%5Btry,02c8bd1f934b0336c3a45f1c863313ef6aae77b6,260638,134678930%5D staging tp4m: https://treeherder.allizom.org/perf.html#/graphs?series=%5Btry,02c8bd1f934b0336c3a45f1c863313ef6aae77b6,1,3%5D&selected=%5Btry,02c8bd1f934b0336c3a45f1c863313ef6aae77b6,388161,127835726%5D production tsvg: https://treeherder.mozilla.org/perf.html#/graphs?series=%5Btry,b651e33205845624deea16fa9a2b9cfe9bcb9e0d,1,3%5D&selected=%5Btry,b651e33205845624deea16fa9a2b9cfe9bcb9e0d,260638,134678931%5D staging tsvg: https://treeherder.allizom.org/perf.html#/graphs?series=%5Btry,b651e33205845624deea16fa9a2b9cfe9bcb9e0d,1,3%5D&selected=%5Btry,b651e33205845624deea16fa9a2b9cfe9bcb9e0d,388161,127835725%5D One thing to remember is these results are from pairs of devices which can contribute to a bimodal result. But overall, these don't seem too bad and certainly not as bad as jmaher originally detected. I say: This isn't that much of a concern today but we can probably improve the noise by only using one device instead of two for the talos tests. jmaher?
Flags: needinfo?(bob) → needinfo?(jmaher)
I don't see this as a concern, it is noisy, but not bi-modal- I think a single device would help reduce noise- should we make that change?
Flags: needinfo?(jmaher)
See Also: → 1405707
Filed Bug 1405707 to reduce the number of devices to one. I'll go ahead and resolve this as wfm.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.