Closed
Bug 1190877
Opened 9 years ago
Closed 9 years ago
10% linux64 tresize regression on mozilla-inbound on July 30th from revision 5e130ad70aa7
Categories
(Core :: General, defect)
Core
General
Tracking
()
RESOLVED
WONTFIX
Tracking | Status | |
---|---|---|
firefox42 | --- | affected |
People
(Reporter: jmaher, Unassigned)
References
Details
(Keywords: perf, regression, Whiteboard: [talos_regression])
Talos generated some alerts based on the data coming in. I had done a lot of retriggers on treeherder: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=bce7955eaa13&tochange=82b165aa3192&filter-searchStr=Ubuntu%20HW%2012.04%20x64%20mozilla-inbound%20talos%20chromez and you can see from the graph that tresize started to get a lot noisier about July 30: https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=[mozilla-inbound,72f4651f24362c87efb15d5f4113b9ca194d8e3f,1]&series=[mozilla-inbound,90b2964264c531aa801a987595c22c0832c2ac93,0]&highlightedRevisions=bce7955eaa13&highlightedRevisions=3d5c585fa1e3 Looking in detail, we can see the comparison here: https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=4d0818791d07&newProject=mozilla-inbound&newRevision=5e130ad70aa7&originalSignature=72f4651f24362c87efb15d5f4113b9ca194d8e3f&newSignature=72f4651f24362c87efb15d5f4113b9ca194d8e3f You can learn more about the test: https://wiki.mozilla.org/Buildbot/Talos/Tests#tresize To test a fix for this, you can push to try: try: -b o -p linux64 -u none[x64] -t chromez [mozharess: --spsProfile]
Reporter | ||
Comment 1•9 years ago
|
||
:jya, can you take a look at this and figure out why this test regressed with your push?
Flags: needinfo?(jyavenard)
Comment 2•9 years ago
|
||
I can't. None of those code is active on Linux. It's only in use on Windows >= Vista and OSX
Flags: needinfo?(jyavenard)
Comment 3•9 years ago
|
||
additionaly, none of those changes have an impact on how fast we can resize a window: it's used in MSE and at the time reported only used when watching YouTube..
Comment 4•9 years ago
|
||
What page does the tresize test have open when it performs the resize? Unless that page has a video or audio element loaded, it seems exceedingly unlikely that jya's code has any impact on page resize. If media code is active at the time of resize, it's possible that the threads that run and do the resize could be being starved, but the media code would need to be active for this to be a possibility of course.
Reporter | ||
Comment 5•9 years ago
|
||
here is the code for tresize: http://hg.mozilla.org/build/talos/file/tip/talos/startup_test/tresize/addon/content Many times things are done accidentally and this is probably the case. I have full confidence that the push: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=5e130ad70aa7 which contains 3 changesets is causing this regression. Now we need to figure out why this is the case. We will be uplifting to Aurora on Monday, so I would like to either fix it or back it out before the uplift.
Comment 6•9 years ago
|
||
Oh come on... Look at the code and those changes, how could it possibly affect JS... Not once is a media element created and not a singly byte of that code will ever be reached, MediaSource being disabled on Linux as there's no H264 decoder. The first patch is changing the content of the MediaSource NSPR_MODULES_LOG. The second patch is having the moofparser not ignore a track (again, only used when playing fragmented MP4 within MSE) The third patch is triggered if you seek backward in youtube. They are all less than 3 lines change. We can't back those changes out without affecting all of the mediasource 42 release plan, and you'll be breaking youtube.
Reporter | ||
Comment 7•9 years ago
|
||
please look at the data on treeherder/perfherder and explain to me why it is different. We have a performance policy in place: https://www.mozilla.org/en-US/about/governance/policies/regressions/ We need to follow it or start a conversation to modify it for future patches. Please tell me what additional data you need to help debug this.
Comment 9•9 years ago
|
||
The regression is real, the current data for retriggered jobs on Treeherder proves that. We can trigger additional jobs/ all the talos jobs for this revision if there is a need for it to investigate this patch.
Reporter | ||
Comment 10•9 years ago
|
||
Vaibhav, thanks for double checking the regression and the data! Anthony, let me know if you have questions about the test, the data, or how to run the tests whilst debugging.
Comment 11•9 years ago
|
||
I don't doubt that the regression is real. I'm sure it is. I doubt at which tree you're barking at. Now show me a current try run, from top of current central, with the MediaSource changes and one without and maybe I'll believe it a tad more.
Reporter | ||
Comment 12•9 years ago
|
||
I don't mind doing try runs, but what would we be looking for? How could we remove MediaSource changes? Normally I do a hg update <change>; push to try; hg update <rev_before_change>; push to try this is already done unless we need some other data.
Comment 13•9 years ago
|
||
revert the changes in 3 new commits, or apply the revert patch and make a single commit, it's not hard. Sorry, but I think that the burden of proof is on you here. but again, mediasource is not active on linux at this stage... I'll let Anthony deal with this from now...
Comment 14•9 years ago
|
||
Can talos configurations be changed in any way other than through m-c changes?
Flags: needinfo?(jmaher)
Comment 15•9 years ago
|
||
Build for both 4d0818791d07 5e130ad70aa7 are marked purged clobber.
Reporter | ||
Comment 16•9 years ago
|
||
talos is pinned to a specific version in m-c. sometimes numbers can be adjusted based on network issues, machine updates, or other oddities outside of talos and the build. To verify that isn't the case, we retrigger the jobs at the same point in time and look at the results. In this case we have retriggered the jobs and the data a few days after landing shows the 3 patches as causing a regression and the previous revision as remaining the same. I am sorry :jya, but this is not an issue you can just ignore, it is fine to have others help out. As a Firefox developer we all have responsibilities, and unfortunately I am the bearer of bad news. This is a 10% regression, that is not anything to just ignore- if we can justify what is going on, that is great. Reverting the changes would be exactly like looking at the revision prior to landing, which we have collected 6 data points for that show a consistent set of numbers. In fact, many times I hear the same argument and upon further examination of the code we find that there is a bug in related code which causes loading of elements on all platforms, not just the one originally thought.
Flags: needinfo?(jmaher)
Quick experiment: I've added a MOZ_CRASH at every single point where bug 1183196 touched the code. https://treeherder.mozilla.org/#/jobs?repo=try&revision=d33ea2356df0 No crashes. This at least proves that the code changed in bug 1183196 is not even reached during this talos test. How could unused code have this impact? (In reply to Joel Maher (:jmaher) from comment #16) > In fact, many times I hear the same argument and upon > further examination of the code we find that there is a bug in related code > which causes loading of elements on all platforms, not just the one > originally thought. Joel, could you please have a look at the original patches (they're small and very localized), see if you can suggest which part(s) could do such things? Next step would be to actually do a revert, just to see if it helps with tresize.
Comment 18•9 years ago
|
||
Try push with MSE changes reverted: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dfabbfd7d1a0 https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=dfabbfd7d1a0 Try push without MSE changes reverted: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca3611456799 https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=ca3611456799 qparent was 07befc6f54e7. Joel: can you please compare these two Try pushes, respin jobs if necessary to increase confidence, and tell us whether backing out would help here?
Flags: needinfo?(jmaher)
Comment 19•9 years ago
|
||
The only thing that would make sense would be that somehow the extra code tipped the binary size to reach a certain threshold, at which case the compiler can't do optimisation as well as others. But that said, absolutely any other changes could achieve the same regression, and reverting the change isn't a solution as any changes in time could randomly cause it. Which is why I wanted to see current m-c (not from a few days ago) as it is; and one with the 3 changes reverted on top of it. That will prove my point, happy to bet that it won't change a thing because the tipping point would have been reached by any other commits in the tree since that day. Joel, I'm not ignoring the issue, I'm saying you're barking at the wrong tree and that your approach has only proven that there is a regression, not that a change in an unused code is the culprit. this is current central: https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=f7a077c6edbe try with the three commits reverted: https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=644a31f88488
Comment 20•9 years ago
|
||
In the above, actual treeherder link: central: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f7a077c6edbe with revert: https://treeherder.mozilla.org/#/jobs?repo=try&revision=644a31f88488
(In reply to Gerald Squelart [:gerald] from comment #17) > Quick experiment: I've added a MOZ_CRASH at every single point where bug > 1183196 touched the code. > https://treeherder.mozilla.org/#/jobs?repo=try&revision=d33ea2356df0 > No crashes. Sorry, in my test I only added MOZ_CRASH's to what's in bug 1183196, I've actually missed one spot from bug 1189138 that's also in the suspect push, so please take my test with a grain of salt. I'm fairly confident that code would not be hit either (as it's in disabled-on-Linux mediasource); But let me know if you'd like me to really try.
Comment 22•9 years ago
|
||
Not sure if bug 1191334 is related...
Comment 23•9 years ago
|
||
(In reply to JW Wang [:jwwang] from comment #22) > Not sure if bug 1191334 is related... This is on windows, playing youtube where mediasource *is* enabled.
Joel - who is responsible for tresize? Can we get them to explain why this test is would be affected by media playback code? I'm not saying that it is impossible but we've got some pressing issues we need to deal with ahead of looking into this.
Flags: needinfo?(ajones)
Comment 25•9 years ago
|
||
So here is the comparison between base (left) and revert (right) https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f7a077c6edbe&newProject=try&newRevision=644a31f88488 So we have with the revert (that is without the code being blamed): tcanvasmark opt: +0.23% tresize opt: +0.36% So according to this, with the code removed: it's *slower*. Now let's compare Chris Pearce base (with blamed code) and the one I pushed (with blamed code still) https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=ca3611456799&newProject=try&newRevision=f7a077c6edbe tcanvasmark opt: +0.53% tresize opt: -1.22% The variation is higher between two identical copy of central that there was with the changes reverted. Now just for fun: Chris revert vs my revert. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=dfabbfd7d1a0&newProject=try&newRevision=644a31f88488 tcanvasmark opt: +0.18% tresize opt: +0.45% Well, there Chris you got me... Your revert is obviously better than mine, but we've always known you have more experience than I do :) (never mind that the resulting code is identical) Should question the validity of the test really...
Reporter | ||
Comment 26•9 years ago
|
||
first off, thanks for the try pushes and related investigation! I have done a lot of additional retriggers of the existing runs, and things just don't make sense. What I suspect is going on is that somehow the base image on linux64 machines was upgraded and we just happened to start deploying that when this landed. Historically this has happened a couple times and it takes a day for the changes to really roll out to the machines we run the tests on. I will take this bug for now, and assume it is not related to the media patches in question. When we have a regression that we have ruled out all other code changes, and done retriggers to prove that it is a consistent and repeatable regression, we push on it to understand it. Usually retriggers will adjust the data to show that machine updates were the cause, oddly enough this didn't work 2 days ago, but a bunch of retriggers today showed it going back even 20 revisions.
Comment 27•9 years ago
|
||
I filed bug 1196419 to try and detect meaningful differences in test runs on the same revision, which seemed to be the root cause of the false positive here.
Comment 28•9 years ago
|
||
This doesn't look like something we can resolve.
Component: Audio/Video: Playback → General
Reporter | ||
Comment 29•9 years ago
|
||
this appears to be infrastructure related, we are still hunting this down, but this specific bug shouldn't be open anymore since there is nothing we can do in the product to fix this.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•