Closed Bug 1190877 Opened 9 years ago Closed 9 years ago

10% linux64 tresize regression on mozilla-inbound on July 30th from revision 5e130ad70aa7

Categories

(Core :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox42 --- affected

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

:jya, can you take a look at this and figure out why this test regressed with your push?
Flags: needinfo?(jyavenard)
I can't. None of those code is active on Linux.

It's only in use on Windows >= Vista and OSX
Flags: needinfo?(jyavenard)
additionaly, none of those changes have an impact on how fast we can resize a window: it's used in MSE and at the time reported only used when watching YouTube..
What page does the tresize test have open when it performs the resize? Unless that page has a video or audio element loaded, it seems exceedingly unlikely that jya's code has any impact on page resize. If media code is active at the time of resize, it's possible that the threads that run and do the resize could be being starved, but the media code would need to be active for this to be a possibility of course.
here is the code for tresize:
http://hg.mozilla.org/build/talos/file/tip/talos/startup_test/tresize/addon/content

Many times things are done accidentally and this is probably the case.  I have full confidence that the push:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=5e130ad70aa7

which contains 3 changesets is causing this regression.  Now we need to figure out why this is the case.  We will be uplifting to Aurora on Monday, so I would like to either fix it or back it out before the uplift.
Oh come on...

Look at the code and those changes, how could it possibly affect JS... Not once is a media element created and not a singly byte of that code will ever be reached, MediaSource being disabled on Linux as there's no H264 decoder.

The first patch is changing the content of the MediaSource NSPR_MODULES_LOG.
The second patch is having the moofparser not ignore a track (again, only used when playing fragmented MP4 within MSE)
The third patch is triggered if you seek backward in youtube.

They are all less than 3 lines change.

We can't back those changes out without affecting all of the mediasource 42 release plan, and you'll be breaking youtube.
please look at the data on treeherder/perfherder and explain to me why it is different.  We have a performance policy in place:
https://www.mozilla.org/en-US/about/governance/policies/regressions/

We need to follow it or start a conversation to modify it for future patches.  Please tell me what additional data you need to help debug this.
Anthony, can you please handle this? thanks.
Flags: needinfo?(ajones)
The regression is real, the current data for retriggered jobs on Treeherder proves that. We can trigger additional jobs/ all the talos jobs for this revision if there is a need for it to investigate this patch.
Vaibhav, thanks for double checking the regression and the data!  Anthony, let me know if you have questions about the test, the data, or how to run the tests whilst debugging.
I don't doubt that the regression is real. I'm sure it is. 

I doubt at which tree you're barking at.
Now show me a current try run, from top of current central, with the MediaSource changes and one without and maybe I'll believe it a tad more.
I don't mind doing try runs, but what would we be looking for?  How could we remove MediaSource changes?  Normally I do a hg update <change>; push to try; hg update <rev_before_change>; push to try

this is already done unless we need some other data.
revert the changes in 3 new commits, or apply the revert patch and make a single commit, it's not hard.

Sorry, but I think that the burden of proof is on you here.

but again, mediasource is not active on linux at this stage... 

I'll let Anthony deal with this from now...
Can talos configurations be changed in any way other than through m-c changes?
Flags: needinfo?(jmaher)
Build for both 4d0818791d07 5e130ad70aa7 are marked purged clobber.
talos is pinned to a specific version in m-c.  sometimes numbers can be adjusted based on network issues, machine updates, or other oddities outside of talos and the build.  To verify that isn't the case, we retrigger the jobs at the same point in time and look at the results.  In this case we have retriggered the jobs and the data a few days after landing shows the 3 patches as causing a regression and the previous revision as remaining the same.

I am sorry :jya, but this is not an issue you can just ignore, it is fine to have others help out.  As a Firefox developer we all have responsibilities, and unfortunately I am the bearer of bad news.  This is a 10% regression, that is not anything to just ignore- if we can justify what is going on, that is great.

Reverting the changes would be exactly like looking at the revision prior to landing, which we have collected 6 data points for that show a consistent set of numbers.  In fact, many times I hear the same argument and upon further examination of the code we find that there is a bug in related code which causes loading of elements on all platforms, not just the one originally thought.
Flags: needinfo?(jmaher)
Quick experiment: I've added a MOZ_CRASH at every single point where bug 1183196 touched the code.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d33ea2356df0

No crashes.
This at least proves that the code changed in bug 1183196 is not even reached during this talos test.
How could unused code have this impact?

(In reply to Joel Maher (:jmaher) from comment #16)
> In fact, many times I hear the same argument and upon
> further examination of the code we find that there is a bug in related code
> which causes loading of elements on all platforms, not just the one
> originally thought.
Joel, could you please have a look at the original patches (they're small and very localized), see if you can suggest which part(s) could do such things?

Next step would be to actually do a revert, just to see if it helps with tresize.
Try push with MSE changes reverted:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dfabbfd7d1a0
https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=dfabbfd7d1a0

Try push without MSE changes reverted:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca3611456799
https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=ca3611456799

qparent was 07befc6f54e7.

Joel: can you please compare these two Try pushes, respin jobs if necessary to increase confidence, and tell us whether backing out would help here?
Flags: needinfo?(jmaher)
The only thing that would make sense would be that somehow the extra code tipped the binary size to reach a certain threshold, at which case the compiler can't do optimisation as well as others.

But that said, absolutely any other changes could achieve the same regression, and reverting the change isn't a solution as any changes in time could randomly cause it.

Which is why I wanted to see current m-c (not from a few days ago) as it is; and one with the 3 changes reverted on top of it.
That will prove my point, happy to bet that it won't change a thing because the tipping point would have been reached by any other commits in the tree since that day.

Joel, I'm not ignoring the issue, I'm saying you're barking at the wrong tree and that your approach has only proven that there is a regression, not that a change in an unused code is the culprit.

this is current central:
https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=f7a077c6edbe

try with the three commits reverted:
https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=644a31f88488
(In reply to Gerald Squelart [:gerald] from comment #17)
> Quick experiment: I've added a MOZ_CRASH at every single point where bug
> 1183196 touched the code.
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=d33ea2356df0
> No crashes.

Sorry, in my test I only added MOZ_CRASH's to what's in bug 1183196, I've actually missed one spot from bug 1189138 that's also in the suspect push, so please take my test with a grain of salt.
I'm fairly confident that code would not be hit either (as it's in disabled-on-Linux mediasource); But let me know if you'd like me to really try.
Not sure if bug 1191334 is related...
(In reply to JW Wang [:jwwang] from comment #22)
> Not sure if bug 1191334 is related...

This is on windows, playing youtube where mediasource *is* enabled.
Joel - who is responsible for tresize? Can we get them to explain why this test is would be affected by media playback code? I'm not saying that it is impossible but we've got some pressing issues we need to deal with ahead of looking into this.
Flags: needinfo?(ajones)
So here is the comparison between base (left) and revert (right)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f7a077c6edbe&newProject=try&newRevision=644a31f88488

So we have with the revert (that is without the code being blamed):
tcanvasmark opt: +0.23% 
tresize opt: +0.36%

So according to this, with the code removed: it's *slower*.

Now let's compare Chris Pearce base (with blamed code) and the one I pushed (with blamed code still)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=ca3611456799&newProject=try&newRevision=f7a077c6edbe

tcanvasmark opt: +0.53%
tresize opt: -1.22%

The variation is higher between two identical copy of central that there was with the changes reverted.

Now just for fun:
Chris revert vs my revert.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=dfabbfd7d1a0&newProject=try&newRevision=644a31f88488

tcanvasmark opt: +0.18%
tresize opt: +0.45% 

Well, there Chris you got me... Your revert is obviously better than mine, but we've always known you have more experience than I do :) (never mind that the resulting code is identical)

Should question the validity of the test really...
first off, thanks for the try pushes and related investigation!  I have done a lot of additional retriggers of the existing runs, and things just don't make sense.  What I suspect is going on is that somehow the base image on linux64 machines was upgraded and we just happened to start deploying that when this landed.  Historically this has happened a couple times and it takes a day for the changes to really roll out to the machines we run the tests on.  I will take this bug for now, and assume it is not related to the media patches in question.

When we have a regression that we have ruled out all other code changes, and done retriggers to prove that it is a consistent and repeatable regression, we push on it to understand it.  Usually retriggers will adjust the data to show that machine updates were the cause, oddly enough this didn't work 2 days ago, but a bunch of retriggers today showed it going back even 20 revisions.
No longer blocks: 1183196, 1189138
Flags: needinfo?(jmaher)
I filed bug 1196419 to try and detect meaningful differences in test runs on the same revision, which seemed to be the root cause of the false positive here.
This doesn't look like something we can resolve.
Component: Audio/Video: Playback → General
this appears to be infrastructure related, we are still hunting this down, but this specific bug shouldn't be open anymore since there is nothing we can do in the product to fix this.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.