1190877 - 10% linux64 tresize regression on mozilla-inbound on July 30th from revision 5e130ad70aa7

additionaly, none of those changes have an impact on how fast we can resize a window: it's used in MSE and at the time reported only used when watching YouTube..

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 4

•

10 years ago

What page does the tresize test have open when it performs the resize? Unless that page has a video or audio element loaded, it seems exceedingly unlikely that jya's code has any impact on page resize. If media code is active at the time of resize, it's possible that the threads that run and do the resize could be being starved, but the media code would need to be active for this to be a possibility of course.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 5

•

10 years ago

here is the code for tresize: http://hg.mozilla.org/build/talos/file/tip/talos/startup_test/tresize/addon/content Many times things are done accidentally and this is probably the case. I have full confidence that the push: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=5e130ad70aa7 which contains 3 changesets is causing this regression. Now we need to figure out why this is the case. We will be uplifting to Aurora on Monday, so I would like to either fix it or back it out before the uplift.

Jean-Yves Avenard [:jya]

Comment 6

•

10 years ago

Oh come on... Look at the code and those changes, how could it possibly affect JS... Not once is a media element created and not a singly byte of that code will ever be reached, MediaSource being disabled on Linux as there's no H264 decoder. The first patch is changing the content of the MediaSource NSPR_MODULES_LOG. The second patch is having the moofparser not ignore a track (again, only used when playing fragmented MP4 within MSE) The third patch is triggered if you seek backward in youtube. They are all less than 3 lines change. We can't back those changes out without affecting all of the mediasource 42 release plan, and you'll be breaking youtube.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 7

•

10 years ago

please look at the data on treeherder/perfherder and explain to me why it is different. We have a performance policy in place: https://www.mozilla.org/en-US/about/governance/policies/regressions/ We need to follow it or start a conversation to modify it for future patches. Please tell me what additional data you need to help debug this.

Jean-Yves Avenard [:jya]

Comment 8

•

10 years ago

Anthony, can you please handle this? thanks.

Flags: needinfo?(ajones)

Vaibhav (:vaibhav1994)

Comment 9

•

10 years ago

The regression is real, the current data for retriggered jobs on Treeherder proves that. We can trigger additional jobs/ all the talos jobs for this revision if there is a need for it to investigate this patch.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 10

•

10 years ago

Vaibhav, thanks for double checking the regression and the data! Anthony, let me know if you have questions about the test, the data, or how to run the tests whilst debugging.

Jean-Yves Avenard [:jya]

Comment 11

•

10 years ago

I don't doubt that the regression is real. I'm sure it is. I doubt at which tree you're barking at. Now show me a current try run, from top of current central, with the MediaSource changes and one without and maybe I'll believe it a tad more.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 12

•

10 years ago

I don't mind doing try runs, but what would we be looking for? How could we remove MediaSource changes? Normally I do a hg update <change>; push to try; hg update <rev_before_change>; push to try this is already done unless we need some other data.

Jean-Yves Avenard [:jya]

Comment 13

•

10 years ago

revert the changes in 3 new commits, or apply the revert patch and make a single commit, it's not hard. Sorry, but I think that the burden of proof is on you here. but again, mediasource is not active on linux at this stage... I'll let Anthony deal with this from now...

Karl Tomlinson (:karlt)

Comment 14

•

10 years ago

Can talos configurations be changed in any way other than through m-c changes?

Flags: needinfo?(jmaher)

Karl Tomlinson (:karlt)

Comment 15

•

10 years ago

Build for both 4d0818791d07 5e130ad70aa7 are marked purged clobber.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 16

•

10 years ago

talos is pinned to a specific version in m-c. sometimes numbers can be adjusted based on network issues, machine updates, or other oddities outside of talos and the build. To verify that isn't the case, we retrigger the jobs at the same point in time and look at the results. In this case we have retriggered the jobs and the data a few days after landing shows the 3 patches as causing a regression and the previous revision as remaining the same. I am sorry :jya, but this is not an issue you can just ignore, it is fine to have others help out. As a Firefox developer we all have responsibilities, and unfortunately I am the bearer of bad news. This is a 10% regression, that is not anything to just ignore- if we can justify what is going on, that is great. Reverting the changes would be exactly like looking at the revision prior to landing, which we have collected 6 data points for that show a consistent set of numbers. In fact, many times I hear the same argument and upon further examination of the code we find that there is a bug in related code which causes loading of elements on all platforms, not just the one originally thought.

Flags: needinfo?(jmaher)

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 17

•

10 years ago

Quick experiment: I've added a MOZ_CRASH at every single point where bug 1183196 touched the code. https://treeherder.mozilla.org/#/jobs?repo=try&revision=d33ea2356df0 No crashes. This at least proves that the code changed in bug 1183196 is not even reached during this talos test. How could unused code have this impact? (In reply to Joel Maher (:jmaher) from comment #16) > In fact, many times I hear the same argument and upon > further examination of the code we find that there is a bug in related code > which causes loading of elements on all platforms, not just the one > originally thought. Joel, could you please have a look at the original patches (they're small and very localized), see if you can suggest which part(s) could do such things? Next step would be to actually do a revert, just to see if it helps with tresize.

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 18

•

10 years ago

Try push with MSE changes reverted: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dfabbfd7d1a0 https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=dfabbfd7d1a0 Try push without MSE changes reverted: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca3611456799 https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=ca3611456799 qparent was 07befc6f54e7. Joel: can you please compare these two Try pushes, respin jobs if necessary to increase confidence, and tell us whether backing out would help here?

Flags: needinfo?(jmaher)

Jean-Yves Avenard [:jya]

Comment 19

•

10 years ago

The only thing that would make sense would be that somehow the extra code tipped the binary size to reach a certain threshold, at which case the compiler can't do optimisation as well as others. But that said, absolutely any other changes could achieve the same regression, and reverting the change isn't a solution as any changes in time could randomly cause it. Which is why I wanted to see current m-c (not from a few days ago) as it is; and one with the 3 changes reverted on top of it. That will prove my point, happy to bet that it won't change a thing because the tipping point would have been reached by any other commits in the tree since that day. Joel, I'm not ignoring the issue, I'm saying you're barking at the wrong tree and that your approach has only proven that there is a regression, not that a change in an unused code is the culprit. this is current central: https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=f7a077c6edbe try with the three commits reverted: https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=644a31f88488

Jean-Yves Avenard [:jya]

Comment 20

•

10 years ago

In the above, actual treeherder link: central: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f7a077c6edbe with revert: https://treeherder.mozilla.org/#/jobs?repo=try&revision=644a31f88488

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 21

•

10 years ago

(In reply to Gerald Squelart [:gerald] from comment #17) > Quick experiment: I've added a MOZ_CRASH at every single point where bug > 1183196 touched the code. > https://treeherder.mozilla.org/#/jobs?repo=try&revision=d33ea2356df0 > No crashes. Sorry, in my test I only added MOZ_CRASH's to what's in bug 1183196, I've actually missed one spot from bug 1189138 that's also in the suspect push, so please take my test with a grain of salt. I'm fairly confident that code would not be hit either (as it's in disabled-on-Linux mediasource); But let me know if you'd like me to really try.

JW Wang [:jwwang] [:jw_wang]

Comment 22

•

10 years ago

Not sure if bug 1191334 is related...

Jean-Yves Avenard [:jya]

Comment 23

•

10 years ago

(In reply to JW Wang [:jwwang] from comment #22) > Not sure if bug 1191334 is related... This is on windows, playing youtube where mediasource *is* enabled.

Anthony Jones (:ajones, :kentuckyfriedtakahe, :k17e)

Comment 24

•

10 years ago

Joel - who is responsible for tresize? Can we get them to explain why this test is would be affected by media playback code? I'm not saying that it is impossible but we've got some pressing issues we need to deal with ahead of looking into this.

Flags: needinfo?(ajones)

Jean-Yves Avenard [:jya]

Comment 25

•

10 years ago

So here is the comparison between base (left) and revert (right) https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f7a077c6edbe&newProject=try&newRevision=644a31f88488 So we have with the revert (that is without the code being blamed): tcanvasmark opt: +0.23% tresize opt: +0.36% So according to this, with the code removed: it's *slower*. Now let's compare Chris Pearce base (with blamed code) and the one I pushed (with blamed code still) https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=ca3611456799&newProject=try&newRevision=f7a077c6edbe tcanvasmark opt: +0.53% tresize opt: -1.22% The variation is higher between two identical copy of central that there was with the changes reverted. Now just for fun: Chris revert vs my revert. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=dfabbfd7d1a0&newProject=try&newRevision=644a31f88488 tcanvasmark opt: +0.18% tresize opt: +0.45% Well, there Chris you got me... Your revert is obviously better than mine, but we've always known you have more experience than I do :) (never mind that the resulting code is identical) Should question the validity of the test really...

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 26

•

10 years ago

first off, thanks for the try pushes and related investigation! I have done a lot of additional retriggers of the existing runs, and things just don't make sense. What I suspect is going on is that somehow the base image on linux64 machines was upgraded and we just happened to start deploying that when this landed. Historically this has happened a couple times and it takes a day for the changes to really roll out to the machines we run the tests on. I will take this bug for now, and assume it is not related to the media patches in question. When we have a regression that we have ruled out all other code changes, and done retriggers to prove that it is a consistent and repeatable regression, we push on it to understand it. Usually retriggers will adjust the data to show that machine updates were the cause, oddly enough this didn't work 2 days ago, but a bunch of retriggers today showed it going back even 20 revisions.

No longer blocks: 1183196, 1189138

Flags: needinfo?(jmaher)

William Lachance (:wlach)

Comment 27

•

10 years ago

I filed bug 1196419 to try and detect meaningful differences in test runs on the same revision, which seemed to be the root cause of the false positive here.

Ralph Giles (:rillian)

Comment 28

•

10 years ago

This doesn't look like something we can resolve.

Component: Audio/Video: Playback → General

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 29

•

10 years ago

this appears to be infrastructure related, we are still hunting this down, but this specific bug shouldn't be open anymore since there is nothing we can do in the product to fix this.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX