Closed Bug 1015572 Opened 10 years ago Closed 10 years ago

8% linux32 tpaint regression on fx32, fx31 around May 21

Categories

(Testing :: Talos, defect)

x86
Linux
defect
Not set
normal
Points:
8

Tracking

(firefox31- wontfix, firefox32+ wontfix, firefox33+ wontfix, firefox34+ wontfix)

RESOLVED WONTFIX
Tracking Status
firefox31 - wontfix
firefox32 + wontfix
firefox33 + wontfix
firefox34 + wontfix

People

(Reporter: jmaher, Assigned: dao)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][dzalerts] [qa-])

Attachments

(1 file)

I found this from the datazilla alerts we are working on- there is no tpaint talos alert in the last month.  Looking at the graph this looks real:
http://graphs.mozilla.org/graph.html#tests=[[82,52,33],[82,63,33]]&sel=1400321808476,1400926608476&displayrange=7&datatype=running

I have done a lot of retriggers, it is so hard with the noise to determine what is going on:
https://tbpl.mozilla.org/?tree=Mozilla-Aurora&jobname=Ubuntu%20HW%2012.04%20mozilla-aurora%20pgo%20talos%20other&startdate=2014-05-21&enddate=2014-05-22

Looking at graph server and the retriggers the first point where we cross over the 160 line and continue to report numbers above it (for the most part) is:
https://tbpl.mozilla.org/?tree=Mozilla-Aurora&jobname=Ubuntu%20HW%2012.04%20mozilla-aurora%20pgo%20talos%20other&startdate=2014-05-21&enddate=2014-05-22&rev=6c7ac16fbaaa

There are a lot of changes there, I am not sure why we have landed so many changes at once.

My thoughts are to look at Aurora and then see what landed there around the same time we saw the overall range bump up on Inbound.  Cross referencing those patches we should be able to narrow this down.  Good news is that we don't see this on beta, so any of these patches that landed on beta can be removed from the suspect list.
and the winner is:
https://hg.mozilla.org/releases/mozilla-aurora/rev/f6660863c0aa

I cannot access the bug, so I will just guess on who to cc.

Here is more information about the test: https://wiki.mozilla.org/Buildbot/Talos/Tests#tpaint

Alexandre, I see that you pushed this patch, it appears to have caused a 8% regression.  Do you have any idea why this might have done it?

another rant- if you are going to push code to Mozilla and make bugs private please push with your email address that you use for bugzilla so I can copy/paste your author email and contact you in bugzilla.  That is a major fail!

and another rant- since this bug is private I cannot link to it which makes it even more difficult to link the root cause of this.  The root cause is the patch from bug 1000337.

This is on Aurora, so please respond promptly.
Flags: needinfo?(lissyx+mozillians)
Summary: 8% tpaint regression on fx32, fx31 around May 21 → 8% linux32 tpaint regression on fx32, fx31 around May 21
Blocks: 990085
(In reply to Joel Maher (:jmaher) from comment #3)
> and the winner is:
> https://hg.mozilla.org/releases/mozilla-aurora/rev/f6660863c0aa
> 
> I cannot access the bug, so I will just guess on who to cc.
> 
> Here is more information about the test:
> https://wiki.mozilla.org/Buildbot/Talos/Tests#tpaint
> 
> Alexandre, I see that you pushed this patch, it appears to have caused a 8%
> regression.  Do you have any idea why this might have done it?

I'm a bit surprised, this is code related to the NotificationDB/NotificationStorage; how can it have an impact on paint tests ?!

> 
> another rant- if you are going to push code to Mozilla and make bugs private
> please push with your email address that you use for bugzilla so I can
> copy/paste your author email and contact you in bugzilla.  That is a major
> fail!

Sorry, I was not aware of this.

> 
> and another rant- since this bug is private I cannot link to it which makes
> it even more difficult to link the root cause of this.  The root cause is
> the patch from bug 1000337.

It's a security bug.

> 
> This is on Aurora, so please respond promptly.
Flags: needinfo?(lissyx+mozillians)
Thanks for the quick response.

I am not sure why this is the one which caused the regression.  I do know that with a series of retriggers done on the 4 backouts this was the only one that had numbers back to normal and less noise.

The paint test opens a new application window 25 times, ignores the first 5 and reports the median number of the remaining 20.  Could there be anything that affects new window creation?
(In reply to Joel Maher (:jmaher) from comment #5)
> Thanks for the quick response.
> 
> I am not sure why this is the one which caused the regression.  I do know
> that with a series of retriggers done on the 4 backouts this was the only
> one that had numbers back to normal and less noise.
> 
> The paint test opens a new application window 25 times, ignores the first 5
> and reports the median number of the remaining 20.  Could there be anything
> that affects new window creation?

Does your test involves any Notification API use?
Unfortunately this test doesn't do any intentional use of the notification api.

We call window.open:
http://hg.mozilla.org/build/talos/file/2b78501af8d0/talos/startup_test/tpaint.html#l102

and the uri we open is a small blob of html:
http://hg.mozilla.org/build/talos/file/2b78501af8d0/talos/startup_test/tpaint.html#l88
(In reply to Joel Maher (:jmaher) from comment #7)
> Unfortunately this test doesn't do any intentional use of the notification
> api.
> 
> We call window.open:
> http://hg.mozilla.org/build/talos/file/2b78501af8d0/talos/startup_test/
> tpaint.html#l102
> 
> and the uri we open is a small blob of html:
> http://hg.mozilla.org/build/talos/file/2b78501af8d0/talos/startup_test/
> tpaint.html#l88

I really have no idea how those two things may relates :(
Flags: needinfo?(mhenretty)
Blocks: 1000337
interesting, I have access to the bug now and see that it landed on beta- there is no regression on beta.  This could be one of two things:
1) this is the wrong patch
2) other code which is on aurora/trunk is causing this patch to show the regression
(In reply to Joel Maher (:jmaher) from comment #9)
> interesting, I have access to the bug now and see that it landed on beta-
> there is no regression on beta.  This could be one of two things:
> 1) this is the wrong patch
> 2) other code which is on aurora/trunk is causing this patch to show the
> regression

My feeling would be much more on 1, but until proved otherwise, I cannot assert anything. Do you have any way to get the diff between the two trees ?

Maybe Michael will be able to have an idea, also. He did most of the Notification API work.
Lets see what Michael says, there have been a few times in the past where it looks the the changeset, but it ends up not being it.  It is rare, but if the code doesn't seem related at all we need to not close that door.
Looking at the changeset [1], there is nothing that should change the timing of the first paint. In the patch we added a few lines of pure JS, but no additional asynchronous calls, no additional loops, and no calls into xp/webidl. Even if the startup code from [2] did use the notification API (which it doesn't), the should be no significant additional running time added due to this patch.

Joel, are there any other dependencies that could cause this regression unrelated to the changeset from bug 1000337?


1.) https://hg.mozilla.org/releases/mozilla-aurora/rev/f6660863c0aa

2.) http://hg.mozilla.org/build/talos/file/2b78501af8d0/talos/startup_test/tpaint.html#l88
Flags: needinfo?(mhenretty)
How delightful. Those patches were actually pushed to try to check perf impact and nothing showed up.
Gijs, maybe they are not the culprit, let me see how it turns out:
https://tbpl.mozilla.org/?tree=Try&rev=77c6493f3987
hmm, over 10 retriggers and all are showing the old value of 148-152 (except one high point of 189).  The previous work in this bug was just trying to get this down <160.

Gijs, any thoughts on this?
Flags: needinfo?(gijskruitbosch+bugs)
(I'm out until Monday, unfortunately, and won't have time to look at this before then, sorry - perhaps Dão can help as it was his patch?)
Flags: needinfo?(dao)
Yeah, http://compare-talos.mattn.ca/?oldRevs=5239876497e2&newRev=77c6493f3987&server=graphs.mozilla.org&submit=true looks serious. Dão, can you look into this, please? It would be interesting, for instance, if you could trypush pgo builds with your other idea and see if that fixes our linux startup time issue. :-\

Otherwise, Joel, can you get a recent screenshot of the browser windows on those machines, so we can look at their toolbar colors and see how they would/wouldn't be affected by the patch, just so we can keep that in mind? I'm not sure where to look for something like that, but I'm hoping you know. :-)
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(jmaher)
I don't have a method for getting screenshots mid test.  I do have a loaner linux64 machine where I could run the test locally and get a screenshot.  This regression is on linux32, but I suspect the colors would be the same.

Gijs, would that be useful?
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #19)
> I don't have a method for getting screenshots mid test.  I do have a loaner
> linux64 machine where I could run the test locally and get a screenshot. 
> This regression is on linux32, but I suspect the colors would be the same.
> 
> Gijs, would that be useful?

Anything's better than nothing, so yes please. :-)
Untracking. beta 6 should be released today and we won't block the release on this bug.
I'm marking this bug as won't fix for 32 as there's been no progress.

Joel - Can you confirm that the regression is still present and as severe as when you reported this 3 months ago?
Thanks for confirming Joel. Reading through the comment history, this bug was left off with :
- a request to Dão to look into this (comment 18)
- Joel needing to get a screenshot on a loaner linux64 machine (comment 20)

We're about to ship this regression in Firefox 32. I would like to see this picked back up in the next couple of weeks to see if progress can be made on 33 while on Aurora.
Flags: needinfo?(jmaher)
Flags: needinfo?(dao)
Flags: firefox-backlog?
Attached image tpaint.png
took a screen shot (vnc -> machine) while the test was running.  I am not sure I am getting anything useful, but now that I have this setup, I can easily try other ones

Gijs, does this help at all?
Flags: needinfo?(jmaher) → needinfo?(gijskruitbosch+bugs)
(In reply to Joel Maher (:jmaher) from comment #25)
> Created attachment 8477397 [details]
> tpaint.png
> 
> took a screen shot (vnc -> machine) while the test was running.  I am not
> sure I am getting anything useful, but now that I have this setup, I can
> easily try other ones
> 
> Gijs, does this help at all?

I mean, from the looks of it we'd find a dark toolbar for some cases (i.e. tab bar, menubar (assuming we're not ignoring collapsed toolbars?)) and not for other cases (i.e. nav-bar).

I still think Dão is the best person to investigate here, as it was his patch. I'm not sure why the needinfo was cleared earlier.
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(dao)
Blocks: 1012629
Flags: firefox-backlog? → firefox-backlog+
Whiteboard: [talos_regression][dzalerts] → [talos_regression][dzalerts] [qa?]
Whiteboard: [talos_regression][dzalerts] [qa?] → [talos_regression][dzalerts] [qa-]
Last week I sent a few diagnosis patches to try using this syntax:

try: -b o -p linux,linux64 -u none -t other

but no talos jobs were run. :( If somebody can spot a mistake in the syntax or knows what else could have gone wrong, please let me know.
for linux specifically, you need: -t other_l64
Thanks, I'll try again. Why does http://trychooser.pub.build.mozilla.org/ not take care of this?
an oversight on my part, I have a bug to work on now:
https://bugzilla.mozilla.org/show_bug.cgi?id=1059689
So, how would I get tpaint runs on linux32 as opposed to x64?
try: -b o -p linux -u none -t all

if we get trychooser fixed, it could be '-t other_nol64', maybe that works, I am not sure without attempting it though.
Points: --- → 8
Assignee: nobody → dao
Status: NEW → ASSIGNED
Iteration: --- → 35.1
Depends on: 1061947
The patch in bug 1061947 should cut off about half of the regression. Filed as a separate bug because I'll want to land and uplift that patch now while also investigating the other half of the regression, which might lead to more patches.
Flags: needinfo?(dao)
I think this will have to stay half fixed/wontfix. Like I said, bug 1061947 eliminated some inefficiency, but the rest of the regression appears to be inevitable because we're doing more work on startup. We could delay this until after the first paint, but then users would see the wrong icons being used for that first paint.

I've nominated bug 1061947's patch for aurora and beta uplift.
No longer blocks: 1000337
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Iteration: 35.1 → ---
Flags: firefox-backlog+ → firefox-backlog-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: