Closed Bug 1029968 Opened 10 years ago Closed 10 years ago

June 19 regression in all windows PGO talos performance numbers

Categories

(Core :: Graphics: Layers, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox33 - ---

People

(Reporter: dbaron, Unassigned)

References

Details

(Keywords: perf, Whiteboard: [talos_regression])

All (or at least many) Windows performance numbers on Talos regressed substantially on mozilla-inbound on June 19, in this range:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=bff872c9d4b2&tochange=e589c195f61d

The regression later appeared on mozilla-central and fx-team in what I believe although haven't yet checked are corresponding ranges:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=79e69d064957&tochange=bdac18bd6c74
https://hg.mozilla.org/integration/fx-team/pushloghtml?fromchange=3a4d57044461&tochange=36efd6ffbcd0

See an example graphs at:
http://mzl.la/1kWKQT9 (Tp5 Optimized WINNT 6.1)
http://mzl.la/1kWLhga (Paint WINNT 5.1)
but there are many more.

Given that it's Windows-specific, I think bug 1027365 seems the most likely in that range at first glance.
Flags: needinfo?(nical.bugzilla)
(In reply to David Baron [:dbaron] (UTC-7) (needinfo? for questions) from comment #0)
> The regression later appeared on mozilla-central and fx-team in what I
> believe although haven't yet checked are corresponding ranges:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=79e69d064957&tochange=bdac18bd6c74
> https://hg.mozilla.org/integration/fx-team/
> pushloghtml?fromchange=3a4d57044461&tochange=36efd6ffbcd0

Actually, the ranges don't correspond.

Maybe we changed our infrastructure?
So, let's just focus on the graphs for those two tests (which I picked sort of at random, although partly because they were larger numbers and probably important tests):

The graphs for mozilla-central:
http://mzl.la/1q60K5F (Tp5 Optimized WINNT 6.1)
http://mzl.la/1q61eZv (Paint WINNT 5.1)

The graphs for fx-team:
http://mzl.la/1nzcV4u (Tp5 Optimized WINNT 6.1)
http://mzl.la/1nzagI3 (Paint WINNT 5.1)

Do we run non-PGO versions of these tests?  Maybe we just hit some PGO cliff?
Flags: needinfo?(ehsan)
(In reply to David Baron [:dbaron] (UTC-7) (needinfo? for questions) from comment #0)
> Given that it's Windows-specific, I think bug 1027365 seems the most likely
> in that range at first glance.

Bug 1027365 only simplified the prefs around enabling async-video and did not affect windows (async-video was already enabled without e10s and disabled with e10s). The only change is that async-video is now enabled by default on Linux+emulator and Mac+emulator
Flags: needinfo?(nical.bugzilla)
thanks for filing this!  the error seems to be pgo specific, I have kicked off a few pgo builds to fill in the holes.

This might take a bit of magic and luck- I will be on pto later today and tomorrow, so if we don't figure this out by Friday, I will work on it more.

Here is a narrow view of the tbpl ranges to work with:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Windows%20XP%2032-bit%20mozilla-inbound%20pgo%20talos&fromchange=f57cf85fd128&tochange=fefe4c4ffe93
Whiteboard: [talos_regression]
(In reply to David Baron [:dbaron] (UTC-7) (needinfo? for questions) from comment #2)
> So, let's just focus on the graphs for those two tests (which I picked sort
> of at random, although partly because they were larger numbers and probably
> important tests):
> 
> The graphs for mozilla-central:
> http://mzl.la/1q60K5F (Tp5 Optimized WINNT 6.1)
> http://mzl.la/1q61eZv (Paint WINNT 5.1)
> 
> The graphs for fx-team:
> http://mzl.la/1nzcV4u (Tp5 Optimized WINNT 6.1)
> http://mzl.la/1nzagI3 (Paint WINNT 5.1)
> 
> Do we run non-PGO versions of these tests?  Maybe we just hit some PGO cliff?

I'm pretty sure we run both PGO and non-PGO versions of these tests (according to TBPL) but the last time I looked at this stuff was a while ago, not sure if I can provide any meaningful info here...
Flags: needinfo?(ehsan)
(In reply to David Baron [:dbaron] (UTC-7) (needinfo? for questions) from comment #2)
> Do we run non-PGO versions of these tests?

Yes, the branch name has the *-Non-PGO suffix on graph server:

Same graphs for mozilla-central Non-PGO (branch 94):
http://mzl.la/1pkNcng (Tp5 Optimized WINNT 6.1)
http://mzl.la/UKM1jd (Paint WINNT 5.1)
Summary: June 19 regression in all windows talos performance numbers → June 19 regression in all windows PGO talos performance numbers
Nominating for tracking Firefox 33 given that this appears to be a 20%-70% performance regression across our primary performance tests on Windows.
regressions:

Windows 7:
* 10% tresize: 21.25 -> 23.75
* 16% kraken: 1650 -> 1912
* 23% dromaeo_css: 4875 -> 3750
* 30% dromaeo_dom: 1250 -> 885
* 15% session_restore: 1317 -> 1515
* 355% a11y: 165 -> 583
* 45% tpaint: 140 -> 203
* 10% ts_paint: 750 -> 823
* 15% sessionrestore_no_auto_restore: 1310 -> 1514
* 3.5% tscrollx: 3.37 -> 3.49
* 30% tsvgr_opacity: 220 -> 286
* 29% tart: 6.85 -> 8.85
* 45% cart: 43.5 -> 64
* 40% tsvgx: 212 -> 297
* 75% tp5o: 207 -> 364
* 255% tp5o_responsiveness: 38 -> 97

Windows XP:
* 27% tresize: 10.3 -> 13.1
* 1% canvasmark: 6780-7000 -> 6710-6750 (lower is worse), possible noise levels
* 19% session_restore: 1080 -> 1285
* 280% a11y: 170 -> 470
* 50% tpaint: 126 -> 193
* 10% ts_paint: 615 -> 675
* 18% sessionrestore_no_auto_restore: 1090 -> 1275
* 9% tscrollx: 2.4 -> 2.6, 3.1 -> 3.26  # this is bimodal and it shifted
* 51% tsvgr_opacity: 352 -> 537
* 40% tart: 4.5 -> 6.45
* 38% cart: 40.25 -> 55.5
* 14% tsvgx: 484 -> 552
* 75% tp5o: 189 -> 333
* 300% tp5o_responsiveness: 29 -> 82

windows 8 is similar

the offending push is:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=e589c195f61d

you can see the retriggers I did here:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=mozilla-inbound%20pgo%20talos&fromchange=70f19803d1ba&tochange=5f1041f40876

:jandem, can you take this bug and fix this.
Flags: needinfo?(jdemooij)
Yes this seems to be some kind of PGO compiler issue; there's no way my patches can regress performance like this and the non-PGO builds confirm this.

See also the discussion in bug 1030706. I'll post more info in this bug tomorrow or early next week.
thanks :jandem!  We can hack next week to figure out the PGO issue.  Since we ship PGO this actually has a real impact on what we ship.
here is a list of all the regressions and 1 improvement as seen on mozilla.dev.tree-management:
http://54.215.155.53:8080/alerts.html?rev=e589c195f61d&showAll=1
See Also: → 1030706
jmaher, is it possible this regression was fixed yesterday?

Several other tests that were affected by the PGO regression seem to be fixed and I see some improvement mails on dev-tree-management. It's scary because a pretty minor string patch introduced it and another pretty small string patch "fixed" it, let's hope it stays this way.
Flags: needinfo?(jmaher)
oh, things look better.  Strings are scary things for Firefox!  Thanks for fixing this and following up!
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(jmaher)
Resolution: --- → FIXED
(In reply to Joel Maher (:jmaher) from comment #13)
> oh, things look better.  Strings are scary things for Firefox!  Thanks for
> fixing this and following up!

To be clear, I didn't fix it intentionally. It looks like another, unrelated string patch somehow "fixed" the MSVC PGO bug... Scary because it may come back when we land another patch...
Flags: needinfo?(jdemooij)
Target Milestone: --- → mozilla33
You need to log in before you can comment on or make changes to this bug.