25ms Windows ts_paint regression from PGO exclusion

RESOLVED WONTFIX

Status

RESOLVED WONTFIX
6 years ago
8 months ago

People

(Reporter: mbrubeck, Assigned: ted)

Tracking

({perf, regression})

Trunk
All
Windows 7
perf, regression

Firefox Tracking Flags

(firefox23 unaffected, firefox24- affected)

Details

(Reporter)

Description

6 years ago
Bug 871712 switched PGO from opt-out to opt-in per directory on Windows.  This caused a regression of around 25ms (3-4%) of all Ts benchmarks on Windows platforms.  For example:

> Regression: Firefox - Ts Paint, MAX Dirty Profile - Win7 - 3.3% increase
> ------------------------------------------------------------------------
> Previous: avg 764.048 stddev 3.793 of 12 runs up to revision feccfce43b59
> New     : avg 789.267 stddev 4.972 of 12 runs since revision ea767da526ff
> Change  : +25.219 (3.3% / z=6.650)
> Graph   : http://mzl.la/13C1ZJs

https://groups.google.com/d/topic/mozilla.dev.tree-management/Dr_ZXqVKzqw/discussion

(From that thread it looks like Kraken, SVG Opacity, and TResize might be regressed by the same change, though I haven't looked closer at those yet.  The statistics on inbound were confused by several "PGO" builds right before the change that apparently were not PGOed successfully; see bug 874139.)
(Reporter)

Comment 1

6 years ago
Ted suggested on IRC that we might be able to improve this by tweaking the threshold for the analysis from bug 871712.
(Reporter)

Updated

6 years ago
Summary: 15-20ms Windows ts_paint regression from PGO exclusion → 25ms Windows ts_paint regression from PGO exclusion

Comment 2

6 years ago
Ted, can you please see if you can make this go away by adjusting the threshold?
Assignee: nobody → ted
I'm in SF this week, so I won't have time to look at this until next week, but yes, I can test.

Comment 4

6 years ago
(In reply to comment #3)
> I'm in SF this week, so I won't have time to look at this until next week, but
> yes, I can test.

Sounds good.  Thanks!

Updated

5 years ago
tracking-firefox24: ? → +

Comment 5

5 years ago
Ted, ping?
Flags: needinfo?(ted)
Sorry, I forgot about this. I halved the threshold to 50,000 and pushed to try:
https://tbpl.mozilla.org/?tree=Try&rev=ff8e825786d9
Flags: needinfo?(ted)
There are a bunch of results on Try, but I have a hard time reading the tea leaves. Matt, Ehsan: can one of you see if this made a difference?
Flags: needinfo?(mbrubeck)
Flags: needinfo?(ehsan)
(Reporter)

Comment 8

5 years ago
Here's the parent push for comparison:
https://tbpl.mozilla.org/?rev=81b227f1a522&jobname=win.*pgo.talos

Here are raw numbers (before -> after this patch) for the Ts tests.  Overall, I don't see a significant improvement, though the "dirtypaint" tests are slightly improved (based on this very small sample):

Win XP:
ts_paint: 443.58 -> 445.53
tspaint_places_generated_med: 454.95 -> 447.84
tspaint_places_generated_max: 446.47 -> 448.16

Win 7:
ts_paint: 681.68 -> 693.58
tspaint_places_generated_med: 690.53 -> 662.53
tspaint_places_generated_max: 676.63 -> 657.37

Win 8:
ts_paint: 635.32 -> 635.32
tspaint_places_generated_med: 641.58 -> 638.63
tspaint_places_generated_max: 639.95 -> 635.42

linker_max_vsize: 3276537856 ->  3346546688 (+2.1%)

Also note that all these tests are on the new iX slaves, so they can't be directly compared to numbers from comment 0 which ran on the old Talos hardware.
Flags: needinfo?(mbrubeck)
So uh, do either of you have a verdict here? Should we try to land this patch?
(Reporter)

Comment 11

5 years ago
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #9)
> Also,
> <http://perf.snarkfest.net/compare-talos/index.
> html?oldRevs=81b227f1a522&newRev=ff8e825786d9&submit=true> is not screaming
> with sadness

Unfortunately, that link is comparing PGO results from Try with both PGO and non-PGO results from mozilla-central because of this compare-talos bug:
https://bitbucket.org/mconnor/compare-talos/issue/13

I triggered some more runs of the "other" and "dirtypaint" suites and will do a manual comparison with slightly more data when they finish.

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10)
> So uh, do either of you have a verdict here? Should we try to land this
> patch?

The perf win is small or nil - it definitely does not appear to restore most of the startup perf we lost when bug 871712 landed.  Meanwhile, it adds back 67 MB of the 431 MB linker vsize that we shaved off.  So I'd say it's mostly a wash, depending on how much margin we think we need for the linker memory thing.

One nice thing about bug 871712 is that it slowed the growth in linker vsize to near zero, so if the perf wins are real then maybe we can land this and take the vsize hit without dooming our future selves.

Comment 12

5 years ago
(In reply to comment #11)
> One nice thing about bug 871712 is that it slowed the growth in linker vsize to
> near zero, so if the perf wins are real then maybe we can land this and take
> the vsize hit without dooming our future selves.

Yes, and we can always go back and revert it when we get close to the cap again.
I thought about this a little bit and realized what might be at fault. If you read my ramblings about all the data munging I had to do to get the "instructions per directory" counts out of the pgomgr data, you'd find that I had to discard about 25% of the data from pgomgr because I couldn't uniquely map function names back to object files. It is entirely possible that the regression here is in some code that is being executed in our profiling run but can't be properly mapped back to its object file and thus doesn't have PGO enabled.

Comment 14

5 years ago
(In reply to comment #13)
> I thought about this a little bit and realized what might be at fault. If you
> read my ramblings about all the data munging I had to do to get the
> "instructions per directory" counts out of the pgomgr data, you'd find that I
> had to discard about 25% of the data from pgomgr because I couldn't uniquely
> map function names back to object files. It is entirely possible that the
> regression here is in some code that is being executed in our profiling run but
> can't be properly mapped back to its object file and thus doesn't have PGO
> enabled.

Hmm, do you have any ideas on how we could do a better job at the mapping?
I don't. :-(
I keep getting release tracking emails about this, but I don't know that we ever reached consensus. Ehsan, Matt: do either of you know what we should do here?
Flags: needinfo?(mbrubeck)
Flags: needinfo?(ehsan)
(Reporter)

Comment 17

5 years ago
I think maybe we should just take the hit for now (i.e. WONTFIX this bug), since no one seems to have a feasible solution to the problem.  There may be easier and more sustainable ways to win back the startup perf than further wrestling with PGO exclusions.
Flags: needinfo?(mbrubeck)

Comment 18

5 years ago
I agree.  Our basic choices at this point are either taking this regression or being unable to release our product to users.  I'll go for the first one.

Should I write to dev.platform about this?
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Flags: needinfo?(ehsan)
Resolution: --- → WONTFIX
Doesn't seem incredibly important, but up to you. Thanks guys!
Release drivers: I'm going to tracking- this bug since we've decided to WONTFIX it (see comment 17, comment 18).
tracking-firefox24: + → -

Updated

8 months ago
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.