Closed Bug 874141 Opened 11 years ago Closed 11 years ago

25ms Windows ts_paint regression from PGO exclusion

Categories

(Firefox Build System :: General, defect)

All
Windows 7
defect
Not set
normal

Tracking

(firefox23 unaffected, firefox24- affected)

RESOLVED WONTFIX
Tracking Status
firefox23 --- unaffected
firefox24 - affected

People

(Reporter: mbrubeck, Assigned: ted)

References

Details

(Keywords: perf, regression)

Bug 871712 switched PGO from opt-out to opt-in per directory on Windows.  This caused a regression of around 25ms (3-4%) of all Ts benchmarks on Windows platforms.  For example:

> Regression: Firefox - Ts Paint, MAX Dirty Profile - Win7 - 3.3% increase
> ------------------------------------------------------------------------
> Previous: avg 764.048 stddev 3.793 of 12 runs up to revision feccfce43b59
> New     : avg 789.267 stddev 4.972 of 12 runs since revision ea767da526ff
> Change  : +25.219 (3.3% / z=6.650)
> Graph   : http://mzl.la/13C1ZJs

https://groups.google.com/d/topic/mozilla.dev.tree-management/Dr_ZXqVKzqw/discussion

(From that thread it looks like Kraken, SVG Opacity, and TResize might be regressed by the same change, though I haven't looked closer at those yet.  The statistics on inbound were confused by several "PGO" builds right before the change that apparently were not PGOed successfully; see bug 874139.)
Ted suggested on IRC that we might be able to improve this by tweaking the threshold for the analysis from bug 871712.
Summary: 15-20ms Windows ts_paint regression from PGO exclusion → 25ms Windows ts_paint regression from PGO exclusion
Ted, can you please see if you can make this go away by adjusting the threshold?
Assignee: nobody → ted
I'm in SF this week, so I won't have time to look at this until next week, but yes, I can test.
(In reply to comment #3)
> I'm in SF this week, so I won't have time to look at this until next week, but
> yes, I can test.

Sounds good.  Thanks!
Ted, ping?
Flags: needinfo?(ted)
Sorry, I forgot about this. I halved the threshold to 50,000 and pushed to try:
https://tbpl.mozilla.org/?tree=Try&rev=ff8e825786d9
Flags: needinfo?(ted)
There are a bunch of results on Try, but I have a hard time reading the tea leaves. Matt, Ehsan: can one of you see if this made a difference?
Flags: needinfo?(mbrubeck)
Flags: needinfo?(ehsan)
Here's the parent push for comparison:
https://tbpl.mozilla.org/?rev=81b227f1a522&jobname=win.*pgo.talos

Here are raw numbers (before -> after this patch) for the Ts tests.  Overall, I don't see a significant improvement, though the "dirtypaint" tests are slightly improved (based on this very small sample):

Win XP:
ts_paint: 443.58 -> 445.53
tspaint_places_generated_med: 454.95 -> 447.84
tspaint_places_generated_max: 446.47 -> 448.16

Win 7:
ts_paint: 681.68 -> 693.58
tspaint_places_generated_med: 690.53 -> 662.53
tspaint_places_generated_max: 676.63 -> 657.37

Win 8:
ts_paint: 635.32 -> 635.32
tspaint_places_generated_med: 641.58 -> 638.63
tspaint_places_generated_max: 639.95 -> 635.42

linker_max_vsize: 3276537856 ->  3346546688 (+2.1%)

Also note that all these tests are on the new iX slaves, so they can't be directly compared to numbers from comment 0 which ran on the old Talos hardware.
Flags: needinfo?(mbrubeck)
So uh, do either of you have a verdict here? Should we try to land this patch?
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #9)
> Also,
> <http://perf.snarkfest.net/compare-talos/index.
> html?oldRevs=81b227f1a522&newRev=ff8e825786d9&submit=true> is not screaming
> with sadness

Unfortunately, that link is comparing PGO results from Try with both PGO and non-PGO results from mozilla-central because of this compare-talos bug:
https://bitbucket.org/mconnor/compare-talos/issue/13

I triggered some more runs of the "other" and "dirtypaint" suites and will do a manual comparison with slightly more data when they finish.

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10)
> So uh, do either of you have a verdict here? Should we try to land this
> patch?

The perf win is small or nil - it definitely does not appear to restore most of the startup perf we lost when bug 871712 landed.  Meanwhile, it adds back 67 MB of the 431 MB linker vsize that we shaved off.  So I'd say it's mostly a wash, depending on how much margin we think we need for the linker memory thing.

One nice thing about bug 871712 is that it slowed the growth in linker vsize to near zero, so if the perf wins are real then maybe we can land this and take the vsize hit without dooming our future selves.
(In reply to comment #11)
> One nice thing about bug 871712 is that it slowed the growth in linker vsize to
> near zero, so if the perf wins are real then maybe we can land this and take
> the vsize hit without dooming our future selves.

Yes, and we can always go back and revert it when we get close to the cap again.
I thought about this a little bit and realized what might be at fault. If you read my ramblings about all the data munging I had to do to get the "instructions per directory" counts out of the pgomgr data, you'd find that I had to discard about 25% of the data from pgomgr because I couldn't uniquely map function names back to object files. It is entirely possible that the regression here is in some code that is being executed in our profiling run but can't be properly mapped back to its object file and thus doesn't have PGO enabled.
(In reply to comment #13)
> I thought about this a little bit and realized what might be at fault. If you
> read my ramblings about all the data munging I had to do to get the
> "instructions per directory" counts out of the pgomgr data, you'd find that I
> had to discard about 25% of the data from pgomgr because I couldn't uniquely
> map function names back to object files. It is entirely possible that the
> regression here is in some code that is being executed in our profiling run but
> can't be properly mapped back to its object file and thus doesn't have PGO
> enabled.

Hmm, do you have any ideas on how we could do a better job at the mapping?
I don't. :-(
I keep getting release tracking emails about this, but I don't know that we ever reached consensus. Ehsan, Matt: do either of you know what we should do here?
Flags: needinfo?(mbrubeck)
Flags: needinfo?(ehsan)
I think maybe we should just take the hit for now (i.e. WONTFIX this bug), since no one seems to have a feasible solution to the problem.  There may be easier and more sustainable ways to win back the startup perf than further wrestling with PGO exclusions.
Flags: needinfo?(mbrubeck)
I agree.  Our basic choices at this point are either taking this regression or being unable to release our product to users.  I'll go for the first one.

Should I write to dev.platform about this?
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(ehsan)
Resolution: --- → WONTFIX
Doesn't seem incredibly important, but up to you. Thanks guys!
Release drivers: I'm going to tracking- this bug since we've decided to WONTFIX it (see comment 17, comment 18).
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.