Compile Skia with -O3 on optimized builds to avoid performance regression, and instead get performance improvement

ASSIGNED
Assigned to

Status

()

Core
Graphics
P3
normal
ASSIGNED
a year ago
a year ago

People

(Reporter: lsalzman, Assigned: lsalzman)

Tracking

51 Branch
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [gfx-noted])

(Assignee)

Description

a year ago
Currently we compile with -Os optimization flags on Linux and Android. In release builds on Linux, we use -O3, but on Android we still always use -Os even in release. On windows, e use an equivalent /O1 /Oi which implicitly triggers /Os (favor size over speed) which is also used in release.

With the updating effort to Skia milestone 55, I noticed a severe performance regression that involved upstream removing some handcrafted SSE intrinsic code, and replacing it with a more abstracted SSE vector class that relies more heavily upon the compiler to do the work of inlining it away. The regression, in canvasmark, took numbers down approximately 25% from our baseline.

The fix turned out just to be to compile with -O3. With that, numbers are in fact up to about 20% over baseline. So it turns what could have been a regression into an improvement.

The downside to this is that libxul.so/dll will increase in size by about 1% for Android and Windows in our release builds. The net benefit we get is a significant bump in our canvasmark numbers and some other odds and ends on talos, whereas the alternative is a significant decrease in performance.

The reason I wanted to start a discussion on this was to address whether we are okay with this size increase.
(Assignee)

Comment 1

a year ago
It should also be noted that upstream prefers to compile with -O3 themselves, so that is what they are actually testing performance against.
I would be shocked if a 1% increase in library size was an issue. Do we know how much longer it takes for libxul to load on a slow platform, say, an Android device, when Skia is compiled with O3?
I don't see any problem with this size increase, especially for performance gains.
(Assignee)

Comment 4

a year ago
Just to clarify as well, this is only to change the optimization flags for Skia's local moz.build, not the global optimization flags.
(In reply to Lee Salzman [:lsalzman] from comment #0)
> Currently we compile with -Os optimization flags on Linux and Android. In
> release builds on Linux, we use -O3, but on Android we still always use -Os
> even in release. On windows, e use an equivalent /O1 /Oi which implicitly
> triggers /Os (favor size over speed) which is also used in release.

It's more complicated than that: release builds for Linux and Windows are using PGO. With GCC, that means -O3 is the maximum something is going to be compiled with. Cold code will still be built as -Os. On Visual Studio, I'm actually not sure what PGO does to optimization flags. I guess it does something similar.

All this to say: have you compared PGO builds for Linux and Windows (without touching the current optimization flags)
On Android, size is definitely a concern: there is considerable pressure to find ways to _reduce_ the overall size of our APK, rather than allowing it to continue growing. (E.g. see bug 1215247, where there has been considerable discussion of the significance of Fennec APK size.)

As such, I think we need to be cautious about a change that pushes us in the wrong direction there; do we believe the impact on real-world use (as opposed to a specific benchmark) will justify the size increase _on Android_ in particular? (I don't think it's a concern for desktop Linux.) cc'ing some of the people who may want to look at this.
Also, if it's only inlining not happening for some class methods, can you try to add MOZ_ALWAYS_INLINE to them instead?
(Assignee)

Comment 8

a year ago
(In reply to Mike Hommey [:glandium] from comment #7)
> Also, if it's only inlining not happening for some class methods, can you
> try to add MOZ_ALWAYS_INLINE to them instead?

The problem has some local components and some global ones. So while you can hack a few heads off the hydra this way, at the cost of trying to maintain local patches to a library with a high degree of internal flux, this wouldn't give the full 20% boost that is quoted, but only a small portion of it.
(Assignee)

Comment 9

a year ago
(In reply to Mike Hommey [:glandium] from comment #5)
> (In reply to Lee Salzman [:lsalzman] from comment #0)
> > Currently we compile with -Os optimization flags on Linux and Android. In
> > release builds on Linux, we use -O3, but on Android we still always use -Os
> > even in release. On windows, e use an equivalent /O1 /Oi which implicitly
> > triggers /Os (favor size over speed) which is also used in release.
> 
> It's more complicated than that: release builds for Linux and Windows are
> using PGO. With GCC, that means -O3 is the maximum something is going to be
> compiled with. Cold code will still be built as -Os. On Visual Studio, I'm
> actually not sure what PGO does to optimization flags. I guess it does
> something similar.
> 
> All this to say: have you compared PGO builds for Linux and Windows (without
> touching the current optimization flags)

Have some experiments going on try right now to assess the impact on our Windows builds and whether the release builds are currently affected by the regression. The normal opt builds definitely are. Linux is definitely safe on release due to the mentioned -O3 we use there, but opt builds are regressed. Will update when these PGO builds on try finally get somewhere.
As I said, the -O3 on release builds on linux is *not* a simple -O3. You should check Linux PGO builds.
(Assignee)

Comment 11

a year ago
(In reply to Mike Hommey [:glandium] from comment #10)
> As I said, the -O3 on release builds on linux is *not* a simple -O3. You
> should check Linux PGO builds.

I triggered the try run for all platforms, so we'll get in Linux results too. No worries.
(In reply to Jonathan Kew (:jfkthame) from comment #6)
> As such, I think we need to be cautious about a change that pushes us in the
> wrong direction there; do we believe the impact on real-world use (as
> opposed to a specific benchmark) will justify the size increase _on Android_
> in particular? (I don't think it's a concern for desktop Linux.)

I would echo these concerns, though if we think canvasmark is a decent benchmark and we can show that this change is beneficial for real-world webpages, that's certainly a point in favor of doing it.  It's not like we don't already do this for other pieces of code: cairo/pixman had optimizations turned up in bug 386897 for much the same reason that's being proposed here (though the win was much larger), and the whole JS engine gets compiled with optimizations cranked up (!).

(In reply to Lee Salzman [:lsalzman] from comment #8)
> (In reply to Mike Hommey [:glandium] from comment #7)
> > Also, if it's only inlining not happening for some class methods, can you
> > try to add MOZ_ALWAYS_INLINE to them instead?
> 
> The problem has some local components and some global ones. So while you can
> hack a few heads off the hydra this way, at the cost of trying to maintain
> local patches to a library with a high degree of internal flux, this
> wouldn't give the full 20% boost that is quoted, but only a small portion of
> it.

I assume this local/global split alos means that we couldn't compile certain files with -O3?  Comment 0 makes it sound like it's a particular file/routine, whereas comment 8 makes it sound like a more widespread problem, which is a little confusing.

At the risk of sounding like a downer, it's worth noting that this is likely a *permanent* 1% increase with this release of Skia, and probably an increasingly larger increase as Skia gets more code (relative to a hypothetical -Os build and that Skia itself increases in size faster with -O3 vs. -Os).  And it's not just installed size on disk, it's binaries that we build every day in automation and store for quite some time, it's compilation time for hundreds or thousands of builds per day, network bandwidth transferring things around, etc. etc.

It is quite difficult to get a 1% size decrease--something like person-weeks of work on average.  We can't avoid many size increases (more functionality is hard to get without more code), but for deliberate size increases like this, we should be really sure the win is there and there aren't other alternatives.
(Assignee)

Comment 13

a year ago
(In reply to Nathan Froyd [:froydnj] from comment #12)
> (In reply to Jonathan Kew (:jfkthame) from comment #6)
> > As such, I think we need to be cautious about a change that pushes us in the
> > wrong direction there; do we believe the impact on real-world use (as
> > opposed to a specific benchmark) will justify the size increase _on Android_
> > in particular? (I don't think it's a concern for desktop Linux.)
> 
> I would echo these concerns, though if we think canvasmark is a decent
> benchmark and we can show that this change is beneficial for real-world
> webpages, that's certainly a point in favor of doing it.  It's not like we
> don't already do this for other pieces of code: cairo/pixman had
> optimizations turned up in bug 386897 for much the same reason that's being
> proposed here (though the win was much larger), and the whole JS engine gets
> compiled with optimizations cranked up (!).

Evil idea: what if we cranked back down the optimization levels on cairo and pixman to reclaim some of that size, so we have some room to spend it on Skia, considering Skia is now taking over important rendering duties from Cairo? That is to say, if we're going to be spending that size increase anywhere, spending it on Cairo instead of Skia is the wrong place, if we had to make that choice today...
 
> (In reply to Lee Salzman [:lsalzman] from comment #8)
> > (In reply to Mike Hommey [:glandium] from comment #7)
> > > Also, if it's only inlining not happening for some class methods, can you
> > > try to add MOZ_ALWAYS_INLINE to them instead?
> > 
> > The problem has some local components and some global ones. So while you can
> > hack a few heads off the hydra this way, at the cost of trying to maintain
> > local patches to a library with a high degree of internal flux, this
> > wouldn't give the full 20% boost that is quoted, but only a small portion of
> > it.
> 
> I assume this local/global split alos means that we couldn't compile certain
> files with -O3?  Comment 0 makes it sound like it's a particular
> file/routine, whereas comment 8 makes it sound like a more widespread
> problem, which is a little confusing.

Some files can be compiled with the higher optimization levels, yielding a portion of the increase. But there are some further places sprinkled around everywhere else in the library that are not being dealt with optimally via -Os currently that just compiling the whole library with -O3 fixes.

> At the risk of sounding like a downer, it's worth noting that this is likely
> a *permanent* 1% increase with this release of Skia, and probably an
> increasingly larger increase as Skia gets more code (relative to a
> hypothetical -Os build and that Skia itself increases in size faster with
> -O3 vs. -Os).  And it's not just installed size on disk, it's binaries that
> we build every day in automation and store for quite some time, it's
> compilation time for hundreds or thousands of builds per day, network
> bandwidth transferring things around, etc. etc.
> 
> It is quite difficult to get a 1% size decrease--something like person-weeks
> of work on average.  We can't avoid many size increases (more functionality
> is hard to get without more code), but for deliberate size increases like
> this, we should be really sure the win is there and there aren't other
> alternatives.
(Assignee)

Comment 14

a year ago
(In reply to Lee Salzman [:lsalzman] from comment #13)
> (In reply to Nathan Froyd [:froydnj] from comment #12)
> > (In reply to Jonathan Kew (:jfkthame) from comment #6)
> > > As such, I think we need to be cautious about a change that pushes us in the
> > > wrong direction there; do we believe the impact on real-world use (as
> > > opposed to a specific benchmark) will justify the size increase _on Android_
> > > in particular? (I don't think it's a concern for desktop Linux.)
> > 
> > I would echo these concerns, though if we think canvasmark is a decent
> > benchmark and we can show that this change is beneficial for real-world
> > webpages, that's certainly a point in favor of doing it.  It's not like we
> > don't already do this for other pieces of code: cairo/pixman had
> > optimizations turned up in bug 386897 for much the same reason that's being
> > proposed here (though the win was much larger), and the whole JS engine gets
> > compiled with optimizations cranked up (!).
> 
> Evil idea: what if we cranked back down the optimization levels on cairo and
> pixman to reclaim some of that size, so we have some room to spend it on
> Skia, considering Skia is now taking over important rendering duties from
> Cairo? That is to say, if we're going to be spending that size increase
> anywhere, spending it on Cairo instead of Skia is the wrong place, if we had
> to make that choice today...

Just tested this. It only bought back about 0.1%... Oh well, evil ideas rarely pan out.
Another line of investigation: try to use some specific flags on top of -Os instead of going full -O3. Things like -finline-small-functions, -finline-functions, -fpartial-inlining, etc.
(Assignee)

Comment 16

a year ago
(In reply to Mike Hommey [:glandium] from comment #15)
> Another line of investigation: try to use some specific flags on top of -Os
> instead of going full -O3. Things like -finline-small-functions,
> -finline-functions, -fpartial-inlining, etc.

Tried those and all other optimizations listed in the manpage for -O3, the ones it claims -Os disables, the ones listed by --help=optimizers, etc. and no good. There is some other implicit limit being set by -Os that is affecting this mess.
(Assignee)

Comment 17

a year ago
Talos results for PGO builds came in:

Windows PGO build, /Ot on newer Skia displays the speedup above a PGO build without, and above our PGO builds with older version of Skia, but without /Ot the PGO builds for the different versions of Skia seem to result in similar canvasmark scores. So, PGO on Windows seems to compensate for the regression, but we nevertheless get an extra new speedup from /Ot above even that.

Linux PGO build, which was already expected to use -O3, is insensitive to the presence of it as such.
(In reply to Lee Salzman [:lsalzman] from comment #17)
> but we nevertheless get an extra new speedup from /Ot above even that.

that you also get with older version of Skia, so this becomes a separate issue.

> Linux PGO build, which was already expected to use -O3, is insensitive to
> the presence of it as such.

IOW, the concern for this bug is the regression on Android. What about Mac?
(Although, most Linux distro builds are not building with PGO, so there's a concern about regression there. Some of them override the optimization flags and don't use -Os, though)
(Assignee)

Comment 20

a year ago
Mac opt builds seem to show the same pattern of regression and improvement with -O3.
We will have to choose between a few options, and those choices can be platform specific. Depending on what is more important, we may, for example, choose a performance regression over size on Android, but performance improvement on Windows.
What we really don't have a (practical) option for is status quo, where we neither get the performance regression, nor the size increase.
Summary: Compile Skia with -O3 on optimized builds → Compile Skia with -O3 on optimized builds to avoid performance regression, and instead get performance improvement
(In reply to Milan Sreckovic [:milan] from comment #21)
> What we really don't have a (practical) option for is status quo, where we
> neither get the performance regression, nor the size increase.

This is not what comment 17 says (it says PGO builds are unaffected by the regression).
(Assignee)

Comment 23

a year ago
(In reply to Mike Hommey [:glandium] from comment #22)
> (In reply to Milan Sreckovic [:milan] from comment #21)
> > What we really don't have a (practical) option for is status quo, where we
> > neither get the performance regression, nor the size increase.
> 
> This is not what comment 17 says (it says PGO builds are unaffected by the
> regression).

Upon further investigation, this is not strictly true for Windows. There are regressions still showing up, as a side-effect of enabling LTCG, that are worse than the opt build regressions, and require a bit more invasive hackery to fix.
(In reply to Lee Salzman [:lsalzman] from comment #23)
> There are regressions still showing up, as a side-effect of enabling LTCG

Are you saying that PGO builds are slower than opt builds? If yes, that's not what matters most. What matters most is that PGO builds don't do worse with the new version than they did with the old one. If you want to improve performance by tweaking optimization flags, that's a separate issue. Especially, wrt PGO, before touching any flags, we should start making the profile run more relevant code. We barely scratch the surface, and a lot of code is probably considered cold and not optimized fully.
(Assignee)

Comment 25

a year ago
(In reply to Mike Hommey [:glandium] from comment #24)
> (In reply to Lee Salzman [:lsalzman] from comment #23)
> > There are regressions still showing up, as a side-effect of enabling LTCG
> 
> Are you saying that PGO builds are slower than opt builds? If yes, that's
> not what matters most. What matters most is that PGO builds don't do worse
> with the new version than they did with the old one. If you want to improve
> performance by tweaking optimization flags, that's a separate issue.
> Especially, wrt PGO, before touching any flags, we should start making the
> profile run more relevant code. We barely scratch the surface, and a lot of
> code is probably considered cold and not optimized fully.

On some tsvgr_opacity, yes, PGO actually comes out slower.
You need to log in before you can comment on or make changes to this bug.