Closed Bug 544402 Opened 14 years ago Closed 8 years ago

Measure benchmark gains for compiling with SSE2 instructions on Windows

Categories

(Firefox Build System :: General, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1271794

People

(Reporter: vlad, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [ts])

I just kicked off a simple patch to add -arch:SSE2 to our default build flags for MSVC.  In a simple math benchmark I did for something else this made an enormous difference -- 11,000ms -> 71ms just doing raw matrix multiplications.

Try server has this to say; I only got WINNT 6.0 talos numbers for my run, and I had to compare it to random previous tryserver runs, but:

Tp4:      831    -> 684  (18% win)
Tsspider:  59.24 ->  56.68
Ts:       601.63 -> 574.16
shutdown: 409.84 -> 377.58
V8:       159.5  -> 145.33

Most of the rest of the numbers stayed about the same.  I don't know how much I trust the try server for this, so we should probably do some very controlled tests for this, but the numbers are encouraging.  Figuring out how to get that 18% Tp4 win seems like it's worth it, especially since Tp4 is such a hard number to budge.

SSE2 is supported in Pentium 4 or later CPUs (2001) and Athlon 64 or later (2003).

We could make the SSE2 build the default, but also offer a non-SSE2 optional download, with additional smarts in the installer to know whether you need to download the non-SSE2 build.
Is it possible to drill down into that Tp4 number and see if the win is localized to some set of pages or just an across-the-board speedup? Would be nice to see if we can get a portion of that win without flipping this on globally.

Shipping two different build configs sounds like a lot of pain. I'm not sure if it's better or worse than unsupporting people with older CPUs (or new CPUs that don't implement the full instruction set).
Shipping two build configs is unknown amounts of pain, but seems like a pretty simple combination of problems we've already solved.  I don't by any means think that we should rule it out, but I do think that we should narrow this bug to a single, actionable thing.  "Seriously consider" and two distinct other actions (providing a build with SSE2, and not providing a build that runs on non-SSE2 machines) are signs that we're unlikely to get as much actionable data as advocacy and speculation.

Resummarizing, and assigning to vlad to get more information on his promising numbers.  Let's make this bug about getting the data, including the paths that are most affected.  As we've seen in things like UTF8 conversion and tracemonkey, we can often get big wins by selecting SSE2 at runtime, and I suspect that code that's doing matrix math is a good candidate for it as well.
Assignee: nobody → vladimir
Summary: Seriously consider dropping non-SSE2 support and building with -arch:SSE2 on MSVC → Measure benchmark gains for compiling with SSE2 instructions on Windows
(In reply to comment #3)
> I suspect that code that's doing matrix math is a good candidate for it as well.

(It would be if we had any such code; that was just an example based on a different benchmark, entirely not browser related.)
Is this proper SSE2 optimisation (as in by hand optimising) or just enabling the MSVC SSE2 optimisation option?

From what myself and others have seen of it, MSVC's method is not that smart and can introduce stability problems into normally stable code.
Comment #0 says that he just used -arch:SSE2 for the benchmarking.
So its the buggy, half baked version then.

thanks.
Note that "by hand" optimizing the whole codebase is not feasible.  There are some obvious hotspots where it's been done, but the whole point would be to get the compiler to do it automatically; ideally using pgo to decide where to do it.
Using the VS optimisations is not a feasible or safe option, if you are that intent on allowing compile time optimisations then use ICC for goodness sake.

As for doing it by hand not being feasible, well i call shenanigans to that considering the amount of work a single person has put into his SSE2, SSSE3 and SSE4.x optimised gfx output modules,

To say that the entire mozilla dev team couldn't is just a joke.

w/e, when you finally implement it i'll report bugs as usual, I can't say i hate the idea of faster page loading.
Having a huge percentage of the codebase in hand-coded assembly is a maintenance nightmare. It's a non-starter. Using hand-coded assembly for hotspots like gfx is a different matter. If VC++ doesn't produce good output then we won't use it, that's fine, but claiming that doing it manually is a viable option is not correct.
Keywords: perf
Whiteboard: [ts]
Blocks: 447581
I see this bug was abandoned after Danial pointed out some limitations.

However, I think Danial's argument doesn't make some important distinctions between various *kinds* of optimisations possible under SSE2. They include:

- vectoring: using SIMD instructions where SIMD is appropriate. This is the only part being discussed here, and yes, MSVC hardly does that at all and when it does, hardly ever the best way. But at the same time -- who cares?

- using SSE2 floating point arithmetic instead of x87. This is very important and correct me if I'm wrong, firefox release builds use x87. This part has absolutely nothing to do with Danial's argument and is much more important that SIMD.

- making sure some other instruction extensions, like cmov, are used. Maybe they are now, I don't know, but they'd certainly be under arch:SSE2.

Can we just do this? Please?
the lack of sse2 appears to be due to providing support for people still on Pentum 3 and Athlon XP processors.

....i wouldn't attempt running Firefox 4+ on either anyhow as the improved firefox cores are heavier on system memory and their related bottlenecks on these platforms anyway.  Firefox 3 is the best version for my older athlon xp setup that i still maintain atleast....
Right, this I agree on. It's only by motherboard failure that I don't have an Athlon XP system here myself.

What about arch:see? Anything that would stop limiting us to i386 instruction set (enable SSE floating point and cmov).
I've tried compiling Firefox with:
ac_add_options --enable-optimize="-arch:SSE -O1"
ac_add_options --enable-optimize="-arch:SSE2 -O1"

And without the option.
The normal build is faster in every benchmark I did (Peacekeeper, SunSpider, Dromaeo DOM), except for math, for one of the Peacekeeper scores (Rendering) and for some DOM tests. But overall the normal build gains the best scores.

I wonder if it's the compiler that cannot use very well SSE or if I made something wrong, because I was sure that compiling with SSE could be only better, maybe not much better, but however not worse.

These are the three build configs:
Normal) -TC -nologo -W3 -Gy -Fdgenerated.pdb -we4553 -DNDEBUG -DTRIMMED -Zi -Zi -UDEBUG -DNDEBUG -O1 -Oy
SSE) -TC -nologo -W3 -Gy -Fdgenerated.pdb -we4553 -DNDEBUG -DTRIMMED -Zi -Zi -UDEBUG -DNDEBUG -arch:SSE -O1 -Oy
SSE2) -TC -nologo -W3 -Gy -Fdgenerated.pdb -we4553 -DNDEBUG -DTRIMMED -Zi -Zi -UDEBUG -DNDEBUG -arch:SSE2 -O1 -Oy
Our normal builds use profile-guided optimization ... so if you're not doing that the stock builds should be significantly faster.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #15)
> Our normal builds use profile-guided optimization ... so if you're not doing
> that the stock builds should be significantly faster.

I've built three different versions from the same code. One without SSE, one with SSE and the last with SSE2. I didn't used the stock build.
This is why I think the results are strange, also if the math field is better with SSE than without it.
What version of Visual C++ were you using for this test?
I've used Visual C++ 2010.
I've tested the builds also on another PC (x86) and the results are the same.
Assignee: vladimir → nobody
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.