Closed Bug 833890 Opened 13 years ago Closed 13 years ago

Figure out how much of a win PGO builds actually are

Categories

(Core :: General, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: ehsan.akhgari, Assigned: dmandelin)

References

Details

We need somebody to measurements on a bunch of benchmarks that we consider interesting to see how much of a win PGO builds are on Windows. Dave, word on the street is that you've done that measurement recently! If that's the case, would you mind sharing the results here? If not, can you please ask someone on your team to do the measurement? I'm aiming to have numbers on this by the next Engineering call. Thanks!
No longer blocks: 833881
Dave's original post: https://groups.google.com/d/msg/mozilla.dev.platform/a1ua8-Y29ls/WwxLeaOmo9sJ In that thread I linked to (for example) Dromaeo (DOM) results for PGO vs. non-PGO Windows builds: http://graphs.mozilla.org/graph.html#tests=[[73,94,12],[73,1,12]]&sel=none&displayrange=365&datatype=running You can do this for any Talos test and compare. We could probably write some scripts to find the perf difference on all of our Talos benchmarks between the two types of builds.
Thanks for the link, Ted! Do we need to do this experiment on more benchmarks? Or is Dromaeo all that we care about? (/me suspects the answer is that it's not!)
Another important factor is that we are not yet in a position to consider turning off PGO for the JS engine, so _maybe_ Dromaeo is the only important benchmark here?
There's no reason to reduce the investigation just to Talos, other vendors have benchmarks that, even if in some cases are built for other browsers and have bugs, could be usable to test dom, canvas...
(In reply to Marco Bonardo [:mak] from comment #4) > There's no reason to reduce the investigation just to Talos, other vendors > have benchmarks that, even if in some cases are built for other browsers and > have bugs, could be usable to test dom, canvas... Yeah, I was not talking about Talos at all!
Got some first tests done. Cold startup, 3 trials: 1.23 pgo 3979, 5368, 4772 1.24 non-pgo 4540, 4620, 3685 I don't really see any difference. I'd need to do many more trials to get statistical significance, and that would take a long time with all the reboots. I didn't really expect to see a difference given that cold startup is mostly io.
OS: Mac OS X → Windows 7
Warm startup, 2 successive trials: 1.23 pgo 984, 735 1.24 non-pgo 1579, 741 I think this is also a no-difference. 750 ms is about what it takes on this machine for a warm startup generally. The first trial may not have been fully warmed up.
SunSpider, 10 trials: 1.23 pgo 260.0, 266.6, 284.7, 285.1, 264.8, 288.7, 269.7, 271.8, 269.4, 266.1 mean = 272.7 1.24 non-pgo 286.0, 278.1, 277.3, 276.6, 274.7, 278.1, 279.3, 282.1, 279.7, 281.0 mean = 279.3 The t test p-value for the difference in the means is 0.0596. That's pretty close to statistical significance with the usual 0.05 value. I'm not entirely sure what to make of it, but I note that the lowest non-pgo score was 274.7, and pgo had 7 scores lower than that value, so I'm inclined to think it is a real difference. The difference I measured in these tests was 6.6 ms or 2.5%.
Dromaeo DOM runs: 1.23 pgo http://dromaeo.com/?id=189193 total score 2139 1.24 non-pgo http://dromaeo.com/?id=189187 total score 1903 The pgo version scored 1.15x better. The difference broadly held up over the different subtests. The main problem is that I don't know what significance a 1.15x Dromaeo DOM difference would have in actual usage. I suppose various people out there are comparing us based on Dromaeo, though.
Is there anything else we should be testing, or is the above enough to go on for decision-making?
> I suppose various people out there are comparing us based on Dromaeo, though. Yep. In some ways, that's the biggest issue here. That said, it's interesting to look at the numbers breakdown. For example, the "DOM Attributes" test, "setAttribute" subtest is almost 40% faster with PGO than without. On the other hand, the "element.expando" tests, which are pure jitcode, are of course unaffected. That pattern holds across the board: tests which involve running lots of C++ code are must faster with PGO, while tests that are largely gated on the JIT or bottlenecked on a single simple loop or library call in C++ (e.g. createTextNode) don't win nearly as much. As far as comments earlier in this bug about the JS engine... I thought we already had PGO off for the JS engine on Windows. Is that not the case? What I think is really worth measuring that I don't think we have good numbers for are layout performance. I'm not talking pageload (we measure that with Tp); I'm talking "click the reply button in gmail and see how long that takes".
(In reply to Boris Zbarsky (:bz) from comment #12) > As far as comments earlier in this bug about the JS engine... I thought we > already had PGO off for the JS engine on Windows. Is that not the case? > It was turned back on awhile ago, but the bug number is escaping me atm.
(In reply to comment #12) > What I think is really worth measuring that I don't think we have good numbers > for are layout performance. I'm not talking pageload (we measure that with > Tp); I'm talking "click the reply button in gmail and see how long that takes". Do we have a good benchmark for this kind of thing? Something that we can use to get some numbers? (Microbenchmarks _could_ be useful here, since we're measuring differences in how well the compiler optimizes code, etc.)
I don't know of a good layout performance benchmark offhand, sadly...
But yes, we could try to write a microbenchmark for it.
roc, dbaron, do you know of a good layout microbenchmark?
(In reply to Boris Zbarsky (:bz) from comment #12) > What I think is really worth measuring that I don't think we have good > numbers for are layout performance. I'm not talking pageload (we measure > that with Tp); I'm talking "click the reply button in gmail and see how long > that takes". Videocameras/screen captures work well for that sort of thing. I can do it but probably won't be able to until at least Wednesday.
(In reply to comment #18) > (In reply to Boris Zbarsky (:bz) from comment #12) > > What I think is really worth measuring that I don't think we have good > > numbers for are layout performance. I'm not talking pageload (we measure > > that with Tp); I'm talking "click the reply button in gmail and see how long > > that takes". > > Videocameras/screen captures work well for that sort of thing. I can do it but > probably won't be able to until at least Wednesday. In that case, I think that's going to be valuable enough for us to delay having the final conversation by then. Thanks, Dave!
(In reply to :Ehsan Akhgari from comment #17) > roc, dbaron, do you know of a good layout microbenchmark? This one maybe? http://www.craftymind.com/factory/guimark2/HTML5TextTest.html
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #20) > (In reply to :Ehsan Akhgari from comment #17) > > roc, dbaron, do you know of a good layout microbenchmark? > > This one maybe? http://www.craftymind.com/factory/guimark2/HTML5TextTest.html pgo 25.33 non-pgo 22.46 = 1.13x slower
I tried some Gmail interactions. Clicking reply didn't work because there was no visual effect for the click, so I couldn't get a start time. I successfully measured the time between clicking the login button and the password dialog appearing, the time between clicking the login button and seeing the emails, and the time between clicking an email and seeing it. I did 2 pgo trials under the assumption that non-pgo trial 1 was warming up some caches. The camera is 30 fps so there is an inherent imprecision of something like +/- 0.033 seconds. Times are in seconds. pwd dialog show list show email non-pgo trial 1 0.20 2.83 0.20 pgo 0.13 2.67 0.23 non-pgo trial 2 0.17 2.70 0.17 Hard to see what this means. I suspect io latency and animation timers are in play, so there could be CPU differences but they are too small to observe against those latencies. So far, in summary, it seems that we have seen: - a clear difference for direct tests of DOM and layout speed (dromaeo, the guimark2 test) of about 1.1-1.2x in most cases (but 1.4x in one case) - a possibly real small difference on SunSpider of about 3%. - did not see clear differences in startup time - did not see clear differences for gmail interaction I'm going to close this bug but feel free to ask more questions and reopen.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Thanks a lot, David. This is very helpful.
You need to log in before you can comment on or make changes to this bug.