Closed Bug 545191 Opened 16 years ago Closed 15 years ago

Investigate Tp4 regression from OOPP

Categories

(Core Graveyard :: Plug-ins, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: benjamin, Unassigned)

References

Details

Attachments

(2 files)

Tp4 had a performance regression of between 1 and 2.5% on both Windows and Linux when we enabled OOPP by default. This is likely to be pages using flash, but we're not sure. We'd like to figure out: * Which pages are slowing down * Whether that slowdown is mainly due to one-time plugin startup costs, once-per-instance setup costs, or runtime costs associated with RPC
roc asked me to try to figure out whether the Tp hit was due to a specific set of pages. Unfortunately, I can't tell from the data we've got. The attached PDF requires a very large screen to be readable. Sorry, there's no way around it. It shows the estimated *median difference*, with confidence intervals, from Hg revision 166198dfb055 to f54bb3222492, for each platform and every page in the Tp set. (Specifically, this is the Hodges-Lehmann estimate, with a 95% confidence interval adjusted for multiple comparisons.) There is one column for each OS, and within each column, pages are sorted top to bottom by the change in Tp4 -- positive change is bad, negative is good. The x-axis is on an unusual scale, to make both the dots and the ends of the confidence intervals visible: sqrt(abs(x))*sign(x). If a confidence line extends all the way to the edge of the graph you should assume it keeps going quite some distance beyond that; I had to cut them off so we could actually see what's going on in the vicinity of zero. The only conclusion I'm prepared to draw from this chart is that we don't have enough statistical power to answer the question. In almost all cases the confidence interval is huge on both sides of zero. We *might* be able to say that there is an overall Tp regression from this data -- I didn't do that test -- but we cannot pin it down to pages without more measurements. To underline the problem, look at the "leopard" column: there's way more dots to the right of zero, despite that OSX shouldn't have been affected by the change at all! Also, there is no consistency among platforms, and Vista has an awful lot of pages that seem to have been *helped* by the change, including youtube.com, even though that's a place where we expect there to have been a regression. I'll ask releng to re-run Tp4 a bunch of times on the relevant revisions tomorrow, so we can try to squeeze those bars down.
Comment on attachment 426161 [details] preliminary visualization of per-page change due to OOPP "application/empty"?!
Attachment #426161 - Attachment mime type: application/empty → application/pdf
Assignee: nobody → zweinberg
I would actually expect some pages to be helped by this change: primarily pages that use windowed flash instead of windowless, because it makes it much more likely for the plugin to be able to use the second core of the machine effectively. Can you figure out which of the test pages actually have any plugins on them at all, what type, and whether they are windowed or windowless?
We have a pretty nasty problem here, because we're trying to pick out a very small change against enormous variance. os dmean dsd leopard 52.40900 744.5454 linux 5.61100 497.7784 vista -24.87839 833.5462 xp 11.36800 491.6698 'dmean' is just the difference in the overall per-OS mean Tp4 score between before-f54bb... and after-f54bb... runs. 'dsd' is the combined standard deviation of both runs. (Something fishy is going on here, because if you ignore the variance, that looks like the patch hurt OSX but helped Vista - but we know that turning on OOPP has no effect on OSX at all!) With sufficiently large n, we *can* detect a difference that small against that background, but it takes a whole lot of samples. Let's be generous to ourselves, and consider the largest dmean with the smallest dsd (this is cheating, but you'll see that it doesn't matter in the end), and find out how many samples we need to get a 95% confidence level. Except that we have to correct that for four hundred comparisons, so it's really a 99.9875% confidence level. n est lo hi stat p sig 10 -182.641044 -531.476147 75.89955 31 1.654939e-01 FALSE 20 155.772352 -350.518367 574.23390 217 6.587979e-01 FALSE 50 -95.514017 -329.660467 113.25619 1116 3.556051e-01 FALSE 100 33.584825 -109.918835 178.46921 5176 6.671691e-01 FALSE 200 55.230076 -41.187300 152.98468 21314 2.557316e-01 FALSE 500 2.644939 -60.822057 66.68968 125359 9.373399e-01 FALSE 1000 71.079517 26.922872 114.44988 540971 1.509729e-03 FALSE 2000 36.636747 5.165186 68.34626 2083144 2.280369e-02 FALSE -- 5000 70.731283 50.721647 90.62697 13502656 3.751000e-12 TRUE 10000 48.908188 34.911248 62.88334 52792655 7.896794e-12 TRUE (ideal) 52.400000 52.400000 52.40000 Inf 0.000000 TRUE That's five thousand trials *per page* to get statistical significance, and even then, you can see that the confidence interval is not very good. Clearly we are not going to run Tp4 five hundred times per patch. I can only hope that when I talk to releng, we can figure out a way to cut the variance down (it will probably help if we can run the before- and after- trials on the same machine, for instance).
(In reply to comment #4) > > Can you figure out which of the test pages actually have any plugins on them at > all, what type, and whether they are windowed or windowless? Where can I download the Tp4 data set?
(attn jhford + alice) Here's a possible plan for cutting down the intra-run variance so we can get better numbers out of Talos. First, if we don't still have the builds for Hg revisions 166198dfb055 (henceforth the "A" build) and f54bb3222492 (the "B" build) we're going to need to regenerate them. (If it would be easier, builds for revisions db1f6446efda (A) and 609a51758b08 (B) would also do for this experiment.) Take one Talos slave per OS (so four machines in total) out of the mozilla-central pool temporarily. They're dedicated to this test; they don't do anything else till it's done. Particularly on the Vista slave, but ideally on all four, someone checks for unnecessary background processes that can be shut down. (I assume this has already been done to some extent, but I'd like even more aggressive pruning for this test.) Unpack *both* builds and set up the Tp4 pageset on all four machines. What I mean by this is, all the buildsteps for Tp4 up to but *not* including the actual test run have been completed, and we have side-by-side installations, each with its own testing profile, of both builds on the same machine. The point here is that both builds will be executing not only on the same machine but against the exact same set of files on disk. Also, disable sending results anywhere; we don't want to confuse the graphs server with this. Manually invoke each browser once against the testing profile and immediately quit it, to complete component registration and prime the fastload caches. (Maybe the PerfConfigurator step already does this? If it doesn't, it should.) Then run the actual tests, once for each build, but with the cycle count bumped up a lot. Given the above prep, I am hoping that 100 cycles will be sufficient, but it might be necessary to come back for even more, so please *don't* clear the machines and put them back into the pool once the tests complete. The data I need from this is the raw results for each run -- the lines like this NOISE: |0;youtube.com/www.youtube.com/index.html;1049;693.2222222222222;230;9733;9733;245;1066;231;1063;1059;1049;234;230;1062 (except that there will, presumably, be 100 numbers instead of 10). All other output is useless for this test, and in fact, if you could turn off the median/mean/max/min output for these lines, leaving only the raw per-run numbers and the pagename, that would save me having to throw it away :) Obviously, the raw results need to be labeled as build A or B and with the operating system. One file per run, with meaningful names, would be ideal. Is all this feasible? I can help with some of the fiddling but it may be faster for y'all in releng to do it...
Summary of talk with Zack and I today. 1) Zack is seeing problems with some results for some Talos suites like tp4 having > 33% variance of results. Some other test suites are very stable, so this is not a problem with results of all Talos suites. 2) We agree that the cause of this variance could be anywhere in the stack: hardware+bios variances, OS variances, test framework (buildbot), testware (the talos suite being run), different compiler optimizations on consecutive builds and the product being tested. 3) Zack was not, but is now, aware of the level of detail we deal with to reduce variance in the stack: machines with sequential serial numbers, identical OS patch levels, same network, identical toolchain, disabling background jobs, reboots after each run, etc. Noted that the talos suites which return consistent results are run on the same machines as talos suites which are returning noisy results. To narrow down this problem, we proposed a short clearly defined project to see where the problem might be: - RelEng to get a talos machine from production and give zack access on it. joduinn filed bug#548740 to track. - Zack to put a specific build (of his choice) on the loaner machine, and test that same build repeatedly to see if he can reproduce the variance in results. Zack will use this bug to track his progress here. - after a few days, zack and I will reconnect and see a) if this reproduced the noise variance and b) if so, figure out possible next steps (debug msgs in the opt builds, process monitoring on machine, etc.) (zack: it was a hectic day, so if I missed anything from our chat, please chime in, ok?)
Depends on: 548740
It's great that we can work on making Tp4 less noisy, but I'm wondering if it's also possible to do something short-term to correlate the bad/good pages in the PDF attachment against any plugin usage, windowed/windowless, the size of streams fed to the plugin. That might at least give my team code to look at in more detail.
(In reply to comment #9) > It's great that we can work on making Tp4 less noisy, but I'm wondering if it's > also possible to do something short-term to correlate the bad/good pages in the > PDF attachment against any plugin usage, windowed/windowless, the size of > streams fed to the plugin. That might at least give my team code to look at in > more detail. As far as I can tell, the only plugin used anywhere in the Tp4 data set is Flash. There are 18 sites whose archived contents include an .swf object, and another 43 sites that mention the string "application/x-shockwave-flash" somewhere in their HTML or Javascript. A complete list of included SWFs is at the end of this message, annotated by size; they're all pretty small as Flash goes. I've regenerated the PDF with each line coloured to indicate this used/mentioned/absent distinction. I don't see any correlation at all, so I'm not going to dig into windowed/not just yet. 108K www.guardian.co.uk/static.guim.co.uk/static/70054/common/flash/brightcovewrapper.swf 96K www.wretch.cc/l.yimg.com/e/serv/index/VideoHomeADV3.swf 72K www.gamespot.com/image.com.com/gamespot/images/cne_flash/production/slide_show/gs_wide_topslot/topslot_wide.swf 60K www.corriere.it/www.corriere.it/includes2007/ssi/boxes/boxNews/boxNews.swf 60K www.guardian.co.uk/static.guim.co.uk/static/70054/common/flash/guMiniPlayer.swf 44K www.marca.com/estaticos01.marca.com/multimedia/alMinutoC.swf 36K www.marca.com/estaticos02.marca.com/multimedia/reproductores/newPlayer.swf 36K www.marca.com/estaticos03.marca.com/multimedia/reproductores/newPlayer.swf 28K www.people.com.cn/www.people.com.cn/adv/meiling70155.swf 28K www.spiegel.de/www.spiegel.de/media/0,4906,19498,00.swf 28K www.ifeng.com/img.ifeng.com/tres/recommend/client/jhsp/090210-maso-660x90.swf 28K www.imdb.com/ia.media-imdb.com/media/imdb/01/I/77/84/25/10.swf 24K www.ifeng.com/img.ifeng.com/tres/recommend/client/bankofchina/090104-boc-660x90.swf 24K www.espn.go.com/a.espncdn.com/prod/assets/totemPoll.swf 20K www.ku6.com/image.ku6.com/888/200902/sx_hp_banner0210.swf 20K www.ifeng.com/img.ifeng.com/tres/recommend/client/emirates/090106-emirates-258x215_tea.swf 16K www.yam.com/www.yam.com/f/422x51.swf 12K www.blogfa.com/www.blogfa.com/ads/banner/ouriran120-240.swf 12K www.jugem.jp/img2.afpbb.com/flashdata/thumb/20090211/3779629.swf 12K www.it168.com/adshow.it168.com/newImage/20090206/11299_184114.swf 12K www.jugem.jp/img2.afpbb.com/flashdata/thumb/20090212/3782676.swf 12K www.jugem.jp/img2.afpbb.com/flashdata/thumb/20090210/3775611.swf 4K www.bbc.co.uk/www.bbc.co.uk/home/object/clock/tiny.swf 4K www.exblog.jp/md.exblog.jp/sd/flash/Topfla.swf 4K www.people.com.cn/www.people.com.cn/img/2007people_index/tu.swf 4K www.minijuegos.com/80.69.64.205/images/mail2.swf
I don't think it's worth the OOPP team's time to dig into this right now, though; as far as I'm concerned, we don't know that we have a regression!
Further note: some of the sites that one might *expect* to use Flash (like youtube.com) don't.
(In reply to comment #8) > Summary of talk with Zack and I today. ... > To narrow down this problem, we proposed a short clearly defined project to see > where the problem might be: > - RelEng to get a talos machine from production and give zack access on it. > joduinn filed bug#548740 to track. zack: I forgot to write down which OS you preferred me to set up for you. The problem is reported on Win32 and linux, so I'll go with a linux talos machine. If you'd prefer a win32 talos machine instead, please let me know in the depbug.
Handed over "zack-testing.build.mozilla.org" on 30mar. Zack, last I heard (05apr), you had VPN issues. Did you get past those, and if so, do you have any update?
I'm sorry, I've been completely drowning in other stuff; no progress. (And lately I have not wanted to put even more load on the network.) Next week, though, should be ideal for me to get back to this project as I'll be in the office for a change.
I regret to say that I never found time to get around to this bug, and now I don't work for Mozilla anymore, and will not be *able* to get around to this bug. And perhaps it is moot, anyway, I hear the newer talos slaves are more stable. zack-testing.build.m.o should probably be put back into the talos pool.
Assignee: zackw → nobody
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → INCOMPLETE
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: