967635 - Add profiling hooks for all the major Talos Test

Reporter

Description

•

12 years ago

Bug 870657 has already started this work for talos ts test. We need to modify talos to trigger profiling recording for the relevant time slice for each talos test. For example the page load test should show samples only for the time while the page is loading.

Markus Stange [:mstange]

Comment 1

•

12 years ago

During the Australis Talos investigation, where we profiled at least tresize, tpaint and tart, it worked really well to have the profile capture the whole Firefox run. We instrumented the tests to add MOZ_MEASUREMENT_START and MOZ_MEASUREMENT_STOP markers to the profile, and removed the unnecessary parts from the profile in a post-processing step.

Benoit Girard (:BenWa)

Reporter

Comment 2

•

12 years ago

I lost the link to that. Can you post the most recent tools you used for that. It should be a great starting point.

Markus Stange [:mstange]

Comment 3

•

12 years ago

The instrumented talos repo is at http://hg.mozilla.org/users/mstange_themasta.com/talos-sps-profiling/file/07984c5b92ab (e.g. with instrumented TART: http://hg.mozilla.org/users/mstange_themasta.com/talos-sps-profiling/file/07984c5b92ab/talos/page_load_test/tart/addon/content/tart.js#l98 ) and the post-processing and symbolication tools are at https://github.com/mstange/analyze-tryserver-profiles .

Viktor Stanchev [:vikstrous]

Assignee

Updated

•

12 years ago

Assignee: nobody → vstanchev

Viktor Stanchev [:vikstrous]

Assignee

Comment 4

•

11 years ago

It would be good to have a common profiler object with a simple interface that can be used from all tests to pause/unpause the profiler and to insert markers. It seems like we have 3 types of tests: pageloader addon tests, html page tests, and tart tests. The tart tests have their own addon. The pageloader tests (tp5) need to call the profiler from within the pageloader addon and the html page tests need to call the profiler from within an html page. I've instrumented some of the html pages, but now I realized that I need to also add the profiler object to the two addons. What's the best way to do this? I don't like the idea of duplicating the profiler javascript file twice and I don't like the idea of using symlinks (because Windows). What's the best way of doing this? Maybe we can have talos copy the profiler javascript file into the extension / page whenever it runs the tests?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

11 years ago

we use the pageloader addon for the tart/cart tests as well as all the pageloader tests. There are a few startup_tests which need attention. That should minimize your duplication.

Flags: needinfo?(jmaher)

Viktor Stanchev [:vikstrous]

Assignee

Comment 6

•

11 years ago

Attached patch talos_profiler (obsolete) — Details — Splinter Review

The dbghelper.dll part of the patch is untested and just copied from Markus's repository. I have the Profiler JavaScript copied into 3 files. I'm considering using "port". That might help me reduce duplication. https://developer.mozilla.org/en-US/Add-ons/SDK/Guides/Content_Scripts/using_port The list of specifically instrumented tests is: tart, cart, ts_paint, tpaint, tresize, ts, tscrollr, tscrollx, tcanvasmark tsvg and tp5o need only the modification to the pageloader addon modification to give useful information about which page is being tested.

Attachment #8379912 - Flags: review?(bgirard)

Viktor Stanchev [:vikstrous]

Assignee

Comment 7

•

11 years ago

After reading this page (https://developer.mozilla.org/en-US/Add-ons/SDK/Guides/Content_Scripts/Communicating_With_Other_Scripts) I don't know if using port is a great idea... because we also need to jump one more gap between a content script and the javascript on the page. This would require boilerplate that takes up more lines of code than duplicating the profiler file.

Viktor Stanchev [:vikstrous]

Assignee

Comment 8

•

11 years ago

Attached patch talos_profiler2 (obsolete) — Details — Splinter Review

I rebased my patch on top of the most recent talos version

Attachment #8379912 - Attachment is obsolete: true

Attachment #8379912 - Flags: review?(bgirard)

Viktor Stanchev [:vikstrous]

Assignee

Updated

•

11 years ago

Attachment #8381690 - Flags: review?(bgirard)

Justin Wood (:Callek)

Updated

•

11 years ago

Depends on: 982689

Benoit Girard (:BenWa)

Reporter

Comment 9

•

11 years ago

Comment on attachment 8381690 [details] [diff] [review] talos_profiler2 Review of attachment 8381690 [details] [diff] [review]: ----------------------------------------------------------------- Joel are you the right person to review these talos changes? ::: talos/ttest.py @@ +303,5 @@ > test_config['profile_path'] = os.path.normpath(test_config['profile_path']) > > + # Put our special dbgHelper.dll in there so that Windows XP profiles > + # make sense. > + utils.info("Do we have we a file to install?") let's remove the dbghelper and ask relend to deploy it. @@ +313,5 @@ > + > + # Turn on the profiler on startup and write its output to a temp file > + sps_profile = test_config.get('sps_profile', False) > + sps_profile_interval = test_config.get('sps_profile_interval', 1) > + sps_profile_file = None 316/317 can be moved into the if.

Attachment #8381690 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

11 years ago

Comment on attachment 8381690 [details] [diff] [review] talos_profiler2 Review of attachment 8381690 [details] [diff] [review]: ----------------------------------------------------------------- I am not a fan of 3 profiler.js files, but I don't have a better idea right now. I would really like to see the dbghelp.dll added to the machines and not part of the toolchain. really good start here, I think we are close. ::: talos/test.py @@ +71,5 @@ > """ > cycles = 20 > timeout = 150 > url = 'startup_test/startup_test.html' > + url_timestamp = True why is this added in here? @@ +85,5 @@ > shutdown = None > xperf_counters = ['main_startup_fileio', 'main_startup_netio', 'main_normal_fileio', 'main_normal_netio', 'nonmain_startup_fileio', 'nonmain_startup_netio', 'nonmain_normal_fileio', 'mainthread_readcount', 'mainthread_readbytes', 'mainthread_writecount', 'mainthread_writebytes'] > filters = [["ignore_first", [1]], ['median', []]] > tpmozafterpaint = True > + dbghelp_file = '${talos}/page_load_test/tart/dbghelp.dll' did you add this .dll to the repo? why is this only for tspaint? @@ +120,5 @@ > timeout = 300 > mobile = False # XUL based tests with local files. > tpmozafterpaint = True > filters = [["ignore_first", [5]], ['median', []]] > + dbghelp_file = '${talos}/page_load_test/tart/dbghelp.dll' fyi: we don't run generated_max anymore, I should remove this from the repo. @@ +327,5 @@ > Derived from the tp5n pageset, this is the 49 most reliable webpages. > """ > tpmanifest = '${talos}/page_load_test/tp5n/tp5o.manifest' > responsiveness = True > + sps_profile_interval = 10 why do you set the interval here and not other tests? ::: talos/ttest.py @@ +25,5 @@ > import copy > import mozcrash > import talosconfig > import shutil > +import zipfile can we guarantee that this is available on all the slaves? @@ +303,5 @@ > test_config['profile_path'] = os.path.normpath(test_config['profile_path']) > > + # Put our special dbgHelper.dll in there so that Windows XP profiles > + # make sense. > + utils.info("Do we have we a file to install?") I like this idea, then we could get rid of the other references and code for it throughout this patch. @@ +398,5 @@ > > + # record when we started the test > + url = test_config['url'] > + if 'url_timestamp' in test_config and test_config['url_timestamp']: > + url += "?begin=%d".format(int(time.time()*1000)) this was added in test.py for ts_paint, did you remove it somehow? @@ +474,5 @@ > + arcname = os.path.join(mud, "{0}.zip".format(profile_name)) > + profile_filename = os.path.join(profile_name, cycle_name) > + utils.info("Adding profile {0} to archive {1}".format(profile_filename, arcname)) > + with zipfile.ZipFile(arcname, 'a', mode) as arc: > + arc.write(sps_profile_file, profile_filename) I can see this failing randomly on Windows with some type of access permissions, can we have this in a try except so we can print out a more descriptive failure/error message?

Attachment #8381690 - Flags: review?(jmaher) → review-

Viktor Stanchev [:vikstrous]

Assignee

Comment 11

•

11 years ago

(In reply to Joel Maher (:jmaher) from comment #10) > Comment on attachment 8381690 [details] [diff] [review] > talos_profiler2 > > Review of attachment 8381690 [details] [diff] [review]: > ----------------------------------------------------------------- > > I am not a fan of 3 profiler.js files, but I don't have a better idea right > now. I would really like to see the dbghelp.dll added to the machines and > not part of the toolchain. > > really good start here, I think we are close. > > ::: talos/test.py > @@ +71,5 @@ > > """ > > cycles = 20 > > timeout = 150 > > url = 'startup_test/startup_test.html' > > + url_timestamp = True > > why is this added in here? > In talos/startup_test/startup_test.html it checks for a parameter in the url called begin:  This seemed to be broken when I was working on the test and there were some changes from Markus's repository that added changed the default value for the url_timestamp option, so I kept them because they seemed like they might fix the issue. The results were showing ts time of NaN without this fix. "results": { "ts": [ NaN ] }, In addition to this option being enabled, I had to manually add the parameter at the end of the url because it didn't seem like there was anything adding it yet. That's one of the changes you commented on below. > @@ +85,5 @@ > > shutdown = None > > xperf_counters = ['main_startup_fileio', 'main_startup_netio', 'main_normal_fileio', 'main_normal_netio', 'nonmain_startup_fileio', 'nonmain_startup_netio', 'nonmain_normal_fileio', 'mainthread_readcount', 'mainthread_readbytes', 'mainthread_writecount', 'mainthread_writebytes'] > > filters = [["ignore_first", [1]], ['median', []]] > > tpmozafterpaint = True > > + dbghelp_file = '${talos}/page_load_test/tart/dbghelp.dll' > > did you add this .dll to the repo? why is this only for tspaint? Removed. > > @@ +120,5 @@ > > timeout = 300 > > mobile = False # XUL based tests with local files. > > tpmozafterpaint = True > > filters = [["ignore_first", [5]], ['median', []]] > > + dbghelp_file = '${talos}/page_load_test/tart/dbghelp.dll' > > fyi: we don't run generated_max anymore, I should remove this from the repo. > > @@ +327,5 @@ > > Derived from the tp5n pageset, this is the 49 most reliable webpages. > > """ > > tpmanifest = '${talos}/page_load_test/tp5n/tp5o.manifest' > > responsiveness = True > > + sps_profile_interval = 10 > > why do you set the interval here and not other tests? This test takes longer than all the other tests and if we don't decrease the sampling interval it can't produce a profile of the entire test. > > ::: talos/ttest.py > @@ +25,5 @@ > > import copy > > import mozcrash > > import talosconfig > > import shutil > > +import zipfile > > can we guarantee that this is available on all the slaves? I don't know how to guarantee this. The windows 7 32 bit talos machine I borrowed has it. Looks like it was added in python 1.6: http://docs.python.org/2/library/zipfile > > @@ +303,5 @@ > > test_config['profile_path'] = os.path.normpath(test_config['profile_path']) > > > > + # Put our special dbgHelper.dll in there so that Windows XP profiles > > + # make sense. > > + utils.info("Do we have we a file to install?") > > I like this idea, then we could get rid of the other references and code for > it throughout this patch. Removed. > > @@ +398,5 @@ > > > > + # record when we started the test > > + url = test_config['url'] > > + if 'url_timestamp' in test_config and test_config['url_timestamp']: > > + url += "?begin=%d".format(int(time.time()*1000)) > > this was added in test.py for ts_paint, did you remove it somehow? > See above. Where was this supposed to happen? The test that was missing the timestamp was ts. > @@ +474,5 @@ > > + arcname = os.path.join(mud, "{0}.zip".format(profile_name)) > > + profile_filename = os.path.join(profile_name, cycle_name) > > + utils.info("Adding profile {0} to archive {1}".format(profile_filename, arcname)) > > + with zipfile.ZipFile(arcname, 'a', mode) as arc: > > + arc.write(sps_profile_file, profile_filename) > > I can see this failing randomly on Windows with some type of access > permissions, can we have this in a try except so we can print out a more > descriptive failure/error message? I don't seem to be getting any errors when testing this on Windows, but Windows works in mysterious ways, so I'll add the try/except just in case.

Viktor Stanchev [:vikstrous]

Assignee

Comment 12

•

11 years ago

Attached patch talos_profiler3 (obsolete) — Details — Splinter Review

Attachment #8381690 - Attachment is obsolete: true

Attachment #8381690 - Flags: review?(bgirard)

Viktor Stanchev [:vikstrous]

Assignee

Updated

•

11 years ago

Attachment #8391475 - Flags: review?(jmaher)

Avi Halachmi (:avih)

Comment 13

•

11 years ago

Comment on attachment 8391475 [details] [diff] [review] talos_profiler3 Review of attachment 8391475 [details] [diff] [review]: ----------------------------------------------------------------- ::: talos/test.py @@ +203,5 @@ > tpcycles = 1 # number of time to run each page > cycles = None > timeout = None > filters = None > + keys = ['tpmanifest', 'tpcycles', 'tppagecycles', 'tprender', 'tpchrome', 'tpmozafterpaint', 'tploadaboutblank', tploadaboutblank was modified to tploadnocache few weeks ago. Apparently you rebased your code but accidentally kept the old flag name. This would break TART and CART if the test includes more than one cycle (if affects how the TART addon reloads the page).

Attachment #8391475 - Flags: feedback+

Viktor Stanchev [:vikstrous]

Assignee

Comment 14

•

11 years ago

Attached patch talos_profiler3 (obsolete) — Details — Splinter Review

I fixed the issue with tart I was having earlier (it was the result of a property that was renamed and a merge conflict) and another bug I introduced by moving variables around in last patch.

Attachment #8391475 - Attachment is obsolete: true

Attachment #8391475 - Flags: review?(jmaher)

Viktor Stanchev [:vikstrous]

Assignee

Updated

•

11 years ago

Attachment #8391512 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 15

•

11 years ago

Comment on attachment 8391512 [details] [diff] [review] talos_profiler3 Review of attachment 8391512 [details] [diff] [review]: ----------------------------------------------------------------- this looks good. Have you verified that this patch without profiling turned on do not affect the results on try server? we have compare.py in the talos repository to help sanity check that.

Attachment #8391512 - Flags: review?(jmaher) → review+

Viktor Stanchev [:vikstrous]

Assignee

Comment 16

•

11 years ago

When testing this I was getting mysterious hangs on Windows that I couldn't reproduce locally. This is the stack trace I got from T(s) here: https://tbpl.mozilla.org/?tree=Try&rev=58ce35a411b3 Worker thread: ntdll.dll!_KiFastSystemCallRet@0() Unknown ntdll.dll!_ZwWaitForSingleObject@12() Unknown ntdll.dll!_RtlpWaitOnCriticalSection@8() Unknown ntdll.dll!_RtlEnterCriticalSection@4() Unknown ntdll.dll!_LdrGetDllHandleEx@20() Unknown ntdll.dll!_LdrGetDllHandle@16() Unknown KernelBase.dll!_GetModuleHandleForUnicodeString@4() Unknown KernelBase.dll!_BasepGetModuleHandleExW@16() Unknown KernelBase.dll!_GetModuleHandleW@4() Unknown msvcr100.dll!_initptd(_tiddata * ptd, threadlocaleinfostruct * ptloci) Line 384 C > msvcr100.dll!_beginthreadex(void * security, unsigned int stacksize, unsigned int (void *) * initialcode, void * argument, unsigned int createflag, unsigned int * thrdaddr) Line 161 C xul.dll!EnsureWalkThreadReady() Line 280 C++ xul.dll!NS_StackWalk(void (void *, void *, void *) * aCallback, unsigned int aSkipFrames, unsigned int aMaxFrames, void * aClosure, unsigned int aThread, void * aPlatformData) Line 500 C++ xul.dll!TableTicker::doNativeBacktrace(ThreadProfile & aProfile, TickSample * aSample) Line 473 C++ xul.dll!TableTicker::InplaceTick(TickSample * sample) Line 581 C++ xul.dll!SamplerThread::SampleContext(Sampler * sampler, ThreadProfile * thread_profile) Line 179 C++ xul.dll!SamplerThread::Run() Line 123 C++ xul.dll!`anonymous namespace'::ThreadFunc(void * closure) Line 27 C++ msvcr100.dll!_callthreadstartex() Line 314 C msvcr100.dll!_threadstartex(void * ptd) Line 292 C kernel32.dll!@BaseThreadInitThunk@12() Unknown ntdll.dll!___RtlUserThreadStart@8() Unknown ntdll.dll!__RtlUserThreadStart@8() Unknown Main thread: ntdll.dll!_LdrpHashUnicodeString@4() Unknown ntdll.dll!_LdrpFindLoadedDllByName@12() Unknown ntdll.dll!_LdrpFindLoadedDll@16() Unknown ntdll.dll!_LdrGetDllHandleEx@20() Unknown ntdll.dll!_LdrGetDllHandle@16() Unknown > KernelBase.dll!_GetModuleHandleForUnicodeString@4() Unknown KernelBase.dll!_BasepGetModuleHandleExW@16() Unknown KernelBase.dll!_GetModuleHandleW@4() Unknown xul.dll!SetupErrorHandling(const char * progname) Line 4427 C++ xul.dll!XREMain::XRE_mainInit(bool * aExitFlag) Line 2887 C++ xul.dll!XREMain::XRE_main(int argc, char * * argv, const nsXREAppData * aAppData) Line 4055 C++ xul.dll!XRE_main(int argc, char * * argv, const nsXREAppData * aAppData, unsigned int aFlags) Line 4296 C++ firefox.exe!do_main(int argc, char * * argv, nsIFile * xreDirectory) Line 282 C++ firefox.exe!NS_internal_main(int argc, char * * argv) Line 643 C++ firefox.exe!wmain(int argc, wchar_t * * argv) Line 112 C++ firefox.exe!__tmainCRTStartup() Line 552 C kernel32.dll!@BaseThreadInitThunk@12() Unknown ntdll.dll!___RtlUserThreadStart@8() Unknown ntdll.dll!__RtlUserThreadStart@8() Unknown Posting here just for the record and in case anyone has ideas about why it might be happening...

Viktor Stanchev [:vikstrous]

Assignee

Comment 17

•

11 years ago

I tried replacing GetModuleHandleW with GetModuleHandleExW and that actually made things worse: https://tbpl.mozilla.org/?tree=Try&rev=5954477e1622 I don't think I understand the problem well enough. It seems like a deadlock in ntdll.dll when GetModuleHandleW is called from 2 threads at the same time. The documentation says to use GetModuleHandleExW instead but for a different reason. In any case, we can't change what _beginthreadex calls under the hood. Maybe we can use CreateThread instead? http://msdn.microsoft.com/en-us/library/windows/desktop/ms682453%28v=vs.85%29.aspx Searching on the Internet, I found a few references to deadlocks in ntdll in older versions of Windows, but nothing matched my stack trace exactly and nothing was actually useful for fixing the issue. Hacky fix idea: have our own lock that prevents calls to _beginthreadex while a handle to a dll is being requested. This is not a Good solution, but since I can't look at the source for ntdll it's hard to say what the actual problem is. Who can I ask for help with this?

Flags: needinfo?(bgirard)

Viktor Stanchev [:vikstrous]

Assignee

Comment 18

•

11 years ago

Looks like this was already reported in bug 978585

Viktor Stanchev [:vikstrous]

Assignee

Updated

•

11 years ago

Depends on: 978585

Benoit Girard (:BenWa)

Reporter

Comment 19

•

11 years ago

Clearing the needinfo. Just to be explicit I think we need to follow bug 978585 for a solution.

Flags: needinfo?(bgirard)

Viktor Stanchev [:vikstrous]

Assignee

Comment 20

•

11 years ago

Attached patch talos_profiler4 (obsolete) — Details — Splinter Review

I added the --spsProfileEntries option so that we can control the tests a bit better and I made the Kraken test take up to 1000000 samples at 10 ms intervals by default because it's a very long, javascript heavy test. That will increase the size of those profiles by up to 10x, but the compression should make the increase even less than that. I also added a few more prints to make debugging easier.

Attachment #8391512 - Attachment is obsolete: true

Viktor Stanchev [:vikstrous]

Assignee

Comment 21

•

11 years ago

Attached patch talos_profiler5 (obsolete) — Details — Splinter Review

Oops I forgot the actual Kraken test change

Attachment #8398680 - Attachment is obsolete: true

Viktor Stanchev [:vikstrous]

Assignee

Comment 22

•

11 years ago

Profiling off: https://tbpl.mozilla.org/?tree=Try&rev=cf8e02b1409e Profiling on: https://tbpl.mozilla.org/?tree=Try&rev=6225a12c8b55

Viktor Stanchev [:vikstrous]

Assignee

Comment 23

•

11 years ago

Attached patch talos_profiler6 (obsolete) — Details — Splinter Review

Uhh, I was missing the profiler js files in the last patch.

Attachment #8398681 - Attachment is obsolete: true

Viktor Stanchev [:vikstrous]

Assignee

Comment 24

•

11 years ago

Profiling off: https://tbpl.mozilla.org/?tree=Try&rev=aeed574ddb5c Profiling on: https://tbpl.mozilla.org/?tree=Try&rev=dd76971bd49f

Avi Halachmi (:avih)

Comment 25

•

11 years ago

(In reply to Viktor Stanchev [:vikstrous] from comment #24) > Profiling off: https://tbpl.mozilla.org/?tree=Try&rev=aeed574ddb5c > Profiling on: https://tbpl.mozilla.org/?tree=Try&rev=dd76971bd49f Hmm.. the result numbers look less than 2% apart between the submissions. Does SPS introduce that little overhead??

Viktor Stanchev [:vikstrous]

Assignee

Comment 26

•

11 years ago

(In reply to Avi Halachmi (:avih) from comment #25) > (In reply to Viktor Stanchev [:vikstrous] from comment #24) > > Profiling off: https://tbpl.mozilla.org/?tree=Try&rev=aeed574ddb5c > > Profiling on: https://tbpl.mozilla.org/?tree=Try&rev=dd76971bd49f > > Hmm.. the result numbers look less than 2% apart between the submissions. > Does SPS introduce that little overhead?? Looks like the profiler wasn't actually turned on. I thought mozharness: --spsProfile works now, but I guess not. I need to investigate.

Viktor Stanchev [:vikstrous]

Assignee

Comment 27

•

11 years ago

"mozharness: ..." is not passed on to mozharness unless it's after "try:" Here's a push with profiling on: https://tbpl.mozilla.org/?tree=Try&rev=bd262f3b027a

Viktor Stanchev [:vikstrous]

Assignee

Comment 28

•

11 years ago

Attached patch talos_profiler7 (obsolete) — Details — Splinter Review

I made some tweaks to the default profiler parameters so that we are more likely to get the whole tests recorded.

Attachment #8398777 - Attachment is obsolete: true

Attachment #8399644 - Flags: review?(jmaher)

Avi Halachmi (:avih)

Comment 29

•

11 years ago

Comparing the profiler-off numbers: https://tbpl.mozilla.org/?tree=Try&rev=aeed574ddb5c With the profiler-on numbers: https://tbpl.mozilla.org/?tree=Try&rev=bd262f3b027a 1. on osx (10.8) and linux (64 opt) the numbers again look very similar (0 ~ 10% diff among all the sub tests) between profiler off to on numbers. 2. On windows, most tests aborted with the profiler on. Viktor, in your try runs, could you please verify that there actually are collected profiles from each platform/test-group/sub-test? Benwa, what magnitude of a performance hit should we roughly expect from running the profiler during tests? e.g. 1%/10%/100%/500% ? etc.

Flags: needinfo?(bgirard)

Viktor Stanchev [:vikstrous]

Assignee

Comment 30

•

11 years ago

Attached file getprofiles.py — Details

Here's a python script that grabs all the profiles from the try run. These are the sizes of the compressed profiles from each test ( https://tbpl.mozilla.org/?tree=Try&rev=bd262f3b027a ): 4492 ./Rev5_MacOSX_Mountain_Lion_10.8_try_talos_svgr 3460 ./Ubuntu_HW_12.04_x64_try_talos_chromez 3204 ./Rev5_MacOSX_Mountain_Lion_10.8_try_talos_chromez 3024 ./Rev5_MacOSX_Mountain_Lion_10.8_try_talos_tp5o 2984 ./Ubuntu_HW_12.04_try_talos_chromez 2212 ./WINNT_6.2_try_talos_chromez 2076 ./Rev5_MacOSX_Mountain_Lion_10.8_try_talos_dromaeojs 2004 ./Rev4_MacOSX_Snow_Leopard_10.6_try_talos_other 1984 ./Rev5_MacOSX_Mountain_Lion_10.8_try_talos_other 1948 ./Rev4_MacOSX_Snow_Leopard_10.6_try_talos_chromez 1400 ./Windows_7_32-bit_try_talos_dromaeojs 1164 ./Windows_7_32-bit_try_talos_svgr 1048 ./Windows_7_32-bit_try_talos_tp5o 980 ./WINNT_6.2_try_talos_tp5o 964 ./Ubuntu_HW_12.04_x64_try_talos_other 900 ./Ubuntu_HW_12.04_try_talos_other 724 ./Rev4_MacOSX_Snow_Leopard_10.6_try_talos_dromaeojs 664 ./Ubuntu_HW_12.04_x64_try_talos_svgr 608 ./Windows_XP_32-bit_try_talos_other 584 ./Rev4_MacOSX_Snow_Leopard_10.6_try_talos_svgr 516 ./Ubuntu_HW_12.04_try_talos_svgr 476 ./Windows_7_32-bit_try_talos_chromez 416 ./Ubuntu_HW_12.04_x64_try_talos_dromaeojs 376 ./Ubuntu_HW_12.04_try_talos_dromaeojs 180 ./Ubuntu_HW_12.04_x64_try_talos_tp5o 156 ./Rev4_MacOSX_Snow_Leopard_10.6_try_talos_tp5o 148 ./Windows_XP_32-bit_try_talos_tp5o 136 ./Ubuntu_HW_12.04_try_talos_tp5o 128 ./Windows_XP_32-bit_try_talos_chromez 68 ./Windows_7_32-bit_try_talos_xperf

Viktor Stanchev [:vikstrous]

Assignee

Comment 31

•

11 years ago

P.S. The sizes are in kB. The total is 39MB

Joel Maher ( :jmaher ) (UTC -8)

Comment 32

•

11 years ago

Comment on attachment 8399644 [details] [diff] [review] talos_profiler7 Review of attachment 8399644 [details] [diff] [review]: ----------------------------------------------------------------- this looks good, I marked some things which are unrelated to this patch, that if we don't land this right away, it would be nice to just put these changes in a different patch so we can land it. ::: talos/test.py @@ +73,5 @@ > """ > cycles = 20 > timeout = 150 > url = 'startup_test/startup_test.html' > + url_timestamp = True patch 1 ::: talos/ttest.py @@ +391,5 @@ > > + # record when we started the test > + url = test_config['url'] > + if 'url_timestamp' in test_config and test_config['url_timestamp']: > + url += "?begin={0}".format(int(time.time()*1000)) patch 1 @@ +397,5 @@ > command_args = utils.GenerateBrowserCommandLine(browser_config["browser_path"], > browser_config["extra_args"], > browser_config["deviceroot"], > profile_dir, > + url) patch 1 @@ +422,5 @@ > cmthread.start() > > # todo: ctrl+c doesn't close the browser windows > + code = browser.wait() > + utils.info("Browser exited with error code: {0}".format(code)) patch 1

Attachment #8399644 - Flags: review?(jmaher) → review+

Viktor Stanchev [:vikstrous]

Assignee

Comment 33

•

11 years ago

Android with profiling off: https://tbpl.mozilla.org/?tree=Try&rev=53ba3fbf2281 Android with profiling on: https://tbpl.mozilla.org/?tree=Try&rev=ae5c2f1674bd Reference with old talos: https://tbpl.mozilla.org/?tree=Try&rev=677f3dced389

Viktor Stanchev [:vikstrous]

Assignee

Comment 34

•

11 years ago

In the above android pushes I initially had the wrong path to the taols.zip but after fixing that I re-ran the tests and everything is green. They shouldn't be affected because we don't turn on the profiler for mobile tests. Please take a look at the results yourself, but they look normal to me. I think we can commit this. I made a summary of the try pushes to look at: Reference with old talos: https://tbpl.mozilla.org/?tree=Try&rev=677f3dced389 New talos: Desktop with profiling on: https://tbpl.mozilla.org/?tree=Try&rev=bd262f3b027a Android with profiling on: https://tbpl.mozilla.org/?tree=Try&rev=ae5c2f1674bd Desktop with profiling off: https://tbpl.mozilla.org/?tree=Try&rev=aeed574ddb5c Android with profiling off: https://tbpl.mozilla.org/?tree=Try&rev=53ba3fbf2281

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 35

•

11 years ago

Attached file differences for desktop tests with profiling off — Details

I am concerned mostly about: linux (32/64) - tart and tsvgx you can see the rest of the numbers.

Joel Maher ( :jmaher ) (UTC -8)

Comment 36

•

11 years ago

it appears your base for the try pushes are from a few days earlier (prior to march 29th), that would help explain some of the android numbers I see. Otherwise, I would like to see why tart and tsvgx on both linux platforms have problems with just the patch and no sps profiling turned on. Everything else isn't concerning!

Flags: needinfo?(jmaher)

Benoit Girard (:BenWa)

Reporter

Comment 37

•

11 years ago

10% overhead on 64-bit platform is very reasonable. I suspect the win32 overhead is much higher. bug 978585 might reduce the overhead a bit but I would still expect that enable-profiling overhead would be at least ~30%. This kind of overhead isn't great but it means that performance problem should still reproduce. Keep in mind that this is probably much better then instrumenting overhead with approaches like gprof.

Flags: needinfo?(bgirard)

Joel Maher ( :jmaher ) (UTC -8)

Comment 38

•

11 years ago

My understanding is that we see a regression on the tests by just landing the profiler code without using it, that was my understanding from comment 35.

Jonathan Griffin (:jgriffin)

Updated

•

11 years ago

Keywords: ateam-talos-task

Viktor Stanchev [:vikstrous]

Assignee

Comment 39

•

11 years ago

Attached file comparison.txt (obsolete) — Details

I rebased and re-ran the talos tests: https://tbpl.mozilla.org/?tree=Try&rev=6d01ee263ba0

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 40

•

11 years ago

the win7 ts, paint is higher- on inbound we have ~900ms times and with this patch we have ~970ms times. Everything else looks good.

Flags: needinfo?(jmaher)

Viktor Stanchev [:vikstrous]

Assignee

Comment 41

•

11 years ago

I investigated this issue on my VM and it looks like the performance decrease is caused by loading Profiler.js on every page. We probably shouldn't be testing external javascript loading performance in each of these tests, but I don't see an easy way to get rid of the issue. I have 2 ideas: duplicate the profiler related code as inline in every test or add a preprocessor that inserts it into pages as a part of the html before every test. A preprocessor is a substantial change to the way talos works and duplicating profiler-related code in every test can be a maintainability issue. Joel, do you have any other ideas? Can we just accept that there will be a decrease in performance on every test that's probably proportional to the number of page loads?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 42

•

11 years ago

as this is making ts,paint worse is there anyway to hack that? For example, could we pass in a url param (profiling=1) and if we have profiling enabled, we could dynamically load the profiling script? Otherwise, taking a single one time hit is not that big of a deal- we can document it and move forward.

Flags: needinfo?(jmaher)

Viktor Stanchev [:vikstrous]

Assignee

Comment 43

•

11 years ago

Well, this affects all tests. I think ts_paint on Windows just happens to be the most sensitive one to this change. If we were to do something about this we should do it for all tests or just not at all. Yeah, the dynamic loading can work, but it's not perfect. 1. We would still have to have the async hack in every file. 2. It would make all calls to the profiler look like: if (typeof Profiler != 'undefined') Profiler.pause(); 3. Deferred loading of javascript can be done in many ways and is a bit hacky as far as I know and while that's okay for websites it may have different effects on different tests. The synchronous loading might have a more predictable behaviour. Disclaimer: I don't know if this is really the case. I have limited time until the end of my internship so I would prefer not to make any more big changes like this and just get this patch in so that we can be sure that we at least have the base functionality in before I leave. If you think any of these concerns are reasonable, I would prefer it if we just got it committed as it is right now. Where would we have to document the increase in the numbers and who needs to be made aware of it?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 44

•

11 years ago

My suggestion was to dynamically load in the tspaint file only, the rest of the results are pretty much the same.

Flags: needinfo?(jmaher)

Viktor Stanchev [:vikstrous]

Assignee

Comment 45

•

11 years ago

Attached patch talos_profiler8 — Details — Splinter Review

It turned out that I was seeing patterns where there were none. I was able to determine with very high certainty that loading the extra javascript has a negligible performance impact in the order of less than 10ms. The thing that was causing the regressions was the addition of the timestamp to the URL that you recommended that I remove in comment 32. I still don't know how that changes anything, but I've proved to myself that it does, so I'm taking it out. I thought that nothing other than ts should be using it, but apparently it affects ts_paint. https://tbpl.mozilla.org/?tree=Try&rev=c7e4007f6e2a This is a push with profiling off. OSX 10.8 is not done yet, but the comparison so far shows very little difference.

Attachment #8399644 - Attachment is obsolete: true

Viktor Stanchev [:vikstrous]

Assignee

Comment 46

•

11 years ago

Attached file comparison.txt — Details

Attachment #8404111 - Attachment is obsolete: true

Viktor Stanchev [:vikstrous]

Assignee

Updated

•

11 years ago

Attachment #8410557 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 47

•

11 years ago

Comment on attachment 8410557 [details] [diff] [review] talos_profiler8 Review of attachment 8410557 [details] [diff] [review]: ----------------------------------------------------------------- this looks great and so does the try run!

Attachment #8410557 - Flags: review?(jmaher) → review+

Joel Maher ( :jmaher ) (UTC -8)

Comment 48

•

11 years ago

I assume we are ready to land this? Do we have documentation (wiki, blogs) for this?

Benoit Girard (:BenWa)

Reporter

Comment 49

•

11 years ago

We will blog and announce this to dev.platform once all the pieces are landed. I was just thinking that likely the trychooser page caption should include a link to a wiki page where we can keep up to date information about the profiling option can cover FAQ like: 1) On which platform does it work/not work 2) Best usage practice (do a push without to get baseline number, do a push with to get the profiles but focus on the numbers from the first push) 3) How to get the profile package from the zip etc...

Joel Maher ( :jmaher ) (UTC -8)

Comment 50

•

11 years ago

that sounds good. I suspect over time the documentation will change based on questions and usage. Go ahead and land this on the talos repo! Thanks for making this work:)

Viktor Stanchev [:vikstrous]

Assignee

Comment 51

•

11 years ago

I updated: https://wiki.mozilla.org/Buildbot/Talos https://wiki.mozilla.org/Buildbot/Talos/Profiling https://wiki.mozilla.org/ReleaseEngineering/TryChooser

Joel Maher ( :jmaher ) (UTC -8)

Comment 52

•

11 years ago

This never landed on the talos repo- should I go ahead and do that?

Benoit Girard (:BenWa)

Reporter

Comment 53

•

11 years ago

Yes please. I'm going to help finish this feature if we hit any problems.

Joel Maher ( :jmaher ) (UTC -8)

Comment 54

•

11 years ago

http://hg.mozilla.org/build/talos/rev/b4907f0b27d3. Any other items we should do before resolving this as closed?

Benoit Girard (:BenWa)

Reporter

Comment 55

•

11 years ago

I think we're good. If there's any issues lets move them to a different bug.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Benoit Girard (:BenWa)

Reporter

Comment 57

•

11 years ago

Talos changes are now in production. Pushing a run to test: https://tbpl.mozilla.org/?tree=Try&rev=416f995054d2

Avi Halachmi (:avih)

Comment 58

•

11 years ago

The TART addon is now broken when running outside of talos. I get this: at the error console when trying to use the TART addon: ReferenceError: dumpLog is not defined Profiler.js:13 this._cps2 is undefined browser.js:1700 TypeError: Profiler is undefined tart.js:190 TypeError: detector is undefined tart.js:293

Avi Halachmi (:avih)

Comment 59

•

11 years ago

Also, tscroll[x] pages are now also broken in a similar way when loaded in a normal browser: file:///E:/dev/mozilla/src/talos/talos/page_load_test/scroll/scroll-test.js ReferenceError: dumpLog is not defined Profiler.js:13 file:///E:/dev/mozilla/src/talos/talos/scripts/talos-debug.js?dummy=1399425102048 TypeError: Profiler is undefined scroll-test.js:97 And I noticed that tsvgx is not broken in a normal browser, but then again the tsvgx pages don't have the profiler hooks. Is that intentional? I didn't try other tests except for tart, tscroll and tsvgx so far.

Avi Halachmi (:avih)

Updated

•

11 years ago

Depends on: 1010473

Avi Halachmi (:avih)

Updated

•

11 years ago

Depends on: 1016089

Avi Halachmi (:avih)

Updated

•

11 years ago

Depends on: 1016675

Avi Halachmi (:avih)

Comment 60

•

11 years ago

I'm not sure I fully understand when is the profiler paused/resumed with this patch, and also not what's the optimal/preferred strategy which this patch tries to follow. I'm guessing that the optimal strategy is to profile as little as possible data which doesn't affect the reported result, and as much as possible data which does affect the reported result. Here's what I can see at the patch when the profiler is enabled: - "Turn on the profiler on startup and write its output to a temp file" - Some (but not all) of the pageload tests which report their own measurements (i.e. not a "plain" page load measurement - like some pages of of tscroll, tsvgx, *ART, etc) resume and pause the profiler around the section[s] which they measure. - The pageloader addon (which loads all the pageload tests - including tp5 for instance) adds a marker before it starts loading the tested page URL. - Some of the startup tests only use pause (startup_test.html, tspaint_test.html), while other startup tests use resume and pause (tpaint.html, tresize-test.html). This may OK for some or for all, but I'm unfamiliar with what all these tests measure. So, assuming the profiler is indeed started and resumed on startup, and that there's no postprocessing according to markers (which I couldn't detect at the patch), the above should result in the following deviations from the preferred strategy (as I guessed it): - All the plain pageload tests (like tp5) collect profile from browser startup until the entire run is over, instead of collecting profile only when pages are loaded. It should be easy for the pageloader addon to do this, but it doesn't. - The tests which resume and then pause themselves, also collect data from browser startup because the first resume is no-op if the profiler is already resumed from startup. - Possibly some incorrect resume/pause points (e.g. for TART it also profiles "warmup" runs which aren't measured - bug 1016675), or missing resume/pause (e.g. tsvgx has none but it should. No bug on this yet). Since it looks like the main big issues should be relatively easy to fix (or at least to improve), I'm wondering if I understood the strategy correctly. If I did understand it correctly, then we have some work to do here. Otherwise, I'd appreciate some clarification on why it does what it does.

Flags: needinfo?(bgirard)

Avi Halachmi (:avih)

Comment 61

•

11 years ago

Also, I noticed that the default sps intervals are different for different tests, and that the buffer size is also different. Are we intentionally lowering our resolution due to storage reasons? FWIW, I think we shouldn't concern ourselves with storage just yet. If it becomes an issue, then maybe, but unless we _know_ if will be an issue or actually see one, I think we should default to maximum resolution and as much storage as it needs for each test. I'm guessing that the output could be smaller than the buffer size if the test didn't fill the whole buffer up, right? Also, at least for tp5, if we limit the profile data to only during page loads, I'm guessing it could result in a significant reduction in size, as well as for other tests which profile more than they should.

Joel Maher ( :jmaher ) (UTC -8)

Comment 62

•

11 years ago

I recall the reason was either stability or reducing interference. If it is storage related, that would be a good discussion to revisit.

Benoit Girard (:BenWa)

Reporter

Comment 63

•

11 years ago

(In reply to Avi Halachmi (:avih) from comment #61) > Also, I noticed that the default sps intervals are different for different > tests, and that the buffer size is also different. Are we intentionally > lowering our resolution due to storage reasons? Yes, we wanted to keep each profile to a few MBs each max and we wanted to make sure the whole test fit into the profile. If we spent more time paused then we certainly could increase the interval.

Flags: needinfo?(bgirard)

Benoit Girard (:BenWa)

Reporter

Comment 64

•

11 years ago

well increase the rate, lower the interval.

Benoit Girard (:BenWa)

Reporter

Comment 65

•

11 years ago

(In reply to Avi Halachmi (:avih) from comment #60) > I'm not sure I fully understand when is the profiler paused/resumed with > this patch, and also not what's the optimal/preferred strategy which this > patch tries to follow. Pausing and resuming is just meant to discard samples that are obviously not useful. This isn't a hard requirement for us to do or get right since cleopatra UI is fairly good at filtering if you know what you're looking for. But if we have time we should adjust the mistake that we find. > > I'm guessing that the optimal strategy is to profile as little as possible > data which doesn't affect the reported result, and as much as possible data > which does affect the reported result. > > Here's what I can see at the patch when the profiler is enabled: > > - "Turn on the profiler on startup and write its output to a temp file" This should only be done if the startup is measuring data about the browser on startup. Otherwise we can start the profiler after when we start the test itself if it's not a startup test. Starting the profiler on startup is more cumbersome because some settings can't be passed via environment variables I believe (maybe that was fixed). The test should save when the test is completed, or on shutdown if we measure shutdown. Once the file is saved it should be uploaded and a URL will be put in the Talos log. > > - Some (but not all) of the pageload tests which report their own > measurements (i.e. not a "plain" page load measurement - like some pages of > of tscroll, tsvgx, *ART, etc) resume and pause the profiler around the > section[s] which they measure. That's correct. If a test doesn't do it properly and we have the time to fix it, we should. > > - The pageloader addon (which loads all the pageload tests - including tp5 > for instance) adds a marker before it starts loading the tested page URL. It useful to add a marker that has the Page URL so we can easily tell from the profile what page of tp5 we are in and which page is being loaded. > > - Some of the startup tests only use pause (startup_test.html, > tspaint_test.html), while other startup tests use resume and pause > (tpaint.html, tresize-test.html). This may OK for some or for all, but I'm > unfamiliar with what all these tests measure. > > > So, assuming the profiler is indeed started and resumed on startup, and that > there's no postprocessing according to markers (which I couldn't detect at > the patch), the above should result in the following deviations from the > preferred strategy (as I guessed it): The was post processing with mstange' stuff but we're trying to have Talos do it automatically instead of requiring the user to have to manually postprocess the scripts. > > - All the plain pageload tests (like tp5) collect profile from browser > startup until the entire run is over, instead of collecting profile only > when pages are loaded. It should be easy for the pageloader addon to do > this, but it doesn't. > > - The tests which resume and then pause themselves, also collect data from > browser startup because the first resume is no-op if the profiler is already > resumed from startup. If there's a redundant resume I think its just ignored. > > - Possibly some incorrect resume/pause points (e.g. for TART it also > profiles "warmup" runs which aren't measured - bug 1016675), or missing > resume/pause (e.g. tsvgx has none but it should. No bug on this yet). Missing resume/pause points is not a big deal since the user can do more work and ignore the samples when looking at a profile. But not recording samples during the important port of the test is a big bug and should really be filed and addressed. > > > Since it looks like the main big issues should be relatively easy to fix (or > at least to improve), I'm wondering if I understood the strategy correctly. > > If I did understand it correctly, then we have some work to do here. > Otherwise, I'd appreciate some clarification on why it does what it does. Your understanding is correct. If we can spare the time to fix the pause/resume points then great. But I would give priority to the issues that are preventing people from diagnosing performance regressions with Talos over adjust missing pause/resume points.

Avi Halachmi (:avih)

Comment 66

•

11 years ago

Thanks for the detailed reply. (In reply to Benoit Girard (:BenWa) from comment #65) > Pausing and resuming is just meant to discard samples that are obviously not > useful. This isn't a hard requirement for us to do or get right since > cleopatra UI is fairly good at filtering if you know what you're looking > for. Ah! this is one point I haven't considered. I thought that in order to compare profiles effectively, the profiles would need to be as accurate as possible and only contain data which is relevant to the results/measurements. This could have been done by more accurate resume/pause, or some postprocessing as well. I didn't consider the option that cleopatra itself will filter the data effectively for this task. > But if we have time we should adjust the mistake that we find. If the profiles already have enough data for cleopatra to filter them as required and use them to their fullest potential (which it seems it does), then improving pause/resumes wouldn't help anything other than conserve some disk space, and may introduce bugs. Since I'm not concerned with disk space and there's no other apparent advantage otherwise, it would probably be a waste of time to work on this. Thanks again.

Benoit Girard (:BenWa)

Reporter

Comment 67

•

11 years ago

Just to be clear, when I say cleopatra does the filtering I mean the user needs to go through and understand what data is relevant or not so it does add noise that does the user needs to do with. If we have things like GC/CC running between tests then filtering that out is useful because the user doesn't have to deal with the noise. But the primary concern is that the profiles contains the data we need (signal) then we can worry about removing the useless samples (noise).

talos_profiler 11 years ago Viktor Stanchev [:vikstrous] 683.94 KB, patch		Details \| Diff \| Splinter Review
talos_profiler2 11 years ago Viktor Stanchev [:vikstrous] 681.83 KB, patch	jmaher : review-	Details \| Diff \| Splinter Review
talos_profiler3 11 years ago Viktor Stanchev [:vikstrous] 38.27 KB, patch	avih : feedback+	Details \| Diff \| Splinter Review
talos_profiler3 11 years ago Viktor Stanchev [:vikstrous] 38.21 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
talos_profiler4 11 years ago Viktor Stanchev [:vikstrous] 35.95 KB, patch		Details \| Diff \| Splinter Review
talos_profiler5 11 years ago Viktor Stanchev [:vikstrous] 35.70 KB, patch		Details \| Diff \| Splinter Review
talos_profiler6 11 years ago Viktor Stanchev [:vikstrous] 39.75 KB, patch		Details \| Diff \| Splinter Review
talos_profiler7 11 years ago Viktor Stanchev [:vikstrous] 46.43 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
getprofiles.py 11 years ago Viktor Stanchev [:vikstrous] 794 bytes, text/x-python		Details
differences for desktop tests with profiling off 11 years ago Joel Maher ( :jmaher ) (UTC -8) 6.61 KB, text/plain		Details
comparison.txt 11 years ago Viktor Stanchev [:vikstrous] 8.04 KB, text/plain		Details
talos_profiler8 11 years ago Viktor Stanchev [:vikstrous] 44.53 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
comparison.txt 11 years ago Viktor Stanchev [:vikstrous] 7.42 KB, text/plain		Details