Closed Bug 967635 Opened 6 years ago Closed 6 years ago

Add profiling hooks for all the major Talos Test

Categories

(Testing :: Talos, defect)

x86
macOS
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: BenWa, Assigned: vikstrous)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: ateam-talos-task)

Attachments

(4 files, 9 obsolete files)

Bug 870657 has already started this work for talos ts test. We need to modify talos to trigger profiling recording for the relevant time slice for each talos test. For example the page load test should show samples only for the time while the page is loading.
During the Australis Talos investigation, where we profiled at least tresize, tpaint and tart, it worked really well to have the profile capture the whole Firefox run. We instrumented the tests to add MOZ_MEASUREMENT_START and MOZ_MEASUREMENT_STOP markers to the profile, and removed the unnecessary parts from the profile in a post-processing step.
I lost the link to that. Can you post the most recent tools you used for that. It should be a great starting point.
Assignee: nobody → vstanchev
It would be good to have a common profiler object with a simple interface that can be used from all tests to pause/unpause the profiler and to insert markers.

It seems like we have 3 types of tests: pageloader addon tests, html page tests, and tart tests. The tart tests have their own addon. The pageloader tests (tp5) need to call the profiler from within the pageloader addon and the html page tests need to call the profiler from within an html page.

I've instrumented some of the html pages, but now I realized that I need to also add the profiler object to the two addons. What's the best way to do this? I don't like the idea of duplicating the profiler javascript file twice and I don't like the idea of using symlinks (because Windows). What's the best way of doing this?

Maybe we can have talos copy the profiler javascript file into the extension / page whenever it runs the tests?
Flags: needinfo?(jmaher)
we use the pageloader  addon for the tart/cart tests as well as all the pageloader tests.  There are a few startup_tests which need attention.  That should minimize your duplication.
Flags: needinfo?(jmaher)
Attached patch talos_profiler (obsolete) — Splinter Review
The dbghelper.dll part of the patch is untested and just copied from Markus's repository.

I have the Profiler JavaScript copied into 3 files. I'm considering using "port". That might help me     reduce duplication.              

https://developer.mozilla.org/en-US/Add-ons/SDK/Guides/Content_Scripts/using_port

The list of specifically instrumented tests is:
tart, cart, ts_paint, tpaint, tresize, ts, tscrollr, tscrollx, tcanvasmark

tsvg and tp5o need only the modification to the pageloader addon modification to give useful information about which page is being tested.
Attachment #8379912 - Flags: review?(bgirard)
After reading this page (https://developer.mozilla.org/en-US/Add-ons/SDK/Guides/Content_Scripts/Communicating_With_Other_Scripts) I don't know if using port is a great idea... because we also need to jump one more gap between a content script and the javascript on the page. This would require boilerplate that takes up more lines of code than duplicating the profiler file.
Attached patch talos_profiler2 (obsolete) — Splinter Review
I rebased my patch on top of the most recent talos version
Attachment #8379912 - Attachment is obsolete: true
Attachment #8379912 - Flags: review?(bgirard)
Attachment #8381690 - Flags: review?(bgirard)
Depends on: 982689
Comment on attachment 8381690 [details] [diff] [review]
talos_profiler2

Review of attachment 8381690 [details] [diff] [review]:
-----------------------------------------------------------------

Joel are you the right person to review these talos changes?

::: talos/ttest.py
@@ +303,5 @@
>              test_config['profile_path'] = os.path.normpath(test_config['profile_path'])
>  
> +            # Put our special dbgHelper.dll in there so that Windows XP profiles
> +            # make sense.
> +            utils.info("Do we have we a file to install?")

let's remove the dbghelper and ask relend to deploy it.

@@ +313,5 @@
> +
> +            # Turn on the profiler on startup and write its output to a temp file
> +            sps_profile = test_config.get('sps_profile', False)
> +            sps_profile_interval = test_config.get('sps_profile_interval', 1)
> +            sps_profile_file = None

316/317 can be moved into the if.
Attachment #8381690 - Flags: review?(jmaher)
Comment on attachment 8381690 [details] [diff] [review]
talos_profiler2

Review of attachment 8381690 [details] [diff] [review]:
-----------------------------------------------------------------

I am not a fan of 3 profiler.js files, but I don't have a better idea right now.  I would really like to see the dbghelp.dll added to the machines and not part of the toolchain.

really good start here, I think we are close.

::: talos/test.py
@@ +71,5 @@
>      """
>      cycles = 20
>      timeout = 150
>      url = 'startup_test/startup_test.html'
> +    url_timestamp = True

why is this added in here?

@@ +85,5 @@
>      shutdown = None
>      xperf_counters = ['main_startup_fileio', 'main_startup_netio', 'main_normal_fileio', 'main_normal_netio', 'nonmain_startup_fileio', 'nonmain_startup_netio', 'nonmain_normal_fileio', 'mainthread_readcount', 'mainthread_readbytes', 'mainthread_writecount', 'mainthread_writebytes']
>      filters = [["ignore_first", [1]], ['median', []]]
>      tpmozafterpaint = True
> +    dbghelp_file = '${talos}/page_load_test/tart/dbghelp.dll'

did you add this .dll to the repo?  why is this only for tspaint?

@@ +120,5 @@
>      timeout = 300
>      mobile = False # XUL based tests with local files.
>      tpmozafterpaint = True
>      filters = [["ignore_first", [5]], ['median', []]]
> +    dbghelp_file = '${talos}/page_load_test/tart/dbghelp.dll'

fyi: we don't run generated_max anymore, I should remove this from the repo.

@@ +327,5 @@
>      Derived from the tp5n pageset, this is the 49 most reliable webpages.
>      """
>      tpmanifest = '${talos}/page_load_test/tp5n/tp5o.manifest'
>      responsiveness = True
> +    sps_profile_interval = 10

why do you set the interval here and not other tests?

::: talos/ttest.py
@@ +25,5 @@
>  import copy
>  import mozcrash
>  import talosconfig
>  import shutil
> +import zipfile

can we guarantee that this is available on all the slaves?

@@ +303,5 @@
>              test_config['profile_path'] = os.path.normpath(test_config['profile_path'])
>  
> +            # Put our special dbgHelper.dll in there so that Windows XP profiles
> +            # make sense.
> +            utils.info("Do we have we a file to install?")

I like this idea, then we could get rid of the other references and code for it throughout this patch.

@@ +398,5 @@
>  
> +                # record when we started the test
> +                url = test_config['url']
> +                if 'url_timestamp' in test_config and test_config['url_timestamp']:
> +                    url += "?begin=%d".format(int(time.time()*1000))

this was added in test.py for ts_paint, did you remove it somehow?

@@ +474,5 @@
> +                      arcname = os.path.join(mud, "{0}.zip".format(profile_name))
> +                      profile_filename = os.path.join(profile_name, cycle_name)
> +                      utils.info("Adding profile {0} to archive {1}".format(profile_filename, arcname))
> +                      with zipfile.ZipFile(arcname, 'a', mode) as arc:
> +                          arc.write(sps_profile_file, profile_filename)

I can see this failing randomly on Windows with some type of access permissions, can we have this in a try except so we can print out a more descriptive failure/error message?
Attachment #8381690 - Flags: review?(jmaher) → review-
(In reply to Joel Maher (:jmaher) from comment #10)
> Comment on attachment 8381690 [details] [diff] [review]
> talos_profiler2
> 
> Review of attachment 8381690 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I am not a fan of 3 profiler.js files, but I don't have a better idea right
> now.  I would really like to see the dbghelp.dll added to the machines and
> not part of the toolchain.
> 
> really good start here, I think we are close.
> 
> ::: talos/test.py
> @@ +71,5 @@
> >      """
> >      cycles = 20
> >      timeout = 150
> >      url = 'startup_test/startup_test.html'
> > +    url_timestamp = True
> 
> why is this added in here?
> 


In talos/startup_test/startup_test.html it checks for a parameter in the url called begin:

    <!-- call this with an arg, e.g. file://foo/startup-test.html?begin=12345678 -->

This seemed to be broken when I was working on the test and there were some changes from Markus's repository that added changed the default value for the url_timestamp option, so I kept them because they seemed like they might fix the issue.

The results were showing ts time of NaN without this fix.

"results": {
      "ts": [
        NaN
      ]
    },

In addition to this option being enabled, I had to manually add the parameter at the end of the url because it didn't seem like there was anything adding it yet. That's one of the changes you commented on below.

> @@ +85,5 @@
> >      shutdown = None
> >      xperf_counters = ['main_startup_fileio', 'main_startup_netio', 'main_normal_fileio', 'main_normal_netio', 'nonmain_startup_fileio', 'nonmain_startup_netio', 'nonmain_normal_fileio', 'mainthread_readcount', 'mainthread_readbytes', 'mainthread_writecount', 'mainthread_writebytes']
> >      filters = [["ignore_first", [1]], ['median', []]]
> >      tpmozafterpaint = True
> > +    dbghelp_file = '${talos}/page_load_test/tart/dbghelp.dll'
> 
> did you add this .dll to the repo?  why is this only for tspaint?

Removed.

> 
> @@ +120,5 @@
> >      timeout = 300
> >      mobile = False # XUL based tests with local files.
> >      tpmozafterpaint = True
> >      filters = [["ignore_first", [5]], ['median', []]]
> > +    dbghelp_file = '${talos}/page_load_test/tart/dbghelp.dll'
> 
> fyi: we don't run generated_max anymore, I should remove this from the repo.
> 
> @@ +327,5 @@
> >      Derived from the tp5n pageset, this is the 49 most reliable webpages.
> >      """
> >      tpmanifest = '${talos}/page_load_test/tp5n/tp5o.manifest'
> >      responsiveness = True
> > +    sps_profile_interval = 10
> 
> why do you set the interval here and not other tests?

This test takes longer than all the other tests and if we don't decrease the sampling interval it can't produce a profile of the entire test.

> 
> ::: talos/ttest.py
> @@ +25,5 @@
> >  import copy
> >  import mozcrash
> >  import talosconfig
> >  import shutil
> > +import zipfile
> 
> can we guarantee that this is available on all the slaves?

I don't know how to guarantee this. The windows 7 32 bit talos machine I borrowed has it. Looks like it was added in python 1.6: http://docs.python.org/2/library/zipfile

> 
> @@ +303,5 @@
> >              test_config['profile_path'] = os.path.normpath(test_config['profile_path'])
> >  
> > +            # Put our special dbgHelper.dll in there so that Windows XP profiles
> > +            # make sense.
> > +            utils.info("Do we have we a file to install?")
> 
> I like this idea, then we could get rid of the other references and code for
> it throughout this patch.

Removed.

> 
> @@ +398,5 @@
> >  
> > +                # record when we started the test
> > +                url = test_config['url']
> > +                if 'url_timestamp' in test_config and test_config['url_timestamp']:
> > +                    url += "?begin=%d".format(int(time.time()*1000))
> 
> this was added in test.py for ts_paint, did you remove it somehow?
> 

See above. Where was this supposed to happen? The test that was missing the timestamp was ts.

> @@ +474,5 @@
> > +                      arcname = os.path.join(mud, "{0}.zip".format(profile_name))
> > +                      profile_filename = os.path.join(profile_name, cycle_name)
> > +                      utils.info("Adding profile {0} to archive {1}".format(profile_filename, arcname))
> > +                      with zipfile.ZipFile(arcname, 'a', mode) as arc:
> > +                          arc.write(sps_profile_file, profile_filename)
> 
> I can see this failing randomly on Windows with some type of access
> permissions, can we have this in a try except so we can print out a more
> descriptive failure/error message?
I don't seem to be getting any errors when testing this on Windows, but Windows works in mysterious ways, so I'll add the try/except just in case.
Attached patch talos_profiler3 (obsolete) — Splinter Review
Attachment #8381690 - Attachment is obsolete: true
Attachment #8381690 - Flags: review?(bgirard)
Attachment #8391475 - Flags: review?(jmaher)
Comment on attachment 8391475 [details] [diff] [review]
talos_profiler3

Review of attachment 8391475 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/test.py
@@ +203,5 @@
>      tpcycles = 1 # number of time to run each page
>      cycles = None
>      timeout = None
>      filters = None
> +    keys = ['tpmanifest', 'tpcycles', 'tppagecycles', 'tprender', 'tpchrome', 'tpmozafterpaint', 'tploadaboutblank',

tploadaboutblank was modified to tploadnocache few weeks ago.

Apparently you rebased your code but accidentally kept the old flag name. This would break TART and CART if the test includes more than one cycle (if affects how the TART addon reloads the page).
Attachment #8391475 - Flags: feedback+
Attached patch talos_profiler3 (obsolete) — Splinter Review
I fixed the issue with tart I was having earlier (it was the result of a property that was renamed and a merge conflict) and another bug I introduced by moving variables around in last patch.
Attachment #8391475 - Attachment is obsolete: true
Attachment #8391475 - Flags: review?(jmaher)
Attachment #8391512 - Flags: review?(jmaher)
Comment on attachment 8391512 [details] [diff] [review]
talos_profiler3

Review of attachment 8391512 [details] [diff] [review]:
-----------------------------------------------------------------

this looks good.  Have you verified that this patch without profiling turned on do not affect the results on try server?  we have compare.py in the talos repository to help sanity check that.
Attachment #8391512 - Flags: review?(jmaher) → review+
When testing this I was getting mysterious hangs on Windows that I couldn't reproduce locally. This is the stack trace I got from T(s) here: https://tbpl.mozilla.org/?tree=Try&rev=58ce35a411b3

Worker thread:
 	ntdll.dll!_KiFastSystemCallRet@0()	Unknown
 	ntdll.dll!_ZwWaitForSingleObject@12()	Unknown
 	ntdll.dll!_RtlpWaitOnCriticalSection@8()	Unknown
 	ntdll.dll!_RtlEnterCriticalSection@4()	Unknown
 	ntdll.dll!_LdrGetDllHandleEx@20()	Unknown
 	ntdll.dll!_LdrGetDllHandle@16()	Unknown
 	KernelBase.dll!_GetModuleHandleForUnicodeString@4()	Unknown
 	KernelBase.dll!_BasepGetModuleHandleExW@16()	Unknown
 	KernelBase.dll!_GetModuleHandleW@4()	Unknown
 	msvcr100.dll!_initptd(_tiddata * ptd, threadlocaleinfostruct * ptloci) Line 384	C
>	msvcr100.dll!_beginthreadex(void * security, unsigned int stacksize, unsigned int (void *) * initialcode, void * argument, unsigned int createflag, unsigned int * thrdaddr) Line 161	C
 	xul.dll!EnsureWalkThreadReady() Line 280	C++
 	xul.dll!NS_StackWalk(void (void *, void *, void *) * aCallback, unsigned int aSkipFrames, unsigned int aMaxFrames, void * aClosure, unsigned int aThread, void * aPlatformData) Line 500	C++
 	xul.dll!TableTicker::doNativeBacktrace(ThreadProfile & aProfile, TickSample * aSample) Line 473	C++
 	xul.dll!TableTicker::InplaceTick(TickSample * sample) Line 581	C++
 	xul.dll!SamplerThread::SampleContext(Sampler * sampler, ThreadProfile * thread_profile) Line 179	C++
 	xul.dll!SamplerThread::Run() Line 123	C++
 	xul.dll!`anonymous namespace'::ThreadFunc(void * closure) Line 27	C++
 	msvcr100.dll!_callthreadstartex() Line 314	C
 	msvcr100.dll!_threadstartex(void * ptd) Line 292	C
 	kernel32.dll!@BaseThreadInitThunk@12()	Unknown
 	ntdll.dll!___RtlUserThreadStart@8()	Unknown
 	ntdll.dll!__RtlUserThreadStart@8()	Unknown


Main thread:
 	ntdll.dll!_LdrpHashUnicodeString@4()	Unknown
 	ntdll.dll!_LdrpFindLoadedDllByName@12()	Unknown
 	ntdll.dll!_LdrpFindLoadedDll@16()	Unknown
 	ntdll.dll!_LdrGetDllHandleEx@20()	Unknown
 	ntdll.dll!_LdrGetDllHandle@16()	Unknown
>	KernelBase.dll!_GetModuleHandleForUnicodeString@4()	Unknown
 	KernelBase.dll!_BasepGetModuleHandleExW@16()	Unknown
 	KernelBase.dll!_GetModuleHandleW@4()	Unknown
 	xul.dll!SetupErrorHandling(const char * progname) Line 4427	C++
 	xul.dll!XREMain::XRE_mainInit(bool * aExitFlag) Line 2887	C++
 	xul.dll!XREMain::XRE_main(int argc, char * * argv, const nsXREAppData * aAppData) Line 4055	C++
 	xul.dll!XRE_main(int argc, char * * argv, const nsXREAppData * aAppData, unsigned int aFlags) Line 4296	C++
 	firefox.exe!do_main(int argc, char * * argv, nsIFile * xreDirectory) Line 282	C++
 	firefox.exe!NS_internal_main(int argc, char * * argv) Line 643	C++
 	firefox.exe!wmain(int argc, wchar_t * * argv) Line 112	C++
 	firefox.exe!__tmainCRTStartup() Line 552	C
 	kernel32.dll!@BaseThreadInitThunk@12()	Unknown
 	ntdll.dll!___RtlUserThreadStart@8()	Unknown
 	ntdll.dll!__RtlUserThreadStart@8()	Unknown

Posting here just for the record and in case anyone has ideas about why it might be happening...
I tried replacing GetModuleHandleW with GetModuleHandleExW and that actually made things worse:

https://tbpl.mozilla.org/?tree=Try&rev=5954477e1622

I don't think I understand the problem well enough. It seems like a deadlock in ntdll.dll when GetModuleHandleW is called from 2 threads at the same time. The documentation says to use GetModuleHandleExW instead but for a different reason. In any case, we can't change what _beginthreadex calls under the hood. Maybe we can use CreateThread instead? http://msdn.microsoft.com/en-us/library/windows/desktop/ms682453%28v=vs.85%29.aspx Searching on the Internet, I found a few references to deadlocks in ntdll in older versions of Windows, but nothing matched my stack trace exactly and nothing was actually useful for fixing the issue.

Hacky fix idea: have our own lock that prevents calls to _beginthreadex while a handle to a dll is being requested. This is not a Good solution, but since I can't look at the source for ntdll it's hard to say what the actual problem is.

Who can I ask for help with this?
Flags: needinfo?(bgirard)
Looks like this was already reported in bug 978585
Depends on: 978585
Clearing the needinfo. Just to be explicit I think we need to follow bug 978585 for a solution.
Flags: needinfo?(bgirard)
Attached patch talos_profiler4 (obsolete) — Splinter Review
I added the --spsProfileEntries option so that we can control the tests a bit better and I made the Kraken test take up to 1000000 samples at 10 ms intervals by default because it's a very long, javascript heavy test. That will increase the size of those profiles by up to 10x, but the compression should make the increase even less than that.

I also added a few more prints to make debugging easier.
Attachment #8391512 - Attachment is obsolete: true
Attached patch talos_profiler5 (obsolete) — Splinter Review
Oops I forgot the actual Kraken test change
Attachment #8398680 - Attachment is obsolete: true
Attached patch talos_profiler6 (obsolete) — Splinter Review
Uhh, I was missing the profiler js files in the last patch.
Attachment #8398681 - Attachment is obsolete: true
(In reply to Viktor Stanchev [:vikstrous] from comment #24)
> Profiling off: https://tbpl.mozilla.org/?tree=Try&rev=aeed574ddb5c
> Profiling on: https://tbpl.mozilla.org/?tree=Try&rev=dd76971bd49f

Hmm.. the result numbers look less than 2% apart between the submissions. Does SPS introduce that little overhead??
(In reply to Avi Halachmi (:avih) from comment #25)
> (In reply to Viktor Stanchev [:vikstrous] from comment #24)
> > Profiling off: https://tbpl.mozilla.org/?tree=Try&rev=aeed574ddb5c
> > Profiling on: https://tbpl.mozilla.org/?tree=Try&rev=dd76971bd49f
> 
> Hmm.. the result numbers look less than 2% apart between the submissions.
> Does SPS introduce that little overhead??

Looks like the profiler wasn't actually turned on. I thought mozharness: --spsProfile works now, but I guess not. I need to investigate.
"mozharness: ..." is not passed on to mozharness unless it's after "try:" Here's a push with profiling on: https://tbpl.mozilla.org/?tree=Try&rev=bd262f3b027a
Attached patch talos_profiler7 (obsolete) — Splinter Review
I made some tweaks to the default profiler parameters so that we are more likely to get the whole tests recorded.
Attachment #8398777 - Attachment is obsolete: true
Attachment #8399644 - Flags: review?(jmaher)
Comparing the profiler-off numbers:
https://tbpl.mozilla.org/?tree=Try&rev=aeed574ddb5c

With the profiler-on numbers:
https://tbpl.mozilla.org/?tree=Try&rev=bd262f3b027a

1. on osx (10.8) and linux (64 opt) the numbers again look very similar (0 ~ 10% diff among all the sub tests) between profiler off to on numbers.

2. On windows, most tests aborted with the profiler on.


Viktor, in your try runs, could you please verify that there actually are collected profiles from each platform/test-group/sub-test?

Benwa, what magnitude of a performance hit should we roughly expect from running the profiler during tests? e.g. 1%/10%/100%/500% ? etc.
Flags: needinfo?(bgirard)
Attached file getprofiles.py
Here's a python script that grabs all the profiles from the try run. These are the sizes of the compressed profiles from each test ( https://tbpl.mozilla.org/?tree=Try&rev=bd262f3b027a ):

4492	./Rev5_MacOSX_Mountain_Lion_10.8_try_talos_svgr
3460	./Ubuntu_HW_12.04_x64_try_talos_chromez
3204	./Rev5_MacOSX_Mountain_Lion_10.8_try_talos_chromez
3024	./Rev5_MacOSX_Mountain_Lion_10.8_try_talos_tp5o
2984	./Ubuntu_HW_12.04_try_talos_chromez
2212	./WINNT_6.2_try_talos_chromez
2076	./Rev5_MacOSX_Mountain_Lion_10.8_try_talos_dromaeojs
2004	./Rev4_MacOSX_Snow_Leopard_10.6_try_talos_other
1984	./Rev5_MacOSX_Mountain_Lion_10.8_try_talos_other
1948	./Rev4_MacOSX_Snow_Leopard_10.6_try_talos_chromez
1400	./Windows_7_32-bit_try_talos_dromaeojs
1164	./Windows_7_32-bit_try_talos_svgr
1048	./Windows_7_32-bit_try_talos_tp5o
980	./WINNT_6.2_try_talos_tp5o
964	./Ubuntu_HW_12.04_x64_try_talos_other
900	./Ubuntu_HW_12.04_try_talos_other
724	./Rev4_MacOSX_Snow_Leopard_10.6_try_talos_dromaeojs
664	./Ubuntu_HW_12.04_x64_try_talos_svgr
608	./Windows_XP_32-bit_try_talos_other
584	./Rev4_MacOSX_Snow_Leopard_10.6_try_talos_svgr
516	./Ubuntu_HW_12.04_try_talos_svgr
476	./Windows_7_32-bit_try_talos_chromez
416	./Ubuntu_HW_12.04_x64_try_talos_dromaeojs
376	./Ubuntu_HW_12.04_try_talos_dromaeojs
180	./Ubuntu_HW_12.04_x64_try_talos_tp5o
156	./Rev4_MacOSX_Snow_Leopard_10.6_try_talos_tp5o
148	./Windows_XP_32-bit_try_talos_tp5o
136	./Ubuntu_HW_12.04_try_talos_tp5o
128	./Windows_XP_32-bit_try_talos_chromez
68	./Windows_7_32-bit_try_talos_xperf
P.S. The sizes are in kB. The total is 39MB
Comment on attachment 8399644 [details] [diff] [review]
talos_profiler7

Review of attachment 8399644 [details] [diff] [review]:
-----------------------------------------------------------------

this looks good, I marked some things which are unrelated to this patch, that if we don't land this right away, it would be nice to just put these changes in a different patch so we can land it.

::: talos/test.py
@@ +73,5 @@
>      """
>      cycles = 20
>      timeout = 150
>      url = 'startup_test/startup_test.html'
> +    url_timestamp = True

patch 1

::: talos/ttest.py
@@ +391,5 @@
>  
> +                # record when we started the test
> +                url = test_config['url']
> +                if 'url_timestamp' in test_config and test_config['url_timestamp']:
> +                    url += "?begin={0}".format(int(time.time()*1000))

patch 1

@@ +397,5 @@
>                  command_args = utils.GenerateBrowserCommandLine(browser_config["browser_path"], 
>                                                                  browser_config["extra_args"], 
>                                                                  browser_config["deviceroot"],
>                                                                  profile_dir, 
> +                                                                url)

patch 1

@@ +422,5 @@
>                          cmthread.start()
>  
>                      # todo: ctrl+c doesn't close the browser windows
> +                    code = browser.wait()
> +                    utils.info("Browser exited with error code: {0}".format(code))

patch 1
Attachment #8399644 - Flags: review?(jmaher) → review+
In the above android pushes I initially had the wrong path to the taols.zip but after fixing that I re-ran the tests and everything is green. They shouldn't be affected because we don't turn on the profiler for mobile tests.

Please take a look at the results yourself, but they look normal to me. I think we can commit this. I made a summary of the try pushes to look at:

Reference with old talos: https://tbpl.mozilla.org/?tree=Try&rev=677f3dced389

New talos:

Desktop with profiling on: https://tbpl.mozilla.org/?tree=Try&rev=bd262f3b027a
Android with profiling on: https://tbpl.mozilla.org/?tree=Try&rev=ae5c2f1674bd

Desktop with profiling off: https://tbpl.mozilla.org/?tree=Try&rev=aeed574ddb5c
Android with profiling off: https://tbpl.mozilla.org/?tree=Try&rev=53ba3fbf2281
Flags: needinfo?(jmaher)
I am concerned mostly about:
linux (32/64) - tart and tsvgx

you can see the rest of the numbers.
it appears your base for the try pushes are from a few days earlier (prior to march 29th), that would help explain some of the android numbers I see.

Otherwise, I would like to see why tart and tsvgx on both linux platforms have problems with just the patch and no sps profiling turned on.  Everything else isn't concerning!
Flags: needinfo?(jmaher)
10% overhead on 64-bit platform is very reasonable. I suspect the win32 overhead is much higher. bug 978585 might reduce the overhead a bit but I would still expect that enable-profiling overhead would be at least ~30%.

This kind of overhead isn't great but it means that performance problem should still reproduce. Keep in mind that this is probably much better then instrumenting overhead with approaches like gprof.
Flags: needinfo?(bgirard)
My understanding is that we see a regression on the tests by just landing the profiler code without using it, that was my understanding from comment 35.
Attached file comparison.txt (obsolete) —
I rebased and re-ran the talos tests: https://tbpl.mozilla.org/?tree=Try&rev=6d01ee263ba0
Flags: needinfo?(jmaher)
the win7 ts, paint is higher- on inbound we have ~900ms times and with this patch we have ~970ms times.  Everything else looks good.
Flags: needinfo?(jmaher)
I investigated this issue on my VM and it looks like the performance decrease is caused by loading Profiler.js on every page. We probably shouldn't be testing external javascript loading performance in each of these tests, but I don't see an easy way to get rid of the issue. I have 2 ideas: duplicate the profiler related code as inline in every test or add a preprocessor that inserts it into pages as a part of the html before every test. A preprocessor is a substantial change to the way talos works and duplicating profiler-related code in every test can be a maintainability issue. Joel, do you have any other ideas? Can we just accept that there will be a decrease in performance on every test that's probably proportional to the number of page loads?
Flags: needinfo?(jmaher)
as this is making ts,paint worse is there anyway to hack that?  For example, could we pass in a url param (profiling=1) and if we have profiling enabled, we could dynamically load the profiling script?

Otherwise, taking a single one time hit is not that big of a deal- we can document it and move forward.
Flags: needinfo?(jmaher)
Well, this affects all tests. I think ts_paint on Windows just happens to be the most sensitive one to this change. If we were to do something about this we should do it for all tests or just not at all.

Yeah, the dynamic loading can work, but it's not perfect.

1. We would still have to have the async hack in every file.

2. It would make all calls to the profiler look like:

if (typeof Profiler != 'undefined') Profiler.pause();

3. Deferred loading of javascript can be done in many ways and is a bit hacky as far as I know and while that's okay for websites it may have different effects on different tests. The synchronous loading might have a more predictable behaviour. Disclaimer: I don't know if this is really the case.

I have limited time until the end of my internship so I would prefer not to make any more big changes like this and just get this patch in so that we can be sure that we at least have the base functionality in before I leave.

If you think any of these concerns are reasonable, I would prefer it if we just got it committed as it is right now. Where would we have to document the increase in the numbers and who needs to be made aware of it?
Flags: needinfo?(jmaher)
My suggestion was to dynamically load in the tspaint file only, the rest of the results are pretty much the same.
Flags: needinfo?(jmaher)
Attached patch talos_profiler8Splinter Review
It turned out that I was seeing patterns where there were none. I was able to determine with very high certainty that loading the extra javascript has a negligible performance impact in the order of less than 10ms. The thing that was causing the regressions was the addition of the timestamp to the URL that you recommended that I remove in comment 32. I still don't know how that changes anything, but I've proved to myself that it does, so I'm taking it out. I thought that nothing other than ts should be using it, but apparently it affects ts_paint.

https://tbpl.mozilla.org/?tree=Try&rev=c7e4007f6e2a

This is a push with profiling off. OSX 10.8 is not done yet, but the comparison so far shows very little difference.
Attachment #8399644 - Attachment is obsolete: true
Attached file comparison.txt
Attachment #8404111 - Attachment is obsolete: true
Attachment #8410557 - Flags: review?(jmaher)
Comment on attachment 8410557 [details] [diff] [review]
talos_profiler8

Review of attachment 8410557 [details] [diff] [review]:
-----------------------------------------------------------------

this looks great and so does the try run!
Attachment #8410557 - Flags: review?(jmaher) → review+
I assume we are ready to land this?  Do we have documentation (wiki, blogs) for this?
We will blog and announce this to dev.platform once all the pieces are landed.

I was just thinking that likely the trychooser page caption should include a link to a wiki page where we can keep up to date information about the profiling option can cover FAQ like:
1) On which platform does it work/not work
2) Best usage practice (do a push without to get baseline number, do a push with to get the profiles but focus on the numbers from the first push)
3) How to get the profile package from the zip
etc...
that sounds good.  I suspect over time the documentation will change based on questions and usage.  

Go ahead and land this on the talos repo!  Thanks for making this work:)
This never landed on the talos repo- should I go ahead and do that?
Yes please. I'm going to help finish this feature if we hit any problems.
http://hg.mozilla.org/build/talos/rev/b4907f0b27d3.

Any other items we should do before resolving this as closed?
I think we're good. If there's any issues lets move them to a different bug.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Duplicate of this bug: 870657
Talos changes are now in production. Pushing a run to test:

https://tbpl.mozilla.org/?tree=Try&rev=416f995054d2
The TART addon is now broken when running outside of talos. I get this: at the error console when trying to use the TART addon:

ReferenceError: dumpLog is not defined Profiler.js:13
this._cps2 is undefined browser.js:1700
TypeError: Profiler is undefined tart.js:190
TypeError: detector is undefined tart.js:293
Also, tscroll[x] pages are now also broken in a similar way when loaded in a normal browser:

file:///E:/dev/mozilla/src/talos/talos/page_load_test/scroll/scroll-test.js
ReferenceError: dumpLog is not defined Profiler.js:13
file:///E:/dev/mozilla/src/talos/talos/scripts/talos-debug.js?dummy=1399425102048
TypeError: Profiler is undefined scroll-test.js:97

And I noticed that tsvgx is not broken in a normal browser, but then again the tsvgx pages don't have the profiler hooks. Is that intentional?

I didn't try other tests except for tart, tscroll and tsvgx so far.
Depends on: 1010473
Depends on: 1016089
Depends on: 1016675
I'm not sure I fully understand when is the profiler paused/resumed with this patch, and also not what's the optimal/preferred strategy which this patch tries to follow.

I'm guessing that the optimal strategy is to profile as little as possible data which doesn't affect the reported result, and as much as possible data which does affect the reported result.

Here's what I can see at the patch when the profiler is enabled:

- "Turn on the profiler on startup and write its output to a temp file"

- Some (but not all) of the pageload tests which report their own measurements (i.e. not a "plain" page load measurement - like some pages of of tscroll, tsvgx, *ART, etc) resume and pause the profiler around the section[s] which they measure.

- The pageloader addon (which loads all the pageload tests - including tp5 for instance) adds a marker before it starts loading the tested page URL.

- Some of the startup tests only use pause (startup_test.html, tspaint_test.html), while other startup tests use resume and pause (tpaint.html, tresize-test.html). This may OK for some or for all, but I'm unfamiliar with what all these tests measure.


So, assuming the profiler is indeed started and resumed on startup, and that there's no postprocessing according to markers (which I couldn't detect at the patch), the above should result in the following deviations from the preferred strategy (as I guessed it):

- All the plain pageload tests (like tp5) collect profile from browser startup until the entire run is over, instead of collecting profile only when pages are loaded. It should be easy for the pageloader addon to do this, but it doesn't.

- The tests which resume and then pause themselves, also collect data from browser startup because the first resume is no-op if the profiler is already resumed from startup.

- Possibly some incorrect resume/pause points (e.g. for TART it also profiles "warmup" runs which aren't measured - bug 1016675), or missing resume/pause (e.g. tsvgx has none but it should. No bug on this yet).


Since it looks like the main big issues should be relatively easy to fix (or at least to improve), I'm wondering if I understood the strategy correctly.

If I did understand it correctly, then we have some work to do here. Otherwise, I'd appreciate some clarification on why it does what it does.
Flags: needinfo?(bgirard)
Also, I noticed that the default sps intervals are different for different tests, and that the buffer size is also different. Are we intentionally lowering our resolution due to storage reasons?

FWIW, I think we shouldn't concern ourselves with storage just yet. If it becomes an issue, then maybe, but unless we _know_ if will be an issue or actually see one, I think we should default to maximum resolution and as much storage as it needs for each test.

I'm guessing that the output could be smaller than the buffer size if the test didn't fill the whole buffer up, right?

Also, at least for tp5, if we limit the profile data to only during page loads, I'm guessing it could result in a significant reduction in size, as well as for other tests which profile more than they should.
I recall the reason was either stability or reducing interference.  If it is storage related, that would be a good discussion to revisit.
(In reply to Avi Halachmi (:avih) from comment #61)
> Also, I noticed that the default sps intervals are different for different
> tests, and that the buffer size is also different. Are we intentionally
> lowering our resolution due to storage reasons?

Yes, we wanted to keep each profile to a few MBs each max and we wanted to make sure the whole test fit into the profile.

If we spent more time paused then we certainly could increase the interval.
Flags: needinfo?(bgirard)
well increase the rate, lower the interval.
(In reply to Avi Halachmi (:avih) from comment #60)
> I'm not sure I fully understand when is the profiler paused/resumed with
> this patch, and also not what's the optimal/preferred strategy which this
> patch tries to follow.

Pausing and resuming is just meant to discard samples that are obviously not useful. This isn't a hard requirement for us to do or get right since cleopatra UI is fairly good at filtering if you know what you're looking for. But if we have time we should adjust the mistake that we find.

> 
> I'm guessing that the optimal strategy is to profile as little as possible
> data which doesn't affect the reported result, and as much as possible data
> which does affect the reported result.
> 
> Here's what I can see at the patch when the profiler is enabled:
> 
> - "Turn on the profiler on startup and write its output to a temp file"

This should only be done if the startup is measuring data about the browser on startup. Otherwise we can start the profiler after when we start the test itself if it's not a startup test. Starting the profiler on startup is more cumbersome because some settings can't be passed via environment variables I believe (maybe that was fixed).

The test should save when the test is completed, or on shutdown if we measure shutdown.

Once the file is saved it should be uploaded and a URL will be put in the Talos log.

> 
> - Some (but not all) of the pageload tests which report their own
> measurements (i.e. not a "plain" page load measurement - like some pages of
> of tscroll, tsvgx, *ART, etc) resume and pause the profiler around the
> section[s] which they measure.

That's correct. If a test doesn't do it properly and we have the time to fix it, we should.

> 
> - The pageloader addon (which loads all the pageload tests - including tp5
> for instance) adds a marker before it starts loading the tested page URL.

It useful to add a marker that has the Page URL so we can easily tell from the profile what page of tp5 we are in and which page is being loaded.

> 
> - Some of the startup tests only use pause (startup_test.html,
> tspaint_test.html), while other startup tests use resume and pause
> (tpaint.html, tresize-test.html). This may OK for some or for all, but I'm
> unfamiliar with what all these tests measure.
> 
> 
> So, assuming the profiler is indeed started and resumed on startup, and that
> there's no postprocessing according to markers (which I couldn't detect at
> the patch), the above should result in the following deviations from the
> preferred strategy (as I guessed it):

The was post processing with mstange' stuff but we're trying to have Talos do it automatically instead of requiring the user to have to manually postprocess the scripts.

> 
> - All the plain pageload tests (like tp5) collect profile from browser
> startup until the entire run is over, instead of collecting profile only
> when pages are loaded. It should be easy for the pageloader addon to do
> this, but it doesn't.
> 
> - The tests which resume and then pause themselves, also collect data from
> browser startup because the first resume is no-op if the profiler is already
> resumed from startup.

If there's a redundant resume I think its just ignored.

> 
> - Possibly some incorrect resume/pause points (e.g. for TART it also
> profiles "warmup" runs which aren't measured - bug 1016675), or missing
> resume/pause (e.g. tsvgx has none but it should. No bug on this yet).

Missing resume/pause points is not a big deal since the user can do more work and ignore the samples when looking at a profile. But not recording samples during the important port of the test is a big bug and should really be filed and addressed.

> 
> 
> Since it looks like the main big issues should be relatively easy to fix (or
> at least to improve), I'm wondering if I understood the strategy correctly.
> 
> If I did understand it correctly, then we have some work to do here.
> Otherwise, I'd appreciate some clarification on why it does what it does.

Your understanding is correct. If we can spare the time to fix the pause/resume points then great. But I would give priority to the issues that are preventing people from diagnosing performance regressions with Talos over adjust missing pause/resume points.
Thanks for the detailed reply.

(In reply to Benoit Girard (:BenWa) from comment #65)
> Pausing and resuming is just meant to discard samples that are obviously not
> useful. This isn't a hard requirement for us to do or get right since
> cleopatra UI is fairly good at filtering if you know what you're looking
> for.

Ah! this is one point I haven't considered. I thought that in order to compare profiles effectively, the profiles would need to be as accurate as possible and only contain data which is relevant to the results/measurements.

This could have been done by more accurate resume/pause, or some postprocessing as well.

I didn't consider the option that cleopatra itself will filter the data effectively for this task.

> But if we have time we should adjust the mistake that we find.

If the profiles already have enough data for cleopatra to filter them as required and use them to their fullest potential (which it seems it does), then improving pause/resumes wouldn't help anything other than conserve some disk space, and may introduce bugs.

Since I'm not concerned with disk space and there's no other apparent advantage otherwise, it would probably be a waste of time to work on this.

Thanks again.
Just to be clear, when I say cleopatra does the filtering I mean the user needs to go through and understand what data is relevant or not so it does add noise that does the user needs to do with.

If we have things like GC/CC running between tests then filtering that out is useful because the user doesn't have to deal with the noise. But the primary concern is that the profiles contains the data we need (signal) then we can worry about removing the useless samples (noise).
You need to log in before you can comment on or make changes to this bug.