Closed Bug 770317 Opened 13 years ago Closed 12 years ago

create a xperf-based mainthread io tracker

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla17

People

(Reporter: taras.mozilla, Assigned: jmaher)

References

Details

(Whiteboard: [Snappy:P1])

Attachments

(9 files, 7 obsolete files)

2.44 KB, text/plain
Details
5.07 KB, patch
Details | Diff | Splinter Review
3.03 KB, text/plain
Details
21.48 KB, patch
jmaher
: review+
Details | Diff | Splinter Review
870 bytes, patch
taras.mozilla
: review+
Details | Diff | Splinter Review
2.19 KB, patch
k0scist
: review+
Details | Diff | Splinter Review
8.47 KB, patch
k0scist
: review+
Details | Diff | Splinter Review
1.23 KB, text/plain
Details
4.05 KB, patch
k0scist
: review+
Details | Diff | Splinter Review
We caught a number of surprisingly silly occurrences of main thread IO in Firefox. These result in unbounded pauses during Firefox execution. We need to eliminate them. We currently have some xperf support in Firefox. I would like to make use of + extend xperf support in order to output the following: * Number of disk IOPs * Number of Network IOPS Those should be broke up by main/non-main thread and also segmented into [startup | normal use | shutdown]. Once someone on perf team figures this out, we'll hand this off to someone in Automation to integrate with graph server.
OS: Mac OS X → Windows XP
There are several ways to do this. I believe that we should first try the following: - from code, fire an xperf probe at the end of startup (done, as per bug 696033); - from code, fire an xperf probe at the start of shutdown (which event should we use?); - launch Firefox monitored by |xperf -on FileIO+Network+"Mozilla Generic Provider"|; - stop xperf, parse results.
Assignee: nobody → necheverria
Attached file xperf File IO tracker
Python script that calls xperf with FileIO and "Mozilla Generic Provider" providers, parses the dumpfile, and aggregates I/O operation by main/non-main thread and startup/normal/shutdown stages. Example: python iotracker.py "C:/firefox-src/mozilla-central/obj-dbg/dist/bin/firefox.exe -no-remote -p dev" {'non-main': {'startup': defaultdict(<type 'int'>, {'read_bytes': 5313621, 'read_ops': 176}), 'normal': defaultdict(<type 'int'>, {'write_bytes': 15764688, 'read_bytes': 31482145, 'read_ops': 1057, 'write_ops': 770})}, 'main': {'startup': defaultdict(<type 'int'>, {'write_bytes': 765, 'read_bytes': 136616759, 'read_ops': 2516, 'write_ops': 1}), 'shutdown': defaultdict(<type 'int'>, {'write_bytes': 28356, 'read_bytes': 10446, 'read_ops': 265, 'write_ops': 5}), 'normal': defaultdict(<type 'int'>, {'write_bytes': 2550862, 'read_bytes': 2540804, 'read_ops': 250, 'write_ops': 424})}}
Is this something we can integrate in telemetry or the slowstartup addon?
(In reply to Marco Castelluccio from comment #3) I'm not sure about integrating it with the slowstartup addon, as Event Tracing for Windows requires administrative privileges. It also has a non-negligible performance overhead when it's enabled (events from all processes are received and must be filtered).
For reference we currently run xperf+firefox via talos. Here are the launching and parsing scripts: http://hg.mozilla.org/build/talos/file/807c94aef671/talos/xtalos I am thinking if we can boil this down to 1 or two numbers we can get those added in graph server and report up to there. Questions: * will we have to install a .mof or other files on the windows machine? ** If so, can this be automated? We have hundreds of windows 7 clients * Are we only targeting windows 7, or other flavors of windows now?
Attachment #642025 - Flags: feedback?(vdjeric)
Attached file xperf File and Network IO tracker (obsolete) —
Attachment #643106 - Flags: feedback?(vdjeric)
(In reply to Joel Maher (:jmaher) from comment #5) Yes, we'll need to install the mof file that includes the XPCOM shutdown event.
is there a way to do this installation automatically? If not, this is a manual process that will need to occur on hundreds of machines. Also do you have a link to the .mof file?
Comment on attachment 642025 [details] [diff] [review] Add xperf probe for XPCOM shutdown event Looks good, remember to have a module owner review it next so we can land the patch :)
Attachment #642025 - Flags: feedback?(vdjeric) → feedback+
Comment on attachment 643106 [details] xperf File and Network IO tracker In your runp() function, you can just do something like subprocess.Popen(..., stdout=sys.stdout, ..) instead of redirecting the subprocess's stdout to your script and manually "print"-ing it Just a nitpick, but could you also change the indent to 2 spaces instead of 4 and use a bit more space in the code, e.g. put "continue" and "break" statements on separate lines from condition checks
Attachment #643106 - Flags: feedback?(vdjeric) → feedback+
Attachment #643106 - Attachment mime type: text/x-python → text/plain
Attachment #642025 - Attachment is obsolete: true
Attachment #644076 - Flags: review?(mh+mozilla)
Comment on attachment 644076 [details] [diff] [review] Add xperf probe for XPCOM shutdown event Review of attachment 644076 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/startup/mozprofilerprobe.mof @@ +21,5 @@ > { > +}; > + > +[dynamic: ToInstance, Description("Mozilla Event: XPCOM Shutdown."): Amended, > + Guid("{26d1e091-0ae7-4f49-a554-4214445c505c}"), Please uppercase the guid, like the others above. ::: toolkit/components/startup/nsAppStartup.cpp @@ +92,5 @@ > { 0xA7, 0x60, 0x8D, 0x49, 0x02, 0x77, 0x48, 0xAE} } > +// {26d1e091-0ae7-4f49-a554-4214445c505c} > +#define NS_XPCOM_SHUTDOWN_EVENT_CID \ > + { 0x26d1e091, 0x0ae7, 0x4f49, \ > + { 0xa5, 0x54, 0x42, 0x14, 0x44, 0x5c, 0x50, 0x5c} } Likewise.
Attachment #644076 - Flags: review?(mh+mozilla) → review+
Changed GUIDs to uppercase.
Attachment #644076 - Attachment is obsolete: true
Attachment #644370 - Flags: review?(vdjeric)
OS: Windows XP → All
Hardware: x86 → All
Attached patch IO tracker + talos (obsolete) — Splinter Review
I have made changes to the way xperf in launched in talos to allow for both user mode logging and kernel logging, as needed for the IO tracker. - The PID of the firefox instance is also now passed to etlparser.py - Replaced many os.system calls with subprocess.call since otherwise commands with two sets of double quotes would fail for some reason. jmaher will do further work to integrate the IO tracker with talos based on the "xperf File and Network IO tracker" python script.
Attachment #644497 - Flags: feedback?(jmaher)
Improvements based on vladan's comments.
Attachment #643106 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla17
Comment on attachment 644497 [details] [diff] [review] IO tracker + talos Review of attachment 644497 [details] [diff] [review]: ----------------------------------------------------------------- just a couple of questions. ::: talos/bcontroller.py @@ +133,5 @@ > '-o', csvname, > '-p', self.process, > '-e', etlname, > + '-c', self.configFile, > + '--pid', str(proc.pid)] do we use --pid anywhere? I don't see it used in etlparser.py ::: talos/xperf.config @@ +7,5 @@ > # base test overrides > basetest: > tpcycles: 1 > xperf_providers : ['PROC_THREAD', 'LOADER', 'HARD_FAULTS', 'FILENAME', 'FILE_IO', 'FILE_IO_INIT'] > + xperf_user_providers: ['"Mozilla Generic Provider"', 'Microsoft-Windows-TCPIP'] I assume we can work with the " properly.
Attachment #644497 - Flags: feedback?(jmaher) → feedback+
Yes, we should probably always add quotes to each of the providers when building the command line string, instead of putting the quotes in the name of the provider itself when it has spaces. options.pid is not used anywhere in etlparser.py, but can be used when implementing network/per-thread IO parsing.
Lets not work in a closed bug until the work is done :)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch IO tracker + talos (obsolete) — Splinter Review
Integrated with etlparser; now dumps a etl_output_thread_stats.csv file with aggregated counters.
Attachment #644497 - Attachment is obsolete: true
Attachment #645491 - Flags: review?(jmaher)
Comment on attachment 645491 [details] [diff] [review] IO tracker + talos - Log entries are no longer pre-filtered by the Firefox process, since incoming network events may occur on any other process. In this case, it's the connection ID (etw:ActivityId) what matters. - Other events are filtered by the Firefox PID instead of the process name.
Comment on attachment 645491 [details] [diff] [review] IO tracker + talos Review of attachment 645491 [details] [diff] [review]: ----------------------------------------------------------------- a couple nits/questions below, but otherwise this is great. I ran this locally and I ended up with: > more etl_output_thread_stats.csv thread, stage, counter, value main, normal, file_read_bytes, 1188312 main, normal, file_read_ops, 57 main, normal, file_write_bytes, 4143423 main, normal, file_write_ops, 206 main, normal, net_recv_bytes, 26 main, normal, net_send_bytes, 26 main, startup, file_read_bytes, 830415 main, startup, file_read_ops, 85 main, startup, file_write_bytes, 735 main, startup, file_write_ops, 18 main, startup, net_recv_bytes, 4 main, startup, net_send_bytes, 5 non-main, normal, file_read_bytes, 1037622 non-main, normal, file_read_ops, 41 non-main, normal, file_write_bytes, 785162 non-main, normal, file_write_ops, 58 non-main, normal, net_recv_bytes, 2527 non-main, normal, net_send_bytes, 2532 non-main, startup, file_read_bytes, 169236 non-main, startup, file_read_ops, 10 non-main, startup, file_write_bytes, 49704 non-main, startup, file_write_ops, 4 What do we want to report to the graph server? I think this is too many variable to report (22 total). Taras mentioned running this against tp5, is that the only test we want to run this against for now? Ideally I would like to get this down to 6 or 8 counters that we report to the graph server. I guess it is not big deal if we need to do them all, it will just make displaying and reporting these more difficult. ::: talos/xtalos/start_xperf.py @@ +66,5 @@ > (options.xperf_path, > '+'.join(options.xperf_providers), > '+'.join(options.xperf_stackwalk), > + options.etl_filename, > + '"+"'.join(options.xperf_user_providers), nice call on the quotes here. ::: talos/xtalos/stop_xperf.py @@ +48,5 @@ > if options == None: > print "Unable to verify options" > sys.exit(1) > > + xperf_cmd = '"%s" -stop -stop talos_ses' % (options.xperf_path) do we need two '-stop' commands here? ::: talos/xtalos/xtalos.py @@ +73,5 @@ > > self.add_option("-d", "--debug", > type="int", dest = "debug_level", > help = "debug level for output from tool (0-5, 5 being everything), defaults to 1") > + defaults["debug_level"] = 5 I am not sure we want to make this level=5 by default.
Attachment #645491 - Flags: review?(jmaher) → review+
Here are the results from tp5n (1 cycle, not 25 cycles): (talos) c:\mozilla\talos\talos>type etl_output_thread_stats.csv thread, stage, counter, value main, normal, file_read_bytes, 61656933 main, normal, file_read_ops, 2759 main, normal, file_write_bytes, 5690149 main, normal, file_write_ops, 769 main, normal, net_recv_bytes, 2076257 main, normal, net_send_bytes, 1689701 main, startup, file_read_bytes, 1303547 main, startup, file_read_ops, 104 main, startup, file_write_bytes, 735 main, startup, file_write_ops, 18 main, startup, net_recv_bytes, 1 main, startup, net_send_bytes, 2 non-main, normal, file_read_bytes, 158177202 non-main, normal, file_read_ops, 6186 non-main, normal, file_write_bytes, 500356217 non-main, normal, file_write_ops, 44076 non-main, normal, net_recv_bytes, 68757627 non-main, normal, net_send_bytes, 68953041 non-main, startup, file_read_bytes, 512320 non-main, startup, file_read_ops, 18
We could get rid of ops(maybe) and merge read/write byte counters.
- Restored defaults["debug_level"] to 1 - Added total I/O counters (i.e. net_io_bytes)
Attachment #645491 - Attachment is obsolete: true
Attachment #647008 - Flags: review?
(In reply to Joel Maher (:jmaher) from comment #24) > Comment on attachment 645491 [details] [diff] [review] > ::: talos/xtalos/stop_xperf.py > @@ +48,5 @@ > > if options == None: > > print "Unable to verify options" > > sys.exit(1) > > > > + xperf_cmd = '"%s" -stop -stop talos_ses' % (options.xperf_path) > > do we need two '-stop' commands here? > Yes, two stop commands are needed to stop the two trace sessions (user and kernel). "talos_ses" is the name of the user session, the kernel session has no name.
(In reply to Joel Maher (:jmaher) from comment #25) > Here are the results from tp5n (1 cycle, not 25 cycles): > > (talos) c:\mozilla\talos\talos>type etl_output_thread_stats.csv > ... We still need to install the mof file and XPCOM shutdown event patch to get I/O during shutdown separately. Then it will look something like what I get on my machine: thread, stage, counter, value main, normal, file_io_bytes, 2714779 main, normal, file_read_bytes, 1122436 main, normal, file_read_ops, 54 main, normal, file_write_bytes, 1592343 main, normal, file_write_ops, 91 main, normal, net_io_bytes, 1684 main, normal, net_recv_bytes, 673 main, normal, net_send_bytes, 1011 main, shutdown, file_io_bytes, 2681987 main, shutdown, file_write_bytes, 2681987 main, shutdown, file_write_ops, 174 main, shutdown, net_io_bytes, 4 main, shutdown, net_recv_bytes, 2 main, shutdown, net_send_bytes, 2 main, startup, file_io_bytes, 4619774 main, startup, file_read_bytes, 4618891 main, startup, file_read_ops, 574 main, startup, file_write_bytes, 883 main, startup, file_write_ops, 18 main, startup, net_io_bytes, 25 main, startup, net_recv_bytes, 12 main, startup, net_send_bytes, 13 non-main, normal, file_io_bytes, 1887236 non-main, normal, file_read_bytes, 1064664 non-main, normal, file_read_ops, 39 non-main, normal, file_write_bytes, 822572 non-main, normal, file_write_ops, 60 non-main, normal, net_io_bytes, 648 non-main, normal, net_recv_bytes, 347 non-main, normal, net_send_bytes, 301 non-main, startup, file_io_bytes, 5489921 non-main, startup, file_read_bytes, 5460697 non-main, startup, file_read_ops, 219 non-main, startup, file_write_bytes, 29224 non-main, startup, file_write_ops, 7 non-main, startup, net_io_bytes, 5055 non-main, startup, net_recv_bytes, 2527 non-main, startup, net_send_bytes, 2528
So I guess we're reporting 12 counters to the graph server: (main|non-main), (startup|normal|shutdown), (file_io_bytes|net_io_bytes) Some counters may not always be present. Last time I ran it, there was no I/O in non-main threads during shutdown.
we can go with the twelve counters. If we do, I want to restrict this to a single test, or no more than 2 tests. Currently we are looking at TP5, are there other tests which would be of use?
is it normal to not have all counters present? i.e. is there a technical reason? If this is expected and normal, then I can make sure that talos handles that scenario properly.
The script just creates the corresponding items in the array as events are parsed.
Comment on attachment 647008 [details] [diff] [review] IO tracker + talos Review of attachment 647008 [details] [diff] [review]: ----------------------------------------------------------------- Overall this is good. I understand the MOF file. For the xpcom shutdown event, do we just need to land a patch on mozilla-central and use that?
Attachment #647008 - Flags: review? → review+
(In reply to Joel Maher (:jmaher) from comment #34) > Comment on attachment 647008 [details] [diff] [review] > IO tracker + talos > > Review of attachment 647008 [details] [diff] [review]: > ----------------------------------------------------------------- > > Overall this is good. I understand the MOF file. For the xpcom shutdown > event, do we just need to land a patch on mozilla-central and use that? Yes. The patch is already on mozilla-central: https://hg.mozilla.org/mozilla-central/rev/f0544409e034
(In reply to Joel Maher (:jmaher) from comment #31) > we can go with the twelve counters. If we do, I want to restrict this to a > single test, or no more than 2 tests. Currently we are looking at TP5, are > there other tests which would be of use? tp5 is good for now. What's our eta for deployment?
the biggest unknown here is getting the .mof file deployed. Otherwise I need to spend some time getting graph server definitions defined, we need to deploy this as a new talos.zip and we need to update buildbot-configs. Most of the known quantities can be handled this week, the unknown could take a few weeks. If we could get this deployed and just not collect/report data from the .mof file, we could get going and then track the bug to deploy the .mof file.
Why is it hard to deploy the .mof, is there anything we can do to speed that up? Having data partitioned makes it a lot more useful.
we just need to deploy the .mof file on all the windows 7 machines which is a few hundred. I assume this can be scripted as my initial test seemed like it would work fine, but this needs to get scheduled in with releng which wouldn't be a top priority for them.
(In reply to Joel Maher (:jmaher) from comment #39) > we just need to deploy the .mof file on all the windows 7 machines which is > a few hundred. I assume this can be scripted as my initial test seemed like > it would work fine, but this needs to get scheduled in with releng which > wouldn't be a top priority for them. Can you file a bug specifying what needs to be done, I'll followup with making it a priority.
Fixed class name in mof file. Apparently it didn't end up making much difference since classes are referred by guid.
Attachment #647248 - Flags: review?(mh+mozilla)
Depends on: 778856
Attachment #647248 - Flags: review?(mh+mozilla) → review+
Can we get it checked-in to mozilla-central?
(In reply to Nicolás Chaim from comment #42) > Can we get it checked-in to mozilla-central? mark the patch as checkin-needed, point at glandium
Comment on attachment 647248 [details] [diff] [review] Fix class name in mof file I can't seem to get the check-in needed flag.
Attachment #647248 - Flags: review?(mh+mozilla)
Keywords: checkin-needed
Keywords: checkin-needed
Attachment #647248 - Flags: review?(vdjeric)
Attachment #647248 - Flags: review?(vdjeric)
Attachment #647248 - Flags: review?(mh+mozilla)
Depends on: 779272
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
Not done with the xperf part yet. lets stop landing mc-changes in this bug :)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Taras Glek (:taras) from comment #48) > Not done with the xperf part yet. lets stop landing mc-changes in this bug :) If it helps, you can use [leave open] in the whiteboard to stop the merge tool from closing the bug. (https://groups.google.com/d/topic/mozilla.dev.platform/DQyQ56DgmhA/discussion)
Attachment #648736 - Flags: review?(jhammel) → review+
landed graph server definitions: http://hg.mozilla.org/graphs/rev/b3935f847775
Depends on: 780681
Depends on: 781319
One problem I am having getting this patch to land on talos is that it uses python bits from modern talos and we run python 2.4 on the windows talos machines. We have been trying for the last 2 quarters to bump the python version up and it is much easier to rewrite our code to work with 2.4. The biggest thing is the 'from collections import defaultdict'. Is the primary purpose for the default dict to keep things in proper order: for cntr in sorted(io.iterkeys()): for cntr in sorted(io.iterkeys()): Either we can drop that requirement or work around it with some extra code, I am testing it now with no defaultdict requirement.
Assignee: nicolas → jmaher
I had to assign this bug to myself since nicolas @ mozilla.com was not a valid email in bugzilla and I couldn't submit my comment. I have cleaned up the patch and it works in talos. I found that Providing a user_provider which doesn't exist on the system will fail to produce anything. So in the spirit of getting things going we can temporarily remove the "Mozilla Generic Provider" from the listening configs and this generates: thread, stage, counter, value main, startup, file_io_bytes, 40790612 main, startup, file_read_bytes, 38018251 main, startup, file_read_ops, 697 main, startup, file_write_bytes, 2772361 main, startup, file_write_ops, 170 main, startup, net_io_bytes, 43 main, startup, net_recv_bytes, 21 main, startup, net_send_bytes, 22 This isn't ideal, but it gets us going and allows us to integrate into talos until the windows 7 machines are all updated with the .mof file.
defaultdict(int) creates a dict that initializes members to the type's default value (zero for int) if they don't exist. If we're using a regular dict, the equivalent would be: key = (th, stg, "net_%s_bytes" % opType) if key not in io: io[key] = 0 io[key] += bytes We could perhaps also include some existing, external implementation for the defaultdict class, or maybe just initialize the dict with all possible keys we're using and zero values.
(In reply to Nicolás Chaim from comment #54) > defaultdict(int) creates a dict that initializes members to the type's > default value (zero for int) if they don't exist. > > If we're using a regular dict, the equivalent would be: > > key = (th, stg, "net_%s_bytes" % opType) > if key not in io: > io[key] = 0 > io[key] += bytes Please don't do this. python dicts have had setdefault forever key = (th, stg, "net_%s_bytes" % opType) io.setdefault(key, 0) += bytes
(In reply to Jeff Hammel [:jhammel] from comment #55) > (In reply to Nicolás Chaim from comment #54) > > defaultdict(int) creates a dict that initializes members to the type's > > default value (zero for int) if they don't exist. > > > > If we're using a regular dict, the equivalent would be: > > > > key = (th, stg, "net_%s_bytes" % opType) > > if key not in io: > > io[key] = 0 > > io[key] += bytes > > Please don't do this. python dicts have had setdefault forever > > key = (th, stg, "net_%s_bytes" % opType) > io.setdefault(key, 0) += bytes Oh, that's perfect. I'm not very familiar with Python.
I guess it would still need to be io[key] = io.get(key, 0) + bytes since io.setdefault(key, 0) will just return an inmutable int
(In reply to Nicolás Chaim from comment #57) > I guess it would still need to be > io[key] = io.get(key, 0) + bytes > since > io.setdefault(key, 0) > will just return an inmutable int Beh, good point. :(
landed the talos bits with the slight mods mentioned in the last few comments. This worked on try server without affecting current results: http://hg.mozilla.org/build/talos/rev/d3733d055e49 We still need to do a few modifications to talos to bring the xperf providers up from parsing the etl file and get them input into the graph server. I am working on that with :jhammel.
This patch does what it says it does by publishing all of the counters we collect from xperf to the graph servers. The remaining work is to resolve the config files and how this will be run from buildbot. If this patch is f+, we can start resolving the config file issues and the buildbot configs.
Attachment #655589 - Flags: feedback?(jhammel)
Comment on attachment 655589 [details] [diff] [review] make the counters report to graph server (1.0) >diff -r 6d79047595a4 talos/bcontroller.py >--- a/talos/bcontroller.py Fri Aug 24 20:02:51 2012 -0700 >+++ b/talos/bcontroller.py Mon Aug 27 09:37:21 2012 -0400 >@@ -109,11 +109,13 @@ > print "Error running etlparser: %s" % e > self.returncode = 1 > >- print "__xperf_data_begin__" >- fhandle = open(csvname, 'r') >- print fhandle.read() >+ results_file = open(self.browser_log, "a") >+ results_file.write("__xperf_data_begin__") >+ fhandle = open("etl_output_thread_stats.csv", 'r') >+ results_file.write(fhandle.read()) > fhandle.close() >- print "__xperf_data_end__" >+ results_file.write("__xperf_data_end__") >+ results_file.close() > else: #blocking call to system, non-remote device > self.returncode = os.system(self.command + " > " + self.browser_log) So this is just ensuring that we're writing to the browser log(?) That seems to be a good idea. I've also filed bug 786055 to get rid of the os.system calls and shell-level redirects >diff -r 6d79047595a4 talos/cmanager_win32.py >--- a/talos/cmanager_win32.py Fri Aug 24 20:02:51 2012 -0700 >+++ b/talos/cmanager_win32.py Mon Aug 27 09:37:21 2012 -0400 >@@ -21,6 +21,7 @@ > path = win32pdh.MakeCounterPath((None, 'process', process, > None, -1, counter)) > hq = win32pdh.OpenQuery() >+ hc = None > try: > hc = win32pdh.AddCounter(hq, path) > except: >@@ -33,8 +34,9 @@ > except: > win32pdh.CloseQuery(hq) > >- self.registeredCounters[counter] = [hq, [(hc, path)]] >- self.updateCounterPathsForChildProcesses(counter) >+ if hc: >+ self.registeredCounters[counter] = [hq, [(hc, path)]] >+ self.updateCounterPathsForChildProcesses(counter) Not sure I understand this code, but maybe I don't need to. >diff -r 6d79047595a4 talos/results.py >--- a/talos/results.py Fri Aug 24 20:02:51 2012 -0700 >+++ b/talos/results.py Mon Aug 27 09:37:21 2012 -0400 >@@ -265,6 +265,9 @@ > # regular expression for responsiveness results > RESULTS_RESPONSIVENESS_REGEX = re.compile('MOZ_EVENT_TRACE\ssample\s\d*?\s(\d*?)$', re.DOTALL|re.MULTILINE) > >+ # regular expression for xperf io data >+ XPERF_REGEX = re.compile('([a-z, \-\_]+)_io_bytes, ([0-9]+)') As you can probably guess, I'm not a huge fan of regexes like this, but I guess this is okay. Maybe a comment show example output would help. Also, isn't the xperf output formatted data anyway? Can't we just read the .csv and call it a day? > # classes for results types > classes = {'tsformat': TsResults, > 'tpformat': PageloaderResults} >@@ -391,6 +394,36 @@ > self.shutdown(global_counters) > if 'responsiveness' in global_counters: > global_counters['responsiveness'].extend(self.responsiveness()) >+ self.xperf(global_counters) >+ >+ def xperf(self, counter_results): >+ """record xperf counters in counter_results dictionary""" >+ >+ counters = ['main_startup_fileio', >+ 'main_startup_netio', >+ 'main_normal_fileio', >+ 'main_normal_netio', >+ 'main_shutdown_fileio', >+ 'main_shutdown_netio', >+ 'nonmain_startup_fileio', >+ 'nonmain_startup_netio', >+ 'nonmain_normal_fileio', >+ 'nonmain_normal_netio', >+ 'nonmain_shutdown_fileio', >+ 'nonmain_shutdown_netio'] Pretty messy but I don't know a way around this. >+ if not set(counters).intersection(counter_results.keys()): >+ # no xperf counters to accumulate >+ return >+ for line in self.results_raw.split('\n'): >+ xperfmatch = self.XPERF_REGEX.search(line) Again, I wouldn't use a regex for this. >+ if xperfmatch: >+ (type, value) = (xperfmatch.group(1), xperfmatch.group(2)) >+ # type will be similar to 'main, startup, file' >+ cname = ('_').join(type.split(', ')) >+ counter_name = '%sio' % cname >+ if counter_name in counter_results: >+ counter_results[counter_name].append(value) Does this need a setdefault? (not sure). > def rss(self, counter_results): > """record rss counters in counter_results dictionary""" >diff -r 6d79047595a4 talos/test.py >--- a/talos/test.py Fri Aug 24 20:02:51 2012 -0700 >+++ b/talos/test.py Mon Aug 27 09:37:21 2012 -0400 >@@ -49,7 +49,7 @@ > > class TsBase(Test): > """abstract base class for ts-style tests""" >- keys = ['url', 'url_timestamp', 'timeout', 'cycles', 'shutdown', 'profile_path', >+ keys = ['url', 'url_timestamp', 'timeout', 'cycles', 'shutdown', 'profile_path', 'xperf_counters', > 'xperf_providers', 'xperf_user_providers', 'xperf_stackwalk'] > > class ts(TsBase): >@@ -62,6 +62,7 @@ > class ts_paint(ts): > url = 'startup_test/tspaint_test.html?begin=' > shutdown = None >+ xperf_counters = ['main_startup_fileio', 'main_startup_netio', 'main_normal_fileio', 'main_normal_netio', 'main_shutdown_fileio', 'main_shutdown_netio', 'nonmain_startup_fileio', 'nonmain_startup_netio', 'nonmain_normal_fileio', 'nonmain_normal_netio', 'nonmain_shutdown_fileio', 'nonmain_shutdown_netio'] Again, messy, though i can't think of a better way of doing this. Maybe e.g. fileio implies all of '{main,nonmain}_{startup,normal,shutdown}_fileio'... > class ts_places_generated_max(ts): > profile_path = '${talos}/places_generated_max' >@@ -145,7 +146,7 @@ > filters = None > keys = ['tpmanifest', 'tpcycles', 'tppagecycles', 'tprender', 'tpchrome', 'tpmozafterpaint', > 'rss', 'resolution', 'cycles', >- 'win_counters', 'w7_counters', 'linux_counters', 'mac_counters', 'remote_counters', >+ 'win_counters', 'w7_counters', 'linux_counters', 'mac_counters', 'remote_counters', 'xperf_counters', > 'timeout', 'shutdown', 'responsiveness', 'profile_path', > 'xperf_providers', 'xperf_user_providers', 'xperf_stackwalk', 'filters' > ] >@@ -197,6 +198,7 @@ > w7_counters = ['Main_RSS', 'Content_RSS', 'Private Bytes', '% Processor Time', 'Modified Page List Bytes'] > linux_counters = ['Private Bytes', 'XRes', 'Main_RSS', 'Content_RSS'] > mac_counters = ['Private Bytes', 'Main_RSS', 'Content_RSS'] >+ xperf_counters = ['main_startup_fileio', 'main_startup_netio', 'main_normal_fileio', 'main_normal_netio', 'main_shutdown_fileio', 'main_shutdown_netio', 'nonmain_startup_fileio', 'nonmain_startup_netio', 'nonmain_normal_fileio', 'nonmain_normal_netio', 'nonmain_shutdown_fileio', 'nonmain_shutdown_netio'] > > class tp_js(PageloaderTest): > url = """'"http://localhost/page_load_test/framecycler.html?quit=1&cycles=5"'""" >diff -r 6d79047595a4 talos/ttest.py >--- a/talos/ttest.py Fri Aug 24 20:02:51 2012 -0700 >+++ b/talos/ttest.py Mon Aug 27 09:37:21 2012 -0400 >@@ -270,6 +270,10 @@ > # setup global (cross-cycle) counters: > # shutdown, responsiveness > global_counters = {} >+ if browser_config.get('xperf_path'): >+ for c in test_config.get('xperf_counters', []): >+ global_counters[c] = [] >+ > if test_config['shutdown']: > global_counters['shutdown'] = [] > if test_config.get('responsiveness') and platform.system() != "Linux": >diff -r 6d79047595a4 talos/xperf.config >--- a/talos/xperf.config Fri Aug 24 20:02:51 2012 -0700 >+++ b/talos/xperf.config Mon Aug 27 09:37:21 2012 -0400 >@@ -5,10 +5,15 @@ > talos.logfile: browser_output.txt > > # base test overrides >-# TODO: add "Mozilla Generic Provider" to xperf_user_providers once we deploy the mof file to test slaves > basetest: >+ cycles: 1 > tpcycles: 1 >- cycles: 1 >+ tppagecycles: 1 > xperf_providers : ['PROC_THREAD', 'LOADER', 'HARD_FAULTS', 'FILENAME', 'FILE_IO', 'FILE_IO_INIT'] >- xperf_user_providers: ['Microsoft-Windows-TCPIP'] >+ xperf_user_providers: ['Mozilla Generic Provider', 'Microsoft-Windows-TCPIP'] > xperf_stackwalk : ['FileRead', 'FileWrite', 'FileFlush'] >+ w7_counters: [] >+ win_counters: [] >+ linux_counters: [] >+ remote_counters: [] >+ mac_counters: []
Attachment #655589 - Flags: feedback?(jhammel) → feedback+
so we could assume there is a known csv (or we could make it .json) file and parse the data that way. That would remove our dependency on regex. The idea was to keep this similar to the rest of the system in that we output data in browser_output.txt and then parse it up after the test completes. Minor details, lets figure out the .config problem and start getting this stuff up for real reviews.
Depends on: 786308
A few small tweaks to this to make it work. Lets get this reviewed and talos deployed:) I chose to stick with all output going to browser_output.txt instead of having the results.py read the .csv file. I would like to keep it as consistent as possible and put all the filename logic for .etl and .csv in one place.
Attachment #655589 - Attachment is obsolete: true
Attachment #657506 - Flags: review?(jhammel)
Comment on attachment 657506 [details] [diff] [review] make the counters report to graph server (1.1) This looks fine. Would you mind if I did a follow-up wrt to how we parse the counters in results? (Parsing via CSV vs the regex) I would also like to put this on two separate lines: NOISE: __xperf_data_end____startBeforeLaunchTimestamp1346452998207__endBeforeLaunchTimestamp We have the following data marked by delimeters: NOISE: __xperf_data_begin__thread, stage, counter, value NOISE: main, normal, file_io_bytes, 135479381 NOISE: main, normal, file_read_bytes, 129100860 NOISE: main, normal, file_read_ops, 4280 NOISE: main, normal, file_write_bytes, 6378521 NOISE: main, normal, file_write_ops, 558 NOISE: main, normal, net_io_bytes, 11938485 NOISE: main, normal, net_recv_bytes, 1429236 NOISE: main, normal, net_send_bytes, 10509249 NOISE: main, shutdown, file_io_bytes, 26375 NOISE: main, shutdown, file_read_bytes, 7373 NOISE: main, shutdown, file_read_ops, 151 NOISE: main, shutdown, file_write_bytes, 19002 NOISE: main, shutdown, file_write_ops, 3 NOISE: main, startup, file_io_bytes, 24399710 NOISE: main, startup, file_read_bytes, 24398866 NOISE: main, startup, file_read_ops, 420 NOISE: main, startup, file_write_bytes, 844 NOISE: main, startup, file_write_ops, 18 NOISE: main, startup, net_io_bytes, 2 NOISE: main, startup, net_recv_bytes, 1 NOISE: main, startup, net_send_bytes, 1 NOISE: non-main, normal, file_io_bytes, 471162110 NOISE: non-main, normal, file_read_bytes, 96544227 NOISE: non-main, normal, file_read_ops, 3854 NOISE: non-main, normal, file_write_bytes, 374617883 NOISE: non-main, normal, file_write_ops, 37368 NOISE: non-main, normal, net_io_bytes, 150199046 NOISE: non-main, normal, net_recv_bytes, 84117265 NOISE: non-main, normal, net_send_bytes, 66081781 NOISE: non-main, startup, file_io_bytes, 354464 NOISE: non-main, startup, file_read_bytes, 354464 NOISE: non-main, startup, file_read_ops, 12 NOISE: __xperf_data_end__ This is just CSV. Should be pretty trivial to parse with the csv module. I'm also inclined to put the list of all counters in e.g. xtalos or somewhere the can be globally accessed. This way we can check for validity of what is passed in.
Attachment #657506 - Flags: review?(jhammel) → review+
Attached file example CSV
Attached file example reading CSV (obsolete) —
this is an example of reading the example output from CSV. I don't look for the delimeters, __xperf_data_{start,end}__ (which we currently don't use at all) which needs to be done, but these can be easily found a la the other delimeters in results (e.g. http://hg.mozilla.org/build/talos/file/6d1c5ee867e6/talos/results.py#l330)
Attachment #657537 - Flags: feedback?(jmaher)
(it would also be nice to have a test for this parsing)
Comment on attachment 657537 [details] example reading CSV >import csv > >contents = file('xperf.csv').read() we should do: if not os.path.exists(browser_config['xperf_csvname']): return contents = file(browser_config['xperf_csvname']).read() >lines = contents.splitlines() >reader = csv.reader(lines) >header = None >for row in reader: > > # read CSV > row = [i.strip() for i in row] > if not header: > header = row this assumes the first row is the header, we could look for __xperf_start or something > continue > values = dict(zip(header, row)) > > # format for talos > thread = values['thread'] > if thread == 'non-main': > thread = 'nonmain' come to think of it, we can fix this hack in how we generate the csv in the first place > counter = values['counter'].rsplit('_io_bytes', 1)[0] > counter_name = '%s_%s_%sio' % (thread, values['stage'], counter) > value = float(values['value']) > > # accrue counter > if counter_name in counter_results: > counter_results[counter_name].append(value) I would innitialize the counter first. Also make sure we only assign counters that are legit. if counter_name in xtalos.known_couters: counter_results.setdefault(counter_name, []).append(value)
Attachment #657537 - Flags: feedback?(jmaher) → feedback+
landed the results->graphserver patch in talos repo: http://hg.mozilla.org/build/talos/rev/a632249f5fe9 still left todo: sanity check try server run deploy talos.zip create patch for buildbot, get r+ and a reconfig
(In reply to Joel Maher (:jmaher) from comment #68) > Comment on attachment 657537 [details] > example reading CSV > > >import csv > > > >contents = file('xperf.csv').read() > we should do: > if not os.path.exists(browser_config['xperf_csvname']): > return > contents = file(browser_config['xperf_csvname']).read() Sure, this was just an example for local data pasted from your try run. In real life: - we already have the browser_log in memory: http://hg.mozilla.org/build/talos/file/6d1c5ee867e6/talos/results.py#l284 - we would look for __xperf_start__ and __xperf_end__ like we do the other delimeters (e.g. line 330) and take the content between these delimeters - these become the contents vs the xperf.csv file I read for testing > >lines = contents.splitlines() > >reader = csv.reader(lines) > >header = None > >for row in reader: > > > > # read CSV > > row = [i.strip() for i in row] > > if not header: > > header = row > this assumes the first row is the header, we could look for __xperf_start or > something Currently, we print the header row right after __xperf_start__ > > continue > > values = dict(zip(header, row)) > > > > # format for talos > > thread = values['thread'] > > if thread == 'non-main': > > thread = 'nonmain' > come to think of it, we can fix this hack in how we generate the csv in the > first place Even better :) > > counter = values['counter'].rsplit('_io_bytes', 1)[0] > > counter_name = '%s_%s_%sio' % (thread, values['stage'], counter) > > value = float(values['value']) > > > > # accrue counter > > if counter_name in counter_results: > > counter_results[counter_name].append(value) > I would innitialize the counter first. Also make sure we only assign > counters that are legit. > if counter_name in xtalos.known_couters: > counter_results.setdefault(counter_name, []).append(value) Yep, even better :) In any case, I'm fine doing this as a follow up if you want to check in your patch. I mostly wrote this as a proof of concept.
I would rather do this as a followup so we can get the xperf tests live as fast as possible. Although with the long weekend, it might be a couple days before we really do get the tests live.
sounds good to me
this incorporates the suggestions :jhammel had for csv parsing of xperf.csv. No more hacking through browser_output.txt! There was a bug in the regular tp5 runs where we were trying to access counters which didn't exist. This is resolved in the patch by better error checking.
Attachment #657537 - Attachment is obsolete: true
Attachment #658128 - Flags: review?(jhammel)
Comment on attachment 658128 [details] [diff] [review] fix bug in counters for tp5n on windows, cleanup .csv parsing (1.0) + if len(self.registeredCounters[counter]) < 2: + return None + Not sure I understand this line? + filename = 'etl_output_thread_stats.csv' + if not os.path.exists(filename): + return AIUI, if you have xperf counters, this file *should* exist. So maybe we should make a debug statement here? + import csv Personally I'd put this at the top of the file. Hmmm....there's also no error handling if the file is of a bad format. I guess that's not a problem until it is. Other than those nits, looks good to me!
Attachment #658128 - Flags: review?(jhammel) → review+
updated with feedback from review and landed: http://hg.mozilla.org/build/talos/rev/2605620aa7fc
Depends on: 788235
Depends on: 788354
Displaying all graphs at once: http://graphs.mozilla.org/graph.html#tests=[[251,94,12],[250,94,12],[248,94,12],[247,94,12],[246,94,12],[244,94,12],[245,94,12],[243,94,12],[242,94,12]]&sel=none&displayrange=7&datatype=running tp5n_nonmain_normal_netio_paint: 153735839.0: http://graphs.mozilla.org/graph.html#tests=[[251,94,12]]&sel=none&displayrange=7&datatype=running tp5n_main_startup_fileio_paint: 24400446.0: http://graphs.mozilla.org/graph.html#tests=[[242,94,12]]&sel=none&displayrange=7&datatype=running tp5n_main_shutdown_fileio_paint: 59143.0: http://graphs.mozilla.org/graph.html#tests=[[246,94,12]]&sel=none&displayrange=7&datatype=running tp5n_main_normal_netio_paint: 20300.0: http://graphs.mozilla.org/graph.html#tests=[[245,94,12]]&sel=none&displayrange=7&datatype=running tp5n_main_startup_netio_paint: 2.0: http://graphs.mozilla.org/graph.html#tests=[[243,94,12]]&sel=none&displayrange=7&datatype=running tp5n_nonmain_startup_fileio_paint: 354464.0: http://graphs.mozilla.org/graph.html#tests=[[248,94,12]]&sel=none&displayrange=7&datatype=running tp5n_nonmain_normal_fileio_paint: 502931900.0: http://graphs.mozilla.org/graph.html#tests=[[250,94,12]]&sel=none&displayrange=7&datatype=running tp5n_main_normal_fileio_paint: 53619254.0: http://graphs.mozilla.org/graph.html#tests=[[244,94,12]]&sel=none&displayrange=7&datatype=running tp5n_main_shutdown_netio_paint: 2.0: http://graphs.mozilla.org/graph.html#tests=[[247,94,12]]&sel=none&displayrange=7&datatype=running All of these links are for mozilla-central, will be slightly different links when looking at inbound data.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
What does the _paint suffix mean?
it is a relic of talos. It means that we are using waiting for mozAfterPaint before going to the next page (what we do by default now in all our tests).
(In reply to Taras Glek (:taras) from comment #77) > What does the _paint suffix mean? Please see http://k0s.org/mozilla/blog/20120724135349 for the hoary details. Ultimately, we should not be sending metadata as part of the test name. But for graphserver these things are ultimately tied. For datazilla they will not be
Joel, do we retain the csv and/or xperf file after running the test? tp5n_main_shutdown_fileio_paint shows a lot of variation(which may be due to some non-determinism in our code). Would be good to compare io to see where the extra io is going.
Unfortunately we do not save off the .etl or generated .csv file. I could add in a fix to print out the .csv file to stdout and it would show up in the logs. Lets figure out a few things while we watch the tree for another 20-30 pushes. That might bring out other hot spots or patterns.
(In reply to Joel Maher (:jmaher) from comment #81) > Unfortunately we do not save off the .etl or generated .csv file. I could > add in a fix to print out the .csv file to stdout and it would show up in > the logs. Lets figure out a few things while we watch the tree for another > 20-30 pushes. That might bring out other hot spots or patterns. Can you turn on logging of tp5n_main_shutdown_fileio_paint io?
For some reason we are not collecting it. It could be we have it configured incorrectly (please check my xperf options: http://hg.mozilla.org/build/talos/file/tip/talos/xperf.config#l12), or maybe the parser is mixing it up. I will be heads down most of this week on mobile stuff. Whoever can get to it first wins.
Blocks: 644744
tp5n_main_shutdown_fileio_paint: 59143.0: http://graphs.mozilla.org/graph.html#tests=[[246,94,12]]&sel=none&displayrange=7&datatype=running Let me know if that is useful or not.
(In reply to Joel Maher (:jmaher) from comment #85) > tp5n_main_shutdown_fileio_paint: 59143.0: > http://graphs.mozilla.org/graph.html#tests=[[246,94, > 12]]&sel=none&displayrange=7&datatype=running > > Let me know if that is useful or not. It's not. Can you add a summary of files read + bytes from them to the logs? This would be handy for atleast main thread startup/shutdown io(we can skip the others for now to save space).
we already print the data in the logfile: RETURN:<a href='http://graphs.mozilla.org/graph.html#tests=[[251,94,12]]'>tp5n_nonmain_normal_netio_paint: 140670519.0</a><br> RETURN:<a href='http://graphs.mozilla.org/graph.html#tests=[[242,94,12]]'>tp5n_main_startup_fileio_paint: 26365858.0</a><br> RETURN:<a href='http://graphs.mozilla.org/graph.html#tests=[[246,94,12]]'>tp5n_main_shutdown_fileio_paint: 30886.0</a><br> RETURN:<a href='http://graphs.mozilla.org/graph.html#tests=[[245,94,12]]'>tp5n_main_normal_netio_paint: 5307030.0</a><br> RETURN:<a href='http://graphs.mozilla.org/graph.html#tests=[[243,94,12]]'>tp5n_main_startup_netio_paint: 2.0</a><br> RETURN:<a href='http://graphs.mozilla.org/graph.html#tests=[[248,94,12]]'>tp5n_nonmain_startup_fileio_paint: 477344.0</a><br> RETURN:<a href='http://graphs.mozilla.org/graph.html#tests=[[250,94,12]]'>tp5n_nonmain_normal_fileio_paint: 471793045.0</a><br> RETURN:<a href='http://graphs.mozilla.org/graph.html#tests=[[244,94,12]]'>tp5n_main_normal_fileio_paint: 39505092.0</a><br> RETURN:<a href='http://graphs.mozilla.org/graph.html#tests=[[247,94,12]]'>tp5n_main_shutdown_netio_paint: 3.0</a><br>
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: