Closed Bug 493427 Opened 15 years ago Closed 14 years ago

Create profiling mechanism for Windows Mobile

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Windows Mobile 6 Professional
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: mfinkle, Assigned: crowderbt)

References

Details

Attachments

(3 files, 14 obsolete files)

7.96 KB, text/plain
Details
5.61 KB, text/html
Details
22.29 KB, patch
Details | Diff | Splinter Review
We currently have no way of finding "hot spots" in the Windows Mobile releases of Fennec. The landscape for profilers on Windows Mobile is fairly barren.

We did come across a possible way of getting some basic metrics. Using the _penter/_pexit hooks, Visual Studio will call user supplied hook functions whenever entering and exiting functions. See:
http://msdn.microsoft.com/en-us/library/c63a9b7h.aspx
http://msdn.microsoft.com/en-us/library/xc11y76y.aspx

A basic example can be found here:
http://www.rainer-keuchel.de/wince/profiler.html

(not sure where to put this bug)
Blocks: 491346
Attached patch first cut (obsolete) — Splinter Review
this will let you profile using -callcap or -fastcap, enabled by adding --enable-callcap or --enable-fastcap to your mozconfig.  You can add StartProfiling and StopProfiling around the code you are interested in profiling to get a more narrow profile.
As you can see, I'm not very familiar with perl
Unfortunately, Windows Mobile seems not to support -callcap and -fastcap option.

http://msdn.microsoft.com/en-us/library/aa934480.aspx

The patch does not work at least on emulator.

Load module: fennec.exe
Load module: profile-writer.dll
Load module: mozce_shunt.dll
Load module: toolhelp.dll
Load module: coredll.dll.0409.MUI
Load module: coredll.dll
[Stack fault]: Thread=9c1ec400 Proc=80096ec0 'fennec.exe'
AKY=00001001 PC=781a1388(mozce_shunt.dll+0x00001388) RA=781be8e4(mozce_shunt.dll+0x0001e8e4) BVA=1bab1be4
[Stack fault]: Thread=9c1ec400 Proc=80096ec0 'fennec.exe'
AKY=00001001 PC=781a1388(mozce_shunt.dll+0x00001388) RA=781be8e4(mozce_shunt.dll+0x0001e8e4) BVA=1bab0de4
Unload module: mozce_shunt.dll
Unload module: profile-writer.dll
Unload module: toolhelp.dll
The program '[0x5B77C19E] fennec.exe' has exited with code 1114 (0x45a).

I hope I am wrong or missing something.
Attached patch wip (obsolete) — Splinter Review
I just wanted to attach what I had to the bug before leaving for vacation
Attachment #379527 - Attachment is obsolete: true
Attached file interprets binay profile (obsolete) —
here's a tool to parse the map files in an objdir and write out json for a binary profile.  Its in java, feel free to rewrite in a better language
Attachment #379528 - Attachment is obsolete: true
as of now the json is too big for firefox to handle (and I'm not certain its valid).  Maybe some one who knows js better can pick this up now
Attachment #380766 - Attachment is obsolete: true
I'd be glad to take this over (I need it for profiling startup); do you have the latest of your sampling patch up here, as well?
Yea, the sampling patch is the latest
Assignee: nobody → crowder
Attachment #380765 - Attachment is obsolete: true
Attachment #383521 - Attachment mime type: application/octet-stream → text/plain
Comment on attachment 383521 [details]
Just refreshing this patch to the tip of m-c


> ifeq (_WINNT,$(GNU_CC)_$(OS_ARCH))
>-	$(LD) -NOLOGO -OUT:$@ -PDB:$(LINK_PDBFILE) $(WIN32_EXE_LDFLAGS) $(LDFLAGS) $(PROGOBJS) $(RESFILE) $(LIBS) $(EXTRA_LIBS) $(OS_LIBS)
>+	$(LD) -NOLOGO -OUT:$@ -Fm:$(LINK_MAPFILE) -MAP:$(LINK_MAPFILE) -PDB:$(LINK_PDBFILE) $(WIN32_EXE_LDFLAGS) $(LDFLAGS) $(PROGOBJS) $(RESFILE) $(LIBS) $(EXTRA_LIBS) $(OS_LIBS)

I am wondering why WINNT part is fixing?
I'll leave that answer up to Brad, but there's no reason we couldn't use this profiling technique on Win32 platforms, that I can think of.
yes, this should work for desktop win32 platforms, although I haven't tested that
Attached patch updates (obsolete) — Splinter Review
Fixes:
* This is now threadsafe, afaik (the use of statics in previous version was not)
* I tinkered a bit with the output buffer sizes; not sure if this measurably improves profile times or not.
* The fwrite() itself used to be charged to the caller as part of its processing time, shouldn't be any longer (timing is measured from the end of the "Enter" handler to the beginning of the "Exit" handler.
* Lots of stuff moved to profile-writer.c and tracked in TLS, rather than global; improves thread-safety as mentioned above, but also seems to improve perf.
* We don't do the ToolHelp snapshots as often as we used to.  Now check through the internally maintained module-list before taking a snapshot; this prevents having to stop everything for the snapshot, which seems to be expensive by itself.
* HUMAN_READABLE is gone.  I tried maintaining it for a while, but got sick of it and it's too slow to be usable anyway.
Attachment #383521 - Attachment is obsolete: true
Attached patch match new output better (obsolete) — Splinter Review
This new processing script knows how to handle multiple module stacks, and reads all prof_*.dat files, plus some other cleanups and less output (no JSON yet).
Attachment #383052 - Attachment is obsolete: true
Attached patch updates and tweaks (obsolete) — Splinter Review
Attachment #384703 - Attachment is obsolete: true
The new patch has lots of improvements, but primarily:
* It overcomes a shortcoming of callcap (maybe fastcap, too?) wherein we sometimes don't get appropriate exit data for some routines by keeping an address stack and popping until we return to the address we're trying to exit from.  In other words, if a parent node in the call graph exits, its descendants are considered to have exited as well (at approximately the same time).
* Short-circuit module lookup better/faster
* Most of the time, passing "data" as a function saves us from having to call TlsGetData() redundantly (hopefully saving some measurable time in startup)
* Now using the performance counter instead of tick-counts.  I ran into a lot of fun implementing this improvement.  Apparently, sometimes, QueryPerformanceCounter() flips out and loses tens (hundreds) of thousands of ticks, and then somehow recovers them.  I compensate for that by using ancestors start-times for nodes whose start time (in ticks) is actually later than their descendants.
* We can now lazily/dynamically init for threads that somehow haven't called the DllThreadAttach, but are magically calling the profiling callbacks.

Bugzilla interdiff is failing me, so there are probably other improvements I'm forgetting.
Attached patch python processing script (obsolete) — Splinter Review
This is a newer version of the profile data processor.  It outputs the following:
* mangled.txt -- in descending order, sorted by total-time, the most costly nodes in the call graph, in a human readable form.
* functions.js -- a dictionary of function address IDs (basically address_of_function >> 3) and their corresponding function names.
* callinfo.js -- a dictionary of call IDs (a depth/function ID tuple) and their corresponding data (time, self_time, children, parent)
* edges.js -- A dictionary of edge IDs (callID/function ID tuple) and their corresponding times.

These are consumed by the profiler.html I'm about to upload, which builds an interactive, shark-like call-tree.
Attachment #384785 - Attachment is obsolete: true
Attached file interactive profile data viewer (obsolete) —
To use the output from process.py, you can get unmangled C++ function names by passing functions.js and mangled.txt through undname.exe (provided by Visual Studio), like so:

undname functions.js | tr -d '\r' > tmp.js && mv tmp.js functions.js
undname mangled.txt | tr -d '\r' > unmangled.txt

I'll eventually add these to process.py...  just not in there yet.
process.py needs the following to run:
* find the mapfiles for your build and put them in the same directory with process.py
* copy the mod_*.txt and prof_*.dat files from the root directory of your device into the same directory as process.py.

Once these files are all gathered, you simply need to run process.py
Attached file interactive profile data viewer (obsolete) —
Attachment #386392 - Attachment is obsolete: true
Attached file python processing script (obsolete) —
Attachment #386390 - Attachment is obsolete: true
Attached patch profiling patch (obsolete) — Splinter Review
Attachment #386387 - Attachment is obsolete: true
This knows how to offset the top-level module (the .exe) properly in order to resolve its symbols.
Attachment #386523 - Attachment is obsolete: true
Allows for searching.
Attachment #386522 - Attachment is obsolete: true
A couple of suggestions:

1) Allow the user to specify the folder to put profile data in. 
2) Expose start/stop profiling to js
3) default to not profiling on start up, add a flag to enable this
4) flush data to disk when we stop profiling to help ensure file io doesn't show up in profiles
5) move file io off the main thread
6) check data->profiling in prof.c to avoid an extra function call when its not needed
The capprof hack I did at http://people.mozilla.com/~vladimir/misc/capprof.patch does most of the above (except for #1, though it could be passed down); it saves data in a different format though than what crowder has, not sure which one's better data wise.
Flags: wanted-fennec1.0+
updated to trunk.  doesn't set profile true on init.
Attachment #386524 - Attachment is obsolete: true
Comment on attachment 399386 [details] [diff] [review]
profiling patch (updated to mc trunk)

this is pretty useful.  we should get it into the tree so it doesn't rot further.  Vlad, can you look over it?  It is default off, but does add some build fu to create map files.
Attachment #399386 - Flags: review?(vladimir)
I'm still not a big fan of this approach -- it does a lot of work on the function entry/exit that can be done during postprocessing.
could that be an optimization for follow up bugs?  The cost of work on entry/exit should be uniform (eg. function calls take more time universally because of this overhead).
It isn't uniform, because some functions "randomly" pay for I/O.  It'll be better once the I/O is off main-thread.
...  err, in it's own thread, rather.
Vlad:  Other than the I/O I'm not sure I understand what work you feel it shouldn't be doing, or which could be delayed to post-processing.
Well, two things:

- for the modules, you can just dump all the known modules once (or maybe twice, one at start and once at stop), and use that instead of looking through the modules list for each entry/exit (even if it's found in the current module, or in the currently known list -- no need, just dump all their address ranges).  There's a chance you might miss a module that was loaded and then unloaded during the perf run, but afaik we don't do that anywhere (at least, not for any of our modules)

- the time fixups; if you just record the current value of the timer, you should be able detect a bogus time in the same way while postprocessing, and fix up the value then.. that all operates purely on previously recorded data, so no need to do it live
Yeah, okay.  Those are pretty easily fixed.
Attachment #399386 - Flags: review?(vladimir)
Component: Windows Mobile → General
OS: Windows XP → Windows Mobile 6 Professional
Hardware: x86 → ARM
Ding dong, the Windows Mobile effort is dead.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: