406810 - Analyze startup time of Firefox3 (TRUNK)

Assignee

Description

•

18 years ago

User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko) Build Identifier: The goal is to speed up startup of FF3. I'll do an analysis of startup on linux using oprofiler. Further steps depends on obtained data. Reproducible: Always

Michal Novotny [:michal]

Assignee

Updated

•

18 years ago

Version: unspecified → Trunk

Robert Sayre

Updated

•

18 years ago

Assignee: nobody → michal

Mike Schroepfer

Comment 1

•

18 years ago

Some graphs that are useful for understanding regression: http://graphs.mozilla.org/#spst=range&spstart=1124636400&spend=1178280000&bpst=Cursor&bpstart=1124636400&bpend=1178280000&m1tid=9315&m1bl=0&m1avg=0&m2tid=7312&m2bl=0&m2avg=0

Michal Novotny [:michal]

Assignee

Comment 2

•

18 years ago

(In reply to comment #1) > Some graphs that are useful for understanding regression: > > http://graphs.mozilla.org/#spst=range&spstart=1124636400&spend=1178280000&bpst=Cursor&bpstart=1124636400&bpend=1178280000&m1tid=9315&m1bl=0&m1avg=0&m2tid=7312&m2bl=0&m2avg=0 > Is somewhere description of what all the available data on the page mean and how they were obtained?

Michal Novotny [:michal]

Assignee

Comment 3

•

18 years ago

Attached file output of opreport — Details

Michal Novotny [:michal]

Assignee

Comment 4

•

18 years ago

I tested startup of MOZILLA_1_8_BRANCH and HEAD and found warm startup of HEAD a little bit faster on i386 linux. 10 startups in a row in ms: MOZILLA_1_8_BRANCH: 835, 839, 835, 836, 836, 840, 839, 833, 835, 833 HEAD: 832, 829, 831, 831, 833, 832, 830, 831, 831, 831 Cold startup is slower in case of HEAD. I'm measuring it using: sudo bash -c "sync; echo 3 > /proc/sys/vm/drop_caches"; ./startup-unix.pl ../../../_obj-browser-release-32-fp/dist/bin/firefox -P ... MOZILLA_1_8_BRANCH: 10986, 10818, 10123, 11509, 10734, 10693, 10643, 10636, 10646, 10519 HEAD: 12142, 12200, 12284, 12159, 12217, 12509, 12350, 12075, 12433, 12900 Data from profiler are in attachment. Callgraph output and annotated source aren't included because of their size. To get more stable data I captured 10 startups and merged them together. That's why there are multiple [vdso] with different tgid. Some indispensable CPU time is used for memory allocations, and symbols lookup. Both increased in HEAD version. I tried to run FF with LD logging using LD_DEBUG_OUTPUT=/tmp/ld_output_ff.log LD_DEBUG=symbols and it seems that number of symbols increased: wc -l ld_output_ff3.log.24023 ld_output_ff2.log.24313 337393 ld_output_ff3.log.24023 261167 ld_output_ff2.log.24313 I'll try to do multiple profilings to see that results are really stable and comparable.

Michal Novotny [:michal]

Assignee

Comment 5

•

18 years ago

I did few tests with dynamic linking to see how much can be spared on this part. For some unknown reason I can't get annotation of do_lookup_x() in do-lookup.h from oprofiler. So I just tried to minimize number of the while loops in this function. By analyzing output of LD_DEBUG=symbols I got following result : 1107 file=/usr/lib/libstdc++.so.6 975 file=./libxpcom_core.so 939 file=/lib/libc.so.6 753 file=/lib/libglib-2.0.so.0 597 file=/usr/lib/libbonobo-2.so.0 487 file=/usr/lib/libORBit-2.so.0 451 file=/usr/lib/libX11.so.6 390 file=/lib/libgobject-2.0.so.0 390 file=./firefox-bin 384 file=./libnspr4.so 316 file=./libmozjs.so 245 file=/usr/lib/libpango-1.0.so.0 201 file=/lib/libpthread.so.0 182 file=/usr/lib/libgdk-x11-2.0.so.0 177 file=/usr/lib/libgconf-2.so.4 176 file=./libsqlite3.so 150 file=/usr/lib/libfreetype.so.6 149 file=/usr/lib/libgtk-x11-2.0.so.0 135 file=./libnss3.so 117 file=/usr/lib/libbonobo-activation.so.4 108 file=/lib/libasound.so.2 107 file=/usr/lib/libXdmcp.so.6 107 file=/lib/libdbus-1.so.3 97 file=/usr/lib/libXt.so.6 83 file=/usr/lib/libaudiofile.so.0 80 file=/lib/libkeyutils.so.1 76 file=./libplds4.so 71 file=/usr/lib/libkrb5.so.3 62 file=/usr/lib/libxml2.so.2 58 file=/usr/lib/libk5crypto.so.3 57 file=/usr/lib/libXrender.so.1 57 file=/usr/lib/libcairo.so.2 50 file=/usr/lib/libfontconfig.so.1 48 file=/usr/lib/libORBitCosNaming-2.so.0 44 file=/usr/lib/libgssapi_krb5.so.2 43 file=/usr/lib/libpangoft2-1.0.so.0 43 file=./libxpcom.so 37 file=./libplc4.so ... This means that during FF startup there were 1107 lookups for symbols in libstdc++.so.6, 975 in libxpcom_core.so ... Order of shared libraries in which ld tries to find symbols can be changed without recompilation by using LD_PRELOAD. I measured startup with standard order and with "optimized" order (sorted by number of resolved symbols). There was no gain from changing order. In fact "optimized" order was slower by few milliseconds even if number of lookups was considerably smaller: # grep -e "lookup in file" ld_output_ff_normal.log.9612 | wc -l 336619 # grep -e "lookup in file" ld_output_ff_optimized_1.log.9642 | wc -l 242599 I also tried to reduce number of lookups by linking libstc++ statically. In this case startup was faster by about 8 ms (1% of startup time). This isn't too much. Anyway, is there some reason why building with --enable-static doesn't produce single binary?

Brendan Eich [:brendan]

Comment 6

•

18 years ago

Really need :bs and :luser on this bug. /be

Status: UNCONFIRMED → NEW

Ever confirmed: true

(not currently active) Ted Mielczarek

Comment 7

•

18 years ago

Are you compiling with the same mozconfig that the nightly tinderboxes use? http://lxr.mozilla.org/mozilla/source/tools/tinderbox-configs/firefox/ I notice you mention "static", but we're shipping builds in the libxul configuration these days, where most of the code gets linked into libxul.so, and the Firefox binary is just a little shim.

Benjamin Smedberg

Comment 8

•

18 years ago

Yes, anything that has an libxpcom_core.so is not a valid target for testing things like dynamic linker overhead.

Sudheer

Comment 9

•

18 years ago

Attached file ff3 startup timeline — Details

I generated the time line (by enabling NS_TIMELINE) on Debian linux PC(attached). While talking about improving start-up performance, somebody mentioned that making fastload files architecture specific will greatly speed-up. I am relatively new to mozilla code. Can somebody throw some light on fastload system? to jump start me? (in the log, nsIOService::NewChannelFromURI () seems to be used a lot during start-up. Are the files loaded (jar, xul, css, dtd,...) by this method the fast loaded files? They seem to be plain xul, css, dtd files and does not seem like pre-compiled/processed. What are fast-loaded files? How are these generated? Where are these processed? What part can be made architecture-specific to improve speed. I am exploring the code and debugging to get more understanding, but some quick help would be highly appreciated. Sorry, if bugzilla is not the right place to discuss/ask for help.

Robert Sayre

Comment 10

•

18 years ago

I believe the mozconfig for doing this analysis should be as follows: . $topsrcdir/browser/config/mozconfig mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/ff-opt-libxul ac_add_options --enable-optimize ac_add_options --enable-libxul ac_add_options --disable-debug ac_add_options --disable-static ac_add_options --disable-shared ac_add_options --disable-tests This is a bit redundant given the defaults, but it makes things clear.

Michal Novotny [:michal]

Assignee

Comment 11

•

18 years ago

And what is the right mozconfig for 1.8 BRANCH? With this mozconfig configure fails: configure: error: Only one of --enable-shared or --enable-static must be specified. *** Fix above errors and then restart with "make -f client.mk build"

Matthew Gertner

Comment 12

•

18 years ago

My distribution Firefox 2 has the following build options (about:buildconfig), so they're probably the ones to use for performance measurement: --enable-application=browser --enable-update-channel=release --enable-official-branding --enable-optimize --disable-debug --disable-tests --enable-static --disable-shared --enable-svg --enable-canvas --enable-update-packaging

Status: NEW → ASSIGNED

Michal Novotny [:michal]

Assignee

Comment 13

•

18 years ago

So I have now following mozconfigs. MOZILLA_1_8_BRANCH: ------------------------------------------------- mk_add_options MOZ_CO_PROJECT=browser mk_add_options PROFILE_GEN_SCRIPT=@TOPSRCDIR@/build/profile_pageloader.pl mk_add_options MOZ_CO_MODULE="mozilla/tools/update-packaging mozilla/tools/codesighs" ac_add_options --enable-application=browser ac_add_options --enable-update-channel=nightly ac_add_options --enable-update-packaging ac_add_options --enable-optimize ac_add_options --disable-debug ac_add_options --disable-tests ac_add_options --enable-codesighs mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/_obj-browser-release-32-tb-fp ac_add_options --enable-static ac_add_options --disable-shared ac_add_options --enable-system-cairo ac_add_options --enable-default-toolkit=gtk2 ac_add_options --enable-xft ac_add_options --disable-freetype2 ac_add_options --enable-svg ac_add_options --enable-canvas CC="gcc -fno-omit-frame-pointer -gstabs+" CXX="g++ -fno-omit-frame-pointer -gstabs+" ------------------------------------------------- HEAD: ------------------------------------------------- mk_add_options MOZ_CO_PROJECT=browser mk_add_options PROFILE_GEN_SCRIPT=@TOPSRCDIR@/build/profile_pageloader.pl mk_add_options MOZ_CO_MODULE="mozilla/tools/update-packaging mozilla/tools/codesighs" ac_add_options --enable-application=browser ac_add_options --enable-update-channel=nightly ac_add_options --enable-update-packaging ac_add_options --enable-optimize ac_add_options --disable-debug ac_add_options --disable-tests ac_add_options --enable-codesighs mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/_obj-browser-release-32-tb-fp ac_add_options --enable-libxul ac_add_options --disable-static ac_add_options --disable-shared CC="gcc -fno-omit-frame-pointer -gstabs+" CXX="g++ -fno-omit-frame-pointer -gstabs+" ------------------------------------------------- Startup times are now: FF3 693 FF2 661

Michal Novotny [:michal]

Assignee

Comment 14

•

18 years ago

Data from profiler can be downloaded from: http://www.allpeers.com/download/michal/ff2.tar.bz2 http://www.allpeers.com/download/michal/ff3.tar.bz2 They include callgraph output so they are too big to upload them to bugzilla. They represent 20 starts of FF merged together. More time is spent in dynamic linker and sqlite3. Major difference that I can see is in pango usage: -------------------------------------------------------------------------- 1 0.0124 libxul.so nsThebesFontMetrics::GetHeight(int&) 8071 99.9876 libxul.so nsThebesFontMetrics::GetMetrics() const 4 0.0065 libxul.so gfxPangoFont::GetMetrics() 5374 66.5676 libpango-1.0.so.0.1800.3 pango_font_get_metrics 2675 33.1351 libxul.so gfxPangoFont::RealizePangoFont(int) 17 0.2106 libxul.so gfxPangoFont::GetCharSize(char, gfxSize&, gfxSize&, unsigned int*) 4 0.0495 libxul.so gfxPangoFont::GetMetrics() [self] 1 0.0124 libxul.so nsACString_internal::Equals(char const*, nsCStringComparator const&) const 1 0.0124 libxul.so GetPangoLanguage(nsACString_internal const&) 1 0.0124 libfreetype.so.6.3.16 FT_Get_Sfnt_Table -------------------------------------------------------------------------- 8071 is 13.17% of all startup time.

Mike Schroepfer

Comment 15

•

18 years ago

Pav/JDaggett looks like pango might be part of the issue here ... or more specifically nsThebesFontMetrics::GetMetrics(). Is there anything we can do to optimize this?

Mike Shaver (:shaver emeritus)

Comment 16

•

18 years ago

Linux doesn't have OS-global caching of font data (only availability), and we're doing things The Linux Way, so AIUI our two paths look like they'll be: 1) hope and push for such caching to appear, in which case the overhead will go basically to zero once that's done and all our users have upgraded 2) use our Win32 font selection code, implement our own caching, and endure the endless complaints about how we just can't play nice on Linux :) (We could short-circuit the time delay from 1) by shipping our own Pango, but -- yeah, no.)

Mike Schroepfer

Updated

•

18 years ago

Blocks: 397947

Mike Schroepfer

Comment 17

•

18 years ago

Did you have a chance to check out windows?

Michal Novotny [:michal]

Assignee

Comment 18

•

18 years ago

(In reply to comment #17) Finally I have some data from windows profiler. I used AQTime with user+kernel counting. It is hard to compare results because there were lot of changes between 1.8 branch and trunk. Anyway I created 3 tables that shows difference between FF2 and FF3 for modules (this is useless), source files and routines. Data can be downloaded as ODS or XLS from: http://www.allpeers.com/download/michal/prof_ods.zip http://www.allpeers.com/download/michal/prof_xls.zip All times are in microseconds.

Matthew Gertner

Comment 19

•

18 years ago

Can we generate web-friendly views for this data? Maybe something like http://simile.mit.edu/exhibit/?

Matthew Gertner

Comment 20

•

18 years ago

I guess Exhibit is probably not worth the effort, but maybe sticking the data in Google Spreadsheets if it can handle the data volume.

Dietrich Ayala (:dietrich)

Updated

•

18 years ago

Blocks: 364272

Mike Schroepfer

Comment 21

•

18 years ago

Google spreadsheets can't take that much data - so imported the first 300 rows at the URL (also here): http://spreadsheets.google.com/pub?key=pq20lsXjZGAykkHScEtvQVg Some things of note: JS_CallTracer js_TraceObject GetGCThingFlags js_TraceScopeProperty js_TraceId Are top five on 1.9 and don't exist in 1.8. PR_Malloc is #7 and is called 3k more times on 1.9 than 1.8 (surprised given the alloc reudctions we've done..) nsID:Equals #11 nsLocalFile::EnsureShortPath doesn't show up in 1.8 but is high in 1.9 JS_DHashTableEnumerate called 4x more in 1.9 than 1.8 but takes 20x more time js_LookupPropertyWithFlags 2x time in 1.9 from 1.8 nsStringBuffer::Realloc 2x time in 1.9 than 1.8.. _cairo_win32_surface_set_clip_region is new for 1.9 (no surprise)

URL: http://spreadsheets.google.com/pub?ke...

Brendan Eich [:brendan]

Comment 22

•

18 years ago

jst pointed out that we whacked the DOM's JSGC_BEGIN callback in a past release to hold off all GC during startup for 10 sec. or so -- that seems to have regressed, and jst mentioned Smaug's name ;-). /be

Boris Zbarsky [:bzbarsky]

Comment 23

•

18 years ago

> Are top five on 1.9 and don't exist in 1.8. Those are all JS GC. We should definitely try to avoid that during startup, if it's happening then. > nsID:Equals #11 Do we have callstacks? Is this just QI, or something else going on? > nsStringBuffer::Realloc 2x time in 1.9 than 1.8.. Again, do we have callstacks? That would be a good one to try and nix. It looks like we have about 30% more calls and 2x the time, right?

Stuart Parmenter

Comment 24

•

18 years ago

callstacks for the top 5 PR_Mallocs would be nice too.

Mike Schroepfer

Comment 25

•

18 years ago

(In reply to comment #23) > > Are top five on 1.9 and don't exist in 1.8. > > Those are all JS GC. We should definitely try to avoid that during startup, if > it's happening then. > > > nsID:Equals #11 > > Do we have callstacks? Is this just QI, or something else going on? > > > nsStringBuffer::Realloc 2x time in 1.9 than 1.8.. > > Again, do we have callstacks? That would be a good one to try and nix. It > looks like we have about 30% more calls and 2x the time, right? > I'm reading 2x time and 3x fewer calls. Michal can we get call stacks for some of these?

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 26

•

18 years ago

(In reply to comment #22) > jst pointed out that we whacked the DOM's JSGC_BEGIN callback in a past release > to hold off all GC during startup for 10 sec. or so -- that seems to have > regressed, and jst mentioned Smaug's name ;-). I changed CC handling (CC then calls GC), but the first CC shouldn't happen during startup. One can enable DEBUG_smaug in nsJSEnvironment.cpp too see when cycle collector is called. First time I see it is around 10 sec after startup (if user is inactive all the time). The CC call happens because sGCTimer fires and the first time that timer is initialized, NS_FIRST_GC_DELAY (10000ms) is used as delay. Is it possible that changes to nsJSContext::DOMBranchCallback have regressed something? Should we prevent (or delay) calling JS_MaybeGC there during startup? Or in nsJSContext::ScriptEvaluated?

Johnny Stenback (:jst)

Comment 27

•

18 years ago

One thing that's likely to make GC show up more in profile data is that the profilers slow things down enough that we actually fire that 10s timer during startup, whereas we wouldn't in normal runs (unless the system is *really* slow, or really busy).

Benjamin Smedberg

Comment 28

•

18 years ago

> nsLocalFile::EnsureShortPath doesn't show up in 1.8 but is high in 1.9 Was this done against a libxul build? We know that EnsureShortPath is expensive and shows up a lot in dynamic builds during component registration, but we don't call it that much in libxul builds. Stacks would be good.

Boris Zbarsky [:bzbarsky]

Comment 29

•

18 years ago

> I'm reading 2x time and 3x fewer calls. Hit Count is number of calls, right?

Michal Novotny [:michal]

Assignee

Comment 30

•

18 years ago

(In reply to comment #23) > > Are top five on 1.9 and don't exist in 1.8. > > Those are all JS GC. We should definitely try to avoid that during startup, if > it's happening then. This seems to be problem on my side. I started FF with parameter "-url file://C:\moz\TRUNK\mozilla\tools\performance\startup\quitForMac.html" and measured whole run of firefox. I hoped that ending code will not influence profiling too much. I tried to create an action in AQTime that disables profiling when nsAppStartup::Quit is called but profiling data are IMO incorrect. Anyway I will provide them soon (with callgraphs).

Boris Zbarsky [:bzbarsky]

Comment 31

•

18 years ago

Oh, yeah. Shutdown does various slow things like sqlite flushing and GC...

Mike Schroepfer

Comment 32

•

18 years ago

(In reply to comment #29) > > I'm reading 2x time and 3x fewer calls. > > Hit Count is number of calls, right? > Thought so...

Boris Zbarsky [:bzbarsky]

Comment 33

•

18 years ago

So maybe I'm misreading things, but for the string Realloc there is a hit count of 24000-some, another of 1800-some, and a 600-some that I assume is the difference. Hence my 30% number. It's not clear which columns in the spreadsheet are 1.9 and which are 1.8, so not sure whether that's a 30% increase or decrease.

Michal Novotny [:michal]

Assignee

Comment 34

•

18 years ago

New data can be found at: http://www.allpeers.com/download/michal/new/ It differs from previous data so that profiling was now stopped on call nsAppStartup::Quit that is called from quitForMac.html. So anything related to exitting firefox shouldn't be now in profiling data. Anyway at least first two methods (XRE_main and nsBaseAppShell::Run) show incorrect values for Time. Maybe because they weren't finished when profiling was stopped (bug in AQTime?). The rest seem to be sane (at least I hope). File ff2_ff3_routines_stop_at_quit contains all routines with differences between 1.8 and 1.9 sorted by time difference descending. For first 30 are callgraphs in "callgraph/FF{2,3}" subdirectory. File ff2_ff3_routines_stop_at_quit_hit_ratio contains only these routines that have time difference bigger that 500 microseconds and are sorted by hit ratio. For all that have ratio more than 2 are also callgraphs available. I wasn't able to export all callgraph data from AQTime, so only the minimum amount mentioned above is present as PNG.

Michal Novotny [:michal]

Assignee

Comment 35

•

18 years ago

(In reply to comment #32) > (In reply to comment #29) > > > I'm reading 2x time and 3x fewer calls. > > > > Hit Count is number of calls, right? > > > > Thought so... > Yes, it means number of routine calls that were profiled. More info about values can be found in manual. http://www.automatedqa.com/downloads/Docs/aqtime_user_manual.pdf

Michal Novotny [:michal]

Assignee

Comment 36

•

18 years ago

(In reply to comment #33) > So maybe I'm misreading things, but for the string Realloc there is a hit count > of 24000-some, another of 1800-some, and a 600-some that I assume is the > difference. Hence my 30% number. It's not clear which columns in the > spreadsheet are 1.9 and which are 1.8, so not sure whether that's a 30% > increase or decrease. > There are labels in original spreadsheets (not in google docs). First set of data is for 1.8, second set for 1.9 and third is the difference.

Michal Novotny [:michal]

Assignee

Comment 37

•

18 years ago

(In reply to comment #28) > > nsLocalFile::EnsureShortPath doesn't show up in 1.8 but is high in 1.9 > > Was this done against a libxul build? We know that EnsureShortPath is expensive > and shows up a lot in dynamic builds during component registration, but we > don't call it that much in libxul builds. Stacks would be good. > Yes, the 1.9 build contains huge xul.dll. I used mozconfig from tinderbox-configs + added debug symbols. Callgraph for this method is at http://www.allpeers.com/download/michal/new/callgraph/FF3/nsLocalFile+EnsureShortPath.png

Michal Novotny [:michal]

Assignee

Comment 38

•

18 years ago

(In reply to comment #25) > > > nsID:Equals #11 http://www.allpeers.com/download/michal/new/callgraph/FF2/nsID+Equals.png http://www.allpeers.com/download/michal/new/callgraph/FF3/nsID+Equals.png Can't get more parent levels in one image :-/ If you need more, I will provide them separately. > > > > Do we have callstacks? Is this just QI, or something else going on? > > > > > nsStringBuffer::Realloc 2x time in 1.9 than 1.8.. http://www.allpeers.com/download/michal/new/callgraph/FF2/nsStringBuffer+Realloc.png http://www.allpeers.com/download/michal/new/callgraph/FF3/nsStringBuffer+Realloc.png More parent levels are present in these, but 4th level is a little bit messy. > > > > Again, do we have callstacks? That would be a good one to try and nix. It > > looks like we have about 30% more calls and 2x the time, right? Yes, exactly 33.27% more hits and 93.65% more time according to new data.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 39

•

18 years ago

The cairo functions that are showing up are consistent with the thinking that we're doing a lot more repaints/drawing than pre-threadmanager during startup..

Mike Schroepfer

Comment 40

•

18 years ago

(In reply to comment #39) > The cairo functions that are showing up are consistent with the thinking that > we're doing a lot more repaints/drawing than pre-threadmanager during startup.. > Is there some way to capture the event/paint timeline and figure out a way around this?

Boris Zbarsky [:bzbarsky]

Comment 41

•

18 years ago

>http://www.allpeers.com/download/michal/new/callgraph/FF2/nsID+Equals.png >http://www.allpeers.com/download/michal/new/callgraph/FF3/nsID+Equals.png It does look like this is mostly QI. One obvious change here is that the number of hits in nsGenericElement::QueryInterface has gone up by 25% or so. Can we get the callers of nsGenericElement::QueryInterface? Interestingly, the number of QIs on nsXULElement hasn't changed much, so the nsGenericElement stuff is some other elements. There are also a lot more hits in nsStandardURL::QueryInterface (more than twice as many as on branch), so perhaps the callers of this would be good too. And BackstagePass::QueryInterface seems to have jumped. Between them, those three account for 2/3 or more of the jump in nsID::Equals calls. All might need several levels of callers to get to the actual place where the do_QueryInterface happened. It might be worth filing a separate bug (blocking this one) for the QI issue so that we don't get too many confusing things all going on here at once. But we should definitely look into this! For the string realloc, could we get one more level on 1.9 at least for the top two 4th-level callers? Those are the two nsSubstring::Appends there.

Ondrej Brablc

Updated

•

18 years ago

Depends on: 412300

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

18 years ago

Depends on: 412476

Jesse Ruderman

Updated

•

18 years ago

Keywords: perf

Michal Novotny [:michal]

Assignee

Comment 42

•

18 years ago

(In reply to comment #41) > For the string realloc, could we get one more level on 1.9 at least for the top > two 4th-level callers? Those are the two nsSubstring::Appends there. > nsSubstring::Appends are top in 1.8. In 1.9 are on the top nsPersistentProperties::Load and nsCSSScanner::GatherIdent http://www.allpeers.com/download/michal/new/callgraph/FF3/nsCSSScanner+GatherIdent.png http://www.allpeers.com/download/michal/new/callgraph/FF3/nsPersistentProperties+Load.png

Reed Loden [:reed]

Comment 43

•

18 years ago

(In reply to comment #42) > nsSubstring::Appends are top in 1.8. In 1.9 are on the top > nsPersistentProperties::Load and nsCSSScanner::GatherIdent > > http://www.allpeers.com/download/michal/new/callgraph/FF3/nsCSSScanner+GatherIdent.png Bug 311566 has a patch to deal with nsCSSScanner::GatherIdent.

Boris Zbarsky [:bzbarsky]

Comment 44

•

18 years ago

Ah, indeed. GatherIdent is a sort of known issue. For nsPersistentProperties, we might be having values (or keys?) that are too big for an autostring. Might be worth using a string with a bigger stack buffer there. I wonder whether we can do something about the expat driver there too (mCDataText, I bet).

Mike Schroepfer

Comment 45

•

18 years ago

(In reply to comment #44) > Ah, indeed. > > GatherIdent is a sort of known issue. For nsPersistentProperties, we might be is getting Bug 311566 done a possibility > having values (or keys?) that are too big for an autostring. Might be worth > using a string with a bigger stack buffer there. > Bug 121341 looks promising here - any chance we can get this done?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 46

•

18 years ago

(In reply to comment #40) > (In reply to comment #39) > > The cairo functions that are showing up are consistent with the thinking that > > we're doing a lot more repaints/drawing than pre-threadmanager during startup.. > > > > Is there some way to capture the event/paint timeline and figure out a way > around this? Yep, on windows all painting will go through widget/src/windows/nsWindow.cpp, nsWindow::OnPaint. Adding a print of when that's hit, along with the paintRect (scroll down a bit) will give an idea of what area we're repainting and when. ... and as I say that, I see that there's a debug_DumpPaintEvent already in there, which needs nglayout.debug.paint_dumping to be set. But that's only in debug builds, so you should be able to just comment out the NS_DEBUG's and set the pref and be good to go.

Mike Schroepfer

Comment 47

•

18 years ago

(In reply to comment #46) > (In reply to comment #40) > > (In reply to comment #39) > > > The cairo functions that are showing up are consistent with the thinking that > > > we're doing a lot more repaints/drawing than pre-threadmanager during startup.. > > > > > > > Is there some way to capture the event/paint timeline and figure out a way > > around this? > > Yep, on windows all painting will go through widget/src/windows/nsWindow.cpp, > nsWindow::OnPaint. Adding a print of when that's hit, along with the paintRect > (scroll down a bit) will give an idea of what area we're repainting and when. > > ... and as I say that, I see that there's a debug_DumpPaintEvent already in > there, which needs nglayout.debug.paint_dumping to be set. But that's only in > debug builds, so you should be able to just comment out the NS_DEBUG's and set > the pref and be good to go. > Michal can you try this out and post some data to see if there are obvious places for us to prevent repaints?

Michal Novotny [:michal]

Assignee

Comment 48

•

18 years ago

(In reply to comment #47) > > Yep, on windows all painting will go through widget/src/windows/nsWindow.cpp, > > nsWindow::OnPaint. Adding a print of when that's hit, along with the paintRect > > (scroll down a bit) will give an idea of what area we're repainting and when. > > > > ... and as I say that, I see that there's a debug_DumpPaintEvent already in > > there, which needs nglayout.debug.paint_dumping to be set. But that's only in > > debug builds, so you should be able to just comment out the NS_DEBUG's and set > > the pref and be good to go. > > > > Michal can you try this out and post some data to see if there are obvious > places for us to prevent repaints? > There are only few events: 1 PAINT widget=01A8CE1C name=noname id=000A02D0 rect= 0,0 986,913 2 PAINT widget=01C9549C name=noname id=000A0312 rect= 0,0 16,16 3 PAINT widget=01D5E98C name=noname id=00060384 rect= 0,0 986,803 4 PAINT widget=01A8CE1C name=noname id=000A02D0 rect= 94,24 487,38 5 PAINT widget=01A8CE1C name=noname id=000A02D0 rect=206,3 776,51 6 PAINT widget=01C9549C name=noname id=000A0312 rect= 0,0 16,16 7 PAINT widget=01A8CE1C name=noname id=000A02D0 rect= 0,23 986,65

Michal Novotny [:michal]

Assignee

Comment 49

•

18 years ago

It seems that lot of inline functions aren't inlined because some default gcc limits are exceeded. When -Winline option is specified I can see warnings like this: warning: inlining failed in call to ‘void nsCachedStyleData::Destroy(P RUint32, nsPresContext*)’: --param large-function-growth limit reached I have no such warnings when I set following limits to gcc/g++ in flags: --param max-inline-insns-single=1000 --param large-function-growth=1700 --param inline-unit-growth=200 The large-function-growth is quite high, only 200 is enough for everything except layout/style/nsStyleSet.cpp that needs 1700. With this flags I have only a little bigger linux build (26460160 instead of 26081280) but it is quite faster (585ms instead of 630ms). But maybe official builds are done with custom-built gcc with higher defaults limits?

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 50

•

18 years ago

There is a known issue with GCC 3.1.2 which won't inline things properly. Can't find the bug filed about it right now. Unfortunately that's the version many distros are using.

Michal Novotny [:michal]

Assignee

Comment 51

•

18 years ago

I spent some time looking at valgrind output on linux but profile data of whole startup is too messy. I tried to separate few comprehensive non-overlapping parts using --toggle-collect option with selected functions: Whole startup (including CCIfUserInactive) 1077202801 CCIfUserInactive 76353317 startup w/o CCIfUserInactive 1000849484 100% cca 600ms SheetLoadData::OnStreamComplete 307915061 30.8% cca 184.6ms PresShell::ReflowEvent::Run 120069008 12% cca 72ms mozJSComponentLoader::LoadModule 99339364 10% cca 59.6ms nsAppStartup::CreateHiddenWindow 81380310 8.1% cca 48.8ms nsLookAndFeelConstructor 18428861 1.8% cca 11ms --------------------------------------------------------------------------- covered by selected parts 627132604 62.66% cca 376ms Absolute numbers are count of valgrind's IR instructions. CCIfUserInactive is subtracted from total count because it is performed only when code is instrumented by valgrind. Calculation of spent time isn't perfect because only IRs in user space are counted. It is also disputable if spent time can be computed proportionally from IR count, but no better metric is available. Here are some more data about loading CSS (SheetLoadData::OnStreamComplete) which takes significant amount of startup time. Top 10 functions sorted by time spent by themself. Note that in graphs are times spent with children in percents. Btw call counts are reliable only 1 level back and forward from inspected function. On other levels they are somehow approximated. -------------------------------------------------------------------------------- Ir function -------------------------------------------------------------------------------- 15,278,356 js_GetToken http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_js_GetToken.png 12,684,221 nsUTF8ToUnicode::Convert(char const*, int*, unsigned short*, int*) http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_nsUTF8ToUnicode+Convert.png 12,333,877 GetChar http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_GetChar.png 12,220,425 strcmp http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_strcmp.png 8,249,724 _int_malloc http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_malloc.png 8,243,489 little2_updatePosition http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_little2_updatePosition.png 5,111,469 nsCSSScanner::Read(unsigned&) http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_nsCSSScanner+Read.png 4,705,887 g_key_file_lookup_group_node http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_g_key_file_lookup_group_node.png 4,514,941 nsAString_internal::ReplacePrep(unsigned, unsigned, unsigned) http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_nsAString_internal+ReplacePrep.png 4,279,656 little2_cdataSectionTok http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_little2_cdataSectionTok.png Strcmp and g_key_file_lookup_group_node lead to nsIconChannel::Init (http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_nsIconChannel+Init.png) that calls gtk_widget_render_icon what takes 13,36% (cca 25ms). All other functions are called due to XML parsing or JS compilation. Could fixing bug #196843 help here?

Boris Zbarsky [:bzbarsky]

Comment 52

•

18 years ago

Michal, I think most of what you're seeing is that OnStreamComplete calls back into the parser to restart the XUL/XML/HTML parsing process once it finishes parsing the stylesheet. So there's a lot of code "under" OnStreamComplete that has nothing to do with it. I'm looking at js_GetToken here, at the very least. If you want to see what part of the code is really under OnStreamComplete, you want to exclude everything called via the StyleSheetLoaded calls in CSSLoaderImpl::SheetComplete. For some of the things on your list (e.g. the CSS scanner and some of the string stuff), I do think that fixing bug 196843 might help. But it's hard to tell until we separate out the parts that are really happening in CSS-land from the work happening off StyleSheetLoaded.

Michal Novotny [:michal]

Assignee

Comment 53

•

18 years ago

(In reply to comment #52) StyleSheetLoaded takes 92.57% http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_CSSLoaderImpl+ParseSheet.png So CSS fast loading can influence only CSSParseImpl::Parse part (7.12%)? This include also parsing all xml files referred in css files?

Boris Zbarsky [:bzbarsky]

Comment 54

•

18 years ago

> So CSS fast loading can influence only CSSParseImpl::Parse part (7.12%)? Yes. > This include also parsing all xml files referred in css files? You mean XBL bindings? If so, no. Those are loaded asynchronously from a totally different place. And CSS fastloading wouldn't affect them anyway.

Michal Novotny [:michal]

Assignee

Comment 55

•

18 years ago

There is a lot of activity in JS: 0.61% 425380x JS_CallTracer'2 1.37% 343723x GetChar 1.69% 241201x js_GetToken 0.19% 198749x js_UngetToken 0.10% 190226x GetGCThingFlags 0.17% 181610x js_TraceId 0.33% 172224x js_MatchToken and many others... To be honest I must say that I'm completely lost in this code. At a first look it seems that they are just called too often. But due to cycles no reasonable callgraph can be produced. Does it make sense to try to optimize this? If yes, can somebody please kick me into right direction?

Brendan Eich [:brendan]

Comment 56

•

18 years ago

(In reply to comment #55) > There is a lot of activity in JS: > > 0.61% 425380x JS_CallTracer'2 GC is running -- that shouldn't happen during startup, if the changes jst made a while back still work. > 1.37% 343723x GetChar > 1.69% 241201x js_GetToken > 0.19% 198749x js_UngetToken > 0.33% 172224x js_MatchToken JS compilation should not happen due to FastLoad, but clearly something is compiling JS from source rather than XDR'ing bytecodes. What is on the callstack that is invoking JS_Compile* or JS_Evaluate* APIs? > 0.10% 190226x GetGCThingFlags > 0.17% 181610x js_TraceId More GC footprints. > and many others... > > To be honest I must say that I'm completely lost in this code. At a first look > it seems that they are just called too often. But due to cycles no reasonable > callgraph can be produced. Does it make sense to try to optimize this? If yes, > can somebody please kick me into right direction? We have long-standing methods to avoid JS GC and compilation costs from accruing at startup, but they seem not to be working. So two action items: * restore the GC avoidance that jst engineered to working order; * find out what is compiling JS and switch it to fastload. /be

Robert Sayre

Comment 57

•

18 years ago

(In reply to comment #56) > * find out what is compiling JS and switch it to fastload. I'm pretty sure these are calls to CompileFunction() in /content/xbl/src/nsXBLProtoImplMethod.cpp and /content/xbl/src/nsXBLProtoImplProperty.cpp, but someone should verify.

Boris Zbarsky [:bzbarsky]

Comment 58

•

18 years ago

> JS compilation should not happen due to FastLoad, We don't fastload XBL. I'll bet money it's the issue. Sadly, I'm not convinced that fastloading things in bite-sized chunks (one field/property/method at a time, possibly involving muxing on every one, etc) is so great. We might be able to do better if we fastload the entire prototype binding at once or something...

Mike Schroepfer

Comment 59

•

18 years ago

(In reply to comment #58) > > JS compilation should not happen due to FastLoad, > > We don't fastload XBL. I'll bet money it's the issue. That's bug 94199 > Sadly, I'm not convinced that fastloading things in bite-sized chunks (one > field/property/method at a time, possibly involving muxing on every one, etc) > is so great. We might be able to do better if we fastload the entire prototype > binding at once or something... What if we did a Ts/Txul run and logged every file, resource, and property touched. Then did a pre-process where we concat all that stuff into one big file that we mmap in? Sounds easy in english! :-)

Mike Schroepfer

Comment 60

•

18 years ago

Also - see Bug 412796

Johnny Stenback (:jst)

Comment 61

•

18 years ago

(In reply to comment #56) > (In reply to comment #55) > > There is a lot of activity in JS: > > > > 0.61% 425380x JS_CallTracer'2 > > GC is running -- that shouldn't happen during startup, if the changes jst made > a while back still work. Last I checked (a couple of weeks ago), the change to keep GC out of startup did work, but it's time based, 10s into startup we can do a GC. And naturally if this is running under a profiler getting startup times of 10s+ is not hard, and then we'd see GC as part of startup in those profiles. I'd suggest setting NS_FIRST_GC_DELAY in dom/src/base/nsJSEnvironment.cpp to a *really* large number when profiling startup to avoid this issue.

Michal Novotny [:michal]

Assignee

Comment 62

•

18 years ago

(In reply to comment #57) > (In reply to comment #56) > > * find out what is compiling JS and switch it to fastload. > > I'm pretty sure these are calls to CompileFunction() in > /content/xbl/src/nsXBLProtoImplMethod.cpp and > /content/xbl/src/nsXBLProtoImplProperty.cpp, but someone should verify. > There is 754 calls to js_CompileFunctionBody and js_CompileScript. 334x from nsXBLProtoImplMethod::CompileMember 326x from nsXBLProtoImplProperty::CompileMember 85x from nsXBLProtoImplField::InstallField 9x from other places

Michal Novotny [:michal]

Assignee

Comment 63

•

18 years ago

(In reply to comment #61) > Last I checked (a couple of weeks ago), the change to keep GC out of startup > did work, but it's time based, 10s into startup we can do a GC. And naturally > if this is running under a profiler getting startup times of 10s+ is not hard, > and then we'd see GC as part of startup in those profiles. > > I'd suggest setting NS_FIRST_GC_DELAY in dom/src/base/nsJSEnvironment.cpp to a > *really* large number when profiling startup to avoid this issue. I know about it and I tried to exclude calls to GC from timer in profiling. Anyway I set NS_FIRST_GC_DELAY to 100000000 and tried again. There are 3 calls to js_CG. Backtrace of the first call: Breakpoint 1, js_GC (cx=0x8ec1788, gckind=GC_NORMAL) at /opt/moz/TRUNK_new3/mozilla/js/src/jsgc.c:2326 2326 rt = cx->runtime; Current language: auto; currently c (gdb) bt #0 js_GC (cx=0x8ec1788, gckind=GC_NORMAL) at /opt/moz/TRUNK_new3/mozilla/js/src/jsgc.c:2326 #1 0x00127ecf in js_DestroyContext (cx=0x8ec1788, mode=JSDCM_FORCE_GC) at /opt/moz/TRUNK_new3/mozilla/js/src/jscntxt.c:433 #2 0x00121a0c in JS_DestroyContext (cx=0x8ec1788) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:1011 #3 0x00f2bebd in XPCJSContextStack::SetSafeJSContext (this=0x8e78670, aSafeJSContext=0x905d1f8) at /opt/moz/TRUNK_new3/mozilla/js/src/xpconnect/src/xpcthreadcontext.cpp:265 #4 0x00f2cb46 in nsXPCThreadJSContextStackImpl::SetSafeJSContext (this=0x8e7f8e8, aSafeJSContext=0x905d1f8) at /opt/moz/TRUNK_new3/mozilla/js/src/xpconnect/src/xpcthreadcontext.cpp:421 #5 0x014dbc86 in nsAppShellService::SetXPConnectSafeContext (this=0x9033440) at /opt/moz/TRUNK_new3/mozilla/xpfe/appshell/src/nsAppShellService.cpp:123 #6 0x014dc4db in nsAppShellService::CreateHiddenWindow (this=0x9033440, aAppShell=0x8e72d58) at /opt/moz/TRUNK_new3/mozilla/xpfe/appshell/src/nsAppShellService.cpp:197 #7 0x014ecf52 in nsAppStartup::CreateHiddenWindow (this=0x8ec0530) at /opt/moz/TRUNK_new3/mozilla/toolkit/components/startup/src/nsAppStartup.cpp:149 #8 0x00f00515 in XRE_main (argc=5, argv=0xbfb16f54, aAppData=0x8da81e0) at /opt/moz/TRUNK_new3/mozilla/toolkit/xre/nsAppRunner.cpp:3119 #9 0x08048aba in main (argc=5, argv=0xbfb16f54) at /opt/moz/TRUNK_new3/mozilla/browser/app/nsBrowserApp.cpp:158 Second and third are executed from JS_MaybeGC() because the condition is true. (gdb) bt 5 #0 js_GC (cx=0x909f5d8, gckind=GC_NORMAL) at /opt/moz/TRUNK_new3/mozilla/js/src/jsgc.c:2326 #1 0x001206ca in JS_GC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2396 #2 0x00120735 in JS_MaybeGC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2466 #3 0x012f2d25 in nsJSContext::ScriptEvaluated (this=0x909f5a0, aTerminated=1) at /opt/moz/TRUNK_new3/mozilla/dom/src/base/nsJSEnvironment.cpp:3218 #4 0x012f0ded in nsJSContext::EvaluateStringWithValue (this=0x909f5a0, aScript=@0xbfb15b94, aScopeObject=0xb130b460, aPrincipal=0x0, aURL=0x937a868 "chrome://global/content/bindings/toolbar.xml", aLineNo=79, aVersion=0, aRetValue=0xbfb15ba8, aIsUndefined=0xbfb15ba4) at /opt/moz/TRUNK_new3/mozilla/dom/src/base/nsJSEnvironment.cpp:1337 (More stack frames follow...) (gdb) f 2 #2 0x00120735 in JS_MaybeGC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2466 2466 JS_GC(cx); (gdb) p bytes $1 = 532480 (gdb) p lastBytes $2 = 262144 JS backtrace: chrome://global/content/bindings/toolbar.xml:92 (gdb) bt 5 #0 js_GC (cx=0x909f5d8, gckind=GC_NORMAL) at /opt/moz/TRUNK_new3/mozilla/js/src/jsgc.c:2326 #1 0x001206ca in JS_GC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2396 #2 0x00120735 in JS_MaybeGC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2466 #3 0x012f2d25 in nsJSContext::ScriptEvaluated (this=0x909f5a0, aTerminated=0) at /opt/moz/TRUNK_new3/mozilla/dom/src/base/nsJSEnvironment.cpp:3218 #4 0x012ed18f in nsJSContext::ScriptExecuted (this=0x909f5a0) at /opt/moz/TRUNK_new3/mozilla/dom/src/base/nsJSEnvironment.cpp:3283 (More stack frames follow...) (gdb) f 2 #2 0x00120735 in JS_MaybeGC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2466 2466 JS_GC(cx); (gdb) p bytes $3 = 712704 (gdb) p lastBytes $4 = 532480 JS backtrace: chrome://browser/content/browser.js:6062 FullZoom_get__cps() chrome://browser/content/browser.js:6089 FullZoom_init() chrome://browser/content/browser.js:870 delayedStartup() Any ideas?

Mike Shaver (:shaver emeritus)

Comment 64

•

18 years ago

For the MaybeGC cases, I wonder if the work in bug 417052 would help; we're finding that the heuristic there is a little too easily tripped for our performance-tastes.

Michal Novotny [:michal]

Assignee

Comment 65

•

18 years ago

(In reply to comment #64) > For the MaybeGC cases, I wonder if the work in bug 417052 would help; we're > finding that the heuristic there is a little too easily tripped for our > performance-tastes. > Raising the constant from 1.33 to 4 helps. So planned 16 will solve this.

Michal Novotny [:michal]

Assignee

Updated

•

18 years ago

Depends on: 121341

:Felipe Gomes (needinfo for replies!)

Comment 66

•

12 years ago

Probably no longer relevant to track anything in this bug.. If there's more startup analysis to be done it will happen on newer bugs

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → WORKSFORME

output of opreport 18 years ago Michal Novotny [:michal] 81.57 KB, application/octet-stream		Details
ff3 startup timeline 18 years ago Sudheer 37.75 KB, text/plain		Details