Closed Bug 406810 Opened 17 years ago Closed 11 years ago

Analyze startup time of Firefox3 (TRUNK)

Categories

(Firefox :: General, enhancement)

Other
Linux
enhancement
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: michal, Assigned: michal)

References

()

Details

(Keywords: perf)

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Build Identifier: 

The goal is to speed up startup of FF3. I'll do an analysis of startup on linux using oprofiler. Further steps depends on obtained data.

Reproducible: Always
Version: unspecified → Trunk
Assignee: nobody → michal
(In reply to comment #1)
> Some graphs that are useful for understanding regression:
> 
> http://graphs.mozilla.org/#spst=range&spstart=1124636400&spend=1178280000&bpst=Cursor&bpstart=1124636400&bpend=1178280000&m1tid=9315&m1bl=0&m1avg=0&m2tid=7312&m2bl=0&m2avg=0
> 

Is somewhere description of what all the available data on the page mean and how they were obtained?
Attached file output of opreport
I tested startup of MOZILLA_1_8_BRANCH and HEAD and found warm startup of HEAD a little bit faster on i386 linux.

10 startups in a row in ms:

MOZILLA_1_8_BRANCH:
835, 839, 835, 836, 836, 840, 839, 833, 835, 833

HEAD:
832, 829, 831, 831, 833, 832, 830, 831, 831, 831

Cold startup is slower in case of HEAD. I'm measuring it using:
sudo bash -c "sync; echo 3 > /proc/sys/vm/drop_caches"; ./startup-unix.pl ../../../_obj-browser-release-32-fp/dist/bin/firefox -P ...

MOZILLA_1_8_BRANCH:
10986, 10818, 10123, 11509, 10734, 10693, 10643, 10636, 10646, 10519

HEAD:
12142, 12200, 12284, 12159, 12217, 12509, 12350, 12075, 12433, 12900

Data from profiler are in attachment. Callgraph output and annotated source aren't included because of their size. To get more stable data I captured 10 startups and merged them together. That's why there are multiple [vdso] with different tgid.

Some indispensable CPU time is used for memory allocations, and symbols lookup. Both increased in HEAD version.
I tried to run FF with LD logging using LD_DEBUG_OUTPUT=/tmp/ld_output_ff.log LD_DEBUG=symbols and it seems that number of symbols increased:

wc -l ld_output_ff3.log.24023 ld_output_ff2.log.24313  
  337393 ld_output_ff3.log.24023 
  261167 ld_output_ff2.log.24313 

I'll try to do multiple profilings to see that results are really stable and comparable.
I did few tests with dynamic linking to see how much can be spared on this part. For some unknown reason I can't get annotation of do_lookup_x() in do-lookup.h from oprofiler. So I just tried to minimize number of the while loops in this function. By analyzing output of LD_DEBUG=symbols I got following result :

   1107 file=/usr/lib/libstdc++.so.6 
    975 file=./libxpcom_core.so 
    939 file=/lib/libc.so.6 
    753 file=/lib/libglib-2.0.so.0 
    597 file=/usr/lib/libbonobo-2.so.0 
    487 file=/usr/lib/libORBit-2.so.0 
    451 file=/usr/lib/libX11.so.6 
    390 file=/lib/libgobject-2.0.so.0 
    390 file=./firefox-bin 
    384 file=./libnspr4.so 
    316 file=./libmozjs.so 
    245 file=/usr/lib/libpango-1.0.so.0 
    201 file=/lib/libpthread.so.0 
    182 file=/usr/lib/libgdk-x11-2.0.so.0 
    177 file=/usr/lib/libgconf-2.so.4 
    176 file=./libsqlite3.so 
    150 file=/usr/lib/libfreetype.so.6 
    149 file=/usr/lib/libgtk-x11-2.0.so.0 
    135 file=./libnss3.so 
    117 file=/usr/lib/libbonobo-activation.so.4 
    108 file=/lib/libasound.so.2 
    107 file=/usr/lib/libXdmcp.so.6 
    107 file=/lib/libdbus-1.so.3 
     97 file=/usr/lib/libXt.so.6 
     83 file=/usr/lib/libaudiofile.so.0 
     80 file=/lib/libkeyutils.so.1 
     76 file=./libplds4.so 
     71 file=/usr/lib/libkrb5.so.3 
     62 file=/usr/lib/libxml2.so.2 
     58 file=/usr/lib/libk5crypto.so.3 
     57 file=/usr/lib/libXrender.so.1 
     57 file=/usr/lib/libcairo.so.2 
     50 file=/usr/lib/libfontconfig.so.1 
     48 file=/usr/lib/libORBitCosNaming-2.so.0 
     44 file=/usr/lib/libgssapi_krb5.so.2 
     43 file=/usr/lib/libpangoft2-1.0.so.0 
     43 file=./libxpcom.so 
     37 file=./libplc4.so 
     ...

This means that during FF startup there were 1107 lookups for symbols in libstdc++.so.6, 975 in libxpcom_core.so ...
Order of shared libraries in which ld tries to find symbols can be changed without recompilation by using LD_PRELOAD. I measured startup with standard order and with "optimized" order (sorted by number of resolved symbols). There was no gain from changing order. In fact "optimized" order was slower by few milliseconds even if number of lookups was considerably smaller:

# grep -e "lookup in file" ld_output_ff_normal.log.9612 | wc -l
336619
# grep -e "lookup in file" ld_output_ff_optimized_1.log.9642 | wc -l
242599


I also tried to reduce number of lookups by linking libstc++ statically. In this case startup was faster by about 8 ms (1% of startup time). This isn't too much. Anyway, is there some reason why building with --enable-static doesn't produce single binary?
Really need :bs and :luser on this bug.

/be
Status: UNCONFIRMED → NEW
Ever confirmed: true
Are you compiling with the same mozconfig that the nightly tinderboxes use?
http://lxr.mozilla.org/mozilla/source/tools/tinderbox-configs/firefox/

I notice you mention "static", but we're shipping builds in the libxul configuration these days, where most of the code gets linked into libxul.so, and the Firefox binary is just a little shim.

Yes, anything that has an libxpcom_core.so is not a valid target for testing things like dynamic linker overhead.
Attached file ff3 startup timeline
I generated the time line (by enabling NS_TIMELINE) on Debian linux PC(attached). While talking about improving start-up performance, somebody mentioned that making fastload files architecture specific will greatly speed-up. I am relatively new to mozilla code. Can somebody throw some light on fastload system? to jump start me?
(in the log, nsIOService::NewChannelFromURI () seems to be used a lot during start-up. Are the files loaded (jar, xul, css, dtd,...) by this method the fast loaded files?
They seem to be plain xul, css, dtd files and does not seem like pre-compiled/processed. What are fast-loaded files? How are these generated? Where are these processed? What part can be made architecture-specific to improve speed. I am exploring the code and debugging to get more understanding, but some quick help would be highly appreciated.

Sorry, if bugzilla is not the right place to discuss/ask for help.
I believe the mozconfig for doing this analysis should be as follows:

. $topsrcdir/browser/config/mozconfig
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/ff-opt-libxul
ac_add_options --enable-optimize
ac_add_options --enable-libxul

ac_add_options --disable-debug
ac_add_options --disable-static
ac_add_options --disable-shared
ac_add_options --disable-tests


This is a bit redundant given the defaults, but it makes things clear.
And what is the right mozconfig for 1.8 BRANCH? With this mozconfig configure fails:

configure: error: Only one of --enable-shared or --enable-static must be specified.
*** Fix above errors and then restart with               "make -f client.mk build"
My distribution Firefox 2 has the following build options (about:buildconfig), so they're probably the ones to use for performance measurement:

--enable-application=browser --enable-update-channel=release --enable-official-branding --enable-optimize --disable-debug --disable-tests --enable-static --disable-shared --enable-svg --enable-canvas --enable-update-packaging
Status: NEW → ASSIGNED
So I have now following mozconfigs.

MOZILLA_1_8_BRANCH:
-------------------------------------------------
mk_add_options MOZ_CO_PROJECT=browser
mk_add_options PROFILE_GEN_SCRIPT=@TOPSRCDIR@/build/profile_pageloader.pl
mk_add_options MOZ_CO_MODULE="mozilla/tools/update-packaging mozilla/tools/codesighs"
ac_add_options --enable-application=browser
ac_add_options --enable-update-channel=nightly
ac_add_options --enable-update-packaging
ac_add_options --enable-optimize
ac_add_options --disable-debug
ac_add_options --disable-tests
ac_add_options --enable-codesighs
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/_obj-browser-release-32-tb-fp
ac_add_options --enable-static
ac_add_options --disable-shared
ac_add_options --enable-system-cairo
ac_add_options --enable-default-toolkit=gtk2
ac_add_options --enable-xft
ac_add_options --disable-freetype2
ac_add_options --enable-svg
ac_add_options --enable-canvas
CC="gcc -fno-omit-frame-pointer -gstabs+"
CXX="g++ -fno-omit-frame-pointer -gstabs+"
-------------------------------------------------

HEAD:
-------------------------------------------------
mk_add_options MOZ_CO_PROJECT=browser
mk_add_options PROFILE_GEN_SCRIPT=@TOPSRCDIR@/build/profile_pageloader.pl
mk_add_options MOZ_CO_MODULE="mozilla/tools/update-packaging mozilla/tools/codesighs"
ac_add_options --enable-application=browser
ac_add_options --enable-update-channel=nightly
ac_add_options --enable-update-packaging
ac_add_options --enable-optimize
ac_add_options --disable-debug
ac_add_options --disable-tests
ac_add_options --enable-codesighs
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/_obj-browser-release-32-tb-fp
ac_add_options --enable-libxul
ac_add_options --disable-static
ac_add_options --disable-shared
CC="gcc -fno-omit-frame-pointer -gstabs+"
CXX="g++ -fno-omit-frame-pointer -gstabs+"
-------------------------------------------------

Startup times are now:
FF3 693
FF2 661
Data from profiler can be downloaded from:

http://www.allpeers.com/download/michal/ff2.tar.bz2
http://www.allpeers.com/download/michal/ff3.tar.bz2

They include callgraph output so they are too big to upload them to bugzilla. They represent 20 starts of FF merged together.

More time is spent in dynamic linker and sqlite3. Major difference that I can see is in pango usage:

--------------------------------------------------------------------------
  1         0.0124  libxul.so                nsThebesFontMetrics::GetHeight(int&)
  8071     99.9876  libxul.so                nsThebesFontMetrics::GetMetrics() const
4         0.0065  libxul.so                gfxPangoFont::GetMetrics()
  5374     66.5676  libpango-1.0.so.0.1800.3 pango_font_get_metrics
  2675     33.1351  libxul.so                gfxPangoFont::RealizePangoFont(int)
  17        0.2106  libxul.so                gfxPangoFont::GetCharSize(char, gfxSize&, gfxSize&, unsigned int*)
  4         0.0495  libxul.so                gfxPangoFont::GetMetrics() [self]
  1         0.0124  libxul.so                nsACString_internal::Equals(char const*, nsCStringComparator const&) const
  1         0.0124  libxul.so                GetPangoLanguage(nsACString_internal const&)
  1         0.0124  libfreetype.so.6.3.16    FT_Get_Sfnt_Table
--------------------------------------------------------------------------

8071 is 13.17% of all startup time.
Pav/JDaggett looks like pango might be part of the issue here ... or more specifically nsThebesFontMetrics::GetMetrics().  Is there anything we can do to optimize this?
Linux doesn't have OS-global caching of font data (only availability), and we're doing things The Linux Way, so AIUI our two paths look like they'll be:

1) hope and push for such caching to appear, in which case the overhead will go basically to zero once that's done and all our users have upgraded

2) use our Win32 font selection code, implement our own caching, and endure the endless complaints about how we just can't play nice on Linux :)

(We could short-circuit the time delay from 1) by shipping our own Pango, but -- yeah, no.)
Blocks: 397947
Did you have a chance to check out windows?
(In reply to comment #17)

Finally I have some data from windows profiler. I used AQTime with user+kernel counting. It is hard to compare results because there were lot of changes between 1.8 branch and trunk. Anyway I created 3 tables that shows difference between FF2 and FF3 for modules (this is useless), source files and routines. Data can be downloaded as ODS or XLS from:

http://www.allpeers.com/download/michal/prof_ods.zip
http://www.allpeers.com/download/michal/prof_xls.zip

All times are in microseconds.
Can we generate web-friendly views for this data? Maybe something like http://simile.mit.edu/exhibit/?
I guess Exhibit is probably not worth the effort, but maybe sticking the data in Google Spreadsheets if it can handle the data volume.
Blocks: 364272
Google spreadsheets can't take that much data - so imported the first 300 rows at the URL (also here):

http://spreadsheets.google.com/pub?key=pq20lsXjZGAykkHScEtvQVg

Some things of note:

JS_CallTracer
js_TraceObject
GetGCThingFlags
js_TraceScopeProperty
js_TraceId

Are top five on 1.9 and don't exist in 1.8.

PR_Malloc is #7 and is called 3k more times on 1.9 than 1.8 (surprised given the alloc reudctions we've done..)

nsID:Equals #11

nsLocalFile::EnsureShortPath doesn't show up in 1.8 but is high in 1.9

JS_DHashTableEnumerate called 4x more in 1.9 than 1.8 but takes 20x more time

js_LookupPropertyWithFlags 2x time in 1.9 from 1.8

nsStringBuffer::Realloc 2x time in 1.9 than 1.8..

_cairo_win32_surface_set_clip_region is new for 1.9 (no surprise)




jst pointed out that we whacked the DOM's JSGC_BEGIN callback in a past release to hold off all GC during startup for 10 sec. or so -- that seems to have regressed, and jst mentioned Smaug's name ;-).

/be
> Are top five on 1.9 and don't exist in 1.8.

Those are all JS GC.  We should definitely try to avoid that during startup, if it's happening then.

> nsID:Equals #11

Do we have callstacks?  Is this just QI, or something else going on?

> nsStringBuffer::Realloc 2x time in 1.9 than 1.8..

Again, do we have callstacks?  That would be a good one to try and nix. It looks like we have about 30% more calls and 2x the time, right?
callstacks for the top 5 PR_Mallocs would be nice too.
(In reply to comment #23)
> > Are top five on 1.9 and don't exist in 1.8.
> 
> Those are all JS GC.  We should definitely try to avoid that during startup, if
> it's happening then.
> 
> > nsID:Equals #11
> 
> Do we have callstacks?  Is this just QI, or something else going on?
> 
> > nsStringBuffer::Realloc 2x time in 1.9 than 1.8..
> 
> Again, do we have callstacks?  That would be a good one to try and nix. It
> looks like we have about 30% more calls and 2x the time, right?
> 

I'm reading 2x time and 3x fewer calls.

Michal can we get call stacks for some of these?
(In reply to comment #22)
> jst pointed out that we whacked the DOM's JSGC_BEGIN callback in a past release
> to hold off all GC during startup for 10 sec. or so -- that seems to have
> regressed, and jst mentioned Smaug's name ;-).
I changed CC handling (CC then calls GC), but the first CC shouldn't happen
during startup. One can enable DEBUG_smaug in nsJSEnvironment.cpp too see
when cycle collector is called. First time I see it is around 10 sec after
startup (if user is inactive all the time).
The CC call happens because sGCTimer fires and the first time that timer
is initialized, NS_FIRST_GC_DELAY (10000ms) is used as delay.

Is it possible that changes to nsJSContext::DOMBranchCallback have regressed
something? Should we prevent (or delay) calling JS_MaybeGC there during startup?
Or in nsJSContext::ScriptEvaluated?

One thing that's likely to make GC show up more in profile data is that the profilers slow things down enough that we actually fire that 10s timer during startup, whereas we wouldn't in normal runs (unless the system is *really* slow, or really busy).
> nsLocalFile::EnsureShortPath doesn't show up in 1.8 but is high in 1.9

Was this done against a libxul build? We know that EnsureShortPath is expensive and shows up a lot in dynamic builds during component registration, but we don't call it that much in libxul builds. Stacks would be good.
> I'm reading 2x time and 3x fewer calls.

Hit Count is number of calls, right?
(In reply to comment #23)
> > Are top five on 1.9 and don't exist in 1.8.
> 
> Those are all JS GC.  We should definitely try to avoid that during startup, if
> it's happening then.

This seems to be problem on my side. I started FF with parameter "-url file://C:\moz\TRUNK\mozilla\tools\performance\startup\quitForMac.html" and measured whole run of firefox. I hoped that ending code will not influence profiling too much. I tried to create an action in AQTime that disables profiling when nsAppStartup::Quit is called but profiling data are IMO incorrect. Anyway I will provide them soon (with callgraphs).
Oh, yeah.  Shutdown does various slow things like sqlite flushing and GC...
(In reply to comment #29)
> > I'm reading 2x time and 3x fewer calls.
> 
> Hit Count is number of calls, right?
> 

Thought so...
So maybe I'm misreading things, but for the string Realloc there is a hit count of 24000-some, another of 1800-some, and a 600-some that I assume is the difference.  Hence my 30% number.  It's not clear which columns in the spreadsheet are 1.9 and which are 1.8, so not sure whether that's a 30% increase or decrease.
New data can be found at:
http://www.allpeers.com/download/michal/new/

It differs from previous data so that profiling was now stopped on call nsAppStartup::Quit that is called from quitForMac.html. So anything related to exitting firefox shouldn't be now in profiling data. Anyway at least first two methods (XRE_main and nsBaseAppShell::Run) show incorrect values for Time. Maybe because they weren't finished when profiling was stopped (bug in AQTime?). The rest seem to be sane (at least I hope).

File ff2_ff3_routines_stop_at_quit contains all routines with differences between 1.8 and 1.9 sorted by time difference descending. For first 30 are callgraphs in "callgraph/FF{2,3}" subdirectory.

File ff2_ff3_routines_stop_at_quit_hit_ratio contains only these routines that have time difference bigger that 500 microseconds and are sorted by hit ratio. For all that have ratio more than 2 are also callgraphs available.

I wasn't able to export all callgraph data from AQTime, so only the minimum amount mentioned above is present as PNG.
(In reply to comment #32)
> (In reply to comment #29)
> > > I'm reading 2x time and 3x fewer calls.
> > 
> > Hit Count is number of calls, right?
> > 
> 
> Thought so...
> 

Yes, it means number of routine calls that were profiled.

More info about values can be found in manual.
http://www.automatedqa.com/downloads/Docs/aqtime_user_manual.pdf
(In reply to comment #33)
> So maybe I'm misreading things, but for the string Realloc there is a hit count
> of 24000-some, another of 1800-some, and a 600-some that I assume is the
> difference.  Hence my 30% number.  It's not clear which columns in the
> spreadsheet are 1.9 and which are 1.8, so not sure whether that's a 30%
> increase or decrease.
> 

There are labels in original spreadsheets (not in google docs). First set of data is for 1.8, second set for 1.9 and third is the difference.
(In reply to comment #28)
> > nsLocalFile::EnsureShortPath doesn't show up in 1.8 but is high in 1.9
> 
> Was this done against a libxul build? We know that EnsureShortPath is expensive
> and shows up a lot in dynamic builds during component registration, but we
> don't call it that much in libxul builds. Stacks would be good.
> 

Yes, the 1.9 build contains huge xul.dll. I used mozconfig from tinderbox-configs + added debug symbols. Callgraph for this method is at http://www.allpeers.com/download/michal/new/callgraph/FF3/nsLocalFile+EnsureShortPath.png
(In reply to comment #25)
> > > nsID:Equals #11

http://www.allpeers.com/download/michal/new/callgraph/FF2/nsID+Equals.png
http://www.allpeers.com/download/michal/new/callgraph/FF3/nsID+Equals.png

Can't get more parent levels in one image :-/ If you need more, I will provide them separately.

> > 
> > Do we have callstacks?  Is this just QI, or something else going on?
> > 
> > > nsStringBuffer::Realloc 2x time in 1.9 than 1.8..

http://www.allpeers.com/download/michal/new/callgraph/FF2/nsStringBuffer+Realloc.png
http://www.allpeers.com/download/michal/new/callgraph/FF3/nsStringBuffer+Realloc.png

More parent levels are present in these, but 4th level is a little bit messy.

> > 
> > Again, do we have callstacks?  That would be a good one to try and nix. It
> > looks like we have about 30% more calls and 2x the time, right?

Yes, exactly 33.27% more hits and 93.65% more time according to new data.
The cairo functions that are showing up are consistent with the thinking that we're doing a lot more repaints/drawing than pre-threadmanager during startup..
(In reply to comment #39)
> The cairo functions that are showing up are consistent with the thinking that
> we're doing a lot more repaints/drawing than pre-threadmanager during startup..
> 

Is there some way to capture the event/paint timeline and figure out a way around this?
>http://www.allpeers.com/download/michal/new/callgraph/FF2/nsID+Equals.png
>http://www.allpeers.com/download/michal/new/callgraph/FF3/nsID+Equals.png

It does look like this is mostly QI.  One obvious change here is that the number of hits in nsGenericElement::QueryInterface has gone up by 25% or so.  Can we get the callers of nsGenericElement::QueryInterface?  Interestingly, the number of QIs on nsXULElement hasn't changed much, so the nsGenericElement stuff is some other elements.

There are also a lot more hits in nsStandardURL::QueryInterface (more than twice as many as on branch), so perhaps the callers of this would be good too.  

And BackstagePass::QueryInterface seems to have jumped.

Between them, those three account for 2/3 or more of the jump in nsID::Equals calls.  All might need several levels of callers to get to the actual place where the do_QueryInterface happened.

It might be worth filing a separate bug (blocking this one) for the QI issue so that we don't get too many confusing things all going on here at once.  But we should definitely look into this!

For the string realloc, could we get one more level on 1.9 at least for the top two 4th-level callers?  Those are the two nsSubstring::Appends there.
Depends on: 412300
Keywords: perf
(In reply to comment #41)
> For the string realloc, could we get one more level on 1.9 at least for the top
> two 4th-level callers?  Those are the two nsSubstring::Appends there.
> 

nsSubstring::Appends are top in 1.8. In 1.9 are on the top nsPersistentProperties::Load and nsCSSScanner::GatherIdent

http://www.allpeers.com/download/michal/new/callgraph/FF3/nsCSSScanner+GatherIdent.png
http://www.allpeers.com/download/michal/new/callgraph/FF3/nsPersistentProperties+Load.png
(In reply to comment #42)
> nsSubstring::Appends are top in 1.8. In 1.9 are on the top
> nsPersistentProperties::Load and nsCSSScanner::GatherIdent
> 
> http://www.allpeers.com/download/michal/new/callgraph/FF3/nsCSSScanner+GatherIdent.png

Bug 311566 has a patch to deal with nsCSSScanner::GatherIdent.
Ah, indeed.

GatherIdent is a sort of known issue.  For nsPersistentProperties, we might be having values (or keys?) that are too big for an autostring.  Might be worth using a string with a bigger stack buffer there.

I wonder whether we can do something about the expat driver there too (mCDataText, I bet).
(In reply to comment #44)
> Ah, indeed.
> 
> GatherIdent is a sort of known issue.  For nsPersistentProperties, we might be

is getting Bug 311566 done a possibility

> having values (or keys?) that are too big for an autostring.  Might be worth
> using a string with a bigger stack buffer there.
> 

Bug 121341 looks promising here - any chance we can get this done?
(In reply to comment #40)
> (In reply to comment #39)
> > The cairo functions that are showing up are consistent with the thinking that
> > we're doing a lot more repaints/drawing than pre-threadmanager during startup..
> > 
> 
> Is there some way to capture the event/paint timeline and figure out a way
> around this?

Yep, on windows all painting will go through widget/src/windows/nsWindow.cpp, nsWindow::OnPaint.  Adding a print of when that's hit, along with the paintRect (scroll down a bit) will give an idea of what area we're repainting and when.

... and as I say that, I see that there's a debug_DumpPaintEvent already in there, which needs nglayout.debug.paint_dumping to be set.  But that's only in debug builds, so you should be able to just comment out the NS_DEBUG's and set the pref and be good to go.
(In reply to comment #46)
> (In reply to comment #40)
> > (In reply to comment #39)
> > > The cairo functions that are showing up are consistent with the thinking that
> > > we're doing a lot more repaints/drawing than pre-threadmanager during startup..
> > > 
> > 
> > Is there some way to capture the event/paint timeline and figure out a way
> > around this?
> 
> Yep, on windows all painting will go through widget/src/windows/nsWindow.cpp,
> nsWindow::OnPaint.  Adding a print of when that's hit, along with the paintRect
> (scroll down a bit) will give an idea of what area we're repainting and when.
> 
> ... and as I say that, I see that there's a debug_DumpPaintEvent already in
> there, which needs nglayout.debug.paint_dumping to be set.  But that's only in
> debug builds, so you should be able to just comment out the NS_DEBUG's and set
> the pref and be good to go.
> 

Michal can you try this out and post some data to see if there are obvious places for us to prevent repaints? 

(In reply to comment #47)
> > Yep, on windows all painting will go through widget/src/windows/nsWindow.cpp,
> > nsWindow::OnPaint.  Adding a print of when that's hit, along with the paintRect
> > (scroll down a bit) will give an idea of what area we're repainting and when.
> > 
> > ... and as I say that, I see that there's a debug_DumpPaintEvent already in
> > there, which needs nglayout.debug.paint_dumping to be set.  But that's only in
> > debug builds, so you should be able to just comment out the NS_DEBUG's and set
> > the pref and be good to go.
> > 
> 
> Michal can you try this out and post some data to see if there are obvious
> places for us to prevent repaints? 
> 

There are only few events:

   1 PAINT      widget=01A8CE1C name=noname       id=000A02D0 rect=  0,0   986,913 
   2 PAINT      widget=01C9549C name=noname       id=000A0312 rect=  0,0    16,16  
   3 PAINT      widget=01D5E98C name=noname       id=00060384 rect=  0,0   986,803 
   4 PAINT      widget=01A8CE1C name=noname       id=000A02D0 rect= 94,24  487,38  
   5 PAINT      widget=01A8CE1C name=noname       id=000A02D0 rect=206,3   776,51  
   6 PAINT      widget=01C9549C name=noname       id=000A0312 rect=  0,0    16,16  
   7 PAINT      widget=01A8CE1C name=noname       id=000A02D0 rect=  0,23  986,65  
It seems that lot of inline functions aren't inlined because some default gcc limits are exceeded. When -Winline option is specified I can see warnings like this:

warning: inlining failed in call to ‘void nsCachedStyleData::Destroy(P 
RUint32, nsPresContext*)’: --param large-function-growth limit reached

I have no such warnings when I set following limits to gcc/g++ in flags:

--param max-inline-insns-single=1000
--param large-function-growth=1700
--param inline-unit-growth=200

The large-function-growth is quite high, only 200 is enough for everything except layout/style/nsStyleSet.cpp that needs 1700.

With this flags I have only a little bigger linux build (26460160 instead of 26081280) but it is quite faster (585ms instead of 630ms).

But maybe official builds are done with custom-built gcc with higher defaults limits?
There is a known issue with GCC 3.1.2 which won't inline things properly. Can't find the bug filed about it right now. Unfortunately that's the version many distros are using.
I spent some time looking at valgrind output on linux but profile data of whole startup is too messy. I tried to separate few comprehensive non-overlapping parts using --toggle-collect option with selected functions:

Whole startup (including CCIfUserInactive)   1077202801
CCIfUserInactive                               76353317
startup w/o CCIfUserInactive                 1000849484    100% cca 600ms

SheetLoadData::OnStreamComplete               307915061   30.8% cca 184.6ms
PresShell::ReflowEvent::Run                   120069008     12% cca 72ms
mozJSComponentLoader::LoadModule               99339364     10% cca 59.6ms
nsAppStartup::CreateHiddenWindow               81380310    8.1% cca 48.8ms
nsLookAndFeelConstructor                       18428861    1.8% cca 11ms
---------------------------------------------------------------------------
covered by selected parts                     627132604  62.66% cca 376ms

Absolute numbers are count of valgrind's IR instructions. CCIfUserInactive is subtracted from total count because it is performed only when code is instrumented by valgrind. Calculation of spent time isn't perfect because only IRs in user space are counted. It is also disputable if spent time can be computed proportionally from IR count, but no better metric is available.

Here are some more data about loading CSS (SheetLoadData::OnStreamComplete) which takes significant amount of startup time. Top 10 functions sorted by time spent by themself. Note that in graphs are times spent with children in percents. Btw call counts are reliable only 1 level back and forward from inspected function. On other levels they are somehow approximated.
-------------------------------------------------------------------------------- 
        Ir  function 
-------------------------------------------------------------------------------- 
15,278,356  js_GetToken
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_js_GetToken.png
12,684,221  nsUTF8ToUnicode::Convert(char const*, int*, unsigned short*, int*)
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_nsUTF8ToUnicode+Convert.png 
12,333,877  GetChar
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_GetChar.png 
12,220,425  strcmp
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_strcmp.png 
 8,249,724  _int_malloc
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_malloc.png 
 8,243,489  little2_updatePosition
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_little2_updatePosition.png 
 5,111,469  nsCSSScanner::Read(unsigned&)
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_nsCSSScanner+Read.png 
 4,705,887  g_key_file_lookup_group_node
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_g_key_file_lookup_group_node.png 
 4,514,941  nsAString_internal::ReplacePrep(unsigned, unsigned, unsigned)
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_nsAString_internal+ReplacePrep.png 
 4,279,656  little2_cdataSectionTok 
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_little2_cdataSectionTok.png

Strcmp and g_key_file_lookup_group_node lead to nsIconChannel::Init (http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_nsIconChannel+Init.png) that calls gtk_widget_render_icon what takes 13,36% (cca 25ms). All other functions are called due to XML parsing or JS compilation. Could fixing bug #196843 help here?
Michal, I think most of what you're seeing is that OnStreamComplete calls back into the parser to restart the XUL/XML/HTML parsing process once it finishes parsing the stylesheet.  So there's a lot of code "under" OnStreamComplete that has nothing to do with it.  I'm looking at js_GetToken here, at the very least.

If you want to see what part of the code is really under OnStreamComplete, you want to exclude everything called via the StyleSheetLoaded calls in CSSLoaderImpl::SheetComplete.

For some of the things on your list (e.g. the CSS scanner and some of the string stuff), I do think that fixing bug 196843 might help.  But it's hard to tell until we separate out the parts that are really happening in CSS-land from the work happening off StyleSheetLoaded.
(In reply to comment #52)

StyleSheetLoaded takes 92.57%
http://www.allpeers.com/download/michal/kcachegrind/OnStreamComplete_1.5_50_CSSLoaderImpl+ParseSheet.png

So CSS fast loading can influence only CSSParseImpl::Parse part (7.12%)? This include also parsing all xml files referred in css files?
> So CSS fast loading can influence only CSSParseImpl::Parse part (7.12%)? 

Yes.

> This include also parsing all xml files referred in css files?

You mean XBL bindings?  If so, no. Those are loaded asynchronously from a totally different place.  And CSS fastloading wouldn't affect them anyway.
There is a lot of activity in JS:

0.61%   425380x   JS_CallTracer'2
1.37%   343723x   GetChar
1.69%   241201x   js_GetToken
0.19%   198749x   js_UngetToken
0.10%   190226x   GetGCThingFlags
0.17%   181610x   js_TraceId
0.33%   172224x   js_MatchToken
and many others...

To be honest I must say that I'm completely lost in this code. At a first look it seems that they are just called too often. But due to cycles no reasonable callgraph can be produced. Does it make sense to try to optimize this? If yes, can somebody please kick me into right direction?
(In reply to comment #55)
> There is a lot of activity in JS:
> 
> 0.61%   425380x   JS_CallTracer'2

GC is running -- that shouldn't happen during startup, if the changes jst made a while back still work.

> 1.37%   343723x   GetChar
> 1.69%   241201x   js_GetToken
> 0.19%   198749x   js_UngetToken
> 0.33%   172224x   js_MatchToken

JS compilation should not happen due to FastLoad, but clearly something is compiling JS from source rather than XDR'ing bytecodes. What is on the callstack that is invoking JS_Compile* or JS_Evaluate* APIs?

> 0.10%   190226x   GetGCThingFlags
> 0.17%   181610x   js_TraceId

More GC footprints.

> and many others...
> 
> To be honest I must say that I'm completely lost in this code. At a first look
> it seems that they are just called too often. But due to cycles no reasonable
> callgraph can be produced. Does it make sense to try to optimize this? If yes,
> can somebody please kick me into right direction?

We have long-standing methods to avoid JS GC and compilation costs from accruing at startup, but they seem not to be working. So two action items:

* restore the GC avoidance that jst engineered to working order; 
* find out what is compiling JS and switch it to fastload.

/be
(In reply to comment #56)
> * find out what is compiling JS and switch it to fastload.

I'm pretty sure these are calls to CompileFunction() in /content/xbl/src/nsXBLProtoImplMethod.cpp and /content/xbl/src/nsXBLProtoImplProperty.cpp, but someone should verify.

> JS compilation should not happen due to FastLoad,

We don't fastload XBL.  I'll bet money it's the issue.

Sadly, I'm not convinced that fastloading things in bite-sized chunks (one field/property/method at a time, possibly involving muxing on every one, etc) is so great.  We might be able to do better if we fastload the entire prototype binding at once or something...
(In reply to comment #58)
> > JS compilation should not happen due to FastLoad,
> 
> We don't fastload XBL.  I'll bet money it's the issue.

That's bug 94199

> Sadly, I'm not convinced that fastloading things in bite-sized chunks (one
> field/property/method at a time, possibly involving muxing on every one, etc)
> is so great.  We might be able to do better if we fastload the entire prototype
> binding at once or something...

What if we did a Ts/Txul run and logged every file, resource, and property touched.  Then did a pre-process where we concat all that stuff into one big file that we mmap in?  Sounds easy in english!  :-)

 

Also - see Bug 412796
(In reply to comment #56)
> (In reply to comment #55)
> > There is a lot of activity in JS:
> > 
> > 0.61%   425380x   JS_CallTracer'2
> 
> GC is running -- that shouldn't happen during startup, if the changes jst made
> a while back still work.

Last I checked (a couple of weeks ago), the change to keep GC out of startup did work, but it's time based, 10s into startup we can do a GC. And naturally if this is running under a profiler getting startup times of 10s+ is not hard, and then we'd see GC as part of startup in those profiles.

I'd suggest setting NS_FIRST_GC_DELAY in dom/src/base/nsJSEnvironment.cpp to a *really* large number when profiling startup to avoid this issue.
(In reply to comment #57)
> (In reply to comment #56)
> > * find out what is compiling JS and switch it to fastload.
> 
> I'm pretty sure these are calls to CompileFunction() in
> /content/xbl/src/nsXBLProtoImplMethod.cpp and
> /content/xbl/src/nsXBLProtoImplProperty.cpp, but someone should verify.
> 

There is 754 calls to js_CompileFunctionBody and js_CompileScript.

334x from nsXBLProtoImplMethod::CompileMember
326x from nsXBLProtoImplProperty::CompileMember
 85x from nsXBLProtoImplField::InstallField
  9x from other places
(In reply to comment #61)
> Last I checked (a couple of weeks ago), the change to keep GC out of startup
> did work, but it's time based, 10s into startup we can do a GC. And naturally
> if this is running under a profiler getting startup times of 10s+ is not hard,
> and then we'd see GC as part of startup in those profiles.
> 
> I'd suggest setting NS_FIRST_GC_DELAY in dom/src/base/nsJSEnvironment.cpp to a
> *really* large number when profiling startup to avoid this issue.

I know about it and I tried to exclude calls to GC from timer in profiling. Anyway I set NS_FIRST_GC_DELAY to 100000000 and tried again. There are 3 calls to js_CG. Backtrace of the first call:

Breakpoint 1, js_GC (cx=0x8ec1788, gckind=GC_NORMAL) at /opt/moz/TRUNK_new3/mozilla/js/src/jsgc.c:2326
2326        rt = cx->runtime;
Current language:  auto; currently c
(gdb) bt
#0  js_GC (cx=0x8ec1788, gckind=GC_NORMAL) at /opt/moz/TRUNK_new3/mozilla/js/src/jsgc.c:2326
#1  0x00127ecf in js_DestroyContext (cx=0x8ec1788, mode=JSDCM_FORCE_GC) at /opt/moz/TRUNK_new3/mozilla/js/src/jscntxt.c:433
#2  0x00121a0c in JS_DestroyContext (cx=0x8ec1788) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:1011
#3  0x00f2bebd in XPCJSContextStack::SetSafeJSContext (this=0x8e78670, aSafeJSContext=0x905d1f8)
    at /opt/moz/TRUNK_new3/mozilla/js/src/xpconnect/src/xpcthreadcontext.cpp:265
#4  0x00f2cb46 in nsXPCThreadJSContextStackImpl::SetSafeJSContext (this=0x8e7f8e8, aSafeJSContext=0x905d1f8)
    at /opt/moz/TRUNK_new3/mozilla/js/src/xpconnect/src/xpcthreadcontext.cpp:421
#5  0x014dbc86 in nsAppShellService::SetXPConnectSafeContext (this=0x9033440)
    at /opt/moz/TRUNK_new3/mozilla/xpfe/appshell/src/nsAppShellService.cpp:123
#6  0x014dc4db in nsAppShellService::CreateHiddenWindow (this=0x9033440, aAppShell=0x8e72d58)
    at /opt/moz/TRUNK_new3/mozilla/xpfe/appshell/src/nsAppShellService.cpp:197
#7  0x014ecf52 in nsAppStartup::CreateHiddenWindow (this=0x8ec0530)
    at /opt/moz/TRUNK_new3/mozilla/toolkit/components/startup/src/nsAppStartup.cpp:149
#8  0x00f00515 in XRE_main (argc=5, argv=0xbfb16f54, aAppData=0x8da81e0)
    at /opt/moz/TRUNK_new3/mozilla/toolkit/xre/nsAppRunner.cpp:3119
#9  0x08048aba in main (argc=5, argv=0xbfb16f54) at /opt/moz/TRUNK_new3/mozilla/browser/app/nsBrowserApp.cpp:158


Second and third are executed from JS_MaybeGC() because the condition is true.

(gdb) bt 5
#0  js_GC (cx=0x909f5d8, gckind=GC_NORMAL) at /opt/moz/TRUNK_new3/mozilla/js/src/jsgc.c:2326
#1  0x001206ca in JS_GC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2396
#2  0x00120735 in JS_MaybeGC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2466
#3  0x012f2d25 in nsJSContext::ScriptEvaluated (this=0x909f5a0, aTerminated=1)
    at /opt/moz/TRUNK_new3/mozilla/dom/src/base/nsJSEnvironment.cpp:3218
#4  0x012f0ded in nsJSContext::EvaluateStringWithValue (this=0x909f5a0, aScript=@0xbfb15b94, aScopeObject=0xb130b460, 
    aPrincipal=0x0, aURL=0x937a868 "chrome://global/content/bindings/toolbar.xml", aLineNo=79, aVersion=0, 
    aRetValue=0xbfb15ba8, aIsUndefined=0xbfb15ba4) at /opt/moz/TRUNK_new3/mozilla/dom/src/base/nsJSEnvironment.cpp:1337
(More stack frames follow...)
(gdb) f 2
#2  0x00120735 in JS_MaybeGC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2466
2466            JS_GC(cx);
(gdb) p bytes 
$1 = 532480 
(gdb) p lastBytes 
$2 = 262144 

JS backtrace:
chrome://global/content/bindings/toolbar.xml:92

(gdb) bt 5
#0  js_GC (cx=0x909f5d8, gckind=GC_NORMAL) at /opt/moz/TRUNK_new3/mozilla/js/src/jsgc.c:2326
#1  0x001206ca in JS_GC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2396
#2  0x00120735 in JS_MaybeGC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2466
#3  0x012f2d25 in nsJSContext::ScriptEvaluated (this=0x909f5a0, aTerminated=0)
    at /opt/moz/TRUNK_new3/mozilla/dom/src/base/nsJSEnvironment.cpp:3218
#4  0x012ed18f in nsJSContext::ScriptExecuted (this=0x909f5a0)
    at /opt/moz/TRUNK_new3/mozilla/dom/src/base/nsJSEnvironment.cpp:3283
(More stack frames follow...)
(gdb) f 2
#2  0x00120735 in JS_MaybeGC (cx=0x909f5d8) at /opt/moz/TRUNK_new3/mozilla/js/src/jsapi.c:2466
2466            JS_GC(cx);
(gdb) p bytes
$3 = 712704
(gdb) p lastBytes 
$4 = 532480

JS backtrace:
chrome://browser/content/browser.js:6062 FullZoom_get__cps()
chrome://browser/content/browser.js:6089 FullZoom_init()
chrome://browser/content/browser.js:870  delayedStartup()


Any ideas?
For the MaybeGC cases, I wonder if the work in bug 417052 would help; we're finding that the heuristic there is a little too easily tripped for our performance-tastes.
(In reply to comment #64)
> For the MaybeGC cases, I wonder if the work in bug 417052 would help; we're
> finding that the heuristic there is a little too easily tripped for our
> performance-tastes.
> 

Raising the constant from 1.33 to 4 helps. So planned 16 will solve this.
Depends on: 121341
Probably no longer relevant to track anything in this bug.. If there's more startup analysis to be done it will happen on newer bugs
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: