Closed Bug 75083 Opened 24 years ago Closed 23 years ago

Tokenizer (and more) slowed down by non-inlined code

Categories

(Core :: XPCOM, defect)

x86
Windows 2000
defect
Not set
minor

Tracking

()

RESOLVED WONTFIX

People

(Reporter: bratell, Assigned: bratell)

References

()

Details

(Keywords: perf)

Attachments

(3 files)

nsString::AppendWithConversion, nsStr::Trim and nsLocalCString(nsLocalCString) all do nsCRT::strlen which don't get inlined by the MSVC compiler and they show up high on the Tokenizer cost on the jdk index page. By inlining them manually I got a 15% speedup in the tokenizer. I will try calling strlen direct and see if the compiler can do something even better with that.
Blocks: 56854
Keywords: perf
I have seen this in profile builds but if we use the compiler flags /Oi or /O2 we should get inlined strlen in windows. What compiler flags are really used when building on Windows? The only relevant /O2 I found is: /config/makefile.win, line 87 -- CFLAGS = /O2 /GB /MD but from the comment it looks like that is only for the tools. So aren't we inlining? Are we using the wrong build flag? Can we make a _big_ win here or am I missing something? CC:ing some people who might know how our build system works on Windows.
I'm not sure that we've really done a good job exploring this space. We use -O1 (see mozilla/config/WIN32). Adding mjudge & rpotts, who know a lot about Win32.
I did: set MOZ_FLAGS=/Oi (inline strlen and other similar functions) and rebuilt strings and xpcom (nothing else). The quick and dirty result in quantify was that the total time to load the page in the url above went from 12.3 seconds to 10.3 on my computer. A big part of the difference can be explained by fewer reflows for some reason. Probably because they are timer started and it the layout is quicker, fewer are needed. So even if the "real" speedup is unknown, there seems to be some. I will try to do a full rebuild with /Oi to see what that does, but someone with more resources than I should really try some of the optimizer flags and do wall clock timings. I can't access John Morrison's tests I guess? And by the way, the object files I looked at was smaller than without /Oi so inlining some functions seems to generate smaller code than doing the function call.
Keywords: topperf
Another flags to test would be /Ob2 (let the compiler inline anything it think is suitable to inline). /O1 and /O2 only implies /Ob1 which is to inline functions with the |inline| keyword but nothing else.
A full rebuild with /0i and I'm down to 10.1 seconds in my very unscientific measurements. That's just 2% better than with just xpcom built with /Oi, but it is still a positive trend.
Showing chofmann what Daniel found.
So I changed WIN32 to build everything with /O2 /Ob2 and did a new run. It's much noise but what I see in parts of the program is in the table below. The times are F+D times in seconds as reported by Quantifty. The number in parentheses after the time is number of calls. Before: After: PresShell::ProcessReflowCommands 6.482s (191) 4.998s (635) nsHTMLTokenizer::ConsumeToken 0.623s (91k) 0.409s (91k) nsPresContext::ResolveStyleContextFor 1.200s (32k) 1.039s (32k) nsCSSFr.Const...::ConstructFrameByDisplayType 1.334s (31k) 1.183s (31k) SinkContext::FlushText 0.630s (62k) 0.534s (62k) and so on... I want this change in the real tree as soon as possible, to at least get some real world measurements! We also need to see how much this affects build size. (This put a new light on all those questions about when we are going to release an optimized build. :-) )
Severity: normal → major
Status: NEW → ASSIGNED
Keywords: mozilla0.9
Summary: Tokenizer slowed down by non-inlined strlens in string code → Tokenizer (and more) slowed down by non-inlined code
Target Milestone: --- → mozilla0.9
This caught my eye, so I ran jrgm's performance test once with /Oi and then again with /O2 /Ob2, and both runs produced the same timings as the current (/O1) flags.
That's depressing. I wonder why I got so nice figures during the weekend. Didn't you see any difference at all? Maybe it will only show an effect on big computational intensive pages? Anyway, I'm rebuilding right now to see if my /O1 figures were wrong in any way.
Perhaps Quantify's instrumentation makes non-inline functions seem slower when in fact there's little difference. One thing that has I've always wondered is if we'd save anything by making all the methods in nsCOMPtr & nsString statically linked or inline rather than the cross-DLL calls we make now.
FWIW, I tested release-mode builds on my PII/400mhz/256mb laptop, over a DSL line at home. I did clobber builds and cleared the disk cache before each test run. The graph represents the median page load time for each site (and for all sites - first element in the series), across the five iterations that jrgm's test does. I can post the raw data (or XLS spreadsheet) if anyone wants.
I believe you. The graph is very clear. I think Adam may have a point. The non-inlined functions are "timed", not instrumented as our code, and maybe that timing is wrong in some way. The only problem with that theory is that even if I remove all strlen and memcpy completely from my "Before" build, it will still not be as fast in Quantify. I timed seven loads of the JDK index page with the time Mozilla displays by loading it into an empty mozilla window and then pressing shift+reload six times. These figures are interesting in their own but they shows no improvements. :-( /O1 /O2 /Ob2 1. 8.858s 9.153s 3.3% slower 2. 12.628s 12.047s 4.6% faster 3. 19.608s 18.897s 3.6% faster 4. 19.058s 20.900s 9.7% slower 5. 15.463s 14.912s 3.6% faster 6. 14.551s 14.881s 2.3% slower 7. 13.710s 13.390s 2.3% faster This is so depressing. I thought I really had found something. Well, well. Back to Quantify. :-/ Maybe someone could explain why the load times differs so much in the same pattern: Fast - slower - really slow - really slow- slightly slow - slightly slow - slightly slow.... I can understand the difference between number 1 and 2, because the previuos page has to be torn down, something we are really slow at, but the rest? At least that could explain why different people get different numbers on the same page with the same computer.
Severity: major → minor
Target Milestone: mozilla0.9 → ---
Daniel, the rumor in today's performance meeting is that you may have a patch to manually modify those functions into in-line code?? (besides turning the compilier options) If that's true, can you post your patch to this bug, so we can try it out too? Thanks! Cathleen
Keywords: topperf
Blocks: 71668
What is the status of this bug? Is there similar phenomenon in other OS,such as:solaris?
Phenomena, as in "code that should be inlined is not", or phenomena as in "quantify is reporting inaccurate numbers for timed functions"? I believe heavier inlining can do good things at some places but apparently it didn't affect page load in a way that was measurable.
hi,daniel Because the bug haven't been discussed for at least a year,I want to know whether your changing compling options is effective to enhance the performance problem now. Is there a conclusion of previous discussion? In theory,although inline function can lead to the inflation of binary code, it is still helpful to improve performance of mozilla if changed the small size functions to inline functions. In pratice, we should blance the two sides of inline functions. BTW, have someone tried similar compiling options in other OS,such as, linux,solaris? thanks leon
I modified the functions in nsString and nsCRT,and changed some short functions,whose line-counting number is smaller than 10, into "inline" function. The groups of testing data below is the time consumed(milliseconds) when running local testcase "index-7.html" using build from modified code and original code. index Org1 org2 Org3 Org4 0rg5 Mod1 Mod2 Mod3 Mod4 1 10547 10453 10843 11828 10484 11344 10641 10453 10563 2 10218 10844 10703 12266 10797 11360 10719 11125 11234 3 11454 11703 11437 10547 10672 11750 11078 11031 11187 4 11015 11125 11172 10734 11016 11735 11172 11156 11531 5 10657 11407 10922 12015 11141 11265 11313 11204 11063 6 10969 11453 11297 11110 11187 11109 11250 10813 11094 7 10562 10579 10469 10938 10531 10563 10547 10437 10484 8 10250 10422 10625 10985 10390 10578 10438 10422 10250 9 10781 10468 10204 10453 10266 10141 10047 10156 10172 10 10422 10344 11391 10422 10109 10203 10047 10485 10000 11 10125 10000 10203 10171 11094 10219 10063 10563 10031 12 10297 10062 10953 10203 9985 10547 12266 10438 10688 13 10937 10063 10562 10125 9953 10171 10078 10500 10078 14 10032 10032 10469 10062 10062 10188 10453 10516 10141 15 10047 10110 10641 10110 10391 10172 10062 10531 10688 ------------------------------------------------------------------------------------- Ava 10554 10604 10792 10797 10538 10756 10678 10655 10613 According to the data above,it is difficult to say that these modification is helpful to enhance the performance of mozilla. Although I strongly suggest that we should create inline functions if they are short,I don't think the bug is very helpful and I suggest it be closed if no oppostions. At the same time, we can also get similar conclusion from previous comments.
I modified the functions in nsString and nsCRT,and changed some short functions,whose line-counting number is smaller than 10, into "inline" function. The groups of testing data in attachment is the time consumed(milliseconds) when running local testcase "index-7.html" using build from modified code and original code. According to the testing data ,it is difficult to say that these modification is helpful to enhance the performance of mozilla. Although I strongly suggest that we should create inline functions if they are short,I don't think the bug is very helpful and I suggest it be closed if no oppostions. At the same time, we can also get similar conclusion from previous comments.
Attachment #90200 - Attachment description: testing data from modificated code and original code → testing data from modified code and original code
see previous comments
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → WONTFIX
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: