205810 - XSLT partial perf fix for NS_NewAtom slowdown

Reporter

Description

•

22 years ago

User-Agent: Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.3) Gecko/20030313 Build Identifier: Mozilla/5.0 The bug fix http://bugzilla.mozilla.org/show_bug.cgi?id=195262 causes NS_NewAtom() service time to about double due to charset conversion. This patch provides partial a fix to the impact of this on XSLT, by avoiding use of the NS_NewAtom. The regression in NS_NewAtom performance still impacts others areas of XSLT code. This entire section of XSLT code is pending a major patch by Peter Van der Beken, so this small patch may be obsoleted by Peter's major patch. Nevertheless, for now the problem and fix being documented in this bug. Reproducible: Always Steps to Reproduce: measurement of XSLT processing time, before and after patch

Sam Emrick

Reporter

Comment 1

•

22 years ago

Attached patch patch of improve XSLT perf, avoiding NS_NewAtom — Details — Splinter Review

Axel Hecht

Comment 2

•

22 years ago

another issue is to kill http://lxr.mozilla.org/seamonkey/ident?i=TX_StringEqualsAtom Note that this is semi trivial, one should get the utf8 string first and then compare that for most of the cases. ExprParser and Lexer want that. Probably not high profile, but it should be done. The stuff in attachement 123401 looks reasonable, whoever actually does that though should poke sicking for a bugnumber for the attr rewrite and mention that in the code.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Summary: XSLT partial perf fix for NS_NewAtom slowdown → XSLT partial perf fix for NS_NewAtom slowdown

Darin Fisher

Comment 3

•

22 years ago

would it be worthwhile adding code to check if the input to NS_NewAtom is ascii before calling NS_ConvertUCS2toUTF8?

Darin Fisher

Comment 4

•

22 years ago

Attached patch like maybe this would help? (untested) — Details — Splinter Review

Axel Hecht

Comment 5

•

22 years ago

Comment on attachment 123464 [details] [diff] [review] like maybe this would help? (untested) One loop in IsAscii and one loop in NS_LossyConvert vs. two loops in NS_Convert (one for the length, one for the actual copying). That might just equal out.

Darin Fisher

Comment 6

•

22 years ago

axel: but you would avoid the shifting and masking operations. anyways, it might be worth it to give this a try... or maybe we should be looking to optimize the conversion function itself. maybe during the length loop it could also check for only ASCII. that would be a quick thing to check. then the second loop could be replaced with a truncation conversion. hmm...

Sam Emrick

Reporter

Comment 7

•

22 years ago

The best way to fix conversions would be... to never have to do them except and only for input/output. There should never be reason to do string conversion internally within Mozilla. Secondly, the unique and arcane string support in Mozilla should be thrown away, and then, standard, open, and well-tuned unicode library function can be used, such as: http://oss.software.ibm.com/icu/apiref/index.html http://oss.software.ibm.com/icu/apiref/utf8_8h.html http://oss.software.ibm.com/icu/apiref/utf16_8h.html There is no reason in God's Universe that Mozilla has to be different than Everything Else in string handling and unicode... except it chooses to be different. The performance cost of this choice is, at least right now, very large. But this whole stinky thing is not the subject of this bug.

Darin Fisher

Comment 8

•

22 years ago

sam: your point is well taken, but the reality is that mozilla is a large project composed of many subprojects each with very different demands on string encodings. some grew up only carrying about the native filesystem charset (because maybe the only input/output is with fopen and other standard friends). other code, like the networking library never has much use for UTF-16. mostly it needs things to be encoded in the charset of the origin server, whatever that may be. there are also cases where UTF-8 is used. and of course the content handling code wants UTF-16. taken as a whole, it's a big mismatch... i agree. and, like you said... we aren't going to solve that problem here. my point was just that we might be able to improve the performance problems with the atom table to avoid the need to work around those problems in XSLT code. maybe this shows us that it was wrong to store atoms as UTF-8. maybe that is the conclusion. i don't know yet... we don't have enough information.

Sam Emrick

Reporter

Comment 9

•

22 years ago

Yes, I agree Darin... just sometimes it feels good to whine. :-) Sorry. I don't understand why Alec (and others?) wants Atoms to be UTF8. Since I understand why that is a good thing in other ways and places (I assume), it is hard for me to argue against it, except it is quite a performance hit in this area that I'm looking at right now.

Sam Emrick

Reporter

Comment 10

•

22 years ago

Drats. Need to proofread better. Make that "Since I DON'T understand why that is..."

Sam Emrick

Reporter

Comment 11

•

22 years ago

BTW and FWIW, our favorite competitive browser is internally entirely UTF16.

Alec Flett

Comment 12

•

22 years ago

the other point I'd like to make is that there are a reason in God's Universe that mozilla will deal with strings in different ways across subsystems: footprint, performance, code maintainability/architecture to name a few. It may make sense for an entire module to deal with its strings in UTF16 because it is dealing with manipulation of raw Unicode strings that may originate in other languages. Another module might be dealing with strings that are almost entirely ASCII but still need to support i18n.. and when 99% of strings are ASCII there could be massive bloat from inflating 8-bit characters strings into 16 bit Unicode strings. I'm not even going to get in the the issues of dealing with legacy systems, legacy character sets, legacy web pages, legacy servers, and so forth. No reasonably sized project has a single prevailing rule for dealing with strings. you can argue this on principle, but you'd be going against what people have learned through practice.

Alec Flett

Comment 13

•

22 years ago

oh, and the reason I want atoms to be UTF8 is that the majority of consumers of the atom table are storing ASCII or nearly-always-ASCII strings, and it makes no sense for them to take the footprint hit of doubling the size of their strings. The problem is that people want one universal atom solution across the product, and this is just wrong. A large portion of atoms in use are either internal tokens or well-defined ASCII values that are seen all over the sourcebase. The folks who are trying to store internationalized strings in the atom table are really trying to use it for the wrong purpose if anything, we need a totally different atom table for that purpose.

Sam Emrick

Reporter

Comment 14

•

22 years ago

Attached file NS_NewAtom trace — Details

Sam Emrick

Reporter

Comment 15

•

22 years ago

Alec, with all due respect, I don't buy the footprint arguement. The UTF8 change has been in nightlys for a while. So you should have data about all the footprint savings from the change. Please share this footprintsavings data with us. For my evidence, please see this attachment of a trace of NS_NewAtom. http://bugzilla.mozilla.org/attachment.cgi?id=123565 NS_NewAtom is just a symptom, and Alec I am not picking on you. Its just that 8 to 16 bit chars are the least of mozilla footprint problems. I'd propose to leave http://bugzilla.mozilla.org/show_bug.cgi?id=195262 for its original purpose -- static allocated Atoms -- but back off the patch to make Atoms UTF8.... these things are unrelated. Static atoms support is good, switching a single internal mozilla component to UTF8 should be reconsidered seperate from the static atom support. If UTF8 is the direction, then there is probablt some some core sets of fuctions that should together be converted to UTF8.

Alec Flett

Comment 16

•

22 years ago

we can't have static atoms without UTF8 atoms.. at least in any way that I can figure out. The problem is that many consumers were previously using GetUnicode() which would not work for static atoms.. Now as for your trace - what I see is the lack of CopyUCS2toUTF8 - if we implement that then we avoid creating extra string objects and so forth. If you ask me, that's the bigger performance hit here. NS_ConvertUCS2toUTF8() sucks. Now admittedly maybe I should have implemented CopyUCS2toUTF8 before implementing UTF8 atoms, and all I can say there is "sorry" - in my pageload tests the effect was minimal. I really feel like this is an area that we need to start looking at the consumers of nsIAtom and deciding if we can use this change as a lever to reduce our use of UCS2 for storing ascii strings. My suggestion is to look at your code and see if you can benefit from some of the other methods on nsIAtom such as Equals() and EqualsUTF8() - we can get further improvements out of those functions when I write EqualsUTF8(const nsAString&, const nsACString); which will do an in-place comparision without creating temporary strings.... Now as for footprint wins what I did discover was that during a simple startup, layout 2-3 pages, and shutdown, 75% of the 2100+ atoms created with NS_NewAtom() were pure ASCII, and about 50% of them were actual strings stored statically in our binaries. so what I've done basically is converted 50% of those atoms to static atoms, making: 1000 static atoms which now allocate only 8 bytes each, as opposed to 8 bytes + the length of the unicode version of the string 500 or so atoms that were ASCII, and now take up half the room as before. Furthermore, I think the footprint wins are going to be a rippling effect - first we got some minor footprint wins in the atom table, but going forward we're going to be able to store/process other things as UTF8 and avoid going through UCS2 altogether in some cases. I still hold that UCS2 is not appropriate for all contexts - it is only appropriate when you are storing large amounts of i18n data whose charset may vary greatly depending on the user's interaction with the product, where you need a good "average waste" among the encodings - examples include localized strings and documents downloaded from the web. When the data is primarily ASCII you pick an encoding that is optimized for ASCII, i.e. UTF8 - examples include lists of well known ASCII tokens like HTML tags, CSS keywords, and so forth.

Sam Emrick

Reporter

Updated

•

22 years ago

Keywords: perf

Sam Emrick

Reporter

Comment 17

•

22 years ago

I thought a 'better patch' for this was being worked, but not seen that yet. Lacking that, can we please move forward with this patch? Thanks. Sam

jag (Peter Annema)

Comment 18

•

22 years ago

CopyUTF16toUTF8 now exists.

Axel Hecht

Comment 19

•

22 years ago

CopyUTF16toUTF8 has no beef here. Sadly enough. Peter lost essential parts of his walker tree, so we may resurrect something like attachement 123401 in the meantime. It's not topnotch priority on my list, though.

Peter Van der Beken [:peterv]

Assignee

Comment 20

•

21 years ago

The XSLT code that this bug was filed against is not in the tree anymore. Can we close this bug now? I doubt that CopyUTF16toUTF8 will be of any help here, afaict we still need a string to hold the copy in UTF8 as a key for the hash lookup.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 21

•

21 years ago

Yeah, let's close this. The biggest problem has been taken care of by the walker. There are still some things that are taking a hit and that we should atomize more, most noticably output where the resulthandlers would benefit from having atom rather then string arguments. Closing as invalid since the code in question is gone.

Status: NEW → RESOLVED

Closed: 21 years ago

Resolution: --- → INVALID

patch of improve XSLT perf, avoiding NS_NewAtom 22 years ago Sam Emrick 4.69 KB, patch		Details \| Diff \| Splinter Review
like maybe this would help? (untested) 22 years ago Darin Fisher 2.38 KB, patch		Details \| Diff \| Splinter Review
NS_NewAtom trace 22 years ago Sam Emrick 5.39 KB, text/plain		Details