Closed
      
        Bug 205810
      
      
        Opened 22 years ago
          Closed 21 years ago
      
        
    
  
XSLT partial perf fix for NS_NewAtom slowdown  
    Categories
(Core :: XSLT, defect)
        Core
          
        
        
      
        
    
        XSLT
          
        
        
      
        
    Tracking
()
        RESOLVED
        INVALID
        
    
  
People
(Reporter: emrick, Assigned: peterv)
Details
(Keywords: perf)
Attachments
(3 files)
| 4.69 KB,
          patch         | Details | Diff | Splinter Review | |
| 2.38 KB,
          patch         | Details | Diff | Splinter Review | |
| 5.39 KB,
          text/plain         | Details | 
User-Agent:       Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.3) Gecko/20030313
Build Identifier: Mozilla/5.0
The bug fix http://bugzilla.mozilla.org/show_bug.cgi?id=195262
causes NS_NewAtom() service time to about double due to charset conversion. 
This patch provides partial a fix to the impact of this on XSLT, by avoiding use
of the NS_NewAtom. The regression in NS_NewAtom performance still impacts others
areas of XSLT code. 
This entire section of XSLT code is pending a major patch by Peter Van der
Beken, so this small patch may be obsoleted by Peter's major patch.
Nevertheless, for now the problem and fix being documented in this bug. 
Reproducible: Always
Steps to Reproduce:
measurement of XSLT processing time, before and after patch
|   | Reporter | |
| Comment 1•22 years ago
           | ||
| Comment 2•22 years ago
           | ||
another issue is to kill
http://lxr.mozilla.org/seamonkey/ident?i=TX_StringEqualsAtom
Note that this is semi trivial, one should get the utf8 string first and then
compare that for most of the cases. ExprParser and Lexer want that. Probably not
high profile, but it should be done.
The stuff in attachement 123401 looks reasonable, whoever actually does that 
though should poke sicking for a bugnumber for the attr rewrite and mention that
in the code.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: XSLT partial perf fix for NS_NewAtom slowdown → XSLT partial perf fix for NS_NewAtom slowdown
|   | ||
| Comment 3•22 years ago
           | ||
would it be worthwhile adding code to check if the input to NS_NewAtom is ascii
before calling NS_ConvertUCS2toUTF8?
|   | ||
| Comment 4•22 years ago
           | ||
| Comment 5•22 years ago
           | ||
Comment on attachment 123464 [details] [diff] [review]
like maybe this would help?  (untested)
One loop in IsAscii and one loop in NS_LossyConvert vs. two loops in NS_Convert
(one for the length, one for the actual copying). That might just equal out.
|   | ||
| Comment 6•22 years ago
           | ||
axel: but you would avoid the shifting and masking operations.  anyways, it
might be worth it to give this a try... or maybe we should be looking to
optimize the conversion function itself.  maybe during the length loop it could
also check for only ASCII.  that would be a quick thing to check.  then the
second loop could be replaced with a truncation conversion.  hmm...
|   | Reporter | |
| Comment 7•22 years ago
           | ||
The best way to fix conversions would be... to never have to do them except and
only for input/output. There should never be reason to do string conversion
internally within Mozilla. Secondly, the unique and arcane string support in
Mozilla should be thrown away, and then, standard, open, and well-tuned unicode
library function can be used, such as:
http://oss.software.ibm.com/icu/apiref/index.html
http://oss.software.ibm.com/icu/apiref/utf8_8h.html
http://oss.software.ibm.com/icu/apiref/utf16_8h.html
There is no reason in God's Universe that Mozilla has to be different than
Everything Else in string handling and unicode... except it chooses to be
different. The performance cost of this choice is, at least right now, very large.
But this whole stinky thing is not the subject of this bug.
|   | ||
| Comment 8•22 years ago
           | ||
sam: your point is well taken, but the reality is that mozilla is a large
project composed of many subprojects each with very different demands on string
encodings.  some grew up only carrying about the native filesystem charset
(because maybe the only input/output is with fopen and other standard friends).
 other code, like the networking library never has much use for UTF-16.  mostly
it needs things to be encoded in the charset of the origin server, whatever that
may be.  there are also cases where UTF-8 is used.  and of course the content
handling code wants UTF-16.  taken as a whole, it's a big mismatch... i agree. 
and, like you said... we aren't going to solve that problem here.  my point was
just that we might be able to improve the performance problems with the atom
table to avoid the need to work around those problems in XSLT code.  maybe this
shows us that it was wrong to store atoms as UTF-8.  maybe that is the
conclusion.  i don't know yet... we don't have enough information.
|   | Reporter | |
| Comment 9•22 years ago
           | ||
Yes, I agree Darin... just sometimes it feels good to whine. :-) Sorry.
I don't understand why Alec (and others?) wants Atoms to be UTF8. Since 
I understand why that is a good thing in other ways and places (I 
assume), it is hard for me to argue against it, except it is quite a 
performance hit in this area that I'm looking at right now.  
|   | Reporter | |
| Comment 10•22 years ago
           | ||
Drats. Need to proofread better. Make that "Since I DON'T understand 
why that is..."
|   | Reporter | |
| Comment 11•22 years ago
           | ||
BTW and FWIW, our favorite competitive browser is internally entirely 
UTF16.
|   | ||
| Comment 12•22 years ago
           | ||
the other point I'd like to make is that there are a reason in God's Universe
that mozilla will deal with strings in different ways across subsystems:
footprint, performance, code maintainability/architecture to name a few. It may
make sense for an entire module to deal with its strings in UTF16 because it is
dealing with manipulation of raw Unicode strings that may originate in other
languages. Another module might be dealing with strings that are almost entirely
ASCII but still need to support i18n.. and when 99% of strings are ASCII there
could be massive bloat from inflating 8-bit characters strings into 16 bit
Unicode strings. I'm not even going to get in the the issues of dealing with
legacy systems, legacy character sets, legacy web pages, legacy servers, and so
forth.
No reasonably sized project has a single prevailing rule for dealing with
strings. you can argue this on principle, but you'd be going against what people
have learned through practice.
|   | ||
| Comment 13•22 years ago
           | ||
oh, and the reason I want atoms to be UTF8 is that the majority of consumers of
the atom table are storing ASCII or nearly-always-ASCII strings, and it makes no
sense for them to take the footprint hit of doubling the size of their strings.
The problem is that people want one universal atom solution across the product,
and this is just wrong. A large portion of atoms in use are either internal
tokens or well-defined ASCII values that are seen all over the sourcebase. The
folks who are trying to store internationalized strings in the atom table are
really trying to use it for the wrong purpose
if anything, we need a totally different atom table for that purpose.
|   | Reporter | |
| Comment 14•22 years ago
           | ||
|   | Reporter | |
| Comment 15•22 years ago
           | ||
Alec, with all due respect, I don't buy the footprint arguement. The UTF8 change
has been in nightlys for a while. So you should have data about all the
footprint savings from the change. Please share this footprintsavings data with us. 
For my evidence, please see this attachment of a trace of NS_NewAtom. 
http://bugzilla.mozilla.org/attachment.cgi?id=123565
NS_NewAtom is just a symptom, and Alec I am not picking on you. Its just that 8
to 16 bit chars are the least of mozilla footprint problems. 
I'd propose to leave http://bugzilla.mozilla.org/show_bug.cgi?id=195262
for its original purpose -- static allocated Atoms -- but back off the patch to
make Atoms UTF8.... these things are unrelated. 
Static atoms support is good, switching a single internal mozilla component to
UTF8 should be reconsidered seperate from the static atom support. If UTF8 is
the direction, then there is probablt some some core sets of fuctions that
should together be converted to UTF8. 
|   | ||
| Comment 16•22 years ago
           | ||
we can't have static atoms without UTF8 atoms.. at least in any way that I can
figure out. The problem is that many consumers were previously using
GetUnicode() which would not work for static atoms.. 
Now as for your trace - what I see is the lack of CopyUCS2toUTF8 - if we
implement that then we avoid creating extra string objects and so forth. If you
ask me, that's the bigger performance hit here. NS_ConvertUCS2toUTF8() sucks.
Now admittedly maybe I should have implemented CopyUCS2toUTF8 before
implementing UTF8 atoms, and all I can say there is "sorry" - in my pageload
tests the effect was minimal.
I really feel like this is an area that we need to start looking at the
consumers of nsIAtom and deciding if we can use this change as a lever to reduce
our use of UCS2 for storing ascii strings. My suggestion is to look at your code
and see if you can benefit from some of the other methods on nsIAtom such as
Equals() and EqualsUTF8() - we can get further improvements out of those
functions when I write EqualsUTF8(const nsAString&, const nsACString); which
will do an in-place comparision without creating temporary strings....
Now as for footprint wins what I did discover was that during a simple startup,
layout 2-3 pages, and shutdown, 75% of the 2100+ atoms created with NS_NewAtom()
were pure ASCII, and about 50% of them were actual strings stored statically in
our binaries. so what I've done basically is converted 50% of those atoms to
static atoms, making:
1000 static atoms which now allocate only 8 bytes each, as opposed to 8 bytes +
the length of the unicode version of the string
500 or so atoms that were ASCII, and now take up half the room as before.
Furthermore, I think the footprint wins are going to be a rippling effect -
first we got some minor footprint wins in the atom table, but going forward
we're going to be able to store/process other things as UTF8 and avoid going
through UCS2 altogether in some cases.
I still hold that UCS2 is not appropriate for all contexts - it is only
appropriate when you are storing large amounts of i18n data whose charset may
vary greatly depending on the user's interaction with the product, where you
need a good "average waste" among the encodings - examples include localized
strings and documents downloaded from the web. When the data is primarily ASCII
you pick an encoding that is optimized for ASCII, i.e. UTF8 - examples include
lists of well known ASCII tokens like HTML tags, CSS keywords, and so forth.
|   | Reporter | |
| Comment 17•22 years ago
           | ||
I thought a 'better patch' for this was being worked, but not seen that yet.
Lacking that, can we please move forward with this patch? Thanks.
Sam 
| Comment 18•22 years ago
           | ||
CopyUTF16toUTF8 now exists.
| Comment 19•22 years ago
           | ||
CopyUTF16toUTF8 has no beef here. Sadly enough.
Peter lost essential parts of his walker tree, so we may resurrect something like
attachement 123401 in the meantime. It's not topnotch priority on my list, though.
| Assignee | ||
| Comment 20•21 years ago
           | ||
The XSLT code that this bug was filed against is not in the tree anymore. Can we
close this bug now?
I doubt that CopyUTF16toUTF8 will be of any help here, afaict we still need a
string to hold the copy in UTF8 as a key for the hash lookup.
Yeah, let's close this. The biggest problem has been taken care of by the
walker. There are still some things that are taking a hit and that we should
atomize more, most noticably output where the resulthandlers would benefit from
having atom rather then string arguments.
Closing as invalid since the code in question is gone.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → INVALID
          You need to log in
          before you can comment on or make changes to this bug.
        
Description
•