Closed Bug 223595 Opened 21 years ago Closed 21 years ago

speed up nsHTMLTags id-string-atom mapping (and thus nsParserService)

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: axel, Assigned: bryner)

Details

(Keywords: perf)

Attachments

(2 files, 1 obsolete file)

Here is the plan: Use a (mildly hacked) perfect hash function generated by gperf instead of a PLHashTable, don't do a single string copy. I have the hash function on my disk, though it requires to have a string of known length. I need to hack a bit to find out how to deal with nsIAtoms->IDs nicely. One thing I'd like to change in nsIParserService, though. NS_IMETHOD HTMLIdToStringTag(PRInt32 aId, const PRUnichar **aTagName) const = 0; should either be NS_IMETHOD HTMLIdToStringTag(PRInt32 aId, const char **aTagName) const = 0; or even NS_IMETHOD(const char*) HTMLIdToStringTag(PRInt32 aId) const = 0; This touches the html(fragment)content sinks a bit, but those really want to know the atom for the id. Alec, is there a way to share a static atom array between content and htmlparser? It does take a tweak in mozSanitizingSerialiser, though. That'd need to use CopyASCIIToUTF16 instead of Write in 3 places, http://lxr.mozilla.org/seamonkey/source/content/base/src/mozSanitizingSerializer.cpp#475 (string search is http://lxr.mozilla.org/seamonkey/search?string=HTMLIdToStringTag) I'd like to add a HTMLCaseSensitiveStringTagToId, too, as we'd like to use that in XSLT. (Alec, regarding my mails, I wasn't able to generate a perfect hash function without the string length. As I can do with, I just try to give this a go.)
Bah, after killing the tab, I didn't set the CCs and the assignee when refiling this bug.
Assignee: parser → axel
yes! I actually spent a good deal of time hacking on HTMLIdToStringTag some months back... you'd be amazed at how far out the ripples go when you make that change. I actually ended up giving up because...well now I can't remember what the roadblock was in any case, part of the issue was, as you mention, sharing tables between content and htmlparser. I think for that we really should be rolling gkhtmlpar.dll into gklayout.dll or whatever they are called. In any case, I welcome this change, and I think we could see some good code cleanup as a result... we are going to run into basic issues with PRUnichar vs. char, the real magic is going to be in determining exactly where the right place to do the conversion is. A big issue behind all of this, which I wish we could resolve, is that the htmlparser converts everything to UCS2 on the fly, which means the parser itself is always going to be reading in tag names in unicode! [and yes, I think it would add unnecessary complexity (and likely lead to bugs) to try to read the file in ASCII and only convert sections of the file into UCS2] So we're stuck with tag names (and attributes names) coming in as unicode.
I was a bit too optimistic about the HTMLIDToStringTag. That method is exposed via nsHTMLTags, nsIDTD, nsIParserService. Just to name a few. Same goes for other methods of nsHTMLTags. Thus I'm going to let the interface as it is for now. Maybe one day, I'll really look into htmlparser and understand how objects are tied together, and maybe fold some of this together. This shouldn't impact the main performance goals greatly, however.
I was actually just looking at this exact thing. It is possible to "share" static atoms between the parser and layout since RegisterStaticAtoms looks for an existing atom. I've got a patch which eliminates the string conversion and hashtable lookup for each parsed tag (from the nsHTMLContentSink and nsNodeInfoManager). It does this by just storing an array of atom pointers indexed by the existing enum values... the content sink already knows the enum value so we can get the atom very quickly. Also, since atom lookups of UTF8 strings are faster than for UTF16 strings, I added methods on nsINodeInfoManager that can take a UTF8 string as the prefix/name. I converted some callers that were already doing some sort of string copy or conversion before calling those methods to just pass in a UTF8 string. jprof results before and after show that this reduces the amount of time spent in HTMLContentSink::CreateContentObject quite a bit (the exact numbers are over on my Linux partition). It might be worth looking at implementing a similar system for HTML attribute names. There is a predefined set of valid attribute names; we could just build an array of atoms to optimize the lookup as we do for tag names.
Attached patch patch (obsolete) — Splinter Review
Attachment #134236 - Flags: superreview?(alecf)
Attachment #134236 - Flags: review?(axel)
So, I don't actually think my change to the PRUnichar static storage will work for configurations without HAVE_CPP_2BYTE_WCHAR_T defined. It will compile, but I think it will misbehave at runtime. Consider that section of the patch removed, unless anyone can think of a better way to do it.
Comment on attachment 134236 [details] [diff] [review] patch Basically, yes. +#define HTML_TAG(_tag, _classname) (const PRUnichar *)(NS_L(#_tag)), is not working, though. Just leave those alone. I'm ok with adding those methods to nsIDTD for now, though they should really be in nsIParserService eventually. And just have a NS_IMETHOD(nsIParserService*) GetParserService(); on nsIDTD. As you change those "tooltip" and friends in layout, could you move them over to nodeInfoManager->GetNodeInfo(NS_LITERAL_CSTRING("tooltip"), nsnull, kNameSpaceID_XUL, getter_AddRefs(nodeInfo)); ? Guess I'd like to see a new diff.
Attachment #134236 - Flags: superreview?(alecf)
Attachment #134236 - Flags: review?(axel)
Attachment #134236 - Flags: review-
Attached patch revised patchSplinter Review
Attachment #134236 - Attachment is obsolete: true
Attachment #134277 - Flags: superreview?(alecf)
Attachment #134277 - Flags: review?(axel)
Comment on attachment 134277 [details] [diff] [review] revised patch changing sr request to jst to get module owner input on nsINodeInfoManager changes. alec you're still welcome to review of course :)
Attachment #134277 - Flags: superreview?(alecf) → superreview?(jst)
Point of interest re. conversion to UCS2 on-the-fly: any idea when this change was made (from digging through CVS)? HTML tag names should always be ASCII, and I have a nagging suspicion that this might have been changed so that we could run XML view-source through the parser. If this is true, perhaps it's possible to divert XML view-source through the pretty-print codepath (I don't know much about the XML side of things) and keep the HTML tag strings in ASCII?
I suspect we've always started out with UTF16 in the parser. The reason is that the unicode converters return UTF16, and (in the general case) it's almost certainly as much or more work for them to return UTF8 when converting from an arbitrary character set. This work happens well before we parse out the tags, as I understand it. Now, an interesting question is whether we can do better for single-byte encodings, particularly ones in which all of the characters map to a single byte in UTF8. In such a case, we could conceivably avoid doing any copying of the buffer here.
Attachment #134277 - Flags: review?(axel) → review+
Comment on attachment 134277 [details] [diff] [review] revised patch nice work! I especially like the IntTagToAtom() work. Are all 4 GetNodeInfo's being used? there's no way we can get rid of any of the old GetNodeInfo's? My only dismay with this is the additional code, it would be nice to get rid of at least a few APIs which use unicode tag names...
The size cost of having the AString versions which call through to the ACString versions is about 474 bytes according to readelf. But yeah, it might be good to check what people are actually using (in the general case I'm not sure if I want to force people to convert UTF16 to UTF8 at the call site if that results in more string code inlining).
It looks to me like the only variants of this method that are actually used (with my patch) are: NS_IMETHOD GetNodeInfo(nsIAtom *aName, nsIAtom *aPrefix, PRInt32 aNamespaceID, nsINodeInfo** aNodeInfo); NS_IMETHOD GetNodeInfo(const nsAString& aName, nsIAtom *aPrefix, PRInt32 aNamespaceID, nsINodeInfo** aNodeInfo); NS_IMETHOD GetNodeInfo(const nsAString& aQualifiedName, const nsAString& aNamespaceURI, nsINodeInfo** aNodeInfo); NS_IMETHOD GetNodeInfo(const nsACString& aName, nsIAtom *aPrefix, PRInt32 aNamespaceID, nsINodeInfo** aNodeInfo); jst, should I just rip out the rest of them?
Comment on attachment 134277 [details] [diff] [review] revised patch sr=jst, and assuming you've searched all extensions etc, go ahead and take out the GetNodeInfo()'s that aren't used any more.
Attachment #134277 - Flags: superreview?(jst) → superreview+
Attached file lots of build warnings
Bryner - I think your patch made all of these warnings when building on Mac OS X 10.3. Seems like a lot.
Josh Aas, those warnings are there without bryner's patch. In fact, his patch didn't touch any code anywhere near (within 200 lines of) line 1119.
.
Assignee: axel → bryner
checked in - I see anywhere from 1.1 - 1.5% Tp improvement on btek.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
> it's almost certainly as much or more work for them to return UTF8 when > converting from an arbitrary character set. Just for interest --- I'm not sure this is true. For example, the Chinese encodings are basically ASCII with escape sequences to switch in and out of double-byte character mode. For documents with a lot of ASCII (and I believe most Chinese pages are mostly HTML tags) UTF8 conversion could be very efficient because you can just copy ASCII bytes until you see the start of an escape sequence. /me hates UTF16 and wishes we were all UTF8
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: