Closed
Bug 223595
Opened 21 years ago
Closed 21 years ago
speed up nsHTMLTags id-string-atom mapping (and thus nsParserService)
Categories
(Core :: DOM: HTML Parser, defect)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
FIXED
People
(Reporter: axel, Assigned: bryner)
Details
(Keywords: perf)
Attachments
(2 files, 1 obsolete file)
22.93 KB,
patch
|
axel
:
review+
jst
:
superreview+
|
Details | Diff | Splinter Review |
11.12 KB,
text/plain
|
Details |
Here is the plan:
Use a (mildly hacked) perfect hash function generated by gperf instead of a
PLHashTable, don't do a single string copy.
I have the hash function on my disk, though it requires to have a string of
known length. I need to hack a bit to find out how to deal with nsIAtoms->IDs
nicely.
One thing I'd like to change in nsIParserService, though.
NS_IMETHOD HTMLIdToStringTag(PRInt32 aId,
const PRUnichar **aTagName) const = 0;
should either be
NS_IMETHOD HTMLIdToStringTag(PRInt32 aId,
const char **aTagName) const = 0;
or even
NS_IMETHOD(const char*) HTMLIdToStringTag(PRInt32 aId) const = 0;
This touches the html(fragment)content sinks a bit, but those really want to
know the atom for the id.
Alec, is there a way to share a static atom array between content and
htmlparser?
It does take a tweak in mozSanitizingSerialiser, though. That'd need to use
CopyASCIIToUTF16 instead of Write in 3 places,
http://lxr.mozilla.org/seamonkey/source/content/base/src/mozSanitizingSerializer.cpp#475
(string search is
http://lxr.mozilla.org/seamonkey/search?string=HTMLIdToStringTag)
I'd like to add a HTMLCaseSensitiveStringTagToId, too, as we'd like to use
that in XSLT.
(Alec, regarding my mails, I wasn't able to generate a perfect hash function
without the string length. As I can do with, I just try to give this a go.)
Reporter | ||
Comment 1•21 years ago
|
||
Bah, after killing the tab, I didn't set the CCs and the assignee when refiling
this bug.
Assignee: parser → axel
Comment 2•21 years ago
|
||
yes! I actually spent a good deal of time hacking on HTMLIdToStringTag some
months back... you'd be amazed at how far out the ripples go when you make that
change. I actually ended up giving up because...well now I can't remember what
the roadblock was
in any case, part of the issue was, as you mention, sharing tables between
content and htmlparser. I think for that we really should be rolling
gkhtmlpar.dll into gklayout.dll or whatever they are called.
In any case, I welcome this change, and I think we could see some good code
cleanup as a result... we are going to run into basic issues with PRUnichar vs.
char, the real magic is going to be in determining exactly where the right place
to do the conversion is.
A big issue behind all of this, which I wish we could resolve, is that the
htmlparser converts everything to UCS2 on the fly, which means the parser itself
is always going to be reading in tag names in unicode!
[and yes, I think it would add unnecessary complexity (and likely lead to bugs)
to try to read the file in ASCII and only convert sections of the file into UCS2]
So we're stuck with tag names (and attributes names) coming in as unicode.
Reporter | ||
Comment 3•21 years ago
|
||
I was a bit too optimistic about the HTMLIDToStringTag.
That method is exposed via nsHTMLTags, nsIDTD, nsIParserService. Just to name a
few. Same goes for other methods of nsHTMLTags. Thus I'm going to let the
interface as it is for now.
Maybe one day, I'll really look into htmlparser and understand how objects are
tied together, and maybe fold some of this together.
This shouldn't impact the main performance goals greatly, however.
Assignee | ||
Comment 4•21 years ago
|
||
I was actually just looking at this exact thing. It is possible to "share"
static atoms between the parser and layout since RegisterStaticAtoms looks for
an existing atom. I've got a patch which eliminates the string conversion and
hashtable lookup for each parsed tag (from the nsHTMLContentSink and
nsNodeInfoManager). It does this by just storing an array of atom pointers
indexed by the existing enum values... the content sink already knows the enum
value so we can get the atom very quickly.
Also, since atom lookups of UTF8 strings are faster than for UTF16 strings, I
added methods on nsINodeInfoManager that can take a UTF8 string as the
prefix/name. I converted some callers that were already doing some sort of
string copy or conversion before calling those methods to just pass in a UTF8
string.
jprof results before and after show that this reduces the amount of time spent
in HTMLContentSink::CreateContentObject quite a bit (the exact numbers are over
on my Linux partition).
It might be worth looking at implementing a similar system for HTML attribute
names. There is a predefined set of valid attribute names; we could just build
an array of atoms to optimize the lookup as we do for tag names.
Assignee | ||
Comment 5•21 years ago
|
||
Assignee | ||
Updated•21 years ago
|
Attachment #134236 -
Flags: superreview?(alecf)
Attachment #134236 -
Flags: review?(axel)
Assignee | ||
Comment 6•21 years ago
|
||
So, I don't actually think my change to the PRUnichar static storage will work
for configurations without HAVE_CPP_2BYTE_WCHAR_T defined. It will compile, but
I think it will misbehave at runtime. Consider that section of the patch
removed, unless anyone can think of a better way to do it.
Reporter | ||
Comment 7•21 years ago
|
||
Comment on attachment 134236 [details] [diff] [review]
patch
Basically, yes.
+#define HTML_TAG(_tag, _classname) (const PRUnichar *)(NS_L(#_tag)),
is not working, though.
Just leave those alone.
I'm ok with adding those methods to nsIDTD for now, though they should really
be in nsIParserService eventually.
And just have a
NS_IMETHOD(nsIParserService*) GetParserService();
on nsIDTD.
As you change those "tooltip" and friends in layout, could you move them over
to
nodeInfoManager->GetNodeInfo(NS_LITERAL_CSTRING("tooltip"),
nsnull, kNameSpaceID_XUL,
getter_AddRefs(nodeInfo));
?
Guess I'd like to see a new diff.
Attachment #134236 -
Flags: superreview?(alecf)
Attachment #134236 -
Flags: review?(axel)
Attachment #134236 -
Flags: review-
Assignee | ||
Comment 8•21 years ago
|
||
Attachment #134236 -
Attachment is obsolete: true
Assignee | ||
Updated•21 years ago
|
Attachment #134277 -
Flags: superreview?(alecf)
Attachment #134277 -
Flags: review?(axel)
Assignee | ||
Comment 9•21 years ago
|
||
Comment on attachment 134277 [details] [diff] [review]
revised patch
changing sr request to jst to get module owner input on nsINodeInfoManager
changes. alec you're still welcome to review of course :)
Attachment #134277 -
Flags: superreview?(alecf) → superreview?(jst)
Comment 10•21 years ago
|
||
Point of interest re. conversion to UCS2 on-the-fly: any idea when this change
was made (from digging through CVS)? HTML tag names should always be ASCII, and
I have a nagging suspicion that this might have been changed so that we could
run XML view-source through the parser. If this is true, perhaps it's possible
to divert XML view-source through the pretty-print codepath (I don't know much
about the XML side of things) and keep the HTML tag strings in ASCII?
Assignee | ||
Comment 11•21 years ago
|
||
I suspect we've always started out with UTF16 in the parser. The reason is that
the unicode converters return UTF16, and (in the general case) it's almost
certainly as much or more work for them to return UTF8 when converting from an
arbitrary character set. This work happens well before we parse out the tags,
as I understand it.
Now, an interesting question is whether we can do better for single-byte
encodings, particularly ones in which all of the characters map to a single byte
in UTF8. In such a case, we could conceivably avoid doing any copying of the
buffer here.
Reporter | ||
Updated•21 years ago
|
Attachment #134277 -
Flags: review?(axel) → review+
Comment 12•21 years ago
|
||
Comment on attachment 134277 [details] [diff] [review]
revised patch
nice work! I especially like the IntTagToAtom() work.
Are all 4 GetNodeInfo's being used? there's no way we can get rid of any of
the old GetNodeInfo's? My only dismay with this is the additional code, it
would be nice to get rid of at least a few APIs which use unicode tag names...
Assignee | ||
Comment 13•21 years ago
|
||
The size cost of having the AString versions which call through to the ACString
versions is about 474 bytes according to readelf. But yeah, it might be good to
check what people are actually using (in the general case I'm not sure if I want
to force people to convert UTF16 to UTF8 at the call site if that results in
more string code inlining).
Assignee | ||
Comment 14•21 years ago
|
||
It looks to me like the only variants of this method that are actually used
(with my patch) are:
NS_IMETHOD GetNodeInfo(nsIAtom *aName, nsIAtom *aPrefix,
PRInt32 aNamespaceID, nsINodeInfo** aNodeInfo);
NS_IMETHOD GetNodeInfo(const nsAString& aName, nsIAtom *aPrefix,
PRInt32 aNamespaceID, nsINodeInfo** aNodeInfo);
NS_IMETHOD GetNodeInfo(const nsAString& aQualifiedName,
const nsAString& aNamespaceURI,
nsINodeInfo** aNodeInfo);
NS_IMETHOD GetNodeInfo(const nsACString& aName, nsIAtom *aPrefix,
PRInt32 aNamespaceID, nsINodeInfo** aNodeInfo);
jst, should I just rip out the rest of them?
Comment 15•21 years ago
|
||
Comment on attachment 134277 [details] [diff] [review]
revised patch
sr=jst, and assuming you've searched all extensions etc, go ahead and take out
the GetNodeInfo()'s that aren't used any more.
Attachment #134277 -
Flags: superreview?(jst) → superreview+
Comment 16•21 years ago
|
||
Bryner - I think your patch made all of these warnings when building on Mac OS
X 10.3. Seems like a lot.
![]() |
||
Comment 17•21 years ago
|
||
Josh Aas, those warnings are there without bryner's patch. In fact, his patch
didn't touch any code anywhere near (within 200 lines of) line 1119.
Assignee | ||
Comment 19•21 years ago
|
||
checked in - I see anywhere from 1.1 - 1.5% Tp improvement on btek.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
> it's almost certainly as much or more work for them to return UTF8 when
> converting from an arbitrary character set.
Just for interest --- I'm not sure this is true. For example, the Chinese
encodings are basically ASCII with escape sequences to switch in and out of
double-byte character mode. For documents with a lot of ASCII (and I believe
most Chinese pages are mostly HTML tags) UTF8 conversion could be very efficient
because you can just copy ASCII bytes until you see the start of an escape sequence.
/me hates UTF16 and wishes we were all UTF8
You need to log in
before you can comment on or make changes to this bug.
Description
•