Closed Bug 98214 Opened 23 years ago Closed 23 years ago

Avoid unnecessary string copy in xml parser

Categories

(Core :: XML, defect, P3)

x86
Windows 2000
defect

Tracking

()

VERIFIED FIXED
mozilla0.9.6

People

(Reporter: harishd, Assigned: hjtoi-bugzilla)

Details

(Keywords: perf, Whiteboard: [fixinhand])

Attachments

(1 file)

Sample quantify run starting up with about:blank.

Function:
Tokenizer_HandleExternalEntityRef
Calls:
33
Function time:	8.48 usec (0.00% of Focus)
F+D time:	283,385.23 usec (0.02% of Focus)
Avg F time:	0.26 usec
Min F time:	0.26 usec
Max F time:	0.26 usec
Module:
y:\mozilla\dist\WIN32_O.OBJ\bin\components\gkparser.dll
Source File:	y:\mozilla\htmlparser\src\nsExpatTokenizer.cpp
Measurement:
Line
Hidden functions:	(None)
Keywords: perf
Unfortunately the patch does not avoid the copy, it just moves it to somewhere
else. However, I believe this has the advange that if somebody some day figures
a way to do this without a copy they can do that easily.

There is a real performance improvement in the patch though, regarding how we
parse XML declaration and get the encoding. The old code needs to do string copy
& allocation almost every time (we end up here 22 times in startup, doing
allocation 18 times of those). This is 0.00% of startup focus according to
Purify, but the new code takes just a fraction of the time the old code does. It
is also more correct.

Finally there are some cleanups. Reviews?
Status: NEW → ASSIGNED
Priority: -- → P3
Whiteboard: [fixinhand]
Target Milestone: --- → mozilla0.9.6
Comment on attachment 53122 [details] [diff] [review]
Cleanup & speedup

Needs more commenting. Especially places where you
look for 'g' & 'n'.
Attachment #53122 - Flags: review+
Ok, I added this for 'n' and 'g':

// Want to avoid string comparisons, hence looking for 'g'
// and only if found check the string leading to it. Not
// foolproof, but fast.

I see it extremely unlikely that someone would give us something like:

<?xml fooversion"1" barencoding"UTF-8"?>

which would pass this detection as "valid". All this happens before we pass the
data to Expat, we only use this routine to convert the data we see to format
that we want to feed Expat. If Expat barfed on the XML declaration (it does not
since it is not a validating parser) the user would get an error page explaining
that the XML file is not correct.

We also use the knowledge of earliest possible offset into our advantage in the
routine, and this is explained by saying something like:

// The shortest string so far (strlen==5):
// <?xml

after which we start the loop at position 6 and so forth.
Comment on attachment 53122 [details] [diff] [review]
Cleanup & speedup

sr=vidur

Not a big deal, but nsMemory::Free() is preferred over nsCRT::free(). Also, could you not just use an nsXPIDLString for the filename in the nsExpatTokenizer part of the patch?

For future reference, there's a local class in nsParser.cpp called CWordTokenizer that finds word boundaries and could probably be used to simplify the logic of the XML decl parsing code that you have.
Attachment #53122 - Flags: superreview+
I changed to nsXPIDLCString.

I tried to simplify the obfuscated code by using CWordTokenizer<char> but became
blocked by bug 104479.

Marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Marking verified on the Oct 22nd builds.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: