Closed Bug 113201 Opened 23 years ago Closed 21 years ago

[FIX]Make XML detection more meaningful in BufferContainsHTML()

Categories

(Core :: DOM: HTML Parser, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.8alpha1

People

(Reporter: rbs, Assigned: bzbarsky)

Details

Attachments

(2 files, 2 obsolete files)

There is a function in htmlparser/src/nsDTDUtils.h with the current signature: inline PRBool BufferContainsHTML(const nsString& aBuffer, PRBool& aHasXMLFragment) The boolean aHasXMLFragment is what is of interest in this bug. It is meant to be set when the function detects that the buffer is the beginning of an XML document. As it stands, this function is giving a wrong guess is almost all instances. I tried the following patch in CNavDTD:CanParse() from where the function is called, I will attach the output that I got on startup where all the files are XML as indicated by their MIME type. It shows that the current detection is pretty useless, although the wrong guess has no particular impact later on. Index: src/CNavDTD.cpp =================================================================== RCS file: /cvsroot/mozilla/htmlparser/src/CNavDTD.cpp,v retrieving revision 3.364 diff -u -5 -r3.364 CNavDTD.cpp --- src/CNavDTD.cpp 28 Nov 2001 06:13:01 -0000 3.364 +++ src/CNavDTD.cpp 3 Dec 2001 06:04:18 -0000 @@ -343,10 +343,11 @@ } else if(PR_TRUE==aParserContext.mMimeType.EqualsWithConversion(kTextJSContentType)) { result=ePrimaryDetect; } else { +nsAutoString mimeType(aParserContext.mMimeType); //otherwise, look into the buffer to see if you recognize anything... PRBool theBufHasXML=PR_FALSE; if(BufferContainsHTML(aBuffer,theBufHasXML)){ result = eValidDetect ; if(0==aParserContext.mMimeType.Length()) { @@ -362,10 +363,14 @@ } } else result=eValidDetect; } } +printf("hasXML:%s MIME:%s URI:%s\n", +theBufHasXML ? "Yes" : "No ", +NS_LossyConvertUCS2toASCII(mimeType).get(), +NS_LossyConvertUCS2toASCII(aParserContext.mScanner->GetFilename()).get()); } } return result; }
Priority: -- → P3
Target Milestone: --- → mozilla0.9.8
Status: NEW → ASSIGNED
Target Milestone: mozilla0.9.8 → mozilla0.9.9
Target Milestone: mozilla0.9.9 → mozilla1.0.1
Bug 132681 also spotted at this problem.
Target Milestone: mozilla1.0.1 → mozilla1.2beta
Target Milestone: mozilla1.2beta → mozilla1.4alpha
Attached patch Patch (obsolete) — Splinter Review
Comment on attachment 148069 [details] [diff] [review] Patch The code was looking for "<?xml" starting at position 100 in the file, which only worked if the file had a PI after a license comment, basically... This should fix it to look starting at start of file.
Attachment #148069 - Flags: superreview?(rbs)
Attachment #148069 - Flags: review?(rbs)
Did you consider bug 132681 comment 4? Apparently, that function is even bogus. It was an earlier attempt at sniffing, and hasn't evolved at all, whereas the sniffer of necko is top-notch these days. You are in a better postition for a surgical patch... since you are the maintainer of the sniffer of necko. I tried my little patch (updated to the newest string-fu...), and it didn't print anything (without your patch). This seems due to the patch added for bug 132681. We don't get into that if-branch, coupled with the fact that the mime-type provided by necko is generally correct.
CC:ing bz, see my comment above.
Oh, hmmmm... Yeah, the better thing to do would probably be to remove the XML crap from here altogether. I'll try to see how much of this code I can remove without my sanity being affected... ;)
Attachment #148069 - Flags: superreview?(rbs)
Attachment #148069 - Flags: superreview-
Attachment #148069 - Flags: review?(rbs)
Attachment #148069 - Flags: review-
Attached patch Just remove all this gunk (obsolete) — Splinter Review
This removes most of the MIME gunk from the parser, leaving some in the one place it's useful -- the parser context. Note that DetermineParseMode() is always called before FindSuitableDTD, if needed, so relying on the parser context's doctype is in fact just fine in the CanParse() code (which is only called from FindSuitableDTD).
Assignee: harishd → bzbarsky
Attachment #148069 - Attachment is obsolete: true
Comment on attachment 148162 [details] [diff] [review] Just remove all this gunk This patch does change semantics a bit... we will now assume that anything that's not in our hardcoded HTML/XML list and is being fed to the parser should be parsed as plaintext. Before we would sniff it for tag-like things. I think the new behavior is better, frankly.. ;)
Attachment #148162 - Flags: superreview?(jst)
Attachment #148162 - Flags: review?(rbs)
Priority: P3 → P1
Summary: Make XML detection more meaningful in BufferContainsHTML() → [FIX]Make XML detection more meaningful in BufferContainsHTML()
Target Milestone: mozilla1.4alpha → mozilla1.8alpha
Comment on attachment 148162 [details] [diff] [review] Just remove all this gunk The patch looks good to me. There wasn't much point with the double sniffing because necko is security-sensitive and doesn't pass something with unknown MIME type. All the user would get would be an early 'Save As' (or 'Open With') instead. I was wondering about the (now useless) 'aBuffer' param in CanParse(), but it is harmless.
Attachment #148162 - Flags: review?(rbs) → review+
Comment on attachment 148162 [details] [diff] [review] Just remove all this gunk You wanna save us a few bytes of code by eliminating the unnecessary argument? sr=jst either way. It'll be interesting to see if this causes regressions :-)
Attachment #148162 - Flags: superreview?(jst) → superreview+
I removed the extra args and changed nsParser::WillBuildModel to only copy those first 1024 bytes if we're actually autodetecting the parsemode, since that's the only thing that buffer is needed for now. If this causes regressions, we fix them. ;)
Attachment #148162 - Attachment is obsolete: true
Nice... I have the feeling that the XXXVidur comment doesn't apply anymore. Does it? + // XXXVidur Make a copy and only check in the first 1k His concern seemed to have come from the consumers of aBuffer in the repeated calls of WillBuildModel(). But now, the first call will cache the status [per the code just above in: @@ -1126,17 +1127,17 @@ FindSuitableDTD( ] and since aBuffer is gone, and the peek is buried in the |if|, nobody else will re-peek again... right?
Oh, good catch. Removed that comment.
Checked in. We win about 3700 bytes of build size and maybe 0.2% of Tp... (maybe; that's kinda the noise for those tests...)
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: