Closed Bug 113201 Opened 23 years ago Closed 21 years ago

[FIX]Make XML detection more meaningful in BufferContainsHTML()

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla1.8alpha1

People

(Reporter: rbs, Assigned: bzbarsky)

Details

Attachments

(2 files, 2 obsolete files)

output of the detection at startup 23 years ago rbs 6.71 KB, text/plain		Details
Patch 21 years ago Boris Zbarsky [:bzbarsky] 1.12 KB, patch	bzbarsky : review- bzbarsky : superreview-	Details \| Diff \| Splinter Review
Just remove all this gunk 21 years ago Boris Zbarsky [:bzbarsky] 15.90 KB, patch	rbs : review+ jst : superreview+	Details \| Diff \| Splinter Review
Patch updated to comments 21 years ago Boris Zbarsky [:bzbarsky] 24.59 KB, patch		Details \| Diff \| Splinter Review

rbs

Reporter

Description

•

23 years ago

There is a function in htmlparser/src/nsDTDUtils.h with the current signature: inline PRBool BufferContainsHTML(const nsString& aBuffer, PRBool& aHasXMLFragment) The boolean aHasXMLFragment is what is of interest in this bug. It is meant to be set when the function detects that the buffer is the beginning of an XML document. As it stands, this function is giving a wrong guess is almost all instances. I tried the following patch in CNavDTD:CanParse() from where the function is called, I will attach the output that I got on startup where all the files are XML as indicated by their MIME type. It shows that the current detection is pretty useless, although the wrong guess has no particular impact later on. Index: src/CNavDTD.cpp =================================================================== RCS file: /cvsroot/mozilla/htmlparser/src/CNavDTD.cpp,v retrieving revision 3.364 diff -u -5 -r3.364 CNavDTD.cpp --- src/CNavDTD.cpp 28 Nov 2001 06:13:01 -0000 3.364 +++ src/CNavDTD.cpp 3 Dec 2001 06:04:18 -0000 @@ -343,10 +343,11 @@ } else if(PR_TRUE==aParserContext.mMimeType.EqualsWithConversion(kTextJSContentType)) { result=ePrimaryDetect; } else { +nsAutoString mimeType(aParserContext.mMimeType); //otherwise, look into the buffer to see if you recognize anything... PRBool theBufHasXML=PR_FALSE; if(BufferContainsHTML(aBuffer,theBufHasXML)){ result = eValidDetect ; if(0==aParserContext.mMimeType.Length()) { @@ -362,10 +363,14 @@ } } else result=eValidDetect; } } +printf("hasXML:%s MIME:%s URI:%s\n", +theBufHasXML ? "Yes" : "No ", +NS_LossyConvertUCS2toASCII(mimeType).get(), +NS_LossyConvertUCS2toASCII(aParserContext.mScanner->GetFilename()).get()); } } return result; }

rbs

Reporter

Comment 1

•

23 years ago

Attached file output of the detection at startup — Details

harishd

Updated

•

23 years ago

Priority: -- → P3

Target Milestone: --- → mozilla0.9.8

harishd

Updated

•

23 years ago

Status: NEW → ASSIGNED

Target Milestone: mozilla0.9.8 → mozilla0.9.9

harishd

Updated

•

23 years ago

Target Milestone: mozilla0.9.9 → mozilla1.0.1

rbs

Reporter

Comment 2

•

23 years ago

Bug 132681 also spotted at this problem.

harishd

Updated

•

23 years ago

Target Milestone: mozilla1.0.1 → mozilla1.2beta

harishd

Updated

•

22 years ago

Target Milestone: mozilla1.2beta → mozilla1.4alpha

Boris Zbarsky [:bzbarsky]

Assignee

Comment 3

•

21 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

Boris Zbarsky [:bzbarsky]

Assignee

Comment 4

•

21 years ago

Comment on attachment 148069 [details] [diff] [review] Patch The code was looking for "<?xml" starting at position 100 in the file, which only worked if the file had a PI after a license comment, basically... This should fix it to look starting at start of file.

Attachment #148069 - Flags: superreview?(rbs)

Attachment #148069 - Flags: review?(rbs)

rbs

Reporter

Comment 5

•

21 years ago

Did you consider bug 132681 comment 4? Apparently, that function is even bogus. It was an earlier attempt at sniffing, and hasn't evolved at all, whereas the sniffer of necko is top-notch these days. You are in a better postition for a surgical patch... since you are the maintainer of the sniffer of necko. I tried my little patch (updated to the newest string-fu...), and it didn't print anything (without your patch). This seems due to the patch added for bug 132681. We don't get into that if-branch, coupled with the fact that the mime-type provided by necko is generally correct.

rbs

Reporter

Comment 6

•

21 years ago

CC:ing bz, see my comment above.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 7

•

21 years ago

Oh, hmmmm... Yeah, the better thing to do would probably be to remove the XML crap from here altogether. I'll try to see how much of this code I can remove without my sanity being affected... ;)

Boris Zbarsky [:bzbarsky]

Assignee

Updated

•

21 years ago

Attachment #148069 - Flags: superreview?(rbs)

Attachment #148069 - Flags: superreview-

Attachment #148069 - Flags: review?(rbs)

Attachment #148069 - Flags: review-

Boris Zbarsky [:bzbarsky]

Assignee

Comment 8

•

21 years ago

Attached patch Just remove all this gunk (obsolete) — Details — Splinter Review

This removes most of the MIME gunk from the parser, leaving some in the one place it's useful -- the parser context. Note that DetermineParseMode() is always called before FindSuitableDTD, if needed, so relying on the parser context's doctype is in fact just fine in the CanParse() code (which is only called from FindSuitableDTD).

Boris Zbarsky [:bzbarsky]

Assignee

Updated

•

21 years ago

Assignee: harishd → bzbarsky

Attachment #148069 - Attachment is obsolete: true

Boris Zbarsky [:bzbarsky]

Assignee

Comment 9

•

21 years ago

Comment on attachment 148162 [details] [diff] [review] Just remove all this gunk This patch does change semantics a bit... we will now assume that anything that's not in our hardcoded HTML/XML list and is being fed to the parser should be parsed as plaintext. Before we would sniff it for tag-like things. I think the new behavior is better, frankly.. ;)

Attachment #148162 - Flags: superreview?(jst)

Attachment #148162 - Flags: review?(rbs)

Boris Zbarsky [:bzbarsky]

Assignee

Updated

•

21 years ago

Priority: P3 → P1

Summary: Make XML detection more meaningful in BufferContainsHTML() → [FIX]Make XML detection more meaningful in BufferContainsHTML()

Target Milestone: mozilla1.4alpha → mozilla1.8alpha

rbs

Reporter

Comment 10

•

21 years ago

Comment on attachment 148162 [details] [diff] [review] Just remove all this gunk The patch looks good to me. There wasn't much point with the double sniffing because necko is security-sensitive and doesn't pass something with unknown MIME type. All the user would get would be an early 'Save As' (or 'Open With') instead. I was wondering about the (now useless) 'aBuffer' param in CanParse(), but it is harmless.

Attachment #148162 - Flags: review?(rbs) → review+

Johnny Stenback (:jst)

Comment 11

•

21 years ago

Comment on attachment 148162 [details] [diff] [review] Just remove all this gunk You wanna save us a few bytes of code by eliminating the unnecessary argument? sr=jst either way. It'll be interesting to see if this causes regressions :-)

Attachment #148162 - Flags: superreview?(jst) → superreview+

Boris Zbarsky [:bzbarsky]

Assignee

Comment 12

•

21 years ago

Attached patch Patch updated to comments — Details — Splinter Review

I removed the extra args and changed nsParser::WillBuildModel to only copy those first 1024 bytes if we're actually autodetecting the parsemode, since that's the only thing that buffer is needed for now. If this causes regressions, we fix them. ;)

Attachment #148162 - Attachment is obsolete: true

rbs

Reporter

Comment 13

•

21 years ago

Nice... I have the feeling that the XXXVidur comment doesn't apply anymore. Does it? + // XXXVidur Make a copy and only check in the first 1k His concern seemed to have come from the consumers of aBuffer in the repeated calls of WillBuildModel(). But now, the first call will cache the status [per the code just above in: @@ -1126,17 +1127,17 @@ FindSuitableDTD( ] and since aBuffer is gone, and the peek is buried in the |if|, nobody else will re-peek again... right?

Boris Zbarsky [:bzbarsky]

Assignee

Comment 14

•

21 years ago

Oh, good catch. Removed that comment.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 15

•

21 years ago

Checked in. We win about 3700 bytes of build size and maybe 0.2% of Tp... (maybe; that's kinda the noise for those tests...)

Status: ASSIGNED → RESOLVED

Closed: 21 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.