Closed Bug 113201 Opened 23 years ago Closed 20 years ago

[FIX]Make XML detection more meaningful in BufferContainsHTML()

Categories

(Core :: DOM: HTML Parser, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.8alpha1

People

(Reporter: rbs, Assigned: bzbarsky)

Details

Attachments

(2 files, 2 obsolete files)

There is a function in htmlparser/src/nsDTDUtils.h with the current signature:

inline PRBool BufferContainsHTML(const nsString& aBuffer,
                                 PRBool& aHasXMLFragment)

The boolean aHasXMLFragment is what is of interest in this bug. It is meant to 
be set when the function detects that the buffer is the beginning of an XML 
document. 

As it stands, this function is giving a wrong guess is almost all instances.

I tried the following patch in CNavDTD:CanParse() from where the function
is called, I will attach the output that I got on startup where all the files 
are XML as indicated by their MIME type. It shows that the current detection is 
pretty useless, although the wrong guess has no particular impact later on.


Index: src/CNavDTD.cpp
===================================================================
RCS file: /cvsroot/mozilla/htmlparser/src/CNavDTD.cpp,v
retrieving revision 3.364
diff -u -5 -r3.364 CNavDTD.cpp
--- src/CNavDTD.cpp	28 Nov 2001 06:13:01 -0000	3.364
+++ src/CNavDTD.cpp	3 Dec 2001 06:04:18 -0000
@@ -343,10 +343,11 @@
     } 
     else 
if(PR_TRUE==aParserContext.mMimeType.EqualsWithConversion(kTextJSContentType)) {
       result=ePrimaryDetect;
     } 
     else {
+nsAutoString mimeType(aParserContext.mMimeType);
       //otherwise, look into the buffer to see if you recognize anything...
       PRBool theBufHasXML=PR_FALSE;
       if(BufferContainsHTML(aBuffer,theBufHasXML)){
         result = eValidDetect ;
         if(0==aParserContext.mMimeType.Length()) {
@@ -362,10 +363,14 @@
             }
           }
           else result=eValidDetect;
         }
       }
+printf("hasXML:%s MIME:%s URI:%s\n", 
+theBufHasXML ? "Yes" : "No ",
+NS_LossyConvertUCS2toASCII(mimeType).get(),
+NS_LossyConvertUCS2toASCII(aParserContext.mScanner->GetFilename()).get());
     } 
   }
   return result;
 }
Priority: -- → P3
Target Milestone: --- → mozilla0.9.8
Status: NEW → ASSIGNED
Target Milestone: mozilla0.9.8 → mozilla0.9.9
Target Milestone: mozilla0.9.9 → mozilla1.0.1
Bug 132681 also spotted at this problem.
Target Milestone: mozilla1.0.1 → mozilla1.2beta
Target Milestone: mozilla1.2beta → mozilla1.4alpha
Attached patch Patch (obsolete) — Splinter Review
Comment on attachment 148069 [details] [diff] [review]
Patch

The code was looking for "<?xml" starting at position 100 in the file, which
only worked if the file had a PI after a license comment, basically...

This should fix it to look starting at start of file.
Attachment #148069 - Flags: superreview?(rbs)
Attachment #148069 - Flags: review?(rbs)
Did you consider bug 132681 comment 4? Apparently, that function is even bogus.
It was an earlier attempt at sniffing, and hasn't evolved at all, whereas the
sniffer of necko is top-notch these days. You are in a better postition for a
surgical patch... since you are the maintainer of the sniffer of necko.

I tried my little patch (updated to the newest string-fu...), and it didn't
print anything (without your patch). This seems due to the patch added for bug
132681. We don't get into that if-branch, coupled with the fact that the
mime-type provided by necko is generally correct.
CC:ing bz, see my comment above.
Oh, hmmmm... Yeah, the better thing to do would probably be to remove the XML
crap from here altogether.

I'll try to see how much of this code I can remove without my sanity being
affected... ;)
Attachment #148069 - Flags: superreview?(rbs)
Attachment #148069 - Flags: superreview-
Attachment #148069 - Flags: review?(rbs)
Attachment #148069 - Flags: review-
Attached patch Just remove all this gunk (obsolete) — Splinter Review
This removes most of the MIME gunk from the parser, leaving some in the one
place it's useful -- the parser context.

Note that DetermineParseMode() is always called before FindSuitableDTD, if
needed, so relying on the parser context's doctype is in fact just fine in the
CanParse() code (which is only called from FindSuitableDTD).
Assignee: harishd → bzbarsky
Attachment #148069 - Attachment is obsolete: true
Comment on attachment 148162 [details] [diff] [review]
Just remove all this gunk

This patch does change semantics a bit... we will now assume that anything
that's not in our hardcoded HTML/XML list and is being fed to the parser should
be parsed as plaintext.  Before we would sniff it for tag-like things.	I think
the new behavior is better, frankly.. ;)
Attachment #148162 - Flags: superreview?(jst)
Attachment #148162 - Flags: review?(rbs)
Priority: P3 → P1
Summary: Make XML detection more meaningful in BufferContainsHTML() → [FIX]Make XML detection more meaningful in BufferContainsHTML()
Target Milestone: mozilla1.4alpha → mozilla1.8alpha
Comment on attachment 148162 [details] [diff] [review]
Just remove all this gunk

The patch looks good to me. There wasn't much point with the double sniffing
because necko is security-sensitive and doesn't pass something with unknown
MIME type. All the user would get would be an early 'Save As' (or 'Open With')
instead.

I was wondering about the (now useless) 'aBuffer' param in CanParse(), but it
is harmless.
Attachment #148162 - Flags: review?(rbs) → review+
Comment on attachment 148162 [details] [diff] [review]
Just remove all this gunk

You wanna save us a few bytes of code by eliminating the unnecessary argument?

sr=jst either way. It'll be interesting to see if this causes regressions :-)
Attachment #148162 - Flags: superreview?(jst) → superreview+
I removed the extra args and changed nsParser::WillBuildModel to only copy
those first 1024 bytes if we're actually autodetecting the parsemode, since
that's the only thing that buffer is needed for now.

If this causes regressions, we fix them.  ;)
Attachment #148162 - Attachment is obsolete: true
Nice...

I have the feeling that the XXXVidur comment doesn't apply anymore. Does it?

+    // XXXVidur Make a copy and only check in the first 1k

His concern seemed to have come from the consumers of aBuffer in the repeated
calls of WillBuildModel().

But now, the first call will cache the status [per the code just above in:
@@ -1126,17 +1127,17 @@ FindSuitableDTD( ]

and since aBuffer is gone, and the peek is buried in the |if|, nobody else will
re-peek again... right?
Oh, good catch.  Removed that comment.
Checked in.  We win about 3700 bytes of build size and maybe 0.2% of Tp...
(maybe; that's kinda the noise for those tests...)
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: