Closed
Bug 113201
Opened 23 years ago
Closed 21 years ago
[FIX]Make XML detection more meaningful in BufferContainsHTML()
Categories
(Core :: DOM: HTML Parser, defect, P1)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
FIXED
mozilla1.8alpha1
People
(Reporter: rbs, Assigned: bzbarsky)
Details
Attachments
(2 files, 2 obsolete files)
6.71 KB,
text/plain
|
Details | |
24.59 KB,
patch
|
Details | Diff | Splinter Review |
There is a function in htmlparser/src/nsDTDUtils.h with the current signature:
inline PRBool BufferContainsHTML(const nsString& aBuffer,
PRBool& aHasXMLFragment)
The boolean aHasXMLFragment is what is of interest in this bug. It is meant to
be set when the function detects that the buffer is the beginning of an XML
document.
As it stands, this function is giving a wrong guess is almost all instances.
I tried the following patch in CNavDTD:CanParse() from where the function
is called, I will attach the output that I got on startup where all the files
are XML as indicated by their MIME type. It shows that the current detection is
pretty useless, although the wrong guess has no particular impact later on.
Index: src/CNavDTD.cpp
===================================================================
RCS file: /cvsroot/mozilla/htmlparser/src/CNavDTD.cpp,v
retrieving revision 3.364
diff -u -5 -r3.364 CNavDTD.cpp
--- src/CNavDTD.cpp 28 Nov 2001 06:13:01 -0000 3.364
+++ src/CNavDTD.cpp 3 Dec 2001 06:04:18 -0000
@@ -343,10 +343,11 @@
}
else
if(PR_TRUE==aParserContext.mMimeType.EqualsWithConversion(kTextJSContentType)) {
result=ePrimaryDetect;
}
else {
+nsAutoString mimeType(aParserContext.mMimeType);
//otherwise, look into the buffer to see if you recognize anything...
PRBool theBufHasXML=PR_FALSE;
if(BufferContainsHTML(aBuffer,theBufHasXML)){
result = eValidDetect ;
if(0==aParserContext.mMimeType.Length()) {
@@ -362,10 +363,14 @@
}
}
else result=eValidDetect;
}
}
+printf("hasXML:%s MIME:%s URI:%s\n",
+theBufHasXML ? "Yes" : "No ",
+NS_LossyConvertUCS2toASCII(mimeType).get(),
+NS_LossyConvertUCS2toASCII(aParserContext.mScanner->GetFilename()).get());
}
}
return result;
}
Bug 132681 also spotted at this problem.
![]() |
Assignee | |
Comment 3•21 years ago
|
||
![]() |
Assignee | |
Comment 4•21 years ago
|
||
Comment on attachment 148069 [details] [diff] [review]
Patch
The code was looking for "<?xml" starting at position 100 in the file, which
only worked if the file had a PI after a license comment, basically...
This should fix it to look starting at start of file.
Attachment #148069 -
Flags: superreview?(rbs)
Attachment #148069 -
Flags: review?(rbs)
Did you consider bug 132681 comment 4? Apparently, that function is even bogus.
It was an earlier attempt at sniffing, and hasn't evolved at all, whereas the
sniffer of necko is top-notch these days. You are in a better postition for a
surgical patch... since you are the maintainer of the sniffer of necko.
I tried my little patch (updated to the newest string-fu...), and it didn't
print anything (without your patch). This seems due to the patch added for bug
132681. We don't get into that if-branch, coupled with the fact that the
mime-type provided by necko is generally correct.
![]() |
Assignee | |
Comment 7•21 years ago
|
||
Oh, hmmmm... Yeah, the better thing to do would probably be to remove the XML
crap from here altogether.
I'll try to see how much of this code I can remove without my sanity being
affected... ;)
![]() |
Assignee | |
Updated•21 years ago
|
Attachment #148069 -
Flags: superreview?(rbs)
Attachment #148069 -
Flags: superreview-
Attachment #148069 -
Flags: review?(rbs)
Attachment #148069 -
Flags: review-
![]() |
Assignee | |
Comment 8•21 years ago
|
||
This removes most of the MIME gunk from the parser, leaving some in the one
place it's useful -- the parser context.
Note that DetermineParseMode() is always called before FindSuitableDTD, if
needed, so relying on the parser context's doctype is in fact just fine in the
CanParse() code (which is only called from FindSuitableDTD).
![]() |
Assignee | |
Updated•21 years ago
|
Assignee: harishd → bzbarsky
Attachment #148069 -
Attachment is obsolete: true
![]() |
Assignee | |
Comment 9•21 years ago
|
||
Comment on attachment 148162 [details] [diff] [review]
Just remove all this gunk
This patch does change semantics a bit... we will now assume that anything
that's not in our hardcoded HTML/XML list and is being fed to the parser should
be parsed as plaintext. Before we would sniff it for tag-like things. I think
the new behavior is better, frankly.. ;)
Attachment #148162 -
Flags: superreview?(jst)
Attachment #148162 -
Flags: review?(rbs)
![]() |
Assignee | |
Updated•21 years ago
|
Priority: P3 → P1
Summary: Make XML detection more meaningful in BufferContainsHTML() → [FIX]Make XML detection more meaningful in BufferContainsHTML()
Target Milestone: mozilla1.4alpha → mozilla1.8alpha
Reporter | ||
Comment 10•21 years ago
|
||
Comment on attachment 148162 [details] [diff] [review]
Just remove all this gunk
The patch looks good to me. There wasn't much point with the double sniffing
because necko is security-sensitive and doesn't pass something with unknown
MIME type. All the user would get would be an early 'Save As' (or 'Open With')
instead.
I was wondering about the (now useless) 'aBuffer' param in CanParse(), but it
is harmless.
Attachment #148162 -
Flags: review?(rbs) → review+
Comment 11•21 years ago
|
||
Comment on attachment 148162 [details] [diff] [review]
Just remove all this gunk
You wanna save us a few bytes of code by eliminating the unnecessary argument?
sr=jst either way. It'll be interesting to see if this causes regressions :-)
Attachment #148162 -
Flags: superreview?(jst) → superreview+
![]() |
Assignee | |
Comment 12•21 years ago
|
||
I removed the extra args and changed nsParser::WillBuildModel to only copy
those first 1024 bytes if we're actually autodetecting the parsemode, since
that's the only thing that buffer is needed for now.
If this causes regressions, we fix them. ;)
Attachment #148162 -
Attachment is obsolete: true
Reporter | ||
Comment 13•21 years ago
|
||
Nice...
I have the feeling that the XXXVidur comment doesn't apply anymore. Does it?
+ // XXXVidur Make a copy and only check in the first 1k
His concern seemed to have come from the consumers of aBuffer in the repeated
calls of WillBuildModel().
But now, the first call will cache the status [per the code just above in:
@@ -1126,17 +1127,17 @@ FindSuitableDTD( ]
and since aBuffer is gone, and the peek is buried in the |if|, nobody else will
re-peek again... right?
![]() |
Assignee | |
Comment 14•21 years ago
|
||
Oh, good catch. Removed that comment.
![]() |
Assignee | |
Comment 15•21 years ago
|
||
Checked in. We win about 3700 bytes of build size and maybe 0.2% of Tp...
(maybe; that's kinda the noise for those tests...)
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•