Closed Bug 100849 Opened 23 years ago Closed 22 years ago

Charset sniffing in parser could not find charset in some cases

Categories

(Core :: DOM: HTML Parser, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla1.0

People

(Reporter: shanjian, Assigned: shanjian)

References

Details

(Keywords: regression, topembed+, Whiteboard: [adt2])

Attachments

(1 file, 5 obsolete files)

While I was working on bug 99666, I found that if meta charset is specified as
<META http-equiv=Content-Type content=text/html;charset=iso-8859-1>
instead of 
<META http-equiv=Content-Type content=text/html; charset=iso-8859-1>
, charset sniffing could not extract the charset information. While I am not 
very sure if the former part is standard compatible, it is certainly a normal 
practice. Our meta charset observer could handle these cases.
QA Contact: bsharma → moied
Status: NEW → ASSIGNED
Priority: -- → P3
Target Milestone: --- → mozilla1.0
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 
(you can query for this string to delete spam or retrieve the list of bugs I've 
moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
reassign to myself. 
Assignee: harishd → shanjian
Status: ASSIGNED → NEW
*** Bug 112929 has been marked as a duplicate of this bug. ***
Attached patch patch (obsolete) — Splinter Review
harish, could you review my patch?
Status: NEW → ASSIGNED
Target Milestone: mozilla1.0.1 → mozilla0.9.9
shanjian: could you explain to me what your patch does?
The standard form of meta charset specification is :
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">This can be
handled by existing code correctly. However several variations, though
incorrect, are widely used in many websites. They are ,
  <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">  
  <meta http-equiv="Content-Type" content="text/html" charset="UTF-8">  
  <meta charset="UTF-8">

My approach is for each meta tag, I search for "charset=". If found, the string
after '=' excluding single and double quote are interpreted as charset value. It
should be able to handle all these variation forms.
Shanjian: According to your patch we will be able to find the charset value
regradless of whether the attribute name is "content" or not. IMO, that's too
liberal. Do you know what IE does?

FYI: Still reviewing :-/
Also, you have removed the check for "Content-type" attribute value.

Could you please rewrite your patch such that the existing logic ( prior to
applying your patch ) does not change much?
Harish, 
Both IE and Netscape4.x accept the format of <meta charset="UTF-8">. Because of
this, I have to drop all other verification as long as it is a meta tag and
contains certain way to express charset. It might sounds too liberal, but we
really do not have much choice. Besides, we are doing the same thing in meta
charset observer any way. 

reset TM.
Target Milestone: mozilla0.9.9 → mozilla1.0
Moving Netscape owned 0.9.9 and 1.0 bugs that don't have an nsbeta1, nsbeta1+,
topembed, topembed+, Mozilla0.9.9+ or Mozilla1.0+ keyword.  Please send any
questions or feedback about this to adt@netscape.com.  You can search for
"Moving bugs not scheduled for a project" to quickly delete this bugmail.
Target Milestone: mozilla1.0 → mozilla1.2
nominate this bug as nsbeta1
Keywords: nsbeta1
I know all the thing shanjian mentioned are supported by both IE and N4.x
I am sure this is a regression over 6.1. nsbeta1+
Keywords: nsbeta1nsbeta1+, regression
Target Milestone: mozilla1.2 → mozilla1.0
topembed because this level of tolerance is essential on 
many of the international sites we need to be concerned about.
Keywords: topembed
If the charset sniffer code fails then the meta charset observer should be able to 
detect it right? That's, the tag observer would kick in, passing in the META tag
information, and hence should cause a document reload with the correct charset.
kat: Do you mean that there are a lot of international sites that exhibit this
problem? I see only one dupe in this bug and that too doesn't point to any real
world url. Please explain to me why this should be topembed. 
harishd, over the last several years, I have observed sites with
variations of this problem from time to time. This is not very frequent
but occurs regularly. I don't have time to provide URLs now but during the
top 500 site testing in Japan conducted late last year, I saw several 
cases of one variation or another of what shanjian provided above. And we had 
to contact those sites to fix the problem. 
So the answer to your question is: yes, these sites do exist and I have
seen a sizable number of them over the last several years. I suspect that
some authoring tools or scripts create an error of this type. 
It was only a few days ago that somone discovered that the site
build script our team uses for developer.netscape.com pages contained
an error like the following:

=======================
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<title>What's New</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
....
....
=======================

We corrected it the next day but it was the XSLT script we were using
that had this problem. This is not exactly the same problem shanjian mentions but
some script I have seen generate something like:

<meta charset="ISO-8859-1">

We are starting test sites in China and we expect to find more of the same.

For these reasons, we should build in some tolerance for these types 
of mistakes.
Kat: Being too liberal would cause problems in the future. One of the main
reasons for HTML to be so broken, in the real world, is because Navigator was
too tolerant. However, I'm not saying that we should be strict and follow the
spec. verbatim. We should solve these problems on a case by case basis (
provided it's a 
high traffic web site ).
harish, 

If it is one or two websites, that will be fine. But there are simply too many 
websites using various practice. All those sites works well with IE, and before 
we get much market share, it does not make sense to change the world. I would 
like to emphasis again that we are doing the same thing in meta charset 
observer. If there is any problem with this approach, the problem is already 
there. You should agree that reload charset later will always cause more 
problem, let alone performance. 
>All those sites works well with IE, and before  we get much market share, it
does >not make sense to change the world.
I never said that we evangelize. All I said was that we should deal with this
problem on a |CASE| by |CASE| basis. There no reason get all paranoid for a
problem not yet seen.

>like to emphasis again that we are doing the same thing in meta charset 
>observer. If there is any problem with this approach, the problem is already 
>there. You should agree that reload charset later will always cause more 
>problem, let alone performance. 

If the meta charset is able to handle it then well and good. We will fall back
on the charset observer for a malformed case. Performance, will not be an issue
for a good mark up right?
Btw, Mike Shaver has a proposal to avoid charset reload ( ref. bug 61363 ).
>>I never said that we evangelize. All I said was that we should deal with this
>>problem on a |CASE| by |CASE| basis. There no reason get all paranoid for a
>>problem not yet seen.
Considering all the following examples (incomplete yet), do you still believe my 
approach is too liberal? We identified a charset inside meta tag, and there is a 
value follow it, should we assume that user intend to specify the charset? 
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html" charset="UTF-8">
<meta charset="ISO-8859-1">

>>If the meta charset is able to handle it then well and good. We will fall back
>>on the charset observer for a malformed case. Performance, will not be an 
>>issue for a good mark up right?
If meta charset observer work all well, then I will not spend time on this bug 
at this time, and it will not be escalated as nsbeta1+. There is situations 
where meta charset observer's notification will not be honored. So it is not 
only a performance issue. 

I just made some comment in bug 61363 today, I really doubt that the bug can be 
fixed any time soon. It is a very complicated issue because reconversion from 
unicode to native encoding does not work (yet). 
>If meta charset observer work all well, then I will not spend time on this bug 
>at this time, and it will not be escalated as nsbeta1+.

The what do you mean by "I would like to emphasis again that we are doing the
same thing in meta charset observer."?
>If meta charset observer work all well, then I will not spend time on this bug 
>at this time, and it will not be escalated as nsbeta1+.

Then what do you mean by "I would like to emphasis again that we are doing the
same thing in meta charset observer."?
I mean we are parsing the meta tag in a "liberal" way in meta charset observer 
as I suggested in this patch. 
Comment on attachment 68019 [details] [diff] [review]
patch

>+        if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("CHARSET="), tokStart, tokEnd)) {

This wouldn't work if the markup was <META CHARSET = "XXX">. So, my be you
should only search for 
"CHARSET". Also, you should guarantee that "CHARSET" is an attribute name.
That's charset should be
to the left of '='

>+          // extract charset value, taking care of open and close quote
>+          tokStart = tokEnd;
>+          char endChar;
>+          if (*tokStart == '\'' || *tokStart == '\"')
>+            ++tokStart;

You may have to check for iterator end before incrementing

>+          tokEnd = tokStart;
> 
>-        if (foundContentType && (aCharset.Length() > 0)) {
>+          while (*tokEnd != '\'' && *tokEnd != '\"' && tokEnd.get() < end.get()) 
>+            ++tokEnd;

Replace tokEnd.get() < end.get() with tokEnd != end.

>+          aCharset.Assign(NS_ConvertASCIItoUCS2(tokStart.get(), tokEnd.get() - tokStart.get()));

aCharset.Assign(NS_ConvertASCIItoUCS2(tokStart, tokEnd - tokStart)); should
work. right?

>           return PR_TRUE;

How do you handle a CHARSET attribute with no value. That is,
<META CHARSET=>. We shouldn't be returning PR_TRUE if there was no charset
value. right?


BTW: Please get an sr= from vidur
Attachment #68019 - Flags: needs-work+
Attached patch rewrite my patch (obsolete) — Splinter Review
Attachment #68019 - Attachment is obsolete: true
While I was taking harish's suggestion, I felt my last patch was difficult to
understand. So I rewrote it. 
harish, could you take a look again? Thanks.
cc vidur.
Comment on attachment 74419 [details] [diff] [review]
rewrite my patch

>+    // extract charset value, taking care of open and close quote
>+    while (*currPos == ' ')  // skip spaces
>+      ++currPos;
You need to check for iterator end before increamenting currPos. Also, just
checking 
for ' ' is not sufficient. You should also check for '\n', '\r', '\t'.

>+    ++currPos;
>+    while (*currPos == ' ')  //skip spaces
>+      ++currPos;

Again check for iterator end before incrementing.

>+    } else {
>+      currPos = tagEnd;
>+    }

Could you add a comment in the else clause. Also, how do you deal with 
<META CHARSET= > case? Would you return charset found or not found?
Attachment #74419 - Flags: needs-work+
Attached patch update patch (obsolete) — Splinter Review
Attachment #74419 - Attachment is obsolete: true
>>You need to check for iterator end before increamenting currPos. 
In my new approach, I identified the end of tag ">" before anything else. So 
while I was checking for space characters, it won't go out of range since 
we know we will meet ">" before that happens.

>>Also, justchecking for ' ' is not sufficient. You should also check for '\n', '\r', '\t'.

Good point. That is done. 

>>Again check for iterator end before incrementing.
Same reason as above. 

>>Could you add a comment in the else clause. Also, how do you deal with 
>><META CHARSET= > case? Would you return charset found or not found?
Done. In case of "<meta charset= >", it return false (not found). 
harishd, can you r= shanjian's update patch?
Comment on attachment 74836 [details] [diff] [review]
update patch

>+    if (!CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), currPos, tokEnd) 
>+            || currPos == begin)
>+      break;

Fix the indendation.

>+
>+    // extract charset value, taking care of open and close quote
>+    while (*currPos == kSpace || *currPos == kNewLine ||
>+           *currPos == kCR || *currPos == kTab)  // skip spaces

Update your comment

With those changes r=harishd.
Whiteboard: adt2
Attached patch update patch (obsolete) — Splinter Review
Attachment #74836 - Attachment is obsolete: true
Attachment #76209 - Flags: review+
vidur, could you sr? 
topembed+
-by triage team (chofmann, cathleen, marcia)
Keywords: topembedtopembed+
Comment on attachment 76209 [details] [diff] [review]
update patch

+    // skip spaces before '='
+    while (*currPos == kSpace || *currPos == kNewLine ||
+	    *currPos == kCR || *currPos == kTab)  
+      ++currPos;

This relies on the string underneeth the iterators to be null terminated, I
don't see that being guaranteed anywhere, so check that currPos != tagEnd in
the loop.

+    // skip spaces after '='
+    while (*currPos == kSpace || *currPos == kNewLine ||
+	    *currPos == kCR || *currPos == kTab)  
+      ++currPos;
+	   

Same here, don't step beond the end of the string.

+    // return true if we successfully got something for charset
+    if (currPos != tokEnd) {
+      aCharset.Assign(NS_ConvertASCIItoUCS2(currPos.get(), tokEnd.get() -
currPos.get()));
+      return PR_TRUE;
+    } else {

Sigh, else-after-return! Loose the else here.

+      //nothing specified as charset, continue next loop
+      currPos = tagEnd;
+    }

sr=jst if you fix that.
Attachment #76209 - Flags: superreview+
Um, never mind the two first comments, they are non-issues since you know
there's an '>' before the end of the string.
Attached patch update patch (obsolete) — Splinter Review
drop else after return.
Attachment #76209 - Attachment is obsolete: true
Attachment #76494 - Flags: superreview+
Attachment #76494 - Flags: review+
Comment on attachment 76494 [details] [diff] [review]
update patch

a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76494 - Flags: approval+
Attachment #76494 - Attachment is obsolete: true
Harish, I have to rewrite my patch to incorporate the changes you made for bug 124904. 
So could you do code review again? Thanks.
Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

>-      else if (!FindCharInReadable('>', tagEnd, end)) {
>-        // Can't be sure if this is a real tag since only 
>-        // a part of the tag is in this buffer. 
>-        return PR_FALSE; 
>+    tagEnd = currPos;
>+    if (!FindCharInReadable('>', tagEnd, end))
>+      break;

1) You removed my comments. Please put it back.
2) Before we were checking for '>' if other conditions failed. Now,
   it looks like we check for '>' regardless. Is this intentional?

>+    // If this is not a META tag, continue to next loop
>+    if ( (*currPos != 'm' && *currPos != 'M') ||
>+         (*(++currPos) != 'e' && *currPos != 'E') ||
>+         (*(++currPos) != 't' && *currPos != 'T') ||
>+         (*(++currPos) != 'a' && *currPos != 'A') ) {
>+      currPos = tagEnd;
>+      continue;
>     }

Why can't you simply use whatever that was there before?
That is,  if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"),
tagStart, tagEnd)) {

>+    ++currPos;

Are you sure that currPos does not go beyond end?
>>1) You removed my comments. Please put it back.
I have similar comment before the block, ie.
    // Find the end of the tag, break if incomplete
    tagEnd = currPos;
    if (!FindCharInReadable('>', tagEnd, end))
      break;
Since the patch was based on my patch instead of yours, I made the changes 
only when necessary. But I am OK with your comment if you think it is 
better.

>>2) Before we were checking for '>' if other conditions failed. Now,
>>   it looks like we check for '>' regardless. Is this intentional?
Yes. as what I have done in my previous patch, the end of tag is checked 
in all condition. 


>+    // If this is not a META tag, continue to next loop
>+    if ( (*currPos != 'm' && *currPos != 'M') ||
>+         (*(++currPos) != 'e' && *currPos != 'E') ||
>+         (*(++currPos) != 't' && *currPos != 'T') ||
>+         (*(++currPos) != 'a' && *currPos != 'A') ) {
>+      currPos = tagEnd;
>+      continue;
>     }
>>Why can't you simply use whatever that was there before?
>>That is,  if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"),
>>tagStart, tagEnd)) {
  
Because we have identified start tag '>' and all those checking 
was immediately follow it. Using CaseInsensitiveFindInReadable might 
skip possible comment. 

>>Are you sure that currPos does not go beyond end? 
Yes, because we have located the end of tag '>'. We won't go beyond the 
limit before meeting this character. 
 Impact Platform: ALL
Impact language users: 560 M 100%
Probability of hitting the problem: MID, we saw some top sites have this kind of
problem. Both 4.x and IE, and 6.2 have no problem
Severity if hit the problem in the worst case: User will see corrupted text
Way of recover after hit the problem: User can try these 80+ charset one by one
to figure out which display correctly.
Risk of the fix: MID
Potential benefit of fix this problem: Other display problem
>>Why can't you simply use whatever that was there before?
>>That is,  if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"),
>>tagStart, tagEnd)) {
  
>Because we have identified start tag '>' and all those checking 
>was immediately follow it. Using CaseInsensitiveFindInReadable might 
>skip possible comment. 

I don't understand. Could you give me an example?
There is typo in my previous comment. ">" should be "<".

As you see in my patch, after we identified '<', we check if the 
following stuff are "!--" or not. If so, it is a comment. If not, 
we can check if the following stuff is "meta". If we search for 
meta directly, using following example:

<some_tag> <!-- ...   <meta ...> --> 

After we identififies the start of some_tag, and verifies that it is 
not the start of comment, it's time to check if this tag is a meta tag. 
If I use "CaseInsensitiveFindInReadable", "<!--..." will skipped and 
currPos will be set to "meta". That's say in order to skip all possible 
comment, we have to check each tag individually.
The patch in bug 124904 is meant to take care of situations like <some_tag> <!--
...   <meta ...> -->. That's when you reach

-    if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), tagStart,
tagEnd)) {

you are guaranteed to be out of a comment.
Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

r=harishd.
Attachment #76581 - Flags: review+
Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

This code is now full of *(++currPos) and similar code that accesses *currPos
after incrementing it w/o checking if we stepped past the end of the string or
not. Are we guaranteed that the string is always null-terminated? If not, this
is a crash waiting to happen.
jst, before those *(++currPos) operations, the end of tag, ie '>'. has been located.
So we won't go beyond limit as long as we does not meet this '>'. And that's why
I only increment the pointer after the character before has been identified . 
Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

Oh, duh, should've looked closer.

Looks good to me, sr=jst
Attachment #76581 - Flags: superreview+
Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76581 - Flags: approval+
major site issue
adt1.0.0
Keywords: adt1.0.0
adt1.0.0+ (on ADT's behalf) for checkin into 1.0. Pls test against top sites
after you have checked this in.
Keywords: adt1.0.0adt1.0.0+
Whiteboard: adt2 → [adt2]
fix checked in. 
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Verified fix checked in cvs (Rev 3.304)
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: