Charset sniffing in parser could not find charset in some cases

VERIFIED FIXED in mozilla1.0

Status

()

Core
HTML: Parser
P3
normal
VERIFIED FIXED
17 years ago
16 years ago

People

(Reporter: Shanjian Li, Assigned: Shanjian Li)

Tracking

({regression, topembed+})

Trunk
mozilla1.0
regression, topembed+
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [adt2])

Attachments

(1 attachment, 5 obsolete attachments)

(Assignee)

Description

17 years ago
While I was working on bug 99666, I found that if meta charset is specified as
<META http-equiv=Content-Type content=text/html;charset=iso-8859-1>
instead of 
<META http-equiv=Content-Type content=text/html; charset=iso-8859-1>
, charset sniffing could not extract the charset information. While I am not 
very sure if the former part is standard compatible, it is certainly a normal 
practice. Our meta charset observer could handle these cases.

Updated

17 years ago
QA Contact: bsharma → moied

Updated

17 years ago
Status: NEW → ASSIGNED
Priority: -- → P3
Target Milestone: --- → mozilla1.0

Comment 1

16 years ago
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 
(you can query for this string to delete spam or retrieve the list of bugs I've 
moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
(Assignee)

Comment 2

16 years ago
reassign to myself. 
Assignee: harishd → shanjian
Status: ASSIGNED → NEW
(Assignee)

Comment 3

16 years ago
*** Bug 112929 has been marked as a duplicate of this bug. ***
(Assignee)

Comment 4

16 years ago
Created attachment 68019 [details] [diff] [review]
patch
(Assignee)

Comment 5

16 years ago
harish, could you review my patch?
Status: NEW → ASSIGNED
(Assignee)

Updated

16 years ago
Target Milestone: mozilla1.0.1 → mozilla0.9.9

Comment 6

16 years ago
shanjian: could you explain to me what your patch does?
(Assignee)

Comment 7

16 years ago
The standard form of meta charset specification is :
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">This can be
handled by existing code correctly. However several variations, though
incorrect, are widely used in many websites. They are ,
  <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">  
  <meta http-equiv="Content-Type" content="text/html" charset="UTF-8">  
  <meta charset="UTF-8">

My approach is for each meta tag, I search for "charset=". If found, the string
after '=' excluding single and double quote are interpreted as charset value. It
should be able to handle all these variation forms.

Comment 8

16 years ago
Shanjian: According to your patch we will be able to find the charset value
regradless of whether the attribute name is "content" or not. IMO, that's too
liberal. Do you know what IE does?

FYI: Still reviewing :-/

Comment 9

16 years ago
Also, you have removed the check for "Content-type" attribute value.

Could you please rewrite your patch such that the existing logic ( prior to
applying your patch ) does not change much?
(Assignee)

Comment 10

16 years ago
Harish, 
Both IE and Netscape4.x accept the format of <meta charset="UTF-8">. Because of
this, I have to drop all other verification as long as it is a meta tag and
contains certain way to express charset. It might sounds too liberal, but we
really do not have much choice. Besides, we are doing the same thing in meta
charset observer any way. 

(Assignee)

Comment 11

16 years ago
reset TM.
Target Milestone: mozilla0.9.9 → mozilla1.0

Comment 12

16 years ago
Moving Netscape owned 0.9.9 and 1.0 bugs that don't have an nsbeta1, nsbeta1+,
topembed, topembed+, Mozilla0.9.9+ or Mozilla1.0+ keyword.  Please send any
questions or feedback about this to adt@netscape.com.  You can search for
"Moving bugs not scheduled for a project" to quickly delete this bugmail.
Target Milestone: mozilla1.0 → mozilla1.2
(Assignee)

Comment 13

16 years ago
nominate this bug as nsbeta1
Keywords: nsbeta1

Comment 14

16 years ago
I know all the thing shanjian mentioned are supported by both IE and N4.x
I am sure this is a regression over 6.1. nsbeta1+
Keywords: nsbeta1 → nsbeta1+, regression
Target Milestone: mozilla1.2 → mozilla1.0

Comment 15

16 years ago
topembed because this level of tolerance is essential on 
many of the international sites we need to be concerned about.
Keywords: topembed

Comment 16

16 years ago
If the charset sniffer code fails then the meta charset observer should be able to 
detect it right? That's, the tag observer would kick in, passing in the META tag
information, and hence should cause a document reload with the correct charset.

Comment 17

16 years ago
kat: Do you mean that there are a lot of international sites that exhibit this
problem? I see only one dupe in this bug and that too doesn't point to any real
world url. Please explain to me why this should be topembed. 

Comment 18

16 years ago
harishd, over the last several years, I have observed sites with
variations of this problem from time to time. This is not very frequent
but occurs regularly. I don't have time to provide URLs now but during the
top 500 site testing in Japan conducted late last year, I saw several 
cases of one variation or another of what shanjian provided above. And we had 
to contact those sites to fix the problem. 
So the answer to your question is: yes, these sites do exist and I have
seen a sizable number of them over the last several years. I suspect that
some authoring tools or scripts create an error of this type. 
It was only a few days ago that somone discovered that the site
build script our team uses for developer.netscape.com pages contained
an error like the following:

=======================
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<title>What's New</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
....
....
=======================

We corrected it the next day but it was the XSLT script we were using
that had this problem. This is not exactly the same problem shanjian mentions but
some script I have seen generate something like:

<meta charset="ISO-8859-1">

We are starting test sites in China and we expect to find more of the same.

For these reasons, we should build in some tolerance for these types 
of mistakes.

Comment 19

16 years ago
Kat: Being too liberal would cause problems in the future. One of the main
reasons for HTML to be so broken, in the real world, is because Navigator was
too tolerant. However, I'm not saying that we should be strict and follow the
spec. verbatim. We should solve these problems on a case by case basis (
provided it's a 
high traffic web site ).
(Assignee)

Comment 20

16 years ago
harish, 

If it is one or two websites, that will be fine. But there are simply too many 
websites using various practice. All those sites works well with IE, and before 
we get much market share, it does not make sense to change the world. I would 
like to emphasis again that we are doing the same thing in meta charset 
observer. If there is any problem with this approach, the problem is already 
there. You should agree that reload charset later will always cause more 
problem, let alone performance. 

Comment 21

16 years ago
>All those sites works well with IE, and before  we get much market share, it
does >not make sense to change the world.
I never said that we evangelize. All I said was that we should deal with this
problem on a |CASE| by |CASE| basis. There no reason get all paranoid for a
problem not yet seen.

>like to emphasis again that we are doing the same thing in meta charset 
>observer. If there is any problem with this approach, the problem is already 
>there. You should agree that reload charset later will always cause more 
>problem, let alone performance. 

If the meta charset is able to handle it then well and good. We will fall back
on the charset observer for a malformed case. Performance, will not be an issue
for a good mark up right?

Comment 22

16 years ago
Btw, Mike Shaver has a proposal to avoid charset reload ( ref. bug 61363 ).
(Assignee)

Comment 23

16 years ago
>>I never said that we evangelize. All I said was that we should deal with this
>>problem on a |CASE| by |CASE| basis. There no reason get all paranoid for a
>>problem not yet seen.
Considering all the following examples (incomplete yet), do you still believe my 
approach is too liberal? We identified a charset inside meta tag, and there is a 
value follow it, should we assume that user intend to specify the charset? 
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html" charset="UTF-8">
<meta charset="ISO-8859-1">

>>If the meta charset is able to handle it then well and good. We will fall back
>>on the charset observer for a malformed case. Performance, will not be an 
>>issue for a good mark up right?
If meta charset observer work all well, then I will not spend time on this bug 
at this time, and it will not be escalated as nsbeta1+. There is situations 
where meta charset observer's notification will not be honored. So it is not 
only a performance issue. 

I just made some comment in bug 61363 today, I really doubt that the bug can be 
fixed any time soon. It is a very complicated issue because reconversion from 
unicode to native encoding does not work (yet). 

Comment 24

16 years ago
>If meta charset observer work all well, then I will not spend time on this bug 
>at this time, and it will not be escalated as nsbeta1+.

The what do you mean by "I would like to emphasis again that we are doing the
same thing in meta charset observer."?

Comment 25

16 years ago
>If meta charset observer work all well, then I will not spend time on this bug 
>at this time, and it will not be escalated as nsbeta1+.

Then what do you mean by "I would like to emphasis again that we are doing the
same thing in meta charset observer."?
(Assignee)

Comment 26

16 years ago
I mean we are parsing the meta tag in a "liberal" way in meta charset observer 
as I suggested in this patch. 

Comment 27

16 years ago
Comment on attachment 68019 [details] [diff] [review]
patch

>+        if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("CHARSET="), tokStart, tokEnd)) {

This wouldn't work if the markup was <META CHARSET = "XXX">. So, my be you
should only search for 
"CHARSET". Also, you should guarantee that "CHARSET" is an attribute name.
That's charset should be
to the left of '='

>+          // extract charset value, taking care of open and close quote
>+          tokStart = tokEnd;
>+          char endChar;
>+          if (*tokStart == '\'' || *tokStart == '\"')
>+            ++tokStart;

You may have to check for iterator end before incrementing

>+          tokEnd = tokStart;
> 
>-        if (foundContentType && (aCharset.Length() > 0)) {
>+          while (*tokEnd != '\'' && *tokEnd != '\"' && tokEnd.get() < end.get()) 
>+            ++tokEnd;

Replace tokEnd.get() < end.get() with tokEnd != end.

>+          aCharset.Assign(NS_ConvertASCIItoUCS2(tokStart.get(), tokEnd.get() - tokStart.get()));

aCharset.Assign(NS_ConvertASCIItoUCS2(tokStart, tokEnd - tokStart)); should
work. right?

>           return PR_TRUE;

How do you handle a CHARSET attribute with no value. That is,
<META CHARSET=>. We shouldn't be returning PR_TRUE if there was no charset
value. right?


BTW: Please get an sr= from vidur
Attachment #68019 - Flags: needs-work+
(Assignee)

Comment 28

16 years ago
Created attachment 74419 [details] [diff] [review]
rewrite my patch
Attachment #68019 - Attachment is obsolete: true
(Assignee)

Comment 29

16 years ago
While I was taking harish's suggestion, I felt my last patch was difficult to
understand. So I rewrote it. 
harish, could you take a look again? Thanks.
cc vidur.

Comment 30

16 years ago
Comment on attachment 74419 [details] [diff] [review]
rewrite my patch

>+    // extract charset value, taking care of open and close quote
>+    while (*currPos == ' ')  // skip spaces
>+      ++currPos;
You need to check for iterator end before increamenting currPos. Also, just
checking 
for ' ' is not sufficient. You should also check for '\n', '\r', '\t'.

>+    ++currPos;
>+    while (*currPos == ' ')  //skip spaces
>+      ++currPos;

Again check for iterator end before incrementing.

>+    } else {
>+      currPos = tagEnd;
>+    }

Could you add a comment in the else clause. Also, how do you deal with 
<META CHARSET= > case? Would you return charset found or not found?
Attachment #74419 - Flags: needs-work+
(Assignee)

Comment 31

16 years ago
Created attachment 74836 [details] [diff] [review]
update patch
Attachment #74419 - Attachment is obsolete: true
(Assignee)

Comment 32

16 years ago
>>You need to check for iterator end before increamenting currPos. 
In my new approach, I identified the end of tag ">" before anything else. So 
while I was checking for space characters, it won't go out of range since 
we know we will meet ">" before that happens.

>>Also, justchecking for ' ' is not sufficient. You should also check for '\n', '\r', '\t'.

Good point. That is done. 

>>Again check for iterator end before incrementing.
Same reason as above. 

>>Could you add a comment in the else clause. Also, how do you deal with 
>><META CHARSET= > case? Would you return charset found or not found?
Done. In case of "<meta charset= >", it return false (not found). 

Comment 33

16 years ago
harishd, can you r= shanjian's update patch?

Comment 34

16 years ago
Comment on attachment 74836 [details] [diff] [review]
update patch

>+    if (!CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), currPos, tokEnd) 
>+            || currPos == begin)
>+      break;

Fix the indendation.

>+
>+    // extract charset value, taking care of open and close quote
>+    while (*currPos == kSpace || *currPos == kNewLine ||
>+           *currPos == kCR || *currPos == kTab)  // skip spaces

Update your comment

With those changes r=harishd.

Updated

16 years ago
Whiteboard: adt2
(Assignee)

Comment 35

16 years ago
Created attachment 76209 [details] [diff] [review]
update patch
Attachment #74836 - Attachment is obsolete: true
(Assignee)

Updated

16 years ago
Attachment #76209 - Flags: review+
(Assignee)

Comment 36

16 years ago
vidur, could you sr? 

Comment 37

16 years ago
topembed+
-by triage team (chofmann, cathleen, marcia)
Keywords: topembed → topembed+
Comment on attachment 76209 [details] [diff] [review]
update patch

+    // skip spaces before '='
+    while (*currPos == kSpace || *currPos == kNewLine ||
+	    *currPos == kCR || *currPos == kTab)  
+      ++currPos;

This relies on the string underneeth the iterators to be null terminated, I
don't see that being guaranteed anywhere, so check that currPos != tagEnd in
the loop.

+    // skip spaces after '='
+    while (*currPos == kSpace || *currPos == kNewLine ||
+	    *currPos == kCR || *currPos == kTab)  
+      ++currPos;
+	   

Same here, don't step beond the end of the string.

+    // return true if we successfully got something for charset
+    if (currPos != tokEnd) {
+      aCharset.Assign(NS_ConvertASCIItoUCS2(currPos.get(), tokEnd.get() -
currPos.get()));
+      return PR_TRUE;
+    } else {

Sigh, else-after-return! Loose the else here.

+      //nothing specified as charset, continue next loop
+      currPos = tagEnd;
+    }

sr=jst if you fix that.
Attachment #76209 - Flags: superreview+
Um, never mind the two first comments, they are non-issues since you know
there's an '>' before the end of the string.
(Assignee)

Comment 40

16 years ago
Created attachment 76494 [details] [diff] [review]
update patch

drop else after return.
Attachment #76209 - Attachment is obsolete: true
(Assignee)

Updated

16 years ago
Attachment #76494 - Flags: superreview+
Attachment #76494 - Flags: review+

Comment 41

16 years ago
Comment on attachment 76494 [details] [diff] [review]
update patch

a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76494 - Flags: approval+
(Assignee)

Comment 42

16 years ago
Created attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904
Attachment #76494 - Attachment is obsolete: true
(Assignee)

Comment 43

16 years ago
Harish, I have to rewrite my patch to incorporate the changes you made for bug 124904. 
So could you do code review again? Thanks.

Comment 44

16 years ago
Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

>-      else if (!FindCharInReadable('>', tagEnd, end)) {
>-        // Can't be sure if this is a real tag since only 
>-        // a part of the tag is in this buffer. 
>-        return PR_FALSE; 
>+    tagEnd = currPos;
>+    if (!FindCharInReadable('>', tagEnd, end))
>+      break;

1) You removed my comments. Please put it back.
2) Before we were checking for '>' if other conditions failed. Now,
   it looks like we check for '>' regardless. Is this intentional?

>+    // If this is not a META tag, continue to next loop
>+    if ( (*currPos != 'm' && *currPos != 'M') ||
>+         (*(++currPos) != 'e' && *currPos != 'E') ||
>+         (*(++currPos) != 't' && *currPos != 'T') ||
>+         (*(++currPos) != 'a' && *currPos != 'A') ) {
>+      currPos = tagEnd;
>+      continue;
>     }

Why can't you simply use whatever that was there before?
That is,  if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"),
tagStart, tagEnd)) {

>+    ++currPos;

Are you sure that currPos does not go beyond end?
(Assignee)

Comment 45

16 years ago
>>1) You removed my comments. Please put it back.
I have similar comment before the block, ie.
    // Find the end of the tag, break if incomplete
    tagEnd = currPos;
    if (!FindCharInReadable('>', tagEnd, end))
      break;
Since the patch was based on my patch instead of yours, I made the changes 
only when necessary. But I am OK with your comment if you think it is 
better.

>>2) Before we were checking for '>' if other conditions failed. Now,
>>   it looks like we check for '>' regardless. Is this intentional?
Yes. as what I have done in my previous patch, the end of tag is checked 
in all condition. 


>+    // If this is not a META tag, continue to next loop
>+    if ( (*currPos != 'm' && *currPos != 'M') ||
>+         (*(++currPos) != 'e' && *currPos != 'E') ||
>+         (*(++currPos) != 't' && *currPos != 'T') ||
>+         (*(++currPos) != 'a' && *currPos != 'A') ) {
>+      currPos = tagEnd;
>+      continue;
>     }
>>Why can't you simply use whatever that was there before?
>>That is,  if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"),
>>tagStart, tagEnd)) {
  
Because we have identified start tag '>' and all those checking 
was immediately follow it. Using CaseInsensitiveFindInReadable might 
skip possible comment. 

>>Are you sure that currPos does not go beyond end? 
Yes, because we have located the end of tag '>'. We won't go beyond the 
limit before meeting this character. 

Comment 46

16 years ago
 Impact Platform: ALL
Impact language users: 560 M 100%
Probability of hitting the problem: MID, we saw some top sites have this kind of
problem. Both 4.x and IE, and 6.2 have no problem
Severity if hit the problem in the worst case: User will see corrupted text
Way of recover after hit the problem: User can try these 80+ charset one by one
to figure out which display correctly.
Risk of the fix: MID
Potential benefit of fix this problem: Other display problem

Comment 47

16 years ago
>>Why can't you simply use whatever that was there before?
>>That is,  if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"),
>>tagStart, tagEnd)) {
  
>Because we have identified start tag '>' and all those checking 
>was immediately follow it. Using CaseInsensitiveFindInReadable might 
>skip possible comment. 

I don't understand. Could you give me an example?
(Assignee)

Comment 48

16 years ago
There is typo in my previous comment. ">" should be "<".

As you see in my patch, after we identified '<', we check if the 
following stuff are "!--" or not. If so, it is a comment. If not, 
we can check if the following stuff is "meta". If we search for 
meta directly, using following example:

<some_tag> <!-- ...   <meta ...> --> 

After we identififies the start of some_tag, and verifies that it is 
not the start of comment, it's time to check if this tag is a meta tag. 
If I use "CaseInsensitiveFindInReadable", "<!--..." will skipped and 
currPos will be set to "meta". That's say in order to skip all possible 
comment, we have to check each tag individually.

Comment 49

16 years ago
The patch in bug 124904 is meant to take care of situations like <some_tag> <!--
...   <meta ...> -->. That's when you reach

-    if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), tagStart,
tagEnd)) {

you are guaranteed to be out of a comment.

Comment 50

16 years ago
Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

r=harishd.
Attachment #76581 - Flags: review+
Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

This code is now full of *(++currPos) and similar code that accesses *currPos
after incrementing it w/o checking if we stepped past the end of the string or
not. Are we guaranteed that the string is always null-terminated? If not, this
is a crash waiting to happen.
(Assignee)

Comment 52

16 years ago
jst, before those *(++currPos) operations, the end of tag, ie '>'. has been located.
So we won't go beyond limit as long as we does not meet this '>'. And that's why
I only increment the pointer after the character before has been identified . 
Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

Oh, duh, should've looked closer.

Looks good to me, sr=jst
Attachment #76581 - Flags: superreview+

Comment 54

16 years ago
Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76581 - Flags: approval+

Comment 55

16 years ago
major site issue
adt1.0.0
Keywords: adt1.0.0

Comment 56

16 years ago
adt1.0.0+ (on ADT's behalf) for checkin into 1.0. Pls test against top sites
after you have checked this in.
Keywords: adt1.0.0 → adt1.0.0+
Whiteboard: adt2 → [adt2]
(Assignee)

Comment 57

16 years ago
fix checked in. 
Status: ASSIGNED → RESOLVED
Last Resolved: 16 years ago
Resolution: --- → FIXED

Comment 58

16 years ago
Verified fix checked in cvs (Rev 3.304)
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.