Bugzilla

Updated

•

23 years ago

Status: NEW → ASSIGNED

Priority: -- → P3

Target Milestone: --- → mozilla1.0

Asa Dotzler [:asa]

Comment 1

•

23 years ago

Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 
(you can query for this string to delete spam or retrieve the list of bugs I've 
moved)

Target Milestone: mozilla1.0 → mozilla1.0.1

Assignee

Comment 2

•

23 years ago

reassign to myself.

Assignee: harishd → shanjian

Status: ASSIGNED → NEW

Assignee

Comment 3

•

23 years ago

*** Bug 112929 has been marked as a duplicate of this bug. ***

Assignee

Comment 4

•

23 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Assignee

Comment 5

•

23 years ago

harish, could you review my patch?

Status: NEW → ASSIGNED

Assignee

Updated

•

23 years ago

Target Milestone: mozilla1.0.1 → mozilla0.9.9

Comment 6

•

23 years ago

shanjian: could you explain to me what your patch does?

Assignee

Comment 7

•

23 years ago

The standard form of meta charset specification is :
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">This can be
handled by existing code correctly. However several variations, though
incorrect, are widely used in many websites. They are ,
  <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">  
  <meta http-equiv="Content-Type" content="text/html" charset="UTF-8">  
  <meta charset="UTF-8">

My approach is for each meta tag, I search for "charset=". If found, the string
after '=' excluding single and double quote are interpreted as charset value. It
should be able to handle all these variation forms.

Comment 8

•

23 years ago

Shanjian: According to your patch we will be able to find the charset value
regradless of whether the attribute name is "content" or not. IMO, that's too
liberal. Do you know what IE does?

FYI: Still reviewing :-/

Comment 9

•

23 years ago

Also, you have removed the check for "Content-type" attribute value.

Could you please rewrite your patch such that the existing logic ( prior to
applying your patch ) does not change much?

Assignee

Comment 10

•

23 years ago

Harish, 
Both IE and Netscape4.x accept the format of <meta charset="UTF-8">. Because of
this, I have to drop all other verification as long as it is a meta tag and
contains certain way to express charset. It might sounds too liberal, but we
really do not have much choice. Besides, we are doing the same thing in meta
charset observer any way.

Assignee

Comment 11

•

23 years ago

reset TM.

Target Milestone: mozilla0.9.9 → mozilla1.0

scottputterman

Comment 12

•

23 years ago

Moving Netscape owned 0.9.9 and 1.0 bugs that don't have an nsbeta1, nsbeta1+,
topembed, topembed+, Mozilla0.9.9+ or Mozilla1.0+ keyword.  Please send any
questions or feedback about this to adt@netscape.com.  You can search for
"Moving bugs not scheduled for a project" to quickly delete this bugmail.

Target Milestone: mozilla1.0 → mozilla1.2

Assignee

Comment 13

•

23 years ago

nominate this bug as nsbeta1

Keywords: nsbeta1

Comment 14

•

22 years ago

I know all the thing shanjian mentioned are supported by both IE and N4.x
I am sure this is a regression over 6.1. nsbeta1+

Keywords: nsbeta1 → nsbeta1+, regression

Target Milestone: mozilla1.2 → mozilla1.0

Katsuhiko Momoi

Comment 15

•

22 years ago

topembed because this level of tolerance is essential on 
many of the international sites we need to be concerned about.

Keywords: topembed

Comment 16

•

22 years ago

If the charset sniffer code fails then the meta charset observer should be able to 
detect it right? That's, the tag observer would kick in, passing in the META tag
information, and hence should cause a document reload with the correct charset.

Comment 17

•

22 years ago

kat: Do you mean that there are a lot of international sites that exhibit this
problem? I see only one dupe in this bug and that too doesn't point to any real
world url. Please explain to me why this should be topembed.

Katsuhiko Momoi

Comment 18

•

22 years ago

harishd, over the last several years, I have observed sites with
variations of this problem from time to time. This is not very frequent
but occurs regularly. I don't have time to provide URLs now but during the
top 500 site testing in Japan conducted late last year, I saw several 
cases of one variation or another of what shanjian provided above. And we had 
to contact those sites to fix the problem. 
So the answer to your question is: yes, these sites do exist and I have
seen a sizable number of them over the last several years. I suspect that
some authoring tools or scripts create an error of this type. 
It was only a few days ago that somone discovered that the site
build script our team uses for developer.netscape.com pages contained
an error like the following:

=======================
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<title>What's New</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
....
....
=======================

We corrected it the next day but it was the XSLT script we were using
that had this problem. This is not exactly the same problem shanjian mentions but
some script I have seen generate something like:

<meta charset="ISO-8859-1">

We are starting test sites in China and we expect to find more of the same.

For these reasons, we should build in some tolerance for these types 
of mistakes.

Comment 19

•

22 years ago

Kat: Being too liberal would cause problems in the future. One of the main
reasons for HTML to be so broken, in the real world, is because Navigator was
too tolerant. However, I'm not saying that we should be strict and follow the
spec. verbatim. We should solve these problems on a case by case basis (
provided it's a 
high traffic web site ).

Assignee

Comment 20

•

22 years ago

harish, 

If it is one or two websites, that will be fine. But there are simply too many 
websites using various practice. All those sites works well with IE, and before 
we get much market share, it does not make sense to change the world. I would 
like to emphasis again that we are doing the same thing in meta charset 
observer. If there is any problem with this approach, the problem is already 
there. You should agree that reload charset later will always cause more 
problem, let alone performance.

Comment 21

•

22 years ago

>All those sites works well with IE, and before  we get much market share, it
does >not make sense to change the world.
I never said that we evangelize. All I said was that we should deal with this
problem on a |CASE| by |CASE| basis. There no reason get all paranoid for a
problem not yet seen.

>like to emphasis again that we are doing the same thing in meta charset 
>observer. If there is any problem with this approach, the problem is already 
>there. You should agree that reload charset later will always cause more 
>problem, let alone performance. 

If the meta charset is able to handle it then well and good. We will fall back
on the charset observer for a malformed case. Performance, will not be an issue
for a good mark up right?

Comment 22

•

22 years ago

Btw, Mike Shaver has a proposal to avoid charset reload ( ref. bug 61363 ).

Assignee

Comment 23

•

22 years ago

>>I never said that we evangelize. All I said was that we should deal with this
>>problem on a |CASE| by |CASE| basis. There no reason get all paranoid for a
>>problem not yet seen.
Considering all the following examples (incomplete yet), do you still believe my 
approach is too liberal? We identified a charset inside meta tag, and there is a 
value follow it, should we assume that user intend to specify the charset? 
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html" charset="UTF-8">
<meta charset="ISO-8859-1">

>>If the meta charset is able to handle it then well and good. We will fall back
>>on the charset observer for a malformed case. Performance, will not be an 
>>issue for a good mark up right?
If meta charset observer work all well, then I will not spend time on this bug 
at this time, and it will not be escalated as nsbeta1+. There is situations 
where meta charset observer's notification will not be honored. So it is not 
only a performance issue. 

I just made some comment in bug 61363 today, I really doubt that the bug can be 
fixed any time soon. It is a very complicated issue because reconversion from 
unicode to native encoding does not work (yet).

Comment 24

•

22 years ago

>If meta charset observer work all well, then I will not spend time on this bug 
>at this time, and it will not be escalated as nsbeta1+.

The what do you mean by "I would like to emphasis again that we are doing the
same thing in meta charset observer."?

Comment 25

•

22 years ago

>If meta charset observer work all well, then I will not spend time on this bug 
>at this time, and it will not be escalated as nsbeta1+.

Then what do you mean by "I would like to emphasis again that we are doing the
same thing in meta charset observer."?

Assignee

Comment 26

•

22 years ago

I mean we are parsing the meta tag in a "liberal" way in meta charset observer 
as I suggested in this patch.

Comment 27

•

22 years ago

Comment on attachment 68019 [details] [diff] [review]
patch

>+        if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("CHARSET="), tokStart, tokEnd)) {

This wouldn't work if the markup was <META CHARSET = "XXX">. So, my be you
should only search for 
"CHARSET". Also, you should guarantee that "CHARSET" is an attribute name.
That's charset should be
to the left of '='

>+          // extract charset value, taking care of open and close quote
>+          tokStart = tokEnd;
>+          char endChar;
>+          if (*tokStart == '\'' || *tokStart == '\"')
>+            ++tokStart;

You may have to check for iterator end before incrementing

>+          tokEnd = tokStart;
> 
>-        if (foundContentType && (aCharset.Length() > 0)) {
>+          while (*tokEnd != '\'' && *tokEnd != '\"' && tokEnd.get() < end.get()) 
>+            ++tokEnd;

Replace tokEnd.get() < end.get() with tokEnd != end.

>+          aCharset.Assign(NS_ConvertASCIItoUCS2(tokStart.get(), tokEnd.get() - tokStart.get()));

aCharset.Assign(NS_ConvertASCIItoUCS2(tokStart, tokEnd - tokStart)); should
work. right?

>           return PR_TRUE;

How do you handle a CHARSET attribute with no value. That is,
<META CHARSET=>. We shouldn't be returning PR_TRUE if there was no charset
value. right?


BTW: Please get an sr= from vidur

Attachment #68019 - Flags: needs-work+

Assignee

Comment 28

•

22 years ago

Attached patch rewrite my patch (obsolete) — Details — Splinter Review

Attachment #68019 - Attachment is obsolete: true

Assignee

Comment 29

•

22 years ago

While I was taking harish's suggestion, I felt my last patch was difficult to
understand. So I rewrote it. 
harish, could you take a look again? Thanks.
cc vidur.

Comment 30

•

22 years ago

Comment on attachment 74419 [details] [diff] [review]
rewrite my patch

>+    // extract charset value, taking care of open and close quote
>+    while (*currPos == ' ')  // skip spaces
>+      ++currPos;
You need to check for iterator end before increamenting currPos. Also, just
checking 
for ' ' is not sufficient. You should also check for '\n', '\r', '\t'.

>+    ++currPos;
>+    while (*currPos == ' ')  //skip spaces
>+      ++currPos;

Again check for iterator end before incrementing.

>+    } else {
>+      currPos = tagEnd;
>+    }

Could you add a comment in the else clause. Also, how do you deal with 
<META CHARSET= > case? Would you return charset found or not found?

Attachment #74419 - Flags: needs-work+

Assignee

Comment 31

•

22 years ago

Attached patch update patch (obsolete) — Details — Splinter Review

Attachment #74419 - Attachment is obsolete: true

Assignee

Comment 32

•

22 years ago

>>You need to check for iterator end before increamenting currPos. 
In my new approach, I identified the end of tag ">" before anything else. So 
while I was checking for space characters, it won't go out of range since 
we know we will meet ">" before that happens.

>>Also, justchecking for ' ' is not sufficient. You should also check for '\n', '\r', '\t'.

Good point. That is done. 

>>Again check for iterator end before incrementing.
Same reason as above. 

>>Could you add a comment in the else clause. Also, how do you deal with 
>><META CHARSET= > case? Would you return charset found or not found?
Done. In case of "<meta charset= >", it return false (not found).

Comment 33

•

22 years ago

harishd, can you r= shanjian's update patch?

Comment 34

•

22 years ago

Comment on attachment 74836 [details] [diff] [review]
update patch

>+    if (!CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), currPos, tokEnd) 
>+            || currPos == begin)
>+      break;

Fix the indendation.

>+
>+    // extract charset value, taking care of open and close quote
>+    while (*currPos == kSpace || *currPos == kNewLine ||
>+           *currPos == kCR || *currPos == kTab)  // skip spaces

Update your comment

With those changes r=harishd.

Updated

•

22 years ago

Whiteboard: adt2

Assignee

Comment 35

•

22 years ago

Attached patch update patch (obsolete) — Details — Splinter Review

Attachment #74836 - Attachment is obsolete: true

Assignee

Updated

•

22 years ago

Attachment #76209 - Flags: review+

Assignee

Comment 36

•

22 years ago

vidur, could you sr?

Cathleen

Comment 37

•

22 years ago

topembed+
-by triage team (chofmann, cathleen, marcia)

Keywords: topembed → topembed+

Comment 38

•

22 years ago

Comment on attachment 76209 [details] [diff] [review]
update patch

+    // skip spaces before '='
+    while (*currPos == kSpace || *currPos == kNewLine ||
+	    *currPos == kCR || *currPos == kTab)  
+      ++currPos;

This relies on the string underneeth the iterators to be null terminated, I
don't see that being guaranteed anywhere, so check that currPos != tagEnd in
the loop.

+    // skip spaces after '='
+    while (*currPos == kSpace || *currPos == kNewLine ||
+	    *currPos == kCR || *currPos == kTab)  
+      ++currPos;
+	   

Same here, don't step beond the end of the string.

+    // return true if we successfully got something for charset
+    if (currPos != tokEnd) {
+      aCharset.Assign(NS_ConvertASCIItoUCS2(currPos.get(), tokEnd.get() -
currPos.get()));
+      return PR_TRUE;
+    } else {

Sigh, else-after-return! Loose the else here.

+      //nothing specified as charset, continue next loop
+      currPos = tagEnd;
+    }

sr=jst if you fix that.

Attachment #76209 - Flags: superreview+

Comment 39

•

22 years ago

Um, never mind the two first comments, they are non-issues since you know
there's an '>' before the end of the string.

Assignee

Comment 40

•

22 years ago

Attached patch update patch (obsolete) — Details — Splinter Review

drop else after return.

Attachment #76209 - Attachment is obsolete: true

Assignee

Updated

•

22 years ago

Attachment #76494 - Flags: superreview+

Attachment #76494 - Flags: review+

Asa Dotzler [:asa]

Comment 41

•

22 years ago

Comment on attachment 76494 [details] [diff] [review]
update patch

a=asa (on behalf of drivers) for checkin to the 1.0 trunk

Attachment #76494 - Flags: approval+

Assignee

Comment 42

•

22 years ago

Attached patch patch incorporate changes for bug 124904 — Details — Splinter Review

Attachment #76494 - Attachment is obsolete: true

Assignee

Comment 43

•

22 years ago

Harish, I have to rewrite my patch to incorporate the changes you made for bug 124904. 
So could you do code review again? Thanks.

Comment 44

•

22 years ago

Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

>-      else if (!FindCharInReadable('>', tagEnd, end)) {
>-        // Can't be sure if this is a real tag since only 
>-        // a part of the tag is in this buffer. 
>-        return PR_FALSE; 
>+    tagEnd = currPos;
>+    if (!FindCharInReadable('>', tagEnd, end))
>+      break;

1) You removed my comments. Please put it back.
2) Before we were checking for '>' if other conditions failed. Now,
   it looks like we check for '>' regardless. Is this intentional?

>+    // If this is not a META tag, continue to next loop
>+    if ( (*currPos != 'm' && *currPos != 'M') ||
>+         (*(++currPos) != 'e' && *currPos != 'E') ||
>+         (*(++currPos) != 't' && *currPos != 'T') ||
>+         (*(++currPos) != 'a' && *currPos != 'A') ) {
>+      currPos = tagEnd;
>+      continue;
>     }

Why can't you simply use whatever that was there before?
That is,  if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"),
tagStart, tagEnd)) {

>+    ++currPos;

Are you sure that currPos does not go beyond end?

Assignee

Comment 45

•

22 years ago

>>1) You removed my comments. Please put it back.
I have similar comment before the block, ie.
    // Find the end of the tag, break if incomplete
    tagEnd = currPos;
    if (!FindCharInReadable('>', tagEnd, end))
      break;
Since the patch was based on my patch instead of yours, I made the changes 
only when necessary. But I am OK with your comment if you think it is 
better.

>>2) Before we were checking for '>' if other conditions failed. Now,
>>   it looks like we check for '>' regardless. Is this intentional?
Yes. as what I have done in my previous patch, the end of tag is checked 
in all condition. 


>+    // If this is not a META tag, continue to next loop
>+    if ( (*currPos != 'm' && *currPos != 'M') ||
>+         (*(++currPos) != 'e' && *currPos != 'E') ||
>+         (*(++currPos) != 't' && *currPos != 'T') ||
>+         (*(++currPos) != 'a' && *currPos != 'A') ) {
>+      currPos = tagEnd;
>+      continue;
>     }
>>Why can't you simply use whatever that was there before?
>>That is,  if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"),
>>tagStart, tagEnd)) {
  
Because we have identified start tag '>' and all those checking 
was immediately follow it. Using CaseInsensitiveFindInReadable might 
skip possible comment. 

>>Are you sure that currPos does not go beyond end? 
Yes, because we have located the end of tag '>'. We won't go beyond the 
limit before meeting this character.

Comment 46

•

22 years ago

 Impact Platform: ALL
Impact language users: 560 M 100%
Probability of hitting the problem: MID, we saw some top sites have this kind of
problem. Both 4.x and IE, and 6.2 have no problem
Severity if hit the problem in the worst case: User will see corrupted text
Way of recover after hit the problem: User can try these 80+ charset one by one
to figure out which display correctly.
Risk of the fix: MID
Potential benefit of fix this problem: Other display problem

Comment 47

•

22 years ago

>>Why can't you simply use whatever that was there before?
>>That is,  if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"),
>>tagStart, tagEnd)) {
  
>Because we have identified start tag '>' and all those checking 
>was immediately follow it. Using CaseInsensitiveFindInReadable might 
>skip possible comment. 

I don't understand. Could you give me an example?

Assignee

Comment 48

•

22 years ago

There is typo in my previous comment. ">" should be "<".

As you see in my patch, after we identified '<', we check if the 
following stuff are "!--" or not. If so, it is a comment. If not, 
we can check if the following stuff is "meta". If we search for 
meta directly, using following example:

<some_tag> <!-- ...   <meta ...> --> 

After we identififies the start of some_tag, and verifies that it is 
not the start of comment, it's time to check if this tag is a meta tag. 
If I use "CaseInsensitiveFindInReadable", "<!--..." will skipped and 
currPos will be set to "meta". That's say in order to skip all possible 
comment, we have to check each tag individually.

Comment 49

•

22 years ago

The patch in bug 124904 is meant to take care of situations like <some_tag> <!--
...   <meta ...> -->. That's when you reach

-    if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), tagStart,
tagEnd)) {

you are guaranteed to be out of a comment.

Comment 50

•

22 years ago

Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

r=harishd.

Attachment #76581 - Flags: review+

Comment 51

•

22 years ago

Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

This code is now full of *(++currPos) and similar code that accesses *currPos
after incrementing it w/o checking if we stepped past the end of the string or
not. Are we guaranteed that the string is always null-terminated? If not, this
is a crash waiting to happen.

Assignee

Comment 52

•

22 years ago

jst, before those *(++currPos) operations, the end of tag, ie '>'. has been located.
So we won't go beyond limit as long as we does not meet this '>'. And that's why
I only increment the pointer after the character before has been identified .

Comment 53

•

22 years ago

Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

Oh, duh, should've looked closer.

Looks good to me, sr=jst

Attachment #76581 - Flags: superreview+

Asa Dotzler [:asa]

Comment 54

•

22 years ago

Comment on attachment 76581 [details] [diff] [review]
patch incorporate changes for bug 124904

a=asa (on behalf of drivers) for checkin to the 1.0 trunk

Attachment #76581 - Flags: approval+

Comment 55

•

22 years ago

major site issue
adt1.0.0

Keywords: adt1.0.0

Jaime Rodriguez, Jr.

Comment 56

•

22 years ago

adt1.0.0+ (on ADT's behalf) for checkin into 1.0. Pls test against top sites
after you have checked this in.

Keywords: adt1.0.0 → adt1.0.0+

Whiteboard: adt2 → [adt2]