Closed
Bug 100849
Opened 23 years ago
Closed 22 years ago
Charset sniffing in parser could not find charset in some cases
Categories
(Core :: DOM: HTML Parser, defect, P3)
Core
DOM: HTML Parser
Tracking
()
VERIFIED
FIXED
mozilla1.0
People
(Reporter: shanjian, Assigned: shanjian)
References
Details
(Keywords: regression, topembed+, Whiteboard: [adt2])
Attachments
(1 file, 5 obsolete files)
7.62 KB,
patch
|
harishd
:
review+
jst
:
superreview+
asa
:
approval+
|
Details | Diff | Splinter Review |
While I was working on bug 99666, I found that if meta charset is specified as <META http-equiv=Content-Type content=text/html;charset=iso-8859-1> instead of <META http-equiv=Content-Type content=text/html; charset=iso-8859-1> , charset sniffing could not extract the charset information. While I am not very sure if the former part is standard compatible, it is certainly a normal practice. Our meta charset observer could handle these cases.
Status: NEW → ASSIGNED
Priority: -- → P3
Target Milestone: --- → mozilla1.0
Comment 1•23 years ago
|
||
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 (you can query for this string to delete spam or retrieve the list of bugs I've moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
Assignee | ||
Comment 2•23 years ago
|
||
reassign to myself.
Assignee: harishd → shanjian
Status: ASSIGNED → NEW
Assignee | ||
Comment 3•23 years ago
|
||
*** Bug 112929 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 4•23 years ago
|
||
Assignee | ||
Updated•23 years ago
|
Target Milestone: mozilla1.0.1 → mozilla0.9.9
Assignee | ||
Comment 7•23 years ago
|
||
The standard form of meta charset specification is : <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">This can be handled by existing code correctly. However several variations, though incorrect, are widely used in many websites. They are , <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> <meta http-equiv="Content-Type" content="text/html" charset="UTF-8"> <meta charset="UTF-8"> My approach is for each meta tag, I search for "charset=". If found, the string after '=' excluding single and double quote are interpreted as charset value. It should be able to handle all these variation forms.
Shanjian: According to your patch we will be able to find the charset value regradless of whether the attribute name is "content" or not. IMO, that's too liberal. Do you know what IE does? FYI: Still reviewing :-/
Also, you have removed the check for "Content-type" attribute value. Could you please rewrite your patch such that the existing logic ( prior to applying your patch ) does not change much?
Assignee | ||
Comment 10•23 years ago
|
||
Harish, Both IE and Netscape4.x accept the format of <meta charset="UTF-8">. Because of this, I have to drop all other verification as long as it is a meta tag and contains certain way to express charset. It might sounds too liberal, but we really do not have much choice. Besides, we are doing the same thing in meta charset observer any way.
Comment 12•23 years ago
|
||
Moving Netscape owned 0.9.9 and 1.0 bugs that don't have an nsbeta1, nsbeta1+, topembed, topembed+, Mozilla0.9.9+ or Mozilla1.0+ keyword. Please send any questions or feedback about this to adt@netscape.com. You can search for "Moving bugs not scheduled for a project" to quickly delete this bugmail.
Target Milestone: mozilla1.0 → mozilla1.2
Comment 14•22 years ago
|
||
I know all the thing shanjian mentioned are supported by both IE and N4.x I am sure this is a regression over 6.1. nsbeta1+
Target Milestone: mozilla1.2 → mozilla1.0
Comment 15•22 years ago
|
||
topembed because this level of tolerance is essential on many of the international sites we need to be concerned about.
Keywords: topembed
Comment 16•22 years ago
|
||
If the charset sniffer code fails then the meta charset observer should be able to detect it right? That's, the tag observer would kick in, passing in the META tag information, and hence should cause a document reload with the correct charset.
Comment 17•22 years ago
|
||
kat: Do you mean that there are a lot of international sites that exhibit this problem? I see only one dupe in this bug and that too doesn't point to any real world url. Please explain to me why this should be topembed.
Comment 18•22 years ago
|
||
harishd, over the last several years, I have observed sites with variations of this problem from time to time. This is not very frequent but occurs regularly. I don't have time to provide URLs now but during the top 500 site testing in Japan conducted late last year, I saw several cases of one variation or another of what shanjian provided above. And we had to contact those sites to fix the problem. So the answer to your question is: yes, these sites do exist and I have seen a sizable number of them over the last several years. I suspect that some authoring tools or scripts create an error of this type. It was only a few days ago that somone discovered that the site build script our team uses for developer.netscape.com pages contained an error like the following: ======================= <head> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> <title>What's New</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> .... .... ======================= We corrected it the next day but it was the XSLT script we were using that had this problem. This is not exactly the same problem shanjian mentions but some script I have seen generate something like: <meta charset="ISO-8859-1"> We are starting test sites in China and we expect to find more of the same. For these reasons, we should build in some tolerance for these types of mistakes.
Comment 19•22 years ago
|
||
Kat: Being too liberal would cause problems in the future. One of the main reasons for HTML to be so broken, in the real world, is because Navigator was too tolerant. However, I'm not saying that we should be strict and follow the spec. verbatim. We should solve these problems on a case by case basis ( provided it's a high traffic web site ).
Assignee | ||
Comment 20•22 years ago
|
||
harish, If it is one or two websites, that will be fine. But there are simply too many websites using various practice. All those sites works well with IE, and before we get much market share, it does not make sense to change the world. I would like to emphasis again that we are doing the same thing in meta charset observer. If there is any problem with this approach, the problem is already there. You should agree that reload charset later will always cause more problem, let alone performance.
Comment 21•22 years ago
|
||
>All those sites works well with IE, and before we get much market share, it does >not make sense to change the world. I never said that we evangelize. All I said was that we should deal with this problem on a |CASE| by |CASE| basis. There no reason get all paranoid for a problem not yet seen. >like to emphasis again that we are doing the same thing in meta charset >observer. If there is any problem with this approach, the problem is already >there. You should agree that reload charset later will always cause more >problem, let alone performance. If the meta charset is able to handle it then well and good. We will fall back on the charset observer for a malformed case. Performance, will not be an issue for a good mark up right?
Comment 22•22 years ago
|
||
Btw, Mike Shaver has a proposal to avoid charset reload ( ref. bug 61363 ).
Assignee | ||
Comment 23•22 years ago
|
||
>>I never said that we evangelize. All I said was that we should deal with this >>problem on a |CASE| by |CASE| basis. There no reason get all paranoid for a >>problem not yet seen. Considering all the following examples (incomplete yet), do you still believe my approach is too liberal? We identified a charset inside meta tag, and there is a value follow it, should we assume that user intend to specify the charset? <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"> <meta http-equiv="Content-Type" content="text/html" charset="UTF-8"> <meta charset="ISO-8859-1"> >>If the meta charset is able to handle it then well and good. We will fall back >>on the charset observer for a malformed case. Performance, will not be an >>issue for a good mark up right? If meta charset observer work all well, then I will not spend time on this bug at this time, and it will not be escalated as nsbeta1+. There is situations where meta charset observer's notification will not be honored. So it is not only a performance issue. I just made some comment in bug 61363 today, I really doubt that the bug can be fixed any time soon. It is a very complicated issue because reconversion from unicode to native encoding does not work (yet).
Comment 24•22 years ago
|
||
>If meta charset observer work all well, then I will not spend time on this bug
>at this time, and it will not be escalated as nsbeta1+.
The what do you mean by "I would like to emphasis again that we are doing the
same thing in meta charset observer."?
Comment 25•22 years ago
|
||
>If meta charset observer work all well, then I will not spend time on this bug
>at this time, and it will not be escalated as nsbeta1+.
Then what do you mean by "I would like to emphasis again that we are doing the
same thing in meta charset observer."?
Assignee | ||
Comment 26•22 years ago
|
||
I mean we are parsing the meta tag in a "liberal" way in meta charset observer as I suggested in this patch.
Comment 27•22 years ago
|
||
Comment on attachment 68019 [details] [diff] [review] patch >+ if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("CHARSET="), tokStart, tokEnd)) { This wouldn't work if the markup was <META CHARSET = "XXX">. So, my be you should only search for "CHARSET". Also, you should guarantee that "CHARSET" is an attribute name. That's charset should be to the left of '=' >+ // extract charset value, taking care of open and close quote >+ tokStart = tokEnd; >+ char endChar; >+ if (*tokStart == '\'' || *tokStart == '\"') >+ ++tokStart; You may have to check for iterator end before incrementing >+ tokEnd = tokStart; > >- if (foundContentType && (aCharset.Length() > 0)) { >+ while (*tokEnd != '\'' && *tokEnd != '\"' && tokEnd.get() < end.get()) >+ ++tokEnd; Replace tokEnd.get() < end.get() with tokEnd != end. >+ aCharset.Assign(NS_ConvertASCIItoUCS2(tokStart.get(), tokEnd.get() - tokStart.get())); aCharset.Assign(NS_ConvertASCIItoUCS2(tokStart, tokEnd - tokStart)); should work. right? > return PR_TRUE; How do you handle a CHARSET attribute with no value. That is, <META CHARSET=>. We shouldn't be returning PR_TRUE if there was no charset value. right? BTW: Please get an sr= from vidur
Attachment #68019 -
Flags: needs-work+
Assignee | ||
Comment 28•22 years ago
|
||
Attachment #68019 -
Attachment is obsolete: true
Assignee | ||
Comment 29•22 years ago
|
||
While I was taking harish's suggestion, I felt my last patch was difficult to understand. So I rewrote it. harish, could you take a look again? Thanks. cc vidur.
Comment 30•22 years ago
|
||
Comment on attachment 74419 [details] [diff] [review] rewrite my patch >+ // extract charset value, taking care of open and close quote >+ while (*currPos == ' ') // skip spaces >+ ++currPos; You need to check for iterator end before increamenting currPos. Also, just checking for ' ' is not sufficient. You should also check for '\n', '\r', '\t'. >+ ++currPos; >+ while (*currPos == ' ') //skip spaces >+ ++currPos; Again check for iterator end before incrementing. >+ } else { >+ currPos = tagEnd; >+ } Could you add a comment in the else clause. Also, how do you deal with <META CHARSET= > case? Would you return charset found or not found?
Attachment #74419 -
Flags: needs-work+
Assignee | ||
Comment 31•22 years ago
|
||
Attachment #74419 -
Attachment is obsolete: true
Assignee | ||
Comment 32•22 years ago
|
||
>>You need to check for iterator end before increamenting currPos. In my new approach, I identified the end of tag ">" before anything else. So while I was checking for space characters, it won't go out of range since we know we will meet ">" before that happens. >>Also, justchecking for ' ' is not sufficient. You should also check for '\n', '\r', '\t'. Good point. That is done. >>Again check for iterator end before incrementing. Same reason as above. >>Could you add a comment in the else clause. Also, how do you deal with >><META CHARSET= > case? Would you return charset found or not found? Done. In case of "<meta charset= >", it return false (not found).
Comment 33•22 years ago
|
||
harishd, can you r= shanjian's update patch?
Comment 34•22 years ago
|
||
Comment on attachment 74836 [details] [diff] [review] update patch >+ if (!CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), currPos, tokEnd) >+ || currPos == begin) >+ break; Fix the indendation. >+ >+ // extract charset value, taking care of open and close quote >+ while (*currPos == kSpace || *currPos == kNewLine || >+ *currPos == kCR || *currPos == kTab) // skip spaces Update your comment With those changes r=harishd.
Updated•22 years ago
|
Whiteboard: adt2
Assignee | ||
Comment 35•22 years ago
|
||
Attachment #74836 -
Attachment is obsolete: true
Assignee | ||
Updated•22 years ago
|
Attachment #76209 -
Flags: review+
Assignee | ||
Comment 36•22 years ago
|
||
vidur, could you sr?
Comment 37•22 years ago
|
||
topembed+ -by triage team (chofmann, cathleen, marcia)
Comment 38•22 years ago
|
||
Comment on attachment 76209 [details] [diff] [review] update patch + // skip spaces before '=' + while (*currPos == kSpace || *currPos == kNewLine || + *currPos == kCR || *currPos == kTab) + ++currPos; This relies on the string underneeth the iterators to be null terminated, I don't see that being guaranteed anywhere, so check that currPos != tagEnd in the loop. + // skip spaces after '=' + while (*currPos == kSpace || *currPos == kNewLine || + *currPos == kCR || *currPos == kTab) + ++currPos; + Same here, don't step beond the end of the string. + // return true if we successfully got something for charset + if (currPos != tokEnd) { + aCharset.Assign(NS_ConvertASCIItoUCS2(currPos.get(), tokEnd.get() - currPos.get())); + return PR_TRUE; + } else { Sigh, else-after-return! Loose the else here. + //nothing specified as charset, continue next loop + currPos = tagEnd; + } sr=jst if you fix that.
Attachment #76209 -
Flags: superreview+
Comment 39•22 years ago
|
||
Um, never mind the two first comments, they are non-issues since you know there's an '>' before the end of the string.
Assignee | ||
Comment 40•22 years ago
|
||
drop else after return.
Attachment #76209 -
Attachment is obsolete: true
Assignee | ||
Updated•22 years ago
|
Attachment #76494 -
Flags: superreview+
Attachment #76494 -
Flags: review+
Comment 41•22 years ago
|
||
Comment on attachment 76494 [details] [diff] [review] update patch a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76494 -
Flags: approval+
Assignee | ||
Comment 42•22 years ago
|
||
Attachment #76494 -
Attachment is obsolete: true
Assignee | ||
Comment 43•22 years ago
|
||
Harish, I have to rewrite my patch to incorporate the changes you made for bug 124904. So could you do code review again? Thanks.
Comment 44•22 years ago
|
||
Comment on attachment 76581 [details] [diff] [review] patch incorporate changes for bug 124904 >- else if (!FindCharInReadable('>', tagEnd, end)) { >- // Can't be sure if this is a real tag since only >- // a part of the tag is in this buffer. >- return PR_FALSE; >+ tagEnd = currPos; >+ if (!FindCharInReadable('>', tagEnd, end)) >+ break; 1) You removed my comments. Please put it back. 2) Before we were checking for '>' if other conditions failed. Now, it looks like we check for '>' regardless. Is this intentional? >+ // If this is not a META tag, continue to next loop >+ if ( (*currPos != 'm' && *currPos != 'M') || >+ (*(++currPos) != 'e' && *currPos != 'E') || >+ (*(++currPos) != 't' && *currPos != 'T') || >+ (*(++currPos) != 'a' && *currPos != 'A') ) { >+ currPos = tagEnd; >+ continue; > } Why can't you simply use whatever that was there before? That is, if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), tagStart, tagEnd)) { >+ ++currPos; Are you sure that currPos does not go beyond end?
Assignee | ||
Comment 45•22 years ago
|
||
>>1) You removed my comments. Please put it back. I have similar comment before the block, ie. // Find the end of the tag, break if incomplete tagEnd = currPos; if (!FindCharInReadable('>', tagEnd, end)) break; Since the patch was based on my patch instead of yours, I made the changes only when necessary. But I am OK with your comment if you think it is better. >>2) Before we were checking for '>' if other conditions failed. Now, >> it looks like we check for '>' regardless. Is this intentional? Yes. as what I have done in my previous patch, the end of tag is checked in all condition. >+ // If this is not a META tag, continue to next loop >+ if ( (*currPos != 'm' && *currPos != 'M') || >+ (*(++currPos) != 'e' && *currPos != 'E') || >+ (*(++currPos) != 't' && *currPos != 'T') || >+ (*(++currPos) != 'a' && *currPos != 'A') ) { >+ currPos = tagEnd; >+ continue; > } >>Why can't you simply use whatever that was there before? >>That is, if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), >>tagStart, tagEnd)) { Because we have identified start tag '>' and all those checking was immediately follow it. Using CaseInsensitiveFindInReadable might skip possible comment. >>Are you sure that currPos does not go beyond end? Yes, because we have located the end of tag '>'. We won't go beyond the limit before meeting this character.
Comment 46•22 years ago
|
||
Impact Platform: ALL Impact language users: 560 M 100% Probability of hitting the problem: MID, we saw some top sites have this kind of problem. Both 4.x and IE, and 6.2 have no problem Severity if hit the problem in the worst case: User will see corrupted text Way of recover after hit the problem: User can try these 80+ charset one by one to figure out which display correctly. Risk of the fix: MID Potential benefit of fix this problem: Other display problem
Comment 47•22 years ago
|
||
>>Why can't you simply use whatever that was there before? >>That is, if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), >>tagStart, tagEnd)) { >Because we have identified start tag '>' and all those checking >was immediately follow it. Using CaseInsensitiveFindInReadable might >skip possible comment. I don't understand. Could you give me an example?
Assignee | ||
Comment 48•22 years ago
|
||
There is typo in my previous comment. ">" should be "<". As you see in my patch, after we identified '<', we check if the following stuff are "!--" or not. If so, it is a comment. If not, we can check if the following stuff is "meta". If we search for meta directly, using following example: <some_tag> <!-- ... <meta ...> --> After we identififies the start of some_tag, and verifies that it is not the start of comment, it's time to check if this tag is a meta tag. If I use "CaseInsensitiveFindInReadable", "<!--..." will skipped and currPos will be set to "meta". That's say in order to skip all possible comment, we have to check each tag individually.
Comment 49•22 years ago
|
||
The patch in bug 124904 is meant to take care of situations like <some_tag> <!-- ... <meta ...> -->. That's when you reach - if (CaseInsensitiveFindInReadable(NS_LITERAL_CSTRING("META"), tagStart, tagEnd)) { you are guaranteed to be out of a comment.
Comment 50•22 years ago
|
||
Comment on attachment 76581 [details] [diff] [review] patch incorporate changes for bug 124904 r=harishd.
Attachment #76581 -
Flags: review+
Comment 51•22 years ago
|
||
Comment on attachment 76581 [details] [diff] [review] patch incorporate changes for bug 124904 This code is now full of *(++currPos) and similar code that accesses *currPos after incrementing it w/o checking if we stepped past the end of the string or not. Are we guaranteed that the string is always null-terminated? If not, this is a crash waiting to happen.
Assignee | ||
Comment 52•22 years ago
|
||
jst, before those *(++currPos) operations, the end of tag, ie '>'. has been located. So we won't go beyond limit as long as we does not meet this '>'. And that's why I only increment the pointer after the character before has been identified .
Comment 53•22 years ago
|
||
Comment on attachment 76581 [details] [diff] [review] patch incorporate changes for bug 124904 Oh, duh, should've looked closer. Looks good to me, sr=jst
Attachment #76581 -
Flags: superreview+
Comment 54•22 years ago
|
||
Comment on attachment 76581 [details] [diff] [review] patch incorporate changes for bug 124904 a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76581 -
Flags: approval+
Comment 56•22 years ago
|
||
adt1.0.0+ (on ADT's behalf) for checkin into 1.0. Pls test against top sites after you have checked this in.
Assignee | ||
Comment 57•22 years ago
|
||
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•