Closed Bug 91696 Opened 23 years ago Closed 23 years ago

Meta tag does not work

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla0.9.4

People

(Reporter: ftang, Assigned: jbetak)

References

()

Details

(Keywords: intl, Whiteboard: fixed on the trunk)

Attachments

(4 files)

reproduce procedure
1. visit http://tw.yahoo.com
expect result- see Chinese
actual result- see garbage

The reason is this page use illegal meta tag-
<meta http-equiv=content-type content=text/html; charset=big5>

while it should be 
<meta http-equiv=content-type content="text/html; charset=big5">
However, other browser work.
Reassign to ftang.
Assignee: nhotta → ftang
OS: Windows NT → All
Hardware: PC → All
Summary: Meta tag does not work → Meta tag does not work
jbetak, can you work on this one?
The code should be in nsMetaCharsetObserver.cpp
Assignee: ftang → jbetak
Keywords: intl
QA Contact: andreasb → ylong
See another case - bug 90433 for un-quoted meta-charset value.
accepting for 0.9.3
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.3
still investigating - pushing out ot 0.9.4
Target Milestone: mozilla0.9.3 → mozilla0.9.4
found something related - bug 88746, confirmed that this is not a regression, 
although we have to make sure to to regress the fix for bug 88746.

Frank, please correct me if I'm wrong, but I believe we need to attack the 
parser logic here:

http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsParser.cpp#2524

The metaCharsetObserver is not being notified with the parsed value, only with 
the charset value from cache.
OK, I have a working fix in my local tree. I'll polish it up tonight and post 
for review. The culprit seems to be this (from nsParser):

2519             if ((offset = tokenizer.GetNextWord(PR_TRUE)) != kNotFound) {
2520               const char* contentStart = attrStart+offset;
2521               CWordTokenizer<char> contentTokenizer(contentStart, 0, 
2522                                                     tokenizer.GetLength());
2523 

When calling tokenizer.GetNextWord(PR_TRUE)), it'll give special handling to 
words starting with qoutes, regardless whether we pass in PR_TRUE or PR_FALSE. 
If a word doesn't start with a quote, tokenizer.GetLength() will return the 
length of that word 'till the first whitespace delimiter. If a word starts with 
a quote, the length runs till the end of the buffer. Replacing 
tokenizer.GetLength() by tokenizer.GetTotalLength() seems to fix the problem. 
I'll keep working on this, just wanted to give a quick progress status...


OK, this seems to be the simplest solution - I'm still evaluating it in terms 
of correctness, consistency and potential regression risk.

The problem is as follows: on line 2519 in nsParser.cpp we are implicitly 
assuming that the content value is quoted (content="text/html; charset=big5">). 
The tokenizer is pointing at the first value after the '=' sign and when 
requesting to read the next word, it will either read until the next white 
space character, or alternatively take a whole quoted string as one word. In 
absence of quotes (content=text/html; charset=big5>). , it will only read 
in 'text/html;', which will be then assumed as the whole content value. It 
evidently doesn't contain the charset information, which then will never be 
parsed out.

The proposed solution is a small hack around the current code. Rather than 
assuming that we have a quoted content value, we instantiate a new tokenizer 
starting with the first character after the '=' sign and the last character 
before the '>' bracket. This might be potentially inaccurate, when there are 
more values than just content. I'm still looking into it, although the current 
code is very specific about what it will parse out of the content value and 
about the expected location of the value elements. I think this change might 
turn out to be quite safe.

To limit regression risk, we could check, whether the content value is qouted 
and apply this tweak only then. Actually, I will do that first thing tomorrow.

cc'ing harishd and vidur for review. A short explanation: I'd like to use the 
first patch to handle META http-equiv content-type tags w/o quoted content 
value. Parsing until the end of the tokenizer seems reasonable given this 
context and the currently present code in nsParser.cpp...

I'll also talk to Frank about it.
Whiteboard: have patch - need r/sr
roy: could you help with reviewing the nsMetaCharsetObserver.cpp part of the 
fix, before I forget, what exactly I'm doing there ;-)

vidur: could you please review the nsParser.cpp change?
I believe shanjian is the best person to review your nsMetaCharsetObserver.cpp .
shanjian: can you review his patch?  
--- cc'ing shanjian ----
Juraj, I don't quite understand why you want to modify parser. Charset was 
parsed by parser as an additional attribute, so only your change in 
nsMetaCharsetObserver.cpp is necessary. Did I miss something?

Also for your change in nsMetaCharsetObserver.cpp, is 
((0==nsCRT::strncasecmp(contentValue,texthtml.get(),texthtml.Length())) ||
          (((contentValue[0]=='\'') ||
            (contentValue[0]=='\"'))&&
           (0==nsCRT::strncasecmp(contentValue+1,
                      texthtml.get(),
                      texthtml.Length()))
          ))
AND
((0==nsCRT::strncasecmp(contentValue,texthtml.get(),texthtml.Length())) ||
          (((contentValue[0]=='\'') ||
            (contentValue[0]=='\"'))&&
           (0==nsCRT::strncasecmp(contentValue+1,
                      texthtml.get(),
                      texthtml.get(),
                      texthtml.Length())) ||
           ((nsnull != charsetValue) && 
            (0==nsCRT::strncasecmp(contentValue,
                       texthtml.get(),
                      texthtml.Length())))
          ))
equivalent?  I tried on my local tree, and remove this part seems working fine. 


shanjian,

thanks for looking at this - and you are right, the observer changes would be 
sufficient. I already verified that. You see, I did changes to nsParser first, 
since I was hitting it first in my debugger. Frank then pointed out, that in 
poor networking enviroments the parser might not have the complete metatag 
available and we need to fix it in the metatagobserver also and I used his cgi 
script to emulate such enviroments.

I kept the changes in the parser, since I believe it might be better for the 
performance. We would set the document charset sooner, before the observer had 
a chance to kick in. I deemed the change in the parser to be negligible, but if 
it has a potential to help, why not? 

But wait - I was just wondering, why do we then parse the meta charset in the 
parser at all? There must be a reason and I think it's performance. Since we 
parse the meta charset in both places, why not fixing both to include the case 
of "missing quotation marks"? I think we should keep the change, if the 
regression risk is low.
oh,

I added this

||
           ((nsnull != charsetValue) && 
            (0==nsCRT::strncasecmp(contentValue,
                       texthtml.get(),
                      texthtml.Length()))

to allow us to bypass the condition that the content value starts with an " 
or ' and the immediately following character sequence is text/html. I've 
expanded this condition by allowing text/html without quoting, when the parser 
was able to parse out a charset value. The crucial point in this is that quoted 
values are read by the parser in one shot, it ignores the white space. This way 
the document type and charset come over mangled together in one key value. 
Otherwise the white space will separate them into two consecutive key values.

BTW Did you test on tw.yahoo.com? Did you apply the parser changes? If you did, 
the parser is probably catching the charset before the observer kicks in...
Ok, I didn't realize your first part change was in meta charset sniffing. I 
thought it was for preparing the attributes. (In fact, I didn't check that part 
at all). So you need to ask harishd or vidur to review that part. And I agree, 
this part should be fixed. 

As for my second comment, the code you added still seems redundant to me. (Yes, 
I tried without that part, and without parser fix, it works fine.) No matter 
charsetValue is nsnull or not, all need to be there is to make sure contentValue 
start with "text/html". Disable parser part and try out yourself. 

Shanjian,

you were right - I haven't read that part of the if statement correctly.

         ((0==nsCRT::strncasecmp(contentValue,texthtml.get(),texthtml.Length
())) ||
          (((contentValue[0]=='\'') ||
            (contentValue[0]=='\"'))&&
           (0==nsCRT::strncasecmp(contentValue+1,
                      texthtml.get(),
                      texthtml.Length()))
          ))

Covers explicitly both - the quoted and unquoted case. I dropped my extension 
of the if clause and simplified the patch somewhat. Sigh, I still can't get 
vidur to respond to my review request.
r=shanjian for patch #3, file nsMetaCharsetObserver.cpp
The change to the parser treats the rest of the tag after the content attribute
as the attribute value. This should work for the case shown and probably for
cases where there are additional attributes after content (e.g. content appears
before http-equiv), though the latter should be tested.
sr=vidur. 

thanks Vidur! I'll look into the scenario you mentioned tonight. Meanwhile I'm 
clearing the whiteboard status and nominating for the branch. This prevents a 
number of international sites from displaying properly, of which the most 
prominent one is Taiwanese Yahoo.
Keywords: nsBranch
Whiteboard: have patch - need r/sr → prepared to land, waiting for the tree to open
QA I checked this into the trunk on 08/09/2001. Could you please verify - I'm 
particularly keen to see, whether this causes any regressions.
Whiteboard: prepared to land, waiting for the tree to open → fixed on the trunk, need some verification work before lifting to the branch
Yes, it works on 08-10 trunk win32 / Win2k-cn (just checked lnly one right now).
Yuying, 

thanks for your prompt reply! I'll keep this around for a few days to see, 
whether any regressions will arise from smoketests and do some testing myself. 
Please let me know, should you see any problems/changes in META charset 
sniffing and detection...
resolving per ftang's request
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Whiteboard: fixed on the trunk, need some verification work before lifting to the branch → fixed on the trunk
Mark it as verified - works good on 09-04 trunk build / Win2k-CN.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: