Closed Bug 91696 Opened 23 years ago Closed 23 years ago

Meta tag does not work

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla0.9.4

People

(Reporter: ftang, Assigned: jbetak)

References

(
URL
)

Details

(Keywords: intl, Whiteboard: fixed on the trunk)

Attachments

(4 files)

proposed patch v. 1 23 years ago jbetak@netscape.com (away - not reading bugmail) 806 bytes, patch		Details \| Diff \| Splinter Review
a bit more regression-cautious approach, I'd still prefer a simpler solution, if justifieable 23 years ago jbetak@netscape.com (away - not reading bugmail) 816 bytes, patch		Details \| Diff \| Splinter Review
patch for nsMetaCharsetObserver.cpp - as demonstrated by ftang, this is needed for impermanent/slow network connections 23 years ago jbetak@netscape.com (away - not reading bugmail) 2.79 KB, patch		Details \| Diff \| Splinter Review
Patch #3 (cvs diff -u -w -b) 23 years ago jbetak@netscape.com (away - not reading bugmail) 3.16 KB, patch		Details \| Diff \| Splinter Review

Frank Tang

Reporter

Description

•

23 years ago

reproduce procedure
1. visit http://tw.yahoo.com
expect result- see Chinese
actual result- see garbage

The reason is this page use illegal meta tag-
<meta http-equiv=content-type content=text/html; charset=big5>

while it should be 
<meta http-equiv=content-type content="text/html; charset=big5">
However, other browser work.

nhottanscp

Comment 1

•

23 years ago

Reassign to ftang.

Assignee: nhotta → ftang

OS: Windows NT → All

Hardware: PC → All

Summary: Meta tag does not work → Meta tag does not work

Frank Tang

Reporter

Comment 2

•

23 years ago

jbetak, can you work on this one?
The code should be in nsMetaCharsetObserver.cpp

Assignee: ftang → jbetak

Andreas Becker

Updated

•

23 years ago

Keywords: intl

QA Contact: andreasb → ylong

Yuying Long

Comment 3

•

23 years ago

See another case - bug 90433 for un-quoted meta-charset value.

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 4

•

23 years ago

accepting for 0.9.3

Status: NEW → ASSIGNED

Target Milestone: --- → mozilla0.9.3

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 5

•

23 years ago

still investigating - pushing out ot 0.9.4

Target Milestone: mozilla0.9.3 → mozilla0.9.4

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 6

•

23 years ago

found something related - bug 88746, confirmed that this is not a regression, 
although we have to make sure to to regress the fix for bug 88746.

Frank, please correct me if I'm wrong, but I believe we need to attack the 
parser logic here:

http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsParser.cpp#2524

The metaCharsetObserver is not being notified with the parsed value, only with 
the charset value from cache.

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 7

•

23 years ago

OK, I have a working fix in my local tree. I'll polish it up tonight and post 
for review. The culprit seems to be this (from nsParser):

2519             if ((offset = tokenizer.GetNextWord(PR_TRUE)) != kNotFound) {
2520               const char* contentStart = attrStart+offset;
2521               CWordTokenizer<char> contentTokenizer(contentStart, 0, 
2522                                                     tokenizer.GetLength());
2523 

When calling tokenizer.GetNextWord(PR_TRUE)), it'll give special handling to 
words starting with qoutes, regardless whether we pass in PR_TRUE or PR_FALSE. 
If a word doesn't start with a quote, tokenizer.GetLength() will return the 
length of that word 'till the first whitespace delimiter. If a word starts with 
a quote, the length runs till the end of the buffer. Replacing 
tokenizer.GetLength() by tokenizer.GetTotalLength() seems to fix the problem. 
I'll keep working on this, just wanted to give a quick progress status...

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 8

•

23 years ago

Attached patch proposed patch v. 1 — Details — Splinter Review

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 9

•

23 years ago

OK, this seems to be the simplest solution - I'm still evaluating it in terms 
of correctness, consistency and potential regression risk.

The problem is as follows: on line 2519 in nsParser.cpp we are implicitly 
assuming that the content value is quoted (content="text/html; charset=big5">). 
The tokenizer is pointing at the first value after the '=' sign and when 
requesting to read the next word, it will either read until the next white 
space character, or alternatively take a whole quoted string as one word. In 
absence of quotes (content=text/html; charset=big5>). , it will only read 
in 'text/html;', which will be then assumed as the whole content value. It 
evidently doesn't contain the charset information, which then will never be 
parsed out.

The proposed solution is a small hack around the current code. Rather than 
assuming that we have a quoted content value, we instantiate a new tokenizer 
starting with the first character after the '=' sign and the last character 
before the '>' bracket. This might be potentially inaccurate, when there are 
more values than just content. I'm still looking into it, although the current 
code is very specific about what it will parse out of the content value and 
about the expected location of the value elements. I think this change might 
turn out to be quite safe.

To limit regression risk, we could check, whether the content value is qouted 
and apply this tweak only then. Actually, I will do that first thing tomorrow.

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 10

•

23 years ago

Attached patch a bit more regression-cautious approach, I'd still prefer a simpler solution, if justifieable — Details — Splinter Review

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 11

•

23 years ago

cc'ing harishd and vidur for review. A short explanation: I'd like to use the 
first patch to handle META http-equiv content-type tags w/o quoted content 
value. Parsing until the end of the tokenizer seems reasonable given this 
context and the currently present code in nsParser.cpp...

I'll also talk to Frank about it.

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 12

•

23 years ago

Attached patch patch for nsMetaCharsetObserver.cpp - as demonstrated by ftang, this is needed for impermanent/slow network connections — Details — Splinter Review

jbetak@netscape.com (away - not reading bugmail)

Assignee

Updated

•

23 years ago

Whiteboard: have patch - need r/sr

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 13

•

23 years ago

roy: could you help with reviewing the nsMetaCharsetObserver.cpp part of the 
fix, before I forget, what exactly I'm doing there ;-)

vidur: could you please review the nsParser.cpp change?

Roy Yokoyama

Comment 14

•

23 years ago

I believe shanjian is the best person to review your nsMetaCharsetObserver.cpp .
shanjian: can you review his patch?  
--- cc'ing shanjian ----

Shanjian Li

Comment 15

•

23 years ago

Juraj, I don't quite understand why you want to modify parser. Charset was 
parsed by parser as an additional attribute, so only your change in 
nsMetaCharsetObserver.cpp is necessary. Did I miss something?

Also for your change in nsMetaCharsetObserver.cpp, is 
((0==nsCRT::strncasecmp(contentValue,texthtml.get(),texthtml.Length())) ||
          (((contentValue[0]=='\'') ||
            (contentValue[0]=='\"'))&&
           (0==nsCRT::strncasecmp(contentValue+1,
                      texthtml.get(),
                      texthtml.Length()))
          ))
AND
((0==nsCRT::strncasecmp(contentValue,texthtml.get(),texthtml.Length())) ||
          (((contentValue[0]=='\'') ||
            (contentValue[0]=='\"'))&&
           (0==nsCRT::strncasecmp(contentValue+1,
                      texthtml.get(),
                      texthtml.get(),
                      texthtml.Length())) ||
           ((nsnull != charsetValue) && 
            (0==nsCRT::strncasecmp(contentValue,
                       texthtml.get(),
                      texthtml.Length())))
          ))
equivalent?  I tried on my local tree, and remove this part seems working fine.

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 16

•

23 years ago

shanjian,

thanks for looking at this - and you are right, the observer changes would be 
sufficient. I already verified that. You see, I did changes to nsParser first, 
since I was hitting it first in my debugger. Frank then pointed out, that in 
poor networking enviroments the parser might not have the complete metatag 
available and we need to fix it in the metatagobserver also and I used his cgi 
script to emulate such enviroments.

I kept the changes in the parser, since I believe it might be better for the 
performance. We would set the document charset sooner, before the observer had 
a chance to kick in. I deemed the change in the parser to be negligible, but if 
it has a potential to help, why not? 

But wait - I was just wondering, why do we then parse the meta charset in the 
parser at all? There must be a reason and I think it's performance. Since we 
parse the meta charset in both places, why not fixing both to include the case 
of "missing quotation marks"? I think we should keep the change, if the 
regression risk is low.

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 17

•

23 years ago

oh,

I added this

||
           ((nsnull != charsetValue) && 
            (0==nsCRT::strncasecmp(contentValue,
                       texthtml.get(),
                      texthtml.Length()))

to allow us to bypass the condition that the content value starts with an " 
or ' and the immediately following character sequence is text/html. I've 
expanded this condition by allowing text/html without quoting, when the parser 
was able to parse out a charset value. The crucial point in this is that quoted 
values are read by the parser in one shot, it ignores the white space. This way 
the document type and charset come over mangled together in one key value. 
Otherwise the white space will separate them into two consecutive key values.

BTW Did you test on tw.yahoo.com? Did you apply the parser changes? If you did, 
the parser is probably catching the charset before the observer kicks in...

Shanjian Li

Comment 18

•

23 years ago

Ok, I didn't realize your first part change was in meta charset sniffing. I 
thought it was for preparing the attributes. (In fact, I didn't check that part 
at all). So you need to ask harishd or vidur to review that part. And I agree, 
this part should be fixed. 

As for my second comment, the code you added still seems redundant to me. (Yes, 
I tried without that part, and without parser fix, it works fine.) No matter 
charsetValue is nsnull or not, all need to be there is to make sure contentValue 
start with "text/html". Disable parser part and try out yourself.

jbetak@netscape.com (away - not reading bugmail)

Assignee

Updated

•

23 years ago

Blocks: 90433

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 19

•

23 years ago

Attached patch Patch #3 (cvs diff -u -w -b) — Details — Splinter Review

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 20

•

23 years ago

Shanjian,

you were right - I haven't read that part of the if statement correctly.

         ((0==nsCRT::strncasecmp(contentValue,texthtml.get(),texthtml.Length
())) ||
          (((contentValue[0]=='\'') ||
            (contentValue[0]=='\"'))&&
           (0==nsCRT::strncasecmp(contentValue+1,
                      texthtml.get(),
                      texthtml.Length()))
          ))

Covers explicitly both - the quoted and unquoted case. I dropped my extension 
of the if clause and simplified the patch somewhat. Sigh, I still can't get 
vidur to respond to my review request.

Shanjian Li

Comment 21

•

23 years ago

r=shanjian for patch #3, file nsMetaCharsetObserver.cpp

vidur (gone)

Comment 22

•

23 years ago

The change to the parser treats the rest of the tag after the content attribute
as the attribute value. This should work for the case shown and probably for
cases where there are additional attributes after content (e.g. content appears
before http-equiv), though the latter should be tested.
sr=vidur.

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 23

•

23 years ago

thanks Vidur! I'll look into the scenario you mentioned tonight. Meanwhile I'm 
clearing the whiteboard status and nominating for the branch. This prevents a 
number of international sites from displaying properly, of which the most 
prominent one is Taiwanese Yahoo.

Keywords: nsBranch

Whiteboard: have patch - need r/sr → prepared to land, waiting for the tree to open

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 24

•

23 years ago

QA I checked this into the trunk on 08/09/2001. Could you please verify - I'm 
particularly keen to see, whether this causes any regressions.

Whiteboard: prepared to land, waiting for the tree to open → fixed on the trunk, need some verification work before lifting to the branch

Yuying Long

Comment 25

•

23 years ago

Yes, it works on 08-10 trunk win32 / Win2k-cn (just checked lnly one right now).

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 26

•

23 years ago

Yuying, 

thanks for your prompt reply! I'll keep this around for a few days to see, 
whether any regressions will arise from smoketests and do some testing myself. 
Please let me know, should you see any problems/changes in META charset 
sniffing and detection...

jbetak@netscape.com (away - not reading bugmail)

Assignee

Comment 27

•

23 years ago

resolving per ftang's request

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Whiteboard: fixed on the trunk, need some verification work before lifting to the branch → fixed on the trunk

Yuying Long

Comment 28

•

23 years ago

Mark it as verified - works good on 09-04 trunk build / Win2k-CN.

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.