Closed Bug 427240 Opened 16 years ago Closed 14 years ago

Wrong character encoding parsing for Spanish chars

Categories

(Core :: DOM: HTML Parser, defect)

x86
All
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla1.9.3a5

People

(Reporter: danielgutson, Assigned: danielgutson)

References

Details

(Keywords: intl, Whiteboard: [fixed by the HTML5 parser])

Attachments

(5 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5

Some pages in English containing UTF-8 characters which are NOT followed by a space are not parsed. For example:
  &ntildea
is not parsed as "ña", instead, it is shown textually (so the UTF-8 character is not shown). In other words, the FF parser needs the special characters to finish with a space, otherwise it tries to recognize the ´ntildea´ as a character, which does not exist, instead of ´ntilde´ followed by an ´a´.
Same problem with the &.acute family (&aacute, &oacute, etc.).

This is REALLY annoying. Sorry I am not providing a live web page now, just a modified one I got.

 Daniel.


Reproducible: Always

Steps to Reproduce:
1.
2.
3.
Attached file test case
See the &xacute and &ntilde examples.
(In reply to comment #0)
> Some pages in English containing UTF-8 characters which are NOT followed by a

ERRATA: pages in Spanish.
(sorry)
It needs to be ña - the semicolon terminates the token. For IE
compatibility, we recognize incorrectly terminated entities (including the
space), but maybe ntilde isn't recognized in this way.

PS : the ñ needs not to be a UTF-8 character, it also exists in other
charsets. The title is incorrect.

(In reply to comment #3)
> It needs to be ña - the semicolon terminates the token. For IE
> compatibility, we recognize incorrectly terminated entities (including the
> space), but maybe ntilde isn't recognized in this way.
> 
> PS : the ñ needs not to be a UTF-8 character, it also exists in other
> charsets. The title is incorrect.
> 

What about the rest of the char encodings? (those acute)
And, now I know that the semicolon should follow, but plenty of Spanish pages are written in this way. Could the tokenizer/lexer stop recognizing the token once it has an already valid token? I know that would modify the rules. (BTW, how is this lexer written? With lex?).

  Daniel.
Summary: Wrong UTF-8 character parsing → Wrong character encoding parsing for Spanish chars
(In reply to comment #5)
> see http://www.w3.org/TR/html401/charset.html#entities
> 

Jo, I swear I believe you :) It's just that a lot of pages are written like this, so, in terms of User Experience, I think there should be more work on compatibility so these pages are rendered "properly" (despite they are wrongly -non std- written).
(In reply to comment #3)
> It needs to be ña - the semicolon terminates the token. For IE
> compatibility, we recognize incorrectly terminated entities (including the
> space), but maybe ntilde isn't recognized in this way.
> 
> PS : the ñ needs not to be a UTF-8 character, it also exists in other
> charsets. The title is incorrect.
> 

Let me guess, is it http://mxr.mozilla.org/firefox/ident?i=ConsumeEntity ?
I see the scanner is not prepared for it. Maybe it´s a good deal of work. What you think?
Sorry the amount of emails, I think that ""the problem"" is that the method
  nsScanner::ReadEntityIdentifier
is called, which reads a complete word. For example, in the case of the &ntildea, it will read "ntildea" and return that as a token.

So I think of two possible solutions:
  1) enhance the nsScanner by adding a new method, ReadEntityIdentifier passing a list of valid tokens, so the scanner stops scanning when a token is found.
  2) enhance the upper layer (the one that uses the scanner, i.e. ConsumeEntity) to decompose the "ntildea" in "ntilde" and "a".

What would you suggest?

Please note that this is the first time I ever read FF code.

   Daniel.
I´ll try to fix this myself, by following option #2: I'll add a case here http://mxr.mozilla.org/firefox/source/parser/htmlparser/src/nsHTMLTokens.cpp#104 so I'll invoke a function for specially deal with this IE compatibility, by iterating over the hash and see if I find the token as a substring.

I'll do this instead of #1 because:
 - this is an IE compatibility issue, addressed at the consumer level (rather than the scanner level)
 - I don't want to change the scanner (too big change)
 - I don't want to mess performance, since looking up each token within a table would impact it. 
 - this is a particular case where the token is not found, and not a such frequent situation.

BTW, I will highly appreciate some welcome help about how this works. Should I just download, modify, and submit a candidate patch??

Thanks!
 Daniel.
Assignee: nobody → danielgutson
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee: danielgutson → nobody
Component: General → HTML: Parser
Product: Firefox → Core
QA Contact: general → parser
Assignee: nobody → danielgutson
This is my first contribution to FF. Please let me know how to proceed!
Notes:
  * "Antes" means "before" and "Ahora" means "Now", in Spanish
  * there are some characters in the "before" image that were properly rendered, because I changed the test case manually for debugging purposes.
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Blocks: 430809
You'll need to get a patch reviewed and checked in before this is really fixed ;)  mrbkap would probably be a good reviewer.  See http://www.mozilla.org/hacking/life-cycle.html for more information.

You can make your patch a little easier to readby asking |diff| for 8 or 10 lines of context instead of the default of 3, and by using -p.  Please avoid introducing tabs to files that use spaces.

Bug 385776 comment 0 points to the part of the HTML5 spec that covers entity parsing.  Please make sure your patch brings Gecko in line with this spec (e.g. make sure to favor longer entities, and make sure attribute parsing is stricter than normal text parsing).  Bug 385776 has tests for some of these things.

Is this bug a duplicate of bug 385776?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Jesse, thanks.
Comments:
  1- this covers only the first test case of Bug 38776
  2- I'll ensure (in a new patch) to favor longer entities
  3- This does NOT cover attributes entities
  4- I'll do as you say, about the diff params

So, how can we proceed? Besides point 2, I can extend my patch to cover attributes (point 3), and even all the test cases of 385776. But in terms of release strategy, what you think if I do point #2 in this bug, and then complete the rest of the test cases in 385776, so we could ship (if review goes OK) since the majority of the pages we´re seeing in Spanish are about entities in content? (test case #1).

Thanks for all your support and help with my first bug! :)

  Daniel.
Status: REOPENED → NEW
ps: wow, I got a lot of info from the life-cycle link you gave me. Specially when I got into the parser documentation, which would had be handy the first time I faced the code :)

   Daniel.
Status: NEW → ASSIGNED
(In reply to comment #12)
> Is this bug a duplicate of bug 385776?

Jesse, I "fixed my fix" :) in order to translate to the longest entity, and tried testcase 0 of bug 385776. I got this:
FAIL: 

TRADE; (was: ™, expected: ™ (\u2122)) 
lang; (was: 〈 (\u2329), expected: 〈 (\u3008)) 
rang; (was: 〉 (\u232a), expected: 〉 (\u3009)) 

comments about that:
  - TRADE was not found because it was uppercase; 'trade' (lowercase) is supported. Should I add an entry in uppercase too?
  - about lang and rang: I got the same with IE6. Looking at the standard linked in 385776, seems like the code are right. Could it be that the test case has wrong codes? (Simon, could you please comment on this?)

Thanks!

  Daniel.

ps: I'll attach the newer version of the patch.
This (2nd) version of the patch selects the entity of the longest name partial-matching the target.
Solves testcase 0 of bug 385776 (Entities in content), except for lang and rang, which I suspect have the wrong codes in the test case.
Attachment #317672 - Attachment is obsolete: true
Actually it also solves the second testcase of 385776, except for lang and rang.
Attachment #317882 - Flags: review?(mrbkap)
This patch depends on the previous one, and makes all test cases pass of bug 385776 (except for lang and rang).
Attachment #318079 - Flags: review?(mrbkap)
Hey Dan, it's going to be a couple of days before I can review the patch in depth, but at a glance I have a couple of minor nitpicks:
  * Please use two spaces for indentation instead of tabs (see the emacs and vim modelines at the tops of the files).
  * For block statements, like multiline if and for loops, prevailing style in the HTML parser is to put the brace on the same line as the if statement or for loop, so |for (...) {| instead of the next line.

I'll review in more depth later. Thanks for the patches!
The codepoint change of ⟨ and ⟩ was intentional.
(In reply to comment #20)
> The codepoint change of ⟨ and ⟩ was intentional.

Anne, where can I get the proper codes from? Could you please send me some URL or attach some document? Thanks!
(In reply to comment #19)
> Hey Dan, it's going to be a couple of days before I can review the patch in
> depth, but at a glance I have a couple of minor nitpicks:
>   * Please use two spaces for indentation instead of tabs (see the emacs and
> vim modelines at the tops of the files).
>   * For block statements, like multiline if and for loops, prevailing style in
> the HTML parser is to put the brace on the same line as the if statement or for
> loop, so |for (...) {| instead of the next line.
> I'll review in more depth later. Thanks for the patches!

Hi Blake, thanks. I'm already aware of the following issues (so you can skip these)
 * I used bool instead of PR_BOOL  (still applicable?)
 * bracing style (I used Allman instead of K&R)
 * Tabs

Please note that I took care of these issues in the last patch.

 Daniel.
(sorry, I lied :)  Bracing style still Allman in last patch)
Status: ASSIGNED → RESOLVED
Closed: 16 years ago16 years ago
Keywords: intl
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Hi, this is how my patched FF renders the output of testcase #2.
As you can see, the left angle and right angle brackets are properly rendered with the existing values. So, PLEASE CONFIRM that I should change them, since once I do it, they will no longer be properly rendered.
(unless I should change them in the rendering engine as well).

Thanks!
  Daniel.
OS: Windows XP → All
Comment on attachment 317882 [details] [diff] [review]
317672: Fix for handling entities not terminated with the ; (new patch for favoring longer entities)

Hi Daniel,

I'm sorry it's taken me so long to get to this review. You caught me just as I was moving and about to take a vacation.

Overall, I think I'd prefer to do this work in CEntityToken::ConsumeEntity. That way, view-source would benefit, and we'd be able to reduce the amount of tokenization-type work that we do in the DTD.

You mentioned performance as a concern; one possible way of alleviating this problem would be to use js/src/jskwgen.c, in a slightly more generalized form to consume non-numeric entities.

For now, it seems like it wouldn't be too difficult to use your current approach to get this into the tree and we can optimize later. I'm not going to review this patch in depth, but please keep in mind the mostly-K&R bracing style and lack of hard tabs in parser/htmlparser.

Thanks again, and sorry for the delay. I'll be a lot more responsive from now on.
Attachment #317882 - Flags: review?(mrbkap) → review-
Hi Blake,
  thanks for answering. Now I will take a little while since I just changed job, and I'm fully on that. I guess I will implement the changes in about 2 weeks. I promise to care about styling (including bracing and spaces).

Thanks!

   Daniel.
Attachment #318079 - Flags: review?(mrbkap)
I believe this is fixed by bug 487949.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago14 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by the HTML5 parser]
Target Milestone: --- → mozilla1.9.3a5
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: