Closed Bug 427240 Opened 16 years ago Closed 14 years ago

Wrong character encoding parsing for Spanish chars

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla1.9.3a5

People

(Reporter: danielgutson, Assigned: danielgutson)

References

Details

(Keywords: intl, Whiteboard: [fixed by the HTML5 parser])

Attachments

(5 files, 1 obsolete file)

test case 16 years ago Daniel Gutson 8.41 KB, text/html		Details
Fix for handling entities not terminated with the ; 16 years ago Daniel Gutson 6.37 KB, patch		Details \| Diff \| Splinter Review
Screen capture of the test case, before and after the fix. 16 years ago Daniel Gutson 177.84 KB, image/jpeg		Details
317672: Fix for handling entities not terminated with the ; (new patch for favoring longer entities) 16 years ago Daniel Gutson 9.70 KB, patch	mrbkap : review-	Details \| Diff \| Splinter Review
2nd patch for fixing semi-colonness entities in attributes (see details) 16 years ago Daniel Gutson 2.49 KB, patch		Details \| Diff \| Splinter Review
[INFORMATION NEEDED] output of bug 385776 testcase #2 with both patches 16 years ago Daniel Gutson 28.71 KB, image/jpeg		Details

Daniel Gutson

Assignee

Description

•

16 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5

Some pages in English containing UTF-8 characters which are NOT followed by a space are not parsed. For example:
  &ntildea
is not parsed as "ña", instead, it is shown textually (so the UTF-8 character is not shown). In other words, the FF parser needs the special characters to finish with a space, otherwise it tries to recognize the ´ntildea´ as a character, which does not exist, instead of ´ntilde´ followed by an ´a´.
Same problem with the &.acute family (&aacute, &oacute, etc.).

This is REALLY annoying. Sorry I am not providing a live web page now, just a modified one I got.

 Daniel.


Reproducible: Always

Steps to Reproduce:
1.
2.
3.

Daniel Gutson

Assignee

Comment 1

•

16 years ago

Attached file test case — Details

See the &xacute and &ntilde examples.

Daniel Gutson

Assignee

Comment 2

•

16 years ago

(In reply to comment #0)
> Some pages in English containing UTF-8 characters which are NOT followed by a

ERRATA: pages in Spanish.
(sorry)

Jo Hermans

Comment 3

•

16 years ago

It needs to be &ntilde;a - the semicolon terminates the token. For IE
compatibility, we recognize incorrectly terminated entities (including the
space), but maybe ntilde isn't recognized in this way.

PS : the ñ needs not to be a UTF-8 character, it also exists in other
charsets. The title is incorrect.

Daniel Gutson

Assignee

Comment 4

•

16 years ago

(In reply to comment #3)
> It needs to be &ntilde;a - the semicolon terminates the token. For IE
> compatibility, we recognize incorrectly terminated entities (including the
> space), but maybe ntilde isn't recognized in this way.
> 
> PS : the ñ needs not to be a UTF-8 character, it also exists in other
> charsets. The title is incorrect.
> 

What about the rest of the char encodings? (those acute)
And, now I know that the semicolon should follow, but plenty of Spanish pages are written in this way. Could the tokenizer/lexer stop recognizing the token once it has an already valid token? I know that would modify the rules. (BTW, how is this lexer written? With lex?).

  Daniel.

Daniel Gutson

Assignee

Updated

•

16 years ago

Summary: Wrong UTF-8 character parsing → Wrong character encoding parsing for Spanish chars

Jo Hermans

Comment 5

•

16 years ago

see http://www.w3.org/TR/html401/charset.html#entities

Daniel Gutson

Assignee

Comment 6

•

16 years ago

(In reply to comment #5)
> see http://www.w3.org/TR/html401/charset.html#entities
> 

Jo, I swear I believe you :) It's just that a lot of pages are written like this, so, in terms of User Experience, I think there should be more work on compatibility so these pages are rendered "properly" (despite they are wrongly -non std- written).

Daniel Gutson

Assignee

Comment 7

•

16 years ago

(In reply to comment #3)
> It needs to be &ntilde;a - the semicolon terminates the token. For IE
> compatibility, we recognize incorrectly terminated entities (including the
> space), but maybe ntilde isn't recognized in this way.
> 
> PS : the ñ needs not to be a UTF-8 character, it also exists in other
> charsets. The title is incorrect.
> 

Let me guess, is it http://mxr.mozilla.org/firefox/ident?i=ConsumeEntity ?
I see the scanner is not prepared for it. Maybe it´s a good deal of work. What you think?

Daniel Gutson

Assignee

Comment 8

•

16 years ago

Sorry the amount of emails, I think that ""the problem"" is that the method
  nsScanner::ReadEntityIdentifier
is called, which reads a complete word. For example, in the case of the &ntildea, it will read "ntildea" and return that as a token.

So I think of two possible solutions:
  1) enhance the nsScanner by adding a new method, ReadEntityIdentifier passing a list of valid tokens, so the scanner stops scanning when a token is found.
  2) enhance the upper layer (the one that uses the scanner, i.e. ConsumeEntity) to decompose the "ntildea" in "ntilde" and "a".

What would you suggest?

Please note that this is the first time I ever read FF code.

   Daniel.

Daniel Gutson

Assignee

Comment 9

•

16 years ago

I´ll try to fix this myself, by following option #2: I'll add a case here http://mxr.mozilla.org/firefox/source/parser/htmlparser/src/nsHTMLTokens.cpp#104 so I'll invoke a function for specially deal with this IE compatibility, by iterating over the hash and see if I find the token as a substring.

I'll do this instead of #1 because:
 - this is an IE compatibility issue, addressed at the consumer level (rather than the scanner level)
 - I don't want to change the scanner (too big change)
 - I don't want to mess performance, since looking up each token within a table would impact it. 
 - this is a particular case where the token is not found, and not a such frequent situation.

BTW, I will highly appreciate some welcome help about how this works. Should I just download, modify, and submit a candidate patch??

Thanks!
 Daniel.

Jo Hermans

Updated

•

16 years ago

Assignee: nobody → danielgutson

Status: UNCONFIRMED → NEW

Ever confirmed: true

Jo Hermans

Updated

•

16 years ago

Assignee: danielgutson → nobody

Component: General → HTML: Parser

Product: Firefox → Core

QA Contact: general → parser

Jo Hermans

Updated

•

16 years ago

Assignee: nobody → danielgutson

Daniel Gutson

Assignee

Comment 10

•

16 years ago

Attached patch Fix for handling entities not terminated with the ; (obsolete) — Details — Splinter Review

This is my first contribution to FF. Please let me know how to proceed!

Daniel Gutson

Assignee

Comment 11

•

16 years ago

Attached image Screen capture of the test case, before and after the fix. — Details

Notes:
  * "Antes" means "before" and "Ahora" means "Now", in Spanish
  * there are some characters in the "before" image that were properly rendered, because I changed the test case manually for debugging purposes.

Daniel Gutson

Assignee

Updated

•

16 years ago

Status: NEW → ASSIGNED

Daniel Gutson

Assignee

Updated

•

16 years ago

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Daniel Gutson

Assignee

Updated

•

16 years ago

Blocks: 430809

Jesse Ruderman

Comment 12

•

16 years ago

You'll need to get a patch reviewed and checked in before this is really fixed ;)  mrbkap would probably be a good reviewer.  See http://www.mozilla.org/hacking/life-cycle.html for more information.

You can make your patch a little easier to readby asking |diff| for 8 or 10 lines of context instead of the default of 3, and by using -p.  Please avoid introducing tabs to files that use spaces.

Bug 385776 comment 0 points to the part of the HTML5 spec that covers entity parsing.  Please make sure your patch brings Gecko in line with this spec (e.g. make sure to favor longer entities, and make sure attribute parsing is stricter than normal text parsing).  Bug 385776 has tests for some of these things.

Is this bug a duplicate of bug 385776?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Daniel Gutson

Assignee

Comment 13

•

16 years ago

Jesse, thanks.
Comments:
  1- this covers only the first test case of Bug 38776
  2- I'll ensure (in a new patch) to favor longer entities
  3- This does NOT cover attributes entities
  4- I'll do as you say, about the diff params

So, how can we proceed? Besides point 2, I can extend my patch to cover attributes (point 3), and even all the test cases of 385776. But in terms of release strategy, what you think if I do point #2 in this bug, and then complete the rest of the test cases in 385776, so we could ship (if review goes OK) since the majority of the pages we´re seeing in Spanish are about entities in content? (test case #1).

Thanks for all your support and help with my first bug! :)

  Daniel.

Status: REOPENED → NEW

Daniel Gutson

Assignee

Comment 14

•

16 years ago

ps: wow, I got a lot of info from the life-cycle link you gave me. Specially when I got into the parser documentation, which would had be handy the first time I faced the code :)

   Daniel.

Status: NEW → ASSIGNED

Daniel Gutson

Assignee

Comment 15

•

16 years ago

(In reply to comment #12)
> Is this bug a duplicate of bug 385776?

Jesse, I "fixed my fix" :) in order to translate to the longest entity, and tried testcase 0 of bug 385776. I got this:
FAIL: 

TRADE; (was: &TRADE;, expected: ™ (\u2122)) 
lang; (was: 〈 (\u2329), expected: 〈 (\u3008)) 
rang; (was: 〉 (\u232a), expected: 〉 (\u3009)) 

comments about that:
  - TRADE was not found because it was uppercase; 'trade' (lowercase) is supported. Should I add an entry in uppercase too?
  - about lang and rang: I got the same with IE6. Looking at the standard linked in 385776, seems like the code are right. Could it be that the test case has wrong codes? (Simon, could you please comment on this?)

Thanks!

  Daniel.

ps: I'll attach the newer version of the patch.

Daniel Gutson

Assignee

Comment 16

•

16 years ago

Attached patch 317672: Fix for handling entities not terminated with the ; (new patch for favoring longer entities) — Details — Splinter Review

This (2nd) version of the patch selects the entity of the longest name partial-matching the target.
Solves testcase 0 of bug 385776 (Entities in content), except for lang and rang, which I suspect have the wrong codes in the test case.

Attachment #317672 - Attachment is obsolete: true

Daniel Gutson

Assignee

Comment 17

•

16 years ago

Actually it also solves the second testcase of 385776, except for lang and rang.

Daniel Gutson

Assignee

Updated

•

16 years ago

Attachment #317882 - Flags: review?(mrbkap)

Daniel Gutson

Assignee

Comment 18

•

16 years ago

Attached patch 2nd patch for fixing semi-colonness entities in attributes (see details) — Details — Splinter Review

This patch depends on the previous one, and makes all test cases pass of bug 385776 (except for lang and rang).

Daniel Gutson

Assignee

Updated

•

16 years ago

Attachment #318079 - Flags: review?(mrbkap)

Blake Kaplan (:mrbkap) (inactive)

Comment 19

•

16 years ago

Hey Dan, it's going to be a couple of days before I can review the patch in depth, but at a glance I have a couple of minor nitpicks:
  * Please use two spaces for indentation instead of tabs (see the emacs and vim modelines at the tops of the files).
  * For block statements, like multiline if and for loops, prevailing style in the HTML parser is to put the brace on the same line as the if statement or for loop, so |for (...) {| instead of the next line.

I'll review in more depth later. Thanks for the patches!

Anne (:annevk)

Comment 20

•

16 years ago

The codepoint change of &lang; and &rang; was intentional.

Daniel Gutson

Assignee

Comment 21

•

16 years ago

(In reply to comment #20)
> The codepoint change of &lang; and &rang; was intentional.

Anne, where can I get the proper codes from? Could you please send me some URL or attach some document? Thanks!

Daniel Gutson

Assignee

Comment 22

•

16 years ago

(In reply to comment #19)
> Hey Dan, it's going to be a couple of days before I can review the patch in
> depth, but at a glance I have a couple of minor nitpicks:
>   * Please use two spaces for indentation instead of tabs (see the emacs and
> vim modelines at the tops of the files).
>   * For block statements, like multiline if and for loops, prevailing style in
> the HTML parser is to put the brace on the same line as the if statement or for
> loop, so |for (...) {| instead of the next line.
> I'll review in more depth later. Thanks for the patches!

Hi Blake, thanks. I'm already aware of the following issues (so you can skip these)
 * I used bool instead of PR_BOOL  (still applicable?)
 * bracing style (I used Allman instead of K&R)
 * Tabs

Please note that I took care of these issues in the last patch.

 Daniel.

Daniel Gutson

Assignee

Comment 23

•

16 years ago

(sorry, I lied :)  Bracing style still Allman in last patch)

Daniel Gutson

Assignee

Updated

•

16 years ago

Status: ASSIGNED → RESOLVED

Closed: 16 years ago → 16 years ago

Keywords: intl

Resolution: --- → FIXED

Daniel Gutson

Assignee

Updated

•

16 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Daniel Gutson

Assignee

Updated

•

16 years ago

Status: REOPENED → ASSIGNED

Daniel Gutson

Assignee

Comment 24

•

16 years ago

Attached image [INFORMATION NEEDED] output of bug 385776 testcase #2 with both patches — Details

Hi, this is how my patched FF renders the output of testcase #2.
As you can see, the left angle and right angle brackets are properly rendered with the existing values. So, PLEASE CONFIRM that I should change them, since once I do it, they will no longer be properly rendered.
(unless I should change them in the rendering engine as well).

Thanks!
  Daniel.

Daniel Gutson

Assignee

Updated

•

16 years ago

OS: Windows XP → All

Blake Kaplan (:mrbkap) (inactive)

Comment 25

•

16 years ago

Comment on attachment 317882 [details] [diff] [review]
317672: Fix for handling entities not terminated with the ; (new patch for favoring longer entities)

Hi Daniel,

I'm sorry it's taken me so long to get to this review. You caught me just as I was moving and about to take a vacation.

Overall, I think I'd prefer to do this work in CEntityToken::ConsumeEntity. That way, view-source would benefit, and we'd be able to reduce the amount of tokenization-type work that we do in the DTD.

You mentioned performance as a concern; one possible way of alleviating this problem would be to use js/src/jskwgen.c, in a slightly more generalized form to consume non-numeric entities.

For now, it seems like it wouldn't be too difficult to use your current approach to get this into the tree and we can optimize later. I'm not going to review this patch in depth, but please keep in mind the mostly-K&R bracing style and lack of hard tabs in parser/htmlparser.

Thanks again, and sorry for the delay. I'll be a lot more responsive from now on.

Attachment #317882 - Flags: review?(mrbkap) → review-

Daniel Gutson

Assignee

Comment 26

•

16 years ago

Hi Blake,
  thanks for answering. Now I will take a little while since I just changed job, and I'm fully on that. I guess I will implement the changes in about 2 weeks. I promise to care about styling (including bracing and spaces).

Thanks!

   Daniel.

Blake Kaplan (:mrbkap) (inactive)

Updated

•

15 years ago

Attachment #318079 - Flags: review?(mrbkap)

:Ms2ger (he/him; ⌚ UTC+1/+2)

Comment 27

•

15 years ago

I believe this is fixed by bug 487949.

Henri Sivonen (:hsivonen)

Updated

•

14 years ago

Status: ASSIGNED → RESOLVED

Closed: 16 years ago → 14 years ago

Resolution: --- → FIXED

Whiteboard: [fixed by the HTML5 parser]

:Ms2ger (he/him; ⌚ UTC+1/+2)

Updated

•

14 years ago

Target Milestone: --- → mozilla1.9.3a5

You need to log in before you can comment on or make changes to this bug.