Wrong character encoding parsing for Spanish chars

RESOLVED FIXED in mozilla1.9.3a5

Status

()

Core
HTML: Parser
--
major
RESOLVED FIXED
10 years ago
8 years ago

People

(Reporter: Daniel Gutson, Assigned: Daniel Gutson)

Tracking

({intl})

unspecified
mozilla1.9.3a5
x86
All
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [fixed by the HTML5 parser])

Attachments

(5 attachments, 1 obsolete attachment)

(Assignee)

Description

10 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5

Some pages in English containing UTF-8 characters which are NOT followed by a space are not parsed. For example:
  &ntildea
is not parsed as "ña", instead, it is shown textually (so the UTF-8 character is not shown). In other words, the FF parser needs the special characters to finish with a space, otherwise it tries to recognize the ´ntildea´ as a character, which does not exist, instead of ´ntilde´ followed by an ´a´.
Same problem with the &.acute family (&aacute, &oacute, etc.).

This is REALLY annoying. Sorry I am not providing a live web page now, just a modified one I got.

 Daniel.


Reproducible: Always

Steps to Reproduce:
1.
2.
3.
(Assignee)

Comment 1

10 years ago
Created attachment 313812 [details]
test case

See the &xacute and &ntilde examples.
(Assignee)

Comment 2

10 years ago
(In reply to comment #0)
> Some pages in English containing UTF-8 characters which are NOT followed by a

ERRATA: pages in Spanish.
(sorry)

Comment 3

10 years ago
It needs to be ña - the semicolon terminates the token. For IE
compatibility, we recognize incorrectly terminated entities (including the
space), but maybe ntilde isn't recognized in this way.

PS : the ñ needs not to be a UTF-8 character, it also exists in other
charsets. The title is incorrect.

(Assignee)

Comment 4

10 years ago
(In reply to comment #3)
> It needs to be ña - the semicolon terminates the token. For IE
> compatibility, we recognize incorrectly terminated entities (including the
> space), but maybe ntilde isn't recognized in this way.
> 
> PS : the ñ needs not to be a UTF-8 character, it also exists in other
> charsets. The title is incorrect.
> 

What about the rest of the char encodings? (those acute)
And, now I know that the semicolon should follow, but plenty of Spanish pages are written in this way. Could the tokenizer/lexer stop recognizing the token once it has an already valid token? I know that would modify the rules. (BTW, how is this lexer written? With lex?).

  Daniel.
(Assignee)

Updated

10 years ago
Summary: Wrong UTF-8 character parsing → Wrong character encoding parsing for Spanish chars
(Assignee)

Comment 6

10 years ago
(In reply to comment #5)
> see http://www.w3.org/TR/html401/charset.html#entities
> 

Jo, I swear I believe you :) It's just that a lot of pages are written like this, so, in terms of User Experience, I think there should be more work on compatibility so these pages are rendered "properly" (despite they are wrongly -non std- written).
(Assignee)

Comment 7

10 years ago
(In reply to comment #3)
> It needs to be ña - the semicolon terminates the token. For IE
> compatibility, we recognize incorrectly terminated entities (including the
> space), but maybe ntilde isn't recognized in this way.
> 
> PS : the ñ needs not to be a UTF-8 character, it also exists in other
> charsets. The title is incorrect.
> 

Let me guess, is it http://mxr.mozilla.org/firefox/ident?i=ConsumeEntity ?
I see the scanner is not prepared for it. Maybe it´s a good deal of work. What you think?
(Assignee)

Comment 8

10 years ago
Sorry the amount of emails, I think that ""the problem"" is that the method
  nsScanner::ReadEntityIdentifier
is called, which reads a complete word. For example, in the case of the &ntildea, it will read "ntildea" and return that as a token.

So I think of two possible solutions:
  1) enhance the nsScanner by adding a new method, ReadEntityIdentifier passing a list of valid tokens, so the scanner stops scanning when a token is found.
  2) enhance the upper layer (the one that uses the scanner, i.e. ConsumeEntity) to decompose the "ntildea" in "ntilde" and "a".

What would you suggest?

Please note that this is the first time I ever read FF code.

   Daniel.
(Assignee)

Comment 9

10 years ago
I´ll try to fix this myself, by following option #2: I'll add a case here http://mxr.mozilla.org/firefox/source/parser/htmlparser/src/nsHTMLTokens.cpp#104 so I'll invoke a function for specially deal with this IE compatibility, by iterating over the hash and see if I find the token as a substring.

I'll do this instead of #1 because:
 - this is an IE compatibility issue, addressed at the consumer level (rather than the scanner level)
 - I don't want to change the scanner (too big change)
 - I don't want to mess performance, since looking up each token within a table would impact it. 
 - this is a particular case where the token is not found, and not a such frequent situation.

BTW, I will highly appreciate some welcome help about how this works. Should I just download, modify, and submit a candidate patch??

Thanks!
 Daniel.

Updated

10 years ago
Assignee: nobody → danielgutson
Status: UNCONFIRMED → NEW
Ever confirmed: true

Updated

10 years ago
Assignee: danielgutson → nobody
Component: General → HTML: Parser
Product: Firefox → Core
QA Contact: general → parser

Updated

10 years ago
Assignee: nobody → danielgutson
(Assignee)

Comment 10

10 years ago
Created attachment 317672 [details] [diff] [review]
Fix for handling entities not terminated with the ;

This is my first contribution to FF. Please let me know how to proceed!
(Assignee)

Comment 11

10 years ago
Created attachment 317673 [details]
Screen capture of the test case, before and after the fix.

Notes:
  * "Antes" means "before" and "Ahora" means "Now", in Spanish
  * there are some characters in the "before" image that were properly rendered, because I changed the test case manually for debugging purposes.
(Assignee)

Updated

10 years ago
Status: NEW → ASSIGNED
(Assignee)

Updated

10 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
(Assignee)

Updated

10 years ago
Blocks: 430809

Comment 12

10 years ago
You'll need to get a patch reviewed and checked in before this is really fixed ;)  mrbkap would probably be a good reviewer.  See http://www.mozilla.org/hacking/life-cycle.html for more information.

You can make your patch a little easier to readby asking |diff| for 8 or 10 lines of context instead of the default of 3, and by using -p.  Please avoid introducing tabs to files that use spaces.

Bug 385776 comment 0 points to the part of the HTML5 spec that covers entity parsing.  Please make sure your patch brings Gecko in line with this spec (e.g. make sure to favor longer entities, and make sure attribute parsing is stricter than normal text parsing).  Bug 385776 has tests for some of these things.

Is this bug a duplicate of bug 385776?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 13

10 years ago
Jesse, thanks.
Comments:
  1- this covers only the first test case of Bug 38776
  2- I'll ensure (in a new patch) to favor longer entities
  3- This does NOT cover attributes entities
  4- I'll do as you say, about the diff params

So, how can we proceed? Besides point 2, I can extend my patch to cover attributes (point 3), and even all the test cases of 385776. But in terms of release strategy, what you think if I do point #2 in this bug, and then complete the rest of the test cases in 385776, so we could ship (if review goes OK) since the majority of the pages we´re seeing in Spanish are about entities in content? (test case #1).

Thanks for all your support and help with my first bug! :)

  Daniel.
Status: REOPENED → NEW
(Assignee)

Comment 14

10 years ago
ps: wow, I got a lot of info from the life-cycle link you gave me. Specially when I got into the parser documentation, which would had be handy the first time I faced the code :)

   Daniel.
Status: NEW → ASSIGNED
(Assignee)

Comment 15

10 years ago
(In reply to comment #12)
> Is this bug a duplicate of bug 385776?

Jesse, I "fixed my fix" :) in order to translate to the longest entity, and tried testcase 0 of bug 385776. I got this:
FAIL: 

TRADE; (was: ™, expected: ™ (\u2122)) 
lang; (was: 〈 (\u2329), expected: 〈 (\u3008)) 
rang; (was: 〉 (\u232a), expected: 〉 (\u3009)) 

comments about that:
  - TRADE was not found because it was uppercase; 'trade' (lowercase) is supported. Should I add an entry in uppercase too?
  - about lang and rang: I got the same with IE6. Looking at the standard linked in 385776, seems like the code are right. Could it be that the test case has wrong codes? (Simon, could you please comment on this?)

Thanks!

  Daniel.

ps: I'll attach the newer version of the patch.
(Assignee)

Comment 16

10 years ago
Created attachment 317882 [details] [diff] [review]
317672: Fix for handling entities not terminated with the ; (new patch for favoring longer entities)

This (2nd) version of the patch selects the entity of the longest name partial-matching the target.
Solves testcase 0 of bug 385776 (Entities in content), except for lang and rang, which I suspect have the wrong codes in the test case.
Attachment #317672 - Attachment is obsolete: true
(Assignee)

Comment 17

10 years ago
Actually it also solves the second testcase of 385776, except for lang and rang.
(Assignee)

Updated

10 years ago
Attachment #317882 - Flags: review?(mrbkap)
(Assignee)

Comment 18

10 years ago
Created attachment 318079 [details] [diff] [review]
2nd patch for fixing semi-colonness entities in attributes (see details)

This patch depends on the previous one, and makes all test cases pass of bug 385776 (except for lang and rang).
(Assignee)

Updated

10 years ago
Attachment #318079 - Flags: review?(mrbkap)
Hey Dan, it's going to be a couple of days before I can review the patch in depth, but at a glance I have a couple of minor nitpicks:
  * Please use two spaces for indentation instead of tabs (see the emacs and vim modelines at the tops of the files).
  * For block statements, like multiline if and for loops, prevailing style in the HTML parser is to put the brace on the same line as the if statement or for loop, so |for (...) {| instead of the next line.

I'll review in more depth later. Thanks for the patches!

Comment 20

10 years ago
The codepoint change of ⟨ and ⟩ was intentional.
(Assignee)

Comment 21

10 years ago
(In reply to comment #20)
> The codepoint change of ⟨ and ⟩ was intentional.

Anne, where can I get the proper codes from? Could you please send me some URL or attach some document? Thanks!
(Assignee)

Comment 22

10 years ago
(In reply to comment #19)
> Hey Dan, it's going to be a couple of days before I can review the patch in
> depth, but at a glance I have a couple of minor nitpicks:
>   * Please use two spaces for indentation instead of tabs (see the emacs and
> vim modelines at the tops of the files).
>   * For block statements, like multiline if and for loops, prevailing style in
> the HTML parser is to put the brace on the same line as the if statement or for
> loop, so |for (...) {| instead of the next line.
> I'll review in more depth later. Thanks for the patches!

Hi Blake, thanks. I'm already aware of the following issues (so you can skip these)
 * I used bool instead of PR_BOOL  (still applicable?)
 * bracing style (I used Allman instead of K&R)
 * Tabs

Please note that I took care of these issues in the last patch.

 Daniel.
(Assignee)

Comment 23

10 years ago
(sorry, I lied :)  Bracing style still Allman in last patch)
(Assignee)

Updated

10 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 10 years ago10 years ago
Keywords: intl
Resolution: --- → FIXED
(Assignee)

Updated

10 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Updated

10 years ago
Status: REOPENED → ASSIGNED
(Assignee)

Comment 24

10 years ago
Created attachment 318847 [details]
[INFORMATION NEEDED] output of bug 385776 testcase #2 with both patches

Hi, this is how my patched FF renders the output of testcase #2.
As you can see, the left angle and right angle brackets are properly rendered with the existing values. So, PLEASE CONFIRM that I should change them, since once I do it, they will no longer be properly rendered.
(unless I should change them in the rendering engine as well).

Thanks!
  Daniel.
(Assignee)

Updated

10 years ago
OS: Windows XP → All
Comment on attachment 317882 [details] [diff] [review]
317672: Fix for handling entities not terminated with the ; (new patch for favoring longer entities)

Hi Daniel,

I'm sorry it's taken me so long to get to this review. You caught me just as I was moving and about to take a vacation.

Overall, I think I'd prefer to do this work in CEntityToken::ConsumeEntity. That way, view-source would benefit, and we'd be able to reduce the amount of tokenization-type work that we do in the DTD.

You mentioned performance as a concern; one possible way of alleviating this problem would be to use js/src/jskwgen.c, in a slightly more generalized form to consume non-numeric entities.

For now, it seems like it wouldn't be too difficult to use your current approach to get this into the tree and we can optimize later. I'm not going to review this patch in depth, but please keep in mind the mostly-K&R bracing style and lack of hard tabs in parser/htmlparser.

Thanks again, and sorry for the delay. I'll be a lot more responsive from now on.
Attachment #317882 - Flags: review?(mrbkap) → review-
(Assignee)

Comment 26

10 years ago
Hi Blake,
  thanks for answering. Now I will take a little while since I just changed job, and I'm fully on that. I guess I will implement the changes in about 2 weeks. I promise to care about styling (including bracing and spaces).

Thanks!

   Daniel.
I believe this is fixed by bug 487949.
Status: ASSIGNED → RESOLVED
Last Resolved: 10 years ago8 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by the HTML5 parser]

Updated

8 years ago
Target Milestone: --- → mozilla1.9.3a5
You need to log in before you can comment on or make changes to this bug.