Closed
Bug 427240
Opened 17 years ago
Closed 15 years ago
Wrong character encoding parsing for Spanish chars
Categories
(Core :: DOM: HTML Parser, defect)
Tracking
()
RESOLVED
FIXED
mozilla1.9.3a5
People
(Reporter: danielgutson, Assigned: danielgutson)
References
Details
(Keywords: intl, Whiteboard: [fixed by the HTML5 parser])
Attachments
(5 files, 1 obsolete file)
8.41 KB,
text/html
|
Details | |
177.84 KB,
image/jpeg
|
Details | |
317672: Fix for handling entities not terminated with the ; (new patch for favoring longer entities)
9.70 KB,
patch
|
mrbkap
:
review-
|
Details | Diff | Splinter Review |
2.49 KB,
patch
|
Details | Diff | Splinter Review | |
28.71 KB,
image/jpeg
|
Details |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5
Some pages in English containing UTF-8 characters which are NOT followed by a space are not parsed. For example:
ña
is not parsed as "ña", instead, it is shown textually (so the UTF-8 character is not shown). In other words, the FF parser needs the special characters to finish with a space, otherwise it tries to recognize the ´ntildea´ as a character, which does not exist, instead of ´ntilde´ followed by an ´a´.
Same problem with the &.acute family (á, ó, etc.).
This is REALLY annoying. Sorry I am not providing a live web page now, just a modified one I got.
Daniel.
Reproducible: Always
Steps to Reproduce:
1.
2.
3.
Assignee | ||
Comment 1•17 years ago
|
||
See the &xacute and ñ examples.
Assignee | ||
Comment 2•17 years ago
|
||
(In reply to comment #0)
> Some pages in English containing UTF-8 characters which are NOT followed by a
ERRATA: pages in Spanish.
(sorry)
Comment 3•17 years ago
|
||
It needs to be ña - the semicolon terminates the token. For IE
compatibility, we recognize incorrectly terminated entities (including the
space), but maybe ntilde isn't recognized in this way.
PS : the ñ needs not to be a UTF-8 character, it also exists in other
charsets. The title is incorrect.
Assignee | ||
Comment 4•17 years ago
|
||
(In reply to comment #3)
> It needs to be ña - the semicolon terminates the token. For IE
> compatibility, we recognize incorrectly terminated entities (including the
> space), but maybe ntilde isn't recognized in this way.
>
> PS : the ñ needs not to be a UTF-8 character, it also exists in other
> charsets. The title is incorrect.
>
What about the rest of the char encodings? (those acute)
And, now I know that the semicolon should follow, but plenty of Spanish pages are written in this way. Could the tokenizer/lexer stop recognizing the token once it has an already valid token? I know that would modify the rules. (BTW, how is this lexer written? With lex?).
Daniel.
Assignee | ||
Updated•17 years ago
|
Summary: Wrong UTF-8 character parsing → Wrong character encoding parsing for Spanish chars
Comment 5•17 years ago
|
||
Assignee | ||
Comment 6•17 years ago
|
||
(In reply to comment #5)
> see http://www.w3.org/TR/html401/charset.html#entities
>
Jo, I swear I believe you :) It's just that a lot of pages are written like this, so, in terms of User Experience, I think there should be more work on compatibility so these pages are rendered "properly" (despite they are wrongly -non std- written).
Assignee | ||
Comment 7•17 years ago
|
||
(In reply to comment #3)
> It needs to be ña - the semicolon terminates the token. For IE
> compatibility, we recognize incorrectly terminated entities (including the
> space), but maybe ntilde isn't recognized in this way.
>
> PS : the ñ needs not to be a UTF-8 character, it also exists in other
> charsets. The title is incorrect.
>
Let me guess, is it http://mxr.mozilla.org/firefox/ident?i=ConsumeEntity ?
I see the scanner is not prepared for it. Maybe it´s a good deal of work. What you think?
Assignee | ||
Comment 8•17 years ago
|
||
Sorry the amount of emails, I think that ""the problem"" is that the method
nsScanner::ReadEntityIdentifier
is called, which reads a complete word. For example, in the case of the ña, it will read "ntildea" and return that as a token.
So I think of two possible solutions:
1) enhance the nsScanner by adding a new method, ReadEntityIdentifier passing a list of valid tokens, so the scanner stops scanning when a token is found.
2) enhance the upper layer (the one that uses the scanner, i.e. ConsumeEntity) to decompose the "ntildea" in "ntilde" and "a".
What would you suggest?
Please note that this is the first time I ever read FF code.
Daniel.
Assignee | ||
Comment 9•17 years ago
|
||
I´ll try to fix this myself, by following option #2: I'll add a case here http://mxr.mozilla.org/firefox/source/parser/htmlparser/src/nsHTMLTokens.cpp#104 so I'll invoke a function for specially deal with this IE compatibility, by iterating over the hash and see if I find the token as a substring.
I'll do this instead of #1 because:
- this is an IE compatibility issue, addressed at the consumer level (rather than the scanner level)
- I don't want to change the scanner (too big change)
- I don't want to mess performance, since looking up each token within a table would impact it.
- this is a particular case where the token is not found, and not a such frequent situation.
BTW, I will highly appreciate some welcome help about how this works. Should I just download, modify, and submit a candidate patch??
Thanks!
Daniel.
Updated•17 years ago
|
Assignee: nobody → danielgutson
Status: UNCONFIRMED → NEW
Ever confirmed: true
Updated•17 years ago
|
Assignee: danielgutson → nobody
Component: General → HTML: Parser
Product: Firefox → Core
QA Contact: general → parser
Updated•17 years ago
|
Assignee: nobody → danielgutson
Assignee | ||
Comment 10•17 years ago
|
||
This is my first contribution to FF. Please let me know how to proceed!
Assignee | ||
Comment 11•17 years ago
|
||
Notes:
* "Antes" means "before" and "Ahora" means "Now", in Spanish
* there are some characters in the "before" image that were properly rendered, because I changed the test case manually for debugging purposes.
Assignee | ||
Updated•17 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Updated•17 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Comment 12•17 years ago
|
||
You'll need to get a patch reviewed and checked in before this is really fixed ;) mrbkap would probably be a good reviewer. See http://www.mozilla.org/hacking/life-cycle.html for more information.
You can make your patch a little easier to readby asking |diff| for 8 or 10 lines of context instead of the default of 3, and by using -p. Please avoid introducing tabs to files that use spaces.
Bug 385776 comment 0 points to the part of the HTML5 spec that covers entity parsing. Please make sure your patch brings Gecko in line with this spec (e.g. make sure to favor longer entities, and make sure attribute parsing is stricter than normal text parsing). Bug 385776 has tests for some of these things.
Is this bug a duplicate of bug 385776?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 13•17 years ago
|
||
Jesse, thanks.
Comments:
1- this covers only the first test case of Bug 38776
2- I'll ensure (in a new patch) to favor longer entities
3- This does NOT cover attributes entities
4- I'll do as you say, about the diff params
So, how can we proceed? Besides point 2, I can extend my patch to cover attributes (point 3), and even all the test cases of 385776. But in terms of release strategy, what you think if I do point #2 in this bug, and then complete the rest of the test cases in 385776, so we could ship (if review goes OK) since the majority of the pages we´re seeing in Spanish are about entities in content? (test case #1).
Thanks for all your support and help with my first bug! :)
Daniel.
Status: REOPENED → NEW
Assignee | ||
Comment 14•17 years ago
|
||
ps: wow, I got a lot of info from the life-cycle link you gave me. Specially when I got into the parser documentation, which would had be handy the first time I faced the code :)
Daniel.
Status: NEW → ASSIGNED
Assignee | ||
Comment 15•17 years ago
|
||
(In reply to comment #12)
> Is this bug a duplicate of bug 385776?
Jesse, I "fixed my fix" :) in order to translate to the longest entity, and tried testcase 0 of bug 385776. I got this:
FAIL:
TRADE; (was: ™, expected: ™ (\u2122))
lang; (was: 〈 (\u2329), expected: 〈 (\u3008))
rang; (was: 〉 (\u232a), expected: 〉 (\u3009))
comments about that:
- TRADE was not found because it was uppercase; 'trade' (lowercase) is supported. Should I add an entry in uppercase too?
- about lang and rang: I got the same with IE6. Looking at the standard linked in 385776, seems like the code are right. Could it be that the test case has wrong codes? (Simon, could you please comment on this?)
Thanks!
Daniel.
ps: I'll attach the newer version of the patch.
Assignee | ||
Comment 16•17 years ago
|
||
This (2nd) version of the patch selects the entity of the longest name partial-matching the target.
Solves testcase 0 of bug 385776 (Entities in content), except for lang and rang, which I suspect have the wrong codes in the test case.
Attachment #317672 -
Attachment is obsolete: true
Assignee | ||
Comment 17•17 years ago
|
||
Actually it also solves the second testcase of 385776, except for lang and rang.
Assignee | ||
Updated•17 years ago
|
Attachment #317882 -
Flags: review?(mrbkap)
Assignee | ||
Comment 18•17 years ago
|
||
This patch depends on the previous one, and makes all test cases pass of bug 385776 (except for lang and rang).
Assignee | ||
Updated•17 years ago
|
Attachment #318079 -
Flags: review?(mrbkap)
Comment 19•17 years ago
|
||
Hey Dan, it's going to be a couple of days before I can review the patch in depth, but at a glance I have a couple of minor nitpicks:
* Please use two spaces for indentation instead of tabs (see the emacs and vim modelines at the tops of the files).
* For block statements, like multiline if and for loops, prevailing style in the HTML parser is to put the brace on the same line as the if statement or for loop, so |for (...) {| instead of the next line.
I'll review in more depth later. Thanks for the patches!
Comment 20•17 years ago
|
||
The codepoint change of ⟨ and ⟩ was intentional.
Assignee | ||
Comment 21•17 years ago
|
||
(In reply to comment #20)
> The codepoint change of ⟨ and ⟩ was intentional.
Anne, where can I get the proper codes from? Could you please send me some URL or attach some document? Thanks!
Assignee | ||
Comment 22•17 years ago
|
||
(In reply to comment #19)
> Hey Dan, it's going to be a couple of days before I can review the patch in
> depth, but at a glance I have a couple of minor nitpicks:
> * Please use two spaces for indentation instead of tabs (see the emacs and
> vim modelines at the tops of the files).
> * For block statements, like multiline if and for loops, prevailing style in
> the HTML parser is to put the brace on the same line as the if statement or for
> loop, so |for (...) {| instead of the next line.
> I'll review in more depth later. Thanks for the patches!
Hi Blake, thanks. I'm already aware of the following issues (so you can skip these)
* I used bool instead of PR_BOOL (still applicable?)
* bracing style (I used Allman instead of K&R)
* Tabs
Please note that I took care of these issues in the last patch.
Daniel.
Assignee | ||
Comment 23•17 years ago
|
||
(sorry, I lied :) Bracing style still Allman in last patch)
Assignee | ||
Updated•17 years ago
|
Assignee | ||
Updated•17 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Updated•17 years ago
|
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 24•17 years ago
|
||
Hi, this is how my patched FF renders the output of testcase #2.
As you can see, the left angle and right angle brackets are properly rendered with the existing values. So, PLEASE CONFIRM that I should change them, since once I do it, they will no longer be properly rendered.
(unless I should change them in the rendering engine as well).
Thanks!
Daniel.
Assignee | ||
Updated•17 years ago
|
OS: Windows XP → All
Comment 25•17 years ago
|
||
Comment on attachment 317882 [details] [diff] [review]
317672: Fix for handling entities not terminated with the ; (new patch for favoring longer entities)
Hi Daniel,
I'm sorry it's taken me so long to get to this review. You caught me just as I was moving and about to take a vacation.
Overall, I think I'd prefer to do this work in CEntityToken::ConsumeEntity. That way, view-source would benefit, and we'd be able to reduce the amount of tokenization-type work that we do in the DTD.
You mentioned performance as a concern; one possible way of alleviating this problem would be to use js/src/jskwgen.c, in a slightly more generalized form to consume non-numeric entities.
For now, it seems like it wouldn't be too difficult to use your current approach to get this into the tree and we can optimize later. I'm not going to review this patch in depth, but please keep in mind the mostly-K&R bracing style and lack of hard tabs in parser/htmlparser.
Thanks again, and sorry for the delay. I'll be a lot more responsive from now on.
Attachment #317882 -
Flags: review?(mrbkap) → review-
Assignee | ||
Comment 26•17 years ago
|
||
Hi Blake,
thanks for answering. Now I will take a little while since I just changed job, and I'm fully on that. I guess I will implement the changes in about 2 weeks. I promise to care about styling (including bracing and spaces).
Thanks!
Daniel.
Updated•16 years ago
|
Attachment #318079 -
Flags: review?(mrbkap)
Comment 27•16 years ago
|
||
I believe this is fixed by bug 487949.
Updated•15 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 17 years ago → 15 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by the HTML5 parser]
Updated•15 years ago
|
Target Milestone: --- → mozilla1.9.3a5
You need to log in
before you can comment on or make changes to this bug.
Description
•