Closed
Bug 670217
Opened 14 years ago
Closed 14 years ago
No output of Unicode characters with codes over U+FFFD—U+FFFF
Categories
(support.mozilla.org :: General, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: Aleksej, Unassigned)
Details
(Keywords: dataloss, intl, Whiteboard: [testday-20110708])
I only tried this with Questions and Private messages, but I suppose that it can be reproduced with everything, because it also affects or affected AMO (bug 592758) and Bugzilla (example in bug 592755).
A piece of message beginning with such characters: attachment 471172 [details]
When messages are output, the characters after U+FFFD-U+FFFF and anything entered after them is not visible in the comment. That way, if the comment begins with such a > FFFF character, the comment will look empty, and considered empty by the program when editing.
Comment 1•14 years ago
|
||
I'm not sure what those characters are supposed to be, as Firefox on Windows 7 (en-US) doesn't render them, but I think this may actually be because we're using "narrow" Python, that is Python with unicode objects internally represented in UCS-2, and not UCS-4. To switch to UCS-4, we'd need to recompile Python.
Jeremy, Jeff, does that sound right, or am I barking up the wrong tree? Some searching seemed to confirm this hunch.
Comment 2•14 years ago
|
||
(In reply to comment #1)
> I'm not sure what those characters are supposed to be, as Firefox on Windows
> 7 (en-US) doesn't render them, but I think this may actually be because
> we're using "narrow" Python, that is Python with unicode objects internally
> represented in UCS-2, and not UCS-4. To switch to UCS-4, we'd need to
> recompile Python.
I assume Python would still see some bytes (possibly misinterpreted) so I'd start by checking the db.
Comment 3•14 years ago
|
||
Are these 4 byte codes? We probably ignore them. mysql has trouble with that. We could store things as binary on the db if we need to preserve those.
Comment 4•14 years ago
|
||
(In reply to comment #3)
> Are these 4 byte codes? We probably ignore them. mysql has trouble with
> that. We could store things as binary on the db if we need to preserve
> those.
>U+FFFF is beyond what UCS-2 can naively represent, yeah. MySQL does struggle with them. Some searching tells me that narrow Python builds will misinterpret higher-plane characters and break them into two (incorrect) characters. FWIW, Firefox seems to do the same thing, at least in this build/OS.
Comment 5•14 years ago
|
||
I'd say this is a WONTFIX unless there's a huge data loss.
| Reporter | ||
Comment 6•14 years ago
|
||
Firefox on Linux doesn’t seem to have issues with such characters, at least when they are stand-alone. I have seen/suspected a string put together by copy-paste in an input field on a website break, which caused a PyQt program using the string fail to render the text; but I couldn’t reproduce it.
However, if Firefox itself is found to have any issues with them, then support questions and bug reports about that will be problematic.
Comment 7•14 years ago
|
||
Given questionable support in our Python builds and lack of it in MySQL < 5.5[1], this is WONTFIX for now.
[1] MySQL UTF-8 only supports 1-3 byte chars. http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
Comment 8•11 years ago
|
||
Any rough estimate as to when this might be fixed? It would be nice to be able to add emoticons and other symbols to messages. Tangentially related: bug 769974.
http://www.fileformat.info/info/unicode/block/emoticons/list.htm
http://www.fileformat.info/info/unicode/block/miscellaneous_symbols_and_pictographs/list.htm
You need to log in
before you can comment on or make changes to this bug.
Description
•