Closed Bug 670217 Opened 14 years ago Closed 14 years ago

No output of Unicode characters with codes over U+FFFD—U+FFFF

Categories

(support.mozilla.org :: General, defect)

x86_64
Linux
defect
Not set
minor

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: Aleksej, Unassigned)

Details

(Keywords: dataloss, intl, Whiteboard: [testday-20110708])

I only tried this with Questions and Private messages, but I suppose that it can be reproduced with everything, because it also affects or affected AMO (bug 592758) and Bugzilla (example in bug 592755). A piece of message beginning with such characters: attachment 471172 [details] When messages are output, the characters after U+FFFD-U+FFFF and anything entered after them is not visible in the comment. That way, if the comment begins with such a > FFFF character, the comment will look empty, and considered empty by the program when editing.
I'm not sure what those characters are supposed to be, as Firefox on Windows 7 (en-US) doesn't render them, but I think this may actually be because we're using "narrow" Python, that is Python with unicode objects internally represented in UCS-2, and not UCS-4. To switch to UCS-4, we'd need to recompile Python. Jeremy, Jeff, does that sound right, or am I barking up the wrong tree? Some searching seemed to confirm this hunch.
(In reply to comment #1) > I'm not sure what those characters are supposed to be, as Firefox on Windows > 7 (en-US) doesn't render them, but I think this may actually be because > we're using "narrow" Python, that is Python with unicode objects internally > represented in UCS-2, and not UCS-4. To switch to UCS-4, we'd need to > recompile Python. I assume Python would still see some bytes (possibly misinterpreted) so I'd start by checking the db.
Are these 4 byte codes? We probably ignore them. mysql has trouble with that. We could store things as binary on the db if we need to preserve those.
(In reply to comment #3) > Are these 4 byte codes? We probably ignore them. mysql has trouble with > that. We could store things as binary on the db if we need to preserve > those. >U+FFFF is beyond what UCS-2 can naively represent, yeah. MySQL does struggle with them. Some searching tells me that narrow Python builds will misinterpret higher-plane characters and break them into two (incorrect) characters. FWIW, Firefox seems to do the same thing, at least in this build/OS.
I'd say this is a WONTFIX unless there's a huge data loss.
Firefox on Linux doesn’t seem to have issues with such characters, at least when they are stand-alone. I have seen/suspected a string put together by copy-paste in an input field on a website break, which caused a PyQt program using the string fail to render the text; but I couldn’t reproduce it. However, if Firefox itself is found to have any issues with them, then support questions and bug reports about that will be problematic.
Given questionable support in our Python builds and lack of it in MySQL < 5.5[1], this is WONTFIX for now. [1] MySQL UTF-8 only supports 1-3 byte chars. http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
Any rough estimate as to when this might be fixed? It would be nice to be able to add emoticons and other symbols to messages. Tangentially related: bug 769974. http://www.fileformat.info/info/unicode/block/emoticons/list.htm http://www.fileformat.info/info/unicode/block/miscellaneous_symbols_and_pictographs/list.htm
You need to log in before you can comment on or make changes to this bug.