Created attachment 623643 [details] Screenshot of the bug I noticed this with the <U+1F493> character (http://www.fileformat.info/info/unicode/char/1f493/index.htm). See the attached screenshot. The cause of the problem is that the current code assumes that an UTF8 character that requires more than 2 bytes is coded on 3 bytes. However, some UTF8 characters are coded on 4 bytes; they are seen by the JS code as 2 separate characters so I assumed at the time I wrote the flawed code that this case would just work, but actually it doesn't because the current code returns 3 bytes for each half of the 4 bytes character, and the character ends up counted as 6 bytes.
Created attachment 623644 [details] [diff] [review] Patch
Comment on attachment 623644 [details] [diff] [review] Patch [Approval Request Comment] This patch fixes a bug in a feature landed for Thunderbird 12 (bug 518597) so I think we want to take the fix to aurora, and maybe even to beta.
> c >= 32768 c >= 65536
(In reply to Masatoshi Kimura [:emk] from comment #5) > > c >= 32768 > c >= 65536 Why? The character with which I noticed the bug was <U+1F493>, and it's coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.
(In reply to Florian Quèze from comment #6) > (In reply to Masatoshi Kimura [:emk] from comment #5) > > > c >= 32768 > > c >= 65536 > > Why? The character with which I noticed the bug was <U+1F493>, and it's > coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467. Ah, I forgot JS encodes characters in UTF-8. Then it should be (c >= 55296 || c <= 57343).
I'm still a bit confused by your comment. To clarify, could you give an example of a character that isn't correctly handled by the current code?
(In reply to Masatoshi Kimura [:emk] from comment #7) > it should be (c >= 55296 || c <= 57343). Right. Reopening to fix this.
Created attachment 637159 [details] [diff] [review] Follow-up to only match UTF-16 surrogate halves This time I actually looked for documentation about unicode instead of trying/guessing. http://en.wikipedia.org/wiki/UTF-16 says: "Code points U+D800 to U+DFFF The Unicode standard permanently reserves these code point values for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character" http://en.wikipedia.org/wiki/UTF-8 says: "code points below U+0080 (which UTF-8 encodes in one byte)" "for text using only code points below U+0800 [...] each code point's UTF-8 encoding is one or two bytes" "Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16." "Invalid code points According to the UTF-8 definition (RFC 3629) the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence"
Comment on attachment 637159 [details] [diff] [review] Follow-up to only match UTF-16 surrogate halves [Approval Request Comment] My previous broken patch landed in aurora which was Tb14 at the time, so I would like to take the follow-up to aurora and beta.
Landed attachment 637159 [details] [diff] [review] as https://hg.mozilla.org/comm-central/rev/c2e2bef7c4ac
Setting status 14 to affected until we land the second patch on beta.
Checked in: https://hg.mozilla.org/releases/comm-aurora/rev/c90cfc006a30 https://hg.mozilla.org/releases/comm-beta/rev/949cb94b5667