Closed Bug 754824 Opened 9 years ago Closed 8 years ago
The highlight is off by a few characters in the search result view when some characters are UTF8 encoded on 4 bytes
I noticed this with the <U+1F493> character (http://www.fileformat.info/info/unicode/char/1f493/index.htm). See the attached screenshot. The cause of the problem is that the current code assumes that an UTF8 character that requires more than 2 bytes is coded on 3 bytes. However, some UTF8 characters are coded on 4 bytes; they are seen by the JS code as 2 separate characters so I assumed at the time I wrote the flawed code that this case would just work, but actually it doesn't because the current code returns 3 bytes for each half of the 4 bytes character, and the character ends up counted as 6 bytes.
Assignee: nobody → florian
Attachment #623644 - Flags: review?(bugmail)
Attachment #623644 - Flags: review?(bugmail) → review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 15.0
Comment on attachment 623644 [details] [diff] [review] Patch [Approval Request Comment] This patch fixes a bug in a feature landed for Thunderbird 12 (bug 518597) so I think we want to take the fix to aurora, and maybe even to beta.
Attachment #623644 - Flags: approval-comm-aurora?
Attachment #623644 - Flags: approval-comm-aurora? → approval-comm-aurora+
> c >= 32768 c >= 65536
(In reply to Masatoshi Kimura [:emk] from comment #5) > > c >= 32768 > c >= 65536 Why? The character with which I noticed the bug was <U+1F493>, and it's coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.
(In reply to Florian Quèze from comment #6) > (In reply to Masatoshi Kimura [:emk] from comment #5) > > > c >= 32768 > > c >= 65536 > > Why? The character with which I noticed the bug was <U+1F493>, and it's > coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467. Ah, I forgot JS encodes characters in UTF-8. Then it should be (c >= 55296 || c <= 57343).
I'm still a bit confused by your comment. To clarify, could you give an example of a character that isn't correctly handled by the current code?
(In reply to Masatoshi Kimura [:emk] from comment #7) > it should be (c >= 55296 || c <= 57343). Right. Reopening to fix this.
This time I actually looked for documentation about unicode instead of trying/guessing. http://en.wikipedia.org/wiki/UTF-16 says: "Code points U+D800 to U+DFFF The Unicode standard permanently reserves these code point values for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character" http://en.wikipedia.org/wiki/UTF-8 says: "code points below U+0080 (which UTF-8 encodes in one byte)" "for text using only code points below U+0800 [...] each code point's UTF-8 encoding is one or two bytes" "Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16." "Invalid code points According to the UTF-8 definition (RFC 3629) the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence"
Attachment #637159 - Flags: review?(bugmail) → review+
Comment on attachment 637159 [details] [diff] [review] Follow-up to only match UTF-16 surrogate halves [Approval Request Comment] My previous broken patch landed in aurora which was Tb14 at the time, so I would like to take the follow-up to aurora and beta.
Landed attachment 637159 [details] [diff] [review] as https://hg.mozilla.org/comm-central/rev/c2e2bef7c4ac
Status: REOPENED → RESOLVED
Closed: 9 years ago → 8 years ago
Resolution: --- → FIXED
Target Milestone: Thunderbird 15.0 → Thunderbird 16.0
Setting status 14 to affected until we land the second patch on beta.
You need to log in before you can comment on or make changes to this bug.