Last Comment Bug 754824 - The highlight is off by a few characters in the search result view when some characters are UTF8 encoded on 4 bytes
: The highlight is off by a few characters in the search result view when some ...
Status: RESOLVED FIXED
:
Product: Thunderbird
Classification: Client Software
Component: Search (show other bugs)
: Trunk
: All All
: -- normal (vote)
: Thunderbird 16.0
Assigned To: Florian Quèze [:florian] [:flo]
:
Mentors:
Depends on:
Blocks: 518597
  Show dependency treegraph
 
Reported: 2012-05-14 05:58 PDT by Florian Quèze [:florian] [:flo]
Modified: 2012-07-03 13:01 PDT (History)
3 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
wontfix
+
fixed
+
fixed


Attachments
Screenshot of the bug (35.23 KB, image/png)
2012-05-14 05:58 PDT, Florian Quèze [:florian] [:flo]
no flags Details
Patch (2.28 KB, patch)
2012-05-14 06:05 PDT, Florian Quèze [:florian] [:flo]
bugmail: review+
standard8: approval‑comm‑aurora+
Details | Diff | Splinter Review
Follow-up to only match UTF-16 surrogate halves (3.18 KB, patch)
2012-06-27 09:29 PDT, Florian Quèze [:florian] [:flo]
bugmail: review+
standard8: approval‑comm‑aurora+
standard8: approval‑comm‑beta+
Details | Diff | Splinter Review

Description Florian Quèze [:florian] [:flo] 2012-05-14 05:58:46 PDT
Created attachment 623643 [details]
Screenshot of the bug

I noticed this with the <U+1F493> character (http://www.fileformat.info/info/unicode/char/1f493/index.htm). See the attached screenshot.

The cause of the problem is that the current code assumes that an UTF8 character that requires more than 2 bytes is coded on 3 bytes. However, some UTF8 characters are coded on 4 bytes; they are seen by the JS code as 2 separate characters so I assumed at the time I wrote the flawed code that this case would just work, but actually it doesn't because the current code returns 3 bytes for each half of the 4 bytes character, and the character ends up counted as 6 bytes.
Comment 1 Florian Quèze [:florian] [:flo] 2012-05-14 06:05:21 PDT
Created attachment 623644 [details] [diff] [review]
Patch
Comment 2 Florian Quèze [:florian] [:flo] 2012-05-15 03:24:15 PDT
http://hg.mozilla.org/comm-central/rev/ec7a7bf505fd
Comment 3 Florian Quèze [:florian] [:flo] 2012-05-15 03:26:08 PDT
Comment on attachment 623644 [details] [diff] [review]
Patch

[Approval Request Comment]
This patch fixes a bug in a feature landed for Thunderbird 12 (bug 518597) so I think we want to take the fix to aurora, and maybe even to beta.
Comment 4 Florian Quèze [:florian] [:flo] 2012-06-04 02:15:51 PDT
https://hg.mozilla.org/releases/comm-aurora/rev/b705a1702ac8
Comment 5 Masatoshi Kimura [:emk] 2012-06-04 02:23:07 PDT
> c >= 32768
c >= 65536
Comment 6 Florian Quèze [:florian] [:flo] 2012-06-04 02:50:13 PDT
(In reply to Masatoshi Kimura [:emk] from comment #5)
> > c >= 32768
> c >= 65536

Why? The character with which I noticed the bug was <U+1F493>, and it's coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.
Comment 7 Masatoshi Kimura [:emk] 2012-06-04 04:57:35 PDT
(In reply to Florian Quèze from comment #6)
> (In reply to Masatoshi Kimura [:emk] from comment #5)
> > > c >= 32768
> > c >= 65536
> 
> Why? The character with which I noticed the bug was <U+1F493>, and it's
> coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.

Ah, I forgot JS encodes characters in UTF-8. Then it should be (c >= 55296 || c <= 57343).
Comment 8 Florian Quèze [:florian] [:flo] 2012-06-04 05:29:48 PDT
I'm still a bit confused by your comment. To clarify, could you give an example of a character that isn't correctly handled by the current code?
Comment 9 Florian Quèze [:florian] [:flo] 2012-06-27 09:18:29 PDT
(In reply to Masatoshi Kimura [:emk] from comment #7)
> it should be (c >= 55296 || c <= 57343).

Right. Reopening to fix this.
Comment 10 Florian Quèze [:florian] [:flo] 2012-06-27 09:29:12 PDT
Created attachment 637159 [details] [diff] [review]
Follow-up to only match UTF-16 surrogate halves

This time I actually looked for documentation about unicode instead of trying/guessing.

http://en.wikipedia.org/wiki/UTF-16 says:
"Code points U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character"


http://en.wikipedia.org/wiki/UTF-8 says:
"code points below U+0080 (which UTF-8 encodes in one byte)"

"for text using only code points below U+0800 [...] each code point's UTF-8 encoding is one or two bytes"

"Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16."

"Invalid code points

According to the UTF-8 definition (RFC 3629) the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence"
Comment 11 Florian Quèze [:florian] [:flo] 2012-06-27 09:49:20 PDT
Comment on attachment 637159 [details] [diff] [review]
Follow-up to only match UTF-16 surrogate halves

[Approval Request Comment]
My previous broken patch landed in aurora which was Tb14 at the time, so I would like to take the follow-up to aurora and beta.
Comment 12 Florian Quèze [:florian] [:flo] 2012-06-28 09:06:40 PDT
Landed attachment 637159 [details] [diff] [review] as https://hg.mozilla.org/comm-central/rev/c2e2bef7c4ac
Comment 13 Mark Banner (:standard8) (afk until 26th July) 2012-07-03 04:31:42 PDT
Setting status 14 to affected until we land the second patch on beta.
Comment 14 Mark Banner (:standard8) (afk until 26th July) 2012-07-03 13:01:43 PDT
Checked in:

https://hg.mozilla.org/releases/comm-aurora/rev/c90cfc006a30
https://hg.mozilla.org/releases/comm-beta/rev/949cb94b5667

Note You need to log in before you can comment on or make changes to this bug.