Last Comment Bug 754824 - The highlight is off by a few characters in the search result view when some characters are UTF8 encoded on 4 bytes
: The highlight is off by a few characters in the search result view when some ...
Status: RESOLVED FIXED
:
Product: Thunderbird
Classification: Client Software
Component: Search (show other bugs)
: Trunk
: All All
: -- normal (vote)
: Thunderbird 16.0
Assigned To: Florian Quèze [:florian] [:flo]
:
Mentors:
Depends on:
Blocks: 518597
  Show dependency treegraph
 
Reported: 2012-05-14 05:58 PDT by Florian Quèze [:florian] [:flo]
Modified: 2012-07-03 13:01 PDT (History)
3 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
wontfix
+
fixed
+
fixed


Attachments
Screenshot of the bug (35.23 KB, image/png)
2012-05-14 05:58 PDT, Florian Quèze [:florian] [:flo]
no flags Details
Patch (2.28 KB, patch)
2012-05-14 06:05 PDT, Florian Quèze [:florian] [:flo]
bugmail: review+
standard8: approval‑comm‑aurora+
Details | Diff | Review
Follow-up to only match UTF-16 surrogate halves (3.18 KB, patch)
2012-06-27 09:29 PDT, Florian Quèze [:florian] [:flo]
bugmail: review+
standard8: approval‑comm‑aurora+
standard8: approval‑comm‑beta+
Details | Diff | Review

Description Florian Quèze [:florian] [:flo] 2012-05-14 05:58:46 PDT
Created attachment 623643 [details]
Screenshot of the bug

I noticed this with the <U+1F493> character (http://www.fileformat.info/info/unicode/char/1f493/index.htm). See the attached screenshot.

The cause of the problem is that the current code assumes that an UTF8 character that requires more than 2 bytes is coded on 3 bytes. However, some UTF8 characters are coded on 4 bytes; they are seen by the JS code as 2 separate characters so I assumed at the time I wrote the flawed code that this case would just work, but actually it doesn't because the current code returns 3 bytes for each half of the 4 bytes character, and the character ends up counted as 6 bytes.
Comment 1 Florian Quèze [:florian] [:flo] 2012-05-14 06:05:21 PDT
Created attachment 623644 [details] [diff] [review]
Patch
Comment 2 Florian Quèze [:florian] [:flo] 2012-05-15 03:24:15 PDT
http://hg.mozilla.org/comm-central/rev/ec7a7bf505fd
Comment 3 Florian Quèze [:florian] [:flo] 2012-05-15 03:26:08 PDT
Comment on attachment 623644 [details] [diff] [review]
Patch

[Approval Request Comment]
This patch fixes a bug in a feature landed for Thunderbird 12 (bug 518597) so I think we want to take the fix to aurora, and maybe even to beta.
Comment 4 Florian Quèze [:florian] [:flo] 2012-06-04 02:15:51 PDT
https://hg.mozilla.org/releases/comm-aurora/rev/b705a1702ac8
Comment 5 Masatoshi Kimura [:emk] 2012-06-04 02:23:07 PDT
> c >= 32768
c >= 65536
Comment 6 Florian Quèze [:florian] [:flo] 2012-06-04 02:50:13 PDT
(In reply to Masatoshi Kimura [:emk] from comment #5)
> > c >= 32768
> c >= 65536

Why? The character with which I noticed the bug was <U+1F493>, and it's coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.
Comment 7 Masatoshi Kimura [:emk] 2012-06-04 04:57:35 PDT
(In reply to Florian Quèze from comment #6)
> (In reply to Masatoshi Kimura [:emk] from comment #5)
> > > c >= 32768
> > c >= 65536
> 
> Why? The character with which I noticed the bug was <U+1F493>, and it's
> coded in UTF8 on 4 bytes, and in UTF16 on 2 characters: 55357 and 56467.

Ah, I forgot JS encodes characters in UTF-8. Then it should be (c >= 55296 || c <= 57343).
Comment 8 Florian Quèze [:florian] [:flo] 2012-06-04 05:29:48 PDT
I'm still a bit confused by your comment. To clarify, could you give an example of a character that isn't correctly handled by the current code?
Comment 9 Florian Quèze [:florian] [:flo] 2012-06-27 09:18:29 PDT
(In reply to Masatoshi Kimura [:emk] from comment #7)
> it should be (c >= 55296 || c <= 57343).

Right. Reopening to fix this.
Comment 10 Florian Quèze [:florian] [:flo] 2012-06-27 09:29:12 PDT
Created attachment 637159 [details] [diff] [review]
Follow-up to only match UTF-16 surrogate halves

This time I actually looked for documentation about unicode instead of trying/guessing.

http://en.wikipedia.org/wiki/UTF-16 says:
"Code points U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character"


http://en.wikipedia.org/wiki/UTF-8 says:
"code points below U+0080 (which UTF-8 encodes in one byte)"

"for text using only code points below U+0800 [...] each code point's UTF-8 encoding is one or two bytes"

"Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16."

"Invalid code points

According to the UTF-8 definition (RFC 3629) the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence"
Comment 11 Florian Quèze [:florian] [:flo] 2012-06-27 09:49:20 PDT
Comment on attachment 637159 [details] [diff] [review]
Follow-up to only match UTF-16 surrogate halves

[Approval Request Comment]
My previous broken patch landed in aurora which was Tb14 at the time, so I would like to take the follow-up to aurora and beta.
Comment 12 Florian Quèze [:florian] [:flo] 2012-06-28 09:06:40 PDT
Landed attachment 637159 [details] [diff] [review] as https://hg.mozilla.org/comm-central/rev/c2e2bef7c4ac
Comment 13 Mark Banner (:standard8) 2012-07-03 04:31:42 PDT
Setting status 14 to affected until we land the second patch on beta.

Note You need to log in before you can comment on or make changes to this bug.