Last Comment Bug 672472 - hyphenator does not handle non-ASCII characters correctly
: hyphenator does not handle non-ASCII characters correctly
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla8
Assigned To: Jonathan Kew (:jfkthame)
:
: Makoto Kato [:m_kato]
Mentors:
Depends on:
Blocks: 672320
  Show dependency treegraph
 
Reported: 2011-07-19 04:03 PDT by Jonathan Kew (:jfkthame)
Modified: 2011-07-20 12:56 PDT (History)
2 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
convert hyphenation-point offsets correctly to utf16 offsets (3.93 KB, patch)
2011-07-19 04:03 PDT, Jonathan Kew (:jfkthame)
smontagu: review+
Details | Diff | Splinter Review

Description Jonathan Kew (:jfkthame) 2011-07-19 04:03:35 PDT
Created attachment 546742 [details] [diff] [review]
convert hyphenation-point offsets correctly to utf16 offsets

Because I misinterpreted the libhyphen API in bug 253317, hyphenation positions are not returned correctly when non-ASCII characters are present. (This doesn't affect the en-US patterns, but showed up once I started testing with more languages for bug 672320.)

The issue is that although the hnj_hyphen_hyphenate2() function takes the text as an 8-bit string, with a length in bytes, the hyphens array that it returns (when using UTF-8 dictionaries) is indexed by Unicode character count, not (as I assumed) by UTF-8 code unit positions in the input string.

This means that the conversion of hyphenation-point offsets to our UTF-16 text representation is incorrect.
Comment 1 Jonathan Kew (:jfkthame) 2011-07-20 03:26:42 PDT
Pushed to mozilla-inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/faa1737443dd

Note You need to log in before you can comment on or make changes to this bug.