Last Comment Bug 253317 - Provide hyphenation dictionary for justified text
: Provide hyphenation dictionary for justified text
Status: RESOLVED FIXED
[sg:audit]
: dev-doc-complete, privacy
Product: Core
Classification: Components
Component: Layout: Text (show other bugs)
: Trunk
: All All
: -- enhancement with 8 votes (vote)
: mozilla6
Assigned To: Jonathan Kew (:jfkthame)
:
Mentors:
http://www.w3.org/TR/css3-gcpm/#hyphe...
: 281813 (view as bug list)
Depends on: 655198 685214 770555
Blocks: 67715 656750 655078 672320 677254 677347 1127107
  Show dependency treegraph
 
Reported: 2004-07-27 12:57 PDT by Brian Hunt
Modified: 2015-01-28 15:37 PST (History)
28 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed


Attachments
part 1 - implement -moz-hyphens property in CSS, based on CSS3 Text draft (14.31 KB, patch)
2011-03-29 12:56 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Review
part 2 (WIP) - implement nsIHyphenator and nsIHyphenationManager components (89.82 KB, patch)
2011-03-29 13:02 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Review
part 3 - provide the en-US hyphenation patterns (115.53 KB, patch)
2011-03-29 13:03 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Review
part 1 - implement -moz-hyphens property in CSS (13.71 KB, patch)
2011-03-31 11:34 PDT, Jonathan Kew (:jfkthame)
dbaron: review+
Details | Diff | Review
part 2 - import hyphenation code from hyphen-2.7.1 (108.78 KB, patch)
2011-03-31 11:36 PDT, Jonathan Kew (:jfkthame)
smontagu: review+
Details | Diff | Review
part 3 - import en-US hyphenation patterns from hyphen-2.7.1 (116.75 KB, patch)
2011-03-31 11:39 PDT, Jonathan Kew (:jfkthame)
smontagu: review+
Details | Diff | Review
part 4 - implement nsIHyphenationManager and nsIHyphenator components (49.81 KB, patch)
2011-03-31 11:43 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Review
part 5 - support none/manual/auto values for -moz-hyphens in layout (12.82 KB, patch)
2011-03-31 11:45 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Review
part 6 - reftests for automatic hyphenation (4.45 KB, patch)
2011-03-31 11:46 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Review
part 4 - implement nsHyphenationManager and nsHyphenator (34.28 KB, patch)
2011-04-13 06:18 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Review
part 5 - support -moz-hyphens:auto/manual/none in layout (38.11 KB, patch)
2011-04-19 06:42 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Review
part 6 - hyphenation reftests (updated) (8.53 KB, patch)
2011-04-19 06:43 PDT, Jonathan Kew (:jfkthame)
roc: review+
Details | Diff | Review
part 3 - import en-US hyphenation patterns - moved to intl/locale/en-US/hyphenation (116.82 KB, patch)
2011-04-20 08:49 PDT, Jonathan Kew (:jfkthame)
jfkthame: review+
Details | Diff | Review
part 3.1 - build support to install hyphenation patterns in the product (5.03 KB, patch)
2011-04-20 08:52 PDT, Jonathan Kew (:jfkthame)
l10n: review+
Details | Diff | Review
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API (29.81 KB, patch)
2011-04-20 08:57 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Review
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API (29.82 KB, patch)
2011-04-20 23:16 PDT, Jonathan Kew (:jfkthame)
smontagu: review+
roc: superreview+
Details | Diff | Review
part 5 - support -moz-hyphens:auto/manual/none in layout - updated per comments 32 & 36 (39.18 KB, patch)
2011-04-20 23:18 PDT, Jonathan Kew (:jfkthame)
roc: review+
Details | Diff | Review
wip - package the hyphenation dictionary on android - incomplete (2.01 KB, patch)
2011-05-05 10:37 PDT, Jonathan Kew (:jfkthame)
mark.finkle: feedback-
Details | Diff | Review
Android packaging patch (3.90 KB, patch)
2011-05-06 07:46 PDT, Mark Finkle (:mfinkle) (use needinfo?)
blassey.bugs: review+
Details | Diff | Review

Description Brian Hunt 2004-07-27 12:57:29 PDT
For justified text, provide lookups to a hyphenation dictionary to wrap the text
appropriately. Ie. When text is justified, break large words apart across
multiple lines with hyphenation, to make the justification more precise.

eg. (random example)
This is a security
problem that needs
to     be     kept
confidential.

would become something like:
This is a security
problem that needs
to be  kept  conf-
idential.
Comment 1 Boris Zbarsky [:bz] (Out June 25-July 6) 2004-07-27 13:50:01 PDT
Isn't this a duplicate of bug 67715?  If not, this probably blocks that bug.
Comment 2 Boris Zbarsky [:bz] (Out June 25-July 6) 2005-02-22 14:21:05 PST
*** Bug 281813 has been marked as a duplicate of this bug. ***
Comment 3 Jonathan Kew (:jfkthame) 2011-03-29 12:56:01 PDT
Created attachment 522769 [details] [diff] [review]
part 1 - implement -moz-hyphens property in CSS, based on CSS3 Text draft

As a preliminary step, this implements a "-moz-hyphens" property based on "hyphens" in the CSS3 Text draft, with values none/manual/auto/all. This will make it possible to control the use of hyphenation from CSS (once an actual hyphenation engine is in place).
Comment 4 Jonathan Kew (:jfkthame) 2011-03-29 13:02:01 PDT
Created attachment 522772 [details] [diff] [review]
part 2 (WIP) - implement nsIHyphenator and nsIHyphenationManager components

The hyphenation code here is hyphen-2.5, with slight modifications. Currently, the actual hyphenation is done on 8-bit text, which means we have to convert to utf-8 in order to find hyphen positions; for better performance, we could perhaps adapt the core code to work directly on utf16 strings.
Comment 5 Jonathan Kew (:jfkthame) 2011-03-29 13:03:00 PDT
Created attachment 522773 [details] [diff] [review]
part 3 - provide the en-US hyphenation patterns
Comment 6 Jonathan Kew (:jfkthame) 2011-03-31 11:34:07 PDT
Created attachment 523368 [details] [diff] [review]
part 1 - implement -moz-hyphens property in CSS
Comment 7 Jonathan Kew (:jfkthame) 2011-03-31 11:36:45 PDT
Created attachment 523370 [details] [diff] [review]
part 2 - import hyphenation code from hyphen-2.7.1

This comes directly from the current (2.7.1) source available at http://sourceforge.net/projects/hunspell/files/Hyphen/. The code is MPL/LGPL-licensed, so there should be no licensing issues with using it in Gecko.
Comment 8 Jonathan Kew (:jfkthame) 2011-03-31 11:39:11 PDT
Created attachment 523371 [details] [diff] [review]
part 3 - import en-US hyphenation patterns from hyphen-2.7.1

These are the US-English hyphenation patterns, also used by OpenOffice.org, I believe; we may want to look into sharing OOo files if they're installed, but for the time being it seems simplest to include this in the tree, like we do for the US-English spelling dictionary.
Comment 9 David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) 2011-03-31 11:42:20 PDT
Comment on attachment 523368 [details] [diff] [review]
part 1 - implement -moz-hyphens property in CSS

>diff --git a/layout/style/nsCSSPropList.h b/layout/style/nsCSSPropList.h
>+CSS_PROP_TEXT(
>+    -moz-hyphens,
>+    _moz_hyphens,

Don't use the "_moz_" in the name of the constant (and fix nsComputedDOMStyle.cpp to match).

>+    MozHyphens,

use CSS_PROP_DOMPROP_PREFIXED(Hyphens) here instead, and change the ValueForMozHyphens() in nsRuleNode.cpp to ValueForHyphens().

Have you tested that NS_STYLE_HINT_REFLOW is sufficient for a CalcDifference hint?

I presume that you've run the layout/style mochitests and they still pass.

r=dbaron with that
Comment 10 Jonathan Kew (:jfkthame) 2011-03-31 11:43:11 PDT
Created attachment 523373 [details] [diff] [review]
part 4 - implement nsIHyphenationManager and nsIHyphenator components

This provides XPCOM components to implement hyphenation for a given locale (based on a file of hyphenation patterns), and to manage the hyphenators, looking them up by language/locale codes.

Currently, the hyphenation patterns for a given locale are never freed (until shutdown) once they've been loaded. If memory usage becomes a concern, we could make the hyphenationManager release them if they haven't been used for a certain amount of time.
Comment 11 Jonathan Kew (:jfkthame) 2011-03-31 11:45:47 PDT
Created attachment 523374 [details] [diff] [review]
part 5 - support none/manual/auto values for -moz-hyphens in layout

Implement automatic hyphenation support in layout. This is an "opt-in" feature; hyphenation will only be done if the CSS calls for "-moz-hyphens: auto", AND if the data is tagged as English.

(If additional patterns are added, then of course additional languages will be supported for hyphenation. We might want to make patterns available as part of language-packs for users to add as desired.)
Comment 12 Jonathan Kew (:jfkthame) 2011-03-31 11:46:26 PDT
Created attachment 523375 [details] [diff] [review]
part 6 - reftests for automatic hyphenation
Comment 13 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-03-31 17:24:46 PDT
Comment on attachment 523373 [details] [diff] [review]
part 4 - implement nsIHyphenationManager and nsIHyphenator components

Why do we need to use XPCOM here? Now that libxul is a required part of the build, these can just be regular C++ classes nsHyphenator and nsHyphenationManager.
Comment 14 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-03-31 17:27:07 PDT
Also, if insertHyphens is just for testing, I think we should do without it. We can fake it in the test code by laying out at different widths and seeing what boxes are created, using getClientRects.
Comment 15 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-03-31 17:33:44 PDT
I see that the hyphenator defines the notion of "a word". What does that mean exactly? We have words for selection/caret movement, and words for CSS capitalization, or is it something else? (We also have line break opportunities which are similar)
Comment 16 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-03-31 18:46:34 PDT
Let's name mHyphenatorInited mHyphenatorInitialized.

+      mFrame->GetStyleVisibility()->mLanguage->ToUTF8String(langString);
+      hyphSvc->GetHyphenator(langString.get(), getter_AddRefs(mHyphenator));

Why not make GetHyphenator take an nsIAtom*?

+    nsAutoTArray<PRUint8,200> hyphens;
+    PRBool inhibitHyphenInInitialWord =
+      startOriginalOffset > 0 &&
+      mFrag->CharAt(startOriginalOffset - 1) == CH_SHY;
+
+    if (NS_SUCCEEDED(mHyphenator->Hyphenate(*text, &hyphens,
+                                            inhibitHyphenInInitialWord)))

Seems like 'hyphens' should be an array of PRPackedBool.

Hmm. Actually I think there's a big problem here, which is that we should be looking at text runs that cross frame boundaries. In fact, I think we probably should be computing hyphenation breaks in nsLineBreaker. Probably the SHY breaks should be computed there too.
Comment 17 Carlo Alberto Ferraris 2011-03-31 22:55:12 PDT
(In reply to comment #11)
> We might want to make patterns available as part of
> language-packs for users to add as desired.
Yes, please do.
Comment 18 Jonathan Kew (:jfkthame) 2011-04-01 00:52:31 PDT
(In reply to comment #13)
> Comment on attachment 523373 [details] [diff] [review]
> part 4 - implement nsIHyphenationManager and nsIHyphenator components
> 
> Why do we need to use XPCOM here? Now that libxul is a required part of the
> build, these can just be regular C++ classes nsHyphenator and
> nsHyphenationManager.

(In reply to comment #14)
> Also, if insertHyphens is just for testing, I think we should do without it. We
> can fake it in the test code by laying out at different widths and seeing what
> boxes are created, using getClientRects.

Does XPCOM give people more flexibility to use (and/or replace) the component, e.g. from JS? For example, someone might want to implement alternative hyphenation components, if the libhyphen algorithm isn't ideal for all languages, or use insertHyphens from JS to find most hyphen positions but then massage the results before replacing text in the DOM.

I'd be OK with a simple C++ implementation, but providing XPCOM interfaces here seemed like the "Gecko-ish" way to do things.
Comment 19 Jonathan Kew (:jfkthame) 2011-04-01 01:02:44 PDT
(In reply to comment #15)
> I see that the hyphenator defines the notion of "a word". What does that mean
> exactly? We have words for selection/caret movement, and words for CSS
> capitalization, or is it something else? (We also have line break opportunities
> which are similar)

(There's also the spell-checker, which has a notion of "word" as the unit to be checked.)

For simplicity, I went with a contiguous sequence of characters with the Unicode category "letter" or "combining mark", as that's what the hyphenation patterns are typically designed to handle. It's possible that we could take advantage of existing code for selection (not caret movement, which seems to include trailing punctuation with the "word", and certainly not line-breaking).
Comment 20 Jonathan Kew (:jfkthame) 2011-04-01 01:30:48 PDT
(In reply to comment #16)
> Why not make GetHyphenator take an nsIAtom*?

A string seemed more JS-friendly, but if we decide not to expose this to JS then atoms would be fine, I guess.

> Hmm. Actually I think there's a big problem here, which is that we should be
> looking at text runs that cross frame boundaries. 

That's true, in principle, but I'm not sure how important it is (if it'd be at the cost of significant extra complexity, for example). Cases where a single hyphenatable word crosses frame boundaries are probably not common, and when they do occur (e.g., the author is explicitly styling certain letters of the word, such as highlighting one syllable in bold), it may well be inappropriate to be hyphenating anyway as this is typically a case where the "word" is specifically in focus for some reason, and splitting it across lines then detracts from the intent.

If we can do this readily, I'm fine with it, but if it's going to be tricky, I don't think missing a few potential hyphenation points in such cases is a major issue; it's a limitation we could live with. Authors writing such stuff who _do_ want it hyphenated (which seems like it'll be very rare) could still use &shy; and thereby get explicit control of the possible breaks.

> In fact, I think we probably
> should be computing hyphenation breaks in nsLineBreaker. Probably the SHY
> breaks should be computed there too.

Can nsLineBreaker handle this? Offhand, I don't see anything that suggests it can mark breakpoints as requiring hyphen insertion if they're taken.
Comment 21 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-04-01 05:12:01 PDT
(In reply to comment #18)
> Does XPCOM give people more flexibility to use (and/or replace) the component,
> e.g. from JS? For example, someone might want to implement alternative
> hyphenation components, if the libhyphen algorithm isn't ideal for all
> languages, or use insertHyphens from JS to find most hyphen positions but then
> massage the results before replacing text in the DOM.
> 
> I'd be OK with a simple C++ implementation, but providing XPCOM interfaces here
> seemed like the "Gecko-ish" way to do things.

Those are bad old ways.

Who's going to replace the hyphenator with a JS implementation? An extension? Seems unlikely. We shouldn't take complexity and perf hits in our code for an extension point that is unlikely to be used. We should implement the simple thing and then if demand arises for an extension point, provide that as an option.

If Web content wants to discover hyphenation points, we need to create a Web API for that. XPCOM doesn't do it.

We may think it's not a major issue to look only within a frame, in practice, but I think it's fundamentally wrong and will probably bite us later. Fundamentally, wrapping arbitrary chunks of text in unstyled <spans> should not change layout. We go to a lot of effort to make that true, e.g. forming ligatures across element boundaries. I really think we should do this right.

nsLineBreaker can't report hyphen breaks right now. It would need to be extended to do that. But I think the line breaker is the logical place to do this; apart from having the text that we need, hyphenation is fundamentally about exposing line break opportunities.
Comment 22 Jonathan Kew (:jfkthame) 2011-04-01 07:49:14 PDT
(In reply to comment #21)

> Who's going to replace the hyphenator with a JS implementation? An extension?
> Seems unlikely.

It may not be as unlikely as all that. There's already Hyphenator.js (http://code.google.com/p/hyphenator/), for example, showing that people have an interest in the topic. Our hyphenation would basically make that superfluous, but more sophisticated things could be built as refinements on top of our implementation if we expose it to JS - e.g, adding site-specific hyphenation additions/exceptions while still using our hyphenator for the bulk of the work.

Perhaps more likely that replacing the actual hyphenator with a JS implementation, I can imagine people wanting to _use_ the hyphenator from JS and then put the resulting soft-hyphenated text (after post-processing) into the document.

> We may think it's not a major issue to look only within a frame, in practice,
> but I think it's fundamentally wrong and will probably bite us later.
> Fundamentally, wrapping arbitrary chunks of text in unstyled <spans> should not
> change layout. We go to a lot of effort to make that true, e.g. forming
> ligatures across element boundaries. I really think we should do this right.
> 
> nsLineBreaker can't report hyphen breaks right now. It would need to be
> extended to do that. But I think the line breaker is the logical place to do
> this; apart from having the text that we need, hyphenation is fundamentally
> about exposing line break opportunities.

Fair enough; I'll have a look at nsLineBreaker and see what would be involved there.
Comment 23 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-04-01 12:50:49 PDT
(In reply to comment #22)
> It may not be as unlikely as all that. There's already Hyphenator.js
> (http://code.google.com/p/hyphenator/), for example, showing that people have
> an interest in the topic. Our hyphenation would basically make that
> superfluous, but more sophisticated things could be built as refinements on top
> of our implementation if we expose it to JS - e.g, adding site-specific
> hyphenation additions/exceptions while still using our hyphenator for the bulk
> of the work.

That could be useful for Web sites themselves, so if there's demand we should just create a proper Web-accessible API for that. And that means not going through an XPCOM service.

> Perhaps more likely that replacing the actual hyphenator with a JS
> implementation, I can imagine people wanting to _use_ the hyphenator from JS
> and then put the resulting soft-hyphenated text (after post-processing) into
> the document.

Ditto.

Basically the right approach is to use something fast and lightweight internally (i.e. a regular C++ class, not even using virtuals if we can help it), then if we want a JS API, we create a scriptable wrapper or extension point which uses XPCOM as our WebIDL to define the interface, but does not use the rest of XPCOM such as services.

> Fair enough; I'll have a look at nsLineBreaker and see what would be involved
> there.

Great, thanks.
Comment 24 Simon Montagu :smontagu 2011-04-12 03:04:33 PDT
Comment on attachment 523371 [details] [diff] [review]
part 3 - import en-US hyphenation patterns from hyphen-2.7.1

rs=me
Comment 25 Jonathan Kew (:jfkthame) 2011-04-13 06:18:00 PDT
Created attachment 525667 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator

Updated to implement hyphenation manager and hyphenator as simple C++ classes, without XPCOM interfaces.

(Also includes the Makefile changes needed to build the new files, and include the hyphenation patterns in the final package.)

Still to come: an update of part 5, to use the revised interface.
Comment 26 Jonathan Kew (:jfkthame) 2011-04-19 06:42:38 PDT
Created attachment 526972 [details] [diff] [review]
part 5 - support -moz-hyphens:auto/manual/none in layout

This implements automatic hyphenation via the nsLineBreaker, so that it works across element boundaries.
Comment 27 Jonathan Kew (:jfkthame) 2011-04-19 06:43:23 PDT
Created attachment 526973 [details] [diff] [review]
part 6 - hyphenation reftests (updated)
Comment 28 Jonathan Kew (:jfkthame) 2011-04-19 07:13:24 PDT
Nominating for tracking-firefox6: hope to land this on m-c fairly soon, at which point it will be a significant new web-facing feature.

Needs dev-doc for authors re the -moz-hyphens property and the importance of proper language tagging to go with it. Also, localizers may wish to consider preparing and bundling hyphenation patterns for additional languages alongside the en-US patterns here.
Comment 29 Carlo Alberto Ferraris 2011-04-19 07:23:19 PDT
I have the it-IT hyphenation dictionary at hand. Should I attach it?
Comment 30 Jonathan Kew (:jfkthame) 2011-04-19 07:36:09 PDT
(In reply to comment #29)
> I have the it-IT hyphenation dictionary at hand. Should I attach it?

Is that the dictionary from an OpenOffice or LibreOffice distribution? Or some other source?

I'm not sure what the best approach will be to keep track of additional dictionaries. I think this bug should be for the core support that becomes part of the base Firefox product; I'd suggest checking with Pike about how we should handle the resources for additional languages.
Comment 31 Carlo Alberto Ferraris 2011-04-19 07:50:41 PDT
It's the latest official (open|libre)office3 it-IT hyphenation dictionary.
Comment 32 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-04-19 15:06:58 PDT
This is very nice!

+   * Possible states are 0 (no break), 1 (break allowed), 2 (hyphenation point)

Just use the FLAG_BREAK names here.

+    memset(breakState.Elements(), 0, length*sizeof(PRUint8));

Use FLAG_BREAK_TYPE_NONE here.

+    nsCAutoString lang;
+    mCurrentWordLangGroup->ToUTF8String(lang);
+    nsRefPtr<nsHyphenator> hyphenator =
+      nsHyphenationManager::Instance()->GetHyphenator(lang.get());

Why doesn't GetHyphenator just take the atom as a parameter?

+      if (mCurrentWordLangGroup && mCurrentWordLangGroup != aLangGroup) {
+        mCurrentWordContainsMixedLang = PR_TRUE;
+      } else {
+        mCurrentWordLangGroup = aLangGroup;
+      }

This chunk of code can be moved out to a shared function, say nsLineBreaker::UpdateCurrentWordLangGroup().

+          if (hyphenator) {
+            nsDependentSubstring string(aText + wordStart, aText + offset);
+            nsAutoTArray<PRPackedBool,200> hyphens;
+            if (NS_SUCCEEDED(hyphenator->Hyphenate(string, hyphens))) {
+              for (PRUint32 i = 0; i + 1 < string.Length(); ++i) {
+                if (hyphens[i]) {
+                  breakState[wordStart + i + 1] =
+                    gfxTextRun::CompressedGlyph::FLAG_BREAK_TYPE_HYPHEN;
+                }
+              }
+            }
+          }

I believe this chunk of code can be shared between AppendText and FlushCurrentWord.

I suspect it would be better to have PropertyProvider::GetHyphenationBreaks look up the hypenation state in the textrun itself, so that callers like BreakAndMeasureText and AddInlineMinWidthForFlow don't have to check for both kinds of hyphens.
Comment 33 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-04-19 15:10:42 PDT
Are we going to ship en-US hyphenation dictionaries in every build? If not, these reftests are going to break.
Comment 34 Jonathan Kew (:jfkthame) 2011-04-19 15:26:56 PDT
(In reply to comment #32)

I'll go through most of these comments tomorrow, but a couple of responses for now:

> Why doesn't GetHyphenator just take the atom as a parameter?

Because if it doesn't find an exact match, it munges the value to successively drop trailing subtag(s) and calls itself recursively to see if they exist. This means it's dealing with the value as a string and potentially calling itself with a new temporary string, and there's no reason to make it create temporary atoms in order to do that.

> I suspect it would be better to have PropertyProvider::GetHyphenationBreaks
> look up the hypenation state in the textrun itself, so that callers like
> BreakAndMeasureText and AddInlineMinWidthForFlow don't have to check for both
> kinds of hyphens.

Currently, the textrun doesn't explicitly know about &shy; breaks; we'd have to add something to record those during textrun construction in order to be able to do this. I'll take a look... the "obvious" idea of using nsLineBreaker to mark all potential line-breaks doesn't work, as the &shy; characters are filtered out of the text that's passed to the linebreaker.
Comment 35 Jonathan Kew (:jfkthame) 2011-04-19 15:29:40 PDT
(In reply to comment #33)
> Are we going to ship en-US hyphenation dictionaries in every build? If not,
> these reftests are going to break.

I've been assuming "yes" (just like I think we always ship the en-US spelling dictionary), but if product drivers decide against it (e.g. prefer to ship _only_ a local-language dictionary in non-en-US builds) we'd need to figure out some other approach to testing.
Comment 36 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-04-19 15:50:07 PDT
(In reply to comment #34)
> Because if it doesn't find an exact match, it munges the value to successively
> drop trailing subtag(s) and calls itself recursively to see if they exist. This
> means it's dealing with the value as a string and potentially calling itself
> with a new temporary string, and there's no reason to make it create temporary
> atoms in order to do that.

Ah OK. But I think it should still take an nsIAtom* parameter and do the conversion internally. That would actually save a bit of code already, and later if we want to we could easily add an extra internal cache mapping nsIAtoms to their hyphenators.

> > I suspect it would be better to have PropertyProvider::GetHyphenationBreaks
> > look up the hypenation state in the textrun itself, so that callers like
> > BreakAndMeasureText and AddInlineMinWidthForFlow don't have to check for both
> > kinds of hyphens.
> 
> Currently, the textrun doesn't explicitly know about &shy; breaks; we'd have to
> add something to record those during textrun construction in order to be able
> to do this. I'll take a look... the "obvious" idea of using nsLineBreaker to
> mark all potential line-breaks doesn't work, as the &shy; characters are
> filtered out of the text that's passed to the linebreaker.

Right. What I was suggesting is modifying PropertyProvider::GetHyphenationBreaks in nsTextFrameThebes to look in the textrun for the auto-hyphenation breaks recorded by nsLineBreaker, and combining those results with the &shy; results that it already computes. That way we won't need to record &shy; breaks in the textrun.
Comment 37 Axel Hecht [:Pike] 2011-04-20 05:11:53 PDT
Sorry if my comments are out of context, I'm going to go "tl;dr;" on this bug.

Hyphenation is tricky language data, as it's less about which language is the preferred one, but which the user would see. How does it determine which hyphenation dict to use if there are multiple?

If we ship it optionally, it should also be installable through AMO. Might be a good idea to do so either way.

How big is the data, if we'd ship multiple dicts?

Also, making them optional makes us less strict about having that data to be triple licensed.

I don't think I want the hyphenation data to be in intl/hyphenation/locales/en-US/. If we go for a single data file (and not ship en-US by default), it should probably be in intl/locales/en-US/hyphenation/, if we can't put it in toolkit/locales/en-US. We'd also need to see how to integrate that into the l10n repacks right. Which patch does the build magic?

If we ship a bunch independent of localization, I care less about the file location.
Comment 38 Jonathan Kew (:jfkthame) 2011-04-20 06:52:50 PDT
(In reply to comment #37)
> Hyphenation is tricky language data, as it's less about which language is the
> preferred one, but which the user would see. How does it determine which
> hyphenation dict to use if there are multiple?

It will _only_ try to auto-hyphenate if the author specifically requests it with "-moz-hyphens:auto" (eventually "hyphens:auto" once CSS3 Text is sufficiently advanced), AND if the text is tagged with a language code that maps to an available hyphenation resource.

So in the case of a multilingual page, individual elements will be hyphenated according to the appropriate rules for their language (or not hyphenated at all, if we don't have rules for them).

> If we ship it optionally, it should also be installable through AMO. Might be a
> good idea to do so either way.
> 
> How big is the data, if we'd ship multiple dicts?

The en-US data is around 100K (but deflates by about 60%, so it would add about 40K to the installer, I assume). Other languages vary greatly; many of them are MUCH smaller (in the 5K-20K range, before compression), as they have simple rules that can be expressed in just a few patterns, though a handful may be larger than en-US.

Although many of the files are small, the cumulative size of shipping all 60+ languages that we could easily derive from the TeX world (for example) is probably more than we want to include in the default installer.

> Also, making them optional makes us less strict about having that data to be
> triple licensed.

Ok, that's good to know. Licensing may well vary among the available patterns.

> I don't think I want the hyphenation data to be in
> intl/hyphenation/locales/en-US/. If we go for a single data file (and not ship
> en-US by default), it should probably be in intl/locales/en-US/hyphenation/, if
> we can't put it in toolkit/locales/en-US. We'd also need to see how to
> integrate that into the l10n repacks right. Which patch does the build magic?

Currently, it's included in part 4 here. I'm happy to move things around to wherever you think they belong.

> If we ship a bunch independent of localization, I care less about the file
> location.

ISTM that it makes sense to treat hyphenation patterns similarly to spelling dictionaries. We have the en-US dictionary in the mozilla-central tree (in extensions/spellcheck/locales/en-US/hunspell), and it becomes part of a "standard" m-c build, in the {application}/dictionaries folder, but it looks like l10n repacks replace this with a localized dictionary. In the patches here, I was aiming to install the en-US hyphenation data in a "hyphenation" folder alongside "dictionaries", on the assumption that it could then be processed in the same way during repacks.

I don't see any intl/locales resources in the tree at the moment; there's intl/locale, but that is source code for locale-related functionality rather than locale-specific resources.
Comment 39 Axel Hecht [:Pike] 2011-04-20 07:10:39 PDT
I favor intl/locales/en-US/hyphenation as it allows us to put other locale info that we may have at some point into the same packaging process. I'm thinking about support data for js i18n, in particular.

Re the bundling piece: Should this be part of browser or part of toolkit? We probably want to talk to the mobile guys about that. Not sure if TB cares, or SM.
The rest of the logic looks OK for repacks, the locales/Makefile.in would be the place to make a call on whether we ship en-US in all builds, or not. Might still make sense to do so.

How are we going to do 
pref("intl.hyphenation-alias.en", "en-us");
for l10n builds? Can't we use sane defaults here, and get by without requiring a pref?

Leaves me with two follow up tasks for you: Feedback on mobile on whether they want to ship with this, and feedback from add-ons, both toolkit code and AMO, if they can hook you up with an extension infrastructure for this data.
Comment 40 Jonathan Kew (:jfkthame) 2011-04-20 07:56:14 PDT
(In reply to comment #39)
> I favor intl/locales/en-US/hyphenation ...

OK, I'll move it and refresh the patches accordingly.

> How are we going to do 
> pref("intl.hyphenation-alias.en", "en-us");
> for l10n builds? Can't we use sane defaults here, and get by without requiring
> a pref?

I don't see a way to handle this without a localizable pref or some similiar mechanism. In the en-GB localization, for example, we might ship both en-us and en-gb patterns (or a user might add the en-us ones, if we only ship en-gb there), but I expect we'd want to map "en" -> "en-gb" by default so that text simply tagged as lang="en", without a more specific locale, will be handled using British rules.

I also think it should be possible for users to customize this behavior if they wish, which the pref allows.

Do l10n repacks currently have a mechanism to modify default prefs at all? I thought that was part of the process, but may of course be mistaken.
Comment 41 Martin von Gagern 2011-04-20 08:01:39 PDT
(In reply to comment #39)
> pref("intl.hyphenation-alias.en", "en-us");

Is this really specific to hyphenation? I'd consider the setting "please treat all unknown English locales as US English" to affect several aspects, like hyphenation, spell checking, and perhaps more to come. Having different settings for different facets might cause trouble, particularly in terms of config UI.
Comment 42 Jonathan Kew (:jfkthame) 2011-04-20 08:05:09 PDT
(In reply to comment #39)
> and feedback from add-ons, both toolkit code and AMO,
> if they can hook you up with an extension infrastructure for this data.

Looking on AMO, I see we already have https://addons.mozilla.org/en-US/firefox/language-tools/ with a whole bunch of spell-check dictionaries, etc.; it seems like hyphenation patterns could be delivered in exactly the same way, no?
Comment 43 Jonathan Kew (:jfkthame) 2011-04-20 08:49:50 PDT
Created attachment 527285 [details] [diff] [review]
part 3 - import en-US hyphenation patterns - moved to intl/locale/en-US/hyphenation

This is the same as the previous "part 3", except that the patterns are placed in intl/locales/en-US/hyphenation as per comment 39.

Carrying forward r=smontagu; Simon, please speak up if you have any concerns with this placement.
Comment 44 Jonathan Kew (:jfkthame) 2011-04-20 08:52:26 PDT
Created attachment 527288 [details] [diff] [review]
part 3.1 - build support to install hyphenation patterns in the product

This is the build changes to copy the hyphenation patterns into the {program}/hyphenation folder, split out from the rest of Part 4 for easier review. Pike, does this look like it does the right thing from your point of view?
Comment 45 Jonathan Kew (:jfkthame) 2011-04-20 08:57:39 PDT
Created attachment 527290 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API

Updated to use Atom parameters, as per Roc's comments.
Comment 46 Jonathan Kew (:jfkthame) 2011-04-20 14:34:56 PDT
Comment on attachment 527290 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API

Argh, cancelling r? on part 4 as I just realized I attached a stale patch, so don't waste time on it. Update coming.
Comment 47 Jonathan Kew (:jfkthame) 2011-04-20 23:16:10 PDT
Created attachment 527486 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API
Comment 48 Jonathan Kew (:jfkthame) 2011-04-20 23:18:14 PDT
Created attachment 527488 [details] [diff] [review]
part 5 - support -moz-hyphens:auto/manual/none in layout - updated per comments 32 & 36
Comment 49 Jonathan Kew (:jfkthame) 2011-04-20 23:26:46 PDT
(In reply to comment #33)
> Are we going to ship en-US hyphenation dictionaries in every build? If not,
> these reftests are going to break.

Regardless of the decision on this, I think we'll have the en-US hyph dic in the main build (like we do for the spelling dic), so these should work during our standard unit tests.

It's true that they'd fail if they were run with an l10n-repackaged build where the dic has been swapped out, but I don't see a good way to deal with that at this point. (We'd have the same issue with spell-check tests if we were running full unit tests on repacks, wouldn't we?)
Comment 50 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-04-21 03:00:30 PDT
I guess so.
Comment 51 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-04-21 03:01:16 PDT
I think it might make sense to just always package the en-US hyphenation dictionary, anyway, in addition to any other languages we choose to package.
Comment 52 Axel Hecht [:Pike] 2011-04-21 12:05:09 PDT
I agree that it'd make sense to package the en-US hyphenation dictionary in all builds. That'd enable us to run the hyphenation tests on non-en-US builds, too, which is a good thing to have.

For the locale prefs, doing prefs as part of repacks is tough. There's a template prefs file we include, but I'd rather not have localizers touch that, it's too easy to put scary things in there.

Isn't a language mapping algorithm like BCP47 good enough in general? I don't mind if users have prefs to tweak things on top of that, but I don't think we should have prefs for the simple case of walking up a set of language tags. We currently don't have a reusable component implementing bcp47 sadly, but bug 356038 could add that at some point. Right now, going in chunks of "en-US", "en", 'nothing' should do the trick, I think.
Comment 53 Axel Hecht [:Pike] 2011-04-21 12:08:55 PDT
Comment on attachment 527288 [details] [diff] [review]
part 3.1 - build support to install hyphenation patterns in the product

I'd make 

+PATTERN_FILES = $(strip $(wildcard $(LOCALE_SRCDIR)/hyphenation/*.dic))

list both the en-US and the locale pattern files so that we ship the en-US ones by default.

r=me with that.
Comment 54 Cédric Corazza 2011-04-21 13:55:01 PDT
I read through rapidly all the comments and didn't read about one concern: bloating.
This is not really about the issue discussed here, but still. Adding en-US dic to l10n locale will bloat Firefox a little more. I remember times where size packages matters, and it was around 6 MB for Windows in these days, then we added l10n dictionaries (1 MB more or so). We are now around 12 MB.
Just ignore my comment if this doesn't matter anymore.
If it does, I will let the "guys who know" file the bug in the right component.
Sorry if made unncessary noise in this bug.
Comment 55 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-04-21 17:52:57 PDT
We worry about bloat, but adding features generally requires adding to the download size. In this case we expect adding the en-US dictionary to add about 40K to the download, see comment #38.
Comment 56 Simon Montagu :smontagu 2011-04-28 00:51:59 PDT
Comment on attachment 527486 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API

Review of attachment 527486 [details] [diff] [review]:

r=me
Comment 57 Jonathan Kew (:jfkthame) 2011-04-28 05:27:12 PDT
Comment on attachment 527486 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API

Requesting sr for the new APIs here.
Comment 58 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-05-03 14:52:41 PDT
Comment on attachment 527486 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API

Review of attachment 527486 [details] [diff] [review]:

::: intl/hyphenation/public/nsHyphenator.h
@@ +49,5 @@
+public:
+  nsHyphenator(nsIFile *aFile);
+
+  nsrefcnt AddRef(void) {
+    NS_PRECONDITION(PRInt32(mRefCnt) >= 0, "illegal refcnt");

Use NS_INLINE_DECL_REFCOUNTING

::: intl/hyphenation/src/hnjalloc.h
@@ +43,5 @@
+#include "prmem.h"
+
+#define hnj_malloc(size)      PR_Malloc(size)
+#define hnj_realloc(p, size)  PR_Realloc(p, size)
+#define hnj_free(p)           PR_Free(p)

Use mozalloc.h and moz_malloc or moz_xmalloc. If you think libhyphen is very robust about checking malloc return values for OOM, then use moz_malloc, otherwise use moz_xmalloc.
Comment 59 Jonathan Kew (:jfkthame) 2011-05-04 05:46:26 PDT
(In reply to comment #58)

> Use NS_INLINE_DECL_REFCOUNTING

Done; although I was thinking we might eventually want to enhance this to have a time-based cache that discards hyphenation tables that haven't been used recently, if memory usage becomes a concern. For that, I think we'd want a custom Release implementation so as to track last-used time. But we can do that in a followup bug if it seems necessary.

> Use mozalloc.h and moz_malloc or moz_xmalloc. If you think libhyphen is very
> robust about checking malloc return values for OOM, then use moz_malloc,
> otherwise use moz_xmalloc.

Done. libhyphen does not check allocator return values, it assumes that hnj_{m,re}alloc  are infallible (normally by using its own wrappers), so they're mapped to the moz_x* versions here.


http://hg.mozilla.org/mozilla-central/rev/852a668d69b9
http://hg.mozilla.org/mozilla-central/rev/fb4db64ca8b8
http://hg.mozilla.org/mozilla-central/rev/861b8bbc25b5
http://hg.mozilla.org/mozilla-central/rev/a5877159c9a1
http://hg.mozilla.org/mozilla-central/rev/3f96f2d3dffe
http://hg.mozilla.org/mozilla-central/rev/cf41ba95de85
http://hg.mozilla.org/mozilla-central/rev/3fd770ef6a65
Comment 60 Jonathan Kew (:jfkthame) 2011-05-04 07:49:54 PDT
I backed out parts 3.1 (the build support for including the hyphenation dict) and 6 (tests) because of orange in l10n-check:

make libs
make[4]: Entering directory `/builds/slave/cen-lnx/build/obj-firefox/intl/locales'
/builds/slave/cen-lnx/build/obj-firefox/config/nsinstall -R    ../../dist/xpi-stage/locale-x-test/hyphenation
usage: /builds/slave/cen-lnx/build/obj-firefox/config/nsinstall [-C cwd] [-L linkprefix] [-m mode] [-o owner] [-g group]
                                                                [-DdltR] file [file ...] directory
make[4]: make[4]: Leaving directory `/builds/slave/cen-lnx/build/obj-firefox/intl/locales'
*** [libs] Error 2
NEXT ERROR make[3]: *** [default] Error 2
make[3]: Leaving directory `/builds/slave/cen-lnx/build/obj-firefox/intl/locales'
make[2]: Leaving directory `/builds/slave/cen-lnx/build/obj-firefox/browser/locales'
make[2]: *** [libs-x-test] Error 2
make[1]: *** [l10n-check] Error 2
make[1]: Leaving directory `/builds/slave/cen-lnx/build/obj-firefox/browser/locales'
make: *** [l10n-check] Error 2
program finished with exit code 2

As such, the feature is currently disabled - the code is still in the tree, but with no dictionary, it won't actually do anything.
Comment 61 Jonathan Kew (:jfkthame) 2011-05-04 09:04:51 PDT
Re-landed after fixing the path in the makefile so that it won't fail l10n-check this time:
http://hg.mozilla.org/mozilla-central/rev/5cd93fcc75d9
http://hg.mozilla.org/mozilla-central/rev/4ed2169f5b34
Comment 62 Eric Shepherd [:sheppy] 2011-05-04 12:20:15 PDT
Documentation:

https://developer.mozilla.org/en/CSS/hyphens

I've composed an example but am awaiting a build with this working before inserting it into the page. The Nightly I'm running now doesn't have it yet.

Also mentioned on Firefox 6 for developers.
Comment 63 Jonathan Kew (:jfkthame) 2011-05-04 13:02:31 PDT
(In reply to comment #62)
> Documentation:
> 
> https://developer.mozilla.org/en/CSS/hyphens

Thanks! :)

A couple of notes. First a nit: there's a missing "e" on "The" at the very beginning!

You might want to reference CSS3 Text in addition to Generated-Content; I'm not sure where the final spec will settle down, though personally I think Text is a more appropriate home for it.

  http://www.w3.org/TR/2011/WD-css3-text-20110412/#hyphens

Also, it may be worth pointing out that the "auto" setting depends on appropriate language tagging, as the proper places to break words depend on the language. So Gecko (at least) will not do any automatic hyphenation unless the page/element has an appropriate 'lang' attribute; merely setting "-moz-hyphens:auto" with no 'lang' attribute will have no visible effect.

(As of now we only provide support for auto-hyphenating English, but we're looking at ways to include support for additional languages.)
Comment 64 :Ehsan Akhgari (busy, don't ask for review please) 2011-05-04 15:57:46 PDT
So, I'm not totally comfortable with the choice of only shipping en-US hyphenation support by default.  This is something which affects the layout of web content, and I don't think that we should limit the language specific layout support to only the language the browser UI is in.  If you're a multilingual person, you almost definitely do not run two parallel browsers one in each language.  And we do not restrict other language specific layout features (such as shaping) to only browsers in some locales.  (I know that this example is kind of extreme, but I'm talking in terms of what an average user expects).

So, I'm wondering, is the built-in hyphenation dictionaries choice made in this bug a good one?
Comment 65 Axel Hecht [:Pike] 2011-05-04 16:29:34 PDT
(In reply to comment #64)
> So, I'm wondering, is the built-in hyphenation dictionaries choice made in this
> bug a good one?

I think we should have more bugs on a good default set of hyphenation dicts to ship, plus good localized defaults, plus a good infra to detect if we should download and use additional dicts by demand.
Comment 66 :Ehsan Akhgari (busy, don't ask for review please) 2011-05-04 16:36:02 PDT
(In reply to comment #65)
> (In reply to comment #64)
> > So, I'm wondering, is the built-in hyphenation dictionaries choice made in this
> > bug a good one?
> 
> I think we should have more bugs on a good default set of hyphenation dicts to
> ship, plus good localized defaults, plus a good infra to detect if we should
> download and use additional dicts by demand.

Is that considered to be something that should be addressed before we can ship this in Firefox?
Comment 67 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-05-04 19:21:28 PDT
(In reply to comment #64)
> So, I'm not totally comfortable with the choice of only shipping en-US
> hyphenation support by default.  This is something which affects the layout of
> web content, and I don't think that we should limit the language specific
> layout support to only the language the browser UI is in.

There are multiple, somewhat conflicting goals here:
1) Avoid making page layout depend on the browser UI locale.
2) Support auto-hyphenation for the preferred language(s) of each user.
3) Acceptable Firefox download bloat.

What's currently on trunk satisfies #1 and #3. I think this something we can ship; supporting auto-hyphenation for English only is better than not supporting it for any language, IMHO.

I'm not sure what we should do beyond that, though. That's a discussion we should probably have on dev.platform.
Comment 68 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-05-04 19:23:26 PDT
(In reply to comment #67)
> There are multiple, somewhat conflicting goals here:
> 1) Avoid making page layout depend on the browser UI locale.
> 2) Support auto-hyphenation for the preferred language(s) of each user.
> 3) Acceptable Firefox download bloat.
> 
> What's currently on trunk satisfies #1 and #3.

Just to clarify: the reason the current code satisfies #1 is that we ship the en-US dictionary for *every* locale (and never ship any other dictionaries).
Comment 69 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-05-04 19:34:38 PDT
Varying the hyphenation by the local would leak the locale.
Comment 70 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-05-04 19:37:12 PDT
Don't we already leak the locale in the Accept-Language header?
Comment 71 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-05-04 19:39:26 PDT
IIRC, there are a few ways we leak the locale. I don't know how much of a problem, or not, it is considered. I just want to make sure Sid is aware of the leak.
Comment 72 Phil Ringnalda (:philor) 2011-05-04 21:59:18 PDT
Do we want to package this on Android, or mark the six permaorange reftests as fails-if? (Or, do the right thing, build, package, and test based on an ifdef.)
Comment 73 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-05-05 04:12:15 PDT
We should package it on Android.
Comment 74 Jonathan Kew (:jfkthame) 2011-05-05 10:37:25 PDT
Created attachment 530366 [details] [diff] [review]
wip - package the hyphenation dictionary on android - incomplete

I don't know how the Android packaging works, and what's required in order to incorporate this. I found mobile/installer/package-manifest.in and mobile/locales/Makefile.in and added appropriate entries to these (see attached patch), but this is not sufficient. I'm seeing hyphenation/en-US/hyph_en_US.dic in the $OBJDIR/dist/fennec directory, but it doesn't end up in the final package.

So could someone who knows how this should work please point me in the right direction - or just do it for me! :)
Comment 75 Mark Finkle (:mfinkle) (use needinfo?) 2011-05-05 12:15:36 PDT
How big are the dictionaries? We explicitly remove spellcheck dictionaries in mobile due to size issues and the fact that the spellcheck UI is not 100% functional.
Comment 76 Jonathan Kew (:jfkthame) 2011-05-05 12:36:43 PDT
They vary greatly, from a few K to several hundred K. The en-US dict (currently the only one in the tree; we're looking into how best to handle other locales) is just over 100K, but compresses to about 42K when zipped.
Comment 77 Mark Finkle (:mfinkle) (use needinfo?) 2011-05-05 12:42:29 PDT
I can try the patch and see how it works. It looks OK.
Comment 78 :Ehsan Akhgari (busy, don't ask for review please) 2011-05-05 17:41:18 PDT
(In reply to comment #70)
> Don't we already leak the locale in the Accept-Language header?

The value of the Accept-Language header can be customized by the user, and the UI locale is not its only source.

That said, if we choose to ship en-US in every locale, this won't be a concern, except for the locales which ship their own version of the dictionary in addition to the en-US one.
Comment 79 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-05-05 17:42:56 PDT
(In reply to comment #75)
> How big are the dictionaries? We explicitly remove spellcheck dictionaries in
> mobile due to size issues and the fact that the spellcheck UI is not 100%
> functional.

This is probably more important than spellcheck since a) it affects layout and b) autohyphenation would be especially useful on mobile to make text fit in a narrow space.
Comment 80 :Ehsan Akhgari (busy, don't ask for review please) 2011-05-05 17:56:07 PDT
(In reply to comment #67)
> (In reply to comment #64)
> > So, I'm not totally comfortable with the choice of only shipping en-US
> > hyphenation support by default.  This is something which affects the layout of
> > web content, and I don't think that we should limit the language specific
> > layout support to only the language the browser UI is in.
> 
> There are multiple, somewhat conflicting goals here:
> 1) Avoid making page layout depend on the browser UI locale.
> 2) Support auto-hyphenation for the preferred language(s) of each user.
> 3) Acceptable Firefox download bloat.
> 
> What's currently on trunk satisfies #1 and #3. I think this something we can
> ship; supporting auto-hyphenation for English only is better than not
> supporting it for any language, IMHO.
> 
> I'm not sure what we should do beyond that, though. That's a discussion we
> should probably have on dev.platform.

Posted a thread to dev.platform.
Comment 81 Mark Finkle (:mfinkle) (use needinfo?) 2011-05-05 21:30:56 PDT
Comment on attachment 530366 [details] [diff] [review]
wip - package the hyphenation dictionary on android - incomplete

You missed a spot packager.mk

Android omnijar packaging is different and separate from other omnijar packaging. I tweaked your patch and got the hyphen dict files in the package. I sent the patch to the try server to see if the reftests clear up.
Comment 82 Jonathan Kew (:jfkthame) 2011-05-06 02:49:47 PDT
Aha, thanks - I figured there must be something like that I'd missed.

In a local build, I now see the dict in the package as expected. However, testing on my device still doesn't actually work - maybe the code ends up looking in the wrong directory or something.
Comment 83 Mark Finkle (:mfinkle) (use needinfo?) 2011-05-06 06:39:12 PDT
(In reply to comment #82)
> Aha, thanks - I figured there must be something like that I'd missed.
> 
> In a local build, I now see the dict in the package as expected. However,
> testing on my device still doesn't actually work - maybe the code ends up
> looking in the wrong directory or something.

I also found a place in the code where we unpack folders out of the omnijar file on Android.

Do we have any easily accessible test pages to try hyphenation? I already failed once to get any reftests to run on try and I don't have an easy way to try to run the Android reftests myself locally.
Comment 84 Mark Finkle (:mfinkle) (use needinfo?) 2011-05-06 07:18:45 PDT
Not sure of a better place to ask, but are we planning to support reading the hyphen dict from a locale JAR file? It seems natural that the locale specific dictionary could be shipped with a locale itself. It looks like the code is currently designed to only read from extracted, local files - using the GRE or APP folder as the root.
Comment 85 Jonathan Kew (:jfkthame) 2011-05-06 07:34:07 PDT
(In reply to comment #83)
> Do we have any easily accessible test pages to try hyphenation? I already
> failed once to get any reftests to run on try

(That sounds like bug 655046.)

I've put a page at http://people.mozilla.org/~jkew/hyph.html that should demonstrate hyphenation, if it's working.
Comment 86 Jonathan Kew (:jfkthame) 2011-05-06 07:37:21 PDT
(In reply to comment #84)
> Not sure of a better place to ask, but are we planning to support reading
> the hyphen dict from a locale JAR file? It seems natural that the locale
> specific dictionary could be shipped with a locale itself. It looks like the
> code is currently designed to only read from extracted, local files - using
> the GRE or APP folder as the root.

What do we do for spelling dictionaries? ISTM that we probably want a similar strategy for both these kinds of resources. Perhaps it's time to spin off a new bug on that....
Comment 87 Mark Finkle (:mfinkle) (use needinfo?) 2011-05-06 07:46:01 PDT
Created attachment 530621 [details] [diff] [review]
Android packaging patch

This patch builds on Jonathan's but adds the hyphenation folder to the Android specific omnijar packager and then extracts the hyphenation folder out of the omnijar file on the Android device, so the toolkit code can access it.

Using this patch I can see hyphenation working on http://people.mozilla.org/~jkew/hyph.html
Comment 88 Brad Lassey [:blassey] (use needinfo?) 2011-05-06 09:41:56 PDT
Comment on attachment 530621 [details] [diff] [review]
Android packaging patch

Review of attachment 530621 [details] [diff] [review]:
-----------------------------------------------------------------

I'd rather not extract these dictionaries from the omnijar, but this will do for now. Please file a follow up bug
Comment 89 Mark Finkle (:mfinkle) (use needinfo?) 2011-05-06 12:29:40 PDT
pushed Android packaging patch:
http://hg.mozilla.org/mozilla-central/rev/de45da66d672

Filed bug 655337 for loading via a JAR friendly method
Comment 90 Mathias Nater 2011-05-12 14:02:55 PDT
Filed https://bugzilla.mozilla.org/show_bug.cgi?id=656750 for enhancements on hyphenation.
Comment 91 Johnathan Nightingale [:johnath] 2011-05-24 14:20:25 PDT
Marking this tracking+, but it seems like it should pick up status-firefox6:fixed as soon as that flag is available (today, ideally)
Comment 92 Rene 2011-08-02 04:05:04 PDT
(In reply to comment #21)
> Who's going to replace the hyphenator with a JS implementation? An
> extension? Seems unlikely. We shouldn't take complexity and perf hits in our
> code for an extension point that is unlikely to be used. We should implement
> the simple thing and then if demand arises for an extension point, provide
> that as an option.
>
It's not so unlikely at all. There are currently set of speller extensions for smaller languages like Finnish or Estonian that provide much better quality than hunspell.
I would be very happy if there would be an extension interface for hyphenation (like there is mozISpellCheckingEngine for speller) and kindly ask you to consider this for next update.
Comment 93 Mojca Miklavec 2011-08-22 02:03:58 PDT
(In reply to Rene from comment #92)
> There are currently set of speller extensions
> for smaller languages like Finnish or Estonian that provide much better
> quality than hunspell.

What extensions are needed for Finnish or Estonian? What are their requirements that current hyphenation algorithm is not able to satisfy? (I'm interested to know what hyphenation points TeX is missing.)
Comment 94 Rene 2011-08-22 02:23:35 PDT
(In reply to Mojca Miklavec from comment #93)
> What extensions are needed for Finnish or Estonian? What are their
> requirements that current hyphenation algorithm is not able to satisfy? (I'm
> interested to know what hyphenation points TeX is missing.)
Proper Estonian hyphenation is a combination of rules (for base words) and morphological analysis (for compounds). TeX handles the first part nicely, but can't handle compounds. I think having TeX hyphenation as a standard solution is acceptable (as Mozilla's first priority should be browsing, not language technology), but there should be an extension point here to implement language specific better quality options.

Note You need to log in before you can comment on or make changes to this bug.