Closed Bug 253317 Opened 20 years ago Closed 13 years ago

Provide hyphenation dictionary for justified text

Categories

(Core :: Layout: Text and Fonts, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla6
Tracking Status
firefox6 + fixed

People

(Reporter: bmh_ca, Assigned: jfkthame)

References

(Blocks 2 open bugs, )

Details

(Keywords: dev-doc-complete, privacy, Whiteboard: [sg:audit])

Attachments

(8 files, 11 obsolete files)

13.71 KB, patch
dbaron
: review+
Details | Diff | Splinter Review
108.78 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
8.53 KB, patch
roc
: review+
Details | Diff | Splinter Review
116.82 KB, patch
jfkthame
: review+
Details | Diff | Splinter Review
5.03 KB, patch
Pike
: review+
Details | Diff | Splinter Review
29.82 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
39.18 KB, patch
roc
: review+
Details | Diff | Splinter Review
3.90 KB, patch
blassey
: review+
Details | Diff | Splinter Review
For justified text, provide lookups to a hyphenation dictionary to wrap the text
appropriately. Ie. When text is justified, break large words apart across
multiple lines with hyphenation, to make the justification more precise.

eg. (random example)
This is a security
problem that needs
to     be     kept
confidential.

would become something like:
This is a security
problem that needs
to be  kept  conf-
idential.
Isn't this a duplicate of bug 67715?  If not, this probably blocks that bug.
Blocks: 67715
*** Bug 281813 has been marked as a duplicate of this bug. ***
As a preliminary step, this implements a "-moz-hyphens" property based on "hyphens" in the CSS3 Text draft, with values none/manual/auto/all. This will make it possible to control the use of hyphenation from CSS (once an actual hyphenation engine is in place).
Assignee: nobody → jfkthame
The hyphenation code here is hyphen-2.5, with slight modifications. Currently, the actual hyphenation is done on 8-bit text, which means we have to convert to utf-8 in order to find hyphen positions; for better performance, we could perhaps adapt the core code to work directly on utf16 strings.
Attachment #522769 - Attachment is obsolete: true
Attachment #522772 - Attachment is obsolete: true
Attachment #522773 - Attachment is obsolete: true
Attachment #523368 - Flags: review?
Attachment #523368 - Flags: review? → review?(dbaron)
This comes directly from the current (2.7.1) source available at http://sourceforge.net/projects/hunspell/files/Hyphen/. The code is MPL/LGPL-licensed, so there should be no licensing issues with using it in Gecko.
Attachment #523370 - Flags: review?(smontagu)
These are the US-English hyphenation patterns, also used by OpenOffice.org, I believe; we may want to look into sharing OOo files if they're installed, but for the time being it seems simplest to include this in the tree, like we do for the US-English spelling dictionary.
Attachment #523371 - Flags: review?(smontagu)
Comment on attachment 523368 [details] [diff] [review]
part 1 - implement -moz-hyphens property in CSS

>diff --git a/layout/style/nsCSSPropList.h b/layout/style/nsCSSPropList.h
>+CSS_PROP_TEXT(
>+    -moz-hyphens,
>+    _moz_hyphens,

Don't use the "_moz_" in the name of the constant (and fix nsComputedDOMStyle.cpp to match).

>+    MozHyphens,

use CSS_PROP_DOMPROP_PREFIXED(Hyphens) here instead, and change the ValueForMozHyphens() in nsRuleNode.cpp to ValueForHyphens().

Have you tested that NS_STYLE_HINT_REFLOW is sufficient for a CalcDifference hint?

I presume that you've run the layout/style mochitests and they still pass.

r=dbaron with that
Attachment #523368 - Flags: review?(dbaron) → review+
This provides XPCOM components to implement hyphenation for a given locale (based on a file of hyphenation patterns), and to manage the hyphenators, looking them up by language/locale codes.

Currently, the hyphenation patterns for a given locale are never freed (until shutdown) once they've been loaded. If memory usage becomes a concern, we could make the hyphenationManager release them if they haven't been used for a certain amount of time.
Attachment #523373 - Flags: review?(smontagu)
Implement automatic hyphenation support in layout. This is an "opt-in" feature; hyphenation will only be done if the CSS calls for "-moz-hyphens: auto", AND if the data is tagged as English.

(If additional patterns are added, then of course additional languages will be supported for hyphenation. We might want to make patterns available as part of language-packs for users to add as desired.)
Attachment #523374 - Flags: review?(roc)
Attachment #523375 - Flags: review?(roc)
Comment on attachment 523373 [details] [diff] [review]
part 4 - implement nsIHyphenationManager and nsIHyphenator components

Why do we need to use XPCOM here? Now that libxul is a required part of the build, these can just be regular C++ classes nsHyphenator and nsHyphenationManager.
Also, if insertHyphens is just for testing, I think we should do without it. We can fake it in the test code by laying out at different widths and seeing what boxes are created, using getClientRects.
I see that the hyphenator defines the notion of "a word". What does that mean exactly? We have words for selection/caret movement, and words for CSS capitalization, or is it something else? (We also have line break opportunities which are similar)
Let's name mHyphenatorInited mHyphenatorInitialized.

+      mFrame->GetStyleVisibility()->mLanguage->ToUTF8String(langString);
+      hyphSvc->GetHyphenator(langString.get(), getter_AddRefs(mHyphenator));

Why not make GetHyphenator take an nsIAtom*?

+    nsAutoTArray<PRUint8,200> hyphens;
+    PRBool inhibitHyphenInInitialWord =
+      startOriginalOffset > 0 &&
+      mFrag->CharAt(startOriginalOffset - 1) == CH_SHY;
+
+    if (NS_SUCCEEDED(mHyphenator->Hyphenate(*text, &hyphens,
+                                            inhibitHyphenInInitialWord)))

Seems like 'hyphens' should be an array of PRPackedBool.

Hmm. Actually I think there's a big problem here, which is that we should be looking at text runs that cross frame boundaries. In fact, I think we probably should be computing hyphenation breaks in nsLineBreaker. Probably the SHY breaks should be computed there too.
(In reply to comment #11)
> We might want to make patterns available as part of
> language-packs for users to add as desired.
Yes, please do.
(In reply to comment #13)
> Comment on attachment 523373 [details] [diff] [review]
> part 4 - implement nsIHyphenationManager and nsIHyphenator components
> 
> Why do we need to use XPCOM here? Now that libxul is a required part of the
> build, these can just be regular C++ classes nsHyphenator and
> nsHyphenationManager.

(In reply to comment #14)
> Also, if insertHyphens is just for testing, I think we should do without it. We
> can fake it in the test code by laying out at different widths and seeing what
> boxes are created, using getClientRects.

Does XPCOM give people more flexibility to use (and/or replace) the component, e.g. from JS? For example, someone might want to implement alternative hyphenation components, if the libhyphen algorithm isn't ideal for all languages, or use insertHyphens from JS to find most hyphen positions but then massage the results before replacing text in the DOM.

I'd be OK with a simple C++ implementation, but providing XPCOM interfaces here seemed like the "Gecko-ish" way to do things.
(In reply to comment #15)
> I see that the hyphenator defines the notion of "a word". What does that mean
> exactly? We have words for selection/caret movement, and words for CSS
> capitalization, or is it something else? (We also have line break opportunities
> which are similar)

(There's also the spell-checker, which has a notion of "word" as the unit to be checked.)

For simplicity, I went with a contiguous sequence of characters with the Unicode category "letter" or "combining mark", as that's what the hyphenation patterns are typically designed to handle. It's possible that we could take advantage of existing code for selection (not caret movement, which seems to include trailing punctuation with the "word", and certainly not line-breaking).
(In reply to comment #16)
> Why not make GetHyphenator take an nsIAtom*?

A string seemed more JS-friendly, but if we decide not to expose this to JS then atoms would be fine, I guess.

> Hmm. Actually I think there's a big problem here, which is that we should be
> looking at text runs that cross frame boundaries. 

That's true, in principle, but I'm not sure how important it is (if it'd be at the cost of significant extra complexity, for example). Cases where a single hyphenatable word crosses frame boundaries are probably not common, and when they do occur (e.g., the author is explicitly styling certain letters of the word, such as highlighting one syllable in bold), it may well be inappropriate to be hyphenating anyway as this is typically a case where the "word" is specifically in focus for some reason, and splitting it across lines then detracts from the intent.

If we can do this readily, I'm fine with it, but if it's going to be tricky, I don't think missing a few potential hyphenation points in such cases is a major issue; it's a limitation we could live with. Authors writing such stuff who _do_ want it hyphenated (which seems like it'll be very rare) could still use &shy; and thereby get explicit control of the possible breaks.

> In fact, I think we probably
> should be computing hyphenation breaks in nsLineBreaker. Probably the SHY
> breaks should be computed there too.

Can nsLineBreaker handle this? Offhand, I don't see anything that suggests it can mark breakpoints as requiring hyphen insertion if they're taken.
(In reply to comment #18)
> Does XPCOM give people more flexibility to use (and/or replace) the component,
> e.g. from JS? For example, someone might want to implement alternative
> hyphenation components, if the libhyphen algorithm isn't ideal for all
> languages, or use insertHyphens from JS to find most hyphen positions but then
> massage the results before replacing text in the DOM.
> 
> I'd be OK with a simple C++ implementation, but providing XPCOM interfaces here
> seemed like the "Gecko-ish" way to do things.

Those are bad old ways.

Who's going to replace the hyphenator with a JS implementation? An extension? Seems unlikely. We shouldn't take complexity and perf hits in our code for an extension point that is unlikely to be used. We should implement the simple thing and then if demand arises for an extension point, provide that as an option.

If Web content wants to discover hyphenation points, we need to create a Web API for that. XPCOM doesn't do it.

We may think it's not a major issue to look only within a frame, in practice, but I think it's fundamentally wrong and will probably bite us later. Fundamentally, wrapping arbitrary chunks of text in unstyled <spans> should not change layout. We go to a lot of effort to make that true, e.g. forming ligatures across element boundaries. I really think we should do this right.

nsLineBreaker can't report hyphen breaks right now. It would need to be extended to do that. But I think the line breaker is the logical place to do this; apart from having the text that we need, hyphenation is fundamentally about exposing line break opportunities.
(In reply to comment #21)

> Who's going to replace the hyphenator with a JS implementation? An extension?
> Seems unlikely.

It may not be as unlikely as all that. There's already Hyphenator.js (http://code.google.com/p/hyphenator/), for example, showing that people have an interest in the topic. Our hyphenation would basically make that superfluous, but more sophisticated things could be built as refinements on top of our implementation if we expose it to JS - e.g, adding site-specific hyphenation additions/exceptions while still using our hyphenator for the bulk of the work.

Perhaps more likely that replacing the actual hyphenator with a JS implementation, I can imagine people wanting to _use_ the hyphenator from JS and then put the resulting soft-hyphenated text (after post-processing) into the document.

> We may think it's not a major issue to look only within a frame, in practice,
> but I think it's fundamentally wrong and will probably bite us later.
> Fundamentally, wrapping arbitrary chunks of text in unstyled <spans> should not
> change layout. We go to a lot of effort to make that true, e.g. forming
> ligatures across element boundaries. I really think we should do this right.
> 
> nsLineBreaker can't report hyphen breaks right now. It would need to be
> extended to do that. But I think the line breaker is the logical place to do
> this; apart from having the text that we need, hyphenation is fundamentally
> about exposing line break opportunities.

Fair enough; I'll have a look at nsLineBreaker and see what would be involved there.
(In reply to comment #22)
> It may not be as unlikely as all that. There's already Hyphenator.js
> (http://code.google.com/p/hyphenator/), for example, showing that people have
> an interest in the topic. Our hyphenation would basically make that
> superfluous, but more sophisticated things could be built as refinements on top
> of our implementation if we expose it to JS - e.g, adding site-specific
> hyphenation additions/exceptions while still using our hyphenator for the bulk
> of the work.

That could be useful for Web sites themselves, so if there's demand we should just create a proper Web-accessible API for that. And that means not going through an XPCOM service.

> Perhaps more likely that replacing the actual hyphenator with a JS
> implementation, I can imagine people wanting to _use_ the hyphenator from JS
> and then put the resulting soft-hyphenated text (after post-processing) into
> the document.

Ditto.

Basically the right approach is to use something fast and lightweight internally (i.e. a regular C++ class, not even using virtuals if we can help it), then if we want a JS API, we create a scriptable wrapper or extension point which uses XPCOM as our WebIDL to define the interface, but does not use the rest of XPCOM such as services.

> Fair enough; I'll have a look at nsLineBreaker and see what would be involved
> there.

Great, thanks.
Keywords: dev-doc-needed
Attachment #523370 - Flags: review?(smontagu) → review+
Comment on attachment 523371 [details] [diff] [review]
part 3 - import en-US hyphenation patterns from hyphen-2.7.1

rs=me
Attachment #523371 - Flags: review?(smontagu) → review+
Updated to implement hyphenation manager and hyphenator as simple C++ classes, without XPCOM interfaces.

(Also includes the Makefile changes needed to build the new files, and include the hyphenation patterns in the final package.)

Still to come: an update of part 5, to use the revised interface.
Attachment #523373 - Attachment is obsolete: true
Attachment #523374 - Attachment is obsolete: true
Attachment #525667 - Flags: review?(smontagu)
Attachment #523374 - Flags: review?(roc)
Attachment #523373 - Flags: review?(smontagu)
This implements automatic hyphenation via the nsLineBreaker, so that it works across element boundaries.
Attachment #526972 - Flags: review?(roc)
Attachment #523375 - Attachment is obsolete: true
Attachment #526973 - Flags: review?(roc)
Attachment #523375 - Flags: review?(roc)
Nominating for tracking-firefox6: hope to land this on m-c fairly soon, at which point it will be a significant new web-facing feature.

Needs dev-doc for authors re the -moz-hyphens property and the importance of proper language tagging to go with it. Also, localizers may wish to consider preparing and bundling hyphenation patterns for additional languages alongside the en-US patterns here.
I have the it-IT hyphenation dictionary at hand. Should I attach it?
(In reply to comment #29)
> I have the it-IT hyphenation dictionary at hand. Should I attach it?

Is that the dictionary from an OpenOffice or LibreOffice distribution? Or some other source?

I'm not sure what the best approach will be to keep track of additional dictionaries. I think this bug should be for the core support that becomes part of the base Firefox product; I'd suggest checking with Pike about how we should handle the resources for additional languages.
It's the latest official (open|libre)office3 it-IT hyphenation dictionary.
This is very nice!

+   * Possible states are 0 (no break), 1 (break allowed), 2 (hyphenation point)

Just use the FLAG_BREAK names here.

+    memset(breakState.Elements(), 0, length*sizeof(PRUint8));

Use FLAG_BREAK_TYPE_NONE here.

+    nsCAutoString lang;
+    mCurrentWordLangGroup->ToUTF8String(lang);
+    nsRefPtr<nsHyphenator> hyphenator =
+      nsHyphenationManager::Instance()->GetHyphenator(lang.get());

Why doesn't GetHyphenator just take the atom as a parameter?

+      if (mCurrentWordLangGroup && mCurrentWordLangGroup != aLangGroup) {
+        mCurrentWordContainsMixedLang = PR_TRUE;
+      } else {
+        mCurrentWordLangGroup = aLangGroup;
+      }

This chunk of code can be moved out to a shared function, say nsLineBreaker::UpdateCurrentWordLangGroup().

+          if (hyphenator) {
+            nsDependentSubstring string(aText + wordStart, aText + offset);
+            nsAutoTArray<PRPackedBool,200> hyphens;
+            if (NS_SUCCEEDED(hyphenator->Hyphenate(string, hyphens))) {
+              for (PRUint32 i = 0; i + 1 < string.Length(); ++i) {
+                if (hyphens[i]) {
+                  breakState[wordStart + i + 1] =
+                    gfxTextRun::CompressedGlyph::FLAG_BREAK_TYPE_HYPHEN;
+                }
+              }
+            }
+          }

I believe this chunk of code can be shared between AppendText and FlushCurrentWord.

I suspect it would be better to have PropertyProvider::GetHyphenationBreaks look up the hypenation state in the textrun itself, so that callers like BreakAndMeasureText and AddInlineMinWidthForFlow don't have to check for both kinds of hyphens.
Are we going to ship en-US hyphenation dictionaries in every build? If not, these reftests are going to break.
(In reply to comment #32)

I'll go through most of these comments tomorrow, but a couple of responses for now:

> Why doesn't GetHyphenator just take the atom as a parameter?

Because if it doesn't find an exact match, it munges the value to successively drop trailing subtag(s) and calls itself recursively to see if they exist. This means it's dealing with the value as a string and potentially calling itself with a new temporary string, and there's no reason to make it create temporary atoms in order to do that.

> I suspect it would be better to have PropertyProvider::GetHyphenationBreaks
> look up the hypenation state in the textrun itself, so that callers like
> BreakAndMeasureText and AddInlineMinWidthForFlow don't have to check for both
> kinds of hyphens.

Currently, the textrun doesn't explicitly know about &shy; breaks; we'd have to add something to record those during textrun construction in order to be able to do this. I'll take a look... the "obvious" idea of using nsLineBreaker to mark all potential line-breaks doesn't work, as the &shy; characters are filtered out of the text that's passed to the linebreaker.
(In reply to comment #33)
> Are we going to ship en-US hyphenation dictionaries in every build? If not,
> these reftests are going to break.

I've been assuming "yes" (just like I think we always ship the en-US spelling dictionary), but if product drivers decide against it (e.g. prefer to ship _only_ a local-language dictionary in non-en-US builds) we'd need to figure out some other approach to testing.
(In reply to comment #34)
> Because if it doesn't find an exact match, it munges the value to successively
> drop trailing subtag(s) and calls itself recursively to see if they exist. This
> means it's dealing with the value as a string and potentially calling itself
> with a new temporary string, and there's no reason to make it create temporary
> atoms in order to do that.

Ah OK. But I think it should still take an nsIAtom* parameter and do the conversion internally. That would actually save a bit of code already, and later if we want to we could easily add an extra internal cache mapping nsIAtoms to their hyphenators.

> > I suspect it would be better to have PropertyProvider::GetHyphenationBreaks
> > look up the hypenation state in the textrun itself, so that callers like
> > BreakAndMeasureText and AddInlineMinWidthForFlow don't have to check for both
> > kinds of hyphens.
> 
> Currently, the textrun doesn't explicitly know about &shy; breaks; we'd have to
> add something to record those during textrun construction in order to be able
> to do this. I'll take a look... the "obvious" idea of using nsLineBreaker to
> mark all potential line-breaks doesn't work, as the &shy; characters are
> filtered out of the text that's passed to the linebreaker.

Right. What I was suggesting is modifying PropertyProvider::GetHyphenationBreaks in nsTextFrameThebes to look in the textrun for the auto-hyphenation breaks recorded by nsLineBreaker, and combining those results with the &shy; results that it already computes. That way we won't need to record &shy; breaks in the textrun.
Sorry if my comments are out of context, I'm going to go "tl;dr;" on this bug.

Hyphenation is tricky language data, as it's less about which language is the preferred one, but which the user would see. How does it determine which hyphenation dict to use if there are multiple?

If we ship it optionally, it should also be installable through AMO. Might be a good idea to do so either way.

How big is the data, if we'd ship multiple dicts?

Also, making them optional makes us less strict about having that data to be triple licensed.

I don't think I want the hyphenation data to be in intl/hyphenation/locales/en-US/. If we go for a single data file (and not ship en-US by default), it should probably be in intl/locales/en-US/hyphenation/, if we can't put it in toolkit/locales/en-US. We'd also need to see how to integrate that into the l10n repacks right. Which patch does the build magic?

If we ship a bunch independent of localization, I care less about the file location.
(In reply to comment #37)
> Hyphenation is tricky language data, as it's less about which language is the
> preferred one, but which the user would see. How does it determine which
> hyphenation dict to use if there are multiple?

It will _only_ try to auto-hyphenate if the author specifically requests it with "-moz-hyphens:auto" (eventually "hyphens:auto" once CSS3 Text is sufficiently advanced), AND if the text is tagged with a language code that maps to an available hyphenation resource.

So in the case of a multilingual page, individual elements will be hyphenated according to the appropriate rules for their language (or not hyphenated at all, if we don't have rules for them).

> If we ship it optionally, it should also be installable through AMO. Might be a
> good idea to do so either way.
> 
> How big is the data, if we'd ship multiple dicts?

The en-US data is around 100K (but deflates by about 60%, so it would add about 40K to the installer, I assume). Other languages vary greatly; many of them are MUCH smaller (in the 5K-20K range, before compression), as they have simple rules that can be expressed in just a few patterns, though a handful may be larger than en-US.

Although many of the files are small, the cumulative size of shipping all 60+ languages that we could easily derive from the TeX world (for example) is probably more than we want to include in the default installer.

> Also, making them optional makes us less strict about having that data to be
> triple licensed.

Ok, that's good to know. Licensing may well vary among the available patterns.

> I don't think I want the hyphenation data to be in
> intl/hyphenation/locales/en-US/. If we go for a single data file (and not ship
> en-US by default), it should probably be in intl/locales/en-US/hyphenation/, if
> we can't put it in toolkit/locales/en-US. We'd also need to see how to
> integrate that into the l10n repacks right. Which patch does the build magic?

Currently, it's included in part 4 here. I'm happy to move things around to wherever you think they belong.

> If we ship a bunch independent of localization, I care less about the file
> location.

ISTM that it makes sense to treat hyphenation patterns similarly to spelling dictionaries. We have the en-US dictionary in the mozilla-central tree (in extensions/spellcheck/locales/en-US/hunspell), and it becomes part of a "standard" m-c build, in the {application}/dictionaries folder, but it looks like l10n repacks replace this with a localized dictionary. In the patches here, I was aiming to install the en-US hyphenation data in a "hyphenation" folder alongside "dictionaries", on the assumption that it could then be processed in the same way during repacks.

I don't see any intl/locales resources in the tree at the moment; there's intl/locale, but that is source code for locale-related functionality rather than locale-specific resources.
I favor intl/locales/en-US/hyphenation as it allows us to put other locale info that we may have at some point into the same packaging process. I'm thinking about support data for js i18n, in particular.

Re the bundling piece: Should this be part of browser or part of toolkit? We probably want to talk to the mobile guys about that. Not sure if TB cares, or SM.
The rest of the logic looks OK for repacks, the locales/Makefile.in would be the place to make a call on whether we ship en-US in all builds, or not. Might still make sense to do so.

How are we going to do 
pref("intl.hyphenation-alias.en", "en-us");
for l10n builds? Can't we use sane defaults here, and get by without requiring a pref?

Leaves me with two follow up tasks for you: Feedback on mobile on whether they want to ship with this, and feedback from add-ons, both toolkit code and AMO, if they can hook you up with an extension infrastructure for this data.
(In reply to comment #39)
> I favor intl/locales/en-US/hyphenation ...

OK, I'll move it and refresh the patches accordingly.

> How are we going to do 
> pref("intl.hyphenation-alias.en", "en-us");
> for l10n builds? Can't we use sane defaults here, and get by without requiring
> a pref?

I don't see a way to handle this without a localizable pref or some similiar mechanism. In the en-GB localization, for example, we might ship both en-us and en-gb patterns (or a user might add the en-us ones, if we only ship en-gb there), but I expect we'd want to map "en" -> "en-gb" by default so that text simply tagged as lang="en", without a more specific locale, will be handled using British rules.

I also think it should be possible for users to customize this behavior if they wish, which the pref allows.

Do l10n repacks currently have a mechanism to modify default prefs at all? I thought that was part of the process, but may of course be mistaken.
(In reply to comment #39)
> pref("intl.hyphenation-alias.en", "en-us");

Is this really specific to hyphenation? I'd consider the setting "please treat all unknown English locales as US English" to affect several aspects, like hyphenation, spell checking, and perhaps more to come. Having different settings for different facets might cause trouble, particularly in terms of config UI.
(In reply to comment #39)
> and feedback from add-ons, both toolkit code and AMO,
> if they can hook you up with an extension infrastructure for this data.

Looking on AMO, I see we already have https://addons.mozilla.org/en-US/firefox/language-tools/ with a whole bunch of spell-check dictionaries, etc.; it seems like hyphenation patterns could be delivered in exactly the same way, no?
This is the same as the previous "part 3", except that the patterns are placed in intl/locales/en-US/hyphenation as per comment 39.

Carrying forward r=smontagu; Simon, please speak up if you have any concerns with this placement.
Attachment #523371 - Attachment is obsolete: true
Attachment #527285 - Flags: review+
This is the build changes to copy the hyphenation patterns into the {program}/hyphenation folder, split out from the rest of Part 4 for easier review. Pike, does this look like it does the right thing from your point of view?
Attachment #527288 - Flags: review?(l10n)
Updated to use Atom parameters, as per Roc's comments.
Attachment #525667 - Attachment is obsolete: true
Attachment #527290 - Flags: review?(smontagu)
Attachment #525667 - Flags: review?(smontagu)
Comment on attachment 527290 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API

Argh, cancelling r? on part 4 as I just realized I attached a stale patch, so don't waste time on it. Update coming.
Attachment #527290 - Flags: review?(smontagu)
(In reply to comment #33)
> Are we going to ship en-US hyphenation dictionaries in every build? If not,
> these reftests are going to break.

Regardless of the decision on this, I think we'll have the en-US hyph dic in the main build (like we do for the spelling dic), so these should work during our standard unit tests.

It's true that they'd fail if they were run with an l10n-repackaged build where the dic has been swapped out, but I don't see a good way to deal with that at this point. (We'd have the same issue with spell-check tests if we were running full unit tests on repacks, wouldn't we?)
I think it might make sense to just always package the en-US hyphenation dictionary, anyway, in addition to any other languages we choose to package.
I agree that it'd make sense to package the en-US hyphenation dictionary in all builds. That'd enable us to run the hyphenation tests on non-en-US builds, too, which is a good thing to have.

For the locale prefs, doing prefs as part of repacks is tough. There's a template prefs file we include, but I'd rather not have localizers touch that, it's too easy to put scary things in there.

Isn't a language mapping algorithm like BCP47 good enough in general? I don't mind if users have prefs to tweak things on top of that, but I don't think we should have prefs for the simple case of walking up a set of language tags. We currently don't have a reusable component implementing bcp47 sadly, but bug 356038 could add that at some point. Right now, going in chunks of "en-US", "en", 'nothing' should do the trick, I think.
Comment on attachment 527288 [details] [diff] [review]
part 3.1 - build support to install hyphenation patterns in the product

I'd make 

+PATTERN_FILES = $(strip $(wildcard $(LOCALE_SRCDIR)/hyphenation/*.dic))

list both the en-US and the locale pattern files so that we ship the en-US ones by default.

r=me with that.
Attachment #527288 - Flags: review?(l10n) → review+
I read through rapidly all the comments and didn't read about one concern: bloating.
This is not really about the issue discussed here, but still. Adding en-US dic to l10n locale will bloat Firefox a little more. I remember times where size packages matters, and it was around 6 MB for Windows in these days, then we added l10n dictionaries (1 MB more or so). We are now around 12 MB.
Just ignore my comment if this doesn't matter anymore.
If it does, I will let the "guys who know" file the bug in the right component.
Sorry if made unncessary noise in this bug.
We worry about bloat, but adding features generally requires adding to the download size. In this case we expect adding the en-US dictionary to add about 40K to the download, see comment #38.
Comment on attachment 527486 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API

Review of attachment 527486 [details] [diff] [review]:

r=me
Attachment #527486 - Flags: review?(smontagu) → review+
Comment on attachment 527486 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API

Requesting sr for the new APIs here.
Attachment #527486 - Flags: superreview?(dbaron)
Attachment #527486 - Flags: superreview?(dbaron) → superreview?(roc)
Comment on attachment 527486 [details] [diff] [review]
part 4 - implement nsHyphenationManager and nsHyphenator - updated to use nsIAtom instead of strings in the API

Review of attachment 527486 [details] [diff] [review]:

::: intl/hyphenation/public/nsHyphenator.h
@@ +49,5 @@
+public:
+  nsHyphenator(nsIFile *aFile);
+
+  nsrefcnt AddRef(void) {
+    NS_PRECONDITION(PRInt32(mRefCnt) >= 0, "illegal refcnt");

Use NS_INLINE_DECL_REFCOUNTING

::: intl/hyphenation/src/hnjalloc.h
@@ +43,5 @@
+#include "prmem.h"
+
+#define hnj_malloc(size)      PR_Malloc(size)
+#define hnj_realloc(p, size)  PR_Realloc(p, size)
+#define hnj_free(p)           PR_Free(p)

Use mozalloc.h and moz_malloc or moz_xmalloc. If you think libhyphen is very robust about checking malloc return values for OOM, then use moz_malloc, otherwise use moz_xmalloc.
Attachment #527486 - Flags: superreview?(roc) → superreview+
(In reply to comment #58)

> Use NS_INLINE_DECL_REFCOUNTING

Done; although I was thinking we might eventually want to enhance this to have a time-based cache that discards hyphenation tables that haven't been used recently, if memory usage becomes a concern. For that, I think we'd want a custom Release implementation so as to track last-used time. But we can do that in a followup bug if it seems necessary.

> Use mozalloc.h and moz_malloc or moz_xmalloc. If you think libhyphen is very
> robust about checking malloc return values for OOM, then use moz_malloc,
> otherwise use moz_xmalloc.

Done. libhyphen does not check allocator return values, it assumes that hnj_{m,re}alloc  are infallible (normally by using its own wrappers), so they're mapped to the moz_x* versions here.


http://hg.mozilla.org/mozilla-central/rev/852a668d69b9
http://hg.mozilla.org/mozilla-central/rev/fb4db64ca8b8
http://hg.mozilla.org/mozilla-central/rev/861b8bbc25b5
http://hg.mozilla.org/mozilla-central/rev/a5877159c9a1
http://hg.mozilla.org/mozilla-central/rev/3f96f2d3dffe
http://hg.mozilla.org/mozilla-central/rev/cf41ba95de85
http://hg.mozilla.org/mozilla-central/rev/3fd770ef6a65
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
I backed out parts 3.1 (the build support for including the hyphenation dict) and 6 (tests) because of orange in l10n-check:

make libs
make[4]: Entering directory `/builds/slave/cen-lnx/build/obj-firefox/intl/locales'
/builds/slave/cen-lnx/build/obj-firefox/config/nsinstall -R    ../../dist/xpi-stage/locale-x-test/hyphenation
usage: /builds/slave/cen-lnx/build/obj-firefox/config/nsinstall [-C cwd] [-L linkprefix] [-m mode] [-o owner] [-g group]
                                                                [-DdltR] file [file ...] directory
make[4]: make[4]: Leaving directory `/builds/slave/cen-lnx/build/obj-firefox/intl/locales'
*** [libs] Error 2
NEXT ERROR make[3]: *** [default] Error 2
make[3]: Leaving directory `/builds/slave/cen-lnx/build/obj-firefox/intl/locales'
make[2]: Leaving directory `/builds/slave/cen-lnx/build/obj-firefox/browser/locales'
make[2]: *** [libs-x-test] Error 2
make[1]: *** [l10n-check] Error 2
make[1]: Leaving directory `/builds/slave/cen-lnx/build/obj-firefox/browser/locales'
make: *** [l10n-check] Error 2
program finished with exit code 2

As such, the feature is currently disabled - the code is still in the tree, but with no dictionary, it won't actually do anything.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Re-landed after fixing the path in the makefile so that it won't fail l10n-check this time:
http://hg.mozilla.org/mozilla-central/rev/5cd93fcc75d9
http://hg.mozilla.org/mozilla-central/rev/4ed2169f5b34
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Documentation:

https://developer.mozilla.org/en/CSS/hyphens

I've composed an example but am awaiting a build with this working before inserting it into the page. The Nightly I'm running now doesn't have it yet.

Also mentioned on Firefox 6 for developers.
(In reply to comment #62)
> Documentation:
> 
> https://developer.mozilla.org/en/CSS/hyphens

Thanks! :)

A couple of notes. First a nit: there's a missing "e" on "The" at the very beginning!

You might want to reference CSS3 Text in addition to Generated-Content; I'm not sure where the final spec will settle down, though personally I think Text is a more appropriate home for it.

  http://www.w3.org/TR/2011/WD-css3-text-20110412/#hyphens

Also, it may be worth pointing out that the "auto" setting depends on appropriate language tagging, as the proper places to break words depend on the language. So Gecko (at least) will not do any automatic hyphenation unless the page/element has an appropriate 'lang' attribute; merely setting "-moz-hyphens:auto" with no 'lang' attribute will have no visible effect.

(As of now we only provide support for auto-hyphenating English, but we're looking at ways to include support for additional languages.)
So, I'm not totally comfortable with the choice of only shipping en-US hyphenation support by default.  This is something which affects the layout of web content, and I don't think that we should limit the language specific layout support to only the language the browser UI is in.  If you're a multilingual person, you almost definitely do not run two parallel browsers one in each language.  And we do not restrict other language specific layout features (such as shaping) to only browsers in some locales.  (I know that this example is kind of extreme, but I'm talking in terms of what an average user expects).

So, I'm wondering, is the built-in hyphenation dictionaries choice made in this bug a good one?
(In reply to comment #64)
> So, I'm wondering, is the built-in hyphenation dictionaries choice made in this
> bug a good one?

I think we should have more bugs on a good default set of hyphenation dicts to ship, plus good localized defaults, plus a good infra to detect if we should download and use additional dicts by demand.
(In reply to comment #65)
> (In reply to comment #64)
> > So, I'm wondering, is the built-in hyphenation dictionaries choice made in this
> > bug a good one?
> 
> I think we should have more bugs on a good default set of hyphenation dicts to
> ship, plus good localized defaults, plus a good infra to detect if we should
> download and use additional dicts by demand.

Is that considered to be something that should be addressed before we can ship this in Firefox?
(In reply to comment #64)
> So, I'm not totally comfortable with the choice of only shipping en-US
> hyphenation support by default.  This is something which affects the layout of
> web content, and I don't think that we should limit the language specific
> layout support to only the language the browser UI is in.

There are multiple, somewhat conflicting goals here:
1) Avoid making page layout depend on the browser UI locale.
2) Support auto-hyphenation for the preferred language(s) of each user.
3) Acceptable Firefox download bloat.

What's currently on trunk satisfies #1 and #3. I think this something we can ship; supporting auto-hyphenation for English only is better than not supporting it for any language, IMHO.

I'm not sure what we should do beyond that, though. That's a discussion we should probably have on dev.platform.
(In reply to comment #67)
> There are multiple, somewhat conflicting goals here:
> 1) Avoid making page layout depend on the browser UI locale.
> 2) Support auto-hyphenation for the preferred language(s) of each user.
> 3) Acceptable Firefox download bloat.
> 
> What's currently on trunk satisfies #1 and #3.

Just to clarify: the reason the current code satisfies #1 is that we ship the en-US dictionary for *every* locale (and never ship any other dictionaries).
Varying the hyphenation by the local would leak the locale.
Keywords: privacy
Whiteboard: [sg:audit]
Don't we already leak the locale in the Accept-Language header?
IIRC, there are a few ways we leak the locale. I don't know how much of a problem, or not, it is considered. I just want to make sure Sid is aware of the leak.
Do we want to package this on Android, or mark the six permaorange reftests as fails-if? (Or, do the right thing, build, package, and test based on an ifdef.)
We should package it on Android.
I don't know how the Android packaging works, and what's required in order to incorporate this. I found mobile/installer/package-manifest.in and mobile/locales/Makefile.in and added appropriate entries to these (see attached patch), but this is not sufficient. I'm seeing hyphenation/en-US/hyph_en_US.dic in the $OBJDIR/dist/fennec directory, but it doesn't end up in the final package.

So could someone who knows how this should work please point me in the right direction - or just do it for me! :)
Attachment #530366 - Flags: feedback?(pavlov)
Blocks: 655078
How big are the dictionaries? We explicitly remove spellcheck dictionaries in mobile due to size issues and the fact that the spellcheck UI is not 100% functional.
They vary greatly, from a few K to several hundred K. The en-US dict (currently the only one in the tree; we're looking into how best to handle other locales) is just over 100K, but compresses to about 42K when zipped.
I can try the patch and see how it works. It looks OK.
(In reply to comment #70)
> Don't we already leak the locale in the Accept-Language header?

The value of the Accept-Language header can be customized by the user, and the UI locale is not its only source.

That said, if we choose to ship en-US in every locale, this won't be a concern, except for the locales which ship their own version of the dictionary in addition to the en-US one.
(In reply to comment #75)
> How big are the dictionaries? We explicitly remove spellcheck dictionaries in
> mobile due to size issues and the fact that the spellcheck UI is not 100%
> functional.

This is probably more important than spellcheck since a) it affects layout and b) autohyphenation would be especially useful on mobile to make text fit in a narrow space.
(In reply to comment #67)
> (In reply to comment #64)
> > So, I'm not totally comfortable with the choice of only shipping en-US
> > hyphenation support by default.  This is something which affects the layout of
> > web content, and I don't think that we should limit the language specific
> > layout support to only the language the browser UI is in.
> 
> There are multiple, somewhat conflicting goals here:
> 1) Avoid making page layout depend on the browser UI locale.
> 2) Support auto-hyphenation for the preferred language(s) of each user.
> 3) Acceptable Firefox download bloat.
> 
> What's currently on trunk satisfies #1 and #3. I think this something we can
> ship; supporting auto-hyphenation for English only is better than not
> supporting it for any language, IMHO.
> 
> I'm not sure what we should do beyond that, though. That's a discussion we
> should probably have on dev.platform.

Posted a thread to dev.platform.
Depends on: 655198
Comment on attachment 530366 [details] [diff] [review]
wip - package the hyphenation dictionary on android - incomplete

You missed a spot packager.mk

Android omnijar packaging is different and separate from other omnijar packaging. I tweaked your patch and got the hyphen dict files in the package. I sent the patch to the try server to see if the reftests clear up.
Attachment #530366 - Flags: feedback?(pavlov) → feedback-
Aha, thanks - I figured there must be something like that I'd missed.

In a local build, I now see the dict in the package as expected. However, testing on my device still doesn't actually work - maybe the code ends up looking in the wrong directory or something.
(In reply to comment #82)
> Aha, thanks - I figured there must be something like that I'd missed.
> 
> In a local build, I now see the dict in the package as expected. However,
> testing on my device still doesn't actually work - maybe the code ends up
> looking in the wrong directory or something.

I also found a place in the code where we unpack folders out of the omnijar file on Android.

Do we have any easily accessible test pages to try hyphenation? I already failed once to get any reftests to run on try and I don't have an easy way to try to run the Android reftests myself locally.
Not sure of a better place to ask, but are we planning to support reading the hyphen dict from a locale JAR file? It seems natural that the locale specific dictionary could be shipped with a locale itself. It looks like the code is currently designed to only read from extracted, local files - using the GRE or APP folder as the root.
(In reply to comment #83)
> Do we have any easily accessible test pages to try hyphenation? I already
> failed once to get any reftests to run on try

(That sounds like bug 655046.)

I've put a page at http://people.mozilla.org/~jkew/hyph.html that should demonstrate hyphenation, if it's working.
(In reply to comment #84)
> Not sure of a better place to ask, but are we planning to support reading
> the hyphen dict from a locale JAR file? It seems natural that the locale
> specific dictionary could be shipped with a locale itself. It looks like the
> code is currently designed to only read from extracted, local files - using
> the GRE or APP folder as the root.

What do we do for spelling dictionaries? ISTM that we probably want a similar strategy for both these kinds of resources. Perhaps it's time to spin off a new bug on that....
This patch builds on Jonathan's but adds the hyphenation folder to the Android specific omnijar packager and then extracts the hyphenation folder out of the omnijar file on the Android device, so the toolkit code can access it.

Using this patch I can see hyphenation working on http://people.mozilla.org/~jkew/hyph.html
Attachment #530366 - Attachment is obsolete: true
Attachment #530621 - Flags: review?(blassey.bugs)
Comment on attachment 530621 [details] [diff] [review]
Android packaging patch

Review of attachment 530621 [details] [diff] [review]:
-----------------------------------------------------------------

I'd rather not extract these dictionaries from the omnijar, but this will do for now. Please file a follow up bug
Attachment #530621 - Flags: review?(blassey.bugs) → review+
pushed Android packaging patch:
http://hg.mozilla.org/mozilla-central/rev/de45da66d672

Filed bug 655337 for loading via a JAR friendly method
Filed https://bugzilla.mozilla.org/show_bug.cgi?id=656750 for enhancements on hyphenation.
Marking this tracking+, but it seems like it should pick up status-firefox6:fixed as soon as that flag is available (today, ideally)
Target Milestone: --- → mozilla6
Blocks: 672320
(In reply to comment #21)
> Who's going to replace the hyphenator with a JS implementation? An
> extension? Seems unlikely. We shouldn't take complexity and perf hits in our
> code for an extension point that is unlikely to be used. We should implement
> the simple thing and then if demand arises for an extension point, provide
> that as an option.
>
It's not so unlikely at all. There are currently set of speller extensions for smaller languages like Finnish or Estonian that provide much better quality than hunspell.
I would be very happy if there would be an extension interface for hyphenation (like there is mozISpellCheckingEngine for speller) and kindly ask you to consider this for next update.
Blocks: 677347
(In reply to Rene from comment #92)
> There are currently set of speller extensions
> for smaller languages like Finnish or Estonian that provide much better
> quality than hunspell.

What extensions are needed for Finnish or Estonian? What are their requirements that current hyphenation algorithm is not able to satisfy? (I'm interested to know what hyphenation points TeX is missing.)
(In reply to Mojca Miklavec from comment #93)
> What extensions are needed for Finnish or Estonian? What are their
> requirements that current hyphenation algorithm is not able to satisfy? (I'm
> interested to know what hyphenation points TeX is missing.)
Proper Estonian hyphenation is a combination of rules (for base words) and morphological analysis (for compounds). TeX handles the first part nicely, but can't handle compounds. I think having TeX hyphenation as a standard solution is acceptable (as Mozilla's first priority should be browsing, not language technology), but there should be an extension point here to implement language specific better quality options.
Depends on: 685214
Depends on: 770555
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: