Closed Bug 672320 Opened 13 years ago Closed 10 years ago

add hyphenation resources for more locales

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla8

People

(Reporter: jfkthame, Assigned: jfkthame)

References

(Blocks 2 open bugs)

Details

(Keywords: dev-doc-complete)

Attachments

(31 files, 3 obsolete files)

56.16 KB, patch
smontagu
: review+
gerv
: feedback+
Details | Diff | Splinter Review
827 bytes, patch
smontagu
: review+
Details | Diff | Splinter Review
2.41 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
793.17 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
21.57 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
21.34 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
2.00 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
87.91 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
3.65 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
202 bytes, text/html
Details
4.47 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
465.30 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
2.18 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
24.86 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
2.25 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
95.04 KB, patch
smontagu
: review+
gerv
: feedback-
Details | Diff | Splinter Review
3.11 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
104.65 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
2.20 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
614.42 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
17.23 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
2.35 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
5.06 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
2.41 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
3.92 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
745.83 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
2.37 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
8.24 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
2.36 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
9.65 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
1.97 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
This is a followup to bug 253317, which provided the code to support auto-hyphenation, and the hyphenation patterns for en-US.

There are patterns for around 25 more languages available in the TeX community under the LPPL license. As per bug 656248 comment 9, we can use these as a basis for hyphenation files in Gecko, provided we follow proper licensing procedures.

The TeX patterns require a preprocessing and packaging step to make them ready for libhyphen use. This will prevent our versions serving as a direct replacement for the original files, which means we will abide by the LPPL requirements for relicensing a derived work.
Depends on: 672472
Component: Layout: Text → Internationalization
QA Contact: layout.fonts-and-text → i18n
As an initial test case, this adds Swedish hyphenation patterns (adapted from those used in TeX). We should be able to process a couple dozen more languages following exactly the same pattern, but I figured it would be simpler to review a single one first.

Gerv, please take a look at the LICENSE file and check that it meets your expectations. Note that the header lines that have been added to the patterns, and the fact that they're stripped of the TeX \patterns{...} markup, means that this is not usable as a direct replacement for the original work. (There's also been a substring-merging operation to meet libhyphen requirements, but that would not in itself affect TeX use, so it's not relevant to the relicensing requirements.)
Attachment #546743 - Flags: review?(smontagu)
Attachment #546743 - Flags: feedback?
Attachment #546743 - Flags: feedback? → feedback?(gerv)
The general consensus seems to be that we should ship all available (subject to licensing) hyphenation resources in the default build, to provide the most uniform behavior and to minimize fingerprintability.
Attachment #546746 - Flags: review?(smontagu)
Comment on attachment 546743 [details] [diff] [review]
hyphenation patterns for Swedish

Review of attachment 546743 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me
Attachment #546743 - Flags: review?(smontagu) → review+
Attachment #546746 - Flags: review?(smontagu) → review+
Attachment #546747 - Flags: review?(smontagu) → review+
Comment on attachment 546743 [details] [diff] [review]
hyphenation patterns for Swedish

Blimey, that's complicated! What about "(more information to be added later)"? Other than that, looks OK.

Gerv
Attachment #546743 - Flags: feedback?(gerv) → feedback+
(In reply to comment #5)
> What about "(more information to be added later)"?

That comes directly from the upstream packages I'm using; e.g. see http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/hyph-sv.lic.txt?revision=570&view=markup for the Swedish license file.

I'm using a conversion process that takes our tri-license plus the extra linking wording, and simply appends the old *.lic.txt file verbatim. So if that file contains oddities like this (many of them do!), they'll be preserved as-is.
These are the most straightforward cases: patterns licensed under LPPL (so we can relicense our derived work), and tagged with simple language codes. Languages where more than one set of patterns are available, or other complications, will be handled individually so we can review the locale tagging used more carefully.
Attachment #547010 - Flags: review?(smontagu)
Just a simple testcase for each supported language, to make sure the patterns are found and used as expected.
Attachment #547011 - Flags: review?(smontagu)
Comment on attachment 547010 [details] [diff] [review]
add a bunch more hyphenation locales derived from TeX patterns

rs=me.

I'm assuming that it doesn't matter whether a specific locale appears in our own locale properties files. kmr doesn't yet (though bug 666662 will add it).
Attachment #547010 - Flags: review?(smontagu) → review+
Comment on attachment 547011 [details] [diff] [review]
basic reftests for the additional patterns

Review of attachment 547011 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547011 - Flags: review?(smontagu) → review+
(In reply to comment #9)

> I'm assuming that it doesn't matter whether a specific locale appears in our
> own locale properties files.

Right, this doesn't affect use of the resource. All that's required is for the lang="xxx" tag on the page to match the locale code of the pattern file.
Attached patch hyphenation patterns for German (obsolete) — Splinter Review
For German, there are separate patterns for traditional and reformed orthographies, and for Swiss German. The hyphenation-alias pref is used to select the reformed (1996) patterns as the default for data that is just tagged as lang="de" without a more specific subtag.

(The handling of these tags should probably be updated as part of the BCP47 effort, but for now the hyphenation manager just uses its own limited alias/wildcard scheme.)
Attachment #547062 - Flags: review?(smontagu)
Attached patch reftests for German hyphenation (obsolete) — Splinter Review
Attachment #547065 - Flags: review?(smontagu)
These patterns are labelled as mn-Cyrl in the TeX archives, but it seems likely some data may just be tagged as lang="mn" without the script subtag. Unless/until we also have patterns for mn-Mong (if hyphenation is even a possibility there, which seems doubtful), I think it's simplest to label this as plain "mn".

In the event that someone uses mn-Mong on a web page (whether tagged as "mn" or explicitly "mn-Mong"), this won't result in hyphenation there because the patterns simply won't match any Mongolian-script data.
Attachment #547080 - Flags: review?(smontagu)
Attachment #547081 - Flags: review?(smontagu)
The Serbo-Croatian patterns cover both Latin and Cyrillic orthographies - they are a combination of the two separate sets of LaTeX patterns. This test checks that hyphenation is working for both scripts.
Attachment #547256 - Flags: review?(smontagu)
(In reply to comment #13)
> Created attachment 547062 [details] [diff] [review] [review]
> hyphenation patterns for German

May I report "false" hyphenations be here?
(In reply to comment #13)
> Created attachment 547062 [details] [diff] [review] [review]
> hyphenation patterns for German

May I report "false" hyphenations here?
Comment on attachment 547282 [details]
false hyphenation war-um instead of wa-rum

><style>
>body {
>   width: 8em;
>   -moz-hyphens: auto;
>   word-wrap: break-word;
>}
></style>
><p lang = "de">
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
>Warum
Attachment #547282 - Attachment mime type: text/plain → text/html
(In reply to comment #22)
> Created attachment 547282 [details]
> false hyphenation war-um instead of wa-rum

The hyphenation "war-um" is shown in the wordlist at http://repo.or.cz/w/wortliste.git, which is the upstream source for the German patterns here.

(I'm not a German expert, but I suspect this may be a case where the "correct" hyphenation is open to question, perhaps depending whether you prefer to give more weight to morphology or phonology, or other factors.)

It would be best to raise this issue with Werner Lemberg, the author/maintainer of the resources we're using (see the intl/locales/de-1996/hyphenation/LICENSE file); you could file a separate Mozilla bug to track the issue so that it doesn't get lost in the meantime, but it's not really practical to debate and alter individual hyphenations here. If there are problems with the patterns for a particular language, this should be addressed upstream where the patterns are maintained, and then a new revision imported into our codebase.
BTW, the dictionary at http://dict.tu-chemnitz.de lists "Word division: wa·r·um", which seems to imply that either "wa-rum" or "war-um" could be permissible.
Comment on attachment 547062 [details] [diff] [review]
hyphenation patterns for German

Review of attachment 547062 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libpref/src/init/all.js
@@ +1115,5 @@
> +// use reformed (1996) German patterns by default unless specifically tagged as de-1901
> +// (these prefs may soon be obsoleted by better BCP47-based tag matching, but for now...)
> +pref("intl.hyphenation-alias.de", "de-1996");
> +pref("intl.hyphenation-alias.de-*", "de-1996");
> +pref("intl.hyphenation-alias.de-DE-1901", "de-1901");

Do we not need a pref entry for de-CH and/or de-CH-*?
Comment on attachment 547065 [details] [diff] [review]
reftests for German hyphenation

Review of attachment 547065 [details] [diff] [review]:
-----------------------------------------------------------------

Here again I'd like to see a test for de-CH
Comment on attachment 547080 [details] [diff] [review]
hyphenation patterns for Mongolian (Cyrillic script)

Review of attachment 547080 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me
Attachment #547080 - Flags: review?(smontagu) → review+
Comment on attachment 547081 [details] [diff] [review]
reftest for Mongolian hyphenation

Review of attachment 547081 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547081 - Flags: review?(smontagu) → review+
Comment on attachment 547254 [details] [diff] [review]
add patterns for Serbo-Croatian (covering Serbian & Bosnian lang tags)

Review of attachment 547254 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547254 - Flags: review?(smontagu) → review+
Comment on attachment 547256 [details] [diff] [review]
reftest for Serbian hyphenation

Review of attachment 547256 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547256 - Flags: review?(smontagu) → review+
(In reply to comment #26)
> Do we not need a pref entry for de-CH and/or de-CH-*?

Not for de-CH, because we have a resource by that name; but you're right, we should have de-CH-* aliased to de-CH.

And now that you mention it, we should be adding an alias for xx-* in each of the "simple" cases where we have patterns for language xx. Otherwise, if (for example) text is tagged with lang="pt-PT" or "pt-BR" instead of plain "pt", the patterns won't be found.

I'll put up a patch to add those en masse. (Again, I anticipate that upcoming BCP47 work may supersede this, but for now...)
This adds a testcase using de-CH. (It actually yields the same hyphen positions as standard German in this case. There are differences in the patterns, but I don't know enough about them to readily identify specific words that should turn out different.)
Attachment #547065 - Attachment is obsolete: true
Attachment #547357 - Flags: review?(smontagu)
Attachment #547065 - Flags: review?(smontagu)
Attachment #547062 - Attachment is obsolete: true
Attachment #547359 - Flags: review?(smontagu)
Attachment #547062 - Flags: review?(smontagu)
This adds the appropriate aliases for the already-landed patterns. (Similar entries should be included with each additional language.)
Attachment #547362 - Flags: review?(smontagu)
(In reply to comment #25)
> either "wa-rum" or "war-um" could be permissible.

I stand corrected. Acording to § 113 of the rules
http://www.canoo.net/services/GermanSpelling/Amtlich/Trennung/pgf107-112.html#pgf113 both are permissible.

What about "EinsPlus" which is hyphenated as "Ein-sPlus"?
(In reply to comment #36)

> What about "EinsPlus" which is hyphenated as "Ein-sPlus"?

I assume "einsplus" is not a standard word? This illustrates that we should probably give some kind of special treatment to "words" that have CamelCasing. A similar problem arises in English; picking an arbitrary example, "CinemaScope" gets hyphenated as "Cine-maS-cope". :(

Please file a separate bug about this, however; it's not an issue of patterns for particular locales, it's something I think we should solve in a more general way.
Comment on attachment 547357 [details] [diff] [review]
reftest for German hyphenation - added de-CH

Review of attachment 547357 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547357 - Flags: review?(smontagu) → review+
Comment on attachment 547359 [details] [diff] [review]
hyphenation patterns for German - added de-CH-* alias

Review of attachment 547359 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547359 - Flags: review?(smontagu) → review+
Comment on attachment 547362 [details] [diff] [review]
add missing hyph-aliases for "xx-*" -> "xx"

Review of attachment 547362 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547362 - Flags: review?(smontagu) → review+
Keywords: dev-doc-needed
The French patterns are not LPPL-licensed, but have a simple "free" statement that permits redistribution and modified versions; see LICENSE:
+% This file is available for free and can used and redistributed
+% asis for free. Modified versions should have another name.

Gerv, flagging you for feedback so you have a chance to confirm that this is OK.
Attachment #547632 - Flags: review?(smontagu)
Attachment #547632 - Flags: feedback?(gerv)
Attachment #547633 - Flags: review?(smontagu)
Gerv, please check whether the licensing terms are OK here as well. The original terms (included verbatim in our LICENSE file) say:

% Unlimited copying and redistribution of this file
% is permitted so long as the file is not modified
% in any way.
%
% Modifications may be made for private purposes (though
% this is discouraged, as it could result in documents
% hyphenating differently on different systems) but if
% such modifications are re-distributed, the modified
% file must not be capable of being confused with the
% original.  In particular, this means
%
%(a) the filename (the portion before the extension, if any)
%    must not match any of :
%
%        UKHYPH                  UK-HYPH
%        UKHYPHEN                UK-HYPHEN
%        UKHYPHENS               UK-HYPHENS
%        UKHYPHENATION           UK-HYPHENATION
%        UKHYPHENISATION         UK-HYPHENISATION
%        UKHYPHENIZATION         UK-HYPHENIZATION
%
%   regardless of case, and
%
%(b) the file must contain conditions identical to these,
% except that the modifier/distributor may, if he or she
% wishes, augment the list of proscribed filenames.

which looks to me like it covers our situation.
Attachment #547649 - Flags: review?(smontagu)
Attachment #547649 - Flags: feedback?(gerv)
Attachment #547677 - Flags: review?(smontagu)
Russian patterns - distributed under LPPL 1.2+. (Should have been included with the rest of the LPPL-licensed languages, but the license file had slightly different phrasing and my simple grep missed it.)
Attachment #547680 - Flags: review?(smontagu)
Attachment #547682 - Flags: review?(smontagu)
There are two versions of Norwegian, "nb" (Bokmål) and "nn" (Nynorsk). The macrolanguage "no" is aliased to "nb" on the grounds that this is the more widely used written form.
Attachment #547688 - Flags: review?(smontagu)
Comment on attachment 547688 [details] [diff] [review]
pt 12.1 - patterns for Norwegian

Also tagging Gerv for feedback, just to double-check the licensing is OK. The original files include the statement:

% Copyright (C) 2007 Karl Ove Hufthammer.
% Copying and distribution of this file, with or without modification,
% are permitted in any medium without royalty, provided the copyright
% notice and this notice are preserved.

which seems pretty clear.
Attachment #547688 - Attachment description: patterns for Norwegian → pt 12.1 - patterns for Norwegian
Attachment #547688 - Flags: feedback?(gerv)
Attached patch pt 12.2 - tests for Norwegian (obsolete) — Splinter Review
Attachment #547689 - Flags: review?(smontagu)
French: OK.
Norwegian: OK.

For both of these, leave the license the same as now. Unlike the LPPL, there is no need to make a change.

UK English: less clear. It depends what it means by "modifications may be made for private purposes" and then talking about distributing them! Do you have precedent for free software redistribution (e.g. by Debian)?

Gerv
(In reply to comment #51)
> French: OK.
> Norwegian: OK.
> 
> For both of these, leave the license the same as now. Unlike the LPPL, there
> is no need to make a change.

OK, thanks.

> UK English: less clear. It depends what it means by "modifications may be
> made for private purposes" and then talking about distributing them!

The only logical interpretation I can come up with (aside from that they're trying to discourage tampering in general) is that for private purposes, the file could be modified "in-place" so as to directly alter the behavior of the overall system (i.e. without a renaming requirement); whereas if you decide to distribute something that is modified, you MUST rename/document/etc so as to avoid the possibility of confusion with the canonical version.

> Do you
> have precedent for free software redistribution (e.g. by Debian)?

I don't think Debian distributes this, but AFAIK their primary reason is that the "source" (the OUP list of hyphenated words from which the patterns were derived) is not available, rather than a licensing concern.

OTOH, OpenOffice distributes a derived version (see http://wiki.services.openoffice.org/wiki/Dictionaries#English_.28AU.2CCA.2CGB.2CNZ.2CUS.2CZA.29) that is relicensed under LGPL.
Oh, and there's the most obvious free software distribution of this stuff - the TeX Live collection. See http://tug.org/texlive/copying.html for the top-level summary of their license conditions.

MikTeX also distributes it, and so apparently considers it free software (see http://miktex.org/copying).
Aha, I found some archived discussion of the UK hyphenation patterns and their (strange) license; see http://forum.soft32.com/linux/Strange-license-ukhyphen-ftopict290515.html. The main point of debate there seems to centre around the renaming requirement for modified versions, and in particular the inclusion of a list of prohibited "similar" filenames.
In this case, the license documentation in hyph-utf8 was less complete, but checking the Lithuanian TeX package that is upstream of that repackaging confirms that the patterns are LPPL-licensed.
Attachment #547798 - Flags: review?(smontagu)
Attachment #547799 - Flags: review?(smontagu)
Attachment #547799 - Attachment is patch: true
Attachment #547907 - Flags: review?(smontagu)
Attachment #547907 - Attachment description: patterns for Finnish → pt 14.1 - patterns for Finnish
Attachment #547908 - Flags: review?(smontagu)
Depends on: 673704
Comment on attachment 547632 [details] [diff] [review]
pt 9.1 - patterns for French

Review of attachment 547632 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me
Attachment #547632 - Flags: review?(smontagu) → review+
Comment on attachment 547633 [details] [diff] [review]
pt 9.2 - test for French hyphenation

Review of attachment 547633 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/reftests/text/reftest.list
@@ +126,4 @@
>  == auto-hyphenation-mn-1.html auto-hyphenation-mn-1-ref.html
>  == auto-hyphenation-sh-1.html auto-hyphenation-sh-1-ref.html
>  == auto-hyphenation-sr-1.html auto-hyphenation-sr-1-ref.html
> +== auto-hyphenation-fr-1.html auto-hyphenation-fr-1-ref.html

Nit: life will be easier in the future if you alphabetize the list of tests. (Ditto the prefs in all.js, if they aren't already)
Attachment #547633 - Flags: review?(smontagu) → review+
Comment on attachment 547649 [details] [diff] [review]
pt 10.1 - British English patterns

Review of attachment 547649 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547649 - Flags: review?(smontagu) → review+
Comment on attachment 547677 [details] [diff] [review]
pt 10.2 - tests for British English patterns

Review of attachment 547677 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547677 - Flags: review?(smontagu) → review+
Comment on attachment 547680 [details] [diff] [review]
pt 11.1 - hyphenation patterns for Russian

Review of attachment 547680 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547680 - Flags: review?(smontagu) → review+
Comment on attachment 547682 [details] [diff] [review]
pt 11.2 - test for Russian patterns

Review of attachment 547682 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547682 - Flags: review?(smontagu) → review+
Comment on attachment 547688 [details] [diff] [review]
pt 12.1 - patterns for Norwegian

Review of attachment 547688 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libpref/src/init/all.js
@@ +1149,5 @@
>  pref("intl.hyphenation-alias.sr-*", "sh");
>  pref("intl.hyphenation-alias.bs-*", "sh");
>  
> +// Norwegian has two forms, Bokmål and Nynorsk, with "no" as a macrolanguage encompassing both.
> +// For "no", we'll alias to "nb" (Bokmål) as that is the more widely used written form.

This one made me wonder about a general question: are these prefs overridable by l10ns? On the one hand, I suppose that will let the fingerprinting genie back out of the bottle, but on the other hand since we do have separate nb and nn l10n, I should think the nn version won't be happy with this default.
Attachment #547688 - Flags: review?(smontagu) → review+
Comment on attachment 547689 [details] [diff] [review]
pt 12.2 - tests for Norwegian

Review of attachment 547689 [details] [diff] [review]:
-----------------------------------------------------------------

Do you want to add a test for the examples that appear in the licence files as different in nb and nn (attende and betre)?
Comment on attachment 547798 [details] [diff] [review]
pt 13.1 - patterns for Lithuanian

Review of attachment 547798 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547798 - Flags: review?(smontagu) → review+
Comment on attachment 547799 [details] [diff] [review]
pt 13.2 - test for Lithuanian patterns

Review of attachment 547799 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547799 - Flags: review?(smontagu) → review+
Comment on attachment 547907 [details] [diff] [review]
pt 14.1 - patterns for Finnish

Review of attachment 547907 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547907 - Flags: review?(smontagu) → review+
Comment on attachment 547908 [details] [diff] [review]
pt 14.2 - test for Finnish patterns

Review of attachment 547908 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #547908 - Flags: review?(smontagu) → review+
(In reply to comment #68)
> Do you want to add a test for the examples that appear in the licence files
> as different in nb and nn (attende and betre)?

Sure, I'll add those to the test files, with the appropriate -ref version for each.

(In reply to comment #67)
> This one made me wonder about a general question: are these prefs
> overridable by l10ns?

We discussed this a bit back when hyphenation support was initially being added. As far as I understood, I don't think there is currently any (easy?) way for l10n to override default prefs. I think this is something we ought to support for a number of reasons, not just hyphenation defaults - e.g. it ought to be possible for localizers to customize the default font settings, too.

Besides Norwegian, another example that might deserve l10n treatment is "en"; currently, this maps to "en-US", but if we add en-GB patterns then it would seem reasonable for the en-GB version to change the mapping for "en".

Probably worth opening a new bug on this specific topic, and discussing again with Pike & others. But perhaps we should allow the BCP47 dust to settle before worrying too much about this; I'm hoping we'll find ourselves with nice new BPC47-based lang/locale-matching APIs that can supersede and improve on the current hyph-alias prefs.
Gerv: any further thoughts re the UK English question (see comments 51-52)?
Do we already decide how to deliver those hyphenation resources to end user? all will be installed by default or user have to download them as addon, or will automatically download when it's needed?
Attachment #547632 - Flags: feedback?(gerv)
Attachment #547688 - Flags: feedback?(gerv)
Added the specific example words that differ between nb/nn locales.
Attachment #547689 - Attachment is obsolete: true
Attachment #548406 - Flags: review?(smontagu)
Attachment #547689 - Flags: review?(smontagu)
(In reply to comment #75)
> Do we already decide how to deliver those hyphenation resources to end user?
> all will be installed by default or user have to download them as addon, or
> will automatically download when it's needed?

The current approach is to install them all, so as to ensure consistent behavior for everyone. We are also considering other options that we could use if this becomes a problem, either due to size or because of licensing constraints on some of the resources we'd like to offer.
Comment on attachment 548406 [details] [diff] [review]
pt 12.2 - tests for Norwegian (revised)

Review of attachment 548406 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #548406 - Flags: review?(smontagu) → review+
The author of the Hungarian patterns has kindly relicensed them under the Mozilla tri-license terms (they were formerly GPL-only), so we can now include these.
Attachment #553729 - Flags: review?(smontagu)
Test for the Hungarian patterns.

Also, just realized that I forgot the hyph-alias entry for "hu-*" in the pt 15.1 patch; this will be added before landing.
Attachment #553730 - Flags: review?(smontagu)
Attachment #553887 - Flags: review?(smontagu)
Attachment #553888 - Flags: review?(smontagu)
Attachment #553889 - Flags: review?(smontagu)
Attachment #553890 - Flags: review?(smontagu)
Comment on attachment 547649 [details] [diff] [review]
pt 10.1 - British English patterns

Need more info on precedent for distribution of unclear UK English patterns (e.g. from the Debian project).

Gerv
Attachment #547649 - Flags: feedback?(gerv) → feedback-
Original files are at:

http://www.ctan.org/pkg/ukhyph
http://mirrors.ctan.org/language/hyphenation/ukhyphen.tex

But those files are not guaranteed to stay.

If licence needs to be resolved (there is currently just free-text licence description), I would suggest to contact the author and also update hyph-utf8 repository.
Added a list of added languages to the following page. This is everything that's in the Aurora build as of today.

https://developer.mozilla.org/en/CSS/hyphens
Thanks a lot for this list.

hsb = Upper Sorbian
kmr = Kurmanji (Northern Kurdish)
de-CH = Swiss German, Traditional Orthography (Czech German - are you joking? :)
de-1901 = German, Traditional Orthography
de-1996 = German, Reformed Orthography

I think that Bokman and Nynorsk could be written with uppercase. I slightly prefer Slovenian to Slovene as adjective/for language name, but this can be an infinite debate/it's a cosmetic issue anyway.
Fixed those. That "Czech German" thing, I dunno where that came from. Erp.
Attachment #553729 - Flags: review?(smontagu) → review+
Attachment #553730 - Flags: review?(smontagu) → review+
Attachment #553887 - Flags: review?(smontagu) → review+
Attachment #553888 - Flags: review?(smontagu) → review+
Attachment #553889 - Flags: review?(smontagu) → review+
Attachment #553890 - Flags: review?(smontagu) → review+
(In reply to Mojca Miklavec from comment #90)
> hsb = Upper Sorbian
> kmr = Kurmanji (Northern Kurdish)
> de-CH = Swiss German, Traditional Orthography (Czech German - are you
> joking? :)
> de-1901 = German, Traditional Orthography
> de-1996 = German, Reformed Orthography
> 
> I think that Bokman and Nynorsk could be written with uppercase. I slightly
> prefer Slovenian to Slovene as adjective/for language name, but this can be
> an infinite debate/it's a cosmetic issue anyway.

I should note that all of these names should eventually be generated by bug 666662.
Landed for mozilla9:
http://hg.mozilla.org/mozilla-central/rev/13e47b981869 (Hungarian)
http://hg.mozilla.org/mozilla-central/rev/f381ae05803a (hu-test)
http://hg.mozilla.org/mozilla-central/rev/002abea8ccb9 (Italian)
http://hg.mozilla.org/mozilla-central/rev/b39232627a54 (it-test)
http://hg.mozilla.org/mozilla-central/rev/53e0de790071 (Turkish)
http://hg.mozilla.org/mozilla-central/rev/079f4e4a1f4b (tr-test)

Leaving open as not clear what left to do here, please close if appropriate. Thanks! :-)
Status: NEW → ASSIGNED
Whiteboard: [inbound]
Added hu, it, and tr to the documentation, flagged as being in Firefox 9.
Would it be possible to add hyphenation for Greek (el-EL) and ancient Greek (grc)?
(In reply to Pablo Rodríguez from comment #96)
> Would it be possible to add hyphenation for Greek (el-EL) and ancient Greek
> (grc)?

This depends on the availability of hyphenation patterns under licensing terms that allow us to ship them (or rather, a modified form) in Firefox. There are Greek patterns in the hyph-utf8 package in TeX Live, but I don't see any obvious statement of their license terms, so we'd need this to be clarified before trying to use them.
(In reply to Jonathan Kew from comment #97)
> (In reply to Pablo Rodríguez from comment #96)
> > Would it be possible to add hyphenation for Greek (el-EL) and ancient Greek
> > (grc)?
> 
> There are Greek patterns in the hyph-utf8 package in TeX Live, but I don't
> see any obvious statement of their license terms, so we'd need this to be
> clarified before trying to use them.

CTAN contains the same patterns (http://www.ctan.org/tex-archive/language/hyphenation/elhyphen) and contains they are released under the LaTeX Project Public License (http://mirrors.ctan.org/language/hyphenation/elhyphen/copyrite.txt).

I guess this should be OK.

BTW, patterns for polytonic Greek are also interesting but there is no standard tag for it.

Thanks for your help,


Pablo
Jonathan,

sorry, but since I got no reply, I don't know whether the suggested patterns for hyphenating ancient, polytonic and monotonic Greek are fine.

Are the patterns at http://www.ctan.org/tex-archive/language/hyphenation/elhyphen to be included in FF/TB for hyphenation?

Thanks for your help,


Pablo
No, the files in elhyphen are for pdfTeX in some weird font encoding. You should use the patterns from hyph-utf8. They are the same patterns, but in proper UTF-8 encoding.

The licence statements were slightly updated on 13th September. See http://tug.org/svn/texhyphen?view=revision&revision=592. That version should be fine for inclusion, at least that was the intention.
(In reply to Mojca Miklavec from comment #100)
> No, the files in elhyphen are for pdfTeX in some weird font encoding. You
> should use the patterns from hyph-utf8. They are the same patterns, but in
> proper UTF-8 encoding.
> 
> The licence statements were slightly updated on 13th September. See
> http://tug.org/svn/texhyphen?view=revision&revision=592. That version should
> be fine for inclusion, at least that was the intention.

Thanks for the comment, Mojca.

I hope the proper Greek hyphenation dictionaries (at least for monotonic Greek) will be included in Firefox 8 (right now they are missing from https://developer.mozilla.org/en/CSS/hyphens#Gecko_notes).

Thanks again,


Pablo
Would it be possible to add hyphenation for some Indic languages? The patterns are present here: http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/

I am requesting for these patterns hyph-[hi, sa, ml, ta, gu, as, bn,  or, te,  kn , mr, pa]

Thanks
(In reply to Santhosh Thottingal from comment #102)
> Would it be possible to add hyphenation for some Indic languages? The
> patterns are present here:
> http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/
> tex/
> 
> I am requesting for these patterns hyph-[hi, sa, ml, ta, gu, as, bn,  or,
> te,  kn , mr, pa]

We can't simply import those patterns, due to licensing issues - they're distributed under the LGPL only. However, I hope to create and add patterns for Indic languages sometime soon.

(In reply to Pablo Rodríguez from comment #101)
> (In reply to Mojca Miklavec from comment #100)
> > No, the files in elhyphen are for pdfTeX in some weird font encoding. You
> > should use the patterns from hyph-utf8. They are the same patterns, but in
> > proper UTF-8 encoding.
> > 
> > The licence statements were slightly updated on 13th September. See
> > http://tug.org/svn/texhyphen?view=revision&revision=592. That version should
> > be fine for inclusion, at least that was the intention.
> 
> Thanks for the comment, Mojca.
> 
> I hope the proper Greek hyphenation dictionaries (at least for monotonic
> Greek) will be included in Firefox 8 (right now they are missing from
> https://developer.mozilla.org/en/CSS/hyphens#Gecko_notes).

I'm intending to return to this soon, but have been busy with other issues, as well as waiting on further clarification of some licensing questions (not just specific to Greek).
There is a TeX hyphenation file for Polish:

ftp://ftp.gust.org.pl/TeX/language/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-pl.tex

The license of this file says:

%    Do with this file whatever needs to be done in future for the sake of
%    "a better world" as long as you respect the copyright of original file.
%    If you're the original author of patterns or taking over a new revolution,
%    plese remove all of the TUG comments & credits that we added here -
%    you are the Queen / the King, we are only the servants.

OpenOffice uses it and calls the above license "public domain" (which it probably really isn't) - http://extensions.services.openoffice.org/project/pl-dict - this is more in line with "WTF Public License", unless you treat "a better world" literally, which brings us closely to Crockford vs IBM: http://www.youtube.com/watch?v=-hCimLnIsDA :)
Reading more into it, this seems to be a community-maintained former abandonware assumed to be open source. Ugh. :(
That material dates back to the days before people obsessed so much over explicit licensing terms when stuff was made available "freely". The original authors are listed (just a bit higher up in the file). I've been in contact with Jacko and Marek regarding getting the license clarified so that we'd be able to use it, and they'd be fine with that, but unfortunately we don't have contact info for Hanna, and so it's difficult to move forward. There's a plan to create a new (and explicitly-licensed) set of Polish patterns, but that will take some time.
(In reply to Jonathan Kew (:jfkthame) from comment #103)
> (In reply to Pablo Rodríguez from comment #101)
> [...]
> I'm intending to return to this soon, but have been busy with other issues,
> as well as waiting on further clarification of some licensing questions (not
> just specific to Greek).

Is there any chance that Firefox 9 will have these hyphenation dictionaries?

Many thanks,


Pablo
Another issue that I guess it would be interesting to solve with hyphenation is URL breaking.

Of course, URL breaking shouldn't use hyphens and probably a good method would be the default one used by the LaTeX package url.sty.

Would it be possible to enable this feature for URLs when hyphenation is active?

Thanks for your help,


Pablo
(In reply to Jonathan Kew (:jfkthame) from comment #103)
> (In reply to Pablo Rodríguez from comment #101)
> > I hope the proper Greek hyphenation dictionaries (at least for monotonic
> > Greek) will be included in Firefox 8 (right now they are missing from
> > https://developer.mozilla.org/en/CSS/hyphens#Gecko_notes).
> 
> I'm intending to return to this soon, but have been busy with other issues,
> as well as waiting on further clarification of some licensing questions (not
> just specific to Greek).

Has this issue improved since last time?

Many thanks for your help,


Pablo
Issues for Indian languages seem to get stuck. Too bad,

Great idea anyway,

Marcis
For Indic languages, I am the copyright holder(all licensed under LGPL). Please let me know what you need to get the patterns included. No need to get stuck because of licensing issues here :).
I notice that this has a target milestone of mozilla8, and a lot of changes were landed during that cycle, yet this is still open.

Jonathan: What is left to be done here? Might it be better off spun into one or more follow-up bugs? (The bug summary itself is somewhat vague about when this might be considered "fixed".)
Flags: needinfo?(jfkthame)
This basically stalled, pending clarification of some licensing questions.

(Note that LGPL code, for example, was -not- being accepted into Firefox at the time this was active. The policy now states that "it may be permissible to import Third Party Code under the LGPL (version 2.0 upwards) to be Product Code if it's a clearly-demarcated library and will be dynamically linked into the product", which might be OK for hyphenation resources, but we should verify that with Gerv.)

We did add resources for a bunch of languages here, so given that we lost momentum, perhaps it'd be best to resolve this and suggest people file new bugs for specific additional resources, and make them dependencies of the general "Enhance hyphenation" bug 656750.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(jfkthame)
Resolution: --- → FIXED
Pablo, Santosh, Marcis, Marek, etc: please feel free to file new bugs for any specific languages where suitably-licensed resources are available, and we'll try to get them added. Sorry it's not been at the top of anyone's priority list to keep pushing this forward.
Many thanks for the reply, Jonathan.

Sorry for the obvious question, but I’m afraid that I’m not familiar with Firefox licensing: which are the allowed licenses for hyphenation dictionaries?

Best wishes to all for the new yesar 2014,

Pablo
Sorry, I forgot to mention: is the information available at https://developer.mozilla.org/en-US/docs/Web/CSS/hyphens#Notes_on_supported_languages updated?
It is a wiki so it is trivial to update it. If new hyphenation dictionaries are added you can also notify the doc team by setting the dev-doc-needed keyword on the relevant bug (though you don't have a guarantee about how long you'll have to wait, the support of a new hyphenation dict is a trivial change in the doc and should be done by the release of the relevant Firefox).

Note that new bugs will also help us in keeping the doc updated.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: