Last Comment Bug 702121 - Incorrect hyphenation patterns used by CSS moz-hyphens in XHTML content (en-US patterns used, regardless of lang declaration)
: Incorrect hyphenation patterns used by CSS moz-hyphens in XHTML content (en-U...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Layout: Text (show other bugs)
: Trunk
: All All
: P3 normal (vote)
: mozilla17
Assigned To: David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch)
:
Mentors:
: 707637 (view as bug list)
Depends on: 822554
Blocks:
  Show dependency treegraph
 
Reported: 2011-11-13 07:08 PST by testit
Modified: 2013-02-25 15:57 PST (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-


Attachments
hyphenation_no_lang.xhtml (358 bytes, text/xml)
2011-11-13 07:08 PST, testit
no flags Details
XHTML testcase for -moz-hyphens:auto that should _not_ apply English rules (287 bytes, application/xhtml+xml)
2011-11-13 14:17 PST, Jonathan Kew (:jfkthame)
no flags Details
Track whether nsStyleVisibility::mLanguage came from explicit information in the document. (, patch 1) (4.19 KB, patch)
2012-07-22 07:42 PDT, David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch)
no flags Details | Diff | Splinter Review
Only do hyphenation when the language was specified explicitly, rather than using an encoding-inferred language. (, patch 2) (15.05 KB, patch)
2012-07-22 07:42 PDT, David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch)
jfkthame: review+
Details | Diff | Splinter Review
Track whether nsStyleVisibility::mLanguage came from explicit information in the document. (, patch 1) (5.05 KB, patch)
2012-08-17 17:13 PDT, David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch)
jfkthame: review+
Details | Diff | Splinter Review
, patch 3: Make the auto-hyphenation-{8,9,10} reftests not depend on picking the same fonts for different language or encoding values, to fix Android reftest failures from landing of bug 831354. (3.75 KB, patch)
2013-01-17 08:35 PST, David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch)
jfkthame: review+
Details | Diff | Splinter Review

Description testit 2011-11-13 07:08:15 PST
Created attachment 574156 [details]
hyphenation_no_lang.xhtml

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0
Build ID: 20111104165243

Steps to reproduce:

I used -moz-hyphens in a xhtml file which has no language specified (no lang attribute used).


Actual results:

According to the specification (https://developer.mozilla.org/en/CSS/hyphens), a hyphenation rule should only be applied, if a language is specified using the "lang" attribute.

Which hyphenation rule is applied if no language is specified?
Comment 1 :Ms2ger 2011-11-13 07:22:12 PST
None?
Comment 2 testit 2011-11-13 07:26:50 PST
Sorry, my description was not clear. The actual result is, that a hyphenation rule is applied but it shouldn't.
Comment 3 j.j. 2011-11-13 10:05:00 PST

*** This bug has been marked as a duplicate of bug 702123 ***
Comment 4 testit 2011-11-13 10:10:09 PST
This is not a duplicate of bug 702123. The topic of bug 702123 is, that a hyphenation rule is applied but shouldn't. The topic of this bug is that different hyphenation rules are applied in html and xhtml files.
Comment 5 Boris Zbarsky [:bz] 2011-11-13 11:21:44 PST
> According to the specification (https://developer.mozilla.org/en/CSS/hyphens)

That's not the specification.  That's documentation.  And it was just wrong.  I've fixed it.

What hyphenation dictionary is used depends on the encoding of the page and the declared language.

> The topic of this bug is that different hyphenation rules are applied in html and xhtml
> files.

The attached testcase is using UTF-8 as the encoding.  If I load it a UTF-8 HTML (as opposed to ISO-8859-1 HTML), it hyphenates exactly the same as the attached testcase.

So as far as I can tell the behavior is correct; over to layout:text to verify.
Comment 6 Boris Zbarsky [:bz] 2011-11-13 11:22:26 PST
I guess the difference between ISO-8859-1 and UTF-8 could be considered a problem.  Is that the concern?
Comment 7 Boris Zbarsky [:bz] 2011-11-13 11:24:01 PST
Oh, and what the actual _spec_ at http://dev.w3.org/csswg/css3-text/#hyphens says is this:

  The UA is therefore only required to automatically hyphenate text for which the author
  has declared a language (e.g. via HTML lang or XML xml:lang) and for which it has an
  appropriate hyphenation resource. 

Note that the UA is allowed to hyphenate other texts too; it's just not required to.
Comment 8 Jonathan Kew (:jfkthame) 2011-11-13 13:04:44 PST
It looks to me like we're not correctly detecting the declared language (or lack thereof) in XHTML documents, and always treating the text as being en-US. Even after adding a specific declaration such as xml:lang="af", I still see the en-US hyphenation dictionary being used.

Not sure yet whether this is a problem specifically in the hyphenation code, or a higher-level issue with xml:lang support.
Comment 9 Jonathan Kew (:jfkthame) 2011-11-13 14:17:59 PST
Created attachment 574185 [details]
XHTML testcase for -moz-hyphens:auto that should _not_ apply English rules

In this testcase, the content is explicitly tagged with xml:lang="ar", so we should definitely *not* be applying English rules. (And as we don't ship any rules for Arabic, the net result should be no auto-hyphenation.)

However, it seems that when BuildTextRunsScanner::SetupBreakSinksForTextRun looks at GetStyleVisibility()->mLanguage to find the appropriate language, it's getting "en" despite the "ar" tag in the document.
Comment 10 David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch) 2011-11-14 12:40:34 PST
I think this is a duplicate of bug 234485.  I thought we'd implemented support for xml:lang, but I don't think we have.
Comment 11 David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch) 2011-11-14 12:48:19 PST
... except for the part where we should store whether the language information came from something explicit, and only do language-based hyphenation when it is.
Comment 12 David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch) 2011-11-14 12:57:00 PST
https://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/4e1148566cbf/explicit-language is an untested patch for the style system side of tracking whether the language is explicitly set.
Comment 13 Boris Zbarsky [:bz] 2011-12-15 19:04:17 PST
*** Bug 707637 has been marked as a duplicate of this bug. ***
Comment 14 David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch) 2012-07-22 07:42:35 PDT
Created attachment 644757 [details] [diff] [review]
Track whether nsStyleVisibility::mLanguage came from explicit information in the document.  (, patch 1)
Comment 15 David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch) 2012-07-22 07:42:50 PDT
Created attachment 644758 [details] [diff] [review]
Only do hyphenation when the language was specified explicitly, rather than using an encoding-inferred language.  (, patch 2)
Comment 16 Jonathan Kew (:jfkthame) 2012-08-08 04:14:33 PDT
Comment on attachment 644757 [details] [diff] [review]
Track whether nsStyleVisibility::mLanguage came from explicit information in the document.  (, patch 1)

Review of attachment 644757 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/style/nsStyleStruct.cpp
@@ +139,5 @@
>        language.FindChar(PRUnichar(',')) == kNotFound) {
>      mLanguage = do_GetAtom(language);
> +    // REVIEW: Should this count as explicit, or should only the lang
> +    // attribute count as explicit?
> +    mExplicitLanguage = true;

My inclination would be to treat only the lang attribute as explicit. That's often the only thing that an author will control (or even be aware of), I suspect. And given that this (for now, at least) is targeted at hyphenation, which in turn is author-controlled via CSS, it seems right to keep the language control also firmly and solely in the hands of the page author, without involving the server configuration.

In which case the initialization of mExplicitLanguage (always false) could be moved to the initializer lists for the non-copy constructors, instead of needing to be included in the Init() method.

::: layout/style/nsStyleStruct.h
@@ +113,5 @@
>    nscoord mScriptMinSize;        // [inherited] length
>    float   mScriptSizeMultiplier; // [inherited]
>    nsCOMPtr<nsIAtom> mLanguage;   // [inherited]
> +  // was mLanguage set based on something in the document?
> +  bool mExplicitLanguage;        // [inherited]

It looks to me like the struct will pack better if you insert this alongside the existing PR[U]Int8 fields, rather than appending it.

(The initializer in nsStyleStruct.cpp would then need to be re-ordered to match.)
Comment 17 Jonathan Kew (:jfkthame) 2012-08-08 04:52:10 PDT
Comment on attachment 644758 [details] [diff] [review]
Only do hyphenation when the language was specified explicitly, rather than using an encoding-inferred language.  (, patch 2)

Review of attachment 644758 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine to me (assuming it passes the tests, of course!)

If it's easy to do, maybe it'd be nice to split the variable-renaming stuff that doesn't involve any functional change (langGroup -> hyphenationLanguage, etc) into a separate patch, and then the actual behavior-changing patch would become much smaller. But not a big deal if you don't want to bother.
Comment 18 David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch) 2012-08-17 17:13:03 PDT
Created attachment 652991 [details] [diff] [review]
Track whether nsStyleVisibility::mLanguage came from explicit information in the document.  (, patch 1)
Comment 19 David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch) 2012-08-20 19:29:14 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/198f6c4784cb
https://hg.mozilla.org/integration/mozilla-inbound/rev/67f1eff2cc7c

I think the remainder of this bug is a duplicate of bug 234485.
Comment 21 David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch) 2013-01-17 08:35:16 PST
Created attachment 703339 [details] [diff] [review]
, patch 3:  Make the auto-hyphenation-{8,9,10} reftests not depend on picking the same fonts for different language or encoding values, to fix Android reftest failures from landing of bug 831354.
Comment 22 David Baron :dbaron: ⌚️UTC+8 (review requests must explain patch) 2013-01-17 12:23:38 PST
https://tbpl.mozilla.org/?tree=Try&rev=e3df072f8858
https://hg.mozilla.org/integration/mozilla-inbound/rev/47166178f558
Comment 24 bhavana bajaj [:bajaj] 2013-02-25 15:57:28 PST
Not rel-noting it as majority of the feature  landed in mozilla-17. The patch that recently landed seems to be reftest's hence not tracking for rel-noting

Note You need to log in before you can comment on or make changes to this bug.