Closed Bug 450088 Opened 16 years ago Closed 16 years ago

Line breaking regression (in Chinese and other languages)

Categories

(Core :: Layout: Text and Fonts, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9.2a1

People

(Reporter: wuyongwei, Assigned: jfkthame)

References

()

Details

(Keywords: fixed1.9.1)

Attachments

(1 file, 6 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1

FF3 has problems dealing breaking lines in non-Western text, which used to be good in FF2.  Resizing the window containing the URL above, at least two problems are noticeable:

* Extra spaces appear in the text, like after "61岁了。"
* Improper line breaking after "“" and before "”"

This is a regression, since FF2 renders it correctly.  Both FF2 and IE ignore the line break in the Chinese text flow, and correctly break the line according to the Chinese rules.  Safari also ignores the line break in the Chinese text flow, but seems to break the line only according to Unicode 5.0.0 Standard Annex 14, without recognizing Chinese-specific rules: it prohibits immediate line breaks before and after "“" or "”" altogether.

I really wonder why this happened, since FF2 is correct in its behaviour.  If for some reasons the old code cannot be (easily) reused, you may be interested in looking at another line breaking library I recently implemented (according to Unicode 5.0.0 Standard Annex 14).  The URL is

http://vimgadgets.cvs.sourceforge.net/vimgadgets/common/tools/linebreak/

Its licence is zlib kind, and is compatible with the Firefox licence.

Reproducible: Always

Steps to Reproduce:
1. Open a Chinese page like the above URL
2. Resize the window slowly to check the line-breaking behaviour
Actual Results:  
1. "“" can be seen at the end of the line; or "”" at the beginning of the line
2. There are extra spaces in the page, like "一 件"

Expected Results:  
1. "“" should not appear at the end of the line; nor should "”" at the beginning of the line
2. There should be no spaces in the Chinese text flow

a) Restore the old line-breaking code in FF2; or
b) Use a new line-breaking library, like

http://vimgadgets.cvs.sourceforge.net/vimgadgets/common/tools/linebreak/
OS: Windows XP → All
Version: unspecified → 3.0 Branch
Component: General → Layout: Text
Product: Firefox → Core
QA Contact: general → layout.fonts-and-text
Version: 3.0 Branch → 1.9.0 Branch
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.9.1?
Jonathan, this might be fun for you to look at. nsLineBreaker.cpp would be a good place to start.
The "extra spaces" mentioned, such as after "61岁了。" or in the middle of "一 件", appear to be present in the source text, AFAICT. Is it really expected that spaces will magically vanish in the midst of Chinese text? I wouldn't have thought so -- and so I don't think we have an issue there.

The bad breaks after open-quote or before close-quote are due to the character classes in jisx4501class.h, which classifies these as "character" (i.e., like CJK ideographs) rather than as opening or closing punctuation. This in turn derives from their assignment to class 18 (Western characters) in intl/lwbrk/tools/jisx4501class.txt. I guess JIS X 4501 doesn't anticipate the use of "Western" quote marks around CJK text.

We can fix the issue by editing jisx4501class.txt (and regenerating our derived tables); we can assign the open-quotes to class 1, and the close-quotes to class 2 (meaning open and close parenthesis respectively, according to comments in nsJISx4501LineBreaker.cpp, which seems the closest match to the desired behavior).

Is there an upstream source for this data that we should also involve, or do we simply maintain it ourselves?
CSS white-space processing is complicated:
http://www.w3.org/TR/CSS21/text.html#white-space-model
It allows newlines in CJK text to be elided completely (whereas for most text they would be turned into spaces). But it would not be correct per CSS 2.1 to eliminate a space between two CJK characters.

I think we maintain the class data ourselves. Masayuki Nakano maintains that code. Masayuki-san, please comment on whether Jonathan's suggestion makes sense.
Yes, for newlines that makes sense, but AFAICS the test page mentioned here has actual spaces, not newlines.

OK, I tried changing the spaces in the test to newlines, and we still render these as spaces, so that's something we might want to change. Trying the file with Safari, though, I see the same thing: newlines in the Chinese text result in spaces, they are not ignored (as suggested in the original report).
Note that the real patch is the change to jisx4501class.txt; the other modified files are derived from this by running the anzx4501.pl tool.

The appropriate classes for U+20xx quote marks are not always clear-cut, because of widely differing quote mark usage in various Western languages; one person's opening quote mark may be another person's closer. In practice, though, I suspect that only the "English-like" convention is likely to be used with CJK text, so as long as U+2018/2019 and U+201C/201D are classified as open/close pairs, we'll be OK.

This patch does not address the issue of ignoring newlines within CJK text, it only resolves the bad breaks adjacent to quote marks.
Attachment #347866 - Flags: review?(masayuki)
That's fine. We quite commonly take a bug with multiple issues and refocus it onto one issue. People can file followup bugs to tackle the remaining issues, although in this case, it seems the remaining issue is not a bug so we probably shouldn't bother.
(In reply to comment #4)
> OK, I tried changing the spaces in the test to newlines, and we still render
> these as spaces, so that's something we might want to change. Trying the file
> with Safari, though, I see the same thing: newlines in the Chinese text result
> in spaces, they are not ignored (as suggested in the original report).

It's probably worth testing Windows IE in addition to Safari.  IE/Windows (the layout engine, partly via other browsers) has (high?) ninety-some percent share in the Chinese browser market, I think.
(In reply to comment #3)
> CSS white-space processing is complicated:
> http://www.w3.org/TR/CSS21/text.html#white-space-model
> It allows newlines in CJK text to be elided completely (whereas for most text
> they would be turned into spaces). But it would not be correct per CSS 2.1 to
> eliminate a space between two CJK characters.

See here:
http://www.w3.org/TR/html401/struct/text.html#h-9.1

> in Latin scripts, inter-word space is typically rendered as an ASCII space ( ),
> while in Thai it is a zero-width word separator (​).
> In Japanese and Chinese, inter-word space is not typically rendered at all.

The last line is important. The old textframe removed the white spaces if it has LF. But new textframe is not so.

> I think we maintain the class data ourselves. Masayuki Nakano maintains that
> code. Masayuki-san, please comment on whether Jonathan's suggestion makes
> sense.

Please wait, I'm checking...
Comment on attachment 347866 [details] [diff] [review]
patch to correct CJK linebreak behavior at quote marks

201A, 201B, 201E, 201F should be 18, probably. The reason was said by you.

2018 and 201C should be 22, and also 2019 and 201D should be 23. The risk of these class is lower for Western languages. If some web pages uses these 4 quotations as different meanings from our definition in Western text, we don't break around these quotations.

00AB and 00BB should be same. Use 22 and 23.
Oh, but following case, we cannot render like web designer wanted.
data:text/html,<div style="width:0;">”字”</div>

But we don't have good mechanism for this case. And it's OK. It might be broken on Fx2.

And you should add some testcases to layout/reftests/line-breaking/quotationmarks-1.html And then, you should use entities for non-ASCII characters in its file.
(In reply to comment #10)
> And you should add some testcases to
> layout/reftests/line-breaking/quotationmarks-1.html And then, you should use
> entities for non-ASCII characters in its file.

Oops, these characters are already tested. Therefore, you need to change the reference file (quotationmakrs-1-ref.html) for new behavior.
(In reply to comment #4)
> Yes, for newlines that makes sense, but AFAICS the test page mentioned here has
> actual spaces, not newlines.
> 
> OK, I tried changing the spaces in the test to newlines, and we still render
> these as spaces, so that's something we might want to change. Trying the file
> with Safari, though, I see the same thing: newlines in the Chinese text result
> in spaces, they are not ignored (as suggested in the original report).

The "extra spaces" I mentioned are newlines in the source.

However, I was mistaken about the Safari case, probably because the space was rendered very narrow.  Chrome behaves the same as Safari.  So it means WebKit currently also has this problem in Chinese rendering.

IE and FF2 behave as I expect, eliminating the newline completely between two Chinese characters.  This makes sense, since Chinese text generally does not contain any spaces.  Without this specific rendering rule, a long paragraph in Chinese may have to be a long line in the source, which presents an issue for some editors.
(In reply to comment #5)
> Created an attachment (id=347866) [details]
> patch to correct CJK linebreak behavior at quote marks
> 
> Note that the real patch is the change to jisx4501class.txt; the other modified
> files are derived from this by running the anzx4501.pl tool.
> 
> The appropriate classes for U+20xx quote marks are not always clear-cut,
> because of widely differing quote mark usage in various Western languages; one
> person's opening quote mark may be another person's closer. In practice,
> though, I suspect that only the "English-like" convention is likely to be used
> with CJK text, so as long as U+2018/2019 and U+201C/201D are classified as
> open/close pairs, we'll be OK.
> 
> This patch does not address the issue of ignoring newlines within CJK text, it
> only resolves the bad breaks adjacent to quote marks.

I am not familiar with the Mozilla code base, so sorry if my question is irrelevant.  Is jisx4501class.txt generic for Unicode line breaking, like the data file for Unicode 5.0.0 Annex 14?

Also notice that a Unicode character may have different line-breaking attributes in different language contexts.  This is conformant to Unicode 5.0.0 Annex 14.  In liblinebreak, U+201C (left double quote) is assigned the "Ambiguous Quotation" class (conservative line breaking in all context); but "Opening Punctuation" in English, Spanish, French, and Chinese; and "Closing Punctuation" in German and Russian.

A quick test shows that the line breaking rule changes with the lang attribute of the text in Internet Explorer too.

Of course, ignoring the lang attribute but make text rendering conformant to Unicode 5.0.0 Annex 14 is better than the current behaviour already.
(In reply to comment #12)
> The "extra spaces" I mentioned are newlines in the source.
> 
> However, I was mistaken about the Safari case, probably because the space was
> rendered very narrow.  Chrome behaves the same as Safari.  So it means WebKit
> currently also has this problem in Chinese rendering.
> 
> IE and FF2 behave as I expect, eliminating the newline completely between two
> Chinese characters.  This makes sense, since Chinese text generally does not
> contain any spaces.  Without this specific rendering rule, a long paragraph in
> Chinese may have to be a long line in the source, which presents an issue for
> some editors.

That's definitely a bug, but harder to fix; please file a separate bug for that, and we'll leave this bug as being about the linebreaking rules.
(In reply to comment #13)

Yes, the character classes from jisx4501class.txt are currently used for line-breaking regardless of language. Making the behavior dependent on the lang attribute would be a nice enhancement but is a separate issue from the current bug.
(In reply to comment #9)

> 201A, 201B, 201E, 201F should be 18, probably. The reason was said by you.

OK, so we'll accept that if these were ever used with CJK text, we could get unwanted breaks. It seems unlikely to be an issue in practice.

> 2018 and 201C should be 22, and also 2019 and 201D should be 23.

Could we have a comment somewhere documenting these class numbers? (Or is there one that I haven't found?) I was initially assuming the classes in jisx4501class.txt should correspond to the first table shown in the comments of nsJISx4501LineBreaker.cpp (around lines 60-80), but that only lists class values up to 20.

Unfortunately, it looks like making 2018 and 201C class 22 here (or class 1 as I initially tried) will lead to breakage for a number of languages such as Belarusian, Danish, Estonian, German, Icelandic, Ukrainian, and others where these are used as *closing* quote marks, so that instead of “quotes” they have „quotes.“ (See http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage.)

To be safer, I think we need a class for 2018 and 201C that prohibits a break after, but does not introduce a break opportunity before the character. It's not clear to me if we have a suitable class for this at the moment, though.
Trying to understand the line-break code in nsJISx4501LineBreaker.cpp more fully, I noticed a mismatch between the comments and actual values in row [d] of the gPairConservative table (item 8 in the comments describing the various classes and tables). Line 310 says

      [d] 0000 1111 1101 1111  = 0x0EDF

and the actual array has 0x0EDF, but the bit pattern shown (which corresponds to the matrix above, noting that the bits are encoded from right to left) would be 0x0FDF.

Masayuki-san, could you check which value is intended here -- it seems to me that either we should change this to 0x0FDF, or remove the X from the Complex column of row [d] in the matrix.
(In reply to comment #16)
> > 2018 and 201C should be 22, and also 2019 and 201D should be 23.
> 
> Could we have a comment somewhere documenting these class numbers? (Or is there
> one that I haven't found?) I was initially assuming the classes in
> jisx4501class.txt should correspond to the first table shown in the comments of
> nsJISx4501LineBreaker.cpp (around lines 60-80), but that only lists class
> values up to 20.

See http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/tools/jisx4501simp.txt

> Unfortunately, it looks like making 2018 and 201C class 22 here (or class 1 as
> I initially tried) will lead to breakage for a number of languages such as
> Belarusian, Danish, Estonian, German, Icelandic, Ukrainian, and others where
> these are used as *closing* quote marks, so that instead of “quotes” they have
> „quotes.“ (See http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage.)

The class 22 and the class 23 are [c] and [d] in the document of nsJISx4501LineBreaker.cpp. These classes are same behavior as class 1 of the document (character class). So, between a Western alphabet and these classes are not breakable point. So, using class 22 and class 23, you don't change the current behavior on non-CJ text.

(In reply to comment #17)
> Trying to understand the line-break code in nsJISx4501LineBreaker.cpp more
> fully, I noticed a mismatch between the comments and actual values in row [d]
> of the gPairConservative table (item 8 in the comments describing the various
> classes and tables). Line 310 says
> 
>       [d] 0000 1111 1101 1111  = 0x0EDF
> 
> and the actual array has 0x0EDF, but the bit pattern shown (which corresponds
> to the matrix above, noting that the bits are encoded from right to left) would
> be 0x0FDF.
> 
> Masayuki-san, could you check which value is intended here -- it seems to me
> that either we should change this to 0x0FDF, or remove the X from the Complex
> column of row [d] in the matrix.

Oh, it should be 0x0FDF! E.g., ")$" should not be breakable. Thank you.
(In reply to comment #18)

> > Unfortunately, it looks like making 2018 and 201C class 22 here (or class 1 as
> > I initially tried) will lead to breakage for a number of languages such as
> > Belarusian, Danish, Estonian, German, Icelandic, Ukrainian, and others where
> > these are used as *closing* quote marks, so that instead of “quotes” they have
> > „quotes.“ (See http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage.)
> 
> The class 22 and the class 23 are [c] and [d] in the document of
> nsJISx4501LineBreaker.cpp. These classes are same behavior as class 1 of the
> document (character class). So, between a Western alphabet and these classes
> are not breakable point. So, using class 22 and class 23, you don't change the
> current behavior on non-CJ text.

I don't think that's correct. Yes, class 22 won't break after a Western *letter*, but it looks to me like using class 22 for U+2018 and 201C can introduce a possible break between word-final punctuation such as . , : and the following quote. This would be fine in English usage, where these characters are opening quotes, and therefore should "bind" to the following word, but it will be unacceptable for languages that use these same characters as closing quotes (see above).

This shows up in layout/reftests/line-breaking/quotationmarks-1.html, which is purely "Western" sample text, but gets a lot of new line-breaks between punctuation and quote marks if these characters are changed to class 22.

I think we're going to have to subdivide one of the existing classes to make a further distinction, but I'm still figuring out the details.
Attachment #347866 - Attachment is obsolete: true
Attachment #347866 - Flags: review?(masayuki)
(In reply to comment #19)
> (In reply to comment #18)
> 
> > > Unfortunately, it looks like making 2018 and 201C class 22 here (or class 1 as
> > > I initially tried) will lead to breakage for a number of languages such as
> > > Belarusian, Danish, Estonian, German, Icelandic, Ukrainian, and others where
> > > these are used as *closing* quote marks, so that instead of “quotes” they have
> > > „quotes.“ (See http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage.)
> > 
> > The class 22 and the class 23 are [c] and [d] in the document of
> > nsJISx4501LineBreaker.cpp. These classes are same behavior as class 1 of the
> > document (character class). So, between a Western alphabet and these classes
> > are not breakable point. So, using class 22 and class 23, you don't change the
> > current behavior on non-CJ text.
> 
> I don't think that's correct. Yes, class 22 won't break after a Western
> *letter*, but it looks to me like using class 22 for U+2018 and 201C can
> introduce a possible break between word-final punctuation such as . , : and the
> following quote.

ASCII punctuations (e.g., ".,:;") are also 22 or 23, so, they should not be breakable with 22 or 23.

> This shows up in layout/reftests/line-breaking/quotationmarks-1.html, which is
> purely "Western" sample text, but gets a lot of new line-breaks between
> punctuation and quote marks if these characters are changed to class 22.

Really? It's strange...
'(' is class 22, '.' is class 23. But following test code is not broken.

data:text/html,<div style="width:0;">.(</div>
Ah, sorry. This is correct test. I forgot a rule.

data:text/html,<div style="width:0;">aaaaaaaa(.aaaaaaaa</div>

But this line is not also broken.
Oops, I mistook. This is correct. And this is broken.

data:text/html,<div style="width:0;">aaaaaaaa.(aaaaaaaa</div>

You're right. I'll check the cause.
Ah... I'm sorry. I looked conservative table.

Looks like you need to add a new class for quotations now. A line is only broken between the new class character and CJ character.
It's OK for "aaaaaaaa.(aaaaaaaa" to break before the open-paren, I think, and this is the existing behavior.

The problem is that we can't accept the same behavior for the "open" quotes U+2018 and 201C, because they're not always openers, some people use them as closers. So a suitable test case is:

  data:text/html,<div style="width:0;">aaaaaaaa.&#x201C;aaaaaaaa</div>

which doesn't break currently, but does break if we change the quotes to class 22 as suggested above. We need to prevent that break (because of the German, Ukrainian, etc usage), but allow a break before the quote (and not after it) in CJK context.

To solve this, we're going to have to change how the fullwidth punctuation from the U+FFxx block is handled. Currently, these map to the same classes as their ASCII equivalents, but in that case we can't distinguish between

  data:text/html,<div style="width:0;">aaaaaaaa.&#x201C;aaaaaaaa</div>

where we *don't* want a break between period and quote, and

  data:text/html,<div style="width:0;">字字字字字字字字.&#x201C;字字字字字字字字</div>

where we *do* want to allow the break between (full-width) period and quote mark.
(In reply to comment #16)
> > 2018 and 201C should be 22, and also 2019 and 201D should be 23.
> 
> Could we have a comment somewhere documenting these class numbers? (Or is there
> one that I haven't found?) I was initially assuming the classes in
> jisx4501class.txt should correspond to the first table shown in the comments of
> nsJISx4501LineBreaker.cpp (around lines 60-80), but that only lists class
> values up to 20.
> 
> Unfortunately, it looks like making 2018 and 201C class 22 here (or class 1 as
> I initially tried) will lead to breakage for a number of languages such as
> Belarusian, Danish, Estonian, German, Icelandic, Ukrainian, and others where
> these are used as *closing* quote marks, so that instead of “quotes” they have
> „quotes.“ (See http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage.)
> 
> To be safer, I think we need a class for 2018 and 201C that prohibits a break
> after, but does not introduce a break opportunity before the character. It's
> not clear to me if we have a suitable class for this at the moment, though.

I am fully aware of this issue, and that is why I have mentioned repeatedly Unicode 5.0.0 Annex 14 (UAX #14).  In UAX #14, the quotation marks (including U+2018 and U+201C) are given the category "Ambiguous Quotation", i.e. they "act like they are both opening and closing".  In liblinebreak, I specialized them to Opening or Closing only in specific language contexts.

Even if you do not want to use a completely new line-breaking implementation, I strongly suggest that you have a look at UAX #14.
Bug 56652 has a general discussion of UAX #14.
(In reply to comment #25)
> To solve this, we're going to have to change how the fullwidth punctuation from
> the U+FFxx block is handled. Currently, these map to the same classes as their
> ASCII equivalents, but in that case we can't distinguish between
> 
>   data:text/html,<div style="width:0;">aaaaaaaa.&#x201C;aaaaaaaa</div>
> 
> where we *don't* want a break between period and quote, and
> 
>   data:text/html,<div style="width:0;">字字字字字字字字.&#x201C;字字字字字字字字</div>
> 
> where we *do* want to allow the break between (full-width) period and quote
> mark.

Just FYI.  UAX #14 does not deal with this specific issue either.  IMHO, ignoring a breaking chance is better than misusing a breaking chance.

I may even argue that some people do not want this breaking chance.  Can "字字字字字字字字." happen to be a quotation inside a German context?  :-)
Flags: blocking1.9.1? → wanted1.9.1+
This test will fail until 450088 is fixed, due to bad breaks adjacent to quote marks used in a CJK context; see earlier comments on the bug for further discussion.
Attachment #348337 - Flags: review?(masayuki)
With this patch applied, the quotation-cjk reftest (see preceding attachment) now passes. All the other linebreak reftests still pass as well.

Because of the ambiguity of many quote-mark characters, the results will still not be perfect in every situation; in particular, we will not find all the good potential breaks in Western text. But I believe this patch significantly improves CJK behavior, and addresses the specific cases shown in the initial bug report, without disrupting behavior for other scripts.
Attachment #348338 - Flags: review?(masayuki)
Why do you need [f]?

        1 [a] 7  8  9 [b]15 18 COMPLEX [c] [d] [e] [f] [g]
[d]        X              X  X              X   X   X   X
[f]        X              X  X              X   X   X    

looks like the difference is only [d][g] vs [f][g] (all [x][d|f] are same). However, [g] is new class. And I think ASCII closing punctuations and Full-width punctuations should have same behavior.
Sorry, I guess I didn't explain this sufficiently. Purely for use in CJK contexts, yes, we'd want to treat ASCII and full-width closing punctuation the same, but we can't do this at the moment because of the risk of breaking Western-language usage.

The pair [f][g] allows a break between the closing (full-width) punctuation and the following open-quote in cases such as ....她说:“我是.... This is important because otherwise we can often get a run of 4 (or more) characters with no possible line-break, which looks pretty bad.

However, we cannot allow the same break for [d][g], where [d] has the ASCII closing punctuation, because the quote marks in [g] are not universally used as openers; in some Western orthographies such as German they can be closing quotes.

So if the punctuation is ASCII (or generic), we can't risk breaking before the "open" quote, as it might really be a closer. But if the punctuation is fullwidth, I think we can risk assuming this is a CJK context, using the quote marks in the commonly-understood (English-like) sense, and permit this break.

To really do the job right, we would have to pass language/locale information to the line-breaking routine, and provide tailored rules for different orthographic conventions. But that is an enhancement beyond the scope of this bug, I think. For now, we need to fix the bad CJK breaks that were reported, without interfering with the behavior for other scripts.

So because of the quote-mark ambiguities, we have to split [f] away from [d]; otherwise we can't provide the break before [g], and that makes typical CJK text look significantly worse.
Bug 465457 suggests language-specific line breaking. (However, even if/when this is implemented, we will still need a "best guess" fallback behavior for use when we don't have specific language information.)
(In reply to comment #32)
> The pair [f][g] allows a break between the closing (full-width) punctuation and
> the following open-quote in cases such as ....她说:“我是.... This is important
> because otherwise we can often get a run of 4 (or more) characters with no
> possible line-break, which looks pretty bad.

I do not think the break is allowed, unless the language is set to zh, or the encoding is a Chinese legacy one.  Consider this case (though maybe rare; and I am not sure the German is correct enough):

  „字字.“ ist ein chinesisches Angebot.
> However, we cannot allow the same break for [d][g], where [d] has the ASCII
> closing punctuation, because the quote marks in [g] are not universally used as
> openers; in some Western orthographies such as German they can be closing
> quotes.

O.K. We can check whether the text fragment is CJK or not by ContextState::mHasCJKChar. You can pass ContextState to GetClass and you should change the result only when the character is U+2018, U+2019, U+201C and U+201D. Then, we don't make damage to non-CJK text. If so, cannot we keep the same behavior between ASCII closing punctuations and Full-width closing punctuations? And can you reduce the [f] class?

(In reply to comment #34)
> (In reply to comment #32)
> > The pair [f][g] allows a break between the closing (full-width) punctuation and
> > the following open-quote in cases such as ....她说:“我是.... This is important
> > because otherwise we can often get a run of 4 (or more) characters with no
> > possible line-break, which looks pretty bad.
> 
> I do not think the break is allowed, unless the language is set to zh, or the
> encoding is a Chinese legacy one.  Consider this case (though maybe rare; and I
> am not sure the German is correct enough):
> 
>   „字字.“ ist ein chinesisches Angebot.

Right. But it should be fixed by bug 465457. The lang attr should override ContextState.mHasCJKChar. If we can fix this bug by the above way.
Or, should we the characters to NEED_CONTEXTUAL_ANALYSIS?
(In reply to comment #36)
> Or, should we the characters to NEED_CONTEXTUAL_ANALYSIS?

oops, I meant "should we add the...". Sorry for the spam.
This is a completely revised version of the patch. Instead of using new classes, the two key characters U+2018 and 201C are added to NEEDS_CONTEXTUAL_ANALYSIS, and mapped to CLASS_OPEN only if they are followed by a CJK character. This gives us the crucial break opportunity needed for CJK text to look good, while minimizing the risk of bad breaks in obscure cases such as the Chinese phrase quoted in German (see comment #34). Looking ahead at the following character seems the best way to identify this potential break.

There are now no changes to the pair tables (except fixing the typo noted in comment #17). The quote mark classes are updated to 22/23 in jisx4501class.txt, but the openers will normally be overridden by ContextualAnalysis to avoid the potential bad breaks as described in comment #19.

This patch includes the new reftest as well as the changes to linebreak code. It passes all the current linebreak reftests.
Attachment #348337 - Attachment is obsolete: true
Attachment #348338 - Attachment is obsolete: true
Attachment #350398 - Flags: review?(masayuki)
Attachment #348337 - Flags: review?(masayuki)
Attachment #348338 - Flags: review?(masayuki)
Oops, just after submitting the patch I realized I'd forgotten to include the open-guillemet in ContextualAnalysis; this might also get used in CJK contexts, I guess, so we'd better treat it like the open-quotes.
Attachment #350398 - Attachment is obsolete: true
Attachment #350400 - Flags: review?(masayuki)
Attachment #350398 - Flags: review?(masayuki)
+  } else if (cur == U_OPEN_SINGLE_QUOTE ||
+             cur == U_OPEN_DOUBLE_QUOTE ||
+             cur == U_OPEN_GUILLEMET) {
+    // for CJK usage, we treat these as openers to allow a break before them,
+    // but otherwise treat them as normal characters because quote mark usage
+    // in various Western languages varies too much; see bug #450088 discussion.
+    if (!aState.UseConservativeBreaking() && IS_CJK_CHAR(next))
+      return CLASS_OPEN;
+    return CLASS_CHARACTER;
   } else {
     NS_ERROR("Forgot to handle the current character!");
   }
   return GetClass(cur);

Should you fallback to GetClass(cur) instead of |return CLASS_CHARACTER|?
That would be possible (if I update jisx4501class.txt to have 18 rather than 22 for these characters), but it seems like that just sends us through a longer code path with more lookups to reach the same result.

If you feel strongly that we should make it refer back to the lookup tables here, we could do it, but I don't see the benefit (only a small cost).
Please do it. The current table is misleadable.

> -2018;201F;18
> +2018;;22
> +2019;;23
> +201A;;18
> +201B;;18
> +201C;;22
> +201D;;23
> +201E;;18
> +201F;;18

Then, this will be just:

>  2018;201F;18
> +2019;;23
> +201D;;23
OK, it now falls back to GetClass() in the conservative-breaking and non-CJK situations, and the class table has been updated to provide the correct values. The resulting behavior is unchanged from the previous version of the patch, so the line-break reftests still pass (including the new CJK one).
Attachment #350400 - Attachment is obsolete: true
Attachment #350437 - Flags: superreview?
Attachment #350437 - Flags: review?(masayuki)
Attachment #350400 - Flags: review?(masayuki)
Attachment #350437 - Flags: superreview? → superreview?(roc)
Attachment #350437 - Flags: review?(masayuki) → review+
Comment on attachment 350437 [details] [diff] [review]
revised patch as requested by Masayuki-san

Looks good for me. Thank you.
Comment on attachment 350437 [details] [diff] [review]
revised patch as requested by Masayuki-san

Fixes a regression from FF2
Attachment #350437 - Flags: superreview?(roc)
Attachment #350437 - Flags: superreview+
Attachment #350437 - Flags: approval1.9.1?
Has this landed on trunk yet? We should let it bake there for a day before taking it.
pushed to trunk.
http://hg.mozilla.org/mozilla-central/rev/556a93a5ffa6
Assignee: nobody → jfkthame
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Whiteboard: [c-n: baking for 1.9.1]
Target Milestone: --- → mozilla1.9.2a1
Version: 1.9.0 Branch → Trunk
backed-out the patch. Because the new reftest failed on Linux.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [c-n: baking for 1.9.1]
> REFTEST TEST-UNEXPECTED-FAIL | file:///builds/slave/trunk_linux-7/build/layout/reftests/line-breaking/quotationmarks-cjk-1.html | 
> REFTEST   IMAGE 1 (TEST): 
> REFTEST   IMAGE 2 (REFERENCE): 
> REFTEST number of differing pixels: 94
looks like the test failure is caused by gfx/thebes bug. probably, we should disable the test only for linux.
Weird. The failure isn't due to this patch, though. The line-breaking is fine, the problem has something to do with font/line metrics, I guess, causing the "missing glyph" boxes on the last couple of lines to be drawn in a vertically-squished way, as if the code believes the font height is less there.

I'm guessing we wouldn't get this failure if the test box actually had CJK fonts available. (It worked fine for me on Linux.) It'd be nice to get to the root of the problem, but I would suggest it's actually a separate bug.
Maybe mark the test as random on Linux for now, and file a new bug re the missing-glyph rendering issue.
Jonathan, I think that we should only changes the new reftest should not be checked on linux.
http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/README.txt#51
Are you able to reproduce the problem (squashed "missing-glyph boxes") somehow? I was hoping to come up with a test case that demonstrates this, and then file a separate bug, but have not been able to reproduce it locally on my Linux system (or on OS X).
(In reply to comment #54)
> Are you able to reproduce the problem (squashed "missing-glyph boxes") somehow?

I cannot.

> I was hoping to come up with a test case that demonstrates this, and then file
> a separate bug, but have not been able to reproduce it locally on my Linux
> system (or on OS X).

I think for 1.9.1 schedule, we need to land the current patch ASAP, so, you should disable the reftest only for Linux.
OK, I have disabled the new reftest on Linux for now, so AFAIK this should be safe to land. The new patch is taken directly from my .hg/patches dir, with user name and commit message; hope that's all correct.
Attachment #350437 - Attachment is obsolete: true
Attachment #352289 - Flags: review?(masayuki)
Attachment #350437 - Flags: approval1.9.1?
The reftest failure on Linux is due to lack of CJK fonts on the test machines, leading to the use of "missing glyph" boxes for the character U+5B57 in the test. The rendering of the missing glyph is affected by the presence of <br> before it in the reference file; this separates it from the preceding character and prevents the font chosen there from being propagated. In the actual test (not reference), where <br> is not present, the size of the missing glyph box varies according to the font chosen during font-matching for the preceding character.

Karl suggested a workaround using a <span> with a bogus family name, to effectively reset the font-matching process. This seems to work; however, I'm uncomfortable about adding this to the test case here as it "pollutes" the text stream that we're trying to line-break, and so makes it difficult to know whether we are still testing the intended sequence.
Does the test work on Linux if, instead of using a <br>, you set white-space:pre and use newlines in the source?
No, that still causes the missing-glyph after the break to be separated from the (fallback) font-matching that was applied to the character before the break.
Is there a specific reason why the missing glyph is drawn according to the metrics of the immediately-preceding font, or would it make (at least as much) sense to always draw it according to the metrics of the first (available) font from the current set, which is what happens if it occurs in isolation?

It's a tough call, perhaps, but arguably it would be nice to draw all missing glyphs within a given style run in a consistent way, rather than varying depending on the particular character that happened to occur before them. (And if we did this, it would also have the effect of making this test behave more predictably.)
Attachment #352289 - Flags: review?(masayuki) → review+
http://hg.mozilla.org/mozilla-central/rev/1eae54a0f48c

pushed the latest patch, please file a new bug for linux missing glyph issue.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Whiteboard: [c-n: baking for 1.9.1]
(In reply to comment #60)
> Is there a specific reason why the missing glyph is drawn according to the
> metrics of the immediately-preceding font, or would it make (at least as much)
> sense to always draw it according to the metrics of the first (available) font
> from the current set, which is what happens if it occurs in isolation?

The latter would make a lot more sense! Can you do it?
I think so - hoping a trivial fix will work, but needs testing. Filed bug #469005 for this.
Whiteboard: [c-n: baking for 1.9.1] → [needs 191 landing
Comment on attachment 352289 [details] [diff] [review]
updated patch: disabled reftest on Linux, added commit msg

changes the line-breaking spec of after open quotes ([‘]U+2018 and [“]U+201c) and [«]U+AB only when the next character is a CJK character.

This is important changes for Chinese users. And Japanese and Korean users don't have problems by this change, maybe.
Attachment #352289 - Flags: approval1.9.1?
Keywords: checkin-needed
Whiteboard: [needs 191 landing → [needs 191 landing]
Comment on attachment 352289 [details] [diff] [review]
updated patch: disabled reftest on Linux, added commit msg

a191=beltzner
Attachment #352289 - Flags: approval1.9.1? → approval1.9.1+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: