Closed Bug 1081858 Opened 10 years ago Closed 8 years ago

Implement css-text-3 segment break transformation rules

Categories

(Core :: Layout: Text and Fonts, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: kanru, Assigned: kanru)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: dev-doc-complete)

Attachments

(5 files, 7 obsolete files)

2.60 KB, patch
kanru
: review+
Details | Diff | Splinter Review
858.63 KB, patch
kanru
: review+
Details | Diff | Splinter Review
2.10 KB, patch
kanru
: review+
Details | Diff | Splinter Review
8.01 KB, patch
jfkthame
: review+
Details | Diff | Splinter Review
142 bytes, text/html
Details
See http://dev.w3.org/csswg/css-text/#line-break-transform

This would be good for markup generated content. For example many markdown parser put a paragraph inside a <p> directly:

  <p>一些
  中文</p>

which rendered like

  一些 中文

But ideally it should be

  一些中文

without the linebreak translated to a space in the middle of words.
Severity: normal → enhancement
Assignee: nobody → kchen
Blocks: css-text-3
No longer blocks: 1253614
It turns out we already have this code since beginning but it was never enabled due to off-by-one error.

http://searchfox.org/mozilla-central/rev/84075be5067b68dc6cb3b89f999645650e68c05b/layout/generic/nsTextFrameUtils.cpp#92-98

Index should be -2 and 0 because aText has advanced.
Attached file segment_break_skip_table.py (obsolete) —
As described here: https://drafts.csswg.org/css-text/#line-break-transform

The table is generated from segment_break_skip_table.py: https://bugzilla.mozilla.org/attachment.cgi?id=8803871

I'm not sure if I should use ICU to get the properties and how.
Attachment #8803874 - Flags: review?(dbaron)
You should probably add code to intl/unicharutil/tools/genUnicodePropertyData.pl and generate a table for that. That long condition chain doesn't look good to me especially given that shortcircuit would likely not work for majority of cases.
You can probably use fullwidth as an example to learn how to generate a table based on unicode data files in that code.
fullwidth was added in bug 774560.

I also looked bug 1265631 which added line-break table. Looks like when ENABLE_INTL_API is defined I can use u_getIntPropertyValue from ICU.
And in the !INTL_API_ENABLED case, we have a three-bit field for mPairedBracketType in nsCharProps2, but as the comment notes, only two bits are really needed there. So you can change that to a two-bit field and use the third bit for your flag, rather than needing to add a new table.
When it comes to regenerating Unicode property tables, note that our Unicode support is currently at 8.0; we haven't yet updated to Unicode 9. That's bug 1281448, but there are additional issues in play there (ICU versions, etc).

So if you go ahead with this prior to bug 1281448 being completed, you should use Unicode 8.0 data files to regenerate the tables.
(Note that bug 1312440 also proposes to touch the Unicode table-generating code slightly, but any rebasing needed will be trivial.)
Comment on attachment 8803872 [details] [diff] [review]
Part 1. Fix off-by-one error and correctly discard newlines between CJK chars.

Jonathan is a better reviewer for this than I am.

(But please don't use "CJK" as a term for things that don't include Korean.)
Attachment #8803872 - Flags: review?(dbaron) → review?(jfkthame)
Attachment #8803874 - Flags: review?(dbaron) → review?(jfkthame)
Comment on attachment 8803872 [details] [diff] [review]
Part 1. Fix off-by-one error and correctly discard newlines between CJK chars.

Review of attachment 8803872 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/generic/nsTextFrameUtils.cpp
@@ +88,5 @@
>            (i + 1 >= aLength ||
>             !IsSpaceCombiningSequenceTail(aText, aLength - (i + 1)))) {
>          nowInWhitespace = true;
>        } else if (ch == '\n' && aCompression == COMPRESS_WHITESPACE_NEWLINE) {
> +        if (i > 0 && IS_CJ_CHAR(aText[-2]) &&

The use of aText[-2] looks scary here; initially, I was going to ask if we have any guarantee that it's valid to look at aText[-2]?

The answer, of course, is that if i > 0 then we're on at least the second iteration of the loop, so the aText++ above has been executed at least twice, and aText[-2] is safe.

But I think this code would be easier to read and understand if you move the increment of aText into the for-loop header, so that `i` and `aText` are always incremented at the same time. (And then pass aText+1 to IsSpaceCombiningSequenceTail above.)
Comment on attachment 8803874 [details] [diff] [review]
Part 2. Implement segment break transformation

Review of attachment 8803874 [details] [diff] [review]:
-----------------------------------------------------------------

See comments above re using Unicode property data tables rather than the long if-condition.

::: intl/unicharutil/util/nsUnicharUtils.h
@@ +144,5 @@
>  uint32_t
>  HashUTF8AsUTF16(const char* aUTF8, uint32_t aLength, bool* aErr);
>  
> +bool
> +IsSegmentBreakSkipChar(char16_t u);

You'll need this to take a 32-bit codepoint; there are lots of CJK ideographs in plane 2, and passing a char16_t means they'll never be recognized properly.

::: layout/generic/nsTextFrameUtils.cpp
@@ +94,5 @@
> +          aSkipChars->SkipChar();
> +          continue;
> +        }
> +        if (i > 0 && IsSegmentBreakSkipChar(aText[-2]) &&
> +            i + 1 < aLength && IsSegmentBreakSkipChar(aText[0])) {

A problem here is that this code doesn't handle surrogate pairs. I guess the original version was written way back before Unicode had all those extra CJK ideographs in Plane 2, and it's never been updated (well, it wasn't working anyway!) for the new, larger character repertoire.

So while you're fixing this, I think you should also handle surrogates.

(For bonus points, please file followup bugs to handle surrogates in the other existing users of IS_CJ_CHAR, which are using it for similar purposes to this code. It looks like nsPlainTextSerializer::Write and nsFind::Find both need similar attention. Probably the old IS_CJ_CHAR macro should just die...)
Attachment #8803871 - Attachment is obsolete: true
Attachment #8803872 - Attachment is obsolete: true
Attachment #8803872 - Flags: review?(jfkthame)
Attachment #8803974 - Flags: review?(jfkthame)
Attachment #8803874 - Attachment is obsolete: true
Attachment #8803874 - Flags: review?(jfkthame)
Attachment #8803977 - Flags: review?(jfkthame)
See Also: → 1312630
See Also: → 1312631
(In reply to Jonathan Kew (:jfkthame) from comment #12)
> Comment on attachment 8803872 [details] [diff] [review]
> Part 1. Fix off-by-one error and correctly discard newlines between CJK
> chars.
> 
> Review of attachment 8803872 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: layout/generic/nsTextFrameUtils.cpp
> @@ +88,5 @@
> >            (i + 1 >= aLength ||
> >             !IsSpaceCombiningSequenceTail(aText, aLength - (i + 1)))) {
> >          nowInWhitespace = true;
> >        } else if (ch == '\n' && aCompression == COMPRESS_WHITESPACE_NEWLINE) {
> > +        if (i > 0 && IS_CJ_CHAR(aText[-2]) &&
> 
> The use of aText[-2] looks scary here; initially, I was going to ask if we
> have any guarantee that it's valid to look at aText[-2]?
> 
> The answer, of course, is that if i > 0 then we're on at least the second
> iteration of the loop, so the aText++ above has been executed at least
> twice, and aText[-2] is safe.

Personally, I'd prefer not touching aText itself, and instead always use index to access content inside, so that we have aText[i - 1], aText[i - 2]. I think that makes the code easier to reason about, as fewer things are changing.
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #18)
> (In reply to Jonathan Kew (:jfkthame) from comment #12)
> > Comment on attachment 8803872 [details] [diff] [review]
> > Part 1. Fix off-by-one error and correctly discard newlines between CJK
> > chars.
> > 
> > Review of attachment 8803872 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > ::: layout/generic/nsTextFrameUtils.cpp
> > @@ +88,5 @@
> > >            (i + 1 >= aLength ||
> > >             !IsSpaceCombiningSequenceTail(aText, aLength - (i + 1)))) {
> > >          nowInWhitespace = true;
> > >        } else if (ch == '\n' && aCompression == COMPRESS_WHITESPACE_NEWLINE) {
> > > +        if (i > 0 && IS_CJ_CHAR(aText[-2]) &&
> > 
> > The use of aText[-2] looks scary here; initially, I was going to ask if we
> > have any guarantee that it's valid to look at aText[-2]?
> > 
> > The answer, of course, is that if i > 0 then we're on at least the second
> > iteration of the loop, so the aText++ above has been executed at least
> > twice, and aText[-2] is safe.
> 
> Personally, I'd prefer not touching aText itself, and instead always use
> index to access content inside, so that we have aText[i - 1], aText[i - 2].
> I think that makes the code easier to reason about, as fewer things are
> changing.

I think so too. I was too lazy to change all the aText references and make sure the index is correct. I'll change the next patch to this approach if there is no particular reason to stick with changing aText itself.
Comment on attachment 8803974 [details] [diff] [review]
Part 1. Fix aText off-by-one indexing

Review of attachment 8803974 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine, but I'm also OK with changing it to use aText[i] etc, and leave the aText pointer itself untouched.
Attachment #8803974 - Flags: review?(jfkthame) → review+
Comment on attachment 8803975 [details] [diff] [review]
Part 2. Add EastAsianWidthFWH data from Unicode's EastAsianWidth.txt to nsUnicodeProperties for builds without ICU.

Review of attachment 8803975 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/unicharutil/tools/genUnicodePropertyData.pl
@@ +545,5 @@
> +while (<FH>) {
> +    s/#.*//;
> +    if (m/([0-9A-F]{4,6})(?:\.\.([0-9A-F]{4,6}))*\s*;\s*([^ ]+)/) {
> +        my $lb = $3;
> +        warn "unknown EastAsianWidth class" unless exists $eastAsianWidthCode{$lb};

Minor nit: I assume this was copy/pasted from the preceding block that reads the LineBreak data; the local variable $lb clearly was shorthand for "line-break". But here it's not a good variable name; let's call this something like $eaw instead...

@@ +546,5 @@
> +    s/#.*//;
> +    if (m/([0-9A-F]{4,6})(?:\.\.([0-9A-F]{4,6}))*\s*;\s*([^ ]+)/) {
> +        my $lb = $3;
> +        warn "unknown EastAsianWidth class" unless exists $eastAsianWidthCode{$lb};
> +        $lb = $lb =~ m/^[FWH]$/;

...and then here, use a new variable for the flag, which is not directly an EastAsianWidth value but a derivative. So I think I'd go for:

    my $isFWH = ($eaw =~ m/^[FWH]$/);

(added parens are optional, but IMO they help the readability here).
Attachment #8803975 - Flags: review?(jfkthame) → review+
Attachment #8803976 - Flags: review?(jfkthame) → review+
Comment on attachment 8803977 [details] [diff] [review]
Part 4. Implement segment break transformation rules

Review of attachment 8803977 [details] [diff] [review]:
-----------------------------------------------------------------

Looks OK, except for the added surrogate checks mentioned below.

(I guess you'll adjust this to use indexing instead of incrementing aText, too; that sounds fine.)

::: intl/unicharutil/util/nsUnicharUtils.cpp
@@ +427,5 @@
>  
> +bool
> +IsSegmentBreakSkipChar(uint32_t u)
> +{
> +    return unicode::IsEastAsianWidthFWH(u) &&

nit: 2-space indent here, please.

::: layout/generic/nsTextFrameUtils.cpp
@@ +97,5 @@
> +        }
> +        uint32_t ucs4before;
> +        uint32_t ucs4after;
> +        if (i > 1 && NS_IS_LOW_SURROGATE(aText[-1])) {
> +          ucs4before = SURROGATE_TO_UCS4(aText[-2], aText[-1]);

Before doing SURROGATE_TO_UCS4, we should also check that the high surrogate is valid. (Unfortunately it's possible for JS code to poke lone surrogates into content, so we need to both codes of the pair.)

@@ +102,5 @@
> +        } else if (i > 0) {
> +          ucs4before = aText[-1];
> +        }
> +        if (i + 2 < aLength && NS_IS_HIGH_SURROGATE(aText[1])) {
> +          ucs4after = SURROGATE_TO_UCS4(aText[1], aText[2]);

Same here, add a check that the low surrogate is valid.
Adjusted to use indexing instead of incrementing aText. carry over r+
Attachment #8803974 - Attachment is obsolete: true
Attachment #8804271 - Flags: review+
Fixed indent and checking high/low surrogates.

I also fixed the editor modeline which used 4 spaces indenting.
Attachment #8803977 - Attachment is obsolete: true
Attachment #8803977 - Flags: review?(jfkthame)
Attachment #8804276 - Flags: review?(jfkthame)
Attachment #8804276 - Flags: review?(jfkthame) → review+
Pushed by kchen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/44f9c7e8d124
Part 1. Fix aText off-by-one indexing. r=jfkthame
https://hg.mozilla.org/integration/mozilla-inbound/rev/ba420f595902
Part 2. Add EastAsianWidthFWH data from Unicode's EastAsianWidth.txt to nsUnicodeProperties for builds without ICU. r=jfkthame
https://hg.mozilla.org/integration/mozilla-inbound/rev/0dd35a1f895f
Part 3. Implement IsEastAsianWidthFWH using ICU or nsUnicodeProperties data. r=jfkthame
https://hg.mozilla.org/integration/mozilla-inbound/rev/d9e96e907d0a
Part 4. Implement segment break transformation rules. r=jfkthame
Pushed by kchen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ac6306117c61
Followup, initialize nsCharProps2 properly. on a CLOSED TREE r=bustage
Pushed by kchen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fd1d01c7a9e2
Part 1. Fix aText off-by-one indexing. r=jfkthame
https://hg.mozilla.org/integration/mozilla-inbound/rev/1d3177608997
Part 2. Add EastAsianWidthFWH data from Unicode's EastAsianWidth.txt to nsUnicodeProperties for builds without ICU. r=jfkthame
https://hg.mozilla.org/integration/mozilla-inbound/rev/763deb5caa30
Part 3. Implement IsEastAsianWidthFWH using ICU or nsUnicodeProperties data. r=jfkthame
https://hg.mozilla.org/integration/mozilla-inbound/rev/5044bee3df13
Part 4. Implement segment break transformation rules. r=jfkthame
Backed out for failing reftest segment-break-transformation-1.html on Windows XP:

https://hg.mozilla.org/integration/mozilla-inbound/rev/78e4e0957ce33427d91564211413bd900c7afab0
https://hg.mozilla.org/integration/mozilla-inbound/rev/3b7f2b3d247de346a151fa83b655e687715e5ec5
https://hg.mozilla.org/integration/mozilla-inbound/rev/c11ad1787ba0ed5bf1d78077954a7fcd8362fa72
https://hg.mozilla.org/integration/mozilla-inbound/rev/c02372e09279dc1f6fb31a404e7b6dd7b8076e44

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=5044bee3df13af6a5289cb1affcc50aed38f80b2
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=38254935&repo=mozilla-inbound

06:31:48     INFO - REFTEST TEST-UNEXPECTED-FAIL | file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/text/segment-break-transformation-1.html == file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/text/segment-break-transformation-1-ref.html | image comparison, max difference: 255, number of differing pixels: 5968

Kanru's test of the fix on Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1353bb5fd7539ee0450c13d81136bed61d207cb5
Flags: needinfo?(kchen)
Pushed by kchen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8d6560b363d2
Part 1. Fix aText off-by-one indexing. r=jfkthame
https://hg.mozilla.org/integration/mozilla-inbound/rev/766b5a6cec7c
Part 2. Add EastAsianWidthFWH data from Unicode's EastAsianWidth.txt to nsUnicodeProperties for builds without ICU. r=jfkthame
https://hg.mozilla.org/integration/mozilla-inbound/rev/3573997144c6
Part 3. Implement IsEastAsianWidthFWH using ICU or nsUnicodeProperties data. r=jfkthame
https://hg.mozilla.org/integration/mozilla-inbound/rev/be7d9851343b
Part 4. Implement segment break transformation rules. r=jfkthame
Flags: needinfo?(kchen)
Attached file testcase
It seems to me this fix is not complete. See the testcase. If there is any whitespace around the segment break, there will still be a whitespace left.

Per CSS Text 3, those whitespaces should have been removed in the first step of Phase I, so existence of them should not affect the result of the segment break transformation applied in the second step.
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #37)
> Created attachment 8809207 [details]
> testcase
> 
> It seems to me this fix is not complete. See the testcase. If there is any
> whitespace around the segment break, there will still be a whitespace left.
> 
> Per CSS Text 3, those whitespaces should have been removed in the first step
> of Phase I, so existence of them should not affect the result of the segment
> break transformation applied in the second step.

Yeah, looks like we need to move the segment break handling after all whitespaces are skipped.
Depends on: 1316482
I suggest we remove the item of this bug from "Firefox 52 for developers", because without fixing bug 1316482, we do not really fully implement the new rule.
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #40)
> I suggest we remove the item of this bug from "Firefox 52 for developers",
> because without fixing bug 1316482, we do not really fully implement the new
> rule.

It looks like this bug is now fixed, so I don't have to remove it anymore ;-)

I'm thinking about what else needs to be done to document this bug. Looking at the white-space property page values section:

https://developer.mozilla.org/en-US/docs/Web/CSS/white-space#Values

The table near the bottom says that for values of pre, pre-wrap, and pre-line, new lines are preserved; for the other values they are collapsed. This looks like basically (a simplified explanation of) the behaviour described by

https://drafts.csswg.org/css-text/#line-break-transform

Do I need to add any more than this to explain it to web developers?
(In reply to Chris Mills (Mozilla, MDN editor) [:cmills] from comment #41)
> (In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #40)
> > I suggest we remove the item of this bug from "Firefox 52 for developers",
> > because without fixing bug 1316482, we do not really fully implement the new
> > rule.
> 
> It looks like this bug is now fixed, so I don't have to remove it anymore ;-)

Unfortunately, the fix didn't land in 53...

My friend just told me today that her code didn't work as intended and it was exactly what the bug 1316482 was about.
So, even though 52's already shipped, I think it's still better to move the item to the "53 for devs" page.
Depends on: 1369985
Depends on: 1371351
Depends on: 1373586
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: