Closed Bug 1824671 Opened 2 years ago Closed 1 year ago

Migrate bidi handling from ICU4C to the unicode-bidi crate

Categories

(Core :: Internationalization, task)

task

Tracking

()

RESOLVED FIXED
124 Branch
Tracking Status
firefox124 --- disabled
firefox125 --- fixed

People

(Reporter: hsivonen, Assigned: jfkthame)

References

(Blocks 1 open bug)

Details

Attachments

(12 files, 2 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

ICU4X isn’t aiming to provide an implementation of the Unicode Bidirectional Algorithm and is instead aiming to provide the requisite data for an external implementation of the algorithm to query. The external implementation being targeted is the unicode-bidi crate.

We need to

  1. Make unicode-bidi able to deal with UTF-16 and UTF-16 code unit indices.
  2. Expose iteration over the resolved level runs in logical order.
  3. Add an entry point where the caller guarantees that the input consists of a paragraph and unicode-bidi skips its own paragraph breaking.
Blocks: icu4x
See Also: → 1824638

I have posted a PR to add native UTF-16 support to the unicode-bidi crate.

Additional upstream PRs 115 (single-paragraph API) and 117 (get_base_direction function) have now been merged; at this point I think the unicode-bidi APIs will be sufficient to allow integration into Gecko without too much pain.

unicode-bidi 0.3.14 introduces native utf-16 support, and other additional
APIs that we'll need for Gecko integration.

The failure in bidi-011.xht occurs because unicode-bidi (unlike ICU4C) closely follows the
implementation notes in UAX#9 regarding treatment of the bidi embedding/override controls,
if the characters are not actually removed from the buffer during processing. The (non-normative)
algorithm given in the spec results in embedding initiators taking on the level of the context,
whereas the legacy ICU4C behavior gives them the level of the embedded (or overridden) text.
This difference affects bidi continuation creation when the directional controls occur exactly
at inline element boundaries, and is observable in this test.

We could fix this by post-processing the unicode-bidi results to conform to what ICU4C did,
but I don't think it's worth it for this extremely niche case (which is not mandated by any
spec, AFAICT; and it's unclear whether any particular behavior is better or worse than any
other). Moreover, applying a tweak to fix this will also revert the newly-passing examples
in css-text/white-space/white-space-collapse-002.html, so we'd lose as much as we'd gain.

The new failures in first-letter-punctuation-126 and -127 are because unlike ICU4C, unicode-bidi
(correctly, per spec) assigns level 2 to characters with class AN where the surrounding context
is level 0. These two examples use Arabic numeric punctuation characters that have class AN in
conjunction with a Latin letter (class L), and so we get a bidi level boundary between the letter
and the punctuation, which interferes with our first-letter implementation.

This is such an obscure edge case, extremely unlikely to ever occur in real content, that I'm happy
just marking it as an expected failure for now. Our entire first-letter implementation has much
more serious issues than this.

Depends on D197891

Assignee: nobody → jfkthame
Attachment #9371492 - Attachment description: WIP: Bug 1824671 - patch 1 - [wip] Vendor unicode-bidi 0.3.14, in preparation for migrating the intl Bidi component from ICU4C. → Bug 1824671 - patch 1 - [wip] Vendor unicode-bidi 0.3.14, in preparation for migrating the intl Bidi component from ICU4C.
Status: NEW → ASSIGNED
Attachment #9371493 - Attachment description: WIP: Bug 1824671 - patch 1a - [wip] Fix for unicode-bidi issue #119. → Bug 1824671 - patch 1a - [wip] Fix for unicode-bidi issue #119.
Attachment #9371494 - Attachment description: WIP: Bug 1824671 - patch 2 - [wip] Create a minimal unicode-bidi-ffi crate to expose APIs needed by the intl::Bidi component. → Bug 1824671 - patch 2 - [wip] Create a minimal unicode-bidi-ffi crate to expose APIs needed by the intl::Bidi component.
Attachment #9371495 - Attachment description: WIP: Bug 1824671 - patch 3 - [wip] Convert intl Bidi component to be backed by the unicode-bidi crate. → Bug 1824671 - patch 3 - [wip] Convert intl Bidi component to be backed by the unicode-bidi crate.
Attachment #9371496 - Attachment description: WIP: Bug 1824671 - patch 4 - [wip] Update Gtest to accept either ICU4C or unicode-bidi handling of the embedding controls (explicitly unspecified by UAX#9). → Bug 1824671 - patch 4 - [wip] Update Gtest to accept either ICU4C or unicode-bidi handling of the embedding controls (explicitly unspecified by UAX#9).
Attachment #9371497 - Attachment description: WIP: Bug 1824671 - patch 5 - [wip] Update WPT metadata for tests affected by the change in bidi implementation. → Bug 1824671 - patch 5 - [wip] Update WPT metadata for tests affected by the change in bidi implementation.
Attachment #9371498 - Attachment description: WIP: Bug 1824671 - patch 6 - [wip] Update reference for examples in clone-intrinsic-size-bidi that now work better. → Bug 1824671 - patch 6 - [wip] Update reference for examples in clone-intrinsic-size-bidi that now work better.

This will cause ./mach vendor rust to pull it into the tree.

Depends on D198445

There are some large source files in ICU4X, requiring this limit to be increased.
(That doesn't necessarily equate to bloating the final product, as the linker will
only include the fragments we actually use.)

Depends on D198446

Attachment #9371492 - Attachment description: Bug 1824671 - patch 1 - [wip] Vendor unicode-bidi 0.3.14, in preparation for migrating the intl Bidi component from ICU4C. → Bug 1824671 - patch 1 - Vendor unicode-bidi 0.3.14, in preparation for migrating the intl Bidi component from ICU4C. r=#platform-i18n-reviewers,#supply-chain-reviewers
Attachment #9371493 - Attachment description: Bug 1824671 - patch 1a - [wip] Fix for unicode-bidi issue #119. → Bug 1824671 - patch 1a - Fix for unicode-bidi issue #119. r=#platform-i18n-reviewers,#supply-chain-reviewers
Attachment #9371494 - Attachment description: Bug 1824671 - patch 2 - [wip] Create a minimal unicode-bidi-ffi crate to expose APIs needed by the intl::Bidi component. → Bug 1824671 - patch 2 - Create a minimal unicode-bidi-ffi crate to expose APIs needed by the intl::Bidi component. r=#platform-i18n-reviewers
Attachment #9371495 - Attachment description: Bug 1824671 - patch 3 - [wip] Convert intl Bidi component to be backed by the unicode-bidi crate. → Bug 1824671 - patch 3 - Convert intl Bidi component to be backed by the unicode-bidi crate. r=#platform-i18n-reviewers
Attachment #9371496 - Attachment description: Bug 1824671 - patch 4 - [wip] Update Gtest to accept either ICU4C or unicode-bidi handling of the embedding controls (explicitly unspecified by UAX#9). → Bug 1824671 - patch 4 - Update Gtest to accept either ICU4C or unicode-bidi handling of the embedding controls (explicitly unspecified by UAX#9). r=#platform-i18n-reviewers
Attachment #9371497 - Attachment description: Bug 1824671 - patch 5 - [wip] Update WPT metadata for tests affected by the change in bidi implementation. → Bug 1824671 - patch 5 - Update WPT metadata for tests affected by the change in bidi implementation. r=#platform-i18n-reviewers
Attachment #9371498 - Attachment description: Bug 1824671 - patch 6 - [wip] Update reference for examples in clone-intrinsic-size-bidi that now work better. → Bug 1824671 - patch 6 - Update reference for examples in clone-intrinsic-size-bidi that now work better. r=#platform-i18n-reviewers
Attachment #9372602 - Attachment description: Bug 1824671 - patch 7 - Add icu_properties & dependencies to audits.toml and to package license whitelist, to enable it to be vendored. → Bug 1824671 - patch 7 - Add icu_properties & dependencies to audits.toml and to package license whitelist, to enable it to be vendored. r=#platform-i18n-reviewers,#supply-chain-reviewers
Attachment #9372603 - Attachment description: Bug 1824671 - patch 8 - Add icu_properties as a dependency to unicode-bidi-ffi. → Bug 1824671 - patch 8 - Add icu_properties as a dependency to unicode-bidi-ffi. r=#platform-i18n-reviewers
Attachment #9372605 - Attachment description: Bug 1824671 - patch 10 - Run ./mach vendor rust to bring in icu_properties. → Bug 1824671 - patch 9 - Run ./mach vendor rust to bring in icu_properties. r=#platform-i18n-reviewers,#supply-chain-reviewers
Attachment #9372606 - Attachment description: Bug 1824671 - patch 11 - Switch unicode-bidi-ffi to use icu_properties as data source. → Bug 1824671 - patch 10 - Switch unicode-bidi-ffi to use icu_properties as data source. r=#platform-i18n-reviewers
Attachment #9372604 - Attachment description: Bug 1824671 - patch 9 - Bump file-size limit in vendor_rust.py, to enable vendoring icu4x sources. → Bug 1824671 - [not to be landed] - Bump file-size limit in vendor_rust.py, to enable vendoring icu4x sources.

The following patch is waiting for review from a reviewer who resigned from the review:

ID Title Author Reviewer Status
D198447 Bug 1824671 - [not to be landed] - Bump file-size limit in vendor_rust.py, to enable vendoring icu4x sources. jfkthame glandium: Resigned from review

:jfkthame, could you please find another reviewer?

For more information, please visit BugBot documentation.

Flags: needinfo?(jfkthame)
Flags: needinfo?(jfkthame)
Attachment #9371493 - Attachment description: Bug 1824671 - patch 1a - Fix for unicode-bidi issue #119. r=#platform-i18n-reviewers,#supply-chain-reviewers → Bug 1824671 - patch 1a - Bump unicode-bidi to v0.3.15 to pick up bug-fix #119 and extended get_base_direction_full API. r=#platform-i18n-reviewers,#supply-chain-reviewers
Pushed by jkew@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/17e79f579645 patch 1 - Vendor unicode-bidi 0.3.14, in preparation for migrating the intl Bidi component from ICU4C. r=supply-chain-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/b273d641a26a patch 1a - Bump unicode-bidi to v0.3.15 to pick up bug-fix #119 and extended get_base_direction_full API. r=supply-chain-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/9a2fbd21feb5 patch 2 - Create a minimal unicode-bidi-ffi crate to expose APIs needed by the intl::Bidi component. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/89622f53da27 patch 3 - Convert intl Bidi component to be backed by the unicode-bidi crate. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/d8414db5bc0c patch 4 - Update Gtest to accept either ICU4C or unicode-bidi handling of the embedding controls (explicitly unspecified by UAX#9). r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/e16c7ccc64e5 patch 5 - Update WPT metadata for tests affected by the change in bidi implementation. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/5c20e50fc26b patch 6 - Update reference for examples in clone-intrinsic-size-bidi that now work better. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/71dbcba23335 patch 7 - Add icu_properties & dependencies to audits.toml and to package license whitelist, to enable it to be vendored. r=glandium,supply-chain-reviewers https://hg.mozilla.org/integration/autoland/rev/7fe14219e95e patch 8 - Add icu_properties as a dependency to unicode-bidi-ffi. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/f029db1566d8 patch 9 - Run ./mach vendor rust to bring in icu_properties. r=supply-chain-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/f41ef5e3f7b9 patch 10 - Switch unicode-bidi-ffi to use icu_properties as data source. r=platform-i18n-reviewers,dminor

Backed out for causing spidermonkey bustages in Bidi.h

  • Backout link
  • Push with failures
  • Failure Log
  • Failure line: /builds/worker/workspace/obj-spider/dist/include/mozilla/intl/Bidi.h:13:12: fatal error: 'mozilla/intl/unicode_bidi_ffi_generated.h' file not found
Flags: needinfo?(jfkthame)

Looks like the standalone spidermonkey build fails because the Cbindgen-generated header for unicode-bidi-ffi isn't available there.

As spidermonkey does not actually use the intl::Bidi component AFAICT, the simplest fix for now is probably to just omit the Bidi component from the standalone build; there's no point in compiling it at all.

Flags: needinfo?(jfkthame)

For now, standalone JS builds don't appear to use the --compile-environment option,
and so the Cbindgen step to generate the unicode_bidi_ffi_generated.h header file
doesn't happen. This means intl/components/src/Bidi.h will fail to compile.

As SpiderMonkey doesn't actually use the Bidi component, we can avoid this issue by
simply omitting the Bidi sources from the standalone build. If/when it ever needs this
component in future, we can restore it by making the JS build run the Cbindgen step.

Attachment #9376207 - Attachment is obsolete: true
Attachment #9372604 - Attachment is obsolete: true
Pushed by jkew@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/18ec98c3bab1 patch 1 - Vendor unicode-bidi 0.3.14, in preparation for migrating the intl Bidi component from ICU4C. r=supply-chain-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/cf26994c564c patch 1a - Bump unicode-bidi to v0.3.15 to pick up bug-fix #119 and extended get_base_direction_full API. r=supply-chain-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/594ba1dd35c1 patch 2 - Create a minimal unicode-bidi-ffi crate to expose APIs needed by the intl::Bidi component. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/7ca242981444 patch 3 - Convert intl Bidi component to be backed by the unicode-bidi crate. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/37e201064ad6 patch 4 - Update Gtest to accept either ICU4C or unicode-bidi handling of the embedding controls (explicitly unspecified by UAX#9). r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/7cf1a5f1e892 patch 5 - Update WPT metadata for tests affected by the change in bidi implementation. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/a3d2816c9be4 patch 6 - Update reference for examples in clone-intrinsic-size-bidi that now work better. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/def067029471 patch 7 - Add icu_properties & dependencies to audits.toml and to package license whitelist, to enable it to be vendored. r=glandium,supply-chain-reviewers https://hg.mozilla.org/integration/autoland/rev/4576fdd506ed patch 8 - Add icu_properties as a dependency to unicode-bidi-ffi. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/0aeea071568a patch 9 - Run ./mach vendor rust to bring in icu_properties. r=supply-chain-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/d96c1ebebf8e patch 10 - Switch unicode-bidi-ffi to use icu_properties as data source. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/187caa38d349 patch 11 - Make the JS build run cbindgen unconditionally, so that the unicode-bidi-ffi header gets generated. r=firefox-build-system-reviewers,glandium
Regressions: 1879806
Regressions: 1880487
See Also: → 1880746

Given the several perf issues associated with this, I propose to disable it on Beta/Release builds for the time being, so it'll be on Nightly only while we investigate further.

Blocks: 1881232
Regressions: 1906978
Regressions: 1885702
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: