Migrate bidi handling from ICU4C to the unicode-bidi crate
Categories
(Core :: Internationalization, task)
Tracking
()
People
(Reporter: hsivonen, Assigned: jfkthame)
References
(Blocks 1 open bug)
Details
Attachments
(12 files, 2 obsolete files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
ICU4X isn’t aiming to provide an implementation of the Unicode Bidirectional Algorithm and is instead aiming to provide the requisite data for an external implementation of the algorithm to query. The external implementation being targeted is the unicode-bidi crate.
We need to
- Make unicode-bidi able to deal with UTF-16 and UTF-16 code unit indices.
- Expose iteration over the resolved level runs in logical order.
- Add an entry point where the caller guarantees that the input consists of a paragraph and unicode-bidi skips its own paragraph breaking.
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Assignee | ||
Comment 1•1 year ago
|
||
I have posted a PR to add native UTF-16 support to the unicode-bidi crate.
Assignee | ||
Comment 2•1 year ago
|
||
Additional upstream PRs 115 (single-paragraph API) and 117 (get_base_direction function) have now been merged; at this point I think the unicode-bidi APIs will be sufficient to allow integration into Gecko without too much pain.
Assignee | ||
Comment 3•1 year ago
|
||
unicode-bidi 0.3.14 introduces native utf-16 support, and other additional
APIs that we'll need for Gecko integration.
Assignee | ||
Comment 4•1 year ago
|
||
Depends on D197887
Assignee | ||
Comment 5•1 year ago
|
||
Depends on D197888
Assignee | ||
Comment 6•1 year ago
|
||
Depends on D197889
Assignee | ||
Comment 7•1 year ago
|
||
Depends on D197890
Assignee | ||
Comment 8•1 year ago
|
||
The failure in bidi-011.xht occurs because unicode-bidi (unlike ICU4C) closely follows the
implementation notes in UAX#9 regarding treatment of the bidi embedding/override controls,
if the characters are not actually removed from the buffer during processing. The (non-normative)
algorithm given in the spec results in embedding initiators taking on the level of the context,
whereas the legacy ICU4C behavior gives them the level of the embedded (or overridden) text.
This difference affects bidi continuation creation when the directional controls occur exactly
at inline element boundaries, and is observable in this test.
We could fix this by post-processing the unicode-bidi results to conform to what ICU4C did,
but I don't think it's worth it for this extremely niche case (which is not mandated by any
spec, AFAICT; and it's unclear whether any particular behavior is better or worse than any
other). Moreover, applying a tweak to fix this will also revert the newly-passing examples
in css-text/white-space/white-space-collapse-002.html, so we'd lose as much as we'd gain.
The new failures in first-letter-punctuation-126 and -127 are because unlike ICU4C, unicode-bidi
(correctly, per spec) assigns level 2 to characters with class AN where the surrounding context
is level 0. These two examples use Arabic numeric punctuation characters that have class AN in
conjunction with a Latin letter (class L), and so we get a bidi level boundary between the letter
and the punctuation, which interferes with our first-letter implementation.
This is such an obscure edge case, extremely unlikely to ever occur in real content, that I'm happy
just marking it as an expected failure for now. Our entire first-letter implementation has much
more serious issues than this.
Depends on D197891
Assignee | ||
Comment 9•1 year ago
|
||
Depends on D197892
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 10•1 year ago
|
||
Depends on D197893
Assignee | ||
Comment 11•1 year ago
|
||
This will cause ./mach vendor rust
to pull it into the tree.
Depends on D198445
Assignee | ||
Comment 12•1 year ago
|
||
There are some large source files in ICU4X, requiring this limit to be increased.
(That doesn't necessarily equate to bloating the final product, as the linker will
only include the fragments we actually use.)
Depends on D198446
Assignee | ||
Comment 13•1 year ago
|
||
Depends on D198447
Assignee | ||
Comment 14•1 year ago
|
||
Depends on D198448
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Comment 15•1 year ago
|
||
The following patch is waiting for review from a reviewer who resigned from the review:
ID | Title | Author | Reviewer Status |
---|---|---|---|
D198447 | Bug 1824671 - [not to be landed] - Bump file-size limit in vendor_rust.py, to enable vendoring icu4x sources. | jfkthame | glandium: Resigned from review |
:jfkthame, could you please find another reviewer?
For more information, please visit BugBot documentation.
Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Comment 16•1 year ago
|
||
Comment 17•1 year ago
|
||
Backed out for causing spidermonkey bustages in Bidi.h
- Backout link
- Push with failures
- Failure Log
- Failure line: /builds/worker/workspace/obj-spider/dist/include/mozilla/intl/Bidi.h:13:12: fatal error: 'mozilla/intl/unicode_bidi_ffi_generated.h' file not found
Assignee | ||
Comment 18•1 year ago
|
||
Looks like the standalone spidermonkey build fails because the Cbindgen-generated header for unicode-bidi-ffi isn't available there.
As spidermonkey does not actually use the intl::Bidi component AFAICT, the simplest fix for now is probably to just omit the Bidi component from the standalone build; there's no point in compiling it at all.
Assignee | ||
Comment 19•1 year ago
|
||
For now, standalone JS builds don't appear to use the --compile-environment option,
and so the Cbindgen step to generate the unicode_bidi_ffi_generated.h header file
doesn't happen. This means intl/components/src/Bidi.h will fail to compile.
As SpiderMonkey doesn't actually use the Bidi component, we can avoid this issue by
simply omitting the Bidi sources from the standalone build. If/when it ever needs this
component in future, we can restore it by making the JS build run the Cbindgen step.
Updated•1 year ago
|
Assignee | ||
Comment 20•1 year ago
|
||
Updated•1 year ago
|
Comment 21•1 year ago
|
||
Comment 22•1 year ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/18ec98c3bab1
https://hg.mozilla.org/mozilla-central/rev/cf26994c564c
https://hg.mozilla.org/mozilla-central/rev/594ba1dd35c1
https://hg.mozilla.org/mozilla-central/rev/7ca242981444
https://hg.mozilla.org/mozilla-central/rev/37e201064ad6
https://hg.mozilla.org/mozilla-central/rev/7cf1a5f1e892
https://hg.mozilla.org/mozilla-central/rev/a3d2816c9be4
https://hg.mozilla.org/mozilla-central/rev/def067029471
https://hg.mozilla.org/mozilla-central/rev/4576fdd506ed
https://hg.mozilla.org/mozilla-central/rev/0aeea071568a
https://hg.mozilla.org/mozilla-central/rev/d96c1ebebf8e
https://hg.mozilla.org/mozilla-central/rev/187caa38d349
Assignee | ||
Comment 23•1 year ago
|
||
Given the several perf issues associated with this, I propose to disable it on Beta/Release builds for the time being, so it'll be on Nightly only while we investigate further.
Updated•1 year ago
|
Description
•