Closed Bug 240943 Opened 17 years ago Closed 16 years ago

Update bidi data files to Unicode 4.0.1

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: smontagu, Assigned: smontagu)

References

()

Details

(Keywords: fixed1.7)

Attachments

(3 files, 1 obsolete file)

Unicode 4.0.1 includes changes to some Bidi properties, and Unicode 4.0 already
added several new right-to-left characters, including some beyond the BMP.

We need at least to rerun layout/tools/genbidicattable.pl and
layout/tools/gensymmtable.pl on the latest version of the Unicode Character
Database, and the scripts themselves may need some modifications.
Blocks: 73251
Asking for blocking1.7. I think making the next stable release compliant with
Unicode 4.0.1 makes a lot of sense. Moreover, fixing this will also resolve Bug
73251 which currently represents the most frequent and obvious rendering bug
with Hebrew pages.

Prog.
Flags: blocking1.7?
Attached patch Patch for 1.7Splinter Review
I wouldn't like to take the risk of adding support for Bidi surrogate
characters to 1.7 at this late stage, so here is a patch which only changes the
character types within the BMP, generated from a subset of the UCDB.
Comment on attachment 147221 [details] [diff] [review]
Patch for 1.7

blizzard, can you rs this change to the generated file bidicattable.h? No
actual code changes are involved, only modifications and additions to the bidi
character types.
Attachment #147221 - Flags: superreview?(blizzard)
Given that smontagu doesn't want to take take the risk of adding Bidi surrogate
characters, is there good reason to hold the release of 1.7 for this patch? (I'm
not suggesting we wouldn't approve a reviewed patch, just questioning whether or
not we should block the release for this change).
Asa, see comment #1 for a couple of good reasons to block 1.7 for this bug. 

Prog.
(In reply to comment #4)
> Given that smontagu doesn't want to take take the risk of adding Bidi surrogate
> characters, is there good reason to hold the release of 1.7 for this patch? 

Let me clarify two points: (1) this patch only includes the parts that I don't
think are risky; (2) as far as I know, no Bidi surrogate characters are yet in
use in the wild, although Unicode 4.0 defined a few.
Comment on attachment 147221 [details] [diff] [review]
Patch for 1.7

I guess I was forgetting correct process and thinking that an "rs" request
didn't need r= first. jshin, can you review?
Attachment #147221 - Flags: review?(jshin)
Attached patch Patch for trunk (obsolete) — Splinter Review
As well as the necessary changes for non-BMP characters, this includes some
tightening up of the code in GetBidiCat(), and changes to the license block to
match the manual changes that were made to bidicattable.h.

I'll cut out the #ifdef DEBUG_Simon part before checkin.
Comment on attachment 147705 [details] [diff] [review]
Patch for trunk

Jungshik, can you review this one too?

> #define CHAR_IS_BIDI(c) ( (IS_HINDI_DIGIT(c) ) || (IS_HEBREW_CHAR(c) ) \
>-                        || (IS_06_CHAR(c) ) || (IS_FE_CHAR(c) ) )
>+                        || (IS_06_CHAR(c) ) || (IS_FE_CHAR(c) ) \
>+                        || (IS_CYPRIOT_CHAR(c) ) )

I don't like this part very much myself. There are a whole bunch of RTL scripts
roadmapped for addition to Unicode, and I don't like the idea of adding another
condition to this #define for each one.

The way to go is probably to move bidicattable.h back into intl/unicharutils
and call it directly, but let's do that separately.
Attachment #147705 - Flags: review?(jshin)
Comment on attachment 147221 [details] [diff] [review]
Patch for 1.7

r=jshin
Attachment #147221 - Flags: review?(jshin) → review+
Comment on attachment 147221 [details] [diff] [review]
Patch for 1.7

transferring rs request. See comment 3 for justification of rs.
Attachment #147221 - Flags: superreview?(blizzard) → superreview?(roc)
Attachment #147221 - Flags: superreview?(roc) → superreview+
Comment on attachment 147221 [details] [diff] [review]
Patch for 1.7

Requesting approval for 1.7. 

The behaviour of hyphen-minus between right-to-left characters and numbers has
long been one of the most obvious difference between the Bidi behaviours of
Mozilla and IE, and has led to many bug reports (see bug 73251 and its
duplicates).

In the past Mozilla was conforming to the Unicode standard better than IE, but
meanwhile the standard has changed under us and we have become non-compliant.

risk: minimal. The patch has no semantic code changes, only additions and
modifications to the table of Bidi categories.
Attachment #147221 - Flags: approval1.7?
Comment on attachment 147221 [details] [diff] [review]
Patch for 1.7

a=mkaply
Attachment #147221 - Flags: approval1.7? → approval1.7+
I'm removing my blocking1.7 request, this is already checked into the Branch. I
hope the Trunk patch will soon follow (reminder: jshin).

Prog.
Flags: blocking1.7?
Could you land the branch patch on the trunk until the trunk patch is ready so
that the branch patch gets more testing?
(In reply to comment #15)
> Could you land the branch patch on the trunk until the trunk patch is ready so
> that the branch patch gets more testing?

I did so, unfortunately regressing the recent manual change to the license block
of bidicattable.h, but that will be corrected when I check in the final version.
Comment on attachment 147705 [details] [diff] [review]
Patch for trunk

r=jshin sorry for the delay. 


>Index: content/shared/src/nsTextFragment.cpp

>@@ -344,17 +345,24 @@

>     while (cp < end) {
>-      PRUnichar ch = *cp++;
>-      if (CHAR_IS_BIDI(ch) ) {
>+      PRUnichar ch1 = *cp++;
>+      PRUint32 utf32Char = ch1;
>+      if (IS_HIGH_SURROGATE(ch1) && cp < end) {
>+        PRUnichar ch2 = *cp++;
>+        if (IS_LOW_SURROGATE(ch2)) {
>+          utf32Char = SURROGATE_TO_UCS4(ch1, ch2);
>+        }
>+      }
>+      if (CHAR_IS_BIDI(utf32Char) ) {
>         mState.mIsBidi = PR_TRUE;
>         break;

an edge case: If a high surrogate code point is followed by a 'BIDI character'
(in BMP), it wouldn't be detected. Therefore, it may be necessary to add 'else
if (CHAR_IS_BIDI(ch2))' after 'if (IS_LOW_SURROGATE(ch2))'
Comment on attachment 147705 [details] [diff] [review]
Patch for trunk

sorry that i forgot to set the r flag.
Attachment #147705 - Flags: review?(jshin) → review+
Thanks for review.

> an edge case: If a high surrogate code point is followed by a 'BIDI character'
> (in BMP), it wouldn't be detected. Therefore, it may be necessary to add 'else
> if (CHAR_IS_BIDI(ch2))' after 'if (IS_LOW_SURROGATE(ch2))'

Good catch! And if we are talking edge cases, I suppose we also have to consider
the case of an isolated high surrogate followed by a well-formed surrogate pair.
Attached patch Patch for trunkSplinter Review
I changed that to: 
+      PRUnichar ch1 = *cp++;
+      PRUint32 utf32Char = ch1;
+      if (IS_HIGH_SURROGATE(ch1) &&
+	   cp < end &&
+	   IS_LOW_SURROGATE(*cp)) {
+	 PRUnichar ch2 = *cp++;
+	 utf32Char = SURROGATE_TO_UCS4(ch1, ch2);
+      }

and removed the debugging code.
Attachment #147705 - Attachment is obsolete: true
Comment on attachment 148662 [details] [diff] [review]
Patch for trunk

Transferring r and requesting sr
Attachment #148662 - Flags: superreview?(rbs)
Attachment #148662 - Flags: review+
Comment on attachment 148662 [details] [diff] [review]
Patch for trunk

sr=rbs
Attachment #148662 - Flags: superreview?(rbs) → superreview+
Fix checked in.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
This isn't working the way you want. On Linux gcc throws pairs of warnings like
these:

nsTextBoxFrame.cpp:510:
 warning: comparison is always false due to limited range of data type
nsTextBoxFrame.cpp:510:
 warning: comparison is always true due to limited range of data type

This happens because CHAR_IS_BIDI now contains IS_CYPRIOT_CHAR and
IS_CYPRIOT_CHAR tests against values > 0xffff. Gcc knows that PRUnichar is only
16 bits so it always evaluates IS_CYPRIOT_CHAR as false.

If you're expecting to handle U+1xxxx characters you'll have to do something else.

I'm reopening this bug because the patch clearly doesn't work as designed.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I saw the warning, but why does it mean that the patch is not working as expected? 

I have suggested getting rid of CHAR_IS_BIDI anyway for performance reasons (bug
244036). If we don't go ahead with that we could at least split it into two to
remove the warning, assuming that the caller will always know whether the
character is BMP or above:

#define UCS2_CHAR_IS_BIDI(c)  ( (IS_HINDI_DIGIT(c) ) || (IS_HEBREW_CHAR(c) ) \
                             || (IS_06_CHAR(c) ) || (IS_FE_CHAR(c) ) )
#define UTF32_CHAR_IS_BIDI(c) ( (IS_CYPRIOT_CHAR(c)) )
> I saw the warning, but why does it mean that the patch is not working as
> expected?

I was basing my statement on what's in the code. One would normally get the
impression that CHAR_IS_BIDI can detect Cypriot glyphs from reading the code.
One would have to know that this doesn't work for mozilla's PRUnichar. I think
that's a sure way to create subtle bugs as other people work on the code. While
gcc handles this at compile time other compilers might leave the tests in which
wouldn't help performance.

I like the idea of useing UCS2 in its name. That makes it more obvious what's
happening. I wonder about having both UCS2 and UTF32/UCS4 running around in
mozilla. Someone will eventually manage to assign one to the other and back
again which obviously doesn't work. Personally I would have gone with 32-bit
quantities form the beginning but I realize why that wasn't done.
Keywords: fixed1.7
(In reply to comment #9)
> ... There are a whole bunch of RTL scripts
> roadmapped for addition to Unicode, and I don't like the idea of adding another
> condition to this #define for each one.

Unicode has clearly roadmapped 0590-08FF and 00010800-00010FFF for RTL scripts,
see http://www.unicode.org/roadmaps/ - although these blocks contain some
combining marks which are formally bidi neutral and in principle usable with LTR
scripts. Although this roadmap is not set in stone, it is reasonable at this
stage to add just two conditions to cover these whole blocks. There are some
other RTL characters, apparently FB1D-FDFF and FE70-FEFF. But the safest thing
is to check with the Unicode bidi tables.
I think Peter Kirk's suggestion is excellent. It will make CHAR_IS_BIDI a
coarser-grained and faster test, more appropriate for the places it is used on
every character in a document, e.g. nsTextFragment::SetBidiFlag, leaving the
more precise and inevitably slower test for the detailed processing of Bidi text
in nsBidi::GetDirProps.
Attachment #152143 - Flags: superreview?(roc)
Attachment #152143 - Flags: review?(roc)
Attachment #152143 - Flags: superreview?(roc)
Attachment #152143 - Flags: superreview+
Attachment #152143 - Flags: review?(roc)
Attachment #152143 - Flags: review+
Checked in.
Status: REOPENED → RESOLVED
Closed: 17 years ago16 years ago
Resolution: --- → FIXED
*** Bug 291511 has been marked as a duplicate of this bug. ***
Component: Layout: BiDi Hebrew & Arabic → Layout: Text
QA Contact: zach → layout.fonts-and-text
You need to log in before you can comment on or make changes to this bug.