118589 - do not draw ? as fallback for Non-Spacing, and control characters

Reporter

Description

•

23 years ago

we currently draw '?' for all character which we do not have glyph in a font. We
should draw nothing and measure as zero for all the following characters:
class in unicode data base
Mn  Mark, Non-Spacing
Cc  Other, Control    
Cf  Other, FormatLm  Letter, Modifier

we could have a xp ccharmap and use it for all 3 platform to implement such
fallback.

Frank Tang

Reporter

Comment 1

•

23 years ago

Attached file a list of such characters — Details

Frank Tang

Reporter

Comment 2

•

23 years ago

simon- I think we could use this to solve the showing of zwj and zwnj problem.
also we can turn OFF Hebrew and Arabic mark showing as '?' mark.

shanjian- how can we build the ccharmap in the compile time (or before check in)
as a bitmap ?

for these 582 unciode code point, right before we draw / measure as '?', we
should check. If they are one of these characters, measure as 0,0 for
GetTextDimension, draw nothing .

Rui Xu

Updated

•

23 years ago

QA Contact: ruixu → ylong

Christopher Hoess (gone)

Comment 3

•

23 years ago

Frank, should bug 106311 be duped into this?

Frank Tang

Reporter

Comment 4

•

23 years ago

shanjian said the compressed cmap will be more than 1K. probably we should
simply use an array with binary search since we don't care too much of the
performance here.

Frank Tang

Reporter

Updated

•

23 years ago

Status: NEW → ASSIGNED

Target Milestone: --- → mozilla0.9.9

Shanjian Li

Comment 5

•

23 years ago

Frank, 
I calculated the size of ccmap manually ( it is not very hard to do so.) The 
size of such a ccmap should be around 964 bytes. The advantage is fast accessibility.
There are 3 memory reference before getting the result. A binary search takes up to 
10 comparison. So ccmap approach might still be a better idea if this kind of search 
is done frequently in certain user environment.

kill this account

Comment 6

•

23 years ago

the map should be a singleton (eg: one only)

Shanjian Li

Comment 7

•

23 years ago

frank, 
My answer to your second question is incorrect. I double checked the implementation 
of CCMAP. It allows several ALU_TYPE (ie. 16bits, 32bits, 64bits) to initiate and 
access ccmap. So if we always use 16bits integer to initiate the array (ccmap), there 
will be difference between BE and LE when ALU_TYPE is 32 or 64.

We probably should get rid of ALU_TYPE. I need to talk to brian.

kill this account

Comment 8

•

23 years ago

Brendan asked specifically for variable sized access (ALU).

Lets discuss the plusses and minuses of the various options before we make a 
decision.

Brendan Eich [:brendan]

Comment 9

•

23 years ago

If we initialize a ccmap using PRUint16 stores, with bit-setting within 16-bit
units, but access using wider loads, then shanjian is right and we'll definitely
care about byte order (or PRUint16-order within the 32- or 64-bit units).  But
can we not use the wider (ALU_TYPE) accesses always, whenever loading or storing?

/be

Shanjian Li

Comment 10

•

23 years ago

>>But can we not use the wider (ALU_TYPE) accesses always, whenever loading or 
>>storing?
Brendan, are you suggesting us to disable wider accesses only in certain 
situation (loading or storing)? CCMAP has a flag field we can use, but we will 
have one more memory reference in each access.

Shanjian Li

Comment 11

•

23 years ago

I talked with brian yesterday, and we came up with 3 possible solution:
1) Dynamicly generate CCMAP
  plus: We can still utilize the wider access in ccmap
        Static initialization array is more readable and easy to maintenance.
  minus:There is addition dependency on ccmap library (runtime).
        There is little addition running cost
2) Totally disable CCMAP wider access, and initialize ccmap directly
  plus: All ccmap access are just macros, we don't have runtime dependency
        CCMAP code will be less complicated, and thus easy to maintenance.
  minus:The benefit of wider access is lost
3) Create a new indexed array similar to ccmap
  plus: we might have the smallest foot print 
  minus:we are reinventing the wheel.

Brendan Eich [:brendan]

Comment 12

•

23 years ago

No, I was suggesting that you use wide accesses for all loads and stores from
the map, where the current code uses a wide access some of the time.  Where the
current code uses only PRUint16 accesses, no need to change.  But first perhaps
we can ascertain the performance gain of wider accesses?

/be

Andreas Becker

Updated

•

23 years ago

Keywords: intl

Frank Tang

Reporter

Comment 13

•

23 years ago

move this to m1.1 item

Target Milestone: mozilla0.9.9 → mozilla1.1

saari (gone)

Comment 14

•

22 years ago

*** Bug 152958 has been marked as a duplicate of this bug. ***

Frank Tang

Reporter

Updated

•

22 years ago

Target Milestone: mozilla1.1alpha → ---

Frank Tang

Reporter

Comment 15

•

22 years ago

ok, I change my position,
first of all, I think we have a sloution already. we can add empty entry to
/intl/unicharutil/tables/transliterate.properties  to solve this problem instead
of invent a new method. At least it work on window now. I should try it with mac
and linux

for example, if the linux and mac do not have hebrew vowel sign, add the
following lines into the /intl/unicharutil/tables/transliterate.properties
probably will turn off the vowel sign rendering when the font is not there
instead of display a ? mark

2. I think my attachment about which unicode could be treat as this way is
wrong. There are some character should not be display as nothing. We need to
display them as ? instead

>Frank, should bug 106311 be duped into this?
no, this bug is about if we cannot display the character by using a valid glyph,
display it as nothing instead of display it as question mark
For bug 106311, those characters are not display as a question mark but are
displayed with a glphy which claim to be a glyph for ascii 0x11. that is totally
a different issue. 

This one is how we treat fallback, that one is how we decide which glyph is
invalid from a valid truetype font.

for now, we know that we probably want to address for the following characters:
1. hebrew accent and point mark
2. arabic points
3. bidi control characters

smontagu, is that true ?

Frank Tang

Reporter

Comment 16

•

22 years ago

smontagu- can you give me a list of hebrew/arabic charcaters that you think we
should display nothing instead of ? in case we don't have a glyph from any font

Whiteboard: [eta 8/25]

Target Milestone: --- → mozilla1.2alpha

Simon Montagu :smontagu

Assignee

Comment 17

•

22 years ago

In the following list, I am sure about the Hebrew characters, but it would be
good if someone could give a second opinion about the Arabic.

0591;HEBREW ACCENT ETNAHTA
0592;HEBREW ACCENT SEGOL
0593;HEBREW ACCENT SHALSHELET
0594;HEBREW ACCENT ZAQEF QATAN
0595;HEBREW ACCENT ZAQEF GADOL
0596;HEBREW ACCENT TIPEHA
0597;HEBREW ACCENT REVIA
0598;HEBREW ACCENT ZARQA
0599;HEBREW ACCENT PASHTA
059A;HEBREW ACCENT YETIV
059B;HEBREW ACCENT TEVIR
059C;HEBREW ACCENT GERESH
059D;HEBREW ACCENT GERESH MUQDAM
059E;HEBREW ACCENT GERSHAYIM
059F;HEBREW ACCENT QARNEY PARA
05A0;HEBREW ACCENT TELISHA GEDOLA
05A1;HEBREW ACCENT PAZER
05A3;HEBREW ACCENT MUNAH
05A4;HEBREW ACCENT MAHAPAKH
05A5;HEBREW ACCENT MERKHA
05A6;HEBREW ACCENT MERKHA KEFULA
05A7;HEBREW ACCENT DARGA
05A8;HEBREW ACCENT QADMA
05A9;HEBREW ACCENT TELISHA QETANA
05AA;HEBREW ACCENT YERAH BEN YOMO
05AB;HEBREW ACCENT OLE
05AC;HEBREW ACCENT ILUY
05AD;HEBREW ACCENT DEHI
05AE;HEBREW ACCENT ZINOR
05AF;HEBREW MARK MASORA CIRCLE
05B0;HEBREW POINT SHEVA
05B1;HEBREW POINT HATAF SEGOL
05B2;HEBREW POINT HATAF PATAH
05B3;HEBREW POINT HATAF QAMATS
05B4;HEBREW POINT HIRIQ
05B5;HEBREW POINT TSERE
05B6;HEBREW POINT SEGOL
05B7;HEBREW POINT PATAH
05B8;HEBREW POINT QAMATS
05B9;HEBREW POINT HOLAM
05BB;HEBREW POINT QUBUTS
05BC;HEBREW POINT DAGESH OR MAPIQ
05BD;HEBREW POINT METEG
05BF;HEBREW POINT RAFE
05C1;HEBREW POINT SHIN DOT
05C2;HEBREW POINT SIN DOT
05C4;HEBREW MARK UPPER DOT
0640;ARABIC TATWEEL
064B;ARABIC FATHATAN
064C;ARABIC DAMMATAN
064D;ARABIC KASRATAN
064E;ARABIC FATHA
064F;ARABIC DAMMA
0650;ARABIC KASRA
0651;ARABIC SHADDA
0652;ARABIC SUKUN
0653;ARABIC MADDAH ABOVE
0654;ARABIC HAMZA ABOVE
0655;ARABIC HAMZA BELOW
0670;ARABIC LETTER SUPERSCRIPT ALEF
06D6;ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA;
06D7;ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA;
06D8;ARABIC SMALL HIGH MEEM INITIAL FORM;
06D9;ARABIC SMALL HIGH LAM ALEF;
06DA;ARABIC SMALL HIGH JEEM;
06DB;ARABIC SMALL HIGH THREE DOTS;
06DC;ARABIC SMALL HIGH SEEN;
06DF;ARABIC SMALL HIGH ROUNDED ZERO;
06E0;ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO;
06E1;ARABIC SMALL HIGH DOTLESS HEAD OF KHAH;
06E2;ARABIC SMALL HIGH MEEM ISOLATED FORM;
06E3;ARABIC SMALL LOW SEEN;
06E4;ARABIC SMALL HIGH MADDA;
06E7;ARABIC SMALL HIGH YEH;
06E8;ARABIC SMALL HIGH NOON;
06EA;ARABIC EMPTY CENTRE LOW STOP;
06EB;ARABIC EMPTY CENTRE HIGH STOP;
06EC;ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE;
06ED;ARABIC SMALL LOW MEEM;
FB1E;HEBREW POINT JUDEO-SPANISH VARIKA

Mohammed Elzubeir

Comment 18

•

22 years ago

The Arabic looks right to me.

Roozbeh Pournader

Comment 19

•

22 years ago

I'm only not sure about U+0640 ARABIC TATWEEL. It's a semi-letter semi-control
character, also something used as a dingbat. I prefer removing it from the list.

Mohammed Elzubeir

Comment 20

•

22 years ago

Well, what other implications are there for keeping it on the list? If all there
is to it is that the fallback is to simply ignore it, then it certainly should
be on the list. After all, the TATWEEL has no significance really (it's a
formatting character to elongate the length of a word). 

But if there's something else I'm missing then please do enligten me ;)

Roozbeh Pournader

Comment 21

•

22 years ago

&#1616;well, Tatweel is sometimes used a hyphen in Persian (the hyphen glyph in many
fonts are a little high for Arabic text), it is sometimes used as a bullet, ...

These cases are not ignorable, and I prefer seeing a question mark in these
places than nothing, to find that there is a font problem.

Mohammed Elzubeir

Comment 22

•

22 years ago

I am now enlightened, thanks ;) I changed my mind, I would rather not see the
U+0640 in that list.

Frank Tang

Reporter

Comment 23

•

19 years ago

what a hack. I have not touch mozilla code for 2 years. I didn't read these bugs
for 2 years. And they are still there. Just close them as won't fix to clean up.

Status: ASSIGNED → RESOLVED

Closed: 19 years ago

Resolution: --- → WONTFIX

Jungshik Shin

Comment 24

•

19 years ago

This issue is being dealt with in bug 205387

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Travis Chase

Comment 25

•

19 years ago

Mass Reassign Please excuse the spam

Assignee: ftang → nobody

Status: REOPENED → NEW

Simon Montagu :smontagu

Assignee

Comment 26

•

19 years ago

This has wider scope than bug 205387.

Assignee: nobody → smontagu

Depends on: 205387

Behnam Esfahbod [:zwnj]

Updated

•

19 years ago

Blocks: Persian

Phil Ringnalda (:philor)

Updated

•

15 years ago

QA Contact: amyy → i18n

Simon Montagu :smontagu

Assignee

Comment 27

•

15 years ago

Very, very WONTFIX: displaying nothing for a given character can be used as a phishing vector.

Status: NEW → RESOLVED

Closed: 19 years ago → 15 years ago

Resolution: --- → WONTFIX