we currently draw '?' for all character which we do not have glyph in a font. We should draw nothing and measure as zero for all the following characters: class in unicode data base Mn Mark, Non-Spacing Cc Other, Control Cf Other, FormatLm Letter, Modifier we could have a xp ccharmap and use it for all 3 platform to implement such fallback.
simon- I think we could use this to solve the showing of zwj and zwnj problem. also we can turn OFF Hebrew and Arabic mark showing as '?' mark. shanjian- how can we build the ccharmap in the compile time (or before check in) as a bitmap ? for these 582 unciode code point, right before we draw / measure as '?', we should check. If they are one of these characters, measure as 0,0 for GetTextDimension, draw nothing .
Frank, should bug 106311 be duped into this?
shanjian said the compressed cmap will be more than 1K. probably we should simply use an array with binary search since we don't care too much of the performance here.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.9
Frank, I calculated the size of ccmap manually ( it is not very hard to do so.) The size of such a ccmap should be around 964 bytes. The advantage is fast accessibility. There are 3 memory reference before getting the result. A binary search takes up to 10 comparison. So ccmap approach might still be a better idea if this kind of search is done frequently in certain user environment.
the map should be a singleton (eg: one only)
frank, My answer to your second question is incorrect. I double checked the implementation of CCMAP. It allows several ALU_TYPE (ie. 16bits, 32bits, 64bits) to initiate and access ccmap. So if we always use 16bits integer to initiate the array (ccmap), there will be difference between BE and LE when ALU_TYPE is 32 or 64. We probably should get rid of ALU_TYPE. I need to talk to brian.
Brendan asked specifically for variable sized access (ALU). Lets discuss the plusses and minuses of the various options before we make a decision.
If we initialize a ccmap using PRUint16 stores, with bit-setting within 16-bit units, but access using wider loads, then shanjian is right and we'll definitely care about byte order (or PRUint16-order within the 32- or 64-bit units). But can we not use the wider (ALU_TYPE) accesses always, whenever loading or storing? /be
>>But can we not use the wider (ALU_TYPE) accesses always, whenever loading or >>storing? Brendan, are you suggesting us to disable wider accesses only in certain situation (loading or storing)? CCMAP has a flag field we can use, but we will have one more memory reference in each access.
I talked with brian yesterday, and we came up with 3 possible solution: 1) Dynamicly generate CCMAP plus: We can still utilize the wider access in ccmap Static initialization array is more readable and easy to maintenance. minus:There is addition dependency on ccmap library (runtime). There is little addition running cost 2) Totally disable CCMAP wider access, and initialize ccmap directly plus: All ccmap access are just macros, we don't have runtime dependency CCMAP code will be less complicated, and thus easy to maintenance. minus:The benefit of wider access is lost 3) Create a new indexed array similar to ccmap plus: we might have the smallest foot print minus:we are reinventing the wheel.
No, I was suggesting that you use wide accesses for all loads and stores from the map, where the current code uses a wide access some of the time. Where the current code uses only PRUint16 accesses, no need to change. But first perhaps we can ascertain the performance gain of wider accesses? /be
move this to m1.1 item
Target Milestone: mozilla0.9.9 → mozilla1.1
*** Bug 152958 has been marked as a duplicate of this bug. ***
ok, I change my position, first of all, I think we have a sloution already. we can add empty entry to /intl/unicharutil/tables/transliterate.properties to solve this problem instead of invent a new method. At least it work on window now. I should try it with mac and linux for example, if the linux and mac do not have hebrew vowel sign, add the following lines into the /intl/unicharutil/tables/transliterate.properties probably will turn off the vowel sign rendering when the font is not there instead of display a ? mark 2. I think my attachment about which unicode could be treat as this way is wrong. There are some character should not be display as nothing. We need to display them as ? instead >Frank, should bug 106311 be duped into this? no, this bug is about if we cannot display the character by using a valid glyph, display it as nothing instead of display it as question mark For bug 106311, those characters are not display as a question mark but are displayed with a glphy which claim to be a glyph for ascii 0x11. that is totally a different issue. This one is how we treat fallback, that one is how we decide which glyph is invalid from a valid truetype font. for now, we know that we probably want to address for the following characters: 1. hebrew accent and point mark 2. arabic points 3. bidi control characters smontagu, is that true ?
smontagu- can you give me a list of hebrew/arabic charcaters that you think we should display nothing instead of ? in case we don't have a glyph from any font
Whiteboard: [eta 8/25]
Target Milestone: --- → mozilla1.2alpha
In the following list, I am sure about the Hebrew characters, but it would be good if someone could give a second opinion about the Arabic. 0591;HEBREW ACCENT ETNAHTA 0592;HEBREW ACCENT SEGOL 0593;HEBREW ACCENT SHALSHELET 0594;HEBREW ACCENT ZAQEF QATAN 0595;HEBREW ACCENT ZAQEF GADOL 0596;HEBREW ACCENT TIPEHA 0597;HEBREW ACCENT REVIA 0598;HEBREW ACCENT ZARQA 0599;HEBREW ACCENT PASHTA 059A;HEBREW ACCENT YETIV 059B;HEBREW ACCENT TEVIR 059C;HEBREW ACCENT GERESH 059D;HEBREW ACCENT GERESH MUQDAM 059E;HEBREW ACCENT GERSHAYIM 059F;HEBREW ACCENT QARNEY PARA 05A0;HEBREW ACCENT TELISHA GEDOLA 05A1;HEBREW ACCENT PAZER 05A3;HEBREW ACCENT MUNAH 05A4;HEBREW ACCENT MAHAPAKH 05A5;HEBREW ACCENT MERKHA 05A6;HEBREW ACCENT MERKHA KEFULA 05A7;HEBREW ACCENT DARGA 05A8;HEBREW ACCENT QADMA 05A9;HEBREW ACCENT TELISHA QETANA 05AA;HEBREW ACCENT YERAH BEN YOMO 05AB;HEBREW ACCENT OLE 05AC;HEBREW ACCENT ILUY 05AD;HEBREW ACCENT DEHI 05AE;HEBREW ACCENT ZINOR 05AF;HEBREW MARK MASORA CIRCLE 05B0;HEBREW POINT SHEVA 05B1;HEBREW POINT HATAF SEGOL 05B2;HEBREW POINT HATAF PATAH 05B3;HEBREW POINT HATAF QAMATS 05B4;HEBREW POINT HIRIQ 05B5;HEBREW POINT TSERE 05B6;HEBREW POINT SEGOL 05B7;HEBREW POINT PATAH 05B8;HEBREW POINT QAMATS 05B9;HEBREW POINT HOLAM 05BB;HEBREW POINT QUBUTS 05BC;HEBREW POINT DAGESH OR MAPIQ 05BD;HEBREW POINT METEG 05BF;HEBREW POINT RAFE 05C1;HEBREW POINT SHIN DOT 05C2;HEBREW POINT SIN DOT 05C4;HEBREW MARK UPPER DOT 0640;ARABIC TATWEEL 064B;ARABIC FATHATAN 064C;ARABIC DAMMATAN 064D;ARABIC KASRATAN 064E;ARABIC FATHA 064F;ARABIC DAMMA 0650;ARABIC KASRA 0651;ARABIC SHADDA 0652;ARABIC SUKUN 0653;ARABIC MADDAH ABOVE 0654;ARABIC HAMZA ABOVE 0655;ARABIC HAMZA BELOW 0670;ARABIC LETTER SUPERSCRIPT ALEF 06D6;ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA; 06D7;ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA; 06D8;ARABIC SMALL HIGH MEEM INITIAL FORM; 06D9;ARABIC SMALL HIGH LAM ALEF; 06DA;ARABIC SMALL HIGH JEEM; 06DB;ARABIC SMALL HIGH THREE DOTS; 06DC;ARABIC SMALL HIGH SEEN; 06DF;ARABIC SMALL HIGH ROUNDED ZERO; 06E0;ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO; 06E1;ARABIC SMALL HIGH DOTLESS HEAD OF KHAH; 06E2;ARABIC SMALL HIGH MEEM ISOLATED FORM; 06E3;ARABIC SMALL LOW SEEN; 06E4;ARABIC SMALL HIGH MADDA; 06E7;ARABIC SMALL HIGH YEH; 06E8;ARABIC SMALL HIGH NOON; 06EA;ARABIC EMPTY CENTRE LOW STOP; 06EB;ARABIC EMPTY CENTRE HIGH STOP; 06EC;ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE; 06ED;ARABIC SMALL LOW MEEM; FB1E;HEBREW POINT JUDEO-SPANISH VARIKA
The Arabic looks right to me.
I'm only not sure about U+0640 ARABIC TATWEEL. It's a semi-letter semi-control character, also something used as a dingbat. I prefer removing it from the list.
Well, what other implications are there for keeping it on the list? If all there is to it is that the fallback is to simply ignore it, then it certainly should be on the list. After all, the TATWEEL has no significance really (it's a formatting character to elongate the length of a word). But if there's something else I'm missing then please do enligten me ;)
ِwell, Tatweel is sometimes used a hyphen in Persian (the hyphen glyph in many fonts are a little high for Arabic text), it is sometimes used as a bullet, ... These cases are not ignorable, and I prefer seeing a question mark in these places than nothing, to find that there is a font problem.
I am now enlightened, thanks ;) I changed my mind, I would rather not see the U+0640 in that list.
what a hack. I have not touch mozilla code for 2 years. I didn't read these bugs for 2 years. And they are still there. Just close them as won't fix to clean up.
Status: ASSIGNED → RESOLVED
Last Resolved: 14 years ago
Resolution: --- → WONTFIX
This issue is being dealt with in bug 205387
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Mass Reassign Please excuse the spam
Assignee: ftang → nobody
Status: REOPENED → NEW
Very, very WONTFIX: displaying nothing for a given character can be used as a phishing vector.
Status: NEW → RESOLVED
Last Resolved: 14 years ago → 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.