Closed Bug 204993 Opened 21 years ago Closed 1 month ago

add transliteration to Xft

Categories

(Core Graveyard :: GFX, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: rbs, Assigned: jshin1987)

References

Details

GfxGTK with core X11 fonts as well as GfxWin do what is called
"transliteration". That is, when a glyph isn't found for a character, the code
substitute with another fallback character/string. For example if there is no
glyph for the "euro" currency symbol, the font engine uses the string "EUR" as a
substitute. There is a transliteration API which provides such predefined mappings.

However, it requires to support another class |nsFontXftSubstitute| to host the
transliterator. @see bug 454 for how it was done in GfxGTK. Another example of
how it was hooked in GfxWin can be found bug 33498.

Doing the transliteration helps in mitigating the excessive amount of question
marks/dominos for missing glyphs. It also helps to re-map character without
visual representation to nothing. For example, there is no point showing a
domino for U+200B ZERO WIDTH SPACE. Rather, it can be transliterated to the
emptiness.
re bug 128153 comment 65
                                                                                
> It is lot easier to turn that into nothingness in the font engine
> than higher up.
                                                                                
  Now that you said that, I fully agree with you about the easiness. However,
where it ought to be done may not have so clear-cut an answer in every case. For
some characters such as ZWNJ/ZWJ in Indic script rendering, it's certainly the
font engine that has to take care of them in a context-sensitive and
engine-dependent manner. (see bug 202352. In bug 192088, we tried to deal with
them higher up in nsTextFrame and  broke Indic script rendering). However, cases
like U+2062 might(or might not) need a bit different treatment. If no font
covers invisible characters, transliteration approach works well. However, in
case some 'broken'(?)
fonts have visible glyphs (as on my system. I don't know which font is to blame)
for invisible characters, that does not work. We might need to know the
'authorial intent' (if that kind of authorial intent is  allowed. this is a big
if) and the font engine may not be  the best place to figure that out in some
cases...
                                                                                
Perhaps, what we can do now is to make the font engine do a bit more. In
addition to using nsFontSubStitute for unknown characters.
it also has to turn invisible characters with no visual effect at all
(this excludes characters like ZWJ/ZWNJ from the list but includes
chars. like U+2052. Having said that, I'm not 100% sure whether 'U+0061 U+0062'
and 'U+0061 U+2062 U+0062' are supposed to be identically rendered.) into
nothingness even if they have visible glyphs in some (broken?) fonts. ... There
might even be an opentype GSUB feature to turn the invisible to the visible
that's supposed to be invoked in a controlled manner. Well, I'm getting off the
track a bit...
Isn't is possible to just fall back to the transliterator before falling back to
the unknown glyph, assuming the flag is set?  Am I reading the patch correctly?
 I don't think that should require a whole new type myself but I haven't thought
about it at length.
It is possible to embed it without another class. That's precisely what you are
doing at present with the |MiniFont|.

It is a lot easier to read/maintain down the track if there is a separate
handling. All the logic move into that one class which is just treated as any
other nsFontXft, and in particular, the enumerator callbacks become clean and
tidy since |if (!aFont)| would mean a sure out-of-memory condition (or something
like that) rather than a font-inexistence needing to sprinkle further special
casing here and there.

Essentially, it supersedes the |MiniFont| and does the extras.
This feature is not in as much demand as back in late 1990's when the font
situation for POSIX/X11 was far worse than now. To take just an example,
ISO-8859-15 was new and fonts with Euro(let alone wider Unicode coverage) was
not so commonly available. Back then(and even now for Moz-X11core), installing
more fonts with wider coverage meant significant speed-down.  Anyone who visited
UTF-8 pages with Mozilla-X11 would know what I mean. It takes Pentium 3 700Mhz
machine more than a minute to load Google search result (in UTF-8) if
Moz-X11core is run in UTF-8 locale and one has a lot of large X11 'BDFs' [0]
(CJK or iso10646-1 encoding) installed.  These days with Mozill-Xft and TTFs,
there's no performance reduction with more fonts installed. Even with a number
of fonts with pan-unicode coverage, it takes a second or less so that it's not
an unreasonable requirement that fonts with the necessary coverage of scripts of
one's interest be installed to avoid 'domino's. Moreover, thanks to fontconfig,
it's very easy to add fonts. Just dropping a font into fc search path do the
job. On most standard Linux distros, it should be rare to see excessive numbers
of dominos. If one does, that's a clear indication that one needs to install
more fonts.   Transliteration support is a  good guard against that (and perhaps
good for small embedded systems), but one can't keep  putting off installing
more fonts relying on transliteration. 

Having said that, I think implementing this can build upon the patch for 176290
sharing some routines and generally following what's done in GfxGTK w/X11core.
However, we should set the fallback to 'NONE' only using the genuine
transliteration and resorting to drawing 'unknown glyphs' when
nsSaveCharset::Convert fails because the current unknown glyph approach is
better than any of fallbacks offered by nsISaveAsCharset(hexadecimal, decimal or
'?').[1] This means that nsSaveAsCharset::Convert has to be called per character
basis (in nsXftFontSubstitute::HasChar or equivalent), the result of which might
need to be cached for speed-up (and later use).

> I don't think that should require a whole new type 

 Perhaps not, but I'm afraid doing that in-situ is not the best path we can
take. Generally, there's a change in length after the transliteration so that it
is similar to converting to custom-font encoding in a sense.

[0] TTFs packaged and presented as X11core fonts via a font server fall to this
category as well.

[1] I'm writing from memory. Is 'entity'(*non-numerical*) one of fallbacks? If
it is, some people might like it better than the current unknown glyph. 
> I think implementing this can build upon the patch for 176290

I was about to add that but you are too quick... it is better to wait on bug
176290 which provides a helpful basis for this.

>This feature is not in as much demand as back in late 1990's when the font
>situation for POSIX/X11 was far worse than now. To take just an example,

I am not interested in yet another font debate on Linux where the tendency is
often to resist font improvements until being forced... Even on Windows where
there are already many excellent fonts, my experience with Math characters
(thousands of them...) shows that this is a helpful feature.
As a further incentive for this to happen on Xft, I was the one who implemented
it on GfxWin and did it out of _necessity_.
> I was the one who implemented it on GfxWin and did it out of _necessity_.

 If I sounded differently,  I don't have a single bit of doubt that it was and
is necessary. Our difference lies only in the degree, but your experience should
carry  more weight than mine because I haven't tested  many MathML pages.
Anyway, I already have a sketch of implementation, but as we agreed, we'd better
do it after bug 176290 is resolved.

BTW, one of Unicode 4.0 data files [1] define 'default_ignorable_codes'.
U+2062 (invisible times) is one of them (ZWJ/ZWNJ are not). According to Mark
Davis, characters with 'default_ignorable_code' can be ignored (be turned to
nothing), but can affect the rendering/layout if supported. That is, 'ab' can be
rendered a little differently from 'a⁢b'. My problem is the opposite. One
of my fonts(I suspect it's CODE2000 that James Kass has been diligently making
as 'Pan-Unicodic' as possible) has a visible glyph for this invisible character.
So, I'm getting a very conspicuous 'dotted x inside a dotted box' where nothing
or just a very thin space should be... I may be getting off the track a
bit, but I kinda think of this as dual of the problem at hand.

[1] http://www.unicode.org/Public/4.0-Update/DerivedCoreProperties-4.0.0.txt

Transliteration is quite a useful for some of the more exotic characters... How
many people have fonts with U+2496? ("15.")
If you're telling me, for the record I didn't say it's not useful.  It's for 
sure useful for characters like U+2496 (and many others that are in Unicode 
purely for the sake of backward compatibility with legacy character sets and 
some others that are in Unicode thanks to their own merits).[1] However, for 
some other characters, it _could_ (depending on the situation) be better to 
alert users to the need to install more fonts by conspicuous 'unknown glyph 
symbols'. Well, Mozilla's transliteration table at the moment is not so 
exhaustive (as, say, glibc 2.x's transliteration table) and this  is not an 
issue. 

[1] U+2496 is a part of PRC's GB 2312-80 and I suspect that's the sole reason 
it's in Unicode/ISO 10646. Everybody would object to encoding it if we could 
begin from the scratch.  Being a part of GB 2312-80, it's in every simplified 
Chinese font (BDF or truetype or whatother format. Even some very old X 
terminals from the early 1990's is likely to have a font or two with it.) and 
any font that aims to be pan-Unicodic (e.g shareware Code2000, Cyberbit, Arial 
MS Unicode, etc). I also found that some GPL'd Japanese fonts (that come by 
default in major Linux distros) have it.  Note that in this age of truetype 
dominance, there's  _no_ platform dependency in the font availability barring 
license issues.  As I wrote, my problem is exactly the opposite. Some 
overagressive fonts have visible glyphs for invisible characters like U+2062. 

Of course, this fact doesn't reduce the usefulness of transliterating 'U+2496' 
to U+0031, U+0035, FullStop.
 
For quick refrence, I am "connecting" to bug 205387 which is about ignorable
characters.
Blocks: 205387
One of reasons I haven't implemented this is that I don't like any of
'transliteration options' available in nsISaveAsCharset. For instance, I prefer
'minifont' to  using '&#ddddd;' (NCR). There's a way to have the best of both,
but it takes some work. I have to add a API (or options) to nsISaveAsCharset to
preserve 'untransilterable' characters for which I want to use minifont. 

On the other hand, the need for nsFontXftSubstitute is clearly there (see bug
221024).
Assignee: blizzard → jshin
I've filed bug 230088 for a new API for transliteration that can preserve the
untransliterable.
Status: NEW → ASSIGNED
Product: Core → Core Graveyard
Status: ASSIGNED → RESOLVED
Closed: 1 month ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.