Closed Bug 9574 Opened 22 years ago Closed 21 years ago

[FEATURE] Transliteration API


(Core :: Internationalization, defect, P3)






(Reporter: bobj, Assigned: nhottanscp)


We need an API which will provide transliteration into ASCII characters.
Cases where this would be used
 - Save As Text
   E.g., saving a Euro in ISO-Latin1 plain text or a copyright into SJIS
 - Font Rendering
   When there is not font available for a given glyph
Comments from Erik:
Another idea is to warn the user before doing this.
For example:

"You are trying to save this document in a character encoding that
cannot express some of the characters in the document. Would you like to
have these characters transliterated? Or would you like to use one of
the following encodings that can express the whole document:

  ISO 8859-15
Target Milestone: M14
CC'ing tague and ftang.  Feel free to ponder this while I'm on sabbatical :-)
This is not a Beta stopper.  Setting TFV to M14.
Maybe we don't need this for 5.0...
But Erik and Tague should keep this in mind when dealing with no-glyph available
rendering and no valid conversion to plain text cases, respectively.
Can nsITextTransform.h used as the interface ? Currently, we have one
implementatoin of nsITextTransform, which is the nsHankakuToZenkaku.cpp. I think
we could reuse this interface for transliteration process. We implement other
classes which implement nsITextTransform and use PROG_ID
(NS_TEXTTRANSFORM_PROGID_BASE  + the name of the transliteration) to get that
object.The remaining problem is how do we know which ""the name of the
transliteration" to use.
Sounds good to me.  I expect that most transliterations needed are
transliteration of characters to "ASCII" strings (e.g.
copyright-symbol -> "(c)", a-e-ligature -> "ae", smart-quotes ->
ASCII-quotes, euro-symbol -> "EURO").  So our first transliteration
object would cover these.  We could create other transliteration
objects for other cases when needed.  We might even have alternative
transliterations -- we could have a "Unicode Name" transliterator which
would use the formal names from the Unicode standard.
why don't we just use my entity converter
(mozilla/intl/unicharutil/idl/nsIEntityConverter.idl).  it already handles this
kind of transliteration from a unicode code point to a unicode string of ascii
range characters.  all someone needs to do is generate the translitteration
table and the entity converter would handle it.  the entity converter also has
the benefit that it is reflected in javascript.
Assignee: bobj → nhotta
Reassigned to nhotta who completed the entity converter work.
Is that sufficient to meet this need (with the addition of
additional tables)?

We should make a specific list of how this will be used in 5.0.  E.g.,
saving to plain text file/mail, rending when no glyph is found.
A documentabout entity converter extensibility.
Summary: Transliteration API → [FEATRUE] Transliteration API
Summary: [FEATRUE] Transliteration API → [FEATURE] Transliteration API
This page documents some compatibility issues, related to
potential transliteration for glyph substition.
Change OS to ALL
OS: other → All
Checked in a new property file.
I am going to change nsISaveAsCharset to use it as a fallback for text/plain.
cc'd erik because transliteration may be useful for rendering chars without
glyphs on current system.
I changed the mail code to use the transliteration for text/plain and message 
Remaining issue: Unicode transliteration bobj mentioned (2000-01-12 15:34) 
probably needs different implementation. The entity converter may convert one 
character code to multiple codes but not multiple to multiple.
Moving to M15.
Target Milestone: M14 → M15
Shouldn't we at least add the MS-Latin1 chars and MacRoman chars that are not
in the ISO-Latin1 charset (and *-Latin1 chars that are not in the MacRoman
charset) to 
I agree that we want add those. But I am not sure how to assign abbreviations.
Seems like they are not identical as named entities, "EUR" instead of "EURO" 
which I  checked in the fix yesterday. Not the same as in the Unicode book, (tm) 
is described as "TRADE MARK SIGN" in the book.
I found that windows-1252 data has been added to the property file, marking as 
Closed: 21 years ago
Resolution: --- → FIXED
Should we open additional bug(s) for other transliterations?
E.g., MacRoman characters not in ISO-Latin1, Currency Symbols, etc.
My experience has been that "tracking" bugs end up getting longer and longer
and harder to follow (for both the engineer and QA). I prefer to set up separate
bug reports with precise and strictly scoped details. These can then be fixed
and closed.
Yes, that's is what I was requesting.  Sorry if it was unclear...
I created entity html file to 

I copied all symbols in the page and pasted to the plain text mail.
I send it to myself.  The page are displayed as ".............".
I used 2000031312 beta1 US build on Winnt 4.0J.  I reopen this.
Resolution: FIXED → ---
Running 2000040409 build on US Win95.

Sent in Western Charset Coding and received:
   EUR , f " ? + ++ ^ 0/00 S < (OE Z ' ' " " . - -- ~ (tm)=(tm) s > oe z
   EUR , f " ... + ++ ^ 0/00 S < OE Z ' ' " " . - -- ~ (tm)=(tm) s > oe z

Sent in Japanese Charset Coding and received:
   EUR , f " ? + ++ ^ ? S < (OE Z ? ? ? . - -- ~ (tm)=(tm) s > oe z
   EUR ?? , ?? f ?? " ?? ? ?? ? ?? ? ?? ^ ?? ? ?? S ?? < ? OE ?? Z ?? ? ?? ? ??
? ? ? ?? . ? - ?? -- ?? ~ ?? (tm)=(tm) ?? s ?? > ?? oe ? z
I will investigate and open a separate bug if the problem is not in 
Target Milestone: M15 → M16
I tried on my machine (pulled this morning) and both cases (ISO-8859-1 and 
ISO-2022-JP) worked fine.
Please try again with today's build.
I used today's (2000040409) Win32 build on Winnt 4.0J and Windows 98J, and this 
worked fine.  This did not work in beta1 build.  I mark this bug as fixed.  
I tried to verify this in today's Mac and Linux builds, but Mac build crashed at 
lanching Netscape and I could not create new message in Linux build.  I will 
verify on Mac and Linux later.
Closed: 21 years ago21 years ago
Resolution: --- → FIXED
From my previous comment, it seems like it mostly works.
We may want to add more transliteration (e.g., ellipses) to our tables.

But the other strange thing about my results is that the Subject and Body
got different results in both cases.  Is that as designed?
> Is that as designed?
No, but using the html file provided by teruko, I could not reproduce your 
problem. May require some other conditions for that to happen. But probably a 
separate problem from the transliteration itself. So please file a separate bug 
if you have a reproducible case.
I did the following:
(1) opened in a browser window the same link 
(2) selected the whole line of text
(3) opened a new message window

Note in the Western case, the subject had an elipsis character, but the body
had three periods.  There are other differences too such as "(OE" vs. "OE".  In 
the Japanese message, I see the character that looks like [0/00] in the
subject, but "??" in the body.

Also, what you see in the thread pane is different than what you see in the
subject filed of the message pane.
(4a) set the character coding menu as Western
(4b) set the character coding menu as ISO-2022-JP
(5) pasted into subject field
(6) pasted into main body
(7) sent as plain text
I verified this in 2000041307 Mac and 2000041310 Linux build.
You need to log in before you can comment on or make changes to this bug.