Closed Bug 9574 Opened 22 years ago Closed 21 years ago
[FEATURE] Transliteration API
We need an API which will provide transliteration into ASCII characters. Cases where this would be used - Save As Text E.g., saving a Euro in ISO-Latin1 plain text or a copyright into SJIS - Font Rendering When there is not font available for a given glyph
Comments from Erik: Another idea is to warn the user before doing this. For example: "You are trying to save this document in a character encoding that cannot express some of the characters in the document. Would you like to have these characters transliterated? Or would you like to use one of the following encodings that can express the whole document: ISO 8859-15 UTF-8 ..."
CC'ing tague and ftang. Feel free to ponder this while I'm on sabbatical :-) This is not a Beta stopper. Setting TFV to M14. Maybe we don't need this for 5.0... But Erik and Tague should keep this in mind when dealing with no-glyph available rendering and no valid conversion to plain text cases, respectively.
Can nsITextTransform.h used as the interface ? Currently, we have one implementatoin of nsITextTransform, which is the nsHankakuToZenkaku.cpp. I think we could reuse this interface for transliteration process. We implement other classes which implement nsITextTransform and use PROG_ID (NS_TEXTTRANSFORM_PROGID_BASE + the name of the transliteration) to get that object.The remaining problem is how do we know which ""the name of the transliteration" to use.
Sounds good to me. I expect that most transliterations needed are transliteration of characters to "ASCII" strings (e.g. copyright-symbol -> "(c)", a-e-ligature -> "ae", smart-quotes -> ASCII-quotes, euro-symbol -> "EURO"). So our first transliteration object would cover these. We could create other transliteration objects for other cases when needed. We might even have alternative transliterations -- we could have a "Unicode Name" transliterator which would use the formal names from the Unicode standard.
Reassigned to nhotta who completed the entity converter work. Is that sufficient to meet this need (with the addition of additional tables)? We should make a specific list of how this will be used in 5.0. E.g., saving to plain text file/mail, rending when no glyph is found.
A documentabout entity converter extensibility. http://www.mozilla.org/projects/intl/entity-conversion.html
Summary: Transliteration API → [FEATRUE] Transliteration API
Summary: [FEATRUE] Transliteration API → [FEATURE] Transliteration API
This page documents some compatibility issues, related to potential transliteration for glyph substition. http://home.sol.no/~huftis/mozilla/compat.html
Change OS to ALL
OS: other → All
Checked in a new property file. http://lxr.mozilla.org/seamonkey/source/intl/unicharutil/tables/transliterate.pr operties I am going to change nsISaveAsCharset to use it as a fallback for text/plain.
cc'd erik because transliteration may be useful for rendering chars without glyphs on current system.
I changed the mail code to use the transliteration for text/plain and message headers. Remaining issue: Unicode transliteration bobj mentioned (2000-01-12 15:34) probably needs different implementation. The entity converter may convert one character code to multiple codes but not multiple to multiple. Moving to M15.
Target Milestone: M14 → M15
Shouldn't we at least add the MS-Latin1 chars and MacRoman chars that are not in the ISO-Latin1 charset (and *-Latin1 chars that are not in the MacRoman charset) to http://lxr.mozilla.org/seamonkey/source/intl/unicharutil/tables/transliterate.pr
I agree that we want add those. But I am not sure how to assign abbreviations. Seems like they are not identical as named entities, "EUR" instead of "EURO" which I checked in the fix yesterday. Not the same as in the Unicode book, (tm) is described as "TRADE MARK SIGN" in the book.
I found that windows-1252 data has been added to the property file, marking as FIXED.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Should we open additional bug(s) for other transliterations? E.g., MacRoman characters not in ISO-Latin1, Currency Symbols, etc.
My experience has been that "tracking" bugs end up getting longer and longer and harder to follow (for both the engineer and QA). I prefer to set up separate bug reports with precise and strictly scoped details. These can then be fixed and closed.
Yes, that's is what I was requesting. Sorry if it was unclear...
I created entity html file to http://jazz/users/teruko/publish/test/9574/test2.html I copied all symbols in the page and pasted to the plain text mail. I send it to myself. The page are displayed as ".............". I used 2000031312 beta1 US build on Winnt 4.0J. I reopen this.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Running 2000040409 build on US Win95. Sent in Western Charset Coding and received: Subject: EUR , f " ? + ++ ^ 0/00 S < (OE Z ' ' " " . - -- ~ (tm)=(tm) s > oe z Body: EUR , f " ... + ++ ^ 0/00 S < OE Z ' ' " " . - -- ~ (tm)=(tm) s > oe z Sent in Japanese Charset Coding and received: Subject: EUR , f " ? + ++ ^ ? S < (OE Z ? ? ? . - -- ~ (tm)=(tm) s > oe z Body: EUR ?? , ?? f ?? " ?? ? ?? ? ?? ? ?? ^ ?? ? ?? S ?? < ? OE ?? Z ?? ? ?? ? ?? ? ? ? ?? . ? - ?? -- ?? ~ ?? (tm)=(tm) ?? s ?? > ?? oe ? z
I will investigate and open a separate bug if the problem is not in transliteration.
Status: REOPENED → ASSIGNED
Target Milestone: M15 → M16
I tried on my machine (pulled this morning) and both cases (ISO-8859-1 and ISO-2022-JP) worked fine. Please try again with today's build.
I used today's (2000040409) Win32 build on Winnt 4.0J and Windows 98J, and this worked fine. This did not work in beta1 build. I mark this bug as fixed. I tried to verify this in today's Mac and Linux builds, but Mac build crashed at lanching Netscape and I could not create new message in Linux build. I will verify on Mac and Linux later.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago → 21 years ago
Resolution: --- → FIXED
From my previous comment, it seems like it mostly works. We may want to add more transliteration (e.g., ellipses) to our tables. But the other strange thing about my results is that the Subject and Body got different results in both cases. Is that as designed?
> Is that as designed? No, but using the html file provided by teruko, I could not reproduce your problem. May require some other conditions for that to happen. But probably a separate problem from the transliteration itself. So please file a separate bug if you have a reproducible case.
I did the following: (1) opened in a browser window the same link http://jazz/users/teruko/publish/test/9574/test2.html (2) selected the whole line of text (3) opened a new message window Note in the Western case, the subject had an elipsis character, but the body had three periods. There are other differences too such as "(OE" vs. "OE". In the Japanese message, I see the character that looks like [0/00] in the subject, but "??" in the body. Also, what you see in the thread pane is different than what you see in the subject filed of the message pane. (4a) set the character coding menu as Western (4b) set the character coding menu as ISO-2022-JP (5) pasted into subject field (6) pasted into main body (7) sent as plain text
I verified this in 2000041307 Mac and 2000041310 Linux build.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.