Closed Bug 186463 Opened 22 years ago Closed 20 years ago

Request to provide Tamil character coding (TSCII) support

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: ev122, Assigned: smontagu)

References

()

Details

(Keywords: intl)

Attachments

(6 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.2.1) Gecko/20021130
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.2.1) Gecko/20021130

At present, Mozilla doesn't provide support to view Tamil webpages. It would be
better if one may be able select Tamil by selecting View > Character Coding >
More > SE & SW Asian > Tamil.

The necessary fonts (and related info) may be obtained here:
http://www.tamil.net/tscii/

Reproducible: Always

Steps to Reproduce:
1. Please go to: http://www.tamil.net/projectmadurai//pub/pm0143/kprose1.html


Actual Results:  
See the words below "in Tamil Script, TSCII format)". This shouldn't be a bunch
of question marks. 

Expected Results:  
A sample of Tamil characters: http://www.tamil.net/projectmadurai/pmdr0.gif

Two good Tamil fonts: TSCMaduram (Serif) and TSCArial (Sans-Serif)
see also bug 140013
The Tamil characters are in Unicode (see http://www.unicode.org/charts/ ) so
presuming the fonts are correctly encoded (which bug 140013 suggests *may* not
be the case), then the characters should be supported when using UTF-8 or UCS2
encoding without any support from Mozilla.

Do you know if the encoding is documented somewhere?  Is it in use on the web? 
What other browsers support it?  If none, then it's probably better to just
encourage sites to use UTF-8 or UCS2.

->i18n
Assignee: font → smontagu
Component: Layout: Fonts and Text → Internationalization
QA Contact: ian → ylong
The page in question has
<META HTTP-EQUIV="Content-Type" CONTENT="text/html"; charset="x-user-defined">

I really don't see how Mozilla is supposed to know what the character encoding
is.  Therefore it's going to assume some default, and since the Tamil font you
specified doesn't have glyphs for the characters Mozilla thinks are there, it
will (and should) use some other font.
Thanks for the suggestions. Someday, of course, UTF-8 would become the preferred
encoding for all Tamil webmasters. However at present, most webmasters continue
write Tamil pages using TSCII encoding. (Mostly because UTF-8 is not supported
by WinMe, Win98, Win95.)
More info: http://www.tamil.net/tscii/tscii.html

As to the "x-user-defined", it was simply a temporary measure. As explained below:

"Internationalisation part of HTML standards propose usage of "character set" to
display non-roman language materials. One of the near-term goals of the Internet
Working Group for TSCII is to get Internet Protocols Standardisation Agencies
such as IETF to accept the proposed Encoding scheme TSCII as a "char-set" for
Tamil. This is along the same lines of specific character sets we have for
Russian, Korean, Japanese, Greek etc. Then we can have TSCII as one of the
recognized character set to invoke in HTML files.

"Till that time, an immediate option is to invoke "x-user-defined" case for the
char-set in the META Header of the HTML file and have the end-user choose
TSCII-conformant Tamil font as the font to use for the "User-defined" encoding
(using Browser Preferences Menu)." [Archived at:
http://www.geocities.com/athens/5180/tscguide.html ]
Oh btw, IE 6.0 does have a provision to Select "Tamil". ( Tools > Internet
Options > Fonts > Tamil ) Netscape and Mozilla doesn't have any such provision. :(
TSCII is most widely used by tamils as of now. Mandrake Linux supports TSCII.
Till unicode becomes popular its necessary for Mozilla to support 8-bit
character encoing TSCII also.
Keywords: intl
TSCII is the 8-bit glyph encoding standard.glibc2.3.1 includes tscii support. 
TSCII is fully explained in this page...
http://www.geocities.com/athens/5180/tscii.html

TSCII encoding is represented in this gif
http://www.tamil.net/tscii/charset17.gif
Google search for Tamil is also based on TSCII encoding only.
http://dmoz.org/World/Tamil/  

Because of no-support in mozilla, it is difficult for a normal 'joe' user to
view tamil sites.(Every site is forced to use currently using x-user-defined.).
As an alternate most of the sites are using dynamic fonts which is not supported
in mozilla.

Since this is similair to ISO8859-1, full suport was included for TSCII in
Mandrake Linux 9.0. 

 The tamil community is 70 million strong and spread across India, Sri Lanka,
Singapore, Malaysia, Canada, USA & other countries.TSCII is most commonly used
in Internet by tamils from all countries around the world. The list of sites
using TSCII is documented here..
http://groups.yahoo.com/group/e-Uthavi/database?method=reportRows&tbl=2

We're ready to provide TSCII fonts with opentype fonts for inclusion with
Mozilla if requried under any license of your Choice..



Severity: enhancement → normal
The l10n is fast underway. And the its Alpha should be available for download
soon at http://thamizha.com before the end of Jan'03. However, the lack of
TSCII support would deny Mozilla based browsers the much needed 'edge' that
might be necessary to 'convert' Tamil surfers around the world.
Hello sir,

Please provide the Tamil font coding(TSCII) support for our New Browser.
It is a request 5 Million people in the South India
Tamil is universally accepted as old language, which has good grammer, poem etc
please provide the acceptence

Thanks
Mathavan, tamilan
Please fix this bug in your next release and help everybody.

This will def. increase the number of netscape users.
> "Till that time, an immediate option is to invoke "x-user-defined" case for the
> char-set in the META Header of the HTML file and have the end-user choose
> TSCII-conformant Tamil font as the font to use for the "User-defined" encoding
> (using Browser Preferences Menu)." [Archived at: 
> http://www.geocities.com/athens/5180/tscguide.html ]

 You could have used 'x-tscii', but then browsers have to recognize it,
which no browser does at the moment.
Anyway, to support TSCII, we need the mapping (tabular or algorithmic)
between TSCII and Unicode. Can you provide that?  With that, it _might_
(or might not) be possible to do something quickly 

> Since this is similair to ISO8859-1, full suport was included for TSCII in

  Well, there's a big difference between ISO 8859-x and TSCII in that
the former has a well-defined *one-to-one* _character_ mapping to Unicode while
the latter  appears to be a 'font/glyph encoding' rather than character-encoding.
Therefore, mapping to TSCII from Unicode is not so simple as mapping between
Unicode and ISO-8859-x. Nonetheless, it's possible if we have
the mapping between TSCII and Unicode. TSCII page
(http://www.tamil.net/tscii/tscii.html)
doesn't seem to have any mapping table other than font-glyph tables. Do you
have any TSCII <=> Unicode mapping table available? It cannot be 1 to 1 
so that it may come in a kind of pseudo-code. Do you have anything like that? My
version 
of glibc doesn't support TSCII and I wonder how Mandrake 9.0 supports TSCII.
(Supporting Tamil with Unicode is different from supporting Tamil with TSCII).

I'll look into Yudit source code which supports Tamil in both TSCII and Unicode.

 
> The Tamil characters are in Unicode (see http://www.unicode.org/charts/ ) so
> presuming the fonts are correctly encoded (which bug 140013 suggests *may* not
> be the case), then the characters should be supported when using UTF-8 or UCS2
> encoding without any support from Mozilla.

  Not that simple because Tamil script is a complex script as other scripts in
South Asia are. It takes some work on Mozilla's side (reordering, ligature
for conjunction, etc). Under Windows XP(or even 2k), TextOutW may do the
magic by invoking Uniscribe and OTLS on its own if Tamil is supported
by Uniscribe.  Yeah, it does (see
http://www.microsoft.com/typography/otfntdev/tamilot/)
Have you tried UTF-8 encoded Tamil web pages with
Mozilla running under Windows XP **after** installing a Tamil **opentype*
font? Win XP may come with at least one Tamil OT font. If not,
you can use CODE2000 font (at http://home.att.net/~jameskass) that
has the opentype layout table for Tamil. 

 On other platforms, Mozilla has more work to do unless
it becomes capable of taking advantage of 'native' rendering/layout libraries 
like Pango and AAT(? what's the name of Apple's rough equivalent of
Uniscribe/OTLS?).
  
BTW, you may find it interesting to see http://sila.mozdev.org/

I found the mapping table between TSCII and Unicode at
http://www.tamil.net/tscii/faq5.html
As expected, it's context-dependent and m to n. 
This bug can be splitted into three parts:
 
1. supporting TSCII (x-tscii) as a text/document encoding. 
2. rendering Tamil text with TSCII font (custom-encoded font). (a
temporary/stop-gap measure
   until the third item below is implemented and existing Tamil truetype fonts
are converted
   to opentype fonts.) 
3. rendering Tamil text with opentype fonts for Tamil.  

The first one is pretty straightforward with the second one not so difficult
with the 
latest patch for bug 176290 checked in. The third one is platform-dependent. It
seems
like under Windows XP (or it might be true of  9X as well with the latest
version of 
Uniscribe installed) that Mozilla has not so much to do(if I'm not mistaken).
For gtk,
implementing nsFontMetricsXftPango(? derived from nsFontMetricsXft) that
delegates most of 
layout/rendering to Pango could well do 'the trick' (as in SILA. actually a lot
more extensive delegation can/should be done to Pango than to Graphite, I think). 
At the moment, built-in PangoLite is used for Thai and Devanagari in gtk-x11 
(but not in gtk-xft)    when CTL is turned on. 

> At present, Mozilla doesn't provide support to view Tamil webpages

As you and numerous other people in South Asia know, you can view
non-standard-compliant web pages in Indic scripts treating them
as if they're ISO-8859-1 or Windows-1252 and setting  fonts to
one of 'custom-encoded' fonts.  It's an old trick that has been used
for Indic scripts perhaps since Netscape 3.x, isn't it?   

What's missing is rendering support of *properly* encoded Tamil web 
pages (in UTF-8 or other Unicode transformation formats) with opentype 
Tamil fonts and as a temporray measure with TSCII-encoded font.
(the third and the second items in my list). I realized that that's 
the realm of bug 140013. 


> Thanks for the suggestions. Someday, of course, UTF-8 would become the preferred
> encoding for all Tamil webmasters. 
> More info: http://www.tamil.net/tscii/tscii.html

> As to the "x-user-defined", it was simply a temporary measure.

Considering that there are quite a lot of web pages in TSCII,
Mozilla may support TSCII as a document encoding, but I think
it's not a top priority  because  it's a dead end to support
TSCII as a document encoding. (the first item in my list) [1]
TSCII is a glyph encoding (as opposed to character encoding). 
As such, it appears to me that it cannot represent Tamil text as faithfully as
Unicode. 
I guess those at Tamil.net and elsewhere trying to support TSCII should consider
this point 
and decide which way is better in the long run, to keep promoting a  limited
measure 
like TSCII or to help and encourage people to switch over to  Unicode (inclduing
the conversion of TISCII encoded truetype fonts to opentype fonts) .  Everybody
is moving 
to Unicode and other Indic script users have been making transition.  
  
> However at present, most webmasters continue
> write Tamil pages using TSCII encoding. (Mostly because UTF-8 is not supported
> by WinMe, Win98, Win95.)

  OS doesn't support Unicode as well as Win2k/XP, but there are a couple of
freely available U
nicode editors that run under Win9x/ME. For instance, try Yudit
at http://www.yudit.org. Windows version is somewhere hidden :-) 
It supports Tamil  well and can import TSCII-encoded
text and export to UTF-8. You can also try 'SC Unipad' (try to google it).
If you dont' feel at home in Yudit, you can just edit your html
files with your favorite editor, save it in TSCII and convert it to
UTF-8 with 'uniconv -decode tscii -encode utf-8 input.tscii output.utf8' 


My opinion is, as mentioned in the point No: 1.
we should still go for a new document encoding [x-tscii ] to support
tscii encoding. This also will ensure that we are able to move smoothly
from tscii --> Unicode .

This is of importance considering the vast amount of data in TSCII format available.
And still TSCII documents are being made on daily basis. It might take quiet
long time for the tamil users to change completely to Unicode.

So its very important for Mozilla to support a new encoding type 'x-tscii'  as a
interim solution for the tamil user. This will also will ensure that the tamil
language usage dont suffer till complete Unicode adoption.
How did I get the screenshot? It's easy with my
patch for bug 176290 applied. (so please vote for bug 176290 if
you want Tamil pages encoded in TSCII and tagged as
x-user-defined to be rendered in Mozilla-Xft 1.4. Even better
is to speak up that you want that feature for 1.4 :-))
 
Nothing I had to do other than just adding an entry to
fontEncoding.properties file (in res/fonts directory) like this:
(I suspected this would work, but my first experiment failed
because I typed 'i' in place of 'l' in 'tsc_avarangal'.

# Tamil fonts
encoding.tsc_avarangal.ttf = x-user-defined

You can add as many entries like the above as you want. 

Then, set View|Character Coding to User-defined. 
You also have to set fonts to use for user-defined in Edit|
Preference|Appearance|Fonts to Tamil fonts. You can avoid
this step if you web pages specify fonts to use with
either old font-face or new CSS font-family. 

BTW, because Mozilla Window already has custom-font encoding support
so that you can enjoy the feature right now by following the procedure
above. 

I think with this there's little need to support TSCII as a document 
encoding (item 1 in my list). Needless to say, we need to work
on item 2 and item 3, but that's for bug 1400??.
Sorry for attachment 121512 [details]. Even though I don't know Tamil,
I thought it looked strange, but couldn't pinpoint what's wrong. Now I know....

Please, take a look at this new screenshot and let me know this time
it's rendered correctly.

While trying to figure out why Yudit's mapping of TSCII <-> Unicode is
different
from the mapping at tscii.net, I found out that TSCII truetype fonts have a
_bogus_ 
Mac Roman cmap (PID=1, EID=1).(Additionally, some of them have PID=0, EID=0
Cmap
- Unicode default - Cmaps ans MS Symbol cmap while others have PID=3,EID=1 - MS

Unicode - Cmaps) This is the same problem as Mathematica fonts have.
This problem was solved by specifying the cmap to use for Freetype library.
(see bug 176290 comment #75 and other comments references therein) Therefore,
for each TSCII 
font, two lines  are necessary in fontEncoding.properties file as shown below.

# Tamil fonts (TSCII encoding : see http://www.tscii.net)
encoding.tsc_avarangal.ttf = x-user-defined
encoding.tsc_aparanarpdf.ttf = x-user-defined
encoding.tsc_avarangal.ttf = x-user-defined
encoding.tsc_aandaal.ttf = x-user-defined
encoding.tsc_avarangalfxd.ttf = x-user-defined
encoding.tsc_paranbold.ttf = x-user-defined
encoding.tsc_paranarpdf.ttf = x-user-defined
encoding.tsc_paranarho.ttf = x-user-defined
encoding.tsc_kannadaasan.ttf = x-user-defined

# TSCII fonts have psuedo Apple Roman Cmap as mathematica fonts do.  
encoding.tsc_avarangal.ftcmap=apple-roman
encoding.tsc_aparanarpdf.ftcmap=apple-roman
encoding.tsc_avarangal.ftcmap=apple-roman
encoding.tsc_aandaal.ftcmap=apple-roman
encoding.tsc_avarangalfxd.ftcmap=apple-roman
encoding.tsc_paranbold.ftcmap=apple-roman
encoding.tsc_paranarpdf.ftcmap=apple-roman
encoding.tsc_paranarho.ftcmap=apple-roman
encoding.tsc_kannadaasan.ftcmap=apple-roman
Attachment #121512 - Attachment is obsolete: true
Comment on attachment 121630 [details]
another incorrect rendering of www.tamil.net/poetry

this is still incorrect. x-user-defined
mapping is straight pass-thru converter
so that it should work.... something
is wrong with the font I used for testing...
I'll keep trying..
Attachment #121630 - Attachment description: the correct rendering of www.tamil.net/poetry → another incorrect rendering of www.tamil.net/poetry
This screenshot was obtained with Moz-Win 1.3 with fonts for
user-defined set to TSC_Kannadassan.
(http://www.tamil.net/projectmadurai/pub/pm0098/kanavin.html)
With the same setting, MS IE 6 rendered the page exactly identically. So, I
think we can take this shot as 'the' reference. User-defined converter is just
a straight-through single byte converter (with 0x80-0xFF stored in U+F780 -
U+F7FF when in UTF-16) so that this should
just work fine under Linux as well. But it does not...
This is Mozilla-Xft's rendering of the TSCII table (in x-user-defined)
at http://jshin.net/i18n/tscii2.html. Compare it with 
http://www.tamil.net/tscii/charset17.gif and you'll see that they're in
exact match. With this particular font (TSCAparanar.ttf), what I described
in comment #18 works (along with the patch for bug 176290). Why not with
other fonts? See below.

> a straight-through single byte converter (with 0x80-0xFF stored in U+F780 -
> U+F7FF when in UTF-16) so that this should
> just work fine under Linux as well. But it does not...

This turned out to be because  'TSCII-compliant' truetype fonts come with
various
Truetype Cmaps with the internal inconsistency. Some fonts come with PID=0,
EID=0
(Unicode Cmap) and others with PID=3, EID=1(MS Unicode Cmap). Still others come

with PID=3, EID=0(MS Symbol). _All_ of them come with MacRoman Cmap(PID=1,
EID=0).
So I assumed that using MacRoman Cmap would give Mozilla the same glyph
arrangement as given in <http://www.tamil.net/tscii/charset17.gif>
It turned out that MacRoman Cmaps are not consistent with PID=3, EID=1
/PID=3, EID=0 Cmaps. Among 7 or so TSCII fonts I downloaded, only
TSCAparanar.ttf has MacRoman cmap that matches TSCII 1.7 glyph arrangement.
I don't know how they work under Windows (they do according to my test).
Anyway, to render Tamil text with these fonts and Mozilla-Xft, we need
to write a new converter(TSCII codepoints [0x80 - 0x9F] mapped
as if they're in Windows-1252) assuming that Unicode Cmap has 
pseudo-Unicode glyph indices for TSCII glyphs.
Supporting TSCII is not such a good idea. It's better to write utf-8 encoded
tamil web pages and ask mozilla to write converters to make the existing tscii
encoded fonts show utf-8 encoded tamil pages properly. I understand
infrastructure for doing this is already available and has been implemented for
other Indic scripts.

That way webmasters can create utf-8 encoded pages and viewers don't new fonts
to view thm either.

FTR http://dmoz.org/World/Tamil , on which Google's Tamil search is based, got
converted to UTF-8, in fact the whole of http://dmoz.org is about to convert to
UTF-8.

The sooner we convert Tamil pages to utf-8 the better. It's a big task which
will become bigger as each day passes by so we might as well start it soon.
Enabling TSCII would only delay the transition to UTF-8.
Google in Tamil  http://google.com -> Preferences, choose Tamil is also moving
from Latin script to Tamil script, UTF-8 encoded.
> It's better to write utf-8 encoded
> tamil web pages and ask mozilla to write converters to make the existing tscii
> encoded fonts show utf-8 encoded tamil pages properly.

  Sure. See bug 204039. 

Writing TSCII->Unicode converter is certainly doable, but is a low-priority
item per your comment and because user-defined should work with some fonts.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → WONTFIX
jshin:
Do you agree with the WONTFIX (26 votes so far) ?
(In reply to comment #26)
> jshin:
> Do you agree with the WONTFIX (26 votes so far) ?
--------------
True, but I don't think anyone is attempting fix this bug. Moreover, as time
goes on, it becomes less and less useful, with UTF-8 sites becoming the
preferred site among Tamil webmasters... I would rather see effort put in fixing
UTF-8 related bugs -- especially the 'justify' problem!
*** Bug 261239 has been marked as a duplicate of this bug. ***
*** Bug 266617 has been marked as a duplicate of this bug. ***
*** Bug 292946 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.