Closed Bug 41564 Opened 24 years ago Closed 23 years ago

Internationalize ChatZilla to handle different language scripts

Categories

(Other Applications :: ChatZilla, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.0.1

People

(Reporter: m_kato, Assigned: oliver)

References

Details

Attachments

(2 files, 6 obsolete files)

Current implementation supports latin-1 encoding only.  But in Japanese, IRC 
encoding uses ISO-2022-JP encoding.

Please support multiple encoding for I18N.
I'm going to confirm this bug and summarize the current status of Mozilla chat below and
make some recommendations as to what the specs for internationalization should be.

1. Currently, we are able to deal with only Latin 1 (ISO-8859-1) charactes in the chat window.
   We should use the practice elsewhere in Mozilla to send out Unicode and expect to receive
   Unicode. 
2. #1 will simplfiy dealings with character encoding issues among Mozilla Chatzilla users.
3. There are many existing IRC clients geared toward only single languages. As people communicate
    across continents using Chatzilla and talk to other chat clients which might not know Unicode,
    what we should have is a Character  Coding menu like the one you find in Messenger or Browser
    components. In fact you can copy the menu from there. Ask i18n people how  to do this.
    Then, when the Character coding menu is set to Japanese (ISO-2022-JP) -- as an example --, use
    that encoding to both send chat data and also to interpet incoming data. 
    This way you will be able to deal with legacy clients. 

4. Currently, we are not handling the CJK input method correctly. We are now commiting any entry when
    the CR is pressed. IN CJK, pressing CR means different things depending on what the IME status is.
    If it is in candidate state, pressing CR means "commit to canvas" but NOT send out the data.
    When IME is not in candidate state, then presssing CR means to send the data out, etc.
   Someone familiar with CJK IME should be able to fix this quickly.

Let's make Chatzilla into a great multilingual chat tool!
Status: UNCONFIRMED → NEW
Ever confirmed: true
QA Contact: rginda → momoi
There are other issues concerning this and m_kato has raised it in 
mozilla-i18n group. Kato-san, please send that message to rginda
who may not have seen it.
Depends on: 27805
Status: NEW → ASSIGNED
Robert, sorry.  I mistake.
please assign to me.  I have a fix code
reassigning to m_kato
Assignee: rginda → m_kato
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
TODO plan:
o create scriptable Unicode convert interface. (bug 54857)
o add new command "/charset <charactor-set>"
Depends on: 54857
*MASS SPAM*

Changing QA contact on all open or unverified ChatZilla bugs to me, David
Krause, as I am now the QA contact for this component.
QA Contact: momoi → David
David, the fact that you're the default contact does not 
mean that you should take over all the bugs. Some bugs
can be re-assigned to appropriate people. 
This bug should be QA'ed by an international contact
with machines and environments set up for that task.

For now changing it to ji@netscape.com.
We may assign this to someone in Mozilla.org Japan.
Maybe Koike-san can take over this one?
QA Contact: david → ji
Kato-san said he would fix this until 1.0.
QA Contact: ji → kazhik
Target Milestone: --- → mozilla1.0
Whoops, sorry about that I just did a "Change all bugs at once" and didn't
notice that I was removing a QA other than rginda.  I'll try to be more careful
next time.  It is true that I am not the best person to QA this type of bug.
Sorry again.
Reassigning to Furukawa-san.
Assignee: m_kato → oliver
Status: ASSIGNED → NEW
My experience is currently that ChatZilla doesn't allow the user to change fonts
- that way, even if Russian (significant for me) is correctly transferred, I see
only garbage because of latin1-font. Could someone please change the Summary
field - I am not sure, whether I am allowed to.
OK. I changed the summary line to include other language scripts
than Japanese. We need to deal with different language scripts
Mozilla provides support for.
Summary: Cannot use Japanese via IRC chat → Internationalize ChatZilla to handle different language scripts
*** Bug 102757 has been marked as a duplicate of this bug. ***
Blocks: patchmaker
No longer blocks: patchmaker
*** Bug 111216 has been marked as a duplicate of this bug. ***
Attached patch patch (obsolete) — Splinter Review
Maruyama-san(mal@mozilla.gr.jp) and I have made a minimum patch to
handle non-ASCII characters in ChatZilla.

user_pref("extensions.irc.charset", "iso-2022-jp");

Users can set their default charset like this. 

We don't have UI or command to switch charset yet.
That's great, thanks for the contribution.  I'll check this into the chatzilla
0.8.5 branch, which will hopefully land in a week or so.
Depends on: 103386
> user_pref("extensions.irc.charset", "iso-2022-jp");

Mistake. This should be:

user_pref("extensions.irc.default_charset", "iso-2022-jp");
Do we need to encode strings from the .properties file too?
Attachment #59093 - Attachment is obsolete: true
Attached patch new patch (obsolete) — Splinter Review
I've reworked the patch a bit to integrate better with the existing codebase.
I've already checked this into the CHATZILLA_0_8_5_BRANCH, and will respin an
xpi for <http://www.hacksrus.com/~ginda/chatzilla/>.  The first xpi to have
this code will be 0.8.5-pre23, look for it in an hour or so.  The 0.8.5 branch
will hopefully land early next week, so please test this out asap.
I have installed new 0.8.5-pre23 from URL you have posted.
Mozilla already have had older ChatZilla client.
How could I find out version of Chatzilla I'm running now? Is it new (was it
updated by xpi ?)
Problem is - it doesn't show russian/ukrainian (koi8-r,koi8-u)
I have put both
user_pref("extensions.irc.charset", "koi8-r");
user_pref("extensions.irc.default_charset", "koi8-r");
into prefs.js

Network for test: ForestNet - irc.ForestNet.org
command /list (see topics)
change server side charset "/quote codepage koi8" (for koi8-u and koi8-r)
> Do we need to encode strings from the .properties file too?

What sort of string are you talking about? Localizable strings? or just
some settings for charset.
OK, let me CC tao about this. It looks like chatzilla.jr contains
localizable .dtd files. chatzilla.jar is self-contained as is venkman.jar.
How does localization work in this type of case? Should it not follow
the localization convention like en-US.jar? en-US.chatzilla.jar, for
example. So, it would make sense to insert the default charset of the
chatzilla client you're shipping into chatzilla.properties file. But that
should be easily discoverable by localizers. Suggestions?
If the resource is locale-specific, we need to put them in properties files
packaged in locale-specific jar such as en-US-irc.jar so localizers can easily
translate irc to other languages.
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 
(you can query for this string to delete spam or retrieve the list of bugs I've 
moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
kat:
>> Do we need to encode strings from the .properties file too?
>
> What sort of string are you talking about? Localizable strings? or just
> some settings for charset.

I was asking if we need to pass strings we get from a string bundle through the
decoder.  Now that I've got an idea what's going on here, I see we don't because
they are already in unicode.

I've just landed the branch, and branded it 0.8.5-rc1 (release candidate one.) 
I'd like to get malvin's problem cleared up before mozilla 0.9.7.  Any debugging
help would be greatly appreciated.

In 0.8.5-rc1, users should be able to switch the charset on the fly by typing
'/eval setCharset("iso-2022-jp");' in chatzilla.  This setting should be
persisted in prefs for the next session.

There are two problems in 0.8.5-rc2.

(1) Second outgoing message after executing "/eval setCharset()"
 isn't converted.

function ucConvertOutgoingMessage (msg)
{
    if (client.ucConverter)
        return client.ucConverter.ConvertFromUnicode(msg);

    return msg;
}

If you create an instance of ucConverter every time you send message,
as in my first patch, ConvertFromUnicode() works well.


(2) Outgoing messages are always displayed as garbage in message display area.

if (!client.eventPump.getHook("uc-hook"))
{
    client.eventPump.addHook ([{type: "privmsg", set: "server"}],
        ucConvertIncomingMessage, "uc-hook");
}

This doesn't work for outgoing messages.
Attached patch (2) easy fixed patch (obsolete) — Splinter Review
Koike's (2) problem fixed patch.
Row charctor (UTF-8) send display area.
Attached patch modified prevous patch (obsolete) — Splinter Review
make "ucConvert" class, then (1) Problem fixed!
Attached patch latest patch (obsolete) — Splinter Review
outbound conversions need to be done in more places than just
sayToCurrentTarget.  Doing them in filterOutput (as I had done) was too early,
and resulted in us trying to display encoded text (instead of unicode) in the
output window.	I've added a fromUnicode() function and called it at each site
that sends plain text to the server that I can think of (I may have missed some
call sites.)

I'm not sure why only the first outbound message was converted (possibly
because of my createInstance vs. getService mixup) but it seems to be fixed
now.

I think adding a new class for this is a little too much, and re-creating the
xpcom component for every message processed is *definatley* too much.

I've tested this patch on irc.forestnet.org as described by malvin in comment
#20, and it looks like it works to me.	I see cyrillic characters in the
topics, and when I paste those characters in a private message to myself they
appear at both ends.

I'll post this to hacksrus as pre3 for further testing.

Thanks to everyone who has commented and attached patches to this bug, I
wouldn't have been able to fix this without your help.
Attachment #59819 - Attachment is obsolete: true
Attachment #61005 - Attachment is obsolete: true
Attachment #61013 - Attachment is obsolete: true
rc3, not pre3.

rc3 is now available on www.hacksrus.com/~ginda/chatzilla/.  Please test it out,
I'd like to check it in by tomorrow (which is the 0.9.7 close.)
rc3 doesn't convert the second outgoing message. But every message is 
displayed fine in local window.
Creating an instance of Unicode converter isn't a good solution.
But that is the only way we know now. I think we should adopt it 
as the temporary fix for 0.9.7.
I'm sorry, but I'm not sure I agree.  Creating a new encoder for every message
sent will just hide the real problem, which I'd much prefer to solve and get on
the 0.9.7 branch.

The koi8 encoder seems to work for me.  I attach to forestnet, /list #moldova,
and paste some of the characters from the topic into the input box.  I can /msg
those characters to myself multiple times, and they always look the same.

Could it be that the ISO-2022-JP encoder leaves itself in a bad state after
encoding the first message?  I'm trying to verify this, but nothing obvoius goes
wrong when I pass two ASCII messages through it.  Can you name an irc server
which has 2002-jp users so I can see the problem for myself?
/attach moznet, /join #mozillazine-jp.

If we put an ASCII character at the beginning of Japanese characters,
every outgoing message is converted correctly.
Attached patch ad-hoc patch for iso-2022-jp (obsolete) — Splinter Review
An ad-hoc patch for iso-2022-jp.

"iso-2022-jp" have some STATEs.
Once the STATE of ucConverter becomes a non-ascii charset,
it won't change until the next ascii char.

Trick: A dummy exec "client.ucConverter.ConvertFromUnicode('a');" changes the
STATE to ascii charset.

A matter of concern: Is ucConverter synchronized?
     client.ucConverter.ConvertFromUnicode('a');
     return client.ucConverter.ConvertFromUnicode(msg);

The following code also works.

     client.ucConverter.charset = client.CHARSET;
     return client.ucConverter.ConvertFromUnicode(msg);
shoji, what does it mean for the converter to be "synchronized"?
When ucConverter.convert(From|To)Unicode() is called by TWO (or more) callers
simultaniously, strings and STATE changers will be mixed.

fromUnicode() and toUnicode() must lock ucConverter.

in Java style,

fromUnicode(msg) {
  ...
  synchronized (client.ucConverter) {
    client.ucConverter.fromUnicode("a");
    return client.ucConverter.fromUnicode(msg);
  }
}

# oops.., I made a mistake in adhoc patch.
# client.ucConverter.ConvertFromUnicode('a') is called in toUnicode()..
# It must be client.ucConverter.ConvertToUnicode('a')


I think ConvertFromUnicode() should add "ESC ( B" at the end of 
the returned string. Then JavaScript code doesn't have to care about
STATE.
shom: We have no locking/synchronization constructs in xpcom or javascript, the
converter would have to provide it's own synchronization api.  More likely, the
converter should synchronize itself, so the caller doesn't have to worry about
the details.

kazhik: what is "ESC ( B" in bytes?  What does that sequence mean, and is it
valid for all encodings, or just iso-2022-jp?
"ESC ( B" means the beginning of ASCII characters.

ISO-2022-JP string in ChatZilla begins with "ESC $ B" and 
ends with no escape sequence. So the followin text is assumed 
to be ISO-2022-JP.


It seems nsScriptableUnicodeConverter::ConvertFromUnicode()
should call mEncoder->Finish() after mEncoder->Convert().


Attached patch more patchingSplinter Review
The escape sequence seems to have done the trick for the problems I saw with
iso-2022-jp.  In this patch, I am assuming that ESC(B is the ASCII sequence for
all iso-2022 encodings, can anyone verify that this is this a valid assumption?


I'll post this as rc5 in a minute.
Attachment #61183 - Attachment is obsolete: true
Attachment #61316 - Attachment is obsolete: true
> In this patch, I am assuming that ESC(B is the ASCII sequence for
> all iso-2022 encodings, can anyone verify that this is this a valid
> assumption.

It is. The final "B" is not unique in ISO-2022 encodings but 
"ESC ( B" is unique to ANSI X3.4-1986 (=ASCII).
I've checked the latest patch into the trunk.

kazhik, mEncoder->Finish() sounds like the right fix to me too, will you file an
i18n bug with a patch?
I posted bug 114923 for nsScriptableUnicodeConverter::ConvertFromUnicode()
problem.

We need a command to change charset.

/charset iso-2022-jp
/charset euc-kr
I've just landed the client.ucConverter.charset = client.CHARSET; fix, along
with the /charset command on the trunk.

Providing everything works as expected, we'll have charset support in chatzilla
for 0.9.7!  Thanks again to all who helped out.

I'll mark this bug as fixed, please repoen if there are any problems.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
The charsets aren't working properly.

I can't see the right-sided languages support here, for example: hebrew !!!
I can only see it backwards (with both charsets:
ISO-5559-5 - should show it backwards
Windows-1255 - shoud switch it so DCBA will be ABCD).

Same thing with hebrew input.
I can only input in one way - not sure which one is it, may be it's a problem in
displaying or in the input, but I see it backwards !!!

Chatzilla should use the same charset methods that mozilla browser uses.

I would like this bug to be reopened.
m_vitaly, I think you should open another bug for bidi support
in Chatzilla. We need specialists in that area to diagnose what needs
to happen for Chatzilla to support Hebrew, Arabic and other bidi languages.
This bug put in basic charset support in Chatzilla and we should leave it
at that.
When you file a new bug, in addition to rginda@netscape.com, CC also
mkaply@us.ibm.com and smontagu@netscape.com.
Blocks: 128773
By the way...
Some IRC server supports command "codepage". I know about RusNet and ForestNet
servers for certain. (http://www.rus.net.ua and http://www.ForestNet.Org).

So user must type "/quote codepage koi8u" to get KOI8-U charset (ForestNet). Or 
"/quote codepage cp1251" for "windows-1251" charset.
It whould be just great to have this feature in ChatZilla (so it will send this
command to server if user changes it's client charset).

What do you think?
I believe the updated IRC standard supports character encoding negotiation
between client and server. So chatzilla should support this whenever the server
supports it and get rid of cryptic config file editing and have it "Just
Work(tm)" ;-)

Here are some useful links :

http://www.irc.org/tech_docs/005.html
http://www.irc.org/tech_docs/draft-brocklesby-irc-isupport-03.txt
Product: Core → Other Applications
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: