Closed Bug 203264 Opened 21 years ago Closed 17 years ago

"Hankaku kana" should be converted to zenkaku

Categories

(Other Applications :: ChatZilla, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kazhik, Assigned: bugzilla-mozilla-20000923)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [cz-charset] [cz-0.9.79])

Attachments

(1 file, 1 obsolete file)

"Hankaku kana" in iso-2022-jp should be converted to zenkaku
before sending, like many IRC clients being used in Japan
and Mozilla Mail.

http://lxr.mozilla.org/seamonkey/source/mailnews/base/util/nsMsgI18N.cpp#639
In RFC 2812 2.2, "No specific character set is specified. The protocol is based
on a set of codes which are composed of eight (8) bits, making up an octet."
Non-delimiter characters should be sent as is, in favor of original intention of
sender. In other words, in IRC you should be able to send any binary data except
for delimiter characters, anyway.

IMHO such hankaku->zenkaku conversion should be done at message interpretation
by receiver who expects Japanese strings by detecting Japanese encoding in given
data, not sender side.

This bug should be marked WONT or at least there should be configurable option
to disable sender-side conversion. And more preferably, receiver-side
conversion, which should be configurable too, may be nice enhancement.
Product: Core → Other Applications
Whiteboard: [cz-charset]
I'm happy to do this conversion on send, but only if someone can actually explain why it is done at all - it seems like a very odd thing to do. None of the existing code or associated bugs seems to help, so can someone explain what this conversion is for and why is it necessary?
See http://en.wikipedia.org/wiki/Katakana#Computer_encoding

Since ISO-2022-JP does not encode the Hankaku kana (half-width katakana) forms,  text input using these characters will be converted to question marks by the ISO-2022-JP encoder. In mailnews we prevent this happening by converting to zenkaku (full-width katakana) before encoding.
Ah, that makes sense. Let's see what we can do...
Turning off this conversion results in ? being sent instead, meaning that the data is lost long before the receiver gets it, so I don't propose to give any option to disable this.

However, unfortunately we cannot do this like mailnews can; nsITextTransform is not scriptable, making the use of the conversion component impossible.

I'm not sure I feel comfortable duplicating the code into JS, although that seems like the only option now.
Well, it's possible in JavaScript, that's for sure. I don't like having to do a string concat for each character, though, but I can't find a more efficient way to build the result string.

Note that I opted to keep the old MessageManager behaviour by default and simply turn on the conversion when used in ChatZilla. People had a habit of re-using some of our useful libraries. :)
Assignee: rginda → silver
Status: NEW → ASSIGNED
Attachment #266268 - Flags: review?(gijskruitbosch+bugs)
Comment on attachment 266268 [details] [diff] [review]
Convert halfwidth hankaku to fullwith zenkaku

>RCS file: /cvsroot/mozilla/extensions/irc/js/lib/message-manager.js,v
>@@ -46,12 +46,17 @@ function MessageManager(entities)
>     this.ucConverter =
>         Components.classes[UC_CTRID].getService(nsIUnicodeConverter);
>     this.defaultBundle = null;
>     this.bundleList = new Array();
>     // Provide a fallback so we don't break getMsg and related constants later.
>     this.entities = entities || {};
>+
>+    // ISO-2022-JP (often used for Japanese on IRC) doesn't contain any
>+    // support for "halfwith" hankaku, so we support the option to convert it to
>+    // "fullwith" zenkaku. This does not affect any other encoding at this time.
>+    this.enableHankakuToZenkaku = false;
> }

Nit: halfwidth, fullwidth

Why not put that boolean on the prototype? That would be cleaner in my opinion.

<snip>

>+            // If it is in hankaku - do basic mapping first.
Nit: colon, comma or nothing here - not a dash.

>+            dest = basicMapping[src - HANKAKU_BASE1];
>+
>+            // If is some char could be modifier, and the next char
>+            // is a modifier, modify it and eat one byte.
>+
Nit: This isn't valid English. I see you've copied it from the original code, but please fix this so it makes sense. Also, please remove this blank line. :-)

>+            if (i < msg.length - 1)
>+            {
>+                if ((MOD_NIGORI == srcMod) &&
>+                    (((src >= NIGORI_MIN1) && (src <= NIGORI_MAX1)) ||
>+                     ((src >= NIGORI_MIN2) && (src <= NIGORI_MAX2))))
>+                {
>+                    dest += NIGORI_MODIFIER;
>+                    i++;
>+                }
>+                else if ((MOD_MARU == srcMod) &&
>+                    ((src >= MARU_MIN) && (src <= MARU_MAX)))
Nit: lose a set of braces here, they're all AND operators. Also line up the remaining braces.

<snip>

r=me with the nits fixed and the boolean on the prototype
Attachment #266268 - Flags: review?(gijskruitbosch+bugs) → review+
Attached patch Updated per nitsSplinter Review
Re-written all the comments in the conversion code, and also spotted and removed some trailing whitespace.
Attachment #266268 - Attachment is obsolete: true
Attachment #267810 - Flags: review?(gijskruitbosch+bugs)
Hmm, charsetalias.properties says there are a few aliases for iso-2022-jp (e.g. csiso2022jp).  If I use /charset csiso2022jp, fromUnicode does see its |charset| param as "csiso2022jp" (not iso-2022-jp), but I _think_ the same encoder/decoder gets used.

Would a follow-up patch be needed to handle that?
Yeah, probably; let's not hold up this patch on that for now. One thing that concerns me is how we could cope generally, e.g. does the converter return the canonical name or the one used, could we read chatsetalias.properties ourselves, etc.?
Comment on attachment 267810 [details] [diff] [review]
Updated per nits

+    // We get the following character as well, since it can be a modifier.
Please move that comment to the line before the if statement.

Please file a followup bug about Mook's comment as necessary.

Otherwise, r=gijs.
Attachment #267810 - Flags: review?(gijskruitbosch+bugs) → review+
Blocks: 383878
(In reply to comment #9)
> Hmm, charsetalias.properties says there are a few aliases for iso-2022-jp (e.g.
> csiso2022jp).  If I use /charset csiso2022jp, fromUnicode does see its
> |charset| param as "csiso2022jp" (not iso-2022-jp), but I _think_ the same
> encoder/decoder gets used.
> 
> Would a follow-up patch be needed to handle that?
> 

To handle what exactly? nsIScriptableUnicodeConverter does charset alias resolution when setting its charset, but doesn't expose the canonical name of the charset in use. You can do your own alias->canonical name conversion with GetCharsetAlias() from nsICharsetConverterManager.
We're invoking our own code for the hankaku to zenkaku conversion based on the character encoding we've been set to use; if the user specifies an alias to ISO-2022-jp, we should still invoke said code. That's all in bug 383878.
Checked in --> FIXED.
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Whiteboard: [cz-charset] → [cz-0.9.79]
Changing multiple bugs' whiteboards should be smarter. :-(
Sorry for bugspam.
Whiteboard: [cz-0.9.79] → [cz-charset] [cz-0.9.79]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: