Closed Bug 203264 Opened 21 years ago Closed 17 years ago

"Hankaku kana" should be converted to zenkaku

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: kazhik, Assigned: bugzilla-mozilla-20000923)

References

(Blocks 1 open bug,
URL
)

Details

(Whiteboard: [cz-charset] [cz-0.9.79])

Attachments

(1 file, 1 obsolete file)

Convert halfwidth hankaku to fullwith zenkaku 17 years ago James Ross 5.20 KB, patch	Gijs : review+	Details \| Diff \| Splinter Review
Updated per nits 17 years ago James Ross 5.40 KB, patch	Gijs : review+	Details \| Diff \| Splinter Review

Koike Kazuhiko

Reporter

Description

•

21 years ago

"Hankaku kana" in iso-2022-jp should be converted to zenkaku
before sending, like many IRC clients being used in Japan
and Mozilla Mail.

http://lxr.mozilla.org/seamonkey/source/mailnews/base/util/nsMsgI18N.cpp#639

Ryuichi KUBUKI

Comment 1

•

21 years ago

In RFC 2812 2.2, "No specific character set is specified. The protocol is based
on a set of codes which are composed of eight (8) bits, making up an octet."
Non-delimiter characters should be sent as is, in favor of original intention of
sender. In other words, in IRC you should be able to send any binary data except
for delimiter characters, anyway.

IMHO such hankaku->zenkaku conversion should be done at message interpretation
by receiver who expects Japanese strings by detecting Japanese encoding in given
data, not sender side.

This bug should be marked WONT or at least there should be configurable option
to disable sender-side conversion. And more preferably, receiver-side
conversion, which should be configurable too, may be nice enhancement.

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: Core → Other Applications

Samuel Sieb

Updated

•

19 years ago

URL: http://lxr.mozilla.org/seamonkey/sour...

James Ross

Assignee

Updated

•

19 years ago

Whiteboard: [cz-charset]

James Ross

Assignee

Comment 2

•

18 years ago

I'm happy to do this conversion on send, but only if someone can actually explain why it is done at all - it seems like a very odd thing to do. None of the existing code or associated bugs seems to help, so can someone explain what this conversion is for and why is it necessary?

Simon Montagu :smontagu

Comment 3

•

18 years ago

See http://en.wikipedia.org/wiki/Katakana#Computer_encoding

Since ISO-2022-JP does not encode the Hankaku kana (half-width katakana) forms,  text input using these characters will be converted to question marks by the ISO-2022-JP encoder. In mailnews we prevent this happening by converting to zenkaku (full-width katakana) before encoding.

URL: http://lxr.mozilla.org/seamonkey/sour... → http://bonsai.mozilla.org/cvsblame.cg...

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

18 years ago

URL: http://bonsai.mozilla.org/cvsblame.cg... → http://bonsai.mozilla.org/cvsblame.cg...

James Ross

Assignee

Comment 4

•

18 years ago

Ah, that makes sense. Let's see what we can do...

James Ross

Assignee

Comment 5

•

17 years ago

Turning off this conversion results in ? being sent instead, meaning that the data is lost long before the receiver gets it, so I don't propose to give any option to disable this.

However, unfortunately we cannot do this like mailnews can; nsITextTransform is not scriptable, making the use of the conversion component impossible.

I'm not sure I feel comfortable duplicating the code into JS, although that seems like the only option now.

James Ross

Assignee

Comment 6

•

17 years ago

Attached patch Convert halfwidth hankaku to fullwith zenkaku (obsolete) — Details — Splinter Review

Well, it's possible in JavaScript, that's for sure. I don't like having to do a string concat for each character, though, but I can't find a more efficient way to build the result string.

Note that I opted to keep the old MessageManager behaviour by default and simply turn on the conversion when used in ChatZilla. People had a habit of re-using some of our useful libraries. :)

Assignee: rginda → silver

Status: NEW → ASSIGNED

Attachment #266268 - Flags: review?(gijskruitbosch+bugs)

:Gijs (he/him)

Comment 7

•

17 years ago

Comment on attachment 266268 [details] [diff] [review]
Convert halfwidth hankaku to fullwith zenkaku

>RCS file: /cvsroot/mozilla/extensions/irc/js/lib/message-manager.js,v
>@@ -46,12 +46,17 @@ function MessageManager(entities)
>     this.ucConverter =
>         Components.classes[UC_CTRID].getService(nsIUnicodeConverter);
>     this.defaultBundle = null;
>     this.bundleList = new Array();
>     // Provide a fallback so we don't break getMsg and related constants later.
>     this.entities = entities || {};
>+
>+    // ISO-2022-JP (often used for Japanese on IRC) doesn't contain any
>+    // support for "halfwith" hankaku, so we support the option to convert it to
>+    // "fullwith" zenkaku. This does not affect any other encoding at this time.
>+    this.enableHankakuToZenkaku = false;
> }

Nit: halfwidth, fullwidth

Why not put that boolean on the prototype? That would be cleaner in my opinion.

<snip>

>+            // If it is in hankaku - do basic mapping first.
Nit: colon, comma or nothing here - not a dash.

>+            dest = basicMapping[src - HANKAKU_BASE1];
>+
>+            // If is some char could be modifier, and the next char
>+            // is a modifier, modify it and eat one byte.
>+
Nit: This isn't valid English. I see you've copied it from the original code, but please fix this so it makes sense. Also, please remove this blank line. :-)

>+            if (i < msg.length - 1)
>+            {
>+                if ((MOD_NIGORI == srcMod) &&
>+                    (((src >= NIGORI_MIN1) && (src <= NIGORI_MAX1)) ||
>+                     ((src >= NIGORI_MIN2) && (src <= NIGORI_MAX2))))
>+                {
>+                    dest += NIGORI_MODIFIER;
>+                    i++;
>+                }
>+                else if ((MOD_MARU == srcMod) &&
>+                    ((src >= MARU_MIN) && (src <= MARU_MAX)))
Nit: lose a set of braces here, they're all AND operators. Also line up the remaining braces.

<snip>

r=me with the nits fixed and the boolean on the prototype

Attachment #266268 - Flags: review?(gijskruitbosch+bugs) → review+

James Ross

Assignee

Comment 8

•

17 years ago

Attached patch Updated per nits — Details — Splinter Review

Re-written all the comments in the conversion code, and also spotted and removed some trailing whitespace.

Attachment #266268 - Attachment is obsolete: true

Attachment #267810 - Flags: review?(gijskruitbosch+bugs)

:Mook

Comment 9

•

17 years ago

Hmm, charsetalias.properties says there are a few aliases for iso-2022-jp (e.g. csiso2022jp).  If I use /charset csiso2022jp, fromUnicode does see its |charset| param as "csiso2022jp" (not iso-2022-jp), but I _think_ the same encoder/decoder gets used.

Would a follow-up patch be needed to handle that?

James Ross

Assignee

Comment 10

•

17 years ago

Yeah, probably; let's not hold up this patch on that for now. One thing that concerns me is how we could cope generally, e.g. does the converter return the canonical name or the one used, could we read chatsetalias.properties ourselves, etc.?

:Gijs (he/him)

Comment 11

•

17 years ago

Comment on attachment 267810 [details] [diff] [review]
Updated per nits

+    // We get the following character as well, since it can be a modifier.
Please move that comment to the line before the if statement.

Please file a followup bug about Mook's comment as necessary.

Otherwise, r=gijs.

Attachment #267810 - Flags: review?(gijskruitbosch+bugs) → review+

James Ross

Assignee

Updated

•

17 years ago

Blocks: 383878

Simon Montagu :smontagu

Comment 12

•

17 years ago

(In reply to comment #9)
> Hmm, charsetalias.properties says there are a few aliases for iso-2022-jp (e.g.
> csiso2022jp).  If I use /charset csiso2022jp, fromUnicode does see its
> |charset| param as "csiso2022jp" (not iso-2022-jp), but I _think_ the same
> encoder/decoder gets used.
> 
> Would a follow-up patch be needed to handle that?
> 

To handle what exactly? nsIScriptableUnicodeConverter does charset alias resolution when setting its charset, but doesn't expose the canonical name of the charset in use. You can do your own alias->canonical name conversion with GetCharsetAlias() from nsICharsetConverterManager.

James Ross

Assignee

Comment 13

•

17 years ago

We're invoking our own code for the hankaku to zenkaku conversion based on the character encoding we've been set to use; if the user specifies an alias to ISO-2022-jp, we should still invoke said code. That's all in bug 383878.

James Ross

Assignee

Comment 14

•

17 years ago

Checked in --> FIXED.

Status: ASSIGNED → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

:Gijs (he/him)

Updated

•

17 years ago

Whiteboard: [cz-charset] → [cz-0.9.79]

:Gijs (he/him)

Comment 15

•

17 years ago

Changing multiple bugs' whiteboards should be smarter. :-(
Sorry for bugspam.

Whiteboard: [cz-0.9.79] → [cz-charset] [cz-0.9.79]

You need to log in before you can comment on or make changes to this bug.