Closed Bug 72468 Opened 23 years ago Closed 23 years ago

ISO-2022-JP-2 charset decoder

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla0.9.2

People

(Reporter: watanabe, Assigned: nhottanscp)

References

(
URL
)

Details

(Keywords: intl, Whiteboard: got 'r', 'sr', need updated patch before check in)

Attachments

(9 files)

This is a patch file under directory mozilla/intl/ 23 years ago watanabe 69.93 KB, patch		Details \| Diff \| Splinter Review
This is a patch file under directory mozilla/intl/ 23 years ago watanabe 69.93 KB, patch		Details \| Diff \| Splinter Review
I will resend a patch. 23 years ago watanabe 168.06 KB, patch		Details \| Diff \| Splinter Review
I've succeeded in sending a patch. This is an example HTML of ISO-2022-JP-2. 23 years ago watanabe 348 bytes, text/html; charset=iso-2022-jp		Details
This is a new patch for the original tree. 23 years ago watanabe 14.93 KB, patch		Details \| Diff \| Splinter Review
This is a new patch for the original tree. 23 years ago watanabe 10.77 KB, patch		Details \| Diff \| Splinter Review
This is a new patch using enum for the original tree. 23 years ago watanabe 15.34 KB, patch		Details \| Diff \| Splinter Review
This is a new patch not using aCharset for the original tree. 23 years ago watanabe 15.36 KB, patch		Details \| Diff \| Splinter Review
This is a new patch with more comments for the original tree. 23 years ago watanabe 16.54 KB, patch		Details \| Diff \| Splinter Review

watanabe

Reporter

Description

•

23 years ago

I want to add a ISO-2022-JP-2 charset decoder to Mozilla.

ISO-2022-JP-2 is one of IANA registered charset, that is an extension
of ISO-2022-JP.  ISO-2022-JP-2 is not only a Japanese character set, but
an united CJK character set.  However, ISO-2022-JP-2 is not compatible
with ISO-2022-CN or ISO-2022-KR, and is full upper compatible charset of
ISO-2022-JP.  Therefore, it may be natural that we regard ISO-2022-JP-2
as one of Japanese character set.

Since ISO-2022-JP-2 is very similar to ISO-2022-JP, it is difficult for
us to do auto detect between these two.  Therefore, it may be good
way that we won't distinguish them each other at the time of decoding,
and that we will reconstruct a ISO-2022-JP decoder to be a ISO-2022-JP-2
decoder.  Because it is full upper compatible charset of ISO-2022-JP.

According to this plan, I have made a patch.  I will send it as a next
message.  This patch will make two new files which do not exist in the
CVS repository.  I think it is useful for us to be able to read
ISO-2022-JP-2 by Mozilla.  However, I think it is not so important for us
to be able to write ISO-2022-JP-2 by Mozilla.  This patch contains only a
decoder.

Would you review and adopt it.

watanabe

Reporter

Comment 1

•

23 years ago

Attached patch This is a patch file under directory mozilla/intl/ — Details — Splinter Review

watanabe

Reporter

Comment 2

•

23 years ago

Attached patch This is a patch file under directory mozilla/intl/ — Details — Splinter Review

watanabe

Reporter

Comment 3

•

23 years ago

I'm sorry.  Both two files I sent are broken.  The right file has 172096 bytes.
"time out" error has occured during sending.  I will resend after a while.

nhottanscp

Assignee

Comment 4

•

23 years ago

From RFC 1554,

                              94 character sets
      reg#  character set      ESC sequence                designated to
      ------------------------------------------------------------------
      6     ASCII              ESC 2/8 4/2      ESC ( B    G0
      42    JIS X 0208-1978    ESC 2/4 4/0      ESC $ @    G0
      87    JIS X 0208-1983    ESC 2/4 4/2      ESC $ B    G0
      14    JIS X 0201-Roman   ESC 2/8 4/10     ESC ( J    G0
      58    GB2312-1980        ESC 2/4 4/1      ESC $ A    G0
      149   KSC5601-1987       ESC 2/4 2/8 4/3  ESC $ ( C  G0
      159   JIS X 0212-1990    ESC 2/4 2/8 4/4  ESC $ ( D  G0

                              96 character sets
      reg#  character set      ESC sequence                designated to
      ------------------------------------------------------------------
      100   ISO8859-1          ESC 2/14 4/1     ESC . A    G2
      126   ISO8859-7(Greek)   ESC 2/14 4/6     ESC . F    G2

nhottanscp

Assignee

Comment 5

•

23 years ago

I have a couple of questions.

* Are there mail user agents or HTML composers which send/write out this
characters set? I think the user can use UTF-8 instead.

* If we do not support encoder, reply/forward inline message going to have
problem. So why viewing only has to be supported?

Status: UNCONFIRMED → NEW

Ever confirmed: true

watanabe

Reporter

Comment 6

•

23 years ago

Attached patch I will resend a patch. — Details — Splinter Review

watanabe

Reporter

Comment 7

•

23 years ago

Attached file I've succeeded in sending a patch. This is an example HTML of ISO-2022-JP-2. — Details

Katsuhiko Momoi

Updated

•

23 years ago

URL: http://www.cis.ohio-state.edu/cgi-bin...

Katsuhiko Momoi

Comment 8

•

23 years ago

I don't think we want to support creation of ISO-2022-JP-2
documents or messages. 
watanabe-san, are there existing documents using ISO-2022-JP-2?
For Japanese, the only thing this charset adds is 
JIS X 0212. But EUC-JP can cover that character set. So can
ISO-2022-JP-1. I don't think people would be using this
charset to encode Simplified Chinese or Korean. 

So the question is do we need this charset even for decoding?

Andreas Becker

Comment 9

•

23 years ago

Changing QA Contact to ylong@netscape.com.

QA Contact: andreasb → ylong

watanabe

Reporter

Comment 10

•

23 years ago

My answers are ...

* Mule (Multilingual Emacs) can read/write ISO-2022-JP-2, and that is
default of Mule for mixing JIS X0208, JIS X0212, GB2312 and KSC5601
characters.  W3.el for Mule can browse ISO-2022-JP-2 HTML.   I like
ISO-2022-JP-2 more than UTF-8, and sometimes write in this charset
because UTF-8 cannot save a difference of script in the same file
(eg. JIS X0208 character or GB2312 character) when mapped to the
same UNICODE.

* Well, Mozilla's ISO-2022-JP decoder is now supporting JIS X0212
characters, and ISO-2022-JP encoder is not supporting JIS X0212.  That
is the same case.  JIS X0212 is not a part of ISO-2022-JP but a part
of ISO-2022-JP-2.
  I think it is right that ISO-2022-JP encoder won't support JIS X0212,
because the system which only support ISO-2022-JP might confuse if
Mozilla send JIS X0212 as ISO-2022-JP; sender should be strict for the
standard and receiver should be tolerant.  If we will support
ISO-2022-JP-2 encoder (I don't think we need so), we must separate it
from ISO-2022-JP encoder, even if we will use the same function for both
of decoding.
  I think it is useful that Mozilla can read (browse), even if it cannot
write.

Katsuhiko Momoi

Comment 11

•

23 years ago

> ISO-2022-JP-2 more than UTF-8, and sometimes write in this charset
> because UTF-8 cannot save a difference of script in the same file
> (eg. JIS X0208 character or GB2312 character) when mapped to the
> same UNICODE.

I think the number of such characters are fairly small. 
The recent Unicode 3.0 has made more characters available
for CJK distinctions. Are there really some **important**
characters that Unicode 3.0 still cannot distinguish?
If so, can you give us some example characters that
must be distinguished between JIS X 0212 and GB2312 but
not so in Unicode 3.0?

Frank Tang

Comment 12

•

23 years ago

I think we should add ISO-2022-JP-2, but I think this implementation will simply 
blow the binary size. We already have the conversion table in ucvcn, ucvtw and 
ucvko. The right thing to do is to add a converter to call other converters 
instead of duplicating the table in ucvja. 
I do not agree with the implementation here.

watanabe

Reporter

Comment 13

•

23 years ago

> The recent Unicode 3.0 has made more characters available
> for CJK distinctions. Are there really some **important**
> characters that Unicode 3.0 still cannot distinguish?

  Well... I wrote on a difference of scripts (eg. Japanese script or
Chinese script), not difference of characters.  Anyway, I think that
the question whether we should append ISO-2022-JP-2 decoder don't depend
on how the Unicode is.


> I think we should add ISO-2022-JP-2, but I think this implementation
> will simply blow the binary size. We already have the conversion table
> in ucvcn, ucvtw and ucvko.

  I have remade a patch, that don't have any additional table.

watanabe

Reporter

Comment 14

•

23 years ago

Attached patch This is a new patch for the original tree. — Details — Splinter Review

nhottanscp

Assignee

Comment 15

•

23 years ago

Frank, could you review the patch?

The patch adds ISO-2022-JP-2 as an alias of ISO-2022-JP. ISO-2022-JP-2 will not
appear on the charset menu. So the user cannot save as or send as ISO-2022-JP-2.
When the message is replied or forwarded as inline, ISO-2022-JP encoder will be
used. Some characters may not be converted and will turn to '?' for plain text
or NCR for HTML mail.
Watanabe san, is my expectation above correct?

BTW, anybody know about JIS 0213 support (OS, charset)?

watanabe

Reporter

Comment 16

•

23 years ago

> Watanabe san, is my expectation above correct?

  Almost yes, thank you.  If we browse a ISO-2022-JP-2 file and select
"Save As ...", the file is saved as the original ISO-2022-JP-2 (encoding
is not changed).
  If we save a ISO-2022-JP-2 file after editing, some characters which
are not members of ISO-2022-JP will be converted to numeric character
references such as "&#20320;".


> BTW, anybody know about JIS 0213 support (OS, charset)?

  We can read/write it with Emacs-20.6 (or over) + X0213 patch + free
fonts.
  charset name may be not registered to IANA yet, but already registered
to ISO-2022 ("ESC $ ( O" and "ESC $ ( P").

nhottanscp

Assignee

Comment 17

•

23 years ago

One more question, is there a mapping table for JIS 0213 <-> Unicode?

watanabe

Reporter

Comment 18

•

23 years ago

> One more question, is there a mapping table for JIS 0213 <-> Unicode?

The JIS X0213 code table is in

   http://www.jca.apc.org/%7Eearthian/aozora/0213/jisx0213code.zip

This table contains Japanese characters encoded by Shift_JIS.  Some
X0213 characters are not members of the Unicode.  The graphical shape
tables are in

   http://www.itscj.ipsj.or.jp/ISO-IR/228.pdf
   http://www.itscj.ipsj.or.jp/ISO-IR/229.pdf

These are ISO-2022 registry.

nhottanscp

Assignee

Comment 19

•

23 years ago

Thanks for the information. I filed a tracking bug 73035 for JIS 0213.

nhottanscp

Assignee

Updated

•

23 years ago

Whiteboard: patch need review by ftang

Frank Tang

Comment 20

•

23 years ago

I don't think we should change
intl/chardet/src/nsISO2022JPVerifier.h
and 
intl/chardet/tools/geniso2022jp.pl

intl/uconv/src/charsetalias.properties change is ok
intl/uconv/ucvja/nsJapaneseToUnicode.h change is ok
intl/uconv/ucvja/nsJapaneseToUnicode.cpp need to be change: We should not return 
error when we cannot create a delegate converter, instead, we should convert to 
0xfffd

watanabe

Reporter

Comment 21

•

23 years ago

> I don't think we should change
> intl/chardet/src/nsISO2022JPVerifier.h
> and 
> intl/chardet/tools/geniso2022jp.pl

> intl/uconv/ucvja/nsJapaneseToUnicode.cpp need to be change: We should
> not return error when we cannot create a delegate converter, instead,
> we should convert to 0xfffd

  OK.  I will resend a patch.

watanabe

Reporter

Comment 22

•

23 years ago

Attached patch This is a new patch for the original tree. — Details — Splinter Review

Frank Tang

Comment 23

•

23 years ago

r=ftang

Frank Tang

Comment 24

•

23 years ago

nhotta: please change 
                  if (NS_FAILED(rv)) mEUCKRDecoder = NULL;
to
                 if (NS_FAILED(rv)) 
						mEUCKRDecoder = nsnull;
for all similar code. make sure you change line after if and use nsnull instead 
of NULL

Christopher Blizzard (:blizzard)

Comment 25

•

23 years ago

Just a general comment about the code in that file.  Isn't it possible to use
enums or defines for all of those switch and case statements?  You have comments
after every one of them.  It might be better to use juse something that is
self-describing.

Brendan Eich [:brendan]

Comment 26

•

23 years ago

I'm baffled by all the magic numbers for mState -- please declare an enum.

Could use nsCOMPtr<nsIUnicodeDecoder> for these, and save yourself the manual
ref-counting headaches:

+ nsIUnicodeDecoder *mGB2312Decoder;
+ nsIUnicodeDecoder *mEUCKRDecoder;
+ nsIUnicodeDecoder *mISO88597Decoder;

ftang asked for a new line for each "then" statement governed by if
(NS_FAILED(rv)), after trying to create the decoder -- I'll go for that as a
style improvement, but I wonder why you need to null the out parameter.  Is the
ccm->GetUnicodeDecoder(&aCharset, &mGB2312Decoder) call, for example, storing
garbage in mGB2312Decoder on failure?  It shouldn't be.  This is important for
nsCOMPtr usage.

/be

watanabe

Reporter

Comment 27

•

23 years ago

Attached patch This is a new patch using enum for the original tree. — Details — Splinter Review

nhottanscp

Assignee

Comment 28

•

23 years ago

I am looking at the patch because Frank is busy.
I have some questions.
* What those two states are for, mState_ESC_2e, mState_ESC_4?
* If creation of a decoder fails, it tries again for every character of that 
kind. This would be slow but would that be acceptable?
* Local variable aCharset may be confused as 'a' for argument. Please use other 
name.
* For case mState_JISX0212_1990:, the old code added 3 to mState, the new code 
always set a constant. Is this intended?
             } else {
               mData = fbIdx[*src & 0x7F];
-              mState = (0xFFFD == mData) ? 13 : (mState+3); // 10, 11, or 12
+              mState = (0xFFFD == mData) ? mState_ERROR
+                : mState_JISX0212_1990_2ndbyte;
             }

Status: NEW → ASSIGNED

Target Milestone: --- → mozilla0.9.1

watanabe

Reporter

Comment 29

•

23 years ago

> * What those two states are for, mState_ESC_2e, mState_ESC_4?

  2e of mState_ESC_2e means hexadecimal character code, that is "ESC ."
and is a part of the sequence "ESC . A" or "ESC . F".  mState_ESC_4e means
"ESC N", that is the sequence of single shift.

> * If creation of a decoder fails, it tries again for every character of that
> kind. This would be slow but would that be acceptable?

  This problem will never occur on JIS characters.  So, that is rare case
to be acceptable, but it may be better to rewrite.

> * Local variable aCharset may be confused as 'a' for argument. Please use
> other name.

  OK.

> * For case mState_JISX0212_1990:, the old code added 3 to mState, the new
> code always set a constant. Is this intended?

  Oh, I'm sorry.  I mistaked.

Andreas Becker

Updated

•

23 years ago

Keywords: intl

nhottanscp

Assignee

Comment 30

•

23 years ago

r=nhotta

Please update the patch for the last two issues, thanks.

watanabe

Reporter

Comment 31

•

23 years ago

>> * For case mState_JISX0212_1990:, the old code added 3 to mState, the new
>> code always set a constant. Is this intended?
>  Oh, I'm sorry.  I mistaked

  Oh, this is not a mistake.

  I separated the case 7, 8 and 9 to the case mState_JISX0208_1978,
mState_JISX0208_1983 and mState_JISX0212_1990.

[[old example]]

case 7:
case 8:
case 9:
  .......
  mState = mState + 3;

[[new example]]

case mState_JISX0208_1978:
  .......
  mState = mState_JISX0208_1978_2ndbyte;

case mState_JISX0208_1983:
  .......
  mState = mState_JISX0208_1983_2ndbyte;

case mState_JISX0212_1990:
  .......
  mState = mState_JISX0212_1990_2ndbyte;


  However, "diff" command outputs the following.

[[diff output]]

+case mState_JISX0208_1978:
+  .......
+  mState = mState_JISX0208_1978_2ndbyte;
+
+case mState_JISX0208_1983:
+  .......
+  mState = mState_JISX0208_1983_2ndbyte;

-case 7:
-case 8:
-case 9:
+case mState_JISX0212_1990:
   .......
-  mState = mState + 3;
+  mState = mState_JISX0212_1990_2ndbyte;


  And you discussed the last two lines.  But, this is the only "diff"
command magic.  I didn't change mState+3 to mState_JISX0212_1990_2ndbyte,
but I separated the cases.

nhottanscp

Assignee

Comment 32

•

23 years ago

Okay, so you want to create a new patch for the 'aCharset' issue before asking
for a super-review?
Also if you agree with brendan's comment on 2001-04-17 11:41 about nsCOMPtr then
change that part too, thanks.

nhottanscp

Assignee

Updated

•

23 years ago

Whiteboard: patch need review by ftang

watanabe

Reporter

Comment 33

•

23 years ago

> Okay, so you want to create a new patch for the 'aCharset' issue before
> asking for a super-review?

  OK.

> Also if you agree with brendan's comment on 2001-04-17 11:41 about nsCOMPtr
> then change that part too, thanks.

  If it makes our code simple, I agree with it of course.  However,
I don't know the nsCOMPtr<nsIUnicodeDecoder> interface so much.  Do we
use do_GetService()?  ...Well, I leave it now.

watanabe

Reporter

Comment 34

•

23 years ago

Attached patch This is a new patch not using aCharset for the original tree. — Details — Splinter Review

nhottanscp

Assignee

Comment 35

•

23 years ago

asked blizzard for super review

nhottanscp

Assignee

Comment 36

•

23 years ago

Here is a document about nsCOMPtr.
http://www.mozilla.org/projects/xpcom/nsCOMPtr.html

Christopher Blizzard (:blizzard)

Comment 37

•

23 years ago

+              mData = fbIdx[*src & 0x7F];
+              mState = (0xFFFD == mData) ? mState_ERROR
+                : mState_JISX0208_1978_2ndbyte;

I'm happy that you're moving all of these state variables to human readable
enums ( yay! ) but can you add some comments to code like this, please?  What is
it doing?

+              if (!mGB2312Decoder) {// failed creating a delegate converter
+                *dest++ = 0xFFFD;

Same thing.  In fact, anywhere you are twiddling bits can you please comment?

The code looks fine other than that.

Slap some comments in the code and you can have an sr=blizzard

nhottanscp

Assignee

Updated

•

23 years ago

Whiteboard: got 'r', 'sr', need updated patch before check in

Marek Z. Jeziorek

Comment 38

•

23 years ago

TM to 0.9.2 per PDT triage (it's OK to check it in by Friday or after 0.9.1
branch is made).

Target Milestone: mozilla0.9.1 → mozilla0.9.2

watanabe

Reporter

Comment 39

•

23 years ago

Attached patch This is a new patch with more comments for the original tree. — Details — Splinter Review

nhottanscp

Assignee

Comment 40

•

23 years ago

Checked in, thank you for your contribution.

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Teruko Kobayashi

Comment 41

•

23 years ago

I changed QA contact to nhotta@netscape.com since this is code change.
Hotta san, please mark this as verified.

QA Contact: ylong → nhotta

nhottanscp

Assignee

Comment 42

•

23 years ago

Please use test data attached in this bug.
http://bugzilla.mozilla.org/showattachment.cgi?attach_id=28130

nhottanscp

Assignee

Updated

•

23 years ago

QA Contact: nhotta → teruko

Teruko Kobayashi

Comment 43

•

23 years ago

Verified as fixed in 2001-05-24-04 Win32, 2001-05-24-08 Mac, and 2001-05-23-09 
LInux build.

Status: RESOLVED → VERIFIED

Simon Montagu :smontagu

Updated

•

17 years ago

Attachment #28130 - Attachment mime type: text/html → text/html; charset=iso-2022-jp

You need to log in before you can comment on or make changes to this bug.