Closed Bug 135323 Opened 22 years ago Closed 22 years ago

A return code between the two CJK characters is converted to a space code

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED

People

(Reporter: hsaito54, Assigned: shanjian)

References

Details

(Keywords: intl, Whiteboard: [adt2 rtm])

Attachments

(4 files, 10 obsolete files)

A space code between the two kanji characters is desirable to remove.

For example, this Japanese sentence
"¤¢
¤¤"
is displayed "¤¢ ¤¤" but I expect to display "¤¢¤¤" without a space.
-> Int
Assignee: Matti → yokoyama
Component: Browser-General → Internationalization
QA Contact: imajes-qa → ruixu
Please apply this patch next to apply a patch 
"patch to wrap long word by key-characters" at
"http://bugzilla.mozilla.org/attachment.cgi?id=75233&action=view" in
 "http://bugzilla.mozilla.org/show_bug.cgi?id=95067".
Attached file testcase
This bug is related to bug 95067.

I believe most Japanese people prefer that a line-break isn't rendered as space.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Hardware: Other → All
Attached patch additional patch #1 (obsolete) — Splinter Review
I made a error in my patch carelessly.
I correct errors with this patch program.
Keywords: intl
QA Contact: ruixu → ylong
saito-san:  Your patch looks simple; but I think the patch is rotten.
               Can you update the patch for the current trunk?  I'd like to
apply the patch
               to verify the fix.  Thanks
The patch was remade.
assign to shanjian.
Shanjian: can you review his code?  It's in the layout. Thanks
Assignee: yokoyama → shanjian
saito san,
I think '\n' should be removed only between CJK characters. That's say inside
function "ScanNormalWhiteSpace_F", you need to check character before and after
'\n' before setting the flag. If '\n' is the last character in this fragment, it
is OK to only check its precedence. (This will happen when text span multiple
fragment.) 

I am also wondering about this change:
-  *aContentLenResult = offset - mOffset;
+  *aContentLenResult = offset - save_offset;

Why did you use save_offset instead of mOffset, which is more updated? I assumed
you have tested the former one and it didn't work well. But I would like to know
why. 







Status: NEW → ASSIGNED
To be sure, my patch program is wrong.
Testing the following pages, it found two errors.

http://www.asahi-net.or.jp/~wq6k-yn/para.html

Every time Reload, a vertical scrill bar moves the first error.
But returning a program to a basis,

    *aContentLenResult = offset - mOffset;

the error was corrected.

A patch needs to be added to correct the next error.
The next sentence is not sometimes displayed accurately.

用途を詳しく分析した上で DTDを設計すれば、そのようなこと
はあまり起きない (かもしれない)。しかし、                 <===== error
HTMLのように一つの                                       <===== error
DTDを様々なタイプの文書に用いるのでは、この種の不都合は避  <===== error
け難い。

A variable contentLength is in the function nsTextFrame::MeasureText() at 
nsTextFrame.cpp.

When a space between Kanji characters is deleted, 1 must be subtracted from 
this variable.
Can't I write the document of Japanese?

用途を詳しく分析した上で DTDを設計すれば、そのようなこと
はあまり起きない (かもしれない)。しかし、
HTMLのように一つの
DTDを様々なタイプの文書に用いるのでは、この種の不都合は避
け難い。
>>Can't I write the document of Japanese?
What do you mean? As a test case, that's fine. But if you plan to use japanese
for communication, I am incapable to read it.  
I made a mistake.
I wrote in an Shift-JIS code and displayed it in an EUC-JP code.

> I think '\n' should be removed only between CJK characters.

I appreciate your suggestion.
I will try to investigate whether a patch can act only to the CJK characters.
The patch was remade.
It is happy if you can evaluate.
saito san,
I really appreciated your help one this issue. This part of code is complicated
and very sensitive, so please be patient with me. Please also excuse me if my
question does not make sense to you. I am learning this part of code as well. 
Several suggestions:
1, It is probably not a good idea to add Probe method to word breaker. You can
add the macro to local file directly. In different places, we may have different
definition of CJK char, so it is fine for me to define it multiply. Of cause, if
we can add widely used CJK macro to XPCOM module to avoid this, it will be nice.
But let's leave that for later.
2. Can we avoid the use of SpaceInfo? In nsTextTransformer::GetNextWord, we have
some code dealing with German Szlig character, we need to transform this one to
"SS". I assume you can do it the same way. But I am not very sure. I hope
DeleteOneChar can also be avoided.


I am sorry, there was a spelling mistake.
> Every time Reload, a vertical scrill bar moves the first error.
Every time Reload, a vertical scroll bar moves the first error.
> patch to remove a space between two CJK caracters 
patch to remove a space between two CJK characters 

> DeleteOneChar can also be avoided.

If I can also do, I do not want to use this function.
But a space will remain by the copy & paste using the mouse.
The place of correction was not able to be found.

Summary: A return code between the two kanji caracters is converted to a space code → A return code between the two CJK characters is converted to a space code
please let me know.

Although the character is stored to the buffer "mTransformBuf.mBuffer",
I recognize that this buffer is not used by the copy & paste using the mouse.
Is it right?

If right, when copying & pasting by the mouse,
where is the processing which transposes a return code to a space or 
the continuous space to one space?
I can't answer your question without looking into the code. But I guess that we
do not do any transformation for copy-n-paste. If we do collapse multiple spaces
to one in copy-n-paste, we need to do this transformation (remove space between
kanji) in the same way. 
The patch was remade again.
Please give me a review.
This patch is acceptable to me, but the changes in nsTextFrame.cpp does not seem
necessary. So could you take one more look?

The changes in nsTextFrame.cpp is required in order that a variable "totLen"
in nsPlainTextSerializer.cpp may acquire the right length of a character
sequence.

Is it better to look for other methods for a its variable acquiring the length
of a character sequence?

content/base/src/nsPlainTextSerializer.cpp:
  1513  void
  1514  nsPlainTextSerializer::Write(const nsAString& aString)
  1515  {

  1524    PRInt32 totLen = aString.Length();

An example is shown below, "xx" shows the Kanji character of one character.

"xx
yy"

The value of a variable "totLen" is 2, when a patch is not applied to
nsTextFrame.cpp.
However, the value should be 3.
Saito san, 
Sorry for not being able to response promptly. I was busy in fixing some urgent
issues. I did spend some time today and check your patch carefully. I still
could not understand why the change in nsTextFrame.cpp is necessary. Display and
copy-n-paste works fine without that change. So could you post a test case to
show the problem when that part is not there? thanks.
I appreciate for your preview and suggestions.

Please test a testcase "Created an attachment (id=77572)".
Can you copy "¤¢¤¤" using the mouse and paste it?

When a patch was not applied to nsTextFrame.cpp , "¤¢ " was pasted.
I tested by the version 0.9.8.
The latest version cannot immediately be made up, but should I do so?

P.S.
"¤¢¤¤" is "xx" and "yy" that indicates two Kanji characters.
"¤¢ " is "xx" and a white space.
I am sorry, it mistook again.
> I appreciate for your preview and suggestions.
I appreciate for your review and suggestions.
The previous message containing kanji characters was written in the code of
EUC-JP.
Attached patch patch v5, (obsolete) — Splinter Review
Saito, I finally got enough time to understand all your changes. I changed the
patch to make it more local. (I hope you don't mind my modification.) The idea
is the same. Please review my patch to see if I missed any of your points. I
appreciate you contribution, this is a complicate area and you did a good job. 
I agree with your changes, but the next program needs to change.

nsTextTransformer.cpp: old
+          if (firstChar == '\n' && 
+              offset - mOffset == 1 && 
+              fragLen > offset) 

nsTextTransformer.cpp: new
+          if (firstChar == '\n' && 
+              offset - mOffset == 1 && mOffset > 0 &&
+              fragLen > offset) 

This change is added and a test is O.K.
Attached patch patch v6 (obsolete) — Splinter Review
Attachment #77569 - Attachment is obsolete: true
Attachment #77597 - Attachment is obsolete: true
Attachment #80550 - Attachment is obsolete: true
Attachment #82001 - Attachment is obsolete: true
Attachment #82597 - Attachment is obsolete: true
Attachment #82989 - Attachment is obsolete: true
good point. I updated the patch. Now let's ask for r/sr.

rbs/attinasi, could you give r/sr?
Comment on attachment 83046 [details] [diff] [review]
patch v6

-	 i = contentLen;
+	 if (1 == wordLen && contentLen == 2 && IS_CJK_CHAR(*bp)) {
+	   i = contentLen;
+	 } else {
+	   i = contentLen;
	 }

No need to duplicate the statement.

What happens when there are _multiple_ '\n' between the CJK chars?

I am not a big fan of macros defined in several places.
You can define IS_CJK_CHAR in one place (next to IS_HIGH/LOW_SURROGATE) in
intl/unicharutil/util/nsUnicharUtils.h
Attached patch patch v7 (obsolete) — Splinter Review
Moved macro to nsUnicharUtils.h.
Attachment #83046 - Attachment is obsolete: true
> What happens when there are _multiple_ '\n' between the CJK chars?

One space is replaced, when two or more '\n' are between CJK characters, or 
when a space is included. It is the conventional processing.
When changed into patch-v6 from patch-v5, a mistake occurs.
Please check.

patch-v5:
+          if (firstChar == '\n' && 
+              offset - mOffset == 1 && 
+              fragLen > offset) 
+          {

patch-v6:
+          if (firstChar == '\n' && 
+              offset - mOffset == 1 && 
+              mOffset > 0) 
+          {

new:
+          if (firstChar == '\n' && 
+              offset - mOffset == 1 && 
+              mOffset > 0 &&
+              fragLen > offset)
+          {
Attached patch patch v8 (obsolete) — Splinter Review
Attachment #83587 - Attachment is obsolete: true
rbs/waterson, could you r/sr?
Comment on attachment 85265 [details] [diff] [review]
patch v8

I didn't understand the reply to my question about  what happens when there are
multiple '\n' and/or tabs and/or space chars in-between the CJK chars?
Two or more '\n' between CJK characters replaces one space.
Although one or more TAB and spaces between CJK characters also replace one
space, the above both processing is the conventional processing and this patch
does not bar the processing.
rbs, let me explain hideo's statement. We only remove a single return between 2
CJK characters. We do nothing to multiple return, tab, space, whatsoever. Why is
that? For tab, space, and other non-return space equivalent characters, we
believe user has a reason to put it there when it is there, so we don't want to
do anything. In case of multiple return, that's arguable. When user put a single
return between 2 CJK char, they just want to shorten line length and they expect
text flow to be continuous. We decide to only handle this situation and nothing
else. I personally believe hideo's approach is reasonable. To remove multiple
return between CJK char also make sense to me, but I didn't find it really
necessary.
Is there a real-life page with the problem? Something that has been bugging me
is why there wasn't a duplicate/precedent to this bug. (If there is one in
bugzilla, care to point it?).

What is conventional elsewhere (e.g., telegraphic text -- just a funny example)
might be irrelevant in HTML. Surely, there should be a real-life page, in this
massive web, with this bug? Is the patch not going to cause a suprise and
regress pages that have been using '\n' as end-of-word *and* end-of-line -- not
just end-of-text-line? Also, I wonder about the rendering of <pre>formatted
text, the patch might mess all that.

I applied the patch and it seemed to "fix" the testcase. I had some doubts about
the necessity of the plain-text part, but again, it might be the same doubt as
above (BTW, double-check/confirm that mInWhitespace is in sync).
Again, as a reminder, this isn't an easy area, and tip-toeing to arrive safely
is better -- although it might take longer. So bear with the picky r/sr.
By the text of usual Japanese, a space is not placed between kanji characters.
In case a long document is edited, starting a new line, and continuing and
describing a text in the suitable place of immediately after a punctuation and
kanji characters other than a punctuation is performed ordinarily.

Please see the following examples (was written in the code of EUC-JP).
If HTML describes the text of 1, like 2, although it is visible like 3, we do
not desire a text of 3.

1,
ºé¤¤¤¿¡¢ºé¤¤¤¿¡¢ºù¤¬ºé¤¤¤¿¡£

2,
ºé¤¤¤¿¡¢
ºé¤¤¤¿¡¢
ºù¤¬
ºé¤¤¤¿¡£

3,
ºé¤¤¤¿¡¢ ºé¤¤¤¿¡¢ ºù¤¬ ºé¤¤¤¿¡£

It is possible to continue writing a Japanese text without a new-line for a long
time. However, it is pain for us.

It is checking that this patch does not influence the <pre> element.
> Is there a real-life page with the problem?

It shows one example, but not a good example, since CSS is used and the space 
between kanji characters spreads, 

http://www.asahi-net.or.jp/~wq6 k-yn/para.html

> Something that has been bugging me is why there wasn't a duplicate/precedent 
to this bug.

The argument is made even now at Bugzilla-jp (http://www.mozilla.gr.jp/), and 
there is a contrary opinion about this affair.
That is, it says that it is the specification of HTML and the specification 
should be followed, and has been considered so until now.
Sorry, it is the following URL.

http://www.asahi-net.or.jp/~wq6k-yn/para.html
> It is possible to continue writing a Japanese text without a new-line for a
> long time.

Do you mean that there are words that excessively long and cannot be broken by
users as they type them? Such a word would be weird to read...

> The argument is made even now at Bugzilla-jp (http://www.mozilla.gr.jp/), and 
> there is a contrary opinion about this affair.
> That is, it says that it is the specification of HTML and the specification 
> should be followed, and has been considered so until now.

I am leaning towards the opinion of leaving things as they are myself (i.e.,
retain the HTML-spec way). This will avoid undue uncertainties with <pre>, not
to mention the (still unknown) ramifications when typing in Composer.

Plus, IE6 does what Mozilla is currently doing, and a change will not only be
non-spec compliant, but it will also deviate from what other popular browsers
are doing.
> Do you mean that there are words that excessively long and cannot be broken by
> users as they type them? Such a word would be weird to read...

CJK text doesn't have a space between words. So, if somebody hates
current Mozilla behaviour, he/she have to write a paragraph in a single line.

IE6/Win doesn't display a line-break as a space.
> CJK text doesn't have a space between words. So, if somebody hates
> current Mozilla behaviour, he/she have to write a paragraph in a single line.

But if you use the Mozilla's editor, |nsJISx4501LineBreaker| does line-breaking
nicely as you type, right?
It is different to everybody what is used for a text editor.

Can't it be interpreted as deletion being possible about the space between 
words by HTML 4.01 specifications depending on a each languege?
HTML spec:

<http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1>
> This layout may involve putting space between words (called inter-word space),
> but conventions for inter-word space vary from script to script. For example,
> in Latin scripts, inter-word space is typically rendered as an ASCII space
> (&#x0020;), while in Thai it is a zero-width word separator (&#x200B;).
> In Japanese and Chinese, inter-word space is not typically rendered at all.

CSS spec:
http://www.w3.org/TR/2002/WD-css3-text-20020515/#linefeed-treatment
> Plus, IE6 does what Mozilla is currently doing, and a change will not only be
> non-spec compliant, but it will also deviate from what other popular browsers
> are doing.

The specifications described by "9.1 White space" of HTML is right.
> In Japanese and Chinese and inter-word space is not typically rendered at all.
I believe that it is left to the browser how inter-word space is rendered.

I consider that those browsers force the custom of a Latin scripts on us.
Another real life example of this problem.

    http://alltheweb.com/
Oh, this one too.

    http://www.google.com/

To see the problem, have Japanese at the top of your preferred langage list,
and search something in Japanese.
http://www.asahi-net.or.jp/~wq6k-yn/para.html

Looking at that page, it looks bad. Although that page is intentionally creating
the problem, it can happen in real pages like machine generated output (e.g.
google). &#12288;
I think we want to fix this on the trunk.
I propose to fix this in branch so that it makes to Netscape 7 release.
We shouldn't ignore 2nd and 3rd mostly used language on the internet.
Taka, you can nominate this as nsbeta1 if you think this is important.
Please provide some info. 
1) How frequent does the user encounter the page with the problem.
2) How bad the problem can be? Please provide a screen shot.
Naoki,

It's not a matter of how often users encounter this.  As far as we state that
Mozilla is HTML 4.0 compliant browser and it supports Chinese and Japanese,
the document in comment 53, which is error free HTML 4.0 Strict document, 
must be rendered as native language speakers expect.

Please see the screenshot and judge how bad we are doing with error free
HTML document.

This issue has been around since Netscape 4.0 and discussed on various ML, 
BBS, Newsgroup, etc.  I don't want to see people blaming on Mozilla/Netscape
about this issue anymore.

Nominating to nsbeta1.
Keywords: nsbeta1
Attached patch patch v9 (obsolete) — Splinter Review
Exclude Korean from patch v8.
Attachment #85265 - Attachment is obsolete: true
Comment on attachment 88737 [details] [diff] [review]
patch v9

r=shanjian
Attachment #88737 - Flags: review+
I just put r=shanjian since this patch is originally proposed by saito, and new
one is touched by taka. 

chris waterson, could you please sr?
rbs, feel free to express your objection if you still have some. 
can we reduce this patch by take out the non essential change in
nsJISx4501LineBreaker.cpp
and +#define IS_CJK_CHAR(u)  in nsUnicharUtils.h

[adt2] since this a strong recommendation/complains from Japanese mozilla
community. 
Keywords: nsbeta1nsbeta1+
Whiteboard: [adt2]
Blocks: 141008
Attached patch patch v10Splinter Review
don't touch nsJISx4501LineBreaker.cpp.
Attachment #88737 - Attachment is obsolete: true
After we added the macro to nsUnicharUtils.h, it does not make sense to keep
another copy in nsJISx4501LineBreaker.cpp. Since those are identical macro,
there should be no risk. I would suggest to use v9 instead of v10. 



Actually, IS_CJK_CHAR(u) is not used in this patch anymore because we have to 
exclude Korean from special case handling.  I added IS_CJ_CHAR() in 
nsUnicharUtils.h and that's the macro we are using.
Comment on attachment 89113 [details] [diff] [review]
patch v10

I am Ok with v10.
Attachment #89113 - Flags: review+
Comment on attachment 89113 [details] [diff] [review]
patch v10

sr=waterson
Attachment #89113 - Flags: superreview+
checked into trunk.

/cvsroot/mozilla/layout/html/base/src/nsTextFrame.cpp,v  <--  nsTextFrame.cpp
new revision: 1.374; previous revision: 1.373
/cvsroot/mozilla/layout/html/base/src/nsTextTransformer.cpp,v  <--  nsTextTransf
ormer.cpp
new revision: 1.69; previous revision: 1.68
/cvsroot/mozilla/content/base/src/nsPlainTextSerializer.cpp,v  <--  nsPlainTextS
erializer.cpp
new revision: 1.60; previous revision: 1.59
/cvsroot/mozilla/intl/unicharutil/util/nsUnicharUtils.h,v  <--  nsUnicharUtils.h

new revision: 1.15; previous revision: 1.14
I talked to naoki and we double checked the code. The patch will only affect
chinese and japanese characters. For characters of other languages, the behavior
should be the same. For verification testing, we can also use any non-chinese
and non-japanese website to verify this. Since '\n' is used in many long peice
of text.
If there is something, it shouldn't be difficult to identify. 
For chinese/japanese verification, see previous comment for testcases. Be sure
to test copy-n-paste. 
On 07-02 trunk build:

1. It works fine when a return code between 2 Japanese text characters. 
However, when there is a return code HTML tag mixed together with character(s),
there still like "an extra space" shows in browser window.
http://www.asahi-net.or.jp/~wq6k-yn/para.html

2. When the HTML source has space(s) between return code, like page:
http://www.php.net/manual/ja/function.setcookie.php or:
http://www.vinelinux.org/
There is a single byte space between 2 characters. 
I don't if it is OK, but on Netscape I can highlight the space between
characters while on IE I can not.

3. No Copy/Paste regression was found right now.

4. I don't find any regression with English or other languages so far. 
Notice:
For the display in browser, there is no different with this case before or
after the fix.

When I copy/paste the string from Broswer window to Composer, the space will be
shorter.
Taka, the first two issues in Comment #70 were they intended to be fixed by the
last patch? If not, we can verify this as fixed.
>Taka, the first two issues in Comment #70 were they intended to be fixed by the
>last patch? If not, we can verify this as fixed.
Let me ask this to the original author of the patch.
Saito san, could you answer my last comment?  Is it okay for mozilla community
in Japan about the current level of the support (i.e. remaining problems are not
so important)?
There are about three future examination matters.
A new-line replaces a space, without being deleted in the following examples.

(1) a new-line is between a Kanji character and an English character.
KK
HTML
KK

(2) the Kanji character is inserted into HTML tag 
KK
<a>KK</a>
KK

(3) there is a space following a new-line
KK
    KK

KK is the Kanji character of one character.
The user group of Bugzilla-jp examines whether deletion of a space is required.
Okay, please file a separate bug for those problems. This bug can be marked as
fixed. I think the current fix mostly covers exisiting problems. Nominate for
adt1.0.1
Keywords: adt1.0.1
Whiteboard: [adt2] → [adt2 rtm]
adding adt1.0.1- per ADT.  Let's get this in the next release.
> re: comment #59
>
> rbs, feel free to express your objection if you still have some.

I have been travelling, and was out of bugzilla. I don't like the fix very much
(especially that it doesn't cover everything and it is short-sighted for the
remaining parts), but I don't know what else to suggest -- except perhaps the
following: would that possible to transform the space (when in the CJ context)
into a zero-width space (ZWSP) to make it disappear in a natural/visual way?
Motivation: avoid complications in nsTextFrame, if possible.
What rbs suggested is a interesting idea. But I am not sure how much it can do
to simplify the code.
I will mark this bug as fixed. File another bug anybody like to explore that idea.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Mark as verified for this one.
Status: RESOLVED → VERIFIED
Bug 156369 is filed for remaining problems.
Adding adt1.0.1- per the adt.  Let's try to get it into the next release.
Keywords: adt1.0.1adt1.0.1-
I'm not very familiar with the Mozilla source but if the macro IS_CJK_CHAR or
IS_CJ_CHAR in intl/unicharutil/util/nsUnicharUtils.h works like 
IS_HIGH/LOW_SURROGATE, won't Chinese and Japanese characters composed of two
surrogates be missed and still cause this bug?


You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: