Closed
Bug 27376
Opened 25 years ago
Closed 25 years ago
U+2026 becomes …
Categories
(Core :: Internationalization, defect, P3)
Tracking
()
VERIFIED
FIXED
M15
People
(Reporter: hobbit_mak, Assigned: nhottanscp)
References
()
Details
Attachments
(2 files)
1.36 KB,
patch
|
Details | Diff | Splinter Review | |
1.30 KB,
patch
|
Details | Diff | Splinter Review |
In making utf-8 page in Japanese environment, U+2026 becomes ….
Both are displayed same, but HTML 4.01 5.3 says
>A given character encoding may not be able to express all characters of the
>document character set. For such encodings, or when hardware or software
>configurations do not allow users to input some document characters directly,
>authors may use SGML character references.
So it is better output proper code instead of character reference ….
Comment 1•25 years ago
|
||
Sounds like either a layout rendering problem, or an I18N problem.
Assignee: beard → ftang
Component: Compositor → Internationalization
QA Contact: petersen → teruko
Comment 2•25 years ago
|
||
What do you mean "In making utf-8 page in Japanese environment" ?
Do you mean using composer and when you save it as UTF-8 ?
This seems the effect of the nsIEntityConverter nhotta done. Reassign to nhotta.
Naoki, please get common agreement what we should do with this bug before you
change it. Should we have a pref to control this ?
Assignee: ftang → nhotta
Reporter | ||
Comment 3•25 years ago
|
||
When I make page by composer and save it in utf-8 encoding,
JIS 01-36 character becomes …
It should be U+2026
Comment 4•25 years ago
|
||
This seems to be a general problem and we should try to fix this
very soon. For example, in addition to Shift_JIS 0x81 0x63 (horizontal
ellipsis) being mapped to …, we also map Shift_JIS 0x81 0x64
(two dot leader) to ¨ -- this is simply wrong. This should
map to \u2025.
Assignee | ||
Comment 6•25 years ago
|
||
There was a related issue before for sending character reference in mail.
20062 Send messages with non-encoded NBSPs.
The interface has options so the references can be created before or after the
charset conversion. And I think we want to make it controllable by pref. With or
without UI, we need to discuss (also which should be a default behavior).
Copy/paste cc list from 20062.
As described, this is not a bug: "…" is equivalent and maps to U+2026.
Either one should be acceptable.
Is this bug report a request for enhancment (RFE), to be able to output
either numbered character references instead of named entities?
Or is the bug that we are generating "…" for JIS instead of generating
the JIS values 0x81 0x63?
I just tried sending myself Japanese email with an ellipsis and it did generate
"…" finstead of the JIS value for ellipsis. Composer probably does
the same thing. Maybe this should be logged as a separate bug? Here is the
source of that email:
Return-Path: <bobj@netscape.com>
Received: from netscape.com ([208.12.37.163]) by dredd.mcom.com
(Netscape Messaging Server 4.1 Aug 9 1999 18:28:31) with ESMTP
id FQAS7F00.F2U for <bobj@netscape.com>; Mon, 21 Feb 2000 12:42:51
-0800
Message-ID: <38B1A06F.9050204@netscape.com>
Date: Mon, 21 Feb 2000 12:30:39 -0800
From: bobj@netscape.com (Bob Jung)
User-Agent: Netscape 5.0
X-Accept-Language: en
MIME-Version: 1.0
To: bobj <bobj@netscape.com>
Subject: ellipsis-ja
Content-Type: text/html; charset=ISO-2022-JP
Content-Transfer-Encoding: 7bit
<html><head></head>
<body>ellipsis: …</body>
</html>
Comment 8•25 years ago
|
||
So I should log "Shift_JIS 0x81 0x64 (two dot leader) to ¨"
as a separate bug? This is simply a wrong mapping even with
the HTML entity. "Two dot leader" is not the same as "umlaut".
Comment 10•25 years ago
|
||
I tried using Composer to create an EUC-JP document with an ellipsis. And
as I speculated in my earlier comment, it generages the named entity instead
of the actual code point. Here is the source created by Composer:
<html><head>
<meta http-equiv="Content-Type" content="text/html;charset=EUC-JP">
<title>ellipsis</title></head><body>ellipsis: …</body>
</html>
Comment 11•25 years ago
|
||
I thought that we used to send HTML entities only for ISO-8859-1
msgs as a backward compatibility measure and that we would send
8-bit values in all other encodings if they support the codepoint
in question.
Should not the default be to send 8-bit characters except for
Latin 1 (as backward compatibility) or when the encoding does not
support that codepoint?
Comment 12•25 years ago
|
||
I really think that generating &helliop; when the encoding in
question supports that character is a misuse of character
entitty reference. I for one don't want to see CERs in my
Japanese mail or document when editing the source.
I quote from W3C document on this question at:
http://www.w3.org/TR/html4/charset.html#entities
"5.3 Character references
A given character encoding may not be able to express all characters of
the document character set. For such encodings, or when hardware or
software configurations do not allow users to input some document characters
directly, authors may use SGML character references. Character references
are a character encoding-independent mechanism for entering any
character from the document character set."
CERs are not meant to be inserted just because that can be used.
More judicious use is what I would like to recommend. In that sense
this bug is valid. Neither UTF-8 nor any of the Japanese encodings
need &helliop; to represent Horizontal Ellipsis caharacter.
By the way 4.x does not generate &helliop; for this character
under Japanese or UTF-8 encodings.
Assignee | ||
Comment 13•25 years ago
|
||
… cannot be interpreted on 4.x browser.
Using UTF-8 (without entites) solves the problem of different level of entity
support depends on browsers.
About generating entites of htm32 or htm40, that can be controlled by an option
of the entity converter (currently htm40 is used). In addition to the other pref
I mentioned before (use entities only for characters cannot map to the code
point of the encoding), we may controll it by the pref.
Comment 14•25 years ago
|
||
An option is OK but the default should be: not using the CERs unless the characters in
question are not supported by the chosen encoding.
Comment 15•25 years ago
|
||
To try and summarize the discussion up until now...
For UTF8 (or any Unicode encoding):
It is better to not generate any entities and instead we should just
use the raw Unicode values.
For ISO-8859-1:
Do we continue 4.x behavior and generate certain set of entities?
For other encodings:
Do not generate entities if there is a corresponding codepoint in that
particular encoding.
We can add a pref to enable entity generation, but the default should be off.
Would that pref generate entities whereever possible or some defined subset?
Are there several proposed prefs?
Assignee | ||
Comment 16•25 years ago
|
||
I was not thinking about different behavior depends on the output charsets.
If that's really needed, please add your comment.
We can have two prefs.
1) Specify the version of html entities (default to html32).
2) Do not generate entities if there is a corresponding codepoint in that
particular encoding (default is ON).
Above pref to apply to both html save and mail send. If separate pref needed for
mail send only, please add your comment.
Comment 17•25 years ago
|
||
The non-ISO-8859-1 cases are really the same case -- only replace with
entities that cannot be represented in the encoding. (Unicode is special
because we could optimize the implementation and not check for entity
replacement.)
But do we want to treat ISO-8859-1 special and provide backwards compatibility?
Assignee | ||
Comment 18•25 years ago
|
||
If the option 2) is ON, entities to be generated as fallback in case of
conversion failure (character cannot be mapped). Since no conversion failure to
be expected for UTF-8, no fallback to be executed (so optimized alredy, sort
of).
ISO-8859-1 backwards compatibility, that's HTML source level compatibility not
browser's display, correct?
Assignee | ||
Comment 19•25 years ago
|
||
Regarding ISO-8859-1 backwards compatibility, we discussed it before in bug
20062. So I just keep the current behavior.
We need to change the client code of the interface (html content sink and
message compose). For ISO-8859-1, convert to entity before charset conversion
other charset entity to be used as fallbacks.
Here is a list of expected results after the change.
\u00A0 \u00C0 \u2026
ISO-8859-1 À …
ISO-2022-JP À 0x2144
UTF-8 0xC2A0 0xC380 0xE280A6
Assignee | ||
Comment 20•25 years ago
|
||
Assignee | ||
Comment 21•25 years ago
|
||
Assignee | ||
Comment 22•25 years ago
|
||
fix checked in
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•