27376 - U+2026 becomes …

Reporter

Description

•

26 years ago

In making utf-8 page in Japanese environment, U+2026 becomes …. Both are displayed same, but HTML 4.01 5.3 says >A given character encoding may not be able to express all characters of the >document character set. For such encodings, or when hardware or software >configurations do not allow users to input some document characters directly, >authors may use SGML character references. So it is better output proper code instead of character reference ….

Patrick C. Beard

Comment 1

•

26 years ago

Sounds like either a layout rendering problem, or an I18N problem.

Assignee: beard → ftang

Component: Compositor → Internationalization

QA Contact: petersen → teruko

Frank Tang

Comment 2

•

26 years ago

What do you mean "In making utf-8 page in Japanese environment" ? Do you mean using composer and when you save it as UTF-8 ? This seems the effect of the nsIEntityConverter nhotta done. Reassign to nhotta. Naoki, please get common agreement what we should do with this bug before you change it. Should we have a pref to control this ?

Assignee: ftang → nhotta

TAKAHASHI Makoto

Reporter

Comment 3

•

26 years ago

When I make page by composer and save it in utf-8 encoding, JIS 01-36 character becomes &hellip; It should be U+2026

Katsuhiko Momoi

Comment 4

•

26 years ago

This seems to be a general problem and we should try to fix this very soon. For example, in addition to Shift_JIS 0x81 0x63 (horizontal ellipsis) being mapped to …, we also map Shift_JIS 0x81 0x64 (two dot leader) to ¨ -- this is simply wrong. This should map to \u2025.

nhottanscp

Assignee

Comment 5

•

26 years ago

Accepting for M15.

Status: NEW → ASSIGNED

Target Milestone: M15

nhottanscp

Assignee

Comment 6

•

26 years ago

There was a related issue before for sending character reference in mail. 20062 Send messages with non-encoded NBSPs. The interface has options so the references can be created before or after the charset conversion. And I think we want to make it controllable by pref. With or without UI, we need to discuss (also which should be a default behavior). Copy/paste cc list from 20062.

bobj

Comment 7

•

26 years ago

As described, this is not a bug: "…" is equivalent and maps to U+2026. Either one should be acceptable. Is this bug report a request for enhancment (RFE), to be able to output either numbered character references instead of named entities? Or is the bug that we are generating "…" for JIS instead of generating the JIS values 0x81 0x63? I just tried sending myself Japanese email with an ellipsis and it did generate "…" finstead of the JIS value for ellipsis. Composer probably does the same thing. Maybe this should be logged as a separate bug? Here is the source of that email: Return-Path: <bobj@netscape.com> Received: from netscape.com ([208.12.37.163]) by dredd.mcom.com (Netscape Messaging Server 4.1 Aug 9 1999 18:28:31) with ESMTP id FQAS7F00.F2U for <bobj@netscape.com>; Mon, 21 Feb 2000 12:42:51 -0800 Message-ID: <38B1A06F.9050204@netscape.com> Date: Mon, 21 Feb 2000 12:30:39 -0800 From: bobj@netscape.com (Bob Jung) User-Agent: Netscape 5.0 X-Accept-Language: en MIME-Version: 1.0 To: bobj <bobj@netscape.com> Subject: ellipsis-ja Content-Type: text/html; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit <html><head></head> <body>ellipsis: …</body> </html>

Katsuhiko Momoi

Comment 8

•

26 years ago

So I should log "Shift_JIS 0x81 0x64 (two dot leader) to ¨" as a separate bug? This is simply a wrong mapping even with the HTML entity. "Two dot leader" is not the same as "umlaut".

bobj

Comment 9

•

26 years ago

momoi-san, Yes, please log that as a separate bug.

bobj

Comment 10

•

26 years ago

I tried using Composer to create an EUC-JP document with an ellipsis. And as I speculated in my earlier comment, it generages the named entity instead of the actual code point. Here is the source created by Composer: <html><head> <meta http-equiv="Content-Type" content="text/html;charset=EUC-JP"> <title>ellipsis</title></head><body>ellipsis: …</body> </html>

Katsuhiko Momoi

Comment 11

•

26 years ago

I thought that we used to send HTML entities only for ISO-8859-1 msgs as a backward compatibility measure and that we would send 8-bit values in all other encodings if they support the codepoint in question. Should not the default be to send 8-bit characters except for Latin 1 (as backward compatibility) or when the encoding does not support that codepoint?

Katsuhiko Momoi

Comment 12

•

26 years ago

I really think that generating &helliop; when the encoding in question supports that character is a misuse of character entitty reference. I for one don't want to see CERs in my Japanese mail or document when editing the source. I quote from W3C document on this question at: http://www.w3.org/TR/html4/charset.html#entities "5.3 Character references A given character encoding may not be able to express all characters of the document character set. For such encodings, or when hardware or software configurations do not allow users to input some document characters directly, authors may use SGML character references. Character references are a character encoding-independent mechanism for entering any character from the document character set." CERs are not meant to be inserted just because that can be used. More judicious use is what I would like to recommend. In that sense this bug is valid. Neither UTF-8 nor any of the Japanese encodings need &helliop; to represent Horizontal Ellipsis caharacter. By the way 4.x does not generate &helliop; for this character under Japanese or UTF-8 encodings.

nhottanscp

Assignee

Comment 13

•

26 years ago

… cannot be interpreted on 4.x browser. Using UTF-8 (without entites) solves the problem of different level of entity support depends on browsers. About generating entites of htm32 or htm40, that can be controlled by an option of the entity converter (currently htm40 is used). In addition to the other pref I mentioned before (use entities only for characters cannot map to the code point of the encoding), we may controll it by the pref.

Katsuhiko Momoi

Comment 14

•

26 years ago

An option is OK but the default should be: not using the CERs unless the characters in question are not supported by the chosen encoding.

bobj

Comment 15

•

26 years ago

To try and summarize the discussion up until now... For UTF8 (or any Unicode encoding): It is better to not generate any entities and instead we should just use the raw Unicode values. For ISO-8859-1: Do we continue 4.x behavior and generate certain set of entities? For other encodings: Do not generate entities if there is a corresponding codepoint in that particular encoding. We can add a pref to enable entity generation, but the default should be off. Would that pref generate entities whereever possible or some defined subset? Are there several proposed prefs?

nhottanscp

Assignee

Comment 16

•

26 years ago

I was not thinking about different behavior depends on the output charsets. If that's really needed, please add your comment. We can have two prefs. 1) Specify the version of html entities (default to html32). 2) Do not generate entities if there is a corresponding codepoint in that particular encoding (default is ON). Above pref to apply to both html save and mail send. If separate pref needed for mail send only, please add your comment.

bobj

Comment 17

•

26 years ago

The non-ISO-8859-1 cases are really the same case -- only replace with entities that cannot be represented in the encoding. (Unicode is special because we could optimize the implementation and not check for entity replacement.) But do we want to treat ISO-8859-1 special and provide backwards compatibility?

nhottanscp

Assignee

Comment 18

•

26 years ago

If the option 2) is ON, entities to be generated as fallback in case of conversion failure (character cannot be mapped). Since no conversion failure to be expected for UTF-8, no fallback to be executed (so optimized alredy, sort of). ISO-8859-1 backwards compatibility, that's HTML source level compatibility not browser's display, correct?

nhottanscp

Assignee

Comment 19

•

26 years ago

Regarding ISO-8859-1 backwards compatibility, we discussed it before in bug 20062. So I just keep the current behavior. We need to change the client code of the interface (html content sink and message compose). For ISO-8859-1, convert to entity before charset conversion other charset entity to be used as fallbacks. Here is a list of expected results after the change. \u00A0 \u00C0 \u2026 ISO-8859-1   À … ISO-2022-JP   À 0x2144 UTF-8 0xC2A0 0xC380 0xE280A6

nhottanscp

Assignee

Comment 20

•

26 years ago

Attached patch patch for SaveAs — Details — Splinter Review

nhottanscp

Assignee

Comment 21

•

26 years ago

Attached patch patch for mail send — Details — Splinter Review

nhottanscp

Assignee

Comment 22

•

26 years ago

fix checked in

Status: ASSIGNED → RESOLVED

Closed: 26 years ago

Resolution: --- → FIXED

Teruko Kobayashi

Comment 23

•

26 years ago

I verified this in 2000040509 Win32 build.

Status: RESOLVED → VERIFIED

patch for SaveAs 26 years ago nhottanscp 1.36 KB, patch		Details \| Diff \| Splinter Review
patch for mail send 26 years ago nhottanscp 1.30 KB, patch		Details \| Diff \| Splinter Review

U+2026 becomes &hellip;

U+2026 becomes …