Closed Bug 27376 Opened 25 years ago Closed 25 years ago

U+2026 becomes …

Categories

(Core :: Internationalization, defect, P3)

x86
Windows 98
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: hobbit_mak, Assigned: nhottanscp)

References

()

Details

Attachments

(2 files)

In making utf-8 page in Japanese environment, U+2026 becomes …. Both are displayed same, but HTML 4.01 5.3 says >A given character encoding may not be able to express all characters of the >document character set. For such encodings, or when hardware or software >configurations do not allow users to input some document characters directly, >authors may use SGML character references. So it is better output proper code instead of character reference ….
Sounds like either a layout rendering problem, or an I18N problem.
Assignee: beard → ftang
Component: Compositor → Internationalization
QA Contact: petersen → teruko
What do you mean "In making utf-8 page in Japanese environment" ? Do you mean using composer and when you save it as UTF-8 ? This seems the effect of the nsIEntityConverter nhotta done. Reassign to nhotta. Naoki, please get common agreement what we should do with this bug before you change it. Should we have a pref to control this ?
Assignee: ftang → nhotta
When I make page by composer and save it in utf-8 encoding, JIS 01-36 character becomes … It should be U+2026
This seems to be a general problem and we should try to fix this very soon. For example, in addition to Shift_JIS 0x81 0x63 (horizontal ellipsis) being mapped to …, we also map Shift_JIS 0x81 0x64 (two dot leader) to ¨ -- this is simply wrong. This should map to \u2025.
Accepting for M15.
Status: NEW → ASSIGNED
Target Milestone: M15
There was a related issue before for sending character reference in mail. 20062 Send messages with non-encoded NBSPs. The interface has options so the references can be created before or after the charset conversion. And I think we want to make it controllable by pref. With or without UI, we need to discuss (also which should be a default behavior). Copy/paste cc list from 20062.
As described, this is not a bug: "&hellip;" is equivalent and maps to U+2026. Either one should be acceptable. Is this bug report a request for enhancment (RFE), to be able to output either numbered character references instead of named entities? Or is the bug that we are generating "&hellip;" for JIS instead of generating the JIS values 0x81 0x63? I just tried sending myself Japanese email with an ellipsis and it did generate "&hellip;" finstead of the JIS value for ellipsis. Composer probably does the same thing. Maybe this should be logged as a separate bug? Here is the source of that email: Return-Path: <bobj@netscape.com> Received: from netscape.com ([208.12.37.163]) by dredd.mcom.com (Netscape Messaging Server 4.1 Aug 9 1999 18:28:31) with ESMTP id FQAS7F00.F2U for <bobj@netscape.com>; Mon, 21 Feb 2000 12:42:51 -0800 Message-ID: <38B1A06F.9050204@netscape.com> Date: Mon, 21 Feb 2000 12:30:39 -0800 From: bobj@netscape.com (Bob Jung) User-Agent: Netscape 5.0 X-Accept-Language: en MIME-Version: 1.0 To: bobj <bobj@netscape.com> Subject: ellipsis-ja Content-Type: text/html; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit <html><head></head> <body>ellipsis: &hellip;</body> </html>
So I should log "Shift_JIS 0x81 0x64 (two dot leader) to &uml;" as a separate bug? This is simply a wrong mapping even with the HTML entity. "Two dot leader" is not the same as "umlaut".
momoi-san, Yes, please log that as a separate bug.
I tried using Composer to create an EUC-JP document with an ellipsis. And as I speculated in my earlier comment, it generages the named entity instead of the actual code point. Here is the source created by Composer: <html><head> <meta http-equiv="Content-Type" content="text/html;charset=EUC-JP"> <title>ellipsis</title></head><body>ellipsis: &hellip;</body> </html>
I thought that we used to send HTML entities only for ISO-8859-1 msgs as a backward compatibility measure and that we would send 8-bit values in all other encodings if they support the codepoint in question. Should not the default be to send 8-bit characters except for Latin 1 (as backward compatibility) or when the encoding does not support that codepoint?
I really think that generating &helliop; when the encoding in question supports that character is a misuse of character entitty reference. I for one don't want to see CERs in my Japanese mail or document when editing the source. I quote from W3C document on this question at: http://www.w3.org/TR/html4/charset.html#entities "5.3 Character references A given character encoding may not be able to express all characters of the document character set. For such encodings, or when hardware or software configurations do not allow users to input some document characters directly, authors may use SGML character references. Character references are a character encoding-independent mechanism for entering any character from the document character set." CERs are not meant to be inserted just because that can be used. More judicious use is what I would like to recommend. In that sense this bug is valid. Neither UTF-8 nor any of the Japanese encodings need &helliop; to represent Horizontal Ellipsis caharacter. By the way 4.x does not generate &helliop; for this character under Japanese or UTF-8 encodings.
&hellip; cannot be interpreted on 4.x browser. Using UTF-8 (without entites) solves the problem of different level of entity support depends on browsers. About generating entites of htm32 or htm40, that can be controlled by an option of the entity converter (currently htm40 is used). In addition to the other pref I mentioned before (use entities only for characters cannot map to the code point of the encoding), we may controll it by the pref.
An option is OK but the default should be: not using the CERs unless the characters in question are not supported by the chosen encoding.
To try and summarize the discussion up until now... For UTF8 (or any Unicode encoding): It is better to not generate any entities and instead we should just use the raw Unicode values. For ISO-8859-1: Do we continue 4.x behavior and generate certain set of entities? For other encodings: Do not generate entities if there is a corresponding codepoint in that particular encoding. We can add a pref to enable entity generation, but the default should be off. Would that pref generate entities whereever possible or some defined subset? Are there several proposed prefs?
I was not thinking about different behavior depends on the output charsets. If that's really needed, please add your comment. We can have two prefs. 1) Specify the version of html entities (default to html32). 2) Do not generate entities if there is a corresponding codepoint in that particular encoding (default is ON). Above pref to apply to both html save and mail send. If separate pref needed for mail send only, please add your comment.
The non-ISO-8859-1 cases are really the same case -- only replace with entities that cannot be represented in the encoding. (Unicode is special because we could optimize the implementation and not check for entity replacement.) But do we want to treat ISO-8859-1 special and provide backwards compatibility?
If the option 2) is ON, entities to be generated as fallback in case of conversion failure (character cannot be mapped). Since no conversion failure to be expected for UTF-8, no fallback to be executed (so optimized alredy, sort of). ISO-8859-1 backwards compatibility, that's HTML source level compatibility not browser's display, correct?
Regarding ISO-8859-1 backwards compatibility, we discussed it before in bug 20062. So I just keep the current behavior. We need to change the client code of the interface (html content sink and message compose). For ISO-8859-1, convert to entity before charset conversion other charset entity to be used as fallbacks. Here is a list of expected results after the change. \u00A0 \u00C0 \u2026 ISO-8859-1 &nbsp; &Agrave; &hellip; ISO-2022-JP &nbsp; &Agrave; 0x2144 UTF-8 0xC2A0 0xC380 0xE280A6
Attached patch patch for SaveAsSplinter Review
fix checked in
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
I verified this in 2000040509 Win32 build.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: