Persona is no longer an option for authentication on BMO. For more details see Persona Deprecated.
Last Comment Bug 8865 - need API to convert UCS-2 HTML into a target encoding
: need API to convert UCS-2 HTML into a target encoding
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: All Other
: P3 normal (vote)
: M11
Assigned To: nhottanscp
: nhottanscp
: Axel Hecht [:Pike]
Depends on:
Blocks: 12392 13401 15475 15674 16441 16950
  Show dependency treegraph
Reported: 1999-06-24 17:48 PDT by bobj
Modified: 1999-10-26 20:47 PDT (History)
4 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Description bobj 1999-06-24 17:48:10 PDT
Provide an API which will convert UCS-2 HTML into a target encoding including
appropriate HTML entity substitution.  The implementation will probably call
the standard Unicode converters APIs with additional code to handle the entity
substitutions.  The conversion method for given character ranges it may do one
of the following conversions (controlled via prefs):
  1. convert to character code in target charset encoding
  2. convert to HTML3 Named Entity
  3. convert to various subsets of HTML4 Named Entities (e.g., math)
  4. convert to decimal Numeric Character Reference (NCR)
  5. convert to HTML4 hexadecimal NCR

Default behavior is still being debated.  Factors to be considered are
backwards compatibility (e.g., decimal NCR are supported by almost all clients
currently in use, but they may not support HTML4 hex NCRs).  The biggest
debate concerns when to output character code values instead of entities.

UI work is needed to control the pref values.
Comment 1 bobj 1999-06-24 19:00:59 PDT
Of course when converting from UCS-2, we need to always entity-ize:
     "&lt;" represents the < sign
     "&gt;" represents the > sign
     "&amp;" represents the & sign
     "&quot; represents the " mark
but Ender should do that before the conversion, right?
Comment 2 Akkana Peck 1999-06-25 11:05:59 PDT
Do we always need to entity-ize &quot; ?  I'm not so sure of that one, or of
&amp; though I agree about &lt; and &gt;.

Ender doesn't do anything like this; it just works with the content tree.
The XIF converter creates <entity> tags for (currently) &, <, and >.  The
nsXIFDTD then lets the parser map them back to unicode chars.  Finally, the
content sink (e.g. nsHTMLContentSinkStream.cpp calls NS_UnicodeToEntity on each
unicode char in the stream to turn them into &foo; format.
Comment 3 bobj 1999-06-25 17:58:59 PDT
Good question about &quot;  Seems like we didn't handle that before.  I just
tried it in the 4.5 composer.  But can't quotes cause problems for thing like:
     <tag-name param-name="foobar">

For reference,
Four character entity references deserve special mention since they
are frequently used to escape special characters:

     - "&lt;" represents the < sign.
     - "&gt;" represents the > sign.
     - "&amp;" represents the & sign.
     - "&quot; represents the " mark.

Authors wishing to put the "<" character in text should use "&lt;"
(ASCII decimal 60) to avoid possible confusion with the beginning
of a tag (start tag open delimiter). Similarly, authors should use
"&gt;" (ASCII decimal 62) in text instead of ">" to avoid problems
with older user agents that incorrectly perceive this as the end
of a tag (tag close delimiter) when it appears in quoted attribute

Authors should use "&amp;" (ASCII decimal 38) instead of "&" to
avoid confusion with the beginning of a character reference
(entity reference open delimiter). Authors should also use
"&amp;" in attribute values since character references are
allowed within CDATA attribute values.

Some authors use the character entity reference "&quot;" to
encode instances of the double quote mark (") since that
character may be used to delimit attribute values.
Comment 4 bobj 1999-06-25 18:10:59 PDT
For my strawmen pref controls, I think we need (UE should rename/reword):

Editor Entity Output Preferences

  Named Entity Priority (radio buttons)
   o After valid character code values (default - but this is debatable)
   o Before valid character code values

  Additional (base set is HTML 3.2) Named Entities(check boxes)
   [ ] HTML 4.0 ISO 8859-1 (Latin-1) characters (default off)
   [ ] HTML 4.0 symbols, math symbols, and Greek letters (default off)
   [ ] HTML 4.0 markup-significant and i18n characters  (default off)

  NCR format (radio buttons)
   o decimal (default)
   o hexadecimal (recommended by HTML4, but would cause backwards
                  compatibility problems)
Comment 5 Akkana Peck 1999-06-25 18:32:59 PDT
I'm a bit confused by that list.  Does it offer some way to encode &, < and > to
named entities but not to encode " ?  I think a lot of people will be annoyed if
their quotes all become entities when there doesn't seem to be any pressing
reason for that.
Comment 6 bobj 1999-06-28 17:43:59 PDT
There will be cases when we need to convert " to an entity.  The easiest
implementation is probably to entity-ize all occurrences of the 4 special
characters.  Otherwise Ender needs to have smarts when it is OK or not OK to
use the raw character codes.

I tried a test using 4.5 Composer.  In Notepad I created foo.html:

   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="Author" content="Robert &gt;&quot;Bob&quot;&lt; Jung">
less-than: &lt;
<br>greater-than: &gt;
<br>double-quote: &quot;
<br>ampersand: &amp;

I opened foo.html with 4.5 Composer and saved it to foo2.html.  It change all
"&gt;" to > and "quot;" to ".  It even changed the "&lt;" in the
<meta author...> tag to <.  Here is the resulting foo2.html:

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="Author" content="Robert >"Bob"< Jung">
   <meta name="GENERATOR" content="Mozilla/4.5 [en]C-NSCP  (Win95; U)
less-than: &lt;
<br>greater-than: >
<br>double-quote: "
<br>ampersand: &amp;
less-than: &lt;
<br>greater-than: >
<br>double-quote: "
<br>ampersand: &amp;
Comment 7 Akkana Peck 1999-06-28 17:51:59 PDT
Putting that logic in the editor library would require some sort of big ugly
hack.  Right now it's handled by the content sink stream on output, after ender
no longer has control over it.  The content sink stream can have multiple modes;
but if every sink stream has to know all about which special character codes do
and do not get mapped to entities, then what's the point of having a service to
do the mapping?
Comment 8 tague 1999-06-29 02:23:59 PDT
this probably won't make it until m9, and issue with moving it?
Comment 9 tague 1999-07-23 13:00:59 PDT
not going to make m9 cutoff, moving to m10
Comment 10 bobj 1999-08-18 04:03:59 PDT
Do you still feel this has a dependency on bug 8865?
Comment 11 Frank Tang 1999-08-31 14:21:59 PDT
Tague, can we make this for M10. The scope of this bug is the converter itself
without the intergration part, right ?
Comment 12 tague 1999-08-31 14:25:59 PDT
i have an api and a shell on my machine at home.  i stopped working on this,
because i found that someone had already implemented some entity conversion code

right now i'm looking at html save as path to try to figure out what is going on
with entity conversion before i invest alot of time in building this converter.
i doubt this is going to make m10.
Comment 13 Frank Tang 1999-09-10 10:17:59 PDT
reassign to naoki, tague, please give naoki a brain dump
Comment 14 nhottanscp 1999-09-13 13:36:59 PDT
tague, are you able to check in whatever you have to the tree?
Then I can start woking on this.
For M11, messenger can use it to generate entities before converting to the mail
Later if entities are generated elsewhere then it can be simply removed from
Comment 15 nhottanscp 1999-09-14 14:31:59 PDT
The plan is to call nsIEntityConverter::ConvertToEntity from message compose
before converting unicode to mail charset.
Comment 16 nhottanscp 1999-09-15 14:07:59 PDT
Hooked up the entity converter to nsMsgSend.cpp.
If later editor generates entities then it can be removed from messenger.
Comment 17 nhottanscp 1999-09-27 16:36:59 PDT
Reopening the bug. We have not implemented everything we planned.
The original spec says.
  1. convert to character code in target charset encoding
  2. convert to HTML3 Named Entity
  3. convert to various subsets of HTML4 Named Entities (e.g., math)
  4. convert to decimal Numeric Character Reference (NCR)
  5. convert to HTML4 hexadecimal NCR

We currently have 1, 2 and partially 3. We need to have a simple interface to
convert unicode to HTML (or plain text) by using charset converters and the
entity converter also generates NCRs.
In case of plain text, instead of NE and NCR, either skip not convertable chars
or convert to UTF-8 plain text.
Comment 18 msanz 1999-09-27 16:38:59 PDT
Clearing fixed resolution since it's been reopened.
Comment 19 nhottanscp 1999-09-29 13:31:59 PDT
Removing 6672 from dependency because it was depended on nbsp entity generation
which is done.
Comment 20 nhottanscp 1999-10-06 13:33:59 PDT
Adding 15706 for the dependency, that is needed for the fallback case if unicode
converter fails for no mapping. For now, I can do a work around to do the
conversion per character but that is slow.
Comment 21 nhottanscp 1999-10-18 17:46:59 PDT
I checked in my changes today. There are two changes.
1) Changed nsIEntityConverter to support complete html40 entities. It supports
2&3 of the original spec.
2) Added a new interface nsISaveAsCharset. This is a superset of
nsIEntityCovnerter, supports entity and NCR plus do a charset conversion. It
also supports plain text input (see the idl file for detail). This
interface supports all the requirements of the original spec.

Note: nsISaveAsCharset implementation depends on the unicode encoder bug 15706,
for some charsets (e.g. ISO-8859-1) it doesn't work correctly until 15706 fixes.
nsIEntityConverter does not have this dependency.
Comment 22 nhottanscp 1999-10-26 20:47:59 PDT
Removing 15706 from dependency, we agreed on that encoders to include the
unmapped character in the consumed length which is the current behavior of all
encoders except ISO-2022-JP. 15706 is now a specific problem for ISO-2022-JP.
I made a change to callers of unicode encoder. Marking this bug as FIXED.

Note You need to log in before you can comment on or make changes to this bug.