Last Comment Bug 169590 - Error in unicode conversion.
: Error in unicode conversion.
Status: VERIFIED FIXED
: intl
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: x86 Windows 2000
: -- normal (vote)
: mozilla1.3alpha
Assigned To: rbs
: Yuying Long
: Makoto Kato [:m_kato]
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2002-09-18 18:39 PDT by Stephane Saux
Modified: 2007-08-27 22:18 PDT (History)
14 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
test case (864 bytes, text/html)
2002-09-19 13:10 PDT, Shanjian Li
no flags Details
patch (1.61 KB, patch)
2002-09-19 13:11 PDT, Shanjian Li
no flags Details | Diff | Splinter Review
patch to allow consumers of the encoder to do what they want (12.26 KB, patch)
2002-11-28 05:51 PST, rbs
no flags Details | Diff | Splinter Review
patch to allow consumers of the encoder to do what they want - take2 (12.85 KB, patch)
2002-12-03 04:44 PST, rbs
nhottanscp: review+
jst: superreview+
Details | Diff | Splinter Review
part2: implement the remaining catch-all case and the user-pref (8.99 KB, patch)
2002-12-16 10:14 PST, rbs
nhottanscp: review+
jst: superreview+
Details | Diff | Splinter Review

Description Stephane Saux 2002-09-18 18:39:33 PDT
Gecko's nsHTMLContentSerializer::AppendToString() has a bug. In this
method, it converts Unicode code point between 128 and 255 to entity
such as " ", but it only happens when document charset encoding is
set to ISO-8859-1. Since the method is dealing with Unicode, it should
convert code points in this range regardless of document charset encoding.
Comment 1 Stephane Saux 2002-09-18 18:52:14 PDT
According to Taka:
http://lxr.mozilla.org/mozilla/source/content/base/src/nsHTMLContentSerializer.cpp#156
(test on latin1) should be removed.
Comment 2 Roy Yokoyama 2002-09-19 10:15:01 PDT
to shanjian
Comment 3 Shanjian Li 2002-09-19 13:10:43 PDT
Created attachment 99861 [details]
test case
Comment 4 Shanjian Li 2002-09-19 13:11:28 PDT
Created attachment 99862 [details] [diff] [review]
patch
Comment 5 Shanjian Li 2002-09-19 13:16:12 PDT
How to verify this fix:
Open the attached test case in browser, and choose file->Edit page. Inside
composer, select saveAs and choose a different name to save it. Open the new
html file in a notebook and check its source. Before the patch, ² will be
displayed as "²". After the patch, it should be displayed as "&sup2". 

Comment 6 Shanjian Li 2002-09-19 13:16:48 PDT
naoki, could you r=?
Comment 7 nhottanscp 2002-09-19 13:30:15 PDT
This was actually done intentionally. For Latin1 documents, it is preferred to
use entities even when they can be encoded by the document charset. I think this
is also done in order to keep the save behavior as 4.x.
I think we want to keep the current behavior unless we get complaints from users.

cc to akkana, bobj, jst
Comment 8 Shanjian Li 2002-09-19 14:49:14 PDT
resolve as wfm per naoki's comment.
Comment 9 Takayuki Tei 2002-09-19 15:37:34 PDT
don't rush ;-)

I encountered this problem while trying to figure out why
nsIEditor::OutputToString() doesn't return entities even though I'm specifying
nsIDocumentEncoder::OutputEncodeEntities flag. I set document encoding to UTF-8.
Is it expected that nobody composes Latin document in UTF-8 encoding?  What
about ISO-8859-15?  Should output be different between ISO-8859-1 and
ISO-8859-15? I don't think so.

Especially non-breaking space character is problematic. As far as you stick to
0xA0 code point, it won't be portable across character set encoding (e.g.
JIS0208 doesn't have such code point). Only solution for this is to use  
entity instead of charset encoding dependent code point.

Re-opening.
Comment 10 Shanjian Li 2002-09-19 16:03:44 PDT
I am confused.   does not mean code point 160 in current charset, it always
mean u+00a0 in unicode. Why   doesn't  work and   works? Does it
indicate another problem?
Comment 11 nhottanscp 2002-09-19 16:15:38 PDT
There are two cases.
1) Use entities only if the character cannot be encoded to the target charset.
2) Use available entities as much as possible.

For #1, that is done as a fallback for the charset conversion. This is basically
the way the serializer implemented.
For #2, the entity mapping can be applied before the charset conversion. This is
not done by the serializer currently. 
nsISaveAsCharset has an option to do this which also takes a flag for kind of
entities (e.g. Latin1, Symbol). I think the similar feature can be added to the
serializer.



Comment 12 Takayuki Tei 2002-09-20 11:35:49 PDT
Please look at the problem from embedding application devloppers' point of view.
 As for me, I'm not using Gecko's charset convertor.  Thus, #1 is helpless.

Existance of nsIDocumentEncoder::OutputEncodeEntities flag implies #2, and it's
easily done by removing the testing of document charset encoding.
Comment 13 Takayuki Tei 2002-09-26 11:43:39 PDT
Any comments or updates?
Comment 14 Shanjian Li 2002-09-26 12:44:05 PDT
Reassign this one to naoki. Now is the question is about possible impact this
patch may have. Naoki is better aware of those issues than me. 
Comment 15 nhottanscp 2002-09-26 18:21:04 PDT
>I'm not using Gecko's charset convertor.  Thus, #1 is helpless.
nsIEntityConverter may be used to encode Unicode to entity.

>Existance of nsIDocumentEncoder::OutputEncodeEntities flag implies #2
If the flag is set then we should create the entity (CER) before the charset
conversion. I think the serializer is working that way.
Comment 16 Takayuki Tei 2002-09-26 18:26:44 PDT
>>Existance of nsIDocumentEncoder::OutputEncodeEntities flag implies #2
>If the flag is set then we should create the entity (CER) before the
>charset conversion. I think the serializer is working that way.

Only when document charset encoding is ISO-8859-1. It should be 
consistent regardless of document charset encoding.
Comment 17 nhottanscp 2002-09-27 09:48:26 PDT
If that is the case then need to remove that from the serializer and move the
special case handling up to the editor.
Comment 18 rbs 2002-11-27 13:38:38 PST
I might take this bug to implement the combined thoughts from this bug and 
bug 165686 comment 21 while what/where to do are still fresh in my mind.

The current plan is:

1) Remove the charset check (in other words, revert the entire patch of bug
65324 where it was introduced).

2) remove nsIDocumentEncoder::OutputEncodeEntities and instead define:

nsIDocumentEncoder::OutputEncodeBasicEntities (just for 'nbsp', 'amp', 'gt', ...)
for callers such as nsGenericHTMLElement::GetInnerHTML() to use for
compatibility with IE6's innerHTML or for interoperability with other 4xp
products that don't support 'α' and friends.

nsIDocumentEncoder::OutputEncodeHTMLEntities
for other callers who want a blanket convertion of all HTML entities as
requested in this bug.

3) special-case at call sites:

   * for GetInnerHTML (bug 165686)
     use flag = OutputEncodeBasicEntities

   * for the editor (comment 16 and comment 17)
     use flag = (charset == ISO-8859-1)
              ? OutputEncodeHTMLEntities 
              : OutputEncodeBasicEntities

   * for "Save As..."
     use flag = (is pref saying to encode entity names)
              ? OutputEncodeHTMLEntities
              : OutputEncodeBasicEntities

   * other callers (comment 9) will set as desired.

Anything missing from the plan?
Comment 19 rbs 2002-11-27 14:13:21 PST
> 1)
Patch in attachment 99862 [details] [diff] [review] is doing what is needed. Only need some tweaks to
accommodate the two options now.
Comment 20 rbs 2002-11-28 05:51:38 PST
Created attachment 107699 [details] [diff] [review]
patch to allow consumers of the encoder to do what they want

1) Upon noting that nsIWebBrowserPersist.idl is @UNDER_REVIEW and will be
frozen soon, thereby complicating any further iteration, I eagerly added a
catch-all flag, OutputEncodeW3CEntities, for documents with mixed contents 
(e.g., XHTML+MathML) to leave room for some future extensions (not yet
implemented).

2) I didn't do anything re: "Save" with pref for HTML-entity-zation because it
wasn't clear to me how to guarantee its consistent effectiveness. The current
code that saves the document uses the download manager; c.f. foundHeaderInfo()
in xpfe/communicator/resources/content/contentAreaUtils.js. Sometimes it calls
nsIWebBrowserPersist::saveDocument(), while other times it instead calls
nsIWebBrowserPersist::saveURI(). The latter just works like a copy a-la FTP.

If the pref was meant for the editor (which I am starting to think was the
case), then of course this can be easily done by tweaking GetOutputFlags() in
editor/ui/composer/content/ComposerCommands.js. GetOutputFlags() purposely
"returns output flags based on mimetype, wrapCol and prefs".

Speaking of this function, any reason why it setting the encoding flag even
with text/plain?
 
3) For other embeddors (i.e., this bug), they can now set their desired
entity-zation flag to achieve what they prefer.
Comment 21 rbs 2002-11-28 05:59:13 PST
-          } else if (mIsLatin1 && val > 127 && val < 256) {
+          } else if ((mFlags & nsIDocumentEncoder::OutputEncodeHTMLEntities) &&
+                     (val > 127) && (val < 256)) {

Hey... this excludes &alpha and other greek letters. Perhaps, the comments needs
to be updated. Or alternatively, there can a flag for the latin1 case, and
another for the complete HTML case. Thoughts?
Comment 22 nhottanscp 2002-12-02 11:34:26 PST
Thanks for the patch, I reassign this to you.
Can OutputEncodeBasicEntities only include the four (&lt; &gt; &amp; &quot;)
which really need to be encoded?
Comment 23 rbs 2002-12-03 03:17:27 PST
Yes, that is what the code is presently doing. I didn't document it accurately.
The entities &nbsp; &lt; &gt; &amp; get converted in the textual content of
tags, while the entities &nbsp; &lt; &gt; &amp; &quot; get converted in
attributes. [The fact that &quot; isn't recovered in the textual content is
because it is not (presently) possible to tell if a |"| character that shows up
in the DOM was initially a literal |"| character or an entity |&quot;| in the
source. From bug 165686 comment 7, there could be entity nodes in the DOM to
allow exact recovery, but that is something further down the track. They would
just make the entity-zation of the |"| work better. However unlike |&nbsp;| the
current beahvior of not recovering |&quot;| doesn't cause datatoss during
copy-paste. So no particular urgency here.]

Two things that I have to do:
1) update documentation as per clarification above 
3) use OutputEncodeLatin1Entities for the Latin1 case.
Comment 24 rbs 2002-12-03 04:44:55 PST
Created attachment 108016 [details] [diff] [review]
 patch to allow consumers of the encoder to do what they want - take2
Comment 25 rbs 2002-12-03 05:15:54 PST
Comment on attachment 108016 [details] [diff] [review]
 patch to allow consumers of the encoder to do what they want - take2

Seeking r/sr.

Note: I will be filing another bug for the catch-all case once done with this.
Shouldn't be hard to implement. Rather than the HTML parser service, using  the
entity converter (nsIEntityConverter::ConvertToEntity with all its flags) might
do the trick for that.
Comment 26 nhottanscp 2002-12-03 10:00:24 PST
One question, what is a difference between OutputEncodeW3CEntities and
OutputEncodeHTMLEntities, or are they same?
Comment 27 Johnny Stenback (:jst, jst@mozilla.com) 2002-12-03 14:42:51 PST
Comment on attachment 108016 [details] [diff] [review]
 patch to allow consumers of the encoder to do what they want - take2

sr=jst
Comment 28 rbs 2002-12-04 19:20:37 PST
Re: comment 26
There are not the same.

OutputEncodeHTMLEntities is the set of entities defined by the HTML4 spec only,
whereas OutputEncodeW3CEntities is the overall sum of entities defined at W3C
for specs that we know something about (see nsIEntityConverter.idl from where
the implementation will be based as I alluded earlier). For example, if your
document is in XHTML, you can mix HTML+MathML, and since MathML has its own set
of entities, doing a conversion just with HTMLEntities would leave other bits
out.
Comment 29 rbs 2002-12-16 10:14:45 PST
Created attachment 109424 [details] [diff] [review]
part2: implement the remaining catch-all case and the user-pref

This is the remainder of the task to be done. To keep the ball rolling I am
doing it here rather than in another bug. The patch adds a (hidden) user-pref
allowing the editor to save with as much entity names as the user wants,
irrespective of the charset. The current behavior of defaulting to latin1
entities when the charset is ISO-8859-1 has been retained, and the user-pref
allows to override the default.

Tested with the editor's "<HTML> Source" and "Save As".
Comment 30 rbs 2002-12-16 10:16:28 PST
Comment on attachment 108016 [details] [diff] [review]
 patch to allow consumers of the encoder to do what they want - take2

I forgot to mention that this one was already checked in.
Comment 31 rbs 2002-12-16 10:25:56 PST
+        case "html"   : outputEntity =
webPersist.ENCODE_FLAGS_ENCODE_W3C_ENTITIES; break;

oops, s/ENCODE_FLAGS_ENCODE_W3C_ENTITIES/ENCODE_FLAGS_ENCODE_HTML_ENTITIES/
                           ^^^^^                            ^^^^^^
(I didn't exposose the catch-all "w3c" case, but was using the editor to test
that the added codepath in the serializer was OK, and I forgot to revert back to
"html". Will do so in my tree.)
Comment 32 nhottanscp 2002-12-16 10:50:06 PST
Comment on attachment 109424 [details] [diff] [review]
part2: implement the remaining catch-all case and the user-pref

r=nhotta

+	 char* fullEntityText = nsnull;
xpidl string can be used here

+ * output &nbsp; otherwise, we'll output 0xa0.
depends on the output charset, NCR may be generated, please mention that in the
comment
Comment 33 rbs 2002-12-16 11:20:36 PST
> +	 char* fullEntityText = nsnull;
> xpidl string can be used here

I have hesitated to do that because of perf reasons. Since the codepath is
rarely going to be used, we will avoid an undue cost of the xpidl string's
ctor/dtor for each character in the loop.

> + * output &nbsp; otherwise, we'll output 0xa0.
> depends on the output charset, NCR may be generated, please mention that in the
> comment

It is a bit edgy to explain (since 'nbsp' is invariant/particular - comment
#10), so changed to: "If set, we'll output &nbsp; otherwise, we may output 0xa0
depending on the charset."
Comment 34 Johnny Stenback (:jst, jst@mozilla.com) 2002-12-16 15:47:14 PST
Comment on attachment 109424 [details] [diff] [review]
part2: implement the remaining catch-all case and the user-pref

sr=jst
Comment 35 rbs 2002-12-16 17:21:34 PST
Checked in. Bug completed.
Comment 36 rbs 2002-12-17 13:53:45 PST
I forgot to add r=nhotta and sr=jst in my checkin comment... Last days of a long
year?
Comment 37 neil@parkwaycc.co.uk 2002-12-20 01:53:50 PST
Comment on attachment 109424 [details] [diff] [review]
part2: implement the remaining catch-all case and the user-pref

> function GetOutputFlags(aMimeType, aWrapColumn)
> {
>+  var outputFlags = 0;
>   var editor = GetCurrentEditor();
>-  var outputFlags = (editor && editor.documentCharacterSet == "ISO-8859-1")
>+  var outputEntity = (editor && editor.documentCharacterSet == "ISO-8859-1")
>     ? webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES
>     : webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES;
>   if (aMimeType == "text/plain")
>@@ -997,14 +998,23 @@
>   }
>   else
>   {
>-    // Should we prettyprint? Check the pref
>     try {
>+      // Should we prettyprint? Check the pref
>       var prefs = GetPrefs();
>       if (prefs.getBoolPref("editor.prettyprint"))
>         outputFlags |= webPersist.ENCODE_FLAGS_FORMATTED;
>+      // How much entity names should we output? Check the pref
>+      var encodeEntity = prefs.getCharPref("editor.encode_entity");
>+      switch (encodeEntity) {
>+        case "basic"  : outputEntity = webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES; break;
>+        case "latin1" : outputEntity = webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES; break;
>+        case "html"   : outputEntity = webPersist.ENCODE_FLAGS_ENCODE_W3C_ENTITIES; break;
>+        case "none"   : outputEntity = 0; break;
>+      }
>     }
>     catch (e) {}
>   }
>+  outputFlags |= outputEntity;
The way you did this in editor.js is alot more readable, and would translate to
something like this:
function GetOutputFlags(aMimeType, aWrapColumn)
{
  var editor = GetCurrentEditor();
  var outputFlags = (editor && editor.documentCharacterSet == "ISO-8859-1")
    ? webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES
    : webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES;
  if (aMimeType == "text/plain")
  {
    // When saving in "text/plain" format, always do formatting
    outputFlags |= webPersist.ENCODE_FLAGS_FORMATTED;
  }
  else
  {
    try {
      var prefs = GetPrefs();

      // How much entity names should we output? Check the pref
      var encodeEntity = prefs.getCharPref("editor.encode_entity");
      switch (encodeEntity) {
	case "basic"  : outputFlags =
webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES; break;
	case "latin1" : outputFlags =
webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES; break;
	case "html"   : outputFlags =
webPersist.ENCODE_FLAGS_ENCODE_HTML_ENTITIES; break;
	case "none"   : outputFlags = 0; break;
      }

      // Should we prettyprint? Check the pref
      if (prefs.getBoolPref("editor.prettyprint"))
	outputFlags |= webPersist.ENCODE_FLAGS_FORMATTED;
    }
    catch (e) {}
  }
Comment 38 Yuying Long 2002-12-27 15:21:12 PST
I have tried to veirify the fix by using test case in comment #3 and steps in
comment #5:
On 12-26 trunk build / WinXP, &#178; still show as "&#178;" when I opened the
saved file in notepad, not like in comment #5 said it should be "&sup2".  Is it
OK or not?
Comment 39 rbs 2002-12-27 15:33:15 PST
This is now controlled by a pref. In your prefs.js (or user.js or editor.js),
you also need to additionally set:
pref("editor.encode_entity", "html"); // which includes &sup2; &alpha; etc

or, to just limit the conversion to character codes below 255:
pref("editor.encode_entity", "latin1");
Comment 40 Yuying Long 2002-12-27 15:46:02 PST
I see.  Verified as fixed by following the description above.  Thanks!

Note You need to log in before you can comment on or make changes to this bug.