Closed Bug 169590 Opened 22 years ago Closed 22 years ago

Error in unicode conversion.

Categories

(Core :: Internationalization, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla1.3alpha

People

(Reporter: ssaux, Assigned: rbs)

Details

(Keywords: intl)

Attachments

(3 files, 2 obsolete files)

Gecko's nsHTMLContentSerializer::AppendToString() has a bug. In this
method, it converts Unicode code point between 128 and 255 to entity
such as " ", but it only happens when document charset encoding is
set to ISO-8859-1. Since the method is dealing with Unicode, it should
convert code points in this range regardless of document charset encoding.
According to Taka:
http://lxr.mozilla.org/mozilla/source/content/base/src/nsHTMLContentSerializer.cpp#156
(test on latin1) should be removed.
Keywords: intl
QA Contact: ruixu → ylong
to shanjian
Assignee: yokoyama → shanjian
Attached file test case
Attached patch patch (obsolete) — Splinter Review
How to verify this fix:
Open the attached test case in browser, and choose file->Edit page. Inside
composer, select saveAs and choose a different name to save it. Open the new
html file in a notebook and check its source. Before the patch, ² will be
displayed as "²". After the patch, it should be displayed as "&sup2". 

Status: NEW → ASSIGNED
naoki, could you r=?
This was actually done intentionally. For Latin1 documents, it is preferred to
use entities even when they can be encoded by the document charset. I think this
is also done in order to keep the save behavior as 4.x.
I think we want to keep the current behavior unless we get complaints from users.

cc to akkana, bobj, jst
resolve as wfm per naoki's comment.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → WORKSFORME
don't rush ;-)

I encountered this problem while trying to figure out why
nsIEditor::OutputToString() doesn't return entities even though I'm specifying
nsIDocumentEncoder::OutputEncodeEntities flag. I set document encoding to UTF-8.
Is it expected that nobody composes Latin document in UTF-8 encoding?  What
about ISO-8859-15?  Should output be different between ISO-8859-1 and
ISO-8859-15? I don't think so.

Especially non-breaking space character is problematic. As far as you stick to
0xA0 code point, it won't be portable across character set encoding (e.g.
JIS0208 doesn't have such code point). Only solution for this is to use  
entity instead of charset encoding dependent code point.

Re-opening.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
I am confused.   does not mean code point 160 in current charset, it always
mean u+00a0 in unicode. Why   doesn't  work and   works? Does it
indicate another problem?
Status: REOPENED → ASSIGNED
There are two cases.
1) Use entities only if the character cannot be encoded to the target charset.
2) Use available entities as much as possible.

For #1, that is done as a fallback for the charset conversion. This is basically
the way the serializer implemented.
For #2, the entity mapping can be applied before the charset conversion. This is
not done by the serializer currently. 
nsISaveAsCharset has an option to do this which also takes a flag for kind of
entities (e.g. Latin1, Symbol). I think the similar feature can be added to the
serializer.



Please look at the problem from embedding application devloppers' point of view.
 As for me, I'm not using Gecko's charset convertor.  Thus, #1 is helpless.

Existance of nsIDocumentEncoder::OutputEncodeEntities flag implies #2, and it's
easily done by removing the testing of document charset encoding.
Any comments or updates?
Reassign this one to naoki. Now is the question is about possible impact this
patch may have. Naoki is better aware of those issues than me. 
Assignee: shanjian → nhotta
Status: ASSIGNED → NEW
>I'm not using Gecko's charset convertor.  Thus, #1 is helpless.
nsIEntityConverter may be used to encode Unicode to entity.

>Existance of nsIDocumentEncoder::OutputEncodeEntities flag implies #2
If the flag is set then we should create the entity (CER) before the charset
conversion. I think the serializer is working that way.
Status: NEW → ASSIGNED
>>Existance of nsIDocumentEncoder::OutputEncodeEntities flag implies #2
>If the flag is set then we should create the entity (CER) before the
>charset conversion. I think the serializer is working that way.

Only when document charset encoding is ISO-8859-1. It should be 
consistent regardless of document charset encoding.
If that is the case then need to remove that from the serializer and move the
special case handling up to the editor.
Target Milestone: --- → mozilla1.3alpha
I might take this bug to implement the combined thoughts from this bug and 
bug 165686 comment 21 while what/where to do are still fresh in my mind.

The current plan is:

1) Remove the charset check (in other words, revert the entire patch of bug
65324 where it was introduced).

2) remove nsIDocumentEncoder::OutputEncodeEntities and instead define:

nsIDocumentEncoder::OutputEncodeBasicEntities (just for 'nbsp', 'amp', 'gt', ...)
for callers such as nsGenericHTMLElement::GetInnerHTML() to use for
compatibility with IE6's innerHTML or for interoperability with other 4xp
products that don't support 'α' and friends.

nsIDocumentEncoder::OutputEncodeHTMLEntities
for other callers who want a blanket convertion of all HTML entities as
requested in this bug.

3) special-case at call sites:

   * for GetInnerHTML (bug 165686)
     use flag = OutputEncodeBasicEntities

   * for the editor (comment 16 and comment 17)
     use flag = (charset == ISO-8859-1)
              ? OutputEncodeHTMLEntities 
              : OutputEncodeBasicEntities

   * for "Save As..."
     use flag = (is pref saying to encode entity names)
              ? OutputEncodeHTMLEntities
              : OutputEncodeBasicEntities

   * other callers (comment 9) will set as desired.

Anything missing from the plan?
> 1)
Patch in attachment 99862 [details] [diff] [review] is doing what is needed. Only need some tweaks to
accommodate the two options now.
1) Upon noting that nsIWebBrowserPersist.idl is @UNDER_REVIEW and will be
frozen soon, thereby complicating any further iteration, I eagerly added a
catch-all flag, OutputEncodeW3CEntities, for documents with mixed contents 
(e.g., XHTML+MathML) to leave room for some future extensions (not yet
implemented).

2) I didn't do anything re: "Save" with pref for HTML-entity-zation because it
wasn't clear to me how to guarantee its consistent effectiveness. The current
code that saves the document uses the download manager; c.f. foundHeaderInfo()
in xpfe/communicator/resources/content/contentAreaUtils.js. Sometimes it calls
nsIWebBrowserPersist::saveDocument(), while other times it instead calls
nsIWebBrowserPersist::saveURI(). The latter just works like a copy a-la FTP.

If the pref was meant for the editor (which I am starting to think was the
case), then of course this can be easily done by tweaking GetOutputFlags() in
editor/ui/composer/content/ComposerCommands.js. GetOutputFlags() purposely
"returns output flags based on mimetype, wrapCol and prefs".

Speaking of this function, any reason why it setting the encoding flag even
with text/plain?
 
3) For other embeddors (i.e., this bug), they can now set their desired
entity-zation flag to achieve what they prefer.
Attachment #99862 - Attachment is obsolete: true
-          } else if (mIsLatin1 && val > 127 && val < 256) {
+          } else if ((mFlags & nsIDocumentEncoder::OutputEncodeHTMLEntities) &&
+                     (val > 127) && (val < 256)) {

Hey... this excludes &alpha and other greek letters. Perhaps, the comments needs
to be updated. Or alternatively, there can a flag for the latin1 case, and
another for the complete HTML case. Thoughts?
Thanks for the patch, I reassign this to you.
Can OutputEncodeBasicEntities only include the four (&lt; &gt; &amp; &quot;)
which really need to be encoded?
Assignee: nhotta → rbs
Status: ASSIGNED → NEW
Yes, that is what the code is presently doing. I didn't document it accurately.
The entities &nbsp; &lt; &gt; &amp; get converted in the textual content of
tags, while the entities &nbsp; &lt; &gt; &amp; &quot; get converted in
attributes. [The fact that &quot; isn't recovered in the textual content is
because it is not (presently) possible to tell if a |"| character that shows up
in the DOM was initially a literal |"| character or an entity |&quot;| in the
source. From bug 165686 comment 7, there could be entity nodes in the DOM to
allow exact recovery, but that is something further down the track. They would
just make the entity-zation of the |"| work better. However unlike |&nbsp;| the
current beahvior of not recovering |&quot;| doesn't cause datatoss during
copy-paste. So no particular urgency here.]

Two things that I have to do:
1) update documentation as per clarification above 
3) use OutputEncodeLatin1Entities for the Latin1 case.
Attachment #107699 - Attachment is obsolete: true
Comment on attachment 108016 [details] [diff] [review]
 patch to allow consumers of the encoder to do what they want - take2

Seeking r/sr.

Note: I will be filing another bug for the catch-all case once done with this.
Shouldn't be hard to implement. Rather than the HTML parser service, using  the
entity converter (nsIEntityConverter::ConvertToEntity with all its flags) might
do the trick for that.
Attachment #108016 - Flags: superreview?(jst)
Attachment #108016 - Flags: review?(nhotta)
One question, what is a difference between OutputEncodeW3CEntities and
OutputEncodeHTMLEntities, or are they same?
Comment on attachment 108016 [details] [diff] [review]
 patch to allow consumers of the encoder to do what they want - take2

sr=jst
Attachment #108016 - Flags: superreview?(jst) → superreview+
Re: comment 26
There are not the same.

OutputEncodeHTMLEntities is the set of entities defined by the HTML4 spec only,
whereas OutputEncodeW3CEntities is the overall sum of entities defined at W3C
for specs that we know something about (see nsIEntityConverter.idl from where
the implementation will be based as I alluded earlier). For example, if your
document is in XHTML, you can mix HTML+MathML, and since MathML has its own set
of entities, doing a conversion just with HTMLEntities would leave other bits
out.
Attachment #108016 - Flags: review?(nhotta) → review+
This is the remainder of the task to be done. To keep the ball rolling I am
doing it here rather than in another bug. The patch adds a (hidden) user-pref
allowing the editor to save with as much entity names as the user wants,
irrespective of the charset. The current behavior of defaulting to latin1
entities when the charset is ISO-8859-1 has been retained, and the user-pref
allows to override the default.

Tested with the editor's "<HTML> Source" and "Save As".
Comment on attachment 108016 [details] [diff] [review]
 patch to allow consumers of the encoder to do what they want - take2

I forgot to mention that this one was already checked in.
Attachment #109424 - Flags: superreview?(jst)
Attachment #109424 - Flags: review?(nhotta)
+        case "html"   : outputEntity =
webPersist.ENCODE_FLAGS_ENCODE_W3C_ENTITIES; break;

oops, s/ENCODE_FLAGS_ENCODE_W3C_ENTITIES/ENCODE_FLAGS_ENCODE_HTML_ENTITIES/
                           ^^^^^                            ^^^^^^
(I didn't exposose the catch-all "w3c" case, but was using the editor to test
that the added codepath in the serializer was OK, and I forgot to revert back to
"html". Will do so in my tree.)
Comment on attachment 109424 [details] [diff] [review]
part2: implement the remaining catch-all case and the user-pref

r=nhotta

+	 char* fullEntityText = nsnull;
xpidl string can be used here

+ * output &nbsp; otherwise, we'll output 0xa0.
depends on the output charset, NCR may be generated, please mention that in the
comment
Attachment #109424 - Flags: review?(nhotta) → review+
> +	 char* fullEntityText = nsnull;
> xpidl string can be used here

I have hesitated to do that because of perf reasons. Since the codepath is
rarely going to be used, we will avoid an undue cost of the xpidl string's
ctor/dtor for each character in the loop.

> + * output &nbsp; otherwise, we'll output 0xa0.
> depends on the output charset, NCR may be generated, please mention that in the
> comment

It is a bit edgy to explain (since 'nbsp' is invariant/particular - comment
#10), so changed to: "If set, we'll output &nbsp; otherwise, we may output 0xa0
depending on the charset."
Comment on attachment 109424 [details] [diff] [review]
part2: implement the remaining catch-all case and the user-pref

sr=jst
Attachment #109424 - Flags: superreview?(jst) → superreview+
Checked in. Bug completed.
Status: NEW → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
I forgot to add r=nhotta and sr=jst in my checkin comment... Last days of a long
year?
Comment on attachment 109424 [details] [diff] [review]
part2: implement the remaining catch-all case and the user-pref

> function GetOutputFlags(aMimeType, aWrapColumn)
> {
>+  var outputFlags = 0;
>   var editor = GetCurrentEditor();
>-  var outputFlags = (editor && editor.documentCharacterSet == "ISO-8859-1")
>+  var outputEntity = (editor && editor.documentCharacterSet == "ISO-8859-1")
>     ? webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES
>     : webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES;
>   if (aMimeType == "text/plain")
>@@ -997,14 +998,23 @@
>   }
>   else
>   {
>-    // Should we prettyprint? Check the pref
>     try {
>+      // Should we prettyprint? Check the pref
>       var prefs = GetPrefs();
>       if (prefs.getBoolPref("editor.prettyprint"))
>         outputFlags |= webPersist.ENCODE_FLAGS_FORMATTED;
>+      // How much entity names should we output? Check the pref
>+      var encodeEntity = prefs.getCharPref("editor.encode_entity");
>+      switch (encodeEntity) {
>+        case "basic"  : outputEntity = webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES; break;
>+        case "latin1" : outputEntity = webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES; break;
>+        case "html"   : outputEntity = webPersist.ENCODE_FLAGS_ENCODE_W3C_ENTITIES; break;
>+        case "none"   : outputEntity = 0; break;
>+      }
>     }
>     catch (e) {}
>   }
>+  outputFlags |= outputEntity;
The way you did this in editor.js is alot more readable, and would translate to
something like this:
function GetOutputFlags(aMimeType, aWrapColumn)
{
  var editor = GetCurrentEditor();
  var outputFlags = (editor && editor.documentCharacterSet == "ISO-8859-1")
    ? webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES
    : webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES;
  if (aMimeType == "text/plain")
  {
    // When saving in "text/plain" format, always do formatting
    outputFlags |= webPersist.ENCODE_FLAGS_FORMATTED;
  }
  else
  {
    try {
      var prefs = GetPrefs();

      // How much entity names should we output? Check the pref
      var encodeEntity = prefs.getCharPref("editor.encode_entity");
      switch (encodeEntity) {
	case "basic"  : outputFlags =
webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES; break;
	case "latin1" : outputFlags =
webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES; break;
	case "html"   : outputFlags =
webPersist.ENCODE_FLAGS_ENCODE_HTML_ENTITIES; break;
	case "none"   : outputFlags = 0; break;
      }

      // Should we prettyprint? Check the pref
      if (prefs.getBoolPref("editor.prettyprint"))
	outputFlags |= webPersist.ENCODE_FLAGS_FORMATTED;
    }
    catch (e) {}
  }
I have tried to veirify the fix by using test case in comment #3 and steps in
comment #5:
On 12-26 trunk build / WinXP, &#178; still show as "&#178;" when I opened the
saved file in notepad, not like in comment #5 said it should be "&sup2".  Is it
OK or not?
This is now controlled by a pref. In your prefs.js (or user.js or editor.js),
you also need to additionally set:
pref("editor.encode_entity", "html"); // which includes &sup2; &alpha; etc

or, to just limit the conversion to character codes below 255:
pref("editor.encode_entity", "latin1");
I see.  Verified as fixed by following the description above.  Thanks!
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: