Bugzilla

Comment 3

•

22 years ago

Attached file test case — Details

Comment 4

•

22 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Comment 5

•

22 years ago

How to verify this fix:
Open the attached test case in browser, and choose file->Edit page. Inside
composer, select saveAs and choose a different name to save it. Open the new
html file in a notebook and check its source. Before the patch, &#178; will be
displayed as "&#178;". After the patch, it should be displayed as "&sup2".

Status: NEW → ASSIGNED

Comment 6

•

22 years ago

naoki, could you r=?

Comment 7

•

22 years ago

This was actually done intentionally. For Latin1 documents, it is preferred to
use entities even when they can be encoded by the document charset. I think this
is also done in order to keep the save behavior as 4.x.
I think we want to keep the current behavior unless we get complaints from users.

cc to akkana, bobj, jst

Comment 8

•

22 years ago

resolve as wfm per naoki's comment.

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → WORKSFORME

Comment 9

•

22 years ago

don't rush ;-)

I encountered this problem while trying to figure out why
nsIEditor::OutputToString() doesn't return entities even though I'm specifying
nsIDocumentEncoder::OutputEncodeEntities flag. I set document encoding to UTF-8.
Is it expected that nobody composes Latin document in UTF-8 encoding?  What
about ISO-8859-15?  Should output be different between ISO-8859-1 and
ISO-8859-15? I don't think so.

Especially non-breaking space character is problematic. As far as you stick to
0xA0 code point, it won't be portable across character set encoding (e.g.
JIS0208 doesn't have such code point). Only solution for this is to use &nbsp;
entity instead of charset encoding dependent code point.

Re-opening.

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Comment 10

•

22 years ago

I am confused. &#160; does not mean code point 160 in current charset, it always
mean u+00a0 in unicode. Why &#160; doesn't  work and &nbsp; works? Does it
indicate another problem?

Status: REOPENED → ASSIGNED

Comment 11

•

22 years ago

There are two cases.
1) Use entities only if the character cannot be encoded to the target charset.
2) Use available entities as much as possible.

For #1, that is done as a fallback for the charset conversion. This is basically
the way the serializer implemented.
For #2, the entity mapping can be applied before the charset conversion. This is
not done by the serializer currently. 
nsISaveAsCharset has an option to do this which also takes a flag for kind of
entities (e.g. Latin1, Symbol). I think the similar feature can be added to the
serializer.

Comment 12

•

22 years ago

Please look at the problem from embedding application devloppers' point of view.
 As for me, I'm not using Gecko's charset convertor.  Thus, #1 is helpless.

Existance of nsIDocumentEncoder::OutputEncodeEntities flag implies #2, and it's
easily done by removing the testing of document charset encoding.

Comment 13

•

22 years ago

Any comments or updates?

Comment 14

•

22 years ago

Reassign this one to naoki. Now is the question is about possible impact this
patch may have. Naoki is better aware of those issues than me.

Assignee: shanjian → nhotta

Status: ASSIGNED → NEW

Comment 15

•

22 years ago

>I'm not using Gecko's charset convertor.  Thus, #1 is helpless.
nsIEntityConverter may be used to encode Unicode to entity.

>Existance of nsIDocumentEncoder::OutputEncodeEntities flag implies #2
If the flag is set then we should create the entity (CER) before the charset
conversion. I think the serializer is working that way.

Status: NEW → ASSIGNED

Comment 16

•

22 years ago

>>Existance of nsIDocumentEncoder::OutputEncodeEntities flag implies #2
>If the flag is set then we should create the entity (CER) before the
>charset conversion. I think the serializer is working that way.

Only when document charset encoding is ISO-8859-1. It should be 
consistent regardless of document charset encoding.

Comment 17

•

22 years ago

If that is the case then need to remove that from the serializer and move the
special case handling up to the editor.

Updated

•

22 years ago

Target Milestone: --- → mozilla1.3alpha

Assignee

Comment 18

•

22 years ago

I might take this bug to implement the combined thoughts from this bug and 
bug 165686 comment 21 while what/where to do are still fresh in my mind.

The current plan is:

1) Remove the charset check (in other words, revert the entire patch of bug
65324 where it was introduced).

2) remove nsIDocumentEncoder::OutputEncodeEntities and instead define:

nsIDocumentEncoder::OutputEncodeBasicEntities (just for 'nbsp', 'amp', 'gt', ...)
for callers such as nsGenericHTMLElement::GetInnerHTML() to use for
compatibility with IE6's innerHTML or for interoperability with other 4xp
products that don't support '&alpha;' and friends.

nsIDocumentEncoder::OutputEncodeHTMLEntities
for other callers who want a blanket convertion of all HTML entities as
requested in this bug.

3) special-case at call sites:

   * for GetInnerHTML (bug 165686)
     use flag = OutputEncodeBasicEntities

   * for the editor (comment 16 and comment 17)
     use flag = (charset == ISO-8859-1)
              ? OutputEncodeHTMLEntities 
              : OutputEncodeBasicEntities

   * for "Save As..."
     use flag = (is pref saying to encode entity names)
              ? OutputEncodeHTMLEntities
              : OutputEncodeBasicEntities

   * other callers (comment 9) will set as desired.

Anything missing from the plan?

Assignee

Comment 19

•

22 years ago

> 1)
Patch in attachment 99862 [details] [diff] [review] is doing what is needed. Only need some tweaks to
accommodate the two options now.

Assignee

Comment 20

•

22 years ago

Attached patch patch to allow consumers of the encoder to do what they want (obsolete) — Details — Splinter Review

1) Upon noting that nsIWebBrowserPersist.idl is @UNDER_REVIEW and will be
frozen soon, thereby complicating any further iteration, I eagerly added a
catch-all flag, OutputEncodeW3CEntities, for documents with mixed contents 
(e.g., XHTML+MathML) to leave room for some future extensions (not yet
implemented).

2) I didn't do anything re: "Save" with pref for HTML-entity-zation because it
wasn't clear to me how to guarantee its consistent effectiveness. The current
code that saves the document uses the download manager; c.f. foundHeaderInfo()
in xpfe/communicator/resources/content/contentAreaUtils.js. Sometimes it calls
nsIWebBrowserPersist::saveDocument(), while other times it instead calls
nsIWebBrowserPersist::saveURI(). The latter just works like a copy a-la FTP.

If the pref was meant for the editor (which I am starting to think was the
case), then of course this can be easily done by tweaking GetOutputFlags() in
editor/ui/composer/content/ComposerCommands.js. GetOutputFlags() purposely
"returns output flags based on mimetype, wrapCol and prefs".

Speaking of this function, any reason why it setting the encoding flag even
with text/plain?
 
3) For other embeddors (i.e., this bug), they can now set their desired
entity-zation flag to achieve what they prefer.

Attachment #99862 - Attachment is obsolete: true

Assignee

Comment 21

•

22 years ago

-          } else if (mIsLatin1 && val > 127 && val < 256) {
+          } else if ((mFlags & nsIDocumentEncoder::OutputEncodeHTMLEntities) &&
+                     (val > 127) && (val < 256)) {

Hey... this excludes &alpha and other greek letters. Perhaps, the comments needs
to be updated. Or alternatively, there can a flag for the latin1 case, and
another for the complete HTML case. Thoughts?

Comment 22

•

22 years ago

Thanks for the patch, I reassign this to you.
Can OutputEncodeBasicEntities only include the four (&lt; &gt; &amp; &quot;)
which really need to be encoded?

Assignee: nhotta → rbs

Status: ASSIGNED → NEW

Assignee

Comment 23

•

22 years ago

Yes, that is what the code is presently doing. I didn't document it accurately.
The entities &nbsp; &lt; &gt; &amp; get converted in the textual content of
tags, while the entities &nbsp; &lt; &gt; &amp; &quot; get converted in
attributes. [The fact that &quot; isn't recovered in the textual content is
because it is not (presently) possible to tell if a |"| character that shows up
in the DOM was initially a literal |"| character or an entity |&quot;| in the
source. From bug 165686 comment 7, there could be entity nodes in the DOM to
allow exact recovery, but that is something further down the track. They would
just make the entity-zation of the |"| work better. However unlike |&nbsp;| the
current beahvior of not recovering |&quot;| doesn't cause datatoss during
copy-paste. So no particular urgency here.]

Two things that I have to do:
1) update documentation as per clarification above 
3) use OutputEncodeLatin1Entities for the Latin1 case.

Assignee

Comment 24

•

22 years ago

Attached patch patch to allow consumers of the encoder to do what they want - take2 — Details — Splinter Review

Attachment #107699 - Attachment is obsolete: true

Assignee

Comment 25

•

22 years ago

Comment on attachment 108016 [details] [diff] [review]
 patch to allow consumers of the encoder to do what they want - take2

Seeking r/sr.

Note: I will be filing another bug for the catch-all case once done with this.
Shouldn't be hard to implement. Rather than the HTML parser service, using  the
entity converter (nsIEntityConverter::ConvertToEntity with all its flags) might
do the trick for that.

Attachment #108016 - Flags: superreview?(jst)

Attachment #108016 - Flags: review?(nhotta)

Comment 26

•

22 years ago

One question, what is a difference between OutputEncodeW3CEntities and
OutputEncodeHTMLEntities, or are they same?

Johnny Stenback (:jst)

Comment 27

•

22 years ago

Comment on attachment 108016 [details] [diff] [review]
 patch to allow consumers of the encoder to do what they want - take2

sr=jst

Attachment #108016 - Flags: superreview?(jst) → superreview+

Assignee

Comment 28

•

22 years ago

Re: comment 26
There are not the same.

OutputEncodeHTMLEntities is the set of entities defined by the HTML4 spec only,
whereas OutputEncodeW3CEntities is the overall sum of entities defined at W3C
for specs that we know something about (see nsIEntityConverter.idl from where
the implementation will be based as I alluded earlier). For example, if your
document is in XHTML, you can mix HTML+MathML, and since MathML has its own set
of entities, doing a conversion just with HTMLEntities would leave other bits
out.

Updated

•

22 years ago

Attachment #108016 - Flags: review?(nhotta) → review+

Assignee

Comment 29

•

22 years ago

Attached patch part2: implement the remaining catch-all case and the user-pref — Details — Splinter Review

This is the remainder of the task to be done. To keep the ball rolling I am
doing it here rather than in another bug. The patch adds a (hidden) user-pref
allowing the editor to save with as much entity names as the user wants,
irrespective of the charset. The current behavior of defaulting to latin1
entities when the charset is ISO-8859-1 has been retained, and the user-pref
allows to override the default.

Tested with the editor's "<HTML> Source" and "Save As".

Assignee

Comment 30

•

22 years ago

Comment on attachment 108016 [details] [diff] [review]
 patch to allow consumers of the encoder to do what they want - take2

I forgot to mention that this one was already checked in.

Assignee

Updated

•

22 years ago

Attachment #109424 - Flags: superreview?(jst)

Attachment #109424 - Flags: review?(nhotta)

Assignee

Comment 31

•

22 years ago

+        case "html"   : outputEntity =
webPersist.ENCODE_FLAGS_ENCODE_W3C_ENTITIES; break;

oops, s/ENCODE_FLAGS_ENCODE_W3C_ENTITIES/ENCODE_FLAGS_ENCODE_HTML_ENTITIES/
                           ^^^^^                            ^^^^^^
(I didn't exposose the catch-all "w3c" case, but was using the editor to test
that the added codepath in the serializer was OK, and I forgot to revert back to
"html". Will do so in my tree.)

Comment 32

•

22 years ago

Comment on attachment 109424 [details] [diff] [review]
part2: implement the remaining catch-all case and the user-pref

r=nhotta

+	 char* fullEntityText = nsnull;
xpidl string can be used here

+ * output &nbsp; otherwise, we'll output 0xa0.
depends on the output charset, NCR may be generated, please mention that in the
comment

Attachment #109424 - Flags: review?(nhotta) → review+

Assignee

Comment 33

•

22 years ago

> +	 char* fullEntityText = nsnull;
> xpidl string can be used here

I have hesitated to do that because of perf reasons. Since the codepath is
rarely going to be used, we will avoid an undue cost of the xpidl string's
ctor/dtor for each character in the loop.

> + * output &nbsp; otherwise, we'll output 0xa0.
> depends on the output charset, NCR may be generated, please mention that in the
> comment

It is a bit edgy to explain (since 'nbsp' is invariant/particular - comment
#10), so changed to: "If set, we'll output &nbsp; otherwise, we may output 0xa0
depending on the charset."

Johnny Stenback (:jst)

Comment 34

•

22 years ago

Comment on attachment 109424 [details] [diff] [review]
part2: implement the remaining catch-all case and the user-pref

sr=jst

Attachment #109424 - Flags: superreview?(jst) → superreview+

Assignee

Comment 35

•

22 years ago

Checked in. Bug completed.

Status: NEW → RESOLVED

Closed: 22 years ago → 22 years ago

Resolution: --- → FIXED

Assignee

Comment 36

•

22 years ago

I forgot to add r=nhotta and sr=jst in my checkin comment... Last days of a long
year?

neil@parkwaycc.co.uk

Comment 37

•

22 years ago

Comment on attachment 109424 [details] [diff] [review]
part2: implement the remaining catch-all case and the user-pref

> function GetOutputFlags(aMimeType, aWrapColumn)
> {
>+  var outputFlags = 0;
>   var editor = GetCurrentEditor();
>-  var outputFlags = (editor && editor.documentCharacterSet == "ISO-8859-1")
>+  var outputEntity = (editor && editor.documentCharacterSet == "ISO-8859-1")
>     ? webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES
>     : webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES;
>   if (aMimeType == "text/plain")
>@@ -997,14 +998,23 @@
>   }
>   else
>   {
>-    // Should we prettyprint? Check the pref
>     try {
>+      // Should we prettyprint? Check the pref
>       var prefs = GetPrefs();
>       if (prefs.getBoolPref("editor.prettyprint"))
>         outputFlags |= webPersist.ENCODE_FLAGS_FORMATTED;
>+      // How much entity names should we output? Check the pref
>+      var encodeEntity = prefs.getCharPref("editor.encode_entity");
>+      switch (encodeEntity) {
>+        case "basic"  : outputEntity = webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES; break;
>+        case "latin1" : outputEntity = webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES; break;
>+        case "html"   : outputEntity = webPersist.ENCODE_FLAGS_ENCODE_W3C_ENTITIES; break;
>+        case "none"   : outputEntity = 0; break;
>+      }
>     }
>     catch (e) {}
>   }
>+  outputFlags |= outputEntity;
The way you did this in editor.js is alot more readable, and would translate to
something like this:
function GetOutputFlags(aMimeType, aWrapColumn)
{
  var editor = GetCurrentEditor();
  var outputFlags = (editor && editor.documentCharacterSet == "ISO-8859-1")
    ? webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES
    : webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES;
  if (aMimeType == "text/plain")
  {
    // When saving in "text/plain" format, always do formatting
    outputFlags |= webPersist.ENCODE_FLAGS_FORMATTED;
  }
  else
  {
    try {
      var prefs = GetPrefs();

      // How much entity names should we output? Check the pref
      var encodeEntity = prefs.getCharPref("editor.encode_entity");
      switch (encodeEntity) {
	case "basic"  : outputFlags =
webPersist.ENCODE_FLAGS_ENCODE_BASIC_ENTITIES; break;
	case "latin1" : outputFlags =
webPersist.ENCODE_FLAGS_ENCODE_LATIN1_ENTITIES; break;
	case "html"   : outputFlags =
webPersist.ENCODE_FLAGS_ENCODE_HTML_ENTITIES; break;
	case "none"   : outputFlags = 0; break;
      }

      // Should we prettyprint? Check the pref
      if (prefs.getBoolPref("editor.prettyprint"))
	outputFlags |= webPersist.ENCODE_FLAGS_FORMATTED;
    }
    catch (e) {}
  }

Yuying Long

Comment 38

•

22 years ago

I have tried to veirify the fix by using test case in comment #3 and steps in
comment #5:
On 12-26 trunk build / WinXP, &#178; still show as "&#178;" when I opened the
saved file in notepad, not like in comment #5 said it should be "&sup2".  Is it
OK or not?