65324 - Composer should used numerical entities for Greek, not named entities

Reporter

Description

•

24 years ago

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; 0.7) Gecko/20010109
BuildID:    2001010901

I pasted a couple of passages of Unicode Greek into Composer and saved as UTF-8.
 The text is saved using named entities (i.e., &amp;alpha; for a lower case
alpha, etc.) where available. These named entities are not recognized by
Netscape 4.7. Therefore numeric entities should be used instead.

Reproducible: Always
Steps to Reproduce:
Copy a Greek text in Unicode, say from http://www.perseus.tufts.edu, or one of
the attached pages. Paste it into Composer.  Save as character set UTF-8. Look
at the page with Communicator 4.7.	

Actual Results:  Saved

&amp;nu;&amp;epsilon;?&amp;omicron;&amp;sigmaf; &amp;mu;&amp;epsilon;?&amp;nu;
&amp;kappa;&amp;alpha;&amp;iota;?
&amp;alpha;??&amp;pi;&amp;epsilon;&amp;iota;&amp;rho;&amp;omicron;&amp;sigmaf;

[amp entity used in case the entities are resolved by bugzilla; if you see amp;
ignore it]

which is not readable in Netscape 4.7



Expected Results:  Should have saved

&#957;&#8051;&#959;&#962; &#956;&#8050;&#957; &#954;&#945;&#8054;
&#7940;&#960;&#949;&#953;&#961;&#959;&#962;

[note: this is the behavior of FrontPage 2000]

or (even better!)

[pretend this is the Unicode text; it looked fine when I pasted it into Bugzilla
Helper, but became unrecognized symbols after submitting to this form]

 ????? ???? ???? ?????????

[Note: this is the behavior of Word 2000]

which are readable in Netscape 4.7.

http://www.ziplink.net/~ptrourke/mozcompose1.html
http://www.ziplink.net/~ptrourke/mozcompose2.html

I'd like to be able to recommend Mozilla Composer to those who have to publish
polytonic Greek texts to the web (Greek literature scholars, biblical scholars,
etc.), but can't until this bug is fixed).

Thanks.

PTRourke

Reporter

Comment 1

•

24 years ago

*** Bug 65326 has been marked as a duplicate of this bug. ***

Fabian Guisset

Comment 2

•

24 years ago

*** Bug 65326 has been marked as a duplicate of this bug. ***

Asa Dotzler [:asa]

Comment 3

•

24 years ago

setting bug status to New

Status: UNCONFIRMED → NEW

Ever confirmed: true

PTRourke

Reporter

Comment 4

•

24 years ago

Note also: when editing an existing page which has numerical entities, Mozilla 
converts those with available named entities to named entities upon save.  

BTW, Shouldn't use of named entities be discouraged throughout, as so few are 
supported by XML?

rubydoo123

Comment 5

•

24 years ago

reassigning to nhotta. Naoki, the conversion from numeric entity to mneumonic 
entity should not happen, we should preserve what the user entered.

Assignee: beppe → nhotta

nhottanscp

Assignee

Comment 6

•

24 years ago

I think preserving the original is not easy. Input HTML may use numeric or
mneumonic (or both may be used).

Keywords: intl

nhottanscp

Assignee

Comment 7

•

24 years ago

So I think we want to have UI to let the user control the output.

nhottanscp

Assignee

Comment 8

•

24 years ago

But the case reported here, it saves as UTF-8, so output does not need to have
any entities, since the characters have code points in unicode.
I can reproduce this with 6.0 RTM. I think something got broken before RTM.

Status: NEW → ASSIGNED

nhottanscp

Assignee

Comment 9

•

24 years ago

Those entities are generated by nsHTMLContentSerializer.
This is related to jst's check in (e.g. nsDocumentEncoder.cpp, rev=1.33).
Cc to him, why is this entity conversion needed in there? I think Composer 
eventually does entity conversion (by using nsIEntityConverter) along with 
charset conversion, so I think it's not necessary doing it in 
nsHTMLContentSerializer.

nsHTMLEntities::UnicodeToEntity(int 0x000000f4) line 190
nsParserService::HTMLConvertUnicodeToEntity(const nsParserService * const 
0x03b4c540, int 0x000000f4, nsCString & {...}) line 105 + 9 bytes
nsHTMLContentSerializer::AppendToString(const basic_nsAReadableString<unsigned 
short> & {...}, basic_nsAWritableString<unsigned short> & {...}, int 0x00000001, 
int 0x00000000) line 532
nsXMLContentSerializer::AppendTextData(nsIDOMNode * 0x058561c0, int 0x00000000, 
int 0xffffffff, basic_nsAWritableString<unsigned short> & {...}, int 0x00000001, 
int 0x00000000) line 100 + 58 bytes
nsHTMLContentSerializer::AppendText(nsHTMLContentSerializer * const 0x05b9de80, 
nsIDOMText * 0x058561c0, int 0x00000000, int 0xffffffff, 
basic_nsAWritableString<unsigned short> & {...}) line 132 + 31 bytes
nsDocumentEncoder::SerializeNodeStart(nsIDOMNode * 0x058561c0, int 0x00000000, 
int 0xffffffff, basic_nsAWritableString<unsigned short> & {...}) line 260
nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x058561c0, 
basic_nsAWritableString<unsigned short> & {...}) line 315 + 20 bytes
nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x0579f2a4, 
basic_nsAWritableString<unsigned short> & {...}) line 336 + 21 bytes
nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x057d4fec, 
basic_nsAWritableString<unsigned short> & {...}) line 336 + 21 bytes
nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x057d2174, 
basic_nsAWritableString<unsigned short> & {...}) line 336 + 21 bytes
nsDocumentEncoder::EncodeToString(nsDocumentEncoder * const 0x05b9c7f0, 
basic_nsAWritableString<unsigned short> & {...}) line 882 + 21 bytes
nsDocumentEncoder::EncodeToStream(nsDocumentEncoder * const 0x05b9c7f0, 
nsIOutputStream * 0x05b99b30) line 917 + 19 bytes
nsDocument::SaveFile(nsDocument * const 0x057d218c, nsIFile * 0x05b9c910, int 
0x00000000, int 0x00000000, const unsigned short * 0x05b99c70, const unsigned 
short * 0x0423b3f0, unsigned int 0x00000102, unsigned int 0x00000048) line 3442 
+ 44 bytes
nsEditor::SaveFile(nsEditor * const 0x03c23e50, nsIFile * 0x05b9c910, int 
0x00000000, int 0x00000000, const nsString & {...}) line 1637 + 76 bytes
nsEditorShell::SaveDocument(nsEditorShell * const 0x05446dc0, int 0x00000001, 
int 0x00000000, int * 0x0012c3e8) line 1926 + 53 bytes

nhottanscp

Assignee

Comment 10

•

24 years ago

also cc to akkana

nhottanscp

Assignee

Comment 11

•

24 years ago

Reassign to jst. Entity generation should only be done as a fallback of charset
conversion error, also use Latin1 entity only for compatibility.

Assignee: nhotta → jst

Status: ASSIGNED → NEW

Keywords: nsbeta1

Johnny Stenback (:jst)

Comment 12

•

24 years ago

Reassigning to anthonyd who now owns this code.

Assignee: jst → anthonyd

PTRourke

Reporter

Comment 13

•

24 years ago

> But the case reported here, it saves as UTF-8, so output
> does not need to have any entities, since the characters
> have code points in unicode.
> [Later] Entity generation should only be done
> as a fallback of charset conversion error, also use Latin1
> entity only for compatibility.

This would be ideal.

rubydoo123

Comment 14

•

24 years ago

no, anthony does not own the conversion code, this is either nhotta or jst, 
assign back to nhotta

Assignee: anthonyd → nhotta

nhottanscp

Assignee

Comment 15

•

24 years ago

>> But the case reported here, it saves as UTF-8, so output
>> does not need to have any entities, since the characters
>> have code points in unicode.
>> [Later] Entity generation should only be done
>> as a fallback of charset conversion error, also use Latin1
>> entity only for compatibility.
>
> This would be ideal.

It had been working that way until beta3.
See nsDocumentEncoder.cpp, rev=1.33 ~ 1.35.
Reassign to jst.

Assignee: nhotta → jst

Johnny Stenback (:jst)

Comment 16

•

24 years ago

If this needs to be fixed for nsbeta1 someone other than me needs to grab this
and do something about it, I won't get to this in time for nsbeta1. Marking Future.

OS: Windows ME → All

Hardware: PC → All

Target Milestone: --- → Future

nhottanscp

Assignee

Comment 17

•

24 years ago

Reassign to nhotta.

Assignee: jst → nhotta

nhottanscp

Assignee

Comment 18

•

24 years ago

Attached patch Patch from jst to pass charset to serializers — Details — Splinter Review

nhottanscp

Assignee

Comment 19

•

24 years ago

r=nhotta
I will write a patch later to use the charset in HTML serializer.

Status: NEW → ASSIGNED

nhottanscp

Assignee

Comment 20

•

24 years ago

jst, who can I ask for a super-review for the patch?

Johnny Stenback (:jst)

Comment 21

•

24 years ago

Vidur could probably sr the patch.

One thing that I'm not sure about in the patch is:

  mIsLatin1 = aCharSet.Equals(NS_LITERAL_STRING("iso-8859-1"));

Does that do the right thing? Is the name of the latin 1 charset always
"iso-8859-1"? Would the Equals call need to be case in-sensitive?

nhottanscp

Assignee

Comment 22

•

24 years ago

It needs to get charset atom from the string using nsICharsetConverterManager2
and compare atoms.
I will do that with my other change to the nsHTMLContentSerializer (to actually
use that flag). Before that, let's leave the part unchanged since the flag is
not used anyway.

nhottanscp

Assignee

Comment 23

•

24 years ago

It would be better to pass nsIAtom* instead. I will try to do that.

nhottanscp

Assignee

Comment 24

•

24 years ago

Attached patch Patch, pass charset atom to init method and use it for entity generation — Details — Splinter Review

nhottanscp

Assignee

Comment 25

•

24 years ago

I changed Init method to pass charset atom and use it for NCR generation.
Jst, please review the patch.

nhottanscp

Assignee

Updated

•

24 years ago

Keywords: review

Target Milestone: Future → mozilla0.9

Johnny Stenback (:jst)

Comment 26

•

24 years ago

Sorry about dragging my feet here, I saw this patch but forgot about it until now...

I have a couple of issues with this pathc:

nsIContentSerializer.h:

-  NS_IMETHOD Init(PRUint32 flags, PRUint32 aWrapColumn) = 0;
+  NS_IMETHOD Init(PRUint32 flags, PRUint32 aWrapColumn,
+                  const nsIAtom* aCharSet) = 0;

do not make aCharSet *const*, const and interface pointers just don't mix, loose
the const'ness here.

nsDocumentEncoder.cpp:

+      char finish_buf[32];
+      charLength = 32;

smells like purify ABW's (Array Bounds Write) to mee, maybe I'm wrong, but
shouldn't charLength be 31 here?

In nsDocumentEncoder.cpp, around line 880, wouldn't the code be cleaner if you'd
declare charsetAtom nsCOMPtr outside the if block and then do:

  if (!mCharset.IsEmpty()) {
    // get the atom...
  }

  mSerializer->Init(..., charsetAtom);

less code, and more maintainable, no?

Having to get the charset converter manager in the html serializer just to do an
atom compare seems like lots of unneccessary work, couldn't you just get the
string out of the atom and do a string compare, you know the charset string (in
the atom) is normalized at that point, right? I'd say you should do something
like this:

  mIsLatin1 = PR_FALSE;
  if (aCharSet) {
    const PRUnichar *charset;
    aCharSet->GetUnicode(&charset);

    if (NS_LITERAL_STRING("ISO-8859-1").Equals(charset)) {
      mIsLatin1 = (aCharSet == charsetAtom);
    }
  }

This is less code, and faster, and it would work, or did I miss something here?

Other than that I think the patch looks good, could you make these changes and
attach a new patch?

Johnny Stenback (:jst)

Comment 27

•

24 years ago

Oops, my last comment contained an error, here's the correction:

 mIsLatin1 = PR_FALSE;
  if (aCharSet) {
    const PRUnichar *charset;
    aCharSet->GetUnicode(&charset);

    if (NS_LITERAL_STRING("ISO-8859-1").Equals(charset)) {
      mIsLatin1 = (aCharSet == charsetAtom);
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This line should've been simply:

      mIsLatin1 = PR_TRUE;

nhottanscp

Assignee

Comment 28

•

24 years ago

> nsDocumentEncoder.cpp:
>
> +      char finish_buf[32];
> +      charLength = 32;
>
> smells like purify ABW's (Array Bounds Write) to mee, maybe I'm wrong, but
> shouldn't charLength be 31 here?
32 is fine Finish takes a buffer and its length.

>
> In nsDocumentEncoder.cpp, around line 880, wouldn't the code be cleaner if you'd
> declare charsetAtom nsCOMPtr outside the if block and then do:
>
> if (!mCharset.IsEmpty()) {
> // get the atom...
> }
>
> mSerializer->Init(..., charsetAtom);
>
> less code, and more maintainable,
But I need to pass null if the charset is empty.

Johnny Stenback (:jst)

Comment 29

•

24 years ago

> But I need to pass null if the charset is empty.

nsCOMPtr's initialize them selves to null so unless you put a value in a
nsCOMPtr it will be null, thus what I said does *exactly* what your original
proposal does.

nhottanscp

Assignee

Comment 30

•

24 years ago

Attached patch new patch, changed to include jst's suggestion — Details — Splinter Review

Johnny Stenback (:jst)

Comment 31

•

24 years ago

r=jst for the new patch but please remove:

+#include "nsICharsetConverterManager.h"
+#include "nsICharsetConverterManager2.h"

from nsHTMLContentSerializer.cpp before checking in since they're not needed at
all in the new patch.

vidur (gone)

Comment 32

•

24 years ago

sr=vidur. Johnny seems to have reviewed this patch to death already. :-)

nhottanscp

Assignee

Comment 33

•

24 years ago

checked in

Status: ASSIGNED → RESOLVED

Closed: 24 years ago

Resolution: --- → FIXED

nhottanscp

Assignee

Comment 34

•

24 years ago

*** Bug 44374 has been marked as a duplicate of this bug. ***

sujay

Comment 35

•

24 years ago

verified in 2/13 build.

Status: RESOLVED → VERIFIED

PTRourke

Reporter

Comment 36

•

24 years ago

Beautiful job. Thank you.

Patch from jst to pass charset to serializers 24 years ago nhottanscp 6.22 KB, patch		Details \| Diff \| Splinter Review
Patch, pass charset atom to init method and use it for entity generation 24 years ago nhottanscp 10.02 KB, patch		Details \| Diff \| Splinter Review
new patch, changed to include jst's suggestion 24 years ago nhottanscp 9.70 KB, patch		Details \| Diff \| Splinter Review