Closed Bug 65324 Opened 24 years ago Closed 24 years ago

Composer should used numerical entities for Greek, not named entities

Categories

(Core :: DOM: Editor, defect)

defect
Not set
major

Tracking

()

VERIFIED FIXED
mozilla0.9

People

(Reporter: ptrourke, Assigned: nhottanscp)

References

()

Details

(Keywords: intl)

Attachments

(3 files)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; 0.7) Gecko/20010109
BuildID:    2001010901

I pasted a couple of passages of Unicode Greek into Composer and saved as UTF-8.
 The text is saved using named entities (i.e., α for a lower case
alpha, etc.) where available. These named entities are not recognized by
Netscape 4.7. Therefore numeric entities should be used instead.

Reproducible: Always
Steps to Reproduce:
Copy a Greek text in Unicode, say from http://www.perseus.tufts.edu, or one of
the attached pages. Paste it into Composer.  Save as character set UTF-8. Look
at the page with Communicator 4.7.	

Actual Results:  Saved

νε?ος με?ν
και?
α??πειρος

[amp entity used in case the entities are resolved by bugzilla; if you see amp;
ignore it]

which is not readable in Netscape 4.7



Expected Results:  Should have saved

νέος μὲν καὶ
ἄπειρος

[note: this is the behavior of FrontPage 2000]

or (even better!)

[pretend this is the Unicode text; it looked fine when I pasted it into Bugzilla
Helper, but became unrecognized symbols after submitting to this form]

 ????? ???? ???? ?????????

[Note: this is the behavior of Word 2000]

which are readable in Netscape 4.7.

http://www.ziplink.net/~ptrourke/mozcompose1.html
http://www.ziplink.net/~ptrourke/mozcompose2.html

I'd like to be able to recommend Mozilla Composer to those who have to publish
polytonic Greek texts to the web (Greek literature scholars, biblical scholars,
etc.), but can't until this bug is fixed).

Thanks.
*** Bug 65326 has been marked as a duplicate of this bug. ***
*** Bug 65326 has been marked as a duplicate of this bug. ***
setting bug status to New
Status: UNCONFIRMED → NEW
Ever confirmed: true
Note also: when editing an existing page which has numerical entities, Mozilla 
converts those with available named entities to named entities upon save.  

BTW, Shouldn't use of named entities be discouraged throughout, as so few are 
supported by XML?
reassigning to nhotta. Naoki, the conversion from numeric entity to mneumonic 
entity should not happen, we should preserve what the user entered.
Assignee: beppe → nhotta
I think preserving the original is not easy. Input HTML may use numeric or
mneumonic (or both may be used).
Keywords: intl
So I think we want to have UI to let the user control the output.
But the case reported here, it saves as UTF-8, so output does not need to have
any entities, since the characters have code points in unicode.
I can reproduce this with 6.0 RTM. I think something got broken before RTM.
Status: NEW → ASSIGNED
Those entities are generated by nsHTMLContentSerializer.
This is related to jst's check in (e.g. nsDocumentEncoder.cpp, rev=1.33).
Cc to him, why is this entity conversion needed in there? I think Composer 
eventually does entity conversion (by using nsIEntityConverter) along with 
charset conversion, so I think it's not necessary doing it in 
nsHTMLContentSerializer.

nsHTMLEntities::UnicodeToEntity(int 0x000000f4) line 190
nsParserService::HTMLConvertUnicodeToEntity(const nsParserService * const 
0x03b4c540, int 0x000000f4, nsCString & {...}) line 105 + 9 bytes
nsHTMLContentSerializer::AppendToString(const basic_nsAReadableString<unsigned 
short> & {...}, basic_nsAWritableString<unsigned short> & {...}, int 0x00000001, 
int 0x00000000) line 532
nsXMLContentSerializer::AppendTextData(nsIDOMNode * 0x058561c0, int 0x00000000, 
int 0xffffffff, basic_nsAWritableString<unsigned short> & {...}, int 0x00000001, 
int 0x00000000) line 100 + 58 bytes
nsHTMLContentSerializer::AppendText(nsHTMLContentSerializer * const 0x05b9de80, 
nsIDOMText * 0x058561c0, int 0x00000000, int 0xffffffff, 
basic_nsAWritableString<unsigned short> & {...}) line 132 + 31 bytes
nsDocumentEncoder::SerializeNodeStart(nsIDOMNode * 0x058561c0, int 0x00000000, 
int 0xffffffff, basic_nsAWritableString<unsigned short> & {...}) line 260
nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x058561c0, 
basic_nsAWritableString<unsigned short> & {...}) line 315 + 20 bytes
nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x0579f2a4, 
basic_nsAWritableString<unsigned short> & {...}) line 336 + 21 bytes
nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x057d4fec, 
basic_nsAWritableString<unsigned short> & {...}) line 336 + 21 bytes
nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x057d2174, 
basic_nsAWritableString<unsigned short> & {...}) line 336 + 21 bytes
nsDocumentEncoder::EncodeToString(nsDocumentEncoder * const 0x05b9c7f0, 
basic_nsAWritableString<unsigned short> & {...}) line 882 + 21 bytes
nsDocumentEncoder::EncodeToStream(nsDocumentEncoder * const 0x05b9c7f0, 
nsIOutputStream * 0x05b99b30) line 917 + 19 bytes
nsDocument::SaveFile(nsDocument * const 0x057d218c, nsIFile * 0x05b9c910, int 
0x00000000, int 0x00000000, const unsigned short * 0x05b99c70, const unsigned 
short * 0x0423b3f0, unsigned int 0x00000102, unsigned int 0x00000048) line 3442 
+ 44 bytes
nsEditor::SaveFile(nsEditor * const 0x03c23e50, nsIFile * 0x05b9c910, int 
0x00000000, int 0x00000000, const nsString & {...}) line 1637 + 76 bytes
nsEditorShell::SaveDocument(nsEditorShell * const 0x05446dc0, int 0x00000001, 
int 0x00000000, int * 0x0012c3e8) line 1926 + 53 bytes
also cc to akkana
Reassign to jst. Entity generation should only be done as a fallback of charset
conversion error, also use Latin1 entity only for compatibility.
Assignee: nhotta → jst
Status: ASSIGNED → NEW
Keywords: nsbeta1
Reassigning to anthonyd who now owns this code.
Assignee: jst → anthonyd
> But the case reported here, it saves as UTF-8, so output
> does not need to have any entities, since the characters
> have code points in unicode.
> [Later] Entity generation should only be done
> as a fallback of charset conversion error, also use Latin1
> entity only for compatibility.

This would be ideal.
no, anthony does not own the conversion code, this is either nhotta or jst, 
assign back to nhotta
Assignee: anthonyd → nhotta
>> But the case reported here, it saves as UTF-8, so output
>> does not need to have any entities, since the characters
>> have code points in unicode.
>> [Later] Entity generation should only be done
>> as a fallback of charset conversion error, also use Latin1
>> entity only for compatibility.
>
> This would be ideal.

It had been working that way until beta3.
See nsDocumentEncoder.cpp, rev=1.33 ~ 1.35.
Reassign to jst.

Assignee: nhotta → jst
If this needs to be fixed for nsbeta1 someone other than me needs to grab this
and do something about it, I won't get to this in time for nsbeta1. Marking Future.
OS: Windows ME → All
Hardware: PC → All
Target Milestone: --- → Future
Reassign to nhotta.
Assignee: jst → nhotta
r=nhotta
I will write a patch later to use the charset in HTML serializer.
Status: NEW → ASSIGNED
jst, who can I ask for a super-review for the patch?
Vidur could probably sr the patch.

One thing that I'm not sure about in the patch is:

  mIsLatin1 = aCharSet.Equals(NS_LITERAL_STRING("iso-8859-1"));

Does that do the right thing? Is the name of the latin 1 charset always
"iso-8859-1"? Would the Equals call need to be case in-sensitive?
It needs to get charset atom from the string using nsICharsetConverterManager2
and compare atoms.
I will do that with my other change to the nsHTMLContentSerializer (to actually
use that flag). Before that, let's leave the part unchanged since the flag is
not used anyway.
It would be better to pass nsIAtom* instead. I will try to do that.
I changed Init method to pass charset atom and use it for NCR generation.
Jst, please review the patch.
Keywords: review
Target Milestone: Future → mozilla0.9
Sorry about dragging my feet here, I saw this patch but forgot about it until now...

I have a couple of issues with this pathc:

nsIContentSerializer.h:

-  NS_IMETHOD Init(PRUint32 flags, PRUint32 aWrapColumn) = 0;
+  NS_IMETHOD Init(PRUint32 flags, PRUint32 aWrapColumn,
+                  const nsIAtom* aCharSet) = 0;

do not make aCharSet *const*, const and interface pointers just don't mix, loose
the const'ness here.

nsDocumentEncoder.cpp:

+      char finish_buf[32];
+      charLength = 32;

smells like purify ABW's (Array Bounds Write) to mee, maybe I'm wrong, but
shouldn't charLength be 31 here?

In nsDocumentEncoder.cpp, around line 880, wouldn't the code be cleaner if you'd
declare charsetAtom nsCOMPtr outside the if block and then do:

  if (!mCharset.IsEmpty()) {
    // get the atom...
  }

  mSerializer->Init(..., charsetAtom);

less code, and more maintainable, no?

Having to get the charset converter manager in the html serializer just to do an
atom compare seems like lots of unneccessary work, couldn't you just get the
string out of the atom and do a string compare, you know the charset string (in
the atom) is normalized at that point, right? I'd say you should do something
like this:

  mIsLatin1 = PR_FALSE;
  if (aCharSet) {
    const PRUnichar *charset;
    aCharSet->GetUnicode(&charset);

    if (NS_LITERAL_STRING("ISO-8859-1").Equals(charset)) {
      mIsLatin1 = (aCharSet == charsetAtom);
    }
  }

This is less code, and faster, and it would work, or did I miss something here?

Other than that I think the patch looks good, could you make these changes and
attach a new patch?
Oops, my last comment contained an error, here's the correction:

 mIsLatin1 = PR_FALSE;
  if (aCharSet) {
    const PRUnichar *charset;
    aCharSet->GetUnicode(&charset);

    if (NS_LITERAL_STRING("ISO-8859-1").Equals(charset)) {
      mIsLatin1 = (aCharSet == charsetAtom);
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This line should've been simply:

      mIsLatin1 = PR_TRUE;
> nsDocumentEncoder.cpp:
>
> +      char finish_buf[32];
> +      charLength = 32;
>
> smells like purify ABW's (Array Bounds Write) to mee, maybe I'm wrong, but
> shouldn't charLength be 31 here?
32 is fine Finish takes a buffer and its length.

>
> In nsDocumentEncoder.cpp, around line 880, wouldn't the code be cleaner if you'd
> declare charsetAtom nsCOMPtr outside the if block and then do:
>
> if (!mCharset.IsEmpty()) {
> // get the atom...
> }
>
> mSerializer->Init(..., charsetAtom);
>
> less code, and more maintainable,
But I need to pass null if the charset is empty.
> But I need to pass null if the charset is empty.

nsCOMPtr's initialize them selves to null so unless you put a value in a
nsCOMPtr it will be null, thus what I said does *exactly* what your original
proposal does.
r=jst for the new patch but please remove:

+#include "nsICharsetConverterManager.h"
+#include "nsICharsetConverterManager2.h"

from nsHTMLContentSerializer.cpp before checking in since they're not needed at
all in the new patch.
sr=vidur. Johnny seems to have reviewed this patch to death already. :-)
checked in
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
*** Bug 44374 has been marked as a duplicate of this bug. ***
verified in 2/13 build.
Status: RESOLVED → VERIFIED
Beautiful job. Thank you.  
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: