Closed
Bug 65324
Opened 24 years ago
Closed 24 years ago
Composer should used numerical entities for Greek, not named entities
Categories
(Core :: DOM: Editor, defect)
Core
DOM: Editor
Tracking
()
VERIFIED
FIXED
mozilla0.9
People
(Reporter: ptrourke, Assigned: nhottanscp)
References
()
Details
(Keywords: intl)
Attachments
(3 files)
6.22 KB,
patch
|
Details | Diff | Splinter Review | |
10.02 KB,
patch
|
Details | Diff | Splinter Review | |
9.70 KB,
patch
|
Details | Diff | Splinter Review |
From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; 0.7) Gecko/20010109 BuildID: 2001010901 I pasted a couple of passages of Unicode Greek into Composer and saved as UTF-8. The text is saved using named entities (i.e., α for a lower case alpha, etc.) where available. These named entities are not recognized by Netscape 4.7. Therefore numeric entities should be used instead. Reproducible: Always Steps to Reproduce: Copy a Greek text in Unicode, say from http://www.perseus.tufts.edu, or one of the attached pages. Paste it into Composer. Save as character set UTF-8. Look at the page with Communicator 4.7. Actual Results: Saved νε?ος με?ν και? α??πειρος [amp entity used in case the entities are resolved by bugzilla; if you see amp; ignore it] which is not readable in Netscape 4.7 Expected Results: Should have saved νέος μὲν καὶ ἄπειρος [note: this is the behavior of FrontPage 2000] or (even better!) [pretend this is the Unicode text; it looked fine when I pasted it into Bugzilla Helper, but became unrecognized symbols after submitting to this form] ????? ???? ???? ????????? [Note: this is the behavior of Word 2000] which are readable in Netscape 4.7. http://www.ziplink.net/~ptrourke/mozcompose1.html http://www.ziplink.net/~ptrourke/mozcompose2.html I'd like to be able to recommend Mozilla Composer to those who have to publish polytonic Greek texts to the web (Greek literature scholars, biblical scholars, etc.), but can't until this bug is fixed). Thanks.
*** Bug 65326 has been marked as a duplicate of this bug. ***
Note also: when editing an existing page which has numerical entities, Mozilla converts those with available named entities to named entities upon save. BTW, Shouldn't use of named entities be discouraged throughout, as so few are supported by XML?
Comment 5•24 years ago
|
||
reassigning to nhotta. Naoki, the conversion from numeric entity to mneumonic entity should not happen, we should preserve what the user entered.
Assignee: beppe → nhotta
Assignee | ||
Comment 6•24 years ago
|
||
I think preserving the original is not easy. Input HTML may use numeric or mneumonic (or both may be used).
Keywords: intl
Assignee | ||
Comment 7•24 years ago
|
||
So I think we want to have UI to let the user control the output.
Assignee | ||
Comment 8•24 years ago
|
||
But the case reported here, it saves as UTF-8, so output does not need to have any entities, since the characters have code points in unicode. I can reproduce this with 6.0 RTM. I think something got broken before RTM.
Status: NEW → ASSIGNED
Assignee | ||
Comment 9•24 years ago
|
||
Those entities are generated by nsHTMLContentSerializer. This is related to jst's check in (e.g. nsDocumentEncoder.cpp, rev=1.33). Cc to him, why is this entity conversion needed in there? I think Composer eventually does entity conversion (by using nsIEntityConverter) along with charset conversion, so I think it's not necessary doing it in nsHTMLContentSerializer. nsHTMLEntities::UnicodeToEntity(int 0x000000f4) line 190 nsParserService::HTMLConvertUnicodeToEntity(const nsParserService * const 0x03b4c540, int 0x000000f4, nsCString & {...}) line 105 + 9 bytes nsHTMLContentSerializer::AppendToString(const basic_nsAReadableString<unsigned short> & {...}, basic_nsAWritableString<unsigned short> & {...}, int 0x00000001, int 0x00000000) line 532 nsXMLContentSerializer::AppendTextData(nsIDOMNode * 0x058561c0, int 0x00000000, int 0xffffffff, basic_nsAWritableString<unsigned short> & {...}, int 0x00000001, int 0x00000000) line 100 + 58 bytes nsHTMLContentSerializer::AppendText(nsHTMLContentSerializer * const 0x05b9de80, nsIDOMText * 0x058561c0, int 0x00000000, int 0xffffffff, basic_nsAWritableString<unsigned short> & {...}) line 132 + 31 bytes nsDocumentEncoder::SerializeNodeStart(nsIDOMNode * 0x058561c0, int 0x00000000, int 0xffffffff, basic_nsAWritableString<unsigned short> & {...}) line 260 nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x058561c0, basic_nsAWritableString<unsigned short> & {...}) line 315 + 20 bytes nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x0579f2a4, basic_nsAWritableString<unsigned short> & {...}) line 336 + 21 bytes nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x057d4fec, basic_nsAWritableString<unsigned short> & {...}) line 336 + 21 bytes nsDocumentEncoder::SerializeToStringRecursive(nsIDOMNode * 0x057d2174, basic_nsAWritableString<unsigned short> & {...}) line 336 + 21 bytes nsDocumentEncoder::EncodeToString(nsDocumentEncoder * const 0x05b9c7f0, basic_nsAWritableString<unsigned short> & {...}) line 882 + 21 bytes nsDocumentEncoder::EncodeToStream(nsDocumentEncoder * const 0x05b9c7f0, nsIOutputStream * 0x05b99b30) line 917 + 19 bytes nsDocument::SaveFile(nsDocument * const 0x057d218c, nsIFile * 0x05b9c910, int 0x00000000, int 0x00000000, const unsigned short * 0x05b99c70, const unsigned short * 0x0423b3f0, unsigned int 0x00000102, unsigned int 0x00000048) line 3442 + 44 bytes nsEditor::SaveFile(nsEditor * const 0x03c23e50, nsIFile * 0x05b9c910, int 0x00000000, int 0x00000000, const nsString & {...}) line 1637 + 76 bytes nsEditorShell::SaveDocument(nsEditorShell * const 0x05446dc0, int 0x00000001, int 0x00000000, int * 0x0012c3e8) line 1926 + 53 bytes
Assignee | ||
Comment 10•24 years ago
|
||
also cc to akkana
Assignee | ||
Comment 11•24 years ago
|
||
Reassign to jst. Entity generation should only be done as a fallback of charset conversion error, also use Latin1 entity only for compatibility.
Reporter | ||
Comment 13•24 years ago
|
||
> But the case reported here, it saves as UTF-8, so output
> does not need to have any entities, since the characters
> have code points in unicode.
> [Later] Entity generation should only be done
> as a fallback of charset conversion error, also use Latin1
> entity only for compatibility.
This would be ideal.
Comment 14•24 years ago
|
||
no, anthony does not own the conversion code, this is either nhotta or jst, assign back to nhotta
Assignee: anthonyd → nhotta
Assignee | ||
Comment 15•24 years ago
|
||
>> But the case reported here, it saves as UTF-8, so output >> does not need to have any entities, since the characters >> have code points in unicode. >> [Later] Entity generation should only be done >> as a fallback of charset conversion error, also use Latin1 >> entity only for compatibility. > > This would be ideal. It had been working that way until beta3. See nsDocumentEncoder.cpp, rev=1.33 ~ 1.35. Reassign to jst.
Assignee: nhotta → jst
Comment 16•24 years ago
|
||
If this needs to be fixed for nsbeta1 someone other than me needs to grab this and do something about it, I won't get to this in time for nsbeta1. Marking Future.
OS: Windows ME → All
Hardware: PC → All
Target Milestone: --- → Future
Assignee | ||
Comment 18•24 years ago
|
||
Assignee | ||
Comment 19•24 years ago
|
||
r=nhotta I will write a patch later to use the charset in HTML serializer.
Status: NEW → ASSIGNED
Assignee | ||
Comment 20•24 years ago
|
||
jst, who can I ask for a super-review for the patch?
Comment 21•24 years ago
|
||
Vidur could probably sr the patch. One thing that I'm not sure about in the patch is: mIsLatin1 = aCharSet.Equals(NS_LITERAL_STRING("iso-8859-1")); Does that do the right thing? Is the name of the latin 1 charset always "iso-8859-1"? Would the Equals call need to be case in-sensitive?
Assignee | ||
Comment 22•24 years ago
|
||
It needs to get charset atom from the string using nsICharsetConverterManager2 and compare atoms. I will do that with my other change to the nsHTMLContentSerializer (to actually use that flag). Before that, let's leave the part unchanged since the flag is not used anyway.
Assignee | ||
Comment 23•24 years ago
|
||
It would be better to pass nsIAtom* instead. I will try to do that.
Assignee | ||
Comment 24•24 years ago
|
||
Assignee | ||
Comment 25•24 years ago
|
||
I changed Init method to pass charset atom and use it for NCR generation. Jst, please review the patch.
Comment 26•24 years ago
|
||
Sorry about dragging my feet here, I saw this patch but forgot about it until now... I have a couple of issues with this pathc: nsIContentSerializer.h: - NS_IMETHOD Init(PRUint32 flags, PRUint32 aWrapColumn) = 0; + NS_IMETHOD Init(PRUint32 flags, PRUint32 aWrapColumn, + const nsIAtom* aCharSet) = 0; do not make aCharSet *const*, const and interface pointers just don't mix, loose the const'ness here. nsDocumentEncoder.cpp: + char finish_buf[32]; + charLength = 32; smells like purify ABW's (Array Bounds Write) to mee, maybe I'm wrong, but shouldn't charLength be 31 here? In nsDocumentEncoder.cpp, around line 880, wouldn't the code be cleaner if you'd declare charsetAtom nsCOMPtr outside the if block and then do: if (!mCharset.IsEmpty()) { // get the atom... } mSerializer->Init(..., charsetAtom); less code, and more maintainable, no? Having to get the charset converter manager in the html serializer just to do an atom compare seems like lots of unneccessary work, couldn't you just get the string out of the atom and do a string compare, you know the charset string (in the atom) is normalized at that point, right? I'd say you should do something like this: mIsLatin1 = PR_FALSE; if (aCharSet) { const PRUnichar *charset; aCharSet->GetUnicode(&charset); if (NS_LITERAL_STRING("ISO-8859-1").Equals(charset)) { mIsLatin1 = (aCharSet == charsetAtom); } } This is less code, and faster, and it would work, or did I miss something here? Other than that I think the patch looks good, could you make these changes and attach a new patch?
Comment 27•24 years ago
|
||
Oops, my last comment contained an error, here's the correction: mIsLatin1 = PR_FALSE; if (aCharSet) { const PRUnichar *charset; aCharSet->GetUnicode(&charset); if (NS_LITERAL_STRING("ISO-8859-1").Equals(charset)) { mIsLatin1 = (aCharSet == charsetAtom); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This line should've been simply: mIsLatin1 = PR_TRUE;
Assignee | ||
Comment 28•24 years ago
|
||
> nsDocumentEncoder.cpp: > > + char finish_buf[32]; > + charLength = 32; > > smells like purify ABW's (Array Bounds Write) to mee, maybe I'm wrong, but > shouldn't charLength be 31 here? 32 is fine Finish takes a buffer and its length. > > In nsDocumentEncoder.cpp, around line 880, wouldn't the code be cleaner if you'd > declare charsetAtom nsCOMPtr outside the if block and then do: > > if (!mCharset.IsEmpty()) { > // get the atom... > } > > mSerializer->Init(..., charsetAtom); > > less code, and more maintainable, But I need to pass null if the charset is empty.
Comment 29•24 years ago
|
||
> But I need to pass null if the charset is empty.
nsCOMPtr's initialize them selves to null so unless you put a value in a
nsCOMPtr it will be null, thus what I said does *exactly* what your original
proposal does.
Assignee | ||
Comment 30•24 years ago
|
||
Comment 31•24 years ago
|
||
r=jst for the new patch but please remove: +#include "nsICharsetConverterManager.h" +#include "nsICharsetConverterManager2.h" from nsHTMLContentSerializer.cpp before checking in since they're not needed at all in the new patch.
Comment 32•24 years ago
|
||
sr=vidur. Johnny seems to have reviewed this patch to death already. :-)
Assignee | ||
Comment 33•24 years ago
|
||
checked in
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 34•24 years ago
|
||
*** Bug 44374 has been marked as a duplicate of this bug. ***
Reporter | ||
Comment 36•24 years ago
|
||
Beautiful job. Thank you.
You need to log in
before you can comment on or make changes to this bug.
Description
•