Closed Bug 240717 Opened 21 years ago Closed 21 years ago

DOMParser.parseFromString() confused by character encodings

Categories

(Core :: XML, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla1.8alpha1

People

(Reporter: matthew, Assigned: bzbarsky)

Details

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040206 Firefox/0.8
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040206 Firefox/0.8

DOMParser.parseFromString() seems not to take into account the character
encoding present in the XML declaration.

Reproducible: Always
Steps to Reproduce:
1. Load the testcase (to be attached).
2. Select the "test" link.
3. The first 'window.alert' shows a Javascript string which has been constructed
to be (the serialization of) an XML document containing the 'squared' character
(U00B2 SUPERSCRIPT TWO in Unicode terms). The string contains an XML declaration
which declares the encoding as ISO-8859-1.
4. After that, an XML document is created using new
DOMParser().parseFromString(xml, "text/xml");, and there is a second
window.alert showing the value of the text node in that document.

Actual Results:  
The second window.alert shows that some mangling has gone on, presumably
relating to the character encoding: the string shown appears to be abc[U00C2
LATIN CAPITAL LETTER  WITH CIRCUMFLEX][U00B2 SUPERSCRIPT TWO].

Expected Results:  
The second window.alert should correctly show the text which was set as the
text, ie abc[U00B2 SUPERSCRIPT TWO].

The behaviour is the same on a nightly build of Firefox less than one week old.
Attached file Test case
> DOMParser.parseFromString() seems not to take into account the character
> encoding present in the XML declaration.

On the contrary, it does.  When parsing.

What you're doing is passing in Unicode data into the DOMParser. It converts
this into UTF-8 bytes, then feeds them to the XML parser.  But the XML parser
sees the encoding decl and parses the bytes as ISO-8859-1.  Hence the mangling.

So either the conversion to bytes needs to scan the string for the XML decl
first (ugh!) or the nsDOMParser::ParseFromStream method needs to do something
with the "charset" arg it gets (like set it on the channel so that things don't
break).
Attached patch Say like thisSplinter Review
Attachment #146284 - Flags: superreview?(jst)
Attachment #146284 - Flags: review?(jst)
Comment on attachment 146284 [details] [diff] [review]
Say like this

r+sr=jst
Attachment #146284 - Flags: superreview?(jst)
Attachment #146284 - Flags: superreview+
Attachment #146284 - Flags: review?(jst)
Attachment #146284 - Flags: review+
Taking.
Assignee: hjtoi-bugzilla → bzbarsky
OS: Windows XP → All
Priority: -- → P3
Hardware: PC → All
Target Milestone: --- → mozilla1.8alpha
Checked in.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Hmmm...looks like a "return NS_OK;" is needed in setContentCharset:

Right now, it's:
NS_IMETHODIMP nsDOMParserChannel::SetContentCharset(const nsACString
&aContentCharset)
 {
   mContentCharset = aContentCharset;
 }

But it probably should read:

NS_IMETHODIMP nsDOMParserChannel::SetContentCharset(const nsACString
&aContentCharset)
 {
   mContentCharset = aContentCharset;
   return NS_OK;
 }
Doug, thanks for the heads-up, and you're right.  Fix checked in.
Any chance of getting this fix ported onto the Aviary branch?
I have no plans to port this change to any branches.  If someone makes an aviary
patch and convinces the aviary maintainers to take it, I don't plan to stop them
(nor could I), though I would appreciate not getting too much bugspam in the
process.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: