Closed Bug 305075 Opened 19 years ago Closed 19 years ago

problem with character encoding of loaded xml document

Categories

(Core :: Internationalization, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: surkov, Assigned: smontagu)

Details

Attachments

(3 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; ru-RU; rv:1.7.8) Gecko/20050511 Firefox/1.0.4 (ax)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; ru-RU; rv:1.7.8) Gecko/20050511 Firefox/1.0.4 (ax)

I load xml file and parse it (in instance, XMLRequestor and DOMParser). Xml file
was saved in 'utf-8' encoding. When I load and parse xml file then charackter
encoding of (in instance) attributes values is 'utf-8'. If I want to work with
the xml document then I must convert it to unicode. Even if I specify 'encoding'
attribute of <?xml?> processing instruction then I have the same behaviour.

I guess if attribute 'encoding' is not specified then mozilla should convert it
from 'utf-8' to unicode and if attribute 'encoding' is presented then mozilla
should covert it to unicode. When mozilla loads xml file by self (in instance,
when I open xul file or load xml file into frame) then mozilla coverts it to
unicode by self. I expect the same behaviour when I load and parse xml files.

Reproducible: Always
But UTF-8 is just a way to encode Unicode characters (besides UTF16,UTF32,UCS2
and UCS4). So i quite don't understand what you mean with "I guess if attribute
'encoding' is not specified then mozilla should convert it
from 'utf-8' to unicode" (UTF-8 is a encoding, Unicode a character set, roughly
said). 
I don't exactly what encoding is used by mozilla when mozilla loads file. But I
think it is not utf-8. I mean if I load file then mozilla should convert it to
its internal encoding. I have xml document saved in utf-8 with russian symbols.
When I load and parse the xml document then I must to convert attributes and
textnodes to unicode by nsIScriptableUnicodeConverter.ConvertToUnicode(string,
"utf-8"). I think I don't convert by self.
Do you mean the value of a text node you get XMLRequestor and DOMParser is
"U+00D0 U+0090" (a zero-extended sequence of the UTF-8 representation of U+0410)
when it should be "U+0410" because what you have in that node is Cyrillic
Capital Letter A (U+0410) ? If your XML file is in Windows-1251 and a text node
has U+0410 (0xC0 in Windows-1251), do you get U+00C0 instead of "U+0410"? If
that's the case, this is clearly a bug (probably already reported. I may have
reported it or seen it before....)

Will you please put up a simple test case somewhere or attach it to this bug and
tell us what you expect and what you actually get ? 

Exactly as you say. I'll attach a testcase.
Attached file xml file in utf-8
Attached file testcase
Can't reproduce the problem. Testcase is invalid.
Status: UNCONFIRMED → RESOLVED
Closed: 19 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: