User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; ru-RU; rv:1.7.8) Gecko/20050511 Firefox/1.0.4 (ax) Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; ru-RU; rv:1.7.8) Gecko/20050511 Firefox/1.0.4 (ax) I load xml file and parse it (in instance, XMLRequestor and DOMParser). Xml file was saved in 'utf-8' encoding. When I load and parse xml file then charackter encoding of (in instance) attributes values is 'utf-8'. If I want to work with the xml document then I must convert it to unicode. Even if I specify 'encoding' attribute of <?xml?> processing instruction then I have the same behaviour. I guess if attribute 'encoding' is not specified then mozilla should convert it from 'utf-8' to unicode and if attribute 'encoding' is presented then mozilla should covert it to unicode. When mozilla loads xml file by self (in instance, when I open xul file or load xml file into frame) then mozilla coverts it to unicode by self. I expect the same behaviour when I load and parse xml files. Reproducible: Always
But UTF-8 is just a way to encode Unicode characters (besides UTF16,UTF32,UCS2 and UCS4). So i quite don't understand what you mean with "I guess if attribute 'encoding' is not specified then mozilla should convert it from 'utf-8' to unicode" (UTF-8 is a encoding, Unicode a character set, roughly said).
I don't exactly what encoding is used by mozilla when mozilla loads file. But I think it is not utf-8. I mean if I load file then mozilla should convert it to its internal encoding. I have xml document saved in utf-8 with russian symbols. When I load and parse the xml document then I must to convert attributes and textnodes to unicode by nsIScriptableUnicodeConverter.ConvertToUnicode(string, "utf-8"). I think I don't convert by self.
Do you mean the value of a text node you get XMLRequestor and DOMParser is "U+00D0 U+0090" (a zero-extended sequence of the UTF-8 representation of U+0410) when it should be "U+0410" because what you have in that node is Cyrillic Capital Letter A (U+0410) ? If your XML file is in Windows-1251 and a text node has U+0410 (0xC0 in Windows-1251), do you get U+00C0 instead of "U+0410"? If that's the case, this is clearly a bug (probably already reported. I may have reported it or seen it before....) Will you please put up a simple test case somewhere or attach it to this bug and tell us what you expect and what you actually get ?
Exactly as you say. I'll attach a testcase.
Can't reproduce the problem. Testcase is invalid.
Status: UNCONFIRMED → RESOLVED
Last Resolved: 13 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.