Closed Bug 12375 Opened 20 years ago Closed 20 years ago
Optimize charset conversion process for intl documents
This bug is being created as a result of discussion that happened on bug 8607. We fixed that bug but the issues raised by James Clark's comments needed further tracking. Hence, this bug. The last two comments on bug 8607 are pasted here to provide contextual information. My post: I just discussed charset conversion with Frank Tang, an internationalization engineer. We decided that we will sniff the encoding and convert the incoming data to UCS2 before the data gets passed to expat. So, expat will always see UCS2. Till now, we converted the incoming data to UCS2 without sniffing the encoding (we assumed that the encoding of the incoming data was UTF-8) and passed on the UCS2 data to expat. The expectation was that if the encoding was non-UTF-8 and our guess was wrong, we would re-load the document and convert the incoming data to UCS2 using the specified encoding. This is why we needed the encoding callback from expat. Now that we'll determine the encoding before expat sees the data, we don't need the callback any more. James Clark's reply: The approach you now have in mind should work. Note that you should only sniff the encoding if the content-type is application/xml with no charset parameter. If the content-type is text/xml with a charset or application/xml with a charset parameter, then RFC 2376 requires you to use the specified charset parameter. If the content-type is text/xml with no charset parameter, RFC 2376 requires you to use us-ascii as the encoding. If you know the encoding from the content-type and the encoding is one that expat can handle internally (utf-8, utf-16, us-ascii, iso-8859-1), then it is much more efficient just to pass the data unconverted to expat, tell expat what the encoding is in XML_ParserCreate and let expat do the conversion. However, the approach you have in mind is less efficient that it need be. An approach more like what you originally had in mind would be more efficient: - pass the incoming data without conversion to expat - the encoding parameter passed to XML_ParserCreate should be the encoding determined by the content-type per RFC 2376 unless it is application/xml without a charset in which case it should be null - set an UnknownEncodingHandler; if that gets called then stop the parse; reload the document, this time converting the data into UTF-16 - better yet, use expat's unknown encoding handling machinery; for example, if you have a single byte encoding, the unknown encoding handler can pass expat a table that maps the encoding into Unicode, and then expat can do the conversion as part of the encoding process; only reload the document when you get an encoding of a type that expat's unknown encoding handling machinery cannot handle (ie a stateful encoding such as ISO-2022-JP)
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Add code into nsParser.cpp to detect BOM and also implement the Appendix F of XML 1.0 Jim: What I did is slightly different from Nisheeth specified. I check the first block of data and perform the detection specified in XML Appendix F and reset the charset converter in the scanner. No reload will be required since we can always find the charset in the first block (??almost??) BTW, A seperate fix will be check in later (After I have time to test) for the HTTP header charset for XML/XUL/RDF. so... don't worry about that. I have already did that for HTML but do not have time to developed test cases for XML yet (Do you have one ?). See bug 125000 for that one. Mark it M10 fixed.
can you give us some test cases to verify this? If not, can you mark it verified?
To create test cases- 1. take existing xul/xml file 2. replace displayable text with Japanese 3. replace the first line from <?xml version="1.0"?> to <?xml version="1.0" encoding="Shift_JIS"?> I believe Allan Masri already have some test cases for this. I remember we talk about this before...
Test cases are in http://babel/automation/erik/framework. I verified this in 9-30 build.
You need to log in before you can comment on or make changes to this bug.