Closed Bug 12375 Opened 25 years ago Closed 25 years ago

Optimize charset conversion process for intl documents

Categories

(Core :: Internationalization, defect, P3)

x86
Windows NT
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: nisheeth_mozilla, Assigned: ftang)

Details

This bug is being created as a result of discussion that happened on bug 8607.
We fixed that bug but the issues raised by James Clark's comments needed further
tracking.  Hence, this bug.

The last two comments on bug 8607 are pasted here to provide contextual
information.

My post:

I just discussed charset conversion with Frank Tang, an internationalization
engineer.  We decided that we will sniff the encoding and convert the incoming
data to UCS2 before the data gets passed to expat.  So, expat will always see
UCS2.  Till now, we converted the incoming data to UCS2 without sniffing the
encoding (we assumed that the encoding of the incoming data was UTF-8) and
passed on the UCS2 data to expat.  The expectation was that if the encoding was
non-UTF-8 and our guess was wrong, we would re-load the document and convert
the incoming data to UCS2 using the specified encoding.  This is why we needed
the encoding callback from expat.  Now that we'll determine the encoding before
expat sees the data, we don't need the callback any more.

James Clark's reply:

The approach you now have in mind should work.  Note that you should only sniff
the encoding if the content-type is application/xml with no charset parameter.
If the content-type is text/xml with a charset or application/xml with a charset
parameter, then RFC 2376 requires you to use the specified charset parameter.
If the content-type is text/xml with no charset parameter, RFC 2376 requires you
to use us-ascii as the encoding.

If you know the encoding from the content-type and the encoding is one that
expat can handle internally (utf-8, utf-16, us-ascii, iso-8859-1), then it is
much more efficient just to pass the data unconverted to expat, tell expat what
the encoding is in XML_ParserCreate and let expat do the conversion.

However, the approach you have in mind is less efficient that it need be.  An
approach more like what you originally had in mind would be more efficient:

- pass the incoming data without conversion to expat

- the encoding parameter passed to XML_ParserCreate should be the encoding
determined by the content-type per RFC 2376 unless it is application/xml without
a charset in which case it should be null

- set an UnknownEncodingHandler; if that gets called then stop the parse; reload
the document, this time converting the data into UTF-16

- better yet, use expat's unknown encoding handling machinery; for example, if
you have a single byte encoding, the unknown encoding handler can pass expat a
table that maps the encoding into Unicode, and then expat can do the conversion
as part of the encoding process; only reload the document when you get an
encoding of a type that expat's unknown encoding handling machinery cannot
handle (ie a stateful encoding such as ISO-2022-JP)
Status: NEW → ASSIGNED
Target Milestone: M10
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Add code into nsParser.cpp to detect BOM and also implement the Appendix F of
XML 1.0

Jim: What I did is slightly different from Nisheeth specified. I check the first
block of data and perform the detection specified in XML Appendix F and reset
the charset converter in the scanner. No reload will be required since we can
always find the charset in the first block (??almost??)

BTW, A seperate fix will be check in later (After I have time to test) for the
HTTP header charset for XML/XUL/RDF. so... don't worry about that. I have
already did that for HTML but do not have time to developed test cases for XML
yet (Do you have one ?). See bug 125000 for that one.

Mark it M10 fixed.
can you give us some test cases to verify this? If not, can you mark it
verified?
To create test cases-
1. take existing xul/xml file
2. replace displayable text with Japanese
3. replace the first line from
<?xml version="1.0"?>
to

<?xml version="1.0" encoding="Shift_JIS"?>

I believe Allan Masri already have some test cases for this. I remember we talk
about this before...
Status: RESOLVED → VERIFIED
Test cases are in http://babel/automation/erik/framework.
I verified this in 9-30 build.
You need to log in before you can comment on or make changes to this bug.