Closed
Bug 68738
(BOM)
Opened 25 years ago
Closed 22 years ago
UTF-XX with BOM always maps to UTF-XXBE
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
VERIFIED
FIXED
mozilla1.4beta
People
(Reporter: nhottanscp, Assigned: ftang)
References
()
Details
(Keywords: intl, testcase)
Attachments
(1 file, 2 obsolete files)
|
5.45 KB,
patch
|
hjtoi-bugzilla
:
superreview+
|
Details | Diff | Splinter Review |
Currently "UTF-16" is always mapped to "UTF-16BE" for the alias resolution. But
it could be "UTF-16LE" depends on the data.
http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetalias.properties#69
We can keep "UTF-16" without mapping to other names.
| Assignee | ||
Comment 1•25 years ago
|
||
assign. Future
Status: NEW → ASSIGNED
Target Milestone: --- → Future
Updated•25 years ago
|
QA Contact: teruko → ftang
Comment 2•24 years ago
|
||
This bug seems to mean that Windows' 'save as Unicode' format (which is
little-endian UTF-16 with byte-order mark) cannot be used in web pages. If
I tag my page as UTF-16, IE 5.5 reads it correctly but Mozilla attempts to read it
as UTF-16BE, producing gibberish. It I tag it as UTF-16LE (and remove the BOM),
Mozilla works fine but IE doesn't recognize the tag and defaults to ISO 8859-1,
producing gibberish again.
UTF-16 is neither UTF-16BE nor UTF-16LE. The -LE and -BE encodings NEVER have
a byte-order mark, whereas text tagged as UTF-16 MAY or may not have a BOM
(RFC 2781).
Comment 3•23 years ago
|
||
Same happens with UTF-32
Summary: UTF-16 always maps to UTF-16BE → UTF-XX always maps to UTF-XXBE
Comment 5•23 years ago
|
||
There are too many testcases on that page, this bug is concerned only with these 2:
http://jshin.net/i18n/utftest/bom_utf16le.utf16.html
http://jshin.net/i18n/utftest/bom_utf32le.utf32.html
Comment 6•23 years ago
|
||
The problem is with pages that contain BOM.
We do correct thing with BOM-less pages, which should be treated as BE.
See: http://www-106.ibm.com/developerworks/library/utfencodingforms/#h4
Summary: UTF-XX always maps to UTF-XXBE → UTF-XX with BOM always maps to UTF-XXBE
Comment 7•23 years ago
|
||
*** Bug 112690 has been marked as a duplicate of this bug. ***
Comment 8•23 years ago
|
||
Always assuming BE on data without a BOM is not in the spirit of Annex F of the
XML recommendation:
http://www.w3.org/TR/REC-xml#sec-guessing
It recommends to check whether the first character is '<', in any of the UTFs
(i.e. followed by null bytes). Even though this annex is not normative, it
sounds like a sensible thing to do. If the autodetection fails, you can still
fall back to UTF-16BE. In a strict interpretation of this annex, you would
require '<?' for autodetection to work, but I think the scheme readily extends
to HTML.
Comment 9•23 years ago
|
||
*** Bug 175083 has been marked as a duplicate of this bug. ***
Frank, could we get this fixed within milestone 1.3 perhaps? We are starting to
get dupes so people clearly are affected by this.
Updated•23 years ago
|
Alias: BOM
Comment 11•23 years ago
|
||
With this patch, all the test cases given at the page in the URL field
get rendered as intended.
Comment 12•23 years ago
|
||
> Always assuming BE on data without a BOM is not in the spirit of Annex
> F of the XML recommendation:
I fully agree with Martin that the spirit of XML recommendation
is to try as hard as possible to detect the endianness. UTF-16/UTF-32
should not be considered as meaning UTF-16BE/UTF-32BE by default.
Only after all those efforts fail, as the last resort, we have to assume
network byte order(BE). My patch implements that and all 48(?)
test cases in my test page are successfully rendered with my patch.
Updated•23 years ago
|
Attachment #108664 -
Flags: review?(ftang)
Comment 13•23 years ago
|
||
Sorry for spamming. I didn't realize that I wasn't on the CC list.
Comment 14•23 years ago
|
||
This looks very good, and it also covers <!-- and <![CDATA - everything you'd want.
But have a look at my solution in bug 183354
This is very similar kind of code we're navigating, but in different places.
I may be wrong, but the way I see it - BOM detection should be left to
nsUniversalDetector.cpp and BOM-less detection - for nsParser.cpp
UniversalDetector needs BOM detection because a document isn't neccessarily a
markup. text/plain UTF needs detection too and it doesn't go through Parser?
And if markup goes through Universal Auto Detector first, and you have detection
in Parser as well - that's doing same work twice?
This is merely speculation, for I don't know where the document goes through in
different cases. It just sounds logical, but I might be wrong.
Comment 15•23 years ago
|
||
What a coincidence ! I was about to look at it because
I was also concerned about text/plain, but you got there
first :-) Thank you for the note.
Acoording to nsIParser.h, 'autodetection' has a lower priority than
'BOM detection' (see a set of macro definitions in the file).
I haven't yet checked the code path followed by html/xml documents,
but we may need both (yours and mine). I'll try your patch
after removing BOM detection in nsParser and see what happens...
Even if we need not BOM detection in nsParser, it may be good to
keep it just for the sake of robustness....
Comment 16•23 years ago
|
||
I haven't done the experiment mentioned in my prev. comment
(my mozilla build got kinda screwed up and I'm clobbering and
rebuilding it). Anyway, I think we have to put both patches
in because the universal detection is not always on (see
View|Character Coding|AutoDetection menu). Perhaps, a separate
bug might as well be filed (if deemed necessary) to make Mozilla
always detect
BOM even with autodetection off(it does for html/xml in
nsParser.cpp but it does not for text/plain). Alternatively,
a new entry may be added to View|Character Coding|AutoDetection
that is to enable BOM detection only with other detection turned
off. Again this is only useful for non-html/non-xml files.
BTW, there are a couple of other places where BOM is searched for.
One of them is in intl/chardet which is for language-specific
(as opposed to 'universal) charset detection, I believe.
Comment 17•23 years ago
|
||
cc-ing Hixie, because we were discussing this issue on IRC.
>I may be wrong, but the way I see it - BOM detection should be left to
>nsUniversalDetector.cpp and BOM-less detection - for nsParser.cpp
I think exactly the opposite, but we probably need to define the different cases
more systematically.
Comment 18•23 years ago
|
||
Yes, I actually was thinking the same thing yesterday about user switching off
auto detection completely. So I agree, it should be both places.
And I'd hope that the code avoids redundancy, and that if the charset is already
known through autodetector, the parser won't bother detecting it. But in case of
UTF-16 and UTF-32 (when charset is already provided) what drives it to do
detection of endianness?
By the way there is at least one more place this kind of code can be found in. I
think either JavaScript or CSS parsers also do UTF detection.
Comment 19•23 years ago
|
||
Just a thought, for BOM-less detection to make it completely bulletproof, we
could also skip the leading whitespaces before doing detection.
Worth the bother?
Comment 20•23 years ago
|
||
>>I may be wrong, but the way I see it - BOM detection should be left to
>>nsUniversalDetector.cpp and BOM-less detection - for nsParser.cpp
>I think exactly the opposite,
I sorta agree , but the answer really depends on what 'BOM-less detection' is.
If it's a simple check of the first four octets, it has to be done
in nsParser and only works for html/xml. On the other hand,
more 'invasive' detection method (reading the first few tens of
octets and analyzing octet-pattern, e.g. '00 xx 00 xx 00 xx 00 xx...'
for UTF-16BE) that wors for text/plain as well as for
html/xml belongs to universal character detection module.
As for BOM detection, Alexey may have reach that conclusion
because BOM is not a part of html/xml per se but is a signature
for the endianness and is found not only in html/xml but in
text/plain as well.
In any case, because the universal character detection can be turned off, I
think we need to keep a simple 'bom-less' detection as well
as BOM detection in nsParser.
Now, what should we do to avoid the code repetition in
multiple modules? Do we need a method/function to
return endianness given 'four octets' that can be shared
across modules?
I think the XML recommendation requres that only the first four octets be
checked. I should check the spec to make sure, but if so then the more generic
approach is not appropriate for XML.
| Assignee | ||
Comment 22•23 years ago
|
||
Comment on attachment 108664 [details] [diff] [review]
a patch
It looks like http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info add
additional condictions than
http://www.w3.org/TR/1998/REC-xml-19980210#sec-guessing
from your patch, you does not cover the following cases mention in the
XML1.02edition
00 00 FF FE UCS-4 unusaual octet order (2143)
FE FF 00 00 UCS-4 unusual octet order (3412)
Attachment #108664 -
Flags: review?(ftang) → review-
Comment 23•23 years ago
|
||
Frank's concern addressed.
Attachment #108664 -
Attachment is obsolete: true
Comment 24•23 years ago
|
||
Comment on attachment 115591 [details] [diff] [review]
a new patch with two unusual BO UCS-4's added
+ // 3C 00 XX 00
+ if(IS_SECOND_MARKER(aBytes[2])) {
+ // 00 3C SM 00 UTF-16, little-endian, no Byte Order Mark
There's a typo in comment above.
// 00 3C SM 00 UTF-16, little...
should read
// 3C 00 SM 00 UTF-16, little....
The actual code is not affected.
I have tested all usual BO UTF-16/32's,
but haven't tested unusual BO cases.
Do we have decoders for them? Anyway,
the detection should work for them, too.
Attachment #115591 -
Attachment description: a new patch with two unusual BO UCS-4 added → a new patch with two unusual BO UCS-4's added
Attachment #115591 -
Flags: review?(ftang)
| Assignee | ||
Comment 25•22 years ago
|
||
Comment on attachment 115591 [details] [diff] [review]
a new patch with two unusual BO UCS-4's added
r=ftang
Attachment #115591 -
Flags: review?(ftang) → review+
Comment 26•22 years ago
|
||
Comment on attachment 115591 [details] [diff] [review]
a new patch with two unusual BO UCS-4's added
asking for sr. let's try
to put this before 1.4
Attachment #115591 -
Flags: superreview?(heikki)
Comment 27•22 years ago
|
||
Comment on attachment 115591 [details] [diff] [review]
a new patch with two unusual BO UCS-4's added
+ oCharset.AssignWithConversion(UCS2_LE); // should change to
UTF-16BE
Typo in comment?
Besides, what do those comments mean anyways?
What should be changed?
Changing milestone to 1.4beta to reflect that this is what we are trying to achieve.
I've been going back and forth on this because intl is not really my speciality.
Adding blizzard to Cc 'cos he seems to be (according to sr list).
Anyway, I think using an obscure function like PL_strnpbrk to just test if a
character matches one of four possible characters seems like an overkill. Why
not write a simple inline function for this, for example:
static inline PRBool IsSecondMarker(unsigned char aChar)
{
switch (aChar) {
case '!':
case '?':
case 'h':
case 'H':
return PR_TRUE;
}
return PR_FALSE;
}
Something that I am concerned about is that for XML the detection must be very
strict; does this IsSecondMarker stuff change that?
http://www.w3.org/TR/REC-xml#sec-guessing
Target Milestone: Future → mozilla1.4beta
Comment 29•22 years ago
|
||
> Why not write a simple inline function for this, for example:
Thanks for the suggestion. Done in my tree. I'll upload a new patch
after rebuilding Mozilla. Another possibility is just using
memcmp instead of a series of switch/if-else as suggested by
Chris in bug 183354. If memcmp is sorta 'inlined', it may be faster
(I read in another bug that 'VC++ -O1' doesn't do that..)
> does this IsSecondMarker stuff change that? >
http://www.w3.org/TR/REC-xml#sec-guessing
No, I don't think so. The reason I added it is without it,
Mozilla fails to detect the encoding of html docs in BOM-less
UTF-16LE/UTF-16BE. The code I'm changing deals with both
html and xml. See also comment #8, comment #12, comment #14.
> Besides, what do those comments mean anyways?
I'm not sure ...
> What should be changed?
I just kept them. Anyway, in a new patch I'm
gonna upload, I removed them replacing UCS2_LE|BE with UTF16_LE|BE.
Comment 30•22 years ago
|
||
Basically the same patch.
Can I get r/sr from Frank and Heikki?
This has been sitting for quite a while. Let's try to get this in
for 1.4beta as Heikki suggested.
Attachment #115591 -
Attachment is obsolete: true
Comment 31•22 years ago
|
||
Comment on attachment 119949 [details] [diff] [review]
a new patch with inline function for SM detection
+ oCharset.AssignWithConversion(UCS4_LE); // should change to
UTF-32LE
Comment on attachment 119949 [details] [diff] [review]
a new patch with inline function for SM detection
You don't need |const| in the parameter to IsSecondMarker().
Attachment #119949 -
Flags: superreview+
Comment 33•22 years ago
|
||
Thank you all. Fix just checked in.
In the patch checked in, I remvoed |const|.
I don't know what I was thinking :-)
I also removed the comment pointed out by Alexey.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Updated•22 years ago
|
Attachment #115591 -
Flags: superreview?(heikki)
Comment 34•22 years ago
|
||
Verified.
Although it overrides Universal Charset Detector now, even for text/plain files.
(See bug 183354)
Status: RESOLVED → VERIFIED
The HTML parser isn't the only consumer of encoders/decoders. This patch made
the CSS parser not support UTF-16 anymore. See bug 235090.
Comment 36•22 years ago
|
||
Even without this patch, UTF-16 (with BOM indicating LE) didn't work because we
mapped UTF-16 to UTF-16BE in charset.aliases.
You need to log in
before you can comment on or make changes to this bug.
Description
•