Closed Bug 183354 Opened 22 years ago Closed 21 years ago

Make Universal Charset Autodetector recognise UTF by BOM

Categories

(Core :: Internationalization, defect)

x86
All
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla1.4beta

People

(Reporter: alexeyc2003, Assigned: alexeyc2003)

References

()

Details

(Keywords: intl)

Attachments

(2 files, 2 obsolete files)

At the moment Universal Charset Autodetection only recognises UTF-16XX by it's BOM.

It should recognise all UTF by their BOM, see:
http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/

For testcases see:
http://jshin.net/i18n/utftest/

In particular all these should be recognisable:
http://jshin.net/i18n/utftest/bom_utf16be.html
http://jshin.net/i18n/utftest/bom_utf16le.html
http://jshin.net/i18n/utftest/bom_utf32be.html
http://jshin.net/i18n/utftest/bom_utf32le.html

patch coming up.
Attached patch patch (obsolete) — Splinter Review
The patch recognises BOM in following encodings:

UTF-8
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE

I have no build environment so I haven't tested it.

I assumed that "UTF-8", "UTF-32BE" and "UTF-32LE" is meaningfull output, please
see that it is so.

I increased the minimum data size for BOM detection from 2 to 4 bytes.
taking the bug

also see related patch in bug 68738
Assignee: smontagu → alexey
Target Milestone: --- → mozilla1.3beta
Status: NEW → ASSIGNED
Attachment #108117 - Flags: superreview?(blizzard)
Attachment #108117 - Flags: review?(ftang)
Comment on attachment 108117 [details] [diff] [review]
patch

>+    if (aLen > 3)

...

>+          if ((0x00 == aBuf[1]) && (0xFE == aBuf[2]) && (0xFF == aBuf[3]))
>+            // 00 00 FE FF UTF-32, big-endian BOM

This is going to run off the end of the buffer if aLen == 3.

Also, instead of mixed switch() and if() why not just use memcmp() and if()
statements?  It will probably be more readable.
Attachment #108117 - Flags: superreview?(blizzard) → superreview-
Comment on attachment 108117 [details] [diff] [review]
patch

sorry, I mis-read that
Attachment #108117 - Flags: superreview- → superreview+
Blocks: 182796
Comment on attachment 108117 [details] [diff] [review]
patch

+	 case 0xFE:
+	   if (0xFF == aBuf[1])
+	     // FE FF UTF-16, big endian BOM
+	     mDetectedCharset = "UTF-16BE";

not necessary true. What about the case FE FF 00 00 ?

+	 case 0xFF:
+	   if ((0xFE == aBuf[1]) && (0x00 == aBuf[2]) && (0x00 == aBuf[3]))
This may cause access read violation since 
+    if (aLen > 3)
instead of 
+    if (aLen > 4)
Attachment #108117 - Flags: review?(ftang) → review-
This patch just recognises all 5 UTF encodings by BOM instead of just 2 in the
current code.
When would FE FF 00 00 be encountered? And what should Mozilla do here in that case?

And I can't see how aBuf[3] can cause access read violation in at least 4 byte
long array?
>When would FE FF 00 00 be encountered?
looks at http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info 
for the case
>how aBuf[3] can cause access read violation in at least 4 byte
long array?
you are right, I didn't see that if(aLen >3)
But this is not nessesarily markup. This could be a plaintext document.
Or should the markup specific checks still be done?

And what values should be assigned to mDetectedCharset then?
By the way, I couldn't find what is receiving these values, and what are valid ones?
>But this is not nessesarily markup. This could be a plaintext document.
>Or should the markup specific checks still be done?

My ref to the XML spec it to anwer your question about "When would FE FF 00 00
be encountered"

FE FF 00 00 is a BOM, not a mark up at all, it still apply to plain text.

>And what values should be assigned to mDetectedCharset then?
>By the way, I couldn't find what is receiving these values, and what are valid
ones?

1987 #define UCS4_2143 "X-ISO-10646-UCS-4-2143"
1988 #define UCS4_3412 "X-ISO-10646-UCS-4-3412"

as defined in htmlparser/nsParser.cpp I think
I don't think those nsParser strings are used here. They seem to be just for
internal use in it's own detector function. And their definitions are not used
anywhere else in the code.

The Universal CharDet however seems to be using (via nsICharsetAlias.h):
http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetalias.properties
And there are no entries for these unusual encodings in it.

Moreover:
http://lxr.mozilla.org/mozilla/source/intl/uconv/ucvlatin/nsUTF32ToUnicode.cpp
doesn't seem to handle it either.

And I can't find a converter/decoder in LXR that handles these.
I have no idea what nsParser relies on to handle those. They seem to be useless.

There is no mention of these in the UTF-32 spec:
http://www.unicode.org/unicode/reports/tr19/
or in RFC:
http://www.ietf.org/rfc/rfc3023.txt
Or anywhere else:
http://www-106.ibm.com/developerworks/library/utfencodingforms/#h4
apart from the XML spec:
http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info
And XML spec does not define UFT-32 or how it should be generally handled.
Besides,
If BOM is an unusual FE FF 00 00, we still can't tell wether it is a (3412) or
(3421) order.
Do you know of any actual machines that operate with such byte orders?
Do we have a backend to support it?
Do we really *have* to support it when there is no mention of it in UTF-32 spec?
Attachment #108117 - Attachment is obsolete: true
Attachment #119607 - Flags: superreview?(blizzard)
Attachment #119607 - Flags: review?(ftang)
Target Milestone: mozilla1.3beta → mozilla1.4beta
Comment on attachment 119607 [details] [diff] [review]
patch with unusual BOM checks added

r=ftang
Attachment #119607 - Flags: review?(ftang) → review+
Comment on attachment 119607 [details] [diff] [review]
patch with unusual BOM checks added

sr=blizzard
Attachment #119607 - Flags: superreview?(blizzard) → superreview+
Checking in nsUniversalDetector.cpp;
/cvsroot/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp,v  <--
 nsUniversalDetector.cpp
new revision: 1.14; previous revision: 1.13
done


(I added the missing "if" on the "else (..)" line)
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
however, after checking in, I believe this fix is not quite right.
See these compile warnings:
/home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129:
warning: comparison is always false due to limited range of data type
/home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129:
warning: comparison is always false due to limited range of data type
/home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:134:
warning: comparison is always false due to limited range of data type
/home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:137:
warning: comparison is always false due to limited range of data type
/home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:142:
warning: comparison is always false due to limited range of data type
/home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:142:
warning: comparison is always false due to limited range of data type
/home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:145:
warning: comparison is always false due to limited range of data type
/home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:145:
warning: comparison is always false due to limited range of data type
/home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:150:
warning: comparison is always false due to limited range of data type
/home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:153:
warning: comparison is always false due to limited range of data type


You just can't assume that a char is unsigned. Maybe this function should take a
"const PRUint8*" isntead of "const char*".
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
just some friendly reminders:
//If the data starts with BOM, we know it is UTF

'BOM' needs an article 'a'
'data' doesn't match 'is'
the line could be a sentence
you could replace ',' with ' then'
This warning is also fun:

5.extensions/universalchardet/src/nsUniversalDetector.cpp:133 (See build log
excerpt)
Overflow in implicit constant conversion

This is on the "case 0xFE" line.  It's taking the 0xFE and casting it to "char".

Yes, you can't assume that "char" is unsigned.  It's not.
Attached patch fix (obsolete) — Splinter Review
Comment on attachment 119979 [details] [diff] [review]
fix

smontagu/dbaron still don't like data(pl) starts(sing)

and i'm not fond of the end of that line, but we'll agree on something before
someone checks in
Attachment #119979 - Flags: review+
Comment on attachment 119979 [details] [diff] [review]
fix

Um, |const unsigned char*| and |const char*| shouldn't be compatible types on
platforms where |char| is signed, right?

Also, "data start", not "data starts".
Attachment #119979 - Flags: superreview-
For what it's worth, the gcc 3.2.2 warnings are

/builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:
In member function `virtual void nsUniversalDetector::HandleData(const char*,
unsigned int)':
/builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:128:
warning: overflow in implicit constant conversion
/builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129:
warning: comparison is always false due to limited range of data type
/builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129:
warning: comparison is always false due to limited range of data type
/builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:133:
warning: overflow in implicit constant conversion
/builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:134:
warning: comparison is always false due to limited range of data type
/builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:137:
warning: comparison is always false due to limited range of data type
Attachment #119979 - Attachment is obsolete: true
Attachment #119985 - Flags: superreview?(bzbarsky)
Attachment #119985 - Flags: review?(smontagu)
Comment on attachment 119985 [details] [diff] [review]
proposed fix for warnings

r=smontagu
Attachment #119985 - Flags: review?(smontagu) → review+
Attachment #119985 - Flags: superreview?(bzbarsky) → superreview+
Comment on attachment 119985 [details] [diff] [review]
proposed fix for warnings

Checked in to trunk, 2003-04-09 14:13 PDT.
Due to bug 68738 check in that happened on the same day, there is no way to test
this.
Testing shows, that parser code runs even for text/plain files, and this
detector is pretty much useless now.
Anyways, just assume fixed.
R.I.P Universal Detector.
Status: REOPENED → RESOLVED
Closed: 21 years ago21 years ago
Resolution: --- → FIXED
Mark as verified.  Pages:
http://jshin.net/i18n/utftest/bom_utf32be.html
http://jshin.net/i18n/utftest/bom_utf32le.html
are displayed fine on 05-08 trunk build while they were not in N7.02.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: