Closed Bug 183354 Opened 23 years ago Closed 23 years ago

Make Universal Charset Autodetector recognise UTF by BOM

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla1.4beta

People

(Reporter: alexeyc2003, Assigned: alexeyc2003)

References

(
URL
)

Details

(Keywords: intl)

Attachments

(2 files, 2 obsolete files)

patch 23 years ago Alexey Chernyak 1.78 KB, patch	ftang : review- blizzard : superreview+	Details \| Diff \| Splinter Review
patch with unusual BOM checks added 23 years ago Alexey Chernyak 2.19 KB, patch	ftang : review+ blizzard : superreview+	Details \| Diff \| Splinter Review
fix 23 years ago Alexey Chernyak 906 bytes, patch	timeless : review+ dbaron : superreview-	Details \| Diff \| Splinter Review
proposed fix for warnings 23 years ago David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla) 3.14 KB, patch	smontagu : review+ bzbarsky : superreview+	Details \| Diff \| Splinter Review

Alexey Chernyak

Assignee

Description

•

23 years ago

At the moment Universal Charset Autodetection only recognises UTF-16XX by it's BOM. It should recognise all UTF by their BOM, see: http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ For testcases see: http://jshin.net/i18n/utftest/ In particular all these should be recognisable: http://jshin.net/i18n/utftest/bom_utf16be.html http://jshin.net/i18n/utftest/bom_utf16le.html http://jshin.net/i18n/utftest/bom_utf32be.html http://jshin.net/i18n/utftest/bom_utf32le.html patch coming up.

Alexey Chernyak

Assignee

Comment 1

•

23 years ago

Attached patch patch (obsolete) — Details — Splinter Review

The patch recognises BOM in following encodings: UTF-8 UTF-16BE UTF-16LE UTF-32BE UTF-32LE I have no build environment so I haven't tested it. I assumed that "UTF-8", "UTF-32BE" and "UTF-32LE" is meaningfull output, please see that it is so. I increased the minimum data size for BOM detection from 2 to 4 bytes.

Alexey Chernyak

Assignee

Comment 2

•

23 years ago

taking the bug also see related patch in bug 68738

Assignee: smontagu → alexey

Target Milestone: --- → mozilla1.3beta

Alexey Chernyak

Assignee

Updated

•

23 years ago

Status: NEW → ASSIGNED

Alexey Chernyak

Assignee

Updated

•

23 years ago

Attachment #108117 - Flags: superreview?(blizzard)

Attachment #108117 - Flags: review?(ftang)

Christopher Blizzard (:blizzard)

Comment 3

•

23 years ago

Comment on attachment 108117 [details] [diff] [review] patch >+ if (aLen > 3) ... >+ if ((0x00 == aBuf[1]) && (0xFE == aBuf[2]) && (0xFF == aBuf[3])) >+ // 00 00 FE FF UTF-32, big-endian BOM This is going to run off the end of the buffer if aLen == 3. Also, instead of mixed switch() and if() why not just use memcmp() and if() statements? It will probably be more readable.

Attachment #108117 - Flags: superreview?(blizzard) → superreview-

Christopher Blizzard (:blizzard)

Comment 4

•

23 years ago

Comment on attachment 108117 [details] [diff] [review] patch sorry, I mis-read that

Attachment #108117 - Flags: superreview- → superreview+

Simon Montagu :smontagu

Updated

•

23 years ago

Blocks: 182796

Frank Tang

Comment 5

•

23 years ago

Comment on attachment 108117 [details] [diff] [review] patch + case 0xFE: + if (0xFF == aBuf[1]) + // FE FF UTF-16, big endian BOM + mDetectedCharset = "UTF-16BE"; not necessary true. What about the case FE FF 00 00 ? + case 0xFF: + if ((0xFE == aBuf[1]) && (0x00 == aBuf[2]) && (0x00 == aBuf[3])) This may cause access read violation since + if (aLen > 3) instead of + if (aLen > 4)

Attachment #108117 - Flags: review?(ftang) → review-

Alexey Chernyak

Assignee

Comment 6

•

23 years ago

This patch just recognises all 5 UTF encodings by BOM instead of just 2 in the current code. When would FE FF 00 00 be encountered? And what should Mozilla do here in that case? And I can't see how aBuf[3] can cause access read violation in at least 4 byte long array?

Frank Tang

Comment 7

•

23 years ago

>When would FE FF 00 00 be encountered? looks at http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info for the case

Frank Tang

Comment 8

•

23 years ago

>how aBuf[3] can cause access read violation in at least 4 byte long array? you are right, I didn't see that if(aLen >3)

Alexey Chernyak

Assignee

Comment 9

•

23 years ago

But this is not nessesarily markup. This could be a plaintext document. Or should the markup specific checks still be done? And what values should be assigned to mDetectedCharset then? By the way, I couldn't find what is receiving these values, and what are valid ones?

Frank Tang

Comment 10

•

23 years ago

>But this is not nessesarily markup. This could be a plaintext document. >Or should the markup specific checks still be done? My ref to the XML spec it to anwer your question about "When would FE FF 00 00 be encountered" FE FF 00 00 is a BOM, not a mark up at all, it still apply to plain text. >And what values should be assigned to mDetectedCharset then? >By the way, I couldn't find what is receiving these values, and what are valid ones? 1987 #define UCS4_2143 "X-ISO-10646-UCS-4-2143" 1988 #define UCS4_3412 "X-ISO-10646-UCS-4-3412" as defined in htmlparser/nsParser.cpp I think

Alexey Chernyak

Assignee

Comment 11

•

23 years ago

I don't think those nsParser strings are used here. They seem to be just for internal use in it's own detector function. And their definitions are not used anywhere else in the code. The Universal CharDet however seems to be using (via nsICharsetAlias.h): http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetalias.properties And there are no entries for these unusual encodings in it. Moreover: http://lxr.mozilla.org/mozilla/source/intl/uconv/ucvlatin/nsUTF32ToUnicode.cpp doesn't seem to handle it either. And I can't find a converter/decoder in LXR that handles these. I have no idea what nsParser relies on to handle those. They seem to be useless. There is no mention of these in the UTF-32 spec: http://www.unicode.org/unicode/reports/tr19/ or in RFC: http://www.ietf.org/rfc/rfc3023.txt Or anywhere else: http://www-106.ibm.com/developerworks/library/utfencodingforms/#h4 apart from the XML spec: http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info And XML spec does not define UFT-32 or how it should be generally handled.

Alexey Chernyak

Assignee

Comment 12

•

23 years ago

Besides, If BOM is an unusual FE FF 00 00, we still can't tell wether it is a (3412) or (3421) order. Do you know of any actual machines that operate with such byte orders? Do we have a backend to support it? Do we really *have* to support it when there is no mention of it in UTF-32 spec?

Alexey Chernyak

Assignee

Comment 13

•

23 years ago

Attached patch patch with unusual BOM checks added — Details — Splinter Review

Attachment #108117 - Attachment is obsolete: true

Alexey Chernyak

Assignee

Updated

•

23 years ago

Attachment #119607 - Flags: superreview?(blizzard)

Attachment #119607 - Flags: review?(ftang)

Alexey Chernyak

Assignee

Updated

•

23 years ago

Target Milestone: mozilla1.3beta → mozilla1.4beta

Frank Tang

Comment 14

•

23 years ago

Comment on attachment 119607 [details] [diff] [review] patch with unusual BOM checks added r=ftang

Attachment #119607 - Flags: review?(ftang) → review+

Christopher Blizzard (:blizzard)

Comment 15

•

23 years ago

Comment on attachment 119607 [details] [diff] [review] patch with unusual BOM checks added sr=blizzard

Attachment #119607 - Flags: superreview?(blizzard) → superreview+

Christian :Biesinger (don't email me, ping me on IRC)

Comment 16

•

23 years ago

Checking in nsUniversalDetector.cpp; /cvsroot/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp,v <-- nsUniversalDetector.cpp new revision: 1.14; previous revision: 1.13 done (I added the missing "if" on the "else (..)" line)

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Christian :Biesinger (don't email me, ping me on IRC)

Comment 17

•

23 years ago

however, after checking in, I believe this fix is not quite right. See these compile warnings: /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:134: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:137: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:142: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:142: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:145: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:145: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:150: warning: comparison is always false due to limited range of data type /home/chb/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:153: warning: comparison is always false due to limited range of data type You just can't assume that a char is unsigned. Maybe this function should take a "const PRUint8*" isntead of "const char*".

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

timeless

Comment 18

•

23 years ago

just some friendly reminders: //If the data starts with BOM, we know it is UTF 'BOM' needs an article 'a' 'data' doesn't match 'is' the line could be a sentence you could replace ',' with ' then'

Boris Zbarsky [:bzbarsky]

Comment 19

•

23 years ago

This warning is also fun: 5.extensions/universalchardet/src/nsUniversalDetector.cpp:133 (See build log excerpt) Overflow in implicit constant conversion This is on the "case 0xFE" line. It's taking the 0xFE and casting it to "char". Yes, you can't assume that "char" is unsigned. It's not.

Alexey Chernyak

Assignee

Comment 20

•

23 years ago

Attached patch fix (obsolete) — Details — Splinter Review

timeless

Comment 21

•

23 years ago

Comment on attachment 119979 [details] [diff] [review] fix smontagu/dbaron still don't like data(pl) starts(sing) and i'm not fond of the end of that line, but we'll agree on something before someone checks in

Attachment #119979 - Flags: review+

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 22

•

23 years ago

Attachment #119979 - Flags: superreview-

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 23

•

23 years ago

Attached patch proposed fix for warnings — Details — Splinter Review

For what it's worth, the gcc 3.2.2 warnings are /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp: In member function `virtual void nsUniversalDetector::HandleData(const char*, unsigned int)': /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:128: warning: overflow in implicit constant conversion /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:129: warning: comparison is always false due to limited range of data type /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:133: warning: overflow in implicit constant conversion /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:134: warning: comparison is always false due to limited range of data type /builds/trunk/mozilla/extensions/universalchardet/src/nsUniversalDetector.cpp:137: warning: comparison is always false due to limited range of data type

Attachment #119979 - Attachment is obsolete: true

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

23 years ago

Attachment #119985 - Flags: superreview?(bzbarsky)

Attachment #119985 - Flags: review?(smontagu)

Simon Montagu :smontagu

Comment 24

•

23 years ago

Comment on attachment 119985 [details] [diff] [review] proposed fix for warnings r=smontagu

Attachment #119985 - Flags: review?(smontagu) → review+

Boris Zbarsky [:bzbarsky]

Updated

•

23 years ago

Attachment #119985 - Flags: superreview?(bzbarsky) → superreview+

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 25

•

23 years ago

Comment on attachment 119985 [details] [diff] [review] proposed fix for warnings Checked in to trunk, 2003-04-09 14:13 PDT.

Alexey Chernyak

Assignee

Comment 26

•

23 years ago

Due to bug 68738 check in that happened on the same day, there is no way to test this. Testing shows, that parser code runs even for text/plain files, and this detector is pretty much useless now. Anyways, just assume fixed. R.I.P Universal Detector.

Status: REOPENED → RESOLVED

Closed: 23 years ago → 23 years ago

Resolution: --- → FIXED

Yuying Long

Comment 27

•

22 years ago

Mark as verified. Pages: http://jshin.net/i18n/utftest/bom_utf32be.html http://jshin.net/i18n/utftest/bom_utf32le.html are displayed fine on 05-08 trunk build while they were not in N7.02.

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.