Closed
Bug 368142
Opened 18 years ago
Closed 18 years ago
Tests for correct byte-by-byte encodings of Unicode strings
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
mozilla1.9alpha1
People
(Reporter: Waldo, Assigned: Waldo)
References
Details
Attachments
(1 file, 2 obsolete files)
15.51 KB,
patch
|
smontagu
:
review+
|
Details | Diff | Splinter Review |
We need tests for correct encoding of Unicode strings in various character sets. Here are some for UTF-8, UTF-16 with BOM, UTF-16LE, and UTF-16BE.
Flags: in-testsuite?
Attachment #252715 -
Flags: review?(smontagu)
Assignee | ||
Comment 1•18 years ago
|
||
I decided to read the UTF-8 RFC to see how it described UTF-8 (I'd previously been using the Wikipedia article), and as I read I realized I wasn't validating output -- specifically, I wasn't checking that a multi-byte sequence encoded a value in the correct range and not a value in a lower range. (RFC 3629 gives C0 80 as one example, which matches the form of a two-byte code point but encodes U+000000, not a value in the range U+000080 to U+0007FF inclusive.) This patch does that checking, for both UTF-8 (where it's needed most) and for UTF-16 (where the check only ensures the code point is <= 0x10FFFF, and only for the uncommon case of non-BMP code points). Evaluating the test in the JS shell and pasting the following at a prompt afterward is a simple way to test that invalid sequences cannot be decoded with the changes in this patch: BIS = function(arr){ this._a = arr; this._index = 0; } BIS.prototype.read8 = function(){ if (this._index >= this._a.length) throw Cr.NS_ERROR_FAILURE; return this._a[this._index++]; } var u = new UTF8([0xC0, 0x80]); // or any other array of byte values u.readUnit().toString(16); There's also some logic simplification and an added check for the correct range for the second word of a non-BMP UTF-16 code point.
Attachment #252715 -
Attachment is obsolete: true
Attachment #252994 -
Flags: review?(smontagu)
Attachment #252715 -
Flags: review?(smontagu)
Comment 2•18 years ago
|
||
Comment on attachment 252994 [details] [diff] [review] Range-check code points >+const UNICODE_STRINGS = >+ [ >+ '\u00BD + \u00BE == \u00BD\u00B2 + \u00BC + \u00BE', >+ >+ 'AZaz09 \u007F ' + // U+000000 to U+00007F >+ '\u0080 \u0398 \u03BB \u0725 ' + // U+000080 to U+0007FF >+ '\u0964 \u0F5F \u20AC \uFFFB' // U+000800 to U+00FFFF >+ ]; I would include something in the U+10000 - U+10FFFF range here also, to test 4-byte UTF-8 and UTF-16 surrogate pairs. >+ if (w1 > 0xD700 && w1 < 0xE000) >+ { >+ // non-BMP, use surrogate pair This range is wrong. You want something like: if (w1 > 0xDBFF && w1 < 0xE000) // low surrogate not preceded by high surrogate throw NS_ERROR_ILLEGAL_VALUE); if (w1 > 0xD7FF && w1 < 0xDC00)> { // non-BMP, use surrogate pair For completeness, you probably also want to throw if UTF-8 decodes to a surrogate character.
Attachment #252994 -
Flags: review?(smontagu) → review-
Assignee | ||
Comment 3•18 years ago
|
||
This patch, in addition to addressing review comments, hopefully cleans up the UTF-8 stream reading code (less nesting, etc.). If you have further suggestions for cleanup, please make them. (In reply to comment #2) > I would include something in the U+10000 - U+10FFFF range here also, to test > 4-byte UTF-8 and UTF-16 surrogate pairs. JS strings are UCS-2 and thus suck at non-BMP data, so I handled this another way -- store the data in a file copied and encoded with multiple encodings, then do cross-conversion and comparison to error on inequality. The files are all cvs -kb'd on my machine so as to avoid line-ending snafus, but for whatever reason the UTF-8 version shows up in the patch -- perhaps convenient for seeing the characters the file includes. > This range is wrong. Fixed. > For completeness, you probably also want to throw if UTF-8 decodes to a > surrogate character. Sure, done.
Attachment #252994 -
Attachment is obsolete: true
Attachment #253130 -
Flags: review?(smontagu)
Updated•18 years ago
|
Attachment #253130 -
Flags: review?(smontagu) → review+
Assignee | ||
Comment 4•18 years ago
|
||
Test checked in.
Status: NEW → RESOLVED
Closed: 18 years ago
Flags: in-testsuite? → in-testsuite+
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•