Closed Bug 368142 Opened 18 years ago Closed 18 years ago

Tests for correct byte-by-byte encodings of Unicode strings

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9alpha1

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(1 file, 2 obsolete files)

Attached patch Tests (obsolete) — Splinter Review
We need tests for correct encoding of Unicode strings in various character sets.  Here are some for UTF-8, UTF-16 with BOM, UTF-16LE, and UTF-16BE.
Flags: in-testsuite?
Attachment #252715 - Flags: review?(smontagu)
Attached patch Range-check code points (obsolete) — Splinter Review
I decided to read the UTF-8 RFC to see how it described UTF-8 (I'd previously been using the Wikipedia article), and as I read I realized I wasn't validating output -- specifically, I wasn't checking that a multi-byte sequence encoded a value in the correct range and not a value in a lower range.  (RFC 3629 gives C0 80 as one example, which matches the form of a two-byte code point but encodes U+000000, not a value in the range U+000080 to U+0007FF inclusive.)  This patch does that checking, for both UTF-8 (where it's needed most) and for UTF-16 (where the check only ensures the code point is <= 0x10FFFF, and only for the uncommon case of non-BMP code points).

Evaluating the test in the JS shell and pasting the following at a prompt afterward is a simple way to test that invalid sequences cannot be decoded with the changes in this patch:

BIS = function(arr){ this._a = arr; this._index = 0; }
BIS.prototype.read8 = function(){ if (this._index >= this._a.length)
throw Cr.NS_ERROR_FAILURE; return this._a[this._index++]; }
var u = new UTF8([0xC0, 0x80]); // or any other array of byte values
u.readUnit().toString(16);

There's also some logic simplification and an added check for the correct range for the second word of a non-BMP UTF-16 code point.
Attachment #252715 - Attachment is obsolete: true
Attachment #252994 - Flags: review?(smontagu)
Attachment #252715 - Flags: review?(smontagu)
Comment on attachment 252994 [details] [diff] [review]
Range-check code points

>+const UNICODE_STRINGS =
>+  [
>+    '\u00BD + \u00BE == \u00BD\u00B2 + \u00BC + \u00BE',
>+
>+    'AZaz09 \u007F ' +               // U+000000 to U+00007F
>+    '\u0080 \u0398 \u03BB \u0725 ' + // U+000080 to U+0007FF
>+    '\u0964 \u0F5F \u20AC \uFFFB'    // U+000800 to U+00FFFF
>+  ];

I would include something in the U+10000 - U+10FFFF range here also, to test 4-byte UTF-8 and UTF-16 surrogate pairs.

>+      if (w1 > 0xD700 && w1 < 0xE000)
>+      {
>+        // non-BMP, use surrogate pair

This range is wrong. You want something like:

        if (w1 > 0xDBFF && w1 < 0xE000)
          // low surrogate not preceded by high surrogate
          throw NS_ERROR_ILLEGAL_VALUE);

        if (w1 > 0xD7FF && w1 < 0xDC00)>
        {
          // non-BMP, use surrogate pair

For completeness, you probably also want to throw if UTF-8 decodes to a surrogate character.
Attachment #252994 - Flags: review?(smontagu) → review-
This patch, in addition to addressing review comments, hopefully cleans up the UTF-8 stream reading code (less nesting, etc.).  If you have further suggestions for cleanup, please make them.

(In reply to comment #2)
> I would include something in the U+10000 - U+10FFFF range here also, to test
> 4-byte UTF-8 and UTF-16 surrogate pairs.

JS strings are UCS-2 and thus suck at non-BMP data, so I handled this another way -- store the data in a file copied and encoded with multiple encodings, then do cross-conversion and comparison to error on inequality.  The files are all cvs -kb'd on my machine so as to avoid line-ending snafus, but for whatever reason the UTF-8 version shows up in the patch -- perhaps convenient for seeing the characters the file includes.

> This range is wrong.

Fixed.

> For completeness, you probably also want to throw if UTF-8 decodes to a
> surrogate character.

Sure, done.
Attachment #252994 - Attachment is obsolete: true
Attachment #253130 - Flags: review?(smontagu)
Attachment #253130 - Flags: review?(smontagu) → review+
Test checked in.
Status: NEW → RESOLVED
Closed: 18 years ago
Flags: in-testsuite? → in-testsuite+
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: