Closed Bug 231659 Opened 21 years ago Closed 21 years ago

UTF-8 decoder accepts overlong sequences

Categories

(NSS :: Libraries, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgmyers, Assigned: jgmyers)

References

Details

Attachments

(1 file)

The UTF-8 decoder in NSS accepts overlong sequences and surrogates in violation
of the Unicode standard.
Attached patch Proposed fixSplinter Review
Prohibits overlong sequences, UTF-8 encoded surrogates, and characters above
the maximum value of 10FFFF, all as required by the Unicode standard
conformance clause C12.

Removes the #ifdefs for always-defined symbol UTF16.

Pulls the UTF-8 parser into a separate function, reducing code size by over a
third.

Adjusts the test cases as necessary.  Add tests for detecting ill-formed UTF-8.
Attachment #139524 - Flags: review?(MisterSSL)
The code to handle UCS4 values > 10FFFF wasn't an accident, but I don't 
know what spec it came from.  This code was originally written by Fred 
Roeber, who researched various standards thoroughly and carefully before 
cocding this.  

It appears to me that John must be citing a different specification than 
the one that Fred used.  I'd like to understand why they're so different 
before giving r+ or r- to this code.  I've asked Fred to add comments 
here about this.
Section C.2 of The Unicode Standard, version 4.0 states in part:

The Principles and Procedures document of JTC1/SC2/WG2 states that all future
assignments of characters to 10646 will be constrained to the BMP or the first
14 supplementary planes. This is to ensure interoperability between the 10646
transformation formats (see below). It also guarantees interoperability with
implementations of the Unicode Standard, for which only code positions
0..10FFFF(16) are meaningful. The former provision for private-use code
positions in groups 60 to 7F and in planes E0 to FF in 10646 has been removed
from 10646. As a consequence, UCS-4 can now be taken effectively as an alias for
the Unicode encoding form UTF-32, except that UTF-32 has the extra requirement
that additional Unicode semantics be observed for all characters.
...in other words, ISO 10646 has changed to remove code points above 10FFFF.
Comment on attachment 139524 [details] [diff] [review]
Proposed fix

I have not verified that the new specification give above is crrect, but I
believe this code properly implements it.
Thanks, John.
Attachment #139524 - Flags: review?(MisterSSL) → review+
Attachment #139524 - Flags: superreview?(wchang0222)
Comment on attachment 139524 [details] [diff] [review]
Proposed fix

John, you can go ahead and check this patch in.
I will need more time to review this patch because
it is quite large.

Some comments from my preliminary review.

1. It would be nice to add a comment describing what
the sec_port_read_utf8 function does and what its
return value is.

2. It would be nice to assert "i < inBufLen" at
the beginning of sec_port_read_utf8.  I understand
that we are already doing that check because we
always call sec_port_read_utf8 from within a for
loop that tests "i < inBufLen".

3. The original UTF-8 parsing code has a lot of
comments like this:

>-        /* 0000 0000-0000 007F <- 0xxxxxx */
>-        /* 0abcdefg -> 
>-           00000000 00000000 00000000 0abcdefg */

Would be nice if sec_port_read_utf8 can cite where
the specification of the algorithm can be found.

4. Is it possible to declare the 'ucs4' local
variables inside the for loops to reduce their
scope?

5. I think that some compilers will warn about
truncation of PRUint32 to unsigned char in the
last three lines below:

>+      outBuf[len+L_0] = 0x00;
>+      outBuf[len+L_1] = (ucs4 >> 16);
>+      outBuf[len+L_2] = (ucs4 >> 8);
>+      outBuf[len+L_3] = ucs4;

6. It seems that all the cases you removed from
the 'ucs4' array can be added to the new 'utf8_bad'
array, no?
Target Milestone: --- → 3.10
I put a carefully selected subset of the cases I removed from the 'ucs4' array
into 'utf8_bad'.  I didn't put all of them in because that would have made the
utf8_bad array excessively large for insignificant additional coverage.
Fix and review comments checked in.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
John,  Thanks for your contribution!  
Attachment #139524 - Flags: superreview?(wchang0222)
*** Bug 255463 has been marked as a duplicate of this bug. ***
Setting priorities on unprioritized bugs resolved fixed for NSS 3.10.
Priority: -- → P2
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: