Closed Bug 231659 Opened 20 years ago Closed 20 years ago
UTF-8 decoder accepts overlong sequences
The UTF-8 decoder in NSS accepts overlong sequences and surrogates in violation of the Unicode standard.
Prohibits overlong sequences, UTF-8 encoded surrogates, and characters above the maximum value of 10FFFF, all as required by the Unicode standard conformance clause C12. Removes the #ifdefs for always-defined symbol UTF16. Pulls the UTF-8 parser into a separate function, reducing code size by over a third. Adjusts the test cases as necessary. Add tests for detecting ill-formed UTF-8.
The code to handle UCS4 values > 10FFFF wasn't an accident, but I don't know what spec it came from. This code was originally written by Fred Roeber, who researched various standards thoroughly and carefully before cocding this. It appears to me that John must be citing a different specification than the one that Fred used. I'd like to understand why they're so different before giving r+ or r- to this code. I've asked Fred to add comments here about this.
Section C.2 of The Unicode Standard, version 4.0 states in part: The Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters to 10646 will be constrained to the BMP or the first 14 supplementary planes. This is to ensure interoperability between the 10646 transformation formats (see below). It also guarantees interoperability with implementations of the Unicode Standard, for which only code positions 0..10FFFF(16) are meaningful. The former provision for private-use code positions in groups 60 to 7F and in planes E0 to FF in 10646 has been removed from 10646. As a consequence, UCS-4 can now be taken effectively as an alias for the Unicode encoding form UTF-32, except that UTF-32 has the extra requirement that additional Unicode semantics be observed for all characters.
...in other words, ISO 10646 has changed to remove code points above 10FFFF.
Comment on attachment 139524 [details] [diff] [review] Proposed fix I have not verified that the new specification give above is crrect, but I believe this code properly implements it. Thanks, John.
Attachment #139524 - Flags: review?(MisterSSL) → review+
Comment on attachment 139524 [details] [diff] [review] Proposed fix John, you can go ahead and check this patch in. I will need more time to review this patch because it is quite large. Some comments from my preliminary review. 1. It would be nice to add a comment describing what the sec_port_read_utf8 function does and what its return value is. 2. It would be nice to assert "i < inBufLen" at the beginning of sec_port_read_utf8. I understand that we are already doing that check because we always call sec_port_read_utf8 from within a for loop that tests "i < inBufLen". 3. The original UTF-8 parsing code has a lot of comments like this: >- /* 0000 0000-0000 007F <- 0xxxxxx */ >- /* 0abcdefg -> >- 00000000 00000000 00000000 0abcdefg */ Would be nice if sec_port_read_utf8 can cite where the specification of the algorithm can be found. 4. Is it possible to declare the 'ucs4' local variables inside the for loops to reduce their scope? 5. I think that some compilers will warn about truncation of PRUint32 to unsigned char in the last three lines below: >+ outBuf[len+L_0] = 0x00; >+ outBuf[len+L_1] = (ucs4 >> 16); >+ outBuf[len+L_2] = (ucs4 >> 8); >+ outBuf[len+L_3] = ucs4; 6. It seems that all the cases you removed from the 'ucs4' array can be added to the new 'utf8_bad' array, no?
I put a carefully selected subset of the cases I removed from the 'ucs4' array into 'utf8_bad'. I didn't put all of them in because that would have made the utf8_bad array excessively large for insignificant additional coverage.
Fix and review comments checked in.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
John, Thanks for your contribution!
*** Bug 255463 has been marked as a duplicate of this bug. ***
Setting priorities on unprioritized bugs resolved fixed for NSS 3.10.
Priority: -- → P2
You need to log in before you can comment on or make changes to this bug.