Closed Bug 231659 Opened 21 years ago Closed 21 years ago

UTF-8 decoder accepts overlong sequences

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

3.10

People

(Reporter: jgmyers, Assigned: jgmyers)

References

Details

Attachments

(1 file)

Proposed fix 21 years ago John G. Myers 28.53 KB, patch	nelson : review+	Details \| Diff \| Splinter Review

John G. Myers

Assignee

Description

•

21 years ago

The UTF-8 decoder in NSS accepts overlong sequences and surrogates in violation of the Unicode standard.

John G. Myers

Assignee

Comment 1

•

21 years ago

Attached patch Proposed fix — Details — Splinter Review

Prohibits overlong sequences, UTF-8 encoded surrogates, and characters above the maximum value of 10FFFF, all as required by the Unicode standard conformance clause C12. Removes the #ifdefs for always-defined symbol UTF16. Pulls the UTF-8 parser into a separate function, reducing code size by over a third. Adjusts the test cases as necessary. Add tests for detecting ill-formed UTF-8.

John G. Myers

Assignee

Updated

•

21 years ago

Attachment #139524 - Flags: review?(MisterSSL)

Nelson Bolyard (seldom reads bugmail)

Comment 2

•

21 years ago

The code to handle UCS4 values > 10FFFF wasn't an accident, but I don't know what spec it came from. This code was originally written by Fred Roeber, who researched various standards thoroughly and carefully before cocding this. It appears to me that John must be citing a different specification than the one that Fred used. I'd like to understand why they're so different before giving r+ or r- to this code. I've asked Fred to add comments here about this.

John G. Myers

Assignee

Comment 3

•

21 years ago

Section C.2 of The Unicode Standard, version 4.0 states in part: The Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters to 10646 will be constrained to the BMP or the first 14 supplementary planes. This is to ensure interoperability between the 10646 transformation formats (see below). It also guarantees interoperability with implementations of the Unicode Standard, for which only code positions 0..10FFFF(16) are meaningful. The former provision for private-use code positions in groups 60 to 7F and in planes E0 to FF in 10646 has been removed from 10646. As a consequence, UCS-4 can now be taken effectively as an alias for the Unicode encoding form UTF-32, except that UTF-32 has the extra requirement that additional Unicode semantics be observed for all characters.

John G. Myers

Assignee

Comment 4

•

21 years ago

...in other words, ISO 10646 has changed to remove code points above 10FFFF.

Nelson Bolyard (seldom reads bugmail)

Comment 5

•

21 years ago

Comment on attachment 139524 [details] [diff] [review] Proposed fix I have not verified that the new specification give above is crrect, but I believe this code properly implements it. Thanks, John.

Attachment #139524 - Flags: review?(MisterSSL) → review+

John G. Myers

Assignee

Updated

•

21 years ago

Attachment #139524 - Flags: superreview?(wchang0222)

Wan-Teh Chang

Comment 6

•

21 years ago

Comment on attachment 139524 [details] [diff] [review] Proposed fix John, you can go ahead and check this patch in. I will need more time to review this patch because it is quite large. Some comments from my preliminary review. 1. It would be nice to add a comment describing what the sec_port_read_utf8 function does and what its return value is. 2. It would be nice to assert "i < inBufLen" at the beginning of sec_port_read_utf8. I understand that we are already doing that check because we always call sec_port_read_utf8 from within a for loop that tests "i < inBufLen". 3. The original UTF-8 parsing code has a lot of comments like this: >- /* 0000 0000-0000 007F <- 0xxxxxx */ >- /* 0abcdefg -> >- 00000000 00000000 00000000 0abcdefg */ Would be nice if sec_port_read_utf8 can cite where the specification of the algorithm can be found. 4. Is it possible to declare the 'ucs4' local variables inside the for loops to reduce their scope? 5. I think that some compilers will warn about truncation of PRUint32 to unsigned char in the last three lines below: >+ outBuf[len+L_0] = 0x00; >+ outBuf[len+L_1] = (ucs4 >> 16); >+ outBuf[len+L_2] = (ucs4 >> 8); >+ outBuf[len+L_3] = ucs4; 6. It seems that all the cases you removed from the 'ucs4' array can be added to the new 'utf8_bad' array, no?

Wan-Teh Chang

Updated

•

21 years ago

Target Milestone: --- → 3.10

John G. Myers

Assignee

Comment 7

•

21 years ago

I put a carefully selected subset of the cases I removed from the 'ucs4' array into 'utf8_bad'. I didn't put all of them in because that would have made the utf8_bad array excessively large for insignificant additional coverage.

John G. Myers

Assignee

Comment 8

•

21 years ago

Fix and review comments checked in.

Status: NEW → RESOLVED

Closed: 21 years ago

Resolution: --- → FIXED

Nelson Bolyard (seldom reads bugmail)

Comment 9

•

21 years ago

John, Thanks for your contribution!

John G. Myers

Assignee

Updated

•

21 years ago

Attachment #139524 - Flags: superreview?(wchang0222)

Wan-Teh Chang

Comment 10

•

21 years ago

*** Bug 255463 has been marked as a duplicate of this bug. ***

Nelson Bolyard (seldom reads bugmail)

Comment 11

•

20 years ago

Setting priorities on unprioritized bugs resolved fixed for NSS 3.10.

Priority: -- → P2

You need to log in before you can comment on or make changes to this bug.

Bugzilla

UTF-8 decoder accepts overlong sequences

Categories

(NSS :: Libraries, defect, P2)

Tracking

(Not tracked)

People

(Reporter: jgmyers, Assigned: jgmyers)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type