Closed Bug 686312 Opened 13 years ago Closed 13 years ago

Firefox/WebSockets denies to receive certain valid UTF-8 sequences

Categories

(Core :: Networking: WebSockets, defect)

9 Branch
x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla9

People

(Reporter: tobias.oberstein, Assigned: mcmanus)

Details

(Keywords: dev-doc-complete, Whiteboard: [inbound])

Attachments

(1 file, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a2) Gecko/20110911 Firefox/8.0a2
Build ID: 20110911042006

Steps to reproduce:

I am running a WebSockets protocol test suite which covers UTF-8 by fuzzing an implementation with valid and invalid UTF-8 sequences.
The test suite sends different WS text messages. The exact sequences sent, incl. wire logs can be found in the failed cases
6.9.3, 6.9.4, 6.11.4, 6.22.1, 6.22.2
under the following link
http://www.tavendo.de/autobahn/testsuite/report/clients/index.html


Actual results:

The WS connection is failed by Firefox.


Expected results:

The text message is processed and provided in the JS event handler as text payload.
The behavior is identical for Firefox 7, 8 and 9 (tested, but only on Win64).

At least some of the sequences not processed show up correctly in HTML (i.e. have a look at http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt => 5.3.1  U+FFFE). So I don't know if it's really WS specific or not.

In any case, I filed the bug without "security on", since FF denies more than it should, not less.
The test I U+FFFE and U+FFFF are non-characters. That's why they are being rejected during the utf-8 validation.

I'm going to have to ask around to figure out if that's the behavior we want or not, but at least at this point I can tell you that's why it is happening.
Ok.
Those 2 are valid UTF-8 by RFC3629 and hence by WS draft spec at least ..
What about the other 3 sequences?
Hi Tobias,

You report 5 test cases of interest, but I only see 3 different byte sequences being tested.

6.9.3 and 6.22.2 both seem to test ef bf bf, and 6.9.4 and 6.11.4 both seem to test f4 8f bf bf. The other sequence is from test 6.22.1 (ef bf be).

Do I have that correct?
Hello Patrick,

sorry, yes, you are right, it's only 3 sequences - rechecked:

6.9.4 / 6.11.4 => f48fbfbf

6.22.1 => efbfbe

6.9.3 / 6.22.2 => efbfbf

I missed that, since I somehow "blindly" constructed most of the cases by adapting the cases from the Markus Kuhn page above. Well ..
Ah, I know why I missed it: at the time Markus wrote the page, UTF-8 was on the one hand not yet limited to U+10FFFF and on the other hand U+FFFE/U+FFFF were probably really illegal code positions, not just non-characters. So I had to adjust the cases, and by that introduced the overlap.

--

Don't know if thats relevant: a couple of cases from section 6.4. are deactivated for FF7, since it crashes. FF8 and higher don't crash and pass (non-strict). It also crashes in section 9.3.

https://github.com/oberstet/Autobahn/blob/master/lib/python/autobahn/fuzzing.py#L197
(In reply to Tobias Oberstein from comment #5)

> 6.9.4 / 6.11.4 => f48fbfbf
> 
> 6.22.1 => efbfbe
> 
> 6.9.3 / 6.22.2 => efbfbf

ok, those are all non-characters. Brian Smith, David Baron, Jason Duell and I talked face to face and decided that it is appropriate to let WS pass strings with non characters included - so I'll make a patch.
Attached patch patch v1 (obsolete) — Splinter Review
Assignee: nobody → mcmanus
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attachment #561253 - Flags: review?(jduell.mcbugs)
Attached patch patch v2Splinter Review
Attached the wrong (the test was still full of debug) patch the first time. doh!
Attachment #561253 - Attachment is obsolete: true
Attachment #561253 - Flags: review?(jduell.mcbugs)
Attachment #561257 - Flags: review?(jduell.mcbugs)
Comment on attachment 561257 [details] [diff] [review]
patch v2

Actually, David probably has the best sense of this change. The tweak is small and there is a test that exercises it.
Attachment #561257 - Flags: review?(jduell.mcbugs) → review?(dbaron)
Comment on attachment 561257 [details] [diff] [review]
patch v2

>diff --git a/xpcom/string/public/nsReadableUtils.h b/xpcom/string/public/nsReadableUtils.h
>--- a/xpcom/string/public/nsReadableUtils.h
>+++ b/xpcom/string/public/nsReadableUtils.h
>@@ -236,26 +236,26 @@ PRBool IsASCII( const nsACString& aStrin
>    * Returns |PR_TRUE| if |aString| is a valid UTF-8 string.
>    * XXX This is not bullet-proof and nor an all-purpose UTF-8 validator. 
>    * It is mainly written to replace and roughly equivalent to
>    *
>    *    str.Equals(NS_ConvertUTF16toUTF8(NS_ConvertUTF8toUTF16(str)))
>    *
>    * (see bug 191541)
>    * As such,  it does not check for non-UTF-8 7bit encodings such as 
>-   * ISO-2022-JP and HZ. However, it filters out  UTF-8 representation
>+   * ISO-2022-JP and HZ. However, by default it filters out UTF-8 representation
>    * of surrogate codepoints and non-characters ( 0xFFFE and 0xFFFF
>    * in planes 0 through 16.) as well as overlong UTF-8 sequences. 
>    * Also note that it regards UTF-8 sequences corresponding to 
>    * codepoints above 0x10FFFF as invalid in accordance with 
>    * http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-04.txt

This comment change isn't sufficient to describe your change; you need to be much more specific about what it filters when.  In particular, I'd suggest replacing from "However, by default it filters out UTF-8 representation" to the end of what I quoted with:

>It rejects sequences with the following errors:
> * byte sequences that cannot be decoded into characters according to UTF-8's rules (including cases where the input is part of a valid UTF-8 sequence but starts or ends mid-character)
> * overlong sequences (i.e., cases where a character was encoded non-canonically by using more bytes than necessary)
> * surrogate codepoints (i.e., the codepoints reserved for representing astral characters in UTF-16)
> * codepoints above the unicode range (i.e., outside the first 17 planes; higher than U+10FFFF), in accordance with http://tools.ietf.org/html/rfc3629
> * when aRejectNonChar is true (the default), any codepoint whose low 16 bits are 0xFFFE or 0xFFFF

This suggested change:
 * categorizes the things that are rejected more clearly
 * clearly marks which one is affected by aRejectNonChar
 * updates the RFC2279bis reference to RFC3629.

r=dbaron with that fixed
Attachment #561257 - Flags: review?(dbaron) → review+
(Please wrap my comment appropriately, too; I just didn't want to try wrapping it to the right length in the bugzilla input.)
Whiteboard: [inbound]
Target Milestone: --- → mozilla9
  In particular, I'd suggest
> replacing from "However, by default it filters out UTF-8 representation" to
> the end of what I quoted with:
>

Thanks!
Try run for 4fc81a7e251e is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=4fc81a7e251e
Results (out of 59 total builds):
    success: 56
    warnings: 3
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mcmanus@ducksong.com-4fc81a7e251e
https://hg.mozilla.org/mozilla-central/rev/f98f144896ad
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
I've just added a note to Firefox 9 for developers for this, as well as a brief discussion here:

https://developer.mozilla.org/en/WebSockets/Writing_WebSocket_client_applications#Text_data_format
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: