Last Comment Bug 686312 - Firefox/WebSockets denies to receive certain valid UTF-8 sequences
: Firefox/WebSockets denies to receive certain valid UTF-8 sequences
: dev-doc-complete
Product: Core
Classification: Components
Component: Networking: WebSockets (show other bugs)
: 9 Branch
: x86_64 Windows 7
: -- normal (vote)
: mozilla9
Assigned To: Patrick McManus [:mcmanus]
: Patrick McManus [:mcmanus]
Depends on:
  Show dependency treegraph
Reported: 2011-09-12 08:24 PDT by Tobias Oberstein
Modified: 2011-11-04 12:46 PDT (History)
8 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

patch v1 (7.55 KB, patch)
2011-09-20 11:32 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Splinter Review
patch v2 (7.69 KB, patch)
2011-09-20 11:42 PDT, Patrick McManus [:mcmanus]
dbaron: review+
Details | Diff | Splinter Review

Description Tobias Oberstein 2011-09-12 08:24:26 PDT
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a2) Gecko/20110911 Firefox/8.0a2
Build ID: 20110911042006

Steps to reproduce:

I am running a WebSockets protocol test suite which covers UTF-8 by fuzzing an implementation with valid and invalid UTF-8 sequences.
The test suite sends different WS text messages. The exact sequences sent, incl. wire logs can be found in the failed cases
6.9.3, 6.9.4, 6.11.4, 6.22.1, 6.22.2
under the following link

Actual results:

The WS connection is failed by Firefox.

Expected results:

The text message is processed and provided in the JS event handler as text payload.
Comment 1 Tobias Oberstein 2011-09-12 08:29:40 PDT
The behavior is identical for Firefox 7, 8 and 9 (tested, but only on Win64).

At least some of the sequences not processed show up correctly in HTML (i.e. have a look at => 5.3.1  U+FFFE). So I don't know if it's really WS specific or not.

In any case, I filed the bug without "security on", since FF denies more than it should, not less.
Comment 2 Patrick McManus [:mcmanus] 2011-09-13 10:14:24 PDT
The test I U+FFFE and U+FFFF are non-characters. That's why they are being rejected during the utf-8 validation.

I'm going to have to ask around to figure out if that's the behavior we want or not, but at least at this point I can tell you that's why it is happening.
Comment 3 Tobias Oberstein 2011-09-13 10:44:57 PDT
Those 2 are valid UTF-8 by RFC3629 and hence by WS draft spec at least ..
What about the other 3 sequences?
Comment 4 Patrick McManus [:mcmanus] 2011-09-20 07:52:45 PDT
Hi Tobias,

You report 5 test cases of interest, but I only see 3 different byte sequences being tested.

6.9.3 and 6.22.2 both seem to test ef bf bf, and 6.9.4 and 6.11.4 both seem to test f4 8f bf bf. The other sequence is from test 6.22.1 (ef bf be).

Do I have that correct?
Comment 5 Tobias Oberstein 2011-09-20 08:30:13 PDT
Hello Patrick,

sorry, yes, you are right, it's only 3 sequences - rechecked:

6.9.4 / 6.11.4 => f48fbfbf

6.22.1 => efbfbe

6.9.3 / 6.22.2 => efbfbf

I missed that, since I somehow "blindly" constructed most of the cases by adapting the cases from the Markus Kuhn page above. Well ..
Comment 6 Tobias Oberstein 2011-09-20 08:52:20 PDT
Ah, I know why I missed it: at the time Markus wrote the page, UTF-8 was on the one hand not yet limited to U+10FFFF and on the other hand U+FFFE/U+FFFF were probably really illegal code positions, not just non-characters. So I had to adjust the cases, and by that introduced the overlap.


Don't know if thats relevant: a couple of cases from section 6.4. are deactivated for FF7, since it crashes. FF8 and higher don't crash and pass (non-strict). It also crashes in section 9.3.
Comment 7 Patrick McManus [:mcmanus] 2011-09-20 08:57:26 PDT
(In reply to Tobias Oberstein from comment #5)

> 6.9.4 / 6.11.4 => f48fbfbf
> 6.22.1 => efbfbe
> 6.9.3 / 6.22.2 => efbfbf

ok, those are all non-characters. Brian Smith, David Baron, Jason Duell and I talked face to face and decided that it is appropriate to let WS pass strings with non characters included - so I'll make a patch.
Comment 8 Patrick McManus [:mcmanus] 2011-09-20 11:32:49 PDT
Created attachment 561253 [details] [diff] [review]
patch v1
Comment 9 Patrick McManus [:mcmanus] 2011-09-20 11:42:15 PDT
Created attachment 561257 [details] [diff] [review]
patch v2

Attached the wrong (the test was still full of debug) patch the first time. doh!
Comment 10 Patrick McManus [:mcmanus] 2011-09-20 11:47:39 PDT
Comment on attachment 561257 [details] [diff] [review]
patch v2

Actually, David probably has the best sense of this change. The tweak is small and there is a test that exercises it.
Comment 11 David Baron :dbaron: ⌚️UTC-10 2011-09-20 14:09:56 PDT
Comment on attachment 561257 [details] [diff] [review]
patch v2

>diff --git a/xpcom/string/public/nsReadableUtils.h b/xpcom/string/public/nsReadableUtils.h
>--- a/xpcom/string/public/nsReadableUtils.h
>+++ b/xpcom/string/public/nsReadableUtils.h
>@@ -236,26 +236,26 @@ PRBool IsASCII( const nsACString& aStrin
>    * Returns |PR_TRUE| if |aString| is a valid UTF-8 string.
>    * XXX This is not bullet-proof and nor an all-purpose UTF-8 validator. 
>    * It is mainly written to replace and roughly equivalent to
>    *
>    *    str.Equals(NS_ConvertUTF16toUTF8(NS_ConvertUTF8toUTF16(str)))
>    *
>    * (see bug 191541)
>    * As such,  it does not check for non-UTF-8 7bit encodings such as 
>-   * ISO-2022-JP and HZ. However, it filters out  UTF-8 representation
>+   * ISO-2022-JP and HZ. However, by default it filters out UTF-8 representation
>    * of surrogate codepoints and non-characters ( 0xFFFE and 0xFFFF
>    * in planes 0 through 16.) as well as overlong UTF-8 sequences. 
>    * Also note that it regards UTF-8 sequences corresponding to 
>    * codepoints above 0x10FFFF as invalid in accordance with 
>    *

This comment change isn't sufficient to describe your change; you need to be much more specific about what it filters when.  In particular, I'd suggest replacing from "However, by default it filters out UTF-8 representation" to the end of what I quoted with:

>It rejects sequences with the following errors:
> * byte sequences that cannot be decoded into characters according to UTF-8's rules (including cases where the input is part of a valid UTF-8 sequence but starts or ends mid-character)
> * overlong sequences (i.e., cases where a character was encoded non-canonically by using more bytes than necessary)
> * surrogate codepoints (i.e., the codepoints reserved for representing astral characters in UTF-16)
> * codepoints above the unicode range (i.e., outside the first 17 planes; higher than U+10FFFF), in accordance with
> * when aRejectNonChar is true (the default), any codepoint whose low 16 bits are 0xFFFE or 0xFFFF

This suggested change:
 * categorizes the things that are rejected more clearly
 * clearly marks which one is affected by aRejectNonChar
 * updates the RFC2279bis reference to RFC3629.

r=dbaron with that fixed
Comment 12 David Baron :dbaron: ⌚️UTC-10 2011-09-20 14:10:39 PDT
(Please wrap my comment appropriately, too; I just didn't want to try wrapping it to the right length in the bugzilla input.)
Comment 13 Patrick McManus [:mcmanus] 2011-09-20 15:15:31 PDT
  In particular, I'd suggest
> replacing from "However, by default it filters out UTF-8 representation" to
> the end of what I quoted with:

Comment 14 Mozilla RelEng Bot 2011-09-20 16:02:05 PDT
Try run for 4fc81a7e251e is complete.
Detailed breakdown of the results available here:
Results (out of 59 total builds):
    success: 56
    warnings: 3
Builds available at
Comment 15 Marco Bonardo [::mak] 2011-09-21 02:59:59 PDT
Comment 16 Eric Shepherd [:sheppy] 2011-11-04 12:46:55 PDT
I've just added a note to Firefox 9 for developers for this, as well as a brief discussion here:

Note You need to log in before you can comment on or make changes to this bug.