User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a2) Gecko/20110911 Firefox/8.0a2
Build ID: 20110911042006
Steps to reproduce:
I am running a WebSockets protocol test suite which covers UTF-8 by fuzzing an implementation with valid and invalid UTF-8 sequences.
The test suite sends different WS text messages. The exact sequences sent, incl. wire logs can be found in the failed cases
6.9.3, 6.9.4, 6.11.4, 6.22.1, 6.22.2
under the following link
The WS connection is failed by Firefox.
The text message is processed and provided in the JS event handler as text payload.
The behavior is identical for Firefox 7, 8 and 9 (tested, but only on Win64).
At least some of the sequences not processed show up correctly in HTML (i.e. have a look at http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt => 5.3.1 U+FFFE). So I don't know if it's really WS specific or not.
In any case, I filed the bug without "security on", since FF denies more than it should, not less.
The test I U+FFFE and U+FFFF are non-characters. That's why they are being rejected during the utf-8 validation.
I'm going to have to ask around to figure out if that's the behavior we want or not, but at least at this point I can tell you that's why it is happening.
Those 2 are valid UTF-8 by RFC3629 and hence by WS draft spec at least ..
What about the other 3 sequences?
You report 5 test cases of interest, but I only see 3 different byte sequences being tested.
6.9.3 and 6.22.2 both seem to test ef bf bf, and 6.9.4 and 6.11.4 both seem to test f4 8f bf bf. The other sequence is from test 6.22.1 (ef bf be).
Do I have that correct?
sorry, yes, you are right, it's only 3 sequences - rechecked:
6.9.4 / 6.11.4 => f48fbfbf
6.22.1 => efbfbe
6.9.3 / 6.22.2 => efbfbf
I missed that, since I somehow "blindly" constructed most of the cases by adapting the cases from the Markus Kuhn page above. Well ..
Ah, I know why I missed it: at the time Markus wrote the page, UTF-8 was on the one hand not yet limited to U+10FFFF and on the other hand U+FFFE/U+FFFF were probably really illegal code positions, not just non-characters. So I had to adjust the cases, and by that introduced the overlap.
Don't know if thats relevant: a couple of cases from section 6.4. are deactivated for FF7, since it crashes. FF8 and higher don't crash and pass (non-strict). It also crashes in section 9.3.
(In reply to Tobias Oberstein from comment #5)
> 6.9.4 / 6.11.4 => f48fbfbf
> 6.22.1 => efbfbe
> 6.9.3 / 6.22.2 => efbfbf
ok, those are all non-characters. Brian Smith, David Baron, Jason Duell and I talked face to face and decided that it is appropriate to let WS pass strings with non characters included - so I'll make a patch.
Created attachment 561253 [details] [diff] [review]
Created attachment 561257 [details] [diff] [review]
Attached the wrong (the test was still full of debug) patch the first time. doh!
Comment on attachment 561257 [details] [diff] [review]
Actually, David probably has the best sense of this change. The tweak is small and there is a test that exercises it.
Comment on attachment 561257 [details] [diff] [review]
>diff --git a/xpcom/string/public/nsReadableUtils.h b/xpcom/string/public/nsReadableUtils.h
>@@ -236,26 +236,26 @@ PRBool IsASCII( const nsACString& aStrin
> * Returns |PR_TRUE| if |aString| is a valid UTF-8 string.
> * XXX This is not bullet-proof and nor an all-purpose UTF-8 validator.
> * It is mainly written to replace and roughly equivalent to
> * str.Equals(NS_ConvertUTF16toUTF8(NS_ConvertUTF8toUTF16(str)))
> * (see bug 191541)
> * As such, it does not check for non-UTF-8 7bit encodings such as
>- * ISO-2022-JP and HZ. However, it filters out UTF-8 representation
>+ * ISO-2022-JP and HZ. However, by default it filters out UTF-8 representation
> * of surrogate codepoints and non-characters ( 0xFFFE and 0xFFFF
> * in planes 0 through 16.) as well as overlong UTF-8 sequences.
> * Also note that it regards UTF-8 sequences corresponding to
> * codepoints above 0x10FFFF as invalid in accordance with
> * http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-04.txt
This comment change isn't sufficient to describe your change; you need to be much more specific about what it filters when. In particular, I'd suggest replacing from "However, by default it filters out UTF-8 representation" to the end of what I quoted with:
>It rejects sequences with the following errors:
> * byte sequences that cannot be decoded into characters according to UTF-8's rules (including cases where the input is part of a valid UTF-8 sequence but starts or ends mid-character)
> * overlong sequences (i.e., cases where a character was encoded non-canonically by using more bytes than necessary)
> * surrogate codepoints (i.e., the codepoints reserved for representing astral characters in UTF-16)
> * codepoints above the unicode range (i.e., outside the first 17 planes; higher than U+10FFFF), in accordance with http://tools.ietf.org/html/rfc3629
> * when aRejectNonChar is true (the default), any codepoint whose low 16 bits are 0xFFFE or 0xFFFF
This suggested change:
* categorizes the things that are rejected more clearly
* clearly marks which one is affected by aRejectNonChar
* updates the RFC2279bis reference to RFC3629.
r=dbaron with that fixed
(Please wrap my comment appropriately, too; I just didn't want to try wrapping it to the right length in the bugzilla input.)
In particular, I'd suggest
> replacing from "However, by default it filters out UTF-8 representation" to
> the end of what I quoted with:
Try run for 4fc81a7e251e is complete.
Detailed breakdown of the results available here:
Results (out of 59 total builds):
Builds available at http://firstname.lastname@example.org
I've just added a note to Firefox 9 for developers for this, as well as a brief discussion here: