Open Bug 316178 Opened 19 years ago Updated 2 years ago

Frequent disconnects with Chatzilla (utf-8 string related?)

Categories

(Core :: XPConnect, defect)

x86
All
defect

Tracking

()

People

(Reporter: mcsmurf, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: regression)

Recently i started getting disconnected quite often from IRC when using ChatZilla with the error code 2153185290. I did some debugging with Venkman and got those two errors Error: [Exception... "Could not convert Native argument arg 1 [nsIScriptableInputStream.read]" nsresult: "0x8057000a (NS_ERROR_XPC_BAD_CONVERT_NATIVE)" location: "JS frame :: chrome://chatzilla/content/lib/js/connection-xpcom.js :: bc_readdata :: line 394" data: no]Source File: chrome://chatzilla/content/lib/js/connection-xpcom.jsLine: 394 Error ``malformed UTF-8 character sequence at offset 8972'' [xs] in file ``chrome://chatzilla/content/lib/js/connection-xpcom.js'', line 394, character 0. when being disconnected.
The interface in question, nsIScriptableInputStream, uses a return value of string: 72 string read(in unsigned long aCount); Now, this data is arbitrary binary data. I assume xpconnect tries to convert this to a JS string using some API that the JS Engine offers, which now expects UTF-8 due to the JS_STRINGS_ARE_UTF8 define, which we now enabled. This will probably fail now if one of the bytes has a value > 127, and if it doesn't fail, it probably won't produce the right result. Another example where treating binary data as a string causes problems! ;)
Depends on: 232182
Fixed by back-out of patch for bug 315288. Shaver said he'd help fix this so bug 315288 can be re-fixed. /be
Same apparent problem on on 2005111105 OS/2 trunk not present in 2005111005.
OS: Windows 2000 → All
*** Bug 316227 has been marked as a duplicate of this bug. ***
So what is the plan here? At the moment the problem as I see it is that |string| only promises to be ASCII-safe but has historically been "safe" in a number of other cases... So is the plan to change this interface, or change the way XPConnect treats |string|, or both?
this interface is frozen.
<sigh>...
I'm not sure how you could ever expect nsIScriptableInputStream's read function to reliably return binary data. "string" is treated as null terminated string IIRC. Binary data could possibly return a 0 byte thus losing the trailing data. So the JS UTF8 issue is just one of the issues with using this interface for retrieving non-string data. I think the only real way to address ChatZilla's issue is to use a different interface that really handles binary data.
Null is not a valid character in the IRC protocol. We (ChatZilla) need 8bit clean (apart from the NUL restriction) data, as has always been the case when using the various interfaces we use. We use the binary stream interfaces for file transfers, because they are binary, and the non-binary ones for the non-binary streams (i.e. the normal connections). Changing the behaviour of the non-binary streams to strip or otherwise mess with characters > 127 would go against the entire point of freezing interfaces.
I'm not sure I follow comment 9... The way things stand is that nsIScriptableInputStream is not usable for non-ASCII data. This is a bug in the interface, but there we are -- it's a frozen bug. That means that if there is any chance whatsoever that your data will be non-ASCII you need to use nsIBinaryInputStream to read that data.
The frozen behaviour is that it sends each byte of the stream input as a JS character, with NUL issues, as far as I know. That's perfectly acceptable for ChatZilla's use.
The frozen _interface_ is that if there are any non-ASCII bytes in the stream then behavior is undefined. The fact that there was some particular behavior that people came to depend on is somewhat incidental...
The interface fails to define ANYTHING about the behaviour, save for the NUL issue.
Yes, but in ALL XPIDL interfaces the |string| type is only safe for ASCII data. That's not a special property of this interface.
Fucking pain to actually confirm that (string doc is a pain), but that says they are 8-bit bytes with an undefined encoding. It doesn't change the fact I don't think you should be changing the behaviour of a frozen interface. (note: this doesn't matter to ChatZilla, as it is so trivial to change it to use binary streams it isn't funny, this is about doing stupid things to frozen interfaces) Anyway, I'm not going to make any more comments on here. I'm sick of changes like this, and just don't have the time to waste trying to point out the stupidity of it.
James, I'm ccing you again because this is important: 1) Chatzilla is depending on behavior that is undefined. I understand that it's annoying when such behavior changes, and I agree that gratuitous changes to it should be minimized. But the fact remains that depending on such behavior is a bug. 2) We should move toward introducing a usable (as in, fully defined) nsIScriptableInputStream2 and deprecating nsIScriptableInputStream. biesi has further suggested removing the current scriptable stream contract from future releases, unless it's frozen. Just figured you'd want to know.
Silver, don't go away yet. I'm with you on wanting actual behavior compatibility to be preserved. The limitation on string goes back to http://www.mozilla.org/scriptable/typelib_file.html, see footnote 2 under the type definition table. This is pretty ancient history, and it is not clear what the practical meaning of it is: was it intended to future-proof string beinb extended to mean UTF-8? I think it was because jband at the time rightly feared i18n code treating string as using a an arbitrary multibyte character set encoding. If in fact we either always use string to contain 7-bit ASCII, or 8-bit data (which may or may not be ISO-Latin-1, a proper subset of Unicode), then should we not try to keep compatible with behavior rather than original spec? Yes, I know this will stress the JS_C_STRINGS_ARE_UTF8 work. That's ok, we can do something special there. dbradley: embedded NULs are not allowed in UTF-8, so the only issue is whether string requires 7-bit-clean strings. bz: undefined-by-spec but works-for-years behavior is something we could prefer to keep compatibility with, as you say. Why not do so aggressively? /be
> If in fact we either always use string to contain 7-bit ASCII, or 8-bit data The current behavior of XPConnect for |string| is as follows: 1) For a |string| retval or out param, just pass the raw bytes to JS_NewStringCopyZ. If STRICT_CHECK_OF_UNICODE is defined, assert if any byte is non-ASCII. (why's this not defined by default in debug builds? <sigh>). 2) For a |string| in param, call JS_GetStringBytes on the JSString and then duplicate using nsMemory::Alloc and memcpy. Assert if any byte is non-ASCII (this assert _is_ enabled in all debug builds). So what the behavior actually is depends on what JS_GetStringBytes and JS_NewStringCopyZ do, as things stand. As I recall, JS_GetStringBytes used to deflate from UTF16 by just dropping the high byte. That's still the behavior without JS_C_STRINGS_ARE_UTF8 defined. With that defined, we get UTF16-to-UF8 conversion. XPConnect can probably keep the old behavior for retval/out param by byte-inflating itself, and calling JS_NewUCStringCopyZ. For in param, it could JS_GetStringChars and deflate itself. Assuming we want to do this in spite of the existing assert, of course. If we want to enable the UTF8 stuff in JSEng, we should probably do the above. Note that nowhere am I advocating breaking XPConnect compat gratuitously. I _am_ advocating not relying on de-facto API behavior that contradicts documentation (that is, per docs I would fully expect non-ASCII stuff passed out through |string| or in through |string| to throw).
Boris, according to the string doc (as I read it), it is up to the interface to define what encoding |string| uses, thus the interface was flawed from the start. However, "undefined" behaviour in frozen and non-frozen interfaces is, IMHO, quite different. In a non-frozen interface, only that which is defined is ever guarenteed, and obviously can be changed at any point. In a frozen interface, anything that is not explicitly undefined is implicitly defined and fixed by the existing implementation. In this case, the encoding was not specified, and thus it is (IMO) fixed by the original implementation as 8-bit binary data (with NUL termination, as declared in the interface). If you want to use some other encoding, you will have to define a new interface for it. Regarding point 1, as I implied in comment 15, it is easy enough to change ChatZilla to binary streams (4 bytes, to be precice) so just file a bug if you think it needs changing. Although it will depend on your compatibility story with binary streams, as we've only had to care about them for post-server-socket builds up to now. Note: I couldn't care less how any of this is implemented, all I care about is that someone is trying to change the behaviour of a frozen interface, and I think that is a seriously wrong thing to do. Boris, your last comment about expecting a throw makes no sense what so ever. Nowhere that I've see does it say that non-ASCII is not allowed - it only says that the effect and use of it is not defined in a XPIDL-global way.
> it is up to the interface to define what encoding |string| uses James, that's exactly what I'm saying is _not_ the case. |string| is always ASCII. And yes, the interface is buggy in that it doesn't allow reasonable access to non-ASCII data... > In a frozen interface, anything that is not explicitly undefined is implicitly > defined and fixed by the existing implementation. "the" existing implementation? There can be multiple implementations of an interface. That's the point of an interface.
(In reply to comment #20) > James, that's exactly what I'm saying is _not_ the case. |string| is always > ASCII. And yes, the interface is buggy in that it doesn't allow reasonable > access to non-ASCII data... So the MDC string guide is just completely wrong? Lovely. > "the" existing implementation? There can be multiple implementations of an > interface. That's the point of an interface. s/implementation/implementations/ if it makes you happy. There is only one implementation that is relevant to my comments, AFAIK. As usual, you're just missing the point, though...
(In reply to comment #21) > So the MDC string guide is just completely wrong? Lovely. It's not about IDL syntax and semantics, is it? > s/implementation/implementations/ if it makes you happy. There is only one > implementation that is relevant to my comments, AFAIK. As usual, you're just > missing the point, though... You really shouldn't rely on anything that's not explicitly mentioned in the IDL. Only the documented behaviour is frozen...
> So the MDC string guide is just completely wrong? Lovely. Where? What it has to say on the subject (at http://developer.mozilla.org/en/docs/XPCOM:Strings#IDL_String_types) is: IDL type C++ Type Purpose string char* Raw character pointer to ASCII (7-bit) string, no string classes used. High bit is not guaranteed across XPConnect boundaries This is the only place I see the IDL |string| type discussed in this document. Has this section changed since you last read it? > As usual, you're just missing the point, though... I usually miss the point? ;) It seems to me that there are at least three points: 1) Changes to de-facto behavior are undesirable if they can be avoided. 2) XPConnect should not be depending on undocumented aspects of the JSAPI (like the encoding used for C strings) 3) Consumers of XPIDL interfaces should not be passing non-ASCII data through |string|. The ones that are doing it should be fixed. Which fourth point did I miss?
The string guide explicitly defines the types and representations used for the XPIDL |string| and |wstring| types. Also, the best XPIDL doc I can find does nothing more that define the types as |char*| and |PRUnichar*| respectively. (In reply to comment #22) > You really shouldn't rely on anything that's not explicitly mentioned in the > IDL. Only the documented behaviour is frozen... Absolutely nothing is defined in the interface, as already mentioned (except the NUL thing). Your definition would make the method useless, and impossible to call, because there is no defined behaviour for it. Oh, except for the *IDL documentation* which says it is char* and thus opaque 8bit character data! Which then conflicts with everything Boris has said, as well as the MDC string guide, which Boris also seems to be in disagreement with.
(In reply to comment #24) > Your definition would make the method useless Yes, I think that's his and my opinion.
> The string guide explicitly defines the types and representations Yes, and for |string| it says |char*| and |ASCII| (in C++). It doesn't say what it is in JS, but does say that passing non-ASCII across language boundaries is not safe. > Your definition would make the method useless, The method _is_ useless, as it happens. As in, the interface just generally sucks. > which says it is char* and thus opaque 8bit character data! How do you infer the latter from the former? That is, how do you go from "JSString is somehow converted to char* in a way that is safe for ASCII" to "JSString is treated as an opaque array of bytes (or something, since UTF16 has to come in somehow) and directly mapped to char* that way"?
Seems the string guide is shit for searching - XPIDL, as everyone here is talking about - is not mentioned ANYWHERE NEAR the actual bit on XPIDL stuff. Fucking load of good *that* is. Anyway, for a start, I've never EVER seen anything saying |string| was 7bit before, and I've been using XPIDL stuff for years. And I *have* read the string guide quite a few times... Whatever, I'm off again, as this is not getting anyone anywhere, least of all me.
(In reply to comment #26) > > which says it is char* and thus opaque 8bit character data! > > How do you infer the latter from the former? That is, how do you go from > "JSString is somehow converted to char* in a way that is safe for ASCII" to > "JSString is treated as an opaque array of bytes (or something, since UTF16 has > to come in somehow) and directly mapped to char* that way"? The document in question never mentions ASCII (this is not the string guide, this is the IDL docs).
Silver, will you please calm down? This bug is going to be fixed the way you want if I can help it, and I can. Boris, the interface may "suck" but it worked for years in ways that we should try to preserve. We should preserve behavioral compatibility over spec compatibility if doing so is (a) not unsound formally; (b) not too costly. Why would we not try to preserve behavioral compatibility? /be
Brendan, where have I suggested breaking compat? In comment 18 I even point to the exact code we would need to change to keep XPConnect behavior while allowing JS to turn on the UTF8 define! Please don't construe my comments about misuse of existing interfaces (including forced misuse due to suckage of said interfaces) needing fixing on the consumer side to mean that I advocate unilaterally breaking all such misusers. I'm sure we have plenty of them.... :(
yeah, my comments on this topic were more meant as general remarks on what people can/should rely on, I didn't mean to advocate changing this behaviour. although I happen to think that we should a) implement a better stream b) enable that assert in all debug builds and maybe c) remove this interface in further geckos
(In reply to comment #30) > Brendan, where have I suggested breaking compat? In comment 18 I even point to > the exact code we would need to change to keep XPConnect behavior while > allowing JS to turn on the UTF8 define! Well, let me be blunt: if we preserve compatibility, we should change the docs or specs to reflect the behavior, and then there will be no point in saying "don't rely on this interface you've used for many years, it sucks". That kind of talk just sets Silver off, and grates on me too. > Please don't construe my comments about misuse of existing interfaces > (including forced misuse due to suckage of said interfaces) needing fixing on > the consumer side to mean that I advocate unilaterally breaking all such > misusers. I'm sure we have plenty of them.... :( Cheer up, we have platform-based apps at least. Lots of dreamy sky-castles don't. Sure, sucking less is a goal -- secondary to backward compatibility. /be
(In reply to comment #32) > "don't rely on *the behavior of the sole implementation of* > this interface you've used for many years, it sucks". MS went through this compatibility-over-cleanliness trial with Windows; it's a victory signal. See Raymond Chen's blog, or Joel Spolsky's writings on the topic. /be
We'd need to modify the ACString conversion in XPConnect too, since that uses JS_GetStringBytes. In fact, we'd probably want to audit all JS_GetStringBytes and JS_NewString* callers in dom and XPConnect...
Assignee: dbradley → nobody
QA Contact: pschwartau → xpconnect
Depends on: 329448
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.