Closed Bug 84186 Opened 24 years ago Closed 24 years ago

Add ACString, AUTF8String, AString types to XPIDL

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla0.9.9

People

(Reporter: dmosedale, Assigned: nisheeth_mozilla)

References

Details

Attachments

(4 files, 8 obsolete files)

16.87 KB, text/plain
Details
85.61 KB, patch
Details | Diff | Splinter Review
61.90 KB, patch
dbradley
: review+
jband_mozilla
: superreview+
shaver
: approval+
Details | Diff | Splinter Review
567 bytes, patch
Details | Diff | Splinter Review
shaver and I discussed this briefly in #mozilla a while ago. The short story is the nsIURI is moving in the direction of being UTF8 encoded rather than ASCII encoded. Since we need to be able to deal with this reasonably from JS, and as a way to allow folks to write less bloated scriptable interfaces, shaver suggested, with some hesitation, adding a [utf8] qualifier to xpidl. Thoughts?
Please No. If we *need* a UTF8 type then add a UTF8 type. Don't make it a funky attribute. There would still need to be a distinct xpt type anyway. You better have plans for how to do that before you launch off into this. scc and I exchanged ideas about extending the AString abstract interface style string idl types and *trying* to retire the flat "null terminated array of chars" type as much as possible. So, did someone want to take this bug and develop a solid plan and justification for imposing a new type on the world? Or should we just mark this future/WONTFIX now and forget about it? :)
I agree with John. I think a new type would make more sense and be cleaner.
Status: NEW → ASSIGNED
This is ONE of many possible character encodings, that happen to fit in 8-bit characters. I don't think this is a proper use of the attribute mechanism, to single out a particular character encoding. Let's either add a new type, or pressure the nsIURI folks to just move to UNICODE. Either way, the nsIURI implementation will have to become UNICODE aware.
beard: UTF8 _is_ a Unicode encoding. Do you mean UCS2? Anyway, I'm fine with a new type. The important thing is that there's some standard way to write scriptable interfaces that is: a) i18n friendly b) doesn't require 16-bit-wide chars Since i18n is required everywhere, and a non-trivial number of folks (including the nsIURI owners) are unwilling to accept the footprint cost of 16-bit chars. UTF8 is a particularly good choice for the encoding in question because at least a few popular protocols (IMAP, LDAP, ...) use it natively as their wire format, and this cuts down on the number of conversions necessary.
Blocks: 86000
Updating the summary. So if we're gonna make it a separate type, wouldn't it be better to base it around an|ACString| type rather than a |string|?
Summary: [utf8] attribute for xpidl |string| type → UTF8 string type for XPIDL
Marking this won't fix, since an alternative solution was presented and seems to be accepted. If incorrect feel free to change it back and discuss the issue further.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → WONTFIX
Status: RESOLVED → VERIFIED
Marking Verified -
Um, what alternate solution are you referring to?
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Sorry Dan, I forgot you had changed the summary. I still had this tagged as creating a utf8 qualifier.
I'd like to see this for Mozilla 1.0, as the LDAP string APIs are a mess, and I think the only sane way out of their current state is to move to UTF8 encoding, since that's how they come over the wire. I'd be interested in seeing nsACString exposed to idl, so that we keep all the nifty advantages of the new string fu.
John, Suppose I have a function: void foo (in string a) Is 'a' required to be in ASCII? If so, many interface violate this and can we safely extend the meaning to any encoding ( such as UTF8 )?
Yes they do and no we can't. We've had many discussions about this on the xpcom newsgroup. Take a look at some branches of the thread entitled... "Encoding Wars --- more in the Big String Story" ...circa 4/13/00. We might concoct a new type, but we are not going to redefine the conventional meaning of xpidl's 'string' type. Regardless of the amount of abuse I think it would be a big mistake to rewrite the contract for 'stirng'.
sounds fair to me. there are a few interfaces that want to pass around utf8 strings. URI/URL interfaces will want to do this sometime too. Question for you, is it possible to have a string type that could contain either UTF8 or ASCII? /me goes reads ng history....
Yes, ASCII is a subset of UTF8. If we had a UTF8 type then putting ASCII in it is fine. The thing we don't want to do is force users of our ASCII type to have to be prepared to get non-ASCII UTF8. If we *really* need clean support for UTF8 then we need to suffer the pain of bringing a new type into the world.
watching this bug.
Moving out past 1.0. If someone wants this for 1.0 let me know and see rearrange my bug list.
Target Milestone: --- → mozilla1.0.1
If we are still thinking about making nsIURI/nsIURL/nsIFile intefaces pass utf8 strings, we are going to need this to be fixed.
Ok, that's what I needed to know. Is there a bug for that change or is that what 86000 infers?
Target Milestone: mozilla1.0.1 → mozilla0.9.7
If "1.0" means frozen interfaces does that include the typelib structure or accepted .idl types?
This is not a bug, it's a feature. I'm convinced that adding it will require making typelibs that cannot be read by old typelib readers (e.g. NS6.2). I'm against this.
UTF-8 can be passed as char* or unsigned char*, so this does not block making nsIURI to use UTF-8.
that is illegal to do. The "string" keyword in xpidl is not intended for utf8 encoding. John can give you the reasoning behind this rule.
One reason is that xpconnect assumes ascii and will royally screw any interface that's trying to pass UTF8 through a string API. John's arguments seem to lead to the conclusion that it's better to do it before 1.0 rather than after. I don't really see the problem if 6.2 gets broken, 6.2 isn't going to work with components compiled against mozilla1.0 anyway due to interface and c++ signature changes.
Let's make an effort to fix this soon, for mozilla1.0. We should of course already be using the typelib file major version to control compatibility. See http://www.mozilla.org/scriptable/typelib_file.html. /be
dveditz: I think you are confusing form and content. The fact that there will be additional frozen interfaces has little to do with the distributed overhead of making a new incompatible typelib version. It is not just the interfaces in our tree that take the hit. A new version impacts vendors of components and plugins, and embedded clients. If I ship a plugin with its own interfaces for scriptability I'd like to be able to ship a version that works in 6.2 as well as with future clients. This implies that we'll need to support reading and writing both old and new versions. And, we'll need to add support to the xpidl compiler to spit out the older versions on demand. We can do this. But it is not cheap. We also need to add support to all the consumers of type info; xpconnect, PyXPCOM, xpcom/proxy, etc. I have yet to see a compelling argument that says we really need to support moving UTF8 accross scriptable interfaces. How many interfaces need to do this? Why can't they be converted to use wstring or AString? How many cases can't be so converted? Why? Perhaps UTF8 can be an implementation detail within components? Maybe there is some deCOMtamination that ought to be done here? How much code do we have that has to deal with UTF8ness? Would it not be better to move more of the code to work with unicode? If we make an AString-ish UTF8 scheme are people going to whine that they can't easily move raw char* through such interfaces; i.e. are we going to need 5(!) string types? Why are non-scriptable methods not an option? You can declare a native type right now for UTF8. You just can't call such methods using JS. But you could have parallel scriptable methods that use unicode (xpconnect is going to have to convert to unicode anyway). Anyway, arguments that say we need something have to be backed up with facts and need to address the tradeoffs and alternate solutions.
Can someone come up with a compelling argument for this enhancement that jband's suggestions wouldn't address? I'd like get a decision one way or another fairly soon so I can access the impact.
I'm going to take the silence as agreement. Moving -> 1.1
Target Milestone: mozilla0.9.7 → mozilla1.1
It seems like the *only* pressing reason to provide a UTF-8 datatype is for the nsIURI interface -- which must be frozen for mozilla 1.0 I have no problem creating 'parallel'interfaces (ie. nsIURI_UCS2) to access data as unicode. I guess my only question is when the unicode versions should be used vs. the ASCII ones :-) I suppose, the ASCII ones could *always* return data as Url-Encoded 7-bit data while the unicode ones would just return the 'raw' data as unicode. Is this sufficient? Or can we simply ditch the ASCII ones and always access the data as unicode -- and possibly translate it to Url-Encoded 7-bit ASCII if necessary? Thoughts? Opinions? -- rick
It really depends on the protocol and we should not escape twice. With the new urlparser rewrite we already store the url escaped. Depending on the protocol (could be set by the protocol flags) we should call the escape functions which do the normal escaping for the first 127 chars always (this is a must!) and sometimes drop the escaping of the other chars.
I think the DOM APIs would like to have this too, but i could be wrong.
Nope, no need for this in the DOM interfaces.
what we found with nsIURI is that the ASCII network ready string and string-parts are frequently accessed. this means that the ASCII url-encoded form is what we should optimize for. if we need unicode [gs]etters then they can live on an alternate interface (as rick suggests), and the underlying implementation could be changed if the unicode methods start to get called more frequently then the ASCII-network-ready ones.
Sorry to get back to this discussion late. The thing that prompted me to file is bug is that the LDAP XPCOM SDK does currently use AString/wstring in its interfaces. However, the XPCOM SDK is really just a wrapper around the LDAP C SDK, which is all UTF8, because that's how LDAP stuff is encoded on the wire. Typically, clients of the XPCOM SDK call into it a number of times in a row, using data from each previous call in the subsequent one. So this means that most data will be unnecessarily converted a bunch of times in a row. As far as deCOMtamination goes, I'm not sure that's the right thing in this case, because the LDAP code really is used in more than one part of the app (xpfe, mailnews, autoconfig). jband asks: > Would it not be better to move more of the code to work with unicode? For data that's often mostly ASCII, isn't this unnecessary bloat? > If we make an AString-ish UTF8 scheme are people going to whine that they > can't easily move raw char* through such interfaces; i.e. are we going to need > 5(!) string types? This sounds like a straw man to me. > Why are non-scriptable methods not an option? You can declare a native type > right now for UTF8. You just can't call such methods using JS. Well, for one thing, the LDAP datasource (which uses the XPCOM SDK) is written in JS. > But you could have parallel scriptable methods that use unicode (xpconnect is > going to have to convert to unicode anyway). Well, maybe this is what I'll have to do, or even use arrays of bytes, and then convert them in JS. But it's worth keeping in mind when asking all these questions not just "is it possible to do without a new type?" but also "is it performant and footprint-efficient?"
After further discussion it was decided 1.0.1 makes more sense as a post 1.0 milestone.
Target Milestone: mozilla1.1 → mozilla1.0.1
Kevin McClusky and I were talking last week about how much space we potentially could save if we were to use utf8 instead of UCS2 wherever possible, converting at output time (like inside of widget) when UCS2 is needed. i just got wind of this bug, thought I'd at least add my observation that this might be a good thing for helping to combat bloat.
Most of the huge UCS2 bloat is inside of layout, not from XPCOM boundary crossing, I think. We've been talking about this for literally years, but I was always under the impression that it was Too Hard to fix layout. We have bugs related to using UCS-2 even when the data is all-ASCII because of defects in AppendData, if people are looking for a place to start.
Most of the string data in layout (or at least in the DOM) is stored as char*'s if it's ASCII data, which is what most data on most pages is. Do we have other huge consumers of string data in layout? Style system?
I'm seeing a lot more reasons to *not* do this than reasons to do it. I do want to reiterate the thought that if we *do* decide to add such support to xpidl then it should be as a way to talk about an aString-like *class* (yet to be written!) and not about a pointer to an array of UTF8 chars. We want to get rid of uses of string and wstring and their ilk, not encourage their use.
Changing bug summary to more accurately describe what is desired after discussion with bryner and jag in irc.
OS: Linux → All
Hardware: PC → All
Summary: UTF8 string type for XPIDL → 8-bit UTF8-capable string type for XPIDL
I don't see how the summary change from "UTF8 string type" to "UTF8-capable string type" helps. The current string type is "utf8 capable", the problem is that xpconnect or other idl users can't know whether any given string does in fact contain utf8 and so can't safely convert to UCS2 if desired.
dveditz: according to jband, passing UTF8 through |string| is illegal; why this is so, I don't entirely understand. Despite it being illegal, it does work. What we need is a type for which this both works and is legal. Perhaps jband can shed some light on why this is illegal for |string|.
It's "illegal" because xpconnect assumes ASCII and will mangle the conversion to UCS2; utf8 can be made to work, though, in non-scriptable interfaces. My question was about the distinction between a UTF8 type and a UTF8 "capable" type, and why that distinction was important enough to change the summary for. I know why we want a UTF8 type, what I don't know is what else a "capable" type would do. I imagine some sort of argument pair (char* plus charset arg) would be utf8 capable, but I suspect most of the people who want a UTF8 type want to declare "This routine takes UTF8" in their interfaces not "This routine will handle any arbitrary charset you throw at it". I know some people would like the latter, but I believe there are other bugs for that, and if we have to wait for the arbitrary case we might not ever get it. Since UTF8 is an encoding not a character set it's much easier to deal with than arbitrary character sets, and in western languages saves memory over UCS2 which is one reason we want it now.
> It's "illegal" because xpconnect assumes ASCII and will mangle the conversion > to UCS2; utf8 can be made to work, though, in non-scriptable interfaces. That's actually not correct. It doesn't do any conversion at all, but merely narrows or widens the chars as appropriate. So as a result, you can take the string and convert it by hand from inside JS between UCS2 <-> UTF8. There's even code that does this in the tree now. > My question was about the distinction between a UTF8 type and a UTF8 "capable" > type, and why that distinction was important enough to change the summary for. > I know why we want a UTF8 type, what I don't know is what else a "capable" > type would do. This is an interesting distinction because an "encoding-agnostic" |nsACString| (or even |string|) would fulfill this requirement. Special knowledge of UTF8 would be cool, but is not necessary.
so, i've got an example of why having a UTF8 string type would be very useful. nsIURI::spec nsIURI::host nsIURI::prePath all can contain UTF8 characters, yet these are declared |string| in the nsIURI.idl. all components of an URL with the exception of the hostname are escaped. this is done to support internationalized domain names, which must be passed down to the bowls of necko in UTF8 format. up until very recently, mozilla did not support internationalized domain names (see bug 42898), and it has always been assumed that these attributes of nsIURI would just be plain ASCII. now that assumption has changed. for C++ code, i've had to manually convert consumers from making ASCII assumptions to treating these attributes more generally as UTF8. no problem so far. the problem comes in when i consider the JS consumers. what happens when JS code accesses some of the attributes? if |string| means ASCII, then my JS consumers will get garbage. so, i either need |string| to mean UTF8 to xpconnect, or i need some other string type that i can use to designate UTF8 strings in my IDL. and, since nsIURI is an interface that must be frozen by mozilla1.0, i think we need some sort of solution to this bug soon. oh, and making these interfaces use wstring or AString is not going to cut it. these strings need to be UTF8... for performance reasons, i can't afford to convert them everytime someone asks for them. moreover, storing a UCS2 version along side a UTF8 version is just going to bloat mozilla unnecessarily. please please please add a UTF8 string type :-) (my preference would be for |string| to be treated as UTF8, unless it can be shown that this would be a performance problem.)
Keywords: mozilla1.0
Can we cut the Gordian knot? Brute-force and ignorance: redefine XPIDL string to mean UTF-8, instead of ASCII. Make sure all our code that was inflating ASCII to UCS-2 (or is it UTF-16? Neither, actually) now converts from UTF-8 to PRUnichar vectors. Darin has had to go down this path. Can we follow and take our chances with the performance hit? My little finger says "yes". Jband, throw your trump card. Backward compatibility in the pre-1.0 world, which mixed ASCII and UTF-8 in string, doesn't count with me -- except to underscore that (by design) UTF-8 is a superset of ASCIIZ, so the only hazard is the possible loss of performance converting from string to wstring. /be
I don't suppose I've going to go on fighting the whole friggin' world. But... I've argued from the beginning that the real reason for 'string' even existing is to interface with code that needs to use flat C style strings (even in a Unicode world with a modern 16 bit string class available). And, the point of the ASCII restrictions is so that code using these strings can reasonable make assumptions about the nature of the strings. If you say that 'string' now means UTF8 then you've erased any way of tracking that part of the *explicit* contract about the strings. And, *all* code that receives such strings is on its own to figure out if they are ASCII or not - and have *some* sort of strategy to deal with the unexpected. There is no way to track mismatches between code that vends UTF8 and code that does not know how to deal with UTF8 (and has always been promised ASCII!). I was serious in my comments above that I think it is simply *wrong* to unleash another 'dumb' string type. It is also wrong (but apparently an unavoidable legacy) that we are passing around flat ASCII buffers rather than objects whose length is determined *once*. It is even whackier to reused that ancient C string paradigm for UTF8 when we *ought* to have an object that can answer both the question of length and the question of whether or not any 'extended' chars are present (when usually they are not). I was arguing that if people think support for UTF8 is so important then they should commit the resources to construct a modern class for representing them. *Then* we can decide if we are willing to pay the cost of extending the typelib spec (and client code) to support the new class at interface boundaries. Just changing the spec and hacking xpconnect (et al.) a little will get you UTF8 support. But this is like proclaiming that there is no difference betwen signed and unsigned numbers and turning off static type checking. It will mostly work, but some code will invariably get screwed when handed an instance whose value does not conform to expectations. And you'd be throwing away the tools to track the distinction.
jband: I think we're already in the world you describe in your last paragraph, and XPConnect's string-must-be-ASCII checks won't save C++ code called directly from C++ code (of which we have plenty). I agree we're also risking handing UTF-8 with high bit set off to old ASCII-only stdio routines, or whatnot (though aren't some platforms modernizing? Or do the C and C++ standards dictate that char[] memory passed, e.g., to printf, is ASCII only?). This in addition to any perf hit from string => wstring where now we merely inflate. But worse-is-better is a harsh master, and it is beating us about the face and shoulders here. Unless someone (darin with jag help?) is willing to evolve a "strongly typed" UTF-8 nsAString subclass and get dbradley to extend typelib format and XPConnect *quickly*, I predict string will be redefined to be UTF-8. Then at least our current ASCII vs. UTF-8 confusion will yield to UTF-8 unity, even where we do not want it. :-/. /be
i'm a little confused about darin's argument for needing a utf-8 string type. As far as i can tell, he is saying that we need a utf-8 datatype for javascript callers. but then, he says that AString or wstring won't be sufficient for *performance* reasons. performace for what? the javascript consumers? i can't imagine that performing a utf-8 to unicode conversion before calling into javascript will kill our performance. Right now, nsIURI is the *only* public API that could use utf-8 strings. It seems crazy to completely eliminate ASCII strings just for this one interface !! -- rick
>i can't imagine that performing a utf-8 to unicode conversion before calling >into javascript will kill our performance. Right, that's not the problem. I think darin meant C++ code performance if we used wstring or anything widening for utf-8. dmose has use-cases for utf-8 strings, it's not just nsIURI. But what I would like to emphasize (not sure this helps, tell me why not, if not) is that we aren't losing ASCII strings -- UTF-8 is a proper superset of ASCII (ASCIZ, I mean). We are losing the current string=>wstring performance profile of inflating char to PRUnichar, if we go this route. Instead, for an ASCII string, we'll inflate only after testing bit 7 in each char. I don't fear that right now. I'm probably missing something, however! There are lots of different cases. /be
> we aren't losing ASCII strings But you *are* losing ASCII strings. You are making *everyone* pay the cost of not being able to have a contract that says ASCII and only ASCII when a *few* want a superset. The cost is more checking everywhere and/or more uncertainty in general.
I see, we're losing a "contract" that only XPConnect enforced. I'm full of insecurity right now. UTF-8 strings will leak, I bet they are now. How less secure are we by facing this problem and switching string from "I hope it's ASCII and XPConnect debug builds may spank me" to "I know it's UTF-8 and maybe it'll leak into some old stdio and print garbage (to a debug-only log)"? /be
I don't understand this concern about a contract for "ASCII and only ASCII". Few if any contracts will accept arbitrary strings taken from the entire range of ASCII, including control characters. A contract will normally specify some strict subset of the language defined by NUL-terminated ASCII strings as being legal.
rpotts: what brendan said. currently nsIURI::host is declared |string|, but it's really UTF-8 to allow for internationalized domain names (this also applies to nsIURI::spec and nsIURI:: prePath since they contain the hostname). at one time the hostname was URL escaped, but in order to solve a security problem with embedded %00 sequences, i (perhaps naively) decided to do away with URL escaping the hostname. (this actually matches the behavior of other browsers (such as IE and Nav4) as far as i can tell.) i say that this was perhaps done naively because i was unaware at the time that |string| was meant to "strictly?" represent a null terminated string of ASCII characters. as i've learned more about this issue, i see that there are many pitfalls resulting from this... namely having to do with xpconnect, since it will not know to treat nsIURI::host as UTF-8. so, if this usage of |string| is invalid (and if there will never be a UTF-8 string type), then i need to find some other solution for nsIURI. it probably needs to have AString attributes in addition to ASCII attributes for the portions that need to be in network ready format. i'm not sure what this will mean to the implementation of nsStandardURL. there will likely be foot-print / run-time performance issues to consider. maybe all i need to do is go back to URL escaping the hostname and then try to solve the issues w.r.t. %00 and other fun escape sequences in another way :-/
darin, it's not just nsIURI. I wrote "UTF-8 strings will leak" and probably confused people -- by "leak", I meant that these char arrays will flow through C++ and even C code into places that do not expect UTF-8, and be wrongly widened to PRUnichar arrays simply by inflating each char with a zero high byte. I don't know that this is a worse problem than the "leakage" of UTF-8 strings into code that expects only ASCII, but I suspect it is. Mozilla is a fairly closed system, apart from debug logging, for portability and localizability. Code tends to do AssignWithConversion more than pass a char array off to a stdio or other legacy ASCII-only routine. Anyway, I think a "contract" that's enforced only by debug builds for data flowing through XPConnect is likely to be observed in the breach more than we would like. I prefer to think we can change such a contract still, document the change, and flush out bugs. Who is with me? /be
darin: how does UTF8 help with embedded nulls? If you're using 0xC080 to represent null then you're violating the UTF8 spec which says that each char must be represented in shortest form. NULL must be 0x00 as in ASCII. Our UTF8 converters probably don't validate for speed reasons, but don't count on it staying that way. We don't want to leak invalid UTF8 outside Mozilla so we may end up adding validation to debug builds.
Our UTF8 converters validate strictly for security reasons.
internally, our xpcom-centered utf8 converters do not validate (i.e. NS_ConvertUTF8toUCS2)..
NS_ConvertUTF8toUCS2() calls class ConvertUTF8ToUCS2, which does validate. Note the use of minUCS4 in ConvertUTF8ToUCS2::write() -- a C0 80 sequence will decode to U+FFFD.
It seems like we need to choose our evil here. We need UTF8 functionality one way or another. If noone can afford the time to evolve a new UTF8 type & typelib functionality (and I don't yet see any volunteers :-), then I agree with brendan that we need to redefine |string| slightly. If the extra checking turns out to be a real perf problem, we could go the more-uncertainty route:|string| could be defined in general to be UTF8, but XPconnect could continue to treat |string| as it does today, as ASCII: |string| would be inflated with zeros and UCS2 would be truncated to get back to |string|. This would force the few JS callers that need to use these interfaces with UTF8 data to do the UCS2<->UTF8 munging by hand.
Note that in my previous comment, when I say "ASCII", I really mean "8-bit ASCII". XPconnect currently does treat |string| that way.
By "8-bit ASCII" I think you really mean "ISO-8859-1".
dveditz: i should have been more specific: i was talking about the case of a href with an embedded %00 sequence in the hostname, such as: www.amazon.com%00www.nasty.com back in the days when we used to URL escape hostnames, this would have gotten unescaped in certain places... which, has the unintended side effect of introducing an embedded null. a security hole was discovered that exploited this unescaping, so we decided to disable URL escaping and unescaping of the hostname to counter this exploit (see bug 110418). ok, that aside... i've thought more about my previous comments. URL escaping is not the solution for nsIURI. perhaps it's worth it for me to explain why. nsIURI setters currently allow for unescaped input, and nsIURI getters are expected to generate unescaped output. this means, that for attributes like nsIURI::host, escaping will not help us, because the nsIURI::host getter will still want to return a UTF-8 string. so, what does this mean for nsIURI? well, if |string| were modified to mean UTF-8, then nsIURI, it's consumers, and all of it's implementations would remain as is. if |string| were to mean ASCII (or ISO-Latin-1) exclusively, then nsIURI would have to be changed to use wstring (or AString) attributes. this would be a costly change, not only in development time but also in performance (resulting in either increased foot-print or increased execution time). you might be tempted to solve nsIURI's problems by changing the escaping policy, such that the attribute values would always be in escaped form, but this would also be a very difficult solution to implement. OK, i think i've convinced myself (perhaps twice now) that nsIURI and it's consumers would really benefit from |string| becoming UTF-8. existing code that uses |string| to represent raw network data can be modified to use [array] octet, without even breaking any existing C++ code :-) so, brendan's plan get's my vote!
This bug needs to be pulled back from mozilla1.0.1 to mozilla0.9.9 -- dbradley, can you do it? We can certainly muster helpers, jag, darin, dmose, and myself among them, to check for NS_ConvertASCIItoUCS2 calls that need to be converted to NS_ConvertUTF8toUCS2. /be
Pulling back to 0.9.9. I'll be in Mountainview next week. We can discuss it then. I'll take a look this weekend at it.
Target Milestone: mozilla1.0.1 → mozilla0.9.9
yeah, if i can help out with any of the grunt work just let me know.
I certainly see where jband is coming from - tracking down encoding related errors could be gruesome. Take PyXPCOM for example; this change gives me a number of dilemmas about how to handle string conversions. Every option has a significant downside, including too much encoding magic (Python has a default encoding (generally strict 7-bit ASCII), Moz uses UTF-8, on Windows all the narrow CRT functions use "mbcs" encoding, etc) causing strange and hard to reproduce failures when first tested with non 7 bit ascii data. So I would prefer to see this not go ahead, but jband's idea of a new IDL type explored. I do, however, appreciate the reality of the situation and that my 2c doesn't change it :)
Trying to put the finishing touches on a patch, I ran into something that I wanted to get a concensus on. For a type T_CHAR (single character) that is being translated into UCS2, how should I handle that? Should I still call the UTF8 to UCS2 function or assume that it can only be ASCII and leave it as it is. There's warning code for testing the high bit, but I'm not familiar enough with UTF8 encoding to know if there's a valid UTF8 character that is one byte and has the high bit set. My take is that it doesn't make sense to use T_CHAR for UTF8 and that it should be assumed ASCII, leaving the assert in. Conversely if a conversion from UCS2 character converted to T_CHAR I probably should fail the conversion if it would result in a non-ASCII character?
Any UTF8 byte with the high-bit set is part of a multi-byte sequence (or invalid).
Attached patch Prelim patch (obsolete) — Splinter Review
Here's a first whack. I haven't double checked all the uses of ASCII strings yet. What I wanted to get feedback on is the use of NS_ConvertXXXtoXXX classes and if there are more efficient methods to do this. The complicated part was the fact that the JSData2Native is externally told whether or not to allocate the string. And if told not to alloc it just returns back the deflated string JS provides which from what I could tell truncates and does not translate the UCS2 characters. So I adjusted the callers (where I think it matters) to call it with useAllocator to true and set the is allocated flag on the variant. It's not a great solution if someone comes up with a good alternative let me know. In the mean time I'll double check to see if I missed other conversion and if I can come up with a better solution for the useAllocator issue. Also need some UTF8 test cases to pump true UTF8 through this code.
dbradley, please contact ftang, I think he wrote the old code.
dbradley writes: > My take is that it doesn't make sense to use T_CHAR for UTF8 and > that it should be assumed ASCII, leaving the assert in. > > Conversely if a conversion from UCS2 character converted to T_CHAR I probably > should fail the conversion if it would result in a non-ASCII character? Both of these behaviors sound to me like the right thing.
I'm w/ jband here. 'string' has strong meaning of ASCII only. Changing that to include UTF8 changes the ground rules and would be bad (too many conversion/leak bugs to track down w/ consumers). It's a pre-1.0 contract sure, but, a contract none-the-less. There's grey area around contract integrity pre-1.0, but, this one lands on the "let's not break this contract" side of things IMO. We have lots of pre-1.0 contracts that can't be broken (see http://lxr.mozilla.org/seamonkey/search?string=us+frozen). Brendan: Mozilla is not a "fairly closed system." We have *many* consumers of 'string' (both JS and C++). Pulling a switcheroo (sp?) on them would be rude. "Oh, that's UTF8 all of a sudden? Now you tell me; I've been leaking all along. Oh, I need to do all this encoding/conversion magic; that bites!" Of course, if they're going to be thanking us later for it, that's another thing, but, my crystal ball broke a couple of years ago. As for the re-negging on the nsIURI::host 'string' contract (now it can include UTF8), Darin, please add a comment to the interface indicating that it can be UTF8 so unsuspecting consumers know it's an exception (for now). I see LDAP and nsIURI::host's need for UTF8 to pale in comparison to the system (open and closed) wide impact of this mod. IMHO, we pound out (I can't sign up for the work :-/ ) a new string type, or special case the UTF8 needs in a new interface. dmose, customers needing your LDAP data can use said interface, and we can either leave exceptions (nsIURI::host) as exceptions, or push them (it.?) to use the new interface.
Extending string from US-ASCII only to be capable of handling UTF-8 does not "break" existing contracts unless such contracts use the current undocumented semantics of permitting ISO-8859-1. Existing contracts restrict the set of legal values for string arguments to some subset of the language of US-ASCII strings; these languages are still valid subsets of the language of UTF-8 strings. Put another way, extending string to handle UTF-8 does not itself require any existing contract to start accepting UTF-8. Interfaces that handle data which originate from untrusted sources cannot now rely on the "string" type carrying data conforming to the US-ASCII subset. They are already required to fail safe when given non-ASCII data.
No. The proposed change *would* change contracts. The current xpidl spec is that *all* 'string' values are zero terminated arrays of 7bit ASCII chars. Any consumer of type 'string' can reasonably expect to only recieve 7bit ASCII strings. That is part of the contract. Saying that 'stirng' is extended to include other values can result in those implementations being given values that fall outside of range of values they were promised. They had no preexisting need to protect themselves from such values. This is not only possible, but we'd be explicitly changing XPConnect code such that JSStrings that previoulsy did not convert to anything other than 7bit ASCII would now create UTF-8. This is a big change. (Yes, it can be argued that the existing conversion code *ought* to detect the loss of information). This is absolutely a unilateral change of contract for all 'string' consumers.
No to both point. The current xpidl spec may say that all 'string' values are zero terminated arrays of 7bit ASCII characters, but the current implementation treats these as zero terminated arrays of 8bit ISO-8859-1 characters. Security sensitive implementations currently *must* safely deal with non 7bit characters in these strings as the current implementation permits such values to be passed from untrusted sources. Changing the definition of 'string' does not change existing contracts using 'string'. Take for example nsIClassInfo::contractID. A contractID has a specific syntax. Consumers of nsIClassInfo::contractID do not currently have to handle the full range of US-ASCII strings in values received from this interface, they only have to handle values that are valid per the syntax specified by the nsIClassInfo contract. Just as consumers of this interface don't have to handle contractID values consisting of, say, 103945 consecutive Control-G characters, consumers of this interface after the proposed interface would still not be required to handle a contractID value containing, say, a U+0234 character.
All this talk about contracts is making me queasy; it reminds me of the scene in "Play It Again, Sam" where Woody Allen's wife is leaving him: Wife: "My lawyer will call your lawyer." Woody: "I don't have a lawyer -- have him call my doctor." We have more than a "little bit" of contract violation in the current codebase, and even a little bit can hurt. I don't see how valeski's comment 78 request that darin warn consumers of nsIURI attributes about the UTF-8 hazard helps, if those callers are written in JS or another language that current inflates char arrays to 16-bit-character-type arrays. Such callers will get bogus ISO-Latin-1 code points instead of proper UTF-8 decoding (and JS ones will be spanked by XPConnect's 8th-bit-must-be-zero check in debug builds). Jud, is the proposal that we remove the XPConnect sanity check and make all such callers use a UTF-8 decoder? That removes all "contract" enforcement and adds avoidable costs to such callers. See comment 44 for darin's rejection of using other existing, scriptable string types. I am not proposing a utopia here. I still view switching string as the lesser evil, given all the constraints. /be
John Myers, the point I'm making (and that I believe jband is making as well) is that regardless of ASCII being a subset of UTF8, there is an implied usage semantic that 'string' is null terminated, and that no decoding has to occur. Wrong or right (technically it's wrong, I agree), it's a contract we'd be breaking. Brendan, if you're queasy now, maybe you should jump into some of the UA string mod discussions going on elsewhere ;-). I'm not proposing switching XPConnect's definition of 'string' (from ASCII to UTF8) and pushing the UTF8 decoding onto callers. IMO, that's asking callers to do a lot of work for what, AFAICT, is an edge case that can be handled by a "new" (non existing) string type that explicitly defines itself as UTF8. If I can be convinced this isn't an edge case, then my view changes. If there is true benefit to exposing UTF8 all over the place, then, asking callers to explicitly decode doesn't feel as bad. I'll note here that whatever version of IE I'm running has a preference checkbox to "treat all URL strings as UTF8." That to me feels like IE is testing the water w/ the very contract change we're discussing here (changing char * to mean UTF8 instead of ASCII).
The semantic that 'string' is null terminated would not be changed by the proposal. I have no idea what you mean by "no decoding has to occur". Any language that uses 16-bit string types has to decode 'string' somehow.
(I just bet that IE already has their entire story except "what does the user expect us to do with the incompatible content on the web?" nicely in hand.) I want us to have a widely-used UTF8 type, because I want to be able to gently migrate us away from the pervasive bloat of UCS2 when the vast (vast!) majority of the data we handle is ASCII. If this change (string == utf8) will get us closer to that (and I think it will), I am _all_for_it_. (Even if we have to sacrifice counted strings to do it.)
It may be interesting to think about exactly how the additional XPConnect decoding load would affect interfaces that don't need it. As jgmyers just pointed out, we need to traverse the string once anyway to inflate it even in the current world. So for interfaces that only actually use 7-bit ASCII, the cost would be still be O(n) for strings of length n, with the constant being made slightly larger because each char will have a single test of the high bit, which will fail. Exactly what this means in terms of real-time cost is left as an exercise to the reader, but my suspicion is that this is gonna be lost in the noise of crossing XPConnect. In languages that support 8-bit strings natively (eg C++), there's no "decoding" that needs to be done at all.
I wanted to comment briefly saying that jband set me straight on XPConnect's current enforcement code being not just for DEBUG builds -- he gleefully clears the eighth bit in each char when inflating string to JS string. Sorry if this was known, I missed it and kept babbling as if it weren't the case. It means that our scripted uses of nsIURI attributes are not going to work with IDNS, but that's also a big "duh". So a bunch of people got together and hashed out an alternative proposal that rpotts, jband, and vidur presented: implement an nsAString subclass for UTF-8 and use that in nsIURI and LDAP. Darin'll be commenting with the details soon, but while I was here I thought I'd spread the word about the new proposal. /be
brendan: sorry, we miscommunicated. I did not mean that xpconnect clears the high bit of each 8bit char, but that it uses only the low 8 bits of each 16bit char - throwing away the high 8 bits. Really it just lets the JS engine do the conversion in its simple way by calling JS_GetStringBytes... http://lxr.mozilla.org/seamonkey/source/js/src/xpconnect/src/xpcconvert.cpp#637 XPConnect does not currently enforce in release builds that nothing other than 7bit ASCII is passed. It just punishes you (by mangling your data) if you do pass something else. This is certainly not ideal. There is an old bug of mine on dbradley's list about defining and enforcing a coherent set of data conversion rules for xpconnect - esp. in edge cases where the caller is actually passing 'illegal' values. John.
Ok, so the proposal that brendan aluded to in comment #81 is the following: 1) modify nsAString to have setters that take UTF-8 strings |const char *|. 2) create a nsAString implementation (nsSharableUTF8String??) that implements nsAString with UTF-8 storage instead of UCS2 storage. 3) change all uses of |string| that really intend UTF-8 to use AString. This solution is good for a number of reasons. First of all, it does not impact the typelibs at all, since we can use an existing string type (namely AString). This means that there would be no changes required to xpconnect. Also, callers of the nsIURI getters could now choose between the different implementations of AString and either get a UTF-8 result or a UCS2 result. It is even possible that some of the nsIURI attributes could now be "shared" instead of strdup'd. This feels like the right compromise between adding a new UTF-8 specific string type to xpidl and forcing |string| to mean UTF-8. If there is rough concensus that this is the right thing to do, then I'll file a bug and get started :-)
so jag and shaver demonstrated on IRC why this solution is not so straight- forward... it goes against the idea that charset conversions should be explicit for one thing, and that can mean trouble. another problem: nsAString::BeginReading cannot be easily implemented if the storage type is UTF-8. perhaps jag and/or shaver will want to elaborate further on these problems.
See also the log I'll attach... The point of BeginReading was that once a UTF8String is passed through a nsAString interface, you don't know whether it's a UTF8String or a nsString or whatever, so it should behave like a UCS-2 string and hand you an iterator on a UCS-2 buffer there. In other words, you're going to end up with a lot of implicit conversions. The alternative is adding UTF-8 (read: const char*) methods to nsAString (and all its derivatives), which means adding support for implicit (expensive) conversions back and forth to UCS-2, which is something we've been trying to avoid in the current string design. My question is, say we accept the above as a necessary complication to avoid having to add another xpidl type and to avoid having to go through and fix our functions that can't cleanly deal with UTF-8 input ('coz they're expecting pure ASCII or some form of extended ASCII), and prevent our code from handing non-ASCII input to outside systems, how is xpconnect supposed to know that for one |AString| it's supposed to hand over the string as 16-bit UCS-2 encoded characters, and for another |AString| it's supposed to take the string and perform a UCS-2 to UTF-8 conversion on it and store it in a UTF8String? Or alternatively, assuming that the string is already in UTF-8 (but in 16-bit characters), chop off the high 8 bits and store it in a UTF8String? One way would be to mark the parameter somehow (which I think was rejected earlier as an option on |string|, but if possible and chosen I'd prefer the generated interface to have |nsUTF8String&| as its parameter), the other is that you don't discriminate between the two, pass UCS-2 across the XPCOM boundary, and let UTF-8 conversion happen automatically (since all nsAStrings then will have the hidden auto UCS-2 to UTF-8 conversion), in which case I suggest you set up a non-scriptable interface for the |const char*| version of these methods that takes UTF-8 encoded strings, and a scriptable version that takes UCS-2 encoded strings, which would do the conversion explicitely (or not until necessary, as suggested by shaver).
ok, i'm very discouraged by all of this. there does not seem to be a good solution for nsIURI given the current constraints of XPCOM. people at yesterdays meeting all seemed to think that nsIURI should use AString attributes (myself included). now jag tells me that AString cannot be used without their being implicit or explicit conversions to UCS-2 (and i agree), which is NOT acceptable for nsIURI. why should we bloat our URL structures to use UCS-2 just to support intl domain names? mozilla cannot even use intl domain names w/o a plugin from i-dns.net! and their plugin is not open-source! any solution to the nsIURI problem that imposes an UCS-2 encoding is an unnecessary performance burden. all we need is a way to safely transmit UTF-8 across a XPCOM border w/o converting to another character encoding. are we really happy with the fact that there is no way to do this in an efficient manner? why is it ok that one should have to convert UTF-8 to UCS-2 just to play with XPCOM? i do not think it is correct to slow down mozilla just to be conformant w/ the current rules of XPCOM when XPCOM itself could change to avoid the cost. the original bug description by dmose sums up this issue the best, and i think we've gone around in circles trying to get around it: "The short story is the nsIURI is moving in the direction of being UTF8 encoded rather than ASCII encoded. Since we need to be able to deal with this *reasonably* from JS, and as a way to allow folks to write *less bloated* scriptable interfaces, shaver suggested, with some hesitation, adding a [utf8] qualifier to xpidl." i really believe that this problem would be best served by the addition of a new string type that is UTF-8 specific. i don't care too much if this manifests itself as |char *| in C++ or if it uses some special string class (how about nsACString?). i think our C++ code can deal... besides we already have things like NS_ConvertUCS2toUTF8, which introduce UTF-8 into our C++ code, and there is no protection against using the results from NS_ConvertUCS2toUTF8 with code that doesn't support UTF-8... grep 'printf.*UCS2toUTF8' if you're not convinced!! so what's the problem with this simple approach. if making a change to the typelibs is difficult now... i imagine it will only get more difficult later on. why not do the grunt work of expanding the typelibs now (before 1.0)? add a utf-8 string type... we have good use for it. and |string| should not imply any particular charset encoding IMO. it seems wrong to put any restrictions on |string|. anyways, we don't enforce a charset in optimized code, so what's the big deal? there are so many places in our code where |string| is used for passing around byte arrays that make use of the 8th bit... i think it is unreasonable to say that we should go now and "fix" all of these "broken" uses of |string|. there's nothing broken about them. |string| just means null terminated array of bytes, and |wstring| just means null terminated array of double-bytes... can't we just live with this definition? i'm suggesting that we really have no choice... this definition has grown up all around us... interfaces make use of it, and if mozilla is ever to reach 1.0 we can't start enforcing a particular charset on |string| now... not ASCII, not UTF8. i think we really have to just leave |string| alone... it is a null terminated array of bytes whether we like it or not. too late to change this now, but we can add another string type that would enforce a specific character encoding, namely UTF-8. why are we making our lives so difficult? let's just add a UTF-8 string type and be done with it. haven't we tested all other options? i don't see any other solution besides something along the lines of the original proposal.
darin: > there are so many places in our code where |string| is used for passing > around byte arrays that make use of the 8th bit... Really? Apart from image data going through nsIInputStream, where else? > i think it is unreasonable to say > that we should go now and "fix" all of these "broken" uses of |string|. > there's nothing broken about them. |string| just means null terminated array > of bytes, and |wstring| just means null terminated array of double-bytes... > can't we just live with this definition? First, and perhaps least important, you are arguing against history. Jband is right that string originally and mostly meant ASCII, even if that meaning wasn't enforced well. Second, the stronger reason we can't just treat string as a nul-terminated byte string is that XPConnect and layers like it need to know how to convert string to JS string or a similar UCS2-like string (or to UTF-16, or to UCS-4, or whatever wider code another language or subsystem may require). If we merely inflate by interleaving zeroes, then we *are* saying something more about such strings: that they are ISO-Latin-1 character sequences. That contradicts your premise. We can't have it both ways: string means a character sequence, not a byte array with no zero bytes. This character sequence definition begs the question of "which charset" (as well as what about NULs, which we have answered both ways: NUL terminated unless a [size_is(aCount)] property is used). The answer has been "ASCII". I am proposing "UTF-8" without much conviction, but to cut the Gordian knot or smoke out a better plan. Say we leave string as ASCII and fix the places that want a byte array to use [array, size_is(aCount)] in octet aBuffer, or whatever. For "inplaceout" type data transfer, _a la_ Unix read(2), we'd need these methods to be [noscript] or even [notxpcom]. Ok, back to the question of UTF-8. Can we not do as darin says, and extend typelib format to include a UTF-8 string type *now*, before 1.0, more easily than later, after 1.0? Won't it always be harder, never easier? Jband, dbradley: can you guys provide more detail on the costs (like, it would take dbradley three weeks) of adding a new string type and supporting old typelibs? /be
> Really? Apart from image data going through nsIInputStream, where else? ok, some examples from netwerk/base/public where the ASCII requirement of |string| is currently violated: interface nsIIDNService { // input is UTF-8 string UTF8ToIDNHostName(in string input); }; interface nsIIOService { // all of the uses of |string| here are actually UTF-8 nsIURI newURI(in string aSpec, in nsIURI aBaseURI); nsIChannel newChannel(in string aSpec, in nsIURI aBaseURI); string extractScheme(in string urlString, out unsigned long schemeStartPos, out unsigned long schemeEndPos); string extractUrlPart(in string urlString, in short flag, out unsigned long startPos, out unsigned long endPos); long extractPort(in string str); string getURLSpecFromFile(in nsIFile file); void initFileFromURLSpec(in nsIFile file, in string url); }; interface nsIProtocolHandler { // aSpec is UTF-8 encoded nsIURI newURI(in string aSpec, in nsIURI aBaseURI); }; interface nsIProtocolProxyService { // host is UTF-8 encoded nsIProxyInfo newProxyInfo(in string type, in string host, in long port); void addNoProxyFor(in string host, in long port); void removeNoProxyFor(in string host, in long port); // url is UTF-8 encoded void configureFromPAC(in string url); }; interface nsISocketTransportService { // host is UTF-8 encoded nsITransport createTransport(in string host, ...); nsITransport createTransportOfType(in string socketType, in string host, ...); nsITransport createTransportOfTypes(..., in string host, ...); }; interface nsIStreamLoaderObserver { // result is not limited to ASCII... it is network data. void onStreamComplete(..., in unsigned long resultLength, [size_is(resultLength)] in string result); }; interface nsIURIChecker { // uri is UTF-8 encoded nsIRequest asyncCheckURI(in string uri, ...); }; interface nsIURI { // these are all UTF-8 encoded attribute string spec; attribute string prePath; attribute string host; }; interface nsIStandardURL { // initialSpec is UTF-8 encoded void init(..., in string initialSpec, ...); }; interface nsIURLParser { // spec is UTF-8 encoded void parseURL(in string spec, ...); // authority is UTF-8 encoded void parseAuthority(in string authority, ...); // serverinfo is UTF-8 encoded void parseServerInfo(in string serverinfo, ...); }; here's some examples from xpcom/io where the ASCII requirement of |string| is violated: interface nsIStringInputStream { void setData(in string data, in long dataLen); [noscript] void shareData(in string data, in long dataLen); }; interface nsIScriptableInputStream { // returns null terminated data string read(in unsigned long aCount); }; interface nsIOutputStream { // buf is just data, length is count unsigned long write(in string buf, in unsigned long count); }; interface nsISearchableInputStream { // forString can be any null terminated hunk of bytes void search(in string forString, ...); }; interface nsIFastLoadService { // not that fastload should include iDNS hostnames, but... void startMuxedDocument(..., in string aURISpec, ...); }; interface nsIFastLoadFileControl { // not that fastload should include iDNS hostnames, but... void startMuxedDocument(..., in string aURISpec); }; interface nsIBinaryInputStream { // documentation says that this is a 8-bit string string readStringZ(); }; interface nsIBinaryOutputStream { // documentation says that this is a 8-bit string void writeStringZ(in string aString); }; OK, so if |string| means 7-bit ASCII, then all of these interfaces would need to change. I really don't look forward to changing all of this code before mozilla 1.0. Why place such strict requirements on what |string| can be? I don't see what it buys us. Just let it be an opaque string of bytes. Always inflate to double byte using zero padding. If this results in something that cannot be displayed, so be it. It probably doesn't matter since the |string| value probably wasn't meant to be displayed. This will work nicely if we have a UTF-8 string type. Then we can fix the string attributes/parameters that know they will not be opaque but rather UTF-8 if 8-bit at all.
Darin, I was asking for places where we pump opaque, zero-terminated (or not if you use size_is, as I pointed out) octet arrays. The UTF-8 cases you cite, which are relevant to this bug, are not germane to the question of "should string be an octet array, whether terminated or counted?" The xpcom/io cases that matter should use octet arrays, I think. Then a caller such as JS won't get the unwanted inflation that presumes the octets are code points in a certain character set (never mind that JS-without-JS2-or-extensions will get a big fat JS Array of uint8-domain elements). If the transfer mode is not scriptable, then forget about JS -- the C++ signature will still be char *. It seems to me you are trying to simplify the work before 1.0 at the expense of the (string) => (JS or other, wider-char string) problem. But I admit, you are matching current (ab)usage and minizing effort! You're in danger of winning right there. So perhaps the opaque octet array type is off-topic for the current bug, although I hope it's still of interest to the cc: list. In all this, you and I have not disagreed on the need for a new XPIDL and typelib string type for UTF-8, but we're not helping design or implement that. We need someone to say "it'll take me two weeks" or "I did it, here's the patch!" dbradley, jband: can you spew some words of advice? Maybe someone else will do the grunt-work (I swear, I will soon if I get fed up enough with this logjam). /be
brendan: right, my examples do not exactly answer your question, but i needed to spew out that list because i think of lot of folks don't appreciate how many interfaces actually "mis-use" |string|. you're question just prompted me to spew out the list of such instances. but, my point stands unanswered: why is it important to be strict about the character encoding implied by |string| attributes? why not just let it be the case that playing w/ the 8th bit leaves you to fend for yourself... that XPCOM will not guarantee that the string will be interpreted as anything but an array of bytes (meaning that |string| to double-byte representation would result in 0 padding, and double-byte to |string| would result in truncation of the upper 8 bits). this is what we currently do (with the exception of a debug warning), and i think it is OK. it works if the data is 7-bit ASCII... ie. the result in JS land can be displayed... can be matched to a charset. but, when people play with the 8th bit they are on their own. would this really be too flexible? do we need the added structure of requiring |string| to be 7-bit ASCII? i don't see why. moving to the octet solution requires support for inplaceout, which we don't have. moreover it'll slow down JS code (like chatzilla) which doesn't need to be any slower. JS2 is not here yet, so we need to deal with performance issues given the JS that we got. afterall, if we've gotten this far without enforcing |string| == 7-bit ASCII, then what's the real problem? anyways, back to the discussion of UTF-8... jag has said that he is willing to work on and support a separate UTF-8 string heirarchy. he'll get some help from me, but as for the required changes to XPIDL to support a UTF-8 string class... i'm obviously not going to be of much help there.
Will the new variant class need to handle UTF8 strings or can it get away with ignoring it?
Also would adding a type to the idl affect/break the Python and maybe Perl connects?
It may possibly break Python :( I will ensure that whatever is decided will work eventually, though :)
I want to make sure I understand the latest direction. We're looking at building a new string class to support UTF8. We need a new type in the idl that will map to this. We're willing to break typelib compatibility. XPConnect conversion routines will have to be adjusted to convert this type. Python, nsIVariant, and Perl may be affected. At this point I have a pretty full list of 0.9.9 bugs, plus I'm tasked with a non-mozilla project as well. I can get this done, but I suspect some of my 0.9.9 bugs may have to be sacrificed.
Status: REOPENED → ASSIGNED
David, I can help out. If you list out the to do items, I'll go hack on the code. Hopefully, you won't delay your 0.9.9 bugs and we'll get this in too.
What are we going to call the string class and the type in IDL? Is UTF8String for IDL ok?
UTF8String sounds good to me as the name. I'm psyched that nisheeth can help -- big thanks to him! /be
UTF8String or AUTF8String... given that we have AString, shouldn't it be AUTF8String? however, i'd have to say that UTF8String is slightly easier to read.
The 'A' in "AString" is for "Abstract". If UTF8String is a concrete type implementing (derived from) nsAString, then I see no need for an 'A'. /be
concrete class, really? i didn't think we'd want to pass concrete classes via XPCOM. shouldn't it be an abstract class, with virtual functions only? so as to hide the implementation, etc, etc.? maybe i'm missing something.
On the one hand, we don't bother to hide the "implementation" (memory layout) of string or wstring right now, so why do we need it for utf8string? On the other, we apparently needed a bunch of options for AString, so we only specified the abstract base. Do we not need all these options (Sliding, Sharing, COW, whatever) for UTF-8 encoded stuff? If so, I think we need UTF8String to be abstract. If not, why not just specify it as a sequence of bytes?
I mentioned concrete because we want static type checking benefits: we don't want the compiler to silently swallow lots of hidden runtime conversions to and from UTF-8. That's an argument against implementing nsAString with UTF-8 storage and declaring victory in this bug. But if we support nsACString in typelibs, and make a concrete class nsUTF8String that implements nsACString, do we need a more specific IDL type? The issue is what .Length() and iteration over elements numbered by [0, length) count -- bytes or characters? I understand current nsACString users want Length to count bytes. But UTF8String users would want Length to count characters, and string users are too comfortable already with calling Length (to the point of ignoring IsEmpty). Jag, please advise. If we do need a more specific IDL type for UTF-8, I don't see the need to put a gratuitous 'A' in its name. We were talking about all this just now, and I believe a consensus in favor of UTF-8 first class support in typelibs and strings emerged. We (I especially) feel that UTF-8 should have been used in the ancient days instead of PRUnichar vectors (wstring, "UCS-2" but not really). We're facing plane 1 character codes coming in via certain locales (Korean, I hear), and if we have to switch to an encoding that doesn't store characters in equal-sized units, we might as well use UTF-8 and save space in this hemisphere. /be
"Suicide is Painless" Words by Mike Altman Music by Johnny Mandel Through early morning fog I see Visions of the things to be The pains that are withheld for me I realize and I can see that Chorus: Suicide is painless It brings on many changes And I can take or leave it if I please. I try to find a way to make All our little joys relate Without that ever-present hate But now I know that it's too late, and (chorus) The game of life is hard to play I'm going to lose it anyway The losing card I'll someday lay And this is all I have to say, that (chorus) The only way to win is cheat And lay it down before I'm beat And to another give a seat For that's the only painless feat, cause (chorus) The sword of time will pierce our skins It doesn't hurt when it begins But as it works its way on in The pain grows stronger - watch it grin (chorus) A brave man once requested me To answer questions that are key Is it to be or not to be? And I replied, "Oh why ask me?", cause (chorus) And you can do the same thing if you please. ====== Let's choose carefully :-) -- rick
OK, just so that everyone is on the same page, here's what I plan to do: a) Add two new types to XPIDL: UTF8String and CString. (I know there's a holy war on to decide what to name them. I'm gonna avoid the "A" prefix for now. I'll change it to the consensus demand before I land the patch) b) In XPConnect, hook up UTF8String and CString types to the native nsACString object for now.c) Add a cmd line option to XPIDL that generates old style type libs. d) Make XPIDL barf at IDL that uses the new types if the old style type lib cmd line option is set.e) Jag will create a new nsUTF8String class that inherits from the ACString abstract base class. Once that is done, I will change XPConnect to hook up the UTF8String type to native nsUTF8String objects. Sounds good?
For (e), shouldn't it be distinct from both nsACString and nsAString (i.e, nsAUTF8String or something of the sort)?
weren't people concered with adding lots of new types? (not that i don't think we need them)
dbaron: that was actually the plan. Make a new nsAUTF8String line parallel to nsAString and nsACString.
We should also bump the version number as well -> http://lxr.mozilla.org/seamonkey/source/xpcom/typelib/xpt/public/xpt_struct.h#118 I think we should bump the major, the interface manager will then refuse to process the file. This will keep a new component from causing problems with an older version of Netscape. Also read the comment below that and bump the incompatible value as well.
dbradley: Brendan, Nisheeth, and I convinced ourselves that we could get away with only incrementing the minor version number. See the discusion in the first comment of bug 65762. We are flirting with tripping up much older browsers by doing this - especially if we add more than one type. Except for the issue with *real* old browsers with the shorter lookup table in xpcconvert.cpp, we are free (as far as xpconnect is concerned) to add more type tags. XPConnect will simply fail to reflect methods with unsupported types. We *may* be playing a little fast and loose here, but my hope is to avoid crossing the XPT_MAJOR_INCOMPATIBLE_VERSION threshold until we find a compelling reason to actually change the pattern of bytes in the structs that get deserialized from the typelibs. At that point the problem would be that the old readers would literally be unable to correctly read the files, and not just a problem of types that clients of the reader would be unable to proxy. Certainly other classes of change might come along that might make us cross that threshold. But, we are thinking that we can get away with not doing it now. Also, before we go hogwild adding new types, let's not forget that there is still a very short list of available type tags. And that adding *any* types adds work for supporting those types at various places in our xpconnect and proxy code. Also, as Brendan pointed out, the last available tag should absolutely be reserved for use as a future use when we (may) reach the point where some type tags will require an additional byte. That tag value would allow us to flag that case while still leaving the existing tags unchanged. (Of course, that *would* be an incompatible version change - and would be hell on typeinfo clients too!) As I said in that other bug, the tools (xpidl compiler *and* xpt_link) should support creating the current xpt version 1.1 files on request and fail with an error if non-1.1 types are used. I think this is necessary. Though, I suspect, some plugin vendors (etc.) who might have use of this feature will likely fail to discover it and will be blindly playing hit and miss depending on what types they happen to attempt to pass thorough their interfaces.
Taking...
Assignee: dbradley → nisheeth
Status: ASSIGNED → NEW
nisheeth: what's the ETA on this one looking like? i'd like to squeeze the nsIURI changes in by 0.9.9 if possible.
I expect to land this by Wednesday (Feb 13) of next week. Would that work for you?
sure, that'll be fine.
Blocks: 124042
Just to clarify. I am only gonna get the XPIDL and XPConnect changes landed. Jag or some string guy needs to do the new string classes needed for the world to be happy.
No longer blocks: 124042
Also want to acknowledge David Bradley's contribution. Even though he is bogged down with 0.9.9 bugs, he's sent me the XPIDL/XPConnect changes for adding UTF8String. Hopefully, I'm just gonna have to replicate them (famous last words! :)) for CString and then test, test, test.
Blocks: 124042
here is my comment: 1. I think adding a new type UTF8String is better than changing of an existing type so... we are in the right path 2. even we add UTF8String, I want to make sure all the caller be careful when they use such data type.
Darin, Rick, Jag and I had a meeting today to firm up what needs to happen here. Here's what we decided: 1) "AUTF8String and ACString" in XPIDL will map into nsACString objects in the native world. 2) For AUTF8Strings, XPConnect will convert JS data from UCS2 to UTF8 and back. 3) For ACStrings, XPConnect will convert JS Data from UCS2 to ASCII and back. 3) We will not implement a new nsAUTF8String class. Till now, abstract string classes have only specified the storage size of the string, not enforced the encoding. nsACString and nsAString dictate single byte and double byte storage respectively. We do not want to break this model and create a new abstract string class that enforces encoding. We will only add the notion of a UTF8String to XPIDL so that XPConnect can convert strings appropriately. In the future, we can get the benefit of type checking, admittedly at run time not compile time, by implemementing a concrete class, nsUTF8String, that enforces the encoding behind the nsACString interface. The fact that we need to do less work this way is also a big factor in making this decision. Comments?
Status: NEW → ASSIGNED
> 3) For ACStrings, XPConnect will convert JS Data from UCS2 to ASCII and back. also, XPConnect will preserve the 8-th bit as it currently does for |string|.
So in other words, for ACStrings, XPConnect will convert JS data from UCS2 and ISO-8859-1 and back.
Um, this has nothing to do with ISO-8859-1, UCS2->UTF8->UCS2 I believe the convesion will be in XPConnect...
sure, that would be one interpretation... but i really just meant that ACString should be sufficient for passing raw data provided both ends understand that that's what they're getting ;-)
If you don't expose a new interface and implmentation for UTF8 now, you'll probably never be able to role one out without breaking the world. You're giving up on the whole idea of C++ static type checking to distinguish between ASCII v. UTF8 expectations? ACString is going to support embedded nulls? The fact that xpconnect does not enforce the 0 in the high bit of any 8bit string types does not mean that it *shouldn't* do that enforcement. What do people *really* want the rules to be for the various string types?
jband: darin sought my comments on the plan in comment #118, and I lobbied him and jag strongly in favor of static type checking. Jag was game. We didn't want to rock the boat too much in the bug yet, but since you raised the same issue, I thought I'd give a brief update. For details, jag should say more. /be
jband: 'You're giving up on the whole idea of C++ static type checking to distinguish between ASCII v. UTF8 expectations?' yes :-( at the C++ level i've given up on trying to maintain static type checking. Part of the reason is simply time pressure -- it requires more work to create a C++ abstract UTF8 interface and concrete implementation... the other part of the reason is that treating aCString as an encoding-free representation of the data is (actually) following the same design pattern as was chosen for aString. Currently, we are using aString to represent *both* UCS-2 and UTF-16 data. Unfortunately, at the IDL level we do *not* have a UTF16 string type... so xpconnect is potentially broken :-( this is unfortunate, but does not alter the fact that we have chosen aString to be representation of 16-bit storage units (without explicitly enforcing an encoding)... so, at the C++ interface level, we are expanding that notion by saying that aCString represents 8-bit storage units without specifying a particular encoding. I know what you're going to say -- this is just BS. and i agree with you :-) but at this point, we are in so deep and so screwed up, that i think that this is the best that we can hope for (at this point in time)... at least, at the IDL level we are providing encoding information - allowing navive <-> JS string conversions to work. anything more will require MAJOR work on the codebase :-( and we just don't have the time/resources to make it happen :-( the one issue that we may want to address is whether we should *also* introduce a UTF-16 string type at the IDL level. This would allow us to properly encode/decode UTF16 strings at the XPConnect boundary (in the future)... Of course the C++ interface would still translate to aString :-( i think that everyone will agree that this is not an 'ideal' solution... i just need to decide if this approach solves 'enough' of the problem. -- rick
*i* just need to decide if this approach solves 'enough' of the problem. should be changed to: *everyone* here needs to decide if this pproach solves 'enough' of the problem. -- rick
On ACString: yes, it supports embedded nulls and is really just a container of 8-bit data, though the methods on it heavily lean towards interpreting that data as strings ;-) I think we should do a C++ nsAUTF8String branch, where we could just copy the nsACString branch and do appropriate renames, which shouldn't take more than a few hours, at least for 0.9.9, and then decide what we need to do to give it more value (other than being able to use the type system to discern between any 8-bit string type and UTF8 strings). The 1.0 cycle should give us plenty of time for that :-) Yes, that all sounds rather ugly, but it will allow us to move forward without painting ourselves into a corner. I don't think we need to make all of our code use this nsAUTF8String type immediately, and for darin it'll be just as much work to convert to AUTF8String as it would be to convert to ACString.
Rick: I fear what happens when we change the meaning of single-byte within our codebase, without compiler/xpconnect assistance in tracking down mismatched expectations. Are we going to be repaying that effort savings with interest, tracking down bugs that only occur on crazy international pages, or when people have non-ASCII characters in passwords, etc.? Wasn't that the whole point behind not just changing the |string| documentation to permit UTF-8?
This begs the question of why people shouldn't be able to have non-ASCII characters in passwords.
I thought we permitted non-ASCII characters in passwords now. A patch I recently reviewed certainly seemed to think so, since it was using |wstring| for password values in interfaces. But I can see us wanting to use UTF8String for those interfaces, especially in places like IMAP where the on-the-wire encoding will need to be UTF8 anyway.
rick: brendan did a pretty good job of convincing me that we shouldn't miss this albeit rushed opportunity to somewhat fix our string story... that we should try if possible to work in a way to have static type checking. jag says he'll be able to make it happen... the scariest thing is perhaps the fact that nsAUTF8String will also need to be frozen for mozilla 1.0 :-/
In comment #130, it is mentioned: > I thought we permitted non-ASCII characters in passwords now. > A patch I recently reviewed certainly seemed to think so, I would like to see us support non-ASCII characters in password and if Mozilla is supporting it, that is good. Note that some protcols such as HTTP has a recommended way of dealing with such password input. (Cf. http://www.cis.ohio-state.edu/cs/Services/rfc/rfc-text/rfc2617.txt)
Who said any nsA*String is frozen for 1.0? If they are going to be, we've got some work to do.
dbaron: if nsIURI is to be frozen and if its attributes are of type nsAUTF8String & nsACString, then it seems like we have no choice but to also freeze these string interfaces. i thought this situation was well understood... afterall, isn't nsAString freezing? as i understood it, nsAString is used on some interfaces that are frozen or are going to be frozen.
If nsAString is going to be frozen for 1.0, then there are some changes that need to be made first. I was under the impression that it was considered acceptable not to freeze it.
Rick mentioned both UCS2 and UTF16. How is the code in the tree split between these two?
hey dbaron, unfortunately since we are freezing the DOM interfaces we need to freeze nsAString as well... since it is used extensively in these interfaces... similarly, since nsIURI is freezing, we'll need to freeze nsAUTFString too. Currently, no public interfaces expose the nsACString interface so this one is still up in the air ;-) but realistically, if AString and AUTF8String freeze, then ACString may end up being frozen too... -- rick
All I've heard about freezing the DOM interfaces is bug 110795 comment 11. Who's willing to make the changes needed to nsAString needed before we freeze it? (At a minimum we need the changes that are needed to implement nsStackString (the nsAutoString in the new world), PromiseSingleFragmentString (15 minutes of work), API cleanup for Mid/Right/Left, and that's just what I thought of off the top of my head without even looking at the header file.) Anyone who thinks nsAString needs to be frozen should file a bug on the string component requesting that it happen. (And please cc: me, and/or note the bug here.)
rick: nsIURI will be using nsACString as well for the ASCII compatible versions of the spec and host attributes. this is not necessary i guess since it could just use |string| for these, but i would prefer |ACString| since it would result in something potentially more efficient.
The bad news is that I couldn't meet tonight's deadline to land my changes. The good news is that there is enough done to unblock Darin. Darin, the patch I'm about to attach should get you going with your nsIURI related changes. With it in your tree, you should be able to define 'AUTF8String' and 'ACString' types in your IDL files. These types will map into 'nsACString' in the generated header file and XPConnect will perform appropriate string conversions for them. Just as a sanity check run the following test to make sure things work for you: 1) Run dist/bin/xpcshell and pass in the JS file "js/src/xpconnect/tests/js/old/xpctest_echo.js" as an argument. 2) The js file does a bunch of tests and prints the results to the console. Make sure that all lines that begin with "In2OutOneUTF8String" and "EchoIn2OutOneUTF8String" end with "passed". If the above works, you are all set to start using these new types in IDL and revamping your nsIURI interfaces. Here's whats I plan to do next *before* I land: 1) Make a XPIDL compiler flag that forces old style typelib generation and reports errors when new types are used. 2) Add a similar flag to the XPT linker that forces it to report errors if it comes across an XPT file that uses these new types. I will post the patches for these two changes to this bug. Here's what I will do *after* I land: 1) Change nsIVariant and XPCOM's proxy event mechanism to support these new string types. I have filed bug 125465 and bug 125466 on myself to track these two changes respectively. I'll post patches on those bugs. General question: The DOMString type uses an XPCVoidableString class that inherits from nsAutoString and adds the ability to store a JS null value. So, a boolean comparison of a null DOMString with the null JS value yields TRUE. When the DOMString is passed into the JS print() method, it prints out "null". Do we need to replicate this behavior for CStrings and UTF8Strings? jband, dbradley, shaver, I'd like you guys to r/sr this patch and the upcoming compiler/linker patches. You can start on this one now or wait until all the patches are on here. Up to you... :-) An extra but currently unavoidable copy of UTF8Strings is made in the XPConnect conversion routines. Jag is aware of this problem and will fix them when he lands his string changes. CString conversions are fine, AFAIK.
Attachment #66086 - Attachment is obsolete: true
Just a blind copy of nsACString and kids so you can see the ramifications of this addition and the changes needed to ns{Reading,Writing}Iterator.
I've made and tested XPIDL and XPTLink changes that generate typelibs of a version specified on the command line. I'm about to attach a unified patch. Now, I really am ready for a code review. jband, dbradley, shaver, please jump in! Thanks!
This patch subsumes attachment 69457 [details] [diff] [review] and adds the [-t output version number] command line argument to the XPIDL compiler and the XPT Linker. When the output version number is specified, the compiler checks that the input IDL file only uses constructs that are supported in that version. The linker checks that all typelib files it reads are of the version specified or below. Both the compiler and the linker report errors and abort if these checks fail.
Attachment #69457 - Attachment is obsolete: true
We'd hold mozilla1.0 for this, but I think that means holding for followup bugfixes to the main patches that land in 0.9.9. The main patches need talkback and ~300-400K downloads' worth of testing. /be
Blocks: 122050
Keywords: mozilla0.9.9
I forgot to add this earlier. Thanks a lot to David Bradley, Jag, and John Bandhauer for their help on this bug!
This is a variation on just the xpcom/typelib part of Nisheeth's patch. It changes the following: - don't use 3 different schemes to parse the version string. - combine that parsing into shared code in libxpt - don't declare vars in library code and implement them in exes - don't use // comments in .c files - don't have big comment blocks that go past column 80 - allow -t to specify *current* version Nisheeth, I didn't really test this. If you like these changes then please try them using the tests you have. I'll look at the xpconnect part of the changes soon.
I assume the OS/2 changes are needed. I notice that your lines use tab chars and the lines in that file use spaces. I have no idea if it matters, but you ought to not use the tabs just in case. In xpconnect we yse 'if(' not 'if ('. Please fix those. Please wrap lines longer than 80 columns. In XPCConvert::NativeData2JS the T_UTF8STRING case looks reasonable. The T_CSTRING case makes me uncomfortable. I like avoiding the extra string copy, but I don't like the implicit assumption that the JS string system uses the same allocator as the xpcom global allocator. I'd rather see us do a one-time registration of a finalizer using JS_AddExternalStringFinalizer and then use JS_NewExternalString in this T_CSTRING string creation case. For JSData2Native I don't think you gain anything by trying to factor out the GetJSStringInternals because... You tried to copy the T_DOMSTRING stuff here, but that stuff has wierd behavior specifically for the DOM. I think you should follow the pattern used for the other string types here and just convert void and null jsvals to nsnull and be done with them. Also, because in the T_CSTRING case you can use JS_GetStringBytes to get the 8-bit char arrays that the JS engine caches for us. This saves us a convertion when the same string is passed through more than once. Also, Is the plan to land all of this using nCSting for both types? Or is the AUTF8String going in too? Please address these issues (and whatever you want from my previous attachement) and post us another patch when you have it. Thanks! I'll look at doing the nsIVariant changes.
Looks like John got most of the big things. Some minor issues: I'm wondering if we really need the create_old_typelib flag used outside of the parameter checking for additional -t's. It eliminates some version checks, but those don't look too costly. From what I can tell, all the other checks for create_old_typelib can be eliminated safely and would make the code a little simpler. The outFileName declared in the main function in xpt_link.c should be const char *. You commented out the Foo attribute in interface nsITestXPCFoo in xpctest.idl. Just curious if that was accidental or not.
One more issue came up in a short offline discussion between nisheeth, jband and myself. Currently we have special code in XPConnect that deals with making string conversions from JS to native DOMString's special case |undefined| for backwards compatibility in the DOM. Currently passing |undefined| from JS as a DOMString into native code will convert the |undefined| value into the string "undefined". This is different from how we're treating |undefined| for string and wstring (where |undefined| is treated same as |null|), and it will be different from how we treat |undefined| in AUTF8String and ACString. My suggestion for avoiding this special case for the probably most commonly used (non-DOM) XPIDL string type is to separate DOMString from AString. Now that we're extending the typelibs anyway taking this additional step for consistensy between non-DOM strings would IMO be worth the trouble, and others seemd to agree with me. DOMString is after all the special case from an XPConnect point of view, not the other way around. The amount of changes needed to do this should be trivial, since the treatment of DOMString would remain almost identical to the treatment of AString, the only difference beeing how we treat conversion from JS |undefined| to native strings. IOW, let's add one more string type, AString (!= DOMString) :-)
WOOT! Please do :-)
Sorry for the delay. I was only able to start looking at jband and dbradley's comments today (Tuesday) afternoon. Replies to jband's comments: -- Your patch to xpcom/typelib looks good except that <math.h> was included in xpt_link.c. Now that you've done away with the math based parsing, we no longer need to include that file. > Also, Is the plan to land all of this using nCSting for both types? Or is the > AUTF8String going in too? The plan was to land with using nsACString for both types. When Jag lands nsUTF8String, he will change XPConnect to use it for the AUTF8String type. But, the final decision on this depends on how far along Jag and Darin think they are with the nsUTFString class and nsIURI interface changes respectively. Darin, Jag, please add your 2 cents about what you think the landing strategy for our changes should be. Thanks! >For JSData2Native I don't think you gain anything by trying to factor out the >GetJSStringInternals because... You tried to copy the T_DOMSTRING stuff here, >but that stuff has wierd behavior specifically for the DOM. I think you should >follow the pattern used for the other string types here and just convert void >and null jsvals to nsnull and be done with them. Also, because in the T_CSTRING >case you can use JS_GetStringBytes to get the 8-bit char arrays that the JS >engine caches for us. This saves us a convertion when the same string is passed >through more than once. After consulting with Johnny and other DOM hackers, we had decided to make the null and void JS string cases behave the same across DOMString, UTF8String, and CString. That is why you see the DOMString code copied for the UTF8String and CString cases. Bug 125481 (Make ACStrings and AUTF8Strings voidable) was filed with the same strategy in mind. On a followup discussion with Jband and Johnny, we have changed our minds on this. We will now make UTF8Strings and CStrings behave like the char and wchar strings. The patch I'm about to attach reflects this change. Bug 125481 now becomes invalid. I've used JS_GetStringBytes() as you suggested to reduce a conversion for the CString case. Replies to dbradley's comments: -- >You commented out the Foo attribute in interface nsITestXPCFoo in xpctest.idl. >Just curious if that was accidental or not. Yup, this was accidental. I caught this when I did a clobber build, fixed it, but didn't attach the new patch just for the one line change. >I'm wondering if we really need the create_old_typelib flag used outside of the >parameter checking for additional -t's. It eliminates some version checks, but >those don't look too costly. From what I can tell, all the other checks for >create_old_typelib can be eliminated safely and would make the code a little >simpler. I agree that create_old_typelib isn't necessary anymore. I've left it as a local variable in xpidl.c and removed it from other places. >The outFileName declared in the main function in xpt_link.c should be const char >*. Done! Miscellaneous comments: -- - I've line wrapped better, fixed if statement whitespace in XPConnect, and converted tabs to spaces in the OS2 .asm file. - I've used JS_ExternalStringFinalizer() for the case where the unicode string buffer is created by XPCOM.
Attachment #69802 - Attachment is obsolete: true
Attachment #70041 - Attachment is obsolete: true
In comment 153, s/bug 125481/bug 125841. Darin, how are your nsIURI changes coming along? The plan is still to our changes together for 0.9.9 without waiting for nsUTF8String, right? jband/dbradley/shaver, please review attachment 70474 [details] [diff] [review]. Thanks!
Yes, that's the plan as far as I know.
OK, I think this patch addresses everything so far except Johnny's request to add a new string type. Please look through this patch to see if anything bad jumps out at you. I will go through the code to add a new string type tomorrow. You guys have one day to object! :-) Once I'm done there will be four IDL string types (among others): DOMString - handles void (undefined) and null JS string values in a special DOM compatible way as described in comment 151 from Johnny. AString, ACString, and AUTF8String - will treat null and void JS string values as empty strings.
Attachment #70474 - Attachment is obsolete: true
Um, I had the |undefined| conversion a bit wrong in my mind, non-DOM strings throw an exception when asked to convert |undefined| into a string value, whereas when converted into a DOMString it converts into the string "undefined". I vote for parity between all the non-DOMString string types.
jst: You mean when converting from JS to native, right? For the normal non-domstring types JSVAL_VOID *does* convert to null. It is only in the (unlikely) case of a 'out' [w]string declared as a C++ reference that this is an error - only because it is not legal for a C++ reference to hold null. http://lxr.mozilla.org/seamonkey/source/js/src/xpconnect/src/xpcconvert.cpp#690
Um, my previous comment was based on what Nisheeth told me happens when we try to convert |undefined| in JS into a native string or wstring, maybe I didn't understand what he was testing... Either way, my vote is for consistency between the non-DOMString string types...
In attachment 70479 [details] [diff] [review], xpt_link.c still includes <math.h> even though it doesn't need to. Please ignore that nit because I've fixed it in my local build. The rest of the patch looks good. Hang in there for a final patch that will add an additional string type to IDL.
I should have realized this earlier but better late than never. Just to clarify: I am not really adding a new string type (AString) in addition to DOMString, AUTF8String, and ACString. AString was already an accepted string type in XPIDL but it was mapped behind the scenes to do *exactly* what DOMStrings do. What I am doing is changing things so that AStrings behave *mostly* like DOMStrings except that they assign empty strings into themseleves when null and void js vals are passed into them. So, please ignore my earlier posts about a whole new string type.
Sure, you are not adding *another* new string C++ type for AString (abstract or otherwise). But you *are* adding a new xpidl type.
Thanks for the clarification, jband. Here's the final patch with support for the new AString XPIDL type. jband, shaver, dbradley are requested to r/sr. Thanks!
Attachment #70479 - Attachment is obsolete: true
Just a nit, if the dipper/idl type is "astring", shouldn't it be "acstring" and "autf8string" too for consistency?
Looks good to me. We need someone with a good handle on strings to make sure everything string wise is correct. Might want someone familar with the OS/2 assembler code take a quick look at the small addition. Some minor cleanup issues: Looks like you interjected some additional spaces here in the xpidl.c - break; - + break; + In XPCConvert::JSData2Native the jschar* chars = nsnull; initialization is unnecessary in the T_UTF8STRING case. Looks like you may have some tab/space issues in xpctest_echo.js. Braces and code look a little misaligned.
Thanks for the comments, David. I'll make the changes you suggest in my build. I don't know anyone familiar with OS/2 assembler code off the top of my head. Anyone out there know anyone? Jag, would you please give me a code review of the string related code in the XPConnect conversion routines? Thanks! Jband, would you please sr? Thanks!
Attached patch Version 2.0 of final patch! :-) (obsolete) — Splinter Review
Jband pointed out a performance problem in JSData2Native() which is fixed in this patch. Dbradley's comments are also addressed. David, would you please give this patch a review stamp also? You need only look at the xpcconvert.cpp diffs. The rest of the files haven't changed since the last patch. Jag, please review the string code in the conversion routines in xpcconvert.cpp. Jband, please sr. Thanks!
Attachment #70643 - Attachment is obsolete: true
Rearranged code in the UTF8String and CString cases inside JSData2Native() in xpcconvert.cpp per jband's request. Rest of the patch is the same as earlier. David, please r=. Jband, please sr=. Thanks!
Attachment #71031 - Attachment is obsolete: true
Comment on attachment 71051 [details] [diff] [review] version 3.0 of patch (I've given up on calling it final.) sr=jband This is the one. I agree that we should land it. Note that I will be very annoyed if we *don't follow through with any of the following: - utf8 string class. - changes to this to use that class. - changes to nsVariant. This change by itself leaves things in a state that we should pass through quickly. But, let's get it in and get on with things.
Attachment #71051 - Flags: superreview+
Comment on attachment 71051 [details] [diff] [review] version 3.0 of patch (I've given up on calling it final.) I thought the non-DOM strings were going to do what |string| and |wstring| do, that is to say, no support for null/void/undefined converting from or to. >Index: js/src/xpconnect/src/xpcconvert.cpp >=================================================================== >RCS file: /cvsroot/mozilla/js/src/xpconnect/src/xpcconvert.cpp,v >retrieving revision 1.73 >diff -u -r1.73 xpcconvert.cpp >--- js/src/xpconnect/src/xpcconvert.cpp 5 Feb 2002 06:45:02 -0000 1.73 >+++ js/src/xpconnect/src/xpcconvert.cpp 23 Feb 2002 06:04:30 -0000 >@@ -623,7 +740,7 @@ > { > nsAWritableString* ws = *((nsAWritableString**)d); > >- if(JSVAL_IS_NULL(s)) >+ if(JSVAL_IS_NULL(s) || (!isDOMString && JSVAL_IS_VOID(s))) > { > ws->Truncate(); > ws->SetIsVoid(PR_TRUE); >@@ -726,6 +843,116 @@ > return JS_TRUE; > } > >+ case nsXPTType::T_UTF8STRING: >+ { >+ jschar* chars; >+ PRUint32 length; >+ JSString* str; >+ >+ if(JSVAL_IS_NULL(s) || JSVAL_IS_VOID(s)) >+ { >+ if(useAllocator) >+ { >+ nsACString *rs = new nsCString(); >+ if(!rs) >+ return JS_FALSE; >+ >+ rs->SetIsVoid(PR_TRUE); nsCString's SetIsVoid does nothing, there's no point in setting this. >+ *((nsACString**)d) = rs; >+ } >+ else >+ { >+ nsCString* rs = *((nsCString**)d); >+ rs->Truncate(); >+ rs->SetIsVoid(PR_TRUE); Same here and further below. r=jag on the string changes, but check with jst on the use of DOMString-like IsVoid.
Jag, we will have the notion of "voidness" for AStrings, CStrings and UTF8Strings (see bug 125841). You are right that the SetISVoid() don't do anything for CStrings and UTF8Strings currently. The plan is to implement something similar to XPCVoidableString for CStrings and UTF8String that *do* implement "voidness". The calls to SetIsVoid() are no-ops right now but will do real work in the future. Thanks for the review!
Comment on attachment 71051 [details] [diff] [review] version 3.0 of patch (I've given up on calling it final.) r=dbradley
Attachment #71051 - Flags: review+
when i tried the very first version of this patch, i had to explicitly %{C++ #include "nsAString.h" %} at the top of each .idl file that referenced AUTF8String. is this still required?
Darin: that has long been the case for AString idl users. It is debatable whether that block should be added to the idl or if the #include'rs of the generated .h should be expected to pre-#include the string header. Nevertheless, there is no automatic generation of that #include line. There is a bug around suggesting that this would be nice, but it requires xpidl hacking far beyond the scope of this bug.
Let #includers of those headers #include "nsAString.h". Scott Meyers says so, if you need a reason!
Comment on attachment 71051 [details] [diff] [review] version 3.0 of patch (I've given up on calling it final.) a=shaver for 0.9.9, but I won't cry if you take out those %{C++%} header warts. Nice work, all.
Attachment #71051 - Flags: approval+
The fix is in. Finally! :-) I've updated the status summary to reflect reality.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Summary: 8-bit UTF8-capable string type for XPIDL → Add ACString, AUTF8String, and AString to XPIDL
jband,shaver,"scott-meyers": so adding #include "nsAString.h" to nsrootidl.idl is bad because it causes everyone to include that file? how about at least adding class nsAString; class nsACString; to nsrootidl.idl?
Summary: Add ACString, AUTF8String, and AString to XPIDL → 8-bit UTF8-capable string type for XPIDL
Summary: 8-bit UTF8-capable string type for XPIDL → Add ACString, AUTF8String, AString types to XPIDL
>sr=jband This is the one. I agree that we should land it. Note that I will be >very annoyed if we *don't follow through with any of the following: >- utf8 string class. >- changes to this to use that class. I've filed bug 127789 on jag to implement utf8strings. I've added a comment on it to go fix up the extra string copies in xpcconvert.cpp. >- changes to nsVariant. I'd filed bug 125465 on myself last week to track this.
I like darin's most recent suggestion. The bug in question that jag mentioned earlier is bug 78848.
I can't think of any good reason to not do the forward declarations that Darin suggested.
OK, I'll make the nsrootidl.idl change and check it in with my changes to add the new string types to nsVariant (bug 125465).
I'm attaching the patch suggested by Darin that will get checked in as part of bug 125465's fix. Please holler if you see problems with it. Thanks!
Yeah, don't #include when you can forward-declare the class name. /be
Brendan, please clarify. Exactly what should I not #include? Are you saying that I should remove the #include lines that were there in nsrootidl.idl before my changes?
nisheeth, sorry I was unclear. I was supporting your final patch, which adds class nsACString; to nsrootidl.idl's C++ section, thereby relieving others from having to #include "nsACString.h" or repeat the class forward decl. IOW, I'm all happy here. /be
Attachment 71980 [details] [diff] got checked into the 0.9.9 branch and the trunk just now as part of the fix to bug 125465. Jag, when you implement nsAUTF8String, please remember to add a forward declaration to it in nsrootidl.idl. Thanks!
Marking Verified -
Status: RESOLVED → VERIFIED
Blocks: 129613
Component: xpidl → XPCOM
QA Contact: pschwartau → xpcom
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: