Closed Bug 124042 Opened 23 years ago Closed 23 years ago

support internationalized resource identifiers (UTF-8 URI)

Categories

(Core :: Networking, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9.9

People

(Reporter: darin.moz, Assigned: darin.moz)

References

Details

(Keywords: topembed)

Attachments

(3 files, 7 obsolete files)

nsIURI attributes need to support unicode. this is needed for both IDN support as well as full IRI support. IDN - see http://search.ietf.org/internet-drafts/draft-ietf-idn-uname-01.txt IRI - see http://search.ietf.org/internet-drafts/draft-masinter-url-i18n-08.txt pending support for UTF-8 in XPCOM (see bug 84186), all (or most) nsIURI and nsIURL string attributes should be typed as UTF-8 (scheme being the possible exception). this will enable support for UCS w/o compromising performance of the many US-ASCII only websites and protocols that do not support IDN or IRIs. changing nsIURI attributes to UTF-8 means that there needs to be some additional ASCII encoded (ACE) versions of some of the attributes. at a minimum: ACString nsIURI::ACESpec ACString nsIURI::ACEHost ACEHost is the hostname converted to ASCII per the IDN draft specification (or perhaps by a medley of different converters all hidden behind the existing nsIIDNService::ConvertUTF8toIDN interface). this interface is currently implemented by a commercial XPCOM component from i-dns.net. there is a bug to open source this component (see bug 112979). the implementation is tricky because of the many different ASCII encoding schemes currently deployed on the internet (the standard is not yet ubiquitous). ACESpec is the fully escaped URI spec w/ the ACEHost in place of the UTF-8 host. If unescaped ACESpec would not necessarily be UTF-8. instead it would need to be in the encoding of the origin document (UTF-8 if conforming to the IRI draft specification). this is necessary to support the very common usage by content providers of non-standard non-US-ASCII and non-UTF-8 URLs (eg. ISO-8859-1). as a result, nsIURI must provide a charset attribute. (see related bug 84032, which is probably obsoleted by this bug.) NS_MakeAbsoluteURIWithCharset would need to set the proper charset attribute on nsIURI. at this point i'm not sure if nsStandardURL should store the spec as ASCII or as UTF-8. it'll depend on which is called most frequently. i'm suspecting it'll be best to optimize for ACESpec, and then change imagelib to use this when interacting with the image cache (iirc: image cache is the biggest consumer of GetSpec).
Severity: normal → major
Status: NEW → ASSIGNED
Priority: -- → P2
Target Milestone: --- → mozilla0.9.9
one more issue that this bug also addresses that i forgot to mention is HTTP redirects w/ ACE hostnames given in the Location: header. we would of course want to display the UTF-8 version of a redirected URL and not the ACE version. this means that nsStandardURL will have to interact with the IDN service and the IDN service will need to be extended to provide a method like ConvertIDNtoUTF8. bug 110028 discusses adding a ConvertIDNtoUTF8 method on nsIIDNService.
Blocks: 124030
Depends on: 84186, 110028
I think the URL for IDN specification should be http://search.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt One concern with converting from ACE back to UTF8 is that we need to detect the ACE prefix (which is yet to be fixed as a standard), so we might want to just send the hostname to nsIIDNService::ConvertIDNToUTF8. This may have some performance implications but should be fine if the implementation is efficient and the nsIIDNService is already cached.
william: thx for the link. also, we can also cache the UTF8 version of the ACE hostname and vice versa to minimize conversions.
No longer depends on: 84186
Depends on: 84186
there is a problem w/ the charset attribute (see bug 84032 comment #7). in some cases we do not know the charset of the escaped characters (eg. URL resulting from a http redirect). in such cases it is impossible to convert the unescaped URL to UTF8. some protocols (such as http) must always keep the path and prehost components URL escaped. thus for http, everything but the hostname must be treated as US-ASCII. such considerations are not addressed by the IRI draft specification :( nsIURI can have a charset attribute, but it will mostly be unused for http URLs.
Keywords: nsbeta1+
Blocks: 125718
Blocks: 125080
Is 0.9.9 realistic? If I understand the discussion so far, the plan is: There would be a charset attr in nsIURI serving as a hint for unescaping characters > 0x7F for display purposes. In cases where the charset could not be determined, the path name would simply be displayed in escaped form. However, the hostname would be sent to nsIIDNService to obtain the UTF8 version for display.
i'm wrapping up the patch now :-) and drivers have already expressed strong interest in taking this for mozilla 0.9.9. the patch does the following: 1) adds a charset parameter to nsIProtocolHandler::newURI, so protocols can know the charset from which the UTF-8 input originated. this allows protocols to generate the proper escaped form of the URL at creation time. if the origin charset is not specified it's assumed to be UTF-8. if it is not UTF-8, then we convert the non-ASCII characters to the given charset and escape them. if the origin charset is UTF-8 then we leave them be (unless a preference is set which causes even UTF-8 to be escaped -- this will be the default). the origin charset and this transformation that i've just described does not apply to the hostname since it can only be UTF-8. as a result we'll store the URL spec UTF-8 encoded. 2) adds ascii host and ascii spec getters on nsIURI that return the ACE forms of these per the IDNA rules. 3) all existing nsIURI and nsIURL getters return strictly UTF-8 character strings. combined this makes most of the code in NS_MakeAbsoluteURIWithCharset obsolete. furthermore, all normal getters simply return substrings of the stored spec. the getters for the attributes including username, password, param, query, ref, fileBaseName, and fileExtension are no longer unescaped. callers must unescape these manually if desired. the reason for this change is simple: websites can serve up URLs with escaped characters corresponding to some charset that is unknown to us. callers should therefore be careful about unescaping... especially since the resulting strings could contain embedded nulls, and other fun control characters.
this patch (830k uncompressed) includes an older version of nisheeth's patch for bug 84186. the entire patch can be applied to today's trunk. i've tested it on rh72, win2k, and macosx-macho. i'll be generating optimized linux and win32 builds... hopefully i'll have them posted on ftp.mozilla.org by tomorrow.
darin: I hope you know what you do ... the change in returning not-unescaped parts of the url when accessing the atomic parts of the url seems dangerous to me. Some might build on this behavior, we have to be very carefull and check all callers about this.
The "network.enableIDN" pref is not read correctly in nsStandardURL, it got FALSE even when I have it set to TRUE in my prefs.js. Just for experimentation I made nsStandardURL::nsStandardURL read the prefs, eventually it got TRUE. So I presume that this is because nsPrefService has not yet read the user prefs file. Once enableIDN is TRUE, and nsIIDNService is loaded we can chase IDN links and type in IDN on the URL bar. Just doing a quick smoketest, it segfaults when I click on the non-ASCII link at http://playground.i-dns.net/~wil/moztest/nonascii-link.html If nsIIDNService is no longer used in nsHttpChannel then we can skip the #include "nsIIDNService.h". Tested this on RH6.2+.
Darin: while you're at it, could you incorporate this as well? Thanks!
andreas: yup... that was my biggest fear too, but i've found that there are not that many callers of the atomic parts. i've tried to fix all of them. i'll be giving it another pass to make sure.
william: thx for trying this out. i'll incorporate your patch. the reason why its not reading your prefs.js file is because i have not added a pref observer. so, it'll only pick up the preference if it is set in all.js. my next patch will use a pref observer. and i'll remove the include from nsHttpChannel.cpp :) when it crashed did you only have enableIDN set? or did you also set network.standard-url.escape-utf8 to false? does it still crash if you disable IDN? thx!
william: nevermind.. i'm seeing the crash now.
Darin: how come the LDAP bits of code in the patch have been changed to use GetAsciiHostname() rather than the UTF8 GetHostname(). The LDAP C SDK is entirely UTF8 under the hood.
dmose: but does it talk to the nsIIDNService to convert a UTF-8 hostname to something that bind will understand? hostnames are traditionally restricted to US-ASCII. if you want LDAP to work with UTF-8 hostnames (IDN) then you will have to do the conversion via the nsIIDNService manually elsewhere. moreover, i seriously doubt the LDAP SDK supports UTF-8 hostnames since there isn't even a real RFC on it yet.. just a draft specification that isn't even universally followed. hence the XPIDN plugin from i-dns.net, which tried to do the right thing given the great variety of ASCII compatible encodings currently in use ;-)
this patch is ready for heavy testing / code review (with of course the subtraction of nisheeth's xpidl changes).
Attachment #70879 - Attachment is obsolete: true
take 2 ... bugzilla/mozilla won't let me change the mime type for some reason.
Attachment #71034 - Attachment is obsolete: true
Assumptions: nsXPIDLCString is should not be used anymore and nsCAutoString should be used. NS_NewURI needs to cast uri string to a nsDependentCString. Index: caps/src/nsCodebasePrincipal.cpp Are there hard tabs in this file?? Use of strdup might break mac build: + if (NS_FAILED(mJSPrincipals.Init(strdup(codebase.get())))) Index: caps/src/nsScriptSecurityManager.cpp Again use of strdup may cause you problems. I will not repeat this concern. ReportErrorToConsole why are you getting the Ascii spec? How about just getting the spec and calling msg.Append(). Index: content/base/src/nsContentAreaDragDrop.cpp I am not sure why this yyou made this change. (xx) Index: content/base/src/nsDocument.cpp Does this mean anything: // XXXdarin CopyUTF8toUCS2 and throughout this patch. Index: content/html/content/src/nsGenericHTMLElement.cpp Is there any code that would depend on a escapedQuery being UTF8? - if (!search.IsEmpty()) { - // XXX is escapedQuery really ASCII or UTF8 - CopyASCIItoUCS2(NS_LITERAL_CSTRING("?") + search, aSearch); - } - + if (!search.IsEmpty()) + aSearch.Assign(NS_LITERAL_STRING("?") + NS_ConvertUTF8toUCS2(search)); return NS_OK; Index: content/xul/document/src/nsXULDocument.cpp Shouldn't this be GetAsciiSpec? @@ -5559,12 +5557,12 @@ rv = mCurrentPrototype->GetURI(getter_AddRefs(url)); if (NS_FAILED(rv)) return rv; - nsXPIDLCString urlspec; - rv = url->GetSpec(getter_Copies(urlspec)); + nsCAutoString urlspec; + rv = url->GetSpec(urlspec); if (NS_FAILED(rv)) return rv; PR_LOG(gXULLog, PR_LOG_ALWAYS, - ("xul: error parsing '%s'", (const char*) urlspec)); + ("xul: error parsing '%s'", urlspec.get())); LDAP: Make sure dmose knows about these changes. Index: docshell/base/appstrings.properties Does l10n know about this? Also, I prefer present tense. +netReset=The document contained no data. More coming later.
it is still valid to use nsXPIDLCString, but it would be used w/o getter_Copies: nsXPIDLCString spec; url->GetSpec(spec); printf("spec=%s\n", spec.get()); in most cases i've replaced the nsXPIDLCString with nsCAutoString, but in some cases the surrounding code depends on nsXPIDLCString so i just left it that way. strdup - i'll make sure this thing compiles on the mac before landing :) i made the change to GetURLSpecFromFile and InitFileFromURLSpec because URL strings are all supposed to be UTF-8 :) we don't have a CopyUTF8toUCS2 yet. once we do there are several places where it could be used to eliminate a buffer copy. jag also tells me that w/o changing any of the callers he can make NS_ConvertUTF8toUCS2 avoid the extra buffer copy as well... so, i should probably do away with my CopyUTF8toUCS2 comments. GetEscapedQuery was 100% ASCII before this patch. it now corresponds to GetQuery() with the exception that GetQuery() may not return UTF-8 characters. by default however GetQuery() will return ASCII. uses of GetAsciiSpec vs. GetSpec for logging are not critical... it doesn't really hurt to printf(some-UTF-8-string) :-/ dmose is aware of these changes... see comment #15. uh oh... looks like some unintended appstrings.properties changes crept into the patch :(
Index: docshell/base/nsWebShell.cpp I see other references to netRest in this patch. Please make sure that they are all removed. Index: dom/src/jsurl/nsJSProtocolHandler.cpp Please factor out this code. Similar logic is also in content/shared/src/nsHTMLUtils.cpp + if (IsASCII(aSpec)) + rv = url->SetSpec(aSpec); + else { + // need special encoding for unicode characters... + // XXXdarin iterate over the UTF-8 chars instead + NS_ConvertUTF8toUCS2 ucsSpec(aSpec); + nsCAutoString encSpec; + + char buf[6+1]; // space for \uXXXX plus a NUL at the end + for (const PRUnichar *uch = ucsSpec.get(); *uch; ++uch) { + if (*uch > 0x7F) { + PR_snprintf(buf, sizeof(buf), "\\u%.4x", *uch); + encSpec.Append(buf); + } + else + encSpec.AppendWithConversion(*uch); + } + rv = url->SetSpec(encSpec); + } + Index: js/src/xpconnect/src/xpcconvert.cpp Index: js/src/xpconnect/src/xpcwrappednative.cpp Index: js/src/xpconnect/tests/components/xpctest_echo.cpp Index: js/src/xpconnect/tests/idl/xpctest.idl Index: js/src/xpconnect/tests/js/old/xpctest_echo.js Index: xpcom/base/nsrootidl.idl Index: xpcom/reflect/xptcall/public/xptcall.h Index: xpcom/reflect/xptcall/src/md/os2/xptcinvoke_vacpp.asm Index: xpcom/reflect/xptinfo/public/xptinfo.h Index: xpcom/typelib/xpidl/xpidl. Index: xpcom/typelib/xpidl/xpidl_header.c Index: xpcom/typelib/xpidl/xpidl_typelib.c Index: xpcom/typelib/xpidl/xpidl_util.c Index: xpcom/typelib/xpt/public/xpt_struct.h Index: xpcom/typelib/xpt/tools/xpt_dump.c This is not part of *your* patch, right? Index: mailnews/compose/src/nsMsgSend.cpp The spacing in a few places is off. Maybe more tabs? Index: mailnews/import/eudora/src/nsEudoraCompose.cpp (and a few other places) Do you really need to Adopt first? + urlStr.Adopt(0); + pAttach->pAttachment->GetURLString(getter_Copies(urlStr)); Index: netwerk/base/public/nsIIOService.idl (and other .idl's) Do we *really* have to #include this? If we ever try to freeze this interface, it will be one more thing that we will have to change. Will this not be included by nsrootidl.idl +%{C++ +#include "nsAString.h" +%} + Index: netwerk/base/src/nsIOService.cpp Shouldn't this be a static cast? - for(int i=0; gBadPortList[i]; i++) - { - mRestrictedPortList.AppendElement((void*)gBadPortList[i]); + for(int i=0; gBadPortList[i]; i++) { + mRestrictedPortList.AppendElement(NS_REINTERPRET_CAST(void *, gBadPortList[i])); Misc: Are you going to add "network.standard-url.escape-utf8" and "network.enableIDN" to all.js? Otherwise looks okay. r=dougt.
Comment on attachment 71035 [details] v1 patch w/ one of nisheeth's older XPIDL patches from bug 84186 r=dougt
Attachment #71035 - Flags: review+
- will remove NET_RESET patch - moved JS escaping logic from NS_MakeAbsoluteURIWithCharset into the JS protocol handler (which is where it really belongs). - yes, those files are nisheeth's changes - i keep forgetting to ":set noexpandtab" before editing mailnews code :P - yeah, the Adopt(0) is unnecessary since getter_Copies(foo) clears the buffer before returning. - right, i need to ask nisheeth about nsAString.h... i don't like %{C++ including it anymore than you ;) - nsIOService.cpp:215: invalid static_cast from type `short int' to type `void *' - yes, i should add those prefs to all.js dougt: thanks so much for reviewing this monster-of-a-patch :-)
Blocks: 125223
*** Bug 57524 has been marked as a duplicate of this bug. ***
Blocks: 81019
Attached file v2 final patch w/ dougt's comments (obsolete) —
gzip compressed unified diff for the trunk, revised per dougt's review comments.
Attachment #71035 - Attachment is obsolete: true
Comment on attachment 71562 [details] v2 final patch w/ dougt's comments carrying forward r=dougt
Attachment #71562 - Flags: review+
same patch as before w/ addition of mac project changes
Attachment #70927 - Attachment is obsolete: true
Attachment #71562 - Attachment is obsolete: true
Comment on attachment 71671 [details] v3 final patch w/ dougt's comments + mac project changes carrying forward r=dougt
Attachment #71671 - Flags: review+
Attached file v4 fixed some indentation problems (obsolete) —
Attachment #71671 - Attachment is obsolete: true
Comment on attachment 71717 [details] v4 fixed some indentation problems carrying forward r=dougt
Attachment #71717 - Flags: review+
Blocks: 127282
Blocks: 127408
Need this for embeddding.
Keywords: topembed
Comment on attachment 71717 [details] v4 fixed some indentation problems a few minor nits: + if (NS_FAILED(mJSPrincipals.Init(strdup(codebase.get())))) use ToNewCString(codebase), not strdup() - guarantee the right allocator aResult.Assign(NS_ConvertUTF8toUCS2(scheme) + NS_LITERAL_STRING("://") + - NS_ConvertUTF8toUCS2(preHost) + aHost + + NS_ConvertUTF8toUCS2(userpass) + aHost + NS_ConvertUTF8toUCS2(path)); do you need an "@" sign here, now that you're using userpass rather than prehost? or is that just a method name change? + aBuf = ToNewCString(buf); // XXX ToNewUTF8String? ToNewCString(const nsAString&) is pretty evil - any chance you could leverage NS_LossyConvertUCS2toASCII here? (there are a few lines just like this in html content) +static NS_NAMED_LITERAL_CSTRING(kLDAPScheme, "ldap"); +static NS_NAMED_LITERAL_CSTRING(kLDAPSSLScheme, "ldaps"); Sorry, this isn't really allowed - "no static classes" I think is the rule from the C++ portability guidelines I'm about 1/3 of the way through the patch, but I'll post these comments and add more later today
alec: thx so much for the review comments... - replaced the strdup w/ ToNewCString. - userpass has '@' appended just above the assignment you're mentioning: if (!userpass.IsEmpty()) userpass.Append('@'); - buf is already a "C" string, so the call uses ToNewCString(const nsACString &) - actually, the NS_NAMED_LITERAL_CSTRING's didn't need to be there.
Keywords: mozilla0.9.9
I saw that you needed this tested on #mozillazine, and I downloaded the test binary. What do you want us to test?
netdemon: anything and everything that has anything to do with URLs ;-)
I am still working on this, haven't forgotten.. but if I haven't finished by about 7:15pm tonight, I won't be able to get to it again until probably monday morning, but I'll finish before the tree opens
alec: ok, thanks for the update :)
Blocks: 69859
+nsSegmentEncoder::EncodeSegment(const nsASingleFragmentCString &str, + PRInt16 mask, + nsAFlatCString &result) +{ + const char *text; + PRUint32 resultLen = result.Length(); + EncodeSegment(str.BeginReading(text), URLSegment(0, str.Length()), mask, result); I think you need to make "text" a nsASingleFragmentCString::const_iterator just for good measure I'm slightly confused about the two different EncodeSegment routines - one which returns an integer and one which returns a string. I see the use of both of them but the same name makes it hard to understand whats going on sometimes.. any chance we can just get different names? - // XXX the string code's ToLowerCase doesn't operate on substrings - ToLowerCase((char *)mSpec.get() + mScheme.mPos, mScheme.mLen); + ToLowerCase((char *) mSpec.get(), mScheme.mLen); can we keep that comment in there? Might someday remind us to fix that, and explain the need for casting... also, can you use NS_CONST_CAST here, while you're changing it? + nsCString mOriginCharset; + PRUint32 mURLType; + nsCOMPtr<nsIURLParser> mParser; + nsCOMPtr<nsIFile> mFile; // cached result for nsIFileURL::GetFile + char *mHostA; // cached result for nsIURI::GetHostA + PRPackedBool mMutable; + + enum nsEncodingType { + eEncoding_Unknown, + eEncoding_ASCII, + eEncoding_UTF8 + }; + nsEncodingType mHostEncoding; + nsEncodingType mSpecEncoding; Move mMutable after the enums - enums are 32-bit on most platforms.. the structure will still get bumped to the nearest 4 bytes, but you might as well put it at the end. + const nsPromiseFlatCString flatURI( PromiseFlatCString(inURI) ); You can avoid a copy by making this const nsPromiseFlatCString& - I think that should work? (if that's not allowed, never mind) + nsCAutoString ext; + url->GetFileExtension(ext); + if (!nsCRT::strcasecmp(ext.get(), "dll") || + !nsCRT::strcasecmp(ext.get(), "exe")) doMimeLookup = PR_FALSE; Any chance for .Equals() here? if you use NS_LITERAL_CSTRING("dll") etc you get length for free, could speed up the assignment similar here, can't we just compare these two strings: + if (NS_SUCCEEDED(rv) && strcmp(otherScheme.get(), myScheme.get()) == 0) + { + if (strcmp(otherScheme.get(), "file") == 0) and use NS_LITERAL_CSTRING above and here: + else if (strcmp(otherScheme.get(), "imap") == 0 || + strcmp(otherScheme.get(), "mailbox") == 0 || + strcmp(otherScheme.get(), "news") == 0) here, you can just use Equals() + *result = strcmp(otherSpec.get(), mySpec.get()) == 0; I won't bother pasting more, but if sweep for strcmp and strcasecmp I see a lot of similar candidates... it's not going to be easy to see what's new code though. + /** url components **/ + esc_Scheme = (1<<0), + esc_Username = (1<<1), + esc_Password = (1<<2), + esc_Host = (1<<3), you could use PR_BIT() here.. not necessary but might be easier to read phew. I think that's it.
Comment on attachment 71717 [details] v4 fixed some indentation problems oh, and with all the above suggestions & changes, sr=alecf (I'm satisfied that the right things will be done :)) I'll just put sr=alecf on the final patch when it lands.
Comment on attachment 71717 [details] v4 fixed some indentation problems a=asa (on behalf of drivers) for checkin to the 1.0 trunk. We're still discussing what to do on the branch.
Attachment #71717 - Flags: approval+
Attached file v5 final
w/ changes suggested by alecf
Attachment #71717 - Attachment is obsolete: true
Comment on attachment 72672 [details] v5 final carrying forward: r=dougt sr=alecf a=asa
Attachment #72672 - Flags: superreview+
Attachment #72672 - Flags: review+
Attachment #72672 - Flags: approval+
Blocks: 118587
No longer depends on: 110028
landed on the trunk... thx to everyone (especially dougt and alecf) who helped :-) caused at least one regression (see bug 129250).
Depends on: 110028
... and it broke --enable-ldap-experimental option ... dmose ...
Comment on attachment 73173 [details] [diff] [review] Patch to fix build-bustage for --with-ldap-experimental, thanks to peterv. r=dmose
Attachment #73173 - Flags: review+
Verified IDN functions works for Linux and Windows.
this patch landed on the mozilla 0.9.9 branch as well. marking FIXED. BTW: this bug was not really blocked by bug 110028.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Removing dependency on Bug 110028 per Darin's Comment #49.
No longer depends on: 110028
Is there an i18n QA person for this bug?
Benjamin Chuang, I am IQA contact for this. Should I verify this?
Probably, you should take qa ownership. I don't know much about how this technology works.
QA Contact: benc → teruko
this patch broke solaris's build, please see bug 143379
Changed QA contact to ylong@netscape.com.
QA Contact: teruko → ylong
Mark as verified per comment #47.
Status: RESOLVED → VERIFIED
Blocks: 93419
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: