Closed
Bug 83277
Opened 24 years ago
Closed 16 years ago
Multilingual (unicode) input into address (url) bar
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
WORKSFORME
Future
People
(Reporter: nprobert, Assigned: nhottanscp)
Details
(Keywords: intl)
One of the things that is happening will be the support of multilingual
international domain names (IDN). The conversion to an Ascii Character Encoding
(ACE) is handled outside of the browser at the resolver level in the case of our
client.
It is important that Mozilla especially not treat all address bar and form input
as Latin-1, not normalize, and not case fold the Unicode input by the user in
the address bar.
Looks like anchors and javascript works okay. But input into the address bar
makes a mess of the Unicode input. I suspect that the input locale is getting
in the way here and that Mozilla is not respecting it, particularly when under
Windows the keyboard input locale and language is changed and input is done
through the IME.
Unfortunately, it does mess it up in the Windows version, as does IE. Microsoft
won't change IE because it case folds the Latin-1 to optimize the caching. I
hope Mozilla is not doing the same thing.
Given the Unicode input of these Chinese characters: 4e00 4e8c 4e09.com (one,
two and three horizonatal bars dot com) which converted to UTF-8 is:
00e4 00b8 0080 00e4 00ba 008c 00e4 00b8 0089 002e 0063 006f 006d
The result is this mangled UTF-8, by the time the resolver gets it:
00e4 00b8 20ac 00e4 00ba 0152 00e4 00b8 2030 002e 0063 006f 006d
If Mozilla can be fixed to do this right, then it can be pushed as the preferred
browser to the international community over IE.
Comment 1•24 years ago
|
||
Assignee: rchen → nhotta
Status: UNCONFIRMED → NEW
Component: Localization → Internationalization
Ever confirmed: true
Summary: Multilingual input into address bar → Multilingual (unicode) input into address (url) bar
Assignee | ||
Comment 2•24 years ago
|
||
I tried Japanese string before but I got a correct UTF-8 when calling
WSAAsyncGetHostByName().
http://lxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsDnsService.cpp#620
>00e4 00b8 0080 00e4 00ba 008c 00e4 00b8 0089 002e 0063 006f 006d
Where did you get this string? Did you put a break point somewhere in Mozilla?
> The conversion to an Ascii Character Encoding
> (ACE) is handled outside of the browser at the resolver level in the case of
> our client.
What does "our client" mean? There is another bug 42898 which implements
http://search.ietf.org/internet-drafts/draft-ietf-idn-idna-01.txt.
Keywords: intl
Assignee | ||
Comment 3•24 years ago
|
||
>00e4 00b8 0080 00e4 00ba 008c 00e4 00b8 0089 002e 0063 006f 006d
UTF-8 does not have zeros except for a terminator.
UTF-8 value for \u4e00 \u4e8c \u4e09 are 0xE4B880 0xE4BA8C 0xE4B889.
Reporter | ||
Comment 4•24 years ago
|
||
5/30/01 2:48 PM
---------------
We've only used the MS IME or the Character Map.
Are you saying that Mozilla properly handles all forms of Unicode input by the
user? Regardless of the keyboard and/or input locale?
Is there some way we can log what Mozilla or Netscape sees when we put urls in
the address bar and follow it through to the tail pipe?
Our client (resolver plug-in) only sees what comes out of Mozilla.
Naoki Hotta wrote:
>
> Hi,
>
> What kind of IME have you tried? I heard some third party IME have problem
> with Mozilla. I think there is no data corruption when using MS IME.
>
> Naoki
>
Neal Probert wrote:
-------------------
For more information on IDNA, see
http://www.ietf.org/html.charters/idn-charter.html and http://www.i-d-n.net/ for
IDNA.
Certain Latin-1 characters (which are really UTF-8 sequences) are case folded,
particularly C0-CF. Others like 80-8F are transformed into something else, but
I'm not sure exactly what is happening there. There may be other unexpected
transformations as well, but the behavior is the same.
Our client plugs into the resolver library, so that's how we get our data when
we turn on our debugging. We do expect UTF-8 and can handle URI escaped UTF-8
as well. A copy of our client can be found at http://www.walid.com/ which is
used for nameprep and ACE.
We plug in at the resolver level so that any client application can begin using
IDNA immediately, and we will follow the standard set by the IDN WG and update
our client accordingly.
To get the characters, we paste the characters from Character Map(using Arial
Unicode MS) into the address bar. It can also be done from the IME with
identical results, using the Chinese virtual keyboard.
Reporter | ||
Comment 5•24 years ago
|
||
With the calls to MultiByteToWideChar(), perhaps the codepage is wrong and not
being adjusted to match the data type coming from the paste operator or the IME
itself.
We've been testing on Windows 2000. According to the documentation, perhaps the
codepage should be CP_UTF8 instead on NT, 2K and XP. Somehow, it looks like
language is being mapped to a codepage and that may be the cause of our problem.
Btw, the UTF-8 data printed below was the result of printf( "%04x ", c[i] ); so
it maybe mis-leading.
Assignee | ||
Comment 6•24 years ago
|
||
Code page is mapped from Language Identifiers.
http://lxr.mozilla.org/seamonkey/source/widget/src/windows/nsWindow.cpp#388
Language Identifiers is taken by calling Windows API GetKeyboardLayout().
http://msdn.microsoft.com/library/psdk/winui/keybinpt_5sxg.htm
I think paste case may not work if the pasted characters does not match with
keyboard locale.
Reporter | ||
Comment 7•24 years ago
|
||
It's not safe to assume that what ever language is entered via the IME or even
pasted that it will match the language, codepage or keyboard locale.
While the keyboard maybe tied to the locale (language), the IME is independent
of the keyboard locale. Data from the IME should be handled with the same
language/codepage/locale setting of the IME.
If pasted text is in unicode then it should be treated as unicode, independent
of the keyboard locale.
Assignee | ||
Comment 8•24 years ago
|
||
I was incorrect about the paste behavior assumption, it is actually treated as
unicode (e.g I pasted Chinese character from Character Map to url bar and they
turned to UTF-8 when it was fed to search.netscape.com).
I do not understand about you request about independence of IME from keyboard
locale. Do you have examples of keyboard and IME are set in different languages?
In my environment (windows 2000), I cannot use Japanese IME unless I set to
Japanese keyboard.
Assignee | ||
Comment 9•24 years ago
|
||
Here is a memory dump of a host name at the code calling
WSAAsyncGetHostByName().
http://lxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsDnsService.cpp#620
051329C0 E4 B8 80 E4 BA 8C E4 B8 89 一二三
051329C9 2E 63 6F 6D 00 FD FD FD FD .com.ýýýý
The test case was three Chinese characters (\u4e00 \u4e8c \u4e09) and ".com".
The three Chinese characters were converted to valid UTF-8 (0xE4B880 0xE4BA8C
0xE4B889).
I tried two cases pasting characters from "Character Map" and typed using
Japanese IME, both got the same result. I used a debug build on my machine
(windows 2000 with system locale en-US), pulled this morning.
Mark as Worksforme.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → WORKSFORME
Comment 10•24 years ago
|
||
QA -> jonrubin@netscape.com.
Jon, when you get a chance can you take a look at this?
Reporter, if possible please let us know if you are still able to reproduce this
problem. Thanks.
QA Contact: andreasb → jonrubin
Comment 11•24 years ago
|
||
Could someone provide a testcase for this? Is it sufficient to enter a
multilingual address and see if the resulting error message displays the error
correctly, as in bug 81019? If so, then I can verify that this is fixed.
Reporter | ||
Comment 12•24 years ago
|
||
This is still not fixed in 0.9.1, for the hostname part for IDNA. The Unicode
UTF-16 code points from the Character Map used were 4e00, 4e8c, 4e09, which
should be sent as UTF-8 code points e4b880, e4ba8c, e4b889 in UTF-8 were still
mangled.
Perhaps we are looking in the wrong places for this problem? If the call to
WSAAsyncGetHostByName() is passed the correct UTF-8 data, then where is it
getting mangled? Does this call get repeated again by Mozilla anywhere else?
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Assignee | ||
Updated•24 years ago
|
Target Milestone: --- → mozilla0.9.3
Reporter | ||
Comment 13•24 years ago
|
||
After using the debug/trace ws2_32.dll from the Platform SDK in conjunction with
a modified version of dt_dll built from the MSDN example, I was able to verify
that Mozilla calls WSAAsyncGetHostByName() with the correct UTF-8 Unicode.
Our name space provider (NSP) for Winsock still sees the mangled UTF-8 which
leads me to the conclusion that the Winsock layer is converting the character
data using the thread's current locale.
I'm wondering if the thread that calls WSAAsyncGetHostByName() can set it's
locale to UTF-8 before making the call and that will solve the problem.
Comment 15•23 years ago
|
||
mass change, switching qa contact from jonrubin to ruixu.
QA Contact: jonrubin → ruixu
Comment 16•23 years ago
|
||
Same kind of bug appears with french accentuated letters. And also when opening
an URL via an script (by example searching onto Google).
I'm maintening the http://www.tchatche.com/bd , and a script is creating links
for pictures with the name of the comics. Sometimes, the formed URL has accents.
B.E. when you clic onto
http://www.tchatche.com/BD/Contents/consultation/chroniques/DetailBD.asp?bd=634
«Agrandir la...» or «voir un...», a javascript is calling a picture with
accentuated (
javascript:cover('/BD/','/bd/Images/ImagesBD/D\u00e9luge-Universal_-4-l.jpg','')
where u00e9 is standing for é). The image is not loaded instead of IE.
If I search the french word for "summer" in Google, it gives «
http://www.google.fr/search?q=%E9t%E9&hl=fr&btnG=Recherche+Google&meta= »
(correct) from the main page. But from the sidebar «
http://www.google.com/search?q=%C3%A9t%C3%A9&sourceid=mozilla-xul&btnG=Google+Search
»
will be wrongly translated « été ».
Is it the same bug?
(and sorry for my mispelled anglishe)
Comment 17•22 years ago
|
||
Beep, we need this fixed. Try going to www.domäninfo.com.. Works fine in IE 6
SP1 on WindowsXP. Mozilla can't resolve the name.
Comment 18•16 years ago
|
||
WFM Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0
Status: REOPENED → RESOLVED
Closed: 24 years ago → 16 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•