Closed Bug 119825 Opened 23 years ago Closed 21 years ago

URL (location) bar Search Feature ignores national encoding (google)

Categories

(Core :: Internationalization, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: M.Hankus, Assigned: jshin1987)

References

Details

(Keywords: fixed1.6, intl)

Attachments

(1 file)

Linux build 2002011108

I use search feature of URL bar, and i noticed that url bar ignores national
encoding of entered text. As an example I use ISO-8859-2, and Google as
preferred search engine. When I enter something in URL bar and select search,
mozilla query google with 

http://www.google.com/search?q=%3F%F3%3F%3F&sourceid=mozilla-search

but when I open google and enter the same sentence in a form I got query string

http://www.google.com/search?q=%BF%F3%B3%E6&hl=pl&btnG=Szukaj+z+Google

so results are completly different.
*** Bug 118339 has been marked as a duplicate of this bug. ***
It might be more general, because Search tab in Sidebar behaves in the same way 
as URL bar. In case of ISO8859-2 all non ascii chars are converted to %3F
*** Bug 131126 has been marked as a duplicate of this bug. ***
It bothers me on Win2k too.
can someone reproduce this on 1.0RC1 ?
It disappeared in Win2K(was in 0.9.9)
I can reproduce it in 2002041903 on Windows 98SE. I have not tested RC1.
On Linux RC1 build it is reproducable, as is in 2002042121 (linux)
*** Bug 124588 has been marked as a duplicate of this bug. ***
*** Bug 141393 has been marked as a duplicate of this bug. ***
*** Bug 141841 has been marked as a duplicate of this bug. ***
Verified with Hebrew characters and BeOS (1.0 RC1.0 - 2002050509)

Searching using the Google homepage worked fine, giving 16000 results:
http://www.google.com/search?hl=en&q=%26%231496%3B%26%231511%3B%26%231505%3B%26%231496%3B&btnG=Google+Search

The URL search for the same string returned no results:
http://www.google.com/search?q=%3F%3F%3F%3F&sourceid=mozilla-search

Request to change OS from Linux to All
Can confirm this bug on WIN2K in RC3. Using cyrillics. 
Searching with Google form is fine, searching through URL bar - all characters
are sent as %3F, which obviously screws up the search.
*** Bug 143838 has been marked as a duplicate of this bug. ***
OS: Linux → All
*** Bug 136858 has been marked as a duplicate of this bug. ***
related: bug 102984, bug 83277
changing component
Assignee: hewitt → yokoyama
Component: URL Bar → Internationalization
QA Contact: claudius → ruixu
can we assume this has been confirmed then? :)
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: intl
QA Contact: ruixu → kasumi
-> nhotta
Assignee: yokoyama → nhotta
*** Bug 152065 has been marked as a duplicate of this bug. ***
*** Bug 153487 has been marked as a duplicate of this bug. ***
The search description file defaults to ISO-8859-1.
http://lxr.mozilla.org/seamonkey/source/xpfe/components/search/datasets/google.src

Adding bobj to cc. He was trying to send UTF-8 for google search.
It looks like it is fixed now (it works for me) build 2002071911 Linux.
*** Bug 144939 has been marked as a duplicate of this bug. ***
cc nhotta
bug 161181 is about google.src change.
Status: NEW → ASSIGNED
I'm not sure if anything has changed but linux build 2002080321 worked fine, 
and 2002080508 is not working (I just installed latest build). 
Summary: URL bar Search Feature ignores national encoding → URL (location) bar Search Feature ignores national encoding (google)
*** Bug 155386 has been marked as a duplicate of this bug. ***
*** Bug 128224 has been marked as a duplicate of this bug. ***
*** Bug 149029 has been marked as a duplicate of this bug. ***
So many DUPS here.
Latest one is 155386 which is reported 07/02/2002.
As Mirek mentioned in#27, Mirek tested 2002080321. It works.
I tested on 2002101805 build. It works also.
Mirek: Could you please test on latest?
for me it is working fine for some time (also 2002121922 linux build)
Some time?
Not all the time?
Since many bugs are merged to this one, so I have to describe all my
observations, although I really doubt all of these is simply one bug.

I'm using yesterday's nightly build (English) for windows. Running on w2k English.
1) If you search "中文" in sidebar, it returns no result
2) If you search "中文" in address bar, such as:
http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=中文&btnG=Google+Search
, it translates "中文" to "%D6%D0%CE%C4" & get no result back. It should
translate to "%E4%B8%AD%E6%96%87" & get plenty of results.
3) If you highlight "中文" in browser & right click & select web search, it
translate to
http://www.google.com/search?q=%3F%3F&sourceid=mozilla-search&start=0&start=0

In short, all mozilla based chinese search failed :(
Has bug 145375 affected this one? 
QA Contact: kasumi → cpetersen0953
It appears that search in both the sidebar and url-bar gets affected by
Preferece | Navigator | Language setting.  However, they're affected differently. 
In the URL-bar, what's entered by a user is correctly converted to UTF-8 no
matter what language is at the top of the pref. lang list.   That is, if I type
U+AC00 and U+AC01,
the url of the search result shown in the URL bar contains '%ea%b0%80%ea%b0%81'
(url-escaped UTF-8 representation of <U+AC00><U+AC01>). Moreover,  what I can
type in the URL bar is NOT restricted by the repertoire of  the locale charset
(at least under Win2k. I guess the same is true of Moz-Linux at least under
ll_CC.UTF-8 locale).
However, the search result is all mangled (the result itself appears correct,
though if an actual  serach engine used - as opposed to the meta search server -
supports UTF-8)
Changing the character coding (EUC-KR) doesn't help.  With 'ko' at the top of
the list,  the search result (for Korean word) is properly rendered. 

Given this, the problem is not on the Mozilla's side but is on the server side.
It is converting the search result into the legacy MIME charset that is
primarily associated with the language at the top of the list. If it's English,
 the 'search server' assumes  the result is in ISO-8859-1 although they're
actually in EUC-KR.  Converting EUC-KR to UTF-8 assuming it's in ISO-8859-1
leads to a lot of question marks.  Two things have to be done: 1. The 'meta
search server'  should store everything in UTF-8 at its DB  2.  When sending
back the result, it should just hand over the result without any conversion
regardless of the prefered langauge setting.  These will make multilingual
search possible.  

The search in the sidebar behaves differently and for this Mozilla's also to
blame because Mozilla is not converting the input to UTF-8 . It only works when
the language of keywords entered matches the language at the top of the prefered
lang. list. 

This is definitely an item for I18N release note.  To make search in  language
'X' work correctly, that language has to be at the top of the prefered language
list in Pref|Navigator|Language.  

Matt, can you move up zh(-CN) or zh-TW to the top and see what you get? 

re: comment #12
> On BeOS...
> The URL search for the same string returned no results:
> http://www.google.com/search?q=%3F%3F%3F%3F&sourceid=mozilla-search

Is it still the case that Hebrew characters typed in the URL bar turn to '?'
(U+003F) even with Hebrew at the top of your prefered lang. list? 

What's the locale under which you  run Mozilla (if BeOS has such a thing..)? 
It might have to do with Unicode-based system (Win2k/XP and Linux with UTF-8
locale) vs legacy encoding based system (Win9x/ME and Linux with locales using
legacy encodings).

With Google sherlock file updated, the search sidebar work perfectly well for
Google regardless of what's at the top of the prefered. lang. list. 
I tested en-US Mozilla under Win2k(KO) with the zh-CN at the top of the pref.
lang. list. Both Korean word and Greek word (with Greek letter NOT representable
in EUC-KR. What I tried is 'Καλωσήλθατε'. CJK legacy character sets cover modern
Greek letters without diacritic marks, but don't cover those with diacritic
marks such as 'ή' U+03AE, eta with tono ) worked well with Google. 

However, search in the location(URL) bar doesn't work so well. When I typed
'가각' (U+AC00, U+AC01. set View|Character Coding to UTF-8 to see the word)
in the location bar with zh-CN at the top of the pref. lang list, I got no
result with the URL in the location bar that reads:

http://search-intl.netscape.com/zh-cn/google.tmpl?
cp=clkzhcnsrp&charset=UTF-8&search=%EA%B0%80%EA%B0%81&
lr=lang_zh-CN

'%EA%B0%80%EA%B0%81' is the correct UTF-8 representation of '가각'(U+AC00,
U+AC01) so that the URL seems to be right. It's most likely that google.tmpl
at http://search-intl.netscape.com is to blame. It's assuming that
lang=zh-CN means that the character repertoire should be restricted to
that of GB2312. 


With 'ko' at the top, I expected '가각' in the location bar
to work fine. I was suprised to find that it does not.  Note that the url
below has a different format from the one that appeared with zh-CN as the most
preferred lang. Notably, 'ko/' is missing before 'google.tmpl' and '&lr=lang_ko'
is missing after search. 

http://search-intl.netscape.com/google.tmpl?
cp=clkkosrp&charset=UTF-8&all=yes&cat=World/Korean
&search=%EA%B0%80%EA%B0%81

When I manually fixed up the url as follows, it worked.


http://search-intl.netscape.com/ko/google.tmpl?
cp=clkkosrp&charset=UTF-8&cat=World/Korean&search=%EA%B0%80%EA%B0%81&lr=lang_ko

So, this problem with Korean  has to be fixed on the Mozilla's side. 

Next I put Greek(el) at the top of my pref. lang. list and tried 
'Καλωσήλθατε'. The search result seems to be correct, but the result
looked totally garbled. The URL used was  

http://search.netscape.com/nscp_results.adp?
query=%ce%9a%ce%b1%ce%bb%cf%89%cf%83%ce%ae%ce%bb%ce%b8%ce%b1%cf%84%ce%b5
&source=NSCPRedirect

The url-escaped string after query= is the correct representation of 
'Καλωσήλθατε'. 

http://search-intl.netscape.com/el/google.tmpl?
cp=clkelsrp&charset=UTF-8
&search=%ce%9a%ce%b1%ce%bb%cf%89%cf%83%ce%ae%ce%bb%ce%b8%ce%b1%cf%84%ce%b5&
lr=lang_el

Greek was not so lucky and fixing up the url like the above didn't work.
So, this is another 'meta search server' issue. There's no
'el/google.tmpl' for Greek. I don't know why 'meta search server' cannot simply
fall back to English version if the localized version of 'greek.tmpl' is not
available on the server. Google supports a large number of languages and 'meta
search server' should be able to be a bridge between google's multilingual
search and the location bar. 



> With 'ko' at the top, I expected '가각' in the location bar
> to work fine. I was suprised to find that it does not. 

  Somehow it began to work (with /ko/google.tmpl?....)

> So, this problem with Korean  has to be fixed on the Mozilla's side

  This turned out to be wrong. Most, if not all, fixes have to be done on the
server side (keyword.netscape.com). keyword.netscape.com determines which 'meta
server' to call with what parameters depending on the value of Accept-Lang http
header (that comes from the pref. lang. list of a client) and maybe other
parameters handed over from Mozilla. 

 
  I don't know how keyword.netscape.com determines which meta-search server to
redirect incoming requests to based on accept-lang. (can it be configurable on
the client side?). There seem to be three classes of 'meta search servers':

1. http://search-intl.netscape.com/ll-CC/google.tmpl : This one works well if
'll-CC' matches the first element in Accept-Lang. However, this one seems to be
used only when one of CJK lang. is at the top of the pref. lang. list. Even when
that's the case, there's a problem. It makes an invalid association between
ll-CC and MIME charset and replaces characters outside the repertoire of the
associated MIME charset with question marks. That is, when I include eta with
tonos (ή) with ko as my pref. language, it becomes '?'. This one should be
easiest to fix because google supports multilingual search very well and the
sidebar search already works well.  Perhaps, this is a server-side complement of
 the fix for bug 145375 (which is done on the client-side.)  

2. The second category is completely broken. 
www.netscape.fr (used with fr as my pref. language) and suche.netscape.de (for
German). They seem to interpret UTF-8 sequence as Windows-1252 sequence (when I
gave '가' (U+AC00 : 0xEA 0xB0 0x80), it searched for U+00EA, U+00B0, U+0080
(ê°€), instead.  This means that they don't even work for French and German
keywords if there's even a single character outside US-ASCII. I just tried
Österreich with 'de' as my pref. language, suche.netscape.de looked for 
Österreich, instead. Note that Ö in UTF-8 is 0xC3 0x96 which turn to Ö when
interpreted as Windows-1252

3. The third category is search.netscape.com/nscp_results.adp. It  appears that
it's   used when the first element in Accept-Lang is English or other languages
for which there's no dedicated meta-search server. At the moment, the latter
group includes Russian and Greek among many other languages. This is a curious case.

a. With Russian or Greek as my pref. lang.

When I give keywords not covered by US-ASCII, the search script running there
interpret incoming UTF-8 sequences correctly as in UTF-8 judging from the fact
that the pre-filled search box (for retry) in the result page preserves the
input string intact. It also comes up with some relevant hits. For instance, it
returns sites like  http://www.vienna.at for Österreich with 'ru' as my pref.
lang. For 'Καλωσήλθατε' with Greek, some Greek sites are returned. However,
characters outside US-ASCII are all rendered with question marks.  If I try a
Chinese/Japanese/Korean keyword, a couple of hits in the first page are relevant
while others appear to be off the mark.  

A really funny thing happened when I gave 'Österreich' with Russian pref. and
manually switched to Windows-1252. The prefilled keyword for retry turned from
Österreich  to Österreich, which is perfectly understandable. Strange thing is
there are a mix of hits, some with Österreich and the other with Österreich.
Apparently, what's stored in the DB for search.netscape.com is a mixture of data
in UTF-8(or legacy encoding with the proper encoding tag) and data in legacy
encoding(with no or wrong encoding tag). 

The simplest fix (at least when google is the preferred search engine for the
sidebar search) may be to make keyword.netscape.com redirect all keyword search
to search-intl.netscape.com/xx/google.tmpl instead of lang-specific ones (that
don't even work for target languages) and search.netscape.com/nscp_results.adp
And, needless to say, google.tmpl script should not restrict the repertoire to
that of legacy encodings. Instead, it should allow any character in Unicode. 
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6b) Gecko/20031120

I have a probably related problem:

When I try to search (both Sidebar and address bar, but not additional MozzilaPL
XUL applet)
for a word with Polish diacritical chars in it, it gets messed up:

word: moździerz

Address bar/Sidebar (broken)
http://www.google.com/search?q=mo%25u017Adzierz&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8
Google XUL applet (works)
http://www.google.com/search?q=mo%C5%BAdzierz&ie=utf8&oe=utf8&sourceid=mozilla-xul
No problem in Mozilla Firebird (with default-charset set to iso8859-2):
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6b) Gecko/20031206 Firebird/0.7+

Got:
http://www.google.com/search?q=mo%C5%BAdzierz&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8
Sorry for spam - I triplechecked and the browser had ISO8859-1 set as default
charset.
Setting it to ISO8859-2 doesn't cause the problem to show up though.
As I wrote in comment #39, it's still broken (in some cases, it works while in
other cases it doesn't) NOT because Mozilla (as a client) does anything wrong
BUT because keywrod.netscape.com (search.netscape.com) is broken. Presumably,
search.netscape.com/keyword.netscape.com is not under the control of mozilla.org
anymore. 

asa, I'm sorry to bother you, but what's mozilla.org's plan for the keyword
server(s)? There's nothing we can do on the 'client side' and a relatively
simple fix on the server side would fix the problem (comment #39). I'm tempted
to change the product field to 'mozilla.org'. For a better tracking, I'm
assigning to myself, but should be reassigned to someone who can fix things on
the server-side eventually.

P.S.  Everyone who wants to post to this bug has to set the character coding in
View menu to UTF-8 _before_ posting to avoid characters outside the repertoire
of the current character encoding turn to NCRs (&#12345;) as in comment #34.


re: comment #40. That was a 'transitive' bug. We fixed our escape/unescape to be
complaint to ECMAscript standard(bug 44272), but hadn't fixed all our
__misuses__ of escape/unescape (bug 225695). Those problems have been addressed
since so that 1.6b should be fine with that.
Assignee: nhottanscp → jshin
Status: ASSIGNED → NEW
> to avoid characters outside the repertoire
> of the current character encoding turn to NCRs (&#12345;) as in comment #34.

 to avoid turning characters outside the repertoire of the current character
encoding to NCRs (&#21308;) as in comment #34. 
Status: NEW → ASSIGNED
Attached patch a patchSplinter Review
Because we're not sure of the value of setting up a separate keyword server at
mozilla.org and it's too late for 1.6 even if we decide to do that, we'd better
take a simple way out by setting 'keyword.URL' to google. 
Had we better use 'google feeling lucky' as firebird does? In this patch, I'm
using 'the plain google search'.
I think this should be fixed in both 1.4.2 and 1.6. 

chofmann, what do you think?  I guess you favor setting up our own server, but
as you wrote it's too late for 1.6. As for fixing things on AOL servers, I can
only guess it's a rather simple fix, but can't be sure because I have never seen
the code on that side. Therefore, making the default keyword.URL point to google
seems to be a n easy way out. 
 
Flags: blocking1.6?
Flags: blocking1.4.2?
Comment on attachment 137625 [details] [diff] [review]
a patch

asking for r/sr.

I can't quite decide who to ask for r/sr... (I would have asked smontagu for r,
but he's on vacation). 
This is kinda just filling the hole, but should be a lot better than what we
have now.
Attachment #137625 - Flags: superreview?(brendan)
Attachment #137625 - Flags: review?(chofmann)
This would not block the release. Please request approval when you have the
necessary reviews and drivers will consider the fix for inclusion in 1.6.
Flags: blocking1.6?
Flags: blocking1.6-
Flags: blocking1.4.2?
Flags: blocking1.4.2-
Comment on attachment 137625 [details] [diff] [review]
a patch

Someone test this heavily; code review is not the thing here.

/be
Attachment #137625 - Flags: superreview?(brendan) → superreview+
Thanks for sr.

All the test cases mentioned here (Greek, Russian, Polish, German, Korean,
Japanese, Chinese) and some others I just made up work well as far as I can
tell. Others can test it by setting 'keyword.URL' to
'http://www.google.com/search?ie=UTF-8&oe=utf-8&q=' in about:config and enabling
'keyword' in Edit|Preference|Navigator|Smart Browsing.

See http://www.mozilla.org/docs/end-user/internet-keywords.html for details.


Blocks: 229262
the patch works for me for French language, thanks
Comment on attachment 137625 [details] [diff] [review]
a patch

asking the module owner for review
Attachment #137625 - Flags: review?(chofmann) → review?(smontagu)
Comment on attachment 137625 [details] [diff] [review]
a patch

r=smontagu.
This seems to work well enough out of the box, but I see the %3Fs can still
resurface if the default search engine is reset from the search sidebar, e.g.
to AskJeeves. There may not be much we can do about that.
Attachment #137625 - Flags: review?(smontagu) → review+
fix checked into the trunk. 
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Comment on attachment 137625 [details] [diff] [review]
a patch

asking for a1.6
Attachment #137625 - Flags: approval1.6?
Comment on attachment 137625 [details] [diff] [review]
a patch

a=asa (on behalf of drivers) for checkin to 1.6
Attachment #137625 - Flags: approval1.6? → approval1.6+
Keywords: fixed1.6
forgot to comment; checked in to 1.6 branch this afternoon.
You need to log in before you can comment on or make changes to this bug.