Closed
Bug 25034
Opened 25 years ago
Closed 25 years ago
Location field Search results display CJK links in garbage
Categories
(Core :: Internationalization, defect, P3)
Tracking
()
VERIFIED
FIXED
M14
People
(Reporter: blee, Assigned: mozilla)
Details
(Whiteboard: [PDT+] Expected to be fixed by 2/11/2000)
Attachments
(4 files)
845 bytes,
text/html
|
Details | |
944 bytes,
text/html
|
Details | |
2.00 KB,
patch
|
Details | Diff | Splinter Review | |
734 bytes,
patch
|
Details | Diff | Splinter Review |
To see this,
1) Type JA or KO keywords (eg, "mainichi" or "dongailbo") in location field.
2) Click Search at the end of the location field.
3) Select proper charset and check the links displayed under Search Results of
Sidebar.
==> CJK links show up in garbage.
blds: Win32 01-24-09-M14, n/a in Mac and Linux blds yet.
Yes, according to the Beta1 criteria, search from location bar should work
at least for Latin1 and Japanese.
Keywords: beta1
Comment 3•25 years ago
|
||
We currently always send out escapedUTF-8. We should replace the escaped() with
the ConvertAndEscape I plan to implement in nsITextToSubURI. Unfortunately, I
didn't complete that implementation nsTextToSubURI.cpp yet and the code is not
fully tested yet.
Reassign to naoki. Naoki, can you
1. Finish the implementation of nsTextToSubURI, make sure it can be called from
JavaScript, and
2. Work with rjc to use it with the search.
Assignee: ftang → nhotta
Status: ASSIGNED → NEW
Comment 4•25 years ago
|
||
Comment 5•25 years ago
|
||
The task number 1 is done, the code is implemented and callable from JS.
According to Frank, the search description file can specify a charset name. With
that charset name, use this interface to escape the URI should enable Japanese
search. If that's just a small change I may try it on my machine with Japanese
data. Let me know.
Status: NEW → ASSIGNED
Comment 7•25 years ago
|
||
Reassign to rjc, I checked nsTextToSubURI is working for both ISO-28859-1 and
Shift_JIS.
Assignee: nhotta → rjc
Status: ASSIGNED → NEW
Assignee | ||
Comment 8•25 years ago
|
||
nhotta, I have no idea what needs to be done with this bug. Care to provide
some more details?
Comment 9•25 years ago
|
||
I made a following change to a JS file and tried Japanese search.
Index: internet.js
===================================================================
RCS file: /cvsroot/mozilla/xpfe/components/search/resources/internet.js,v
retrieving revision 1.21
diff -c -r1.21 internet.js
*** internet.js 1999/12/24 21:14:24 1.21
--- internet.js 2000/02/03 18:22:57
***************
*** 479,485 ****
return(false);
}
! searchURL += "&text=" + escape(text);
dump("\nInternet Search URL: " + searchURL + "\n");
// set text in results pane
--- 479,490 ----
return(false);
}
! var charset = "Shift_JIS";
! var texttoSubURI =
Components.classes["component://netscape/intl/texttosuburi"].createInstance();
! texttoSubURI =
texttoSubURI.QueryInterface(Components.interfaces.nsITextToSubURI);
! searchURL += "&text=" + texttoSubURI.ConvertAndEscape(charset, text);
!
! // searchURL += "&text=" + escape(text);
dump("\nInternet Search URL: " + searchURL + "\n");
// set text in results pane
Looking at the debug dump results, it was using UTF-8 before the change, after
the change it encodes Shift_JIS text. Once it gets a correct charset, it the
search URL can be encoded to that charset properly.
Before the change:
Internet Search URL: internetsearch:engine=NC:SearchCategory?engine=urn:search:e
ngine:1&text=%E3%81%82%E3%81%84
doSearch done.
After the change:
Internet Search URL: internetsearch:engine=NC:SearchCategory?engine=urn:search:e
ngine:1&text=%82%A0%82%A2
doSearch done.
So I think the sending the correct search sting can be done with the I18N
interface. But this bug is about the display of the result URLs. It is also
needed that the result URL which is encoded by the server charset (e.g.,
Shift_JIS) to be converted to an unicode string before displayed. If necessary,
I18N need to provide additional interface to unescape and convert to unicode.
Assignee | ||
Comment 10•25 years ago
|
||
nhotta, thanks for the additional detail.
I think that the "text" should be encoded in the URL in UTF-8. The reason: the
URL can actually refer to multiple search engines, each of which could
potentially have a different "text" encoding. Note that the "text" option is
only added once into the URL, irregardless of how many search engines are being
used.
I'm thinking that we need a way in native code of going from the UTF-8 encoded
text to a given charset; we'd do this when constructing the HTTP GET or POST
data. What do you think of that?
Comment 11•25 years ago
|
||
rjc- where is the code which you have a loop to construct the HTTP URL ? Is it
in JavaScript or C++ ? Is that true you also need a TEC to charset name
conversion for the Sherlock file ? Should that conversion (number to charset
name) done in JavaScript or C++ ?
Comment 12•25 years ago
|
||
I think we should take out the escape() in the JavaScript and put our
replacement into the loop for the per HTTP construction. This could only be done
if we do not convert PRUnichar* to char* till the place we construct the HTTP
requrest. In this way, we do not need escape and unescape .
Assignee | ||
Comment 13•25 years ago
|
||
The URL is constructed in JavaScript, so the "text" should really be encoded via
JavaScript as well.
The actually code which constructs the HTTP GET/POST requests is in C++. The
TEC->charset mapping is probably going to have to be in C++ as well.
Comment 14•25 years ago
|
||
>the "text" should really be encoded via JavaScript as well.
Does this mean C++ code to received URL encoded UTF-8 string?
I am going to check in additional function to unescape and convert to unicode.
In the C++ code, that can be used first then convert the result to the server
charset and url escape again before it's sent to the server (lots of
conversions).
Regarding the TEC->charset mapping code, is that needed for beta1? Can charset
name be used as an option in addition to TEC code?
Assignee | ||
Comment 15•25 years ago
|
||
>Does this mean C++ code to received URL encoded UTF-8 string?
Yes.
One trick is that users can bookmark these search URLs. If they open up the
equivalent search folder in their bookmarks, all that's sent to the RDF Search
datasource is this magical URL.
>Regarding the TEC->charset mapping code, is that needed for beta1?
Yes, although for beta1 we probably don't need a full/complete mapping, we can
just have a simple map of the languages we plan on shipping search datasets
for. That should be a fairly small list, I'd presume.
Comment 16•25 years ago
|
||
Comment 17•25 years ago
|
||
I checked in for an additional funciton to unescape and convert to unicode.
Example of the usage:
unicodeString = texttoSubURI.UnEscapeAndConvert("Shift_JIS", "%82%A0%82%A2");
Assignee | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 18•25 years ago
|
||
Hi nhotta, do you have any example code of unescaping and then converting to a
particular charset (instead of Unicode) ?
Comment 19•25 years ago
|
||
No, but you can use UnEscapeAndConvert to convert to unicode then use
ConvertAndEscape to convert to any charset and escape.
Assignee | ||
Comment 20•25 years ago
|
||
Great! Method names is exactly what I wanted.
Assignee | ||
Updated•25 years ago
|
Whiteboard: [PDT+] → [PDT+] Expected to be fixed by 2/11/2000
Assignee | ||
Comment 21•25 years ago
|
||
Small problem: the header file "nsITextToSubURI.h" doesn't appear to be
exported properly out of mozilla/intl/uconv into $(DIST), so I can't use it.
Can you fix that up? It looks like its a problem on at least Mac and Windows
(so, probably Unix as well).
Assignee | ||
Comment 22•25 years ago
|
||
Assignee | ||
Comment 23•25 years ago
|
||
nhotta, I've attached a diff for mozilla/xpfe/components/search/src/
nsInternetSearchService.cpp for converting from escaped-UTF-8 to Unicode to the
appropriate charset for a given search engine.
After you fix the problem with getting "nsITextToSubURI.h" into $DIST, could you
apply the patch and tell me what you think.
Thanks!
Comment 24•25 years ago
|
||
About "nsITextToSubURI.h", Mac exports to a wrong directory dist:locale instead
of dist:uconv so I need to change the project file.
For windows and unix, the file is in dist/include on my machine. Could you check
again?
Assignee | ||
Comment 25•25 years ago
|
||
Yep, I rechecked my Windows build, its there. :^) [As you pointed out, just the
Mac needs tweaking.]
Comment 26•25 years ago
|
||
I applied the patch and now I can use Japanese text for search and the results
are displayed in Japanese, thanks.
Comment 27•25 years ago
|
||
I fixed the Mac project and got reviewed but the tree is closed, so I am not
checking in today, probably tomorrow.
Comment 28•25 years ago
|
||
I checked in the Mac project file last night.
Assignee | ||
Comment 29•25 years ago
|
||
Fixed. Thanks again, nhotta.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 30•25 years ago
|
||
Still happens in Win32 02-11-09-M14 bld. Was the fix checked in to this bld?
Comment 31•25 years ago
|
||
Clearing Fixed resolution until you know for sure if checked in.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 32•25 years ago
|
||
Jan, please don't just re-open bugs without doing testing.
Please test a later build.
Status: REOPENED → RESOLVED
Closed: 25 years ago → 25 years ago
Resolution: --- → FIXED
Comment 33•25 years ago
|
||
blee, you need to have some Japanese data files in order to do Japanese search.
Please contact me I have the data.
Reporter | ||
Comment 34•25 years ago
|
||
Verified (with nhotta's data file) fixed in Win32 02-17-09 bld.
Status: RESOLVED → VERIFIED
Comment 35•25 years ago
|
||
We tried 5 Japanese search engines and saw some of them did not work.
Need investigation to see if any problem in the description file (by trying it
with Sherlock) or id mapping in mozilla.
Comment 36•25 years ago
|
||
For failing site like infoseek, it was sending UTF-8.
Document http://japan.infoseek.com/Titles?qt=%E3%81%84&oq=site%3Ajp&nh=25&col=JW
&sv=JA&lk=noframes loaded successfully
The description file of that has a comment after the charset definition.
resultEncoding=2336 # kTextEncodingEUC_JP = 0x0920
After I removed the comments, it started to work.
Is this a bad description which is not supported or our parser has to change to
read it?
Assignee | ||
Comment 37•25 years ago
|
||
nhotta, this is very strange. Using Infoseek works just dandy for me (on my Mac
running Mac OS 9) with no changes.
Assignee | ||
Comment 38•25 years ago
|
||
nhotta, ignore my last comment. My Mac's Infoseek-Japan file doesn't have a #
comment on the same line. (I'll add it add see what happens.)
Assignee | ||
Comment 39•25 years ago
|
||
Note: according to http://developer.apple.com/technotes/tn/tn1141.html:
Lines beginning with a # character are ignored and are treated as comments.
Assignee | ||
Comment 40•25 years ago
|
||
nhotta, where did you get your search file from? Did you create it yourself, or
get it from somewhere off the net?
Assignee | ||
Comment 41•25 years ago
|
||
Assignee | ||
Comment 42•25 years ago
|
||
nhotta, I've attached a small diff to this bug. Could you apply it and see if
your search files with comments begins to work correctly for you?
This proposed patch makes us more leniant (i.e. slightly less strict in terms of
following the spec at "http://developer.apple.com/technotes/tn/tn1141.html") in
regards to handling comments. With the patch applied, any line can have a
comment following a valid construct, and we'll notice it and ignore the comment.
Comment 43•25 years ago
|
||
I applied the patch and now infoseek is working without the modification.
Now 4 out of 5 search engines I have are working on my PC. I found the one which
was not working has a problem in the data.
resultEncoding=2336 # kTextEncodingShiftJIS = 0x0A01
The comment says it Shift_JIS but the actual encoding number is EUC-JP. It
started to work after I change the number to 2561.
So far, I have tested on windows using the data files copied from Macintosh.
I will try on my Mac using the same data after I refresh my build.
Comment 44•25 years ago
|
||
Macintosh, I saw UTF-8 is actually sent.
I tried excite and lycos (both working fine on windows). The query string was
sending UTF-8 (%E7%8A%AC) instead of Shift_JIS (%8C%A2). As a result, it searches
something else and the texts are gargage sometimes.
Document http://www.excite.co.jp/search.gw?lk=excite_jp&c=japan&s=%E7%8A%AC
loaded successfully
Document: Done (10.721 secs)
Document http://www-jp.lycos.com/cgi-bin/pursuit?mtemp=main&query=%E7%8A%AC&cat=
jp&maxhits=25&results=default&encoding=Shift_JIS loaded successfully
Document: Done (10.52 secs)
Comment 45•25 years ago
|
||
In my Macintosh test, I didn't apply the rjc's patch but those two searches were
working on PC without that patch.
Comment 46•25 years ago
|
||
We tried the same test on blee's Macintosh with yesterday's build and that's
working fine for Japanese. I think my local build has some problems. Ignore my
previous comments about the Macintosh problem. Sorry for the confusion.
Assignee | ||
Comment 47•25 years ago
|
||
I'll check in my changes (for # comment handling) when the tree opens up for
general M15 work. Thanks, nhotta!
Comment hidden (collapsed) |
You need to log in
before you can comment on or make changes to this bug.
Description
•