Closed Bug 25034 Opened 25 years ago Closed 25 years ago

Location field Search results display CJK links in garbage

Categories

(Core :: Internationalization, defect, P3)

x86
Windows NT
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: blee, Assigned: mozilla)

Details

(Whiteboard: [PDT+] Expected to be fixed by 2/11/2000)

Attachments

(4 files)

To see this, 1) Type JA or KO keywords (eg, "mainichi" or "dongailbo") in location field. 2) Click Search at the end of the location field. 3) Select proper charset and check the links displayed under Search Results of Sidebar. ==> CJK links show up in garbage. blds: Win32 01-24-09-M14, n/a in Mac and Linux blds yet.
QA Contact: teruko → blee
Is search a beta feature ?
Status: NEW → ASSIGNED
Target Milestone: M14
Yes, according to the Beta1 criteria, search from location bar should work at least for Latin1 and Japanese.
Keywords: beta1
We currently always send out escapedUTF-8. We should replace the escaped() with the ConvertAndEscape I plan to implement in nsITextToSubURI. Unfortunately, I didn't complete that implementation nsTextToSubURI.cpp yet and the code is not fully tested yet. Reassign to naoki. Naoki, can you 1. Finish the implementation of nsTextToSubURI, make sure it can be called from JavaScript, and 2. Work with rjc to use it with the search.
Assignee: ftang → nhotta
Status: ASSIGNED → NEW
The task number 1 is done, the code is implemented and callable from JS. According to Frank, the search description file can specify a charset name. With that charset name, use this interface to escape the URI should enable Japanese search. If that's just a small change I may try it on my machine with Japanese data. Let me know.
Status: NEW → ASSIGNED
PDT+
Whiteboard: [PDT+]
Reassign to rjc, I checked nsTextToSubURI is working for both ISO-28859-1 and Shift_JIS.
Assignee: nhotta → rjc
Status: ASSIGNED → NEW
nhotta, I have no idea what needs to be done with this bug. Care to provide some more details?
I made a following change to a JS file and tried Japanese search. Index: internet.js =================================================================== RCS file: /cvsroot/mozilla/xpfe/components/search/resources/internet.js,v retrieving revision 1.21 diff -c -r1.21 internet.js *** internet.js 1999/12/24 21:14:24 1.21 --- internet.js 2000/02/03 18:22:57 *************** *** 479,485 **** return(false); } ! searchURL += "&text=" + escape(text); dump("\nInternet Search URL: " + searchURL + "\n"); // set text in results pane --- 479,490 ---- return(false); } ! var charset = "Shift_JIS"; ! var texttoSubURI = Components.classes["component://netscape/intl/texttosuburi"].createInstance(); ! texttoSubURI = texttoSubURI.QueryInterface(Components.interfaces.nsITextToSubURI); ! searchURL += "&text=" + texttoSubURI.ConvertAndEscape(charset, text); ! ! // searchURL += "&text=" + escape(text); dump("\nInternet Search URL: " + searchURL + "\n"); // set text in results pane Looking at the debug dump results, it was using UTF-8 before the change, after the change it encodes Shift_JIS text. Once it gets a correct charset, it the search URL can be encoded to that charset properly. Before the change: Internet Search URL: internetsearch:engine=NC:SearchCategory?engine=urn:search:e ngine:1&text=%E3%81%82%E3%81%84 doSearch done. After the change: Internet Search URL: internetsearch:engine=NC:SearchCategory?engine=urn:search:e ngine:1&text=%82%A0%82%A2 doSearch done. So I think the sending the correct search sting can be done with the I18N interface. But this bug is about the display of the result URLs. It is also needed that the result URL which is encoded by the server charset (e.g., Shift_JIS) to be converted to an unicode string before displayed. If necessary, I18N need to provide additional interface to unescape and convert to unicode.
nhotta, thanks for the additional detail. I think that the "text" should be encoded in the URL in UTF-8. The reason: the URL can actually refer to multiple search engines, each of which could potentially have a different "text" encoding. Note that the "text" option is only added once into the URL, irregardless of how many search engines are being used. I'm thinking that we need a way in native code of going from the UTF-8 encoded text to a given charset; we'd do this when constructing the HTTP GET or POST data. What do you think of that?
rjc- where is the code which you have a loop to construct the HTTP URL ? Is it in JavaScript or C++ ? Is that true you also need a TEC to charset name conversion for the Sherlock file ? Should that conversion (number to charset name) done in JavaScript or C++ ?
I think we should take out the escape() in the JavaScript and put our replacement into the loop for the per HTTP construction. This could only be done if we do not convert PRUnichar* to char* till the place we construct the HTTP requrest. In this way, we do not need escape and unescape .
The URL is constructed in JavaScript, so the "text" should really be encoded via JavaScript as well. The actually code which constructs the HTTP GET/POST requests is in C++. The TEC->charset mapping is probably going to have to be in C++ as well.
>the "text" should really be encoded via JavaScript as well. Does this mean C++ code to received URL encoded UTF-8 string? I am going to check in additional function to unescape and convert to unicode. In the C++ code, that can be used first then convert the result to the server charset and url escape again before it's sent to the server (lots of conversions). Regarding the TEC->charset mapping code, is that needed for beta1? Can charset name be used as an option in addition to TEC code?
>Does this mean C++ code to received URL encoded UTF-8 string? Yes. One trick is that users can bookmark these search URLs. If they open up the equivalent search folder in their bookmarks, all that's sent to the RDF Search datasource is this magical URL. >Regarding the TEC->charset mapping code, is that needed for beta1? Yes, although for beta1 we probably don't need a full/complete mapping, we can just have a simple map of the languages we plan on shipping search datasets for. That should be a fairly small list, I'd presume.
I checked in for an additional funciton to unescape and convert to unicode. Example of the usage: unicodeString = texttoSubURI.UnEscapeAndConvert("Shift_JIS", "%82%A0%82%A2");
Status: NEW → ASSIGNED
Hi nhotta, do you have any example code of unescaping and then converting to a particular charset (instead of Unicode) ?
No, but you can use UnEscapeAndConvert to convert to unicode then use ConvertAndEscape to convert to any charset and escape.
Great! Method names is exactly what I wanted.
Whiteboard: [PDT+] → [PDT+] Expected to be fixed by 2/11/2000
Small problem: the header file "nsITextToSubURI.h" doesn't appear to be exported properly out of mozilla/intl/uconv into $(DIST), so I can't use it. Can you fix that up? It looks like its a problem on at least Mac and Windows (so, probably Unix as well).
nhotta, I've attached a diff for mozilla/xpfe/components/search/src/ nsInternetSearchService.cpp for converting from escaped-UTF-8 to Unicode to the appropriate charset for a given search engine. After you fix the problem with getting "nsITextToSubURI.h" into $DIST, could you apply the patch and tell me what you think. Thanks!
About "nsITextToSubURI.h", Mac exports to a wrong directory dist:locale instead of dist:uconv so I need to change the project file. For windows and unix, the file is in dist/include on my machine. Could you check again?
Yep, I rechecked my Windows build, its there. :^) [As you pointed out, just the Mac needs tweaking.]
I applied the patch and now I can use Japanese text for search and the results are displayed in Japanese, thanks.
I fixed the Mac project and got reviewed but the tree is closed, so I am not checking in today, probably tomorrow.
I checked in the Mac project file last night.
Fixed. Thanks again, nhotta.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Still happens in Win32 02-11-09-M14 bld. Was the fix checked in to this bld?
Clearing Fixed resolution until you know for sure if checked in.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Jan, please don't just re-open bugs without doing testing. Please test a later build.
Status: REOPENED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
blee, you need to have some Japanese data files in order to do Japanese search. Please contact me I have the data.
Verified (with nhotta's data file) fixed in Win32 02-17-09 bld.
Status: RESOLVED → VERIFIED
We tried 5 Japanese search engines and saw some of them did not work. Need investigation to see if any problem in the description file (by trying it with Sherlock) or id mapping in mozilla.
For failing site like infoseek, it was sending UTF-8. Document http://japan.infoseek.com/Titles?qt=%E3%81%84&oq=site%3Ajp&nh=25&col=JW &sv=JA&lk=noframes loaded successfully The description file of that has a comment after the charset definition. resultEncoding=2336 # kTextEncodingEUC_JP = 0x0920 After I removed the comments, it started to work. Is this a bad description which is not supported or our parser has to change to read it?
nhotta, this is very strange. Using Infoseek works just dandy for me (on my Mac running Mac OS 9) with no changes.
nhotta, ignore my last comment. My Mac's Infoseek-Japan file doesn't have a # comment on the same line. (I'll add it add see what happens.)
Note: according to http://developer.apple.com/technotes/tn/tn1141.html: Lines beginning with a # character are ignored and are treated as comments.
nhotta, where did you get your search file from? Did you create it yourself, or get it from somewhere off the net?
nhotta, I've attached a small diff to this bug. Could you apply it and see if your search files with comments begins to work correctly for you? This proposed patch makes us more leniant (i.e. slightly less strict in terms of following the spec at "http://developer.apple.com/technotes/tn/tn1141.html") in regards to handling comments. With the patch applied, any line can have a comment following a valid construct, and we'll notice it and ignore the comment.
I applied the patch and now infoseek is working without the modification. Now 4 out of 5 search engines I have are working on my PC. I found the one which was not working has a problem in the data. resultEncoding=2336 # kTextEncodingShiftJIS = 0x0A01 The comment says it Shift_JIS but the actual encoding number is EUC-JP. It started to work after I change the number to 2561. So far, I have tested on windows using the data files copied from Macintosh. I will try on my Mac using the same data after I refresh my build.
Macintosh, I saw UTF-8 is actually sent. I tried excite and lycos (both working fine on windows). The query string was sending UTF-8 (%E7%8A%AC) instead of Shift_JIS (%8C%A2). As a result, it searches something else and the texts are gargage sometimes. Document http://www.excite.co.jp/search.gw?lk=excite_jp&c=japan&s=%E7%8A%AC loaded successfully Document: Done (10.721 secs) Document http://www-jp.lycos.com/cgi-bin/pursuit?mtemp=main&query=%E7%8A%AC&cat= jp&maxhits=25&results=default&encoding=Shift_JIS loaded successfully Document: Done (10.52 secs)
In my Macintosh test, I didn't apply the rjc's patch but those two searches were working on PC without that patch.
We tried the same test on blee's Macintosh with yesterday's build and that's working fine for Japanese. I think my local build has some problems. Ignore my previous comments about the Macintosh problem. Sorry for the confusion.
I'll check in my changes (for # comment handling) when the tree opens up for general M15 work. Thanks, nhotta!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: