Location field Search results display CJK links in garbage

VERIFIED FIXED in M14

Status

()

P3
normal
VERIFIED FIXED
19 years ago
19 years ago

People

(Reporter: blee, Assigned: mozilla)

Tracking

Trunk
x86
Windows NT
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [PDT+] Expected to be fixed by 2/11/2000)

Attachments

(4 attachments)

(Reporter)

Description

19 years ago
To see this,
1) Type JA or KO keywords (eg, "mainichi" or "dongailbo") in location field.
2) Click Search at the end of the location field.
3) Select proper charset and check the links displayed under Search Results of
   Sidebar.
==> CJK links show up in garbage.

blds: Win32 01-24-09-M14, n/a in Mac and Linux blds yet.
(Reporter)

Updated

19 years ago
QA Contact: teruko → blee

Comment 1

19 years ago
Is search a beta feature ?
Status: NEW → ASSIGNED
Target Milestone: M14

Comment 2

19 years ago
Yes, according to the Beta1 criteria, search from location bar should work
at least for Latin1 and Japanese.
Keywords: beta1

Comment 3

19 years ago
We currently always send out escapedUTF-8. We should replace the escaped() with
the ConvertAndEscape I plan to implement in nsITextToSubURI. Unfortunately, I
didn't complete that implementation nsTextToSubURI.cpp yet and the code is not
fully tested yet.

Reassign to naoki. Naoki, can you
1. Finish the implementation of nsTextToSubURI, make sure it can be called from
JavaScript, and
2. Work with rjc to use it with the search.

Assignee: ftang → nhotta
Status: ASSIGNED → NEW

Comment 4

19 years ago
Created attachment 4851 [details]
a test for the interface

Comment 5

19 years ago
The task number 1 is done, the code is implemented and callable from JS.
According to Frank, the search description file can specify a charset name. With 
that charset name, use this interface to escape the URI should enable Japanese 
search. If that's just a small change I may try it on my machine with Japanese 
data. Let me know.
Status: NEW → ASSIGNED

Comment 6

19 years ago
PDT+
Whiteboard: [PDT+]

Comment 7

19 years ago
Reassign to rjc, I checked nsTextToSubURI is working for both ISO-28859-1 and 
Shift_JIS.
Assignee: nhotta → rjc
Status: ASSIGNED → NEW
(Assignee)

Comment 8

19 years ago
nhotta, I have no idea what needs to be done with this bug.  Care to provide 
some more details?

Comment 9

19 years ago
I made a following change to a JS file and tried Japanese search.

Index: internet.js
===================================================================
RCS file: /cvsroot/mozilla/xpfe/components/search/resources/internet.js,v
retrieving revision 1.21
diff -c -r1.21 internet.js
*** internet.js	1999/12/24 21:14:24	1.21
--- internet.js	2000/02/03 18:22:57
***************
*** 479,485 ****
  		return(false);
  	}
  
! 	searchURL += "&text=" + escape(text);
  	dump("\nInternet Search URL: " + searchURL + "\n");
  
  	// set text in results pane
--- 479,490 ----
  		return(false);
  	}
  
! 	var charset = "Shift_JIS";
! 	var texttoSubURI = 
Components.classes["component://netscape/intl/texttosuburi"].createInstance();
! 	texttoSubURI = 
texttoSubURI.QueryInterface(Components.interfaces.nsITextToSubURI);
! 	searchURL += "&text=" + texttoSubURI.ConvertAndEscape(charset, text);
! 
! //	searchURL += "&text=" + escape(text);
  	dump("\nInternet Search URL: " + searchURL + "\n");
  
  	// set text in results pane

Looking at the debug dump results, it was using UTF-8 before the change, after 
the change it encodes Shift_JIS text. Once it gets a correct charset, it the 
search URL can be encoded to that charset properly.

Before the change:
Internet Search URL: internetsearch:engine=NC:SearchCategory?engine=urn:search:e
ngine:1&text=%E3%81%82%E3%81%84
doSearch done.
After the change:
Internet Search URL: internetsearch:engine=NC:SearchCategory?engine=urn:search:e
ngine:1&text=%82%A0%82%A2
doSearch done.

So I think the sending the correct search sting can be done with the I18N 
interface. But this bug is about the display of the result URLs. It is also 
needed that the result URL which is encoded by the server charset (e.g., 
Shift_JIS) to be converted to an unicode string before displayed. If necessary, 
I18N need to provide additional interface to unescape and convert to unicode.
(Assignee)

Comment 10

19 years ago
nhotta, thanks for the additional detail.

I think that the "text" should be encoded in the URL in UTF-8.  The reason:  the
URL can actually refer to multiple search engines, each of which could
potentially have a different "text" encoding.  Note that the "text" option is
only added once into the URL, irregardless of how many search engines are being
used.

I'm thinking that we need a way in native code of going from the UTF-8 encoded
text to a given charset; we'd do this when constructing the HTTP GET or POST
data.  What do you think of that?

Comment 11

19 years ago
rjc- where is the code which you have a loop to construct the HTTP URL ? Is it
in JavaScript or C++ ? Is that true you also need a TEC to charset name
conversion for the Sherlock file ? Should that conversion (number to charset
name) done in JavaScript or C++ ?

Comment 12

19 years ago
I think we should take out the escape() in the JavaScript and put our
replacement into the loop for the per HTTP construction. This could only be done
if we do not convert PRUnichar* to char* till the place we construct the HTTP
requrest. In this way, we do not need escape and unescape .
(Assignee)

Comment 13

19 years ago
The URL is constructed in JavaScript, so the "text" should really be encoded via
JavaScript as well.

The actually code which constructs the HTTP GET/POST requests is in C++.  The
TEC->charset mapping is probably going to have to be in C++ as well.

Comment 14

19 years ago
>the "text" should really be encoded via JavaScript as well.
Does this mean C++ code to received URL encoded UTF-8 string?
I am going to check in additional function to unescape and convert to unicode.
In the C++ code, that can be used first then convert the result to the server 
charset and url escape again before it's sent to the server (lots of 
conversions).
Regarding the TEC->charset mapping code, is that needed for beta1? Can charset 
name be used as an option in addition to TEC code?
(Assignee)

Comment 15

19 years ago
>Does this mean C++ code to received URL encoded UTF-8 string?

Yes.

One trick is that users can bookmark these search URLs.  If they open up the
equivalent search folder in their bookmarks, all that's sent to the RDF Search
datasource is this magical URL.

>Regarding the TEC->charset mapping code, is that needed for beta1?

Yes, although for beta1 we probably don't need a full/complete mapping, we can
just have a simple map of the languages we plan on shipping search datasets
for.  That should be a fairly small list, I'd presume.

Comment 16

19 years ago
Created attachment 4892 [details]
Added a test for unescape then convert to unicode.

Comment 17

19 years ago
I checked in for an additional funciton to unescape and convert to unicode.
Example of the usage:
unicodeString = texttoSubURI.UnEscapeAndConvert("Shift_JIS", "%82%A0%82%A2");
(Assignee)

Updated

19 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 18

19 years ago
Hi nhotta, do you have any example code of unescaping and then converting to a 
particular charset (instead of Unicode) ?

Comment 19

19 years ago
No, but you can use UnEscapeAndConvert to convert to unicode then use 
ConvertAndEscape to convert to any charset and escape.
(Assignee)

Comment 20

19 years ago
Great!  Method names is exactly what I wanted.
(Assignee)

Updated

19 years ago
Whiteboard: [PDT+] → [PDT+] Expected to be fixed by 2/11/2000
(Assignee)

Comment 21

19 years ago
Small problem:  the header file  "nsITextToSubURI.h"  doesn't appear to be 
exported properly out of mozilla/intl/uconv into $(DIST), so I can't use it.

Can you fix that up?  It looks like its a problem on at least Mac and Windows 
(so, probably Unix as well).
(Assignee)

Comment 22

19 years ago
Created attachment 5094 [details] [diff] [review]
proposed diff for converting from escaped-UTF-8 to Unicode to appropriate charset
(Assignee)

Comment 23

19 years ago
nhotta, I've attached a diff for mozilla/xpfe/components/search/src/
nsInternetSearchService.cpp  for converting from escaped-UTF-8 to Unicode to the 
appropriate charset for a given search engine.

After you fix the problem with getting "nsITextToSubURI.h" into $DIST, could you 
apply the patch and tell me what you think.

Thanks! 

Comment 24

19 years ago
About "nsITextToSubURI.h", Mac exports to a wrong directory dist:locale instead 
of dist:uconv so I need to change the project file.
For windows and unix, the file is in dist/include on my machine. Could you check 
again?
(Assignee)

Comment 25

19 years ago
Yep, I rechecked my Windows build, its there.  :^)  [As you pointed out, just the 
Mac needs tweaking.]

Comment 26

19 years ago
I applied the patch and now I can use Japanese text for search and the results 
are displayed in Japanese, thanks.

Comment 27

19 years ago
I fixed the Mac project and got reviewed but the tree is closed, so I am not 
checking in today, probably tomorrow.

Comment 28

19 years ago
I checked in the Mac project file last night.
(Assignee)

Comment 29

19 years ago
Fixed. Thanks again, nhotta.
Status: ASSIGNED → RESOLVED
Last Resolved: 19 years ago
Resolution: --- → FIXED
(Reporter)

Comment 30

19 years ago
Still happens in Win32 02-11-09-M14 bld. Was the fix checked in to this bld?

Comment 31

19 years ago
Clearing Fixed resolution until  you know for sure if checked in.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 32

19 years ago
Jan, please don't just re-open bugs without doing testing.

Please test a later build.
Status: REOPENED → RESOLVED
Last Resolved: 19 years ago19 years ago
Resolution: --- → FIXED

Comment 33

19 years ago
blee, you need to have some Japanese data files in order to do Japanese search.
Please contact me I have the data.
(Reporter)

Comment 34

19 years ago
Verified (with nhotta's data file) fixed in Win32 02-17-09 bld.
Status: RESOLVED → VERIFIED

Comment 35

19 years ago
We tried 5 Japanese search engines and saw some of them did not work.
Need investigation to see if any problem in the description file (by trying it 
with Sherlock) or id mapping in mozilla.

Comment 36

19 years ago
For failing site like infoseek, it was sending UTF-8.
Document http://japan.infoseek.com/Titles?qt=%E3%81%84&oq=site%3Ajp&nh=25&col=JW
&sv=JA&lk=noframes loaded successfully

The description file of that has a comment after the charset definition.
resultEncoding=2336				# kTextEncodingEUC_JP	= 0x0920

After I removed the comments, it started to work.
Is this a bad description which is not supported or our parser has to change to 
read it?

(Assignee)

Comment 37

19 years ago
nhotta, this is very strange. Using Infoseek works just dandy for me (on my Mac 

running Mac OS 9) with no changes.

(Assignee)

Comment 38

19 years ago
nhotta, ignore my last comment. My Mac's Infoseek-Japan file doesn't have a # 
comment on the same line.  (I'll add it add see what happens.)
(Assignee)

Comment 39

19 years ago
Note:  according to http://developer.apple.com/technotes/tn/tn1141.html:
Lines beginning with a # character are ignored and are treated as comments.
(Assignee)

Comment 40

19 years ago
nhotta, where did you get your search file from?  Did you create it yourself, or 
get it from somewhere off the net?
(Assignee)

Comment 41

19 years ago
Created attachment 5404 [details] [diff] [review]
Proposed diff to be less strict on # comment handling
(Assignee)

Comment 42

19 years ago
nhotta, I've attached a small diff to this bug. Could you apply it and see if 
your search files with comments begins to work correctly for you?

This proposed patch makes us more leniant (i.e. slightly less strict in terms of 
following the spec at "http://developer.apple.com/technotes/tn/tn1141.html") in 
regards to handling comments.  With the patch applied, any line can have a 
comment following a valid construct, and we'll notice it and ignore the comment.

Comment 43

19 years ago
I applied the patch and now infoseek is working without the modification.
Now 4 out of 5 search engines I have are working on my PC. I found the one which 
was not working has a problem in the data.
resultEncoding=2336	# kTextEncodingShiftJIS		= 0x0A01
The comment says it Shift_JIS but the actual encoding number is EUC-JP. It 
started to work after I change the number to 2561.

So far, I have tested on windows using the data files copied from Macintosh.
I will try on my Mac using the same data after I refresh my build.

Comment 44

19 years ago
Macintosh, I saw UTF-8 is actually sent.

I tried excite and lycos (both working fine on windows). The query string was 

sending UTF-8 (%E7%8A%AC) instead of Shift_JIS (%8C%A2). As a result, it searches 

something else and the texts are gargage sometimes.



Document http://www.excite.co.jp/search.gw?lk=excite_jp&c=japan&s=%E7%8A%AC 

loaded successfully

Document: Done (10.721 secs)



Document http://www-jp.lycos.com/cgi-bin/pursuit?mtemp=main&query=%E7%8A%AC&cat=

jp&maxhits=25&results=default&encoding=Shift_JIS loaded successfully

Document: Done (10.52 secs)

Comment 45

19 years ago
In my Macintosh test, I didn't apply the rjc's patch but those two searches were 
working on PC without that patch.

Comment 46

19 years ago
We tried the same test on blee's Macintosh with yesterday's build and that's 
working fine for Japanese. I think my local build has some problems. Ignore my 
previous comments about the Macintosh problem. Sorry for the confusion.
(Assignee)

Comment 47

19 years ago
I'll check in my changes (for # comment handling) when the tree opens up for
general M15 work.  Thanks, nhotta!
You need to log in before you can comment on or make changes to this bug.