Closed Bug 92314 Opened 18 years ago Closed 18 years ago

Ja search engine title garbled on sidebar

Categories

(SeaMonkey :: Search, defect, P3)

defect

Tracking

(Not tracked)

VERIFIED FIXED
mozilla1.0.1

People

(Reporter: yinglinxia, Assigned: nhottanscp)

References

Details

(Keywords: intl, Whiteboard: [adt2 RTM] [ETA 05/31] [Needs a=])

Attachments

(5 files, 6 obsolete files)

If localize the search engine title in the searchplugins\xxx.src file, for 
example Japanese, the title will be garbled on the sidebar. The src file should 
use Shift-JIS for Japanese, but I also tested UTF-8 and unicode, none of them 
works.
Fortunately, this does not affect Latin-1 characters. I see that a Spanish 
sherlock file title "Netscape Search en Español" and it looks fine in browser 
(entered directly without any special encoding).
Status: UNCONFIRMED → NEW
Ever confirmed: true
Thanks for the information, Lynn. Changing summary.
Summary: Search engine title can not be localized → Ja search engine title garbled on sidebar
->search, 0.9.7
Assignee: matt → sgehani
Component: Sidebar → Search
QA Contact: sujay → claudius
Target Milestone: --- → mozilla0.9.7
Moving to mozilla0.9.8.
Target Milestone: mozilla0.9.7 → mozilla0.9.8
Target Milestone: mozilla0.9.8 → mozilla0.9.9
-> mozilla1.0, nsbeta1+
Keywords: nsbeta1+
Priority: -- → P3
Target Milestone: mozilla0.9.9 → mozilla1.0
Reassigning 'several bugs at once' to Steve Morse, to level the load better
across the team.
Assignee: sgehani → morse
ADT2 per ADT triage team
Whiteboard: [adt2]
Keywords: intl
The problem is in the nsInternetSearchService.cpp 
look at line 4596 of function OnDataAvailable
The decoder is probably null so we call context->AppendBytes instead of
decoder->Convert + context->AppendUnicodeBytes to process the data. 
which nsIInternetSearchContext is used for "name" and "description" ? should it
have a "UTF-8" converter ?
still does not work even with UTF-8
This sounds very similar to 
http://bugzilla.mozilla.org/show_bug.cgi?id=118179

I'll see what I can do after fixing 118179
Assignee: morse → yokoyama
roy: the code is very different 
the code should be in 
xpfe/components/search/src/nsInternetSearchService.cpp

Problem is at
http://lxr.mozilla.org/seamonkey/source/xpfe/components/search/src/nsInternetSearchService.cpp#4176
InternetSearchDataSource::ReadFileContents()
{
...
  sourceContents.AssignWithConversion(contents, contentsLen); <= HERE!
...
}

Basicall, here is what's happening
1) Read the search file content as char * 
   and stuff it to nsString by calling AssignWithConversion() 
   which, in this case, creates a "0x00 prefixed Shift-JIS" UCS2 
   disguised string.    2) By the time we read attribes (name, description, etc)
 
http://lxr.mozilla.org/seamonkey/source/xpfe/components/search/src/nsInternetSearchService.cpp#4235,
 the data is already corrupted.
Status: NEW → ASSIGNED
In search of Sherlock file encoding specification
http://developer.apple.com/technotes/tn/tn1141.html
I failed to find an appropriate encoding for the file.

Attributes are defined as :
 "name"  
  This is a human-readable name for the search plug-in.

Which encoding is _human-readable name_?
Can we assume the file will be encoded in UTF-8?
ftang: can you comment?  if it's ok, then can you give /r=?
I was thinking of Decoding at InternetSearchDataSource::ReadFileContents();
but it will have more performance impact.
Instead, I call InternetSearchDataSource::DecodeData() when it is asked.
Attachment #80713 - Attachment is obsolete: true
nhotta: I can't find an owner of internet-search.  I found a doc in mozilla.org
you wrote 2 yrs ago
(http://www.mozilla.org/projects/intl/internet_search_charset_spec.html)
 
Can you review or recommend someone else?
Humm.  nhotta's doc says something about queryCharset in SEARCH section.
queryCharset - a charset name string for server queries, specify in "SEARCH"
section 

Attached test case didn't contain queryCharset
http://bugzilla.mozilla.org/attachment.cgi?id=79522&action=view

My patch using charset in "INTERPRET" section. Is it wrong?
The charset in interpret section is used for the search result from the server
while the query charset is for a charset of the query by the client. So please
use the query charset for now (because currently no way to specify a charset of
the search .src).
I am going to look at the patch.
* You can use const PRUnichar* to pass a charset and PRUnichar** for the
converted result. See my comments below.
+InternetSearchDataSource::DecodeData(const nsString &aCharset, nsString &aValue)

* Please use nsICharsetConverterManager2. Because the charset is from outside,
you need to resolve charset alias. nsICharsetConverterManager2::GetCharsetAtom
does that work (and it takes const const PRUnichar* instead of nsString&)
+    nsCOMPtr<nsICharsetConverterManager> charsetConv = 

* Please add a comment, this is needed to restore the original multibyte but I
think needs comment to explain what is going on.
+        char *value = ToNewCString(aValue);

* Please change to use strlen.
+        PRInt32 srcLength = nsCRT::strlen(NS_REINTERPRET_CAST(const char *,
value));

* I think the function can use PRUnichar ** and avoid to duplicate the buffer in
the function. I see the caller is depending on nsString but the caller can
allocate nsString from the returned buffer.
+                aValue.Assign(unicodeData, outUnicodeLen);
+                nsMemory::Free(unicodeData);

Attached patch revised (obsolete) — Splinter Review
It looks like nsAWritableString has depreciated.
nhotta: can you review?
Attachment #80879 - Attachment is obsolete: true
Comment on attachment 81095 [details] [diff] [review]
revised

Please change "interpret" to the constant for query charset.

This is not needed, the next line you assign with a length specified.
+	     (unicodeData)[outUnicodeLen] = '\0';    // null terminate.
Convert() doesn't do it for us

Change those then r=nhotta
Attachment #81095 - Flags: review+
This is to use the queryCharset defined in 
http://www.mozilla.org/projects/intl/internet_search_charset_spec.html
For example:
<SEARCH 
 name="Netscape <Japanese text in Shift-JIS>"
 description = "<Japanese text in Shift-JIS>"
 queryCharset="Shift_JIS"
 ............................
>
Attachment #81095 - Attachment is obsolete: true
Comment on attachment 81220 [details] [diff] [review]
Use of queryCharset in SEARCH section and remove the null terminate

Commet #23
r=nhotta
Attachment #81220 - Flags: review+
Why can't we just apply the converter at the time we read the file?
It's simply the performance issue.
Please see Comment #17

I could call rv = GetData(dataUni, "search", 0, "queryCharset", charsetValue)
right after reading the file and then search for "name" and "description" 
and decode  them ; 
but we decided to call the converter only when the name and description
attriubtes are required.
or do you mean to decode everything in the file after reading?
Comment on attachment 81220 [details] [diff] [review]
Use of queryCharset in SEARCH section and remove the null terminate


ugh, I wish the file was just stored in UTF8
and then we could just use the UTF8ConverterStream already built into XPCOM,
and the data 
would be decoded as it is read in.

That said, there's no need for 
char *value = ToNewCString(PromiseFlatString(aValue));
you should just say
char *value = ToNewCString(aValue)

(i.e. the PromiseFlatString is unnecessary and might make an extra copy of the
string)

However, even that is lame. You should really use 
NS_LossyConvertUCS2toASCII value(aValue), 
then you won't have to call strlen(value), nor free up the string when you're
done.

on the trunk, we eventually need to fix nsICharsetConverter2 to start taking 
nsAString&/nsACString& so we can avoid all this extra copying and allocating.

but in the mean time, looking at this you could clean this up to avoid the
extra copying
by changing DecodeData() to return a PRUnichar**, and return the buffer you
allocated. 
Then using Adopt() to make descValue take ownership of the newly allocated
string. 

it would look something like:
>+        nsAutoString charsetValue;
>+        if (NS_SUCCEEDED(rv = GetData(dataUni, "search", 0, "queryCharset", charsetValue)) && !charsetValue.IsEmpty())
>+        {
>+            PRUnichar *decodedValue;
>+            rv = DecodeData(charsetValue.get(), &decodedValue);
>+            if (NS_SUCCEEDED(rv))
>+                descValue.Adopt(decodedValue);
>+        }
Attachment #81220 - Flags: needs-work+
I think what waterson means is to decode the file into unicode as you read it
out of the file.. so you're dealing with a correctly-decoded unicode buffer, to
begin with. 

However, I'm not quite sure how we'd accomplish this since the charset name is
actually stored in the file itself. That's why I was hoping for UTF8. 

Roy: Are there already files like this that we need to support (i.e. that went
out in previous releases), or is there any chance we can just fix our own
versions of these files in CVS to be UTF8, and then use the UTF8ConverterStream,
which assumes the file is in UTF8?
Ya, my first patch (04/24/02 16:46)
http://bugzilla.mozilla.org/attachment.cgi?id=80713&action=view
was exactly that.  (assume the file is encoding in UTF-8)

>Roy: Are there already files like this that we need to support
I think tao or ying can answer your question; but ftang and I
looked at a sherlock file in Macintosh-Ja and it
was encoded in Shift-JIS. 



but it never worked before, right? I mean, if this bug has always existed, then
we won't be breaking anything that used to work. 

All of our other data files (.properties, .xml, .dtd, .rdf, etc) are all in
UTF8, it seems odd that we'd decide not to use UTF8 in this case.
Sherlock files use native encoding :-\. However, like xml parser, the browser
should use the embedded charset info to decode the data. 
Notice it does not have a queryCharset but a queryEncoding="2561"
Please make sure this also work.
This file is ship with MacOS X. (in "Users:ftang:Library:Internet Search
Sites:Internet", replace 'ftang' with your login name of course)
notice the name is encoded in Shift_JIS instead of UTF-8
ah! frank that's the best reason yet to give up on the UTF8 idea :)
so this is completely outside of our control - 3rd parties are already
distributing non-UTF8 files, and these files are shared between apps. 

That said, we should definitely do what frank said, and support queryEncoding as
well.
nhotta,ftang and I did little digging how Apple handle the encoding of Sherlock.
It looks like they are using sourceTextEncoding field(not queryEncoding)
to identify the file encoding and if it's not defined, then it
falls back to "x-mac-roman"

sourceTextEncoding is using the Script Codes.  I am going to
add a mapScriptCode table or modify the table in ::MapEncoding()
to support this field.
Blocks: 141008
I need to ask yokoyama to focus on bug 141513 now. nhotta- can you take over
this bug?
Assignee: yokoyama → nhotta
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
Attachment #81220 - Attachment is obsolete: true
Comment on attachment 83593 [details] [diff] [review]
Changed to use "sourceTextEncoding" which contains a Macintosh script code.

r=rjc  (Please be sure to follow tabbed indentation formatting used in file.)
Attachment #83593 - Flags: review+
Attached patch Changed formatting. (obsolete) — Splinter Review
Attachment #83593 - Attachment is obsolete: true
Alec, could you super review?
Comment on attachment 83975 [details] [diff] [review]
Changed formatting.

Why does MapScriptCodeToCharsetName need to allocate its result? It seems like
the whole function could just return a const char* - since the strings are just
static strings. Also, I think scriptList[] could be "static" as well.. 

I also wonder if it might be even better to make MapScriptCodeToCharsetName
take a const PRUnichar* as a parameter, then you could avoid the excess
copying/conversion... you could declare scriptList[] like

static struct scripts scriptList[] = 
{
  { { '0', '\0' }, "x-mac-roman" },
..etc

it will be annoying to manually expand all that unicode, but once you've done
it once, then we never ever have to do it at runtime! :)

and actually, since really what you want is a charset in PRUnichar, you should
make DecodeData take a PRUnichar* as the charset, and let the consumers of that
function do the conversion

finally, does this compiler? the call to MapScriptCodeToCharsetName  seems to
be lacking a 2nd parameter, the line ends with a ","?

overall this looks fine, I just think we can simplify the code a lot by doing
less conversions
Attachment #83975 - Flags: needs-work+
Attachment #83975 - Attachment is obsolete: true
Alec, please review the new patch, thanks.
Comment on attachment 84102 [details] [diff] [review]
Changed to reduce runtime allocation and copy for strings.

oh, wow! thanks for going the extra mile and putting in the unicode names too. 
sr=alecf
Attachment #84102 - Flags: superreview+
checked in to the trunk
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Keywords: adt1.0.0, approval
Yuying, could you verify this on the trunk? You can use FreshEye data attached
to the bug.
Verified it works fine on 05-20 trunk build / WinME-JA by applied the
FreshEye.src in comment #34.
Status: RESOLVED → VERIFIED
Just want to make sure, does it also work on the location bar? You know, if you 
set the search engine as default, when you type something on the URL location 
bar, there will be a dropdown string says "Search xxxx for xxxx".
Blocks: 143047
Whiteboard: [adt2] → [adt2 RTM] [ETA 05/31] [Needs a=]
adt1.0.0+ (on ADT's behalf) for approval to checkin to the 1.0 branch,pending
Driver's approval.  After, checking in, please add the fixed1.0 keyword.
Keywords: adt1.0.0adt1.0.0+
>Just want to make sure, does it also work on the location bar? You know, if you 
>set the search engine as default, when you type something on the URL location 
>bar, there will be a dropdown string says "Search xxxx for xxxx"

I am not sure if URL location bar will get fixed; but I know it will
get Preference/Internet Search fixed as well.
ylong?
>>Just want to make sure, does it also work on the location bar?
Yes, fixed on the trunk.
Adding adt1.0.1+ for adt approval for branch checkin for the 1.0.1 milestone. 
Please get drivers approval before checking in.
Keywords: adt1.0.0+adt1.0.1+
Keywords: mozilla1.0.1
Comment on attachment 84102 [details] [diff] [review]
Changed to reduce runtime allocation and copy for strings.

please check into the 1.0.1 branch ASAP. once landed remove the 
mozilla1.0.1+ keyword and add the fixed1.0.1 keyword
Target Milestone: mozilla1.0 → mozilla1.0.1
fix checked in to the 1.0 branch
Blocks: 146292
No longer blocks: 141008
I'm making a new JA build now, and the Japanese search engine title is showing 
properly! Just want to thank you guys again to make it happens. Yeah!!!
VERIFIED Fixed with 20020719 branch builds
Product: Core → SeaMonkey
You need to log in before you can comment on or make changes to this bug.