Closed Bug 258223 Opened 20 years ago Closed 19 years ago

Bookmark keyword quicksearch need a way to specify character encoding for query URLs

Categories

(Firefox :: Bookmarks & History, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
Firefox1.5

People

(Reporter: qiling, Assigned: jshin1987)

References

Details

(Keywords: intl, relnote)

Attachments

(2 files, 10 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040616
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040616

using the quicksearch feature in Mozilla and Firefox to search for non ASCII
search items often result in the said search item to be sent to website in
non-universal encoding. I cannot decipher what encoding it is being sent, but
sending it in UTF-8 would be nice.

Reproducible: Always
Steps to Reproduce:
1. define a quicksearch with a keyword (for example, "g" for
http://www.google.com/search?q=%s)
2. use the feature: type "g [searchitem]" in address bar ([searchitem] would
necessarily be non-english. I am testing this feature using CHINESE and JAPANESE
3. press enter to activate feature.
Actual Results:  
the query is sent as a wrong encoding and google searches for "????" instead.

Expected Results:  
the search website is sent the properly encoded (UTF-8) queries.

Several notes:

I am providing some search items that can be used for debugging. japanese
searchitem would be "凪," chinese searchitem should be "东" (these are
characters that does not exist in the other language).

Interestingly, all of the chinese searches come out alright. Even more
interestingly, if the chinese character is coupled with the japanese "g 东 凪,"
the japanese character is sent properly. HOWEVER, if sending japanese alone,
google receives something in something strange.

I use Japanese version of windows., and the same problem was confirmed in both
Moz 1.7 and Firefox 0.9.3.

As a final usage note, the addressbar search in Mozilla AND firefox searchbox
performs these searches flawlessly.

Onto some speculation: I believe that the cause of this problem is that the
quicksearch feature probably involves two encodings. one is the encoding for the
searchitem, and the other is the encoding requested to the website. This feature
need to *force* the query items to interpreted as UTF-8, which it is not doing
right now, even though the website requests are all in UTF. When the
searchstring is japanese, the quicksearch feature would send the item as the
default OS encoding (probably EUC-JP), while the page request specifies UTF.
This results in google decoding the searchitem by UTF-8 even though it is
encoded in EUC. In the case where chinese or chinese + japanese is sent, the
default encoding cannot express the chinese characters and sends the query as
UTF, which is accepted and processed properly.

*) I am marking this bug as major because its existance implies that all
non-english users of mozilla / firefox cannot use this feature. It also means
that for these users firefox have no address bar search functionality at all and
must rely on the searchbox. This becomes even more pronounced as firefox will
replace mozilla as the standard browser.

*) this bug is probably related to bug # 205652.
Interesting.  The testcases in this bug workforme with an English Win98 system....

Can someone point me to where bookmark keywords are implemented?  I thought the
"keyword" protocol handler handled that, but it looks like that's a different
beastie....
Small comment: Please view this page in UTF-8 encoding to see the test cases
properly.
Some additional comments:

凪: japanese native, JIS code 4664, Shift JIS code 93E2, Unicode 51EA
东: chinese (simplified), Unicode 4E1C. Does not exist in JIS code pages.

I apparently made an mistake earlier. The character in japanese is sent as S_JIS
but interpreted as UTF.

Using the quicksearch feature for japanese, the following results are obtained
(in firefox):
1) type "g 凪" and press enter
2) the address bar performs substitution to "http://www.google.com/search?q=凪"
3) step 2 quickly changes to "http://www.google.com/search?q=%93%E2" (encoded in
Shift_JIS)

Using the same feature and searching for chinese
1) type "g 东" and press enter
2) the address bar performs substitution to "http://www.google.com/search?q=东"
3) step 2 quickly changes to "http://www.google.com/search?q=%E4%B8%9C" which I
have no idea what encoding it is, but is sent through properly.

*) if not using quicksearch but rather Firefox's searchbox (same with moz
address bar search), the address will become
"http://www.google.com/search?q=%E5%87%AA" directly.

Probably the replace function in quicksearch (s/%s/[searchterm]) should have
code that will convert [searchterm] to proper encoding before sending it to
address bar for page retrieval.
(In reply to comment #3)
> *) if not using quicksearch but rather Firefox's searchbox (same with moz
> address bar search), the address will become
> "http://www.google.com/search?q=%E5%87%AA" directly.

This is for the japansee test case 凪
Attached patch Possible fix (obsolete) — Splinter Review
I can't actually test this because I don't have any Japanese or Chinese fonts.
ccing intl folks and darin.

Since bookmark keywords are handled outside the core, they run up against the
code added in bug 130393 (which encodes URIs for certain schemes in the OS
charset instead of UTF-8 because servers at the time didn't deal well at all
well with UTF-8).

Is that situation still present?  If not, can we remove that code?

That said, Neil's fix should do a decent job for this particular bug, I think...
Ling Qi, could you possibly test it?  (Testing doesn't require a build
environment; just modifying the navigator.js in the jar in the chrome dir in a
Mozilla install.)
Assignee: p_ch → jag
Status: UNCONFIRMED → NEW
Component: Bookmarks → XP Apps
Ever confirmed: true
QA Contact: seamonkey.bookmarks → pawyskoczka
Patch 158088 was a charming fix for Mozilla. For 1.7 (sorry I didn't have time
to install 1.7.2), I changed line 1431 (Diff shows 1495), which fixed the
problem for Moz.

The I cannot find the same line for Firefox but in /content/browser/browser.js,
function getShortcutOrURI(aURL, aPostDataRef) seem to be the one related to this
bug (0.9.3 release code). So Firefox remains untested.
This is rather tricky. Perhaps sending UTF-8 would be the best we can do, but
wouldn't work always. For instance, google can be configured to accept keywords
in an non-UTF-8-encoding. 

re comment #6: More and more servers can deal with url-escaped UTF-8s (even
though their server file system character encoding is not UTF-8). It seems that
MS IIS can deal with them very well (well, most of them are run on Win 2k/XP so
that it should), which may explain why they have, by default, 'always send URLs
in UTF-8' ON in MS IE. There's an apache module for this well. However, I'm not
sure how this apache module deals with cases where a url is not a direct
reference to a file on the file system but is a 'GET' URL (whose 'existence' can
only be checked after running a CGI). See bug 129726. 

re comment #3:
> 3) step 2 quickly changes to "http://www.google.com/search?q=%E4%B8%9C" which
> I have no idea what encoding it is, but is sent through properly.

 '0xE4 0xB8 0x9c' represents U+4E1C (东) in UTF-8. 

> 3) step 2 quickly changes to "http://www.google.com/search?q=%93%E2" 
> (encoded in Shift_JIS)

  I'm surprised that this worked (it doesn't work for me). Is the UI language
for google set to Japanese?  The server side program on Google may do some
encoding checking (check whether it's UTF-8. if not, fall back to a couple of
most widely used encoding for the language selected for the UI. For Japanse,
they would be Shift_JIS, EUC-JP, and perhaps ISO-2022-JP which can  be rather
reliably distinguished from each other).  
Keywords: intl
(In reply to comment #8)
>   I'm surprised that this worked (it doesn't work for me). Is the UI language
> for google set to Japanese?  The server side program on Google may do some
> encoding checking (check whether it's UTF-8. if not, fall back to a couple of
> most widely used encoding for the language selected for the UI. For Japanse,
> they would be Shift_JIS, EUC-JP, and perhaps ISO-2022-JP which can  be rather
> reliably distinguished from each other).  

I am not too sure what you mean by "this" in "...surprised that this worked..."

In any case, the google UI is set to english, but the same behavior exists for
google.com in japanese AND google.co.jp (which, coincidentally, is also in
japanese). I have also tried this with altavista with the same results. I
believe it has to do with the fact that my OS is japanese version of windows
than the interface language of google. 

While I admit that google and other websites attempts to detect the searchterm
and figure out its encoding, it is non-functional in this case because the URL
sent to the website is already in escaped form, e.g. %93%E2, probably because
mozilla is skeptical about sending urls in UTF encoding?

A bit offtopic, but how do does 0xE4 0xB8 0x9C represents U+4E1C? I could not
see any corrolation between the two numbers.
So then the question is whether just URL-encoding the data as the patch does is
ok....
(In reply to comment #10)
> So then the question is whether just URL-encoding the data as the patch does is
> ok....

I honestly do not know where to look but I would suspect that the Moz
address-bar search and the firefox searchbox already URL-encode the data before
sending it out.
Of course; the question is whether the data is URL-encoded before we start
converting encodings to the plaform encoding or after...
(In reply to comment #9)
> While I admit that google and other websites attempts to detect the searchterm
> and figure out its encoding, it is non-functional in this case because the URL
> sent to the website is already in escaped form, e.g. %93%E2, probably because
> mozilla is skeptical about sending urls in UTF encoding?

URLs _MUST_ be sent in escaped form. You can not send 8bit data in HTTP requests.

> A bit offtopic, but how do does 0xE4 0xB8 0x9C represents U+4E1C? I could not
> see any corrolation between the two numbers.

See http://en.wikipedia.org/wiki/UTF-8
This is the third time I'm writting this. (I keep forgetting to submit my
comment before I quit mozilla ...). 

We need to offer a way to let users specify the character encoding to use when
converting what's entered in the address bar before url-escaping and submitting
to the server. Just always using UTF-8 doesn't work. When you take a look at the
definition of a searchplugin, it's clear why you need that. To define a
searchplugin, you have to specify what character encoding to use.  For instance,
a Korean search engine (http://search.naver.com) expects query strings to come
in EUC-KR (or x-windows-949). If we convert a keyword to UTF-8 before
url-escaping and putting in place of '%s' in
'http://search.naver.com/search.naver?where=nexearch&query=%s', it would return
a lot of garbages. 

What we can do are:

1. parse 'mozencoding=<ENCODING>' (e.g. 'mozencoding=ISO-8859-1',
'mozencoding=UTF-8', 'mozencoding=EUC-JP') in the bookmark keyword quick search
definition 
2. convert the input string to the encoding specified via 'mozencoding' before
url-escaping if it's present.
otherwise, use either the platform encdoing or UTF-8. Which one to use, we have
to think about. (currently, we always conver to the platform encoding)
3. get rid of 'mozencoding=<ENCODING>' in the query string before submitting it
to the server
4. document this feature in the bookmark quick search document (and release
notes). I'm adding asa to CC because he documented the bookmark quick search at
the mozilla.org web site.
> URLs _MUST_ be sent in escaped form. You can not send 8bit data in HTTP requests.

  I don't think HTTP has such a restriction. It's just the definition of URI
that restricts what can be used in URI to a subset of ASCII characters. Now
Martin Duerest and others have been working on IRI drafts according to which
non-ASCII characters are allowed in UTF-8. (no other encoding is allowed unless
it's url-escaped). See http://www.w3.org/2004/Talks/IUC25iri/ 
(In reply to comment #15)
>   I don't think HTTP has such a restriction. It's just the definition of URI
> that restricts what can be used in URI to a subset of ASCII characters.

Well, HTTP (RFC 2616) normatively references RFC 2396.

> Now
> Martin Duerest and others have been working on IRI drafts according to which
> non-ASCII characters are allowed in UTF-8. (no other encoding is allowed unless
> it's url-escaped). See http://www.w3.org/2004/Talks/IUC25iri/ 

why doesn't it use escaped utf-8?
(In reply to comment #16)
> (In reply to comment #15)
> >   I don't think HTTP has such a restriction. It's just the definition of URI
> > that restricts what can be used in URI to a subset of ASCII characters.
> 
> Well, HTTP (RFC 2616) normatively references RFC 2396.

Ok. You're right. For HTTP 1.1 request, IRIs have to be converted to URIs. 

> > it's url-escaped). See http://www.w3.org/2004/Talks/IUC25iri/ 
> 
> why doesn't it use escaped utf-8?

 See http://www.w3.org/International/iri-edit/draft-duerst-iri-07.txt and
references therein. 
Bug 123006 actually covers the use of encoding.
We can keep this bug for the character set issues.
*** Bug 266667 has been marked as a duplicate of this bug. ***
Taking. I've just made a patch but have to test it.  
Assignee: jag → jshin
Attached patch patch (obsolete) — Splinter Review
I implemented 'mozcharencoding=XXX' I mentioned earlier.
Attachment #158088 - Attachment is obsolete: true
Comment on attachment 163888 [details] [diff] [review]
patch

asking for r/sr.

+  return str.replace(/[^a-zA-Z0-9+\-*\._@]/gi, encodeChar);

I'll get rid of 'A-Z' (or 'i' flag)
Attachment #163888 - Flags: superreview?(neil.parkwaycc.co.uk)
Attachment #163888 - Flags: review?(jag)
*** Bug 264406 has been marked as a duplicate of this bug. ***
Attached patch firefox patch v1 (obsolete) — Splinter Review
I'm not sure what aPostDataReference does so that I leave that part alone.
However, the part shared with Mozilla is patched the same way as is done in
attachment 163888 [details] [diff] [review].
+        .classes["@mozilla.org/intl/scriptableunicodeconverter"]
+        .getService(Components.interfaces.nsIScriptableUnicodeConverter);

given that it has state, shouldn't it be used with createInstance? (like
calendar and mail do. unlike chatzilla and venkman do...)
Attached patch patch v2 (per cbie's comment) (obsolete) — Splinter Review
cbie, thanks for the catch. I'm not using nsIScriptableUnicodeConverter as a
service any more. I also renamed 'legacyStrEncode' to 'encodeByteString' with a
more extensive comment.
Attachment #163888 - Attachment is obsolete: true
Attachment #163897 - Attachment is obsolete: true
Attachment #163888 - Flags: superreview?(neil.parkwaycc.co.uk)
Attachment #163888 - Flags: review?(jag)
Attached patch firefox patch v2 (obsolete) — Splinter Review
fb counterpart to attachment 163945 [details] [diff] [review]
Comment on attachment 163945 [details] [diff] [review]
patch v2 (per cbie's comment)

asking for r/sr
Attachment #163945 - Flags: superreview?(neil.parkwaycc.co.uk)
Attachment #163945 - Flags: review?(jag)
My patch implemented what I wrote in comment #14, in which I also wrote :
> 2. convert the input string to the encoding specified via 'mozencoding' before
> url-escaping if it's present.
> otherwise, use either the platform encdoing or UTF-8. Which one to use, we have
> to think about. (currently, we always conver to the platform encoding)

In bug 123006, '%S' was added (in mozilla but not in firefox) to send strings in
the platform encoding. '%s' was chnaged to be used for sending in url-escaped
UTF-8. Here we're adding a way to specify explicitly the character encoding to
use for URLs. This is necessary because there are still a lot of 'local' search
engines and search-engine like web sites that are not enlightnened enough to go
all the way to UTF-8 as google has. In addition, on modern Linux (and probably
Mac OS X), the platform encoding is UTF-8 so that users really have no choice
without this patch.
OS: Windows XP → All
Hardware: PC → All
Summary: Bookmark keyword quicksearch sends query in non-universal encoding → Bookmark keyword quicksearch need a way to specify character encoding for query URLs
Jungshik, I did some testing with the patch of bug 136006, and I found that
oppposite to what I initially thought, with that patch %s doesn't always send
the URL as UTF-8 with it.

I was testing initially with google and was always getting UTF-8, but with
dictionary.reference.com, it goes out as cp1252.

I'd like to understand better what the call to 'encodeURIComponent' does exactly.

Don't you thing it's possible to be smart enough to do the right thing by using
LAST_CHARSET to determine the encoding to use, and not require the special
mozencoding keyword ?
(In reply to comment #30)
> with that patch %s doesn't always send the URL as UTF-8 with it.
> 
> I was testing initially with google and was always getting UTF-8, but with
> dictionary.reference.com, it goes out as cp1252.
> 
> I'd like to understand better what the call to 'encodeURIComponent' does exactly.

I guess dictionary.reference.com is running either IIS or Apache with IRI module
that automatically translates incoming UTF-8 to what its server-side program
understands (in this case Windows-1252). See ECMA 262 (ECMAscript standard)
section 15.1.3.  We implemented 
ECMA 262 version of encodeURI* failthfully in our JS engine. That is, it always
produces URL-escaped UTF-8. 

 
> Don't you thing it's possible to be smart enough to do the right thing by using
> LAST_CHARSET to determine the encoding to use, and not require the special
> mozencoding keyword ?

  I don't think that's reliable enough. Moreover, that wouldn't work until the
page in question is tried for the first time. With that, the 'mass/automatic'
deployment of keyword search entries wouldn't be possible.

Status: NEW → ASSIGNED
Comment on attachment 163945 [details] [diff] [review]
patch v2 (per cbie's comment)

>+          re = /^(.*)(\&mozcharencoding=)([a-z][_\-a-z0-9]+)$/i; 
Well, if that's the easiest way to specify the charset... oh, and don't forget
to watch out for trailing spaces.

>           shortcutURL = /%[sS]/.test(shortcutURL) ?
>-              shortcutURL.replace(/%s/g, encodeURIComponent(text))
>-                         .replace(/%S/g, text) :
>+            (newText ? shortcutURL.replace(/%[sS]/g, escapeByteString(newText)) :
>+             shortcutURL.replace(/%s/g, encodeURIComponent(text))
>+                        .replace(/%S/g, text))               :
>               null;
This looks ugly and offhand the easiest way I could think of simplifying it was
to have a variable encodedText which is either encodeURIComponent(text) or
escape(convertFromUnicode(text)) for the %s case, and just use text for the %S
case; using the escaped charset in the %S case doesn't make sense to me.

>+    var unicodeConverter = Components
>+       .classes["@mozilla.org/intl/scriptableunicodeconverter"]
>+       .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
>+    unicodeConverter.charset = charset;
>+    if ("Finish" in unicodeConverter) {
Why would "Finish" not be in unicodeConverter? The interface declares it, so it
must exist.

>+// url-escape a 'byte string' made up of zero-padded '16-bit characters'
Ordinary escape() should do this these days. As you know it used to be the same
as encodeURIComponent but that violated the ECMA spec. Chatzilla uses its own
version because the XPIs are designed to be compatible with previous versions
of Mozilla.
Attached patch patch v3 (obsolete) — Splinter Review
addressed Neil's concerns
Attachment #163945 - Attachment is obsolete: true
Attachment #163946 - Attachment is obsolete: true
Attachment #163945 - Flags: superreview?(neil.parkwaycc.co.uk)
Attachment #163945 - Flags: review?(jag)
Comment on attachment 164872 [details] [diff] [review]
patch v3

thanks for comments. 
asking for r/sr.
Attachment #164872 - Flags: superreview?(neil.parkwaycc.co.uk)
Attachment #164872 - Flags: review?(jag)
Attached patch patch for firefox (1.0) (obsolete) — Splinter Review
Comment on attachment 164872 [details] [diff] [review]
patch v3

(In reply to comment #32)
>don't forget to watch out for trailing spaces.
That came out wrong... I was actually referring to your source code!
Attachment #164872 - Flags: superreview?(neil.parkwaycc.co.uk) → superreview+
oops. thanks for sr. I got rid of trailing spaces in the source code lines I
added. Also removed was  '\s*' in the regular expression.
Comment on attachment 164872 [details] [diff] [review]
patch v3

+	   re = /^(.*)(\&mozcharencoding=)([a-z][_\-a-z0-9]+)\s*$/i; 

Missing |var| or |const|, though for one-time uses, prefer to inline.

No need really to put &mozcharencoding= into a group, so you can change
matches[3] to matches[2].

Now, I have to point out that this is a rather ugly hack. A cleaner solution
would be to make the charset be an explicit field in the new bookmark /
bookmark properties dialogs, and save it as a property of the bookmark, instead
of encoding it in the URL. Also, though highly unlikely, your solution won't
play nice with any url that actually uses "mozencoding" in its query.

This may require some changes to the bookmark service (specifically, the parser
and serializer), perhaps talk to Ben Goodger first.
Attachment #164872 - Flags: review?(jag) → review-
(In reply to comment #38)
> +	   re = /^(.*)(\&mozcharencoding=)([a-z][_\-a-z0-9]+)\s*$/i; 
> 
> Missing |var| or |const|, though for one-time uses, prefer to inline. 
> No need really to put &mozcharencoding= into a group, so you can change
> matches[3] to matches[2].

Thanks for point them out.
 
> A cleaner solution would be to make the charset be an explicit field 
> in the new bookmark /bookmark properties dialogs, and save it as 
> a property of the bookmark, 

Sure, it would be cleaner that way and I thought about it before making this
hack. However, it seemed to be an overkill to add all those changes to both
front-end and back-end for this rather 'obscure' feature.  Moreover, unless the
front-end is designed to make it clear what 'character encoding' is for, it can
easily get users confused. 


> Also, though highly unlikely, your solution won't
> play nice with any url that actually uses "mozencoding" in its query.

That's why I made it rather long ('mozcharencoding') and match only at the end
(to minimize that possibility as much as possible). 
 
> This may require some changes to the bookmark service (specifically, the parser
> and serializer), perhaps talk to Ben Goodger first.

Anyway, I'll talk to him about it.


don't bookmarks already store a charset?
*** Bug 267755 has been marked as a duplicate of this bug. ***
Christian, see the LAST_CHARSET discussion in comments #30 and #31

I'm not completely convinced by junshik's answer, because I don't see how the
deployment of keyword search entries can be done without deploying a complete
bookmark entry, including the LAST_CHARSET element.

Also when you create by hand a new entry, the encoding will be wrong only one
time, so it should not be too bad. I don't know how LAST_CHARSET is selected for
newly created bookmarks, but if it is correctly set using the value that
corresponds to the last display of the page, it will fail even more rarely than
that.
LAST_CHARSET will never be changed for bookmarks containing %s because it works
by looking up the bookmarks for the loaded URL and updating the last visited
date and character set, so such a bookmark would only have a character set if it
was altered from a visited static bookmark.
In fact the same bug applies to bookmarks pointing to redirected pages - their
last visited date never updates because only the target page loads.
*** Bug 242508 has been marked as a duplicate of this bug. ***
Blocks: 270703
Blocks: 260006
Product: Core → Mozilla Application Suite
Keywords: relnote
*** Bug 276122 has been marked as a duplicate of this bug. ***
*** Bug 287050 has been marked as a duplicate of this bug. ***
MSIE-based variants have support for this and MS IE7 may do the same. 
I'll ping Ben once more (comment #39)

This should belong to core,but it doesn't have a suitable component so that I'm
moving it to firefox | bookmark for a better exposure.
Component: XP Apps → Bookmarks
Flags: review-
Product: Mozilla Application Suite → Firefox
Attached patch new patch for firefox (obsolete) — Splinter Review
With this patch, three methods are tried in turn :

1 . mozcharset=xxx at the end of a shortcut URL : needs a manual intervention
of a user, needs to mention in the release notes (or asa's note on keyword)
2. LAST_CHARSET value in the bookmark which is only useful when keyword search
bookmarks are made from static bookmarks. needs to mention in the release notes
(or asa's note on keyword)

3. UTF-8 (the default)

I added |getLastCharset| method to nsIBookmarksService and called it in
browser.js to refer to |LAST_CHARSET| if available.

If method #1 is considered too inconvenient for ordinary users, I can get rid
of it.
Attachment #164880 - Attachment is obsolete: true
Attachment #188830 - Flags: superreview?(neil.parkwaycc.co.uk)
Attachment #188830 - Flags: review?(vladimir)
Comment on attachment 188830 [details] [diff] [review]
new patch for firefox 

I don't think you should be guessing the character set when the URL doesn't
even contain %s.

>+                nsXPIDLString charset;
>+                charsetData->GetValue(getter_Copies(charset));
>+                aCharset = charset;
Slightly more efficient would be to use
const PRUnichar *charset;
charsetData->GetValueConst(&charset);
aCharset.Assign(charset);
Attachment #188830 - Flags: superreview?(neil.parkwaycc.co.uk) → superreview-
(In reply to comment #50)
> (From update of attachment 188830 [details] [diff] [review] [edit])
> I don't think you should be guessing the character set when the URL doesn't
> even contain %s.

 You're worried about an unncessary(but harmless otherwise) call to
|BMSC.getLastCharset| ? I'll invoked that only when '%s' is present

> const PRUnichar *charset;
> charsetData->GetValueConst(&charset);
> aCharset.Assign(charset);

  Ok, I'll change that.
 

(In reply to comment #49)
> Created an attachment (id=188830) [edit]
> new patch for firefox 
> 
> With this patch, three methods are tried in turn :
> 
> 1 . mozcharset=xxx at the end of a shortcut URL : needs a manual intervention
> of a user, needs to mention in the release notes (or asa's note on keyword)
> 2. LAST_CHARSET value in the bookmark which is only useful when keyword search
> bookmarks are made from static bookmarks. needs to mention in the release notes
> (or asa's note on keyword)
> 
> 3. UTF-8 (the default)
> 
> I added |getLastCharset| method to nsIBookmarksService and called it in
> browser.js to refer to |LAST_CHARSET| if available.
> 
> If method #1 is considered too inconvenient for ordinary users, I can get rid
> of it. 
> 

The order to handle what encoding should use for the searching is nice, and the
patch works fine for me for those non UTF-8 encoding search sites with GET method.

But there are still lots of websites use POST method (and some sites reject GET
method to query), in this case, the %s is in the post data which is still
handled in the old way and don't work correctly.
It should be noted that nsFormSubmission.cpp uses nsISaveAsCharset (with
attr_EntityAfterCharsetConv and attr_FallbackDecimalNCR) as encoder.
Made some small changes for browser.js based on Jungshik Shin's last patch. 

Should works for POST method search sites now, tested in Firefox 1.0.4/1.0.5
Windows.

Notes:
This patch file is based on the release package(<install
dir>\chrome\browser.jar) of Firefox not the CVS code. it don't contain the
patch code to cpp/idl files to support LAST_CHARSET (check Jungshik Shin's
patch code for it). I made this so you can temporary fix the problem without
waiting for new official release or download/build the source code. (but it
only support mozcharset in URL as metioned above)
Comment on attachment 189407 [details] [diff] [review]
browser.js patch for firefox 1.0.4

>--- browser.jar.1.0.4\content\browser\browser.js	Tue Apr 12 21:43:20 2005
>+++ browser\content\browser\browser.js	Fri Jul 15 19:05:29 2005
>@@ -1468,51 +1468,86 @@
>     var shortcutURL = BMSVC.resolveKeyword(aURL, aPostDataRef);
>     if (!shortcutURL) {
>       // rjc: add support for string substitution with shortcuts (4/4/2000)
>       //      (see bug # 29871 for details)
>       var aOffset = aURL.indexOf(" ");
>       if (aOffset > 0) {
>         var cmd = aURL.substr(0, aOffset);
>         var text = aURL.substr(aOffset+1);
>         shortcutURL = BMSVC.resolveKeyword(cmd, aPostDataRef);
>         if (shortcutURL && text) {
>+            var encodedText = null; 
>+            var charset= "";
>+            var matches = shortcutURL.match(/^(.*)\&mozcharset=([a-zA-Z][_\-a-zA-Z0-9]+)\s*$/);
>+
>+            if (matches) {
>+               shortcutURL = matches[1];
>+               charset = matches[2];
>+            } else {
>+              try {
>+                charset = BMSVC.getLastCharset(shortcutURL);
>+              } catch (ex) {
>+              }
>+            }
>+            if (charset != "") 
>+              encodedText = escape(convertFromUnicode(charset, text)); 
>+            else  // default case: charset=UTF-8
>+              encodedText = encodeURIComponent(text);
>+
>           if (aPostDataRef && aPostDataRef.value) {
>             // XXXben - currently we only support "application/x-www-form-urlencoded"
>             //          enctypes.
>             aPostDataRef.value = unescape(aPostDataRef.value);
>-            if (aPostDataRef.value.match(/%s/))
>-              aPostDataRef.value = getPostDataStream(aPostDataRef.value, text, 
>+            if (encodedText && aPostDataRef.value.match(/%[sS]/)) {
>+              aPostDataRef.value = getPostDataStream(aPostDataRef.value.replace(/%s/g, encodedText).replace(/%S/g, text), text, 
>                                                      "application/x-www-form-urlencoded");
>-            else {
>+	    } else {
>               shortcutURL = null;
>               aPostDataRef.value = null;
>             }
>-          }
>-          else
>-            shortcutURL = shortcutURL.match(/%s/) ? shortcutURL.replace(/%s/g, encodeURIComponent(text)) : null;
>+          } else {
>+            if (encodedText && /%[sS]/.test(shortcutURL))
>+              shortcutURL = shortcutURL.replace(/%s/g, encodedText)
>+                                       .replace(/%S/g, text);
>+            else 
>+              shortcutURL = null;
>+	  }
>         }
>       }
>     }
> 
>     if (shortcutURL)
>       aURL = shortcutURL;
> 
>   } catch (ex) {
>   }
>   return aURL;
> }
> 
>+function convertFromUnicode(charset, str)
>+{
>+  try {
>+    var unicodeConverter = Components
>+       .classes["@mozilla.org/intl/scriptableunicodeconverter"]
>+       .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
>+    unicodeConverter.charset = charset;
>+    str = unicodeConverter.ConvertFromUnicode(str);
>+    return str + unicodeConverter.Finish();
>+  } catch(ex) {
>+    return null; 
>+  }
>+}
>+
> function getPostDataStream(aStringData, aKeyword, aType)
> {
>   var dataStream = Components.classes["@mozilla.org/io/string-input-stream;1"]
>                             .createInstance(Components.interfaces.nsIStringInputStream);
>-  aStringData = aStringData.replace(/%s/g, encodeURIComponent(aKeyword));
>   dataStream.setData(aStringData, aStringData.length);
> 
>   var mimeStream = Components.classes["@mozilla.org/network/mime-input-stream;1"]
>                               .createInstance(Components.interfaces.nsIMIMEInputStream);
>   mimeStream.addHeader("Content-Type", aType);
>   mimeStream.addContentLength = true;
>   mimeStream.setData(dataStream);
>   return mimeStream.QueryInterface(Components.interfaces.nsIInputStream);
> }
>
Attached patch seamonkey patch (obsolete) — Splinter Review
Attachment #164872 - Attachment is obsolete: true
Attachment #188830 - Attachment is obsolete: true
Attachment #190223 - Flags: superreview?(neil.parkwaycc.co.uk)
Attachment #190223 - Flags: review?(vladimir)
Comment on attachment 190223 [details] [diff] [review]
seamonkey patch

Neil's concerns are addressed in this patch.

>+            if (charsetData) {
>+                PRUnichar *charset;
>+                charsetData->GetValueConst(&charset);

It should be |const PRUnichar *charset|. I fixed it in my tree.
Attachment #188830 - Flags: review?(vladimir)
Attached patch firefox patchSplinter Review
firefox patch with 'post'  support
Attachment #190860 - Flags: superreview?(neil.parkwaycc.co.uk)
Attachment #190860 - Flags: review?(vladimir)
My latest patch for firefox includes the patch of wyns_sh@hotmail.com (I changed
it slightly, though). 

Vladimir and Neil, can you r/sr my two patches soon? 
Comment on attachment 190223 [details] [diff] [review]
seamonkey patch

Where's the definition of convertFromUnicode?
|convertFromUnicode| was added and a typo was fixed.
Attachment #189407 - Attachment is obsolete: true
Attachment #190223 - Attachment is obsolete: true
Attachment #191586 - Flags: superreview?(neil.parkwaycc.co.uk)
Attachment #191586 - Flags: review?(vladimir)
Attachment #190223 - Flags: superreview?(neil.parkwaycc.co.uk)
Attachment #190223 - Flags: review?(vladimir)
Comment on attachment 190860 [details] [diff] [review]
firefox patch

r=me, this looks ok visually -- haven't tried it out yet.  I can't think of
anything bad that could happen with sending a &mozcharset=foo to a site if the
bookmark is selected instead of used as a keyword, but it's probably no worse
than sending "%s" down..
Attachment #190860 - Flags: review?(vladimir) → review+
Comment on attachment 191586 [details] [diff] [review]
seamonkey patch with the missing function and a typo fixed

>+          var charset= "";
Nit: space before =

>+          re = /^(.*)\&mozcharset=([a-zA-Z][_\-a-zA-Z0-9]+)\s*$/; 
No declaration; I assume you mean "const re" here?

>+          if (charset != "") 
Nit: trailing space. Also, I would just use if (charset)
Attachment #191586 - Flags: superreview?(neil.parkwaycc.co.uk) → superreview+
Thanks for r/sr.
Vlad, can I assume that your 'r' of firefox patch imples 'r' of seamonkey patch?
The latter is a subset of the former. 
Neil, can you review firefox patch (attachment  190860 [details] [diff] [review]) as well? 
Comment on attachment 190860 [details] [diff] [review]
firefox patch

I can accept vlad's + as applying to attachment 191586 [details] [diff] [review] too.
Attachment #190860 - Flags: superreview?(neil.parkwaycc.co.uk) → superreview+
Comment on attachment 190860 [details] [diff] [review]
firefox patch

asking for a 1.8b4
As vlad noted, this is a low risk patch, but those who need to use keyword
quicksearch with encodings other than UTF-8 love to see this in ff 1.5
Attachment #190860 - Flags: approval1.8b4?
Comment on attachment 191586 [details] [diff] [review]
seamonkey patch with the missing function and a typo fixed

asking for a 1.8b4
As vlad noted (he granted 'r' for the superset of this patch for firefox), this
is a low risk patch, but those who need to use keyword
quicksearch with encodings other than UTF-8 love to see this in 1.8

btw, I've already taken car of Neil's nits in my tree.
Attachment #191586 - Flags: approval1.8b4?
Attachment #191586 - Flags: approval1.8b4? → approval1.8b4+
Attachment #190860 - Flags: approval1.8b4? → approval1.8b4+
Whiteboard: [checkin needed]
landed on trunk
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox1.1
Whiteboard: [checkin needed]
Attachment #191586 - Flags: review?(vladimir)
*** Bug 260006 has been marked as a duplicate of this bug. ***
One more tiny bug need to be fixed, the post data is also stored in encoded form in bookmarks like: name=%3D%25s
so in browser.js (aPostDataRef && /%s/.test(aPostDataRef.value)) won't match before unescaping even there is %s in post data. resulting the last_charset isn't used.

I am using firefox 1.5 RC3, a quick fix is:
--- browser-orig.js	Sun Oct 23 12:39:12 2005
+++ browser.js	Sat Dec 10 12:31:49 2005
@@ -1631,6 +1631,9 @@
           var charset = "";
           const re = /^(.*)\&mozcharset=([a-zA-Z][_\-a-zA-Z0-9]+)\s*$/; 
           var matches = shortcutURL.match(re);
+	  if (aPostDataRef && aPostDataRef.value) {
+            aPostDataRef.value = unescape(aPostDataRef.value);
+	  }
           if (matches) {
              shortcutURL = matches[1];
              charset = matches[2];
@@ -1651,7 +1654,6 @@
           if (aPostDataRef && aPostDataRef.value) {
             // XXXben - currently we only support "application/x-www-form-urlencoded"
             //          enctypes.
-            aPostDataRef.value = unescape(aPostDataRef.value);
             if (aPostDataRef.value.match(/%[sS]/)) {
               aPostDataRef.value = getPostDataStream(aPostDataRef.value,
                                                      text, encodedText,
*** Bug 336664 has been marked as a duplicate of this bug. ***
*** Bug 246848 has been marked as a duplicate of this bug. ***
*** Bug 270515 has been marked as a duplicate of this bug. ***
*** Bug 272419 has been marked as a duplicate of this bug. ***
Can’t you automatically detect this, based on the encoding the form field would have?


~Grauw
I agree with Laurens Holst, the browser should detect the character encoding, and the user will be able to override this auto-detected encoding, if needed.
You need to log in before you can comment on or make changes to this bug.