Closed Bug 13393 Opened 25 years ago Closed 23 years ago

Implement Accept-Charset Header according to HTTP/1.1

Categories

(Core :: Networking: HTTP, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9

People

(Reporter: momoi, Assigned: darin.moz)

References

()

Details

(Keywords: intl)

Attachments

(3 files)

In 5.0, there is currently no Accept-Charset header entry in our HTTP
request headers. We should implement as we did in 4.x.
Currently DSGW4.x/3.x requires Accept-Charset header from a client.

We do need to revise the way this was implemented in 4.x.
There we had something like this and it was hard-coded:

primary_charset, *, utf-8

and L10n had to localize this value for Win. Mac and Unix simply shipped
with Latin 1 values, which was not correct. But given that there was
noe easy way to localize the values, this was understandable.

Under 5.0, we should do something like the following honoring HTTP/1.1:

primary_charset, utf-8, *;q=0.8

The idea is to supply the "primary_charset" based on the
user's selection of the default language as described in the
5.0 Intl UI proposal document:

http://rocknroll/users/momoi/publish/seamonkey/50intlui.html

This way, L10n need not be involved at all in setting this
manually.

As to the "q" values, we should just pick an arbitrary value (less
than 0) for the 3rd arugument, "*".. Our aim should be to give
servers choices to pick from Primary_charset or UTF-8, or any
other charset if they cannot provide either of the 2 main
choices.

The value for the 4.x prefs.js line looks like this:

user_pref("intl.accept_charsets", "iso-8859-1,utf-8,*;q=0.8");
Correction:

"..As to the "q" values, we should just pick an arbitrary value (less
than 0).."

I meant arbitrary value (less than 1).
Assignee: ftang → warren
Warren, Necko need to implement the back end of this. You just need to pick up
the pref value and our group will do (or find someone to do) the pref UI part.
LDAP gateway is depend on this.
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → DUPLICATE
*** This bug has been marked as a duplicate of 12790 ***
Frank, we need to make sure that our part will be done
so that proper values are picked up when #12790
is fixed. Should we open another bug for that?
Warren, Bug 127990 talks of only one of the "accept" headers and
doesn't refer to "Accept-charset" header specifically though it is
quoted in the data sample from 4.61. Does the fix there apply to all
Accept-headers?
QA Contact: teruko → momoi
Status: RESOLVED → REOPENED
Status: REOPENED → RESOLVED
Closed: 25 years ago25 years ago
Status: RESOLVED → REOPENED
** Checked with 9/16/99 Win32 build **

I put in 2 prefs.js lines like this:

user_pref("intl.accept_charsets", "shift_jis,utf-8,*;q=0.8");
user_pref("intl.accept_languages", "en");

then accessed:

http://kaze:8000/bin/echo.cgi

and found that we are still not sending either the Accept-Language
or the Accept-Charset header.

Someone has to make this work. Frank, is this yours now? or it it still
warren's?
Until what needs to be done to get the right results,
I'm re-opening this bug.
Resolution: DUPLICATE → ---
Assignee: warren → gagan
Status: REOPENED → NEW
Back to Gagan...
Status: NEW → ASSIGNED
Target Milestone: M12
Moving Assignee from gagan to warren since he is away.
Moving what's not done for M12 to M13.
Assignee: warren → gagan
Back to Gagan for M13.
Status: NEW → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → WORKSFORME
From my discussions with Erik, this is more debatable and hence I am closing
this for now. Apparently IE doesn't send a charset either and works just fine
with directory server. If you feel that this should still be sent then lets
discuss this on the newsgroup before opening this bug here again.
Hi, I filed this bug for the convenience of our own DSGW 
which check for accept-charset to see if it can send UTF-8. When
it sees the header we came up with, it then sends UTF-8.

Here's a comment on this issue from a DS developer,
noriko@netscape.com.

> >> Thanks for the explanation.  We understand UTF-8 is now more
> >> common.  We can change DSGW in the next version (5.0) not to check
> >> the Accept-Charset.  But the DSGW already in the market is
> >> expecting the variable...  So, if Communicator 5.0 stops sending
> >> it, the 4.X/3.X DSGW would get screwed up.  I'd like to avoid the
> >> risk.
> >>
Actually, the word 'convenience' is wrong. It is so that we
'avoid' srewing up our own DS Gateway which is used in
web-based access to DS data.
I agree that from an Internet protocl level discussion, this feature
is debatable, but there is also a practical issue.
I'll send you guys one of the msgs I exchanged with DS people.
MSIE does not emit Accept-Charset. How does DSGW handle this situation?
erik, I looked at the charset handling code noriko sent me on
DSGW3.x/4.x. It makes special allowance for MS IE4. It doesn't
look like it does so for IE 5, however.

DSGW seems to decide on charset to use based on Accept-Language
and Accept-charset. If there is no Accpet-charset info, it will
default to a charset appropriate for the Accept-Language.

Take my own serever, polyglot (DSGW 3.x). It can server both
Japanese and English interface pages based on Accept-Language.
The data contained there. however, has both Japanese and 
Latin 1 accented characters.  Als the search root, o="Netscape"
part is in Japanese.
I tried the following with the current Mozilla and IE5 with 
accept-lang set to ja or en.

Mozilla w/ ja:

1. Can display Japanese names but not Latin 1 accents (because DSGW
   does not use UTF-8 but Shift_JIS charset.)

Mozilla w/ en:

2. Cannot find a single entry because the search root o="Netscape"
   is in Japanese but charset used is ISO-8859-1 in this case,
   and thus ldap url simply fails to match.

MS IE 5 w/ja:

3. 1. Can display Japanese names but not Latin 1 accents (because DSGW
   does not use UTF-8 but Shift_JIS charset.)

MS IE 5 w/en:

4. It even refuses to display the first page in the gateway because
   it contains data from "o="Netscape"" in Japanese but the charset
   sent in is ISO-8859-1. 

In summary, not sending accept-charset and thus enabling DSGW to
send data in UTF-8 spells disaster for these DSGW 3.x/4.x users
whos may have 1) multilingual data, and/or 2) ldap attribute names in
in non-ASCII.

I am very much inclined to re-open this bug for the above reasons.
If you don't want me to, please privide arguments before too 
long.
Needless to say, 4.72 I'm using now had none of the problems
mentioned above.
The sniffer script DSGW 3.x/4.x uses has a special allowance
for IE4 and so, though I haven't tried it, IE4 probably gets 
UTF-8 data from DSGW and thus avoids these problems.
My suggestion is to update the sniffer script for DSGW's next version. If the
sniffer script is able to deal with MSIE4, then it should be able to deal with
Mozilla 5. Also, current DSGW customers can be asked to update their script,
which hopefully is a text file.

MSIE5 does not emit Accept-Charset, and MSIE5 has a large market share. If DSGW
is interested in supporting a large fraction of Internet users, DSGW will have
to make changes to their own releases and to their customers' installations.

Mozilla is trying to reduce the amount of stuff it sends out with EVERY HTTP
request. Accept-Charset has limited value. Mozilla needs to weigh all of these
factors and make a decision. It's not my decision to make, but my opinion is
that Mozilla 5 should refrain from emitting Accept-Charset for the above
reasons.
I'm reasonably sure that what you suggest are all doable.
I have no idea, however, how practial that is in this
situation or how much extra work that would entail. 
I hear occasionally from Russian users that their sites
use accept-charset sniffer. I guess in languages where multiple
charsets are competing, accept-charset would be nice but again
I don't know how sorely this is needed for such a case.

I think I've stated the reasons for re-opening the bug. Other 
opinions are welcome.
I've talked to noriko further about this and it looks like
the script is part of C code and cannot be changed without
patching the source itself. This will fall into the sustaining
engineering's area. There is apparently less than perfect but
nonetheless a way to turn off accept-charset sniffing and 
send UTF-8 data, however. This will be a tech support issue.

I don't necessarily buy an argument that we are sending too many 
HTTP headers -- I compared IE5 and Comm4.72 and the difference is only
1. IE5 does not send out accept-charset.
But I can buy an argument that we should not send out what is not
an important or sorely needed HTTP header. This might at this
point in time fall into that category. 

The only other point I would like to pursue is that others in the 
net community agree with this assessment. It won't hurt to ask
before verifying the resolution. And that is what I will do now.
I've publicly asked net people about this feature
and no one expressed concern about this feature not
in Mozilla. The question was asked some time ago
and I now feel that we have waited long enough for 
reaction. 
I think the resolution should be wontfix rather than
worksforme, however.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Changing theresolution to WontFix.
Status: REOPENED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → WONTFIX
Verified as Wontfix.
Status: RESOLVED → VERIFIED
I have read through all the arguments in bug 13393 and would like to
weigh in with a few:

>>>>> From Erik van der Poel 2000-01-22 13:56 -------

ep> MSIE5 does not emit Accept-Charset, and MSIE5 has a large market
ep> share. If DSGW is interested in supporting a large fraction of
ep> Internet users, DSGW will have to make changes to their own
ep> releases and to their customers' installations.

As far as I remember, MSIE5 sends HTTP/1.1 and thus is required to
understand UTF-8. (cf. chapter 14.2. HTTP/1.1). If a browser
understands UTF-8 and everybody knows this because it is HTTP/1.1, it
can refrain from sending this header, it would be redundant. But even
in this case it would still be polite to send Accept-Charset because a
HTTP/1.0 proxy will be required to downgrade a request to HTTP/1.0 and
thus the server can't find out that the browser behind the proxy is
HTTP/1.1.

ep> Mozilla is trying to reduce the amount of stuff it sends out with
ep> EVERY HTTP request. Accept-Charset has limited value. Mozilla needs
ep> to weigh all of these factors and make a decision. It's not my
ep> decision to make, but my opinion is that Mozilla 5 should refrain
ep> from emitting Accept-Charset for the above reasons.

Erik doesn't say why. I honour the decision to send terse headers, but
it is a wrong decision to say, let's just follow IE5. As long as we do
not have the arguments on the table why they decided their way, we
must find them out ourselves.

>>>>> ------ Additional Comments From Katsuhiko Momoi 2000-01-22 14:22 -------

km> I'm reasonably sure that what you suggest are all doable. I have
km> no idea, however, how practial that is in this situation or how
km> much extra work that would entail. I hear occasionally from
km> Russian users that their sites use accept-charset sniffer. I guess
km> in languages where multiple charsets are competing, accept-charset
km> would be nice but again I don't know how sorely this is needed for
km> such a case.

I'm not speaking for languages where multiple charsets are competing,
I'm speaking from the perspective of an i18n'd server, of which I have
implemented a few. An i18n'd server typically works with Unicode
internally and converts on request. The server can be implemented in a
language-ignorant way, it sends many languages. Talking about language
here somehow muddies the waters. If Mozilla doesn't send
Accept-Charset, the server side must convert to iso-8859-1 because
this was the standard charset in HTTP/1.0. Period.

So my revised suggestion of how to form this header would be:

    Accept-Charset: utf-8,*;q=0.8

and leave the primary charset out of the equation. I see no reason why
the primary charset should be announced to servers at all. Mozilla can
convert to it anyway. And if the conversion would be lossy, it would be
wise not to convert to it. But that's beyond the scope of this bugid.

-- 
andreas

Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Andreas, the LDAP server case I was referring to above is one example of your 
i18n'ed server. It stores all the data in UTF-8. It then sends that data to a client
in UTF-8 or in an ecoding appropriate for the language of the client in case
the client does not say explicitly say what charset it can accept.

(The question of language does come into play for certain types of data.)
*** Bug 48361 has been marked as a duplicate of this bug. ***
"in an encoding appropriate for the language of the client" is a very vague
concept. What if the document is not in the language of the client and is not
displayable in the encoding appropriate for the language of the client. Note
that the language of the client can be a set of languages too.
Andreas, there are many different ways to make use of accept-charset.
If you have a directory server which is deployed in an environment predominantly 
Japanese. LDAP protocol default charset is UTF-8. Thus all the data would be 
in UTF-8. Now if a Japanese client accesses it and says that the primary charset is 
Shift_JIS but UTF-8 is OK. Then the server simply sends UTF-8. If not, 
it sends Shift_JIS encoded Japanese data. This kind of use is what we have in
the case described above. 
Then there are the kind of cases you describe above. You may have many language
data on a single page which can be encoded in ISO-8859-1 or UTF-8. 

The notion of primary charset is quite useful in some of these cases.
Note also that ISO-8859-1 is always assumed even if it is not explicitly listed. 

 
Katsuhiko, I'd like to structure the things to discuss, not all of
them need to be addressed or resolved now.

1. Should Mozilla send an Accept-Charset header that contains at least
utf-8 and "*"? I believe my arguments above proof this is necessary,
and Mozilla should have it, at least for the next few years during
which the rest of the world is not utf-8 safe.

2. Should Mozilla have the notion of a primary charset? I did not
question this and I still believe it is useful for Mozilla. I see the
main usefulness when it comes to storing content on disk, but also
when it comes to browsing sites that do not declare their charset and
heuristics are needed to determine it. But this is an entirely
different problem domain, so let's not get carried away with these
problems.

3. Should Mozilla include the primary charset in the Accept-Charset
header? I see no need to. Mozilla can most probably read any charset
and this is expressed with the star. If Mozilla has no bugs in the
conversion engine, it makes no difference for the user if he gets a
LATIN SMALL LETTER C WITH CEDILLA as u+00E7 in utf-8 or as 0xE7 in
iso-8859-1. Or to try an equivalent, 0xC4 0xFF84 in Shift-Jis is a
HALFWIDTH KATAKANA LETTER TO, but u+FF84 is the same thing. No need to
express a preference of one over the other.

4. Does the user need to be able to configure the Accept-Charset
header? I see no reason to. Same argument as in (3) above.

5. Does Mozilla need to consider the set of languages the user has
chosen in the language preferences when sending the Accept-Charset
header? I'd say, definitely not.


Among the 5 topics, only #1 needs to be adressed.
My response to issues raised by Andreas:

#1: Agreed.
#2: We already have that expressed in Navigator Default charset in the Preferences. (This is the client
     side preference setting and has no interactive aspect with servers.)
#3: In an ideal world, this would be true. But just like your argument in #1, i.e. the world is not UTF-8 safe
     yet, not every one would tag their Unicode documents with a lang tag indicating what language
     that is. And Mozilla has dependency on language for which font glyphs to use. For example, Unicode 
     CJK ideographs are not necessarily rendered the same from language to language. The same
     code point may lead to different font glyphs dependent on what language it is. Unless every one 
     uses a lang tag, I may end up seeing a Japanese document with some Chinese glyphs. And
     I definitely don't want that!
    (See how fonts are set in the preference dialog -- according to language.  But if language info is not
     available in the docs, we do our best by looking at the charset info --  a charset is a good 
     secondary determining factor for some language, e.g. Chinese, Japanese, Korean, etc..  Thus, the notion
      of primary charset is still useful  in this situation. )
#4: The user does not have to as long as the localization process can take care of it.
#5:  Agreed. But we may use the Navigator default charset for this.
Thank you for the background info for #3--very interesting, I see more light now
and agree with you.
Target Milestone: M13 → Future
*** Bug 60496 has been marked as a duplicate of this bug. ***
There is a patch attached to bug 60496, by the way.
Added "patch" keyword.
Keywords: patch
Thanks a lot for the patch!

There's some purely cosmetic thing left. When the default character set chosen
via Preferences/Languages is "Unicode (UTF-8)", then the resulting
Accept-Charset header becomes:

    Accept-Charset: UTF-8, utf-8; q=0.667, *; q=0.667

which seemingly is legal but redundant.
Koenig: whoops, you're right... the patch is designed to avoid the duplicate
"utf-8", but it doesn't check for case. Change line 116 of the patch from:

+  if (PL_strstr(acceptable, "utf-8") == NULL) {

to

+  if (PL_strcasestr(acceptable, "utf-8") == NULL) {

and that should do the trick.
Also, while the Language Preference screen won't let you do it, the above patch
will allow a comma separated list of character set/encodings in the
intl.charset.default, which you can set by manually editing your prefs.js.

Nothing else seems to use intl.charset.default (true?), but if something else
isn't expecting a comma delimited tokens in that preference, this could get you
into trouble.
intl.charset.default must be a single item entry.
(No comma delimited list should be in it -- it
defeats the prupose of this pref!)
It is your default fallback encoding for browsing
in case HTTP, HTTP Meta-Equiv, or Auto-detection
cannot give you a document charset.
For Composer, it is used as the default encoding
for a new document.

This value should be set by a localizer to be
suitable for each locale. It has a UI also:

Edit | Prefs | Navigator | Languages | Character Coding.
Understood. The above patch is still o.k., because while _it_ can handle a comma
delimited list, it doesn't add a comma list to the pref itself-- just a little
bit of (unneeded for now, until the patch is changed to use a preference other
then intl.charset.default) robustness-- access to intl.charset.default is
read-only.
http bugs to "Networking::HTTP"
Assignee: gagan → darin
Status: REOPENED → NEW
Component: Internationalization → Networking: HTTP
QA Contact: momoi → tever
Target Milestone: Future → M19
Keywords: intl
Depends on: 65092
No longer depends on: 65092
Blocks: 65092
nominating for moz 0.9
Target Milestone: --- → mozilla0.9
Looks good.  r=darin
adding keyword nsbeta1
Keywords: nsbeta1
Fix checked in.
Status: NEW → RESOLVED
Closed: 25 years ago23 years ago
Resolution: --- → FIXED
You can check what mozilla sends at:
http://gemal.dk/browserspy/accept.cgi
Henrik Gemal wrote:
> You can check what mozilla sends at:
> http://gemal.dk/browserspy/accept.cgi

or you can use
http://www.mozilla.gr.jp:4321/

which is step B20 of the smoketests at
http://www.mozilla.org/quality/smoketests/

verified
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: