13393 - Implement Accept-Charset Header according to HTTP/1.1

Reporter

Description

•

26 years ago

In 5.0, there is currently no Accept-Charset header entry in our HTTP request headers. We should implement as we did in 4.x. Currently DSGW4.x/3.x requires Accept-Charset header from a client. We do need to revise the way this was implemented in 4.x. There we had something like this and it was hard-coded: primary_charset, *, utf-8 and L10n had to localize this value for Win. Mac and Unix simply shipped with Latin 1 values, which was not correct. But given that there was noe easy way to localize the values, this was understandable. Under 5.0, we should do something like the following honoring HTTP/1.1: primary_charset, utf-8, *;q=0.8 The idea is to supply the "primary_charset" based on the user's selection of the default language as described in the 5.0 Intl UI proposal document: http://rocknroll/users/momoi/publish/seamonkey/50intlui.html This way, L10n need not be involved at all in setting this manually. As to the "q" values, we should just pick an arbitrary value (less than 0) for the 3rd arugument, "*".. Our aim should be to give servers choices to pick from Primary_charset or UTF-8, or any other charset if they cannot provide either of the 2 main choices. The value for the 4.x prefs.js line looks like this: user_pref("intl.accept_charsets", "iso-8859-1,utf-8,*;q=0.8");

Katsuhiko Momoi

Reporter

Comment 1

•

26 years ago

Correction: "..As to the "q" values, we should just pick an arbitrary value (less than 0).." I meant arbitrary value (less than 1).

Frank Tang

Updated

•

26 years ago

Assignee: ftang → warren

Frank Tang

Comment 2

•

26 years ago

Warren, Necko need to implement the back end of this. You just need to pick up the pref value and our group will do (or find someone to do) the pref UI part.

Frank Tang

Comment 3

•

26 years ago

LDAP gateway is depend on this.

Warren Harris

Updated

•

26 years ago

Status: NEW → RESOLVED

Closed: 26 years ago

Resolution: --- → DUPLICATE

Warren Harris

Comment 4

•

26 years ago

*** This bug has been marked as a duplicate of 12790 ***

Katsuhiko Momoi

Reporter

Comment 5

•

26 years ago

Frank, we need to make sure that our part will be done so that proper values are picked up when #12790 is fixed. Should we open another bug for that?

Katsuhiko Momoi

Reporter

Comment 6

•

26 years ago

Warren, Bug 127990 talks of only one of the "accept" headers and doesn't refer to "Accept-charset" header specifically though it is quoted in the data sample from 4.61. Does the fix there apply to all Accept-headers?

msanz

Updated

•

26 years ago

QA Contact: teruko → momoi

Katsuhiko Momoi

Reporter

Updated

•

26 years ago

Status: RESOLVED → REOPENED

Katsuhiko Momoi

Reporter

Updated

•

26 years ago

Status: REOPENED → RESOLVED

Closed: 26 years ago → 26 years ago

Katsuhiko Momoi

Reporter

Updated

•

26 years ago

Status: RESOLVED → REOPENED

Katsuhiko Momoi

Reporter

Comment 7

•

26 years ago

** Checked with 9/16/99 Win32 build ** I put in 2 prefs.js lines like this: user_pref("intl.accept_charsets", "shift_jis,utf-8,*;q=0.8"); user_pref("intl.accept_languages", "en"); then accessed: http://kaze:8000/bin/echo.cgi and found that we are still not sending either the Accept-Language or the Accept-Charset header. Someone has to make this work. Frank, is this yours now? or it it still warren's?

Katsuhiko Momoi

Reporter

Comment 8

•

26 years ago

Until what needs to be done to get the right results, I'm re-opening this bug.

Katsuhiko Momoi

Reporter

Updated

•

26 years ago

Resolution: DUPLICATE → ---

Warren Harris

Updated

•

26 years ago

Assignee: warren → gagan

Status: REOPENED → NEW

Warren Harris

Comment 9

•

26 years ago

Back to Gagan...

Gagan

Updated

•

26 years ago

Status: NEW → ASSIGNED

Target Milestone: M12

leger

Comment 10

•

25 years ago

Moving Assignee from gagan to warren since he is away.

Warren Harris

Comment 11

•

25 years ago

Moving what's not done for M12 to M13.

Warren Harris

Updated

•

25 years ago

Assignee: warren → gagan

Warren Harris

Comment 12

•

25 years ago

Back to Gagan for M13.

Gagan

Updated

•

25 years ago

Status: NEW → RESOLVED

Closed: 26 years ago → 25 years ago

Resolution: --- → WORKSFORME

Gagan

Comment 13

•

25 years ago

From my discussions with Erik, this is more debatable and hence I am closing this for now. Apparently IE doesn't send a charset either and works just fine with directory server. If you feel that this should still be sent then lets discuss this on the newsgroup before opening this bug here again.

Katsuhiko Momoi

Reporter

Comment 14

•

25 years ago

Hi, I filed this bug for the convenience of our own DSGW which check for accept-charset to see if it can send UTF-8. When it sees the header we came up with, it then sends UTF-8. Here's a comment on this issue from a DS developer, noriko@netscape.com. > >> Thanks for the explanation. We understand UTF-8 is now more > >> common. We can change DSGW in the next version (5.0) not to check > >> the Accept-Charset. But the DSGW already in the market is > >> expecting the variable... So, if Communicator 5.0 stops sending > >> it, the 4.X/3.X DSGW would get screwed up. I'd like to avoid the > >> risk. > >>

Katsuhiko Momoi

Reporter

Comment 15

•

25 years ago

Actually, the word 'convenience' is wrong. It is so that we 'avoid' srewing up our own DS Gateway which is used in web-based access to DS data. I agree that from an Internet protocl level discussion, this feature is debatable, but there is also a practical issue.

Katsuhiko Momoi

Reporter

Comment 16

•

25 years ago

I'll send you guys one of the msgs I exchanged with DS people.

Erik van der Poel

Comment 17

•

25 years ago

MSIE does not emit Accept-Charset. How does DSGW handle this situation?

Katsuhiko Momoi

Reporter

Comment 18

•

25 years ago

erik, I looked at the charset handling code noriko sent me on DSGW3.x/4.x. It makes special allowance for MS IE4. It doesn't look like it does so for IE 5, however. DSGW seems to decide on charset to use based on Accept-Language and Accept-charset. If there is no Accpet-charset info, it will default to a charset appropriate for the Accept-Language. Take my own serever, polyglot (DSGW 3.x). It can server both Japanese and English interface pages based on Accept-Language. The data contained there. however, has both Japanese and Latin 1 accented characters. Als the search root, o="Netscape" part is in Japanese. I tried the following with the current Mozilla and IE5 with accept-lang set to ja or en. Mozilla w/ ja: 1. Can display Japanese names but not Latin 1 accents (because DSGW does not use UTF-8 but Shift_JIS charset.) Mozilla w/ en: 2. Cannot find a single entry because the search root o="Netscape" is in Japanese but charset used is ISO-8859-1 in this case, and thus ldap url simply fails to match. MS IE 5 w/ja: 3. 1. Can display Japanese names but not Latin 1 accents (because DSGW does not use UTF-8 but Shift_JIS charset.) MS IE 5 w/en: 4. It even refuses to display the first page in the gateway because it contains data from "o="Netscape"" in Japanese but the charset sent in is ISO-8859-1. In summary, not sending accept-charset and thus enabling DSGW to send data in UTF-8 spells disaster for these DSGW 3.x/4.x users whos may have 1) multilingual data, and/or 2) ldap attribute names in in non-ASCII. I am very much inclined to re-open this bug for the above reasons. If you don't want me to, please privide arguments before too long.

Katsuhiko Momoi

Reporter

Comment 19

•

25 years ago

Needless to say, 4.72 I'm using now had none of the problems mentioned above.

Katsuhiko Momoi

Reporter

Comment 20

•

25 years ago

The sniffer script DSGW 3.x/4.x uses has a special allowance for IE4 and so, though I haven't tried it, IE4 probably gets UTF-8 data from DSGW and thus avoids these problems.

Erik van der Poel

Comment 21

•

25 years ago

My suggestion is to update the sniffer script for DSGW's next version. If the sniffer script is able to deal with MSIE4, then it should be able to deal with Mozilla 5. Also, current DSGW customers can be asked to update their script, which hopefully is a text file. MSIE5 does not emit Accept-Charset, and MSIE5 has a large market share. If DSGW is interested in supporting a large fraction of Internet users, DSGW will have to make changes to their own releases and to their customers' installations. Mozilla is trying to reduce the amount of stuff it sends out with EVERY HTTP request. Accept-Charset has limited value. Mozilla needs to weigh all of these factors and make a decision. It's not my decision to make, but my opinion is that Mozilla 5 should refrain from emitting Accept-Charset for the above reasons.

Katsuhiko Momoi

Reporter

Comment 22

•

25 years ago

I'm reasonably sure that what you suggest are all doable. I have no idea, however, how practial that is in this situation or how much extra work that would entail. I hear occasionally from Russian users that their sites use accept-charset sniffer. I guess in languages where multiple charsets are competing, accept-charset would be nice but again I don't know how sorely this is needed for such a case. I think I've stated the reasons for re-opening the bug. Other opinions are welcome.

Katsuhiko Momoi

Reporter

Comment 23

•

25 years ago

I've talked to noriko further about this and it looks like the script is part of C code and cannot be changed without patching the source itself. This will fall into the sustaining engineering's area. There is apparently less than perfect but nonetheless a way to turn off accept-charset sniffing and send UTF-8 data, however. This will be a tech support issue. I don't necessarily buy an argument that we are sending too many HTTP headers -- I compared IE5 and Comm4.72 and the difference is only 1. IE5 does not send out accept-charset. But I can buy an argument that we should not send out what is not an important or sorely needed HTTP header. This might at this point in time fall into that category. The only other point I would like to pursue is that others in the net community agree with this assessment. It won't hurt to ask before verifying the resolution. And that is what I will do now.

Katsuhiko Momoi

Reporter

Comment 24

•

25 years ago

I've publicly asked net people about this feature and no one expressed concern about this feature not in Mozilla. The question was asked some time ago and I now feel that we have waited long enough for reaction. I think the resolution should be wontfix rather than worksforme, however.

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Katsuhiko Momoi

Reporter

Comment 25

•

25 years ago

Changing theresolution to WontFix.

Status: REOPENED → RESOLVED

Closed: 25 years ago → 25 years ago

Resolution: --- → WONTFIX

Katsuhiko Momoi

Reporter

Comment 26

•

25 years ago

Verified as Wontfix.

Status: RESOLVED → VERIFIED

Andreas J. Koenig

Comment 27

•

25 years ago

I have read through all the arguments in bug 13393 and would like to weigh in with a few: >>>>> From Erik van der Poel 2000-01-22 13:56 ------- ep> MSIE5 does not emit Accept-Charset, and MSIE5 has a large market ep> share. If DSGW is interested in supporting a large fraction of ep> Internet users, DSGW will have to make changes to their own ep> releases and to their customers' installations. As far as I remember, MSIE5 sends HTTP/1.1 and thus is required to understand UTF-8. (cf. chapter 14.2. HTTP/1.1). If a browser understands UTF-8 and everybody knows this because it is HTTP/1.1, it can refrain from sending this header, it would be redundant. But even in this case it would still be polite to send Accept-Charset because a HTTP/1.0 proxy will be required to downgrade a request to HTTP/1.0 and thus the server can't find out that the browser behind the proxy is HTTP/1.1. ep> Mozilla is trying to reduce the amount of stuff it sends out with ep> EVERY HTTP request. Accept-Charset has limited value. Mozilla needs ep> to weigh all of these factors and make a decision. It's not my ep> decision to make, but my opinion is that Mozilla 5 should refrain ep> from emitting Accept-Charset for the above reasons. Erik doesn't say why. I honour the decision to send terse headers, but it is a wrong decision to say, let's just follow IE5. As long as we do not have the arguments on the table why they decided their way, we must find them out ourselves. >>>>> ------ Additional Comments From Katsuhiko Momoi 2000-01-22 14:22 ------- km> I'm reasonably sure that what you suggest are all doable. I have km> no idea, however, how practial that is in this situation or how km> much extra work that would entail. I hear occasionally from km> Russian users that their sites use accept-charset sniffer. I guess km> in languages where multiple charsets are competing, accept-charset km> would be nice but again I don't know how sorely this is needed for km> such a case. I'm not speaking for languages where multiple charsets are competing, I'm speaking from the perspective of an i18n'd server, of which I have implemented a few. An i18n'd server typically works with Unicode internally and converts on request. The server can be implemented in a language-ignorant way, it sends many languages. Talking about language here somehow muddies the waters. If Mozilla doesn't send Accept-Charset, the server side must convert to iso-8859-1 because this was the standard charset in HTTP/1.0. Period. So my revised suggestion of how to form this header would be: Accept-Charset: utf-8,*;q=0.8 and leave the primary charset out of the equation. I see no reason why the primary charset should be announced to servers at all. Mozilla can convert to it anyway. And if the conversion would be lossy, it would be wise not to convert to it. But that's beyond the scope of this bugid. -- andreas

Status: VERIFIED → REOPENED

Resolution: WONTFIX → ---

Katsuhiko Momoi

Reporter

Comment 28

•

25 years ago

Andreas, the LDAP server case I was referring to above is one example of your i18n'ed server. It stores all the data in UTF-8. It then sends that data to a client in UTF-8 or in an ecoding appropriate for the language of the client in case the client does not say explicitly say what charset it can accept. (The question of language does come into play for certain types of data.)

Andreas J. Koenig

Comment 29

•

25 years ago

*** Bug 48361 has been marked as a duplicate of this bug. ***

Andreas J. Koenig

Comment 30

•

25 years ago

"in an encoding appropriate for the language of the client" is a very vague concept. What if the document is not in the language of the client and is not displayable in the encoding appropriate for the language of the client. Note that the language of the client can be a set of languages too.

Katsuhiko Momoi

Reporter

Comment 31

•

25 years ago

Andreas, there are many different ways to make use of accept-charset. If you have a directory server which is deployed in an environment predominantly Japanese. LDAP protocol default charset is UTF-8. Thus all the data would be in UTF-8. Now if a Japanese client accesses it and says that the primary charset is Shift_JIS but UTF-8 is OK. Then the server simply sends UTF-8. If not, it sends Shift_JIS encoded Japanese data. This kind of use is what we have in the case described above. Then there are the kind of cases you describe above. You may have many language data on a single page which can be encoded in ISO-8859-1 or UTF-8. The notion of primary charset is quite useful in some of these cases. Note also that ISO-8859-1 is always assumed even if it is not explicitly listed.

Andreas J. Koenig

Comment 32

•

25 years ago

Katsuhiko, I'd like to structure the things to discuss, not all of them need to be addressed or resolved now. 1. Should Mozilla send an Accept-Charset header that contains at least utf-8 and "*"? I believe my arguments above proof this is necessary, and Mozilla should have it, at least for the next few years during which the rest of the world is not utf-8 safe. 2. Should Mozilla have the notion of a primary charset? I did not question this and I still believe it is useful for Mozilla. I see the main usefulness when it comes to storing content on disk, but also when it comes to browsing sites that do not declare their charset and heuristics are needed to determine it. But this is an entirely different problem domain, so let's not get carried away with these problems. 3. Should Mozilla include the primary charset in the Accept-Charset header? I see no need to. Mozilla can most probably read any charset and this is expressed with the star. If Mozilla has no bugs in the conversion engine, it makes no difference for the user if he gets a LATIN SMALL LETTER C WITH CEDILLA as u+00E7 in utf-8 or as 0xE7 in iso-8859-1. Or to try an equivalent, 0xC4 0xFF84 in Shift-Jis is a HALFWIDTH KATAKANA LETTER TO, but u+FF84 is the same thing. No need to express a preference of one over the other. 4. Does the user need to be able to configure the Accept-Charset header? I see no reason to. Same argument as in (3) above. 5. Does Mozilla need to consider the set of languages the user has chosen in the language preferences when sending the Accept-Charset header? I'd say, definitely not. Among the 5 topics, only #1 needs to be adressed.

Katsuhiko Momoi

Reporter

Comment 33

•

25 years ago

My response to issues raised by Andreas: #1: Agreed. #2: We already have that expressed in Navigator Default charset in the Preferences. (This is the client side preference setting and has no interactive aspect with servers.) #3: In an ideal world, this would be true. But just like your argument in #1, i.e. the world is not UTF-8 safe yet, not every one would tag their Unicode documents with a lang tag indicating what language that is. And Mozilla has dependency on language for which font glyphs to use. For example, Unicode CJK ideographs are not necessarily rendered the same from language to language. The same code point may lead to different font glyphs dependent on what language it is. Unless every one uses a lang tag, I may end up seeing a Japanese document with some Chinese glyphs. And I definitely don't want that! (See how fonts are set in the preference dialog -- according to language. But if language info is not available in the docs, we do our best by looking at the charset info -- a charset is a good secondary determining factor for some language, e.g. Chinese, Japanese, Korean, etc.. Thus, the notion of primary charset is still useful in this situation. ) #4: The user does not have to as long as the localization process can take care of it. #5: Agreed. But we may use the Navigator default charset for this.

Andreas J. Koenig

Comment 34

•

25 years ago

Thank you for the background info for #3--very interesting, I see more light now and agree with you.

Gagan

Updated

•

25 years ago

Target Milestone: M13 → Future

Boris Zbarsky [:bzbarsky]

Comment 35

•

25 years ago

*** Bug 60496 has been marked as a duplicate of this bug. ***

Boris Zbarsky [:bzbarsky]

Comment 36

•

25 years ago

There is a patch attached to bug 60496, by the way.

bobj

Comment 37

•

25 years ago

Added "patch" keyword.

Keywords: patch

Adrian Havill

Comment 38

•

25 years ago

Attached patch patch #60496 (add Accept-Charset with utf-8, *, plus intl.charset.default) with one memleak fixed — Details — Splinter Review

Andreas J. Koenig

Comment 39

•

24 years ago

Thanks a lot for the patch! There's some purely cosmetic thing left. When the default character set chosen via Preferences/Languages is "Unicode (UTF-8)", then the resulting Accept-Charset header becomes: Accept-Charset: UTF-8, utf-8; q=0.667, *; q=0.667 which seemingly is legal but redundant.

Adrian Havill

Comment 40

•

24 years ago

Koenig: whoops, you're right... the patch is designed to avoid the duplicate "utf-8", but it doesn't check for case. Change line 116 of the patch from: + if (PL_strstr(acceptable, "utf-8") == NULL) { to + if (PL_strcasestr(acceptable, "utf-8") == NULL) { and that should do the trick.

Adrian Havill

Comment 41

•

24 years ago

Also, while the Language Preference screen won't let you do it, the above patch will allow a comma separated list of character set/encodings in the intl.charset.default, which you can set by manually editing your prefs.js. Nothing else seems to use intl.charset.default (true?), but if something else isn't expecting a comma delimited tokens in that preference, this could get you into trouble.

Katsuhiko Momoi

Reporter

Comment 42

•

24 years ago

intl.charset.default must be a single item entry. (No comma delimited list should be in it -- it defeats the prupose of this pref!) It is your default fallback encoding for browsing in case HTTP, HTTP Meta-Equiv, or Auto-detection cannot give you a document charset. For Composer, it is used as the default encoding for a new document. This value should be set by a localizer to be suitable for each locale. It has a UI also: Edit | Prefs | Navigator | Languages | Character Coding.

Adrian Havill

Comment 43

•

24 years ago

Understood. The above patch is still o.k., because while _it_ can handle a comma delimited list, it doesn't add a comma list to the pref itself-- just a little bit of (unneeded for now, until the patch is changed to use a preference other then intl.charset.default) robustness-- access to intl.charset.default is read-only.

Gagan

Comment 44

•

24 years ago

http bugs to "Networking::HTTP"

Assignee: gagan → darin

Status: REOPENED → NEW

Component: Internationalization → Networking: HTTP

QA Contact: momoi → tever

Target Milestone: Future → M19

Katsuhiko Momoi

Reporter

Updated

•

24 years ago

Keywords: intl

neeti

Updated

•

24 years ago

Depends on: 65092

John G. Myers

Updated

•

24 years ago

No longer depends on: 65092

John G. Myers

Updated

•

24 years ago

Blocks: 65092

Darin Fisher

Assignee

Comment 45

•

24 years ago

nominating for moz 0.9

Target Milestone: --- → mozilla0.9

Adrian Havill

Comment 46

•

24 years ago

Attached patch updated to current nightlies, added UTF-8 case-insensitivity — Details — Splinter Review

Adrian Havill

Comment 47

•

24 years ago

Attached patch moved parse logic from Get* to Set* for performance (Gagan), use safer PR_snprintf (Darin) — Details — Splinter Review

Darin Fisher

Assignee

Comment 48

•

24 years ago

Looks good. r=darin

Darin Fisher

Assignee

Comment 49

•

24 years ago

adding keyword nsbeta1

Keywords: nsbeta1

Darin Fisher

Assignee

Comment 50

•

24 years ago

Fix checked in.

Status: NEW → RESOLVED

Closed: 25 years ago → 24 years ago

Resolution: --- → FIXED

Henrik Gemal

Comment 51

•

24 years ago

You can check what mozilla sends at: http://gemal.dk/browserspy/accept.cgi

Adrian Havill

Comment 52

•

24 years ago

Henrik Gemal wrote: > You can check what mozilla sends at: > http://gemal.dk/browserspy/accept.cgi or you can use http://www.mozilla.gr.jp:4321/ which is step B20 of the smoketests at http://www.mozilla.org/quality/smoketests/

Tom Everingham

Comment 53

•

24 years ago

verified

Status: RESOLVED → VERIFIED

patch #60496 (add Accept-Charset with utf-8, *, plus intl.charset.default) with one memleak fixed 25 years ago Adrian Havill 7.92 KB, patch		Details \| Diff \| Splinter Review
updated to current nightlies, added UTF-8 case-insensitivity 24 years ago Adrian Havill 7.95 KB, patch		Details \| Diff \| Splinter Review
moved parse logic from Get* to Set* for performance (Gagan), use safer PR_snprintf (Darin) 24 years ago Adrian Havill 9.65 KB, patch		Details \| Diff \| Splinter Review