Open Bug 259529 Opened 20 years ago Updated 2 years ago

Submitted form doesn't use best charset from accept-charset

Categories

(Core :: DOM: Core & HTML, defect)

x86
Windows XP
defect

Tracking

()

UNCONFIRMED

People

(Reporter: st, Unassigned)

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; de-DE; rv:1.7) Gecko/20040803 Firefox/0.9.3
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; de-DE; rv:1.7) Gecko/20040803 Firefox/0.9.3

In a <form accept-charset="iso-8859-1,utf-8">, data is always send as
iso-8859-1, even if it contains characters that are not in the iso-8859-1 table.
Only if accept-charset is set to "utf-8", Firefox sends utf-8 data.

The following table compares the POST data Firefox and IE 6 send. The text
entered in a <textarea> was "& ę ö" (Whatever appeares as second character here,
it's Unicode character 281. Nonetheless, in the following table everything's
displayed as intended.).

          iso-8859-1,utf-8  utf-8,iso-8859-1  iso-8859-1  utf-8
Firefox:  & &#281; ö        & &#281; ö        & &#281; ö  & Ä? ö
IE 6:     & Ä? ö           & Ä? ö           & ê ö       & Ä? ö

This is a somewhat serious problem since the server cannot distinguish between
the Unicode character 281 (send as &#281;) and the text &#281; (send as &#281;,
too), thus any reliable server side corrections are impossible.

Even though IE 6 fails at iso-8859-1, replacing character 281 with ê, he's
basically doing it right, using utf-8 whenever possible. If Firebird chose
iso-8859-1 because it's the one most supported: This does not apply here, if the
server says he accepts utf-8, we may use it.

Reproducible: Always
Steps to Reproduce:
afaik the spec doesn't demand this behavior
Assignee: bugs → form-submission
Severity: major → enhancement
Component: Form Manager → HTML: Form Submission
Product: Firefox → Browser
QA Contact: firefox.form-manager
Version: unspecified → Trunk
(In reply to comment #1)
> afaik the spec doesn't demand this behavior

Excuse me, the spec doesn't demand *usable* forms? Are you kidding?
Severity: enhancement → normal
Reporter, does the behavior change if you use a space-separated list of charsets
instead of a comma-separated one?

We _should_ be taking the first charset listed that we support (since that's the
way Accept-* things tend to work), but we may be screwing this up for
comma-separated lists....
(In reply to comment #3)
> We _should_ be taking the first charset listed that we support (since that's the
> way Accept-* things tend to work), 

1. Then you're breaking the rules, simple as that. We're not talking about HTTP
here, this is HTML, and the HTML recommendation clearly states that "the client
must interpret" (must!) <form>'s accept-charset attribute "as an exclusive-or
list" (http://www.w3.org/TR/html4/interact/forms.html#h-17.3). There are no
quality values for this HTML attribute and there is no explicit ordering.

2. Even though I do admit that most authors will most likely order the charsets
the way they prefer them, there is absolutely no reason to just select the first
you support and thus possibly breaking the complete form processing. I'm not
even sure that the way you encode out-of-charset characters is defined somewhere
in the HTML form recommendations.

Please use the first charset you support _and_ that is able to transmit all
characters unmodified.

Nothing else I could add, except that "the specs don't demand" working forms and
"some other protocol does it like that" do not make any sense to me, at least
not in terms of why I should not be able to use some letter combinations with
Firefox. Sorry.
(In reply to comment #4)
> 1. Then you're breaking the rules, simple as that.
...
> the HTML recommendation clearly states that "the client
> must interpret" (must!) <form>'s accept-charset attribute "as an exclusive-or
> list"

We're doing that.  Where do you see us not doing that?

> quality values for this HTML attribute and there is no explicit ordering.

Indeed.  So it's up to the user-agent to somehow select a charset.  The
algorithm we have implemented is "select the first one in the list".  Which is
perfectly compliant with the spec...

> I'm not even sure that the way you encode out-of-charset characters is defined
> somewhere in the HTML form recommendations.

It's not.  It's the de-facto standard, though

> Please use the first charset you support

That's a reasonable request, but a lot of work....  The problm you ran into is
actually a bit more extensive than this, though.  Note that "utf-8,iso-8859-1"
didn't send as utf-8 in your test.  This is why I asked you to test a
space-separated charset list in comment 3.

If my guess is right, we'll need to split this into two bugs -- one on the
parsing of the accept-charset attribute (which is a bug and a spec violation)
and one on an enhancement (desirable, but difficult) to a spec-compliant
behavior.  So if you coul do the test I asked you to do, that would be much
appreciated.
Assignee: form-submission → nobody
QA Contact: form-submission
Component: HTML: Form Submission → DOM: Core & HTML
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.