Closed Bug 5313 Opened 28 years ago Closed 24 years ago

Accept-Charset for form is not implement.

Categories

(Core :: Internationalization, defect, P1)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: ftang, Assigned: ftang)

References

()

Details

(Whiteboard: [nsbeta2-][nsbeta3+]patch in hand need review.)

(This bug imported from BugSplat, Netscape's internal bugsystem.  It
was known there as bug #56223
http://scopus.netscape.com/bugsplat/show_bug.cgi?id=56223
Imported into Bugzilla on 04/20/99 12:24)


Split from Bug 48964:
From: http://www.nagual.ru/~ache/n4w95.html#bug_list
1) Netscape not decode <FORM>s input from CP1251 (Russian Windows
default character set) to KOI8-R when needed. I.e. it totally
ignores ACCEPT-CHARSET="KOI8-R" <FORM> attribute and global HTML
page character set too for both <META> and HTTP header cases. See
Internationalization of the Hypertext Markup Language (RFC 2070)
for details. Look at http://www.nagual.ru/~ache/main.html#form_input
to see this bug in action.
*** Bug 48964 has been marked as a duplicate of this bug. ***
We don't plan to support Accept-Character in FORM according to the Multilingual
HTML RFC in Dogbert. Later this.
Per 6/30 I18n Latered Bug Meeting, this bug is marked as WONTFIX.
We should do review of RFC specs compliance and this bug should be
marked as a duplicate of that bug.
*** This bug has been marked as a duplicate of 75280 ***
Duplicate bug, bulk Verified
This bug will be moved over to 5.0 for a review.
It is true that we don't do anything with Accpet-Charset
attribute for Form Input and TextArea.

The relevant section of the RFC 2070 is "5.1 DTD additions". This does
not seem to be a requirement but rather a recommendation for a user
agent. The recommended action upon encountering the Accept-Charset
attribute would be:

1) a warning to the user about what charset the form can accept, or
2) restrict the input charsets to those listed as the attribute values.

We need to decide if we should follow this requirement.
Assignee: erik → bobj
Bob, we need to decide who will own HTML form I18N issues.
Target Milestone: M7
Here's a reference:
http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.3:
  accept-charset = charset list [CI]
     This attribute specifies the list of character encodings for input
     data that must be accepted by the server processing this form. The
     value is a space- and/or comma-delimited list of charset values.
     The server must interpret this list as an exclusive-or list, i.e.,
     the server must be able to accept any single character encoding per
     entity received.

We need a strategy on supporting charset encodings in form submissions
   http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13
Bob, you forget to include the important paragraph:
The default value for this attribute is the reserved string "UNKNOWN". User
agents MAY interpret this value as the character encoding that was used
to transmit the document containing this FORM element.

Basicly, the HTML spec does not say that the user agent MUST return the value in
those charsets (and which one from the list ???). It only said the server MUST
be able to process these charsets. The user agent MAY interprete this value as
the character encoding that used to transmit. So in other word, this is an
invalid bug. Ignore this value does confirm to the HTML spec.
What is the relationship of RFC2070 to this bug? I thought that this bug was
originalyl about a case like the following:

1. The web designer wants restrict the input charset to those
   she/he specifies as the Accept-Charset attributes of Form.
2. Now if someone inputs into form, via a client, in a charset not
   listed as Accept-Charset attributes, then the client can
   either 1) warn the user that the input charset is not allowed by the
   form but send it anyway or 2) refuse to submit in that charset, or
   3) convert it to a charset which is of the same encoding family
      if that is possible.

3. If no Accept-Charset value is present, then it's the same as "UNKNOWN".
   If "UNKNOWN" is present, then it's still the same thing. But if
   explicit values are present, then we need to honor these and
   do one of the things listed in 2 above.

This is my interpreation of RFC 2070 and this seems to be also consistent
with what HTML 4.0 spec says about Accept-Charset in form.

These are all client-responsibilities.
Status: NEW → ASSIGNED
Target Milestone: M7 → M8
Target Milestone: M8 → M9
There are 2 content types into which form data can be encoded (enctype):
  (1) application/x-www-form-urlencoded
  (2) multipart/form-data
See: http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4

In case (1), there is no way to pass the charset encoding back to the
server, so I think we should continue with the current 4.x behavior or
encoding the form data set in the charset encoding of the form.

In case (2) (not supported prior to 5.0), we can specify the charset
of the form data being submitted by using the charset parameter in
the MIME content-type [see RFC2045].  I suggest that we try to listen
to the <FORM> accept-charset parameter by trying to convert the form
data set into the specified charset(s).  If it converts without
error, submit the converted data, otherwise try the next charset in
the accept-charset list.  If none of the listed charsets convert
without error, then default to the charset of the form.  But we always
include the charset parameter.

Comments?
In case (1), there are 2 subcases (a) and (b):

  (a) method=get
  (b) method=post

In case (1)(a), it is not possible to send the charset label along with the
form submission. In case (1)(b), it *is* possible:

  Content-Type: application/x-www-form-urlencoded; charset=iso-8859-1

Note that the entire form submission must be in this charset, so we would have
to try converting all of the fields into that charset to see if it's OK.

Note also that we had problems with certain servers/CGIs when we tried this a
while ago (adding charset label in POST case).

In case (2), it is not necessary for the entire form submission to be in a
single charset, since you can label each field separately:

--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable

Joe Blow owes =80100.
--AaB03x
Good points.  But what are you recommending?

I don't think it is normally useful to submit different fields in
different charsets in the multipart/form-data case.

For (1b), we could modify the proposal to label the post with a charset.
But as you point out, it may cause problems for servers/CGI's which
cannot handle the parameter.  We could control the behavior by prefs
for cases (1b) and (2), with defaults off and on respectively?

I still like the first proposal.  It preserves backward compatibility
and HTML4 does recommend ("should") using multipart/form-data for non-ASCII:
  http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4

   The content type "application/x-www-form-urlencoded" is inefficient for
   sending large quantities of binary data or text containing non-ASCII
   characters. The content type "multipart/form-data" should be used for
   submitting forms that contain files, non-ASCII data, and binary data.

Content developers who want to add accept-charset, could also change
the forms to use multipart/form-data.
I didn't intend to recommend that we use more than one charset in the form-data
case. I was just pointing out that our implementation *must* use a single
charset in the other case (1). It is probably better to use a single charset
in the form-data case, just to avoid needless confusion and so on, but I don't
feel too strongly about this.

Using prefs to control whether or not we append the charset in case (1b) is
probably a good idea. Those prefs do not need to be surfaced in UI, I think.

I also like the idea of trying to convert to one of the charsets in the
accept-charset attribute.

Furthermore, it might be a good idea to experiment with adding the charset
in case (1b).

Maybe we should even try adding a Content-Type header with a charset to the
request headers immediately following the GET command. GET doesn't have a body,
so it's abnormal, but it might work, and would allow CGIs to receive the
charset info.

Added Valeski to Cc list for opinions.
Currently the label part of (1b) is implemented in 5.0
see http://lxr.mozilla.org/mozilla/source/layout/html/forms/src/nsFormFrame.cpp
for details- look at #ifdef SPECIFY_CHARSET_IN_CONTENT_TYPE. We can easily
remove this feature/bugs byt comment out the #define
SPECIFY_CHARSET_IN_CONTENT_TYPE

I didn't do this for case (2). It should be easy- just change
1108   sprintf(buffer, "Content-type: %s; boundary=%s" CRLF, MULTIPART,
boundary);

Currently it decided the submission charset on what it believe the document is-
the same way we did in 1.x - 4.x

There is a method call GetSubmitCharset() which will return 1 charset .
Currently it return the charset of the document. We can change it to return 1
charset from the Accept-Charset list.
Assignee: bobj → ftang
Status: ASSIGNED → NEW
Assigned to ftang.  Here's my updated proposal:
  (1a) application/x-www-form-urlencoded, method=get
         Submit in charset of HTML form document (4.x behavior)  - Done
  (1b) application/x-www-form-urlencoded, method=post
         If pref-xxx enabled
            Submit in GetSubmitCharset() and label with charset parameter
         Else (4.x behavior)
            Submit in charset of form, and no charset parameter
  (2) multipart/form-data
         Submit in GetSubmitCharset() and label with charset parameter

GetSubmitCharset() would return either
   (i)  a valid charset from the prioritized accept-charset list, or
   (ii) form charset
A "valid charset" means that the data for submission can successfully be
converted into that charset.

Should the default for pref-xxx be disabled (4.x behavior) or enabled?

Do we want to consider Erik's suggestion for (1a) (under pref control):
  Maybe we should even try adding a Content-Type header with a charset to
  the request headers immediately following the GET command. GET doesn't
  body, have a so it's abnormal, but it might work, and would allow CGIs
  to receive the charset info.
Target Milestone: M9 → M12
move to M12
Status: NEW → ASSIGNED
Target Milestone: M12 → M11
move it back to M11
Priority: P2 → P3
Assignee: ftang → tague
Status: ASSIGNED → NEW
QA Contact: ftang
Blocks: 16127
Status: NEW → ASSIGNED
Assignee: tague → ftang
Status: ASSIGNED → NEW
reassign this to myself.
Status: NEW → ASSIGNED
Target Milestone: M11 → M12
No longer blocks: 16127
Assignee: ftang → bobj
Status: ASSIGNED → NEW
Target Milestone: M12 → M13
Status: NEW → ASSIGNED
Target Milestone: M13 → M14
Change OS to ALL
OS: Windows NT → All
Keywords: beta1
Keywords: beta1
Target Milestone: M14 → M15
Target Milestone: M15 → M16
Reassigned to jbetak for Beta2.
Assignee: bobj → jbetak
Status: ASSIGNED → NEW
Keywords: beta2
Status: NEW → ASSIGNED
Keywords: nsbeta2
Keywords: beta2
Putting on [nsbeta2+] radar.  Feature, must fix by 5/16.
Whiteboard: [nsbeta2+][5/16][FEATURE]
Removed "[FEATURE]" from Status Whiteboard since this is really an old HTML
compliance bug originally logged in bugsplat against the old code base.
Whiteboard: [nsbeta2+][5/16][FEATURE] → [nsbeta2+][5/16]
Attempted to test this bug.
clicked on link
result- error message "www.nagual.ru could not be found. Please check the name
and try again."
Putting on [nsbeta2-] radar. Missed the Netscape 6 feature train.  Please set to 
MFuture.

Whiteboard: [nsbeta2+][5/16] → [nsbeta2-]
M16 has been out for a while now, these bugs target milestones need to be 
updated.
reassigning to ftang for resource reallocation
Assignee: jbetak → ftang
Status: ASSIGNED → NEW
add nsbeta3. We need this to compatabile with HTML 4.0.
The fix is local to one file and low risk. The only reason we have not do it yet
is because it is "local anf low risk".
We should fix this in nsbeta3.
Status: NEW → ASSIGNED
Keywords: nsbeta3
FYI:
 Subject: RE: URL-encode international characters in Java?
 Resent-Date: Fri, 7 Jul 2000 12:24:44 -0400 (EDT)
 Resent-From: www-international@w3.org
        Date: Fri, 7 Jul 2000 09:23:25 -0700
        From: Chris Wendt <christw@MICROSOFT.com>
          To: "'Martin J. Duerst'" <duerst@w3.org>,
             "'Vinod Balakrishnan'" <vinod@filemaker.com>,
             Lenny Turetsky <LTuretsky@salesforce.com>,
             "'www-international@w3c.org'" <www-international@w3c.org>,
             "'servlet-interest@java.sun.com'" <servlet-interest@java.sun.com>

From: Martin J. Duerst [mailto:duerst@w3.org]
Sent: Thursday, July 06, 2000 11:53 PM
>Does IE support the 'accept-charset' parameter on FORM?

Yes. In a _very_ limited fashion:
If (accept-charset includes "UTF-8") AND (input contains characters not
fitting the document charset) THEN submit in UTF-8, regardless of the
document charset.

Chris..
set it to P1 M18
Priority: P3 → P1
Target Milestone: M16 → M18
here is the patch http://warp/u/ftang/tmp/fix5313.txt
Index: src/nsFormFrame.cpp
===================================================================
RCS file: /m/pub/mozilla/layout/html/forms/src/nsFormFrame.cpp,v
retrieving revision 3.122
diff -u -r3.122 nsFormFrame.cpp
--- nsFormFrame.cpp     2000/07/12 23:31:07     3.122
+++ nsFormFrame.cpp     2000/07/21 23:29:09
@@ -25,6 +25,7 @@
 
 #define NS_IMPL_IDS
 #include "nsICharsetConverterManager.h"
+#include "nsICharsetAlias.h"
 #include "nsIPlatformCharset.h"
 #undef NS_IMPL_IDS 
 
@@ -970,7 +971,49 @@
   // XXX
   // We may want to get it from the HTML 4 Accept-Charset attribute first
   // see 17.3 The FORM element in HTML 4 for details
-
+  nsresult result = NS_OK;
+  nsAutoString acceptCharsetValue;
+  if (mContent) {
+    nsIHTMLContent* form = nsnull;
+    result = mContent->QueryInterface(kIHTMLContentIID, (void**)&form);
+    if (NS_SUCCEEDED(result) && (nsnull != form)) {
+      nsHTMLValue value;
+      result = form->GetHTMLAttribute(nsHTMLAtoms::acceptcharset, value);
+      if (NS_CONTENT_ATTR_HAS_VALUE == result) {
+        if (eHTMLUnit_String == value.GetUnit()) {
+          value.GetStringValue(acceptCharsetValue);
+        }
+      }
+      NS_RELEASE(form);
+    }
+  }
+#ifdef DEBUG_ftang
+  printf("accept-charset = %s\n", acceptCharsetValue.ToNewUTF8String());
+#endif
+  PRInt32 l = acceptCharsetValue.Length();
+  if(l > 0 ) {
+    PRInt32 offset=0;
+    PRInt32 spPos=0;
+    // get charset from charsets one by one
+    NS_WITH_SERVICE(nsICharsetAlias, calias, kCharsetAliasCID, &rv);
+    if(NS_SUCCEEDED(rv) && (nsnull != calias)) {
+      do {
+        spPos = acceptCharsetValue.FindChar(PRUnichar(' '),PR_TRUE, offset);
+        PRInt32 cnt = ((-1==spPos)?(l-offset):(spPos-offset));
+        if(cnt > 0) {
+          nsAutoString charset;
+          acceptCharsetValue.Mid(charset, offset, cnt);
+#ifdef DEBUG_ftang
+          printf("charset[i] = %s\n",charset.ToNewUTF8String());
+#endif
+          if(NS_SUCCEEDED(calias->GetPreferred(charset,oCharset)))
+            return;
+        }
+        offset = spPos + 1;
+      } while(spPos != -1);
+    }
+  }
+  // if there are no accept-charset or all the charset are not supported
   // Get the charset from document
   nsIDocument* doc = nsnull;
   mContent->GetDocument(doc);
@@ -987,6 +1030,9 @@
   nsAutoString charset;
   nsresult rv = NS_OK;
   GetSubmitCharset(charset);
+#ifdef DEBUG_ftang
+  printf("charset=%s\n", charset.ToNewCString());
+#endif
   
   // Get Charset, get the encoder.
   nsICharsetConverterManager * ccm = nsnull;
Whiteboard: [nsbeta2-] → [nsbeta2-]patch in hand need review.
Also, we need http://warp/u/ftang/tmp/fix5313also.txt

Index: src/nsHTMLAtomList.h
===================================================================
RCS file: /m/pub/mozilla/layout/html/base/src/nsHTMLAtomList.h,v
retrieving revision 3.17
diff -u -r3.17 nsHTMLAtomList.h
--- nsHTMLAtomList.h    2000/06/07 06:58:43     3.17
+++ nsHTMLAtomList.h    2000/07/21 23:31:27
@@ -53,7 +53,7 @@
 HTML_ATOM(abbr, "abbr")
 HTML_ATOM(above, "above")
 HTML_ATOM(accept, "accept")
-HTML_ATOM(acceptcharset, "acceptcharset")
+HTML_ATOM(acceptcharset, "accept-charset")
 HTML_ATOM(accesskey, "accesskey")
 HTML_ATOM(action, "action")
 HTML_ATOM(align, "align")
Whiteboard: [nsbeta2-]patch in hand need review. → [nsbeta2-][nsbeta3+]patch in hand need review.
check in. Mark it fix
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Verified as fixed.
Status: RESOLVED → VERIFIED
*** Bug 5314 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.