Closed Bug 35970 Opened 20 years ago Closed 20 years ago

Error handling non-latin-1 entity refs inside form input elements

Categories

(Core :: DOM: Core & HTML, defect, P3)

x86
Windows 98
defect

Tracking

()

VERIFIED WORKSFORME

People

(Reporter: ian.graham, Assigned: pollmann)

References

()

Details

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)
BuildID:    20000022820

A document with not specified charset mishandles entity
references referencing non-latin-1 characters. 

Reproducible: Always
Steps to Reproduce:
1. Access referenced document. Note, initially, how the greek
   letters (alpha, omega) are correctly rendered in heading and
   inside form input elements. Note there is no charset specified,
   so I _believe_ the browser assumes iso-8859-1?
2. Submit the form. Note how the greek chars are encoded as escaped
   question marks (as good as anything, since they aren't defined
   is iso-latin-1).
3. Press 'back' button. Note how greek letters in form boxes
   are now displayed as periods. If you do a view source, you;ll see 
    the correct entity references.
4. Press form 'submit' button to verify that the browser is processing
   these chars as periods!

Expected Results:  unclear, since greek letters are not defined in 8859-1. 
however, I would expect consistency, such as:
  a) always render entity references using correct characters (greek,
     whatever)
  b) always URL-escape characters that aren't defined in the
     document charset as encoded question marks (or as per future
     specification for this case)

NOTE 1: Things work find if a meta element declares the charset to
   be UTF-8: See, for example
   http://www.java.utoronto.ca/NS5-bugs/encoding-test.html

NOTE 2: I haven't tested this for the case where the server sends
   a content-type header that gives the correct charset.

NOTE 3: [RFE] -- this whole mechanism is problematic if 
   a) the document is encoded in one charset (e.g. EUC-JP)
   b) I want to encode the URL using another charset (E.g. UTF-8)
  Any thoughts on how this could be done? For now, the URL seems
  to always encode dataa using the charset specified as the document
  Charset.
Not sure if the guidelines for URI values in
"B.2.1 Non-ASCII characters in URI attribute values" in the HTML 4 spec:
http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1 are relevant --
convert to UTF-8, then URL-encode. The reasoning: if the request menthod for 
the form is "GET", the form values could end up in a URL. 
This is a good reference. However, the problem is backward compatibility, as
the use of non-UTF-8 encoding cannot always be inferred from the URL text. For
example, the UTF-8 encoding of the character at Unicode position 03EF is
the sequence %CF%AF, which can also be interpreted as two valid Latin-1 chars
(Ï and ¯) Thus if a URL was originally encoded to have these 
two Latin-1 characters (using the traditional ISO-8859-1 -based encoding), the 
decoding algorithm would conver them to a single greek character.....
reassigning
Assignee: rods → pollmann
nominating for nsbeta2 based on:
 - visibility
 - major functionality broken
Keywords: nsbeta2
Not absolutely essential for beta2. Removing beta2 status.
Keywords: nsbeta2
These are handled in a logical manner now.  Marking WORKSFORME!
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → WORKSFORME
Verifed working in M16 build on WIndows 98 (Build 2000061311)

Open Issues DOCUMENTATION (For authors/Web programmers/users):
1) need to document how DOCTYPE declarations or other mechanisms
define what character encoding/escaping mechnanism is used by
the browser when it encodes FORM data in a URL (www-form-urlencoding).

2) need to document what the browser does if data in a form contains
characters that cannot be URL-encoded in a URL. An example would be
a document trying to constrtuct a URL encoded assuming ISO-8859-1
charset (i.e., the 'old fashioned' way), but where the document
contains characters (such as greek letters) referenced in the
document using entity or character references.

At present, the browser replaces such characters by the encoding
%3F (an encoded question mark).
Updating QA contact.
QA Contact: ckritzer → vladimire
Works for me:
Platform: PC
OS: Windows 98
Mozilla Build: 2000101014 M18 Trunk Build

Marking as Verified.
Status: RESOLVED → VERIFIED
Blocks: 135762
Component: HTML: Form Submission → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.