Closed
Bug 35970
Opened 25 years ago
Closed 25 years ago
Error handling non-latin-1 entity refs inside form input elements
Categories
(Core :: DOM: Core & HTML, defect, P3)
Tracking
()
VERIFIED
WORKSFORME
People
(Reporter: ian.graham, Assigned: pollmann)
References
()
Details
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)
BuildID: 20000022820
A document with not specified charset mishandles entity
references referencing non-latin-1 characters.
Reproducible: Always
Steps to Reproduce:
1. Access referenced document. Note, initially, how the greek
letters (alpha, omega) are correctly rendered in heading and
inside form input elements. Note there is no charset specified,
so I _believe_ the browser assumes iso-8859-1?
2. Submit the form. Note how the greek chars are encoded as escaped
question marks (as good as anything, since they aren't defined
is iso-latin-1).
3. Press 'back' button. Note how greek letters in form boxes
are now displayed as periods. If you do a view source, you;ll see
the correct entity references.
4. Press form 'submit' button to verify that the browser is processing
these chars as periods!
Expected Results: unclear, since greek letters are not defined in 8859-1.
however, I would expect consistency, such as:
a) always render entity references using correct characters (greek,
whatever)
b) always URL-escape characters that aren't defined in the
document charset as encoded question marks (or as per future
specification for this case)
NOTE 1: Things work find if a meta element declares the charset to
be UTF-8: See, for example
http://www.java.utoronto.ca/NS5-bugs/encoding-test.html
NOTE 2: I haven't tested this for the case where the server sends
a content-type header that gives the correct charset.
NOTE 3: [RFE] -- this whole mechanism is problematic if
a) the document is encoded in one charset (e.g. EUC-JP)
b) I want to encode the URL using another charset (E.g. UTF-8)
Any thoughts on how this could be done? For now, the URL seems
to always encode dataa using the charset specified as the document
Charset.
Comment 1•25 years ago
|
||
Not sure if the guidelines for URI values in
"B.2.1 Non-ASCII characters in URI attribute values" in the HTML 4 spec:
http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1 are relevant --
convert to UTF-8, then URL-encode. The reasoning: if the request menthod for
the form is "GET", the form values could end up in a URL.
Reporter | ||
Comment 2•25 years ago
|
||
This is a good reference. However, the problem is backward compatibility, as
the use of non-UTF-8 encoding cannot always be inferred from the URL text. For
example, the UTF-8 encoding of the character at Unicode position 03EF is
the sequence %CF%AF, which can also be interpreted as two valid Latin-1 chars
(Ï and ¯) Thus if a URL was originally encoded to have these
two Latin-1 characters (using the traditional ISO-8859-1 -based encoding), the
decoding algorithm would conver them to a single greek character.....
Comment 4•25 years ago
|
||
nominating for nsbeta2 based on:
- visibility
- major functionality broken
Keywords: nsbeta2
Not absolutely essential for beta2. Removing beta2 status.
Keywords: nsbeta2
Assignee | ||
Comment 6•25 years ago
|
||
These are handled in a logical manner now. Marking WORKSFORME!
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → WORKSFORME
Reporter | ||
Comment 7•25 years ago
|
||
Verifed working in M16 build on WIndows 98 (Build 2000061311)
Open Issues DOCUMENTATION (For authors/Web programmers/users):
1) need to document how DOCTYPE declarations or other mechanisms
define what character encoding/escaping mechnanism is used by
the browser when it encodes FORM data in a URL (www-form-urlencoding).
2) need to document what the browser does if data in a form contains
characters that cannot be URL-encoded in a URL. An example would be
a document trying to constrtuct a URL encoded assuming ISO-8859-1
charset (i.e., the 'old fashioned' way), but where the document
contains characters (such as greek letters) referenced in the
document using entity or character references.
At present, the browser replaces such characters by the encoding
%3F (an encoded question mark).
Comment 9•24 years ago
|
||
Works for me:
Platform: PC
OS: Windows 98
Mozilla Build: 2000101014 M18 Trunk Build
Marking as Verified.
Status: RESOLVED → VERIFIED
Updated•6 years ago
|
Component: HTML: Form Submission → DOM: Core & HTML
You need to log in
before you can comment on or make changes to this bug.
Description
•