Error handling non-latin-1 entity refs inside form input elements

VERIFIED WORKSFORME

Status

()

P3
normal
VERIFIED WORKSFORME
19 years ago
16 years ago

People

(Reporter: ian.graham, Assigned: pollmann)

Tracking

Trunk
x86
Windows 98
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(URL)

(Reporter)

Description

19 years ago
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)
BuildID:    20000022820

A document with not specified charset mishandles entity
references referencing non-latin-1 characters. 

Reproducible: Always
Steps to Reproduce:
1. Access referenced document. Note, initially, how the greek
   letters (alpha, omega) are correctly rendered in heading and
   inside form input elements. Note there is no charset specified,
   so I _believe_ the browser assumes iso-8859-1?
2. Submit the form. Note how the greek chars are encoded as escaped
   question marks (as good as anything, since they aren't defined
   is iso-latin-1).
3. Press 'back' button. Note how greek letters in form boxes
   are now displayed as periods. If you do a view source, you;ll see 
    the correct entity references.
4. Press form 'submit' button to verify that the browser is processing
   these chars as periods!

Expected Results:  unclear, since greek letters are not defined in 8859-1. 
however, I would expect consistency, such as:
  a) always render entity references using correct characters (greek,
     whatever)
  b) always URL-escape characters that aren't defined in the
     document charset as encoded question marks (or as per future
     specification for this case)

NOTE 1: Things work find if a meta element declares the charset to
   be UTF-8: See, for example
   http://www.java.utoronto.ca/NS5-bugs/encoding-test.html

NOTE 2: I haven't tested this for the case where the server sends
   a content-type header that gives the correct charset.

NOTE 3: [RFE] -- this whole mechanism is problematic if 
   a) the document is encoded in one charset (e.g. EUC-JP)
   b) I want to encode the URL using another charset (E.g. UTF-8)
  Any thoughts on how this could be done? For now, the URL seems
  to always encode dataa using the charset specified as the document
  Charset.

Comment 1

19 years ago
Not sure if the guidelines for URI values in
"B.2.1 Non-ASCII characters in URI attribute values" in the HTML 4 spec:
http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1 are relevant --
convert to UTF-8, then URL-encode. The reasoning: if the request menthod for 
the form is "GET", the form values could end up in a URL. 
(Reporter)

Comment 2

19 years ago
This is a good reference. However, the problem is backward compatibility, as
the use of non-UTF-8 encoding cannot always be inferred from the URL text. For
example, the UTF-8 encoding of the character at Unicode position 03EF is
the sequence %CF%AF, which can also be interpreted as two valid Latin-1 chars
(Ï and ¯) Thus if a URL was originally encoded to have these 
two Latin-1 characters (using the traditional ISO-8859-1 -based encoding), the 
decoding algorithm would conver them to a single greek character.....

Comment 3

19 years ago
reassigning
Assignee: rods → pollmann

Comment 4

19 years ago
nominating for nsbeta2 based on:
 - visibility
 - major functionality broken
Keywords: nsbeta2

Comment 5

19 years ago
Not absolutely essential for beta2. Removing beta2 status.
Keywords: nsbeta2
(Assignee)

Comment 6

19 years ago
These are handled in a logical manner now.  Marking WORKSFORME!
Status: NEW → RESOLVED
Last Resolved: 19 years ago
Resolution: --- → WORKSFORME
(Reporter)

Comment 7

19 years ago
Verifed working in M16 build on WIndows 98 (Build 2000061311)

Open Issues DOCUMENTATION (For authors/Web programmers/users):
1) need to document how DOCTYPE declarations or other mechanisms
define what character encoding/escaping mechnanism is used by
the browser when it encodes FORM data in a URL (www-form-urlencoding).

2) need to document what the browser does if data in a form contains
characters that cannot be URL-encoded in a URL. An example would be
a document trying to constrtuct a URL encoded assuming ISO-8859-1
charset (i.e., the 'old fashioned' way), but where the document
contains characters (such as greek letters) referenced in the
document using entity or character references.

At present, the browser replaces such characters by the encoding
%3F (an encoded question mark).

Comment 8

18 years ago
Updating QA contact.
QA Contact: ckritzer → vladimire

Comment 9

18 years ago
Works for me:
Platform: PC
OS: Windows 98
Mozilla Build: 2000101014 M18 Trunk Build

Marking as Verified.
Status: RESOLVED → VERIFIED

Updated

17 years ago
Blocks: 135762
You need to log in before you can comment on or make changes to this bug.