From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt) BuildID: 20000022820 A document with not specified charset mishandles entity references referencing non-latin-1 characters. Reproducible: Always Steps to Reproduce: 1. Access referenced document. Note, initially, how the greek letters (alpha, omega) are correctly rendered in heading and inside form input elements. Note there is no charset specified, so I _believe_ the browser assumes iso-8859-1? 2. Submit the form. Note how the greek chars are encoded as escaped question marks (as good as anything, since they aren't defined is iso-latin-1). 3. Press 'back' button. Note how greek letters in form boxes are now displayed as periods. If you do a view source, you;ll see the correct entity references. 4. Press form 'submit' button to verify that the browser is processing these chars as periods! Expected Results: unclear, since greek letters are not defined in 8859-1. however, I would expect consistency, such as: a) always render entity references using correct characters (greek, whatever) b) always URL-escape characters that aren't defined in the document charset as encoded question marks (or as per future specification for this case) NOTE 1: Things work find if a meta element declares the charset to be UTF-8: See, for example http://www.java.utoronto.ca/NS5-bugs/encoding-test.html NOTE 2: I haven't tested this for the case where the server sends a content-type header that gives the correct charset. NOTE 3: [RFE] -- this whole mechanism is problematic if a) the document is encoded in one charset (e.g. EUC-JP) b) I want to encode the URL using another charset (E.g. UTF-8) Any thoughts on how this could be done? For now, the URL seems to always encode dataa using the charset specified as the document Charset.
Not sure if the guidelines for URI values in "B.2.1 Non-ASCII characters in URI attribute values" in the HTML 4 spec: http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1 are relevant -- convert to UTF-8, then URL-encode. The reasoning: if the request menthod for the form is "GET", the form values could end up in a URL.
This is a good reference. However, the problem is backward compatibility, as the use of non-UTF-8 encoding cannot always be inferred from the URL text. For example, the UTF-8 encoding of the character at Unicode position 03EF is the sequence %CF%AF, which can also be interpreted as two valid Latin-1 chars (Ï and ¯) Thus if a URL was originally encoded to have these two Latin-1 characters (using the traditional ISO-8859-1 -based encoding), the decoding algorithm would conver them to a single greek character.....
Assignee: rods → pollmann
nominating for nsbeta2 based on: - visibility - major functionality broken
Not absolutely essential for beta2. Removing beta2 status.
These are handled in a logical manner now. Marking WORKSFORME!
Status: NEW → RESOLVED
Last Resolved: 19 years ago
Resolution: --- → WORKSFORME
Verifed working in M16 build on WIndows 98 (Build 2000061311) Open Issues DOCUMENTATION (For authors/Web programmers/users): 1) need to document how DOCTYPE declarations or other mechanisms define what character encoding/escaping mechnanism is used by the browser when it encodes FORM data in a URL (www-form-urlencoding). 2) need to document what the browser does if data in a form contains characters that cannot be URL-encoded in a URL. An example would be a document trying to constrtuct a URL encoded assuming ISO-8859-1 charset (i.e., the 'old fashioned' way), but where the document contains characters (such as greek letters) referenced in the document using entity or character references. At present, the browser replaces such characters by the encoding %3F (an encoded question mark).
Updating QA contact.
QA Contact: ckritzer → vladimire
Works for me: Platform: PC OS: Windows 98 Mozilla Build: 2000101014 M18 Trunk Build Marking as Verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.