35970 - Error handling non-latin-1 entity refs inside form input elements

Reporter

Description

•

26 years ago

From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt) BuildID: 20000022820 A document with not specified charset mishandles entity references referencing non-latin-1 characters. Reproducible: Always Steps to Reproduce: 1. Access referenced document. Note, initially, how the greek letters (alpha, omega) are correctly rendered in heading and inside form input elements. Note there is no charset specified, so I _believe_ the browser assumes iso-8859-1? 2. Submit the form. Note how the greek chars are encoded as escaped question marks (as good as anything, since they aren't defined is iso-latin-1). 3. Press 'back' button. Note how greek letters in form boxes are now displayed as periods. If you do a view source, you;ll see the correct entity references. 4. Press form 'submit' button to verify that the browser is processing these chars as periods! Expected Results: unclear, since greek letters are not defined in 8859-1. however, I would expect consistency, such as: a) always render entity references using correct characters (greek, whatever) b) always URL-escape characters that aren't defined in the document charset as encoded question marks (or as per future specification for this case) NOTE 1: Things work find if a meta element declares the charset to be UTF-8: See, for example http://www.java.utoronto.ca/NS5-bugs/encoding-test.html NOTE 2: I haven't tested this for the case where the server sends a content-type header that gives the correct charset. NOTE 3: [RFE] -- this whole mechanism is problematic if a) the document is encoded in one charset (e.g. EUC-JP) b) I want to encode the URL using another charset (E.g. UTF-8) Any thoughts on how this could be done? For now, the URL seems to always encode dataa using the charset specified as the document Charset.

Sean Richardson

Comment 1

•

26 years ago

Not sure if the guidelines for URI values in "B.2.1 Non-ASCII characters in URI attribute values" in the HTML 4 spec: http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1 are relevant -- convert to UTF-8, then URL-encode. The reasoning: if the request menthod for the form is "GET", the form values could end up in a URL.

Ian Graham

Reporter

Comment 2

•

26 years ago

This is a good reference. However, the problem is backward compatibility, as the use of non-UTF-8 encoding cannot always be inferred from the URL text. For example, the UTF-8 encoding of the character at Unicode position 03EF is the sequence %CF%AF, which can also be interpreted as two valid Latin-1 chars (Ï and ¯) Thus if a URL was originally encoded to have these two Latin-1 characters (using the traditional ISO-8859-1 -based encoding), the decoding algorithm would conver them to a single greek character.....

rods (gone)

Comment 3

•

26 years ago

reassigning

Assignee: rods → pollmann

ckritzer (gone)

Comment 4

•

26 years ago

nominating for nsbeta2 based on: - visibility - major functionality broken

Keywords: nsbeta2

rickg

Comment 5

•

26 years ago

Not absolutely essential for beta2. Removing beta2 status.

Keywords: nsbeta2

Eric Pollmann

Assignee

Comment 6

•

25 years ago

These are handled in a logical manner now. Marking WORKSFORME!

Status: NEW → RESOLVED

Closed: 25 years ago

Resolution: --- → WORKSFORME

Ian Graham

Reporter

Comment 7

•

25 years ago

Verifed working in M16 build on WIndows 98 (Build 2000061311) Open Issues DOCUMENTATION (For authors/Web programmers/users): 1) need to document how DOCTYPE declarations or other mechanisms define what character encoding/escaping mechnanism is used by the browser when it encodes FORM data in a URL (www-form-urlencoding). 2) need to document what the browser does if data in a form contains characters that cannot be URL-encoded in a URL. An example would be a document trying to constrtuct a URL encoded assuming ISO-8859-1 charset (i.e., the 'old fashioned' way), but where the document contains characters (such as greek letters) referenced in the document using entity or character references. At present, the browser replaces such characters by the encoding %3F (an encoded question mark).

Vladimir Ermakov

Comment 8

•

25 years ago

Updating QA contact.

QA Contact: ckritzer → vladimire

Keyser Sose

Comment 9

•

25 years ago

Works for me: Platform: PC OS: Windows 98 Mozilla Build: 2000101014 M18 Trunk Build Marking as Verified.

Status: RESOLVED → VERIFIED

Markus Kuhn

Updated

•

24 years ago

Blocks: 135762

Dave Miller [:justdave]

Updated

•

22 years ago

No longer blocks: 135762

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: HTML: Form Submission → DOM: Core & HTML

Bugzilla

Error handling non-latin-1 entity refs inside form input elements

Categories

(Core :: DOM: Core & HTML, defect, P3)

Tracking

()

People

(Reporter: ian.graham, Assigned: pollmann)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated

Updated