Closed Bug 50654 Opened 25 years ago Closed 25 years ago

charset in Content-Type ignored

Categories

(Core :: Internationalization, defect, P3)

x86
Windows 2000
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: claus, Assigned: ftang)

References

()

Details

(Whiteboard: [nsbeta3+]fix in hand)

Mozilla M17 ignores the charset declaration within HTTP content-type headers. Instead, it displays the page with the default character set. Steps to reproduce: On the page listed above, click on "Archiv". The (C) character in the bottom line will be displayed uncorrectly unless you manually switch to UTF-8. This is a clear violation of HTTP standards and will cause major damage to WWW i18n if Mozilla is released with this bug.
Nothing to do with ActiveX wrapper control. Marking as INVALID.
Status: UNCONFIRMED → RESOLVED
Closed: 25 years ago
Resolution: --- → INVALID
Hm, I'm quite sure I DID select Browser-General in the first place...
Status: RESOLVED → UNCONFIRMED
Component: ActiveX Wrapper → Browser-General
Resolution: INVALID → ---
Reassigning to module owner
Assignee: locka → asa
QA Contact: cpratt → doronr
changing component and setting defualt owner.
Assignee: asa → gagan
Component: Browser-General → Networking
QA Contact: doronr → tever
hmm shouldn't this be "Internationalization" ? Resembles bug 50893.
Right. This should be first looked at in i18n. Confirmign the bug and re-assigning to ftang. Changing other fileds also. The other bug is about not dealing correctly with document-based HTTP Meta equivalent charset info. This one is about server-generated HTTP content-type charset info, which is UTF-8. The other bug is a regression from a few days ago, but this seems to have been there for some time. It seems, however, it may not be a wholesale failure of HTTP charset handling. I have seen some pages displayed correctly with HTTP charset info sent form a server. It may be handling of specific characters -- in this case the copyright symbol.
Assignee: gagan → ftang
QA Contact: tever → teruko
Status: UNCONFIRMED → NEW
Ever confirmed: true
Component: Networking → Internationalization
With the 8/30/2000 Win32 build, the menu checkmark seems to be wrong. Even when it is displaying UTF-8 page it stays at the default charset, e.g. ISO-8859-1. The server for the URL provided by the orignal poster is sending UTF-8 as the charset info.
this is an interesting bug- The http heder return Content-Type: text/html; charset="utf-8" (I use telnet www.fachschaft.jura.uni-muenchen.de 80 to connect and use GET /archiv/ HTTP/1.0 to get the page) Notice it said Content-Type: text/html; charset="utf-8" but not Content-Type: text/html; charset=utf-8 I first thought this is wrong, but after I double check with the HTTP 1.1 spec http://www.cis.ohio-state.edu/htbin/rfc/rfc2068.html , it said from [Page 16] quoted-string = ( <"> *(qdtext) <"> ) qdtext = <any TEXT except <">> and in page 24: 3.7 Media Types HTTP uses Internet Media Types in the Content-Type (section 14.18) and Accept (section 14.1) header fields in order to provide open and extensible data typing and type negotiation. media-type = type "/" subtype *( ";" parameter ) type = token subtype = token Parameters may follow the type/subtype in the form of attribute/value pairs. Fielding, et. al. Standards Track [Page 25] RFC 2068 HTTP/1.1 January 1997 parameter = attribute "=" value attribute = token value = token | quoted-string so it mean it is ok to use charet="utf-8" We need to change nsHTMLDocument.cpp to fix it.
Status: NEW → ASSIGNED
Keywords: nsbeta3
Actually we support the sever-sent HTTP charset names like "UTF-8" in Communicator 4.75 and the above page works.
I have NS-internal test cases for UTF-8 and "UTF-8" if you like.
It should be easy to fix, need to change nsHTMLDocument.cpp and nsXMLDocument.cpp. I estimate total 2 hours of debugging, codeing, and engineer testing (not QA testing) to fix it.
I have fix in hand. It take me totaly 20 minutes to write the code.
Whiteboard: fix in hand
here is the patch Index: nsHTMLDocument.cpp =================================================================== RCS file: /cvsroot/mozilla/layout/html/document/src/nsHTMLDocument.cpp,v retrieving revision 3.272 diff -c -2 -r3.272 nsHTMLDocument.cpp *** nsHTMLDocument.cpp 2000/09/02 07:21:57 3.272 --- nsHTMLDocument.cpp 2000/09/06 19:26:58 *************** *** 550,556 **** { start += 8; // 8 = "charset=".length ! PRInt32 end = contentType.FindCharInSet(";\n\r ", start ); ! if(kNotFound == end ) ! end = contentType.Length(); nsAutoString theCharset; contentType.Mid(theCharset, start, end - start); --- 550,564 ---- { start += 8; // 8 = "charset=".length ! PRInt32 end = 0; ! if(PRUnichar('"') == contentType.CharAt(start)) { ! start++; ! end = contentType.FindCharInSet("\"", start ); ! if(kNotFound == end ) ! end = contentType.Length(); ! } else { ! end = contentType.FindCharInSet(";\n\r ", start ); ! if(kNotFound == end ) ! end = contentType.Length(); ! } nsAutoString theCharset; contentType.Mid(theCharset, start, end - start); and Index: nsXMLDocument.cpp =================================================================== RCS file: /cvsroot/mozilla/layout/xml/document/src/nsXMLDocument.cpp,v retrieving revision 1.84 diff -c -2 -r1.84 nsXMLDocument.cpp *** nsXMLDocument.cpp 2000/09/02 15:33:40 1.84 --- nsXMLDocument.cpp 2000/09/06 19:25:03 *************** *** 327,334 **** if(kNotFound != start) { ! start += 8; // 8 = "charset=".length ! PRInt32 end = contentType.FindCharInSet(";\n\r ", start ); ! if(kNotFound == end ) ! end = contentType.Length(); nsAutoString theCharset; contentType.Mid(theCharset, start, end - start); --- 327,342 ---- if(kNotFound != start) { ! start += 8; // 8 = "charset=".length ! PRInt32 end = 0; ! if(PRUnichar('"') == contentType.CharAt(start)) { ! start++; ! end = contentType.FindCharInSet("\"", start ); ! if(kNotFound == end ) ! end = contentType.Length(); ! } else { ! end = contentType.FindCharInSet(";\n\r ", start ); ! if(kNotFound == end ) ! end = contentType.Length(); ! } nsAutoString theCharset; contentType.Mid(theCharset, start, end - start);
Hm, for complete HTTP/1.1 compliance, you would also have to handle headers such as: Content-Type: text/html; charset="u\t\f-8" as quoted-string allows quoted-pair, i.e. "\" CHAR So besides the missing decoding of "\x", a FindInStr(..."\"") is actually not enough, as the <"> might actually be part of a <\"> sequence... I doubt that any user agent gets this right, though.
[nsbeta3+] P3 per i18n bug meeting. patch check in and mark it fixed
Whiteboard: fix in hand → [nsbeta3+]fix in hand
mark it fixed
Status: ASSIGNED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
This is still reproduciable in 2000-09-18-05 Win32, 9-18-08 Mac and Linux build. The (C) character in the bottom line will be displayed uncorrectly unless you manually switch to UTF-8.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
UTF-8 in Character coding should be marked. After I went to the above URL, UTF-8 is not even added in Cashed character menu.
We try it again. It is fixed. We (teruko can I ) cannot reproduce this by using today's build.
Status: REOPENED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
I verified this in 2000-09-19-05 Win32, 2000-09-19-10 Mac, and 2000-09-19-08 Linux build.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.