Closed Bug 244964 Opened 20 years ago Closed 19 years ago

[FIX]Support quotes around the charset parameter value

Categories

(Core :: Networking: HTTP, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla1.8beta4

People

(Reporter: aha, Assigned: bzbarsky)

References

(Blocks 1 open bug, )

Details

Attachments

(1 file, 3 obsolete files)

Pages of czech companies register are often displayed with wrong charset
(MacCyrilic, Windows-1252) instead of Windows 1250. Charset of page isn't
present as meta tag, but comes in HTTP response header.

telnet www.justice.cz 80
Trying 194.213.41.34...
Connected to www.justice.cz.
Escape character is '^]'.
GET /cgi-bin/sqw1250.cgi/or/s_detail.sqw?DET=1&CEK=455630&PF=112&K=2466935c HTTP/1.1
Host: www.justice.cz

HTTP/1.1 200 OK
Date: Fri, 28 May 2004 13:24:17 GMT
Server: Apache/1.3.29 (Unix) mod_jk/1.2.5
Transfer-Encoding: chunked
Content-Type: text/html; charset="windows-1250"

Mozilla 1.7 RC2 / W2K
Reporter, are you using the 'universal' auto-detector ? See View->Charcter
Coding->Auto-detect.
I'm seeing this on Linux too.  I have autodetect off, and I've verified via HTTP
NSPR logging that the page in the URL field is sending:

  Content-Type: text/html; charset="windows-1250"

At the same time, Mozilla renders that page as ISO-8859-1 over here (according
to both document.characterSet and the View menu).

I'll try to check into the content side of things tonight.
OS: Windows 2000 → All
Hardware: PC → All
Jo Hermans: Yep, Autodetect Universal.
So the issue here is that the header is:

    Content-Type: text/html; charset="windows-1250"

and not

    Content-Type: text/html; charset=windows-1250

So we end up using the string

    "windows-1250"

as the charset (complete with quotes), and that's an unknown charset.

This seems like a bug in NS_ParseContentType to me, since param values _are_
allowed to be quoted....
Status: UNCONFIRMED → NEW
Ever confirmed: true
Note that this bug has been present ever since darin checked in the
NS_ParseContentType code in March 2002.  :(

I think this is worth fixing for 1.8a2....
making summary more specific, as per bz's comment 4.
Summary: Mozilla doesn't use charset from HTTP response → Mozilla doesn't ignore quotes around charset from HTTP response
Flags: blocking1.8a2?
Summary: Mozilla doesn't ignore quotes around charset from HTTP response → Support quotes around the charset parameter value
Attached patch Patch (obsolete) — Splinter Review
The change to nsHttpResponseHead is what actually fixes this bug... The change
to nsNetUtil may still be worth making, though.  That said, I wish we'd not
have two  separate codepaths for parsing content-types....
Attachment #149560 - Flags: superreview?(darin)
Attachment #149560 - Flags: review?(darin)
hmm, that's some rather complex code you're adding (nsNetUtil.h)... is the
possible speed increase worth the complex code, instead of something like:
   charset.StripWhitespace();
+  if (charset.Length() >= 2 &&
+      charset.First() == charset.Last() &&
+      (charset.First() == '"' || charset.First() == '\'')) {
+       charset = Substring(charset, 1, charset.Length() - 1);
+   }
(pseudocode, especially the substring arguments need verifying)
Yeah, I was considering something like that. The problem is that if it's a
quoted string then StripWhitespace is wrong, no?  It'll strip whitespace from
inside the quotes...  (and yes, I know there better be no whitespace there for a
charset, but..)

Also, does the |foo = Substring(foo, ...)| thing work now with Darin's string
changes?  It used to hang....
(In reply to comment #9)
> Yeah, I was considering something like that. The problem is that if it's a
> quoted string then StripWhitespace is wrong, no?  It'll strip whitespace from
> inside the quotes...  (and yes, I know there better be no whitespace there for a
> charset, but..)

Well then, use nsCString::Trim? hmm, you could maybe even put ' and " in the set
passed to Trim, although that would strip the quotes from "foo'
But really, I don't think there is any correct way to handle spaces in charsets...

> Also, does the |foo = Substring(foo, ...)| thing work now with Darin's string
> changes?  It used to hang....

oh... no idea, I'd hope it doesn't hang :)
Another alternative would be to call nsIMIMEHeaderParam::getParameter, but that
would be a gross overkill and perhaps be a lot slower. 
> Also, does the |foo = Substring(foo, ...)| thing work now with Darin's string
> changes?  It used to hang....

foo = Substring(foo, ...) works fine now.
if this function stays large, then i think the one in nsNetUtil.h should instead
go through an interface indirection.  you could implement the function as
net_ParseContentType and put it in nsURLHelpers.cpp or something, and then make
all internal necko callers use that instead.  i'd be fine if you wanted to
create some new interface that nsIOService implements that exposes this method.
 anyways, it might be a useful function to expose to JS protocol handlers ;-)

thanks for working on this bug bz!
Well, the first question I have is whether the code in nsHttpResponseHead should
use the same algorithm as the nsNetUtils or whatever code. The algorithm in
nsHttpResponseHead seems much more robust to me, especially in the face of other
params on the content-type.  Consider 

  text/html; foo="charset=ISO-8859-1"; charset="whatever"

and think about what the nsNetUtils code will do with it as things stand....

The second question is how we want to treat |foo/bar; charset="aa bb  dd  "| and
whether it matters.

I agree that it would make sense to expose this function to script.

If we can get the desired behavior sorted out, I'll try to get this done before
June 14, I guess...
We should probably try to come up with one implementation of this algorithm
(based on the one in nsHttpResponseHead) and expose it via some interface
implemented by either the io service or under some other contract id.

maybe nsINetUtil implemented via "@mozilla.org/network/util;1" ?

I think the algorithm used by nsHttpResponseHead is a bit more involved because
there was no "pressure" to keep it small in size :-/
Fair enough. ;)

The suggestion sounds like a good plan.
Comment on attachment 149560 [details] [diff] [review]
Patch

minusing this patch and marking it obsolete based on previous discussions.
Attachment #149560 - Attachment is obsolete: true
Attachment #149560 - Flags: superreview?(darin)
Attachment #149560 - Flags: review?(darin)
Attachment #149560 - Flags: review-
Flags: blocking1.8a2? → blocking1.8a2-
Blocks: 231908
*** Bug 231908 has been marked as a duplicate of this bug. ***
Keywords: helpwanted
Target Milestone: --- → Future
*** Bug 271888 has been marked as a duplicate of this bug. ***
Blocks: 254868
I tested this a good bit, including on
http://landfill.mozilla.org/ryl/testCharset.cgi to make sure that this makes
HTTP per se happy (most of my testing was using the new service).

I didn't change strings, mostly because I don't think I can change nsACString,
and that means that I have to deal with PromiseFlatString, etc anyway in this
code...
Attachment #189008 - Flags: superreview?(darin)
Attachment #189008 - Flags: review?(darin)
Attached patch Slightly tweaked (obsolete) — Splinter Review
Attachment #189011 - Flags: superreview?(darin)
Attachment #189011 - Flags: review?(darin)
Attachment #189008 - Flags: superreview?(darin)
Attachment #189008 - Flags: superreview-
Attachment #189008 - Flags: review?(darin)
Attachment #189008 - Flags: review-
Comment on attachment 189011 [details] [diff] [review]
Slightly tweaked

>Index: netwerk/base/public/nsNetUtil.h

>+    nsCAutoString charset;
>+    PRBool hadCharset;
>+    rv = util->ParseContentType(rawContentType, charset, &hadCharset,
>+                                contentType);
>+    if (NS_SUCCEEDED(rv) && hadCharset)
>+        contentCharset = charset;

perhaps it would be more optimal to use nsCString for charset
instead since you know that in the success case you will be 
assigning it into the result string.

We can probably add other things to nsINetUtil over time to 
reduce codesize at the callsites.


>Index: netwerk/base/src/nsIOService.cpp

>+NS_IMPL_THREADSAFE_ISUPPORTS4(nsIOService,
>                               nsIIOService,
>                               nsIObserver,
>+                              nsISupportsWeakReference,
>+                              nsINetUtil)

Maybe QI to nsINetUtil is more common than QI to nsIObserver?


>+nsIOService::ParseContentType(const nsACString& aTypeHeader,
>+                              nsACString& aCharset,
>+                              PRBool* aHadCharset,
>+                              nsACString& aContentType)

nit: operators should be placed next to variable names to be 
consistent with the existing code, e.g. "PRBool *aHadCharset"


>Index: netwerk/base/src/nsURLHelper.cpp

>+#define HTTP_LWS " \t"
>+
>+inline PRBool
>+net_StrContainsChar(const char* str,
>+                    const char* str_end,
>+                    char ch)
>+{
>+    for (const char* iter = str; iter < str_end; ++iter) {
>+        if (*iter == ch)
>+            return PR_TRUE;
>+    }
>+    return PR_FALSE;
>+}

I think this function should just be declared static
instead of inline.  It will be inlined by the compiler
if appropriate.  The important thing to tell the
compiler is that the function has file-static scope.

BTW, how is this different from memchr?

  PRBool hasChar = (memchr(str, ch, str_end - str) != NULL);


>+// Return the index of the closing quote of the string, if any
>+inline PRUint32

s/inline/static/


>+    } while (1);

s/1/PR_TRUE/


>+
>+    NS_NOTREACHED("How did we get here?");    
>+}

I think some compilers may spit out an error since there is no
return statement at the end here.


>+inline PRUint32
>+net_FindMediaDelimiter(const nsAFlatCString& flatStr,

I think I would prefer to see less of nsAFlatCString since
it is just a typedef for nsCString (left-over from the old
days).	Use nsCString to minimize string types :-)

Again, static instead of inline.


>+                       PRUint32 searchStart,
>+                       char delimiter)
>+{
>+    do {
>+        // searchStart points to the spot from which we should start looking
>+        // for the delimiter.
>+        const char delimStr[] = { delimiter, '"', '\'', '\0' };
>+        PRUint32 curDelimPos = flatStr.FindCharInSet(delimStr, searchStart);
>+        if (curDelimPos == PRUint32(-1))

use PR_UINT32_MAX instead of PRUint32(-1)


>+    } while (1);
>+
>+    NS_NOTREACHED("How did we get here?");
>+}

same nits as above.


>+inline void
>+net_ParseMediaType(const nsACString &aMediaTypeStr,

static :)


>Index: netwerk/build/nsNetModule.cpp

>+static NS_METHOD
>+nsNetUtilConstructor(nsISupports* aOuter, REFNSIID aIID, void** aResult)
>+{
>+    // Just get the IOService by the ioservice contractid.  This
>+    // ensures that we never create two IOService objects.
>+    return CallGetService(NS_IOSERVICE_CONTRACTID, aIID, aResult);
>+}

Hmm.. 


> static const nsModuleComponentInfo gNetModuleInfo[] = {
>     { NS_IOSERVICE_CLASSNAME,
>       NS_IOSERVICE_CID,
>       NS_IOSERVICE_CONTRACTID,
>       nsIOServiceConstructor },
>+    { NS_IOSERVICE_CLASSNAME,
>+      NS_IOSERVICE_CID,
>+      NS_NETUTIL_CONTRACTID,
>+      nsNetUtilConstructor },

But doesn't the Service Manager match CIDs when looking for an
existing instance?  How would it ever be the case that both
nsNetUtilConstructor and nsIOServiceConstructor would be called?

Ideally, we'd have a special constructor function that would
force singleton semantics, maybe using:
NS_GENERIC_FACTORY_SINGLETON_CONSTRUCTOR

But, right now I don't think we do anything to protect against
CreateInstance being used on the IO service, so why change that?

I'd probably try to use nsIOServiceConstructor in both entries
in gNetModuleInfo.


What about the caller of NS_ParseContentType in
nsMultiMixedConv.cpp?


Can you please create a testcase for this?  I recommend using
the JS-based unit testing framework in netwerk/test/unit/
See test_http_headers.js for a simple example.	Thanks!
Attachment #189011 - Flags: superreview?(darin)
Attachment #189011 - Flags: superreview-
Attachment #189011 - Flags: review?(darin)
Attachment #189011 - Flags: review-
>       nsIOServiceConstructor },
>+    { NS_IOSERVICE_CLASSNAME,
>+      NS_IOSERVICE_CID,
>+      NS_NETUTIL_CONTRACTID,
>+      nsNetUtilConstructor },

we have other places where the same CID is used for multiple contract ids, and
those all share the same constructor. the service manager makes sure to only
create one of them. example:
http://lxr.mozilla.org/seamonkey/source/caps/src/nsSecurityManagerFactory.cpp#372
> BTW, how is this different from memchr?

It's not; I didn't know memchr existed.  ;)

> use PR_UINT32_MAX instead of PRUint32(-1)

FindCharInSet() returns a -1 (it actually has PRInt32 as its return value).  I
shoild be using PRUint32(kNotFound), probably; the cast and all the rest is just
needed to avoid compiler warnings (since string lengths _are_ unsigned).  Will
PRUint32(kNotFound) be ok with you?

I'll double-check the service manager stuff, and if it already handles this for
me, so much the better.

biesi, your example is not so great, since CAPS does its own ensuring that it
always returns a singleton (it has a gScriptSecMan, etc).
> > use PR_UINT32_MAX instead of PRUint32(-1)
> 
> FindCharInSet() returns a -1 (it actually has PRInt32 as its return value).  I
> shoild be using PRUint32(kNotFound), probably; the cast and all the rest is just
> needed to avoid compiler warnings (since string lengths _are_ unsigned).  Will
> PRUint32(kNotFound) be ok with you?

I didn't think about the return type.  Whatever you prefer is fine.
Attachment #189008 - Attachment is obsolete: true
Attachment #189011 - Attachment is obsolete: true
Attachment #190198 - Flags: superreview?(darin)
Attachment #190198 - Flags: review?(darin)
Attachment #190198 - Flags: superreview?(darin)
Attachment #190198 - Flags: superreview+
Attachment #190198 - Flags: review?(darin)
Attachment #190198 - Flags: review+
Comment on attachment 190198 [details] [diff] [review]
Updated to comments

Requesting 1.8b4 approval.  This makes our HTTP header parsing much closer to
spec.
Attachment #190198 - Flags: approval1.8b4?
Assignee: darin → bzbarsky
Keywords: helpwanted
Priority: -- → P2
Summary: Support quotes around the charset parameter value → [FIX]Support quotes around the charset parameter value
Target Milestone: Future → mozilla1.8beta4
Attachment #190198 - Flags: approval1.8b4? → approval1.8b4+
Fixed.
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Boris, this caused bug 302213, I believe...
Depends on: 302213, 303336
Depends on: 305999
I think there are still quite a few mime types that are broken from this change
and need updated to have slashes in them:

http://lxr.mozilla.org/mozilla/search?string=x-application-&luckytricky=1
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: