Closed Bug 548516 Opened 16 years ago Closed 2 years ago

data: URL charset detection is too brittle

Categories

(Core :: Networking, defect, P3)

Product:

Component:

Platform:

x86

Windows XP

Type:

defect

Priority:

P3

Severity:

S3

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: shadow2531, Unassigned)

References

(
URL
)

Details

(Keywords: student-project, Whiteboard: [necko-backlog])

Attachments

(3 files, 3 obsolete files)

I have something that builds and is a rough start 14 years ago Michael A. Puls II 5.24 KB, patch		Details \| Diff \| Splinter Review
Honor only first charset found and cleanup some things 14 years ago Michael A. Puls II 5.38 KB, patch	bzbarsky : feedback-	Details \| Diff \| Splinter Review
Here's an example of FindFirstOfNotQuoted that could work for finding the first unquoted comma 14 years ago Michael A. Puls II 609 bytes, text/plain		Details
Example patch of using header parser for data URI 14 years ago Michael A. Puls II 5.84 KB, patch		Details \| Diff \| Splinter Review
Updated patch that includes the lowercasing and whitespace stripping etc. 14 years ago Michael A. Puls II 7.27 KB, patch		Details \| Diff \| Splinter Review
Using Trim() instead of StripWhitespace() since it makes more sense 14 years ago Michael A. Puls II 7.26 KB, patch		Details \| Diff \| Splinter Review

Michael A. Puls II

Reporter

Description

•

16 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (.NET CLR 3.5.30729) Build Identifier: Load the test URL. The resulting text/plain page should be treated as UTF-8 and you should see √ in the page. Instead, you see âˆš because the page isn't treated as UTF-8. This is because charset=utf-8 is ignored when unknown headers like foo=bar are present. Safari and Opera handle this fine and treat the page as UTF-8. For an example of why this might be important to fix (besides matching Opera and Safari), see <http://lists.w3.org/Archives/Public/uri/2010Feb/0058.html>. Reproducible: Always

Michael A. Puls II

Reporter

Updated

•

16 years ago

Blocks: 144766

Comment 1

•

16 years ago

Interestingly, the status bar shows the check mark when you hover over the link.

Remko de Keijzer

Comment 2

•

16 years ago

The adressbar also displays the checkmark correctly.

Boris Zbarsky [:bzbarsky]

Updated

•

16 years ago

Blocks: 532230

Boris Zbarsky [:bzbarsky]

Comment 3

•

16 years ago

Michael, do you want to just give this a shot? The relevant code is at http://hg.mozilla.org/mozilla-central/file/4e1b68ecf126/netwerk/protocol/data/src/nsDataHandler.cpp#l190 and the right fix is probably to just use nsIMIMEHeaderParam instead of reinventing the wheel. Alternately, I'm happy to guide whoever wants to fix this through doing it; I just won't be able to get to it myself in the near future.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Keywords: student-project

Michael A. Puls II

Reporter

Comment 4

•

16 years ago

This computer doesn't have the resources to build FF in a practical amount of time. However, I could recreate the parse function in C++ (while still using char* etc.), test it out, pull the latest code, adapt my code to use any Mozilla str functions and create a patch blindly for someone to try. (I'd probably produce an example JS version first though) However, I noticed that white-space handling of the parts separated by ';' differs between browsers. For example, each browser can handle each of these parts differently: "data: text/plain ; charset = utf-8 ; base4 ," Personally, I'd trim all that so that it's like: "text/plain" "charset = utf-8" -> "charset " : " utf-8" -> "charset" : "utf-8" "base64" Further, because raw spaces get resolved to %20, some browsers handle it in some parts and not others: "data:%20text/plain%20;%20charset%20=%20utf-8%20;%20base4%20," , which I would personally make give the same result as the previous example by completely percent-decoding each of the values before trimming, or at least by trimming percent-encoded white-space (no '+' to ' ' of course) I can do some more testing and ask on the URI list what's supposed to happen as far as trimming goes. > nsIMIMEHeaderParam <http://mxr.mozilla.org/mozilla-central/source/netwerk/mime/src/nsMIMEHeaderParamImpl.cpp>?

Boris Zbarsky [:bzbarsky]

Comment 5

•

16 years ago

> <http://mxr.mozilla.org/mozilla-central/source/netwerk/mime/src > /nsMIMEHeaderParamImpl.cpp>? That's the implementation; the idea would be to just reuse that component for parsing here. The %20/space thing is a mess. :( That needs to be specced, I think.

Michael A. Puls II

Reporter

Comment 6

•

16 years ago

(In reply to comment #5) > > <http://mxr.mozilla.org/mozilla-central/source/netwerk/mime/src > > /nsMIMEHeaderParamImpl.cpp>? > > That's the implementation; the idea would be to just reuse that component for > parsing here. Understood. Will look to see if I can find usage examples to see how well it'd work. > The %20/space thing is a mess. :( That needs to be specced, I think. Please review <http://lists.w3.org/Archives/Public/uri/2010Feb/0097.html> when you have a chance. The spacing issue could perhaps be as a different bug, handled later on.

Boris Zbarsky [:bzbarsky]

Comment 7

•

16 years ago

So you actually want me to look over http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html yes?

Michael A. Puls II

Reporter

Comment 8

•

16 years ago

(In reply to comment #7) > So you actually want me to look over > http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html yes? Yes, but only if you have time and feel like it.

Boris Zbarsky [:bzbarsky]

Comment 9

•

16 years ago

I did skim it and it looks ok at first glance, but the devil's obviously in the details... I haven't checked those.

Michael A. Puls II

Reporter

Comment 10

•

16 years ago

(In reply to comment #9) > I did skim it and it looks ok at first glance, but the devil's obviously in the > details... I haven't checked those. Thanks. I will gather some data URI tests and how they behave in different browsers to support those rules. As for parseURI <http://hg.mozilla.org/mozilla-central/file/4e1b68ecf126/netwerk/protocol/data/src/nsDataHandler.cpp#l190>, I was able to recreate it locally so I can test simple fixes. I'll post when I have something worthwhile.

Boris Zbarsky [:bzbarsky]

Comment 11

•

15 years ago

We have a volunteer!

Assignee: nobody → may.gup

David Humphrey (:humph)

Updated

•

15 years ago

Status: NEW → ASSIGNED

Michael A. Puls II

Reporter

Comment 12

•

15 years ago

The comment I made in <http://lists.w3.org/Archives/Public/uri/2010Mar/0004.html> may be useful.

Comment 13

•

15 years ago

Sorry for the delay. I have been able to extract the parameters in the URI as described in the comment above (comment 12), but what am I supposed to do with the values ? Ideally it should offer to Save/Open the data, but I dunno where to go from here, please guide..

Boris Zbarsky [:bzbarsky]

Comment 14

•

15 years ago

I don't follow the question. What are you trying to do with which values, and how is it related to this bug?

Mikko Rantalainen

Comment 15

•

15 years ago

Any progress? Comment 13 seems to suggest that code to fix this bug (data URL parsing is too brittle) is fixed but probably not committed anywhere and this bug is missing the patch. It seems to me that comment 13 is already hinting towards an another bug (feature request) that should be opened for actually using the disposition and filename information parsed from the data URL. It definitely is an another bug and this one should be fixed first.

Michael A. Puls II

Reporter

Comment 16

•

14 years ago

The problem with the test URI <data:text/plain;charset=utf-8;foo=bar,%E2%88%9A>, is that Firefox detects the charset as "utf-8;foo=bar" At <http://hg.mozilla.org/mozilla-central/file/4e1b68ecf126/netwerk/protocol/data/src/nsDataHandler.cpp#l210>, if I change: if (charset) contentCharset = charset + sizeof("charset=") - 1; to: if (charset) { char *test = (char *) strchr(charset, ';'); if (test) { *test = '\0'; } contentCharset = charset + sizeof("charset=") - 1; if (test) { *test = ';'; } } , that should fix it so the charset is just "utf-8". I have yet to actually test that though.

Michael A. Puls II

Reporter

Comment 17

•

14 years ago

Attached patch I have something that builds and is a rough start (obsolete) — Details — Splinter Review

I needed a percent-decoding function. Found NS_UnescapeURL and used it, but I'm not sure that's correct. nsCString::Trim() supports a 4th argument that say to ignore quotes. It defaults to false, but I don't really know what ignore quotes means in this case. Still lots of testing to do.

Attachment #537204 - Flags: feedback?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 18

•

14 years ago

Which kind of %-decoding did you want? There are several different types...

Michael A. Puls II

Reporter

Comment 19

•

14 years ago

(In reply to comment #18) > Which kind of %-decoding did you want? There are several different types... Just need to convert all the %HH to the bytes they represent (and then possibly convert to utf-8). Something that works like javascript's decodeURIComponent(). For example, '+' should be left alone and not converted to a space. For example, if I was going to do this in JS, I'd use decodeURIComponent().

Boris Zbarsky [:bzbarsky]

Comment 20

•

14 years ago

Yeah, then NS_UnescapeURL should work. I'll try to look at the patch in more detail on Monday; is there something you want feedback on in particular?

Michael A. Puls II

Reporter

Comment 21

•

14 years ago

(In reply to comment #20) > Yeah, then NS_UnescapeURL should work. Thanks. > I'll try to look at the patch in more detail on Monday; O.K. > is there something you want feedback on in particular? Mainly want to make sure that I can't get stuck in the loop and want to make sure the char*'s I use are never out of bounds etc. Besides that, just wondering what things you require to be changed. Also, if there are existing data URI tests I can run to look for regression, I'd like to run them. And, eventually add more tests.

Michael A. Puls II

Reporter

Comment 22

•

14 years ago

Also, the way I have it now, if there's more than one charset=, the last non-empty one (empty ones are currently ignored) is honored. This can be changed to allow empty ones to override the previous too. Or, it could be changed so that once a non-empty (or empty too if required) charset= is found, the rest after that are ignored. I like the current way I have it though.

Michael A. Puls II

Reporter

Comment 23

•

14 years ago

(In reply to comment #22) > Also, the way I have it now, if there's more than one charset=, the last > non-empty one (empty ones are currently ignored) is honored. This can be > changed to allow empty ones to override the previous too. Or, it could be > changed so that once a non-empty (or empty too if required) charset= is > found, the rest after that are ignored. I like the current way I have it > though. Testing with Opera and Firefox, it looks like the first charset= (empty or not) gets honored and any charset= after that are ignored. So, I'll change to match that.

Michael A. Puls II

Reporter

Comment 24

•

14 years ago

Attached patch Honor only first charset found and cleanup some things — Details — Splinter Review

Attachment #537204 - Attachment is obsolete: true

Attachment #537720 - Flags: feedback?(bzbarsky)

Attachment #537204 - Flags: feedback?(bzbarsky)

Michael A. Puls II

Reporter

Comment 25

•

14 years ago

(In reply to comment #23) > Testing with Opera and Firefox, it looks like the first charset= (empty or > not) gets honored and any charset= after that are ignored. So, I'll change > to match that. Safari agrees with this too. However, Chrome differs in that it honors the first non-empty one.

Comment 26

•

14 years ago

(In reply to comment #25) > Safari agrees with this too. However, Chrome differs in that it honors the > first non-empty one. I wouldn't loose too much sleep over handling invalid instances if they do not interoperate anyway.

Michael A. Puls II

Reporter

Comment 27

•

14 years ago

(In reply to comment #26) > (In reply to comment #25) > > Safari agrees with this too. However, Chrome differs in that it honors the > > first non-empty one. > > I wouldn't loose too much sleep over handling invalid instances if they do > not interoperate anyway. Understood. It's easy to change how it's handled, so no biggie. I think doing it the Firefox, Opera, Safari way makes sense as then I'm not changing current behavior with that part.

Boris Zbarsky [:bzbarsky]

Comment 28

•

14 years ago

So _is_ there a reason you're not using the mime header param code?

Comment 29

•

14 years ago

(In reply to comment #28) > So _is_ there a reason you're not using the mime header param code? I think re-using that code in new places is a bad idea, unless we clean it up a lot. For instance, it would be bad if the non-compliant RFC2047 handling leaks out into new places.

Boris Zbarsky [:bzbarsky]

Comment 30

•

14 years ago

Well, I don't think we should be reinventing the param-parsing wheel here unless we think the data: syntax is different enough from normal param parsing that we need to special-case it.

Michael A. Puls II

Reporter

Comment 31

•

14 years ago

(In reply to comment #28) > So _is_ there a reason you're not using the mime header param code? Yeh, meant to talk about that. In <http://hg.mozilla.org/mozilla-central/file/8904812b90a7/netwerk/mime/nsMIMEHeaderParamImpl.cpp>: I'm looking for a function that accepts the data URI (well the part before "," after that part is percent-decoded) that parses it like an HTTP header and produces a container (name, value) of all the resulting params. The container would be like vector<pair<string, string>> (no map, want to retain order), where the values are already unfolded (in the case of header folding), rfc2231-decoded/percent-decoded/mime-decoded, converted to UTF-8, trimmed and sanitized etc. and duplicate headers are already taken care of according to the spec. Then, in ParseURI, I'd use that function (GetParamsFromDataURIHeader() for example) to get that container and analyze it to see what name/values are there. The problem is, looking at nsMIMEHeaderParamImpl.cpp, I don't see any function like that and just see functions that look like they do bits and pieces. So, I'd have to write the function I need by using all those functions. The problem with that is I don't know anything about those functions, at all. Also, it'd probably be best to stick to rfc2231 support strictly so the old mime QP stuff doesn't have to be supported. Any idea on how I could make such a function that I need?

Boris Zbarsky [:bzbarsky]

Comment 32

•

14 years ago

> and just see functions that look like they do bits and pieces mime header param gives you functions that let you say "here's a header value; what's the value of this named parameter, if it has one at all?" So you can't get a set of name/value pairs, but for any given name you can ask for the value. You have a known set of names here you need to ask about, so this should work, I'd think. Strict RFC 2231 support is a bigger sticking point; there's no way to skip rfc 2047 processing for mime header param right now without, as you say, gluing bits and pieces together. If that's the only issue, I would prefer that we just exposed an API like getParameter that only does the RFC 2231 stuff. Would that be sufficient here?

Michael A. Puls II

Reporter

Comment 33

•

14 years ago

Oh, so, you're saying to keep parsing pretty much like I am now and just support stuff like "filename=utf-8''%E2%88%9A.txt" where once I find filename, I'd do getParameter("utf-8''%E2%88%9A.txt") to get the string "\xE2\x88\x9A.txt"? If so, I guess I'd want getParameter to work similar to how mimeConverter in <http://hg.mozilla.org/comm-central/file/8f6268c4dc0a/mailnews/compose/src/nsSmtpUrl.cpp#l224> works where you can also specify the encoding to convert to (just in case it's not utf-8 after it's RFC2231-decoded). Or, do you mean to really support filename*= where there could be multiple of them togther separated by "\r\n " (well, "%0D%0A%20" in the URI before the part before ',' is percent-decoded)? If that's the case, I'd need to modify the parsing to look for '=' that's not preceded by '*' etc. It doesn't seem like the header folding part is needed. > If that's the only issue, I would prefer that we just exposed an API like getParameter that only does the RFC 2231 stuff. That sounds better I think now that I know more of what you mean.

Comment 34

•

14 years ago

(In reply to comment #32) > ... > If that's the only issue, I would prefer that we just exposed an API like > getParameter that only does the RFC 2231 stuff. Would that be sufficient > here? ... The proposed patch for bug 610054 would probably help. It would take out quoted-pair, but we'd still carry over all other parsing quirks.

Boris Zbarsky [:bzbarsky]

Comment 35

•

14 years ago

> Oh, so, you're saying to keep parsing pretty much like I am now No, I'm saying you grab the part between ':' and ',', call it "header", say, and then do things like: nsString filename; headerParam->GetParameter(header, "filename", EmptyCString(), PR_FALSE, nsnull, filename); After which |filename| will be the value of the filename parameter, if there was one, decoded, etc, etc and converted to UTF-16. It'll handle header folding and the like. IF you want to pass a fallback encoding you can pass something other than EmptyCString() there.

Comment 36

•

14 years ago

(In reply to comment #35) > > Oh, so, you're saying to keep parsing pretty much like I am now > > No, I'm saying you grab the part between ':' and ',', call it "header", say, > and then do things like: You can't do that, because the first ";" might be part of a quoted-string. I don't believe the current code is reusable for data: URIs (or even the Link header field, which I am looking at). To make things reusable, we need to decouple tokenizing the value from decoding the individual values. And we'll either need a very generic tokenizer, or tokenizers for each type of value.

Boris Zbarsky [:bzbarsky]

Comment 37

•

14 years ago

You mean the first ',', right? OK, point taken. That's really unfortunate, though. :( So I guess the question is whether it's worth writing such a generic tokenizer that would start at some char* and walk forward until it reaches end of data or a given delimiter. Seems like all we need there is a tokenizer that knows how to skip quoted stuff; it doesn't need to do the various decoding and whatnot, right?

Comment 38

•

14 years ago

(In reply to comment #37) > You mean the first ',', right? OK, point taken. Si. > That's really unfortunate, though. :( > > So I guess the question is whether it's worth writing such a generic > tokenizer that would start at some char* and walk forward until it reaches > end of data or a given delimiter. Seems like all we need there is a > tokenizer that knows how to skip quoted stuff; it doesn't need to do the > various decoding and whatnot, right? Having something like that would be very useful. There are still instances where parsing can't be generalized like that entirely, such as with the Link header, I believe. But that wouldn't make the generic code less useful. For header fields, a generic pattern might be something like: thing = [ name ] [ "=" ] [ value ] value = RFC2616token / RFC2616quotedstring headerfield = thing *( ";" [thing]) However, it's not clear how "forgiving" this needs to be. For instance, if we'd be forgiving and allow ";" inside RFC2616token, then we couldn't detect the delimiter in data URIs. If we do not allow "," there, we could, but then we probably would regress parsing of filenames in C-D (where people may be putting commas into non-quoted parameters).

Michael A. Puls II

Reporter

Comment 39

•

14 years ago

(In reply to comment #35) > > Oh, so, you're saying to keep parsing pretty much like I am now > > No, I'm saying you grab the part between ':' and ',', call it "header", say, > and then do things like: > > nsString filename; > headerParam->GetParameter(header, "filename", EmptyCString(), PR_FALSE, > nsnull, filename); > > After which |filename| will be the value of the filename parameter, if there > was one, decoded, etc, etc and converted to UTF-16. It'll handle header > folding and the like. IF you want to pass a fallback encoding you can pass > something other than EmptyCString() there. Ah, O.K. How would you search for base64 then? Just use "base64" for the second argument? What about for the content-type, how would you search for that with that function? (In reply to comment #36) > (In reply to comment #35) > > > Oh, so, you're saying to keep parsing pretty much like I am now > > > > No, I'm saying you grab the part between ':' and ',', call it "header", say, > > and then do things like: > > You can't do that, because the first "," might be part of a quoted-string. But, everything between ":" and "," can (according to Opera, Safari and the code I wrote and the idea I was going with) and should be percent-encoded. So, that "," would be %2C at the time the substring between ":" and "," is gotten.

Comment 40

•

14 years ago

(In reply to comment #39) > (In reply to comment #35) > > > Oh, so, you're saying to keep parsing pretty much like I am now > > > > No, I'm saying you grab the part between ':' and ',', call it "header", say, > > and then do things like: > > > > nsString filename; > > headerParam->GetParameter(header, "filename", EmptyCString(), PR_FALSE, > > nsnull, filename); > > > > After which |filename| will be the value of the filename parameter, if there > > was one, decoded, etc, etc and converted to UTF-16. It'll handle header > > folding and the like. IF you want to pass a fallback encoding you can pass > > something other than EmptyCString() there. > > Ah, O.K. How would you search for base64 then? Just use "base64" for the > second argument? What about for the content-type, how would you search for > that with that function? I don't think it makes sense to try that function. We need to refactor all of this to make it reusable for "data:". > (In reply to comment #36) > > (In reply to comment #35) > > > > Oh, so, you're saying to keep parsing pretty much like I am now > > > > > > No, I'm saying you grab the part between ':' and ',', call it "header", say, > > > and then do things like: > > > > You can't do that, because the first "," might be part of a quoted-string. > > But, everything between ":" and "," can (according to Opera, Safari and the > code I wrote and the idea I was going with) and should be percent-encoded. > So, that "," would be %2C at the time the substring between ":" and "," is > gotten. I can be percent-encoded, but doesn't have to. You can't rely on it.

Michael A. Puls II

Reporter

Comment 41

•

14 years ago

> I can be percent-encoded, but doesn't have to. You can't rely on it. "," in a param in a data URI isn't allowed now or it'll break things. Seems like that all UAs that improved support to support "," in params could require that everything between ":" and "," is percent-encoded. Of course, it shouldn't be hard to find the first non-quoted comma anyway.

Comment 42

•

14 years ago

(In reply to comment #41) > > I can be percent-encoded, but doesn't have to. You can't rely on it. > > "," in a param in a data URI isn't allowed now or it'll break things. Seems > like that all UAs that improved support to support "," in params could > require that everything between ":" and "," is percent-encoded. > > Of course, it shouldn't be hard to find the first non-quoted comma anyway. Well, if we do all of this let's try to get it right :-)

Michael A. Puls II

Reporter

Comment 43

•

14 years ago

(In reply to comment #41) > Of course, it shouldn't be hard to find the first non-quoted comma anyway. In a header, is there anything that closes a quoted value besides " ? In a quoted value, can " appear if it's escaped by \ ? If so, I assume that \ can appear if it's escaped by \. Can things be quoted with ' instead?

Comment 44

•

14 years ago

(In reply to comment #43) > (In reply to comment #41) > > Of course, it shouldn't be hard to find the first non-quoted comma anyway. > > In a header, is there anything that closes a quoted value besides " ? No. But it must be a " that isn't quoted itself. > In a quoted value, can " appear if it's escaped by \ ? Yes. > If so, I assume that \ can appear if it's escaped by \. Yes. > Can things be quoted with ' instead? No. See... http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-14.html#rfc.section.1.2.2.p.10

Michael A. Puls II

Reporter

Comment 45

•

14 years ago

Attached file Here's an example of FindFirstOfNotQuoted that could work for finding the first unquoted comma — Details

Boris Zbarsky [:bzbarsky]

Comment 46

•

14 years ago

> Just use "base64" for the second argument? Yes. > What about for the content-type, how would you search for that with that > function? Use null as the second argument.

Michael A. Puls II

Reporter

Comment 47

•

14 years ago

(In reply to comment #46) > > Just use "base64" for the second argument? > > Yes. > > > What about for the content-type, how would you search for that with that > > function? > > Use null as the second argument. O.K. Then if the result ends up in an nsString, I'd need to convert it to an nsCString utf-8-wise. Searching, I think that's "nsCString test = NS_ConvertUTF16toUTF8(result)". Is that correct? Given that and my function, I should have enough to try some things out.

Michael A. Puls II

Reporter

Comment 48

•

14 years ago

(In reply to comment #46) > > Just use "base64" for the second argument? > > Yes. Does that mean that base64=anything will also trigger base64 handling?

Comment 49

•

14 years ago

(In reply to comment #48) > Does that mean that base64=anything will also trigger base64 handling? That's why I think that re-using code that doesn't do exactly what we need s dangerous...

Michael A. Puls II

Reporter

Comment 50

•

14 years ago

If there's a way to differentiate between "base64" not found and base64="", then I think if (found && isempty) would work. I see it returns NS_OK. I'm assuming that only happens if a base64 param is found. Another issue though. If Firefox's original code (and what I tried to follow), only the first charset= is honored. Is that what happens in header parsers too or will it be the last duplicate that gets honored? (In reply to comment #49) > (In reply to comment #48) > > Does that mean that base64=anything will also trigger base64 handling? > > That's why I think that re-using code that doesn't do exactly what we need s > dangerous... For my original proposal, name=%HH (utf-8-based %HH) and disposition=attachment|inline is all that's needed to support getting a filename and "attachment" mode.

Boris Zbarsky [:bzbarsky]

Comment 51

•

14 years ago

> nsCString test = NS_ConvertUTF16toUTF8(result) NS_ConvertUTF16toUTF8 test(result); should work. > Does that mean that base64=anything will also trigger base64 handling? Well, you can check the value... but "base64=;" might not be distinguishable from "base64;"; would need to check what's actually returned. > Is that what happens in header parsers too or will it be the last duplicate > that gets honored? The "normal" header content-type parser is.... weird. I wouldn't worry about it for the moment, unless you really care about comma-separate content-types. That code might be changing too.

Boris Zbarsky [:bzbarsky]

Comment 52

•

14 years ago

Just to make my stance here clear, I would much rather switch from the current behavior to whatever (possibly suboptimal) behavior we get from using mime header param than write a bunch of custom code that's still suboptimal due to lack of a spec but needs to be maintained. Once there's a sane data: spec (e.g. one which answers your question about multiple charset params), we can think about how best to follow it.

Comment 53

•

14 years ago

(In reply to comment #52) > Just to make my stance here clear, I would much rather switch from the > current behavior to whatever (possibly suboptimal) behavior we get from > using mime header param than write a bunch of custom code that's still > suboptimal due to lack of a spec but needs to be maintained. > > Once there's a sane data: spec (e.g. one which answers your question about > multiple charset params), we can think about how best to follow it. I agree that a revision to RFC 2397 would be useful. That being said; it just inherits the media type syntax from somewhere else, and the answer is that multiple params with the same name are invalid. A new spec might make that clearer, but it's unlikely to recommend a specific behavior for this case.

Boris Zbarsky [:bzbarsky]

Comment 54

•

14 years ago

> the answer is that multiple params with the same name are invalid. Which doesn't say what you _do_ with it. A typical problem. > but it's unlikely to recommend a specific behavior for this case. Then it's broken by design, imo.

Michael A. Puls II

Reporter

Comment 55

•

14 years ago

(In reply to comment #51) > > nsCString test = NS_ConvertUTF16toUTF8(result) > > NS_ConvertUTF16toUTF8 test(result); > > should work. > > > Does that mean that base64=anything will also trigger base64 handling? > > Well, you can check the value... but "base64=;" might not be distinguishable > from "base64;"; would need to check what's actually returned. > > > Is that what happens in header parsers too or will it be the last duplicate > > that gets honored? > > The "normal" header content-type parser is.... weird. I wouldn't worry > about it for the moment, unless you really care about comma-separate > content-types. That code might be changing too. O.K. Thanks. Gonna try to play with the code.

Michael A. Puls II

Reporter

Comment 56

•

14 years ago

Attached patch Example patch of using header parser for data URI (obsolete) — Details — Splinter Review

Wow, this works surprisingly good just for a thrown-together first example. And it makes the code so so much simpler. Of course, I didn't do any lowercasing, trimming, stripchr stuff on the content type and charset yet, but that won't make it any more complex really. The "base64=" triggering base64 handling doesn't seem as bad now. Either does the handling of multiple charset=. I'll mess with it more to add the trimming and lowercasing of the content type and charset etc. later.

Michael A. Puls II

Reporter

Comment 57

•

14 years ago

> The "base64=" triggering base64 handling doesn't seem as bad now. Either > does the handling of multiple charset=. In this case, it's the last charset= that gets honored. (Which matches subject header parsing in mailto URIs btw)

Boris Zbarsky [:bzbarsky]

Comment 58

•

14 years ago

> And it makes the code so so much simpler. Victory! ;)

Michael A. Puls II

Reporter

Comment 59

•

14 years ago

Attached patch Updated patch that includes the lowercasing and whitespace stripping etc. (obsolete) — Details — Splinter Review

This one should behave pretty much the same as the currently released code. So, the use of getParameter() shouldn't affect compatibility really. The only difference is that "base64=" will trigger base64 handling ("base64=something" will not though) and the last charset= is honored instead of the first. For valid data URIs, I don't think this matters. I added the code for filename= and disposition= and commented it out to show how easy it is to add to ParseURI(). Of course, I have no idea how to modify the data URI loading code to take advantage of that for data URIs that are loaded directly. (I could make it moz-filename and moz-disposition for now also.) A question I have about the original code is its use of StripWhitespace() on the contentType and contentCharset strings. That function strips whites-space even in the middle of the string. It doesn't just trim leading and trailing white-space. For example, "t e x t / p l a i n" will become "text/plain" and "charset=u t f - 8" will set the charset to "utf-8". Is that really needed? It seems like the strings should just be trimmed instead and any newlines in the middle converted to spaces or removed. Also, for my FindFirstOfNotQuoted function, do you see any errors in it? Also, I just stuck it in nsDataHandler.cpp. Not sure if that'd be the best place for it or not. Also, should I be using any fallback encodings or anything for getParameter()?

Attachment #542534 - Attachment is obsolete: true

Boris Zbarsky [:bzbarsky]

Comment 60

•

14 years ago

> Of course, I have no idea how to modify the data URI loading code to take > advantage of that Once bug 589292 lands that becomes very clear. > Is that really needed? Almost certainly not! > Also, for my FindFirstOfNotQuoted function, do you see any errors in it? Going to defer to Julian on this one. > Also, should I be using any fallback encodings or anything for getParameter()? Let's not and see if anyone notices. I hope we can just get away with requiring UTF-8 here.

Michael A. Puls II

Reporter

Comment 61

•

14 years ago

Attached patch Using Trim() instead of StripWhitespace() since it makes more sense — Details — Splinter Review

Attachment #543063 - Attachment is obsolete: true

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Attachment #537720 - Flags: feedback?(bzbarsky) → feedback-

Michael A. Puls II

Reporter

Comment 62

•

13 years ago

Someone made some alternative proposals to this at <http://lists.w3.org/Archives/Public/uri/2012Mar/0000.html>.

Patrick McManus [:mcmanus]

Updated

•

10 years ago

Whiteboard: [necko-backlog]

Updated

•

8 years ago

Blocks: 1392241

Firefox Bug Husbandry Bot

Comment 63

•

8 years ago

Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258

Priority: -- → P1

Firefox Bug Husbandry Bot

Comment 64

•

8 years ago

Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258

Priority: P1 → P3

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Assignee: may.gup → nobody

Status: ASSIGNED → NEW

Updated

•

3 years ago

Severity: normal → S3

Thomas Wisniewski [:twisniewski]

Comment 65

•

2 years ago

The testcase now renders fine as of bug 1845006 landing, and we now align much more closely with the current spec (with further caveats being considered in bug 1845005).

Status: NEW → RESOLVED

Closed: 2 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.