Last Comment Bug 610054 - Clean up RFC 2231 MIME param handling, and add support for RFC 5987 subset
: Clean up RFC 2231 MIME param handling, and add support for RFC 5987 subset
Status: RESOLVED FIXED
: dev-doc-needed
Product: Core
Classification: Components
Component: Networking (show other bugs)
: unspecified
: All All
: -- normal with 1 vote (vote)
: mozilla10
Assigned To: Jason Duell [:jduell] (needinfo? me)
:
Mentors:
Depends on: 588781
Blocks: 609667 686429
  Show dependency treegraph
 
Reported: 2010-11-05 15:48 PDT by Jason Duell [:jduell] (needinfo? me)
Modified: 2011-11-03 09:57 PDT (History)
15 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Fix 2231 parsing and add RFC 5987 mode (no continuations) (24.18 KB, patch)
2010-11-05 15:48 PDT, Jason Duell [:jduell] (needinfo? me)
no flags Details | Diff | Review
updates Jason's patch to apply to the current version (24.08 KB, patch)
2011-04-16 05:37 PDT, Julian Reschke
bzbarsky: review-
Details | Diff | Review
proposed patch brought up-to-date with trunk (20.76 KB, patch)
2011-09-11 09:25 PDT, Julian Reschke
no flags Details | Diff | Review
proposed patch brought up-to-date with trunk (21.67 KB, patch)
2011-09-11 14:05 PDT, Julian Reschke
no flags Details | Diff | Review
proposed patch brought up-to-date with trunk (21.65 KB, patch)
2011-10-05 10:29 PDT, Julian Reschke
bzbarsky: review+
Details | Diff | Review
proposed patch (22.69 KB, patch)
2011-10-07 13:17 PDT, Julian Reschke
bzbarsky: review+
Details | Diff | Review

Description Jason Duell [:jduell] (needinfo? me) 2010-11-05 15:48:28 PDT
Created attachment 488587 [details] [diff] [review]
Fix 2231 parsing and add RFC 5987 mode (no continuations)

This patch fixes a number of bugs in our MIME parameter parsing, especially related to continuations.  See 
  
  http://greenbytes.de/tech/tc2231/#attfncontlz

Also, RFC 5987 removes continuations from HTTP (but they remain valid for RFC
2231 apps like email): 

  http://tools.ietf.org/html/draft-ietf-httpbis-content-disp-03

so I've implemented a enum that changes the parser to ignore them.  I'm not sure of the best way to actually hook it up to the browser, though, so for now it's tested in xpcshell (via a GetParameterRFC5987 function) but otherwise unused.  Note that I've also kept the RFC 2047 decoding in the new function:  do we still want that?   Any advice on how (and/or whether) this code should get hooked in would be welcome.

There are still other parsing  errors from  http://greenbytes.de/tech/tc2231/ that are not fixed by this patch--those can be dealt with in a followup bug.

This patch is on top of my one for bug 588781.
Comment 1 Boris Zbarsky [:bz] 2010-11-05 17:34:27 PDT
Jason, is this something we're trying to do for 2.0?  Just trying to prioritize reviews....

As for RFC 2047, the question is whether that's still being used, right?  I assume that it's not compatible with what RFC 5987 defines?
Comment 2 Julian Reschke 2010-11-05 23:47:23 PDT
(In reply to comment #0)
> Note that I've also kept the RFC 2047 decoding in the new function: 
> do we still want that?   Any advice on how (and/or whether) this code should
> get hooked in would be welcome.

From my reading of the specs, RFC 2047 encoding does not apply to header field parameters, see <http://greenbytes.de/tech/webdav/draft-ietf-httpbis-content-disp-03.html#rfc.section.C.1>.

In HTTP C-D it was only implemented in Mozilla, and also made it into Chrome because of this code being written by the same author. It's not supported in other UAs (see <http://greenbytes.de/tech/webdav/draft-ietf-httpbis-content-disp-03.html#rfc.section.C.4>). So I doubt a lot it's used by anybody, and I'd recommend to get rid of it.

For MIME, the specification argument is the same, but I have no data on who's implementing this, so the risk of removing this might be higher.
Comment 3 Julian Reschke 2010-11-05 23:49:30 PDT
(In reply to comment #1)
> As for RFC 2047, the question is whether that's still being used, right?  I
> assume that it's not compatible with what RFC 5987 defines?

Clarification: the situation hasn't changed with 5987. It never was applicable to *parameter* encoding, according to the specs.
Comment 4 Jason Duell [:jduell] (needinfo? me) 2010-11-09 14:57:59 PST
Cc-ing a couple of Thunderbird developers so they're aware of this.

> is this something we're trying to do for 2.0?

Nope--I just got on a roll looking over the spec and figured I'd code up the bugfixes I ran into while I had it paged into my head.

> As for RFC 2047, the question is whether that's still being used, right?

Right.  It was never spec-supported, but was apparently being used circa 2003.  From nsIMIMEHeaderParam.idl:

 * Note that a lot of MUAs and HTTP servers put RFC 2047-encoded parameters 
 * in mail headers and HTTP headers. Unfortunately, this includes Mozilla 
 * as of 2003-05-30. Even more standard-ignorant MUAs, web servers and 
 * application servers put 'raw 8bit characters'. This will try to cope 
 * with all these cases as gracefully as possible. Additionally, it 
 * returns the language tag if the parameter is encoded per RFC 2231 and 
 * includes lang.

We should look at whether this is still the case (perhaps thunderbird is still using 2047?), and nuke the 2047 (and raw 8 bits?) handling if possible.

Re: plugging in this code: Julian points out that RFC 5987 only applies to HTTP header parameters, so 2231 mode should still be used when Content-Disposition appears in the *payload* of an HTTP message, such as for multipart/whatever."  And we'd obviously still need it for Thunderbird.   So perhaps the answer is to have two functions in the IDL, as I've done, and figure out where we want to convert calls into the 5987 variant.
Comment 5 Julian Reschke 2010-11-09 23:19:14 PST
(In reply to comment #4)
> > As for RFC 2047, the question is whether that's still being used, right?
> 
> Right.  It was never spec-supported, but was apparently being used circa 2003. 
> From nsIMIMEHeaderParam.idl:
> 
>  * Note that a lot of MUAs and HTTP servers put RFC 2047-encoded parameters 
>  * in mail headers and HTTP headers. Unfortunately, this includes Mozilla 
>  * as of 2003-05-30. Even more standard-ignorant MUAs, web servers and 
>  * application servers put 'raw 8bit characters'. This will try to cope 
>  * with all these cases as gracefully as possible. Additionally, it 
>  * returns the language tag if the parameter is encoded per RFC 2231 and 
>  * includes lang.

As far as I can tell, historically *only* Mozilla/FF supported RFC 2047 in HTTP header params (incorrectly). See <http://greenbytes.de/tech/webdav/draft-ietf-httpbis-content-disp-03.html#rfc.section.C.4>. Chrome just inherited this years later, no other UA is doing that.
Comment 6 Julian Reschke 2010-12-04 02:36:41 PST
Jason, I've got a new test case:

  <http://greenbytes.de/tech/tc2231/#emptydisposition>

where FF is the only UA to fail. Do we need a separate bug? (I currently don't have time to test with the proposed change).
Comment 7 Jungshik Shin 2011-01-10 16:13:35 PST
Thunderbird's behavior does not matter for this issue. 

For emission : 

The last time I checked (a few years ago), Thunderbird by default used RFC 2231 for C-D, but it breaks the interoperability because MS Outlook and Outlook Express did not understand RFC 2231 (at least until 2 yrs ago). So, there's a config param (about:config) by which a user can select to use RFC 2047 for C-D header. 

The latest version of MS Outlook / Outlook Express (the latter is free, but sometimes, it's better in terms of I18N than non-free MS Outlook) might support RFC 2231, but I haven't tried them recently.

For receiving: TB has to support RFC2047 in C-D because there are MUAs and web mail services that produce RFC 2047 for C-D in email messages. 

As for Firefox, note that multiple Google products (including gmail) emits RFC 2047. So, it can't be simply removed overnight. 

We have to find out what Yahoo, Hotmail/Live and various country-specific web mail services do in HTTP before deciding what to do. 




I haven't followed what TB does this days
Comment 8 Jungshik Shin 2011-01-10 16:17:43 PST
See also http://blog.gmane.org/gmane.org.w3c.internationalization.general/month=20031101  

It's a 7-yr old mail thread, but some people voiced the support for using RFC 2047 in HTTP C-D header. So, it's likely that there are some servers (other than Google's) that still emit RFC 2047 in HTTP C-D header.
Comment 9 Julian Reschke 2011-04-16 05:37:03 PDT
Created attachment 526485 [details] [diff] [review]
updates Jason's patch to apply to the current version

The previous patch didn't apply anymore; I *believe* I have resolved the merge conflict properly, but Jason should double-check.
Comment 10 Julian Reschke 2011-04-16 05:39:24 PDT
(In reply to comment #0)
> so I've implemented a enum that changes the parser to ignore them.  I'm not
> sure of the best way to actually hook it up to the browser, though, so for now
> it's tested in xpcshell (via a GetParameterRFC5987 function) but otherwise
> unused.  Note that I've also kept the RFC 2047 decoding in the new function: 
> do we still want that?   Any advice on how (and/or whether) this code should
> get hooked in would be welcome.

I tested this with the revised patch I just attached, and I can confirm that FF will still process continuations, and the fixes Jason added indeed work as described.
Comment 11 Masatoshi Kimura [:emk] 2011-04-16 10:39:22 PDT
Outlook Express/Windows Mail/Windows Live Mail/Office Outlook have never been supported RFC2231 encoding. We can't remove RFC2047 support at least for mail.
Comment 12 Julian Reschke 2011-04-16 11:54:21 PDT
(In reply to comment #11)
> Outlook Express/Windows Mail/Windows Live Mail/Office Outlook have never been
> supported RFC2231 encoding. We can't remove RFC2047 support at least for mail.

And the change as proposed doesn't do that; neither for email nor for HTTP.

That being said, I think we *should* disable it for HTTP (in the nightly channel) in order to find out what sites "break". Those sites should be fixed to use something which is properly specified and interoperable. (yes, looking at you, GMail!)
Comment 13 Boris Zbarsky [:bz] 2011-04-20 21:31:54 PDT
I don't think seriously breaking gmail in nightlies is an option, fwiw.  We'll need to get them to change first.

I'm sorry about the slow review here; I need to do a bunch more reviewing....
Comment 14 Julian Reschke 2011-04-21 00:04:30 PDT
(In reply to comment #13)
> I don't think seriously breaking gmail in nightlies is an option, fwiw.  We'll
> need to get them to change first.
> ...

Just to be clear: this patch doesn't break GMail, disabling the RFC 2047 encoding just happens to become more easy once the path is in.
Comment 15 Boris Zbarsky [:bz] 2011-04-21 05:36:46 PDT
Yes, that was clear from comment 12.  I just think that the followup plan in that comment is not realistic as presented.  ;)
Comment 16 Julian Reschke 2011-05-03 03:14:53 PDT
Can we get this committed before FF6 Aurora branches? (I'd like to do some more work on C-D, but would like unnecessary conflicts because of other pending changes)
Comment 17 Jason Duell [:jduell] (needinfo? me) 2011-05-16 23:44:30 PDT
The logic in this patch isn't currently hooked up to anything, so we could theoretically land it.   Perhaps we should first add a way to hook it up to HTTP header parsing and provide an about:config pref to turn it on, and see what happens (first manually, and if things look promising, then try it on by default in the nightlies).

bz: we call GetParameter in a bunch of places:  do have a idea of which we'd want to switch to the getParameter5987() function?

nsHTMLMediaElement::GetCanPlay
nsFeedSniffer::HasAttachmentDisposition()
nsContentTypeParser::GetType()
nsScriptLoader::ProcessScriptElement()
XULContentSinkImpl::OpenScript()
nsDocumentOpenInfo::DispatchContent()
nsExternalHelperAppService::GetFilenameAndExtensionFromChannel()
Comment 18 Julian Reschke 2011-05-17 00:25:34 PDT
The support for the RFC 5987 variant ("param*") should only be activated for parameters that are defined to support it - which hopefully is the case for all fields that require I18N.

For now, this should be "filename" in Content-Disposition and "title" in Link (which currently doesn't do it -- see http://greenbytes.de/tech/tc/httplink/#simplecsstitle5987 -- but it's on my list :-).
Comment 19 Jason Duell [:jduell] (needinfo? me) 2011-05-17 21:18:41 PDT
We can experiment with switching C-D filename parsing fairly easily.  I suggest we wait until bug 589292 lands first, as it touches the same code.
Comment 20 Boris Zbarsky [:bz] 2011-05-26 21:49:41 PDT
Comment on attachment 526485 [details] [diff] [review]
updates Jason's patch to apply to the current version

Please update the "decode RFC 2231 ..." comments in DoGetParameter.

Should DoParameterInternal be DoGetParameterInternal?  Either way, I guess.

Using a char for nextContinuation seems like it will fail if someone sends a header with 256 continuations.  Why not use an int with -1 as the failure signal?

caseCStart is really caseC_or_DStart or something, right?  Is it worth naming accordingly?

I don't understand the handling of nextContinuation == '!'.  How does that cause us to skip future continuations, in cases of >9 continuations?

The rest looks fine, but the nextContinuation thing needs fixing.  Sorry about the terrible review lag here.  :(
Comment 21 Julian Reschke 2011-09-11 09:25:39 PDT
Created attachment 559771 [details] [diff] [review]
proposed patch brought up-to-date with trunk

(this doesn't address any of Boris' comments though)
Comment 22 Julian Reschke 2011-09-11 14:05:31 PDT
Created attachment 559792 [details] [diff] [review]
proposed patch brought up-to-date with trunk

this should address some of Boris' comments; see separate comment in bugzilla
Comment 23 Julian Reschke 2011-09-11 14:08:21 PDT
(In reply to Boris Zbarsky (:bz) from comment #20)
> Comment on attachment 526485 [details] [diff] [review]
> updates Jason's patch to apply to the current version
> 
> Please update the "decode RFC 2231 ..." comments in DoGetParameter.

I augmented one of the comments in .h. Note sure about the rest; do you want to *replace* mentions of 2231 by something else.

> Should DoParameterInternal be DoGetParameterInternal?  Either way, I guess.

I don't feel strongly about this; except that we don't want the names to get too long, right?

> Using a char for nextContinuation seems like it will fail if someone sends a
> header with 256 continuations.  Why not use an int with -1 as the failure
> signal?

Done. This also means that gaps are now detected past segment 9 (test cases enhanced).

> caseCStart is really caseC_or_DStart or something, right?  Is it worth
> naming accordingly?

Not sure.

> I don't understand the handling of nextContinuation == '!'.  How does that
> cause us to skip future continuations, in cases of >9 continuations?

Gone by switching to int.

> The rest looks fine, but the nextContinuation thing needs fixing.  Sorry
> about the terrible review lag here.  :(
Comment 24 Julian Reschke 2011-10-05 10:29:30 PDT
Created attachment 564915 [details] [diff] [review]
proposed patch brought up-to-date with trunk
Comment 25 Boris Zbarsky [:bz] 2011-10-07 11:17:26 PDT
Comment on attachment 564915 [details] [diff] [review]
proposed patch brought up-to-date with trunk

> do you want to *replace* mentions of 2231 by something else.

I meant the comments in the .cpp.  Like this one:

89     // get parameter (decode RFC 2231 if it's RFC 2231-encoded and 
90     // return charset.)

which are not quite true anymore; the behavior now depends on the new flag this patch adds.

> Is it worth naming accordingly?

I think it is.  Let's call it caseCorDStart.

Now for the patch:

>+        // no trailing zeros allowed except for ... position 0

That should be "leading zeros", not trailing ones, right?

r=me with those nits picked, and another apology for lag.  :(
Comment 26 Julian Reschke 2011-10-07 13:17:21 PDT
Created attachment 565628 [details] [diff] [review]
proposed patch

Review comments hopefully addressed.
Comment 27 Boris Zbarsky [:bz] 2011-10-07 19:49:53 PDT
Comment on attachment 565628 [details] [diff] [review]
proposed patch

r=me
Comment 28 Julian Reschke 2011-10-08 00:55:12 PDT
(In reply to Boris Zbarsky (:bz) from comment #25)
> ...
> r=me with those nits picked, and another apology for lag.  :(

Thanks a lot. I'll proceed with 686429 (the cleanup of the tests).
Comment 29 Ed Morley [:emorley] 2011-10-10 00:19:22 PDT
In my queue with a few other bits that are being sent to try first and then onto inbound :-)

https://tbpl.mozilla.org/?tree=Try&rev=c1528769b893
Comment 31 Matt Brubeck (:mbrubeck) 2011-10-10 11:14:33 PDT
https://hg.mozilla.org/mozilla-central/rev/68b7ba483004

Note You need to log in before you can comment on or make changes to this bug.