Last Comment Bug 500713 - Update HTML content sniffing heuristic to match the MIME Sniffing Standard
: Update HTML content sniffing heuristic to match the MIME Sniffing Standard
Status: UNCONFIRMED
dom-triaged
:
Product: Core
Classification: Components
Component: DOM (show other bugs)
: unspecified
: All All
: -- normal with 1 vote (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
https://mimesniff.spec.whatwg.org/
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-26 10:41 PDT by Adam Barth
Modified: 2016-03-17 16:17 PDT (History)
11 users (show)
hsivonen: needinfo? (gphemsley)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch (2.68 KB, patch)
2009-06-26 11:35 PDT, Adam Barth
no flags Details | Diff | Splinter Review
mochitest test cases for html5 sniffer (21.02 KB, patch)
2009-07-31 11:55 PDT, Jonathan Griffin (:jgriffin)
no flags Details | Diff | Splinter Review
mochitest test cases for html5 sniffer (47.31 KB, patch)
2009-08-03 17:33 PDT, Jonathan Griffin (:jgriffin)
no flags Details | Diff | Splinter Review
mochitest test cases for html5 sniffer (v3) (48.75 KB, patch)
2009-08-04 11:57 PDT, Jonathan Griffin (:jgriffin)
no flags Details | Diff | Splinter Review
mochitest test cases for html5 sniffer (v4) (48.20 KB, patch)
2009-08-05 10:53 PDT, Jonathan Griffin (:jgriffin)
no flags Details | Diff | Splinter Review

Description Adam Barth 2009-06-26 10:41:29 PDT
We should update our content sniffing heuristic for text/html to match HTML 5.  Patch forthcoming.
Comment 1 Adam Barth 2009-06-26 11:35:35 PDT
Created attachment 385430 [details] [diff] [review]
patch

I updated the text/html signature except for <!DOCTYPE html, which I think we should do separately, to be conservative.
Comment 2 Boris Zbarsky [:bz] 2009-06-26 12:16:50 PDT
So is it intentional that "<htmlfoo" should be detected as HTML?
Comment 3 Adam Barth 2009-06-26 12:25:54 PDT
Yes.  Although, in practice, it doesn't seem to matter much.  If you feel strongly about it, we can change the spec to match the existing " " or ">" behavior (at a slight cost of complexity).
Comment 4 Boris Zbarsky [:bz] 2009-06-26 14:37:10 PDT
I think I would somewhat prefer that on "better to not guess it's HTML if it's not" grounds....
Comment 5 Adam Barth 2009-06-26 14:53:30 PDT
Comment on attachment 385430 [details] [diff] [review]
patch

Ok.  I'll update the spec.
Comment 6 Jonathan Griffin (:jgriffin) 2009-07-29 15:27:36 PDT
I can add the sniffer tests from html5lib to mochitest once this is implemented.
Comment 7 Jonathan Griffin (:jgriffin) 2009-07-31 11:55:22 PDT
Created attachment 391924 [details] [diff] [review]
mochitest test cases for html5 sniffer

Here are the html5lib content type sniffer tests.  Currently all the feed tests fail; they're all detected as text/html.  Is the sniffer code going to be updated to handle these?  If not, I can mark them as 'todo'.  I can also add additional text/html sniffer cases per the current code if you'd like.
Comment 8 Adam Barth 2009-07-31 12:19:00 PDT
I don't have the power to review changes for Mozilla, but these test cases don't have very much coverage.  They mostly test RSS and "<!DOCTYPE html", but they don't test whitespace, length limits, tag variety, case variety, etc.
Comment 9 Jonathan Griffin (:jgriffin) 2009-07-31 12:47:03 PDT
Yes, the tests are just directly ported from the html5lib test cases at http://html5lib.googlecode.com/hg/testdata/sniffer/.   I can add additional tests, but should I use the HTML5 spec or the current code as a reference?

I'll change the review to Henri.
Comment 10 Adam Barth 2009-07-31 13:07:24 PDT
(In reply to comment #9)
> Yes, the tests are just directly ported from the html5lib test cases at
> http://html5lib.googlecode.com/hg/testdata/sniffer/.   I can add additional
> tests, but should I use the HTML5 spec or the current code as a reference?

I would use the spec as a reference.  That way we'll see more and more tests pass as we improve our sniffer implementation to match the spec.
Comment 11 Jonathan Griffin (:jgriffin) 2009-07-31 16:13:36 PDT
Comment on attachment 391924 [details] [diff] [review]
mochitest test cases for html5 sniffer

will attach an updated version Monday
Comment 12 Jonathan Griffin (:jgriffin) 2009-08-03 17:33:28 PDT
Created attachment 392383 [details] [diff] [review]
mochitest test cases for html5 sniffer

A better content-type sniffer test harness along with additional sniffer tests.  This version:

- handles the download dialog which appears when attempting to load certain content types (like application/octet-stream)
- allows tests to specify whether or not they want an HTTP Content-Type header
- prints out prettier failure messages to assist in isolating failures
- contains tests for "unknown type" and "text vs binary" as well as "feed or HTML"

There are currently 39 failures and 60 passes.  Of the more interesting failures:

- 0xFFFE 0x0000 is detected as application/octet-stream, should be text/plain according to spec
- "<body>" (and a lot of other HTML elements) is detected as text/html, should be text/plain according to spec
- unclosed tags like "<html" are detected as text/plain, should be text/html according to spec
- all of the feed detections are wrong; they currently are detected as text/html or application/xhtml+xml instead of application/rss+xml or application/atom+xml

I didn't include tests for image types, or for correct handling of most Content-Type HTTP header values.  

FYI, there's a typo in the HTML 5 spec at http://www.w3.org/TR/html5/infrastructure.html#content-type-sniffing:-unknown-type; the table shows the UTF-16LE BOM as FF FF 00 00, but it should be FF FE 00 00.
Comment 13 Adam Barth 2009-08-03 17:37:27 PDT
Thanks for doing this.  In some cases we should change the spec and in other cases we should change the implementation.
Comment 14 Jonathan Griffin (:jgriffin) 2009-08-04 11:57:42 PDT
Created attachment 392541 [details] [diff] [review]
mochitest test cases for html5 sniffer (v3)

Added just a couple more cases for text vs binary, including one for privilege escalation attack.
Comment 15 Henri Sivonen (:hsivonen) 2009-08-05 04:46:16 PDT
Comment on attachment 392541 [details] [diff] [review]
mochitest test cases for html5 sniffer (v3)

The content type sniffer is not part of the HTML5 parser. Therefore, the html5.enable pref shouldn't have an effect and the comment talking about the HTML5 parser is incorrect.

Otherwise the test looks reasonable, but I think I'm not qualified to review this, because I don't have the right experience of MochiTest.
Comment 16 Jonathan Griffin (:jgriffin) 2009-08-05 10:53:58 PDT
Created attachment 392751 [details] [diff] [review]
mochitest test cases for html5 sniffer (v4)

I've updated the tests per comment #15, and verified that the html5.enable pref has no effect on these tests.  I will ask bz to review when we're ready to check this test in.
Comment 17 cmtalbert 2009-08-06 17:29:59 PDT
(In reply to comment #12)

> FYI, there's a typo in the HTML 5 spec at
> http://www.w3.org/TR/html5/infrastructure.html#content-type-sniffing:-unknown-type;
> the table shows the UTF-16LE BOM as FF FF 00 00, but it should be FF FE 00 00.

In case this hasn't been reported yet, would you mind letting the happy editors of WHATWG know at: whatwg@lists.whatwg.org?
Comment 18 Adam Barth 2009-08-06 18:06:18 PDT
(In reply to comment #17)
> In case this hasn't been reported yet, would you mind letting the happy editors
> of WHATWG know at: whatwg@lists.whatwg.org?

This typo has been fixed already.  The W3C draft at that URL is pretty old at this point.
Comment 19 Andrew Overholt [:overholt] (back Aug 31) 2016-02-04 11:36:29 PST
Jonathan, what should be done here?
Comment 20 Jonathan Griffin (:jgriffin) 2016-02-04 11:45:10 PST
Good question, I last looked at this over 5 years ago! Probably we should run this patch on try, and see whether it passes or not, and someone on the DOM team should then review to determine if these tests still make any sense. I'll leave the needinfo for myself to remind me to schedule this on try later.
Comment 21 Jonathan Griffin (:jgriffin) 2016-02-23 13:23:03 PST
Here's a try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=410d9afe64cf&selectedJob=17114426

Tests currently fail with:

1323 INFO TEST-UNEXPECTED-FAIL | parser/htmlparser/tests/mochitest/test_html5_sniffer.html | unexpected content type - got "text/html", expected "text/plain"
1324 INFO TEST-UNEXPECTED-FAIL | parser/htmlparser/tests/mochitest/test_html5_sniffer.html | uncaught exception - ReferenceError: appendChildNodes is not defined at http://mochi.test:8888/tests/parser/htmlparser/tests/mochitest/test_html5_sniffer.html:151

Andrew, do you want to have a developer look at these?
Comment 22 Andrew Overholt [:overholt] (back Aug 31) 2016-02-24 08:45:41 PST
I don't know enough about the parser to know what to do here. I'll needinfo Henri (who's not around for a few weeks).
Comment 23 Henri Sivonen (:hsivonen) 2016-03-17 02:05:27 PDT
I'm not familiar with our MIME sniffing code. To the point that I don't know if we should be patching our existing code incrementally of if we should be doing a rewrite from spec in Rust.

mcmanus, could you, please, elaborate on why this was move out of the Networking component? MIME sniffing doesn't belong in the HTML parser and, AFAICT, belongs in Networking.

GPHemsley, could you, please, check whether these tests still make sense?
Comment 24 Boris Zbarsky [:bz] 2016-03-17 06:40:00 PDT
Note that the relevant code lives in netwerk/streamconv/converters/nsUnknownDecoder.cpp and in particular in nsUnknownDecoder::SniffForHTML.  As Henri says, this has absolutely nothing to do with the HTML parser.
Comment 25 Patrick McManus [:mcmanus] PTO until Sep 6 2016-03-17 14:28:58 PDT
I don't think you're going to find a lot of MIME type expertise on the networking team.. I suspect that code is just there beacuse its a convenient place for a stream converter. The patch made changes to the parser directory of the code base - that's why it landed there to try and find some love.. maybe DOM would have more expertise?

If we're not interested in the patch, we should WONTFIX.
Comment 26 Boris Zbarsky [:bz] 2016-03-17 16:17:31 PDT
The parser bits are all tests, because that's where it's convenient to put the tests.  Anyway, what this bug really needs is an owner and in particular someone who is willing to respond to bug reports of regressions from this and push back on the spec as needed...

Note You need to log in before you can comment on or make changes to this bug.