Open Bug 500713 Opened 15 years ago Updated 2 years ago

Update HTML content sniffing heuristic to match the MIME Sniffing Standard

Categories

(Core :: DOM: Core & HTML, defect, P3)

defect

Tracking

()

ASSIGNED

People

(Reporter: abarth-mozilla, Assigned: padenot)

References

(Blocks 1 open bug, )

Details

(Whiteboard: dom-triaged)

Attachments

(1 file, 4 obsolete files)

We should update our content sniffing heuristic for text/html to match HTML 5.  Patch forthcoming.
Attached patch patch (obsolete) — Splinter Review
I updated the text/html signature except for <!DOCTYPE html, which I think we should do separately, to be conservative.
Attachment #385430 - Flags: superreview?(bzbarsky)
Attachment #385430 - Flags: review?(bzbarsky)
So is it intentional that "<htmlfoo" should be detected as HTML?
Yes.  Although, in practice, it doesn't seem to matter much.  If you feel strongly about it, we can change the spec to match the existing " " or ">" behavior (at a slight cost of complexity).
I think I would somewhat prefer that on "better to not guess it's HTML if it's not" grounds....
Attachment #385430 - Attachment is obsolete: true
Attachment #385430 - Flags: superreview?(bzbarsky)
Attachment #385430 - Flags: review?(bzbarsky)
Comment on attachment 385430 [details] [diff] [review]
patch

Ok.  I'll update the spec.
I can add the sniffer tests from html5lib to mochitest once this is implemented.
Here are the html5lib content type sniffer tests.  Currently all the feed tests fail; they're all detected as text/html.  Is the sniffer code going to be updated to handle these?  If not, I can mark them as 'todo'.  I can also add additional text/html sniffer cases per the current code if you'd like.
Attachment #391924 - Flags: review?
Attachment #391924 - Flags: review? → review?(abarth-mozilla)
I don't have the power to review changes for Mozilla, but these test cases don't have very much coverage.  They mostly test RSS and "<!DOCTYPE html", but they don't test whitespace, length limits, tag variety, case variety, etc.
Yes, the tests are just directly ported from the html5lib test cases at http://html5lib.googlecode.com/hg/testdata/sniffer/.   I can add additional tests, but should I use the HTML5 spec or the current code as a reference?

I'll change the review to Henri.
Attachment #391924 - Flags: review?(abarth-mozilla) → review?(hsivonen)
(In reply to comment #9)
> Yes, the tests are just directly ported from the html5lib test cases at
> http://html5lib.googlecode.com/hg/testdata/sniffer/.   I can add additional
> tests, but should I use the HTML5 spec or the current code as a reference?

I would use the spec as a reference.  That way we'll see more and more tests pass as we improve our sniffer implementation to match the spec.
Comment on attachment 391924 [details] [diff] [review]
mochitest test cases for html5 sniffer

will attach an updated version Monday
Attachment #391924 - Flags: review?(hsivonen)
A better content-type sniffer test harness along with additional sniffer tests.  This version:

- handles the download dialog which appears when attempting to load certain content types (like application/octet-stream)
- allows tests to specify whether or not they want an HTTP Content-Type header
- prints out prettier failure messages to assist in isolating failures
- contains tests for "unknown type" and "text vs binary" as well as "feed or HTML"

There are currently 39 failures and 60 passes.  Of the more interesting failures:

- 0xFFFE 0x0000 is detected as application/octet-stream, should be text/plain according to spec
- "<body>" (and a lot of other HTML elements) is detected as text/html, should be text/plain according to spec
- unclosed tags like "<html" are detected as text/plain, should be text/html according to spec
- all of the feed detections are wrong; they currently are detected as text/html or application/xhtml+xml instead of application/rss+xml or application/atom+xml

I didn't include tests for image types, or for correct handling of most Content-Type HTTP header values.  

FYI, there's a typo in the HTML 5 spec at http://www.w3.org/TR/html5/infrastructure.html#content-type-sniffing:-unknown-type; the table shows the UTF-16LE BOM as FF FF 00 00, but it should be FF FE 00 00.
Attachment #391924 - Attachment is obsolete: true
Attachment #392383 - Flags: review?(hsivonen)
Thanks for doing this.  In some cases we should change the spec and in other cases we should change the implementation.
Added just a couple more cases for text vs binary, including one for privilege escalation attack.
Attachment #392383 - Attachment is obsolete: true
Attachment #392541 - Flags: review?(hsivonen)
Attachment #392383 - Flags: review?(hsivonen)
Comment on attachment 392541 [details] [diff] [review]
mochitest test cases for html5 sniffer (v3)

The content type sniffer is not part of the HTML5 parser. Therefore, the html5.enable pref shouldn't have an effect and the comment talking about the HTML5 parser is incorrect.

Otherwise the test looks reasonable, but I think I'm not qualified to review this, because I don't have the right experience of MochiTest.
Attachment #392541 - Flags: review?(hsivonen)
I've updated the tests per comment #15, and verified that the html5.enable pref has no effect on these tests.  I will ask bz to review when we're ready to check this test in.
Attachment #392541 - Attachment is obsolete: true
(In reply to comment #12)

> FYI, there's a typo in the HTML 5 spec at
> http://www.w3.org/TR/html5/infrastructure.html#content-type-sniffing:-unknown-type;
> the table shows the UTF-16LE BOM as FF FF 00 00, but it should be FF FE 00 00.

In case this hasn't been reported yet, would you mind letting the happy editors of WHATWG know at: whatwg@lists.whatwg.org?
(In reply to comment #17)
> In case this hasn't been reported yet, would you mind letting the happy editors
> of WHATWG know at: whatwg@lists.whatwg.org?

This typo has been fixed already.  The W3C draft at that URL is pretty old at this point.
Component: Networking → HTML: Parser
Jonathan, what should be done here?
Flags: needinfo?(jgriffin)
Whiteboard: dom-triaged
Good question, I last looked at this over 5 years ago! Probably we should run this patch on try, and see whether it passes or not, and someone on the DOM team should then review to determine if these tests still make any sense. I'll leave the needinfo for myself to remind me to schedule this on try later.
Here's a try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=410d9afe64cf&selectedJob=17114426

Tests currently fail with:

1323 INFO TEST-UNEXPECTED-FAIL | parser/htmlparser/tests/mochitest/test_html5_sniffer.html | unexpected content type - got "text/html", expected "text/plain"
1324 INFO TEST-UNEXPECTED-FAIL | parser/htmlparser/tests/mochitest/test_html5_sniffer.html | uncaught exception - ReferenceError: appendChildNodes is not defined at http://mochi.test:8888/tests/parser/htmlparser/tests/mochitest/test_html5_sniffer.html:151

Andrew, do you want to have a developer look at these?
Flags: needinfo?(jgriffin) → needinfo?(overholt)
I don't know enough about the parser to know what to do here. I'll needinfo Henri (who's not around for a few weeks).
Flags: needinfo?(overholt) → needinfo?(hsivonen)
I'm not familiar with our MIME sniffing code. To the point that I don't know if we should be patching our existing code incrementally of if we should be doing a rewrite from spec in Rust.

mcmanus, could you, please, elaborate on why this was move out of the Networking component? MIME sniffing doesn't belong in the HTML parser and, AFAICT, belongs in Networking.

GPHemsley, could you, please, check whether these tests still make sense?
Component: HTML: Parser → Networking
Flags: needinfo?(mcmanus)
Flags: needinfo?(hsivonen)
Flags: needinfo?(gphemsley)
Summary: Update HTML content sniffing heuristic to match HTML 5 → Update HTML content sniffing heuristic to match the MIME Sniffing Standard
Note that the relevant code lives in netwerk/streamconv/converters/nsUnknownDecoder.cpp and in particular in nsUnknownDecoder::SniffForHTML.  As Henri says, this has absolutely nothing to do with the HTML parser.
I don't think you're going to find a lot of MIME type expertise on the networking team.. I suspect that code is just there beacuse its a convenient place for a stream converter. The patch made changes to the parser directory of the code base - that's why it landed there to try and find some love.. maybe DOM would have more expertise?

If we're not interested in the patch, we should WONTFIX.
Status: NEW → RESOLVED
Closed: 8 years ago
Component: Networking → DOM
Flags: needinfo?(mcmanus)
Resolution: --- → FIXED
Status: RESOLVED → UNCONFIRMED
Ever confirmed: false
Resolution: FIXED → ---
The parser bits are all tests, because that's where it's convenient to put the tests.  Anyway, what this bug really needs is an owner and in particular someone who is willing to respond to bug reports of regressions from this and push back on the spec as needed...
Gordon is no longer actively editing. Paul has been doing some work recently, mostly for media-related sniffing. I suspect he's busy, but worth asking if he's interested in cleaning up the remainder of the sniffing code too.

It does seem like a shame to throw away these tests.
Flags: needinfo?(gphemsley) → needinfo?(padenot)
I can have a look, but as you mention, I can't commit to a specific timeline.
Assignee: nobody → padenot
Flags: needinfo?(padenot)
Blocks: mimesniff
Priority: -- → P3
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Component: DOM → DOM: Core & HTML
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: