Open Bug 500713 Opened 15 years ago Updated 2 years ago

Update HTML content sniffing heuristic to match the MIME Sniffing Standard

Tracking

()

Status:

ASSIGNED

People

(Reporter: abarth-mozilla, Assigned: padenot)

References

(Blocks 1 open bug,
URL
)

Details

(Whiteboard: dom-triaged)

Attachments

(1 file, 4 obsolete files)

patch 15 years ago Adam Barth 2.68 KB, patch		Details \| Diff \| Splinter Review
mochitest test cases for html5 sniffer 15 years ago Jonathan Griffin (:jgriffin) 21.02 KB, patch		Details \| Diff \| Splinter Review
mochitest test cases for html5 sniffer 15 years ago Jonathan Griffin (:jgriffin) 47.31 KB, patch		Details \| Diff \| Splinter Review
mochitest test cases for html5 sniffer (v3) 15 years ago Jonathan Griffin (:jgriffin) 48.75 KB, patch		Details \| Diff \| Splinter Review
mochitest test cases for html5 sniffer (v4) 15 years ago Jonathan Griffin (:jgriffin) 48.20 KB, patch		Details \| Diff \| Splinter Review

Adam Barth

Reporter

Description

•

15 years ago

We should update our content sniffing heuristic for text/html to match HTML 5.  Patch forthcoming.

Adam Barth

Reporter

Comment 1

•

15 years ago

Attached patch patch (obsolete) — Details — Splinter Review

I updated the text/html signature except for <!DOCTYPE html, which I think we should do separately, to be conservative.

Attachment #385430 - Flags: superreview?(bzbarsky)

Attachment #385430 - Flags: review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 2

•

15 years ago

So is it intentional that "<htmlfoo" should be detected as HTML?

Adam Barth

Reporter

Comment 3

•

15 years ago

Yes.  Although, in practice, it doesn't seem to matter much.  If you feel strongly about it, we can change the spec to match the existing " " or ">" behavior (at a slight cost of complexity).

Boris Zbarsky [:bzbarsky]

Comment 4

•

15 years ago

I think I would somewhat prefer that on "better to not guess it's HTML if it's not" grounds....

Adam Barth

Reporter

Updated

•

15 years ago

Attachment #385430 - Attachment is obsolete: true

Attachment #385430 - Flags: superreview?(bzbarsky)

Attachment #385430 - Flags: review?(bzbarsky)

Adam Barth

Reporter

Comment 5

•

15 years ago

Comment on attachment 385430 [details] [diff] [review]
patch

Ok.  I'll update the spec.

Jonathan Griffin (:jgriffin)

Comment 6

•

15 years ago

I can add the sniffer tests from html5lib to mochitest once this is implemented.

Jonathan Griffin (:jgriffin)

Comment 7

•

15 years ago

Attached patch mochitest test cases for html5 sniffer (obsolete) — Details — Splinter Review

Here are the html5lib content type sniffer tests.  Currently all the feed tests fail; they're all detected as text/html.  Is the sniffer code going to be updated to handle these?  If not, I can mark them as 'todo'.  I can also add additional text/html sniffer cases per the current code if you'd like.

Attachment #391924 - Flags: review?

Jonathan Griffin (:jgriffin)

Updated

•

15 years ago

Attachment #391924 - Flags: review? → review?(abarth-mozilla)

Adam Barth

Reporter

Comment 8

•

15 years ago

I don't have the power to review changes for Mozilla, but these test cases don't have very much coverage.  They mostly test RSS and "<!DOCTYPE html", but they don't test whitespace, length limits, tag variety, case variety, etc.

Jonathan Griffin (:jgriffin)

Comment 9

•

15 years ago

Yes, the tests are just directly ported from the html5lib test cases at http://html5lib.googlecode.com/hg/testdata/sniffer/.   I can add additional tests, but should I use the HTML5 spec or the current code as a reference?

I'll change the review to Henri.

Jonathan Griffin (:jgriffin)

Updated

•

15 years ago

Attachment #391924 - Flags: review?(abarth-mozilla) → review?(hsivonen)

Adam Barth

Reporter

Comment 10

•

15 years ago

(In reply to comment #9)
> Yes, the tests are just directly ported from the html5lib test cases at
> http://html5lib.googlecode.com/hg/testdata/sniffer/.   I can add additional
> tests, but should I use the HTML5 spec or the current code as a reference?

I would use the spec as a reference.  That way we'll see more and more tests pass as we improve our sniffer implementation to match the spec.

Jonathan Griffin (:jgriffin)

Comment 11

•

15 years ago

Comment on attachment 391924 [details] [diff] [review]
mochitest test cases for html5 sniffer

will attach an updated version Monday

Attachment #391924 - Flags: review?(hsivonen)

Jonathan Griffin (:jgriffin)

Comment 12

•

15 years ago

Attached patch mochitest test cases for html5 sniffer (obsolete) — Details — Splinter Review

A better content-type sniffer test harness along with additional sniffer tests.  This version:

- handles the download dialog which appears when attempting to load certain content types (like application/octet-stream)
- allows tests to specify whether or not they want an HTTP Content-Type header
- prints out prettier failure messages to assist in isolating failures
- contains tests for "unknown type" and "text vs binary" as well as "feed or HTML"

There are currently 39 failures and 60 passes.  Of the more interesting failures:

- 0xFFFE 0x0000 is detected as application/octet-stream, should be text/plain according to spec
- "<body>" (and a lot of other HTML elements) is detected as text/html, should be text/plain according to spec
- unclosed tags like "<html" are detected as text/plain, should be text/html according to spec
- all of the feed detections are wrong; they currently are detected as text/html or application/xhtml+xml instead of application/rss+xml or application/atom+xml

I didn't include tests for image types, or for correct handling of most Content-Type HTTP header values.  

FYI, there's a typo in the HTML 5 spec at http://www.w3.org/TR/html5/infrastructure.html#content-type-sniffing:-unknown-type; the table shows the UTF-16LE BOM as FF FF 00 00, but it should be FF FE 00 00.

Attachment #391924 - Attachment is obsolete: true

Attachment #392383 - Flags: review?(hsivonen)

Adam Barth

Reporter

Comment 13

•

15 years ago

Thanks for doing this.  In some cases we should change the spec and in other cases we should change the implementation.

Jonathan Griffin (:jgriffin)

Comment 14

•

15 years ago

Attached patch mochitest test cases for html5 sniffer (v3) (obsolete) — Details — Splinter Review

Added just a couple more cases for text vs binary, including one for privilege escalation attack.

Attachment #392383 - Attachment is obsolete: true

Attachment #392541 - Flags: review?(hsivonen)

Attachment #392383 - Flags: review?(hsivonen)

Henri Sivonen (:hsivonen)

Comment 15

•

15 years ago

Comment on attachment 392541 [details] [diff] [review]
mochitest test cases for html5 sniffer (v3)

The content type sniffer is not part of the HTML5 parser. Therefore, the html5.enable pref shouldn't have an effect and the comment talking about the HTML5 parser is incorrect.

Otherwise the test looks reasonable, but I think I'm not qualified to review this, because I don't have the right experience of MochiTest.

Attachment #392541 - Flags: review?(hsivonen)

Jonathan Griffin (:jgriffin)

Comment 16

•

15 years ago

Attached patch mochitest test cases for html5 sniffer (v4) — Details — Splinter Review

I've updated the tests per comment #15, and verified that the html5.enable pref has no effect on these tests.  I will ask bz to review when we're ready to check this test in.

Attachment #392541 - Attachment is obsolete: true

cmtalbert

Comment 17

•

15 years ago

(In reply to comment #12)

> FYI, there's a typo in the HTML 5 spec at
> http://www.w3.org/TR/html5/infrastructure.html#content-type-sniffing:-unknown-type;
> the table shows the UTF-16LE BOM as FF FF 00 00, but it should be FF FE 00 00.

In case this hasn't been reported yet, would you mind letting the happy editors of WHATWG know at: whatwg@lists.whatwg.org?

Adam Barth

Reporter

Comment 18

•

15 years ago

(In reply to comment #17)
> In case this hasn't been reported yet, would you mind letting the happy editors
> of WHATWG know at: whatwg@lists.whatwg.org?

This typo has been fixed already.  The W3C draft at that URL is pretty old at this point.

Patrick McManus [:mcmanus]

Updated

•

8 years ago

Component: Networking → HTML: Parser

Andrew Overholt [:overholt]

Comment 19

•

8 years ago

Jonathan, what should be done here?

Flags: needinfo?(jgriffin)

Whiteboard: dom-triaged

Jonathan Griffin (:jgriffin)

Comment 20

•

8 years ago

Good question, I last looked at this over 5 years ago! Probably we should run this patch on try, and see whether it passes or not, and someone on the DOM team should then review to determine if these tests still make any sense. I'll leave the needinfo for myself to remind me to schedule this on try later.

Jonathan Griffin (:jgriffin)

Comment 21

•

8 years ago

Here's a try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=410d9afe64cf&selectedJob=17114426

Tests currently fail with:

1323 INFO TEST-UNEXPECTED-FAIL | parser/htmlparser/tests/mochitest/test_html5_sniffer.html | unexpected content type - got "text/html", expected "text/plain"
1324 INFO TEST-UNEXPECTED-FAIL | parser/htmlparser/tests/mochitest/test_html5_sniffer.html | uncaught exception - ReferenceError: appendChildNodes is not defined at http://mochi.test:8888/tests/parser/htmlparser/tests/mochitest/test_html5_sniffer.html:151

Andrew, do you want to have a developer look at these?

Flags: needinfo?(jgriffin) → needinfo?(overholt)

Andrew Overholt [:overholt]

Comment 22

•

8 years ago

I don't know enough about the parser to know what to do here. I'll needinfo Henri (who's not around for a few weeks).

Flags: needinfo?(overholt) → needinfo?(hsivonen)

Henri Sivonen (:hsivonen)

Comment 23

•

8 years ago

I'm not familiar with our MIME sniffing code. To the point that I don't know if we should be patching our existing code incrementally of if we should be doing a rewrite from spec in Rust.

mcmanus, could you, please, elaborate on why this was move out of the Networking component? MIME sniffing doesn't belong in the HTML parser and, AFAICT, belongs in Networking.

GPHemsley, could you, please, check whether these tests still make sense?

URL: https://mimesniff.spec.whatwg.org/

Component: HTML: Parser → Networking

Flags: needinfo?(mcmanus)

Flags: needinfo?(hsivonen)

Flags: needinfo?(gphemsley)

Summary: Update HTML content sniffing heuristic to match HTML 5 → Update HTML content sniffing heuristic to match the MIME Sniffing Standard

Boris Zbarsky [:bzbarsky]

Comment 24

•

8 years ago

Note that the relevant code lives in netwerk/streamconv/converters/nsUnknownDecoder.cpp and in particular in nsUnknownDecoder::SniffForHTML.  As Henri says, this has absolutely nothing to do with the HTML parser.

Patrick McManus [:mcmanus]

Comment 25

•

8 years ago

I don't think you're going to find a lot of MIME type expertise on the networking team.. I suspect that code is just there beacuse its a convenient place for a stream converter. The patch made changes to the parser directory of the code base - that's why it landed there to try and find some love.. maybe DOM would have more expertise?

If we're not interested in the patch, we should WONTFIX.

Status: NEW → RESOLVED

Closed: 8 years ago

Component: Networking → DOM

Flags: needinfo?(mcmanus)

Resolution: --- → FIXED

Patrick McManus [:mcmanus]

Updated

•

8 years ago

Status: RESOLVED → UNCONFIRMED

Ever confirmed: false

Resolution: FIXED → ---

Boris Zbarsky [:bzbarsky]

Comment 26

•

8 years ago

The parser bits are all tests, because that's where it's convenient to put the tests.  Anyway, what this bug really needs is an owner and in particular someone who is willing to respond to bug reports of regressions from this and push back on the spec as needed...

Anne (:annevk)

Comment 27

•

7 years ago

Gordon is no longer actively editing. Paul has been doing some work recently, mostly for media-related sniffing. I suspect he's busy, but worth asking if he's interested in cleaning up the remainder of the sniffing code too.

It does seem like a shame to throw away these tests.

Flags: needinfo?(gphemsley) → needinfo?(padenot)

Paul Adenot (:padenot)

Assignee

Comment 28

•

7 years ago

I can have a look, but as you mention, I can't commit to a specific timeline.

Assignee: nobody → padenot

Flags: needinfo?(padenot)

Anne (:annevk)

Updated

•

6 years ago

Blocks: mimesniff

Marion Daly [:mdaly]

Updated

•

6 years ago

Priority: -- → P3

Sylvestre Ledru [:Sylvestre]

Updated

•

6 years ago

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: DOM → DOM: Core & HTML

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.