XMLHttpRequest can't parse HTML from file: protocol correctly, seems parsing as XHTML

RESOLVED INVALID

Status

()

Core
DOM
RESOLVED INVALID
4 years ago
3 years ago

People

(Reporter: Duan Yao, Unassigned)

Tracking

31 Branch
x86_64
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

4 years ago
Created attachment 8454847 [details]
xhr-html.htm

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0 (Beta/Release)
Build ID: 20140703030200

Steps to reproduce:

Load the attached HTML file (xhr-html.htm) with firefox.
The attachment loads itself with XMLHttpRequest, setting responseType = 'document'. The attachment doesn't conform to XHTML syntax.



Actual results:

Parse error is shown in console, it seems XMLHttpRequest is parsing it as XHTML:

mismatched tag. Expected: </meta>. xhr-html.htm:17
"<parsererror xmlns="http://www.mozilla.org/newlayout/xml/parsererror.xml">XML Parsing Error: mismatched tag. Expected: &lt;/meta&gt;.
Location: file:///media/DATA/project/MyHTML/xhr-html.htm
Line Number 17, Column 3:<sourcetext>&lt;/head&gt;
--^</sourcetext></parsererror>"


Expected results:

Parse the HTML file correctly.

It seems HTML parsing in XMLHttpRequest was fixed long ago:
https://bugzilla.mozilla.org/show_bug.cgi?id=651072

However, firefox 28-33 can't make it right.
(Reporter)

Comment 1

4 years ago
Note: HTML files served via http: are parsed correctly, but those via file: are not.
Summary: [regression] XMLHttpRequest can't parse HTML correctly, seems parsing as XHTML → [regression] XMLHttpRequest can't parse HTML from file: protocol correctly, seems parsing as XHTML
(Reporter)

Comment 2

4 years ago
Note: HTML files are also not parsed correctly via chrome: protocol.
From the XMLHttpRequest spec at <https://dvcs.w3.org/hg/xhr/raw-file/tip/Overview.html#response-entity-body-0>:

  The response MIME type is the MIME type the Content-Type header contains excluding any
  parameters and converted to ASCII lowercase, or null if the response header can not be
  parsed or was omitted.

When loading from file:// this will therefore be null, since there is no Content-Type header.

  Final MIME type is the override MIME type unless that is null in which case it is the
  response MIME type. 

which in this case is also null, since the testcase does not call overrideMimeType, and the response MIME type is null.

Then for a response of type "document" the relevant spec section is https://dvcs.w3.org/hg/xhr/raw-file/tip/Overview.html#document-response-entity-body which says:

  If final MIME type is text/html, run these substeps: 

which are not run, since final MIME type is null and then:

  Otherwise, let document be a document that represents the result of parsing the
  response entity body following the rules set forth in the XML specifications. If that
  fails (unsupported character encoding, namespace well-formedness error, etc.), return
  null.

In other words, when loading from file:// XMLHttpRequest will always parse as XML unless overrideMimeType is called

I don't know why you think this is a regression, exactly; this has always been the behavior.
Status: UNCONFIRMED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → INVALID
> Note: HTML files are also not parsed correctly via chrome: protocol.

chrome:// doesn't have a Content-Type header, so yes, you get the same behavior.

Really, XHR is designed to work with XML and HTTP.  The HTML support was shoehorned in, but without breaking backwards compat in the process, which means that it only works reasonably well over HTTP.  Otherwise you have to opt into it with overrideMimeType.
(Reporter)

Comment 5

4 years ago
(In reply to Boris Zbarsky [:bz] from comment #3)
> When loading from file:// this will therefore be null, since there is no
> Content-Type header.
I don't think the spec specifies this. In the "1 Introduction" section:

  Second, it can be used to make requests over both HTTP and HTTPS (some implementations support 
  protocols in addition to HTTP and HTTPS, but that functionality is not covered by this specification).

So I think file: and chrome: protocols are just "not covered", and the exact behaviors are leaved to the implementors.

In section "6 data: URLs and HTTP", the spec says:

  To ensure data: URLs can function in APIs designed around HTTP, such as XMLHttpRequest, this section 
  details how they work. Specifications defining similar URL schemes ought to take inspiration from this section. 

  When a data: URL is fetched using the HTTP method GET, determine the response as follows: 
    ...
    * Include a single response header whose header field name is "content-type" and whose value is the 
    MIME type (including any parameters) given in the data: URL, or the default otherwise. 

So I think the spec allows and encourages XMLHttpRequest implementations to determine the mime type of the response by themselves, if the content-type header is not available in the protocol in question.

For file: protocol, it is a trivial task to determine mime type by a file's extension name -- <input type=file> element already does that, right? Chrome browser also does exactly this.

> 
> I don't know why you think this is a regression, exactly; this has always
> been the behavior.

This is a mistake, I had a illusion that firefox had chrome-like behavior before.
(Reporter)

Comment 6

4 years ago
(In reply to Boris Zbarsky [:bz] from comment #4)
> > Note: HTML files are also not parsed correctly via chrome: protocol.
> 
> chrome:// doesn't have a Content-Type header, so yes, you get the same
> behavior.
> 
> Really, XHR is designed to work with XML and HTTP.  The HTML support was
> shoehorned in, but without breaking backwards compat in the process, which
> means that it only works reasonably well over HTTP.  Otherwise you have to
> opt into it with overrideMimeType.

How can custom codes know what should be passed to overrideMimeType? The most common guess is still according to the file extension. Browsers can also do this, why don't let browser do it, and keep the custom codes simple?

XHR was designed to work with XML and HTTP, but now it is much more versatile. Browser-based applications are not limited to HTTP, not to mention XULRunner.
(Reporter)

Comment 7

4 years ago
The following code demonstrates that browsers(including firefox) already can determine local file's mime type. So for firefox this functionality should be able to incorporated into XHR.

<input type="file" >Select file</input>
<script>
document.querySelector('input[type=file]').onchange = function(ev) {
    var file = ev.target.files[0];
    console.log(file); //contains the file's mime in 'type' field
    var url = URL.createObjectURL(file);
    var xhr = new XMLHttpRequest();
    xhr.open('GET', url);
    xhr.responseType = 'document';
    xhr.send(); //works for html
    xhr.onload = function() {
        console.log(this.response.documentElement.outerHTML);
        URL.revokeObjectURL(url);
    }
}
</script>
> For file: protocol, it is a trivial task to determine mime type by a file's extension

That would break backwards compat in a number of cases for XHR, actually, for extensions that end up being neither HTML nor XML.

Seriously, I believe what we implement is correct per the XHR spec.  If you think that spec should change, please raise spec issues as needed.
Summary: [regression] XMLHttpRequest can't parse HTML from file: protocol correctly, seems parsing as XHTML → XMLHttpRequest can't parse HTML from file: protocol correctly, seems parsing as XHTML
(Reporter)

Comment 9

4 years ago
(In reply to Boris Zbarsky [:bz] from comment #8)
> > For file: protocol, it is a trivial task to determine mime type by a file's extension
> 
> That would break backwards compat in a number of cases for XHR, actually,
> for extensions that end up being neither HTML nor XML.

Can you give some examples? I don't known why users should expect the browser threat a non-popular file extension as XML via file protocol. Even if some of them do, they already break the cross-browser compatibility because chrome etc. don't do that. 

> 
> Seriously, I believe what we implement is correct per the XHR spec.  If you
> think that spec should change, please raise spec issues as needed.

I think the current implementation is "correct" just because the current spec doesn't specify the details of file: protocol, it doesn't mean this is the only correct way.

I think the most relevant spec section for this issue should be "Fetch - 4.1 Basic fetch" (http://fetch.spec.whatwg.org/#basic-fetch). In this section, for all schemes other than http(s), i.e. about, blob, data, a method to make a fake Content-Type header is specified; however, file and ftp schemes are exception:

    "file"
    "ftp"

        For now, unfortunate as it is, file and ftp URLs are left as an exercise for the reader.

So I believe the methods to make fake Content-Type header for file and ftp schemes are just not covered by this spec for now; it doesn't mean that UAs must or should treat Content-Type as null. I think UAs can and should do a reasonable guess to Content-Type for these schemes, like chrome browser.
> I don't known why users should expect the browser threat a non-popular file extension as
> XML via file protocol.

Because that's what browsers do.

> Even if some of them do, they already break the cross-browser compatibility because
> chrome etc. don't do that.

Chrome doesn't let you use XHR to file:// URIs at all, unless you're loading the file your web page is in.  So pretty much anything involving local files and XHR is broken in Chrome.

But other browsers (just tested Safari and Firefox) will let you do XHR to a .txt file and will parse it as XML, for example.

But more importantly, there is absolutely no way we want to change behavior and then change it _again_ if/when the spec decides to define this.  That's just not an OK thing to do.  So unless the spec defines a behavior here, I don't think we should be changing it from what people are already expecting.
(Reporter)

Comment 11

4 years ago
(In reply to Boris Zbarsky [:bz] from comment #10)
> > I don't known why users should expect the browser threat a non-popular file extension as
> > XML via file protocol.
> 
> Because that's what browsers do.
> 
> > Even if some of them do, they already break the cross-browser compatibility because
> > chrome etc. don't do that.
> 
> Chrome doesn't let you use XHR to file:// URIs at all, unless you're loading
> the file your web page is in.  So pretty much anything involving local files
> and XHR is broken in Chrome.

Not exactly. 
Firstly chrome has some command line options to relax the constraint to local file XHR, such as --disable-web-security or –allow-file-access-from-files (http://stackoverflow.com/questions/3102819/disable-same-origin-policy-in-chrome), these are convinient for debug web apps locally. 

Secondly there are chromium-based embedable runtimes (more or less like xulrunner), such as CEF (http://code.google.com/p/chromiumembedded) and node-webkit (https://github.com/rogerwang/node-webkit), which also allow local XHR.

Thirdly, android WebView allow local XHR via file:///android_asset/ pattern or plain file: URLs, and also treat .htm[l] files as text/html type. I note that GeckoView want to simulate file:///android_asset/ pattern (https://bugzilla.mozilla.org/show_bug.cgi?id=948465), so what to do with XHR?

I'll check IE later.

> 
> But other browsers (just tested Safari and Firefox) will let you do XHR to a
> .txt file and will parse it as XML, for example.
Don't you think this use pattern is crazy? For me, this indicates a big flaw in a web app. If firefox and safari would shout at me if I misuse a .txt as xml, I would be very appreiciated.

> 
> But more importantly, there is absolutely no way we want to change behavior
> and then change it _again_ if/when the spec decides to define this.  That's
> just not an OK thing to do.  So unless the spec defines a behavior here, I
> don't think we should be changing it from what people are already expecting.
Sure. But you (mozilla) is a big one in w3c and whatwg, and the spec usually take major browsers' implementation into account. So as a web app developer, I'd like to know your position about this incompatibility across browsers. Which behavior do you think should become the spec? 

For me, "what people are already expecting" for firefox is actually a flaw. Fixing it will benifit to developers in the long-term.
> Which behavior do you think should become the spec? 

Whatever behavior we can actually get browsers to agree on.  I'm certainly not wedded to our current behavior; I just don't want to change it willy nilly.
(Reporter)

Comment 13

4 years ago
(In reply to Boris Zbarsky [:bz] from comment #12)
> > Which behavior do you think should become the spec? 
> 
> Whatever behavior we can actually get browsers to agree on.  I'm certainly
> not wedded to our current behavior; I just don't want to change it willy
> nilly.

Alright, just hope this to happen soon.
It'll happen when it becomes a priority for someone.  If you care about the issue, I strongly suggest you send mail with a spec proposal to the public standards mailing list for XHR.  That's the best way of getting anything to happen here.
(Reporter)

Comment 15

4 years ago
(In reply to Boris Zbarsky [:bz] from comment #14)
> It'll happen when it becomes a priority for someone.  If you care about the
> issue, I strongly suggest you send mail with a spec proposal to the public
> standards mailing list for XHR.  That's the best way of getting anything to
> happen here.

Thank you for your suggestion, I'll consider it. But it seems the recent XHR spec doesn't concern about specific protocols (http://xhr.spec.whatwg.org/), maybe this issue is supposed to be addressed by Fetch (http://fetch.spec.whatwg.org) spec?
I don't know; I haven't been following the fetch work that closely.
(Reporter)

Comment 17

4 years ago
I think I come up with another point that this issue should be fixed, even if the specs haven't been improved -- static resource fetching via file: in firefox already provide reasonable Content-Type header, and thus XHR fetching is inconsistent with it.

For exmaple, I embed a html(inner.html) into an outer html via <object>, and load them via file: protocol in firefox:
  
  <object data="inner.html" type="application/xhtml+xml" typemustmatch="" >fallback!</object>

then the inner.html is not loaded , "fallback!" is shown instead. Obviously, firefox has detected that inner.html is text/html, not application/xhtml+xml indicated by type attr, so refused to load it. If I remove typemustmatch attr, inner.html is loaded, and we can confirm that it is parsed as html by checking its tagName. If I rename inner.html to inner.xhtml, it is also loaded.

According to HTML spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#attr-object-typemustmatch):

   The typemustmatch attribute is a boolean attribute whose presence indicates that the resource
   specified by the data attribute is only to be used if the value of the type attribute and the
   Content-Type of the aforementioned resource match.

So I come to the conclusion that firefox sets reasonable Content-Type header for file: protocol, at least for static resources. 

So the problem is: why not do it for XHR too? I understand this is for legacy codes, but this inconsistency should be fixed sooner or later, right?
Firefox can certainly guess a file type for file:// URLs and does so based on a combinarion of extensions and content sniffing.

What you're ignoring is that there is a real compat issue here.  You think that .txt files should not be parsed as XML.  But what about .plist files?  Those are XML but get detected as text/plain.  There are lots of other file formats that have all sorts of random MIME types but contain XML.

What should happen for those file types?  Do they get treated as XML?  How do we decide?  Do we just treat anything that's not HTML as XML?  Something else?

_That_ is why this needs a spec if we only want to change behavior once.  Because I very much doubt that something we pick here without talking to other UAs would end up being the final specified behavior.
(Reporter)

Comment 19

4 years ago
I think your concern is resolvable: XHR fetching just does what static fetching or <input type=file> does.

I have tried

   <object data="inner.plist" type="text/xml" typemustmatch="" >fallback</object>

and it loads, but type="text/plain" not, empty also not (and this is invalid HTML).

I have also tried to show inner.plist's mime by <input> as show in Comment 7, and it is ""(empty), and it is parsed by XHR as XML.

So the answer is clear: .plist is recognized as XML by firefox, not plain text.

For those extensions that are really "random", I have some suggestions:

1. Consult the OS. I suspect some browsers already do this.

2. Treat them as XML. This should keep backward compatibility. However, we should warn developers that this is instable, because new extensions may be registered over time, and they really should set overrideMimeType() if they use home-made extensions.

3. Treat them as application/octet-stream. This my favorite, because browser will warn me if I don't set overrideMimeType() for home-made extensions. However, this may break some legacy code, maybe we can hide this behind a preference.
(Reporter)

Comment 20

4 years ago
(In reply to Boris Zbarsky [:bz] from comment #18)

> _That_ is why this needs a spec if we only want to change behavior once. 
> Because I very much doubt that something we pick here without talking to
> other UAs would end up being the final specified behavior.

I think the spec IS there -- MIME. 
Web developers should use well-known extensions/mime as-is; If not, or they use private extensions, they should call overrideMimeType(), or they are asking for trouble.

Let me make me more clear: I think before all browsers can converge to a same behavior of XHR, firefox can and should converge its static fetching and XHR's behaviors.
> So the answer is clear: .plist is recognized as XML by firefox, not plain text.

It's recognized as text/plain on my (Mac) system.  Welcome to the world of MIME mappings for file://.

> 1. Consult the OS. I suspect some browsers already do this.

We do this already.

> and they really should set overrideMimeType() if they use home-made extensions.

I don't think .plist is a particularly "home-made" extension.  It's a preexisting file format that someone might want to read via XHR.

> I think the spec IS there -- MIME. 

There is no sane spec for getting MIME types out of files, sadly.
(Reporter)

Comment 22

4 years ago
(In reply to Boris Zbarsky [:bz] from comment #21)
> > So the answer is clear: .plist is recognized as XML by firefox, not plain text.
> 
> It's recognized as text/plain on my (Mac) system.  Welcome to the world of
> MIME mappings for file://.
> 
> > 1. Consult the OS. I suspect some browsers already do this.
> 
> We do this already.
> 
> > and they really should set overrideMimeType() if they use home-made extensions.
> 
> I don't think .plist is a particularly "home-made" extension.  It's a
> preexisting file format that someone might want to read via XHR.
Well, it seems .plist can actually takes one of 3 formats: plain text, XML, and binary, for historical reasons. I think its a poor decision to reuse a extension for completely different formats, and this is where the messes come -- it is very hard or impossible to determine a .plist's actually format, especially on non-apple platforms. 

Also I think .plist IS a home-made extension because it has no standard mime type by IANA, and not used by non-apple platforms. Some one suggests application/x-plist, which is non-standard (http://stackoverflow.com/questions/3603851/what-is-the-http-content-type-for-binary-plist).

> 
> > I think the spec IS there -- MIME. 
> 
> There is no sane spec for getting MIME types out of files, sadly.
I think there will never be one. But practically, many standard MIME types has recommended extensions, for example PDF: http://www.rfc-editor.org/rfc/rfc3778.txt.

Think about web servers, how do they determine a file's content-type before sending it? The situation is similar. Content-type sent via HTTP by no means more reliable than guessing a local file's mime type.
(Reporter)

Comment 23

4 years ago
I found another 2 relevant specs:

In "HTML - 2.6 Fetching resources - 2.6.4 Determining the type of a resource" (http://www.whatwg.org/specs/web-apps/current-work/multipage/fetching-resources.html#content-type-sniffing):

    The Content-Type metadata of a resource must be obtained and interpreted in a manner consistent 
    with the requirements of the MIME Sniffing specification. 
    The sniffed type of a resource must be found in a manner consistent with the requirements given in
    the MIME Sniffing specification for finding the sniffed media type of the relevant sequence of
    octets. [MIMESNIFF]

I interpreted this as: both static fetching and XHR fetching must apply MIME Sniffing algorithm to determine the actually media type.

And in MIME Sniffing specification (http://mimesniff.spec.whatwg.org):

   5.1 Interpreting the resource metadata
   ...
   If the resource is retrieved directly from the file system, set supplied-type to the MIME type
   provided by the file system. 

I don't known what this really mean: almost no main-stream file systems record MIME type for files. Does it actually want to say "provided by the operating system" or "provided by the file extension"?

Nevertheless, the spec does require UAs to obtain MIME type for local files from somewhere, not just treat MIME type as null.

Even If the supplied-type is undefined, the actually type should be sniffed (7.1 Identifying a resource with an unknown MIME type). Largely searching "<!DOCTYPE HTML", "<HTML", etc for text/html and "<?xml" for text/xml.

So my conclusion is: XHR to shouldn't mis-intercept local .htm file as XML according the above specs. Even if the OS or filename can't provide useful type info, sniffing should probably gives right content-type. So does .plist files and other.

P.S. Current XHR spec (http://xhr.spec.whatwg.org/) is referencing fetching standard (http://fetch.spec.whatwg.org), which is obviously incomplete, and doesn't mention sniffing. So I think we should consult fetching section in HTML spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/fetching-resources.html) for now.
(Reporter)

Comment 24

4 years ago
Well, there is still one thing unclear: MIME Sniffing specification defined "8 Context-specific sniffing", however no XHR related clause can be found.

In whatwg wiki (http://wiki.whatwg.org/wiki/Contexts), I can see XHR triggers "connection" context, and "Sniffing Algorithm" for "connection" context is left blank. 

What do those mean? XHR doesn't sniff at all, and trusts supplied-type?
> Nevertheless, the spec does require UAs to obtain MIME type for local files from
> somewhere, not just treat MIME type as null.

That's not necessarily backwards compatible in the case of XHR, as I pointed out.

> the actually type should be sniffed 

You missed http://mimesniff.spec.whatwg.org/#determining-the-sniffed-mime-type-of-a-resource step 2.

> XHR doesn't sniff at all, and trusts supplied-type?

Yes.
(Reporter)

Comment 26

4 years ago
(In reply to Boris Zbarsky [:bz] from comment #25)
> > Nevertheless, the spec does require UAs to obtain MIME type for local files from
> > somewhere, not just treat MIME type as null.
> 
> That's not necessarily backwards compatible in the case of XHR, as I pointed
> out.
If this kind of backward compatibility be kept, consistency between static resource fetching and XHR must be degraded, and cross-browser compatibility is also lost. I think this is a much higher price to pay. Do you think the fetching section of HTML spec should be changed?
> 
> > the actually type should be sniffed 
> 
> You missed
> http://mimesniff.spec.whatwg.org/#determining-the-sniffed-mime-type-of-a-
> resource step 2.
> 
> > XHR doesn't sniff at all, and trusts supplied-type?
> 
> Yes.
step 2 is about "no-sniff flag", but I don't see a sentence states that "in case of XHR, no-sniff flag should be set", how are you sure about this?

Also I don't understand why XHR doesn't sniff? If it does, it may probably handle the .plist formats problem gracefully.
> Do you think the fetching section of HTML spec should be changed?

I think this needs to be handled on the spec level one way or another.  I said that already in comment 10.

> but I don't see a sentence states that "in case of XHR, no-sniff flag should be set",

From http://mimesniff.spec.whatwg.org/#no-sniff-flag :

  A no-sniff flag, which defaults to set if the user agent does not wish to perform
  sniffing on the resource and unset otherwise. 

Basically, UAs can decide whether to sniff or not however they want to, as far as I can tell.

Seriously, if you want to figure out how these various specs interact the right way to do that is the standards mailing lists, where Ian and Anne would see your questions, not this bug.
(Reporter)

Comment 28

4 years ago
(In reply to Boris Zbarsky [:bz] from comment #27)
OK, I'll try the mailing lists. Thanks!
You need to log in before you can comment on or make changes to this bug.