Closed Bug 1425479 Opened 7 years ago Closed 7 years ago

Expose (sniffed) content type (MIME type/charset) of responses through webRequest/StreamFilter API

Categories

(WebExtensions :: Request Handling, enhancement, P5)

56 Branch
enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: robwu, Unassigned)

References

Details

(Whiteboard: [design-decision-denied])

(forked off bug 1287264) The webRequest.onHeadersReceived event offers access to the response headers, including Content-Type. This can be used to determine how the browser should process the response, except when the content sniffer kicks in. It is possible to try and implement a content sniffer in JavaScript ( https://github.com/Rob--W/open-in-browser/issues/5 ), but it's complicated and difficult to maintain. I have two concrete use cases for desiring access to the sniffed MIME type: - The "Open in Browser" extension allows the user to change the content type of a response via a dialog (e.g. Open in Browser as plain text). - The "PDF viewer" (PDF.js) extension intercepts PDF responses and shows a HTML page instead. Both use cases require read/write access to the response's content type. Knowing the exact content type is mainly useful in conjunction with the response body, so I suggest to tie this feature to the StreamFilter API (webRequest.filterReponseData): - To the filter.ondata event, add the mimeType and charset attributes to signal the MIME type and charset at the time of the event. Note that this is provisional and can be changed by content sniffing (or another extension) until the data has been written. - To the filter.write method, add a way to optionally specify the MIME type and charset. If the channel's type is already locked (e.g. because the extension or another extension) has already written data, the write request should be rejected. Optionally: - To the filter.onstart event, add the mimeType and charset attributes. An alternative to the onstart/ondata event attributes is to add MIME/charset as a getter to the StreamFilter.
As I said in bug 1424543, extensions may want to reencode the request contents into UTF-8, for example because the JSON Viewer only recognizes that encoding but some servers use other ones. But if multiple add-ons attempt to do so, they will just mangle the contents. For example, a latin1-encoded 0xE0 (à) will first become 0xC3 0xA0 (à) and then 0xC3 0x83 0xC2 0xA0 (Ã ). To avoid this problem, the `onstart` event listener of the subsequent filters should not run until the current one has set the encoding of the output. This can automatically happen when it writes something for the first time (because after that it shouldn't be possible to alter the encoding). But in bug 1424543 I proposed an `open` method that would allow to run subsequent `onstart` listeners earlier, without having to wait for a write in case the current filter wants to accumulate some data first. > - To the filter.ondata event, add the mimeType and charset attributes to > signal the MIME type and charset at the time of the event. Note that this is > provisional and can be changed by content sniffing (or another extension) > until the data has been written. This should not be provisional, that would be an awful API. Once an extension is told that the data will be in some encoding and content type, it should be in that encoding and content type until the end. What could happen is that another extension or sniffer may then do a second conversion, e.g. 1. A webextension converts application/json in latin1 into application/json in utf-8. 2. The JSON Viewer converts the result into text/html in utf-8. These should be independent, the webextension should not be aware that the JSON Viewer is converting its output, and if the JSON Viewer checked `request.contentCharset` in `onStartRequest`, it should see `utf-8` instead of `latin1`. > - To the filter.write method, add a way to optionally specify the MIME type > and charset. If the channel's type is already locked (e.g. because the > extension or another extension) has already written data, the write request > should be rejected. Since the encoding and content type can't be changed after writing something, I don't think that the `write` method is the right place to set them. I think it would be much better to add an `open` method which would be called only once and would lock the encoding and content type, and call the `onstart` listener of the subsequent extension. > - To the filter.onstart event, add the mimeType and charset attributes. Yes, the `onstart` listener is a good place to read them. But note that an extension may want to set them after receiving some data (e.g. to check if there is a BOM). > An alternative to the onstart/ondata event attributes is to add MIME/charset > as a getter to the StreamFilter. Yes, but reading them before `onstart` listener would be unreliable and should probably not be allowed, and setting them after using `open`/`write`/`close`/`suspend` should fail.
(In reply to Oriol Brufau [:Oriol] from comment #2) > But if multiple add-ons attempt to do so, they will just mangle the > contents. For example, a latin1-encoded 0xE0 (à) will first become 0xC3 0xA0 > (à) and then 0xC3 0x83 0xC2 0xA0 (Ã ). Extensions have no control over the order that the filters are applied. Neither of our API proposals solve the problem when extensions try to modify the response body in a conflicting way. However, that problem is separate from our feature requests (the most that our feature requests can achieve is to provide enough information to allow extensions to detect whether their response body modification would break the response body, e.g. by providing accurate type/charset information). > To avoid this problem, the `onstart` event listener of the subsequent > filters should not run until the current one has set the encoding of the > output. "onstart" signals the start of a request, with no information whatsoever about the content (except for what is known from the response headers). Delaying onstart adds some complexity to the implementation (and some latency to propagation of "onstart"). What would be the value of delaying "onstart"? The encoding/type can depend on the response body as I explain below. > > - To the filter.ondata event, add the mimeType and charset attributes to > > signal the MIME type and charset at the time of the event. Note that this is > > provisional and can be changed by content sniffing (or another extension) > > until the data has been written. > > This should not be provisional, that would be an awful API. Whether it is provisional or not is not a desire from the API, but a consequence of reality. Because of content sniffing, it is possible for the following MIME types to become a completely different type based on the response data (the response data which can be modified by a previous extension/filter): - (no MIME) (internally referred to as application/x-unknown-content-type) - application/octet-stream - text/html - text/plain - (anything containing "xml") For all other types, the type (derived from the Content-Type response header) is not provisional, but fixed. > Once an extension is told that the data will be in some encoding and content type, > it should be in that encoding and content type until the end. Agreed. However, in some cases (see above) the encoding and MIME type can only be fixed once part of the response body is known, due to content sniffing. > What could > happen is that another extension or sniffer may then do a second conversion, > e.g. > 1. A webextension converts application/json in latin1 into application/json > in utf-8. > 2. The JSON Viewer converts the result into text/html in utf-8. > These should be independent, the webextension should not be aware that the > JSON Viewer is converting its output The proposed API does not force any dependency (filters have no knowledge of previous/later filters). This would happen: network [latin1] -> [latin1] StreamFilter (extension) [utf-8] -> [utf-8] JSONView) (and in the case of JSONViewer, the value of the channel encoding does not matter since it is forced to UTF-8: https://searchfox.org/mozilla-central/rev/110706c3c09d457dc70293b213d7bccb4f6f5643/devtools/client/jsonview/converter-child.js#86 ) > > - To the filter.write method, add a way to optionally specify the MIME type > > and charset. If the channel's type is already locked (e.g. because the > > extension or another extension) has already written data, the write request > > should be rejected. > > Since the encoding and content type can't be changed after writing > something, I don't think that the `write` method is the right place to set > them. I don't necessarily view the API as a way to change the type, but as a way to assert that the body is of a certain type. > I think it would be much better to add an `open` method which would be > called only once and would lock the encoding and content type, A separate "open" method sounds good to me. > and call the `onstart` listener of the subsequent extension. I have previously expressed my concern of delaying "onstart". > > - To the filter.onstart event, add the mimeType and charset attributes. > > Yes, the `onstart` listener is a good place to read them. But note that an > extension may want to set them after receiving some data (e.g. to check if > there is a BOM). I would allow the type to be modified (once?) as long as the body has not been written yet.
(In reply to Rob Wu [:robwu] from comment #3) > Extensions have no control over the order that the filters are applied. Yes, but if the extensions have been coded correctly, it wouldn't matter which one ran first. The result could be completely different, but it wouldn't be incorrect. For example, if you have two extensions: - Extension A converts text/plain in any charset into text/plain in UTF-8 - Extension B converts text/plain in any charset into text/html in UTF-8 Then, - If A runs first, it will reencode into UTF-8 preserving text/plain, and then B will convert to text/html without reencoding because the data it receives is already in UTF-8. - If B runs first, it will reencode into UTF-8 and convert to text/html. Then A won't see a text/plain and will do nothing. > "onstart" signals the start of a request, with no information whatsoever > about the content (except for what is known from the response headers). > Delaying onstart adds some complexity to the implementation (and some > latency to propagation of "onstart"). > What would be the value of delaying "onstart"? The encoding/type can depend > on the response body as I explain below. When you use a nsIStreamConverter you can delay the onStartRequest call. Why should it be different for webextensions? The value would be that when the onstart event fires, the extension would know that preceding webextensions would no longer be able to randomly change the content type or encoding. Otherwise the code would need to be moved to the `ondata` listener, storing a boolean somewhere to know if it's the first call. This seems ugly an unnecessary in a great amount of cases. And of course lots of people won't know that reading the content type or encoding in `ondata` will be unreliable, so their webextension will be broken and will annoy users. > Whether it is provisional or not is not a desire from the API, but a > consequence of reality. > Because of content sniffing, it is possible for the following MIME types to > become a completely different type based on the response data (the response > data which can be modified by a previous extension/filter): > - (no MIME) (internally referred to as application/x-unknown-content-type) > - application/octet-stream > - text/html > - text/plain > - (anything containing "xml") Interesting, I have never seen such a thing (except for application/x-unknown-content-type). Anyways, if the final content type will depend on the content that the webextension writes, then I don't see the point for the webextension to depend on the final content type in order to know which kind of content to write. It's a circularity. So I would just pass that content type, or "application/x-unknown-content-type" if there is none. If later it changes, well, it's its problem, the webextension could have written some other content that would have not triggered the change. It won't be an unexpected change so I don't think it's problematic. The problem is if a preceding webextension changes it unexpectedly after `onstart`, this is what I would want to prevent. Otherwise I think that `onstart` is useless. > (and in the case of JSONViewer, the value of the channel encoding does not > matter since it is forced to UTF-8 Yes, that's what I'm trying to fix with a webextension. It was just an hypothetical example. > I don't necessarily view the API as a way to change the type, > but as a way to assert that the body is of a certain type. Then it's the opposite for me :) Not sure how you can assert this, though, given that something that runs later (like the JSON Viewer) can change it. > I would allow the type to be modified (once?) as long as the body has not > been written yet. Yep, of course.
(In reply to Oriol Brufau [:Oriol] from comment #4) > > "onstart" signals the start of a request, with no information whatsoever > > about the content (except for what is known from the response headers). > > Delaying onstart adds some complexity to the implementation (and some > > latency to propagation of "onstart"). > > What would be the value of delaying "onstart"? The encoding/type can depend > > on the response body as I explain below. > > When you use a nsIStreamConverter you can delay the onStartRequest call. Why > should it be different for webextensions? Because there haven't been any use case so far? > The value would be that when the onstart event fires, the extension would > know that preceding webextensions would no longer be able to randomly change > the content type or encoding. The expected primary cause for a changed content type is not an extension, but content sniffing. > Otherwise the code would need to be moved to > the `ondata` listener, storing a boolean somewhere to know if it's the first > call. This seems ugly an unnecessary in a great amount of cases. And of > course lots of people won't know that reading the content type or encoding > in `ondata` will be unreliable, so their webextension will be broken and > will annoy users. Proper documentation should help with the educational part. > Anyways, if the final content type will depend on the content that the > webextension writes, then I don't see the point for the webextension to > depend on the final content type in order to know which kind of content to > write. It's a circularity. Extensions that want to handle a certain MIME type (e.g. application/pdf) need the type (from Content-Type, or the sniffed type) to know whether to rewrite a response (to HTML). The Content-Type header from webRequest.onHeadersReceived is not fully reliable because it can be changed by content sniffing or by other extensions. > So I would just pass that content type, or > "application/x-unknown-content-type" if there is none. If later it changes, > well, it's its problem, the webextension could have written some other > content that would have not triggered the change. There is no need for extensions to set a temporary dummy type if they don't know the type yet. If the extension intends to change the content type, it can buffer the data until the type has been fixed. > It won't be an unexpected change so I don't think it's problematic. The > problem is if a preceding webextension changes it unexpectedly after > `onstart`, this is what I would want to prevent. Otherwise I think that > `onstart` is useless. onstart signals that data is about to become available and that write()/disconnect()/close() can be used. No more, no less.
(In reply to Rob Wu [:robwu] from comment #5) > Because there haven't been any use case so far? Well, that's because currently there is no way to read the content type or encoding. But this is a needed feature, and providing reliable data in `onstart` seems a valid use-case. > Proper documentation should help with the educational part. And providing reliable data everywhere would help even more :) > The Content-Type header from webRequest.onHeadersReceived is not fully reliable > because it can be changed by content sniffing or by other extensions. Then let Firefox accumulate enough data to sniff the content type before calling the `onstart` method. This would be just like nsIContentSniffer.getMIMETypeFromContent, which is used before nsIStreamConverter.onStartRequest. And for the case with multiple extensions, that's why I ask `onstart` to be delayed until the previous extension uses `open`/`write`/`close`/`suspend`.
Severity: normal → enhancement
Priority: -- → P5
Whiteboard: [design-decision-needed]
Hi Rob, this has been added to the agenda for the March 6, 2018 WebExtensions APIs triage. Would you be able to join us? Here’s a quick overview of what to expect at the triage: * We normally spend 5 minutes per bug * The more information in the bug, the better * The goal of the triage is to give a general thumbs up or thumbs down on a proposal; we won't be going deep into implementation details Relevant Links: * Wiki for the meeting: https://wiki.mozilla.org/WebExtensions/Triage#Next_Meeting * Meeting agenda: https://docs.google.com/document/d/1n-Cbk3d6j394mObWPMhQsfIyHIL08RhC_IEV2QBV1x8/edit# * Vision doc for WebExtensions: https://wiki.mozilla.org/WebExtensions/Vision
This design decision has been denied. Shane (:mixedpuppy) noted that this proposal appeared to request something similar to what a register content handler would have done if it had been implemented. Rob, you might want to follow up with Shane if this is something you're interested in pursuing.
Whiteboard: [design-decision-needed] → [design-decision-denied]
Conversation in IRC: Oriol: Why denied? Oriol: It says "register content handler was never fully implemented". Then why not do it? mixedpuppy: the use case never bubbled up as something important. navigator.registerContentHandler was never fully implemented due to some security issue (or impression of one), that was well before my time. Oriol: and couldn't @mozilla.org/streamconv;1 be used instead? mixedpuppy: though I do think that the use case has merit...epub readers, rss readers (like feedly), etc. mixedpuppy: streamconv might be usable for that. it’s been over a year since I looked into all this, so it is not fresh in my mind. mixedpuppy: more importantly though, I think the use case needs to be clear (it was a bit muddled in that bug) and mconca would need to be on board with that functionaliyt. after that, picking the implementation path that would be approved would be fine mixedpuppy: I also think understanding why registerContentHandler was never completed (and I thought actually recently removed?) would be good
Closing all open bugs with the [design-decision-denied] whiteboard flag.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Product: Toolkit → WebExtensions
You need to log in before you can comment on or make changes to this bug.