Closed Bug 1388594 Opened 7 years ago Closed 8 months ago

Should we strip whitespace for non-text and non-base64 data: URI?

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox57 --- affected

People

(Reporter: allstars.chh, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

For data: URI, we strip whitespace unless it's text.
http://searchfox.org/mozilla-central/rev/0f16d437cce97733c6678d29982a6bcad49f817b/netwerk/protocol/data/nsDataHandler.cpp#92

This is done in bug 391951 and bug 390126.

However I've already found some failures in wpt, where the tests expect whitespace to be kept.

like 
http://searchfox.org/mozilla-central/rev/0f16d437cce97733c6678d29982a6bcad49f817b/testing/web-platform/tests/XMLHttpRequest/data-uri.htm#35

in this case the mimeType is image/png so we strip whitespace, however we will fail in this case because the result will be 'Hello,World!', instead of expected 'Hello, World!'

There's anotehr wpt I met, 
https://github.com/w3c/web-platform-tests/blob/af610fabf05f1761321e41b031cc71ae9840bdc0/workers/data-url.html#L53

The line 'else postMessage(...)' will be parsed as 'elsepostMessage(...)', and we throwed ReferenceError, (and I've fixed it in our bug 1340974)

As we're fixing bug 1324406 I'd like to know in which cases we should strip whitespace.

Bz, I found you reviewed bug 3919151 and bug 390126, 
could you provide some suggestions what we should do here?

Thanks
So...

In spec terms, the syntax for data: URIs is originally given in https://tools.ietf.org/html/rfc2397#section-3 as follows:

       dataurl    := "data:" [ mediatype ] [ ";base64" ] "," data
       mediatype  := [ type "/" subtype ] *( ";" parameter )
       data       := *urlchar

and "urlchar" is claimed to come from https://tools.ietf.org/html/rfc2396 but that RFC doesn't actually define that production.  Looks like there's an erratum at https://www.rfc-editor.org/errata/eid2045 that says this should actually be:

       data       := *uric

which in RFC2396 is defined as:

      uric          = reserved | unreserved | escaped
      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | ","
      unreserved  = alphanum | mark
      alphanum = alpha | digit
      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
      escaped     = "%" hex hex
      hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                            "a" | "b" | "c" | "d" | "e" | "f"

I left out alpha and digit, which are [a-zA-Z0-9].  The upshot of all this is that whitespace is not valid in a data: URI, so if it's in there at all we're now in error-recovery mode.

There's an attempt at https://simonsapin.github.io/data-urls/ to address some of the ambiguities of RFC 2397, but it doesn't seem to include this bit.

I'd guess that either the fetch spec or https://simonsapin.github.io/data-urls/ should describe what should happen here.
Flags: needinfo?(simon.sapin)
Flags: needinfo?(annevk)
So the question is why Chrome and Firefox both pass the SVG test, but only Chrome passes the data:image/png,Hello World! test? Does Chrome only strip spaces for base64?

(And yeah, at some point someone should write the definitive algorithm for data URLs. I can probably have another go at it if we think it's important, but I suspect it might take a while given all the tests that would need to be written.)
Flags: needinfo?(annevk)
Whitespace is invalid per RFC 2396, but that RFC doesn’t say when to do with invalid inputs. (Non-ASCII characters for example are UTF-8-encoded then percent-encoded, if I remember correctly.)

Base 64 decoding can be thought of as an additional step after URL parsing (whether or not it’s implemented that way). I think it makes sense to ignore whitespace in the former, but not the latter. Then again, interop is often not about what makes sense…
Flags: needinfo?(simon.sapin)
When exactly does the whitespace stripping happen inside image/png? I tried to reproduce with data:image/png,X%20X, but the byte output of that is always 0x58 0x20 0x58 as far as I can tell. Is it specific to XMLHttpRequest somehow?
> I tried to reproduce with data:image/png,X%20X

The whitespace stripping happens before URI unescaping.

You'd need a data: URI string with an actual whitespace in it.
Chrome also strips whitespace. It seems that Edge and Safari do not strip whitespace.

I think the Edge/Safari behavior is better is it doesn't rely on interpreting the MIME type to produce a byte sequence for the body.

I also tested stripping for base64. It seems only Edge and Firefox strip U+000C (FF; \f). Everyone strips U+0020. (Note that U+0009, U+000A, and U+000D are already stripped by the URL parser for all URLs.) I think what Edge and Firefox do is reasonable for base64. If we're going to strip we might as well strip all known ASCII whitespace.

https://github.com/whatwg/fetch/issues/234 is the tracking issue for a more proper specification.
(And as far as I can tell data URL base64 reuses the window.atob() algorithm which already discards ASCII whitespace so we wouldn't have to do anything special there.)
Blocks: 1392241
Component: General → DOM: Security
Component: DOM: Security → Networking
Priority: -- → P3
Whiteboard: [necko-triaged]
Severity: normal → S3

We now basically align with the current spec as of bug 1845006 landing (with further caveats being considered in bug 1845005). I think it's safe to close this bug unless the spec needs changes.

Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.