JSON.stringify produces different unicode escaping results



8 months ago
7 months ago


(Reporter: kats, Assigned: kats)



Firefox Tracking Flags

(Not tracked)


Run this in the console:
  JSON.stringify({c: "#c\ud838"})

This used to produce

Now it produces

Note the extra backslash before the unicode escape. Is this intentional? It's a regression in the last day or so.
Looking at the commit log for JSON.cpp, it seems like this is probably bug 1469021, so I'm ni?ing Waldo.
Flags: needinfo?(jwalden+bmo)
Blocks: 1469021
The context here is that this change caused an error in a SearchFox run.
Can you provide a bit more details. I think everyone was hoping that this change would be web compatible. Especially which library or language is causing the error?
Indeed, mozregression points to https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=9dd02ae2ee0ae93d12ea26e7a1d405427f090dc6&tochange=e8aec0dccbb8213dd292d7331f01307969cdff9a so bug 1469021 seems likely.

To flesh out the context a bit more, it looks like trying to parse this JSON:
  {"loc":"159:29-31","source":1,"syntax":"def,prop","pretty":"property c\ud83d","sym":"#c\ud83d"}
using rustc_serialize::json fails with this error:
  error SyntaxError("unexpected end of hex escape", 1, 78)

and previously the JSON had the literal unicode character instead of the escaped thing.
(So parsing with rustc_serialize::json in Rust, not using web libraries)
This is bug 1469021, as people seem to already have concluded.  IMO Searchfox should fix its code, because this is a bit of a silly bit of hyperfixation on JSON.stringify output.
Flags: needinfo?(jwalden+bmo)
It's not something we can easily fix in searchfox itself, it would be best fixed in rustc-serialize. I filed this issue because I wasn't sure if the change in behaviour was expected/desirable or not. If the consensus is that it's here to stay then I can file an issue against rustc-serialize.
Or I guess we could migrate searchfox to serde, since it seems like rustc-serialize is deprecated. If that still has the same problem I can file an issue against the appropriate repo.
Yeah, I meant Searchfox as synecdoche for it and its dependencies.  I think we should file bugs against the relevant repositories.
So actually all three of the Rust JSON parsers I tried error out on this.

Test code at https://github.com/staktrace/jsontest

json failed: Unexpected character: " at (1:17)
rustc_serialize failed: SyntaxError("unexpected end of hex escape", 1, 18)
serde_json failed: unexpected end of hex escape at line 1 column 17
In fact you can't even compile a rust program that tries to use the literal string "\ud838" because it's not valid unicode, and AFAICT Rust only allows valid unicode. So I don't know how you'd even fix the json parsing libraries because the thing is not representable in rust strings at all.

Comment 13

8 months ago

https://github.com/rust-lang-deprecated/rustc-serialize/blob/master/src/json.rs#L1684-L1699 guards against lone surrogates in their escaped form, throwing an error when they’re encountered.

Non-escaped lone surrogates (which JSON-parse into the exact same result) however end up here: https://github.com/rust-lang-deprecated/rustc-serialize/blob/71b8d6ef874cb53209406824317575c06a7727e0/src/json.rs#L1719

It seems the intention is to block any use of lone surrogates (`return self.error(InvalidUnicodeCodePoint)`), but it’s not being done properly for unescaped ones. I imagine that would be the fix in the Rust library.

That would make the test case in comment #4 fail consistently, in both the escaped and unescaped case.
I was getting some inconsistent results upon further testing, and realized I made a mistake.

Lone surrogates *are* disallowed in Rust entirely, so even just trying to read a file with such a character into a rust String will fail. So there's actually no way you can even have a rust String with a lone surrogate, and so there's no way you can pass such a string to any JSON parsing libraries. So the code that appears inconsistent in rustc-serialize is actually fine because the one codepath that could actually deal with this issue will error out, and the other codepath will never actually encounter a lone surrogate.

So then going back to my original problem, I discovered that prior to this change landing in Spidermonkey, the thing that we was in the file we were trying to parse was *not* actually the literal \uD838 character but the \uFFFD replacement character. I'm not totally sure yet which step replaces \uD838 with \uFFFD but I'm trying to figure that out. With the change in SpiderMonkey, we no longer have the \uFFFD replacement char but instead have the escaped \\uD838 string, which started to trigger the parsing failures.

Sorry for the erroneous information :(
The \uFFFD replacement was happening in the SpiderMonkey "print" function to print to stdout.

js> print('\ud83d')

Assuming this change is here to stay, I guess we'll have to change Searchfox to gracefully handle input JSON files that are not valid Unicode and thus not representable in Rust strings.
Component: JavaScript: Standard Library → Searchfox
Product: Core → Webtools
Version: Other Branch → other
So I thought about this some more, here are various options for completeness:

1) Modify rustc_serialize to convert lone surrogates to \uFFFD - unlikely to happen since rustc_serialize is deprecated
2) Modify serde-json to convert lone surrogates to \uFFFD and use that. Rejected [1]
3) Modify json-rust to convert lone surrogates to \uFFFD and use that. Requested [2] but unlikely to happen, considering none of the recent issues filed against that repo have any sort of response or resolution
4) Fork one of the above. Seems a bit overkill
5) Write a new Rust JSON library. Also overkill, although I might do this for unrelated reasons anyway
6) Preprocess the analysis file after it's emitted by js-analyzer.js and before it's parsed by the Rust tooling. Preprocessing correctly would need to use a JSON parser, which can't be written in Rust. So it's better to just change js-analyzer.js instead.
7) Change js-analyzer.js to strip out analysis lines with lone surrogates. This is actually fairly easy since we already have a nameValid() checker function that prevents us from emitting analysis on names with invalid characters. It looks like JS strings can contain lone surrogates, but identifier names cannot, and the nameValid function seems to allow some superset of valid identifier names. It's not totally clear to me why nameValid only disallows some things but it seems like a good place to reject names with lone surrogates.

[1] https://github.com/serde-rs/json/issues/495#issuecomment-427593533
[2] https://github.com/maciejhirsz/json-rust/issues/149
Assignee: nobody → kats
Blocks: 1496808
PR merged
Last Resolved: 7 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.