Closed Bug 832664 Opened 11 years ago Closed 11 years ago

Efficient, asynchronous JSON operations

Categories

(Core :: Networking: File, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: Yoric, Unassigned)

References

Details

(Keywords: perf, Whiteboard: [Snappy:P2])

Attachments

(3 files)

Nowadays, we frequently need to (de)serialize multi-megabyte JSON files. We have common occurences of 3 Mb+ sessionstore.js files, and known occurrences of 50 Mb+ sessionstore.js files.

To handle these properly, JSON.parse and JSON.stringify are unusable. So, we need one of the following:
1. asynchronous variants of JSON.parse and JSON.stringify; or
2. a mechanism for transferring JSON objects from/to a worker without copy.

Note that we have nsIJSON::decodeFromStream and (deprecated and apparently not actually asynchronous) nsIJSON::encodeToStream. While I would personally prefer option 2., option 1. seems more realistic and could be built on top of {encodeTo, decodeFrom}Stream.

Not sure whether this should go in JS Engine, File or XPCOM.
Note that we can also write an asynchronous JSON.stringify in JavaScript.
Moving this to File, just in case.
Assignee: general → nobody
Component: JavaScript Engine → Networking: File
So will this depend on the async shutdown observer work?
Whiteboard: [Snappy:P2]
I'm almost sure that we can implement a cooperatively backgrounded main thread API in either JS or C++ that would not depend on bug 722648. Also, in the current state of things, I'm almost certain that we cannot offload this to a non-main thread, as JavaScript objects are not meant to be transplanted that easily.

Jason, could you confirm that last bit?
Flags: needinfo?(jorendorff)
What should happen if, during serialization, the script suddenly changes the contents of the object that it has passed to JSONUtils.serialize? Should these changes be picked up by the serialization process as they come in? Or should the API guarantee that any changes made after the call to serialize won't be reflected in the serialization? Both options sound bad: The first one is non-deterministic, and the second one would probably require making a copy of the object, synchronously, which would defeat the whole point of this bug.
I would go for "not specified".
Comment on attachment 704841 [details] [diff] [review]
Possible API for the library

Iterators are interesting. I do wonder if it is enough. We may also need support for a nsIInputStream variant. And, since many consumers are still talking strings instead of typed arrays, perhaps we need a variant for Iterator<string> as well.
Attachment #704841 - Flags: feedback?(gps)
(In reply to Markus Stange from comment #7)
> What should happen if, during serialization, the script suddenly changes the
> contents of the object that it has passed to JSONUtils.serialize?

The serialization will have a promise that resolves on completion (or some other mechanism). I'd be happy specifying the API as "don't modify this object until that promise resolves".

(Either that, or we need an efficient copy-on-write system, or a handoff mechanism (you get the object back when the promise resolves).)

Parsing is perhaps more of an issue, if the source is a file: we would have to ensure that the file contents don't change during reading, in a cross-platform way.
(In reply to Gregory Szorc [:gps] from comment #9)
> Comment on attachment 704841 [details] [diff] [review]
> Possible API for the library
> 
> Iterators are interesting. I do wonder if it is enough. We may also need
> support for a nsIInputStream variant. And, since many consumers are still
> talking strings instead of typed arrays, perhaps we need a variant for
> Iterator<string> as well.

nsIInputStream makes sense, although I would prefer if we could avoid it.

I fear that Iterator<string> is encouraging bad patterns. We could do that, but with a deprecation warning (bug 812859).
(In reply to Richard Newman [:rnewman] from comment #10)
> (In reply to Markus Stange from comment #7)
> > What should happen if, during serialization, the script suddenly changes the
> > contents of the object that it has passed to JSONUtils.serialize?
> 
> The serialization will have a promise that resolves on completion (or some
> other mechanism). I'd be happy specifying the API as "don't modify this
> object until that promise resolves".
> 
> (Either that, or we need an efficient copy-on-write system, or a handoff
> mechanism (you get the object back when the promise resolves).)

I don't think we can count on such a feature before starting implementation work on this bug. If we do get it, we can probably add it behind the scenes, with a "told you so" to whoever writes code broken by the change.

> Parsing is perhaps more of an issue, if the source is a file: we would have
> to ensure that the file contents don't change during reading, in a
> cross-platform way.

In my mind, if you are reading a file that could change, you should read it entirely (or lock it). Since you can do the full read off main thread, I don't think we need to put too much effort on such a precaution.
Comment on attachment 704841 [details] [diff] [review]
Possible API for the library

Gavin, what would you think about such an API in toolkit?
Attachment #704841 - Flags: feedback?(gavin.sharp)
Comment on attachment 704841 [details] [diff] [review]
Possible API for the library

Sounds great! deserialize taking a byte array would not avoid some of the inefficiencies Greg mentions in bug 827852 comment 3 - for some of those we might need an API that takes e.g. a JSON file's path, and keep the actual file I/O+parsing coupled tightly in the back-end. But I don't have an intuitive sense for which tradeoff makes more sense.
Attachment #704841 - Flags: feedback?(gavin.sharp) → feedback+
I realize that my primary use for asynchronous JSONUtils.serialize could be to serialize objects in which some fields are promises. This could certainly be implemented by returning an iterator of promises. Or does this complicate things too much?
As it turns out, I will probably need this sooner than later. Gavin, what do you think of this early experiment? I am planning to only land |JSON.serialize| in a first patch, mind you.
Assignee: nobody → dteller
Attachment #710776 - Flags: feedback?(gavin.sharp)
Comment on attachment 710776 [details] [diff] [review]
Async JSON sneak preview

I won't have time to look into this in the near future, perhaps Greg can take a look?
Attachment #710776 - Flags: feedback?(gavin.sharp) → feedback?(gps)
I should add that this implementation uses recursion and generators pretty much everywhere, for the sake of attaining quickly a working prototype. Once we have some real use cases for benchmarking and testing purposes, I suspect that we will want to progressively replace this by a de-recursified implementation and a manual iterator - or possibly do this in C++.
Comment on attachment 710776 [details] [diff] [review]
Async JSON sneak preview

Review of attachment 710776 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think I'm qualified to review this. I will say that the thought of having two JSON serializers in the tree scares me. I think I'd prefer to see this implemented in C++ and have both the synchronous JSON.{parse,stringify} and the asynchronous versions provided by this bug share the same APIs under the hood. Who knows, perhaps we can see about getting async JSON APIs standardized.

It's also worth calling out performance of a pure JS implementation. We're rolling this out primarily for chrome-privileged code in areas where performance matters. It's my understanding that chrome JS still is not fully jitted. So, we can theorize pure JS will be slower than C++. That's another point for implementing this in C++.

On the API front, what about an nsIOutputStream (or whatever it is called)? Many times I want to serialize an object directly to a socket, file handle, etc. We already have lots of code that knows how to handle these stream classes. To maintain the benefits of streaming with the current API, we'll need to incur an conversion from iterator output to stream chunk. Why don't we eliminate the middle man and go straight to a stream? Yes, this likely means 2 discrete APIs.

I'd really like to see someone from the core JS team comment here. Perhaps they can provide an API for us to use. I'd feel much better about things if this was their responsibility (that's not a slight against you or anyone else - it's just that logically the JS engine seems the proper place for this).
Attachment #710776 - Flags: feedback?(gps)
(In reply to Gregory Szorc [:gps] from comment #19)
> Comment on attachment 710776 [details] [diff] [review]
> Async JSON sneak preview
> 
> Review of attachment 710776 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I don't think I'm qualified to review this.

Ok, bouncing back to gavin.

> I will say that the thought of
> having two JSON serializers in the tree scares me.

Will you be scared if I tell you that we already have two, including one in pure JavaScript? :)

> I think I'd prefer to see
> this implemented in C++ and have both the synchronous JSON.{parse,stringify}
> and the asynchronous versions provided by this bug share the same APIs under
> the hood. Who knows, perhaps we can see about getting async JSON APIs
> standardized.
> 
> It's also worth calling out performance of a pure JS implementation. We're
> rolling this out primarily for chrome-privileged code in areas where
> performance matters. It's my understanding that chrome JS still is not fully
> jitted. So, we can theorize pure JS will be slower than C++. That's another
> point for implementing this in C++.

I agree that we will probably want both points eventually. This JS implementation is meant as a prototype. A prototype that we might wish to land to m-c until we have a better idea of bottlenecks (or not), but a prototype nevertheless.

> On the API front, what about an nsIOutputStream (or whatever it is called)?
> Many times I want to serialize an object directly to a socket, file handle,
> etc. We already have lots of code that knows how to handle these stream
> classes. To maintain the benefits of streaming with the current API, we'll
> need to incur an conversion from iterator output to stream chunk. Why don't
> we eliminate the middle man and go straight to a stream? Yes, this likely
> means 2 discrete APIs.

You mean having both:
1. an API that produces something JS-friendly, for use with OS.File, IndexedDB, XHR, etc. (presumably an iterator); and
2. an API that produces something XPCOM-friendly, for use with Necko (presumably with nsIOutputStream)?

I guess that makes sense. Can you point me to use cases you would have for 2.?

> I'd really like to see someone from the core JS team comment here. Perhaps
> they can provide an API for us to use. I'd feel much better about things if
> this was their responsibility (that's not a slight against you or anyone
> else - it's just that logically the JS engine seems the proper place for
> this).

I understand and I'd be glad if the JS engine could effectively produce something useful for us. We are actually trying to pressure them into doing this. Here again, a JS implementation can serve as a prototype to be replaced by something closer to the metal.
Comment on attachment 710776 [details] [diff] [review]
Async JSON sneak preview

Back to Gavin, then.
Attachment #710776 - Flags: feedback?(gavin.sharp)
Comment on attachment 710776 [details] [diff] [review]
Async JSON sneak preview

Clearing for gavin.
Attachment #710776 - Flags: feedback?(gavin.sharp)
So, I have discussed this with what seems like half of the JS team.

The main cost that we want to avoid is that of copying strings when crossing compartments.

1. If we write the code in JS and load it as a .jsm, for the moment, we cannot escape this cost. However, in the near future, zones will remove this cost.
2. If we write the code in JS and load it as a subscript, we do not pay this cost. This is, however, a ugly hack.
3. If we write the code in C++ and hide the native code through a .jsm, for the moment, we cannot escape this cost. However, in the near future, zones will also remove this cost.
4. If we write the code in C++ and expose the XPCOM component, we do not pay this cost. This is, however, rather ugly.

So, regardless of whether we do it in JS or in C++, we will eventually get rid of the cost of string copies. I would suggest going for a JS version first, at least for experimentation purposes, and only later move to a C++ version.

Note that the JS version currently attached is clearly suboptimal. If there is agreement on doing this in JS, I should be able to quickly write a more optimal JS version.

What do you think, gavin?
Flags: needinfo?(jorendorff) → needinfo?(gavin.sharp)
Zones only address cross-compartment string copies if the compartments in question end up in the same zone. Why would the code doing the I/O (OS.File?) and this code end up in the same zone? Of course we could try to enforce that somehow, but it's not clear to me that it would make sense to do that.

I think at this point I'm probably not the right person to be evaluating this proposal. Someone with a better grasp of JS engine perf bottlenecks might be a better choice. Also, I think it would help to illustrate how a real user of this component (sessionstore?) would use it.
Flags: needinfo?(gavin.sharp)
Sorry Gavin, I should have mentioned on irc that all system stuff (jsms, browser xul, whatever) ends up in a single zone.
As per gavin's request, here is what writing sessionstore would look like with this API.

Note that, ideally, I would like OS.File.writeAtomic to understand immediately the output of JSON.serialize, which would make this a one-liner.
Attachment #725774 - Flags: feedback?(gavin.sharp)
Comment on attachment 725774 [details] [diff] [review]
Demoing the API with sessionstore.js

As we discussed in person, we should investigate improving postMessage support to make this solution unnecessary. If that's not possible, then we should move forward with an implementation like this one.

(the sessionstore-state-write thing seems to only be used in tests, we should be able to come up with a test-only solution somehow)
Attachment #725774 - Flags: feedback?(gavin.sharp) → feedback+
I haven't read everything here, but when you are talking about no-copy string transfering, I'm reminded of transferable objects [1] while postMessage reminds me of MessagePort (Bug 741618) which is transferable (iirc). 

Is this of any help here?


[1] https://developer.mozilla.org/en-US/docs/DOM/Using_web_workers?redirectlocale=en-US&redirectslug=Using_web_workers#Passing_data_by_transferring_.C2.A0ownership_%28transferable_objects%29
Marking as WONTFIX. Please reopen the bug if you find you need this feature.

Florian: Thanks, but it doesn't solve the issue.
Assignee: dteller → nobody
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: