956713 - Investigate a "json + journal" API

David Teller [:Yoric] - still alive but not very active

Reporter

Description

•

12 years ago

Several modules seem to be hitting the limits of the "save as JSON" approach: 1a. generate big object containing all the data we wish to save; 2a. serialize said object; 3a. write serialization to disk as "foo.json". This approach is very nice when the object is just a few kb large, but Session Restore, Bookmark Backups and possibly other subsystems (Add-on manager?) need to save files that can in extreme cases reach in the multi-megabyte range. In these cases, the cost of 2. and 3. can be unacceptable (e.g. bug 824433). Now, in many cases, we could do as follows: 1b. (the first time) generate big object; 1b'. (in permanent mode) generate object-style diff; 2b. serialize diff; 3b. write diff to disk as "foo.1.journal". Opening this bug to investigate.

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 1

•

12 years ago

Note that we'll want a few maintenance operations: 4b. on startup, instead of reading only "foo.json", read "foo.json", "foo.1.journal", and build the object from the diff; 5b. from a worker, consolidate diffs into a single file every so often (e.g. idle-daily, shutdown). Intuitively: - replacing 1a. with 1b'. should somewhat decrease jank and memory usage; - replacing 2a. with 2b. should considerably decrease jank and memory usage; - replacing 3a. with 3b. should considerably decrease jank and memory usage; - 4b. will increase I/O and jank during startup, but if we have consolidated during shutdown, this will only happen when recovering from a crash; - 5b. should increase memory usage, but only for short periods of time, during idle-daily.

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 2

•

12 years ago

Attached file Rough sketch of a possible API (obsolete) — Details

Attaching a possible draft.

Attachment #8363567 - Flags: feedback?(nfroyd)

Attachment #8363567 - Flags: feedback?(mak77)

David Teller [:Yoric] - still alive but not very active

Reporter

Updated

•

12 years ago

Whiteboard: [Async:team]

Marco Bonardo [:mak]

Comment 3

•

12 years ago

(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #1) > - 4b. will increase I/O and jank during startup, but if we have consolidated > during shutdown, this will only happen when recovering from a crash; > - 5b. should increase memory usage, but only for short periods of time, > during idle-daily. A couple generic thoughts: 1. shutdown is not less important than startup. The situations where we get "Firefox is still running" message or long shutdowns in general are not any better than long startups, for our users. 2. idle is hard to get right. Many users hardly hit any idle time since they use the browser as a one-shot app (open, navigate, close). We figured with time that idle is not the best choice unless there's also a fallback option, that in this case you found in shutdown. I don't think shutdown is going to work, in the case of bookmark backups it ended up happening on shutdown too often and causing long shutdowns. Ideally we want exit(0) right? So, what I think to do, is completely remove them from shutdown, and just force it at a random time if we really couldn't find any good idle in the past days. (I am clearly improving their creation time and making all IO async at the same time). So, that means you need very short idle times to avoid the fallback, and then likely you will run in the middle of the user watching a youtube video or such. 3. a good thing to take into account is files portability, in the case of bookmarks backups it's important that the user doesn't have to search for all of the small journals in addition to the main object... A container file (even just tar), with the main and the journals inside, would be far nicer from this point of view.

Marco Bonardo [:mak]

Comment 4

•

12 years ago

and, just to know, did we look around for maintained and license-compatible third party journaling libraries, before trying to make one by ourselves?

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 5

•

12 years ago

(In reply to Marco Bonardo [:mak] from comment #3) > A couple generic thoughts: > 1. shutdown is not less important than startup. The situations where we get > "Firefox is still running" message or long shutdowns in general are not any > better than long startups, for our users. Good point. > 2. idle is hard to get right. [...] Ok, good point, too. We can shift this to having some worker thread (e.g. OS.File) idle, and this happens quite frequently. > 3. a good thing to take into account is files portability, in the case of > bookmarks backups it's important that the user doesn't have to search for > all of the small journals in addition to the main object... A container file > (even just tar), with the main and the journals inside, would be far nicer > from this point of view. In my early draft, there are only two files: - foo.json; - foo.journal. The rough idea is that operations are added to the journal, which is overwritten each time, until it reaches some limit (e.g. 32 kb), at which stage we consolidate everything into foo.json. We could certainly merge this into a single file, if we don't care about json compatibility. > and, just to know, did we look around for maintained and license-compatible third party journaling libraries, before trying to make one by ourselves? No, we have not. Do you know of any?

Marco Bonardo [:mak]

Comment 6

•

12 years ago

(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #5) > The rough idea is that operations are added to the journal, which is > overwritten each time, until it reaches some limit (e.g. 32 kb), at which > stage we consolidate everything into foo.json. > > We could certainly merge this into a single file, if we don't care about > json compatibility. we are already throwing away json compatibility with lz4 compression, a container won't be any different. And would not be hard to make an add-on to decompress lz4 or this container format. So, based on your idea, what happens if there's a .journal that has not yet been merged, and we need to make other changes? did you mean "appended" instead of "overwritten"? What happens if we are appending to an existing journal, and we crash? Sqlite has write barriers to detect good points in the journal file and read only the good parts, but json doesn't have anything that much sophisticated. Off-hand looks like it's very hard to make a reliable single-journal system, it's easier to make a multi-journal one. > > and, just to know, did we look around for maintained and license-compatible third party journaling libraries, before trying to make one by ourselves? > > No, we have not. Do you know of any? Nope, but I didn't search too. I was only wondering we are surely not the only ones with this issue.

Nathan Froyd [:froydnj]

Updated

•

12 years ago

Attachment #8363567 - Attachment mime type: application/x-javascript → text/plain

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 7

•

12 years ago

(In reply to Marco Bonardo [:mak] from comment #6) > So, based on your idea, what happens if there's a .journal that has not yet > been merged, and we need to make other changes? did you mean "appended" > instead of "overwritten"? Right, appended is better. > What happens if we are appending to an existing > journal, and we crash? My initial reaction to this was "we lose the journal, not such a big deal," but we can certainly do better. > Sqlite has write barriers to detect good points in > the journal file and read only the good parts, but json doesn't have > anything that much sophisticated. Off-hand looks like it's very hard to make > a reliable single-journal system, it's easier to make a multi-journal one. Well, we could decide that a single file looks like * header; * number of bytes of the chunk; * lz4-compressed chunk (extracts to json); * number of bytes of the chunk; * lz4-compressed chunk (extracts to json); * ... We stop writing whenever a lz4-compressed chunk doesn't decompress cleanly or doesn't parse. > Nope, but I didn't search too. I was only wondering we are surely not the > only ones with this issue. Would you want to take a look around?

Nathan Froyd [:froydnj]

Comment 8

•

12 years ago

Comment on attachment 8363567 [details] Rough sketch of a possible API At first glance, having separate files seems like the way to go; I don't think trying to squash multiple JSON blobs into a single file is going to wind up winning. I don't see how the Operation.{field,splice} bits are actually modifying the object that gets passed in. So this is all the file writing stuff, but ISTM you want something more like a JournaledObject that intercepts (via proxies?) all interesting operations and constructs the journal from those. That way you make sure that everything gets journaled...and you know exactly what your journal operations are from the get-go. And/or you can complain early if somebody does something inefficient to your object. Does that seem plausible? Canceling feedback because I don't think + or - is the right call at this stage. This is going to need superreview from somebody eventually, too.

Attachment #8363567 - Flags: feedback?(nfroyd)

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 9

•

12 years ago

(In reply to Nathan Froyd (:froydnj) from comment #8) > So this is all the file writing stuff, but ISTM you want something more like > a JournaledObject that intercepts (via proxies?) all interesting operations > and constructs the journal from those. That way you make sure that > everything gets journaled...and you know exactly what your journal > operations are from the get-go. And/or you can complain early if somebody > does something inefficient to your object. Does that seem plausible? That is certainly possible, but this sounds a bit over-engineered to me. I was planning to have Session Restore produce directly |splice| and |update| operations directly, which I hope should be quite sufficient for a v1.

David Teller [:Yoric] - still alive but not very active

Reporter

Updated

•

12 years ago

Depends on: 974642

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 10

•

12 years ago

Attached file Implementation, WIP — Details

Ok, here is a mostly fleshed out version (untested). I have decided to keep the following strategy: - 1 master file, in json, without specific annotations; - 1 journal file, also in json; - backups for each of these files. The journal file contains a hash for the master file and a list of operations, and is overwritten whenever we commit new operations to that list. It is the caller's responsibility to decide when to consolidate the journal into the master file. Also, the API doesn't attempt to provide high-level primitives for data manipulation. Rather, we only provide |splice| (which maps to Array.prototype.splice) and |insert| (which offers the features of both |foo.bar = sna| and |delete foo.bar|).

Attachment #8363567 - Attachment is obsolete: true

Attachment #8363567 - Flags: feedback?(mak77)

Attachment #8378606 - Flags: feedback?(vdjeric)

Attachment #8378606 - Flags: feedback?(nfroyd)

Nathan Froyd [:froydnj]

Updated

•

12 years ago

Attachment #8378606 - Attachment mime type: application/x-javascript → text/plain

Nathan Froyd [:froydnj]

Comment 11

•

12 years ago

Comment on attachment 8378606 [details] Implementation, WIP On the one hand, I think this is reasonable, though I didn't carefully walk through all the I/O logic to make sure we're not winding up in some weird state. On the other hand, I still think having some sort of proxy to construct the journal operations automagically is a much better way to do this. Do we really think it's reasonable to have clients do: blob.field = v; journaled.field("field", v); every single time they want to touch something? The potential for getting mismatches and/or forgetting one or the other of those operations seems rather high. Perhaps you have a sketch of a client using this that you could share; it's possible I'm misunderstanding things. Maybe you could, as a half-way measure, have |Journaled| maintain the object on its own, so Journaled.prototype.{field,splice} were the only way to modify/query things?

Attachment #8378606 - Flags: feedback?(nfroyd)

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 12

•

12 years ago

(In reply to Nathan Froyd (:froydnj) from comment #11) > Comment on attachment 8378606 [details] > Implementation, WIP > On the other hand, I still think having some sort of proxy to construct the > journal operations automagically is a much better way to do this. Do we > really think it's reasonable to have clients do: > > blob.field = v; > journaled.field("field", v); I believe that this can be added later, on top of Journaled. Further, I am not sure if/how we should bind the in-memory object and the Journaled. For instance, in Session Restore, the in-memory object would live on the main thread and the Journaled in the worker.

Vladan Djeric (:vladan)

Comment 13

•

11 years ago

Comment on attachment 8378606 [details] Implementation, WIP Clearing f? flag for now, this is on hold

Attachment #8378606 - Flags: feedback?(vdjeric)

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Rough sketch of a possible API 12 years ago David Teller [:Yoric] - still alive but not very active 3.96 KB, text/plain		Details
Implementation, WIP 12 years ago David Teller [:Yoric] - still alive but not very active 13.16 KB, text/plain		Details