Bug 1304389 Comment 69 Edit History

Original comment by

on 2024-06-14 13:45:38 PDT

I believe it's possible to incrementally update the recovery.jsonlz4/sessionstore.jsonlz4 while avoiding the issues with atomicity and garbage collection of stale files, by appending to the existing file with diffs and rewriting it entirely when (for example) the diff part is equal or larger than the original part. Restoring then consists of reading the normal file, decompressing the diffs and applying them. This would only work if the rest of the session restore machinery doesn't rely on being able to move the file away every single time - I'm not 100% sure that's not the case, but I think that only happens on a normal shutdown, not during the 15 second updates.

For generating the diffs, I can see several options:
- There's several libraries that can do JSON diff and patch that could be used for this. Not sure any of them are production solid enough to use in session restore though.
- Something ad hock is possible: Instead of writing out the raw JSON, split it on entries (e.g. windows->tabs->entries, _closedTabs->state->entries, etc, probably want to do the same for the cookie list) and compute and store a hash per entry (with modern hashes like BLAKE3 or AHash this is negligible overhead). Then write out the compressed data (in 1 block). Keep the hashes around. For the next session dump, do the same operation, but check whether the hash matches for each entry. If it does, delete the entry. If it does not, store the old hash and the new hash in a list. At the end, store *all* compressed data including new entries, and a list of hashes (entries) that were deleted and append this to the file. Basically we're patching by hashing each "entry" and recording the adds/removes/replaces by hash.
- Probably the easiest: Decode (or have stored in RAM) the previous JSON, call ZSTD_createCDict() with it, and then compress the new one. This should work very well as long as the zstd window is equal at least to the previous session size, and I think we can guarantee that, as a huge session size is less likely to happen on a low RAM machine.

The advantage of these approaches is that they're entirely on the data serialization level. You don't need to know anything or depend on Firefox internals to get this working. and can even shove it in an external Rust library of whatever.

My first thought was to just hash the per-tab data, and re-transmit changed tabs, but look at this:

Data size per second-level domain (in bytes), from the recovery.jsonlz4 from my personal profile:
```
google.com: 169965
slack.com: 462642
discord.com: 19308696
mozilla.org: 70929
chatgpt.com: 266999
openai.com: 11871
github.com: 198099
pytorch.org: 9431
nvidia.com: 7445
openreview.net: 11008
arxiv.org: 19187
github.io: 20058
reddit.com: 104768
searchfox.org: 15451
<rest mostly negligible>
...
```
Given that discord.com is the majority of data, and also the one that keeps changing, you do really need to diff/patch the data itself. But the zstd approach should be fast, easy and efficient for this. So quite doable, I hope.

Revision 1 by

Gian-Carlo Pascutto [:gcp]

on 2024-06-14 13:47:08 PDT

I believe it's possible to incrementally update the recovery.jsonlz4/sessionstore.jsonlz4 while avoiding the issues with atomicity and garbage collection of stale files, by appending to the existing file with diffs and rewriting it entirely when (for example) the diff part is equal or larger than the original part. Restoring then consists of reading the normal file, decompressing the diffs and applying them. This would only work if the rest of the session restore machinery doesn't rely on being able to move the file away every single time - I'm not 100% sure that's not the case, but I think that only happens on a normal shutdown, not during the 15 second updates.

For generating the diffs, I can see several options:
- There's several libraries that can do JSON diff and patch that could be used for this. Not sure any of them are production solid enough to use in session restore though.
- Something ad hock is possible: Instead of writing out the raw JSON, split it on entries (e.g. windows->tabs->entries, _closedTabs->state->entries, etc, probably want to do the same for the cookie list) and compute and store a hash per entry (with modern hashes like BLAKE3 or AHash this is negligible overhead). Then write out the compressed data (in 1 block). Keep the hashes around. For the next session dump, do the same operation, but check whether the hash matches for each entry. If it does, delete the entry. If it does not, store the old hash and the new hash in a list. At the end, store *all* compressed data including new entries, and a list of hashes (entries) that were deleted and append this to the file. Basically we're patching by hashing each "entry" and recording the adds/removes/replaces by hash.
- Probably the easiest: Decode (or have stored in RAM) the previous JSON, call ``ZSTD_createCDict()`` with it, and then compress the new one. This should work very well as long as the zstd window is equal at least to the previous session size, and I think we can guarantee that, as a huge session size is less likely to happen on a low RAM machine.

The advantage of these approaches is that they're entirely on the data serialization level. You don't need to know anything or depend on Firefox internals to get this working. and can even shove it in an external Rust library of whatever.

My first thought was to just hash the per-tab data, and re-transmit changed tabs, but look at this:

Data size per second-level domain (in bytes), from the recovery.jsonlz4 from my personal profile:
```
google.com: 169965
slack.com: 462642
discord.com: 19308696
mozilla.org: 70929
chatgpt.com: 266999
openai.com: 11871
github.com: 198099
pytorch.org: 9431
nvidia.com: 7445
openreview.net: 11008
arxiv.org: 19187
github.io: 20058
reddit.com: 104768
searchfox.org: 15451
<rest mostly negligible>
...
```
Given that discord.com is the majority of data, and also the one that keeps changing, you do really need to diff/patch the data itself. But the zstd approach should be fast, easy and efficient for this. So quite doable, I hope.

Revision 2 by

Gian-Carlo Pascutto [:gcp]

on 2024-06-14 13:49:17 PDT

I believe it's possible to incrementally update the recovery.jsonlz4/sessionstore.jsonlz4 while avoiding the issues with atomicity and garbage collection of stale files, by appending to the existing file with diffs and rewriting it entirely when (for example) the diff part is equal or larger than the original part. Restoring then consists of reading the normal file, decompressing the diffs and applying them. This would only work if the rest of the session restore machinery doesn't rely on being able to move the file away every single time - I'm not 100% sure that's not the case, but I think that only happens on a normal shutdown, not during the 15 second updates.

For generating the diffs, I can see several options:
- There's several libraries that can do JSON diff and patch that could be used for this. Not sure any of them are production solid enough to use in session restore though.
- Something ad hock is possible: Instead of writing out the raw JSON, split it on entries (e.g. windows->tabs->entries, _closedTabs->state->entries, etc, probably want to do the same for the cookie list) and compute and store a hash per entry (with modern hashes like BLAKE3 or AHash this is negligible overhead). Then write out the compressed data (in 1 block). Keep the hashes around. For the next session dump, do the same operation, but check whether the hash matches for each entry. If it does, delete the entry. If it does not, store the old hash and the new hash in a list. At the end, store *all* compressed data including new entries, and a list of hashes (entries) that were deleted and append this to the file. Basically we're patching by hashing each "entry" and recording the adds/removes/replaces by hash.
- Probably the easiest: Decode (or have stored in RAM) the previous JSON, call ``ZSTD_createCDict()`` with it, and then compress the new one. This should work very well as long as the zstd window is equal at least to the previous session size, and I think we can guarantee that, as a huge session size is less likely to happen on a low RAM machine.

The advantage of these approaches is that they're entirely on the data serialization level. You don't need to know anything or depend on Firefox internals to get this working, and can even shove it in an external Rust library of whatever.

My first thought was to just hash the per-tab data, and re-transmit changed tabs, but look at this:

Data size per second-level domain (in bytes), from the recovery.jsonlz4 from my personal profile:
```
google.com: 169965
slack.com: 462642
discord.com: 19308696
mozilla.org: 70929
chatgpt.com: 266999
openai.com: 11871
github.com: 198099
pytorch.org: 9431
nvidia.com: 7445
openreview.net: 11008
arxiv.org: 19187
github.io: 20058
reddit.com: 104768
searchfox.org: 15451
<rest mostly negligible>
...
```
Given that discord.com is the majority of data, and also the one that keeps changing, you do really need to diff/patch the data itself. But the zstd approach should be fast, easy and efficient for this. So quite doable, I hope.

Also, FWIW, just switching to zstd @ level 1 instead of LZ4 gives a factor 2x-3x improvement. These kind of fast high ratio compressors weren't available 8 years ago.

Revision 3 by

Gian-Carlo Pascutto [:gcp]

on 2024-06-14 13:51:24 PDT

I believe it's possible to incrementally update the recovery.jsonlz4/sessionstore.jsonlz4 while avoiding the issues with atomicity and garbage collection of stale files, by appending to the existing file with diffs and rewriting it entirely when (for example) the diff part is equal or larger than the original part. Restoring then consists of reading the normal file, decompressing the diffs and applying them. This would only work if the rest of the session restore machinery doesn't rely on being able to move the file away every single time - I'm not 100% sure that's not the case, but I think that only happens on a normal shutdown, not during the 15 second updates.

For generating the diffs, I can see several options:
- There's several libraries that can do JSON diff and patch that could be used for this. Not sure any of them are production solid enough to use in session restore though.
- Something ad hock is possible: Instead of writing out the raw JSON, split it on entries (e.g. windows->tabs->entries, _closedTabs->state->entries, etc, probably want to do the same for the cookie list) and compute and store a hash per entry (with modern hashes like BLAKE3 or AHash this is negligible overhead). Then write out the compressed data (in 1 block). Keep the hashes around. For the next session dump, do the same operation, but check whether the hash matches for each entry. If it does, delete the entry. If it does not, store the old hash and the new hash in a list. At the end, store *all* compressed data including new entries, and a list of hashes (entries) that were deleted and append this to the file. Basically we're patching by hashing each "entry" and recording the adds/removes/replaces by hash.
- Probably the easiest: Decode (or have stored in RAM) the previous JSON, call ``ZSTD_createCDict()`` with it, and then compress the new one. This should work very well as long as the zstd window is equal at least to the previous session size, and I think we can guarantee that, as a huge session size is less likely to happen on a low RAM machine.

The advantage of these approaches is that they're entirely on the data serialization level. You don't need to know anything or depend on Firefox internals to get this working, and can even shove it in an external Rust library or whatever.

My first thought was to just hash the per-tab data, and re-transmit changed tabs, but look at this:

Data size per second-level domain (in bytes), from the recovery.jsonlz4 from my personal profile:
```
google.com: 169965
slack.com: 462642
discord.com: 19308696
mozilla.org: 70929
chatgpt.com: 266999
openai.com: 11871
github.com: 198099
pytorch.org: 9431
nvidia.com: 7445
openreview.net: 11008
arxiv.org: 19187
github.io: 20058
reddit.com: 104768
searchfox.org: 15451
<rest mostly negligible>
...
```
Given that discord.com is the majority of data, and also the one that keeps changing, you do really need to diff/patch the data itself. But the zstd approach should be fast, easy and efficient for this. So quite doable, I hope.

Also, FWIW, just switching to zstd @ level 1 instead of LZ4 gives a factor 2x-3x improvement. These kind of fast high ratio compressors weren't available 8 years ago.