I believe it's possible to incrementally update the recovery.jsonlz4/sessionstore.jsonlz4 while avoiding the issues with atomicity and garbage collection of stale files, by appending to the existing file with diffs and rewriting it entirely when (for example) the diff part is equal or larger than the original part. Restoring then consists of reading the normal file, decompressing the diffs and applying them. This would only work if the rest of the session restore machinery doesn't rely on being able to move the file away every single time - I'm not 100% sure that's not the case, but I think that only happens on a normal shutdown, not during the 15 second updates. For generating the diffs, I can see several options: - There's several libraries that can do JSON diff and patch that could be used for this. Not sure any of them are production solid enough to use in session restore though. - Something ad hock is possible: Instead of writing out the raw JSON, split it on entries (e.g. windows->tabs->entries, _closedTabs->state->entries, etc, probably want to do the same for the cookie list) and compute and store a hash per entry (with modern hashes like BLAKE3 or AHash this is negligible overhead). Then write out the compressed data (in 1 block). Keep the hashes around. For the next session dump, do the same operation, but check whether the hash matches for each entry. If it does, delete the entry. If it does not, store the old hash and the new hash in a list. At the end, store *all* compressed data including new entries, and a list of hashes (entries) that were deleted and append this to the file. Basically we're patching by hashing each "entry" and recording the adds/removes/replaces by hash. - Probably the easiest: Decode (or have stored in RAM) the previous JSON, call ZSTD_createCDict() with it, and then compress the new one. This should work very well as long as the zstd window is equal at least to the previous session size, and I think we can guarantee that, as a huge session size is less likely to happen on a low RAM machine. The advantage of these approaches is that they're entirely on the data serialization level. You don't need to know anything or depend on Firefox internals to get this working. and can even shove it in an external Rust library of whatever. My first thought was to just hash the per-tab data, and re-transmit changed tabs, but look at this: Data size per second-level domain (in bytes), from the recovery.jsonlz4 from my personal profile: ``` google.com: 169965 slack.com: 462642 discord.com: 19308696 mozilla.org: 70929 chatgpt.com: 266999 openai.com: 11871 github.com: 198099 pytorch.org: 9431 nvidia.com: 7445 openreview.net: 11008 arxiv.org: 19187 github.io: 20058 reddit.com: 104768 searchfox.org: 15451 <rest mostly negligible> ... ``` Given that discord.com is the majority of data, and also the one that keeps changing, you do really need to diff/patch the data itself. But the zstd approach should be fast, easy and efficient for this. So quite doable, I hope.
Bug 1304389 Comment 69 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
I believe it's possible to incrementally update the recovery.jsonlz4/sessionstore.jsonlz4 while avoiding the issues with atomicity and garbage collection of stale files, by appending to the existing file with diffs and rewriting it entirely when (for example) the diff part is equal or larger than the original part. Restoring then consists of reading the normal file, decompressing the diffs and applying them. This would only work if the rest of the session restore machinery doesn't rely on being able to move the file away every single time - I'm not 100% sure that's not the case, but I think that only happens on a normal shutdown, not during the 15 second updates. For generating the diffs, I can see several options: - There's several libraries that can do JSON diff and patch that could be used for this. Not sure any of them are production solid enough to use in session restore though. - Something ad hock is possible: Instead of writing out the raw JSON, split it on entries (e.g. windows->tabs->entries, _closedTabs->state->entries, etc, probably want to do the same for the cookie list) and compute and store a hash per entry (with modern hashes like BLAKE3 or AHash this is negligible overhead). Then write out the compressed data (in 1 block). Keep the hashes around. For the next session dump, do the same operation, but check whether the hash matches for each entry. If it does, delete the entry. If it does not, store the old hash and the new hash in a list. At the end, store *all* compressed data including new entries, and a list of hashes (entries) that were deleted and append this to the file. Basically we're patching by hashing each "entry" and recording the adds/removes/replaces by hash. - Probably the easiest: Decode (or have stored in RAM) the previous JSON, call ``ZSTD_createCDict()`` with it, and then compress the new one. This should work very well as long as the zstd window is equal at least to the previous session size, and I think we can guarantee that, as a huge session size is less likely to happen on a low RAM machine. The advantage of these approaches is that they're entirely on the data serialization level. You don't need to know anything or depend on Firefox internals to get this working. and can even shove it in an external Rust library of whatever. My first thought was to just hash the per-tab data, and re-transmit changed tabs, but look at this: Data size per second-level domain (in bytes), from the recovery.jsonlz4 from my personal profile: ``` google.com: 169965 slack.com: 462642 discord.com: 19308696 mozilla.org: 70929 chatgpt.com: 266999 openai.com: 11871 github.com: 198099 pytorch.org: 9431 nvidia.com: 7445 openreview.net: 11008 arxiv.org: 19187 github.io: 20058 reddit.com: 104768 searchfox.org: 15451 <rest mostly negligible> ... ``` Given that discord.com is the majority of data, and also the one that keeps changing, you do really need to diff/patch the data itself. But the zstd approach should be fast, easy and efficient for this. So quite doable, I hope.
I believe it's possible to incrementally update the recovery.jsonlz4/sessionstore.jsonlz4 while avoiding the issues with atomicity and garbage collection of stale files, by appending to the existing file with diffs and rewriting it entirely when (for example) the diff part is equal or larger than the original part. Restoring then consists of reading the normal file, decompressing the diffs and applying them. This would only work if the rest of the session restore machinery doesn't rely on being able to move the file away every single time - I'm not 100% sure that's not the case, but I think that only happens on a normal shutdown, not during the 15 second updates. For generating the diffs, I can see several options: - There's several libraries that can do JSON diff and patch that could be used for this. Not sure any of them are production solid enough to use in session restore though. - Something ad hock is possible: Instead of writing out the raw JSON, split it on entries (e.g. windows->tabs->entries, _closedTabs->state->entries, etc, probably want to do the same for the cookie list) and compute and store a hash per entry (with modern hashes like BLAKE3 or AHash this is negligible overhead). Then write out the compressed data (in 1 block). Keep the hashes around. For the next session dump, do the same operation, but check whether the hash matches for each entry. If it does, delete the entry. If it does not, store the old hash and the new hash in a list. At the end, store *all* compressed data including new entries, and a list of hashes (entries) that were deleted and append this to the file. Basically we're patching by hashing each "entry" and recording the adds/removes/replaces by hash. - Probably the easiest: Decode (or have stored in RAM) the previous JSON, call ``ZSTD_createCDict()`` with it, and then compress the new one. This should work very well as long as the zstd window is equal at least to the previous session size, and I think we can guarantee that, as a huge session size is less likely to happen on a low RAM machine. The advantage of these approaches is that they're entirely on the data serialization level. You don't need to know anything or depend on Firefox internals to get this working, and can even shove it in an external Rust library of whatever. My first thought was to just hash the per-tab data, and re-transmit changed tabs, but look at this: Data size per second-level domain (in bytes), from the recovery.jsonlz4 from my personal profile: ``` google.com: 169965 slack.com: 462642 discord.com: 19308696 mozilla.org: 70929 chatgpt.com: 266999 openai.com: 11871 github.com: 198099 pytorch.org: 9431 nvidia.com: 7445 openreview.net: 11008 arxiv.org: 19187 github.io: 20058 reddit.com: 104768 searchfox.org: 15451 <rest mostly negligible> ... ``` Given that discord.com is the majority of data, and also the one that keeps changing, you do really need to diff/patch the data itself. But the zstd approach should be fast, easy and efficient for this. So quite doable, I hope. Also, FWIW, just switching to zstd @ level 1 instead of LZ4 gives a factor 2x-3x improvement. These kind of fast high ratio compressors weren't available 8 years ago.
I believe it's possible to incrementally update the recovery.jsonlz4/sessionstore.jsonlz4 while avoiding the issues with atomicity and garbage collection of stale files, by appending to the existing file with diffs and rewriting it entirely when (for example) the diff part is equal or larger than the original part. Restoring then consists of reading the normal file, decompressing the diffs and applying them. This would only work if the rest of the session restore machinery doesn't rely on being able to move the file away every single time - I'm not 100% sure that's not the case, but I think that only happens on a normal shutdown, not during the 15 second updates. For generating the diffs, I can see several options: - There's several libraries that can do JSON diff and patch that could be used for this. Not sure any of them are production solid enough to use in session restore though. - Something ad hock is possible: Instead of writing out the raw JSON, split it on entries (e.g. windows->tabs->entries, _closedTabs->state->entries, etc, probably want to do the same for the cookie list) and compute and store a hash per entry (with modern hashes like BLAKE3 or AHash this is negligible overhead). Then write out the compressed data (in 1 block). Keep the hashes around. For the next session dump, do the same operation, but check whether the hash matches for each entry. If it does, delete the entry. If it does not, store the old hash and the new hash in a list. At the end, store *all* compressed data including new entries, and a list of hashes (entries) that were deleted and append this to the file. Basically we're patching by hashing each "entry" and recording the adds/removes/replaces by hash. - Probably the easiest: Decode (or have stored in RAM) the previous JSON, call ``ZSTD_createCDict()`` with it, and then compress the new one. This should work very well as long as the zstd window is equal at least to the previous session size, and I think we can guarantee that, as a huge session size is less likely to happen on a low RAM machine. The advantage of these approaches is that they're entirely on the data serialization level. You don't need to know anything or depend on Firefox internals to get this working, and can even shove it in an external Rust library or whatever. My first thought was to just hash the per-tab data, and re-transmit changed tabs, but look at this: Data size per second-level domain (in bytes), from the recovery.jsonlz4 from my personal profile: ``` google.com: 169965 slack.com: 462642 discord.com: 19308696 mozilla.org: 70929 chatgpt.com: 266999 openai.com: 11871 github.com: 198099 pytorch.org: 9431 nvidia.com: 7445 openreview.net: 11008 arxiv.org: 19187 github.io: 20058 reddit.com: 104768 searchfox.org: 15451 <rest mostly negligible> ... ``` Given that discord.com is the majority of data, and also the one that keeps changing, you do really need to diff/patch the data itself. But the zstd approach should be fast, easy and efficient for this. So quite doable, I hope. Also, FWIW, just switching to zstd @ level 1 instead of LZ4 gives a factor 2x-3x improvement. These kind of fast high ratio compressors weren't available 8 years ago.