Open Bug 1867137 Opened 6 months ago Updated 4 months ago

Session Restore shouldn't save redundant copies of the same data URI for a favicon over and over, if you have hundreds of tabs that share the same favicon

Categories

(Firefox :: Session Restore, defect, P3)

defect

Tracking

()

People

(Reporter: dholbert, Unassigned)

References

Details

I've been investigating some session-restore-serialization hangs that happen periodically (noticeably janking the Firefox parent process), and I'm filing this bug for the first issue that I noticed there that's causing this serialization to be heavier than it should be.

If you've got e.g. hundreds of tabs from the same site open (bugzilla in my case), then Session Store appears to save a distinct copy of the base64-encoded favicon for each tab. For bugzilla.mozilla.org's 128x128 PNG favicon, that data URI is 186969 characters long, i.e. nearly 200KB i.e. 0.2MB. So: if you have 200 bugzilla tabs open, then that represents 200 * 0.2MB = 40MB of json that Firefox needs to re-serialize as part of your session state. That's quite a lot.

I recognize this isn't trivial, but: in the interests of making session-store reserialization a "lightweight" operation (since it janks the frontend process), could we somehow de-duplicate these data URIs and only serialize them once when we're generating JSON?

Summary: Session Restore shouldn't save redundant copies of the same favicon over and over → Session Restore shouldn't save redundant copies of the same data URI for a favicon over and over, if you have hundreds of tabs that share the same favicon

Making things concrete:

  • My regular browsing profile's $PROFILE/sessionstore-backups/recovery.jsonlz4 file is 32MB in size.
  • When I extract that file using https://gist.github.com/sfoster/c17fde43362bdf8173776cd9fcfb9b71 , I get a 139MB JSON file.
  • 59% of that JSON file's content is verbatim repetitions of the bugzilla favicon data-URI (457 repetitions in fact; maybe I have that many bugzilla tabs open in various windows, I guess). If I search-and-replace to remove all exact copies of that data URI string from the JSON, I'm left with a 57MB JSON file (down from 139MB).

So, the upshot is: every I perform some action that requires an update to my saved sessionstore data (e.g. typing a character into a textfield, navigating or opening/closing a tab), Firefox queues up an idle task to redundantly serialize and compress 139MB of text, of which >80MB represents >450 copies of a base64 string representing the bugzilla favicon. This task janks the frontend when it executes (which I notice as skipped frames in videos playing on twitter and YouTube, for example; I've profiled these skipped frames and the profiler shows it to be due to sessionstore writing out its JSON).

This could be made much better if we only serialized a single copy of this 200KB bugzilla favicon and referenced that single copy from multiple places in the sessionstore JSON content.

(In reply to Daniel Holbert [:dholbert] from comment #2)

This could be made much better if we only serialized a single copy of this 200KB bugzilla favicon and referenced that single copy from multiple places in the sessionstore JSON content.

Presumably we have the same issue de-serializing the icon data when we restore a session? We could probably use a unique icons map or object for the icon urls, and replace the inlined value with a reference. We seem to only use $ref in the context of JSON Schemas, but if we could use it here don't think fixing this would be that much work actually.

We also need to be able to incrementally update that persisted state without writing the whole file to disk every time. But that is a separate issue.

Severity: -- → S3
Priority: -- → P3
See Also: → 1867161
See Also: 1867161

FWIW, after those large favicons, the next largest long-strings that I'm seeing in my own sessionstore files are for base64-encoded csp fields, which are thousands of characters long and stored alongside every URL (sometimes many times for the same tab if we've got some saved bfcache history, I think). Those might be candidates for deduplication or removal, too. (It's not obvious to me why we need to store those, if we're going to have to hit the network [and rediscover the CSP] anyway to re-request the actual site content when we restore the tab... Or, if there are cases where we're reinstantiating the tab from cached data (and hence wouldn't rediscover the CSP), maybe the CSP could be stored in that same cache rather than in the reserialized-every-15-seconds session-restore data.

The data URI favicon bit of this bug is more or less a dupe of bug 1583384 right?

(In reply to Daniel Holbert [:dholbert] from comment #4)

FWIW, after those large favicons, the next largest long-strings that I'm seeing in my own sessionstore files are for base64-encoded csp fields, which are thousands of characters long and stored alongside every URL (sometimes many times for the same tab if we've got some saved bfcache history, I think). Those might be candidates for deduplication or removal, too. (It's not obvious to me why we need to store those, if we're going to have to hit the network [and rediscover the CSP] anyway to re-request the actual site content when we restore the tab... Or, if there are cases where we're reinstantiating the tab from cached data (and hence wouldn't rediscover the CSP), maybe the CSP could be stored in that same cache rather than in the reserialized-every-15-seconds session-restore data.

--> Christoph/Freddy - when do we actually need/need the CSP when starting a new load? Only when we inherit principals or something? What would be the best way for session restore to make this determination? (is it just in session restore because triggering principals ended up there and then we split the CSP out from them later on?)

Flags: needinfo?(fbraun)
Flags: needinfo?(ckerschb)
See Also: → 1583384

We need the CSP before starting any load, actually.

I don't know how we synthesize a document from session restore.
Are we creating a document from disk without hitting the network again?
Then we shouldn't forget the CSP.

Flags: needinfo?(fbraun)

(In reply to Frederik Braun [:freddy] from comment #6)

We need the CSP before starting any load, actually.

Why? What CSP governs a document/docshell load from a CSP-less about:blank to e.g. https://foo.com/, other than whatever foo.com returns over the network?

Edit: put differently, what CSP do we use for a load from the awesomebar? My understanding was "none", but...

I don't know how we synthesize a document from session restore.
Are we creating a document from disk without hitting the network again?

We'll always hit necko, though necko may serve from cache (including headers). Session restore is only metadata like title, favicon, form inputs, potentially cookies, etc.

Flags: needinfo?(ckerschb) → needinfo?(fbraun)

Maybe I misunderstand, but the CSP governs whether we load subsequent things.
So, what if we restore a document that includes a subresource that shouldn't be loaded as per the CSP.
If we load the doc without its CSP we would erroneously also load the disallowed subresource, wouldn't we?

Flags: needinfo?(fbraun)

(In reply to Frederik Braun [:freddy] from comment #8)

Maybe I misunderstand, but the CSP governs whether we load subsequent things.
So, what if we restore a document that includes a subresource that shouldn't be loaded as per the CSP.
If we load the doc without its CSP we would erroneously also load the disallowed subresource, wouldn't we?

I would assume that loading the doc is what sets a CSP for the subsequent loads. I'm talking only about the CSP that is used to start the load of the doc in the docshell in the first place, as part of the loadURIOptions - it was introduced in bug 1518454.

Flags: needinfo?(fbraun)

I can see this making sense for top level delivered with a 'sandbox' directive or subdocuments (in case thoes are also in session storage). But this predates my knowledge of the codebase. I think we need to ask ckerschb after all.

Flags: needinfo?(fbraun) → needinfo?(ckerschb)

Hey Gijs, I am not entirely sure what's needed here, but let me add some thoughts which might allow us to move forward.

Generally speaking, the first time we actually need the CSP is when we start parsing the HTML of the document to be loaded. In case we do a pre-fetch of an image triggered through that HTML, then it should already be blocked by the CSP. In terms of code, this is (as far as I can tell) within Document::StartDocumentLoad.

I suppose you are right and the only reason we used to store it in the cache was because the CSP used to hang of the Principal. But as you pointed out correctly, that's not the case anymore. So I suppose we could lift the CSP out of that case, but I don't have a good sense of where the CSP could live.

Flags: needinfo?(ckerschb)

Looks like we've been talking past each other and I apologize for that. A quick chat with ckerschb helped us untangle that for a bit, I believe.

I think the main question is: What do we get from session restore? Just a URL or an artifact of the parsed & rendered HTML?
If all we get is a URL then I agree we don't need to store the CSP in most cases, because it should be retrieved when fetching from network/cache.

However, this is not true for restoring content from URLs with an inherited principal (about:blank, and friends). Though I suppose those might be unable to fetch through session storage since the actual backing data might be long gone.

We started storing the CSP in session history entries in bug 1518454. I think previously it was stored on the principals, but those are also stored in the session history entries.

We use the CSP at least when starting a subframe load when navigating in history (imagine doing a new load in a subframe and then going back). We might be able to get the CSP from the containing document (?), but currently that's how loading works. Because session store also restores the session history, the CSPs get serialized/deserialized too.

You need to log in before you can comment on or make changes to this bug.