Closed Bug 1249333 Opened 9 years ago Closed 4 years ago

plugins data sanitization takes a long time

Categories

(Core Graveyard :: Plug-ins, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: mak, Unassigned)

References

Details

I know in the past this issue has already been investigated and I'm not sure we have a lot of space to improve perf, but still it's worth some brainstorming, maybe for alternative clearing paths.

The crash in bug 1248489 made clear some users are constantly hitting the async shutdown timeout abort (1 minute).
The 2 things taking most time to clear (in general, not just on shutdown) based off telemetry are history and plugins. But, while history for a user having clear-history-on-shutdown set is likely small (thus it should be far below the upper measure), I expect the plugins santization time to not be affected by the fact plugins have data to sanitize or not.

The fact is, for about 1% of users (based on FX_SANITIZE_PLUGINS) we take more than 20 seconds(or more than 30 seconds for 0.6%) to clear plugins on shutdown.
I guess we are likely spawning one plugin process per plugin and asking them to sanitize.

20s or more on shutdown is a lot of time, so I wonder if there are ideas to improve this situation. What can we do?

Considered the status of the plugins API, I wonder about things like introducing a whitelist for all versions, to limit number of plugins, or always do this work on startup (but then we may have stale plugin processes around?)
Do we do anything to try and limit the plugins we clear to just those that have been instantiated since the last clear (basically a per-plugin "dirty bit"). With most plugins being CTP now, I'd hope that in the common case it's only Flash that we'd need to spin up at shutdown (and I'd expect it to often be running anyway).

Is there any data on how long it takes plugins to init (in the normal case, ignoring sanitization)? BLOCKED_ON_PLUGIN_INSTANCE_INIT_MS? But that's only showing Flash. Might we worthwhile to add a couple keyed probes to split apart what's happening within FX_SANITIZE_PLUGINS... What plugins are involved, how long do they take to init vs clear vs shutdown?
(In reply to Justin Dolske [:Dolske] from comment #1)
> Do we do anything to try and limit the plugins we clear to just those that
> have been instantiated since the last clear (basically a per-plugin "dirty
> bit").

I don't think so, but someone who works on plug-ins should chime in.

> Might we worthwhile to add a couple keyed probes to split
> apart what's happening within FX_SANITIZE_PLUGINS... What plugins are
> involved, how long do they take to init vs clear vs shutdown?

Considered there are tens of plugins we may not know about, I wonder how we could report that through the current telemetry, also considering possible privacy implications.
It may be simpler to go through the top 5 plugins, check which ones can actually clear data, and only try to clear those ones, I guess.
(In reply to Marco Bonardo [::mak] from comment #0)
> The fact is, for about 1% of users (based on FX_SANITIZE_PLUGINS) we take
> more than 20 seconds(or more than 30 seconds for 0.6%) to clear plugins on
> shutdown.
> I guess we are likely spawning one plugin process per plugin and asking them
> to sanitize.

Because of that we seem to have a flash crash as covered by bug 1232649.
The plugin host only clears site data in one of two cases:

* for the Flash plugin OR
* for a plugin that is already loaded

This avoids loading the plugin process for anything but Flash.

See http://mxr.mozilla.org/mozilla-central/source/dom/plugins/base/nsPluginHost.cpp#1727

I really doubt that any plugin except for Flash supports the clear-site-data API, or that this slowdown is caused by anything except Flash.

What's really weird about all this is that we're not just asking Flash to clear all data. That's probably a really simple/fast things to do. Instead, we're asking the plugin to enumerate all sites with data and then clearing each site individually. If that list is large, or stored in a weird format, that could easily be a very time-consuming operation.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #4)
> What's really weird about all this is that we're not just asking Flash to
> clear all data. 

Do we have such API? I thought it was implemented the way it is cause was the only available api...
wait, we pass null as domain so we should hit this path

1740   if (domain.IsVoid()) {
1741     return library->NPP_ClearSiteData(nullptr, flags, maxAge, callback);
1742   }
ok, so based on what benjamin said and the code
* we only clear flash or plug-ins that have already been instantiated, it's likely only flash properly supports clearing
* we seem to take the fast path that clears everything, not fetching every page and then clearing them one by one, since we pass a null domain

Thus it's one of these:
1. Flash is really slow clearing its data
2. Flash is not instantiated, instantiating it takes a lot of time
3. Unknown plugin is really slow clearing its data
4. all of them
We may want to add a telemetry probe specifically for flash to exclude 3 and 4.
I'm not sure we can tell from js if flash is enabled? I don't see such a field exposed by nsIPluginTag
when I said enabled, I actually meant "running"
Ah, there is tag.loaded. So looks like we have everything to measure how much it takes to clear a loaded and an unloaded flash.
Do we have a timeout set for plugins to clear their data? If yes do we kill it after it exceeds the allowed time? It would explain what we see on the other bug as mentioned in comment 3.
(In reply to Henrik Skupin (:whimboo) from comment #11)
> Do we have a timeout set for plugins to clear their data? If yes do we kill
> it after it exceeds the allowed time? It would explain what we see on the
> other bug as mentioned in comment 3.

not specifically for plugins, we only have a 1 minute timeout on shutdown.
I don't know if we can easily stop a plugin from clearing out its data. But yes, adding a soft timeout (i.e. one that doesn't crash) for Release might be a good idea.
we could use Promise.race with a timeout, but that won't help shutdown and will false our telemetry data about shutdown. I suppose we will keep waiting for the plugin process, no idea if we terminate plugin processed abruptly when the main process should go away.
I'm looking at the first incoming data regarding Flash sanitizations taking more than 10s to clear data
In the unloaded case: 2.1% 
In the already loaded case: 0,94%

I was expecting we had some cost to pay to start the plugin, but it's still surprising an already loaded plugin takes 10s to clear data.
Maybe we can use our contacts at Adobe to understand why that happens?
I seem to remember that bsmedberg is our link to Adobe. Or perhaps aklotz knows more about that piece of code.
Flags: needinfo?(benjamin)
Flags: needinfo?(aklotz)
Jeromie, what does the Adobe side think about this? We're sometimes seeing NPP_ClearSiteData taking a _long_ time.
Flags: needinfo?(aklotz) → needinfo?(jeclark)
There's not a lot to go on. It sounds like an instance where a parent process terminates, but we don't know about it for whatever reason, and eventually timeout and exit.

If there is a set of STRs that would allow us to reproduce and debug the issue, it should be straightforward.  Alternatively, if it's intermittent, maybe we have engineers collaborate on instrumenting things so that we can clearly identify the sequence of events that leads to this condition.
Flags: needinfo?(jeclark)
(In reply to Jeromie Clark from comment #18)
> There's not a lot to go on. It sounds like an instance where a parent
> process terminates, but we don't know about it for whatever reason, and
> eventually timeout and exit.

Note, this is not the shutdown crash case that we discussed in another bug, the Firefox process is well alive, we are waiting for Flash to complete sanitization, before proceeding with any shutdown.

Telemetry tells us for some users it takes more than 10 seconds (in some cases even more than 30 seconds) just for NPP_ClearSiteData to complete. I wonder if there's any low hanging fruit on your side to generally speed up data clearing.

Unfortunately I don't know if there's some specific situation, maybe it's just lots of data to clear?
Flags: needinfo?(jeclark)
I can't think of any legitimate reason why a memory operation would come anywhere close to 10 (or 1) seconds.  I really don't think this is a performance issue.  We're waiting on something, but we really need to be able to debug it to do anything beyond guess.
Flags: needinfo?(jeclark)
(In reply to Marco Bonardo [::mak] from comment #15)
> I'm looking at the first incoming data regarding Flash sanitizations taking
> more than 10s to clear data
> In the unloaded case: 2.1% 
> In the already loaded case: 0,94%

Might be interesting to run a telemetry analysis to see if these instances are correlated with anything else that might give a clue (session length? low-end hardware?)

(In reply to Jeromie Clark from comment #20)
> I can't think of any legitimate reason why a memory operation would come
> anywhere close to 10 (or 1) seconds.

How much data are we talking about? If it's been paged out, it could be slow. Presumably the cleared storage needs to be reflected on disk, too? Does that block returning?
If the system is thrashing in swap, that's certainly going to affect performance.  We did some memory hardening work over the last several months, and in the process made is so that some classes of objects weren't being garbage collected correctly, which led to sharp increases in memory consumption in very specific situations. That could certainly have led to large swap files in recent versions. 

Assuming well written content (unused objects are freed appropriately and eligible for garbage collection) and a normally operating Flash Player, we try to operate within the footprint of available RAM.  On systems with less ram, we garbage collect more frequently, so you pay for the smaller RAM footprint with increased CPU consumption.  The goal is to not push you into swap, but it's possible to write content that we can't optimize for (open a bunch of NetStream objects on large videos and never call close() or dispose() on them, for instance).

We fixed the issue that would have led to large memory footprints in Flash Player 21.0.0.140, and it would have been publicly available since 21.0.0.0157.  If you have the ability to identify this behavior and correlate it to a Flash Player version, it would be really interesting to know whether or not the frequency drops off after that point.  That would bear out the possibility that this is disk swap.
No longer blocks: 1250424
Marco if this is something we want to keep in product, I think we need one of the following next steps:

* Find a person who's seeing this slowness and collect a profile
* show some progress UI so that users know what's going on (and suspend the shutdown crashing stuff)
* Don't actually clear data on shutdown: instead, if the option is checked tell Flash that we're always in private browsing mode so that it never saves anything
* Remove the UI option
Flags: needinfo?(benjamin)
For now we are addressing the issue with a workaround in bug 1253204. We will basically wait 10 seconds, if by that time we didn't head back, we just let shutdown continue. That will likely mean we ask the flash process to terminate abruptly at a certain point in shutdown. But we won't abort crash anymore.

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #23)
> * Don't actually clear data on shutdown: instead, if the option is checked
> tell Flash that we're always in private browsing mode so that it never saves
> anything

This is an interesting option.
If I read the code correctly, we should set NPNVprivateModeBool... Unfortunately I don't know the plugins code well enough to do that in the right place without the fear to cause troubles.
It could be considered a decent fix for this specific bug and could speed up shutdowns.

That said, for now we'll be covered up by the workaround.
I don't really see how forcing Private Browsing mode helps.  If the assertion in 21 is correct that this is just the system paging memory to disk, private browsing mode doesn't help.  

In terms of Flash behavior, private browsing mode means that Flash is still saving data (similar to cookie behavior in HTML), but it does so in a temporary directory that's destroyed when exiting the private browsing session.  AFAIK, forcing private browsing mode wouldn't eliminate existing writes (if the concern is that they are potentially hanging), and LSOs tend to be small <100K, although it's possible to write huge numbers of them in a tight loop.  It is possible to make really big ones (this is handy for game developers that want to cache sprite maps, as an example), but even on the extreme end for legitimate content, we're talking 10-100MB at the high end.

I wonder if you just killed us in this instance, whether or not it would produce a useful stack that we could debug.
Priority: -- → P3

We're in the process of removing support for plugins (bug 1677160), so I think this bug is irrelevant now.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.