Closed Bug 1398099 Opened 7 years ago Closed 6 years ago

forEach iteration over sparse arrays seems to have pathological worst case

Categories

(Core :: JavaScript Engine, defect, P3)

57 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox59 --- fixed

People

(Reporter: dimas.sc, Assigned: evilpie)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 2 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
Build ID: 20170907220212

Steps to reproduce:

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
Occurring since I migrated from 56 Beta to 57 Nightly a week ago. Not every day, 3 or 4 times since then.

1. Open Firefox Nightly
2. It restores the last session tabs


Actual results:

3. All working as normal for the next seconds / minutes. After a random time no new pages can be loaded or refreshed the existing ones. The grey loading indicator is showed in the tab on them. Screenshot: https://i.imgur.com/0EtiHws.png

Possible solutions:
- Restart Firefox
- Open Task Manager, locate a Nightly process that have the 30% of CPU and Kill it. Instantantly the tabs loads the web pages and Firefox works as expected. Screenshot: https://i.imgur.com/YUveOcq.png
Do you have any add-ons that might interfere?  Does this reproduce with a new profile?
Flags: needinfo?(dimas.sc)
Dimas, can you reproduce this problem in Safe Mode or with your extensions disabled?
P3 for investigation.  

It's possible that there's a misbehaving website causing your problem.  If you can help us narrow down the problem to a particular website or extension (or combination), we can go forward with figuring out the root cause.  Without that info, we're stuck.
Priority: -- → P3
ni on the reporter for an answer to Comment 3.
Flags: needinfo?(dimas.sc)
Missing info from the reporter, not tracking for now, we can reconsider if this becomes actionable in time.
Sorry, I have been out for a few days. I just updated to 57 beta and the problem happened again. A new crash report has been generated, does it help? https://crash-stats.mozilla.com/report/index/daaa7400-4bd3-4ee8-be18-63b0d0170928

I will try to test a bit more...
It seems to be related to the addon KeeFox, the problem is reported to the extension Github page: https://github.com/kee-org/browser-addon/issues/27

But I suppose that Firefox needs to fix something to avoid the general crash.
Flags: needinfo?(dimas.sc)
See Also: → 1404597
I want to help resolve this but have nothing to go on at the moment beyond the general idea that it happens when KeeFox is enabled. I have never been able to reproduce it on any of my test or production environments, although even if I could, I'm not sure how to efficiently track down a bug like this.

Is there any Firefox debugging tool that can be downloaded and attached to a process that's spinning the CPU in this way? Even just a clue as to the APIs being invoked shortly before the freeze could be a start. Is there a way to run a custom build of Firefox that logs all WebExtension API calls for example?

Have any Firefox developers had a chance to investigate the crash reports supplied by Dimas?

Thanks,
Chris
(KeeFox addon author)
Julian, is the information Dimas provided and all the crash reports enough to start tracking this for 57 release again?

If not, what more can we do to help? Can any devs point us where to look for further helpful information?

Thanks,
Chris
Flags: needinfo?(jcristau)
I don't know.  I'm not the right person to ask here, sorry.
Flags: needinfo?(jcristau)
I guess this is a bug in Firefox. I was using the Kee addon version 2.0.0.8 for several days in Nightly without any issue on several updates of Nightly. Then yesterday I got another update (58.0a1 (2017-10-09) Build ID: 20171009100134) and Firefox froze again.


There seems to be no way to debug this issue on my side, because Firefox just doesn't react anymore but consumes 100% load on one thread until I hard kill Firefox. It's not a subprocess which freezes for me but Firefox itself. Even when I run Firefox on the Terminal, it doesn't report anything.
Note that there seem to be two separate issues. One where FF is not responding anymore and has hi8gh CPU usage and one where FF is responsive but doesn't make requests/load pages anymore. I created a separate issue on the KeeFox Github repo for the "not loading part": https://github.com/kee-org/browser-addon/issues/38
I hypothesise that the different described behaviours are due to whether the extensions.webextensions.remote pref is enabled or not. Maybe on those systems with that pref enabled there will be a way to gather some more information about what is causing the freeze.
I can confirm that when I enable extensions.webextensions.remote pref on Linux, I get the same experience as the Windows Firefox users. Firefox doesn't freeze completely anymore but the tabs stop loading any data.
I don't see a problem.
I posted more information on Github, maybe the following browser console log and crash log that happened during one of the hangs can help:

https://github.com/kee-org/browser-addon/files/1441110/kee.freeze.browser.console.log
https://crash-stats.mozilla.com/report/index/84cb1d01-afb4-4322-a4bc-3c15d0171103
@Dimas, can you maybe change the title of the issues so that other people can easily discover the connection between the browser hangs and the Kee addon?
Summary: Firefox 57 Nightly not loading pages → Extension Kee 2.0 causes Firefox 57 to hang / WebExtensions process has 100% CPU usage
Component: Untriaged → Add-ons
Product: Firefox → Tech Evangelism
Version: 57 Branch → Firefox 57
Component: Add-ons → WebExtensions: Untriaged
Product: Tech Evangelism → Toolkit
Version: Firefox 57 → 57 Branch
Here's a heapdump of the process with max CPU usage from the Windows task manager: https://drive.google.com/file/d/0B4C6tG8cJfRmTjNYQllGbWNlaEk/view?usp=sharing
I've loaded the heapdump into VS2017 and hooked up the symbols (https://developer.mozilla.org/en-US/docs/Mozilla/Using_the_Mozilla_symbol_server)

I've looked through the stacks and local vars and unsurprisingly can't see anything relating to Kee or any Javascript code. Being my first foray into the Firefox source code and without a live instance with the problem to debug, I didn't expect that to be especially enlightening but I had to try.

@jorgev do you have any suggestions what we can try next? It feels like we've been fumbling in the dark for a month now and the deadline when we might have to push this bug out to maybe 50K users is getting very close :-(
Flags: needinfo?(jorge)
That's a big out of my league. Maybe someone in Andy's team can help you figure this bug out.
Flags: needinfo?(jorge) → needinfo?(amckay)
Any insight Kris?
Flags: needinfo?(kmaglione+bmo)
I can't tell you anything without reliable steps to reproduce. The stacks in the crash report aren't useful, so someone who can reproduce will either need to connect to the process with a debugger, or add logging to the extension to try to figure out where it's happening.
Flags: needinfo?(kmaglione+bmo)
Kris, there are some profile logs posted in https://github.com/kee-org/browser-addon/issues/27#issuecomment-342156741 and below. Are those helpful?
In the two perf-html.io examples posted on the GitHub issue, the WebExtensions content thread has a huge event processing delay. I'm not certain how to interpret the performance data within the charts (mainly since I don't understand a lot of the event terms and the sampling limitations) but a couple of things could be worthy of further investigation.

(1) https://perfht.ml/2m2iYKA
(2) https://perfht.ml/2lZAHCq

A) In (2) there is a massive bar (duration?) for a "DOM close" event. It has a latency of only 0.003ms though so maybe a sampling glitch?

B) In (1) and (2) there are two different js functions which use forEach and result in a massive amount of time spent in C++ code that appears to be related to testing for property existence.

B is also interesting because yesterday I managed to get brief access to an instance experiencing the bug and was able to attach a VS2017 community edition debugger to the runaway process. I enabled CPU performance reporting and let the process run for a while. I'm not sure if there's any way to export this data for later analysis (advice welcomed if you know how to do this for next time!) but I took a screenshot and will attach it to this ticket.

You'll see that the same NativeHasProperty function is responsible for most of the CPU time. The screenshot is only focussed on around 1.5 seconds of execution time but this is entirely representative of the complete sample.

Both examples post messages to ports that were opened from content scripts and then hit a code path which will often send a new message back to the background script from the message handler in the content script.

The code that calls postMessage from these forEach functions is below:

function updateForegroundTab(tabId) {
    if (kee && kee.tabStates[tabId] && kee.tabStates[tabId].framePorts) {
        if (KeeLog && KeeLog.debug)
            KeeLog.debug("kee activated on tab: " + tabId);
        kee.foregroundTabId = tabId;
        kee.tabStates[tabId].framePorts.forEach(port => {
            port.postMessage({ appState: kee.appState, isForegroundTab: true, action: "detectForms" });
        });
        return;
    }
    setTimeout(id => {
        updateForegroundTab(tabId);
    }, 1000, tabId);
}

_pauseKee() {
	KeeLog.debug("Pausing Kee.");
	this.appState.KeePassDatabases = null;
	this.appState.ActiveKeePassDatabaseIndex = -1;
	this.appState.connected = false;
	try {
		kee.browserPopupPort.postMessage({ appState: this.appState });
		// Poke every port. In future might just limit to active tab?
		kee.tabStates.forEach(ts => {
			ts.framePorts.forEach(port => {
				port.postMessage({ appState: this.appState, isForegroundTab: port.sender.tab.id === this.foregroundTabId });
			}, this);
		}, this);
	}
	catch (e) {
		KeeLog.warn("Exception posting message: " + e.message + " : " + e.stack);
	}
	browser.browserAction.setBadgeText({ text: "OFF" });
	browser.browserAction.setBadgeBackgroundColor({ color: "red" });
	commandManager.setupContextMenuItems();
	KeeLog.info("Kee paused.");
}

The version of the add-on I was running when I took the screenshot is https://addons.mozilla.org/firefox/downloads/file/760523/keefox-2.0.0.20b2-an+fx-windows.xpi?src=version-history

The two functions are in the main background/app.js file in case you want a starting point for further investigation into the JS code. (line 1393 and 1959)

I hope you don't mind me asking you to take another quick look at this information Kris in case it gives you any ideas - I know you're really busy at the moment so please let me know if there's someone more suitable I should continue this investigation with.
Flags: needinfo?(kmaglione+bmo)
It's hard to say for sure, since a frozen process, by definition, means we can't rely on profile data from it.

But it looks like the problem may have something to do with the fact that you're using a sparse array of frame IDs to store your ports. It looks like Ion may be trying to optimize it by iterating over the whole range of the array and checking for the presence of each individual property...

An sparse array really isn't the correct data structure for what you're doing. A Map() would be better. But we also really shouldn't choke like this...
Component: WebExtensions: Untriaged → JavaScript Engine: JIT
Flags: needinfo?(kmaglione+bmo)
Product: Toolkit → Core
Summary: Extension Kee 2.0 causes Firefox 57 to hang / WebExtensions process has 100% CPU usage → Ion forEach iteration over sparse arrays seems to have pathological worst case
This is not Ion specific. forEach basically does:

    for (var k = 0; k < len; k++) {
        if (k in O) {

So yeah that will be slow if you have a large array especially if we use a sparse representation internally.

I fixed a similar issue for slice but it's more painful and slower to do that for self-hosted code..
Component: JavaScript Engine: JIT → JavaScript Engine
Summary: Ion forEach iteration over sparse arrays seems to have pathological worst case → forEach iteration over sparse arrays seems to have pathological worst case
Flags: needinfo?(amckay)
I think we can mitigate this a little bit by making the "in" on sparse arrays faster. I going to try that.
This about 2x faster, on a debug-opt build so not really fair. Currently this guards on the receiver's shape, which is obviously a bad idea, because it would fail after defining a new indexed property. I need to figure out how to manually check for that prototype situation described in the comment above it. Also need to add back the isNative check into the C++ stub.
The latest Kee release uses Map instead of a sparse array as Kris suggested and this has solved the hanging problem. Thanks for your help diagnosing this!
Blocks: CacheIR
Assignee: nobody → evilpies
Attached patch WIP v2 (obsolete) — Splinter Review
I replaced the shape check with a native object check, so this is now more useful. I started writing tests. I added a check for typed arrays, because those are native and contain indexed properties.
Attachment #8927512 - Attachment is obsolete: true
Attachment #8933288 - Attachment is obsolete: true
Attachment #8933890 - Flags: review?(jdemooij)
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Comment on attachment 8933890 [details] [diff] [review]
Specialized HasProp stub for sparse elements

Review of attachment 8933890 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay. Nice patch, thanks for doing this!

::: js/src/jit/CacheIR.cpp
@@ +1734,2 @@
>  {
> +    MOZ_ASSERT(obj->isNative());

Maybe |obj| should be NativeObject*? Then we can remove the obj->as<NativeObject>() call below. We will need some of these calls in the callers but that is kind of nice too.

::: js/src/jit/VMFunctions.cpp
@@ +1861,5 @@
> +        return true;
> +    }
> +
> +    jsid id = INT_TO_JSID(index);
> +    if (obj->as<NativeObject>().lastProperty()->search(cx, id)) {

Nit: can remove the |as<NativeObject>().| part.

@@ +1872,5 @@
> +    if (MOZ_UNLIKELY(ClassMayResolveId(cx->names(), obj->getClass(), id, obj)))
> +        return false;
> +    // TypedArrayObject are also native and contain indexed properties.
> +    if (MOZ_UNLIKELY(obj->is<TypedArrayObject>()))
> +        return false;

What do you think about doing something like this:

  vp[0].setBoolean(uint32_t(index) < obj->as<TypedArrayObject>().length());
  return true;

I think that should be correct?
Attachment #8933890 - Flags: review?(jdemooij) → review+
Good ideas!
Pushed by evilpies@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3fa61093e722
Specialized HasProp stub for sparse elements. r=jandem
https://hg.mozilla.org/mozilla-central/rev/3fa61093e722
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Tom, I'm wondering if I can roll-back the guardIsNativeObject addition and use guardAnyClass (or guardClass(ARRAY)) instead? The attach checks currently end up checking ClassCanHaveExtraProperties, but guardIsNativeObject doesn't enforce that. In practice we don't seem to hit a problem (maybe with module environments I can create an assert failure?), but in general it is hard to reason about that /any/ NativeObject is legal here.
Flags: needinfo?(evilpies)
(Talked to Tom on IRC. He's okay with using guardAnyClass)
Flags: needinfo?(evilpies)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: