16 Second Pause in ExtensionUtils.jsm's truncate()

RESOLVED FIXED in Firefox 55

Status

()

defect
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: sstangl, Assigned: kmag)

Tracking

({regression})

55 Branch
mozilla57
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite ?

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox55+ fixed, firefox56+ fixed, firefox57+ fixed)

Details

(Whiteboard: [qf])

Attachments

(3 attachments)

Several users on Reddit are reporting that Firefox 55 is completely unusable for them due to extremely long pauses. I asked them to attempt a recording using the Gecko Profiler. One user was unable to get the profiler installed and running, but another one was able to upload several profiles.

The one this bug is about is here: https://perfht.ml/2vJH3JQ

The user is experiencing a 16 second hang on the main thread under WindowDestroyedEvent::Run(). In ExtensionUtils.jsm, the function truncate() is called on something, which then proceeds to spend 16 whole seconds seeking through an OrderedHashTable, calling onRemove() handlers, and finally remove()ing each item.

The user says, "All addons except session manager and uBlock disabled."

This seems pretty severe, and multiple users are reporting that 55 is effectively unusable for them.

The specific user's report on Reddit is here: https://www.reddit.com/r/firefox/comments/6siseo/loading_websites_in_firefox_very_slow/dlgcl0q/

The full thread is here: https://www.reddit.com/r/firefox/comments/6siseo/loading_websites_in_firefox_very_slow/

I'm really not sure how to diagnose this. Ehsan or bz, would you be able to point me in the right direction, or see what the likely culprit would be? I remember that Ehsan changed some stuff with hash tables recently, but I don't know if that could be related.
Flags: needinfo?(ehsan)
Flags: needinfo?(bzbarsky)
This seems like something Kris might know something about, at least based on blame.

But fundamentally, we're starting off at http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/toolkit/components/extensions/ExtensionContent.jsm#555 where we're told an inner window died.  This calls http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/toolkit/components/extensions/MessageChannel.jsm#782 which adds things to this.abortedResponses.  That's a LimitedSet (of max size 30), so we end up at http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/toolkit/components/extensions/ExtensionUtils.jsm#280 which calls http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/toolkit/components/extensions/ExtensionUtils.jsm#271

But... in this case that should do exactly one this.delete() call per add() call.  That shouldn't be unreasonable.  So maybe we have a ton of this.pendingResponses in abortResponses?  Or we have a ton of DocumentManagers around?  But at first glance DocumentManager is a singleton...
Flags: needinfo?(bzbarsky) → needinfo?(kmaglione+bmo)
If LimitedSet#add() (resp. #truncate()) is called a few thousand times in consecutive order, and no GC happens in that time, the SetIteratorObjects aren't getting collected, which means their internal OrderedHashTable#Range isn't destroyed. In that case we spent a huge amount of time in http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/js/src/ds/OrderedHashTable.h#205-208 updating effectively stale ranges.

Simple shell test case for that scenario:
---
class LimitedSet extends Set {
  constructor(limit, iterable = undefined) {
    super(iterable);
    this.limit = limit;
  }

  truncate(limit) {
    for (let item of this) {
      if (this.size <= limit) {
        break;
      }
      this.delete(item);
    }
  }

  add(item) {
    if (!this.has(item) && this.size >= this.limit) {
      this.truncate(this.limit - 1);
    }
    super.add(item);
  }
}

var ls = new LimitedSet(30);
var t = dateNow();
for (var i = 0; i < 50000; ++i) {
    ls.add(i);

    // GC inactive set iterators to destroy Range instances.
    if (i % 1000 === 0)
        gc(); // <-- Without this explicit GC, the test case needs much more time.
}
print(dateNow() - t);
---
And if this is causing this slow-down, the simplest fix would be to change truncate() to:
---
  truncate(limit) {
    // Iterate over all entries to ensure the SetIteratorObject's Range gets destroyed. (bug 1389381)
    for (let item of this) {
      if (this.size > limit) {
        this.delete(item);
      }
    }
  }
---
Hmm.  Can we nursery-allocate set iterator objects?  That seems like it would help too...
(In reply to Boris Zbarsky [:bz] (vacation Aug 14-27) from comment #4)
> Hmm.  Can we nursery-allocate set iterator objects?  That seems like it
> would help too...

bug 1368420. But for nursery allocation we'd need to have a different implementation for OrderedHashTable#Range, so we can background finalize the iterator objects. (At least that's my understanding.)
[Tracking Requested - why for this release]: Severe perf issue.

Is comment 3 something we could ship in a dot release (and maybe 56 too) while we figure out a better long-term fix?
(In reply to Ryan VanderMeulen [:RyanVM] from comment #6)
> Is comment 3 something we could ship in a dot release (and maybe 56 too)
> while we figure out a better long-term fix?

It seems like we need a way to reproduce the problem so that we know this fixes it for that to be a viable option...
Flags: needinfo?(ehsan)
Based on code inspection, I think this should be a regression from bug 1260548 which first added |this.abortedResponses = new ExtensionUtils.LimitedSet(30);| <https://hg.mozilla.org/mozilla-central/rev/8e67a3e9a5bf#l4.17>
Blocks: 1260548
FWIW I tried installing both the Session Manager and uBlock Origin extensions in a new profile on 55.0.1 and browsed around a bit to see if I can reproduce this issue and didn't succeed.
There's a chance this is caused by bug 1386937, which caused us to add duplicate inner-window-destroyed observers. But that seems pretty unlikely, since only the first observer should actually find any responses to abort, which is why I haven't asked for uplift.

In the worst case, we can probably just remove the code that adds entries to abortedResponses, since it's really only there to prevent unhelpful console spam. But I think I'd rather try the solution in comment 3 first, combined with only truncating when we're something like 25% over the limit.

I'd still like to know exactly why this is happening, though, and find a way to reproduce. This seems like a pretty pathological worst case, and I can't even guess what an extension would have to do in order to wind up with that many responses to abort.


Although, looking at this more closely... When we call abortResponses, the aborted responses are removed from the set from a promise handler, which happens asynchronously. Which means that we should wind up iterating over the already aborted responses again from each observer callback. But even then, we only truncate when the item isn't already in the set, which means we'd have to have more than 30 responses to abort for each iteration.
Flags: needinfo?(kmaglione+bmo)
Also, neither Session Manager nor any of the public versions of μBlock are WebExtensions, so it would be nice to know exactly which extensions they're using.
Ah, looking at the profile some more, it looks like the only WebExtension they're using is Zotero.
So, just running the Zotero add-on for a short while, I get a huge number of console errors. If I had to guess, I'd say their messaging wrapper code (which is pretty hairy) doesn't handle those correctly, and they wind up with a bunch of bunch of messages that are waiting for response callbacks that will never come, and all get canceled when when the window running the content scripts finally closes.

So, given that, I'm just going to fix the issues that causes us to handle this so poorly, and not worry too much about trying to reproduce it.
Depends on: 1386937
Comment on attachment 8896468 [details]
Bug 1389381: Part 1 - Synchronously remove rending responses after aborting them.

https://reviewboard.mozilla.org/r/167714/#review173012
Attachment #8896468 - Flags: review?(aswan) → review+
Comment on attachment 8896469 [details]
Bug 1389381: Part 2 - Add slop to the maximum size of a LimitedSet.

https://reviewboard.mozilla.org/r/167716/#review173014

Good intentions, but where does `this.slop` actually get set?
Attachment #8896469 - Flags: review?(aswan) → review-
Comment on attachment 8896470 [details]
Bug 1389381: Part 3 - Loop over the entire set iterator when truncating a LimitedSet.

https://reviewboard.mozilla.org/r/167718/#review173016
Attachment #8896470 - Flags: review?(aswan) → review+
Comment on attachment 8896469 [details]
Bug 1389381: Part 2 - Add slop to the maximum size of a LimitedSet.

https://reviewboard.mozilla.org/r/167716/#review173088
Attachment #8896469 - Flags: review?(aswan) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/4b63f0f44148519df73c725ab99746d4d686e12f
Bug 1389381: Part 1 - Synchronously remove rending responses after aborting them. r=aswan

https://hg.mozilla.org/integration/mozilla-inbound/rev/364ef71685abd9959f7816810e422a15ec92087f
Bug 1389381: Part 2 - Add slop to the maximum size of a LimitedSet. r=aswan

https://hg.mozilla.org/integration/mozilla-inbound/rev/96fb5933d07336b44f19b77a511981cba96e29f3
Bug 1389381: Part 3 - Loop over the entire set iterator when truncating a LimitedSet. r=aswan
Assignee: nobody → kmaglione+bmo
See Also: → 1389968
Comment on attachment 8896468 [details]
Bug 1389381: Part 1 - Synchronously remove rending responses after aborting them.

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1260548
[User impact if declined]: This can cause significant performance issues in certain cases where extension scripts create large numbers of message listeners which never call their response callbacks.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: N/A
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: This is a fairly trivial change to make sure that aborted response objects are removed from a map synchronously, as expected, rather than only after returning to the main event loop or microtask queue. There should be no change in behavior except in the existing corner cases which currently produce incorrect behavior.
[String changes made/needed]: None.
Attachment #8896468 - Flags: approval-mozilla-release?
Attachment #8896468 - Flags: approval-mozilla-beta?
Comment on attachment 8896469 [details]
Bug 1389381: Part 2 - Add slop to the maximum size of a LimitedSet.

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1260548
[User impact if declined]: This can cause significant performance issues in certain cases where extension scripts create large numbers of message listeners which never call their response callbacks.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: N/A
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: This is a fairly simple change to reduce overhead by pruning entries from a set less often, rather than every time a new entry is added after it reaches its maximum capacity.
[String changes made/needed]: None.
Attachment #8896469 - Flags: approval-mozilla-release?
Attachment #8896469 - Flags: approval-mozilla-beta?
Comment on attachment 8896470 [details]
Bug 1389381: Part 3 - Loop over the entire set iterator when truncating a LimitedSet.

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1260548
[User impact if declined]: This can cause significant performance issues in certain cases where extension scripts create large numbers of message listeners which never call their response callbacks.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: N/A
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: This is a fairly simple change to reduce overhead by looping over an entire iterator, to make sure it's finalized immediately, rather than exiting the loop early. The logic is unchanged, and the extra loop overhead from a maximum of about 30 entries in a JIT-optimized function should not amount to much.
[String changes made/needed]: None.
Attachment #8896470 - Flags: approval-mozilla-release?
Attachment #8896470 - Flags: approval-mozilla-beta?
Would it be possible to have a test to make sure we don't regress here?
Thanks
Comment on attachment 8896468 [details]
Bug 1389381: Part 1 - Synchronously remove rending responses after aborting them.

Fix some severe perf issues, taking it as ride along of 55.0.2
Attachment #8896468 - Flags: approval-mozilla-release?
Attachment #8896468 - Flags: approval-mozilla-release+
Attachment #8896468 - Flags: approval-mozilla-beta?
Attachment #8896468 - Flags: approval-mozilla-beta+
Attachment #8896469 - Flags: approval-mozilla-release?
Attachment #8896469 - Flags: approval-mozilla-release+
Attachment #8896469 - Flags: approval-mozilla-beta?
Attachment #8896469 - Flags: approval-mozilla-beta+
Attachment #8896470 - Flags: approval-mozilla-release?
Attachment #8896470 - Flags: approval-mozilla-release+
Attachment #8896470 - Flags: approval-mozilla-beta?
Attachment #8896470 - Flags: approval-mozilla-beta+
Sean, would it be possible to ask some of the affected Reddit users to try out the CI builds that contain this fix? Should be reasonably safe to do, but backing up their profile first would be a reasonable precaution :)

OSX: https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-release-macosx64/1502721201/firefox-55.0.2.en-US.mac.dmg
Win32: https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-release-win32/1502720964/firefox-55.0.2.en-US.win32.zip
Win64: https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-release-win64/1502720964/firefox-55.0.2.en-US.win64.zip
Flags: needinfo?(sstangl)
We tried to reproduce this issue on 55.0.2-build1 as well, but we didn't managed to due so.
Added to the release notes with "Fix performance regressions with WebExtension"
Ryan: I sent messages 2 days ago, but have received no responses.
Flags: needinfo?(sstangl)
(In reply to Sean Stangl [:sstangl] from comment #35)
> Ryan: I sent messages 2 days ago, but have received no responses.

It's probably not worth worrying about, at this point. I'm pretty confident that this solves the regression, and if there are still issues as bad as this, I'm sure we'll hear about them soon enough.
(In reply to Kris Maglione [:kmag] from comment #36)
> (In reply to Sean Stangl [:sstangl] from comment #35)
> > Ryan: I sent messages 2 days ago, but have received no responses.
> 
> It's probably not worth worrying about, at this point. I'm pretty confident
> that this solves the regression, and if there are still issues as bad as
> this, I'm sure we'll hear about them soon enough.

Hi Kris, we're still seeing a subset of our users on v55.0.2 experiencing 10-15sec startup times. I'm able to partially reproduce on a Win 10 machine running FF 55.0.2 x64. Here's a profile running Ghostery at startup and the browser locks up for about 10sec:

https://perfht.ml/2imUBFC

In this profile I disabled and re-enabled Ghostery around the 80sec mark, and Firefox locked up for about 20 seconds.

https://perfht.ml/2imq9LQ

This had been a massive issue for us recently and I'm grateful that you all were able to patch it in the last release.  Any further help you could provide would be much appreciated. Thanks,
Chris
(In reply to Christopher Tino from comment #37)
> Hi Kris, we're still seeing a subset of our users on v55.0.2 experiencing
> 10-15sec startup times. I'm able to partially reproduce on a Win 10 machine
> running FF 55.0.2 x64. Here's a profile running Ghostery at startup and the
> browser locks up for about 10sec:

There's too much going on in those profiles to make much sense of, but the problems are clearly unrelated to this bug. In both cases, a lot of the overhead is from Sync and SDK extensions (particularly LastPass).

Aside from that, most of the WebExtension overhead seems to be JSON/structured clone overhead in the storage API (some of which was probably fixed by bug 1356546 and bug 1370752), and other random messaging overhead.
It may be worth filing a separate bug for comment 37...
Thanks guys.  We're trying to reproduce the issue based on the info we have from users.  I'll see if I can get a few to run the profiler, then I'll open a new bug if we find anything useful.
You need to log in before you can comment on or make changes to this bug.