Very long cycle collector pauses caused by LastPass 4.21.0.4 while viewing/editing schedule on MedSpace

RESOLVED FIXED in Firefox 65

Status

()

defect
RESOLVED FIXED
8 months ago
5 months ago

People

(Reporter: ehoogeveen, Assigned: smaug)

Tracking

Trunk
mozilla66
All
Windows
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox64 wontfix, firefox65 fixed, firefox66 fixed)

Details

(Whiteboard: [qf])

Attachments

(3 attachments)

I recently started working for Dutch company MedWeb to develop their MedSpace web application. MedSpace [1] is a scheduling application for medical professionals and uses IBM Notes/Domino for its backend.

While putting together a MedSpace for the company I noticed that I was getting increasingly bad pauses in Firefox. I did not notice the same pauses in Chrome, though I only did brief testing.

The profiler points to CCSlice as the culprit: On my work PC I've seen pauses of up to 150 seconds while saving/loading the schedule.

Due to the sensitive nature of the data I don't think I can grant access to the offending MedSpace; I could set up a new MedSpace for testing purposes but I only started noticing problems after populating it with a fair amount of data, so it might take some time to set up.

I tried to gather cycle collector logs, which I've uploaded here: https://mega.nz/#!GQ03RYKI!jtaAgtjXecV0vtKX2rIi4sXGZ_neWYij7_JYPF9esCE

I believe the most relevant log is 4720-7 (which was quickly followed by 4720-8). The gc-edges log is much larger than the cc-edges log, but as I mentioned the profiler points to CCSlice as the cause of the long pauses.

I'm using the latest 64-bit Firefox Nightly (though I initially started noticing the pauses on Firefox 63) on Windows 10, and I think the only noteworthy extension I have installed is LastPass (though I've disabled its autofill features and it didn't make a difference).

Let me know if there are any other useful logs I can gather or tests I can do (I've worked on SpiderMonkey as a volunteer so doing local builds is not a problem).

[1] https://www.medspace.nl/
I'm realizing the logs may have been more useful in conjunction with a memory report (and maybe a profile). I'll try to get a full set tomorrow.
Here's a profile, memory report and set of cc logs made with today's Nightly (build ID 20181217220100): https://mega.nz/#!DVkWhSoQ!ySENgXuaKA8J3uV8kBxamVkBDSzYDrEF90oAJP9rXqw

I apologize about the size, for some reason there was a lot going on even before I loaded the schedule. The offending process should be PID 5488, but maybe there are interesting things going on in some of the other processes too.

From the profile, there's over 16 seconds of self time in SnowWhiteKiller::Visit and nearly 3 seconds in its caller, nsPurpleBuffer::VisitEntries (the profiler also seems to be missing a lot of samples though - the total pause was 45 seconds).
Ah! Apparently this isn't just me: Things regressed in Firefox 64, and our customers are running into it on their MedSpaces. I'm going to mark this as a regression and see if I can find a range.
Keywords: regression
OS: Windows 10 → Windows
Hardware: x86_64 → All
Argh, this might not be a recent regression in Firefox after all. It might be a problem exposed by the new version of MedSpace we just rolled out. I also can't reproduce the hangs without LastPass installed - there must be some bad interaction going on there.
In fact, I don't think I can reproduce the problems with LastPass 4.18.1.2 (the version from 2018-10-18); only version 4.21.0.4 (released 5 days ago on 2018-12-13) seems to be affected. That would explain why I started seeing this recently!
Confirmed, this is caused by the latest version of LastPass. I'm going to change the component - this might just be a bug in LastPass, though I have no insight into how it and the cycle collector are clashing.
Component: XPCOM → Extension Compatibility
Product: Core → Firefox
Summary: Very long cycle collector pauses while viewing/editing schedule on MedSpace → Very long cycle collector pauses caused by LastPass 4.21.0.4 while viewing/editing schedule on MedSpace
In the profile, what is in the stacks with SnowWhiteKiller::Visit? That basically should just be destroying CCed objects, so is there any indication of what it is destroying? Maybe there's some C++ objects that are getting created and destroyed rapidly?

We've had previous problems with addons interacting badly with cycle collector tracing, but this sounds different.
This is the stack from the huge pause in the profile:

bool SnowWhiteKiller::Visit(struct nsPurpleBuffer & const, struct nsPurpleBufferEntry *)
nsPurpleBuffer::VisitEntries<SnowWhiteKiller>(SnowWhiteKiller &)
nsCycleCollector::Collect(ccType,js::SliceBudget &,nsICycleCollectorListener *,bool)
nsCycleCollector_collectSlice(js::SliceBudget &,bool)
nsCycleCollector_collectSlice
static void nsJSContext::RunCycleCollectorSlice(class mozilla::TimeStamp)
nsJSContext::RunCycleCollectorSlice
static bool CCRunnerFired(class mozilla::TimeStamp)
[...]

This just loops over the purple buffer and destroys stuff. Presumably something we're destroying is taking a really long time? But I have no idea what it could be, if it isn't showing up in the profiles.

The only other things showing up nested in that frame are _security_check_cookie, at 3.1%. I have no idea what that is. There's also xmalloc at 0.2% and the QI for nsDOMMutationRecord, also at 0.2%. Maybe the nsDOMMutationRecord QI showing up is an indication that we're spending a lot of time destroying nsDOMMutationRecord, or more likely it is just a quirk of the sampling.
Whiteboard: [qf]
Is it okay if I share the profile in the attached file on perf.io so other people can look at it?
Flags: needinfo?(emanuel.hoogeveen)
The memory reports looked completely normal.
(In reply to Andrew McCreight [:mccr8] (PTO-ish 12-22 to -30) from comment #9)
> Is it okay if I share the profile in the attached file on perf.io so other
> people can look at it?

I think that should be fine.

Note that I've opened a bug ticket with LastPass and they say they've passed it onto the team. I linked this bug in the report, but I don't know if they'll use it.
Flags: needinfo?(emanuel.hoogeveen)
(In reply to Andrew McCreight [:mccr8] (PTO-ish 12-22 to -30) from comment #8)
> This is the stack from the huge pause in the profile:
> 
> bool SnowWhiteKiller::Visit(struct nsPurpleBuffer & const, struct
> nsPurpleBufferEntry *)
> nsPurpleBuffer::VisitEntries<SnowWhiteKiller>(SnowWhiteKiller &)
> nsCycleCollector::Collect(ccType,js::SliceBudget &,nsICycleCollectorListener
> *,bool)
> nsCycleCollector_collectSlice(js::SliceBudget &,bool)
> nsCycleCollector_collectSlice
> static void nsJSContext::RunCycleCollectorSlice(class mozilla::TimeStamp)
> nsJSContext::RunCycleCollectorSlice
> static bool CCRunnerFired(class mozilla::TimeStamp)
> [...]
> 
That stack doesn't kill anything.
It should be SnowWhiteKiller's dtor which calls MaybeKillObject, though perhaps sampling or some compiler optimizations affect to what the stack looks like.


> This just loops over the purple buffer and destroys stuff. Presumably
> something we're destroying is taking a really long time? But I have no idea
> what it could be, if it isn't showing up in the profiles.
Or is the purple buffer just massive.


> Maybe the
> nsDOMMutationRecord QI showing up is an indication that we're spending a lot
> of time destroying nsDOMMutationRecord, or more likely it is just a quirk of
> the sampling.
Having MutationObserver somewhere and adding and removing nodes all the time could lead high
numbers of MutationRecords.
Wrappers for MutationRecords are allocated from nursery, but cycle collector needs to still delete the native object.
It looks like what's happening is that we've got a very long chain of nsDOMMutationRecord objects. Each time we destroy one, it releases its pointer to the next entry, and we do another loop over the purple buffer, which destroys it, and so on until we've destroyed them all. For each one we destroy, we need to iterate over the entire contents of the purple buffer again. And if all of the objects are in the purple buffer to begin with, that becomes an O(n!) operation.

One of the CC logs from this profile has 35,843 nsDOMMutationEntry objects in it. If each is from a separate nsDOMMutationRecord, that's probably enough cause this sort of pause.
This sounds like a DOM issue, so I'm moving it there, though maybe an extension could work around it.
Component: Extension Compatibility → DOM
Product: Firefox → Core
mNext isn't mentioned in Traverse or Unlink, which seems potentially bad in either case. Maybe that would fix this issue? Assuming the first record is destroyed by a CC.
(In reply to Andrew McCreight [:mccr8] (PTO-ish 12-22 to -30) from comment #15)
> mNext isn't mentioned in Traverse or Unlink, which seems potentially bad in
> either case. Maybe that would fix this issue? Assuming the first record is
> destroyed by a CC.

Oh, I'm sorry, I was looking at the wrong Traverse/Unlink functions. It is included.
Olli, could you look at this after the winter holiday? If not, I can try to figure something out.
Flags: needinfo?(bugs)
I guess if MutationObserver is created, and then after mutations, but before microtask checkpoint disconnect is called, we end up having long chain.
(In reply to Olli Pettay [:smaug] (away-ish Dec 21-30) from comment #19)
> I guess if MutationObserver is created, and then after mutations, but before
> microtask checkpoint disconnect is called, we end up having long chain.

That definitely looks possible. There are a bunch of places where LastPass creates mutation observers and then disconnects them at arbitrary times. This one looks particularly suspicious:

            reset: f = function() {
                c && (c.disconnect(), c = null), l = []
            },
            onChange: function(t) {
                "undefined" != typeof MutationObserver && (c || (c = new MutationObserver(d)).observe(document.body, {
                    childList: !0,
                    attributes: !0,
                    characterData: !0,
                    subtree: !0,
                    attributeOldValue: !0
                }), -1 === l.indexOf(t) && l.push(t), setTimeout(function() {
                    f()
                }, 0))
            }

I'm not entirely clear on all of the interactions between mutation observer dispatch and timeout handlers, though.

And there are a couple of scripts that do something along the lines of:

  var e = function() {
      var e = document.forms[0];
      if (e && "string" == typeof e.action && 0 === e.action.indexOf(LPVARS.alp_url) && e.SAMLResponse) {
          var t = e.parentElement;
          t.removeChild(e), bg.FederatedLogin.validateK1Encryption(e.SAMLResponse.value, function(n) {
              n && (t.appendChild(e), e.submit())
          })
      }
  },
  t = new MutationObserver(function(n) {
      document.body && (e(), t.disconnect())
  });
  t.observe(document, {
      childList: !0,
      subtree: !0
  });
  document.addEventListener("DOMContentLoaded", e);

I'm not entirely sure whether it's possible to get queued mutation records there unless some other mutation observer mutates the document. And this would happen really early in the page load cycle. But I suppose it's possible.
Either way, it seems like we may want to just destroy the entire chain of mutation records when the observer is disconnected, or have the destructor for one mutation record null out the entire chain of mNext pointers if its mNext is non-null.
Or we want to use the normal LAST_RELEASE trick.

(I'm starting to wonder if we should in general call unlink when refcnt drops to 0, but not about this bug.)
Assignee: nobody → bugs
Flags: needinfo?(bugs)
Posted file a testcase
this helps at least with the testcase
Attachment #9033760 - Flags: review?(continuation)
Comment on attachment 9033760 [details] [diff] [review]
mo_cc_hang.diff

Review of attachment 9033760 [details] [diff] [review]:
-----------------------------------------------------------------

I think it would be a good idea to add a comment about why we need this.
Attachment #9033760 - Flags: review?(continuation) → review+
Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/13191d88873d
break down MutationRecords list so that cycle collector can delete the objects sooner, r=mccr8
https://hg.mozilla.org/mozilla-central/rev/13191d88873d
Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
Emanuel, want to test Nightly with this fix? I think we could land this to branches too, but better to test first.
Flags: needinfo?(emanuel.hoogeveen)
Yep, I'll test this at work tomorrow. Thanks for the quick fix!
Initial results are very promising: I haven't been able to reproduce any long pauses so far!
Flags: needinfo?(emanuel.hoogeveen)
Hmm, I'm getting tab crashes loading the schedule of a random MedSpace. They look related:

https://crash-stats.mozilla.org/report/index/84b0c900-c4f2-47e6-851b-8c2ae0190102
https://crash-stats.mozilla.org/report/index/8a0b41bb-f01f-434f-9651-9d0820190102

The signature is nsDOMMutationRecord::Release and these are EXCEPTION_STACK_OVERFLOW crashes.
Flags: needinfo?(bugs)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #32)
> Hmm, I'm getting tab crashes loading the schedule of a random MedSpace. They
> look related:
> 
> https://crash-stats.mozilla.org/report/index/84b0c900-c4f2-47e6-851b-
> 8c2ae0190102
> https://crash-stats.mozilla.org/report/index/8a0b41bb-f01f-434f-9651-
> 9d0820190102
> 
> The signature is nsDOMMutationRecord::Release and these are
> EXCEPTION_STACK_OVERFLOW crashes.

Heh. I was wondering if that would happen. Recursively releasing tens of thousands of mutation records is going to require a lot of stack space.
bah, ok. Need to explicitly then loop over the list. Ugly.
Depends on: 1517221
What do you think of possibly uplifting this + the fix in bug 1517221 to beta?
bug 1517221  actually backed out this one.
I'll write a beta patch for bug bug 1517221
Flags: needinfo?(bugs)
Fixed for Firefox 65 by the Beta patch in bug 1517221.

\o/ Great to have this fixed for 65!

Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.