Closed Bug 1811867 Opened 1 year ago Closed 1 year ago

thebay.com website crashes Firefox due to recursive DOM mutation events

Categories

(Core :: DOM: Core & HTML, defect)

Firefox 108
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: thane, Unassigned)

References

Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0

Steps to reproduce:

When you open https://www.thebay.com/ it works at first (the top menu opens, and you can select items - for instance Men, and then Boots). Most of the time, the product list will load, but sometimes, there are no products shown. You can use the top menu two or three times, and then it stops responding (nothing appears when you hover over the top menu. Firefox starts to take up more RAM and CPU time in the Task Manager and reports "This page is slowing Firefox down."

At that point, sometimes the tab will close, but sometimes it takes killing Firefox from the Task Manager.

I've tested this in 107, 108, and 109 with the same results. All machines were using Windows 10 x64 and Firefox x64.

I've refreshed Firefox and done an uninstall/reinstall with no change.

Actual results:

Top menu on site stops responding. Firefox starts to take up more RAM and CPU time in the Task Manager and reports "This page is slowing Firefox down."

Expected results:

Menu should have shown drop down choices.

The Bugbug bot thinks this bug should belong to the 'Core::Performance' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Performance
Product: Firefox → Core

I can reproduce the issue just by visiting https://www.thebay.com/c/men (on my macbook on today's nightly).

I've tried diagnosing this (and will continue), but it's a doozy. The slow script warning freezes when I click "debug script", and gives me a spurious stack trace when I click "stop" (and then freezes shortly after).

If I turn this script into a no-op, then I don't see the slow script warning: https://assets.adobedtm.com/aaa144e582c6/0d3a9380a0c9/launch-6afb67a89a6f.min.js

But changing user-agent strings to a Chrome one isn't helping, and a mozregression gives me a range going really far back and without any obvious clues to my eyes: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=7748cc7e9b63d86a40eb4799c0830172da579c84&tochange=f414b9e6d85710f92649566c1c6511265dadd476

I tried commenting out the various mutation listeners I saw in their code, but that isn't leading me anywhere. I will try to comment out various bits of code in that file for a while to see if I can't find a lead, but this is a tough nut to crack.

I encountered this in the wild and was about to open a bug, but it looks like the same issue.

Here's a profile I took on Linux on Nightly: https://share.firefox.dev/3kQEtNk.

The interesting thing I notice is that the stacks are incredibly deep, we spend time throwing stack overflow errors, and there seems to be a recursive loop involving EventHandler::Dispatch:

addclass/< 
each
addclass
j
94/<
dispatch
add/m.handle
runscript
handleevent
dispatch
runscript
runscript
runscript
removeclass/<
each
removeclass
j
94/<
dispatch
add/m.handle
runscript
handleEvent
dispatch
runscript
runscript
runscript

I'm not sure exactly which event is being handled recursively; near the leaf, we report that we're dispatching DOMSubtreeModified, but I don't see any information for the majority of the stack. I'm not sure if that's a limitation of the profiler, or if we're handling some other event recursively that doesn't get reported in the same way.

Collapsing indirect recursion on 94/< (https://www.thebay.com/on/demandware.static/Sites-TheBay-Site/-/en_CA/v1674980862464/js/algoliaSearch.js 5:57292) and add/m.handle (https://www.thebay.com/on/demandware.static/Sites-TheBay-Site/-/en_CA/v1674980862464/js/coreBundle.js 25:37405) makes the insane stack depth disappear.

Looking at this with fresh eyes, the stack trace I see inthe perf profiles was indeed correct, it was just difficult to confirm it due to Firefox freezing. This is the code in their algoliaSearch.js which is triggering the problem:

        $("#algolia-categories-list-placeholder").on("DOMSubtreeModified", (function() {
            j(),
            D()
        }

If I have this function return early, the page un-breaks. But both Chrome and Firefox are firing a lot of mutation events here, because j() is triggering them (while handling them):

        function j() {
            var t = $(".selected-category")
              , e = $("li.ais-HierarchicalMenu-item--selected");
            e.length ? e.siblings().css("display", "none") : $(".ais-HierarchicalMenu-item").css("display", "block"),
            t.removeClass("selected-category-last"),
            $(".selected-category:last").addClass("selected-category-last"),
            $(".selected-category").not($("a.selected-category-last")).css("display", "none"),
            $(".selected-category-last").parent().parent().hasClass("ais-HierarchicalMenu-item--parent") || $(".selected-category-last").parent().parent().parent("ul.ais-HierarchicalMenu-list").children().css("display", "list-item"),
            $(".selected-category-last").closest(".ais-HierarchicalMenu-item--parent").children().children().css("display", "block"),
            $(".selected-category-last").closest(".ais-HierarchicalMenu-item--parent").children().children().children().addClass("aa-category-last-li"),
            $(".ais-HierarchicalMenu-item--parent div a span.aa-category-left-arrow-svg").removeClass("aa-category-last-li"),
            $(".ais-HierarchicalMenu-item--parent div a span.ais-RefinementList-labelText").removeClass("aa-category-last-li"),
            $(".ais-HierarchicalMenu-item--parent div a span.ais-RefinementList-count").removeClass("aa-category-last-li"),
            $(".selected-category-last").closest(".ais-HierarchicalMenu-item--parent").children().children().children().children().children().children(".aa-category-left-arrow-svg").css("display", "none"),
            $(".aa-category-left-arrow-svg").removeClass("aa-category-last-li"),
            $(".ais-HierarchicalMenu-item--parent").css("display", "block"),
            $("a.selected-category-last").css("display", "flex")
        }

If I comment out the latter half of that function as a test, Firefox starts throwing 'too much recursion' errors rather than freezing, so that's definitely the problem here.

However, Chrome doesn't seem to be phased by this. If I add logpoints before j() is run, after j() is run, and when subtreemodified is fired, it looks like Chrome is allowing the function to finish running before firing the next mutation event, whereas Firefox's event loop works differently.

I wonder if we can match Chrome's behavior here, which should reduce the likelihood of a mutation event loop like this freezing up Firefox?

Component: Performance → DOM: Core & HTML
Summary: thebay.com website crashes Firefox → thebay.com website crashes Firefox due to recursive DOM mutation events

It is surprising if Chrome doesn't fire the events synchronously. Those events are supposed to be synchronous.

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #5)

It is surprising if Chrome doesn't fire the events synchronously. Those events are supposed to be synchronous.

Olli, does that mean, it doesn't look right for us to match Chrome's behavior ? Is there anything we can do on our side?

Flags: needinfo?(smaug)

Is there any ETA on this bug? Sorry to be pressuring anyone, but my client was asking.

Since sounds like this is mostly a bug in thebay, has anyone reported the issue to them?

[Tracking Requested - why for this release]: Firefox crashes on a popular Canadian website.

(I'm assuming it affects all branches.)

Severity: -- → S2
Status: UNCONFIRMED → NEW
Ever confirmed: true

I can't reproduce this now on my Windows machine. It looks like the function j() from Tom's investigation has changed to include additional conditions:

    function j() {
      var t = $('.ais-HierarchicalMenu-item--selected');
      if (t.length) {
        if (t.siblings().hide(), t.length > 1) {
          $('a.ais-HierarchicalMenu-link.selected-category').hide(),
          $('a.ais-HierarchicalMenu-link.selected-category:not():last()').show(),
          $('.ais-HierarchicalMenu-item--parent').show(),
          $('.ais-HierarchicalMenu-item--parent:last()').find('.ais-HierarchicalMenu-list--child').addClass('ml-3');
          var e = $('.ais-HierarchicalMenu-list--child.ml-3 .ais-HierarchicalMenu-item--selected');
          0 === e.find('.ais-HierarchicalMenu-list--child').length && e.siblings().show()
        }
      } else t.siblings().show(),
      $('.ais-HierarchicalMenu-item--parent, .ais-HierarchicalMenu-item').show()
    }

Hi Thane, wonder if you could check whether this is still reproducible for you, please?

Flags: needinfo?(thane)

It's working fine now. A fix on their end, maybe? Thanks for all the attention on this. I'm very impressed on how the Mozilla team took immediate interest and action.

Flags: needinfo?(thane)

(In reply to Thane K. Sherrington from comment #11)

It's working fine now. A fix on their end, maybe? Thanks for all the attention on this. I'm very impressed on how the Mozilla team took immediate interest and action.

Great, thanks for checking! Yeah, it looks like they've made a change in the code.

Status: NEW → RESOLVED
Closed: 1 year ago
Flags: needinfo?(smaug)
Resolution: --- → FIXED

Removing tracking request for nightly as the problem was fixed on the website.

It still seems a little mysterious/concerning that Chrome wasn't running into this problem. If we are interpreting the spec to say that these events must be synchronous, and Chrome is interpreting the spec to say that they can be queued, does that leave us open to a similar problem in the future?

There is no proper specification for Mutation Events. Mutation Events have been basically deprecated for...11 years now.
Properly specified MutationObserver replaced them then.

You need to log in before you can comment on or make changes to this bug.