Closed
Bug 1184569
Opened 9 years ago
Closed 2 years ago
Firefox janks hard when loading a Google Docs presentation
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
People
(Reporter: mconley, Unassigned)
References
(Blocks 2 open bugs)
Details
(Keywords: perf, Whiteboard: [platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides])
Filing this on behalf of :Adora. STR: 1) Make sure e10s is disabled 2) Go to https://docs.google.com/presentation/d/11w7wNcYKLS5sskNPoRGBksmKbr01MBMJfRhlvISwxL0/edit#slide=id.p 3) Watch the browser jank really hard. Profile: http://people.mozilla.org/~bgirard/cleopatra/#report=e3362815e4da3080ad8eab44637c367a901cce3c
Comment 1•8 years ago
|
||
If this is still happening we probably need a new profile to figure out what's going on. :-(
Component: General → Untriaged
Comment 2•8 years ago
|
||
Hi, The issue is still reproducible on the latest Nightly (48.0a1, Build ID 20160310030242) on Windows 10 x64 with e10s disabled. Ran a gecko profile without e10s: https://cleopatra.io/#report=1457ea5b8993006a5cb46d721f86e391050e59cc&selection=0,1,2794,3213,3214,3215,2568,4,6,4,3204 And one with e10s: https://cleopatra.io/#report=82fbbb82eea05b904c05abaaeb1777b2799fca90&selection=0,124,125,126,127,128,129,130,1,131,132,133,134,135,136,137,138,284,139,285,286,287,288,289,290,291 But I'm not sure what component this could be put in. Any suggestions Gijs?
Flags: needinfo?(gijskruitbosch+bugs)
OS: Unspecified → All
Hardware: Unspecified → All
Comment 3•8 years ago
|
||
(In reply to Ciprian Muresan [:cmuresan] from comment #2) > Hi, > > The issue is still reproducible on the latest Nightly (48.0a1, Build ID > 20160310030242) on Windows 10 x64 with e10s disabled. > Ran a gecko profile without e10s: > https://cleopatra.io/ > #report=1457ea5b8993006a5cb46d721f86e391050e59cc&selection=0,1,2794,3213, > 3214,3215,2568,4,6,4,3204 > And one with e10s: > https://cleopatra.io/ > #report=82fbbb82eea05b904c05abaaeb1777b2799fca90&selection=0,124,125,126,127, > 128,129,130,1,131,132,133,134,135,136,137,138,284,139,285,286,287,288,289, > 290,291 > But I'm not sure what component this could be put in. Any suggestions Gijs? The profile looks awful - 25% of the total time of the profile is spent in JS land for the page in question, but it's not immediately clear to me why that's so slow and janky, and why we seem to be spending a fraction of that time on this stuff in e10s mode. Benoit or Mike, can you help find a component / owner for this?
Flags: needinfo?(mconley)
Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(bgirard)
Comment 4•8 years ago
|
||
Yes, in the non e10s profile we're spending too much time in JS. I see a lot of bailout notifications so perhaps there's some tweaking we could do in the JS engine. In the e10s profile I see a lot of costly 'IPDL::PContent::SendRpcMessage' sync IPC messages and no so much JS execution.
Flags: needinfo?(bgirard)
Reporter | ||
Comment 5•8 years ago
|
||
Those SendRpcMessage's are due to an observer notification being proxied to the parent synchronously over the add-on shims. This means that there's an add-on installed that's not marked multi-process compatible, and it's janking the browser pretty hard.
Flags: needinfo?(mconley)
Updated•8 years ago
|
Component: Untriaged → JavaScript Engine
Product: Firefox → Core
Comment 6•8 years ago
|
||
mconley, from comment 5 it sounds like this is not really a JS Engine bug. Can you re-assign or WONTFIX? cmuresan, does disabling addons make the problem go away?
Flags: needinfo?(mconley)
Flags: needinfo?(ciprian.muresan)
Reporter | ||
Comment 7•8 years ago
|
||
Yes, assuming disabling add-ons makes this problem go away, we'll need a full list of enabled add-ons.
Component: JavaScript Engine → Add-ons
Flags: needinfo?(mconley)
Product: Core → Tech Evangelism
Comment 8•8 years ago
|
||
The issue is reproducible with e10s disabled and with addons disabled. The jank is not as apparent with e10s enabled. I put it in JS engine based on comment 4, because the user story specifically asks for e10s to be disabled.
Component: Add-ons → JavaScript Engine
Flags: needinfo?(ciprian.muresan)
Product: Tech Evangelism → Core
Comment 9•8 years ago
|
||
Fair enough. Thanks.
Updated•8 years ago
|
Blocks: Meta-Hasal-GDoc
Updated•8 years ago
|
Whiteboard: [platform-rel-Google][platform-rel-GoogleDocs]
Updated•8 years ago
|
platform-rel: --- → ?
Updated•8 years ago
|
platform-rel: ? → +
Updated•8 years ago
|
Flags: needinfo?(bkelly)
Comment 10•8 years ago
|
||
nbp, you've had some great results looking into bailout issues, so I'm going to ni? you on this. Thanks!
Flags: needinfo?(nicolas.b.pierron)
Comment 11•8 years ago
|
||
Here's a new profile: https://cleopatra.io/#report=3cf1cb633f1aab98158736804a6fb156eb771f34 From what I can tell we spend: * 2.7 seconds in nsHtml5TreeOpExecutor::RunFlushLoop() * 1.8 seconds in an XHR's OnStopRequest dispatching an event * About 2.8 seconds in large timeout callbacks The nsHtml5TreeOpExecutor and XHR seem to cause most of the jank here. We are still janking during the timeouts, but it starts recovering over that period.
Comment 12•8 years ago
|
||
Not sure its relevent yet, but some of that nsHtml5TreeOpExecutor time is being triggered from the nsHtml5ExecutorReflusher() runnable. This fires on a repeating timer every 50ms: https://dxr.mozilla.org/mozilla-central/source/parser/html/nsHtml5TreeOpExecutor.cpp#278 I wonder if we are backing up the main thread with too many of these runnables.
Comment 13•8 years ago
|
||
Ah, the timer only runs for background tabs. Sorry for my confusion.
Flags: needinfo?(bkelly)
Comment 14•8 years ago
|
||
My anti-flooding patches help marginally here, but its still spending a ton of time in js. Given all the bailouts in the profile I'll just leave this to nbp to look at. Sorry I couldn't be of more help.
Comment 15•8 years ago
|
||
Ben Kelly, thanks for making the profile on comment 11. From the profile, I spotted only one repeated bailout (~10x "Bailout X from Y", "Invalidate Y", "Bailout X from Y"). This alone is unlikely to cause all the slow-down. I think the issue is similar to a bug that I already started to investigate (Bug 1300848). One observation is that we have a lot of diverse bailouts and invalidations, but none which stands up as being the source of the slow-down. I yet have to understand these issues better, in the mean time I am adding blockers to Bug 1300848. Still, I do not see any such large set of Bailouts & Invalidations in the original profiles, so this might be a recent regression introduced within the last 6 months. Thus unrelated to the initial issue. I will continue my investigation of this potential regression (from the last 6 months) in Bug 1300848, as I think these might have identical root causes, and also because I managed to get an offline version to reproduce the issue. I will keep the need-info to double check once Bug 1300848 is fixed that this is indeed the same issue, but I do not think this is the original issue reported in comment 0. Is there a way to report Baseline / Ion frames in Cleopatra?! The cost of a bailout is the time spent in between, not the number of bailouts. So having a large number of bailout/invalidation is only a hint, not a symptom.
Updated•8 years ago
|
Whiteboard: [platform-rel-Google][platform-rel-GoogleDocs] → [platform-rel-Google][platform-rel-GoogleDocs][platform-rel-GoogleSlides]
Updated•8 years ago
|
Whiteboard: [platform-rel-Google][platform-rel-GoogleDocs][platform-rel-GoogleSlides] → [platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
Updated•8 years ago
|
Rank: 5
Comment 16•7 years ago
|
||
(In reply to Nicolas B. Pierron [:nbp] from comment #15) > Is there a way to report Baseline / Ion frames in Cleopatra?! The cost of a > bailout is the time spent in between, not the number of bailouts. So having > a large number of bailout/invalidation is only a hint, not a symptom. It turns out that there is, Shu added instrumentation for it a long time ago. It was just getting stripped out when the old Cleopatra processed the profiles, which led me to think that we didn't have it. It isn't all integrated yet, but I have a separate script now that can tell me the percentage of time spent in baseline vs ion for any given script (or even the "timeline" of baseline/ion samples for that script, eg BBBBIIBBBBBIBBBBB is a script that runs in baseline, then runs in ion long enough to get running for two samples before it reverts to baseline, etc.) Does that give you what you would want to know here? Generally, if we had a "profile linter" that could tell you of suspicious behavior, what could it look for? From your discussion in this bug (from which I learned several things; thank you), it seems like a script with repeated bailouts at the same location indicates a problem in spidermonkey, and if it is accompanied by a high percentage of time spent in baseline, then it is a likely candidate for a fixable performance problem. Is that accurate? When computing the percentage of time spent in baseline vs ion, would it make sense to ignore the initial warmup time spent in baseline? Or more generally, what do you think would be useful to look for? Is it always a problem to see repeated invalidating bailouts at the same location? I could imagine that if we are discarding ion code during GC, then perhaps that isn't necessarily bad. (Or if it's the same filename:lineno due to the same code running in separate tabs.) I could speculate as to how to properly define "same location", but I'd like to know what information is helpful here in the first place before figuring out how to make it completely accurate.
Comment 17•7 years ago
|
||
(In reply to Steve Fink [:sfink] [:s:] from comment #16) > (In reply to Nicolas B. Pierron [:nbp] from comment #15) > > Is there a way to report Baseline / Ion frames in Cleopatra?! The cost of a > > bailout is the time spent in between, not the number of bailouts. So having > > a large number of bailout/invalidation is only a hint, not a symptom. > > […] I have a separate script now that can tell me the > percentage of time spent in baseline vs ion for any given script (or even > the "timeline" of baseline/ion samples for that script, eg BBBBIIBBBBBIBBBBB > is a script that runs in baseline, then runs in ion long enough to get > running for two samples before it reverts to baseline, etc.) Does that give > you what you would want to know here? Yes, that would be enough to start hacking some metrics, except that this would miss cases where we bailout immediately after the entry into Ion. For that, I guess we can annotate the bailouts with the compilation id (available from the IonScript*), then we can catch Ion entry even if no sample reports Ion execution. > Generally, if we had a "profile linter" that could tell you of suspicious > behavior, what could it look for? From your discussion in this bug (from > which I learned several things; thank you), it seems like a script with > repeated bailouts at the same location indicates a problem in spidermonkey, > and if it is accompanied by a high percentage of time spent in baseline, > then it is a likely candidate for a fixable performance problem. Is that > accurate? This accurate, but as I noticed on Bug 1300848 these are not necessary the major source of issues. We are looking for example, in the following profile: T¹ ............C¹C¹C¹C¹C¹C¹C¹C¹....................C²C²C²C²C²C²C²C²................ T⁰ B⁰B⁰B⁰B⁰B⁰C¹B⁰B⁰B⁰B⁰B⁰B⁰B⁰B⁰C¹I¹I¹I¹B¹B¹B¹B¹C²B¹B¹B¹B¹B¹B¹B¹B¹B¹C²I²I²B²B²B²B²B² where C = compilation time; I = Ion time; B = Baseline time; T¹: is the compilation thread; T⁰: is the main thread. We are looking for useless compilations: S = (C + I) / +I > ~20, for long running (= B + I) scripts. where S = speed-up needed to compensate the compilation time. Otherwise we could be looking for the overall bailout costs: 2 · (1 - 1 / S) · B Note, the normal formulae is not using the sum of all compilations & executions, but doing it as a per-recompilation basis. Using the sums is not representative of the average, but as we do not know the expected speed-up (= S) yet, we cannot blame single bailouts from a profile, and using the sum might be a good enough approximation. > When computing the percentage of time spent in baseline vs ion, > would it make sense to ignore the initial warmup time spent in baseline? The initial warm-up in baseline might be interesting if the compilation never ends successfully, in which case high baseline time would still highlight issues. > Or more generally, what do you think would be useful to look for? (see the above formulae) > Is it always a problem to see repeated invalidating bailouts at the same > location? I could imagine that if we are discarding ion code during GC, then > perhaps that isn't necessarily bad. (Or if it's the same filename:lineno due > to the same code running in separate tabs.) I could speculate as to how to > properly define "same location", but I'd like to know what information is > helpful here in the first place before figuring out how to make it > completely accurate. Repeated invalidation bailouts highlight that we failed to collect information to prevent the thing to happen the next time. Indeed, a GC would remove tons of information which could have helped at avoiding such cases, but if a GC causes us to have too many of these repeated invalidation bailouts, then this would still be a problem that we would have to fix.
Updated•7 years ago
|
Whiteboard: [platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides] → [qf:p1][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
Comment 18•7 years ago
|
||
Hannes this bug is now a qf:p1. Is nbp still the right person to make sure this moves forward.
Flags: needinfo?(hv1989)
Comment 19•7 years ago
|
||
New profile from the Gecko Profiler: https://perfht.ml/2otEqZ6. Note that comment 0 talks about disabling e10s for historical reasons. There is nothing non-e10s specific about this bug.
Comment 20•7 years ago
|
||
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #19) > New profile from the Gecko Profiler: https://perfht.ml/2otEqZ6. > > Note that comment 0 talks about disabling e10s for historical reasons. > There is nothing non-e10s specific about this bug. It does not seem the new profile is JS bound, unless I'm missing something. This bug doesn't have anything else concrete in it -- unsetting this as qf:p1.
Whiteboard: [qf:p1][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides] → [qf][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
Comment 21•7 years ago
|
||
(In reply to Shu-yu Guo [:shu] from comment #20) > (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from > comment #19) > > New profile from the Gecko Profiler: https://perfht.ml/2otEqZ6. > > > > Note that comment 0 talks about disabling e10s for historical reasons. > > There is nothing non-e10s specific about this bug. > > It does not seem the new profile is JS bound It is. GDocs performance in general is JS bound and that's pretty well known.
Comment 22•7 years ago
|
||
needinfo Kannan to run his script to see what's spidermonkey doing here.
Flags: needinfo?(kvijayan)
Whiteboard: [qf][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides] → [qf:investigate:p1][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
Comment 23•7 years ago
|
||
Numbers for scripts >1MM ticks (I had to reload the page several times to hit the buffer threshold at which the tick counts were dumped). Lots of array stuff, mostly, and the usual culprits like _FinishBoundFunctionInit and _IsConstructing: std_Math_max => ticks=10460052 count=95994 array_pop => ticks=10536018 count=24336 ToInteger => ticks=10923164 count=108243 std_Array_unshift => ticks=11557166 count=501 std_Array_slice => ticks=13607854 count=1334 ToString => ticks=13610192 count=144282 regexp_clone => ticks=16836900 count=10013 IsArray => ticks=18471105 count=288596 array_shift => ticks=19968867 count=16535 std_Function_apply => ticks=20769472 count=1000 NewArrayIterator => ticks=22165072 count=23576 intl_availableCollations => ticks=27741953 count=1 str_indexOf => ticks=29607368 count=44249 array_indexOf => ticks=44812005 count=9066 String_substr => ticks=46137106 count=137569 array_join => ticks=56424855 count=16457 String_substring => ticks=60051429 count=20958 str_toLowerCase => ticks=67185423 count=47336 array_splice => ticks=88442465 count=15485 RegExpTester => ticks=90288815 count=19427 RegExpSearcher => ticks=90492706 count=26342 ArrayValues => ticks=96327671 count=23405 array_unshift => ticks=189381907 count=76775 RegExpMatcher => ticks=192140874 count=32647 _IsConstructing => ticks=234657803 count=297712 _FinishBoundFunctionInit => ticks=312683955 count=56445 std_Array => ticks=408231750 count=183697 StringSplitString => ticks=438067699 count=78954 array_push => ticks=555615494 count=752127 array_sort => ticks=563290692 count=77097 String_split => ticks=628484359 count=79157 array_slice => ticks=1820460442 count=1662435
Flags: needinfo?(kvijayan)
Comment 24•7 years ago
|
||
I added instrumentation for FunctionBind (the self-hosted function that uses _FinishBoundFunctionInit), and re-ran. That's high up there in the tick count too: FunctionBind => ticks=457743853 count=49300
Updated•7 years ago
|
Whiteboard: [qf:investigate:p1][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides] → [qf:p3][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
Comment 25•7 years ago
|
||
(already answered comment 10) Answering my-self from comment 15: Bug 1300848 is fixed probably because of the work made to Cache IR. So this is unlikely to be the same issue.
Flags: needinfo?(nicolas.b.pierron)
Updated•2 years ago
|
Performance Impact: --- → P3
Whiteboard: [qf:p3][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides] → [platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
Updated•2 years ago
|
Severity: normal → S3
Comment 26•2 years ago
|
||
Clear a needinfo that is pending on an inactive user.
Inactive users most likely will not respond; if the missing information is essential and cannot be collected another way, the bug maybe should be closed as INCOMPLETE
.
For more information, please visit auto_nag documentation.
Flags: needinfo?(hv1989)
Comment 27•2 years ago
|
||
I can't reproduce this. A lot has changed since 2015 so this was likely fixed on our side and/or by Google Slides changes.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•