Closed Bug 1412505 Opened 3 years ago Closed 2 years ago

Background thumbnail request of go.twitch.tv strains CPU

Categories

(Toolkit :: General, defect)

56 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox-esr52 --- wontfix
firefox56 --- wontfix
firefox57 + fixed
firefox58 + fixed
firefox59 --- verified

People

(Reporter: fvsch, Assigned: Mardak)

References

Details

Attachments

(6 files)

I don’t have fully reproducible steps and am not 100% sure that this specific service worker was at fault, but empirical evidence points at it.

What happened:

1. Yesterday, I used the "Refresh Nightly…" option in about:support (to help investigate a different bug)
2. My profile was refreshed, with bookmarks and preferences reimported, addons reinstalled, etc.
3. I then used "Restart in Safe Mode" to go on testing.
4. After restarting in normal mode (still on the refreshed profile), in the next hours, Nightly was often unresponsive, with CPU spikes at 250%+ over 2 processes (main process and one Web Content process, both around 125% in macOS Activity Monitor).

Disabling all add-ons, restarting Nightly with only the about:newtab open (or even about:blank, I believe) didn't help: routinely, every 2-5 minutes (120s maybe?), a big CPU spike would happen and last for roughly 30-60s.

(Not sure how long the history graph in Activity Monitor is, but you could see the tail of one spike and the beginning of the other on it.)

Using Charles Proxy showed a lot of requests to domains handled by Twitch, eg:
video-edge-2bee62.sjc01.hls.ttvnw.net, twitchcdn.net, in addition to many twitch.tv subdomains.

Removing the go.twitch.tv site data stopped the CPU spikes altogether.

Two hypothesis so far:

1. During the Refresh process, an old Service Worker was imported, and it ended up misbehaving (perhaps because of missing cache/data/something)?

2. My about:newtab page had a Twitch page URL (https://go.twitch.tv/ogamingsc2) pinned in Top Sites. It looks like just having this config is enough for Firefox to visit the page in the background, and for that page to register a Service Worker. So the Service Worker could have been recreated that way, and misbehaved for some reason.


PS: shot in the dark, but maybe related to bug 1291905?
PPS: totally unrelated, but scrolling was smooth most of the time during the CPU spikes, even if many other things were slowed/halted. Congrats for that work!!
Screenshot shows CPU spike with 1 content process, but the issue was similar with 4 content processes (one content process out of 4 would spike, not all 4).

Sadly I did not create a profile with perf-html before deleting the rogue service worker.
Ben, any ideas about sw gone rogue like this, or other potential causes?
Flags: needinfo?(bkelly)
AFAICT this site does not use a service worker.  The attachment in comment 0 is for a normal web worker.  It's contents are:

            try {
                importScripts("https://cvp.twitch.tv/2.0.13/worker.min.js");
            } catch (e) {
                console.error('Exception in worker import', e);
            }
            
The worker.min.js is a very large minified import.  I'm not sure exactly what its doing.

If you can reproduce the high cpu usage you could break the worker in the debugger to see if the CPU stops and maybe what operations its performing.
Flags: needinfo?(bkelly)
I have a perf-html profile, not sure if it shows something useful:
https://perfht.ml/2loWIue

Context:
1. On the New Tab page I re-added a "top site" for a Twitch page: https://go.twitch.tv/ogamingsc2
2. Opened a dozen New Tab pages. After a while I could see a web worker in about:debugging.
Summary: Service Worker for go.twitch.tv goes rogue, routinely sends requests, strains CPU → Web Worker for go.twitch.tv goes rogue, routinely sends requests, strains CPU
(Comment sent too soon, sorry)

3. Used the browser for a while, then left it for a few hours.
4. When using it again, on a page load (loading https://en.wikipedia.org/wiki/Wolfenstein_3D) I noticed Fx becoming slow, UI unresponsive (e.g. the loading indicator stutters and/or freezes).
5. Took me 10s to be able to start the profile, so it might be recorded from 20-30s after the start of the CPU spike.
6. CPU spike continued for roughly one minute.

Finally, while commenting on this bug I had another spike, so here's a second profile:
https://perfht.ml/2loQz14
I've found a way to reproduce/trigger this random bug within seconds, every time, on two different PC's. It happens so often it has become extremely annoying. Also occurs on a fresh profile with no extensions installed.

Firefox 56 or 57 Beta, Windows 10.

Please see the following links:

https://www.youtube.com/watch?v=P1bFMQRjzkY
https://www.youtube.com/watch?v=7OsOjuvgWyI
https://perfht.ml/2A9ezII
https://crash-stats.mozilla.com/report/index/aa389bde-7313-4b44-80ea-0acfe0171102
Component: General → DOM: Workers
Product: Firefox → Core
These profiles don't have any worker thread information.  Please add "DOM Worker" to the thread filter list in the profiler configuration dropdown.  Please see the attached screenshot for where to set it.
No problem. As I mentioned, it's easy to recreate the bug on demand. This time it actually crashed the tab as well and needed to be restored.

https://perfht.ml/2AaLM6E
https://perfht.ml/2xOoAti
The worker thread is completely idle in the first profile in comment 11.  Its blocked waiting on its event queue for some any work to do.

The second profile doesn't have a worker thread for the page at all AFAICT.

I do, however, see vsync running and the compositor is quite busy.  There are also two compositor threads, which I don't understand.

There are also a ton of nsIChannel requests going on in the main thread in the first profile.  The second profile is doing a ton of cycle collector work.

Honestly, I'm not sure what to make of this.  I don't see any worker thread activity at all, though, so I'm not sure I can help.

Moving this up to DOM for now since I don't see any worker activity.  Its possible its not a DOM issue either, but I don't know where else to put this.  Could be graphics, network, etc.

Mike, is there any chance you could look at the work flow documented in comment 6 videos and the profiles?  It feels like there is a bigger problem here.
Component: DOM: Workers → DOM
Flags: needinfo?(mconley)
Summary: Web Worker for go.twitch.tv goes rogue, routinely sends requests, strains CPU → go.twitch.tv routinely sends requests, strains CPU
Hi Florens,

Are you able to reproduce if you go to about:config, and create a new bool pref:

browser.pagethumbnails.capturing_disabled

and set it to true? You might need to restart the browser for this to have the intended effect.
Flags: needinfo?(mconley) → needinfo?(florens)
Interesting. I know that's directed at Florens, but browser.pagethumbnails.capturing_disabled has prevented me from recreating the bug using the "New Tab" method. It now sits around 500MB instead of spiking up to 3000MB and beyond. Time will tell if it has eliminated it completely. I hope it does, it has been annoying me for weeks. Thanks.
I've experienced the same issue — very easily reproduced as well, just add a twitch profile to the Top Sites (https://go.twitch.tv/user_profile). The CPU spikes on browser start and every 5-10 minutes for me.

Removing twitch url from Top Sites or setting browser.pagethumbnails.capturing_disabled fixes it.
Hi. Setting browser.pagethumbnails.capturing_disabled=true does seem to stop this issue.

Steps to reproduce:
1. Create a new profile in about:profiles, and open Nightly with this profile
2. In about:newtab, add a new Top Site with this URL: https://go.twitch.tv/giantwaffle (any go.twitch.tv URL seems to work)

Result: a long CPU spike happens. In my latest test it seemed to take 120 seconds.
I captured this profile (step 2 was at roughly 5-10 seconds in): https://perfht.ml/2yswGMD
But the profile is mostly blank during the spike, so maybe I’m not recording the right threads (config was: "GeckoMain,Compositor,DOM Worker").

Steps to NOT reproduce:
1. Create a new profile in about:profiles, and open Nightly with this profile
2. In about:config, create a new pref browser.pagethumbnails.capturing_disabled=true
2. In about:newtab, add a new Top Site with this URL: https://go.twitch.tv/giantwaffle

Result: no CPU spike, and the Top Site shows the letter "T" and no site thumbnail.
Captured profile (shorter, nothing unusual happening on CPU): https://perfht.ml/2yr5b6i
Flags: needinfo?(florens)
Ok, seems like an issue with the thumbnail capturing on about:newtab.  Since most (all?) of the reports here are from nightly I'm putting this in the activity stream version of newtab.

Ed, can you guys take a look at this?  Its possible there is still a platform issue, but if you could look at it from the activity stream and PageThumbs.jsm side, that would be great.
Component: DOM → Activity Streams: Newtab
Flags: needinfo?(edilee)
Product: Core → Firefox
Note that the thumbnailer is the same one that the previous about:newtab page used. I _think_ adw hacked on that way back in the day.

Hey adw, any idea what might be happening here?
Flags: needinfo?(adw)
Duplicate of this bug: 1414203
Do you see a thumbnail appear on about:newtab for the twitch page? It happening every 5 minutes seems to indicate the thumbnail failed or it's requesting a new thumbnail.
If people could try to reproduce this on older versions of Firefox without Activity Stream, I think that would be helpful.  More generally, a regression window would be really helpful.  It seems like we would have heard before about this problem for a really popular site like Twitch, if it were an old problem.

I'm not sure how Activity Stream is using or triggering the background thumbnailer, but I wouldn't think that it's doing anything wrong to end up triggering the thumbnailer over and over again like this.  Comment 0 mentions a pinned top site on about:newtab, so it's possible the problem precedes Activity Stream.  Although I do recall doing one or two reviews on the background thumbnailer for Activity Stream people, so maybe those changes are related.
Flags: needinfo?(adw)
(In reply to Drew Willcoxon :adw from comment #19)
> Although I do recall doing one or two reviews on the background
> thumbnailer for Activity Stream people, so maybe those changes
> are related.

Not to mention recent changes to the thumbnailer that I didn't review of course.  Looking over the changelog for backgroundPageThumbsContent.js in particular, I'd suspect something there.
The "every 5 minutes" is suspiciously bug 1411637 introduced by bug 1410541 and is being fixed in the next export with bug 1413550. That's why I was asking about if the thumbnail is failing.

I just tested with a clean profile and adding/editing/pinning https://go.twitch.tv/ogamingsc2 as a top site does trigger a thumbnail request that seems like it never gets. And indeed in the console there,s thousands of error messages from the page (as it's a live streaming page) in addition to `getScreenshot error: Error: page-thumbnail:error  Screenshots.jsm:61`

The underlying thumbnail request and high CPU isn't being fixed with bug 1411637, but at least it will know to not retry a thumbnail.

ursula does have a patch for bug 1413650 to be smarter in capturing screenshot from an open tab where in particular, open tabs of a *pinned* top site are incorrectly *not* being used for screenshots right now.

I can confirm that even without the fix from bug 1413650, users who have restarted since getting a go.twitch.tv page as a top site and has opened that page in a tab can get a thumbnail. And from then on, the backgroundpagethumb shouldn't even be called for that page.

Even with bug 1411637 and bug 1413650's fixes, a newly pinned go.twitch.tv page will immediately trigger a backgroundpagethumb request leading to this bug. I believe it is by design to try to get a thumbnail immediately, so the true fix for this bug would be to kill off the request faster ?

I'm removing the "routinely" part as that is mostly being fixed in bug 1411637. (Although it will still happen on restart but less likely to happen with bug 1413650 when a thumbnail is actually captured.)
Depends on: 1411637, 1413650
Flags: needinfo?(edilee)
Summary: go.twitch.tv routinely sends requests, strains CPU → Background thumbnail request of go.twitch.tv strains CPU
[Tracking Requested - why for this release]:

Do we know yet if this effects 57?  We're nearing release, but this sounds bad enough that it might be worth taking a late uplift.

Flagging, to get this on release-drivers radar.
I am able to trigger the bug on 57.0b14 the same way as 56.0.2. It can also be triggered on www.twitch.tv (may require you to be logged in to Twitch to disable the redirect to go.twitch.tv).
Attached image firefox 56 affected
Attached image firefox esr 52 affected
This thumbnailing bug has been around for a while: firefox esr 52 attachment 8925224 [details] and firefox 56 attachment 8925223 [details]. The originally filed bug of high CPU every 5 minutes was only introduced in the previous export of activity stream per comment 21.
Tracking for 57 since this sounds like a serious problem on a popular site.
We could still consider a patch for a dot release or if that's unrealistic, aim to fix in 58. 

Peter can you help find someone to take this on?
Flags: needinfo?(pdehaan)
Theoretically audio/video elements are already prevented from loading:
https://searchfox.org/mozilla-central/source/toolkit/components/thumbnails/content/backgroundPageThumbsContent.js#36-50

If I do this on a clean profile:
Cu.import("resource://gre/modules/BackgroundPageThumbs.jsm"); BackgroundPageThumbs.captureIfMissing("https://go.twitch.tv/")

There's over 200 of these messages in the console:
JavaScript warning: https://cvp.twitch.tv/2.1.0/mediaplayer.min.js, line 1: Error: This error message will be blank when privacy.resistFingerprinting = true.  If it is really necessary, please add it to the
 whitelist in MediaError::GetMessage: 

And indeed if I set `docShell.allowJavascript = false` in backgroundPageThumbsContent.js, a not really useful screenshot is captured without high CPU.

So my guess is that twitch javascript is detecting a failed video and is retrying constantly.

A lower risk / scoped fix of blacklisting twitch.tv from BackgroundPageThumbs would avoid this specific issue.
Is there some way to detect excessive CPU and interrupt/stop/backlist the page automatically?  It would be nice not to half to hunt down these problems for every page.
The site handling JS errors by endlessly retrying, combined with the fact that this is not a new regression would leave me to think that for 57/58 we either blacklist this site explicitly from the thumbnailer and dot-release/uplift or 'wontfix' for imminent releases and start pursuing a 'proper' solution like the one Ben outlined.

Liz would you be OK with that line of reasoning?

My inclination is to 'wontfix' for 57, assign it to my team and try to get an upliftable/landable patch for 58 or 59, but I don't have a good feel for overall severity.

Ursula, could you dig into this one a bit and figure out if there's a workable solution?
Flags: needinfo?(usarracini)
Flags: needinfo?(lhenry)
Comment on attachment 8928036 [details]
Bug 1412505 - Background thumbnail request of go.twitch.tv strains CPU.

https://reviewboard.mozilla.org/r/199264/#review204318

This idea is OK with me as a short-term fix, but `url.includes("twitch.tv/")` seems not great.  Can you just create an nsIURI from the url and check the host instead?
Actually, using nsIEffectiveTLDService (on an nsIURI) is the way to go I would think.
Redirecting n-i to Ritu as she is the 57 release owner.
Flags: needinfo?(lhenry) → needinfo?(rkothari)
Looks like Ed has this covered. Thanks!
Flags: needinfo?(usarracini)
Comment on attachment 8928036 [details]
Bug 1412505 - Background thumbnail request of go.twitch.tv strains CPU.

https://reviewboard.mozilla.org/r/199264/#review204596

Thanks Ed.
Attachment #8928036 - Flags: review?(adw) → review+
Pushed by edilee@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/bf0e4dd83b3b
Background thumbnail request of go.twitch.tv strains CPU. r=adw
See Also: → 1417221
https://hg.mozilla.org/mozilla-central/rev/bf0e4dd83b3b
Status: UNCONFIRMED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 59
go.twitch.tv is no longer being used as a subdomain for twitch.tv since yesterday.
If this patch  is applicable to FF58, can it possibly land for 58?
Assignee: nobody → edilee
Component: Activity Streams: Newtab → General
Product: Firefox → Toolkit
Target Milestone: Firefox 59 → mozilla59
Version: unspecified → 56 Branch
The patch was written to be small / upliftable to 58, so if someone can verify it fixed on 59, we can request uplift.
See Also: → 1417869
You can request uplift now for 58 if you like.  The next beta build will be next Monday. 

Andrei, can your team verify the fix? Thanks!
Flags: needinfo?(andrei.vaida)
I think this change should be reverted. It gets around the problem in a very crude fashion, and doesn't prevent other sites from having the same problem. 

We will prevent the problem from happening in bug 1417869 by making the media error silent to the JS. So the websites won't be tempted to allow to retry again and again.

ni? Myself so that a new bug is created to track the issue.
Flags: needinfo?(jyavenard)
jya, is the silent erroring intended to be upliftable to 58? If not, we should perhaps still keep this fix for 58, but yes definitely a nicer fix would be appreciated.
Will revert this change in bug 1417869. This will be uplift able in 58.

In the mean time, I contacted Twitch about this problem, they are currently testing a fix.
Flags: needinfo?(jyavenard)
Build ID: 20171121220115
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0

Verified as fixed on Firefox Nightly 59.0a1 on Windows 10 x 64, Windows 7 x32, Mac OS X 10.12 and Ubuntu 16.04 x64.
Flags: needinfo?(andrei.vaida)
My testing was based on the CPU usage, I noticed that using Firefox Nightly 58.0a1(2017-10-27) on windows 10 x 64 for example and accessing any of the channels from twitch.tv the CPU usage reached 50-60% but using Firefox Nightly 59.0a1 (2017-11-21) CPU usage maximum reach was 7-8%.
Flags: needinfo?(pdehaan)
Bug 1417869 was uplifted to beta and release, is this one still a problem there?
See Also: → 1421306
See Also: → 1384094
You need to log in before you can comment on or make changes to this bug.