Closed Bug 978439 Opened 10 years ago Closed 10 years ago

Plugin container eating 100% CPU with Adobe Flash and NouVeau Xorg Server

Categories

(Core :: General, defect, P3)

x86_64
Linux
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: fredbezies, Unassigned)

Details

Attachments

(4 files, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0 (Beta/Release)
Build ID: 20140301102433

Steps to reproduce:

Since a few days (just after patch for bug #973192 was included ?), plugin-container is going "crazy" and eats all CPU on flash enabled site like youtube.

To reproduce it ?

1) Have a linux distribution with nouveau Xorg server.
2) Go to youtube
3) Click on connect button



Actual results:

Plugin container eating 100% CPU.


Expected results:

Plugin container not eating all CPU ?
Some technical information : Flash plugin : 11.2.202.341

Using Archlinux 64 bits.

NouVeau : 1.0.10
(In reply to Frederic Bezies from comment #0)
> Since a few days (just after patch for bug #973192 was included ?), [...]

Could you confirm this by narrowing it down with mozregression?

http://mozilla.github.io/mozregression/
Flags: needinfo?(fredbezies)
Cannot use this tool because of Python3 on my archlinux. Will try to track regression manually.
Flags: needinfo?(fredbezies)
After fighting a lot, I got a regression "window", and I hope it will get smaller soon.

Last good revision : d275eebfae04 -> February 14th
First bad revision : b80f7eece913 -> February 15th

Mozregression is stuck to this now :

... attempting to bisect inbound builds (starting from previous day, to make sure no inbound revision is missed)
Downloading build from: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/02/2014-02-13-03-02-01-mozilla-central/firefox-30.0a1.en-US.linux-x86_64.tar.bz2
===== Downloaded 100% =====
Installing nightly
Getting inbound builds between a62bde1d6efe and b80f7eece913
Mozregression results :

Last good revision: a62bde1d6efe
First bad revision: c8f504ccd3c3
Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a62bde1d6efe&tochange=c8f504ccd3c3

Will try to see with homemade builds and find guilty commit. Will be very long. If you have an idea, I'll be pleased to hear it !
Got this in terminal when plugin-container is going crazy.

[26226] WARNING: [PluginModuleParent::ActorDestroy] abnormal shutdown without minidump!: file /home/fred/.moz-commitbuilder-cache/mozbuild-trunk/dom/plugins/ipc/PluginModuleParent.cpp, line 715
For application/x-shockwave-flash found plugin libflashplayer.so
[26341] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /home/fred/.moz-commitbuilder-cache/mozbuild-trunk/xpcom/base/nsTraceRefcntImpl.cpp, line 142
[26341] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /home/fred/.moz-commitbuilder-cache/mozbuild-trunk/xpcom/base/nsTraceRefcntImpl.cpp, line 142
LoadPlugin() /usr/lib/mozilla/plugins/libflashplayer.so returned 7f8e1df4c5e0
And another log part :

nsPluginNativeWindowGtk: NPPVpluginNeedsXEmbed=1
nsPluginNativeWindowGtk: call SetWindow with xid=0x1e007a7
nsPluginNativeWindowGtk: call SetWindow with xid=0x1e007a7
[3632] WARNING: blocked access to response header: file /home/fred/.moz-commitbuilder-cache/mozbuild-trunk/content/base/src/nsXMLHttpRequest.cpp, line 1203
For application/x-shockwave-flash found plugin libflashplayer.so
[3708] WARNING: '!compMgr', file /home/fred/.moz-commitbuilder-cache/mozbuild-trunk/xpcom/glue/nsComponentManagerUtils.cpp, line 59
For application/x-shockwave-flash found plugin libflashplayer.so
[3632] WARNING: NS_ENSURE_TRUE(mTransactionPump || mCachePump) failed: file /home/fred/.moz-commitbuilder-cache/mozbuild-trunk/netwerk/protocol/http/nsHttpChannel.cpp, line 5374
For application/x-shockwave-flash found plugin libflashplayer.so
nsPluginNativeWindowGtk: NPPVpluginNeedsXEmbed=1
nsPluginNativeWindowGtk: call SetWindow with xid=0x1e008f7
nsPluginNativeWindowGtk: call SetWindow with xid=0x1e008f7

Cannot find any files modified between February 14 and February 15 in it :(
(In reply to Frederic Bezies from comment #7)
> Got this in terminal when plugin-container is going crazy.
> 
> [26226] WARNING: [PluginModuleParent::ActorDestroy] abnormal shutdown
> without minidump!: file
> /home/fred/.moz-commitbuilder-cache/mozbuild-trunk/dom/plugins/ipc/
> PluginModuleParent.cpp, line 715
> For application/x-shockwave-flash found plugin libflashplayer.so
> [26341] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file
> /home/fred/.moz-commitbuilder-cache/mozbuild-trunk/xpcom/base/
> nsTraceRefcntImpl.cpp, line 142

This looks like the plugin actually gets killed for timeouts, but then loaded again? Is the site triggering this or are you manually reloading?

Otherwise nothing in particular stands out to me so far, if you can narrow this down to a specific commit that would be great!
Killed it in order to "unfreeze" firefox.

Building gave nothing interesting. Will use tinderbox builds. Faster and more useful :D
Using ftp://ftp.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux64/ I have this regression window :

Last working tinderbox build : https://hg.mozilla.org/mozilla-central/rev/23f7a629a217
First broken tinderbox build : https://hg.mozilla.org/mozilla-central/rev/5d7caa093f4f

Don't have time for now using inbound tinderbox builds, will do it later this afternoon.
And the $64,000 question : which one is "guilty" here ? 

Using inbound tinderbox could help or not ?
At least to me there is nothing jumping out - narrowing it down to an inbound commit should help here.
Looks like I found something, but I've not finished testing all commits as there is a lot.

It seems the older "broken commit" is related to bug Bug 860254.

cf https://hg.mozilla.org/integration/mozilla-inbound/rev/00ea960e8164.

I will try to verify this information asap.
Doing my best with inbound builds, but I cannot find these ones :

5b69776cb061	David Major — Bug 860254 - Part 2: Split junk settings into opt_junk (allocate) versus opt_poison (deallocate). r=glandium
	c1b9533aa1a0	David Major — Bug 860254 - Part 1: #define MALLOC_FILL but keep the options const-false. r=glandium
	2c1080ed8c03	Glenn Randers-Pehrson — Bug 945912 - Add crashtest for libpng crash. r=ryanvm
	
	f57515a8bb3b	Luke Wagner — Bug 936236 - Inline getTransferableContents into caller and handlify (r=jonco)
	673bbf3b807f	Luke Wagner — Bug 936236 - Inline copyData into caller and handlify (r=jonco)
	0e0c6b897c47	Luke Wagner — Bug 936236 - Inline allocateSlots into caller and handlify (r=jonco)
	91edf542ecf4	Luke Wagner — Bug 971845 - Don't uninline in JS_GetArrayBufferData, add JS_GetStableArrayBufferData and use that when necessary (r=sfink)
	47870c0b90ba	Luke Wagner — Bug 971845 - Remove dead JSContext methods (r=sfink)
	
	c78178cf6702	Steve Fink — No bug. Comment clarification on how to free JS_AllocateArrayBufferContents results. r=woof!, DONTBUILD
Got it !

Reverting all 3 patches from bug 860254 is fixing the bug.
We are unlikely to revert bug 860254 to fix this bug. This is likely evidence of a use-after-free error in the X libraries or something else that comes with NouVeau. A valgrind memcheck run might help diagnose the error further.
Component: Plug-ins → General
Priority: -- → P3
I understand you won't revert it. But at least, I found why with nouveau plugin-container is going crazy.

I can do a valgring memcheck if you tell me how to do so.
A recent ASAN build from here might also help figure out what is happening:
  https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-asan/
Downloading last ASAN build and will report any info I can grab.
Here is what I get in terminal with last ASAN build :

==22420==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x000003537030 in thread T0
    #0 0x444e65 (/home/fred/Téléchargements/firefox/plugin-container+0x444e65)
    #1 0x7faf6b432668 (/usr/lib/libX11.so.6+0x43668)
==22420==ABORTING

And I got a "firefox prompt" telling me that Adobe Flash plugin crashed. If I click on "reload", same error message in terminal.
It looks like Flash has other bugs that prevent it from running long enough with ASan. How about we make the bet that the root bug here is probably the same as in bug 973192, and so, by using a Nightly ASan build from before the workarounds landed there, you try to reproduce this with ASan just by starting up Firefox?
(In reply to Benoit Jacob [:bjacob] from comment #23)
> It looks like Flash has other bugs that prevent it from running long enough
> with ASan. How about we make the bet that the root bug here is probably the
> same as in bug 973192, and so, by using a Nightly ASan build from before the
> workarounds landed there, you try to reproduce this with ASan just by
> starting up Firefox?

I'm grabbing a February 15 ASAN build in order to get bug 973192 "alive". And I will report what I get in terminal.
Using ASAN nightly based on revision https://hg.mozilla.org/mozilla-central/rev/305117764929, in order to have February 15 build, when bug 973192 was "alive", I only got this :

==4299==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x000001d1e030 in thread T0
    #0 0x444e65 (/home/fred/Téléchargements/firefox/plugin-container+0x444e65)
    #1 0x7fbe4ff17668 (/usr/lib/libX11.so.6+0x43668)
==4299==ABORTING

Flash is crashing too soon :(
Valgrind would happily continue in presence of such errors. Let me know if you have the motivation (will require making a special build).
Just give me instructions. I will follow them asap. I want to see this nasty bug dead for good.
So in theory this page would be all you need to read,
  https://developer.mozilla.org/en-US/docs/Debugging_Mozilla_with_Valgrind

but in practice it's confusing, lacks turn-key instructions, and out of date regarding jemalloc.

So here are my valgrinding instructions:


 1) make a build with --enable-valgrind, --disable-icf, and with the -g -O2 compiler flags (if we were looking for gecko bugs, we'd use less optimization and inlining, but here we're not really looking for a gecko bug). Here is my mozconfig that does all that:

ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-g -O2"
ac_add_options --enable-valgrind
ac_add_options --disable-icf
mk_add_options MOZ_OBJDIR=./obj-firefox-valgrind
mk_add_options MOZ_MAKE_FLAGS="-j8 -s"
mk_add_options AUTOCLOBBER=1

export MOZCONFIG=/path/to/this/mozconfig, and './mach build' as usual.


 2) run your build in valgrind with this command line (note the yourprofile here needs to be replaced by your actual testing profile name...). This will record the valgrind output in a valgrind-log file.

$ valgrind --soname-synonyms=somalloc=NONE --smc-check=all-non-file --track-origins=yes obj-firefox-valgrind/dist/bin/firefox -P yourprofile -no-remote 2>&1 | tee valgrind-log
Attached file Valgrind log. (obsolete) —
Valgrind log. Don't know if it will be really useful. I hope it will help.
That seems very interesting and relevant, since it happens under XCloseDisplay(). Too bad that you dont have symbols there. Have you installed debug symbols for X libraries, and so is it Valgrind that's bad at picking them up?

Julian, do you have generic advice to get symbols from system libraries in Valgrind stacks on Linux, that would have a chance to help with the stack in comment 29?
Flags: needinfo?(jseward)
I'm building debug packages to have something more complete. I will post another valgrind log asap.
OK, I'll clear the needinfo on Julian, then, for now.
Flags: needinfo?(jseward)
Another thing that would be super useful would be to valgrind the standalone version of glxtest that I attached on https://bugs.freedesktop.org/show_bug.cgi?id=75279#c3 .

If the same valgrind error reproduces there, then you've proven that there is a specific bug in Nouveau, you can then hopefully get Nouveau developers to pay attention, and from there on you dont need to worry about mozilla valgrind builds anymore...
(Valgrinding it is super easy: you just build it, it doesn't need special build settings to be valgrindable, and just run valgrind --track-origins=yes ./glxtest).
Here is the "full" version of valgrind log, I mean with debug symbols.
Attachment #8386255 - Attachment is obsolete: true
Looks like something is rotten in NouVeau :(

Will report to NouVeau developpers. Thanks for the tip.
Woohoo, comment 35 and 36 are like it's Christmas today. Yep, do talk to Nouveau devs, especially with comment 36 they will certainly be interested!
(In reply to Benoit Jacob [:bjacob] from comment #37)
> Woohoo, comment 35 and 36 are like it's Christmas today. Yep, do talk to
> Nouveau devs, especially with comment 36 they will certainly be interested!

I added all logs on freedesktop bug report. And christmas ? If you think so ;)

At least, until nouveau is fixed, I can apply a "dirty workaround" :/
Mesa 10.0.5 / 10.1.1 is released and got a fix for this issue. So, adding a release note could be a great idea, until all distributions are using Mesa 10.0.5+ or 10.1.1+

http://lists.freedesktop.org/archives/mesa-announce/2014-April/000080.html
http://lists.freedesktop.org/archives/mesa-announce/2014-April/000081.html

Fix ?

Ilia Mirkin (8):
      nouveau: fix fence waiting logic in screen destroy
It is fixed since Archlinux upgraded to a fixed MesaGL version. Please closed it as FIXED.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: