Closed Bug 984838 Opened 10 years ago Closed 10 years ago

100% CPU on yammer

Categories

(Firefox :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: markh, Unassigned)

References

Details

(Keywords: perf)

Attachments

(1 file)

STR for me on Win 7, on 2 different machines

* Open yammer; Select the "all company" group; shift-reload.
* Alternatively: Open yammer; select "all company" group; restart and restore session.

Firefox starts shewing 100% of a core.

Also:
* Closing the tab fixes it; re-open and it comes back.
* Selecting the home tab on yammer stops it.  Re-selecting "all company" does *not* restart the problem - it's necessary for the page-load to happen on the "all company" group to repro.

Possibly all groups - didn't try that.  Possibly happens on the community yammer etc, but also didn't try that - so sadly the above requires access to the staff yammer :(
Keywords: perf
Mark can you link to your mozilla profiler when this happens we already have axel's in the yammer thread.
Any change for a gecko profiler link, not to .json.
Something like http://people.mozilla.org/~bgirard/cleopatra/#report=xxxxxx
chance, not change, but anyhow, link to a profile which is easy to read.
I don't know how to import that .json to anywhere.
Ah, that profile is from devtools. Please use Gecko profiler or some other profiler which gives
better information what is happening inside Gecko.
I saw similar behaviour on http://telemetry.mozilla.org/#release/27/SIMPLE_MEASURES_AMI_STARTUP_END/saved_session/Fennec (not sure it's completely required to reproduce, but I had the "Show evolution over Calendar Dates" button selected).

In my case, dismissing the slow script dialog a few times eventually led to crashes:

bp-0160e74b-eea5-41ae-a478-7fbb92140317	
bp-b856d64a-2b53-4a8f-9304-b08ef2140317
bp-93e62d47-59fc-4acc-b3fe-90e0c2140317

One note: running Nightly from the command line, I see the following warning:

JavaScript warning: http://cdnjs.cloudflare.com/ajax/libs/d3/3.3.9/d3.min.js, line 3: mutating the [[Prototype]] of an object will cause your code to run very slowly; instead create the object with the correct initial [[Prototype]] value using Object.create

Here's a profile: http://people.mozilla.org/~bgirard/cleopatra/#report=8e78388aa31ba35b58dec7ecba48ca70af22d59e
That profile shows a bunch of JS running off timers, which then goes and creates tons of arrays, does something with Voronoi polygons, creates a ton of non-Array objects, etc.

NVD3 is some sort of charting package.  Is it possible that yammer is using this for something and continuously updating the charts?
bz: Comment 6 was referring to a similar symptom seen at a different URL; at smaug's suggestion I split it off to bug 984928.
(In reply to Boris Zbarsky [:bz] from comment #7)
> That profile shows a bunch of JS running off timers, which then goes and
> creates tons of arrays, does something with Voronoi polygons, creates a ton
> of non-Array objects, etc.
...which leads to GCing very often, since we certainly spend quite a bit time in gc.
That's showing mostly a bunch of painting.
(In reply to Boris Zbarsky [:bz] from comment #11)
> That's showing mostly a bunch of painting.

Yeah.  Looking at it with devtools, if you enable "Highlight painted area" you can see the group icon flashing repeatedly as fast as the flashing can be rendered.  It *looked* to me like repeated XHR requests complete caused the image's src attribute to be set.  However, no XHR requests could be seen in the network tab - I'm not sure if they were cached, or my attempts at reading the profile were simply wrong.

So yeah, I think repeatedly painting the group icon is the symptom.  Can't repro on Chrome (FWIW, which really isn't much).
Do we not support stackwalking on Win7? That profile seems to be from pseudostack which isn't too useful, since it for example doesn't tell which events are being dispatched.
(Stuff under nsEventDispatcher::Dispatch take quite a bit time. srcset() is suspicious.)

Hmm, https://developer.mozilla.org/en-US/docs/Performance/Profiling_with_the_Built-in_Profiler#Availability hints that native stack might be disabled on release builds.
Can you reproduce the issue on trunk?
Hmm, *lots* of async events dispatched, and the listener for them calls
event.add/y.handle() which ends up calling some srcset.
Do we end up changing some img all the time, fire load event for it, and then
change the src again, and fire load...?
Or could be error event too.

That is my guess from the profile. Lost of async dom events, and HTMLImageElement::PreHandleEvent
also in the profile, that certainly shouldn't be there at all unless it is called often.
Could be a Yammer bug. Have you reported the issue to them?
We've opened a support request with yammer about this. Let's see what they came back with. Will let everyone know when I hear back.
I think on Yammer it was mentioned that this was happening on Nightly. Can someone give me a version that is is happening on? Is it just the latest? Any other versions that people have been experiencing this issue?
Vien: for me it's happening in Nightly on OSX Mavericks, consistently.  If you open a Yammer tab and click on "All Company" then leave it open for a while, Nightly proceeds to eat 100% of your CPU.
I think the latest is still affected by this.
mozregression tells me:

Got as far as we can go bisecting nightlies...
Ensuring we have enough metadata to get a pushlog...
Last good revision: 41d962d23e81 (2014-03-11)
First bad revision: 44ae8462d6ab (2014-03-12)
Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=41d962d23e81&tochange=44ae8462d6ab

Sadly though, mozregression hangs trying to use inbound builds to bisect further.  That pushlog is fairly large and nothing springs out (except maybe ed30fc4d3e17, "Bug 982146 - Clean up SyncRunnable wait code, r=bsmedberg." but reversing that simple patch locally didn't have any affect)
Here is the output from the Mac Activity Monitor "Sample Process".

Some interesting things to search for in there:

- 50 nsJSContext::GarbageCollectNow
- 43 imgRequestNotifyRunnable::Run()
Possibly bug 980243?  Did we use to have the same behavior 100% CPU back in Firefox 26?
(In reply to Boris Zbarsky [:bz] from comment #23)
> Possibly bug 980243?

As usual, bz hits the nail on the head!  Reverting the patch from that bug locally solves the problem for me.
Blocks: 980243
OK, but then the important question is what did Firefox 26 do on this site?  The behavior prior to bug 980243 was to not do image loads we really should have been doing, and it got broken in the Firefox 27 cycle.  Note that WebKit does NOT handle those image loads quite the same way we do, so it's possible that the code was written to rely on their handling and effectively ends up in an infinite loop where it has a load handler on <img> which then sets src on that same image.
(In reply to Boris Zbarsky [:bz] from comment #25)
> OK, but then the important question is what did Firefox 26 do on this site?

I figured I'd test this while waiting for hg to update.  On 26 I see 100% CPU *and* firefox is unresponsive (ie, I can't interact with it, the "spinners" are frozen; strangely though, windows doesn't report it is "unresponsive" and doesn't offer to kill it when trying to close it as I'd expect - it must be hard-killed via the task manager)
Probably starving OS events; that's something we've changed around some since 26.

Sounds like this is a bug in the yammer script, probably as described in comment 25: a load handler on an <img> setting that <img>'s src to the same value it already has, which triggers a forced load and a new load event for the same image, etc.
I'm still working with Yammer support on this issue. I was able to replicate the issue myself and have given them the exact same instructions to replicate it on their end. The customer rep that's helping me was NOT able to replicate and has escalated to their tier 2 support. Should expect an answer back in a day.
So I did a bit of poking around in Yammer's code.

The site is doing tons of src sets all on the same same HTMLImageElement.  The src being set is https://mug0.assets-yammer.com/mugshot/images/150x150/all_company.png and the function immediately doing the set is:

    function i(e, t) {
        e.height = e.height, e.width = e.width, e.src = t
    }

via this stack trace:

0 i(e = [object HTMLImageElement], t = "https://mug0.assets-yammer.com/mugshot/images/150x150/all_company.png") ["https://c64.assets-yammer.com/assets/yam-requirejs-home-39f6e8696fcdb68131e547a660e49b05.js":42]
    this = [object Window]
1 anonymous([object Object]) ["https://c64.assets-yammer.com/assets/yam-requirejs-home-39f6e8696fcdb68131e547a660e49b05.js":42]
    this = [object HTMLImageElement]
2 anonymous(e = [object Object]) ["https://c64.assets-yammer.com/assets/vendor-4debc085c4ec407eb4852954143be359.js":25]
    this = [object HTMLImageElement]
3 anonymous(e = [object Event]) ["https://c64.assets-yammer.com/assets/vendor-4debc085c4ec407eb4852954143be359.js":24]
    this = [object HTMLImageElement]

Sadly, the source is minified, so finding who exactly calls this function is nontrivial, though I suspect it's jQuery, due to this call:

  n.complete ? i(n, c) : e(n).on("load", function () {
    i(n, c)
  })

I did check and there is exactly one src set for every time we fire the load event, so it's not like the site is adding more and more listeners...
After working with Yammer support, they are not able to replicate the issue on their end. I was able to replicate on Beta and Nightly but not GA. I gave them the exact steps that I used but they said that it still works normally for them. They tried on FF GA, Beta, and Nightly. I asked them if they had a yammer development network and they said yes we can go that route to help get this resolved.

I know everyone is busy but is there anyone who would like to volunteer to join their developer network to help and try to get this resolved?

Thanks
Vien, I'm happy to walk them through what things look like from our end...
Boris - I'll reach out to you directly. Thanks!
I'm seeing markh's symptoms on both Aurora (30.0a2 2014-04-22) and Nightly (31.0a1) running on Mac OS X 10.8.3
So I signed up for their development network thing (eventually; it took them a week to actually let me in) and asked about this two days ago.  No response so far.  I don't expect to get any at this point.
Flags: needinfo?(vdoan)
A former yammer employee, Oscar Godson, ran into this.  I'm asking him on twitter if he knows anyone who could help us.
We have an srcset polyfill to render higher resolution images for browsers with a devicePixelRation greater than 1. The polyfill was listening to the 'load' event of the img objects before replacing them. The problem was we weren't unbinding from that event and hence continuously triggering that bit of code. This has been fixed and will go out in our next deploy.
> for browsers with a devicePixelRation greater than 1

Aha!  That explains why some people couldn't reproduce: you needed a high-dpi display!
Flags: needinfo?(vdoan)
This seems to be fixed on the deployed yammer.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: